For more information, please see full course syllabus of Statistics

For more information, please see full course syllabus of Statistics

### Regression

Lecture Slides are screen-captured images of important points in the lecture. Students can download and print out these lecture slide images to do practice problems as well as take notes while watching the lecture.

- Intro 0:00
- Roadmap 0:05
- Roadmap
- Linear Equations 0:34
- Linear Equations: y = mx + b
- Rough Line 5:16
- Rough Line
- Regression - A 'Center' Line 7:41
- Reasons for Summarizing with a Regression Line
- Predictor and Response Variable
- Goal of Regression 12:29
- Goal of Regression
- Prediction 14:50
- Example: Servings of Mile Per Year Shown By Age
- Intrapolation
- Extrapolation
- Error in Prediction 20:34
- Prediction Error
- Residual
- Example 1: Residual 23:34
- Example 2: Large and Negative Residual 26:30
- Example 3: Positive Residual 28:13
- Example 4: Interpret Regression Line & Extrapolate 29:40

### General Statistics Online Course

### Transcription: Regression

*Hi and welcome to www.educator.com.*0000

*Today we are going to be talking about regressions today.*0002

*Here is the big goal of this lesson.*0007

*Basically we want to set up a conceptual understanding of regressions before we actually learn to calculate them and find it.*0010

*Today we are going to do just a brief review of linear equations and talk about the regressions as the center of line.*0018

*Instead if a center point like the mean when you talk about a center line and then we are going to talk about prediction and error. *0026

*What is a linear equation?*0037

*Y = mx + b should be pretty familiar to a lot of you and whenever we think of y= mx + b.*0040

*You can think of y as the output or f(x), x will be the input or often whatever it is on this horizontal axis the x axis.*0048

*B is the y intercept.*0064

*Another way you could think of the y intercept is where x = 0 what is b?*0073

*x would be here and apparently anything where x is 0 that will mean that y will have to be somewhere on this y axis.*0080

*That is what we mean by y intercept.*0093

*M is this slope.*0096

*Slope of something pretty much numbers but just in case you do not, here is how we calculate slope.*0103

*Slope is the change of y over the change of x.*0118

*When we say change we think of delta, the change of y over change of x.*0124

*More commonly people refer to it as rise/run.*0132

*When you think of rise, you think of going up vertically or down vertically.*0140

*The entire rise and they are running in a sort of more horizontal.*0146

*What is the rise/run?*0149

*That is what we think of as a slope.*0151

*When we think about rise/run we mean in the direction of the positive direction is up and right.*0154

*The negative direction will be down and more to the left.*0162

*You could think of rise/run as an indication of rate of change.*0170

*How much x changes in relation to y or vice versa?*0177

*These are the components of our linear equation and every line it does not matter what the line looks like.*0183

*Every straight line has an equation and from that equation you can figure out any x what the y is at any y what the x is?*0197

*If you have xy in a slope you could figure out the intercept and if you have the intercept of y you could bring out the slope.*0211

*This is a useful equation for us.*0224

*We are going to be trying to find a line that is the mean.*0227

*That is the center at the data point.*0231

*In order to do that, we would have to find its equation because the equation is the mean of the line.*0235

*In statistics, we are going to use this equation but we are going to write it in a different way.*0243

*It is just writing conceptual change but we are going to change it around just very slightly and superficially.*0250

*The first thing we do is we talk about the y intercept first.*0258

*In statistics, that is y=b comes first and it is the first b so it is b sub 0 or b knot.*0262

*Instead of the y intercept being added second we start from the y intercept and then we add the slope × x.*0276

*Notice that the slope is not called n anymore, it is called b sub 1.*0287

*We have b sub 0 which is the y intercept and b sub 1 which is slope.*0293

*Same idea as before.*0303

*This how I will refer to things when we talk about the equation of a regression line.*0306

*What do we mean by the center line?*0320

*If you think about a scatter of data, if you have a whole bunch of data you want to think of a line that somehow cuts through the middle of all of these points.*0325

*Right now we could just roughly draw a line and try to make it cut through the center of all these points.*0337

*That is a very rough line.*0353

*In order to find what that equation of this line is, I can as long as I just have 2 of the points on this line.*0356

*For example, if I take this data point and this data point I could find the equation of that line.*0365

*It is because by having 2 axis and 2 y, a set of x and y and another set of x and y I will calculate rise/run.*0374

*From having slope and x and y, I could calculate the y intercept.*0384

*That is a rough line but because it just depends on which 2 points I take.*0390

*If I pick these 2 points I will get this line but let us say I pick this point and this point then I would get an entirely different of line.*0396

*Moreover if I pick this point and that point I will get an even more different line.*0414

*The question is which 2 points will you pick?*0426

*It might be not good enough for us to just eyeball things because we are not sure which 2 points to pick.*0430

*If we have 2 data points then life is easy like you could just use those 2 but usually we have more than 2 data points.*0439

*Just eyeballing a rough line may not be good enough for us.*0448

*If we could just show then we only have 2 data points we could manually find slope and intercept and find the equation of that line.*0453

*Let us talk about this regression as a center line instead of a center point.*0465

*Here are some reasons for summarizing with a regression line and notice that for all of these I’m talking about scatter plot.*0470

*Regression lines are used for scatter plots.*0478

*Here what we want to do is we want to have some variable. *0482

*Here is my first variable, variable 1.*0487

*Here is my other variable, variable 2.*0491

*We want to have a line that describes the center of all of these cases whatever the cases maybe.*0496

*Why do we have a line?*0509

*Why not just a mean?*0512

*Sometimes there is not enough info from just a point.*0516

*If you just have a point, for instance this point is the mean of my x and y.*0522

*That would be x bar and y bar.*0533

*Let us say that is my center point, that point might not give us enough information about this whole distribution.*0538

*We are going to be talking about how to summarize a distribution and what about trend we do not just want a point we like a trend.*0548

*The most information from that point it is useful to have center line.*0557

*We want to find the summary that describes the relationship between the 2 variables.*0563

*It is not enough just to have a point, the point would not describe the relationship between the 2 but the line does.*0574

*A line will tell you whether its slope is negative or positive.*0580

*The line will tell you what kind of information you would want from a trend.*0587

*That relationship is important to us and we will get that information from just a point.*0594

*The only reason that you want to summarize a regression line is that it is helpful to use one variable to predict the other variable.*0603

*Often by convention we will put whatever you feel is the predicted variable on the x axis.*0617

*We may use these to predict these.*0624

*We may use this to predict someone weight to predict their height or vice versa.*0628

*In this case it does not matter which is to predict there.*0636

*Predicted variables are by convention they are not causal variables.*0640

*They are just variables that we use in order to predict the second variable.*0646

*That second variable is called the response variable.*0652

*One thing that is important to know is that the predicted variable by convention or by tradition goes on the x axis *0655

*and the response variable is often on the right axis.*0664

*That goes along with this idea of function that we put in x and f(x) crunch out for us an output.*0669

*That is how we think of predictors. *0681

*You put in that predictor and it will crunch out for you the response.*0683

*When we talk about prediction, those predictions lie on the regression line.*0691

*This regression line equals all of our predictions.*0698

*This means that when we think x is 27 then the prediction line show us that y would be 180 or something like that.*0709

*All the predictions lie on the actual line.*0732

*Notice that a lot of our points do not lie on the prediction line.*0737

*There is a little bit of difference between the actual data and the predicted data.*0742

*Here is the goal of regression, the goal of this line.*0752

*Our fundamental desire is to find this line that is the center.*0758

*It describes the middle of all these points.*0762

*If you want to think about what the center means, it is all the distances on one side.*0767

*A balance of all the distances on that side.*0773

*It does not mean that it has to be a perfectly symmetrical distribution.*0777

*It just means that the point in the middle has to be equal distant to all of these lines and equal distant to all of these lines.*0783

*Think about it as a balance.*0791

*It just has to balance each other out.*0794

*All of the distances has to balance each other out.*0797

*That is how I want you think about it.*0800

*Distance is on one side of the line.*0802

*Balance is all the distances on the other side of the line.*0805

*To show you here is one distance.*0809

*Let us take this point.*0819

*This is the distance, this is y distance away from the line.*0824

*I need all of these distances to be balanced out like all of these distances.*0831

*That is all of these regression line and this is a long distance here.*0846

*I need all of these distances to balance each other out.*0852

*Now how would you find such a line because that seems like a lot of work?*0856

*We have to find a line and find all the distances and drew a line around and make sure all the distances are perfectly, evenly matched.*0861

*That seems far.*0870

*We will learn to calculate the precise slope and intercept of this middle line, the regression line by using the method of these squares.*0871

*This will going to be a beautiful shortcut for us so that we can find that line without having to do all that work.*0883

*That is on the next lesson.*0889

*Let us pretend that I have just given you the beautiful regression line.*0893

*I have just found it for you.*0897

*Let us say here I will show you by age.*0900

*Here on the x axis we have age, when you are like 25 you might drink less milk when you are 15 or 12.*0908

*Here is serving of milk.*0926

*I have already drawn for you this regression line and if you trace it all the way up it may intercept at 795 *0932

*and if you look at rise/run it will be rise 22 and going to the left by 1.*0943

*22/1.*0955

*Here we have this nice line and there are 2 ways you could use prediction.*0958

*One is that you could use prediction in order to find data for predict data.*0967

*We have data for a 12 year old and we have data for 28 year old.*0982

*If I wanted to predict somebody in between that, I wanted to predict a 20 year old milk drinking.*0994

*What I can do is I could put 20 in the equation and find the predicted of milk.*1007

*I could just do 795 – 22 – 20.*1017

*I could drop my predicted servings of milk.*1024

*When we make a prediction, instead of calling it y, we are going to call it y hot.*1027

*This is called extrapolation.*1039

*When you have a range of x and you are finding something within that range of axis, your predictors are within that range of axis.*1042

*You could think of it as within the boundaries.*1052

*Staying within the range of data because this is the data that I use in order to create my line *1058

*and if I stay within the range of my data that is how it is extrapolation.*1073

*If I go outside the range of my data, that is my extrapolation.*1079

*For instance, we do not have data of 10 year olds can I just make one up?*1086

*Can I just that I do?*1102

*Can I find my predicted y for people who are 10 year old?*1105

*Obviously I can from just using the equation of the line.*1113

*That is not the hard part.*1117

*It is easy to plug in 10 but the question is can I actually do this?*1119

*Is it legal for me to do?*1128

*The reason why we separate this into 2 different ways of predicting is that extrapolation *1132

*is a little bit more risky because you are going outside the boundaries of your data.*1144

*Because you are going outside of the boundaries of our data we are not sure that our predictions are going to be accurate.*1168

*When we stay within the range of our data it is a more safe way because it us most similar to the data that we use to create the line.*1178

*There is interpolation and extrapolation.*1190

*What I want you to know is extrapolation is more dangerous that interpolation.*1192

*Let us say we go all the way to 0 years of age, would it be true that they drink all these servings of milk?*1197

*They do not.*1207

*They drink infant formula or breast milk.*1209

*It will be wrong if I say that infants towards 0 years old drink 795 servings of milk a year because that will just be wrong.*1211

*That is what we mean by extrapolation being a little bit more dangerous.*1226

*Let us talk about errors in prediction.*1237

*Even though we have this nice equation for the line, common problem is that the serving of milk per year *1240

*that we predict y hot is not always going to be fit with our data.*1251

*That is not always going to be perfectly line up with our data.*1257

*In fact you could see here there is a lot of jitter around the line and that is called prediction error.*1260

*The prediction error is the real truth and the difference within our prediction.*1268

*Whenever we have data, it is often from a sample.*1283

*We do not know what the real truth is.*1287

*We only have the sample.*1289

*Often we want to know prediction error but this is a theoretical idea.*1291

*It is the difference between the truth in our prediction but we already know what the truth is.*1296

*What we do have is we have our data.*1302

*The sample and what we can find is not the real prediction error but what we call the residual.*1305

*After we find the middle line, then what we can find is the difference between our data and that line.*1312

*That is called the residual.*1322

*This idea here, the distances between our actual y, the data, and our predicted y, y hot, that is called the residual.*1326

*Notice that we have a whole bunch of residuals.*1343

*Here is the thing, because some of our data is greater than our prediction and some of our data is less than our predictions.*1346

*It is a whole bunch of positive and a whole bunch of negative.*1364

*The prediction y, the perfect middle line actually have a balance of positive and negative.*1368

*If we add in all those positives and negatives and these distance is exactly equal to this distance.*1380

*These are positive and these are negative.*1393

*When we add them all together we will get 0.*1396

*The idea is all the residuals on this side and all the residuals on this side add up to 0 because *1401

*that would mean that our line is truly on the middle of all these distances.*1407

*That is called a residual.*1413

*Let us go to our first example.*1416

*This is the same data that we are working at and the question is what is the residual for milk drink of a 24 year old?*1418

*Since we are finding the residual, we know that the residual is the data y but the difference between that and y hot or the predicted y.*1427

*To put it into our example, it is the actual servings of milk that 24 year olds drink, the data that we have.*1445

*Subtract out the predicted servings of milk that 24 year olds drink.*1452

*First things first, let us find how much milk 24 year olds drink.*1459

*If we go to 24, this is our data point right here, we can just add all the points looks like to us.*1464

*It looks like maybe 24 and 225 or something.*1472

*We already have our y, 225.*1481

*We just need to find y hot.*1488

*In order to find y hot, all we have to do is put in 24 to this regression equation.*1491

*Y hot is equal 795 – 22 × 24.*1501

*That will be our predicted y.*1509

*Here I’m just going to bring out the pink Excel and just put in 795 – 22 × 24.*1512

*We will get 267.*1530

*That is equals to 267.*1534

*We have 225 – 267, that makes sense because our predicted serving of milk is above our actual servings of milk from our data.*1538

*We shall get a negative number.*1561

*Let us get it in Excel and it is going to put in 225 – 267 here I get -42.*1568

*That is our residual for milk drinking of a 24 year old.*1582

*Example 2, if a residual is large and negative, where is the point located with respect to the line?*1591

*What does it mean for the residual to be negative?*1599

*We already have an example of a residual being negative, it means that the point is from all the line and below on the y axis.*1601

*Just to draw some examples for you.*1613

*If we have a line that looks like this, one idea is the residual is way down here.*1616

*It will be large and negative given the y hat and the y because it is residual = y – y hot.*1624

*Another example that I could draw for you is something like this.*1639

*Even in this case, this will give us a large residual because once again our y hat is greater than our y.*1644

*If the residual is negative, if the residual is less than 0 then it must mean that our y hat is greater than our y.*1658

*With respect to the line, the point is below the line.*1675

*What does it mean for the residual to be negative?*1682

*It means that our prediction is greater than our data point.*1685

*Example 3, is somebody said that they have fit a line into a set of data points and all their residuals is positive, what would you say to them?*1696

*Let us just think about this.*1710

*Let us say we have some sort of a line and all the residuals are positive.*1713

*That would mean that every data point is somehow above this line because if they are below that would be negative.*1718

*Could that ever be the case if we want our line to be in the middle of all this points?*1729

*No.*1737

*I would probably say to them perhaps they have made a mistake because half of their distances should be positive and half should be negative.*1739

*Sometimes you could have 2 small positive distances and one larger negative distance.*1754

*It could balance out like that but you cannot have all positive nor you can have all negative.*1760

*I would say to them your line is not in the middle of all these points.*1766

*It is not a good regression line.*1778

*Example 4, interpret the y intercept of the regression line in the milk example.*1782

*Does it make sense to extrapolate here?*1788

*One thing that you need to know is the x axis only goes from 10 – 30 but we need to take it all the way out to 5, 0.*1791

*What we mean is here that is where the true y intercept because x axis has to be 0.*1806

*Does it make sense to extrapolate here?*1819

*This would mean that when x is 0 then y would be 795.*1823

*Let us think about what that means.*1833

*When x is 0 age would be 0, we are talking about new born, is it true that new born drink 795 servings of milk?*1835

*We just talked about that.*1850

*It does not make sense to extrapolate here because new born are special case.*1852

*They do not really drink milk, they drink breast milk and infant formula and because of that it does not make sense to talk about new born drinking milk yet.*1857

*It does not quite make sense to extrapolate that way.*1868

*New born are an exception and presumably this line will go on and on and on.*1874

*There will be a point where it crosses the x axis.*1881

*This is the x intercept when y = 0.*1886

*It may not make sense to extrapolate there either just because at a certain point the servings of milk might go into negative.*1891

*That does not make sense in our data.*1901

*It does not quite make sense to extrapolate beyond the confides of our data.*1906

*That is conceptual understanding of regression.*1912

*Hope to see you again for calculating regression next time on www.educator.com.*1916

0 answers

Post by Manoj Joseph on June 9, 2013

Dr.Son

I enjoyed your previous lecture. I am finding difficult to make sense of this session. It may be partly due to unfamiliarity with equations and compounded by the example you use to explain

0 answers

Post by Brijesh Bolar on August 14, 2012

Son Sonsaengnim... your explanations are so good.. you make statistics really easy.

0 answers

Post by marzena quinn on April 5, 2012

Brilliant explanation!