Sign In | Subscribe

Enter your Sign on user name and password.

Forgot password?
  • Follow us on:
Start learning today, and be successful in your academic & professional career. Start Today!
Loading video...
This is a quick preview of the lesson. For full access, please Log In or Sign up.
For more information, please see full course syllabus of Statistics
  • Discussion

  • Download Lecture Slides

  • Table of Contents

  • Transcription

  • Related Books

Lecture Comments (3)

0 answers

Post by Manoj Joseph on June 9, 2013


I enjoyed your previous lecture. I am finding difficult to make sense of this session. It may be partly due to unfamiliarity with equations and compounded by the example you use to explain

0 answers

Post by Brijesh Bolar on August 14, 2012

Son Sonsaengnim... your explanations are so good.. you make statistics really easy.

0 answers

Post by marzena quinn on April 5, 2012

Brilliant explanation!


Lecture Slides are screen-captured images of important points in the lecture. Students can download and print out these lecture slide images to do practice problems as well as take notes while watching the lecture.

  • Intro 0:00
  • Roadmap 0:05
    • Roadmap
  • Linear Equations 0:34
    • Linear Equations: y = mx + b
  • Rough Line 5:16
    • Rough Line
  • Regression - A 'Center' Line 7:41
    • Reasons for Summarizing with a Regression Line
    • Predictor and Response Variable
  • Goal of Regression 12:29
    • Goal of Regression
  • Prediction 14:50
    • Example: Servings of Mile Per Year Shown By Age
    • Intrapolation
    • Extrapolation
  • Error in Prediction 20:34
    • Prediction Error
    • Residual
  • Example 1: Residual 23:34
  • Example 2: Large and Negative Residual 26:30
  • Example 3: Positive Residual 28:13
  • Example 4: Interpret Regression Line & Extrapolate 29:40

Transcription: Regression

Hi and welcome to

Today we are going to be talking about regressions today.0002

Here is the big goal of this lesson.0007

Basically we want to set up a conceptual understanding of regressions before we actually learn to calculate them and find it.0010

Today we are going to do just a brief review of linear equations and talk about the regressions as the center of line.0018

Instead if a center point like the mean when you talk about a center line and then we are going to talk about prediction and error.0026

What is a linear equation?0037

Y = mx + b should be pretty familiar to a lot of you and whenever we think of y= mx + b.0040

You can think of y as the output or f(x), x will be the input or often whatever it is on this horizontal axis the x axis.0048

B is the y intercept.0064

Another way you could think of the y intercept is where x = 0 what is b?0073

x would be here and apparently anything where x is 0 that will mean that y will have to be somewhere on this y axis.0080

That is what we mean by y intercept.0093

M is this slope.0096

Slope of something pretty much numbers but just in case you do not, here is how we calculate slope.0103

Slope is the change of y over the change of x.0118

When we say change we think of delta, the change of y over change of x.0124

More commonly people refer to it as rise/run.0132

When you think of rise, you think of going up vertically or down vertically.0140

The entire rise and they are running in a sort of more horizontal.0146

What is the rise/run?0149

That is what we think of as a slope.0151

When we think about rise/run we mean in the direction of the positive direction is up and right.0154

The negative direction will be down and more to the left.0162

You could think of rise/run as an indication of rate of change.0170

How much x changes in relation to y or vice versa?0177

These are the components of our linear equation and every line it does not matter what the line looks like.0183

Every straight line has an equation and from that equation you can figure out any x what the y is at any y what the x is?0197

If you have xy in a slope you could figure out the intercept and if you have the intercept of y you could bring out the slope.0211

This is a useful equation for us.0224

We are going to be trying to find a line that is the mean.0227

That is the center at the data point.0231

In order to do that, we would have to find its equation because the equation is the mean of the line.0235

In statistics, we are going to use this equation but we are going to write it in a different way.0243

It is just writing conceptual change but we are going to change it around just very slightly and superficially.0250

The first thing we do is we talk about the y intercept first.0258

In statistics, that is y=b comes first and it is the first b so it is b sub 0 or b knot.0262

Instead of the y intercept being added second we start from the y intercept and then we add the slope × x.0276

Notice that the slope is not called n anymore, it is called b sub 1.0287

We have b sub 0 which is the y intercept and b sub 1 which is slope.0293

Same idea as before.0303

This how I will refer to things when we talk about the equation of a regression line.0306

What do we mean by the center line?0320

If you think about a scatter of data, if you have a whole bunch of data you want to think of a line that somehow cuts through the middle of all of these points.0325

Right now we could just roughly draw a line and try to make it cut through the center of all these points.0337

That is a very rough line.0353

In order to find what that equation of this line is, I can as long as I just have 2 of the points on this line.0356

For example, if I take this data point and this data point I could find the equation of that line.0365

It is because by having 2 axis and 2 y, a set of x and y and another set of x and y I will calculate rise/run.0374

From having slope and x and y, I could calculate the y intercept.0384

That is a rough line but because it just depends on which 2 points I take.0390

If I pick these 2 points I will get this line but let us say I pick this point and this point then I would get an entirely different of line.0396

Moreover if I pick this point and that point I will get an even more different line.0414

The question is which 2 points will you pick?0426

It might be not good enough for us to just eyeball things because we are not sure which 2 points to pick.0430

If we have 2 data points then life is easy like you could just use those 2 but usually we have more than 2 data points.0439

Just eyeballing a rough line may not be good enough for us.0448

If we could just show then we only have 2 data points we could manually find slope and intercept and find the equation of that line.0453

Let us talk about this regression as a center line instead of a center point.0465

Here are some reasons for summarizing with a regression line and notice that for all of these I’m talking about scatter plot.0470

Regression lines are used for scatter plots.0478

Here what we want to do is we want to have some variable.0482

Here is my first variable, variable 1.0487

Here is my other variable, variable 2.0491

We want to have a line that describes the center of all of these cases whatever the cases maybe.0496

Why do we have a line?0509

Why not just a mean?0512

Sometimes there is not enough info from just a point.0516

If you just have a point, for instance this point is the mean of my x and y.0522

That would be x bar and y bar.0533

Let us say that is my center point, that point might not give us enough information about this whole distribution.0538

We are going to be talking about how to summarize a distribution and what about trend we do not just want a point we like a trend.0548

The most information from that point it is useful to have center line.0557

We want to find the summary that describes the relationship between the 2 variables.0563

It is not enough just to have a point, the point would not describe the relationship between the 2 but the line does.0574

A line will tell you whether its slope is negative or positive.0580

The line will tell you what kind of information you would want from a trend.0587

That relationship is important to us and we will get that information from just a point.0594

The only reason that you want to summarize a regression line is that it is helpful to use one variable to predict the other variable.0603

Often by convention we will put whatever you feel is the predicted variable on the x axis.0617

We may use these to predict these.0624

We may use this to predict someone weight to predict their height or vice versa.0628

In this case it does not matter which is to predict there.0636

Predicted variables are by convention they are not causal variables.0640

They are just variables that we use in order to predict the second variable.0646

That second variable is called the response variable.0652

One thing that is important to know is that the predicted variable by convention or by tradition goes on the x axis0655

and the response variable is often on the right axis.0664

That goes along with this idea of function that we put in x and f(x) crunch out for us an output.0669

That is how we think of predictors.0681

You put in that predictor and it will crunch out for you the response.0683

When we talk about prediction, those predictions lie on the regression line.0691

This regression line equals all of our predictions.0698

This means that when we think x is 27 then the prediction line show us that y would be 180 or something like that.0709

All the predictions lie on the actual line.0732

Notice that a lot of our points do not lie on the prediction line.0737

There is a little bit of difference between the actual data and the predicted data.0742

Here is the goal of regression, the goal of this line.0752

Our fundamental desire is to find this line that is the center.0758

It describes the middle of all these points.0762

If you want to think about what the center means, it is all the distances on one side.0767

A balance of all the distances on that side.0773

It does not mean that it has to be a perfectly symmetrical distribution.0777

It just means that the point in the middle has to be equal distant to all of these lines and equal distant to all of these lines.0783

Think about it as a balance.0791

It just has to balance each other out.0794

All of the distances has to balance each other out.0797

That is how I want you think about it.0800

Distance is on one side of the line.0802

Balance is all the distances on the other side of the line.0805

To show you here is one distance.0809

Let us take this point.0819

This is the distance, this is y distance away from the line.0824

I need all of these distances to be balanced out like all of these distances.0831

That is all of these regression line and this is a long distance here.0846

I need all of these distances to balance each other out.0852

Now how would you find such a line because that seems like a lot of work?0856

We have to find a line and find all the distances and drew a line around and make sure all the distances are perfectly, evenly matched.0861

That seems far.0870

We will learn to calculate the precise slope and intercept of this middle line, the regression line by using the method of these squares.0871

This will going to be a beautiful shortcut for us so that we can find that line without having to do all that work.0883

That is on the next lesson.0889

Let us pretend that I have just given you the beautiful regression line.0893

I have just found it for you.0897

Let us say here I will show you by age.0900

Here on the x axis we have age, when you are like 25 you might drink less milk when you are 15 or 12.0908

Here is serving of milk.0926

I have already drawn for you this regression line and if you trace it all the way up it may intercept at 7950932

and if you look at rise/run it will be rise 22 and going to the left by 1.0943


Here we have this nice line and there are 2 ways you could use prediction.0958

One is that you could use prediction in order to find data for predict data.0967

We have data for a 12 year old and we have data for 28 year old.0982

If I wanted to predict somebody in between that, I wanted to predict a 20 year old milk drinking.0994

What I can do is I could put 20 in the equation and find the predicted of milk.1007

I could just do 795 – 22 – 20.1017

I could drop my predicted servings of milk.1024

When we make a prediction, instead of calling it y, we are going to call it y hot.1027

This is called extrapolation.1039

When you have a range of x and you are finding something within that range of axis, your predictors are within that range of axis.1042

You could think of it as within the boundaries.1052

Staying within the range of data because this is the data that I use in order to create my line1058

and if I stay within the range of my data that is how it is extrapolation.1073

If I go outside the range of my data, that is my extrapolation.1079

For instance, we do not have data of 10 year olds can I just make one up?1086

Can I just that I do?1102

Can I find my predicted y for people who are 10 year old?1105

Obviously I can from just using the equation of the line.1113

That is not the hard part.1117

It is easy to plug in 10 but the question is can I actually do this?1119

Is it legal for me to do?1128

The reason why we separate this into 2 different ways of predicting is that extrapolation1132

is a little bit more risky because you are going outside the boundaries of your data.1144

Because you are going outside of the boundaries of our data we are not sure that our predictions are going to be accurate.1168

When we stay within the range of our data it is a more safe way because it us most similar to the data that we use to create the line.1178

There is interpolation and extrapolation.1190

What I want you to know is extrapolation is more dangerous that interpolation.1192

Let us say we go all the way to 0 years of age, would it be true that they drink all these servings of milk?1197

They do not.1207

They drink infant formula or breast milk.1209

It will be wrong if I say that infants towards 0 years old drink 795 servings of milk a year because that will just be wrong.1211

That is what we mean by extrapolation being a little bit more dangerous.1226

Let us talk about errors in prediction.1237

Even though we have this nice equation for the line, common problem is that the serving of milk per year1240

that we predict y hot is not always going to be fit with our data.1251

That is not always going to be perfectly line up with our data.1257

In fact you could see here there is a lot of jitter around the line and that is called prediction error.1260

The prediction error is the real truth and the difference within our prediction.1268

Whenever we have data, it is often from a sample.1283

We do not know what the real truth is.1287

We only have the sample.1289

Often we want to know prediction error but this is a theoretical idea.1291

It is the difference between the truth in our prediction but we already know what the truth is.1296

What we do have is we have our data.1302

The sample and what we can find is not the real prediction error but what we call the residual.1305

After we find the middle line, then what we can find is the difference between our data and that line.1312

That is called the residual.1322

This idea here, the distances between our actual y, the data, and our predicted y, y hot, that is called the residual.1326

Notice that we have a whole bunch of residuals.1343

Here is the thing, because some of our data is greater than our prediction and some of our data is less than our predictions.1346

It is a whole bunch of positive and a whole bunch of negative.1364

The prediction y, the perfect middle line actually have a balance of positive and negative.1368

If we add in all those positives and negatives and these distance is exactly equal to this distance.1380

These are positive and these are negative.1393

When we add them all together we will get 0.1396

The idea is all the residuals on this side and all the residuals on this side add up to 0 because1401

that would mean that our line is truly on the middle of all these distances.1407

That is called a residual.1413

Let us go to our first example.1416

This is the same data that we are working at and the question is what is the residual for milk drink of a 24 year old?1418

Since we are finding the residual, we know that the residual is the data y but the difference between that and y hot or the predicted y.1427

To put it into our example, it is the actual servings of milk that 24 year olds drink, the data that we have.1445

Subtract out the predicted servings of milk that 24 year olds drink.1452

First things first, let us find how much milk 24 year olds drink.1459

If we go to 24, this is our data point right here, we can just add all the points looks like to us.1464

It looks like maybe 24 and 225 or something.1472

We already have our y, 225.1481

We just need to find y hot.1488

In order to find y hot, all we have to do is put in 24 to this regression equation.1491

Y hot is equal 795 – 22 × 24.1501

That will be our predicted y.1509

Here I’m just going to bring out the pink Excel and just put in 795 – 22 × 24.1512

We will get 267.1530

That is equals to 267.1534

We have 225 – 267, that makes sense because our predicted serving of milk is above our actual servings of milk from our data.1538

We shall get a negative number.1561

Let us get it in Excel and it is going to put in 225 – 267 here I get -42.1568

That is our residual for milk drinking of a 24 year old.1582

Example 2, if a residual is large and negative, where is the point located with respect to the line?1591

What does it mean for the residual to be negative?1599

We already have an example of a residual being negative, it means that the point is from all the line and below on the y axis.1601

Just to draw some examples for you.1613

If we have a line that looks like this, one idea is the residual is way down here.1616

It will be large and negative given the y hat and the y because it is residual = y – y hot.1624

Another example that I could draw for you is something like this.1639

Even in this case, this will give us a large residual because once again our y hat is greater than our y.1644

If the residual is negative, if the residual is less than 0 then it must mean that our y hat is greater than our y.1658

With respect to the line, the point is below the line.1675

What does it mean for the residual to be negative?1682

It means that our prediction is greater than our data point.1685

Example 3, is somebody said that they have fit a line into a set of data points and all their residuals is positive, what would you say to them?1696

Let us just think about this.1710

Let us say we have some sort of a line and all the residuals are positive.1713

That would mean that every data point is somehow above this line because if they are below that would be negative.1718

Could that ever be the case if we want our line to be in the middle of all this points?1729


I would probably say to them perhaps they have made a mistake because half of their distances should be positive and half should be negative.1739

Sometimes you could have 2 small positive distances and one larger negative distance.1754

It could balance out like that but you cannot have all positive nor you can have all negative.1760

I would say to them your line is not in the middle of all these points.1766

It is not a good regression line.1778

Example 4, interpret the y intercept of the regression line in the milk example.1782

Does it make sense to extrapolate here?1788

One thing that you need to know is the x axis only goes from 10 – 30 but we need to take it all the way out to 5, 0.1791

What we mean is here that is where the true y intercept because x axis has to be 0.1806

Does it make sense to extrapolate here?1819

This would mean that when x is 0 then y would be 795.1823

Let us think about what that means.1833

When x is 0 age would be 0, we are talking about new born, is it true that new born drink 795 servings of milk?1835

We just talked about that.1850

It does not make sense to extrapolate here because new born are special case.1852

They do not really drink milk, they drink breast milk and infant formula and because of that it does not make sense to talk about new born drinking milk yet.1857

It does not quite make sense to extrapolate that way.1868

New born are an exception and presumably this line will go on and on and on.1874

There will be a point where it crosses the x axis.1881

This is the x intercept when y = 0.1886

It may not make sense to extrapolate there either just because at a certain point the servings of milk might go into negative.1891

That does not make sense in our data.1901

It does not quite make sense to extrapolate beyond the confides of our data.1906

That is conceptual understanding of regression.1912

Hope to see you again for calculating regression next time on