WEBVTT mathematics/statistics/son
00:00:00.000 --> 00:00:03.300
Hi and welcome to www.educator.com.
00:00:03.300 --> 00:00:07.500
In the previous lesson we learned about conceptually the idea of regression.
00:00:07.500 --> 00:00:15.700
In this lesson of squares regression we are going to talk about how to actually calculate a regression line and find it.
00:00:15.700 --> 00:00:25.300
Here is the roadmap and we are going to talk about what it means to best fit the data, and what does it mean for a line to best fit the data.
00:00:25.300 --> 00:00:30.400
We are going to talk about sum of squared errors and y that conflict is important for regression.
00:00:30.400 --> 00:00:34.200
We are going to talk about sum quantitative properties of the regression line.
00:00:34.200 --> 00:00:42.000
We know conceptually what it means but once we do have a regression line there are sum rules that the regression line conforms to.
00:00:42.000 --> 00:00:50.300
We are going to talk about how to actually find the slope and the intercept of the regression line,
00:00:50.300 --> 00:00:53.200
What does it mean to best fit the data?
00:00:53.200 --> 00:00:59.900
Well you can think about it like this, there are any number of lines that you could drop through a set of data.
00:00:59.900 --> 00:01:07.600
We could draw that one, we could draw this one, we can draw this one, we could draw that one.
00:01:07.600 --> 00:01:17.000
There are an infinite number of possible ones, but our goal is a regression line that is in the middle of all of these data points.
00:01:17.000 --> 00:01:22.100
When it is in the middle that is what we mean by best fitting line.
00:01:22.100 --> 00:01:46.400
You can think of this fit as roughly being equal to the concept of in the middle and the difference between all of these lines
00:01:46.400 --> 00:01:54.400
and the true regression line is that the best fitting line is roughly in the middle.
00:01:54.400 --> 00:01:57.300
How do we find the best fitting line?
00:01:57.300 --> 00:02:08.300
Quantitatively what it means to best fit the line means that this line had the lowest sum of squared errors.
00:02:08.300 --> 00:02:15.200
Because of that the regression line is also called the least squares line.
00:02:15.200 --> 00:02:17.900
That is y it is called the least squares method.
00:02:17.900 --> 00:02:25.700
Even though in the middle and best fit are good conceptual ideas they are not quantitative ideas.
00:02:25.700 --> 00:02:31.100
This is the quantitative definition of what the best fitting line is.
00:02:31.100 --> 00:02:33.900
Let us talk about what error is.
00:02:33.900 --> 00:02:39.900
We had a particular word for error and that word is the residual.
00:02:39.900 --> 00:02:50.500
And that residual is the difference between y and the predicted y from our best fitting line.
00:02:50.500 --> 00:03:05.300
Having the lowest sum of squared errors is having the lowest SSE is really having a sum of all the squared residuals.
00:03:05.300 --> 00:03:19.000
Residuals square and another way to write that in y – y hat².
00:03:19.000 --> 00:03:25.900
This is our quantitative measure of how good our line is.
00:03:25.900 --> 00:03:28.300
Now there is one of a Catch-22 here.
00:03:28.300 --> 00:03:40.600
We have to have the line before we could figure out whether it has the lowest SSE but the question is how do we find that line?
00:03:40.600 --> 00:03:45.800
First, before we go on let us talk about y we need to square these residuals?
00:03:45.800 --> 00:03:48.600
Remember when we talked about what it means to be in the middle?
00:03:48.600 --> 00:03:54.700
It means that the distances on the positive side or the point above the line and the points below the line,
00:03:54.700 --> 00:03:59.600
the negative distances and these should all balance out.
00:03:59.600 --> 00:04:06.300
If you have a bunch of positive and a bunch of negative and you add them together you should get 0.
00:04:06.300 --> 00:04:14.800
Here is the tricky part that sum of the residuals, y – y hat.
00:04:14.800 --> 00:04:22.900
The sum of the residual period that should be roughly equal to 0 for the best fitting line.
00:04:22.900 --> 00:04:31.900
Because of that we want to square these distances.
00:04:31.900 --> 00:04:43.300
I will write the squared in red because that means that this value, the sum of squared errors should be greater than 0.
00:04:43.300 --> 00:04:45.500
We definitely want to square it.
00:04:45.500 --> 00:04:55.000
These other mathematical properties that will be able to take these seventh feature.
00:04:55.000 --> 00:04:58.700
We know what it means to quantitatively be the regression line.
00:04:58.700 --> 00:05:04.500
It means having the lowest sum of squared errors but there is other quantitative properties that come along.
00:05:04.500 --> 00:05:13.100
One important property to note is that this line, the regression line also contains point of averages.
00:05:13.100 --> 00:05:17.300
The average of all your x and the average of all your y.
00:05:17.300 --> 00:05:21.500
The average of variable 1 and the average of variable 2.
00:05:21.500 --> 00:05:25.500
This point is often also called the center of mass.
00:05:25.500 --> 00:05:33.100
It is really easy to find this point you just take the average of your x and take the average of your y.
00:05:33.100 --> 00:05:39.400
x bar and y bar is your point of average.
00:05:39.400 --> 00:05:52.400
You can also think of it as the center of mass because if we think of all your points, the scatter plot as long like a object, this is the center of that mass.
00:05:52.400 --> 00:06:02.400
We already know that this line has the lowest SSE of any other line that also contains point of averages.
00:06:02.400 --> 00:06:10.700
The sum of the residuals when you do not square it that should be approximately 0.
00:06:10.700 --> 00:06:19.600
And because the sum of the residual is 0, the mean of the residuals is also 0 because the mean is the sum divided by the number of points.
00:06:19.600 --> 00:06:28.700
If the sum is 0 it means it will be 0 and the variation of the residual is as small as possible.
00:06:28.700 --> 00:06:31.400
It is smaller than other lines.
00:06:31.400 --> 00:06:36.500
One way to quantify variation is something like standard deviation.
00:06:36.500 --> 00:06:44.100
The residual have the smallest standard deviation than any other line.
00:06:44.100 --> 00:06:51.300
Those are very important quantitative properties that we need to know.
00:06:51.300 --> 00:06:53.500
This sounds like a wonderful, magical line.
00:06:53.500 --> 00:06:56.600
How do we find such a line?
00:06:56.600 --> 00:07:04.000
You might be thinking that this is sounding pretty hard and maybe we have to find the SSE
00:07:04.000 --> 00:07:10.900
for a whole bunch of different line equations and find the one with the lowest SSE.
00:07:10.900 --> 00:07:13.600
That is actually problematic.
00:07:13.600 --> 00:07:15.100
It is a good idea.
00:07:15.100 --> 00:07:19.500
It is a good conceptual idea but it is problematic, and here is y.
00:07:19.500 --> 00:07:27.500
There are an infinite number of lines.
00:07:27.500 --> 00:07:34.300
You can just change the y intercept by .0001 and get a totally different line.
00:07:34.300 --> 00:07:38.700
You can change the slope but a tiny, tiny amount and get a totally different line.
00:07:38.700 --> 00:07:45.400
There is an infinite number of lines that we would have to test.
00:07:45.400 --> 00:07:53.300
Infinite number of potential lines.
00:07:53.300 --> 00:07:58.900
We can find the SSE of infinite number of line.
00:07:58.900 --> 00:08:03.500
That is just not an option for us.
00:08:03.500 --> 00:08:18.100
Thank you to our hero Carl Gauss he was a mathematician and all kinds of German guy, and he helped us out a lot in statistics.
00:08:18.100 --> 00:08:28.200
Carl Gauss invented this method called the method of least squares and through Carl method we could easily find the slope.
00:08:28.200 --> 00:08:29.700
Here is how we do it.
00:08:29.700 --> 00:08:32.100
The slope is going to be a ratio.
00:08:32.100 --> 00:08:33.900
Slopes are always ratios.
00:08:33.900 --> 00:08:40.500
Rise/run but ratio change of y over change of x.
00:08:40.500 --> 00:08:44.200
Through this methodology has a similar line to it.
00:08:44.200 --> 00:08:47.600
Here is how Carl Gauss find slope.
00:08:47.600 --> 00:08:52.500
Remember slope is not z sub 0, that is intercept.
00:08:52.500 --> 00:08:54.700
It is z sub n.
00:08:54.700 --> 00:08:56.300
B sub 1.
00:08:56.300 --> 00:09:11.900
We call Gauss’s method you want to add up, take the sum of all your x deviation, so the deviation between x and the mean.
00:09:11.900 --> 00:09:26.400
X – x bar and multiply that to all deviations of y.
00:09:26.400 --> 00:09:36.300
Notice that we are not using x hat or y hat because we do not have the line but we do have the center of mass.
00:09:36.300 --> 00:09:37.600
We are using that.
00:09:37.600 --> 00:09:42.500
We are finding the deviations from sort of the center of mass right.
00:09:42.500 --> 00:09:55.100
And that sum over the sum of x - x bar².
00:09:55.100 --> 00:10:04.200
It is sort of think about this, as this as the 2 variation over the x variation squared.
00:10:04.200 --> 00:10:12.500
When you think of rise/run you think of the y/x and you will see that here.
00:10:12.500 --> 00:10:17.600
There is the change of y and there is the changes of x.
00:10:17.600 --> 00:10:23.200
There is two changes of x and here is the change of y.
00:10:23.200 --> 00:10:32.300
This method will give you the slope of your regression line.
00:10:32.300 --> 00:10:42.400
And just as a review, remember that when we have this x here, we really mean x sub I and we mean x sub i.
00:10:42.400 --> 00:10:45.300
I goes from 1 all the way up to n.
00:10:45.300 --> 00:10:48.400
However many data points we have in our sample.
00:10:48.400 --> 00:10:59.600
This often goes without saying that this we want to do this for every single data point that you have.
00:10:59.600 --> 00:11:03.600
That is Carl Gauss’s method.
00:11:03.600 --> 00:11:09.600
In order to find slope we need to use that function.
00:11:09.600 --> 00:11:23.100
It is the change of x, the deviations of x multiplied by the deviations of y all added up over the deviations of x².
00:11:23.100 --> 00:11:25.000
The sum of the deviations of x².
00:11:25.000 --> 00:11:33.100
Let us actually do a little example here.
00:11:33.100 --> 00:11:38.500
If we had a whole bunch of x and a whole bunch of y I just put a few here.
00:11:38.500 --> 00:11:41.800
X = 1, 1.
00:11:41.800 --> 00:11:49.600
The first point is 1, 1 and the second point is 0 and third point is -1, -1.
00:11:49.600 --> 00:11:51.200
A very easy line.
00:11:51.200 --> 00:12:01.000
We already know that the line equations should be something like Y = x.
00:12:01.000 --> 00:12:13.300
Let us see if we could use Carl Gauss’s method in order to find slope and often find it useful and that is where we are going.
00:12:13.300 --> 00:12:32.400
The deviations of x and the deviations of y so the sum of the deviations of x times the deviations of y and the ratio of that sum to the deviations of x².
00:12:32.400 --> 00:12:57.100
In this way we have to find X bar and y bar and easily we can tell here if we take the X bar and we just add it up to 0, so the average of 0 adding this up 0.
00:12:57.100 --> 00:13:02.700
We already know X bar and y bar.
00:13:02.700 --> 00:13:12.500
In order to do this I’m going to have to find x - x bar.
00:13:12.500 --> 00:13:17.500
I’m going to have draw this in a different color to make it easier.
00:13:17.500 --> 00:13:25.100
X - x bar and y - y bar.
00:13:25.100 --> 00:13:32.200
Not only that I’m going to need to know x.
00:13:32.200 --> 00:13:48.200
I need to know X - X bar × Y - y bar and I'm going to need to know x – x bar².
00:13:48.200 --> 00:13:58.200
Let me draw some lines here.
00:13:58.200 --> 00:14:08.800
Let us get started, because my x bar and y bar is 0, 0 this makes this easy for me.
00:14:08.800 --> 00:14:17.300
Let us find this difference for x deviation × y deviation.
00:14:17.300 --> 00:14:26.700
I will just multiply it across that is y across 0 and that is y across 1.
00:14:26.700 --> 00:14:29.800
X – x bar².
00:14:29.800 --> 00:14:30.500
This is y².
00:14:30.500 --> 00:14:33.400
That is 1, 0, and 1.
00:14:33.400 --> 00:14:43.900
I need to find the sum here and take that and put it over this one.
00:14:43.900 --> 00:14:54.900
This sum is 2 and this sum is 2 and I’m going to put that in here my b1 = 2/2 which is 1.
00:14:54.900 --> 00:14:57.300
We found our slope.
00:14:57.300 --> 00:14:59.100
Our slope is just 1.
00:14:59.100 --> 00:15:13.600
Since I already knew that slope of a regression line here should be y = x we know that y = 1 × x which is y = x.
00:15:13.600 --> 00:15:23.200
Now we know how to find slope but how do we find intercept once we have our slope.
00:15:23.200 --> 00:15:31.300
Let us see our previous example b1 = 1.
00:15:31.300 --> 00:15:37.100
We know 1 is a point that falls under our regression line already.
00:15:37.100 --> 00:15:45.400
X bar / y bar which is 0, 0.
00:15:45.400 --> 00:15:55.400
If we know all of those things we could find our intercept just by plugging it in.
00:15:55.400 --> 00:16:06.400
Our equations have a line in statistics is y = b knot + b sub 1 × x.
00:16:06.400 --> 00:16:12.800
All you have to do is plug in our numbers and substitute in order to find the sum.
00:16:12.800 --> 00:16:16.600
That is what we are looking for.
00:16:16.600 --> 00:16:19.200
Here is an example y.
00:16:19.200 --> 00:16:26.100
B sub knot, b sub 0 + 1 × 0.
00:16:26.100 --> 00:16:30.800
Here I will get b = 0.
00:16:30.800 --> 00:16:35.000
This is definitely easy.
00:16:35.000 --> 00:16:45.800
This is just finding our missing value just by having our example y, x, and having this slope.
00:16:45.800 --> 00:16:50.500
We could just derive this linear so that in the future we will know what exactly to plug in.
00:16:50.500 --> 00:17:02.600
Instead of trying to solve for y we could just slip around these things in order to solve for b sub 0.
00:17:02.600 --> 00:17:11.600
All we have to do is move this over to that side so that is y – b sub 1 × x.
00:17:11.600 --> 00:17:19.400
That is how to find b sub 0 / y intercept.
00:17:19.400 --> 00:17:21.800
Let us do some more examples.
00:17:21.800 --> 00:17:24.200
Here is example 1.
00:17:24.200 --> 00:17:28.700
Pretend that this is 3 different kind of pizzas.
00:17:28.700 --> 00:17:33.800
Let us say this is medium size pizza.
00:17:33.800 --> 00:17:39.400
Let us say that this is giant size pizza.
00:17:39.400 --> 00:17:47.900
It has 100 grams of fat per pizza but the cost is $17.50.
00:17:47.900 --> 00:17:54.400
The double size let us say is 110 grams of fat per pizza but the cost is $18.00.
00:17:54.400 --> 00:18:01.700
Pizza x has 120 grams of fat but the cost is $20.00.
00:18:01.700 --> 00:18:12.100
Maybe we would have a feeling that fat makes the taste better or the cost.
00:18:12.100 --> 00:18:21.700
The question is which of these following equations fit this data the best.
00:18:21.700 --> 00:18:29.500
In order to solve this problem we have to find the sum of squared errors for each of these equations.
00:18:29.500 --> 00:18:34.200
We are not sure if any of these equations is the regression line.
00:18:34.200 --> 00:18:40.200
We are just trying to find the best equation out of the 3 that we have.
00:18:40.200 --> 00:18:58.600
Which of these set above data the best which equation has lowest error or sum of square error?
00:18:58.600 --> 00:19:09.400
You can put up your examples right for you in the x so far and click on example 1 are already in the data right here.
00:19:09.400 --> 00:19:14.100
Here is our 3 pizzas of fat as well as the cost.
00:19:14.100 --> 00:19:19.400
It seems just from borrowing it that bar is a positive trend.
00:19:19.400 --> 00:19:25.000
As fat goes up the cost goes up.
00:19:25.000 --> 00:19:43.100
Let us go ahead and try our first equation that we are given.
00:19:43.100 --> 00:19:52.400
The equation sub y = 4.45 that is the intercept + .1x.
00:19:52.400 --> 00:20:00.000
I separated that in order to the intercept as well as the slope because we are going to need those numbers.
00:20:00.000 --> 00:20:02.200
Here is the fat, here is the cost.
00:20:02.200 --> 00:20:06.000
Let us find the predicted cost or y hat.
00:20:06.000 --> 00:20:16.300
In order to find y hat all we have to do is plug in our x into our y equation.
00:20:16.300 --> 00:20:22.800
That would be these values .75 and that will change.
00:20:22.800 --> 00:20:36.400
I’m going to lock it in place and add that to b1 × x.
00:20:36.400 --> 00:20:50.400
B1 is not going to change either so I’m going to lock that in place.
00:20:50.400 --> 00:20:54.200
We do want b12 to keep changing.
00:20:54.200 --> 00:21:04.500
I’m going to take that predicted cost and I’m just copying and pasting.
00:21:04.500 --> 00:21:12.700
These predicted cost are always a little bit less than the actual cost.
00:21:12.700 --> 00:21:21.200
Here all I have are residuals are going to be.
00:21:21.200 --> 00:21:26.400
The residuals are the actual cost – the predicted cost.
00:21:26.400 --> 00:21:32.400
All of our residuals are going to be positive.
00:21:32.400 --> 00:21:46.500
That is the case where all of our actual data are above our prediction line and so because of that we know that this is not quite as good is not a great regression line.
00:21:46.500 --> 00:21:50.100
Maybe it has the best smallest SSC.
00:21:50.100 --> 00:21:58.600
We have our residuals and what I’m going to do is take this residual and square it.
00:21:58.600 --> 00:22:11.400
You can find all my squared residuals and then in order to get the sum of squared residuals I will just add them all up and so I get 23.1875 as my sum of squared errors.
00:22:11.400 --> 00:22:23.000
Who knows, maybe that is the lowest one, we will see.
00:22:23.000 --> 00:22:31.300
Here I put in the data for the next equation.
00:22:31.300 --> 00:22:40.700
It is y equals 8 +.025x.
00:22:40.700 --> 00:22:49.200
I separated out into the intercept versus the slope and let us find the sum of squared error.
00:22:49.200 --> 00:23:11.600
To find the predicted cost I need to add this, take my intercept, lock in place and add that to my slope × x.
00:23:11.600 --> 00:23:21.000
I’m going to lock my slope in place as well.
00:23:21.000 --> 00:23:28.300
And so right now we are a little bit low, still low, still really low.
00:23:28.300 --> 00:23:40.400
I could see that because our predicted costs are more off than our predicted cost I’m going to guess the sum of squared errors is going to be considerably larger.
00:23:40.400 --> 00:23:44.400
Let us find the residual.
00:23:44.400 --> 00:23:48.200
The residual is the data minus the predicted.
00:23:48.200 --> 00:24:07.000
The data minus the predicted and then all I do is square that residual and then sum them all up
00:24:07.000 --> 00:24:14.100
because all of our predicted costs were more off than the predicted cost.
00:24:14.100 --> 00:24:17.700
This equation is much better than this equation.
00:24:17.700 --> 00:24:30.800
Now let us test out the third one.
00:24:30.800 --> 00:24:36.800
I hope we did not see those answers and let us see what the predicted costs look like.
00:24:36.800 --> 00:24:59.300
We want to add our intercept with our slope like that in place × the x.
00:24:59.300 --> 00:25:08.600
Excel will automatically do order of operations, so I do not have to put parentheses around the multiplication first.
00:25:08.600 --> 00:25:11.500
Let us say that this is actually close to the costs.
00:25:11.500 --> 00:25:17.700
If all of that is off by 20% but just below.
00:25:17.700 --> 00:25:20.300
Let us say that is the next one.
00:25:20.300 --> 00:25:23.200
This one is off in the opposite direction.
00:25:23.200 --> 00:25:27.200
It is off in the negative direction.
00:25:27.200 --> 00:25:30.100
This one is off in the negative direction.
00:25:30.100 --> 00:25:35.200
This seems like pretty good prediction where we are getting pretty close to the cost.
00:25:35.200 --> 00:25:37.700
Let us find out what the residual is.
00:25:37.700 --> 00:25:40.300
Here we should have a mix of residuals.
00:25:40.300 --> 00:25:43.000
Some positive and some are negative.
00:25:43.000 --> 00:25:47.200
So costs - the predicted.
00:25:47.200 --> 00:25:57.300
We have 2 positive ones and one -1 and in order to balance each other out quite nicely because the positive ones are smaller,
00:25:57.300 --> 00:26:00.200
but the negative one is a little bit bigger.
00:26:00.200 --> 00:26:05.700
Let us square this.
00:26:05.700 --> 00:26:22.100
Here if we sum that up we get .375 and that is considerably smaller error than 23 and 182.
00:26:22.100 --> 00:26:29.700
I can say that the third equation is the best fitting line.
00:26:29.700 --> 00:26:33.000
This one is the best one.
00:26:33.000 --> 00:26:35.000
Here is example 2.
00:26:35.000 --> 00:26:45.100
Now it give us the same data and x find the regression line for these data point and them interpret it.
00:26:45.100 --> 00:26:54.400
If we go back to our Excel file and click on example 2 then you will see the data here for you.
00:26:54.400 --> 00:27:03.300
First thing we probably want to do is figure out all the different things we would like to get
00:27:03.300 --> 00:27:12.800
and I’m just going to use a little bit of a shorthand instead of writing x – x bar.
00:27:12.800 --> 00:27:15.600
I’m going to write deviations of x.
00:27:15.600 --> 00:27:29.300
The deviations of x and I'm also going to need deviations of y and then I'm going to need to multiply the deviations of x × the deviations of y.
00:27:29.300 --> 00:27:35.300
I’m also going to find deviations of x².
00:27:35.300 --> 00:27:37.500
These are the four things I need.
00:27:37.500 --> 00:27:46.000
In order to get these, I need X bar.
00:27:46.000 --> 00:27:53.200
Here I’m going to put averages and you need to find X bar and y bar.
00:27:53.200 --> 00:27:54.700
And that is right here.
00:27:54.700 --> 00:28:08.200
Here I’m going to put average and find my X bar which is that and also just copy and paste that over to find y bar, the average cost.
00:28:08.200 --> 00:28:15.100
My point of averages is 110 and 18.5.
00:28:15.100 --> 00:28:21.600
Let us find all the deviations of x in order to find slope.
00:28:21.600 --> 00:28:28.500
The deviations of x is x- my x bar.
00:28:28.500 --> 00:28:51.600
Here I’m going to lock my X bar in place and then I can just copy and paste all the way down.
00:28:51.600 --> 00:29:05.100
Let us also find the deviations of y which is costs minus the average cost.
00:29:05.100 --> 00:29:08.200
And then I could just copy and paste that all the way down as well.
00:29:08.200 --> 00:29:17.400
Notice that my deviations of x and deviations of y they are like helping us toward that lowering
00:29:17.400 --> 00:29:26.500
of the residual idea because the deviations of x if you look at all of them they are very balanced.
00:29:26.500 --> 00:29:31.800
Half of them are one side of the average and half of them are the other.
00:29:31.800 --> 00:29:38.700
The definition that is what average means and so are my deviations of y half of them are on the negative side
00:29:38.700 --> 00:29:43.900
and half of them on the positive side and they balance one another up.
00:29:43.900 --> 00:29:53.300
Now let us multiply the deviations of x by the deviations of y and noticed them doing this for every data point.
00:29:53.300 --> 00:29:58.800
Here I know I need to find sum.
00:29:58.800 --> 00:30:08.900
I will sum them here.
00:30:08.900 --> 00:30:11.600
That is my sum.
00:30:11.600 --> 00:30:17.300
Actually color these the different colors so that we do not get confused.
00:30:17.300 --> 00:30:28.500
Let us also find our deviations of x² and let us find the sum of those.
00:30:28.500 --> 00:30:39.100
Here are two sum and what we need to find in order to find the b sub 1.
00:30:39.100 --> 00:30:50.300
Finding b sub 1 we need to find the ratio between this and that.
00:30:50.300 --> 00:30:55.600
Our b sub 1 equals .125.
00:30:55.600 --> 00:31:02.100
Now that we know b sub 1 we can easily find the b sub 0.
00:31:02.100 --> 00:31:15.100
Now actually color these the different color and remember the formula for b sub 0 is just y – b sub 1 × x.
00:31:15.100 --> 00:31:22.500
I already have an X and Y, my point of averages.
00:31:22.500 --> 00:31:24.500
I forgot to put equal sign.
00:31:24.500 --> 00:31:41.500
y – b sub 1 × X and I get 4.75.
00:31:41.500 --> 00:31:51.000
In order to find my equation for the line all we do is take the two values and put them into my actual line equation.
00:31:51.000 --> 00:32:03.700
In order to find my predicted y I would take 4.75 and add that to .125 × x.
00:32:03.700 --> 00:32:10.900
That is my regression line for this set of data.
00:32:10.900 --> 00:32:14.500
The previous example of this would actually choice c.
00:32:14.500 --> 00:32:20.300
It actually happened to be the regression line as well.
00:32:20.300 --> 00:32:24.700
Here is the kicker though we need to interpret this.
00:32:24.700 --> 00:32:31.800
It is not good enough for us to just have this, we need to know what this means.
00:32:31.800 --> 00:32:40.500
In order to get y, we are changing everything from that into costs.
00:32:40.500 --> 00:32:45.600
You can think of the Y intercept as a base cost.
00:32:45.600 --> 00:32:57.300
4.75 seems to be the base cost for these pizza and then for every gram of fat you add 12 ½ cents.
00:32:57.300 --> 00:33:07.600
If you have 1 g of fat presumably, then you would just add 12 ½ cents to this pizza and perhaps that pizza would taste very good.
00:33:07.600 --> 00:33:10.700
It would be probably a lot healthier for you.
00:33:10.700 --> 00:33:31.000
If you add 100 grams of fat so hundred grams of fat and each of those grams of fat is worth .12 then you have to multiply that in order to add that to your base cost.
00:33:31.000 --> 00:33:55.200
In some ways these base cost and there is sort of acting like giving you an idea of how much every gram of fat cost.
00:33:55.200 --> 00:33:59.900
Because notice that as grams of fat goes up, the cost goes up.
00:33:59.900 --> 00:34:07.200
This data is actually wrong.
00:34:07.200 --> 00:34:17.900
This would be very cheap pizza.
00:34:17.900 --> 00:34:30.200
This equation is actually helping us to get an idea of how much each gram of fat is costing and exactly what the relationship is between grams of fat and the cost.
00:34:30.200 --> 00:34:32.400
That is the goal of the regression line.
00:34:32.400 --> 00:34:37.800
For these 40 data points summarize the scatter plot then find the regression line.
00:34:37.800 --> 00:34:47.000
Presumably these data points are in the Excel file and remember how to summarize the scatter plot we are going to be doing that.
00:34:47.000 --> 00:34:55.900
We have to bring them up that Excel file and click on example 3 that have at the bottom.
00:34:55.900 --> 00:35:06.900
This data looks sort of familiar to us, but now they are giving us a different label for the these variables.
00:35:06.900 --> 00:35:15.300
Here it says student faculty ratio on the x-axis and cost per unit on the y axis.
00:35:15.300 --> 00:35:25.400
I'm presuming that each of these cases are something likes schools, maybe universities.
00:35:25.400 --> 00:35:31.200
When the student faculty ratio is very high, then it is cheap to enroll at the schools.
00:35:31.200 --> 00:35:33.000
It is cheap to take units there.
00:35:33.000 --> 00:35:37.000
But when the student faculty ratio is very low then it is more expensive.
00:35:37.000 --> 00:35:41.100
This sort what it looks like.
00:35:41.100 --> 00:35:42.400
Number 1.
00:35:42.400 --> 00:35:43.500
What are our cases?
00:35:43.500 --> 00:35:46.700
Our cases particular, probably something likes schools or universities.
00:35:46.700 --> 00:35:51.800
Our variables are the student faculty ratio and cost per unit.
00:35:51.800 --> 00:36:08.500
Number two in summarizing the scatter plot it seems as the general shape is linear roughly so we can just stick with that.
00:36:08.500 --> 00:36:18.100
Number 3 the trend seems to be a negative trend where as one goes up, as ratio goes up the cost goes down.
00:36:18.100 --> 00:36:24.100
As ratio was down, the cost goes up.
00:36:24.100 --> 00:36:33.300
Number 4, what does this sort of strange look like?
00:36:33.300 --> 00:36:38.200
A sort of like maybe small to medium.
00:36:38.200 --> 00:36:45.600
That is harder to add up and number 5 potential explanations.
00:36:45.600 --> 00:36:57.900
Well, it might be that in order to provide more faculty per students or a better student faculty ratio you need more faculty or you need less students.
00:36:57.900 --> 00:37:03.000
More faculty cost for many less students it costs more for each student.
00:37:03.000 --> 00:37:12.200
That makes sense but it could be when you have a high cost you want to keep the student faculty ratio low.
00:37:12.200 --> 00:37:21.400
Or maybe some of the third variable like prestige that keeps this relationship going.
00:37:21.400 --> 00:37:26.300
We summarize the scatter plot that I think now we have to find the regression line.
00:37:26.300 --> 00:37:34.100
In order to find the regression line we do not really need this chart very much.
00:37:34.100 --> 00:37:38.100
I’m just going to make it feel small and put it over here.
00:37:38.100 --> 00:37:46.400
It is useful to look at later just to eyeball whether our regression line makes sense.
00:37:46.400 --> 00:37:56.000
But let us go ahead and take our steps to find Carl Gauss’s method of finding b sub 1.
00:37:56.000 --> 00:38:19.800
I'm going to write here X deviations, Y deviations, X deviations × Y deviations and then X deviations².
00:38:19.800 --> 00:38:28.800
And this is when Excel comes in real handy because it would be really sort of crazy in order to do all of these.
00:38:28.800 --> 00:38:41.100
Just make life easier for, let us go ahead and find X bar and y bar.
00:38:41.100 --> 00:38:43.700
It does not matter where you find this.
00:38:43.700 --> 00:38:46.800
It is somewhere easy for you to keep track of.
00:38:46.800 --> 00:38:56.200
I’m going to find the average of my x and just use my student faculty ratio as my x.
00:38:56.200 --> 00:39:13.500
The average student faculty ratio is about 20 students per faculty and just a copy that over our average cost is about $366 per unit.
00:39:13.500 --> 00:39:42.400
Let us find the X deviations, so that would be my x - x bar and I want that to just locked in place and then I’m also going to find my Y deviations.
00:39:42.400 --> 00:40:01.700
Y - Y bar lock that in place and multiply my x deviation and y deviations.
00:40:01.700 --> 00:40:06.200
I’m also going to find c deviations².
00:40:06.200 --> 00:40:17.900
Once I have this I can actually just copy and paste all four of these values all the way down for all 40 data points.
00:40:17.900 --> 00:40:26.600
If you take a look half of the X deviations should be negative and approximately half are positive.
00:40:26.600 --> 00:40:37.100
And same with the Y deviations some are positive and then some are negative to balance that out.
00:40:37.100 --> 00:40:42.200
We know we need to find the sum.
00:40:42.200 --> 00:40:52.700
We need to find the sum of our x deviations × y deviations and just to help us out I’m going to pull down this little bar here.
00:40:52.700 --> 00:41:03.700
You see in this corner there is a little sandwich looking thing I pulled it down in order to lock that row in place and so that row does not move.
00:41:03.700 --> 00:41:08.100
Move that down and I know what column I am in.
00:41:08.100 --> 00:41:36.800
I want to sum of all of these together and then I'm also going to sum all of these together and I'm just going to color all of this in a different color so we know.
00:41:36.800 --> 00:41:41.500
Let us find b sub 1.
00:41:41.500 --> 00:41:50.500
B sub 1 is the ratio of this sum over this sum.
00:41:50.500 --> 00:42:02.600
Our slope is a negative slope and that makes sense because we had a negative trend and that -21.51.
00:42:02.600 --> 00:42:07.100
Given that let us find b sub 0.
00:42:07.100 --> 00:42:14.200
We know in order to find b sub 0 we need to use x bar and y bar as our example point.
00:42:14.200 --> 00:42:35.600
I’m going to take y, my y – b sub 1 × x.
00:42:35.600 --> 00:42:46.100
Again Y intercept is 795.21.
00:42:46.100 --> 00:42:58.000
I’m just going to pull this over hold us over on this side and here I can now talk about the regression line.
00:42:58.000 --> 00:43:09.400
The regression line would be Y equals and we put the intercept first 795.21.
00:43:09.400 --> 00:43:20.800
Instead of plus, we could just put a minus because our slope is -21.51 × x.
00:43:20.800 --> 00:43:33.400
This is our regression line and if you want to interpreted the idea is that sort of the base cost is around $800
00:43:33.400 --> 00:43:45.000
and for whatever the student faculty ratio is with each increment you get to the detection of about 20 to 21.50.
00:43:45.000 --> 00:43:54.600
As the ratio goes up and up and up you get a little deduction every time.
00:43:54.600 --> 00:43:56.400
Here is example 4.
00:43:56.400 --> 00:44:01.700
Remember that the regression line must past through the point of averages.
00:44:01.700 --> 00:44:13.200
That is one of the quantitative features of regression lines and the residual should be equal to 0 approximately.
00:44:13.200 --> 00:44:17.800
One of these actually causes the other.
00:44:17.800 --> 00:44:23.900
It is either that the passing through the point averages automatically makes the mean of the residual 0
00:44:23.900 --> 00:44:31.700
or that the mean of this residuals been 0 causes the point averages to be positive.
00:44:31.700 --> 00:44:38.200
This problem is going supposed to be basically to explore which one causes the other.
00:44:38.200 --> 00:44:45.600
Examine the mean of residuals for the regression line, which definitely passes through the point of average.
00:44:45.600 --> 00:44:57.200
An example line that did not pass through the point of average and we should try to see in that case is the mean of residuals still 0.
00:44:57.200 --> 00:45:03.700
Or an example line that does pass through the point of averages, but had the wrong slope.
00:45:03.700 --> 00:45:10.900
For any slope of the line that passes through the point of averages that is not the regression line.
00:45:10.900 --> 00:45:21.400
And then finally we want to discuss the question is that going to find the regression line as the line that makes the sum or mean of the residual 0.
00:45:21.400 --> 00:45:23.200
Let us see.
00:45:23.200 --> 00:45:33.200
If you click on example 4, I put back the pizza example that we covered at the very beginning.
00:45:33.200 --> 00:45:48.600
Here I put in our regression line which have $4.75 as the base rate and 12 ½ cent increase for every gram of fat.
00:45:48.600 --> 00:46:01.200
I already calculated for you the predicted costs, the residuals, and the squared residuals because we actually already did this in the first problem.
00:46:01.200 --> 00:46:07.300
The only that I have changed is i also provided for you the sum of the residual.
00:46:07.300 --> 00:46:11.000
Here we find that the sum of the residuals is 0.
00:46:11.000 --> 00:46:13.000
Let us think about this regression line.
00:46:13.000 --> 00:46:18.500
It definitely passes through the point of averages and the sum of residuals of 0.
00:46:18.500 --> 00:46:32.500
This regression line definitely fits our quantitative definition for regression line and it has a very low sum of squared residual.
00:46:32.500 --> 00:46:40.300
Now given this point let us think about a line that does not pass through the point of averages.
00:46:40.300 --> 00:46:57.500
Now, if we take our line or lines, and which is slightly up or down in either direction it won't pass through the point of averages because of parallel lines never intercept.
00:46:57.500 --> 00:47:09.500
We can keep the same slope .125, but we just change our b sub 0 very slightly.
00:47:09.500 --> 00:47:11.500
We could just change the intercept very slightly.
00:47:11.500 --> 00:47:32.800
Maybe would not move a line just a little bit so when we get 4.8 instead of 4.75 and here our y is y = 4.8 +.125 × x.
00:47:32.800 --> 00:47:36.400
Let us find the squared residuals and all that stuff.
00:47:36.400 --> 00:48:00.400
The predicted costs would be b sub 0 + B sub 1 and lock that in place × x.
00:48:00.400 --> 00:48:09.400
Noticed that are our predicted costs are very, very close because our line is not that far off.
00:48:09.400 --> 00:48:11.400
Let us calculate the residual.
00:48:11.400 --> 00:48:24.500
The actual cost minus the predicted costs and let us also calculate the squared residuals.
00:48:24.500 --> 00:48:29.400
Just squaring each of my residual and are being added up down here.
00:48:29.400 --> 00:48:37.200
Notice that although these sum of squared errors are very close to slightly this one is just bigger than this one.
00:48:37.200 --> 00:48:40.100
It is slightly worse fit than this one.
00:48:40.100 --> 00:48:47.200
This one is a better fit but let us check and see whether our residual at up to 0.
00:48:47.200 --> 00:48:48.400
It does not.
00:48:48.400 --> 00:48:54.500
It has been close to 0, but it does not quite add up to 0.
00:48:54.500 --> 00:49:07.800
These lines that do not quite pass through the point of averages, even though they are only a little bit off these do not add that the sum of the residuals do not add up to 0.
00:49:07.800 --> 00:49:12.400
Now that we have all this we can actually just change it.
00:49:12.400 --> 00:49:16.400
Let us move the regression line down just a little bit.
00:49:16.400 --> 00:49:23.600
Let us just move it down slightly and make this 4.5 instead of 4.75.
00:49:23.600 --> 00:49:25.100
What if we do that?
00:49:25.100 --> 00:49:28.300
Well again it is not that far off.
00:49:28.300 --> 00:49:42.600
It is still pretty low sum of squared error, but the regression line is still the lowest and the residuals still does not add up to 0.
00:49:42.600 --> 00:49:48.200
If it does not pass through the point of averages then it is off by a little bit.
00:49:48.200 --> 00:49:59.200
The other thing we could do is we could keep the intercept the same and instead we could change the slope by a little bit.
00:49:59.200 --> 00:50:03.600
If we do that, then we know it does not pass through the point of averages.
00:50:03.600 --> 00:50:15.300
When we do that what we find once again is that the sum of squared residual is more off than our regression line.
00:50:15.300 --> 00:50:18.600
Our sum of residuals still does not add up to 0.
00:50:18.600 --> 00:50:28.900
Although we try a couple of lines if it does not pass through the point of averages, we see that the residual does not add up to 0.
00:50:28.900 --> 00:50:32.300
Now let us talk about the flip side.
00:50:32.300 --> 00:50:38.500
A line that does pass through the point of averages, but it is still not the regression line.
00:50:38.500 --> 00:50:46.800
Well, in order to find one that passes through the point of averages, but had the wrong slope.
00:50:46.800 --> 00:50:57.700
It is nice to figure out from our actual point, a line that passes through there but had just a different slope.
00:50:57.700 --> 00:50:59.500
You can pick any slope you want.
00:50:59.500 --> 00:51:01.400
I will pick the slope of 5.
00:51:01.400 --> 00:51:06.200
Y is 5/1.
00:51:06.200 --> 00:51:09.300
Let us find b sub 0.
00:51:09.300 --> 00:51:20.300
We could just use that same formula we have use and plug in our values for the point of averages.
00:51:20.300 --> 00:51:30.600
That would be y – x × b1.
00:51:30.600 --> 00:51:33.500
Our B1 is right next to it.
00:51:33.500 --> 00:51:41.300
This is the point that definitely passes through the point of averages, but obviously has the wrong slope.
00:51:41.300 --> 00:51:44.000
Let us find the predicted costs.
00:51:44.000 --> 00:51:46.400
I remember this is the line that it is totally made up.
00:51:46.400 --> 00:51:49.500
Predicted costs might be very off.
00:51:49.500 --> 00:52:16.600
Predicted costs would be the intercept and lock that in place + b1 × x and then lock b1 in place.
00:52:16.600 --> 00:52:21.100
We see that the costs are fairly off.
00:52:21.100 --> 00:52:28.300
$-31 this is pretty close to 18 but this was pretty far off 68.5.
00:52:28.300 --> 00:52:30.200
Now let us find the residual.
00:52:30.200 --> 00:52:36.000
The actual cost minus the predicted cost.
00:52:36.000 --> 00:52:43.400
and finally, let us find the squared residuals.
00:52:43.400 --> 00:52:51.700
Notice that the sum of squared residuals is very very off 4,753.
00:52:51.700 --> 00:52:53.400
It is pretty off.
00:52:53.400 --> 00:52:54.800
We know that this is not a great line.
00:52:54.800 --> 00:52:56.500
It is not a well fitting line.
00:52:56.500 --> 00:53:01.600
These other lines actually fits better, but let us check that sum of the residual.
00:53:01.600 --> 00:53:04.900
What does that add up to be?
00:53:04.900 --> 00:53:15.400
That has a sum of the residual is 0.
00:53:15.400 --> 00:53:21.500
Just because this line passes through the point of average.
00:53:21.500 --> 00:53:29.100
Remember in order to calculate residual always using is x bar and y bar.
00:53:29.100 --> 00:53:40.400
It actually makes sense that as long as it passes through that point of averages the sum of residual is going to be 0.
00:53:40.400 --> 00:53:47.700
Now that we have all of this setup with all our nice formulas we can actually put in any slope.
00:53:47.700 --> 00:53:50.800
Let us put it -.1.
00:53:50.800 --> 00:53:58.500
It will find the B sub 1 and this line perfectly passes through the point of averages.
00:53:58.500 --> 00:54:05.900
Even though our sum of squared residuals have improved, our residual still add up to 0.
00:54:05.900 --> 00:54:19.100
0 even though it is not the line of regression and let us try another one -.00035.
00:54:19.100 --> 00:54:26.500
Excel will do this just because it too many small points for it to show you, but still, you get the idea.
00:54:26.500 --> 00:54:35.200
Although it looks sort of crazy number this means that you need to move the decimal point to the left 18 times.
00:54:35.200 --> 00:54:38.800
That is very, very close to 0.
00:54:38.800 --> 00:54:43.700
Let us try another one 500.
00:54:43.700 --> 00:54:50.200
Once again we see that the sum of the residual is 0.
00:54:50.200 --> 00:55:00.900
These are obviously not very good lines they are not very good regression lines because the squared residuals are terribly, terribly off.
00:55:00.900 --> 00:55:10.500
The sum of the residual is 0 as long as the line passes through the point of averages.
00:55:10.500 --> 00:55:13.800
Let us go back to example 4.
00:55:13.800 --> 00:55:28.600
Here we have seen the mean of the residuals or the sum of the residual similar idea for the regression line, and so the mean of residual equals 0.
00:55:28.600 --> 00:55:38.500
An example line that does not pass through the point of averages, mean of residual is not equal 0.
00:55:38.500 --> 00:55:44.700
An example line that does pass through the point of averages that has the wrong slope.
00:55:44.700 --> 00:55:52.800
Here we find the mean of residuals once again equal 0.
00:55:52.800 --> 00:55:58.800
Is it good enough to define the regression line as the line that makes the sum or mean of the residual 0?
00:55:58.800 --> 00:56:10.000
No, that is not good enough because any line that passes through the point of averages will have the sum or mean of the residual as 0.
00:56:10.000 --> 00:56:14.500
This one really causes that one.
00:56:14.500 --> 00:56:18.700
We also need to have all those other rules.
00:56:18.700 --> 00:56:29.000
For instant the other rules being the sum of squared errors is the lowest in a regression line that definitely has to be there.
00:56:29.000 --> 00:56:33.300
That is it for calculating regressions using the least squares method.
00:56:33.300 --> 00:56:36.000
See you next time on www.educator.com.