WEBVTT mathematics/statistics/son
00:00:00.000 --> 00:00:02.400
Hi and welcome to www.educator.com.
00:00:02.400 --> 00:00:09.800
We are going to talk about the difference between r and r².
00:00:09.800 --> 00:00:15.300
First I’m going to just introduce the quantitative r² and need to understand it.
00:00:15.300 --> 00:00:19.400
Why cannot we just square r and be like that is r².
00:00:19.400 --> 00:00:22.600
We want to know what the meaning of r².
00:00:22.600 --> 00:00:30.700
In order to get to the meaning of r² we have to understand that sum of squared differences is actually going to split apart it to different ways.
00:00:30.700 --> 00:00:35.600
We are going to learn how to parse the different parts of the sum of squared differences.
00:00:35.600 --> 00:00:39.700
Then we are going to talk about what r² means for a very strong correlation.
00:00:39.700 --> 00:00:46.700
What r² maybe for a very weak correlation.
00:00:46.700 --> 00:00:56.200
One of the reason why practically you will need to understand what r² is that often when you do regression on the computer,
00:00:56.200 --> 00:01:04.000
either in SPSS or S data or any of this statistics packages, they will often give you r²
00:01:04.000 --> 00:01:10.000
as one of the output and you might be looking at and me like why are we doing the r²?
00:01:10.000 --> 00:01:12.900
We want to know what is the meaning of it?
00:01:12.900 --> 00:01:16.400
Why just r²? Why not just have r?
00:01:16.400 --> 00:01:24.000
Often if you just find the correlation you will just get r but if you find the regression you will get r².
00:01:24.000 --> 00:01:26.700
It is like what is the deal?
00:01:26.700 --> 00:01:33.900
R² is really is just r², but there is a meaning behind it.
00:01:33.900 --> 00:01:40.300
I want to just stuck and say it is like the difference between feet and feet².
00:01:40.300 --> 00:01:42.100
They mean different things.
00:01:42.100 --> 00:01:49.200
It is not just that you can square the number and be like it is just the numbers squared.
00:01:49.200 --> 00:01:53.200
It is not just about the number it is also about the actual unit.
00:01:53.200 --> 00:02:10.100
You have to understand what the unit is because feet is a measurement that examines link but square feet now gives you area.
00:02:10.100 --> 00:02:11.800
Those are different things.
00:02:11.800 --> 00:02:16.200
They are obviously related to each other, but they are very different ideas.
00:02:16.200 --> 00:02:27.700
Because of that you need to also not only know, like how to calculate r², but also know the meaning of r².
00:02:27.700 --> 00:02:33.700
Again in order to understand the meaning of r² we will need to parse the sum of squares.
00:02:33.700 --> 00:02:40.900
Remember the sum of squares that we have been talking about is something like x or y and
00:02:40.900 --> 00:02:46.900
the difference between x and x bar or the difference between y and y bar.
00:02:46.900 --> 00:02:51.800
Squaring all those and then adding them up, sum of squares.
00:02:51.800 --> 00:03:00.300
When we say sum of squares you might hear the term that this is about variability.
00:03:00.300 --> 00:03:08.700
Sum of squares talks about variability and it is because you are always getting that deviation between your data and the mean.
00:03:08.700 --> 00:03:19.100
Sum of squares is often idea that is highly associated with variability.
00:03:19.100 --> 00:03:27.400
Another way of thinking about parsing sum of squares is parsing variability because variability comes from a variety of sources.
00:03:27.400 --> 00:03:36.100
Here we are going to talk about a couple of those sources and how to figure out this variability comes from that but this variability comes from that.
00:03:36.100 --> 00:03:40.800
When you put it together you have total variability.
00:03:40.800 --> 00:03:48.100
Now total variability is going to be indicated by SST or sum of squares total.
00:03:48.100 --> 00:03:52.400
This idea is all the variability in the system.
00:03:52.400 --> 00:03:54.300
All of the variability.
00:03:54.300 --> 00:04:06.500
We are going to take that and parse it, split apart into two pieces that are equal pieces but there just 2 different places at that variability comes from.
00:04:06.500 --> 00:04:14.700
One of the sources of the variability is always from this relationship between X and Y and that can be explained by the regression line.
00:04:14.700 --> 00:04:30.400
This is sum of squares from the regression and so that can be the idea that sum of squares.
00:04:30.400 --> 00:04:36.500
This one is going to be the left over sum of squares.
00:04:36.500 --> 00:04:53.700
There is going to be some variability left over that is not explained by the regression line and that sum of squares error.
00:04:53.700 --> 00:05:00.000
When we say error, we do not necessarily mean that we made a mistake.
00:05:00.000 --> 00:05:04.200
It is not that we made a mistake.
00:05:04.200 --> 00:05:10.000
Error often just means variability that is unexplained.
00:05:10.000 --> 00:05:12.900
We do know where it came from.
00:05:12.900 --> 00:05:18.400
We do not know if it is because there was some measurement error.
00:05:18.400 --> 00:05:21.500
We do not know if there is just noise in the system.
00:05:21.500 --> 00:05:27.400
We do not know if there is another variable that is causing this variation.
00:05:27.400 --> 00:05:34.300
Sum of squares error just means variability that we cannot explain.
00:05:34.300 --> 00:05:38.100
That does not necessarily mean that we made a mistake.
00:05:38.100 --> 00:05:45.000
Often times that has to be statistics uses that word error but it does not mean that we made a mistake
00:05:45.000 --> 00:05:49.300
but it means that it just variability that we do not know where it came from.
00:05:49.300 --> 00:05:52.900
There is no explanation for it.
00:05:52.900 --> 00:06:05.000
To break this down you could see that this is sum of squares total and that is usually what we get from looking at the difference between y and just the mean.
00:06:05.000 --> 00:06:12.100
That is like the classic sum of squares because the mean should give us some information about where y is.
00:06:12.100 --> 00:06:16.400
It is what every single point is going to be at the mean.
00:06:16.400 --> 00:06:19.700
That is like error but that is the total error.
00:06:19.700 --> 00:06:31.300
Some of that errors, some of that variation away from the mean can be accounted for by regression like here it is farther and farther up from the mean.
00:06:31.300 --> 00:06:37.900
The numbers are bigger than the mean and then here the numbers are smaller than the mean.
00:06:37.900 --> 00:06:44.900
Here this is the residual and this is we have already looked up before.
00:06:44.900 --> 00:06:58.300
You can also think of it as residual error where it is the rest of the variation that is not accounted for by that nice regression line that we found.
00:06:58.300 --> 00:07:08.200
We could think of this as the explained variability.
00:07:08.200 --> 00:07:13.800
This is explained and what explains the variability?
00:07:13.800 --> 00:07:16.200
The regression line.
00:07:16.200 --> 00:07:20.900
The regression line says it is been a very systematically like this.
00:07:20.900 --> 00:07:27.300
The residual is what we call unexplained variability.
00:07:27.300 --> 00:07:36.500
When another one comes from its real error just variability in the system that is caused by another variable.
00:07:36.500 --> 00:07:49.500
When you put the explained variability and unexplained variability altogether you will get total variability.
00:07:49.500 --> 00:08:03.200
Let us break down specifically and mathematically what is sum of squares total or the sum of squares residual or sum of squares are?
00:08:03.200 --> 00:08:06.500
I will give you a picture of what these things are.
00:08:06.500 --> 00:08:10.100
First let us talk about sum of squares total.
00:08:10.100 --> 00:08:15.000
One thing we probably want to do is give a rough idea of what the mean is.
00:08:15.000 --> 00:08:19.500
Let us say the mean of something like this.
00:08:19.500 --> 00:08:24.600
I'm just going to call that y bar because that might mean of y roughly.
00:08:24.600 --> 00:08:30.200
Closer to these points but these guys are sure further down to pin it down.
00:08:30.200 --> 00:08:38.200
I’m going to call that y bar and I want to know the sum of squares total.
00:08:38.200 --> 00:08:42.000
Was the total variability that you see here.
00:08:42.000 --> 00:08:51.100
Because we are squaring all these differences we are not just interested in that residual idea.
00:08:51.100 --> 00:08:58.300
We interested in the area of little squares.
00:08:58.300 --> 00:09:08.400
It is not only the distance down but imagine that distance squared and this area.
00:09:08.400 --> 00:09:14.400
That is the sum of squared variation of one point.
00:09:14.400 --> 00:09:17.800
Imagine doing that with all of these.
00:09:17.800 --> 00:09:21.100
You create these squares.
00:09:21.100 --> 00:09:38.500
Some are big squares, some are little squares and you add up all those different areas.
00:09:38.500 --> 00:09:40.300
That is sum of squares total.
00:09:40.300 --> 00:09:45.500
That is the total variation in our data away from the mean.
00:09:45.500 --> 00:09:48.600
Would not it be nice if all our data looks something like the mean?
00:09:48.600 --> 00:09:58.200
That would be like I can predict this data but this has more variation.
00:09:58.200 --> 00:10:04.400
I must give way over to sum of squares error because that is when we actually know.
00:10:04.400 --> 00:10:09.300
In order to find sum of squared error I need the regression line.
00:10:09.300 --> 00:10:14.400
I’m just going to draw a regression line like this.
00:10:14.400 --> 00:10:17.500
It might not be perfect but something like that.
00:10:17.500 --> 00:10:20.600
Remember how we found residual?
00:10:20.600 --> 00:10:35.600
To find a residual it is just the difference between my y and y bar that my predicted y hat.
00:10:35.600 --> 00:10:43.600
These are my y hat and I want to know the difference between them but we are squaring that difference.
00:10:43.600 --> 00:10:51.800
Instead of just drawing a line we draw a square and imagine getting that area.
00:10:51.800 --> 00:10:57.500
That is the sum of squared residual or error for one point.
00:10:57.500 --> 00:11:03.000
We are going to do that with all of the points.
00:11:03.000 --> 00:11:18.600
Find that area, that area, that area and add up out of all those areas then we get the sum of squared error.
00:11:18.600 --> 00:11:25.000
The variation away from the regression line.
00:11:25.000 --> 00:11:29.000
This is our unexplained variation.
00:11:29.000 --> 00:11:31.700
This is our total variation.
00:11:31.700 --> 00:11:34.100
Now what is this part?
00:11:34.100 --> 00:11:40.400
This is the variability that is already accounted for by the regression line.
00:11:40.400 --> 00:11:47.500
This is the difference between the predicted y and y bar.
00:11:47.500 --> 00:11:49.800
Here is the idea.
00:11:49.800 --> 00:11:55.300
If we just have y bar we not have a lot of predicted power.
00:11:55.300 --> 00:12:00.200
We are just saying our y bar is just average.
00:12:00.200 --> 00:12:02.900
It is just the average and we only have one guess.
00:12:02.900 --> 00:12:04.900
The average.
00:12:04.900 --> 00:12:09.800
If we have the regression line we have a more mere guess.
00:12:09.800 --> 00:12:15.400
If I know what x is I could tell you more closely what y might be.
00:12:15.400 --> 00:12:27.000
I will try to redraw my regression line and pretend that is a nice regression.
00:12:27.000 --> 00:12:30.600
Here is my y hat.
00:12:30.600 --> 00:12:40.900
Also, here is my y bar.
00:12:40.900 --> 00:12:50.500
Here what I want to know is how much of the variability is simply accounted for by having this line?
00:12:50.500 --> 00:12:56.100
Having this line gives us the more predictive power how much of that predictive power is it.
00:12:56.100 --> 00:13:09.100
We want to know for this point this is now my difference and then I'm just to square that difference.
00:13:09.100 --> 00:13:15.600
Here is another point but here is the difference.
00:13:15.600 --> 00:13:18.300
The difference is very like nothing.
00:13:18.300 --> 00:13:21.400
Here is the difference.
00:13:21.400 --> 00:13:32.100
It is right here, this difference.
00:13:32.100 --> 00:13:39.800
Let me give another example like right here for this point this would be the difference.
00:13:39.800 --> 00:13:50.500
I'm looking at all of these you can think of it as sort of the squared spaces in between my regression line and my main line.
00:13:50.500 --> 00:13:59.900
I'm looking at that and that gives me how much of my variance in the data is accounted for by the regression line.
00:13:59.900 --> 00:14:02.300
That is roughly the idea.
00:14:02.300 --> 00:14:14.500
Let us think about actual formulas and to help us out with that I have a more like nicely drawn variation that my crappy dots
00:14:14.500 --> 00:14:24.100
but now you could see the square differences between my actual data points and my mean.
00:14:24.100 --> 00:14:26.500
Here are my square differences.
00:14:26.500 --> 00:14:28.300
Here is that same data.
00:14:28.300 --> 00:14:39.700
It is the same data from before, except now we are looking at differences from the regression line not the mean line.
00:14:39.700 --> 00:14:45.300
Here we are looking at differences between the mean line and the regression line.
00:14:45.300 --> 00:14:49.700
Let us write these things down in formulas in terms of formulas.
00:14:49.700 --> 00:14:55.900
In order to find the sum of squares total let us think about what this is as an idea.
00:14:55.900 --> 00:15:01.800
Okay, we want the sum of squares, so I know it is going to be sum of squares.
00:15:01.800 --> 00:15:06.000
All of these guys are to be like this I could already write that down.
00:15:06.000 --> 00:15:25.700
As this r what we call from the sum of squared and here is going to be the sum of something squared.
00:15:25.700 --> 00:15:30.400
We already know that is going to be the same variability.
00:15:30.400 --> 00:15:42.400
Here we have for every y give me the difference between that y and the mean and then square it and get that area.
00:15:42.400 --> 00:15:45.600
Get all these areas and add them up.
00:15:45.600 --> 00:15:53.000
That just y – y bar.
00:15:53.000 --> 00:16:03.200
If we want to fill this out, we would know this means for everything single point that we have get y - y bar and then square it and add them up.
00:16:03.200 --> 00:16:04.800
That is the idea.
00:16:04.800 --> 00:16:07.400
That is sum of squares total.
00:16:07.400 --> 00:16:15.500
Sum of squares residual actually let us go over to sum of squares error.
00:16:15.500 --> 00:16:22.000
I sometimes call it also sum of squares residual because this is the idea of the residual.
00:16:22.000 --> 00:16:29.300
Remember the residual was y – y hat.
00:16:29.300 --> 00:16:42.500
And so, we are squaring the difference between y and y hat.
00:16:42.500 --> 00:16:44.300
That is really easy.
00:16:44.300 --> 00:16:47.600
Y – y hat.
00:16:47.600 --> 00:16:56.600
If you want to fill it out, you could obviously put in the (i) as well just so you know you have to do that for every single point.
00:16:56.600 --> 00:17:07.400
For the sum of squares for the regression I know that is why they call it sum of squares and sum of squares residual because it is confusing for the r.
00:17:07.400 --> 00:17:13.800
This one is sum of squares regression.
00:17:13.800 --> 00:17:17.100
I want to think of this guy as the good guy.
00:17:17.100 --> 00:17:23.300
It is like you want to be able to predict X and Y and this guy helps you because he sucks up some of the variance.
00:17:23.300 --> 00:17:26.900
This guy is the leftover that I do not know what to do anything about.
00:17:26.900 --> 00:17:36.400
When we talk about the regression we are talking about the difference between y hat and y bar.
00:17:36.400 --> 00:17:45.400
That is y hat and y bar.
00:17:45.400 --> 00:17:55.400
You could obviously do that for each point.
00:17:55.400 --> 00:18:05.100
There you have it, the formulas for these but if you understand the ideas you could always intercept what is this a picture of?
00:18:05.100 --> 00:18:11.400
This is a picture of the difference between the data points and y bar.
00:18:11.400 --> 00:18:15.100
Here is a picture of the difference between the data points and y hat.
00:18:15.100 --> 00:18:20.500
It may be confusing though which one is y hat?
00:18:20.500 --> 00:18:35.500
All you should do is go back to the picture and think to yourself by telling a total variance or variance after we have the regression line.
00:18:35.500 --> 00:18:44.900
Okay, so now that you know as the SST, SSR and SSC now we can talk about r² because you need those components.
00:18:44.900 --> 00:18:55.200
R² is often called the coefficient of determination, not coefficient of correlation squared it is often called the coefficient of determination.
00:18:55.200 --> 00:19:00.300
One of the reasons that r² is important is that it has an interpretation.
00:19:00.300 --> 00:19:04.500
It is actually is talking about the proportion of total variance.
00:19:04.500 --> 00:19:08.700
Remember variance is standard deviation².
00:19:08.700 --> 00:19:18.800
Because we are talking about sum of squared the proportion of that total of variance of y explained by the simple regression model.
00:19:18.800 --> 00:19:20.800
Here is the idea.
00:19:20.800 --> 00:19:25.700
It is like here is all that variance and we do not know where that variance comes from.
00:19:25.700 --> 00:19:27.700
I do not know why they are all varying.
00:19:27.700 --> 00:19:29.500
We have the regression line.
00:19:29.500 --> 00:19:33.700
The regression line explains where some of the variation away from the mean comes from.
00:19:33.700 --> 00:19:38.300
It comes from this relationship of x.
00:19:38.300 --> 00:19:52.900
Is that regression line is doing a good job then a lot of the total variance is explained by the regression line, that predicted regression y.
00:19:52.900 --> 00:20:07.300
If the line is not doing a very good job then it does not explain a lot of the variation there is extra variation above and beyond that.
00:20:07.300 --> 00:20:15.100
All would be very low because only a small portion of that various is accounted for.
00:20:15.100 --> 00:20:26.600
Given that, let us talk about what a strong r might be and what a weak r might be.
00:20:26.600 --> 00:20:30.900
If the correlation is very strong let us think about this.
00:20:30.900 --> 00:20:37.800
Whatever your sum of squares total is they are all variance.
00:20:37.800 --> 00:20:43.900
Whatever that is this is going to account for a lot of it.
00:20:43.900 --> 00:20:58.200
Let us say this is like 100% of the variance this accounts for 85% and so this would be small to be 15%.
00:20:58.200 --> 00:20:59.800
This is of how this works.
00:20:59.800 --> 00:21:03.800
These two added up, give you the total.
00:21:03.800 --> 00:21:15.000
If that is true, if the correlation is very strong this should be small and this should be large.
00:21:15.000 --> 00:21:24.400
If this is small then the proportion of error over the total would be a small number.
00:21:24.400 --> 00:21:26.400
Here is the formula for r².
00:21:26.400 --> 00:21:31.100
R² is 1 – that proportion of error / the total.
00:21:31.100 --> 00:21:38.200
This is the unaccounted for error, that leftover error / the total variation.
00:21:38.200 --> 00:21:43.200
This is the unexplained variation / the total variation.
00:21:43.200 --> 00:21:52.100
This number should be very, very small and when that number is very small 1 - a very small number is a number very close to 1.
00:21:52.100 --> 00:21:58.600
R² is very strong because the maximum r² could be this 1.
00:21:58.600 --> 00:22:22.800
This means that if r² is large this means close to 1 and this means that much of the variation is accounted for by the regression line.
00:22:22.800 --> 00:22:26.200
The regression line did a great job of explaining variation.
00:22:26.200 --> 00:22:33.500
As we near the regression line I could tell you I can predict for you y given x.
00:22:33.500 --> 00:22:38.000
It is doing a good job.
00:22:38.000 --> 00:22:41.800
On the other hand, if a correlation is weak.
00:22:41.800 --> 00:22:49.200
If it is weak then this is the correlation how whiny it is.
00:22:49.200 --> 00:22:54.100
Even if we have a line it does not explain all the variation.
00:22:54.100 --> 00:22:58.100
There is a lot of leftover variation.
00:22:58.100 --> 00:23:02.800
That should be low compared to that one.
00:23:02.800 --> 00:23:07.500
If this is 100% and this is not doing a very good job explaining variation.
00:23:07.500 --> 00:23:14.800
It only explains 15% of the variation then we have 85% of the variation leftover.
00:23:14.800 --> 00:23:20.200
If we put the sum of squared error over the total this number should be large.
00:23:20.200 --> 00:23:27.100
There is a lot of a large proportion of that total variance is still unaccounted for, unexplained.
00:23:27.100 --> 00:23:36.500
1 - a larger number, one that is closer to 1 this will be a very small number for r².
00:23:36.500 --> 00:23:49.200
R² if it is small this means that not a lot of the variation was accounted for by the regression line.
00:23:49.200 --> 00:23:58.500
The regression line did not do very good job of explaining the variation in our data.
00:23:58.500 --> 00:24:00.600
Let us do some example.
00:24:00.600 --> 00:24:10.700
Previously we work with this data before for the above example data we have already found the regression line and the correlation they give it to us.
00:24:10.700 --> 00:24:25.500
We could look at this and it has a negative slope and there is more rise than run.
00:24:25.500 --> 00:24:29.500
Because the cost goes up really fast.
00:24:29.500 --> 00:24:34.200
For every one that you go if you go up a little bit here.
00:24:34.200 --> 00:24:49.000
It makes sense that the correlation is negative and strong, it is -.869 that is a pretty strong, very line-y but it had the negative slope.
00:24:49.000 --> 00:24:51.900
It only gets as far.
00:24:51.900 --> 00:24:57.400
It is giving us the correlation coefficient, not the coefficient of determination, r².
00:24:57.400 --> 00:25:14.200
Find r² for the set of data and examine whether r² once we find it in a different way by looking at r² = 1 - the sum of squared error / sum of squared total.
00:25:14.200 --> 00:25:23.400
Once we find that examine whether this is also r × r.
00:25:23.400 --> 00:25:33.900
If you download the examples provided for you below and go to example 1, here is our data and I just provided the graph for you so you could see.
00:25:33.900 --> 00:25:46.300
I’m just going to move it over to the side because we are not going to need it.
00:25:46.300 --> 00:25:57.500
Remember that we have this, we are going to need to calculate something in order to find the sum of squared error and the sum of squares total.
00:25:57.500 --> 00:26:08.200
One thing that I like to do is remind myself if I looked at sum of squared error, if I double clicked on that what would I see inside?
00:26:08.200 --> 00:26:20.300
Well, we know that the sum of squared error is whatever regression line we have and we need this distance away squared.
00:26:20.300 --> 00:26:32.200
That is going to be the sum of y - y hat because this is y hat².
00:26:32.200 --> 00:26:36.000
I know I’m going to need y hat.
00:26:36.000 --> 00:26:37.300
What else are we going to need?
00:26:37.300 --> 00:26:43.300
Sum of squares total is whatever my mean is.
00:26:43.300 --> 00:26:52.300
Whatever my mean is I’m going to need to know the difference between my data and my mean squared.
00:26:52.300 --> 00:26:59.600
My data and my mean squared, that is sum of squares total.
00:26:59.600 --> 00:27:05.700
That I could easily find I should try to find y hat as well.
00:27:05.700 --> 00:27:10.200
Y hat will be easy to find because we have the regression line.
00:27:10.200 --> 00:27:17.700
We could just plug-in a whole bunch of x and get each y for all those x.
00:27:17.700 --> 00:27:23.600
Why do not we start there?
00:27:23.600 --> 00:27:29.800
Let us find the predicted and then I'm just going to call cost per unit as my y because that was on my y axis.
00:27:29.800 --> 00:27:35.600
I will talk about predicted cost per unit, predicted CPU.
00:27:35.600 --> 00:27:56.600
In order to find that I need to put in my regression formula, so that is 795.207 and then subtract 21.514 and Excel will automatically do order of operations for you.
00:27:56.600 --> 00:27:59.200
Multiplication comes before subtraction.
00:27:59.200 --> 00:28:03.200
I’m just going to just click in x.
00:28:03.200 --> 00:28:14.900
Whatever x is this is going to find me the predicted y value.
00:28:14.900 --> 00:28:27.900
Once I have that I’m just going to drag down this to find all of my predicted CPU.
00:28:27.900 --> 00:28:42.100
It might be actually be helpful to us to find the sum and averages of all of these.
00:28:42.100 --> 00:28:48.100
I’m just going to color these in red so that I know is not part of my data.
00:28:48.100 --> 00:28:50.500
I probably do not need the sum for that.
00:28:50.500 --> 00:28:58.300
I need the average for these.
00:28:58.300 --> 00:29:05.700
I’m also going to need the average for these.
00:29:05.700 --> 00:29:12.000
We have our predicted CPU (cost per unit).
00:29:12.000 --> 00:29:14.200
That is my y hat.
00:29:14.200 --> 00:29:20.200
I also find my y bar, my average cost per unit.
00:29:20.200 --> 00:29:33.200
Let us find the error terms square and also these variations squared.
00:29:33.200 --> 00:29:55.900
Here I’m just going to write it down for myself as y - the predicted y² and also my y - y bar².
00:29:55.900 --> 00:30:05.100
We could also write CPU - predicted CPU² or CPU - average CPU².
00:30:05.100 --> 00:30:08.400
I am just writing it y just to save space.
00:30:08.400 --> 00:30:21.900
Let me get my y - the predicted y and all of these squared.
00:30:21.900 --> 00:30:25.200
Let me also do that for y and y bar.
00:30:25.200 --> 00:30:27.200
Let me get the parentheses.
00:30:27.200 --> 00:30:40.900
Y - y bar and all of that squared.
00:30:40.900 --> 00:30:50.000
Now y bar is never going to change it so I'm just going to lock that down.
00:30:50.000 --> 00:31:03.100
Once I have that I could just copy and paste these 2 cells all the way down.
00:31:03.100 --> 00:31:24.000
Once I have that now I could find the sum of the residual squared as well as the sum of these deviations squared.
00:31:24.000 --> 00:31:37.100
Sum of all these guys and sum of these guys.
00:31:37.100 --> 00:31:41.800
I have almost everything I need in order to find r².
00:31:41.800 --> 00:31:44.700
I have my sum here, my sum here.
00:31:44.700 --> 00:31:50.500
Let us find r².
00:31:50.500 --> 00:32:05.100
R² is going to be 1 - the sum of squared error ÷ by sum of squares total, that ratio.
00:32:05.100 --> 00:32:08.400
Let us first just look at the data that we have clicked.
00:32:08.400 --> 00:32:12.200
This value is smaller than this value.
00:32:12.200 --> 00:32:15.400
This is 1/6.
00:32:15.400 --> 00:32:27.300
Because of that 1/6 that is pretty good so we should have about 5/6 should be closer to 1 then to 0.
00:32:27.300 --> 00:32:34.900
We will get .7 / 6 and so we get a pretty good r².
00:32:34.900 --> 00:32:44.000
Notice that r² is positive even though our slope is negative because r² does not actually talk about slope.
00:32:44.000 --> 00:32:50.100
It is just the proportion of variance accounted for by the regression line.
00:32:50.100 --> 00:32:57.500
It is the same 76% of the total variance is accounted for by that regression line, that majority.
00:32:57.500 --> 00:32:59.500
And so that is good.
00:32:59.500 --> 00:33:05.900
Now let us try to put in r × r so we already know what r is.
00:33:05.900 --> 00:33:11.500
Let us see if r² will give us .76.
00:33:11.500 --> 00:33:29.600
So -.869² we will get something very close and this is probably rounded and so because of that it does not give us precise numbers.
00:33:29.600 --> 00:33:34.800
We do not have that precision, but is pretty close is still 76%.
00:33:34.800 --> 00:33:45.600
If you have the actual r that you computed and you squared it, you would get perfectly r².
00:33:45.600 --> 00:33:55.600
We found our square for the set of data and examined whether it is r × r and it indeed is.
00:33:55.600 --> 00:34:05.200
Example 2, the conceptual explanation of r62 is that it is the proportion of total variance of y explained by the simple regression model.
00:34:05.200 --> 00:34:20.500
A simple regression model we just mean you only have the form y = b knot + b1.
00:34:20.500 --> 00:34:25.400
It can only be aligned, it can be accrued.
00:34:25.400 --> 00:34:29.600
That is what we mean by a simple linear regression.
00:34:29.600 --> 00:34:45.800
What does it mean that the simple linear regression is a model of variance explained by a simple regression model.
00:34:45.800 --> 00:34:48.900
Let us think about this idea.
00:34:48.900 --> 00:34:52.500
Here we have our data set.
00:34:52.500 --> 00:34:58.800
I’m just going to draw some points here.
00:34:58.800 --> 00:35:02.900
These points do not exactly fall in a line.
00:35:02.900 --> 00:35:12.100
That line that we made up the regression line, the regression line is really a model.
00:35:12.100 --> 00:35:18.100
It is not actual data it is a theoretical model that we created from the data.
00:35:18.100 --> 00:35:33.000
By model just like model airplane or model house, it is not the real houses.
00:35:33.000 --> 00:35:38.800
It is like a shining example.
00:35:38.800 --> 00:35:44.600
But not only is it an example, it is idealized.
00:35:44.600 --> 00:35:46.600
It is the perfect version of the world.
00:35:46.600 --> 00:35:51.800
If the word are perfect and there was no error that would be a model.
00:35:51.800 --> 00:35:57.600
When we say a modeling variance we are there is always variance.
00:35:57.600 --> 00:36:00.000
Where does it come from?
00:36:00.000 --> 00:36:24.900
When we create a model, we have a little theory of where that variance comes from and in our model here this is our theory that explains the variance.
00:36:24.900 --> 00:36:31.700
Our theory is that it is a relationship between x and y and it is very small explanation.
00:36:31.700 --> 00:36:37.100
But it is this relationship between x and y that is where the variation comes from.
00:36:37.100 --> 00:36:43.800
That is what we mean by the regression is lying as a model of the variance.
00:36:43.800 --> 00:36:50.700
Now the idea behind r² is how good is this theory.
00:36:50.700 --> 00:36:53.400
How good is this model?
00:36:53.400 --> 00:37:03.700
Does it explain a lot of the total variation or is it a theory that does not really help us out a lot?
00:37:03.700 --> 00:37:11.300
If we have a big r², if it is fairly large and this means that our theory is pretty good.
00:37:11.300 --> 00:37:16.900
Our theory explains a lot of the total variance accounted for the total variance.
00:37:16.900 --> 00:37:20.400
If our r² is very small it means our theory was not that great.
00:37:20.400 --> 00:37:24.000
We had a theory, here is a model but it is not that good.
00:37:24.000 --> 00:37:31.400
It only explains a little bit of the variance.
00:37:31.400 --> 00:37:36.300
Example 3, why is r² only range from 0 to 1?
00:37:36.300 --> 00:37:41.900
It might be helpful here to start off what r² is?
00:37:41.900 --> 00:37:51.900
1 - the sum of squared error / the total sum of squares / the total variance.
00:37:51.900 --> 00:38:00.500
Now let us think can SSE ever be greater than SST?
00:38:00.500 --> 00:38:12.300
No it cannot, because SST by definition it equals the sum of squares from a regression and the sum of squared error.
00:38:12.300 --> 00:38:19.200
This by definition have to be smaller than this and none of these can be negative because they are squared.
00:38:19.200 --> 00:38:27.500
Whatever it has to be positive numbers it is actually the case that if you add 2 positive numbers together to get another positive sum
00:38:27.500 --> 00:38:37.400
and that sum has to be greater than or equal to this.
00:38:37.400 --> 00:38:45.000
Either this is greater than each of these or it is equal to one of them because it could be like this is 0 and this is 100%.
00:38:45.000 --> 00:38:56.000
There is just actually no way that this could be bigger than 1.
00:38:56.000 --> 00:39:08.500
Not bigger than 1, bigger than SST?
00:39:08.500 --> 00:39:11.600
No, cannot be.
00:39:11.600 --> 00:39:20.600
This proportion have to range between 0 and 1.
00:39:20.600 --> 00:39:27.000
It got to be 1 or smaller or they could be equal.
00:39:27.000 --> 00:39:32.400
This could be 0 and this could be 1.
00:39:32.400 --> 00:39:36.000
There is no way that this could be bigger than this.
00:39:36.000 --> 00:39:49.000
Because this value only ranges from 0 to 1, 1 - something that ranges from 0 – 1, this whole thing could only range from 0 to 1.
00:39:49.000 --> 00:39:57.500
Because of that r² can only range from 0 to 1.
00:39:57.500 --> 00:40:00.400
Example 4, and this is going to be a do see.
00:40:00.400 --> 00:40:08.400
Find r² for this set of data and examine whether this is also r × r.
00:40:08.400 --> 00:40:10.900
Let us think about what we are going to do.
00:40:10.900 --> 00:40:36.000
In order to find r × r and so r is the correlation coefficient and that is the sum of the product of z scores z sub x × z sub y and the average product of z scores.
00:40:36.000 --> 00:40:38.700
We are going to find that.
00:40:38.700 --> 00:40:42.000
We also have to find r².
00:40:42.000 --> 00:40:49.400
In order to find r² that is 1 - sum of squared error / sum of squared total.
00:40:49.400 --> 00:40:54.400
In order to find this, we need y hat.
00:40:54.400 --> 00:41:05.500
In order to find y hat we need the regression line.
00:41:05.500 --> 00:41:22.600
To find the regression line one thing we could do is once we find a correlation coefficient we could use that in order to find b1.
00:41:22.600 --> 00:41:28.300
Or obviously we can also just find b1 in other ways too.
00:41:28.300 --> 00:41:41.200
But this is one is a shortcut and once we find b1 we can find the intercept 1 – b1 × x.
00:41:41.200 --> 00:41:44.500
We will have a whole bunch of data.
00:41:44.500 --> 00:41:47.800
We have all this data.
00:41:47.800 --> 00:41:49.600
Let us get started.
00:41:49.600 --> 00:42:02.200
If you go to your examples and example 4, here is our data and I’m just going to move this over to the side because we are not going to be needing it for a while.
00:42:02.200 --> 00:42:07.900
We already can see that it is probably can be a positive correlation if anything.
00:42:07.900 --> 00:42:18.900
Let us just start by finding the correlation coefficient because it is pretty easy for us to find and once we have that we can find other things.
00:42:18.900 --> 00:42:31.900
In order to get started on that it often helps to have the sum, the average, and the standard deviation.
00:42:31.900 --> 00:42:38.000
I’m just going to make these all bolder in red so we know that there are different.
00:42:38.000 --> 00:42:43.800
I’m going to find the sum for these.
00:42:43.800 --> 00:42:48.700
We do not need the sum here though but I figured it as well.
00:42:48.700 --> 00:42:50.500
It is not too hard.
00:42:50.500 --> 00:43:00.300
There is the average and let us get the standard deviation because we are going to need that for the z score anyway.
00:43:00.300 --> 00:43:03.000
Great.
00:43:03.000 --> 00:43:19.600
We go all up now let us find is the scores for TV watching and also the z scores for junk food.
00:43:19.600 --> 00:43:26.500
It makes sense that there is this more positive correlation.
00:43:26.500 --> 00:43:36.100
The more TV watch per week perhaps more junk food calories are consumed.
00:43:36.100 --> 00:43:39.300
Is the correlation strong?
00:43:39.300 --> 00:43:40.500
I do not know.
00:43:40.500 --> 00:43:58.100
In order to find the z score we need to have the TV watching data and subtract from that the mean and I want that distance,
00:43:58.100 --> 00:44:02.600
not in terms of the raw distance, but in terms of standard deviation.
00:44:02.600 --> 00:44:04.400
How many standard deviations away?
00:44:04.400 --> 00:44:09.500
All divided by standard deviation.
00:44:09.500 --> 00:44:15.300
Here I'm just going to lockdown the row.
00:44:15.300 --> 00:44:27.200
I always use the same mean and standard deviation.
00:44:27.200 --> 00:44:40.400
Once I have that I could just drag it all the way down and add it while we drag it across.
00:44:40.400 --> 00:44:47.200
We forgot to find these for junk food calories.
00:44:47.200 --> 00:44:52.100
Let us just double click on one of these and test it out.
00:44:52.100 --> 00:44:53.600
Let us see.
00:44:53.600 --> 00:45:00.900
It gives me the junk food calories - the average / the standard deviation.
00:45:00.900 --> 00:45:04.500
Perfect.
00:45:04.500 --> 00:45:06.900
Let us just eyeball this data for a second.
00:45:06.900 --> 00:45:13.000
We see that roughly half of the z scores are negative and roughly half are positive.
00:45:13.000 --> 00:45:17.000
Here too roughly half are negative and roughly half are positive.
00:45:17.000 --> 00:45:19.600
We know that we did a good job at finding z scores.
00:45:19.600 --> 00:45:33.400
In order to find the average product we are going to need to find the product the z(TV) × z(junk food).
00:45:33.400 --> 00:45:50.400
This times this and once we have all of that we could sum these and we could find the average.
00:45:50.400 --> 00:46:07.600
This divided by count how many data points that and then subtract 1.
00:46:07.600 --> 00:46:10.400
We found the average and that is r.
00:46:10.400 --> 00:46:13.400
Just regular of r.
00:46:13.400 --> 00:46:22.500
That r it is .58, so it is not super duper weak but it is not really strongly either.
00:46:22.500 --> 00:46:25.400
I’m just labeling it so that I know where it is only come out.
00:46:25.400 --> 00:46:44.100
Once we have r we could find b1, b sub 1.
00:46:44.100 --> 00:46:57.100
In order to find b sub 1 that will be r × the ratio between the standard deviation for y and standard deviation of x.
00:46:57.100 --> 00:47:00.400
We have that right over here.
00:47:00.400 --> 00:47:10.300
standard deviation for y ÷ stdev x, that proportion.
00:47:10.300 --> 00:47:23.800
And so we get the b1 is 10.75 and once we have b1 we could find b sub 0.
00:47:23.800 --> 00:47:31.400
Remember, we have the point of averages, but we also have all these points.
00:47:31.400 --> 00:47:32.200
You can substitute anyone of these points.
00:47:32.200 --> 00:47:37.800
Any one of the points between x and predicted y.
00:47:37.800 --> 00:47:48.700
You cannot substitute these points.
00:47:48.700 --> 00:48:00.900
In order to get the point of averages we will get y – b1 × x.
00:48:00.900 --> 00:48:11.200
Here we get the intercepts b sub knot or b sub 0 is 186.
00:48:11.200 --> 00:48:19.500
Now that we have b1 and b0 we can now find predicted y.
00:48:19.500 --> 00:48:24.100
Let us go up here.
00:48:24.100 --> 00:48:40.800
To help us out I am just going to color these some color so that we know that this is one is all about finding the correlation coefficient.
00:48:40.800 --> 00:48:43.400
We found the correlation coefficient.
00:48:43.400 --> 00:48:49.200
Now what we want to do is find r².
00:48:49.200 --> 00:48:52.900
And so in order to find r² let us think about what we need.
00:48:52.900 --> 00:49:17.700
We need predicted y, predicted junk food and we could easily find that and once we have that we know we are going to need y - predicted y².
00:49:17.700 --> 00:49:27.600
That is our sum of squared error. But we also going to need y - y bar².
00:49:27.600 --> 00:49:30.400
That is going to be our total error.
00:49:30.400 --> 00:49:33.300
Let us start with predicted y.
00:49:33.300 --> 00:49:52.500
Predicted y is always going to be b sub y + slope × x which is TV watching.
00:49:52.500 --> 00:50:06.300
We will lock down b sub knot and the slope b sub 1 because do not want that to move.
00:50:06.300 --> 00:50:32.800
Once we have that we could find (y - the predicted y)² .
00:50:32.800 --> 00:50:52.300
And then finally we want to find (y - the average y)².
00:50:52.300 --> 00:51:02.600
We want this average to be locked in place in order to move.
00:51:02.600 --> 00:51:14.300
Once we have all of those 3 pieces we could just do the easy job of copying and pasting all the way down.
00:51:14.300 --> 00:51:21.600
Once we do that, we could sum these up because we are going to need to have
00:51:21.600 --> 00:51:32.900
the sum of squared residual I’m going to need the sum of squared deviation from the mean.
00:51:32.900 --> 00:51:39.700
In order to find the sum I could just copy and paste that.
00:51:39.700 --> 00:51:55.000
Once we have the sum I can now find r².
00:51:55.000 --> 00:52:07.200
I can just put in 1 – SSE / SST.
00:52:07.200 --> 00:52:09.000
Let us see.
00:52:09.000 --> 00:52:13.600
I will get .3377.
00:52:13.600 --> 00:52:22.100
The regression line accounts for about 34% of the variation.
00:52:22.100 --> 00:52:23.900
Let us see.
00:52:23.900 --> 00:52:27.900
Is this r × r?
00:52:27.900 --> 00:52:30.700
Is that going to be the same thing?
00:52:30.700 --> 00:52:40.600
We have r we can just scroll and we get exactly 34%.
00:52:40.600 --> 00:52:48.700
If we get a question like this, Excel can help.
00:52:48.700 --> 00:52:52.000
Thanks for watching www.educator.com.