WEBVTT mathematics/statistics/son
00:00:00.000 --> 00:00:02.400
Hi and welcome to www.educator.com.
00:00:02.400 --> 00:00:07.300
Today we are going to talk about correlation.
00:00:07.300 --> 00:00:12.900
First let us go back in and just briefly review summarizing scatter plots quantitatively
00:00:12.900 --> 00:00:17.800
and talk about all the other things we have talked about scatter plots.
00:00:17.800 --> 00:00:26.500
Then we will talk about eyeballing the correlation coefficient or what we call r, persons r.
00:00:26.500 --> 00:00:36.400
Actually if you have a set of data that looks a particular way often you could sort of ballpark where the correlation coefficient falls.
00:00:36.400 --> 00:00:39.500
We already talked about precisely calculating it.
00:00:39.500 --> 00:00:49.500
Then we are going to go back and talk about the relationship between r and b1 or slope of our regression line.
00:00:49.500 --> 00:00:55.800
First let us talk about summarizing a scatter plot quantitatively.
00:00:55.800 --> 00:00:57.800
We did not deal with shape.
00:00:57.800 --> 00:01:01.800
We just looked at it and maybe that is pretty good.
00:01:01.800 --> 00:01:12.300
We will talk about shape in the next couple of lessons, but for now we are going to leave shape alone in terms of quantitatively calculating it.
00:01:12.300 --> 00:01:20.100
We did look at how to precisely calculate the trend or by looking at the regression line,
00:01:20.100 --> 00:01:27.000
that middle line in between all those lines that summarizes the middle of all those points.
00:01:27.000 --> 00:01:40.700
And that middle line really gives us the relationship between X and Y, because it is the function that gives us if we have x, we get y and if we get y we get x.
00:01:40.700 --> 00:01:46.300
We can get the relationship between those two variables.
00:01:46.300 --> 00:01:55.400
Finally today we are going to talk about how calculating strength and not just looking at it as pretty strong right,
00:01:55.400 --> 00:02:00.600
but instead we are going to actually calculate the correlation coefficients r.
00:02:00.600 --> 00:02:10.600
That idea is simply, how pact around the regression line is our data points.
00:02:10.600 --> 00:02:12.200
Are they tightly packed?
00:02:12.200 --> 00:02:17.900
Is it a strong correlation like it strongly packed around that regression line?
00:02:17.900 --> 00:02:19.000
Or is it very loose?
00:02:19.000 --> 00:02:21.400
Is it dispersed?
00:02:21.400 --> 00:02:32.100
It is not really sticking close to that line, then we would have low strength or low correlation.
00:02:32.100 --> 00:02:37.400
For instance I am just eyeballing it and there are a lot of data.
00:02:37.400 --> 00:02:48.400
You might have no relationship between two variables, and in that case, the spread looks something like this where there is no real line in there.
00:02:48.400 --> 00:02:51.800
It is just sort of this cloud of dots.
00:02:51.800 --> 00:02:58.200
Remember each of these points is a case.
00:02:58.200 --> 00:03:02.900
Each of those cases has two variables.
00:03:02.900 --> 00:03:12.500
One variable x is on it the x-axis and the other variable we will call it y is represented on the y-axis.
00:03:12.500 --> 00:03:19.300
That point represents x here and y here.
00:03:19.300 --> 00:03:24.600
In this case there is no relationship between X and Y just because you know what x is.
00:03:24.600 --> 00:03:29.100
Let us say we know x is here.
00:03:29.100 --> 00:03:33.300
Do you have any certainty as to where y might be?
00:03:33.300 --> 00:03:38.000
There is some y down here and some y up here.
00:03:38.000 --> 00:03:46.100
Even more so what about if we got X was here do we have any reason to say y is in a particular place?
00:03:46.100 --> 00:03:48.400
No not really.
00:03:48.400 --> 00:03:52.900
Because of that a line would not help us here.
00:03:52.900 --> 00:04:00.300
A regression line does not actually summarize this very well, and it is because the correlation coefficient is very low.
00:04:00.300 --> 00:04:01.400
There is very low strength.
00:04:01.400 --> 00:04:04.700
There is very low adherence to the line.
00:04:04.700 --> 00:04:15.000
Moving out a little bit further, you see that this one is starting to have more of a elongated shape.
00:04:15.000 --> 00:04:29.900
This is still a fairly low correlation, but you can see that they are starting to be a linear relationship between X and Y, namely as x goes up, y also goes up.
00:04:29.900 --> 00:04:32.700
This is what we call a positive correlation.
00:04:32.700 --> 00:04:39.900
There is a relationship between x and y that is linear and positive.
00:04:39.900 --> 00:04:45.700
Notice that y on the other side of this is the exact opposite where it is the same shape,
00:04:45.700 --> 00:04:54.100
but it is been almost like flipped around like we put a mirror here and looked at the mirror reflection.
00:04:54.100 --> 00:05:05.100
In this case it is the same shape cloud but now as x goes down y goes up.
00:05:05.100 --> 00:05:12.500
Here we see the opposite relationship between X and Y and we call it a negative correlation.
00:05:12.500 --> 00:05:16.100
Because of that the signs act accordingly.
00:05:16.100 --> 00:05:25.500
Here the sign for this slope that is negative is -.4.
00:05:25.500 --> 00:05:35.800
Here for slope that is positive as x goes up y goes up, as x goes down y goes down, that is a positive number.
00:05:35.800 --> 00:05:46.100
We could easily just by looking at the correlation coefficient immediately know what kind of roughly what kind of relationship x and y have.
00:05:46.100 --> 00:06:00.800
Notice that as we go out even further, not only do the numbers get bigger and bigger out from 0 but the numbers correspond to how whiny the data are.
00:06:00.800 --> 00:06:03.800
How much they correspond to a line.
00:06:03.800 --> 00:06:12.900
It is not really about having more dots but it is about how much do all those dots fit to a line.
00:06:12.900 --> 00:06:18.100
That is what we often call collect fitting the data to a regression line.
00:06:18.100 --> 00:06:21.600
We want to see is it a good fit? Is it a bad fit?
00:06:21.600 --> 00:06:30.100
Correlation coefficient gives us the strength of that it is really strong fit or is it very week and loose.
00:06:30.100 --> 00:06:38.600
The actual maximum for a correlation coefficient is 1 and the minimum -1.
00:06:38.600 --> 00:06:40.600
That is as far as it will go.
00:06:40.600 --> 00:06:46.800
You cannot have a correlation of like 1.1.
00:06:46.800 --> 00:06:49.300
We will talk a little bit about y.
00:06:49.300 --> 00:07:02.900
Here what you see is that it might have the same number of points of all of these, but it is just that there is very, very little variation from the line.
00:07:02.900 --> 00:07:15.200
There is not a lot of variation out from the line, whereas coming like .8 you could see it is better than .4 but it is not quite as whiny as 1.0.
00:07:15.200 --> 00:07:20.400
That is one way to just very quickly eyeball correlation coefficient.
00:07:20.400 --> 00:07:25.000
You can just look at a data and it is elongated a little bit.
00:07:25.000 --> 00:07:33.300
Maybe it is .4 but if it looks like tighter eclipse we will use .8.
00:07:33.300 --> 00:07:46.300
If it looks very close to a line perhaps it is close to 1.
00:07:46.300 --> 00:07:47.500
I want you to notice something.
00:07:47.500 --> 00:07:57.400
Correlation coefficient, other than caring about positive and negative slope, it does not otherwise care very much about slope.
00:07:57.400 --> 00:08:02.300
For instance, look at all of the situations, these are all lines.
00:08:02.300 --> 00:08:05.900
They are all very lined, they are maximum lined.
00:08:05.900 --> 00:08:17.300
Notice that these lines have positive slope but they all have the coefficient correlation of 1.
00:08:17.300 --> 00:08:27.400
It does not matter whether x changes y changes very quickly or as x changes 1 y changes very slowly.
00:08:27.400 --> 00:08:34.600
A slope does not matter, except for just the positive or negative part.
00:08:34.600 --> 00:08:47.100
The same thing with the negative slopes, even though the slopes are all different they all have a correlation coefficient of 1.
00:08:47.100 --> 00:08:53.200
There is an exception to this rule and it is this line right here, the perfect horizontal line.
00:08:53.200 --> 00:08:58.800
Let us think about what the equation for the regular horizontal line is.
00:08:58.800 --> 00:09:06.700
In a regular horizontal line it does not matter what x is y is always the same.
00:09:06.700 --> 00:09:16.100
Let us say here like y=3 that would be like a horizontal line or y=-2 or y=.1.
00:09:16.100 --> 00:09:21.600
Those are all example of horizontal lines.
00:09:21.600 --> 00:09:24.500
Let us think about in the case of a horizontal line.
00:09:24.500 --> 00:09:30.100
It is a perfect prediction because we know where the x is.
00:09:30.100 --> 00:09:31.900
You could tell me whatever x you want.
00:09:31.900 --> 00:09:36.800
I can exactly tell you the y because y is always negative.
00:09:36.800 --> 00:09:38.500
y in this case is 3.
00:09:38.500 --> 00:09:49.300
It is perfect prediction and it is perfect lining this but the correlation coefficient is 0.
00:09:49.300 --> 00:09:59.600
We will try to figure out y as the horizontal line as we figure out the formula for correlation coefficient
00:09:59.600 --> 00:10:06.100
and that hopefully that will become more clear.
00:10:06.100 --> 00:10:13.000
There are many, many ways in which data can have seemingly no linear pattern
00:10:13.000 --> 00:10:18.200
or very weak linear pattern because that is what the correlation coefficients tell us.
00:10:18.200 --> 00:10:27.300
If you see our data have a 0 as its correlation coefficient do we know that it looks just like a cloud?
00:10:27.300 --> 00:10:33.800
No, in fact it can look like anyone of these crazy shapes down here.
00:10:33.800 --> 00:10:42.300
All of these distributions, all of the scatter plots have a very, very weak correlation because remember correlation just means how whiny is it.
00:10:42.300 --> 00:10:47.200
This one is whiny.
00:10:47.200 --> 00:10:56.600
Even though some of these are very, very regular shapes correlation coefficient cannot tell you that this has an interesting shape.
00:10:56.600 --> 00:11:01.500
All it tells you is whether it coheres to that regression line or not.
00:11:01.500 --> 00:11:12.200
Although these are very interesting set of data for instance here there is a certain 4 rough cluster and even though we could see and eyeball it,
00:11:12.200 --> 00:11:14.900
The correlation coefficient would not tell us that.
00:11:14.900 --> 00:11:23.100
Or in this case, this sort of our data set but even here the correlation coefficient would not tell us that either.
00:11:23.100 --> 00:11:29.400
Are all of these data the correlation coefficient is very close to 0.
00:11:29.400 --> 00:11:35.900
I want you to see there are many ways in which you can have a correlation coefficient of 1 or -1.
00:11:35.900 --> 00:11:43.500
There are many ways in which you can have a coefficient of 0.
00:11:43.500 --> 00:11:49.300
Just because we get the correlation coefficient does not mean we can see the shape of the distribution.
00:11:49.300 --> 00:12:00.900
That is often useful to do a scatter plot anyway even for ourselves just so that we know what the numbers are probably going to be describing.
00:12:00.900 --> 00:12:08.500
Let us say we have this graph and this shows us this nice correlation.
00:12:08.500 --> 00:12:16.100
It was probably pretty high like r=.8.
00:12:16.100 --> 00:12:20.600
It is closer to 1 than 0 but not quite 1.
00:12:20.600 --> 00:12:28.600
This is a pretty good correlation and you might have two variables here.
00:12:28.600 --> 00:12:39.300
For instance, perhaps this gives us the z scores for some variable like we are looking at twins and then we want to know does the intelligence of one twin,
00:12:39.300 --> 00:12:44.900
does the IQ of one twin helps us predict the IQ of the other twin?
00:12:44.900 --> 00:12:47.100
Maybe it is true.
00:12:47.100 --> 00:12:50.700
Maybe that does seem to be the case.
00:12:50.700 --> 00:13:02.700
Here we might put something like the intelligence of twin 1 on this axis and then we will put the intelligence of twin 2 on the z score from their IQ score.
00:13:02.700 --> 00:13:06.300
We will put that on the y-axis.
00:13:06.300 --> 00:13:21.300
When we have the scatter plot it is very important that we could toggle between the 3’s, the individual little dots and the forest, the big overall pattern.
00:13:21.300 --> 00:13:24.500
When we will we look at correlation coefficients we are looking back.
00:13:24.500 --> 00:13:36.100
We are sort of getting a bird’s eye view and looking very far away and trying to see the overall pattern and that is the forest.
00:13:36.100 --> 00:13:40.100
It is really important to remember what are my trees?
00:13:40.100 --> 00:13:42.400
What are my cases?
00:13:42.400 --> 00:13:46.100
It is important to remember what this dot mean.
00:13:46.100 --> 00:13:51.800
That is what I mean by the trees like you want to remember what are your cases?
00:13:51.800 --> 00:13:54.100
What are your variables?
00:13:54.100 --> 00:13:58.300
That is always step one of looking at a scatter plot.
00:13:58.300 --> 00:14:08.600
In this case, it is not that each of these dots represents just one twin it is these dots represent a set of twins, a pair of twins.
00:14:08.600 --> 00:14:21.800
This represents both twin 1 who is a little bit below average and twin 2 was actually a little bit above average.
00:14:21.800 --> 00:14:26.500
Let us pick out another one.
00:14:26.500 --> 00:14:28.500
Let us say this one.
00:14:28.500 --> 00:14:38.100
This twin has a little bit above average and guess what, so is their twin.
00:14:38.100 --> 00:14:41.200
Their twin is also little bit above average.
00:14:41.200 --> 00:14:47.900
Each of these dots actually represents 2 people in this case, a set of twins.
00:14:47.900 --> 00:14:56.400
You want to be able to switch your perspective and to zoom in and see the trees but also zoom out and see the forest
00:14:56.400 --> 00:15:09.200
and try to estimate things like correlation coefficient or even try to estimate the regression line and try to eyeball where that might be.
00:15:09.200 --> 00:15:13.400
Okay, now let us get to the business of calculating r.
00:15:13.400 --> 00:15:21.700
You could think of the correlation coefficient as roughly that average product of z scores for x and y.
00:15:21.700 --> 00:15:26.800
Let us recap a little bit what the z scores are.
00:15:26.800 --> 00:15:36.200
z scores are just giving you how many standard deviations away you are but we do not want to know it in terms of the raw numbers.
00:15:36.200 --> 00:15:40.900
We want to know it in terms of standard deviation.
00:15:40.900 --> 00:15:46.900
We do not want to know, like how many feet away, but we want to know how many standard deviations away.
00:15:46.900 --> 00:15:51.800
We could think of the standard deviation as jumps away from the mean.
00:15:51.800 --> 00:15:54.200
How many of those jumps away are you?
00:15:54.200 --> 00:15:56.700
That is the z score.
00:15:56.700 --> 00:15:58.600
Here is how we calculate r.
00:15:58.600 --> 00:16:04.700
The average product of z scores for x and y.
00:16:04.700 --> 00:16:11.300
Let us put the z scores for x and y and multiply them together because we are getting the product.
00:16:11.300 --> 00:16:27.000
The product is z(x) × z(y) and I’m going to sum them together and then divide by n-1.
00:16:27.000 --> 00:16:32.200
Later on we will talk more about y and -1 as more frequent.
00:16:32.200 --> 00:16:44.700
You can roughly see it is about the average and mostly because we are jumping from samples to populations we need to make a little bit of correction.
00:16:44.700 --> 00:16:56.800
This formula of adding something up and dividing by n is an average and the thing that we are averaging are the product of the 2 z scores.
00:16:56.800 --> 00:17:10.500
Now for all of these formulas you can think of these little z scores as you can double-click them and if you double-click what is inside.
00:17:10.500 --> 00:17:18.600
Each z score let me write this in blue, so each z score is the distance away from the mean,
00:17:18.600 --> 00:17:25.300
but not the raw distance and I want it in terms of standard deviation jumps away from the mean.
00:17:25.300 --> 00:17:38.100
That would just be something like my y - y bar for mean and so that distance divided by the standard deviation.
00:17:38.100 --> 00:17:49.100
Here I will just use little s and also for z(x) that is just x - x bar.
00:17:49.100 --> 00:17:56.700
That is the raw distance away from the mean but divided by x standard deviation.
00:17:56.700 --> 00:18:05.200
I will put a little x to indicate the standard deviation of x’s and a little y there to indicate the standard deviation of the y’s.
00:18:05.200 --> 00:18:12.600
I’m going to multiply those together and add them up for every single data point that I have.
00:18:12.600 --> 00:18:18.100
If that is my twin data for every single set of twins that I have.
00:18:18.100 --> 00:18:24.900
When we divide all of that by n-1 and n is my number of cases.
00:18:24.900 --> 00:18:26.800
How many twins how you got?
00:18:26.800 --> 00:18:28.800
How many sets of twins have you got?
00:18:28.800 --> 00:18:36.900
If we want to do although it goes without saying this just implicitly have an i that goes from 1 all the way to n
00:18:36.900 --> 00:18:45.100
because it is for every single one of my data points that I need to do this.
00:18:45.100 --> 00:18:54.800
Furthermore, we can double click on each of these little standard deviation.
00:18:54.800 --> 00:18:57.400
Now how do we find standard deviation?
00:18:57.400 --> 00:19:06.100
A standard deviation is the square root of the average distance away from the mean.
00:19:06.100 --> 00:19:07.900
The average distance away.
00:19:07.900 --> 00:19:14.600
The square root of average squared distance which is average distance away.
00:19:14.600 --> 00:19:28.300
S sub y and this is think about the distance we already know how to do distance because we have already done it.
00:19:28.300 --> 00:19:39.300
Its average squared distance because remember its sum of squares over n-1.
00:19:39.300 --> 00:19:52.700
It is sum of squared distances because if we just got the sum of the differences then we just get something very close to 0.
00:19:52.700 --> 00:20:03.600
We want that and we divide the n-1 because that sum of squares is very small so we need to correct for that by going from samples to populations.
00:20:03.600 --> 00:20:05.400
That is what we do by n-1.
00:20:05.400 --> 00:20:11.400
Because we want the standard deviation and not the variance we are going to square root this whole thing.
00:20:11.400 --> 00:20:20.600
Same thing for s(x) it is a same thing except we substitute an x here instead of y.
00:20:20.600 --> 00:20:27.500
I forgot to put my little sigma notation because I want to do this for every single y.
00:20:27.500 --> 00:20:32.000
Although it looks sort of complicated if we write the whole thing out
00:20:32.000 --> 00:20:42.100
but if we wrote the actual n or double-click diversion of s sub y in there it might look very crazy.
00:20:42.100 --> 00:20:55.700
What you do have to remember it alternately less is the main idea you want to get out of today and you want to take a moment to think what z score.
00:20:55.700 --> 00:21:04.900
Once you unpack z score you want to take a moment to think what standard deviation and hopefully you will be able to unlock those things as you go.
00:21:04.900 --> 00:21:12.200
Then you do not have to remember all of that stuff at once you can just remember them one at a time.
00:21:12.200 --> 00:21:20.100
Now that you know the formula for correlation coefficient let us talk about the relationship between correlation and slope.
00:21:20.100 --> 00:21:25.400
We already know that b1 and r have the same sign.
00:21:25.400 --> 00:21:28.100
If B1 is negative r will be negative.
00:21:28.100 --> 00:21:32.100
If b1 is positive r will be positive and vice versa.
00:21:32.100 --> 00:21:39.700
We already know that they have the same sign and because of that they already slant in the correct way.
00:21:39.700 --> 00:21:45.300
Remember r does not have any thing about y’s and run in it.
00:21:45.300 --> 00:21:49.300
All it cares about this is how much like a line it is.
00:21:49.300 --> 00:22:01.400
B1 and r have a very strict relationship where r when you multiply it by the proportion of standard deviation
00:22:01.400 --> 00:22:15.300
of all y over the standard deviation of x as long as you multiply r by this and you can almost see rise over run then you get this slope.
00:22:15.300 --> 00:22:29.200
Let us just think about this in our head and let us say r is 1 it is always 1 then whatever this proportion is that will perfectly get us b1.
00:22:29.200 --> 00:22:36.800
Also if r is 1 as always 1 these 2 have a very similar standard deviation.
00:22:36.800 --> 00:22:47.800
The spread of y is very similar to the spread of x then we should have perfect correlation of 1.
00:22:47.800 --> 00:23:04.800
In that case you would be able to sort of say that makes sense if y is varying in a similar way as x then they should have correlation version of a slope of about 1.
00:23:04.800 --> 00:23:19.200
If y is changing more slowly than x for every x you only go a tiny bit of y.
00:23:19.200 --> 00:23:31.500
In that case this number would be smaller than this one and then that would give us less rise more run.
00:23:31.500 --> 00:23:35.600
Something that looks sort of less slanted.
00:23:35.600 --> 00:23:43.900
Something like this versus a slope of 1.
00:23:43.900 --> 00:23:51.600
Something a little more shallow and that make sense less rise more run.
00:23:51.600 --> 00:24:05.800
On the other side if for every y you go will go a little x then that would look something like this more rise less run.
00:24:05.800 --> 00:24:14.500
This gives us this perfect relationship between r and b1.
00:24:14.500 --> 00:24:18.300
Using that information let us try to solve this problem.
00:24:18.300 --> 00:24:26.400
Example 1, here are the 3 pizza companies that we have looked at before, Papa John's, Dominoes and Pizza Hut.
00:24:26.400 --> 00:24:35.100
It says find the correlation between grams of fat and cost.
00:24:35.100 --> 00:24:45.000
I think these are for whole pizza and let us make this 17.50.
00:24:45.000 --> 00:24:52.300
Let us make this $18 and $20 because this is really cheap to have $1.75 pizza.
00:24:52.300 --> 00:24:58.400
It would be ridiculous to have 100g of fat in one slice of pizza.
00:24:58.400 --> 00:25:13.200
If you look at the examples provided in the download below we can use the data in order to find correlation coefficient.
00:25:13.200 --> 00:25:19.500
In order to find correlation coefficient I will break it down into the component pieces and the big component pieces
00:25:19.500 --> 00:25:25.500
and the big component pieces I’m going to need are the z scores for x and the z scores for y.
00:25:25.500 --> 00:25:29.300
I will say that the score for fat and that the z score for cost.
00:25:29.300 --> 00:25:41.200
Z score for fat and z score for cost.
00:25:41.200 --> 00:25:49.500
In order to find the z score I would need to put in the difference between this and the average.
00:25:49.500 --> 00:26:04.300
One thing that might be easier is if we actually just create a column for averages because we are probably going to need this again and again.
00:26:04.300 --> 00:26:11.000
Let me go ahead and get those averages.
00:26:11.000 --> 00:26:17.700
I’m just getting the average cost, as well as average grams of fat.
00:26:17.700 --> 00:26:25.800
I’m going to color it in a different color so that we know that this is the entirely different thing here.
00:26:25.800 --> 00:26:31.100
We have that it would be easier for us to find the score for fat.
00:26:31.100 --> 00:26:59.900
Here we want to get x of fat - the average and probably want to lock that in place and we want to divide that by standard deviation
00:26:59.900 --> 00:27:04.100
and the nice thing about Excel is that it already has the function for standard deviation.
00:27:04.100 --> 00:27:25.500
This one will give us the n -1 version so I can just lock that data down in order to move.
00:27:25.500 --> 00:27:32.200
I probably want to copy it over to E later so I’m just going to unlock the B part.
00:27:32.200 --> 00:27:37.300
As long as they in the same column, as long as I stay in column D it will use column B.
00:27:37.300 --> 00:27:42.900
If I move over to Column E it should use column C.
00:27:42.900 --> 00:27:44.400
Let us try that.
00:27:44.400 --> 00:27:49.300
Here we see that the z score is -1, that is 0 and 1 and that makes sense.
00:27:49.300 --> 00:27:57.100
Your z score is totaled together shared roughly equal 0 because you are getting how many distance away on the positive side.
00:27:57.100 --> 00:28:03.100
How many distance away on the negative side, and they should balance out if you really have the mean.
00:28:03.100 --> 00:28:12.600
Let us check this formula yet it is using B3 that it has average that is getting that Standard deviation perfect.
00:28:12.600 --> 00:28:20.400
Once I have that I can actually just copy and paste this over here.
00:28:20.400 --> 00:28:31.600
Here we see now it is using C and this average and getting the standard deviation of this data.
00:28:31.600 --> 00:28:38.800
We see roughly the negative side as to the positive side.
00:28:38.800 --> 00:28:48.800
We have these individual z scores, now we need to get the z scores for fat multiplied by the z score for costs.
00:28:48.800 --> 00:28:56.000
That is real easy, this times this for every single data point or case that we have and we have 3 cases here.
00:28:56.000 --> 00:28:59.300
The 3 different brands of pizza.
00:28:59.300 --> 00:29:10.800
Once we have that instead of the average actually we could just get the average all at once because we could put it in-one formula.
00:29:10.800 --> 00:29:21.700
We could just sum these together, sum those together, and we want to divide by n -1.
00:29:21.700 --> 00:29:25.100
In this case, it is 2.
00:29:25.100 --> 00:29:34.200
If you wanted to put in a formula you could put in counts -1, but I'm just going to put for our purposes 2 here.
00:29:34.200 --> 00:29:48.400
We get a very, very high correlation where it is very, very close to 1 as cost goes up fat goes up.
00:29:48.400 --> 00:29:52.300
As cost goes down fat goes down.
00:29:52.300 --> 00:29:56.200
They have a very positive correlation and it is very whiny it.
00:29:56.200 --> 00:30:02.000
Here it is very closely to the line.
00:30:02.000 --> 00:30:13.400
Here we could see that this data is very highly correlated.
00:30:13.400 --> 00:30:15.200
It has a strong correlation.
00:30:15.200 --> 00:30:27.300
We do not have a lot of points, but apparently they fall very, very close to the line.
00:30:27.300 --> 00:30:34.000
Previously, we found that the regression line for this data is this.
00:30:34.000 --> 00:30:44.500
I believe that in that case, the cost is 17.50 that previously is $18.00 and $20.
00:30:44.500 --> 00:30:55.700
Previously in the regression we already found this so check that the relationship between r from the previous example and b1.
00:30:55.700 --> 00:31:06.700
It is asking us is this really true that b1 in this example, and we are not going to do this formula proof but just to see for ourselves.
00:31:06.700 --> 00:31:17.900
Is b1 really equal the proportion of r times the proportion of the variation from y over the variation from x.
00:31:17.900 --> 00:31:21.700
Is this relationship really true?
00:31:21.700 --> 00:31:34.200
While we already know b1 .125 and we already know r.
00:31:34.200 --> 00:31:54.700
We know r this is .94 and so we know this .94 multiply by s sub y over s(x) does all of this equal .125.
00:31:54.700 --> 00:31:56.200
Let us see.
00:31:56.200 --> 00:32:02.500
That is not too hard and then move that up here.
00:32:02.500 --> 00:32:15.400
We have r over here I'm just going to find s sub y s sub x and multiplied by it.
00:32:15.400 --> 00:32:28.200
I will just create another column for standard deviation, and let us get the standard deviation for x and the standard deviation for y.
00:32:28.200 --> 00:32:47.900
Now you know that this r × the standard deviation of y over the standard deviation of x and that is going to be equal to .125.
00:32:47.900 --> 00:33:08.200
That relationship holds here we have the b1 and r over to this side so we know what these things are.
00:33:08.200 --> 00:33:09.200
There you have it.
00:33:09.200 --> 00:33:14.600
We see that the relationship between r and b1 holds.
00:33:14.600 --> 00:33:24.700
There is sort of a little bit of y for all a lot of run and we know that this line is pretty shallow and that makes sense.
00:33:24.700 --> 00:33:26.400
This is a pretty shallow slope.
00:33:26.400 --> 00:33:37.700
There is little rise over run and because of that is the fraction less than 1.
00:33:37.700 --> 00:33:46.200
Example 3, the mean score on a math achievement test for community college was 504 with a standard deviation of 112.
00:33:46.200 --> 00:33:55.100
For the corresponding reading achievement test the mean was 515 and a standard deviation was 160.
00:33:55.100 --> 00:33:57.800
The correlation coefficient is very high.
00:33:57.800 --> 00:34:01.400
Use this information to find the regression line.
00:34:01.400 --> 00:34:09.200
Here we see that we have the correlation coefficient, but we but they do not give us the data.
00:34:09.200 --> 00:34:10.700
Can we still do this?
00:34:10.700 --> 00:34:18.300
Yes, we can because there is a relationship between the correlation coefficient and the standard deviation.
00:34:18.300 --> 00:34:26.000
There is a relationship between the correlation coefficient and slope at all and we need to know are the standard deviation in order to find this.
00:34:26.000 --> 00:34:32.700
B1 = r × s sub y / s sub x.
00:34:32.700 --> 00:34:37.300
We actually know s sub y and s sub x and r so we could find b1.
00:34:37.300 --> 00:34:42.700
Once we know b1 and we have the point of averages.
00:34:42.700 --> 00:34:50.100
We have point of averages, which is x bar and y bar.
00:34:50.100 --> 00:35:02.700
In fact, let us say this is x and let us say the reading is y so here we have 504 – 515.
00:35:02.700 --> 00:35:10.700
We could get the slope and we can have one point of averages and we could find the intercept.
00:35:10.700 --> 00:35:22.100
Let us go ahead and r is going to be .7 and s sub y which is the reading one is 116.
00:35:22.100 --> 00:35:27.100
S sub x is 112.
00:35:27.100 --> 00:35:34.000
We can find b1 and I’m just going to use a little bit of space down here to just do the calculations.
00:35:34.000 --> 00:35:37.300
Feel free to do this on your calculator.
00:35:37.300 --> 00:35:48.100
.7 × 116/112 and that is .725.
00:35:48.100 --> 00:35:53.700
I have here .725 as my slope.
00:35:53.700 --> 00:36:00.400
Once I have my slope I could put that into my slope intercept line 4.
00:36:00.400 --> 00:36:12.900
My y is 515 and I'm looking for the intercept.
00:36:12.900 --> 00:36:26.300
I add that to .725 × x which is 504.
00:36:26.300 --> 00:36:34.100
When I go ahead and solve that in here and let me go ahead just solve that in here that is going to be 515 -.725 × 504.
00:36:34.100 --> 00:36:50.100
I will get 149.6.
00:36:50.100 --> 00:36:56.100
My b sub 0 = 149.6.
00:36:56.100 --> 00:37:01.500
We have these two ideas we can now find the regression line.
00:37:01.500 --> 00:37:17.000
A regression line in order to predict y is going to b sub 0 or the intercept 149.6 + .725 × x and that is our regression line.
00:37:17.000 --> 00:37:26.900
Here we see that this slope is less than 1 y is more run.
00:37:26.900 --> 00:37:39.000
More shallow slope and you do not need to have all the points in order to find the regression line.
00:37:39.000 --> 00:37:49.500
Example number 4, find the correlation coefficient for this set of data and this set of data is provided for you on the download below.
00:37:49.500 --> 00:37:55.500
If you go ahead and click on example 4 that data is all there.
00:37:55.500 --> 00:38:02.000
Previously we looked at the data and we thought this is pretty good, pretty found linear correlation.
00:38:02.000 --> 00:38:05.100
Let us see if our eyeballing was actually right.
00:38:05.100 --> 00:38:15.900
I’m just going to move this one over a little bit because we are not going to need that as much.
00:38:15.900 --> 00:38:18.500
Let me shrink this down a little bit.
00:38:18.500 --> 00:38:31.600
It always helps me think about I’m trying to find correlation coefficient I know it is the average product of the z scores.
00:38:31.600 --> 00:38:34.900
I need to find that the z scores.
00:38:34.900 --> 00:38:42.900
I need to find that the z scores for the student faculty ratio the SFR.
00:38:42.900 --> 00:38:48.700
I want to find the z score for cost per unit CPU.
00:38:48.700 --> 00:38:50.700
Let us go ahead and do that.
00:38:50.700 --> 00:38:55.600
In order to do that it is often helpful if you have the mean and standard deviation.
00:38:55.600 --> 00:38:59.700
How do we find the mean and standard deviation somewhere.
00:38:59.700 --> 00:39:21.400
Here let me just get the means here and move this over by one column just so that I can write mean and standard deviation.
00:39:21.400 --> 00:39:30.200
Sometimes it will get confused as to like what we are doing and it is often helpful to write these things down.
00:39:30.200 --> 00:39:37.500
I like to put it in a different color because that helps me know this is not part of my data.
00:39:37.500 --> 00:39:44.000
Let us get the average mean of all of our data here.
00:39:44.000 --> 00:39:50.700
The data for the student faculty ratio, as well as the cost per unit.
00:39:50.700 --> 00:40:00.100
Subtlety that same data and find the standard deviation because we are often going to need that for z score.
00:40:00.100 --> 00:40:04.800
It is just useful to have it in advance.
00:40:04.800 --> 00:40:08.100
We have the mean and the standard deviation.
00:40:08.100 --> 00:40:16.400
Here I’m just going to put a little divider here for now so that I can move this down.
00:40:16.400 --> 00:40:21.800
Notice that it gets from row 7 to row 34.
00:40:21.800 --> 00:40:23.800
Let us find the z scores.
00:40:23.800 --> 00:40:26.900
Now that we have mean and standard deviation it should be really easy.
00:40:26.900 --> 00:40:37.600
It is just the difference between my point and my average all divided by standard deviation.
00:40:37.600 --> 00:40:53.100
I want to lock down that mean and average and as long as it is in the same column it will always use but I do want to use E when I move over.
00:40:53.100 --> 00:40:58.200
I’m not going to lockdown the B part I’m just locking down the row.
00:40:58.200 --> 00:41:04.900
I guess z score of -1.556.
00:41:04.900 --> 00:41:10.900
If my z score calculations and my mean and all that stuff are correct.
00:41:10.900 --> 00:41:21.300
I should roughly have z scores that are both positive and negative, and they should roughly balance out.
00:41:21.300 --> 00:41:29.300
Let us take a look at our data and it seems like half of them are negative and roughly half of them are positive.
00:41:29.300 --> 00:41:30.900
They should balance out.
00:41:30.900 --> 00:41:38.700
Once I have that I could actually take all of these guys and drag that over.
00:41:38.700 --> 00:41:40.700
Let us check one of these formulas here.
00:41:40.700 --> 00:41:50.300
This one it gives me this point at the deviation or the difference between this point and the mean divided by its standard deviation.
00:41:50.300 --> 00:41:52.700
Perfect.
00:41:52.700 --> 00:41:58.000
Once we have that I know I need to multiply and get the product of these z scores.
00:41:58.000 --> 00:42:08.100
Z of s(r) × z(CPU).
00:42:08.100 --> 00:42:12.600
Let us see what we could do here.
00:42:12.600 --> 00:42:25.100
Here I’m just going to multiply this times this for every single one of my points.
00:42:25.100 --> 00:42:30.200
Once I get down here and I know I need to find the mean of these points.
00:42:30.200 --> 00:42:38.300
I’m going to find but I do not want to use just the formula for mean because that is going to divide by n.
00:42:38.300 --> 00:42:40.200
We are going to divide by n -1.
00:42:40.200 --> 00:42:43.500
When I split it up into adding all of these up.
00:42:43.500 --> 00:43:03.800
I am going to sum them all up and divide by count instead of counting all of these.
00:43:03.800 --> 00:43:21.400
I’m just going to use the same points here I’m going to say count all of this and subtract one and put all in my green parentheses.
00:43:21.400 --> 00:43:27.700
We get a negative slope that is pretty high.
00:43:27.700 --> 00:43:35.500
It is you know above .8 and so let us take a look at our data to see if that makes sense to us.
00:43:35.500 --> 00:43:38.800
We certainly understand y it is negative.
00:43:38.800 --> 00:43:43.500
It makes sense that r is negative and we did not think it was pretty good.
00:43:43.500 --> 00:43:54.600
We did think it was pretty strong and if it does end up being pretty strong .6 or .7.
00:43:54.600 --> 00:43:58.000
That is correlation coefficient see you next time on www.educator.com.