WEBVTT mathematics/statistics/son
00:00:00.000 --> 00:00:02.400
Hi and welcome to www.educator.com.
00:00:02.400 --> 00:00:07.300
Today we are going to be talking about regressions today.
00:00:07.300 --> 00:00:10.400
Here is the big goal of this lesson.
00:00:10.400 --> 00:00:18.300
Basically we want to set up a conceptual understanding of regressions before we actually learn to calculate them and find it.
00:00:18.300 --> 00:00:26.000
Today we are going to do just a brief review of linear equations and talk about the regressions as the center of line.
00:00:26.000 --> 00:00:37.100
Instead if a center point like the mean when you talk about a center line and then we are going to talk about prediction and error.
00:00:37.100 --> 00:00:39.700
What is a linear equation?
00:00:39.700 --> 00:00:47.700
Y = mx + b should be pretty familiar to a lot of you and whenever we think of y= mx + b.
00:00:47.700 --> 00:01:04.100
You can think of y as the output or f(x), x will be the input or often whatever it is on this horizontal axis the x axis.
00:01:04.100 --> 00:01:13.300
B is the y intercept.
00:01:13.300 --> 00:01:20.100
Another way you could think of the y intercept is where x = 0 what is b?
00:01:20.100 --> 00:01:33.500
x would be here and apparently anything where x is 0 that will mean that y will have to be somewhere on this y axis.
00:01:33.500 --> 00:01:36.400
That is what we mean by y intercept.
00:01:36.400 --> 00:01:43.300
M is this slope.
00:01:43.300 --> 00:01:58.100
Slope of something pretty much numbers but just in case you do not, here is how we calculate slope.
00:01:58.100 --> 00:02:04.200
Slope is the change of y over the change of x.
00:02:04.200 --> 00:02:12.400
When we say change we think of Δ, the change of y over change of x.
00:02:12.400 --> 00:02:20.300
More commonly people refer to it as rise/run.
00:02:20.300 --> 00:02:26.300
When you think of rise, you think of going up vertically or down vertically.
00:02:26.300 --> 00:02:29.600
The entire rise and they are running in a sort of more horizontal.
00:02:29.600 --> 00:02:31.600
What is the rise/run?
00:02:31.600 --> 00:02:34.200
That is what we think of as a slope.
00:02:34.200 --> 00:02:42.300
When we think about rise/run we mean in the direction of the positive direction is up and right.
00:02:42.300 --> 00:02:49.900
The negative direction will be down and more to the left.
00:02:49.900 --> 00:02:57.100
You could think of rise/run as an indication of rate of change.
00:02:57.100 --> 00:03:03.100
How much x changes in relation to y or vice versa?
00:03:03.100 --> 00:03:17.000
These are the components of our linear equation and every line it does not matter what the line looks like.
00:03:17.000 --> 00:03:31.300
Every straight line has an equation and from that equation you can figure out any x what the y is at any y what the x is?
00:03:31.300 --> 00:03:43.900
If you have xy in a slope you could figure out the intercept and if you have the intercept of y you could bring out the slope.
00:03:43.900 --> 00:03:47.200
This is a useful equation for us.
00:03:47.200 --> 00:03:51.300
We are going to be trying to find a line that is the mean.
00:03:51.300 --> 00:03:55.000
That is the center at the data point.
00:03:55.000 --> 00:04:03.100
In order to do that, we would have to find its equation because the equation is the mean of the line.
00:04:03.100 --> 00:04:09.800
In statistics, we are going to use this equation but we are going to write it in a different way.
00:04:09.800 --> 00:04:17.900
It is just writing conceptual change but we are going to change it around just very slightly and superficially.
00:04:17.900 --> 00:04:21.700
The first thing we do is we talk about the y intercept first.
00:04:21.700 --> 00:04:36.000
In statistics, that is y=b comes first and it is the first b so it is b sub 0 or b knot.
00:04:36.000 --> 00:04:47.400
Instead of the y intercept being added second we start from the y intercept and then we add the slope × x.
00:04:47.400 --> 00:04:53.200
Notice that the slope is not called n anymore, it is called b sub 1.
00:04:53.200 --> 00:05:03.500
We have b sub 0 which is the y intercept and b sub 1 which is slope.
00:05:03.500 --> 00:05:06.000
Same idea as before.
00:05:06.000 --> 00:05:19.900
This how I will refer to things when we talk about the equation of a regression line.
00:05:19.900 --> 00:05:25.300
What do we mean by the center line?
00:05:25.300 --> 00:05:37.600
If you think about a scatter of data, if you have a whole bunch of data you want to think of a line that somehow cuts through the middle of all of these points.
00:05:37.600 --> 00:05:53.300
Right now we could just roughly draw a line and try to make it cut through the center of all these points.
00:05:53.300 --> 00:05:56.200
That is a very rough line.
00:05:56.200 --> 00:06:04.700
In order to find what that equation of this line is, I can as long as I just have 2 of the points on this line.
00:06:04.700 --> 00:06:13.800
For example, if I take this data point and this data point I could find the equation of that line.
00:06:13.800 --> 00:06:24.300
It is because by having 2 axis and 2 y, a set of x and y and another set of x and y I will calculate rise/run.
00:06:24.300 --> 00:06:30.500
From having slope and x and y, I could calculate the y intercept.
00:06:30.500 --> 00:06:36.400
That is a rough line but because it just depends on which 2 points I take.
00:06:36.400 --> 00:06:54.500
If I pick these 2 points I will get this line but let us say I pick this point and this point then I would get an entirely different of line.
00:06:54.500 --> 00:07:06.300
Moreover if I pick this point and that point I will get an even more different line.
00:07:06.300 --> 00:07:10.600
The question is which 2 points will you pick?
00:07:10.600 --> 00:07:19.100
It might be not good enough for us to just eyeball things because we are not sure which 2 points to pick.
00:07:19.100 --> 00:07:28.300
If we have 2 data points then life is easy like you could just use those 2 but usually we have more than 2 data points.
00:07:28.300 --> 00:07:32.700
Just eyeballing a rough line may not be good enough for us.
00:07:32.700 --> 00:07:45.000
If we could just show then we only have 2 data points we could manually find slope and intercept and find the equation of that line.
00:07:45.000 --> 00:07:50.600
Let us talk about this regression as a center line instead of a center point.
00:07:50.600 --> 00:07:58.600
Here are some reasons for summarizing with a regression line and notice that for all of these I’m talking about scatter plot.
00:07:58.600 --> 00:08:01.700
Regression lines are used for scatter plots.
00:08:01.700 --> 00:08:06.800
Here what we want to do is we want to have some variable.
00:08:06.800 --> 00:08:11.500
Here is my first variable, variable 1.
00:08:11.500 --> 00:08:15.700
Here is my other variable, variable 2.
00:08:15.700 --> 00:08:29.400
We want to have a line that describes the center of all of these cases whatever the cases maybe.
00:08:29.400 --> 00:08:32.500
Why do we have a line?
00:08:32.500 --> 00:08:36.200
Why not just a mean?
00:08:36.200 --> 00:08:41.800
Sometimes there is not enough info from just a point.
00:08:41.800 --> 00:08:52.700
If you just have a point, for instance this point is the mean of my x and y.
00:08:52.700 --> 00:08:57.800
That would be x bar and y bar.
00:08:57.800 --> 00:09:07.700
Let us say that is my center point, that point might not give us enough information about this whole distribution.
00:09:07.700 --> 00:09:17.000
We are going to be talking about how to summarize a distribution and what about trend we do not just want a point we like a trend.
00:09:17.000 --> 00:09:23.000
The most information from that point it is useful to have center line.
00:09:23.000 --> 00:09:33.700
We want to find the summary that describes the relationship between the 2 variables.
00:09:33.700 --> 00:09:40.400
It is not enough just to have a point, the point would not describe the relationship between the 2 but the line does.
00:09:40.400 --> 00:09:47.300
A line will tell you whether its slope is negative or positive.
00:09:47.300 --> 00:09:54.400
The line will tell you what kind of information you would want from a trend.
00:09:54.400 --> 00:10:03.400
That relationship is important to us and we will get that information from just a point.
00:10:03.400 --> 00:10:17.300
The only reason that you want to summarize a regression line is that it is helpful to use one variable to predict the other variable.
00:10:17.300 --> 00:10:24.500
Often by convention we will put whatever you feel is the predicted variable on the x axis.
00:10:24.500 --> 00:10:28.500
We may use these to predict these.
00:10:28.500 --> 00:10:35.900
We may use this to predict someone weight to predict their height or vice versa.
00:10:35.900 --> 00:10:39.900
In this case it does not matter which is to predict there.
00:10:39.900 --> 00:10:45.700
Predicted variables are by convention they are not causal variables.
00:10:45.700 --> 00:10:52.000
They are just variables that we use in order to predict the second variable.
00:10:52.000 --> 00:10:55.300
That second variable is called the response variable.
00:10:55.300 --> 00:11:04.300
One thing that is important to know is that the predicted variable by convention or by tradition goes on the x axis
00:11:04.300 --> 00:11:09.400
and the response variable is often on the right axis.
00:11:09.400 --> 00:11:20.800
That goes along with this idea of function that we put in x and f(x) crunch out for us an output.
00:11:20.800 --> 00:11:22.800
That is how we think of predictors.
00:11:22.800 --> 00:11:31.100
You put in that predictor and it will crunch out for you the response.
00:11:31.100 --> 00:11:38.200
When we talk about prediction, those predictions lie on the regression line.
00:11:38.200 --> 00:11:49.000
This regression line equals all of our predictions.
00:11:49.000 --> 00:12:12.100
This means that when we think x is 27 then the prediction line show us that y would be 180 or something like that.
00:12:12.100 --> 00:12:17.000
All the predictions lie on the actual line.
00:12:17.000 --> 00:12:22.600
Notice that a lot of our points do not lie on the prediction line.
00:12:22.600 --> 00:12:32.100
There is a little bit of difference between the actual data and the predicted data.
00:12:32.100 --> 00:12:38.600
Here is the goal of regression, the goal of this line.
00:12:38.600 --> 00:12:42.100
Our fundamental desire is to find this line that is the center.
00:12:42.100 --> 00:12:47.300
It describes the middle of all these points.
00:12:47.300 --> 00:12:53.500
If you want to think about what the center means, it is all the distances on one side.
00:12:53.500 --> 00:12:56.700
A balance of all the distances on that side.
00:12:56.700 --> 00:13:03.400
It does not mean that it has to be a perfectly symmetrical distribution.
00:13:03.400 --> 00:13:11.000
It just means that the point in the middle has to be equal distant to all of these lines and equal distant to all of these lines.
00:13:11.000 --> 00:13:14.500
Think about it as a balance.
00:13:14.500 --> 00:13:17.100
It just has to balance each other out.
00:13:17.100 --> 00:13:20.000
All of the distances has to balance each other out.
00:13:20.000 --> 00:13:21.900
That is how I want you think about it.
00:13:21.900 --> 00:13:25.100
Distance is on one side of the line.
00:13:25.100 --> 00:13:29.100
Balance is all the distances on the other side of the line.
00:13:29.100 --> 00:13:39.200
To show you here is one distance.
00:13:39.200 --> 00:13:44.100
Let us take this point.
00:13:44.100 --> 00:13:51.000
This is the distance, this is y distance away from the line.
00:13:51.000 --> 00:14:06.000
I need all of these distances to be balanced out like all of these distances.
00:14:06.000 --> 00:14:12.500
That is all of these regression line and this is a long distance here.
00:14:12.500 --> 00:14:16.100
I need all of these distances to balance each other out.
00:14:16.100 --> 00:14:21.400
Now how would you find such a line because that seems like a lot of work?
00:14:21.400 --> 00:14:29.900
We have to find a line and find all the distances and drew a line around and make sure all the distances are perfectly, evenly matched.
00:14:29.900 --> 00:14:31.400
That seems far.
00:14:31.400 --> 00:14:42.900
We will learn to calculate the precise slope and intercept of this middle line, the regression line by using the method of these squares.
00:14:42.900 --> 00:14:49.600
This will going to be a beautiful shortcut for us so that we can find that line without having to do all that work.
00:14:49.600 --> 00:14:53.400
That is on the next lesson.
00:14:53.400 --> 00:14:57.400
Let us pretend that I have just given you the beautiful regression line.
00:14:57.400 --> 00:15:00.300
I have just found it for you.
00:15:00.300 --> 00:15:07.900
Let us say here I will show you by age.
00:15:07.900 --> 00:15:25.700
Here on the x axis we have age, when you are like 25 you might drink less milk when you are 15 or 12.
00:15:25.700 --> 00:15:32.200
Here is serving of milk.
00:15:32.200 --> 00:15:42.900
I have already drawn for you this regression line and if you trace it all the way up it may intercept at 795
00:15:42.900 --> 00:15:55.600
and if you look at rise/run it will be rise 22 and going to the left by 1.
00:15:55.600 --> 00:15:58.400
22/1.
00:15:58.400 --> 00:16:07.600
Here we have this nice line and there are 2 ways you could use prediction.
00:16:07.600 --> 00:16:22.400
One is that you could use prediction in order to find data for predict data.
00:16:22.400 --> 00:16:34.300
We have data for a 12 year old and we have data for 28 year old.
00:16:34.300 --> 00:16:47.300
If I wanted to predict somebody in between that, I wanted to predict a 20 year old milk drinking.
00:16:47.300 --> 00:16:57.300
What I can do is I could put 20 in the equation and find the predicted of milk.
00:16:57.300 --> 00:17:04.000
I could just do 795 – 22 – 20.
00:17:04.000 --> 00:17:07.500
I could drop my predicted servings of milk.
00:17:07.500 --> 00:17:19.400
When we make a prediction, instead of calling it y, we are going to call it y hot.
00:17:19.400 --> 00:17:22.000
This is called extrapolation.
00:17:22.000 --> 00:17:32.100
When you have a range of x and you are finding something within that range of axis, your predictors are within that range of axis.
00:17:32.100 --> 00:17:38.300
You could think of it as within the boundaries.
00:17:38.300 --> 00:17:53.600
Staying within the range of data because this is the data that I use in order to create my line
00:17:53.600 --> 00:17:59.600
and if I stay within the range of my data that is how it is extrapolation.
00:17:59.600 --> 00:18:06.300
If I go outside the range of my data, that is my extrapolation.
00:18:06.300 --> 00:18:22.500
For instance, we do not have data of 10 year olds can I just make one up?
00:18:22.500 --> 00:18:25.600
Can I just that I do?
00:18:25.600 --> 00:18:33.400
Can I find my predicted y for people who are 10 year old?
00:18:33.400 --> 00:18:36.800
Obviously I can from just using the equation of the line.
00:18:36.800 --> 00:18:39.600
That is not the hard part.
00:18:39.600 --> 00:18:47.700
It is easy to plug in 10 but the question is can I actually do this?
00:18:47.700 --> 00:18:52.200
Is it legal for me to do?
00:18:52.200 --> 00:19:04.500
The reason why we separate this into 2 different ways of predicting is that extrapolation
00:19:04.500 --> 00:19:27.800
is a little bit more risky because you are going outside the boundaries of your data.
00:19:27.800 --> 00:19:38.000
Because you are going outside of the boundaries of our data we are not sure that our predictions are going to be accurate.
00:19:38.000 --> 00:19:49.900
When we stay within the range of our data it is a more safe way because it us most similar to the data that we use to create the line.
00:19:49.900 --> 00:19:52.300
There is interpolation and extrapolation.
00:19:52.300 --> 00:19:57.400
What I want you to know is extrapolation is more dangerous that interpolation.
00:19:57.400 --> 00:20:07.300
Let us say we go all the way to 0 years of age, would it be true that they drink all these servings of milk?
00:20:07.300 --> 00:20:08.800
They do not.
00:20:08.800 --> 00:20:11.200
They drink infant formula or breast milk.
00:20:11.200 --> 00:20:25.800
It will be wrong if I say that infants towards 0 years old drink 795 servings of milk a year because that will just be wrong.
00:20:25.800 --> 00:20:37.500
That is what we mean by extrapolation being a little bit more dangerous.
00:20:37.500 --> 00:20:40.600
Let us talk about errors in prediction.
00:20:40.600 --> 00:20:51.100
Even though we have this nice equation for the line, common problem is that the serving of milk per year
00:20:51.100 --> 00:20:57.200
that we predict y hot is not always going to be fit with our data.
00:20:57.200 --> 00:21:00.300
That is not always going to be perfectly line up with our data.
00:21:00.300 --> 00:21:08.100
In fact you could see here there is a lot of jitter around the line and that is called prediction error.
00:21:08.100 --> 00:21:23.200
The prediction error is the real truth and the difference within our prediction.
00:21:23.200 --> 00:21:26.700
Whenever we have data, it is often from a sample.
00:21:26.700 --> 00:21:29.200
We do not know what the real truth is.
00:21:29.200 --> 00:21:31.300
We only have the sample.
00:21:31.300 --> 00:21:36.500
Often we want to know prediction error but this is a theoretical idea.
00:21:36.500 --> 00:21:41.800
It is the difference between the truth in our prediction but we already know what the truth is.
00:21:41.800 --> 00:21:45.000
What we do have is we have our data.
00:21:45.000 --> 00:21:52.600
The sample and what we can find is not the real prediction error but what we call the residual.
00:21:52.600 --> 00:22:02.200
After we find the middle line, then what we can find is the difference between our data and that line.
00:22:02.200 --> 00:22:06.200
That is called the residual.
00:22:06.200 --> 00:22:22.800
This idea here, the distances between our actual y, the data, and our predicted y, y hot, that is called the residual.
00:22:22.800 --> 00:22:26.100
Notice that we have a whole bunch of residuals.
00:22:26.100 --> 00:22:43.800
Here is the thing, because some of our data is greater than our prediction and some of our data is less than our predictions.
00:22:43.800 --> 00:22:48.300
It is a whole bunch of positive and a whole bunch of negative.
00:22:48.300 --> 00:23:00.000
The prediction y, the perfect middle line actually have a balance of positive and negative.
00:23:00.000 --> 00:23:13.000
If we add in all those positives and negatives and these distance is exactly equal to this distance.
00:23:13.000 --> 00:23:15.700
These are positive and these are negative.
00:23:15.700 --> 00:23:20.800
When we add them all together we will get 0.
00:23:20.800 --> 00:23:27.200
The idea is all the residuals on this side and all the residuals on this side add up to 0 because
00:23:27.200 --> 00:23:33.200
that would mean that our line is truly on the middle of all these distances.
00:23:33.200 --> 00:23:35.700
That is called a residual.
00:23:35.700 --> 00:23:38.100
Let us go to our first example.
00:23:38.100 --> 00:23:47.200
This is the same data that we are working at and the question is what is the residual for milk drink of a 24 year old?
00:23:47.200 --> 00:24:04.900
Since we are finding the residual, we know that the residual is the data y but the difference between that and y hot or the predicted y.
00:24:04.900 --> 00:24:12.100
To put it into our example, it is the actual servings of milk that 24 year olds drink, the data that we have.
00:24:12.100 --> 00:24:19.000
Subtract out the predicted servings of milk that 24 year olds drink.
00:24:19.000 --> 00:24:24.600
First things first, let us find how much milk 24 year olds drink.
00:24:24.600 --> 00:24:32.400
If we go to 24, this is our data point right here, we can just add all the points looks like to us.
00:24:32.400 --> 00:24:41.100
It looks like maybe 24 and 225 or something.
00:24:41.100 --> 00:24:48.000
We already have our y, 225.
00:24:48.000 --> 00:24:50.700
We just need to find y hot.
00:24:50.700 --> 00:25:00.800
In order to find y hot, all we have to do is put in 24 to this regression equation.
00:25:00.800 --> 00:25:08.800
Y hot is equal 795 – 22 × 24.
00:25:08.800 --> 00:25:11.900
That will be our predicted y.
00:25:11.900 --> 00:25:30.600
Here I’m just going to bring out the pink Excel and just put in 795 – 22 × 24.
00:25:30.600 --> 00:25:34.600
We will get 267.
00:25:34.600 --> 00:25:37.900
That is equals to 267.
00:25:37.900 --> 00:26:01.500
We have 225 – 267, that makes sense because our predicted serving of milk is above our actual servings of milk from our data.
00:26:01.500 --> 00:26:08.600
We shall get a negative number.
00:26:08.600 --> 00:26:22.300
Let us get it in Excel and it is going to put in 225 – 267 here I get -42.
00:26:22.300 --> 00:26:31.400
That is our residual for milk drinking of a 24 year old.
00:26:31.400 --> 00:26:39.300
Example 2, if a residual is large and negative, where is the point located with respect to the line?
00:26:39.300 --> 00:26:41.600
What does it mean for the residual to be negative?
00:26:41.600 --> 00:26:52.700
We already have an example of a residual being negative, it means that the point is from all the line and below on the y axis.
00:26:52.700 --> 00:26:56.400
Just to draw some examples for you.
00:26:56.400 --> 00:27:04.500
If we have a line that looks like this, one idea is the residual is way down here.
00:27:04.500 --> 00:27:19.000
It will be large and negative given the y hat and the y because it is residual = y – y hot.
00:27:19.000 --> 00:27:24.600
Another example that I could draw for you is something like this.
00:27:24.600 --> 00:27:37.800
Even in this case, this will give us a large residual because once again our y hat is greater than our y.
00:27:37.800 --> 00:27:55.100
If the residual is negative, if the residual is less than 0 then it must mean that our y hat is greater than our y.
00:27:55.100 --> 00:28:02.300
With respect to the line, the point is below the line.
00:28:02.300 --> 00:28:05.400
What does it mean for the residual to be negative?
00:28:05.400 --> 00:28:15.900
It means that our prediction is greater than our data point.
00:28:15.900 --> 00:28:30.500
Example 3, is somebody said that they have fit a line into a set of data points and all their residuals is positive, what would you say to them?
00:28:30.500 --> 00:28:32.900
Let us just think about this.
00:28:32.900 --> 00:28:38.300
Let us say we have some sort of a line and all the residuals are positive.
00:28:38.300 --> 00:28:48.800
That would mean that every data point is somehow above this line because if they are below that would be negative.
00:28:48.800 --> 00:28:57.600
Could that ever be the case if we want our line to be in the middle of all this points?
00:28:57.600 --> 00:28:59.100
No.
00:28:59.100 --> 00:29:13.700
I would probably say to them perhaps they have made a mistake because half of their distances should be positive and half should be negative.
00:29:13.700 --> 00:29:19.700
Sometimes you could have 2 small positive distances and one larger negative distance.
00:29:19.700 --> 00:29:26.100
It could balance out like that but you cannot have all positive nor you can have all negative.
00:29:26.100 --> 00:29:37.800
I would say to them your line is not in the middle of all these points.
00:29:37.800 --> 00:29:42.400
It is not a good regression line.
00:29:42.400 --> 00:29:48.000
Example 4, interpret the y intercept of the regression line in the milk example.
00:29:48.000 --> 00:29:50.900
Does it make sense to extrapolate here?
00:29:50.900 --> 00:30:06.400
One thing that you need to know is the x axis only goes from 10 – 30 but we need to take it all the way out to 5, 0.
00:30:06.400 --> 00:30:19.100
What we mean is here that is where the true y intercept because x axis has to be 0.
00:30:19.100 --> 00:30:22.700
Does it make sense to extrapolate here?
00:30:22.700 --> 00:30:33.000
This would mean that when x is 0 then y would be 795.
00:30:33.000 --> 00:30:35.000
Let us think about what that means.
00:30:35.000 --> 00:30:50.300
When x is 0 age would be 0, we are talking about new born, is it true that new born drink 795 servings of milk?
00:30:50.300 --> 00:30:52.000
We just talked about that.
00:30:52.000 --> 00:30:56.700
It does not make sense to extrapolate here because new born are special case.
00:30:56.700 --> 00:31:07.900
They do not really drink milk, they drink breast milk and infant formula and because of that it does not make sense to talk about new born drinking milk yet.
00:31:07.900 --> 00:31:14.200
It does not quite make sense to extrapolate that way.
00:31:14.200 --> 00:31:21.300
New born are an exception and presumably this line will go on and on and on.
00:31:21.300 --> 00:31:26.000
There will be a point where it crosses the x axis.
00:31:26.000 --> 00:31:31.100
This is the x intercept when y = 0.
00:31:31.100 --> 00:31:41.000
It may not make sense to extrapolate there either just because at a certain point the servings of milk might go into negative.
00:31:41.000 --> 00:31:45.700
That does not make sense in our data.
00:31:45.700 --> 00:31:52.600
It does not quite make sense to extrapolate beyond the confides of our data.
00:31:52.600 --> 00:31:55.700
That is conceptual understanding of regression.
00:31:55.700 --> 00:32:02.000
Hope to see you again for calculating regression next time on www.educator.com.