WEBVTT mathematics/statistics/son
00:00:00.000 --> 00:00:02.400
Hi and welcome to www.educator.com.
00:00:02.400 --> 00:00:08.500
Today we are going to be talking about variability.
00:00:08.500 --> 00:00:14.600
We are going to start off with just a conceptual introduction to the different kinds of ways that you could measure variability.
00:00:14.600 --> 00:00:19.500
Then we are going to be talking about range, cortex, and inter quartile range.
00:00:19.500 --> 00:00:23.300
We are going to be talking about variance and standard deviation.
00:00:23.300 --> 00:00:30.700
In particular, we are going to focus a little bit the concept of sum of squares.
00:00:30.700 --> 00:00:40.800
We are going to be talking about population, standard deviation versus sample standard deviation and talk about the differences in their formulas.
00:00:40.800 --> 00:00:44.300
We are going to calculate standard deviation in Excel.
00:00:44.300 --> 00:00:48.300
Let us get started.
00:00:48.300 --> 00:00:54.900
Let us think about out conceptual way of thinking about variability.
00:00:54.900 --> 00:00:58.900
There is lot of different ways that you could actually think about variability.
00:00:58.900 --> 00:01:01.000
For instance, let me give you this example.
00:01:01.000 --> 00:01:13.800
Let us say this x right here shown in each of these is the president Barrack Obama.
00:01:13.800 --> 00:01:28.400
Let us say that this is the president and these are different groups of people that are standing within a formal event.
00:01:28.400 --> 00:01:33.400
Here we see the secret service and this is how far each of them are from him.
00:01:33.400 --> 00:01:40.300
Here we see the supreme court justices and they are scattered around him.
00:01:40.300 --> 00:01:45.500
Here are his cabinet members that he has appointed and they are scattered around him.
00:01:45.500 --> 00:01:48.000
Here the tea party senators.
00:01:48.000 --> 00:01:53.700
Let us just that they are the senators that do not like the president as much.
00:01:53.700 --> 00:01:59.200
There are seem to be hurdled over here.
00:01:59.200 --> 00:02:06.100
Which of these groups of people are most spread out from the president?
00:02:06.100 --> 00:02:09.600
Which of these groups of people are closest to him?
00:02:09.600 --> 00:02:13.300
Who is closest to the president?
00:02:13.300 --> 00:02:16.800
Can we describe that with a number?
00:02:16.800 --> 00:02:19.500
There is a couple of ways that you might want to think about.
00:02:19.500 --> 00:02:31.300
One we might be just look at the farthest person away from the president in each of these sets?
00:02:31.300 --> 00:02:38.400
Maybe for this it is this guy or this guy and get that distance, maybe that is the distance that you need.
00:02:38.400 --> 00:02:42.600
For this, it is maybe this guy or this guy.
00:02:42.600 --> 00:02:46.600
Maybe here it is that guy over there.
00:02:46.600 --> 00:02:51.900
Maybe here it is this guy, maybe that guy, they seem pretty distant.
00:02:51.900 --> 00:02:54.100
I knew that guy is a little bit farther.
00:02:54.100 --> 00:02:59.100
Just looking at the farthest person in the group, that is one way of looking at it.
00:02:59.100 --> 00:03:03.900
In that case, it does not matter how many people in the group you have.
00:03:03.900 --> 00:03:12.800
This group has less fewer people that this group but it would not matter if we are just looking at just the one farthest guy in the group.
00:03:12.800 --> 00:03:14.300
That is one way of looking at it.
00:03:14.300 --> 00:03:22.700
Another way of looking at it is creating a little boundary and saying how many people are in that boundary.
00:03:22.700 --> 00:03:28.300
Maybe we have this little square around the president and we just look at how many people are in that square.
00:03:28.300 --> 00:03:45.800
Maybe for here if we draw a square like that, how many people fall in that square?
00:03:45.800 --> 00:03:55.800
If that was our measure we would say this group is the closest to the president. Right?
00:03:55.800 --> 00:03:59.000
Here we have 1 person in this square and none other groups have any people in this square.
00:03:59.000 --> 00:04:04.800
Maybe we could look at different types of squares and see if that changes anything.
00:04:04.800 --> 00:04:06.700
That maybe one way of doing it.
00:04:06.700 --> 00:04:19.900
Another way of doing it might be to find the area of the border.
00:04:19.900 --> 00:04:29.700
That is another way of doing it.
00:04:29.700 --> 00:04:40.100
That one does not seems to be a very good model because that one mean that these people are the closest to the president but this is an odd group.
00:04:40.100 --> 00:04:45.200
They are close to each other but not necessarily close to the president.
00:04:45.200 --> 00:04:49.200
Should that matter in a measure of variability?
00:04:49.200 --> 00:04:50.800
That is another thing to think about.
00:04:50.800 --> 00:05:02.900
The probably one that comes to your mind is this idea that maybe the average distance of all these guys away from the president.
00:05:02.900 --> 00:05:09.000
Who has the closest average distance?
00:05:09.000 --> 00:05:16.200
We also would not need to worry about how many are in the group because we divide by the number of people in the group.
00:05:16.200 --> 00:05:24.700
It actually would not matter if they are close to each other or not, we just care about the distance to the president.
00:05:24.700 --> 00:05:28.600
These are different ways that you could think about variability.
00:05:28.600 --> 00:05:37.200
Notice that they are all ways of sticking a number on this concept of variability but you might come up with different numbers.
00:05:37.200 --> 00:05:46.800
You might come up with different definition for what it means to be spread out versus very close.
00:05:46.800 --> 00:05:57.400
There are some things to think about, should we be measuring how far they are from the center or how far they are from each other?
00:05:57.400 --> 00:06:05.900
Center is going to be an important concept in variability so shall we measure it from the median, mode, mean?
00:06:05.900 --> 00:06:09.600
Does it matter if this group has few and many members?
00:06:09.600 --> 00:06:12.100
Should that be taking into account?
00:06:12.100 --> 00:06:19.700
Does it matter what direction away from the president or from that center point if it is to the right or to the left, up or down?
00:06:19.700 --> 00:06:21.700
What about consistent clustering?
00:06:21.700 --> 00:06:23.300
Should that matter?
00:06:23.300 --> 00:06:28.400
Does are some things to think about when we think about a measure of variability.
00:06:28.400 --> 00:06:30.900
There are lots of different kinds of measures in variability.
00:06:30.900 --> 00:06:40.300
We are going to talking about two classes of them that are going to address these questions in different ways.
00:06:40.300 --> 00:06:46.600
The first class of measure that we want to think about are range, cortex, and inter quartile range.
00:06:46.600 --> 00:06:55.900
This is the idea of just taking the one farthest guy or the one closest guy by looking at that person.
00:06:55.900 --> 00:07:01.700
Usually, these measures of variability are used with median.
00:07:01.700 --> 00:07:06.500
It is usually measuring the spread around the median.
00:07:06.500 --> 00:07:19.200
One of the reason that this is going to be the case is that when we look at range, cortex, and inter quartile range, what we are doing is taking our 0716.8 distribution and cutting it up.
00:07:19.200 --> 00:07:23.800
Either cutting it up in a half which would be the median, the middle point.
00:07:23.800 --> 00:07:27.300
Or cutting it up into quartiles, right?
00:07:27.300 --> 00:07:32.500
Which would be cutting it into ¼ instead of ½.
00:07:32.500 --> 00:07:34.200
That is the idea.
00:07:34.200 --> 00:07:41.600
That is why we are going to be using median as their measure of central tendency.
00:07:41.600 --> 00:07:46.100
When we think about range, you do not need a central tendency at all.
00:07:46.100 --> 00:07:53.500
What you need is the minimum value and the maximum value and the distance in between.
00:07:53.500 --> 00:08:04.800
You could think of it as the maximum value in the set of x then subtract the minimum value in the set of x.
00:08:04.800 --> 00:08:12.700
If you have 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 as your distribution, you take 10 – 1 and your range is 9.
00:08:12.700 --> 00:08:24.500
The problem with that measure of variability even though it is very simple and intuitive, it is highly susceptible to outliers.
00:08:24.500 --> 00:08:36.000
If we change our set to something like 1, 2, 3, 4, 5, 6, 7, 8, 9, 100, all of a sudden it will be 100 – 1 and our range will be 99.
00:08:36.000 --> 00:08:43.400
Just by changing one of our numbers we could drastically change the range.
00:08:43.400 --> 00:08:48.900
Inter quartile range is going to be less susceptible to those outliers but before we get into
00:08:48.900 --> 00:08:57.300
how to calculate inter quartile range, we have to divide that data into quartiles.
00:08:57.300 --> 00:09:09.300
Let us just look at a simple example.
00:09:09.300 --> 00:09:16.200
Here what we would need to do is divide this data into quartiles first.
00:09:16.200 --> 00:09:30.000
Since it is an even number, the median would fall in between 5.5 and to divide it further to the quartiles we divide it by 3 and divide it up to 8.
00:09:30.000 --> 00:09:34.800
Here is the first quartile, second quartile, third, and fourth.
00:09:34.800 --> 00:09:39.000
Because of that, this borders actually has special little names.
00:09:39.000 --> 00:09:50.100
These borders are called Q1, Q2, and Q3, just to indicate that they are the borders of the quartiles.
00:09:50.100 --> 00:10:02.600
First you divide the data in quartiles and then basically in order to get the interquartile range, you are lapping off these guys on the ends.
00:10:02.600 --> 00:10:08.400
It is like the end of bread or cucumber, just like chopping it off, casting it aside.
00:10:08.400 --> 00:10:12.100
Just in case that there are some extreme outliers.
00:10:12.100 --> 00:10:19.800
Here what we do is we take Q3 – Q1.
00:10:19.800 --> 00:10:26.600
In this case, it would be 8 – 3 and the inter quartile range would be 5.
00:10:26.600 --> 00:10:37.900
Here the inter quartile range gives you the idea 50% of the numbers fall into this range because that is two quartiles.
00:10:37.900 --> 00:10:40.100
That is 50% right there.
00:10:40.100 --> 00:10:42.000
That is why it is a nice measure.
00:10:42.000 --> 00:10:47.900
It is more best than actual range because it is less susceptible to outliers.
00:10:47.900 --> 00:10:55.800
It is still intuitive and you can see that nice 50% of all the numbers falls in this range.
00:10:55.800 --> 00:11:00.700
That is inter quartile range, pretty easy.
00:11:00.700 --> 00:11:03.500
Let us do an example.
00:11:03.500 --> 00:11:10.900
Here let us say that there are these ages and we want to know what are the inter quartile range of these cells.
00:11:10.900 --> 00:11:14.700
First, it helps to separate them by quartiles.
00:11:14.700 --> 00:11:27.300
There are 3, 6, 10 numbers here, because of that here is the mid point.
00:11:27.300 --> 00:11:36.500
The median also called Q2 that is 30.
00:11:36.500 --> 00:11:49.500
Here is Q1 and here is Q3.
00:11:49.500 --> 00:12:01.000
In order to find inter quartile range, sometimes called iQr, it is Q3 – Q1.
00:12:01.000 --> 00:12:07.100
In this case it would be 38 – 20.
00:12:07.100 --> 00:12:10.100
Inter quartile range is 18.
00:12:10.100 --> 00:12:17.400
Within 18 here we could just draw that distance of 18.
00:12:17.400 --> 00:12:29.200
In that distance, 50% of our numbers fall in there, between 20 and 38.
00:12:29.200 --> 00:12:33.500
We are going to be talking about variance in standard deviation.
00:12:33.500 --> 00:12:39.000
When we talk about variance in standard deviation, it is more like in that conceptual example,
00:12:39.000 --> 00:12:45.900
that distance away from the president, where we are looking at the actual distance.
00:12:45.900 --> 00:12:56.100
In statistics, what we call distance away from the mean, the president in this case, is a deviation.
00:12:56.100 --> 00:13:04.200
What we might want to do is get the average deviation but there is going to be a little bit of issue.
00:13:04.200 --> 00:13:12.900
When we get the deviations from the mean, remember the mean is the value at the middle.
00:13:12.900 --> 00:13:18.200
The amount is actually in the middle of all the other values.
00:13:18.200 --> 00:13:24.600
Some of the values are going to be greater than the mean and some of the values are less than the mean.
00:13:24.600 --> 00:13:39.800
When we add all of those up, the formula looks like this, the summation sign and we take each value in our distribution x sub I, take out the mean.
00:13:39.800 --> 00:13:44.300
Get that distance away from the mean, that deviation from the mean.
00:13:44.300 --> 00:13:52.000
When we add all those up, where I goes from 1 all the way to n, however many we have in our sample.
00:13:52.000 --> 00:14:01.200
We basically get 0 because sometimes the value is greater than the mean and sometimes the value is less than the mean.
00:14:01.200 --> 00:14:05.400
When it is greater the number is greater than 0.
00:14:05.400 --> 00:14:08.100
When it is less, the number is less than 0.
00:14:08.100 --> 00:14:13.800
We add up a whole bunch of positive and negative numbers, you end up getting something very close to 0.
00:14:13.800 --> 00:14:21.100
That is the problem because when you get 0 as your sum and you divide whatever your n is,
00:14:21.100 --> 00:14:28.300
no matter what n is it is going be 0 because 0 divided by anything is 0.
00:14:28.300 --> 00:14:30.300
This is not going to work for us.
00:14:30.300 --> 00:14:37.200
That is not going to be good to have every single average deviation being 0.
00:14:37.200 --> 00:14:38.100
That is not useful.
00:14:38.100 --> 00:14:39.700
What do we do?
00:14:39.700 --> 00:14:43.500
Here we are going to sum the squared deviation.
00:14:43.500 --> 00:14:49.800
Instead of just summing up all the deviations, we are going to square the deviation and them sum those up.
00:14:49.800 --> 00:14:55.100
Whenever you square it, you get a positive number.
00:14:55.100 --> 00:14:58.800
The sum of squares is always going to be positive.
00:14:58.800 --> 00:15:05.100
You will get many advantages out of doing this squaring business and we will learn more about some of those advantages later.
00:15:05.100 --> 00:15:09.400
Let us talk about how to write this in notation.
00:15:09.400 --> 00:15:18.800
Here we have that same idea, that same deviation idea where looking at distances away from the mean,
00:15:18.800 --> 00:15:24.800
but we are going to square each of those distances.
00:15:24.800 --> 00:15:28.800
I = 1 to n.
00:15:28.800 --> 00:15:39.300
Just a word about this summing notation, basically when you have the summing notation whatever is here,
00:15:39.300 --> 00:15:45.500
you need to do this first and them sum up everything in here.
00:15:45.500 --> 00:15:55.900
Sometimes what people do is they sum up all of x sub I first, they sum up all of them up and then subtract out the x.
00:15:55.900 --> 00:16:04.300
But we are not summing the values, we are summing the squared deviation.
00:16:04.300 --> 00:16:07.000
You got to get the squared deviation first.
00:16:07.000 --> 00:16:17.000
Each values is going to have a distance and each of those distance needs to be squared and then you need to add them up.
00:16:17.000 --> 00:16:24.900
This would not be equal to 0 unless all your values are 0 and your mean is 0.
00:16:24.900 --> 00:16:29.100
In that case, they would not usually equal to 0.
00:16:29.100 --> 00:16:38.500
This is going to be called sum of squares and that is often shown by using the term ss.
00:16:38.500 --> 00:16:45.700
If it is sum of squares are the samples, sometimes you will see this notation where it has a little x down there.
00:16:45.700 --> 00:16:57.000
If it is the sum of squares of the population which you probably ever have, it will be ss sub X.
00:16:57.000 --> 00:17:05.700
We could look at the average squared distance from the mean, average squared deviation.
00:17:05.700 --> 00:17:10.600
You will do that simply by dividing by the number of values you have.
00:17:10.600 --> 00:17:20.000
When we have the variance of the sample, that is going to be called s², that is going to be the variance.
00:17:20.000 --> 00:17:21.600
I will write it in blue, right?
00:17:21.600 --> 00:17:24.300
That is the variance of a sample.
00:17:24.300 --> 00:17:31.500
That is just going to be ss ÷ n.
00:17:31.500 --> 00:17:39.900
The problem with variance is that it is not in the same units as you mean because we have squared all the distances.
00:17:39.900 --> 00:17:45.400
In order to bring it back to the same unit as the mean, it is easier for comparison,
00:17:45.400 --> 00:17:51.500
what we are going to do is get the stan dard deviation by just square rooting each side.
00:17:51.500 --> 00:18:05.200
Standard deviation is just s and that is going to be just the square root of variance.
00:18:05.200 --> 00:18:13.900
Standard deviation is now just the average distance from the mean, instead of average squared distance away from the mean.
00:18:13.900 --> 00:18:25.200
This is going to be for samples, but in order to get variance for the population they use the lower case sigma.
00:18:25.200 --> 00:18:31.600
For variance it will be lower case Σ² and for standard deviation it will be just lower case Σ.
00:18:31.600 --> 00:18:36.700
I will show you in a little bit how to do that.
00:18:36.700 --> 00:18:43.000
Let us take a little bit of time to talk about sum of squares in depth.
00:18:43.000 --> 00:18:56.000
Before that, there is a little typo on this page, I’m just going to correct that so that it will be smooth when we get down here.
00:18:56.000 --> 00:19:05.200
Let us start from the beginning, sum of squares is always this sum of squared distances away from the mean of the sample.
00:19:05.200 --> 00:19:10.300
The mean of the sample is x bar, that is how we denote it.
00:19:10.300 --> 00:19:12.800
That is the symbol for it.
00:19:12.800 --> 00:19:20.200
The sum of squared distances away from the mean is going to be the smallest sum of squares and from any other point.
00:19:20.200 --> 00:19:26.800
You can pick any other number this will give you the smallest sum of squares.
00:19:26.800 --> 00:19:31.000
Any other number will give you a bigger sum of squares.
00:19:31.000 --> 00:19:38.700
Here is the problem, the sample mean is rarely ever the actual population mean.
00:19:38.700 --> 00:19:45.100
Because of that, the population mean is this any other point.
00:19:45.100 --> 00:19:53.500
If we have the real some of squares from the population mean, we would actually get a bigger sum of squares than we actually have.
00:19:53.500 --> 00:19:54.900
That is the problem.
00:19:54.900 --> 00:20:00.800
Here is why, because then that means because we have a sum of squares that is a little bit to small,
00:20:00.800 --> 00:20:10.000
our sample standard deviation is going to be actually a little bit smaller than our population standard deviation all the time.
00:20:10.000 --> 00:20:11.100
That is an issue.
00:20:11.100 --> 00:20:15.600
We are always under shooting the population standard deviation.
00:20:15.600 --> 00:20:26.700
To correct for this, we are going to divide the sum of squares from our sample by a slightly smaller number than we actually do.
00:20:26.700 --> 00:20:37.000
Right now, to get s or standard deviation, we take sum of squares ÷ n.
00:20:37.000 --> 00:20:39.400
That is what we do right now.
00:20:39.400 --> 00:20:46.200
This will help us approximate the actual population.
00:20:46.200 --> 00:20:52.300
Here we are going to need divide by a slightly smaller number
00:20:52.300 --> 00:20:59.200
because when we divide by a smaller number, then our resulting answer is slightly bigger.
00:20:59.200 --> 00:21:08.600
Dividing by 5 we are going to get a bigger answer than if you divide by 8.
00:21:08.600 --> 00:21:12.200
Because of that we are going to use that.
00:21:12.200 --> 00:21:32.900
Instead, in order of approximate the population standard deviation what we are going to do is use ss ÷ n – 1.
00:21:32.900 --> 00:21:41.600
This is going to be a slightly smaller number giving us a slightly bigger population standard deviation.
00:21:41.600 --> 00:21:46.800
Why n – 1? Why not n - .5 or n – 2?
00:21:46.800 --> 00:21:54.700
There is a proof that you could look at up on line called Pessel’s Correction Proof and it is a really elegant proof if you have time to look it up.
00:21:54.700 --> 00:22:03.400
That is my spill on sum of squares but we will come back to this because it is a pretty important idea.
00:22:03.400 --> 00:22:10.500
Let us talk about the difference between population standard deviation and sample standard deviation.
00:22:10.500 --> 00:22:17.200
We always want to make inferences from the sample to the population, that is what we would like to do.
00:22:17.200 --> 00:22:26.100
Our sample distribution is denoted by lower case x and our population distribution is denoted by upper case X.
00:22:26.100 --> 00:22:43.800
In order to make that leap, we are going from sample statistics to population parameter.
00:22:43.800 --> 00:22:55.000
We are going to be estimating things like estimating μ from x bar, that is estimating the mean of the population from the mean of the sample.
00:22:55.000 --> 00:23:08.400
We are going to estimate the Σ or the standard deviation of a population from s, which is the standard deviation of the sample.
00:23:08.400 --> 00:23:22.900
Sigma is our new notation, notice that for population we are using parameters with Greek letters and here we are using regular Roman letters.
00:23:22.900 --> 00:23:27.600
Let us talk about the formulas for these.
00:23:27.600 --> 00:23:34.100
When we talk about mean, μ in this case, an x bar, in this case.
00:23:34.100 --> 00:23:41.400
We talk about adding up all of the lower case x and dividing by lower case n.
00:23:41.400 --> 00:23:52.900
Here we add it up all at once in our upper case X and dividing by upper case N, just superficial changes.
00:23:52.900 --> 00:24:04.700
When we talk about standard deviation, here we are going to be talking about lower case Σ or talking about s.
00:24:04.700 --> 00:24:07.700
Let us actually write down this formula.
00:24:07.700 --> 00:24:15.100
You could write it as √sum of squares ÷n, that is one way to do it.
00:24:15.100 --> 00:24:22.700
One thing you could do is think about double clicking on this.
00:24:22.700 --> 00:24:26.600
Just double click on it.
00:24:26.600 --> 00:24:32.600
Then what we would get is you would see the whole she bang inside.
00:24:32.600 --> 00:24:34.600
Hopefully I could try.
00:24:34.600 --> 00:24:45.000
Sum of squares means give me all the squared deviations, distances, away from x bar, square all of those.
00:24:45.000 --> 00:25:05.600
If you want you could put in I = 1 all the way up to n ÷ n.
00:25:05.600 --> 00:25:13.300
If we want to actually use this to estimate that, we will divide by n – 1.
00:25:13.300 --> 00:25:20.500
This is upper case S and I’m going to denote that by using a little bar there.
00:25:20.500 --> 00:25:33.700
In order to have this estimation, we would use lower case s.
00:25:33.700 --> 00:25:40.500
In this case, what we would do is divide our sum of squares by n – 1.
00:25:40.500 --> 00:25:44.100
That is our way of estimating from s to Σ.
00:25:44.100 --> 00:25:47.000
That is our estimate.
00:25:47.000 --> 00:25:58.800
When we talk about the population standard deviation, it is still ss ÷ n but it is upper case S this time.
00:25:58.800 --> 00:26:09.600
When we double click on ss and see what is inside of it, we unpack that, here is what it looks like.
00:26:09.600 --> 00:26:21.200
It is (X sub I – μ²) ÷ N.
00:26:21.200 --> 00:26:32.200
Here are all of these formulas.
00:26:32.200 --> 00:26:35.400
We have formulas for standard deviation of the sample, standard deviation of the population, but we also have this new idea.
00:26:35.400 --> 00:26:39.900
This is in between this one and this one.
00:26:39.900 --> 00:26:50.900
It is a way of going from sample information to estimating a population standard deviation.
00:26:50.900 --> 00:26:58.800
Usually, we do not calculate sigma directly because we do not have every single value for the population.
00:26:58.800 --> 00:27:07.900
Usually, we calculate small s which is going to be the estimated standard deviation and
00:27:07.900 --> 00:27:15.200
we hardly use this one as well because we do not really care about the standard deviation in just our sample.
00:27:15.200 --> 00:27:24.200
We want to know the standard deviation for the population.
00:27:24.200 --> 00:27:26.000
Let us go on to our examples.
00:27:26.000 --> 00:27:28.000
Here is example 1.
00:27:28.000 --> 00:27:35.600
It says find the mean in standard deviation of the variable friends in the Excel file.
00:27:35.600 --> 00:27:42.400
If you get the Excel file that you can download, go ahead and click on friends.
00:27:42.400 --> 00:27:51.100
We are going to be finding the standard deviation for the variable friends.
00:27:51.100 --> 00:27:59.800
What would be nice is if we could do everything in Excel but before we do that I jut want to make sure you understand how standard deviation works.
00:27:59.800 --> 00:28:04.300
Because of that I’m going to have you do it manually first.
00:28:04.300 --> 00:28:12.800
In order to do that, go ahead and go to data, find the variable friends, click on that column
00:28:12.800 --> 00:28:22.500
and I’m just going to copy that whole column and paste that right in here.
00:28:22.500 --> 00:28:27.300
Here I have my entire distribution of friends.
00:28:27.300 --> 00:28:31.800
I’m going to say Excel calculate the mean for us.
00:28:31.800 --> 00:28:44.900
I’m going to use the function average and select all this nice data right here, click enter.
00:28:44.900 --> 00:28:48.700
That is our mean.
00:28:48.700 --> 00:28:56.900
That mean is not going to change for anybody because mean is just the mean of the entire distribution.
00:28:56.900 --> 00:29:02.300
I’m just going to put our pointer there and I’m going to say whatever the mean is on top of me,
00:29:02.300 --> 00:29:08.700
that is the mean and I’m just going to paste that all the way down.
00:29:08.700 --> 00:29:14.300
This whole column should have the same mean.
00:29:14.300 --> 00:29:21.700
The reason I’m doing that is because that is going to make it easier for us to calculate square of deviation.
00:29:21.700 --> 00:29:27.600
We could just use the locked version of mean too.
00:29:27.600 --> 00:29:31.100
Let us get our squared deviation.
00:29:31.100 --> 00:29:42.600
Deviation just means the distance from each value to the mean x bar².
00:29:42.600 --> 00:29:49.200
In order to do the square we put in the count and 2.
00:29:49.200 --> 00:29:53.700
We hit enter and here is our squared deviation.
00:29:53.700 --> 00:29:59.800
I’m just going to drag that formula all the way down.
00:29:59.800 --> 00:30:04.100
Here we have a whole bunch of squared deviations.
00:30:04.100 --> 00:30:09.600
We have to sum up all those squared deviations.
00:30:09.600 --> 00:30:28.700
Here I’m just going to put in ss because that is what we are going to get and in order to get ss, we just add up this whole column.
00:30:28.700 --> 00:30:44.100
In order to get variance, where S² what we need to do is take ss ÷ n.
00:30:44.100 --> 00:30:48.900
I’m going to take ss ÷ n.
00:30:48.900 --> 00:30:54.900
I know here that my n is 100 but if you did not know for some reason, you could use the function count
00:30:54.900 --> 00:30:58.100
and just ask it count how many values there are.
00:30:58.100 --> 00:31:03.800
Not count it, just count, count how many values there are.
00:31:03.800 --> 00:31:07.700
It should be 100.
00:31:07.700 --> 00:31:12.800
Indeed it is a hundred because it moved the decimal point 2 over.
00:31:12.800 --> 00:31:19.100
Now we could get standard deviation or S.
00:31:19.100 --> 00:31:22.800
In order to get that, we just square root our variance.
00:31:22.800 --> 00:31:32.200
Excel has a function called square root (sqrt) and I’m just going to square root my variance.
00:31:32.200 --> 00:31:38.600
Here I get a standard deviation of 428.64.
00:31:38.600 --> 00:31:46.400
I need to do all that just so that you would understand how to calculate standard deviation.
00:31:46.400 --> 00:31:48.600
Excel has a nice handy way for you to do it.
00:31:48.600 --> 00:31:55.700
Here I’m going to calculate s automatically.
00:31:55.700 --> 00:32:32.900
Here we are looking at just s, in order to calculate s we would do stdevp because that is the one where you divide by n.
00:32:32.900 --> 00:32:36.400
I’m finding the standard deviation of all my squared values, that is wrong.
00:32:36.400 --> 00:32:44.700
I should be finding the standard deviation of my actual data, right?
00:32:44.700 --> 00:32:49.300
In this method, you actually do not need any of this.
00:32:49.300 --> 00:32:51.200
I will just make you go through it so you would learn.
00:32:51.200 --> 00:33:01.600
When we calculate s automatically, using stdevp you will see that we get the exact same standard deviation
00:33:01.600 --> 00:33:08.900
and we do have to do any of that mean calculating or calculating sum of squares of variance or anything like that.
00:33:08.900 --> 00:33:24.100
There is even a way Excel will calculate for you little s, the estimate of the population standard deviation from the sample.
00:33:24.100 --> 00:33:26.800
That is the one that you will be most likely using.
00:33:26.800 --> 00:33:30.500
Because of that, I think that might be a good one for us to do.
00:33:30.500 --> 00:33:36.800
Sum of squares is going to be the same thing.
00:33:36.800 --> 00:33:39.500
I’m just going to copy all of this.
00:33:39.500 --> 00:33:49.300
The sum of squares is going to be the same thing but variance is going to be a little bit different now.
00:33:49.300 --> 00:33:56.100
Instead, I will be dividing by n, we are going to be dividing by n – 1.
00:33:56.100 --> 00:34:03.000
I’m going to put in 99 instead of 100.
00:34:03.000 --> 00:34:13.600
Square rooting, that works the same way, square root of my variance.
00:34:13.600 --> 00:34:33.800
I noticed that when we divide by n -1, my standard deviation is slightly bigger than it would have been when we just divided up by n.
00:34:33.800 --> 00:34:37.500
Let us calculate little s automatically.
00:34:37.500 --> 00:34:43.800
Excel always assumes that is probably what you will be wanting to do.
00:34:43.800 --> 00:35:02.600
It made stdev that default formula is going to divide by n -1.
00:35:02.600 --> 00:35:07.400
We see that those two are the same values, a shortcut.
00:35:07.400 --> 00:35:16.900
You see when you automatically calculate it with Excel, you are not going to need to calculate mean
00:35:16.900 --> 00:35:23.800
or the sum of squares but it is nice to know where those things come from.
00:35:23.800 --> 00:35:29.000
We did that already.
00:35:29.000 --> 00:35:36.900
Let us find the mean and standard deviation of the tagged photos in the Excel file.
00:35:36.900 --> 00:35:51.300
If you click over on data, let us go ahead and grab the tagged photos values in that variable column and paste it right in here.
00:35:51.300 --> 00:35:54.300
It is just easier than going back and forth.
00:35:54.300 --> 00:36:01.400
Let us find the mean in this sample.
00:36:01.400 --> 00:36:22.600
I typed in average and I wanted to average all of this then I’m just going to say whatever is above me that is the same mean.
00:36:22.600 --> 00:36:28.400
Copy and paste it all the way down, everybody else has the same mean.
00:36:28.400 --> 00:36:33.200
I’m just going to get my squared deviation.
00:36:33.200 --> 00:36:43.300
It is my first value – the mean².
00:36:43.300 --> 00:36:52.900
I’m going to copy and paste that all the way down.
00:36:52.900 --> 00:36:55.800
Let us get the sum of squares.
00:36:55.800 --> 00:37:08.700
In order to do that we just find the sum of all these squared deviation.
00:37:08.700 --> 00:37:20.500
In order to find variance or S², that is just s² because that is the one you will be using for most part, right?
00:37:20.500 --> 00:37:29.400
Our little s², we take this sum of squares and we divide it by n -1.
00:37:29.400 --> 00:37:49.200
We could use count, count all of that – 1.
00:37:49.200 --> 00:37:59.400
All of this is in my denominator and hit enter.
00:37:59.400 --> 00:38:02.400
That is my variance.
00:38:02.400 --> 00:38:05.900
What is my standard deviation?
00:38:05.900 --> 00:38:09.400
My little s, my estimated standard deviation.
00:38:09.400 --> 00:38:16.200
All I have to do is square root my variance and that is what I got.
00:38:16.200 --> 00:38:24.900
Let us check our answers by using the automatic Excel version.
00:38:24.900 --> 00:38:38.500
Here we will put in stdev, I want to put in our actual data, our actual values.
00:38:38.500 --> 00:38:44.800
This is our real distributions that we are working with here.
00:38:44.800 --> 00:38:48.500
Excel does it nice and quickly for us.
00:38:48.500 --> 00:38:49.900
We do not need all of these stuff.
00:38:49.900 --> 00:38:59.800
In the future, we will just be using this automatic version but I do want you to know where that comes from.
00:38:59.800 --> 00:39:01.800
Let us go on to example 3.
00:39:01.800 --> 00:39:09.600
The average number of calories in a frozen yogurt is 250, with an estimated population standard deviation of 30.
00:39:09.600 --> 00:39:17.800
If 24 frozen yogurts from popular chains where sampled, what would be their ss or sum of squares?
00:39:17.800 --> 00:39:24.800
Here we know that we do not need the actual values and the means in order to find sum of squares.
00:39:24.800 --> 00:39:32.800
Because we have some of the other pieces and we could just fill out what is missing and figure out what is missing.
00:39:32.800 --> 00:39:38.600
We know that they have estimated population and standard deviation.
00:39:38.600 --> 00:39:42.300
That is little s.
00:39:42.300 --> 00:40:04.000
In order to get little s, we know that they added up all of the x sub I – the mean² ÷ n -1 and took the square root of that.
00:40:04.000 --> 00:40:05.400
We know that is what they did.
00:40:05.400 --> 00:40:14.300
Another way of writing that is square root of ss / n – 1.
00:40:14.300 --> 00:40:17.100
Let us fill in what we have.
00:40:17.100 --> 00:40:28.600
They know that the standard deviation eventually is 30, this s is 30.
00:40:28.600 --> 00:40:31.600
What we are trying to find out is this.
00:40:31.600 --> 00:40:39.400
We do not have that ss.
00:40:39.400 --> 00:40:43.700
But we do have n – 1 because n is 24.
00:40:43.700 --> 00:40:47.800
24 – 1 is 23.
00:40:47.800 --> 00:40:58.600
From that, and only that information we could figure out ss and in they have given us this mean 250.
00:40:58.600 --> 00:41:04.200
It is sort of red airing, you do not actually need it in this problem.
00:41:04.200 --> 00:41:25.700
I’m going to use a little piece of my Excel as a calculator and here I know I need to square 30, 30².
00:41:25.700 --> 00:41:31.000
I could just multiply 23 to that.
00:41:31.000 --> 00:41:35.800
I will get 20,700.
00:41:35.800 --> 00:41:44.400
My ss is 20, 700.
00:41:44.400 --> 00:41:52.100
I did not actually need all my values from the distribution nor my mean.
00:41:52.100 --> 00:41:55.000
Last question, example 4.
00:41:55.000 --> 00:42:01.600
This is a conceptual question, hopefully this will test you on concepts.
00:42:01.600 --> 00:42:08.800
When we divide by n – 1, rather than by n, what effect does this have on the resulting standard deviation?
00:42:08.800 --> 00:42:12.000
N -1 is a smaller number than n, right?
00:42:12.000 --> 00:42:15.700
Dividing by a smaller number will result in a bigger answer.
00:42:15.700 --> 00:42:24.400
The resulting standard deviation s will be a little bit greater than this s.
00:42:24.400 --> 00:42:36.400
This one divides by n and this one divides by n -1.
00:42:36.400 --> 00:42:38.300
That is it for variability.
00:42:38.300 --> 00:42:40.000
Thanks for using www.educator.com.