WEBVTT mathematics/statistics/son
00:00:00.000 --> 00:00:02.200
Hi, welcome to educator.com.
00:00:02.200 --> 00:00:04.100
We are going to talk about F distributions today.
00:00:04.100 --> 00:00:11.300
So first we are going to review other distributions recovered besides F, namely the NT.
00:00:11.300 --> 00:00:16.800
Then we are going to introduce the F statistic also called the variance ratio.
00:00:16.800 --> 00:00:29.000
Then we are going to talk about the distribution of all these S, distribution of all these ratios and finally 0024.5 what α and P value mean in an F distribution.
00:00:29.000 --> 00:00:31.400
Because eventually were in a deep hypothesis testing with the F statistic.
00:00:31.400 --> 00:00:44.000
Okay , first, these other distribution so we know how to calculate the Z statistic and we also know how to
00:00:44.000 --> 00:00:50.100
find the probability of such V value in a normal distribution.
00:00:50.100 --> 00:00:53.000
But what is EZ distribution?
00:00:53.000 --> 00:00:55.900
Well, imagine this.
00:00:55.900 --> 00:00:58.700
Take a data set, let us just call it a population.
00:00:58.700 --> 00:01:08.400
We take a data set, I will just draw a circle and we take some sort of sample from it, of size.
00:01:08.400 --> 00:01:27.200
And we actually calculate the Z statistic for this sample so we calculate the goals, get the mean of this little
00:01:27.200 --> 00:01:32.900
sample minus the μ divided by the standard error.
00:01:32.900 --> 00:01:38.100
So you do that and then you plot the Z.
00:01:38.100 --> 00:01:54.800
So imagine you replace all those that sample again to with replacement and you draw another sample and
00:01:54.800 --> 00:02:03.600
you do this again and then you plot that guy and you dump it back in, you draw another sample, you calculate Z.
00:02:03.600 --> 00:02:16.000
So you do that over and over again many times which end up getting is a normal distribution overtime.
00:02:16.000 --> 00:02:29.700
So many times if you plot Z you get a normal distribution and because of that we also call this a Z
00:02:29.700 --> 00:02:38.700
distribution because the distribution made up of a whole bunch of Z and it has the shape of a normal
00:02:38.700 --> 00:02:42.600
distribution so that is what we call a Z distribution.
00:02:42.600 --> 00:02:55.100
Now, if you take that same idea and you do it, you get a sample, and instead of calculating Z for that simply
00:02:55.100 --> 00:03:14.700
you calculate T, if you do this then and then you plot that and you do that over and over and over and over again you get a T distribution.
00:03:14.700 --> 00:03:29.200
And this resulting t-distribution follows the rules of the t-distribution where it depends on the degrees of
00:03:29.200 --> 00:03:36.900
freedom, how wide it is, at the lower your degrees of freedom assertive variable but the higher the bigger
00:03:36.900 --> 00:03:42.500
your degrees of freedom assertive, less variable and more normal it looks.
00:03:42.500 --> 00:03:45.600
And so that is what we call the t-distribution.
00:03:45.600 --> 00:03:52.300
So that is how Z statistic and the Z distribution sort of go together.
00:03:52.300 --> 00:03:57.500
And this is how the T statistic and the t-distribution sort of go together.
00:03:57.500 --> 00:04:05.000
And they just have to imagine taking a whole bunch of this sample, calculating whatever statistic and
00:04:05.000 --> 00:04:10.200
implying that statistic and then looking at the shape of those statistic.
00:04:10.200 --> 00:04:20.300
So really what this is a sampling distribution of Z.
00:04:20.300 --> 00:04:41.500
And this is a sampling distribution of T, instead of using means or Z squares to plot your plane instead use the T statistic.
00:04:41.500 --> 00:04:46.200
And you could do that for anything you could do that for standard deviation and you can do for inter
00:04:46.200 --> 00:04:50.200
quartile, you can make the sampling distribution of anything you want.
00:04:50.200 --> 00:04:53.600
That is important to keep in mind as we go into F distribution.
00:04:53.600 --> 00:05:01.100
Okay so first thing is what is the F statistic?
00:05:01.100 --> 00:05:07.100
We know how to calculate the T statistic and the V statistic, what is the F statistic?
00:05:07.100 --> 00:05:16.300
Well, later on in these lessons were going to come across what we call the ANOVA, the analysis of variance.
00:05:16.300 --> 00:05:25.200
Analyze means to break down and variance is well, you know what variances is, the spread of usually
00:05:25.200 --> 00:05:32.000
around the mean of your data set and so when we analyze variance, we are going to be breaking down
00:05:32.000 --> 00:05:43.300
variances into its multiple component and the F ratio happens to be ratio of those component variances.
00:05:43.300 --> 00:05:52.200
And so I just want you to get sort of the big idea behind the F ratio not exactly how to calculate it, well get
00:05:52.200 --> 00:05:56.200
into the details of that later on but the general concept.
00:05:56.200 --> 00:06:10.600
So the S statistic usually is this idea that we have let us say two samples, x1, x2, x3, y1, y2, y3.
00:06:10.600 --> 00:06:19.400
Now there is always some variation within the sample within exit there is some variation.
00:06:19.400 --> 00:06:25.400
And within the Y there are some variation.
00:06:25.400 --> 00:06:31.900
So there is definitely some variation but there is another variation here that we are really interested in.
00:06:31.900 --> 00:06:37.400
We are really interested in the difference between these two things.
00:06:37.400 --> 00:06:48.000
Between these two samples, so the F statistic really is taking those ideas and turning it into a ratio and here is what a ratio looks like.
00:06:48.000 --> 00:07:05.200
It is really the between sample variance all over the within sample variance.
00:07:05.200 --> 00:07:13.500
I remember variance is always squared, the average squared distance away from the mean and so because
00:07:13.500 --> 00:07:22.200
of that this is a squared number, this is the square number, they are both positive so this number is always going to be greater than zero.
00:07:22.200 --> 00:07:29.200
There is no way that this number could be less than zero so the statistic is always going to be greater than zero.
00:07:29.200 --> 00:07:34.500
Now another way to think about between sample variance and within sample variance is this.
00:07:34.500 --> 00:07:44.100
Whenever we do these kind of test, we are really interested in the differences between the samples like that is really important to us.
00:07:44.100 --> 00:07:58.200
But sometimes their difference is also like a part of that difference is going to be just inherent variation.
00:07:58.200 --> 00:08:06.500
So sometimes there might be a difference between let us say,men and women, or people who got a
00:08:06.500 --> 00:08:08.500
tutorial versus people who did not, right?
00:08:08.500 --> 00:08:15.200
People who study for the test versus people who did not, people went to private school versus people with public school.
00:08:15.200 --> 00:08:17.200
There might be some difference between them?
00:08:17.200 --> 00:08:20.600
But that difference is also going to have variation.
00:08:20.600 --> 00:08:28.000
So this between sample variance often has inherent variation just variance you cannot do anything about
00:08:28.000 --> 00:08:40.200
inherent variation plus real difference the effect size between samples.
00:08:40.200 --> 00:08:53.900
And noticed that we keep using this word between and that is to indicate that part, so between, that is the part that we are really interested in.
00:08:53.900 --> 00:09:16.700
Over within sample variance and so here there is inherent variation between X and between the Y and that
00:09:16.700 --> 00:09:25.200
is not something we are interested in but it is good to know how variable are in the our little samples are.
00:09:25.200 --> 00:09:34.300
Everyone very similar to each other, is very different, we need to compare the difference between the sample to the difference within the samples.
00:09:34.300 --> 00:09:43.100
So this the inherent sample of the within sample variation is just inherent variation.
00:09:43.100 --> 00:09:52.700
So these are all different ways of seeing the same thing and the reason why I want to say I also like this is
00:09:52.700 --> 00:10:00.900
because later on we are not just going to be talking about between sample and within simple differences, we are going to add onto those ideas.
00:10:00.900 --> 00:10:08.200
The final way I want you sort of think about the F statistic is basically this.
00:10:08.200 --> 00:10:15.900
Ultimately in hypothesis testing, where going to want to know about differences between sample, that is the thing that were really interested in.
00:10:15.900 --> 00:10:31.400
So it is going to be the variation that we want to explain because that is the reason that we did our research in the first place.
00:10:31.400 --> 00:10:46.000
All versus the variation we cannot explain, not with this design at least.
00:10:46.000 --> 00:10:53.400
So in our experimental design we will have these two groups and hopefully these groups will be similar to
00:10:53.400 --> 00:11:00.600
each other but different, similar within the group but different between the groups.
00:11:00.600 --> 00:11:07.100
And that is why in a S statistic we want this variation that we want to explain to be quite large and this
00:11:07.100 --> 00:11:15.800
variation that we cannot explain or do anything about to come along for the ride where we want that to be relatively small.
00:11:15.800 --> 00:11:18.800
Okay so let us do a limited thinking about the F ratio.
00:11:18.800 --> 00:11:27.800
Now if we had a very big difference between the groups what kind of F ratio would we have?
00:11:27.800 --> 00:11:30.200
When it is greater than one, less than one?
00:11:30.200 --> 00:11:37.000
Well if our variation between the groups is bigger than the variation within the group then we should have
00:11:37.000 --> 00:11:48.600
a very large F so that should be F that is greater than one right so at least greater than one but maybe a lot 1144.0 greater than 1, it could be 2 over one or 2 over .5.
00:11:48.600 --> 00:11:55.600
Any of those values which show between sample variances are a lot larger than within sample variance.
00:11:55.600 --> 00:12:07.900
And so if there is a lot of within sample variance then that competes with the between sample variance so
00:12:07.900 --> 00:12:14.000
let us say there is a vague between sample difference but there is also a lot of differences within the
00:12:14.000 --> 00:12:24.700
samples themselves and sort of evens out and you might see an F that is smaller or even less than one right if this one is bigger than this one.
00:12:24.700 --> 00:12:28.400
So that is how you could sort of think about the S statistic.
00:12:28.400 --> 00:12:41.600
Now imagine getting that F statistic over and over and over again from the population and plotting a sampling distribution of S statistics.
00:12:41.600 --> 00:12:43.200
What would you get?
00:12:43.200 --> 00:12:54.600
Well, remember that F cannot go below zero because both numbers are going to be positive so the F really stops at zero.
00:12:54.600 --> 00:12:58.200
But this is what the S statistic ends up looking like.
00:12:58.200 --> 00:13:05.800
This is a skewed distribution and it has a positive tail.
00:13:05.800 --> 00:13:12.000
That means it goes for a really long time on the positive side.
00:13:12.000 --> 00:13:23.000
Its one-sided so it is not is not symmetrical, it is actually asymmetrical there is only a positive side and it is
00:13:23.000 --> 00:13:30.200
because of the proportion of variances and variances are positive.
00:13:30.200 --> 00:13:39.100
And like T is a family of distribution and you are going to be able to find the particular F distribution you are
00:13:39.100 --> 00:13:49.200
working with by looking at the degrees of freedom in the numerator, the one about between sample
00:13:49.200 --> 00:14:07.100
differences and by looking at the denominator the sort of leftover or within sample differences variation.
00:14:07.100 --> 00:14:14.200
So you are going to need both of those numbers in order to find out which S statistic you are working with
00:14:14.200 --> 00:14:20.900
and in Excel, it will actually ask you for the degrees of freedom for the numerator and denominator.
00:14:20.900 --> 00:14:26.600
Now let us talk a little bit about what α means here.
00:14:26.600 --> 00:14:37.400
α here, it will still need a cutoff point so critical F instead of a critical T or Z.
00:14:37.400 --> 00:14:49.700
You will still need a critical F and the α will still be our probability of making false alarm given that the null distribution is true.
00:14:49.700 --> 00:14:54.600
This is the null F distribution just saying.
00:14:54.600 --> 00:14:59.200
And the α would be the same thing the probability of false alarm.
00:14:59.200 --> 00:15:10.300
So once you know what that α sort of have, how you sort of picture that α, let us talk about what that α actually means.
00:15:10.300 --> 00:15:24.500
If you go back to the original idea for that α the original idea is that cut off level.
00:15:24.500 --> 00:15:29.700
So it is our level of tolerance for false alarms.
00:15:29.700 --> 00:15:45.600
How the probability, the false alarm probability that we will tolerate and that is what we want.
00:15:45.600 --> 00:15:48.300
We want α to be very low.
00:15:48.300 --> 00:16:01.700
Now our α will be low, that is the smaller α than this one, our α will be low if our critical F is very big.
00:16:01.700 --> 00:16:04.400
And what does it mean for F to be large?
00:16:04.400 --> 00:16:24.600
This means our between sample variation variability is greater than our within sample variability.
00:16:24.600 --> 00:16:32.000
And that is what it means and so as long as this is much larger than this, we have a large F and that is going
00:16:32.000 --> 00:16:37.800
to mean a smaller a smaller chance of false alarm.
00:16:37.800 --> 00:16:47.000
Now the α is the cutoff level that we are going to set as the significance, the level that we will tolerate.
00:16:47.000 --> 00:16:48.700
So what is the P value?
00:16:48.700 --> 00:17:10.000
So the P value will be given our samples F, this is the probability that we would get this F or higher by chance in this probability.
00:17:10.000 --> 00:17:37.700
So given our samples F actually will be easier so the idea is the probability, the false alarm probability for F
00:17:37.700 --> 00:18:00.000
values, F statistics are equal to or more extreme than our sample, than the F from our sample.
00:18:00.000 --> 00:18:07.500
So the probability that we would get an F greater than the one that we got so F from the sample.
00:18:07.500 --> 00:18:21.600
So this is the F value once we have our sample statistic, this is the probability of false alarm that were willing to tolerate.
00:18:21.600 --> 00:18:32.400
So it is the same idea as T statistics, the α, the P value and T statistics, we are just now applying it to a slightly different looking distribution.
00:18:32.400 --> 00:18:36.900
Now examples.
00:18:36.900 --> 00:18:42.100
Why does the F distribution stop at zero but go on in the positive direction until infinity?
00:18:42.100 --> 00:18:44.800
Well, we know why it stops at zero.
00:18:44.800 --> 00:19:08.600
The F distribution is a ratio of two positive numbers and we know that they are positive because variance squared, thus making it always positive.
00:19:08.600 --> 00:19:19.500
But it goes on until infinity because there is no rule that says you can only be this much bigger in the
00:19:19.500 --> 00:19:27.700
numerator than denominator so the numerator can be like infinitely as big as the denominator who could go on forever and ever.
00:19:27.700 --> 00:19:36.700
Example 2, in an F test also called the one-way ANOVA which we are going to talk about in a little bit, the P
00:19:36.700 --> 00:19:46.400
value, you did an F test and the P value is .034, what is the best interpretation of this result?
00:19:46.400 --> 00:19:50.900
It is plausible that all the samples are roughly equal.
00:19:50.900 --> 00:20:02.700
So here we are thinking about let us say two sample and we need this versus this.
00:20:02.700 --> 00:20:29.600
So the F value is between variation over within variation and if we have a big F value, if we have a big
00:20:29.600 --> 00:20:39.200
enough F value, so sample F then we can have a small P value .034.
00:20:39.200 --> 00:20:47.400
So is it possible that all the samples are roughly equal?
00:20:47.400 --> 00:20:56.200
No because we seem to have a large enough between sample variance so I would say no to that one.
00:20:56.200 --> 00:21:01.100
It is possible that all the sample variances are roughly equal.
00:21:01.100 --> 00:21:09.200
Well, that also is not necessarily what this means it could be that these within variations are very similar to
00:21:09.200 --> 00:21:12.600
each other but that is not what this P value is talking about.
00:21:12.600 --> 00:21:17.800
The within sample variation is much larger than the between sample variation.
00:21:17.800 --> 00:21:23.300
Well, it is true we would have a small F instead it is this one.
00:21:23.300 --> 00:21:25.700
The between sample variation is much larger than within.
00:21:25.700 --> 00:21:27.800
So D is our answer.
00:21:27.800 --> 00:21:35.100
Example 3, consider the height of the following pairs of samples.
00:21:35.100 --> 00:21:37.300
Which will have the largest F.
00:21:37.300 --> 00:21:38.900
Which will have the smallest F.
00:21:38.900 --> 00:21:41.000
Okay let us think about this.
00:21:41.000 --> 00:21:46.000
So players from NBA team Lakers versus adults in LA.
00:21:46.000 --> 00:21:53.200
Well, if we draw those two population, Lakers versus LA.
00:21:53.200 --> 00:22:00.900
This probably has a lot of variance, a lot of variance here, that is a lot of people, this probably have a very
00:22:00.900 --> 00:22:10.000
small variance but there is probably pretty sizable difference between those two groups of people right like
00:22:10.000 --> 00:22:15.000
average adult versus like the Lakers were probably all amazingly tall.
00:22:15.000 --> 00:22:18.400
Well so that is the picture here.
00:22:18.400 --> 00:22:21.100
Will this have a larger, will this have a smaller.
00:22:21.100 --> 00:22:28.100
Well, what about adults in San Francisco versus adults in LA.
00:22:28.100 --> 00:22:35.100
Well, this 2 probably both have a lot of within sample variation there's lots of adults in San Francisco, lots of
00:22:35.100 --> 00:22:41.900
adults in a LA, they are all different from each other but their average just should probably be similar, it is
00:22:41.900 --> 00:22:48.500
not like San Francisco's no pursuit for tall people or LA is no pursuit for tall people so this difference
00:22:48.500 --> 00:22:55.500
between the groups will probably be very small but the within group variability will be very large so I would
00:22:55.500 --> 00:23:01.500
guess this would have actually a pretty small F, and what about this one.
00:23:01.500 --> 00:23:12.800
This one is players from an NBA team Lakers versus players from another team and so here we might think
00:23:12.800 --> 00:23:21.500
Lakers, Clippers, and there is probably a pretty small variation here probably everybody is like about 6 feet
00:23:21.500 --> 00:23:35.800
tall, and so they are probably all like super tall so there is not a lot of variation but there also probably similar across the teams to.
00:23:35.800 --> 00:23:43.500
So because probably the average height on the Lakers is probably similar to the average height on the
00:23:43.500 --> 00:23:50.600
Clippers just that they are both tall groups of people so which one of these will probably have the largest F?
00:23:50.600 --> 00:23:56.300
I think the biggest difference between the groups might actually be this one.
00:23:56.300 --> 00:24:07.100
So I would guess I would go at this one given that I am not really sure about the variance here.
00:24:07.100 --> 00:24:12.100
The variance is smaller but I am not sure how to compare these so far.
00:24:12.100 --> 00:24:20.100
So this is the largest F and I am just going to go by having the largest numerator for sure.
00:24:20.100 --> 00:24:24.100
Well, which will have the smallest F?
00:24:24.100 --> 00:24:32.100
As in the smallest F would probably go at this one because not only does it have a small numerator but it
00:24:32.100 --> 00:24:38.100
has extremely large denominator so I would say this one would definitely have the smallest F.
00:24:38.100 --> 00:24:42.700
So that is the end of F distribution.
00:24:42.700 --> 00:24:46.000
See you next time for ANOVAs on educator.com.