WEBVTT mathematics/statistics/son
00:00:00.000 --> 00:00:02.000
Hi, welcome to educator.com.
00:00:02.000 --> 00:00:05.100
We are going to talk about the chi-square goodness of fit test.
00:00:05.100 --> 00:00:11.800
So first, we are going to start with the bigger review of where the chi-square test actually fits in.
00:00:11.800 --> 00:00:17.700
Amongst all the different inferential statistics we have been learning so far and then we are going to talk
00:00:17.700 --> 00:00:22.900
about a new kind of hypothesis testing, the goodness of fit hypothesis test.
00:00:22.900 --> 00:00:29.500
So it is going to be similar to hypothesis testing as we been doing so far but there is a slightly different logic behind it.
00:00:29.500 --> 00:00:37.400
So because it is a slightly different logic there is a new all hypothesis as well as the alternative hypothesis.
00:00:37.400 --> 00:00:43.700
Then we are going to introduce the chi-square distribution and the chi-square statistic.
00:00:43.700 --> 00:00:49.400
And then we are going to talk about the conditions for chi-square test when do we actually do it.
00:00:49.400 --> 00:00:54.100
So where does the chi-square test belong?
00:00:54.100 --> 00:00:59.400
And it is been a while since we have looked at this if you are going in order with the videos but I think it is
00:00:59.400 --> 00:01:03.500
pretty good to stop right now and sort of think where we come from?
00:01:03.500 --> 00:01:05.000
Where are we now?
00:01:05.000 --> 00:01:12.300
So the first thing we want to think about are the different independent variables that we been able to look at.
00:01:12.300 --> 00:01:24.500
We been able to look at independent variables the predictor variables that are either categorical or continuous.
00:01:24.500 --> 00:01:35.400
When the idea is categorical you have groups right?
00:01:35.400 --> 00:01:37.800
Or different samples, right?
00:01:37.800 --> 00:01:47.400
When the idea is continuous you do not have different groups you have a different levels that predict something.
00:01:47.400 --> 00:01:56.300
So just to give you a idea of a categorical IV that would be something like experimental group versus the
00:01:56.300 --> 00:02:07.000
control group or something like this categorical IV may be someone who gets a drug versus someone who
00:02:07.000 --> 00:02:12.000
gets the placebo , a group that gets the drivers of the group that gets the placebo and example of the
00:02:12.000 --> 00:02:20.000
continuous IV might be looking at how much you study predicting your score on a test , so how much you
00:02:20.000 --> 00:02:22.800
study would be a continuous IV.
00:02:22.800 --> 00:02:29.700
So that is one of the dimensions that we need to know, is your IV categorical or continuous.
00:02:29.700 --> 00:02:40.500
You also need to know whether the DV is categorical or continuous so the DV is the thing that were
00:02:40.500 --> 00:02:47.000
interested in measuring at the end of the day the things that we want to know that this thing change this is
00:02:47.000 --> 00:02:57.600
the thing we want to predict right, and so far here is how would come.
00:02:57.600 --> 00:03:07.000
At the very beginning we looked at continuous types of tests and those types of measures and those were
00:03:07.000 --> 00:03:12.900
the regression, linear regression, as well as correlation.
00:03:12.900 --> 00:03:30.000
Remember R and regression was that stuff about like Y equals the not + b sub 1 times X, so that was
00:03:30.000 --> 00:03:34.700
regression and correlation way back in the day.
00:03:34.700 --> 00:03:43.900
We have been covering a lot of this quadrant actually looking at t-tests and ANOVA right?
00:03:43.900 --> 00:03:58.000
One important thing to know that t-tests and ANOVAs are both hypothesis tests, only so far have not
00:03:58.000 --> 00:04:01.300
learned hypothesis testing with regression and correlation.
00:04:01.300 --> 00:04:15.100
A lot of inferential statistics in college does not cover hypothesis testing of regression until you get to more advance levels of statistics.
00:04:15.100 --> 00:04:21.400
So what do ANOVAs and t-tests sort of have in common?
00:04:21.400 --> 00:04:28.900
Well they have in common that they are both categorical IV and continuous DV.
00:04:28.900 --> 00:04:37.000
The IV is categorical and you only have one, one IV.
00:04:37.000 --> 00:04:42.100
And your DV is continuous.
00:04:42.100 --> 00:04:46.800
So that sort of what they have in common, what is different about them?
00:04:46.800 --> 00:04:56.700
Well the difference is that the IV in t-tests has two levels in only two levels so there is only two groups or two samples.
00:04:56.700 --> 00:05:02.400
In ANOVAs we could test for more than two samples, we can do that for 3 4 5 samples.
00:05:02.400 --> 00:05:12.500
So that IV has greater than two levels and so that is where we been spending a lot of our time.
00:05:12.500 --> 00:05:20.600
So for the most part continuous DV are really important because they tell us a lot, they tell us the find ways
00:05:20.600 --> 00:05:27.300
that we could actually be different, that the data could actually be different.
00:05:27.300 --> 00:05:34.000
So you are going to, it is more rare that you will use the categorical dependent variable, that is not going to
00:05:34.000 --> 00:05:40.500
be as informative to us but it is still possible and that is where the chi-square is going to come in.
00:05:40.500 --> 00:05:47.000
The chi-square is been coming right in this quadrant where we have categorical IV also a categorical DV so
00:05:47.000 --> 00:05:57.400
for instance we might want to see something like if you are given a particular job or the placebo, do you
00:05:57.400 --> 00:06:00.300
feel like you are getting better, yes or no right?
00:06:00.300 --> 00:06:15.400
So that is a categorical DV, it is not like the score that we can find a mean and so this is where the chi-square tests come in.
00:06:15.400 --> 00:06:18.700
And there is going to be 2 chi-square tests that we are going to look at.
00:06:18.700 --> 00:06:22.000
The first one, we are going to cover today and it is called goodness of fit.
00:06:22.000 --> 00:06:25.900
The next one is in the next lesson and it is called a test of homogeneity.
00:06:25.900 --> 00:06:27.500
They are both chi-square test.
00:06:27.500 --> 00:06:37.900
The other way you will see that what is written is chi-squared, so sometimes, do not think of, oh what is this doing here?
00:06:37.900 --> 00:06:46.800
When it has this little curvy part here we need chi-square, the Greek letter chi, finally this is a test that
00:06:46.800 --> 00:06:55.400
rarely is covered in inferential statistics but at more advanced levels of statistics he did cover it and it is called
00:06:55.400 --> 00:07:04.300
the logistic test and logistic test takes you from continuous IV to categorical DV.
00:07:04.300 --> 00:07:16.600
But that is rare design used in conducting science, it is not as informative as continuous to continuous or categorical to continues.
00:07:16.600 --> 00:07:20.900
Alright so we are going to spend your time right in here.
00:07:20.900 --> 00:07:33.700
So there is a new twist on hypothesis testing, it is not totally different, it is still very similar but there is there is a subtle difference.
00:07:33.700 --> 00:07:37.600
Today we are going to start off with the chi-square goodness of fit test.
00:07:37.600 --> 00:07:42.400
Basically let us think about hypothesis testing in general.
00:07:42.400 --> 00:07:50.100
In general you want to determine whether a sample is very different from expected results that is the big idea of hypothesis testing
00:07:50.100 --> 00:07:53.800
and expected results come from your hypothesized population.
00:07:53.800 --> 00:08:00.700
If your sample is very different than we usually determine that with some sort of test statistic and looking
00:08:00.700 --> 00:08:09.000
at how far it is on the on the tested statistics distribution right and we look at whether it is past that α
00:08:09.000 --> 00:08:16.200
cut off or the critical test statistic right and then we say, oh this sample is so different than would be
00:08:16.200 --> 00:08:24.400
expected given that the null hypothesis is true that we are going to reject the null hypothesis.
00:08:24.400 --> 00:08:31.200
That is usually hypothesis testing. It still takes that idea whether to look at whether a sample is very
00:08:31.200 --> 00:08:37.100
different from expected results, but the question is how are we going to compare these two things?
00:08:37.100 --> 00:08:41.600
We are not going to compare means anymore, we are not going to look at the distance between means,
00:08:41.600 --> 00:08:47.300
nor are we going to look at the proportion of variances that is not what we are going to look at either.
00:08:47.300 --> 00:08:58.700
Instead we are going to determine whether the sample proportions for some category are very different
00:08:58.700 --> 00:09:02.200
from the hypothesized population proportion.
00:09:02.200 --> 00:09:09.400
And the question will be how do we determine very different and here is what I mean by determine
00:09:09.400 --> 00:09:13.900
whether the sample proportions are different from the hypothesized population proportion.
00:09:13.900 --> 00:09:28.700
So here I am just going to draw for you sort of schematically what the hypothesized population proportions might look like.
00:09:28.700 --> 00:09:37.000
So this is just sort about the idea, so you might think of the population as being like this and in the
00:09:37.000 --> 00:09:48.200
population you might see a proportion of one third being blue, one third being red, and one third being yellow.
00:09:48.200 --> 00:09:57.500
Now already it is hard to think about like you could already sort of see, well we cannot get the average of
00:09:57.500 --> 00:10:05.500
blue red and yellow right like what would be the average of that, and how would you find the variability of
00:10:05.500 --> 00:10:12.800
that so already we are starting to see why you cannot use t-tests or ANOVAs if you cannot find the mean or
00:10:12.800 --> 00:10:22.300
variance you cannot use those test so is this is what our hypothesized population looks like and when we
00:10:22.300 --> 00:10:31.300
get a sample we get a little sample from that population, we want to know whether our sample
00:10:31.300 --> 00:10:37.200
proportions are very different from the hypothesized proportions or not, so let us say in our sample
00:10:37.200 --> 00:10:50.400
proportion we get mostly blue, little bit of red, little bit of yellow so let say 60% blue 20% red 20% yellow.
00:10:50.400 --> 00:10:55.500
Are those proportions different enough from our hypothesized proportion?
00:10:55.500 --> 00:11:14.500
Another sample we might get is you know, half blue and half red and no yellow, is that really different from our hypothesized proportion?
00:11:14.500 --> 00:11:34.100
Another sample we might get might be only like 110 blue and then 40% red and then the other half will be yellow.
00:11:34.100 --> 00:11:39.800
So something like that we want to say if it is really different from these hypothesized population
00:11:39.800 --> 00:11:46.200
proportion, and so that is what our new our new goal is.
00:11:46.200 --> 00:11:53.500
How different are these proportions from these proportion and then the question becomes okay how to
00:11:53.500 --> 00:11:57.400
determine whether something is very different?
00:11:57.400 --> 00:12:04.500
Is this very different or just different?
00:12:04.500 --> 00:12:08.300
How do we determine very different, that is going to be the key question here.
00:12:08.300 --> 00:12:13.300
And that is why we are going to need the chi-square statistic and the chi-square distribution.
00:12:13.300 --> 00:12:28.800
So we are changing our hypotheses a little bit now the null hypotheses is really about proportion and here is what we are talking about.
00:12:28.800 --> 00:12:36.400
The null hypothesis now is that the proportions of the population are real population that we do not know?
00:12:36.400 --> 00:12:52.000
Will this population be like the predicted or theorized proportion and so here we are asking is this unknown
00:12:52.000 --> 00:13:01.800
population like or known population right and it should sound familiar as that sort of the fundamental basis of inferential statistics.
00:13:01.800 --> 00:13:04.700
So that is our new null hypothesis.
00:13:04.700 --> 00:13:18.100
That the proportions in the population are like the predicted will be like the predicted population proportion still be the same.
00:13:18.100 --> 00:13:26.700
Remember sameness is always the hallmark of the null hypothesis alternatively if you want to say at least
00:13:26.700 --> 00:13:36.600
one of the proportion in the population will be different than predicted so going back to our example, if our
00:13:36.600 --> 00:13:50.000
population are hypothesized population is something like one third, one third, one third maybe what we
00:13:50.000 --> 00:14:16.100
will find is something like in our sample will have one third blue but then some smaller proportion like 15% red and on the rest being yellow.
00:14:16.100 --> 00:14:19.800
Now the one third should match up.
00:14:19.800 --> 00:14:24.000
The one third matches up but what about these other two?
00:14:24.000 --> 00:14:35.200
And so an alternative hypothesis at least one proportion in the population will be different from the predicted proportion,
00:14:35.200 --> 00:14:38.100
there just has to be one guy that is different.
00:14:38.100 --> 00:14:46.500
Suggest I give you an example, let us turn this problem into a null hypothesis in an alternative hypothesis.
00:14:46.500 --> 00:14:58.100
So here it said according to early polls candidate A was supposed to win 63% of the votes and candidate B was supposed to win 37%.
00:14:58.100 --> 00:15:07.800
When the votes are counted candidate a won 340 votes while B won 166 votes so here just to give you that
00:15:07.800 --> 00:15:18.500
picture again the null hypothesis population was that candidate A color A in blue, candidate A should have
00:15:18.500 --> 00:15:33.300
won 63% of the vote and candidate B all color in red should have won 37% of the vote so what would be our null hypothesis?
00:15:33.300 --> 00:15:45.500
Our null hypothesis would be that our unknown population will be like this predicted the proportions of my unknown population
00:15:45.500 --> 00:15:49.500
will have the same proportion as our predicted population.
00:15:49.500 --> 00:16:22.400
So here we might see something like A's proportion of votes of the actual real votes should be like this,
00:16:22.400 --> 00:16:49.200
the predicted population, and B’s proportion of votes should be like predicted population.
00:16:49.200 --> 00:16:57.000
So let us say, A’s proportion the real proportion of votes should be like this, and so should B, B should be like this.
00:16:57.000 --> 00:17:04.800
The other way we could say that is that the proportion of votes the real proportion of votes should be like
00:17:04.800 --> 00:17:10.800
the predicted proportion of votes, and then you could just say for every single category for both A and B.
00:17:10.800 --> 00:17:15.000
So what would be the alternative version of this?
00:17:15.000 --> 00:17:23.000
The alternative would say at least one of the proportion one of the categories either A or B one of those
00:17:23.000 --> 00:17:28.500
proportions will be different from the hypothesized proportion.
00:17:28.500 --> 00:17:35.700
And in fact in this example if one of them is different the other will be different to because since we only
00:17:35.700 --> 00:17:41.400
have two categories if we make one really different than the other one will automatically change.
00:17:41.400 --> 00:17:50.000
But later on we might see example 3, 4, 5 category and so in those cases this will make more sense.
00:17:50.000 --> 00:18:00.000
Okay so now let us talk about how to actually find out if out proportions are really off or not.
00:18:00.000 --> 00:18:12.200
Are our proportion statistical outliers are they deviant, are they significant, do they stand out, that is what we want to know.
00:18:12.200 --> 00:18:18.900
And in order to do that we have to use measure called the chi-square statistic instead of the T statistic
00:18:18.900 --> 00:18:25.700
which looks at a distance away in terms of standard error instead of the S statistic which looks at the
00:18:25.700 --> 00:18:33.000
proportion of the variance are interested in over the variance we cannot explain the chi-square does something different.
00:18:33.000 --> 00:18:44.000
It is now looking at expected values what would we expect and what would we actually observe and so the
00:18:44.000 --> 00:18:54.000
chi-square is going to look like this, so be careful that you do not, usually it is like a uppercase accident and
00:18:54.000 --> 00:19:02.300
it is a little bit different than like a regular letter X, it is usually a little more curvy to let you know it is chi-square.
00:19:02.300 --> 00:19:10.400
So the chi-square is really going to be interested in the difference between what we observe the actual
00:19:10.400 --> 00:19:17.100
observed frequency or percentages minus the expected frequency.
00:19:17.100 --> 00:19:30.500
So what were looking at observed versus expected this is what we see in our sample and this is what we
00:19:30.500 --> 00:19:40.600
would predict given our hypothesized population so this is that predicted population part.
00:19:40.600 --> 00:19:45.200
So were interested in the difference between those two frequencies.
00:19:45.200 --> 00:19:59.700
Now although you could use proportions as well you can only do that if you have the same, if you have a
00:19:59.700 --> 00:20:03.000
constant number of items so you probably are safer to go with frequencies because those are assertively
00:20:03.000 --> 00:20:06.900
weeded proportion so you probably want to go with that.
00:20:06.900 --> 00:20:14.000
So were interested in this difference but remember when we look at this different sometimes there can be
00:20:14.000 --> 00:20:20.500
positive sometimes there can be negative and so we what we do here as is usual in statistics as we square
00:20:20.500 --> 00:20:34.300
the whole thing, but we also want to know about this difference as a proportion of what was expected and we want to do this for every category.
00:20:34.300 --> 00:20:49.200
For the number of categories and I goes from one to the number of categories and there is actually an I down here for everything.
00:20:49.200 --> 00:20:59.100
So what this is saying is that for each category, each proportion that you are looking at so in our in our sort
00:20:59.100 --> 00:21:15.400
of toy example with the red blue and yellow, in this example we would do this for blue we would do this
00:21:15.400 --> 00:21:31.600
for red and we would do this for yellow so number of categories, so categories really speak to what are the proportions made of?
00:21:31.600 --> 00:21:55.000
So in here we have three categories so we would do this three times and add those proportions up and we
00:21:55.000 --> 00:22:00.900
want to eventually be able to find observed frequency and the expected frequency.
00:22:00.900 --> 00:22:10.400
Now in the example that we saw with the voting of for candidate A and B, one of the things I hope you
00:22:10.400 --> 00:22:16.100
noticed was that the observed frequencies were given is just number of votes how many people voted but
00:22:16.100 --> 00:22:26.400
the expected frequencies would be expected hypothesized population, that was given as a percentage so
00:22:26.400 --> 00:22:33.000
you cannot subtract votes from percentage, you have to translate them both into something that is the
00:22:33.000 --> 00:22:46.200
same and so in that it is helpful to change the expected percentages into expected frequency and there is
00:22:46.200 --> 00:22:51.000
going to be another reason for changing it into expected frequencies instead of changing the observed
00:22:51.000 --> 00:22:57.300
frequencies into the observed proportion and I am going to that a little bit later.
00:22:57.300 --> 00:23:04.000
So here is what I want you to think of this, is really the square difference between observed and expected
00:23:04.000 --> 00:23:13.700
frequencies as a proportion of expected frequency and you want to do that and you want to sum that over all the categories.
00:23:13.700 --> 00:23:20.000
Once you have that then you get your chi-square value, now let us think about this chi-square value.
00:23:20.000 --> 00:23:32.700
If this difference is very large right so observed frequencies are just very different than expected one, is that difference is very large?
00:23:32.700 --> 00:23:42.600
You are going to have a very large chi-square also if this difference is very small, they are really close to each other, then your chi-square is be very small.
00:23:42.600 --> 00:23:54.200
So chi-square is giving us a measure of how far apart the observed and expected frequencies are, also I
00:23:54.200 --> 00:23:58.800
want to see that the chi-square cannot be negative.
00:23:58.800 --> 00:24:05.000
First of all because were squaring this difference right so the numerator cannot be negative not only that
00:24:05.000 --> 00:24:11.400
the expected frequencies also cannot be negative because we are counting up how many things we have ,
00:24:11.400 --> 00:24:17.500
how many things we observed and so this also cannot be negative so this whole thing cannot be negative.
00:24:17.500 --> 00:24:25.000
So already we see in our mind the chi-square distribution will probably be positive and positively skewed
00:24:25.000 --> 00:24:30.000
because it stops at zero there is a wall at zero.
00:24:30.000 --> 00:24:39.000
Okay so now let us actually talk and draw the chi-square distribution so imagine having some sort of data
00:24:39.000 --> 00:24:47.000
set and taking from it over and over again samples so you take a sample and so have this big data set, you
00:24:47.000 --> 00:24:53.000
take the sample and you calculate the chi-square statistic and you plot that.
00:24:53.000 --> 00:25:02.300
And then you put that back in you take another sample and you take the chi-square plotted again and do
00:25:02.300 --> 00:25:05.200
that over and over and over and over again.
00:25:05.200 --> 00:25:14.300
You will never get a value that is below zero and you will get values that might be way higher than zero
00:25:14.300 --> 00:25:20.000
sometimes but for the most part though be clustered over here so you will get a skewed distribution and
00:25:20.000 --> 00:25:27.300
indeed the chi-square distribution is a skewed distribution.
00:25:27.300 --> 00:25:36.400
Now here when we look at this you might think, hey, that looks sort of like the F distribution and you are
00:25:36.400 --> 00:25:44.300
right overall and shape it looks just like the F distribution and in a lot of ways we could apply the reasoning
00:25:44.300 --> 00:25:48.000
from the F distribution directly to the chi-square distribution.
00:25:48.000 --> 00:25:56.200
For instant in the chi-square distribution, our α is automatically one tailed it is only on one side and so
00:25:56.200 --> 00:26:04.700
when we say something like α equals .05 this is what we mean, we mean that we will reject the null
00:26:04.700 --> 00:26:12.700
when we have a chi-square value that somewhere out here or here or here but we will fail to reject if we
00:26:12.700 --> 00:26:16.100
get a chi-square value in here from our sample.
00:26:16.100 --> 00:26:26.500
Now this chi-square distribution like the S and t-distribution, it is a family of distribution, not just one
00:26:26.500 --> 00:26:31.100
distribution the only one that is just one distribution is the normal distribution.
00:26:31.100 --> 00:26:38.500
The chi-square distribution again depends on degrees of freedom and the degrees of freedom that the chi-
00:26:38.500 --> 00:26:48.400
square depends on is going to be the number of categories -1 .
00:26:48.400 --> 00:26:54.900
So if you have a lot of categories the chi-square it will look distribution will look different if you have a small
00:26:54.900 --> 00:26:58.900
number of the categories like 2, the chi-square distribution will look different.
00:26:58.900 --> 00:27:03.400
So let us talk about what α means here.
00:27:03.400 --> 00:27:10.600
The α here is this set significance level we are going to say, we are going to use this as the boundary so
00:27:10.600 --> 00:27:23.100
that if we have a chi-square from our sample that bigger than this boundary then we will reject the null.
00:27:23.100 --> 00:27:26.800
What is the difference now with P value?
00:27:26.800 --> 00:27:35.800
Now the P value said this is the probability so we might have a P value somewhere out here or we might
00:27:35.800 --> 00:27:48.700
have a P value somewhere here, the P value is going to be very similar to other hypothesis test what the P
00:27:48.700 --> 00:28:01.200
value means and other hypothesis test, basically is going to be the probability of getting a high square value
00:28:01.200 --> 00:28:19.900
larger more extreme and in this case there is only one kind of extreme, positive larger than the one from our sample but under condition.
00:28:19.900 --> 00:28:23.000
Remember in this world which one is true?
00:28:23.000 --> 00:28:32.000
The null hypothesis is true.
00:28:32.000 --> 00:28:40.000
So considering if the null hypothesis were true this would be the probability of getting such an extreme chi-
00:28:40.000 --> 00:28:47.000
square value , one that is that large or larger, that is all we need.
00:28:47.000 --> 00:28:57.500
So, in that way the P value is from our data while the α is not from our data it is it is just something we sat as the cut off.
00:28:57.500 --> 00:29:04.700
So there are some conditions that we need to know before we use the chi-square.
00:29:04.700 --> 00:29:15.800
When we use the chi-square we cannot just always use it, there are conditions that have to be met so one of the conditions of the chi-square is this.
00:29:15.800 --> 00:29:25.200
Each outcome in the population falls exactly into one of a fixed number of categories, so every time you
00:29:25.200 --> 00:29:32.900
have some sort of case from the population so let us say we are drying out votes.
00:29:32.900 --> 00:29:44.900
Each vote has to fall into one of a fixed number of categories so if it is two candidates, always two
00:29:44.900 --> 00:29:52.000
candidates for every single voter so we cannot compare voters that had two candidates versus voters who had three candidates.
00:29:52.000 --> 00:30:02.000
Also these have to be mutually exclusive categories, one vote cannot go to two candidates at ones so they
00:30:02.000 --> 00:30:07.200
have to be mutually exclusive, you got vote for A or vote for B.
00:30:07.200 --> 00:30:16.300
And you cannot opt out either, or else nobody has to be one of the fixed numbers of categories ahead of time.
00:30:16.300 --> 00:30:25.700
So the numbering is slightly off here but the second condition that must be met is that you must have a
00:30:25.700 --> 00:30:31.900
random sample from your population, that is just like all kinds of hypothesis testing though.
00:30:31.900 --> 00:30:40.600
Number 3, the expected frequency in each category so once you once you compute all the expected
00:30:40.600 --> 00:30:50.000
frequency in order to compute your chi-square, that needs to be each cell each square needs to have an
00:30:50.000 --> 00:30:54.500
expected frequency of five or greater, here is why.
00:30:54.500 --> 00:31:02.000
You need a big enough sample, if you have to small of the sample, again expected frequencies less than five
00:31:02.000 --> 00:31:11.300
also unique big enough proportions, so let us say you want to compare proportions that are like you know
00:31:11.300 --> 00:31:23.600
like one candidate is going to be predicted to win 99.999% of the votes and the other candidate is only
00:31:23.600 --> 00:31:30.500
supposed to win .001% of the vote and you only have five people in your sample.
00:31:30.500 --> 00:31:37.000
And so you need to also have big enough proportion and these balance each other out.
00:31:37.000 --> 00:31:42.700
If you have a large and a sample than your proportions can be smaller also, if you have large enough
00:31:42.700 --> 00:31:45.400
proportions in your sample could be smaller.
00:31:45.400 --> 00:31:53.300
And the final condition is not really condition it is just sort of something I wanted you to know at the rule.
00:31:53.300 --> 00:32:00.400
The chi-square goodness of fit test so that is always been talking about so far.
00:32:00.400 --> 00:32:07.300
This test actually applies to more than two categories.
00:32:07.300 --> 00:32:15.500
You do not just have 2 categories, you have 3 or 4 or 5 or 6 but they do need to be mutually exclusive and
00:32:15.500 --> 00:32:20.200
each outcome in the population must be able to fall into any one of those.
00:32:20.200 --> 00:32:23.500
So those are the conditions.
00:32:23.500 --> 00:32:27.200
So now let us move on to some examples.
00:32:27.200 --> 00:32:33.100
So the first example is the problem that we already looked at so far according to early polls candidate A
00:32:33.100 --> 00:32:38.300
was supposed to win 63% of the vote and B was supposed to win 37%.
00:32:38.300 --> 00:32:46.900
When the votes are counted, A won 340 votes while B won 166 votes.
00:32:46.900 --> 00:32:55.600
One of the things that I like to do just to help myself is when I think of the null hypothesis, when I think of
00:32:55.600 --> 00:33:10.100
the null hypothesis, I sort of write it in a sentence that the proportion of votes, that is my population,
00:33:10.100 --> 00:33:52.000
should be like predicted proportions, and the alternative is that at least one of the proportion of votes will not be like predicted population.
00:33:52.000 --> 00:34:00.400
What I also like to do is I like to draw this out for myself, I like to draw out the predicted population so I will
00:34:00.400 --> 00:34:13.100
color candidate A in blue so that will be about 63%, candidate B will be in red, 37%.
00:34:13.100 --> 00:34:19.000
And so eventually I want to know whether this is reflected in my actual votes.
00:34:19.000 --> 00:34:27.800
The significance level we can set it up .05 just set of convention and we know that it has to be one tailed
00:34:27.800 --> 00:34:35.200
because this is definitely going to be a chi-square and we know it is a chi-square because it is about expected proportions.
00:34:35.200 --> 00:34:40.700
So now let us set our decision stage.
00:34:40.700 --> 00:35:00.000
Now our decision stage, it is helpful to draw that chi-square distribution and to sort of label it, for α
00:35:00.000 --> 00:35:11.500
here this is our rejection region .05, now it would be nice to know what our critical chi-square is, and in
00:35:11.500 --> 00:35:19.000
order to find that we need degrees of freedom and degrees of freedom is the number of categories, in this
00:35:19.000 --> 00:35:31.000
case 2 -1 and that is 1° of freedom and it is because if you know let us say that candidate B won that is
00:35:31.000 --> 00:35:38.000
supposed to win 37% of the votes you could actually figure out candidate A like you do not need me to tell
00:35:38.000 --> 00:35:43.600
you what that is to figure it out and candidate A cannot vary, the proportion cannot very freely once you
00:35:43.600 --> 00:35:47.700
know this one and that is why it is number of categories – 1.
00:35:47.700 --> 00:35:56.600
So now that we have that you might be useful to look at either in the back of your book or use XL
00:35:56.600 --> 00:36:01.500
spreadsheet Excel function in order to find our critical chi-square.
00:36:01.500 --> 00:36:22.300
So in order to find chi-square there are two functions that you need to know just like T this and T, F this and F in, now there is chi-this.
00:36:22.300 --> 00:36:30.000
Actually we need to use chi in right now because here we have the probability .05 and the degrees of
00:36:30.000 --> 00:36:37.700
freedom one and that will give us our critical chi-square and that is 3.84.
00:36:37.700 --> 00:36:48.500
So critical and so this is the boundary were looking for 3.84 so anything more extreme more positive than
00:36:48.500 --> 00:36:54.300
3.84 and were going to reject our null hypothesis.
00:36:54.300 --> 00:37:01.000
So now that our decision stage is set, now it is helpful to actually work with our population and remember
00:37:01.000 --> 00:37:16.200
when we talk about our population, should have left myself some room, when we talk about our actual sample here is what we ended having.
00:37:16.200 --> 00:37:23.400
We have observed frequencies already so for candidate A, I am going to write a column for observed in
00:37:23.400 --> 00:37:40.800
candidate B so candidate A, we observed 340 votes so that is our observed frequency for candidate B, we see 166 votes.
00:37:40.800 --> 00:37:54.300
Now one that helps is we know what the total number of votes was, so the total number of votes is going to be 340+166 and that is 506.
00:37:54.300 --> 00:38:03.100
So 506 people actually voted in this so down here I am going to write total 506.
00:38:03.100 --> 00:38:10.800
Now the question is what should our active frequencies have been?
00:38:10.800 --> 00:38:18.300
So here I am going to write expected and I know that my proportion of expected should be 63%.
00:38:18.300 --> 00:38:22.500
That means is that the total number of people who voted?
00:38:22.500 --> 00:38:27.700
So here is our little sample of 506 people.
00:38:27.700 --> 00:38:42.800
This is our 100% but here we have 506 people in our sample, we should expect 63% of 506 to have voted
00:38:42.800 --> 00:38:47.700
for A, and so how do we find that?
00:38:47.700 --> 00:39:01.000
Well we are going to multiply 63% to 506 to find out how many votes that little blue bit is and so that is
00:39:01.000 --> 00:39:09.900
going to be.63×506 that total amount.
00:39:09.900 --> 00:39:15.200
If we multiply 506 x 1 we would get 506 right?
00:39:15.200 --> 00:39:27.200
So if we multiply by a little bit of a smaller proportion that we get just that chunk. 318.78 actually I am
00:39:27.200 --> 00:39:43.200
going to put this here, let me actually draw this little table right in here because that can help us do our 3939.1 finder chi-square much more quickly.
00:39:43.200 --> 00:39:53.700
And so observed expected frequency observed frequency at 340 and 166, okay.
00:39:53.700 --> 00:40:01.000
So what are the other expected frequency for B, so in order to find this little bit we are going to multiply
00:40:01.000 --> 00:40:14.400
.37×506, so .37x506 and that is 187.22.
00:40:14.400 --> 00:40:22.500
And usually if you add this entire column that you should get roughly a similar total.
00:40:22.500 --> 00:40:29.400
When you do it, when you do these by hand sometimes you might not get exactly the same number it
00:40:29.400 --> 00:40:38.100
might be off by just a little bit because of a rounding error, if you round to the nearest 10th, round to the nearest integer,
00:40:38.100 --> 00:40:45.200
you make it a little bit around it here but you should be off by much so that one way you could check to see what you did was right.
00:40:45.200 --> 00:41:02.700
And so once we have this, so let me just copy these down right here so 318.78 and 187.22 for each of these
00:41:02.700 --> 00:41:16.200
the total is 506, so here, one of things we see is that the expected value for A are a little bit lower and the
00:41:16.200 --> 00:41:25.400
expected values for B are little bit higher, but is this difference in proportion is that significant is that
00:41:25.400 --> 00:41:33.500
standing out enough, and in order to find that we need to find the chi-square, the sample chi-square.
00:41:33.500 --> 00:41:37.000
Now, we completely run out of room here.
00:41:37.000 --> 00:41:39.900
But I will just write the chi-square formula up here.
00:41:39.900 --> 00:41:49.700
So the chi-square is going to be the sum over all the categories of the observed frequency minus the
00:41:49.700 --> 00:41:56.900
expected square as a proportion of the expected frequency.
00:41:56.900 --> 00:42:05.400
And so what I am going to do is calculate this for each category, A and B and then add them up.
00:42:05.400 --> 00:42:19.800
So right here I am going to call this a column, O minus E squared all over B.
00:42:19.800 --> 00:42:27.300
So I am going to do that for A and B and then sum them up.
00:42:27.300 --> 00:42:45.000
So, my observed minus expected squared all divided by expected and so here I get this proportion and I am
00:42:45.000 --> 00:42:56.800
just going to copy and paste that down here and then here I am just going to some them up and I get 3.817.
00:42:56.800 --> 00:43:09.000
We are really close but no cigar so where were right underneath so our sample chi-square is just a smidge
00:43:09.000 --> 00:43:17.400
smaller than our critical chi-square so here were not rejecting the null, we are going to fail to reject the
00:43:17.400 --> 00:43:29.600
null, so let us find the P value so in order to find the P value you could use chi disc or alternatively look it up
00:43:29.600 --> 00:43:34.300
in the back of your book, look for the chi-square distribution.
00:43:34.300 --> 00:43:47.000
It should be behind your normal, your T, your F and then chi-square should come right behind it, it usually goes in that order , maybe a slightly different order.
00:43:47.000 --> 00:44:00.300
And our degrees of freedom remain the same one and so all our P value is just over .05, if we round, .51 right?
00:44:00.300 --> 00:44:17.600
So because of that we are not going to reject the null so we are going to say the proportions of votes are roughly similar to the predicted proportions.
00:44:17.600 --> 00:44:23.000
Well, they are not significantly different at least, they are not super similar but we cannot make a decision
00:44:23.000 --> 00:44:29.200
about that but we can say they are not that different from, that they are not extremely different at least.
00:44:29.200 --> 00:44:44.400
Okay, example 2. A study ask college students could tell dog food apart from expensive liver pâté liverwurst and spam.
00:44:44.400 --> 00:44:55.000
All blended to the same consistency chilled and garnished with herbs and a lemon wedge, just to make it pretty.
00:44:55.000 --> 00:44:58.200
Students are asked to identify which was dog food.
00:44:58.200 --> 00:45:03.500
Researchers wanted to test the probability model where the students are randomly guessing.
00:45:03.500 --> 00:45:06.700
How would they cast their hypothesized model?
00:45:06.700 --> 00:45:14.400
Okay so see the download that shows how many students picked that item to be dog food, so it seems that
00:45:14.400 --> 00:45:23.100
college students have a bunch of different choices in dog food liver Patty, liverwurst and spam, and then
00:45:23.100 --> 00:45:28.000
they need to identify which was dog food so out of those, which of those is dog food?
00:45:28.000 --> 00:45:31.900
So it is sort of like a multiple-choice question.
00:45:31.900 --> 00:45:41.500
So if you hit example 2 in the download that listed below, you will see the number of students is selected that particular item as dog food.
00:45:41.500 --> 00:45:51.200
Now be careful because some people right here, remember, you will really get this problem on a test and you would not know that it is a chi-square problem.
00:45:51.200 --> 00:45:58.200
Sometimes people might immediately just think I will find the means and so they just go ahead and find the
00:45:58.200 --> 00:46:03.100
mean but then if you do find the mean, ask yourself, what does this mean?
00:46:03.100 --> 00:46:07.700
What is the idea or the concept?
00:46:07.700 --> 00:46:15.600
If we average this, we would find the average number of students that selected any of these items as dog
00:46:15.600 --> 00:46:19.100
food and that sort of a mean that does not make any sense right?
00:46:19.100 --> 00:46:28.400
And so before you know, go ahead and find the mean, ask yourself whether the mean is actually meaningful.
00:46:28.400 --> 00:46:37.700
So here we know that the chi-square because the students are choosing something and it is a categorical choice.
00:46:37.700 --> 00:46:44.600
They are not giving you an answer like 20 inches or 50° or I got 10 questions correct right?
00:46:44.600 --> 00:46:52.700
They are actually just saying, that one is dog food and they have five different choices and they have
00:46:52.700 --> 00:47:01.200
chosen one of them as dog food so out of five choices of probability model that are just guessing would
00:47:01.200 --> 00:47:08.000
mean that 20% of the time they should pick pâté, once we dog food, 20% of the time don't expand to be
00:47:08.000 --> 00:47:14.700
dog food 20% of the time to pick dog food to be dog food and so on and so forth.
00:47:14.700 --> 00:47:24.000
So let us try that probability model and by model we also need null hypothesis.
00:47:24.000 --> 00:47:28.500
Model or hypothesized population so step one.
00:47:28.500 --> 00:47:37.800
So the null hypothesis is the idea that they will fit into this picture so this is the population, and it is out of
00:47:37.800 --> 00:47:58.500
100% and they have five choices of pictures just lightly un even, it helps really draw this is as well as you can, just as then it will help you reason to.
00:47:58.500 --> 00:48:05.600
That they will have a equal chance of guessing either one of these and there is two liver patties that is why there are 5 choices.
00:48:05.600 --> 00:48:29.600
So liver pâté 1, spam was next, then actual dog food just in the data set, patty 2 and a liverwurst.
00:48:29.600 --> 00:48:43.500
So these are the five choices and were saying look the students are just guessing they should have a 20% probability of each.
00:48:43.500 --> 00:48:58.000
Is this the right proportion for this sample, is the sample going to serve match that or be very different from this.
00:48:58.000 --> 00:49:14.600
The alternative is that at least one of the real proportion is different from predicted.
00:49:14.600 --> 00:49:30.000
So once we have that, we can set our α to be .05 our decision stage, could draw there chi-square and
00:49:30.000 --> 00:49:37.700
our degrees of freedom, we now have five categories and so our degrees of freedom is 5-1 which equals 4
00:49:37.700 --> 00:49:46.800
and it is because once we know four of this, that we could actually figure out the proportion for the fifth one just from knowing 4 of this.
00:49:46.800 --> 00:49:51.000
So that one is no longer free to vary, it does not have freedom anymore.
00:49:51.000 --> 00:49:58.600
So what is our critical chi-square?
00:49:58.600 --> 00:50:11.400
Well, if you want to pull up your Excel data, here I am just in a start off with step three, in step three we are
00:50:11.400 --> 00:50:26.300
critical chi-square in order to find that we can use chi-in, put in the probability that were interested in and our degrees of freedom which is 4.
00:50:26.300 --> 00:50:35.500
And so our critical chi-square is 9.49.
00:50:35.500 --> 00:50:53.400
Noticed that as degrees of freedom goes up, what is happening to the chi distribution is that it is getting
00:50:53.400 --> 00:50:59.500
fatter it is getting more variable and because of that we need a more extreme chi-square value.
00:50:59.500 --> 00:51:05.800
So that is sort of different than like T distributions or F distribution.
00:51:05.800 --> 00:51:15.100
Those distributions got sharper when we increased our degrees of freedom , chi distributions were the opposite way.
00:51:15.100 --> 00:51:20.500
Those district chi distributions are getting more variable as degrees of freedom goes up.
00:51:20.500 --> 00:51:29.200
So once we have this now we could start working on our actual data, our actual samples.
00:51:29.200 --> 00:51:42.400
So step four is we need to find a sample chi-square and in order to do that it helps to draw out that table so
00:51:42.400 --> 00:51:46.100
the table might look something like this.
00:51:46.100 --> 00:52:05.300
I will just copy this down here and this is the type of food, so that is the category and here we have our observed frequencies.
00:52:05.300 --> 00:52:10.200
The actual number of students that pick that thing to be dog food.
00:52:10.200 --> 00:52:18.300
So here we seen one student pick pâté, one to be dog food, 15 students picked liverwurst to be the dog food.
00:52:18.300 --> 00:52:22.100
What are the expected frequencies?
00:52:22.100 --> 00:52:32.900
Well in order to find expected frequencies we know that the expected proportions are going to be .2 all the way down.
00:52:32.900 --> 00:52:41.400
20% 20% 20% 20% and here I am just going to total this up.
00:52:41.400 --> 00:52:50.500
And I see that 34 students were asked this question.
00:52:50.500 --> 00:52:55.400
Are expected frequencies should add up to about 34?
00:52:55.400 --> 00:52:59.000
Are expected proportions adds up to one?
00:52:59.000 --> 00:53:04.600
And that is why we cannot just directly compare these two things, they are not in the same sort of currency
00:53:04.600 --> 00:53:09.000
yet, you sort of have to change this currency into frequency.
00:53:09.000 --> 00:53:11.700
So how do we do that?
00:53:11.700 --> 00:53:18.800
Well we imagine here are all 34 students take 20% of them, how many students will that be?
00:53:18.800 --> 00:53:26.900
So that is 0.2×34, this times 34.
00:53:26.900 --> 00:53:33.800
And I am just going to lockdown that 34 because that total sum would not change.
00:53:33.800 --> 00:53:46.500
So, this is what we should expect that if they were indeed guessing, this is the expected frequencies that
00:53:46.500 --> 00:53:53.600
we should see and if I just move that over here , we will see that that also at the column also add up to 34.
00:53:53.600 --> 00:54:00.000
Now once we have that we can compute our actual chi-square because remember that observed frequency
00:54:00.000 --> 00:54:06.800
minus expected square divided by expected as a proportion of expected.
00:54:06.800 --> 00:54:17.000
So, that is the observed frequency minus expected frequency squared divided by the expected frequency.
00:54:17.000 --> 00:54:27.900
And I could take that down for each row and then add those up and here I get my chi-square statistic for
00:54:27.900 --> 00:54:41.200
my sample and so my sample chi-square is going to be 16.29, and that is the larger more extreme chi-
00:54:41.200 --> 00:54:46.500
square than my critical chi-square, and let's also find P value here.
00:54:46.500 --> 00:54:57.200
In order to find P value I could use chi-disc, here I put in my chi-square and my degrees of freedom which is 4.
00:54:57.200 --> 00:55:14.900
And so that is .003 and that is certainly smaller than .05 and so in step five, we reject the null.
00:55:14.900 --> 00:55:17.800
Now I just want to make a comment here.
00:55:17.800 --> 00:55:24.700
Notice that here, after we do the chi-square although we reject the null just like in the ANOVA we do not
00:55:24.700 --> 00:55:30.400
actually know which of the categories is the one that is really off.
00:55:30.400 --> 00:55:40.500
This one here, we can sort of see, this one probably seems to be the most off but we are just eyeballing it,
00:55:40.500 --> 00:55:42.800
were not using actual statistical principles.
00:55:42.800 --> 00:55:49.500
So once you reject the null there is a post hoc test that you could do but we are not going to cover those here.
00:55:49.500 --> 00:56:02.200
So it seems that students are not randomly guessing they actually have a preference for something as being dog food.
00:56:02.200 --> 00:56:05.300
My guess is liverwurst.
00:56:05.300 --> 00:56:16.000
So example 3 which of these statements describe properties of the chi-square goodness of fit test?
00:56:16.000 --> 00:56:22.800
So if you switch the order of categories the value of the test statistic does not change, that is actually true it
00:56:22.800 --> 00:56:31.300
does not matter whether candidate A got added before candidate B addition is totally order insensitive you
00:56:31.300 --> 00:56:38.600
could add A or B or B on A, you can add pâté or liverwurst and dog food or dog food the liverwurst and
00:56:38.600 --> 00:56:43.400
pate, it does not really matter so this is actually true, as a true property.
00:56:43.400 --> 00:56:50.400
Observed frequencies are always whole members that is also actually true because when you observe of
00:56:50.400 --> 00:56:58.500
the frequency, you are actually counting how many category numbers you have so counting is going to be made up of whole numbers.
00:56:58.500 --> 00:57:07.200
Expected frequencies are always whole numbers, that is actually not true, expected frequencies are predicted frequencies.
00:57:07.200 --> 00:57:15.000
It is not that at any one time you will have plenty student saying that liverwurst is dog food but it is that on
00:57:15.000 --> 00:57:25.300
average that is what you would predict given a certain proportion and so this is actually not true, expected
00:57:25.300 --> 00:57:31.800
frequencies do not have to be whole numbers because they are theoretical, they are not actually things that we counted up in real life.
00:57:31.800 --> 00:57:42.500
A high value of chi-square indicates high level of agreement between observed frequencies and the expected frequencies.
00:57:42.500 --> 00:57:52.100
Actually if you think about the chi-square statistic, this is actually the opposite of what is the real case.
00:57:52.100 --> 00:57:59.600
If we had a high level of agreement this number would be very small and because this numerator is small
00:57:59.600 --> 00:58:06.100
the chi-square would also be small, a high value of chi-square would actually mean that this is quite large
00:58:06.100 --> 00:58:13.800
compared to this and so this is actually also wrong, the opposite.
00:58:13.800 --> 00:58:23.000
So that is it for chi-square goodness of fit test, join us next time on educator.com for chi-square test of homogeneity.