WEBVTT mathematics/statistics/son
00:00:02.100 --> 00:00:02.300
Hi, welcome to educator.com.
00:00:02.300 --> 00:00:04.800
We are going to talk about the chi-square test of homogeneity.
00:00:04.800 --> 00:00:22.400
Previously we talked about the chi-square goodness of fit test now were in a contrast that with this new test is still 0018.3 chi-square test but it is a test of homogeneity now.
00:00:22.400 --> 00:00:26.700
We are going to try and figure out when do we use which test.
00:00:26.700 --> 00:00:35.300
The test we are testing a new idea , we are not testing goodness of that would actually testing homogeneity similar.
00:00:35.300 --> 00:00:41.300
We actually have slightly different null hypotheses and alternative null and alternative hypotheses .
00:00:41.300 --> 00:00:55.400
We are going to talk about how those have changed then we are going to go over the chi-square statistic and also finding 0051.0 the expected values is going to be a little bit different in test of homogeneity .
00:00:55.400 --> 00:01:05.300
Finally working to go through chi-square distributions as well as degrees of freedom and the conditions for the test of homogeneity,
00:01:05.300 --> 00:01:09.600
one can you actually care conduct this test service statistically legally.
00:01:09.600 --> 00:01:22.100
Okay so the first thing is what is the difference between the test of homogeneity and test of goodness of fit?
00:01:22.100 --> 00:01:29.200
Well in the goodness of fit hypothesis testing we wanted to determine whether sample proportions are very different from hypothesized
00:01:29.200 --> 00:01:38.100
population proportion one way you could think about this is that you have one sample and you are comparing it to some hypothetical population.
00:01:38.100 --> 00:01:48.400
In test of homogeneity and I called it goodness of fit, it is about how well these two things fit together.
00:01:48.400 --> 00:01:52.300
How well does the sample fit with the hypothesized proportion.
00:01:52.300 --> 00:02:00.600
In test of homogeneity homogeneous means similar right, that they are made up of the same stuff.
00:02:00.600 --> 00:02:09.900
In test of homogeneity we want to determine whether 2 populations that are sorted into categories share the same proportions or not.
00:02:09.900 --> 00:02:22.000
And here you could also substitute this word population here because ultimately were using the sample as a proxy for the population.
00:02:22.000 --> 00:02:32.600
So here we have 2 population and we want to know whether those two populations are similar in their proportions or not
00:02:32.600 --> 00:02:39.200
right were not comparing them to some hypothesized population were comparing them to each other.
00:02:39.200 --> 00:02:46.700
And so really you can think of this as an analogy you think of the their relationship by using an analogy from the
00:02:46.700 --> 00:02:50.000
one sample to the independent samples t-test.
00:02:50.000 --> 00:02:56.600
In the one sample t-test we had one sample and we compared it to the null hypothesis right?
00:02:56.600 --> 00:03:09.900
That was when we would have null hypotheses such as new equals zero or new equals 200 or new equals -5 versus an independent sample.
00:03:09.900 --> 00:03:17.700
We had 2 samples and we wanted to know how similar they were to each other right or how different
00:03:17.700 --> 00:03:28.400
they were from each other and our null hypothesis was changed to something like use of X bar minus Y bar equals zero right,
00:03:28.400 --> 00:03:33.100
that they are either made up of the same mean or different means.
00:03:33.100 --> 00:03:48.700
And in a in a similar way the goodness of fit chi-square is really asking whether this proportion in my sample
00:03:48.700 --> 00:03:52.600
is similar to the proportion in our population.
00:03:52.600 --> 00:04:00.300
So that is how I am comparing , this is my null hypothesis in some ways .
00:04:00.300 --> 00:04:15.500
In our inner test of homogeneity we have 2 sample 2 population 2 sample that come from 2 unknown population and we want to know
00:04:15.500 --> 00:04:27.500
whether these have similar proportions to each other and so that is going to be our null hypothesis that these have the same proportion or have different one.
00:04:27.500 --> 00:04:34.800
For null hypotheses is similar proportion.
00:04:34.800 --> 00:04:45.500
And so in that way I hope you could see that goodness of fit in homogeneity their ideas that we have looked at before
00:04:45.500 --> 00:04:53.900
comparing one sample to a hypothesized population or comparing two samples to each other but we have looked at it before
00:04:53.900 --> 00:04:57.600
not with proportion but with means, right?
00:04:57.600 --> 00:05:05.500
And now are looking at it with proportion okay since you are looking at proportion we should have hypotheses about
00:05:05.500 --> 00:05:13.500
proportion so the null hypotheses with something like this the proportion of all the each category the proportion that
00:05:13.500 --> 00:05:22.000
all into each category is the same for each population so however many categories you have so let us say we have
00:05:22.000 --> 00:05:41.000
in a three categories.
00:05:41.000 --> 00:05:47.400
If we believe that they are the same and they should roughly have the same proportion so these have similar proportion.
00:05:47.400 --> 00:06:00.500
It does not actually matter what the proportions are it could be 90, 10 could be 10,10 it could be 75 20 like when the proportions
00:06:00.500 --> 00:06:07.700
that were think there similar for each population and whatever 780 whatever category is 75% of the population
00:06:07.700 --> 00:06:11.400
that category will also be 75% of the population.
00:06:11.400 --> 00:06:22.800
The alternative hypothesis says that for at least one category the populations do not have the same proportion so just like before
00:06:22.800 --> 00:06:34.000
were now talking about differences that the differences are really in the proportions the predicted the populations proportion.
00:06:34.000 --> 00:06:35.800
So just to give you an example.
00:06:35.800 --> 00:06:41.300
Here is the problem and let us try to change it into the null hypothesis as well as alternative hypothesis.
00:06:41.300 --> 00:06:49.900
So according to a poll for and six Democrats said they were very satisfied with candidate A while 510 were unsatisfied
00:06:49.900 --> 00:06:55.500
however 910 Republicans were satisfied with candidate a while 60 were not.
00:06:55.500 --> 00:07:07.400
And in a chi-square test of homogeneity we could see whether the proportions of Democrats and Republicans that Democrats were satisfied are
00:07:07.400 --> 00:07:15.800
similar to the proportions were Republican of Republicans were satisfied versus unsatisfied.
00:07:15.800 --> 00:07:19.600
So let us draw this out first.
00:07:19.600 --> 00:07:31.600
So here we have about 400 Democrats saying there satisfied while 500 saying unsatisfied.
00:07:31.600 --> 00:07:40.100
Let put satisfied in blue and so that is a little bit less than half and the unsatisfied people are a little bit
00:07:40.100 --> 00:07:47.300
more than half so this is the Democratic population that they look like.
00:07:47.300 --> 00:07:59.400
The Republican population looks very different so here we see most of the Republicans being pretty satisfied and
00:07:59.400 --> 00:08:03.600
only a very small minority being unsatisfied right.
00:08:03.600 --> 00:08:12.900
And so the question is are these two are the two similar are the proportions that fall into each category
00:08:12.900 --> 00:08:16.700
satisfied or unsatisfied the same for each population?
00:08:16.700 --> 00:08:18.100
Are they different?
00:08:18.100 --> 00:08:21.000
The null hypothesis would probably say something like this.
00:08:21.000 --> 00:08:51.600
The proportion of satisfied and unsatisfied people like us are similar are the same for Dans as well as republicans.
00:08:51.600 --> 00:09:21.300
The alternative hypothesis says for at least one category either satisfied or unsatisfied, Dans and Republicans do not have the same proportion.
00:09:21.300 --> 00:09:43.700
Okay so note that in the case of 2, once category changes once the proportion of one category changes the other one automatically changes.
00:09:43.700 --> 00:09:51.900
So if we somehow were able to change has satisfied the Democrats were with candidate A, we would also see the
00:09:51.900 --> 00:09:55.300
proportion of unsatisfied people just automatically change.
00:09:55.300 --> 00:10:06.000
So that is in the case of two categories but in the case of multiple categories maybe 2 might change but the others may
00:10:06.000 --> 00:10:11.700
not change right so in that way this would be a more general way of saying alternative hypothesis.
00:10:11.700 --> 00:10:16.200
Now let us talk about the chi-square statistic.
00:10:16.200 --> 00:10:20.900
Now the nice thing about the chi-square statistic is that it is the same as the goodness of fit test.
00:10:20.900 --> 00:10:31.400
We use the same idea so chi-square is going to be observed frequencies and the difference between that and
00:10:31.400 --> 00:10:38.600
expected frequencies where over the proportion of expected frequency.
00:10:38.600 --> 00:10:43.200
But there is just one subtle difference before it was for each category.
00:10:43.200 --> 00:10:51.400
Now we have different categories in different population right so we not only have like category 1 and category 2
00:10:51.400 --> 00:10:59.300
category 3 so on and so forth but we also have population 1 and population 2 at least right?
00:10:59.300 --> 00:11:08.400
And so we have multiple of observed frequencies and so what do we do right?
00:11:08.400 --> 00:11:21.400
Well what we do here is that we consider each of these combination of which population your in and which category
00:11:21.400 --> 00:11:26.000
are talking about each of these are going to be called cells.
00:11:26.000 --> 00:11:34.200
And so we do this for each cell so I will go from one of to the number of cells.
00:11:34.200 --> 00:11:41.100
And how do we get the number of cells?
00:11:41.100 --> 00:11:57.900
Well the number of cells is really how many population right and that is usually shown in columns times how many categories.
00:11:57.900 --> 00:12:12.800
And that is usually shown in rows, you can also think of the number of cells as columns times rows, how many columns you have times the number of rows.
00:12:12.800 --> 00:12:19.500
But really the idea comes from how many different populations your comparing of chi-square test of homogeneity
00:12:19.500 --> 00:12:27.200
actually compare three or four population not just 2 and how many categories you are comparing.
00:12:27.200 --> 00:12:38.300
So in order to use the chi-square formula, it is often helpful to set up your data in a particular way often
00:12:38.300 --> 00:12:44.600
though that often these formulas will refer to rows and columns and so you really need to have the right data in
00:12:44.600 --> 00:12:49.300
the rows and the right data columns in order for any of these formulas to be used correctly.
00:12:49.300 --> 00:12:52.400
So how to set up your data in this way?
00:12:52.400 --> 00:13:00.100
Whatever your sample one is you want to put that all of the information for sample one into a column, right so
00:13:00.100 --> 00:13:08.000
here I put sample 1 at the generic sample one it could be college freshmen are Democrats or mice got a certain
00:13:08.000 --> 00:13:17.900
drive whatever it is the sample one and these are the people in sample 1 who fell into category one.
00:13:17.900 --> 00:13:24.200
These are the people in sample 1 who fell in to category two and these are called cells.
00:13:24.200 --> 00:13:34.500
When you add these frequency that you should get the total number of people in sample 1 right so in that way all
00:13:34.500 --> 00:13:37.900
the information from 1 one is in a column.
00:13:37.900 --> 00:13:43.000
Same thing with sample 2 all the information from sample 2 should be in a column.
00:13:43.000 --> 00:13:50.500
This should be the entire sample broken up into those that fell into category 1 versus category two and then the
00:13:50.500 --> 00:13:56.000
total gives you the total number of cases in sample 2.
00:13:56.000 --> 00:14:03.300
If you had sample three and four they would follow that same pattern and all the information should be in one column.
00:14:03.300 --> 00:14:14.900
On the flip side when you look at rows you should be able to count of how many people how many cases were in category one.
00:14:14.900 --> 00:14:28.300
And so if you count them up this way this is a sample but it is just how many cases in the entire data set that you are looking at
00:14:28.300 --> 00:14:37.800
are in category 1 and if you look across here this is how many cases in the entire data set fall into category 2
00:14:37.800 --> 00:14:49.500
and finally if you look at this total of totals what you should get is that is the entire data set all added up.
00:14:49.500 --> 00:14:55.800
So let us try that here with the Democrats and Republican example.
00:14:55.800 --> 00:15:09.700
So I am going to put Democrats appear Republicans appear satisfied and unsatisfied and all I need to do is make
00:15:09.700 --> 00:15:16.600
sure I find the correct information and put it into the correct cells.
00:15:16.600 --> 00:15:20.700
910 are satisfied 60 are not.
00:15:20.700 --> 00:15:29.500
When I add this up I should be able to get the number of how many Democrats total that are in the sample so this
00:15:29.500 --> 00:15:39.900
is 916 for Republicans this is 970 so we have slightly more people in a Republican sample than our Democrat sample and that is fine.
00:15:39.900 --> 00:15:48.000
If I add the rows up like this if I get the row totals what I should get is just a number of satisfied people.
00:15:48.000 --> 00:15:58.700
It does not matter whether their Democrats or Republicans so we should get 13, 16 and this should be 570.
00:15:58.700 --> 00:16:07.100
And if I add these two accession equal these 2 add being added outbreak of interest adding these four numbers up
00:16:07.100 --> 00:16:13.600
in a different order so that should be 1886.
00:16:13.600 --> 00:16:29.700
So we have 1886 in our total data set across both sample and we know how many people were satisfied , how many
00:16:29.700 --> 00:16:38.200
people are unsatisfied we also know how many Democrats we had how many Republicans we have and all the different combination right?
00:16:38.200 --> 00:16:43.000
Democrats are satisfied Democrats unsatisfied Republican satisfied Republicans unsatisfied.
00:16:43.000 --> 00:16:49.300
So this is a great way to set up your data that really can help you figure out expected frequency which is a
00:16:49.300 --> 00:16:52.600
little bit more complicated to figure out intensive homogeneity.
00:16:52.600 --> 00:16:57.300
Not too much complicated but just a little bit more.
00:16:57.300 --> 00:17:05.700
So here is how we can figure out expected frequency so once you have it set up in this way Democrats Republicans
00:17:05.700 --> 00:17:13.500
satisfied unsatisfied, once you have it set up in this way here is the formula used for expected frequency.
00:17:13.500 --> 00:17:21.800
So E is going to equal basically the proportion of people who are in one particular category.
00:17:21.800 --> 00:17:26.100
So I just want to know how people tend to be satisfied.
00:17:26.100 --> 00:17:33.000
I do not care whether their across a Republican, just in general who satisfied right so that would be the row
00:17:33.000 --> 00:17:45.200
total right so the row total over the grand total this one right here.
00:17:45.200 --> 00:17:57.100
This will give me the rates or the proportion of just the general rate of who satisfied who tends to be satisfied
00:17:57.100 --> 00:18:04.300
that 70% to be satisfied 20% to be satisfied 95% to be satisfied.
00:18:04.300 --> 00:18:12.100
What is the general rate and I am going to multiply that by the total number of the sample that I am interested in
00:18:12.100 --> 00:18:16.800
so maybe I am interested in the Democratic sample so I would get the column totals.
00:18:16.800 --> 00:18:27.100
So that is the general formula that will show you this in a more specific way so let us talk about the expected value of
00:18:27.100 --> 00:18:30.200
Democrats who are satisfied.
00:18:30.200 --> 00:18:45.200
Right so that would be the satisfied total over the grand total so this gives us the rates of being satisfied just
00:18:45.200 --> 00:18:52.000
in general what proportion of the entire data set is satisfied and then I am going to multiply that by however
00:18:52.000 --> 00:18:57.400
many Democrats I have so Democrat total.
00:18:57.400 --> 00:19:06.300
So I could write it in this way but what ends up is that this is just a more general way of saying this example.
00:19:06.300 --> 00:19:11.200
So when I say Democrats total is the same thing as being column totals.
00:19:11.200 --> 00:19:22.300
And when I say row total it is really the same thing as being satisfied total and the grand total is the total number in our data set.
00:19:22.300 --> 00:19:25.200
Democrats Republicans.
00:19:25.200 --> 00:19:36.100
So now let us talk about once you have the expected values you have the observed frequencies and now you could easily find chi-square.
00:19:36.100 --> 00:19:42.000
Once you get your chi-square how do you compare it to the chi-square distribution?
00:19:42.000 --> 00:19:49.900
Well the nice thing is the chi-square distribution looks the same as in the test at as in the goodness of fit test
00:19:49.900 --> 00:20:01.700
and so chi-square it has a wall at zero can not be lower than zero and it has a long positive tail and when you decide how much
00:20:01.700 --> 00:20:08.600
your α is and that is what it is going to look like α is always one tailed in a chi-square distribution.
00:20:08.600 --> 00:20:17.100
But the question is how to find degrees of freedom now that we have rows and columns?
00:20:17.100 --> 00:20:29.200
Well the degrees of freedom is really going to be the degrees of freedom for category times the degrees of freedom for
00:20:29.200 --> 00:20:38.000
however many populations or sample that represent your population you have and that is going to be the number of rows
00:20:38.000 --> 00:20:48.100
right because each categories in a row -1 times the number of columns you have -1 so that is how you find you degrees of freedom
00:20:48.100 --> 00:20:51.600
when you have more than one population that you are comparing.
00:20:51.600 --> 00:20:58.100
So what are the conditions for the test of homogeneity?
00:20:58.100 --> 00:21:05.400
These conditions are to be very similar to the conditions for out goodness of fit testing so the first thing is
00:21:05.400 --> 00:21:14.800
each outcome of each population falls into exactly one of the fixed number of category.
00:21:14.800 --> 00:21:20.300
Well the categories are mutually exclusive just like before, you have to be in one or the other you cannot be into 2 categories
00:21:20.300 --> 00:21:28.100
at the same time you cannot opt out of being in a category also the category choices must be the same for all population.
00:21:28.100 --> 00:21:38.100
So it went to one population has to have if they have three choices the same three choices must be the case for population 2.
00:21:38.100 --> 00:21:49.700
The 2nd requirement for condition is that you must have independent and random sample before in tests of goodness of fit
00:21:49.700 --> 00:21:55.200
we only have this requirement that the sample have to be branded because we only had one sample.
00:21:55.200 --> 00:22:04.700
Now we have multiple samples and they must be independent of each other they cannot they cannot come from the same pool.
00:22:04.700 --> 00:22:17.100
So third condition the expected frequency in each cell is five or greater and not just is the same condition that we had
00:22:17.100 --> 00:22:23.900
for goodness of fit testing it is because you want a big a sample as well as the big enough proportion.
00:22:23.900 --> 00:22:34.800
And number four is not really a condition is just so that you know how free you are with chi-square testing you can have
00:22:34.800 --> 00:22:43.700
more than two categories and you can have more than two populations you could have 4 categories and six population so you
00:22:43.700 --> 00:22:50.800
should have a whole bunch of these different combination so you are not restricted to 2 categories and 2 population.
00:22:50.800 --> 00:22:56.100
So now let us go on to some examples.
00:22:56.100 --> 00:23:03.400
Example 1 is just the example we have been using to talk about how to find how to set up your data and how to find
00:23:03.400 --> 00:23:17.000
expected values so I set this up in an Excel file this is just exactly the same way we set it up previously I just found
00:23:17.000 --> 00:23:20.500
the row totals as well as the column totals.
00:23:20.500 --> 00:23:26.500
And now I could start of my hypothesis testing so first things first.
00:23:26.500 --> 00:23:45.400
Step one our null hypothesis should say something like this that the proportions of satisfied and unsatisfied people minus adults
00:23:45.400 --> 00:23:55.100
for Democrats should be the same as for Republican so the proportion of category one and two of satisfied and
00:23:55.100 --> 00:24:05.900
unsatisfied by Allstate voters should be similar for Democrat and Republican.
00:24:05.900 --> 00:24:20.800
So the alternative hypothesis is that at least one of those proportion will be different between Democrats and Republicans.
00:24:20.800 --> 00:24:35.800
Step two, just set our α to be .05 and we know that because we are doing chi-square hypothesis testing is one
00:24:35.800 --> 00:24:45.000
tailed step three you might want to draw a chi-square distribution for yourself or just in your head and certain
00:24:45.000 --> 00:24:47.800
color and that α part and try to think.
00:24:47.800 --> 00:24:53.600
I want to find my critical chi-square.
00:24:53.600 --> 00:24:59.100
In order to find the critical chi-square I need to find the degrees of freedom.
00:24:59.100 --> 00:25:08.800
And my degrees of freedom is going to be made up of the degrees of freedom for categories as well as the degree of
00:25:08.800 --> 00:25:21.300
nfreedom for population and there is two populations so it is 2-1 and you could also see that as the columns 2 column – 1.
00:25:21.300 --> 00:25:34.800
And the degrees of freedom for number of categories is with two categories that is satisfied and unsatisfied -1
00:25:34.800 --> 00:25:45.100
and there that corresponds perfectly to number of rows -1 and so the degrees of freedom here is going to be that
00:25:45.100 --> 00:25:53.600
this times this so degrees of freedom for category times degrees of freedom for population and is just one.
00:25:53.600 --> 00:26:03.800
So, what is our critical chi-square, but that is going to be found by chi in we put in our probability as well as
00:26:03.800 --> 00:26:11.000
our degrees of freedom and we find 3.84 is our chi-square critical chi-square.
00:26:11.000 --> 00:26:24.400
So we are looking for sample that represent population sample chi-square is that are larger than 3.84.
00:26:24.400 --> 00:26:35.500
Step four look something like this so in order to find your sample chi-square what we need to do first is find our
00:26:35.500 --> 00:26:49.500
expected values so here we have observed frequency and what we need to do is find infected frequency.
00:26:49.500 --> 00:27:03.000
So I am just going to copy and paste this down here so we do not have to keep scrolling and so I am going to draw
00:27:03.000 --> 00:27:15.200
a director at the table here for observed frequency and create the same table for expected frequency.
00:27:15.200 --> 00:27:31.600
Okay so when I look at my expected frequency I need to find out what is the general rate and then multiply it by
00:27:31.600 --> 00:27:50.000
however many however many industry people have in that sample so the general rate of being satisfied is 1316÷1886
00:27:50.000 --> 00:27:54.300
so that the general rate and that is about 70%.
00:27:54.300 --> 00:28:00.200
Take that and multiply that by the total number of Democrats.
00:28:00.200 --> 00:28:17.600
Now this part I want to keep that the same and I want to keep that in the same column so I am going to put $ affinity
00:28:17.600 --> 00:28:28.000
to walk down that column and here I am going to put $ in front of both the D and the 21 in order to lock down this actual cell.
00:28:28.000 --> 00:28:35.900
Because here is what I am going to do I am than actually copy and paste that over here and if look at this then what I am doing
00:28:35.900 --> 00:28:46.000
is I have this same rates again the rate of being satisfied but now it is multiplied by the number of total Republicans.
00:28:46.000 --> 00:28:57.600
And I am going to take that cell copy and paste it down here and here I see that now I have the rates of being
00:28:57.600 --> 00:29:10.100
unsatisfied and they need to change this to that and here I have the rates of being unsatisfied and then
00:29:10.100 --> 00:29:16.200
multiplied by total number of Republican so these are my expected frequencies.
00:29:16.200 --> 00:29:26.300
Notice that the total still add up to be the same right and usually it should there might be some slight discrepancies
00:29:26.300 --> 00:29:31.400
but that will just be because of rounding error so they should still be pretty close.
00:29:31.400 --> 00:29:38.900
So now we have observed frequencies as well as expected frequencies and now we need to figure out my chi-square.
00:29:38.900 --> 00:29:50.100
My chi-square is going to be made up of observed frequency minus expected frequency squared divided by expected frequency.
00:29:50.100 --> 00:30:08.400
And I am going to need to find that for Democrat Republican as well as satisfied and unsatisfied and then add off all of these cells.
00:30:08.400 --> 00:30:13.100
So I will see grand total and I will put that over here.
00:30:13.100 --> 00:30:29.500
Okay so let us find the observed frequency minus the expected frequency squared divided by expected frequency.
00:30:29.500 --> 00:30:40.800
And I could just copy and paste that here because Excel will just move everything down and I can take this over here because Excel
00:30:40.800 --> 00:30:43.500
will move everything over to the right.
00:30:43.500 --> 00:31:15.800
And the grand total for all four of these is going to be 547.18 and so my sample chi-square is quite large.
00:31:15.800 --> 00:31:21.000
And so do I reject my no hypothesis?
00:31:21.000 --> 00:31:38.400
Indeed I do and we can find the P value so here I will put chi disc in order to find my probability.
00:31:38.400 --> 00:31:51.600
Here it is, degrees of freedom is going to be one and that is a very very very small P value so that is the pretty radically
00:31:51.600 --> 00:31:56.700
different population that we set in there.
00:31:56.700 --> 00:32:12.800
And if you want to step five, example 2.
00:32:12.800 --> 00:32:17.400
Consider this data on pesticide residue on domestic and imported fruits.
00:32:17.400 --> 00:32:24.200
Does this data fit the conditions of a chi-square test of homogeneity regardless of your answer conduct hypothesis tests.
00:32:24.200 --> 00:32:36.600
Now be careful here although you see column and rows these are not the columns and rows you should be using the columns are
00:32:36.600 --> 00:32:44.000
actually okay domestic roads imported roads we could consider those two to be the different populations that are interested in.
00:32:44.000 --> 00:32:54.700
But the roads actually do not show the different categories such as sample size percentage showing no residue and percentage showing residue in violation right?
00:32:54.700 --> 00:33:03.300
So what we should do is we should actually transform this data into sort of the correct setup.
00:33:03.300 --> 00:33:24.900
So here you could just pull up a brand-new XL file just been a user of the bottom portion here and here is what we want.
00:33:24.900 --> 00:33:48.400
We would like it to be set up so that we have the two populations appear and we have the different categories here
00:33:48.400 --> 00:33:55.600
so the categories are probably going to be showing no residue showing residue in violation but one of the things I
00:33:55.600 --> 00:34:02.300
noticed is that these percentages do not add up to 100 that there must be some other category that were missing.
00:34:02.300 --> 00:34:13.900
So no residue showing residue in violation of the law so I guess that is really bad and maybe there is just one
00:34:13.900 --> 00:34:23.600
word it is residue but not in violation and you sort of have to figure that out from the data that they have given you.
00:34:23.600 --> 00:34:33.400
But they do give you the sample size 344 as well as 1136 so this is the total.
00:34:33.400 --> 00:34:38.800
The question is what are our observed value?
00:34:38.800 --> 00:34:58.500
In order to find observed value all we have to do is multiply but the proportion so 44.2% times the total.
00:34:58.500 --> 00:35:15.800
Here I walk down that row, now residue in violation what I have to do is to change this percentage so the percentage is .9%.
00:35:15.800 --> 00:35:26.700
So that is .009 so that is .9%.
00:35:26.700 --> 00:35:31.500
And so what sort of leftover?
00:35:31.500 --> 00:35:43.000
Well, the leftover percentages is 1-.442 + .009 right so that sort of everybody else and that is I guess the
00:35:43.000 --> 00:35:54.500
number of fruits that are not in violation but still have some residue on them, some pesticide residue times this.
00:35:54.500 --> 00:36:09.200
And so when I add them all up I could check and that is 344 so I have done my proportions correctly.
00:36:09.200 --> 00:36:16.000
Now right away we could see that were actually not meeting the conditions for chi-square.
00:36:16.000 --> 00:36:28.300
If you look at this cell right here that has that only has three fruits in it even if we round up generously it is 3.1 right?
00:36:28.300 --> 00:36:30.800
So there is only three fruits.
00:36:30.800 --> 00:36:40.300
Remember expected frequencies have to have at least 5, so here the observed value is pretty small.
00:36:40.300 --> 00:36:49.900
Okay so that it said go ahead into hypothesis testing anyway you should not do this in real life but
00:36:49.900 --> 00:36:52.600
for the purpose of this exercise let us do it.
00:36:52.600 --> 00:37:01.800
So now let us find the proportion of imported fruits that are observed to have no residue on them.
00:37:01.800 --> 00:37:12.400
So that 70% 70.4% times this total and that is almost 800 fruits.
00:37:12.400 --> 00:37:29.200
Also we have those that have residue in violation .036 that is 3.6% times 1136, about 41 fruits and then
00:37:29.200 --> 00:37:41.700
I need the leftover percentage , so that is 1-.70% 74.4% +3.6% .
00:37:41.700 --> 00:37:47.800
That percentage times the total.
00:37:47.800 --> 00:37:51.900
And that is 295 right?
00:37:51.900 --> 00:37:59.800
So first notice that these seem like there is way more of these imported fruit than domestic fruits but that is because the
00:37:59.800 --> 00:38:09.100
totals are different so it does not necessarily mean that imported fruits they have so much residue on them,
00:38:09.100 --> 00:38:18.700
that is not necessarily what it means, but that is hard to compare because they have totally different totals.
00:38:18.700 --> 00:38:28.000
So it is helpful to find the row totals as well because that can help us find expected value expected frequency
00:38:28.000 --> 00:38:49.200
and so that is adding these rows together and we have a total of 1480 fruits Domestic and imported altogether.
00:38:49.200 --> 00:39:06.400
Once we have that then it would be easy for us to find expected frequency and expected frequency we could basically set up in a very similar way.
00:39:06.400 --> 00:39:22.300
So what is our expected frequency?
00:39:22.300 --> 00:39:30.300
Well,expected frequency is generally how frequent with the proportion of no residue over all the fruits right.
00:39:30.300 --> 00:39:42.900
So that will be this row totals divided by the grand total that is the general rates and we want to lockdown this row
00:39:42.900 --> 00:40:00.900
because we want to lock those two values down because and that is always going to be the rate for no residue
00:40:00.900 --> 00:40:09.800
times the actual number of domestic fruits.
00:40:09.800 --> 00:40:28.600
So we get 221 and here we do the same thing and I just copied and pasted across an Excel will just naturally you figure out what to do.
00:40:28.600 --> 00:40:35.800
So this is the rate of no residue over total fruits times the total number of imported fruits.
00:40:35.800 --> 00:40:49.600
Then we find there the rates of fruits that have residue but are not in violation which is this total over the grand total.
00:40:49.600 --> 00:41:06.700
And then I am going to lockdown those values and then I am going to multiply that by the total number of domestic fruit.
00:41:06.700 --> 00:41:16.100
And then if I copy that over that should give me the total number of imported fruits expected value of imported fruits given this proportion.
00:41:16.100 --> 00:41:29.500
And finally the proportion of fruits with residue in violation so a lot of pesticide residue that would be this total
00:41:29.500 --> 00:41:41.900
divided by the grand total times the total.
00:41:41.900 --> 00:41:54.900
And here what we can see is if we sum these three expected frequency together we should get something similar to 344.
00:41:54.900 --> 00:42:02.000
And indeed we do and here we should be 1136 and indeed we do great.
00:42:02.000 --> 00:42:10.100
So once we have our table of observed frequencies as well as expected frequencies now we can start to calculate
00:42:10.100 --> 00:42:20.100
for each cell the observed frequency minus expected frequencies where as a proportion of expected frequency.
00:42:20.100 --> 00:42:37.700
So O minus E squared as a proportion of expected frequency so I will copy this cell labels so observed frequency
00:42:37.700 --> 00:42:53.200
minus expected frequency squared divided by expected frequency , and just copy and paste all that let us check one of this.
00:42:53.200 --> 00:43:01.300
This one says that observed frequency minus expected frequency squared over expected frequency.
00:43:01.300 --> 00:43:18.700
And when we add all of these up we get 102 but we have forgotten the difference as we forgot to make a decision stage
00:43:18.700 --> 00:43:21.200
so let us go ahead and do step three.
00:43:21.200 --> 00:43:30.000
So the decision stage will be our critical chi-square and our critical chi-square sound with degrees of freedom
00:43:30.000 --> 00:43:48.300
of the categories times the degrees of freedom of the population multiplied together so the other degrees of freedom for the chi-square.
00:43:48.300 --> 00:44:06.200
So categories -1 is 2, population -1 is 1, so the degrees of freedom is just 2, so our critical chi-square is chi in.
00:44:06.200 --> 00:44:13.000
Put in .05 as our desired probability, our degrees of freedom equals 2 and we get 5.99.
00:44:13.000 --> 00:51:36.000
We see that our chi-square is much larger than that so we would reject our null.