For more information, please see full course syllabus of Statistics

For more information, please see full course syllabus of Statistics

### Chi-Square Test of Homogeneity

Lecture Slides are screen-captured images of important points in the lecture. Students can download and print out these lecture slide images to do practice problems as well as take notes while watching the lecture.

- Intro 0:00
- Roadmap 0:09
- Roadmap
- Goodness-of-Fit vs. Homogeneity 1:13
- Goodness-of-Fit HT
- Homogeneity
- Analogy
- Hypotheses About Proportions 5:00
- Null Hypothesis
- Alternative Hypothesis
- Example
- Chi-Square Statistic 10:12
- Same as Goodness-of-Fit Test
- Set Up Data 12:28
- Setting Up Data Example
- Expected Frequency 16:53
- Expected Frequency
- Chi-Square Distributions & df 19:26
- Chi-Square Distributions & df
- Conditions for Test of Homogeneity 20:54
- Condition 1
- Condition 2
- Condition 3
- Condition 4
- Example 1: Chi-Square Test of Homogeneity 22:52
- Example 2: Chi-Square Test of Homogeneity 32:10

### General Statistics Online Course

### Transcription: Chi-Square Test of Homogeneity

*Hi, welcome to educator.com. *0002

*We are going to talk about the chi-square test of homogeneity. *0002

*Previously we talked about the chi-square goodness of fit test now were in a contrast that with this new test is still 0018.3 chi-square test but it is a test of homogeneity now. *0005

*We are going to try and figure out when do we use which test. *0022

*The test we are testing a new idea , we are not testing goodness of that would actually testing homogeneity similar. *0027

*We actually have slightly different null hypotheses and alternative null and alternative hypotheses . *0035

*We are going to talk about how those have changed then we are going to go over the chi-square statistic and also finding 0051.0 the expected values is going to be a little bit different in test of homogeneity . *0041

*Finally working to go through chi-square distributions as well as degrees of freedom and the conditions for the test of homogeneity, *0055

*one can you actually care conduct this test service statistically legally. *0065

*Okay so the first thing is what is the difference between the test of homogeneity and test of goodness of fit? *0069

*Well in the goodness of fit hypothesis testing we wanted to determine whether sample proportions are very different from hypothesized *0082

*population proportion one way you could think about this is that you have one sample and you are comparing it to some hypothetical population. *0089

*In test of homogeneity and I called it goodness of fit, it is about how well these two things fit together. *0098

*How well does the sample fit with the hypothesized proportion. *0108

*In test of homogeneity homogeneous means similar right, that they are made up of the same stuff. *0112

*In test of homogeneity we want to determine whether 2 populations that are sorted into categories share the same proportions or not. *0120

*And here you could also substitute this word population here because ultimately were using the sample as a proxy for the population. *0130

*So here we have 2 population and we want to know whether those two populations are similar in their proportions or not *0142

*right were not comparing them to some hypothesized population were comparing them to each other. *0152

*And so really you can think of this as an analogy you think of the their relationship by using an analogy from the *0159

*one sample to the independent samples t-test. *0167

*In the one sample t-test we had one sample and we compared it to the null hypothesis right? *0170

*That was when we would have null hypotheses such as new equals zero or new equals 200 or new equals -5 versus an independent sample. *0176

*We had 2 samples and we wanted to know how similar they were to each other right or how different *0190

*they were from each other and our null hypothesis was changed to something like use of X bar minus Y bar equals zero right, *0198

*that they are either made up of the same mean or different means. *0208

*And in a in a similar way the goodness of fit chi-square is really asking whether this proportion in my sample *0213

*is similar to the proportion in our population. *0229

*So that is how I am comparing , this is my null hypothesis in some ways . *0232

*In our inner test of homogeneity we have 2 sample 2 population 2 sample that come from 2 unknown population and we want to know *0240

*whether these have similar proportions to each other and so that is going to be our null hypothesis that these have the same proportion or have different one. *0255

*For null hypotheses is similar proportion. *0267

*And so in that way I hope you could see that goodness of fit in homogeneity their ideas that we have looked at before *0275

*comparing one sample to a hypothesized population or comparing two samples to each other but we have looked at it before *0285

*not with proportion but with means, right? *0294

*And now are looking at it with proportion okay since you are looking at proportion we should have hypotheses about *0297

*proportion so the null hypotheses with something like this the proportion of all the each category the proportion that *0305

*all into each category is the same for each population so however many categories you have so let us say we have *0313

*in a three categories. *0322

*If we believe that they are the same and they should roughly have the same proportion so these have similar proportion. *0341

*It does not actually matter what the proportions are it could be 90, 10 could be 10,10 it could be 75 20 like when the proportions *0347

*that were think there similar for each population and whatever 780 whatever category is 75% of the population *0360

*that category will also be 75% of the population. *0368

*The alternative hypothesis says that for at least one category the populations do not have the same proportion so just like before *0371

*were now talking about differences that the differences are really in the proportions the predicted the populations proportion. *0383

*So just to give you an example. *0394

*Here is the problem and let us try to change it into the null hypothesis as well as alternative hypothesis. *0396

*So according to a poll for and six Democrats said they were very satisfied with candidate A while 510 were unsatisfied *0401

*however 910 Republicans were satisfied with candidate a while 60 were not. *0410

*And in a chi-square test of homogeneity we could see whether the proportions of Democrats and Republicans that Democrats were satisfied are *0415

*similar to the proportions were Republican of Republicans were satisfied versus unsatisfied. *0427

*So let us draw this out first. *0436

*So here we have about 400 Democrats saying there satisfied while 500 saying unsatisfied. *0439

*Let put satisfied in blue and so that is a little bit less than half and the unsatisfied people are a little bit *0451

*more than half so this is the Democratic population that they look like. *0460

*The Republican population looks very different so here we see most of the Republicans being pretty satisfied and *0467

*only a very small minority being unsatisfied right. *0479

*And so the question is are these two are the two similar are the proportions that fall into each category *0483

*satisfied or unsatisfied the same for each population? *0493

*Are they different? *0497

*The null hypothesis would probably say something like this. *0498

*The proportion of satisfied and unsatisfied people like us are similar are the same for Dans as well as republicans. *0501

*The alternative hypothesis says for at least one category either satisfied or unsatisfied, Dans and Republicans do not have the same proportion. *0531

*Okay so note that in the case of 2, once category changes once the proportion of one category changes the other one automatically changes.*0561

*So if we somehow were able to change has satisfied the Democrats were with candidate A, we would also see the *0584

*proportion of unsatisfied people just automatically change. *0592

*So that is in the case of two categories but in the case of multiple categories maybe 2 might change but the others may *0595

*not change right so in that way this would be a more general way of saying alternative hypothesis. *0606

*Now let us talk about the chi-square statistic. *0612

*Now the nice thing about the chi-square statistic is that it is the same as the goodness of fit test. *0616

*We use the same idea so chi-square is going to be observed frequencies and the difference between that and *0621

*expected frequencies where over the proportion of expected frequency. *0631

*But there is just one subtle difference before it was for each category. *0638

*Now we have different categories in different population right so we not only have like category 1 and category 2 *0643

*category 3 so on and so forth but we also have population 1 and population 2 at least right? *0651

*And so we have multiple of observed frequencies and so what do we do right? *0659

*Well what we do here is that we consider each of these combination of which population your in and which category *0668

*are talking about each of these are going to be called cells. *0681

*And so we do this for each cell so I will go from one of to the number of cells. *0686

*And how do we get the number of cells? *0694

*Well the number of cells is really how many population right and that is usually shown in columns times how many categories. *0701

*And that is usually shown in rows, you can also think of the number of cells as columns times rows, how many columns you have times the number of rows. *0718

*But really the idea comes from how many different populations your comparing of chi-square test of homogeneity *0733

*actually compare three or four population not just 2 and how many categories you are comparing. *0739

*So in order to use the chi-square formula, it is often helpful to set up your data in a particular way often *0747

*though that often these formulas will refer to rows and columns and so you really need to have the right data in *0758

*the rows and the right data columns in order for any of these formulas to be used correctly. *0764

*So how to set up your data in this way? *0769

*Whatever your sample one is you want to put that all of the information for sample one into a column, right so *0772

*here I put sample 1 at the generic sample one it could be college freshmen are Democrats or mice got a certain *0780

*drive whatever it is the sample one and these are the people in sample 1 who fell into category one. *0788

*These are the people in sample 1 who fell in to category two and these are called cells. *0798

*When you add these frequency that you should get the total number of people in sample 1 right so in that way all *0804

*the information from 1 one is in a column. *0814

*Same thing with sample 2 all the information from sample 2 should be in a column. *0818

*This should be the entire sample broken up into those that fell into category 1 versus category two and then the*0823

*total gives you the total number of cases in sample 2. *0830

*If you had sample three and four they would follow that same pattern and all the information should be in one column. *0836

*On the flip side when you look at rows you should be able to count of how many people how many cases were in category one. *0843

*And so if you count them up this way this is a sample but it is just how many cases in the entire data set that you are looking at*0855

*are in category 1 and if you look across here this is how many cases in the entire data set fall into category 2 *0868

*and finally if you look at this total of totals what you should get is that is the entire data set all added up. *0878

*So let us try that here with the Democrats and Republican example. *0889

*So I am going to put Democrats appear Republicans appear satisfied and unsatisfied and all I need to do is make *0896

*sure I find the correct information and put it into the correct cells. *0910

*910 are satisfied 60 are not. *0916

*When I add this up I should be able to get the number of how many Democrats total that are in the sample so this *0921

*is 916 for Republicans this is 970 so we have slightly more people in a Republican sample than our Democrat sample and that is fine. *0929

*If I add the rows up like this if I get the row totals what I should get is just a number of satisfied people. *0940

*It does not matter whether their Democrats or Republicans so we should get 13, 16 and this should be 570. *0948

*And if I add these two accession equal these 2 add being added outbreak of interest adding these four numbers up *0959

*in a different order so that should be 1886. *0967

*So we have 1886 in our total data set across both sample and we know how many people were satisfied , how many *0973

*people are unsatisfied we also know how many Democrats we had how many Republicans we have and all the different combination right? *0990

*Democrats are satisfied Democrats unsatisfied Republican satisfied Republicans unsatisfied. *0998

*So this is a great way to set up your data that really can help you figure out expected frequency which is a *1003

*little bit more complicated to figure out intensive homogeneity. *1009

*Not too much complicated but just a little bit more. *1012

*So here is how we can figure out expected frequency so once you have it set up in this way Democrats Republicans *1017

*satisfied unsatisfied, once you have it set up in this way here is the formula used for expected frequency. *1026

*So E is going to equal basically the proportion of people who are in one particular category. *1033

*So I just want to know how people tend to be satisfied. *1042

*I do not care whether their across a Republican, just in general who satisfied right so that would be the row *1046

*total right so the row total over the grand total this one right here. *1053

*This will give me the rates or the proportion of just the general rate of who satisfied who tends to be satisfied *1065

*that 70% to be satisfied 20% to be satisfied 95% to be satisfied. *1077

*What is the general rate and I am going to multiply that by the total number of the sample that I am interested in *1084

*so maybe I am interested in the Democratic sample so I would get the column totals. *1092

*So that is the general formula that will show you this in a more specific way so let us talk about the expected value of *1097

*Democrats who are satisfied. *1107

*Right so that would be the satisfied total over the grand total so this gives us the rates of being satisfied just *1110

*in general what proportion of the entire data set is satisfied and then I am going to multiply that by however *1125

*many Democrats I have so Democrat total. *1132

*So I could write it in this way but what ends up is that this is just a more general way of saying this example. *1137

*So when I say Democrats total is the same thing as being column totals. *1146

*And when I say row total it is really the same thing as being satisfied total and the grand total is the total number in our data set. *1151

*Democrats Republicans. *1162

*So now let us talk about once you have the expected values you have the observed frequencies and now you could easily find chi-square. *1165

*Once you get your chi-square how do you compare it to the chi-square distribution? *1176

*Well the nice thing is the chi-square distribution looks the same as in the test at as in the goodness of fit test *1182

*and so chi-square it has a wall at zero can not be lower than zero and it has a long positive tail and when you decide how much *1190

*your alpha is and that is what it is going to look like Alpha is always one tailed in a chi-square distribution. *1202

*But the question is how to find degrees of freedom now that we have rows and columns? *1208

*Well the degrees of freedom is really going to be the degrees of freedom for category times the degrees of freedom for *1217

*however many populations or sample that represent your population you have and that is going to be the number of rows *1229

*right because each categories in a row -1 times the number of columns you have -1 so that is how you find you degrees of freedom *1238

*when you have more than one population that you are comparing. *1248

*So what are the conditions for the test of homogeneity? *1251

*These conditions are to be very similar to the conditions for out goodness of fit testing so the first thing is *1258

*each outcome of each population falls into exactly one of the fixed number of category. *1265

*Well the categories are mutually exclusive just like before, you have to be in one or the other you cannot be into 2 categories*1275

*at the same time you cannot opt out of being in a category also the category choices must be the same for all population. *1280

*So it went to one population has to have if they have three choices the same three choices must be the case for population 2. *1288

*The 2nd requirement for condition is that you must have independent and random sample before in tests of goodness of fit *1298

*we only have this requirement that the sample have to be branded because we only had one sample. *1310

*Now we have multiple samples and they must be independent of each other they cannot they cannot come from the same pool. *1315

*So third condition the expected frequency in each cell is five or greater and not just is the same condition that we had *1325

*for goodness of fit testing it is because you want a big a sample as well as the big enough proportion. *1337

*And number four is not really a condition is just so that you know how free you are with chi-square testing you can have *1344

*more than two categories and you can have more than two populations you could have 4 categories and six population so you *1355

*should have a whole bunch of these different combination so you are not restricted to 2 categories and 2 population. *1364

*So now let us go on to some examples. *1371

*Example 1 is just the example we have been using to talk about how to find how to set up your data and how to find *1376

*expected values so I set this up in an Excel file this is just exactly the same way we set it up previously I just found *1383

*the row totals as well as the column totals. *1397

*And now I could start of my hypothesis testing so first things first. *1400

*Step one our null hypothesis should say something like this that the proportions of satisfied and unsatisfied people minus adults *1406

*for Democrats should be the same as for Republican so the proportion of category one and two of satisfied and *1425

*unsatisfied by Allstate voters should be similar for Democrat and Republican. *1435

*So the alternative hypothesis is that at least one of those proportion will be different between Democrats and Republicans. *1446

*Step two, just set our alpha to be .05 and we know that because we are doing chi-square hypothesis testing is one *1461

*tailed step three you might want to draw a chi-square distribution for yourself or just in your head and certain *1476

*color and that alpha part and try to think. *1485

*I want to find my critical chi-square. *1488

*In order to find the critical chi-square I need to find the degrees of freedom. *1493

*And my degrees of freedom is going to be made up of the degrees of freedom for categories as well as the degree of *1499

*nfreedom for population and there is two populations so it is 2-1 and you could also see that as the columns 2 column – 1. *1509

*And the degrees of freedom for number of categories is with two categories that is satisfied and unsatisfied -1 *1521

*and there that corresponds perfectly to number of rows -1 and so the degrees of freedom here is going to be that *1535

*this times this so degrees of freedom for category times degrees of freedom for population and is just one. *1545

*So, what is our critical chi-square, but that is going to be found by chi in we put in our probability as well as *1553

*our degrees of freedom and we find 3.84 is our chi-square critical chi-square. *1564

*So we are looking for sample that represent population sample chi-square is that are larger than 3.84. *1571

*Step four look something like this so in order to find your sample chi-square what we need to do first is find our *1584

*expected values so here we have observed frequency and what we need to do is find infected frequency. *1595

*So I am just going to copy and paste this down here so we do not have to keep scrolling and so I am going to draw *1609

*a director at the table here for observed frequency and create the same table for expected frequency. *1623

*Okay so when I look at my expected frequency I need to find out what is the general rate and then multiply it by *1635

*however many however many industry people have in that sample so the general rate of being satisfied is 1316÷1886 *1651

*so that the general rate and that is about 70%. *1670

*Take that and multiply that by the total number of Democrats. *1674

*Now this part I want to keep that the same and I want to keep that in the same column so I am going to put $ affinity *1680

*to walk down that column and here I am going to put $ in front of both the D and the 21 in order to lock down this actual cell. *1697

*Because here is what I am going to do I am than actually copy and paste that over here and if look at this then what I am doing*1708

*is I have this same rates again the rate of being satisfied but now it is multiplied by the number of total Republicans. *1716

*And I am going to take that cell copy and paste it down here and here I see that now I have the rates of being *1726

*unsatisfied and they need to change this to that and here I have the rates of being unsatisfied and then *1737

*multiplied by total number of Republican so these are my expected frequencies. *1750

*Notice that the total still add up to be the same right and usually it should there might be some slight discrepancies*1756

*but that will just be because of rounding error so they should still be pretty close. *1766

*So now we have observed frequencies as well as expected frequencies and now we need to figure out my chi-square. *1771

*My chi-square is going to be made up of observed frequency minus expected frequency squared divided by expected frequency. *1779

*And I am going to need to find that for Democrat Republican as well as satisfied and unsatisfied and then add off all of these cells. *1790

*So I will see grand total and I will put that over here. *1808

*Okay so let us find the observed frequency minus the expected frequency squared divided by expected frequency. *1813

*And I could just copy and paste that here because Excel will just move everything down and I can take this over here because Excel *1829

*will move everything over to the right. *1841

*And the grand total for all four of these is going to be 547.18 and so my sample chi-square is quite large.*1843

*And so do I reject my no hypothesis? *1876

*Indeed I do and we can find the P value so here I will put chi disc in order to find my probability. *1881

*Here it is, degrees of freedom is going to be one and that is a very very very small P value so that is the pretty radically*1898

*different population that we set in there. *1911

*And if you want to step five, example 2. *1917

*Consider this data on pesticide residue on domestic and imported fruits. *1933

*Does this data fit the conditions of a chi-square test of homogeneity regardless of your answer conduct hypothesis tests. *1937

*Now be careful here although you see column and rows these are not the columns and rows you should be using the columns are *1944

*actually okay domestic roads imported roads we could consider those two to be the different populations that are interested in. *1956

*But the roads actually do not show the different categories such as sample size percentage showing no residue and percentage showing residue in violation right? *1964

*So what we should do is we should actually transform this data into sort of the correct setup. *1975

*So here you could just pull up a brand-new XL file just been a user of the bottom portion here and here is what we want. *1983

*We would like it to be set up so that we have the two populations appear and we have the different categories here *2005

*so the categories are probably going to be showing no residue showing residue in violation but one of the things I *2028

*noticed is that these percentages do not add up to 100 that there must be some other category that were missing. *2035

*So no residue showing residue in violation of the law so I guess that is really bad and maybe there is just one *2042

*word it is residue but not in violation and you sort of have to figure that out from the data that they have given you. *2054

*But they do give you the sample size 344 as well as 1136 so this is the total. *2063

*The question is what are our observed value? *2073

*In order to find observed value all we have to do is multiply but the proportion so 44.2% times the total. *2079

*Here I walk down that row, now residue in violation what I have to do is to change this percentage so the percentage is .9%.*2098

*So that is .009 so that is .9%. *2116

*And so what sort of leftover? *2127

*Well, the leftover percentages is 1-.442 + .009 right so that sort of everybody else and that is I guess the *2131

*number of fruits that are not in violation but still have some residue on them, some pesticide residue times this. *2143

*And so when I add them all up I could check and that is 344 so I have done my proportions correctly. *2154

*Now right away we could see that were actually not meeting the conditions for chi-square. *2169

*If you look at this cell right here that has that only has three fruits in it even if we round up generously it is 3.1 right? *2176

*So there is only three fruits. *2188

*Remember expected frequencies have to have at least 5, so here the observed value is pretty small. *2191

*Okay so that it said go ahead into hypothesis testing anyway you should not do this in real life but *2200

*for the purpose of this exercise let us do it. *2210

*So now let us find the proportion of imported fruits that are observed to have no residue on them. *2212

*So that 70% 70.4% times this total and that is almost 800 fruits. *2222

*Also we have those that have residue in violation .036 that is 3.6% times 1136, about 41 fruits and then *2232

*I need the leftover percentage , so that is 1-.70% 74.4% +3.6% . *2249

*That percentage times the total. *2262

*And that is 295 right? *2268

*So first notice that these seem like there is way more of these imported fruit than domestic fruits but that is because the *2272

*totals are different so it does not necessarily mean that imported fruits they have so much residue on them, *2280

*that is not necessarily what it means, but that is hard to compare because they have totally different totals. *2289

*So it is helpful to find the row totals as well because that can help us find expected value expected frequency *2299

*and so that is adding these rows together and we have a total of 1480 fruits Domestic and imported altogether. *2308

*Once we have that then it would be easy for us to find expected frequency and expected frequency we could basically set up in a very similar way. *2329

*So what is our expected frequency? *2346

*Well,expected frequency is generally how frequent with the proportion of no residue over all the fruits right. *2362

*So that will be this row totals divided by the grand total that is the general rates and we want to lockdown this row *2370

*because we want to lock those two values down because and that is always going to be the rate for no residue *2383

*times the actual number of domestic fruits. *2401

*So we get 221 and here we do the same thing and I just copied and pasted across an Excel will just naturally you figure out what to do. *2410

*So this is the rate of no residue over total fruits times the total number of imported fruits. *2428

*Then we find there the rates of fruits that have residue but are not in violation which is this total over the grand total. *2436

*And then I am going to lockdown those values and then I am going to multiply that by the total number of domestic fruit. *2449

*And then if I copy that over that should give me the total number of imported fruits expected value of imported fruits given this proportion. *2467

*And finally the proportion of fruits with residue in violation so a lot of pesticide residue that would be this total *2476

*divided by the grand total times the total. *2489

*And here what we can see is if we sum these three expected frequency together we should get something similar to 344. *2502

*And indeed we do and here we should be 1136 and indeed we do great. *2515

*So once we have our table of observed frequencies as well as expected frequencies now we can start to calculate *2522

*for each cell the observed frequency minus expected frequencies where as a proportion of expected frequency. *2530

*So O minus E squared as a proportion of expected frequency so I will copy this cell labels so observed frequency *2540

*minus expected frequency squared divided by expected frequency , and just copy and paste all that let us check one of this. *2558

*This one says that observed frequency minus expected frequency squared over expected frequency. *2573

*And when we add all of these up we get 102 but we have forgotten the difference as we forgot to make a decision stage*2581

*so let us go ahead and do step three. *2599

*So the decision stage will be our critical chi-square and our critical chi-square sound with degrees of freedom *2601

*of the categories times the degrees of freedom of the population multiplied together so the other degrees of freedom for the chi-square. *2610

*So categories -1 is 2, population -1 is 1, so the degrees of freedom is just 2, so our critical chi-square is chi in. *2628

*Put in .05 as our desired probability, our degrees of freedom equals 2 and we get 5.99. *2646

*We see that our chi-square is much larger than that so we would reject our null.*2653

## Start Learning Now

Our free lessons will get you started (Adobe Flash

Sign up for Educator.com^{®}required).Get immediate access to our entire library.

## Membership Overview

Unlimited access to our entire library of courses.Learn at your own pace... anytime, anywhere!