Sign In | Subscribe
Start learning today, and be successful in your academic & professional career. Start Today!
Loading video...
This is a quick preview of the lesson. For full access, please Log In or Sign up.
For more information, please see full course syllabus of Statistics
  • Discussion

  • Download Lecture Slides

  • Table of Contents

  • Transcription

  • Related Books

Bookmark and Share

Start Learning Now

Our free lessons will get you started (Adobe Flash® required).
Get immediate access to our entire library.

Sign up for

Membership Overview

  • Unlimited access to our entire library of courses.
  • Search and jump to exactly what you want to learn.
  • *Ask questions and get answers from the community and our teachers!
  • Practice questions with step-by-step solutions.
  • Download lesson files for programming and software training practice.
  • Track your course viewing progress.
  • Download lecture slides for taking notes.
  • Learn at your own pace... anytime, anywhere!

Chi-Square Test of Homogeneity

Lecture Slides are screen-captured images of important points in the lecture. Students can download and print out these lecture slide images to do practice problems as well as take notes while watching the lecture.

  • Intro 0:00
  • Roadmap 0:09
    • Roadmap
  • Goodness-of-Fit vs. Homogeneity 1:13
    • Goodness-of-Fit HT
    • Homogeneity
    • Analogy
  • Hypotheses About Proportions 5:00
    • Null Hypothesis
    • Alternative Hypothesis
    • Example
  • Chi-Square Statistic 10:12
    • Same as Goodness-of-Fit Test
  • Set Up Data 12:28
    • Setting Up Data Example
  • Expected Frequency 16:53
    • Expected Frequency
  • Chi-Square Distributions & df 19:26
    • Chi-Square Distributions & df
  • Conditions for Test of Homogeneity 20:54
    • Condition 1
    • Condition 2
    • Condition 3
    • Condition 4
  • Example 1: Chi-Square Test of Homogeneity 22:52
  • Example 2: Chi-Square Test of Homogeneity 32:10

Transcription: Chi-Square Test of Homogeneity

Hi, welcome to 0002

We are going to talk about the chi-square test of homogeneity. 0002

Previously we talked about the chi-square goodness of fit test now were in a contrast that with this new test is still 0018.3 chi-square test but it is a test of homogeneity now. 0005

We are going to try and figure out when do we use which test. 0022

The test we are testing a new idea , we are not testing goodness of that would actually testing homogeneity similar. 0027

We actually have slightly different null hypotheses and alternative null and alternative hypotheses . 0035

We are going to talk about how those have changed then we are going to go over the chi-square statistic and also finding 0051.0 the expected values is going to be a little bit different in test of homogeneity . 0041

Finally working to go through chi-square distributions as well as degrees of freedom and the conditions for the test of homogeneity, 0055

one can you actually care conduct this test service statistically legally. 0065

Okay so the first thing is what is the difference between the test of homogeneity and test of goodness of fit? 0069

Well in the goodness of fit hypothesis testing we wanted to determine whether sample proportions are very different from hypothesized 0082

population proportion one way you could think about this is that you have one sample and you are comparing it to some hypothetical population. 0089

In test of homogeneity and I called it goodness of fit, it is about how well these two things fit together. 0098

How well does the sample fit with the hypothesized proportion. 0108

In test of homogeneity homogeneous means similar right, that they are made up of the same stuff. 0112

In test of homogeneity we want to determine whether 2 populations that are sorted into categories share the same proportions or not. 0120

And here you could also substitute this word population here because ultimately were using the sample as a proxy for the population. 0130

So here we have 2 population and we want to know whether those two populations are similar in their proportions or not 0142

right were not comparing them to some hypothesized population were comparing them to each other. 0152

And so really you can think of this as an analogy you think of the their relationship by using an analogy from the 0159

one sample to the independent samples t-test. 0167

In the one sample t-test we had one sample and we compared it to the null hypothesis right? 0170

That was when we would have null hypotheses such as new equals zero or new equals 200 or new equals -5 versus an independent sample. 0176

We had 2 samples and we wanted to know how similar they were to each other right or how different 0190

they were from each other and our null hypothesis was changed to something like use of X bar minus Y bar equals zero right, 0198

that they are either made up of the same mean or different means. 0208

And in a in a similar way the goodness of fit chi-square is really asking whether this proportion in my sample 0213

is similar to the proportion in our population. 0229

So that is how I am comparing , this is my null hypothesis in some ways . 0232

In our inner test of homogeneity we have 2 sample 2 population 2 sample that come from 2 unknown population and we want to know 0240

whether these have similar proportions to each other and so that is going to be our null hypothesis that these have the same proportion or have different one. 0255

For null hypotheses is similar proportion. 0267

And so in that way I hope you could see that goodness of fit in homogeneity their ideas that we have looked at before 0275

comparing one sample to a hypothesized population or comparing two samples to each other but we have looked at it before 0285

not with proportion but with means, right? 0294

And now are looking at it with proportion okay since you are looking at proportion we should have hypotheses about 0297

proportion so the null hypotheses with something like this the proportion of all the each category the proportion that 0305

all into each category is the same for each population so however many categories you have so let us say we have 0313

in a three categories. 0322

If we believe that they are the same and they should roughly have the same proportion so these have similar proportion. 0341

It does not actually matter what the proportions are it could be 90, 10 could be 10,10 it could be 75 20 like when the proportions 0347

that were think there similar for each population and whatever 780 whatever category is 75% of the population 0360

that category will also be 75% of the population. 0368

The alternative hypothesis says that for at least one category the populations do not have the same proportion so just like before 0371

were now talking about differences that the differences are really in the proportions the predicted the populations proportion. 0383

So just to give you an example. 0394

Here is the problem and let us try to change it into the null hypothesis as well as alternative hypothesis. 0396

So according to a poll for and six Democrats said they were very satisfied with candidate A while 510 were unsatisfied 0401

however 910 Republicans were satisfied with candidate a while 60 were not. 0410

And in a chi-square test of homogeneity we could see whether the proportions of Democrats and Republicans that Democrats were satisfied are 0415

similar to the proportions were Republican of Republicans were satisfied versus unsatisfied. 0427

So let us draw this out first. 0436

So here we have about 400 Democrats saying there satisfied while 500 saying unsatisfied. 0439

Let put satisfied in blue and so that is a little bit less than half and the unsatisfied people are a little bit 0451

more than half so this is the Democratic population that they look like. 0460

The Republican population looks very different so here we see most of the Republicans being pretty satisfied and 0467

only a very small minority being unsatisfied right. 0479

And so the question is are these two are the two similar are the proportions that fall into each category 0483

satisfied or unsatisfied the same for each population? 0493

Are they different? 0497

The null hypothesis would probably say something like this. 0498

The proportion of satisfied and unsatisfied people like us are similar are the same for Dans as well as republicans. 0501

The alternative hypothesis says for at least one category either satisfied or unsatisfied, Dans and Republicans do not have the same proportion. 0531

Okay so note that in the case of 2, once category changes once the proportion of one category changes the other one automatically changes.0561

So if we somehow were able to change has satisfied the Democrats were with candidate A, we would also see the 0584

proportion of unsatisfied people just automatically change. 0592

So that is in the case of two categories but in the case of multiple categories maybe 2 might change but the others may 0595

not change right so in that way this would be a more general way of saying alternative hypothesis. 0606

Now let us talk about the chi-square statistic. 0612

Now the nice thing about the chi-square statistic is that it is the same as the goodness of fit test. 0616

We use the same idea so chi-square is going to be observed frequencies and the difference between that and 0621

expected frequencies where over the proportion of expected frequency. 0631

But there is just one subtle difference before it was for each category. 0638

Now we have different categories in different population right so we not only have like category 1 and category 2 0643

category 3 so on and so forth but we also have population 1 and population 2 at least right? 0651

And so we have multiple of observed frequencies and so what do we do right? 0659

Well what we do here is that we consider each of these combination of which population your in and which category 0668

are talking about each of these are going to be called cells. 0681

And so we do this for each cell so I will go from one of to the number of cells. 0686

And how do we get the number of cells? 0694

Well the number of cells is really how many population right and that is usually shown in columns times how many categories. 0701

And that is usually shown in rows, you can also think of the number of cells as columns times rows, how many columns you have times the number of rows. 0718

But really the idea comes from how many different populations your comparing of chi-square test of homogeneity 0733

actually compare three or four population not just 2 and how many categories you are comparing. 0739

So in order to use the chi-square formula, it is often helpful to set up your data in a particular way often 0747

though that often these formulas will refer to rows and columns and so you really need to have the right data in 0758

the rows and the right data columns in order for any of these formulas to be used correctly. 0764

So how to set up your data in this way? 0769

Whatever your sample one is you want to put that all of the information for sample one into a column, right so 0772

here I put sample 1 at the generic sample one it could be college freshmen are Democrats or mice got a certain 0780

drive whatever it is the sample one and these are the people in sample 1 who fell into category one. 0788

These are the people in sample 1 who fell in to category two and these are called cells. 0798

When you add these frequency that you should get the total number of people in sample 1 right so in that way all 0804

the information from 1 one is in a column. 0814

Same thing with sample 2 all the information from sample 2 should be in a column. 0818

This should be the entire sample broken up into those that fell into category 1 versus category two and then the0823

total gives you the total number of cases in sample 2. 0830

If you had sample three and four they would follow that same pattern and all the information should be in one column. 0836

On the flip side when you look at rows you should be able to count of how many people how many cases were in category one. 0843

And so if you count them up this way this is a sample but it is just how many cases in the entire data set that you are looking at0855

are in category 1 and if you look across here this is how many cases in the entire data set fall into category 2 0868

and finally if you look at this total of totals what you should get is that is the entire data set all added up. 0878

So let us try that here with the Democrats and Republican example. 0889

So I am going to put Democrats appear Republicans appear satisfied and unsatisfied and all I need to do is make 0896

sure I find the correct information and put it into the correct cells. 0910

910 are satisfied 60 are not. 0916

When I add this up I should be able to get the number of how many Democrats total that are in the sample so this 0921

is 916 for Republicans this is 970 so we have slightly more people in a Republican sample than our Democrat sample and that is fine. 0929

If I add the rows up like this if I get the row totals what I should get is just a number of satisfied people. 0940

It does not matter whether their Democrats or Republicans so we should get 13, 16 and this should be 570. 0948

And if I add these two accession equal these 2 add being added outbreak of interest adding these four numbers up 0959

in a different order so that should be 1886. 0967

So we have 1886 in our total data set across both sample and we know how many people were satisfied , how many 0973

people are unsatisfied we also know how many Democrats we had how many Republicans we have and all the different combination right? 0990

Democrats are satisfied Democrats unsatisfied Republican satisfied Republicans unsatisfied. 0998

So this is a great way to set up your data that really can help you figure out expected frequency which is a 1003

little bit more complicated to figure out intensive homogeneity. 1009

Not too much complicated but just a little bit more. 1012

So here is how we can figure out expected frequency so once you have it set up in this way Democrats Republicans 1017

satisfied unsatisfied, once you have it set up in this way here is the formula used for expected frequency. 1026

So E is going to equal basically the proportion of people who are in one particular category. 1033

So I just want to know how people tend to be satisfied. 1042

I do not care whether their across a Republican, just in general who satisfied right so that would be the row 1046

total right so the row total over the grand total this one right here. 1053

This will give me the rates or the proportion of just the general rate of who satisfied who tends to be satisfied 1065

that 70% to be satisfied 20% to be satisfied 95% to be satisfied. 1077

What is the general rate and I am going to multiply that by the total number of the sample that I am interested in 1084

so maybe I am interested in the Democratic sample so I would get the column totals. 1092

So that is the general formula that will show you this in a more specific way so let us talk about the expected value of 1097

Democrats who are satisfied. 1107

Right so that would be the satisfied total over the grand total so this gives us the rates of being satisfied just 1110

in general what proportion of the entire data set is satisfied and then I am going to multiply that by however 1125

many Democrats I have so Democrat total. 1132

So I could write it in this way but what ends up is that this is just a more general way of saying this example. 1137

So when I say Democrats total is the same thing as being column totals. 1146

And when I say row total it is really the same thing as being satisfied total and the grand total is the total number in our data set. 1151

Democrats Republicans. 1162

So now let us talk about once you have the expected values you have the observed frequencies and now you could easily find chi-square. 1165

Once you get your chi-square how do you compare it to the chi-square distribution? 1176

Well the nice thing is the chi-square distribution looks the same as in the test at as in the goodness of fit test 1182

and so chi-square it has a wall at zero can not be lower than zero and it has a long positive tail and when you decide how much 1190

your alpha is and that is what it is going to look like Alpha is always one tailed in a chi-square distribution. 1202

But the question is how to find degrees of freedom now that we have rows and columns? 1208

Well the degrees of freedom is really going to be the degrees of freedom for category times the degrees of freedom for 1217

however many populations or sample that represent your population you have and that is going to be the number of rows 1229

right because each categories in a row -1 times the number of columns you have -1 so that is how you find you degrees of freedom 1238

when you have more than one population that you are comparing. 1248

So what are the conditions for the test of homogeneity? 1251

These conditions are to be very similar to the conditions for out goodness of fit testing so the first thing is 1258

each outcome of each population falls into exactly one of the fixed number of category. 1265

Well the categories are mutually exclusive just like before, you have to be in one or the other you cannot be into 2 categories1275

at the same time you cannot opt out of being in a category also the category choices must be the same for all population. 1280

So it went to one population has to have if they have three choices the same three choices must be the case for population 2. 1288

The 2nd requirement for condition is that you must have independent and random sample before in tests of goodness of fit 1298

we only have this requirement that the sample have to be branded because we only had one sample. 1310

Now we have multiple samples and they must be independent of each other they cannot they cannot come from the same pool. 1315

So third condition the expected frequency in each cell is five or greater and not just is the same condition that we had 1325

for goodness of fit testing it is because you want a big a sample as well as the big enough proportion. 1337

And number four is not really a condition is just so that you know how free you are with chi-square testing you can have 1344

more than two categories and you can have more than two populations you could have 4 categories and six population so you 1355

should have a whole bunch of these different combination so you are not restricted to 2 categories and 2 population. 1364

So now let us go on to some examples. 1371

Example 1 is just the example we have been using to talk about how to find how to set up your data and how to find 1376

expected values so I set this up in an Excel file this is just exactly the same way we set it up previously I just found 1383

the row totals as well as the column totals. 1397

And now I could start of my hypothesis testing so first things first. 1400

Step one our null hypothesis should say something like this that the proportions of satisfied and unsatisfied people minus adults 1406

for Democrats should be the same as for Republican so the proportion of category one and two of satisfied and 1425

unsatisfied by Allstate voters should be similar for Democrat and Republican. 1435

So the alternative hypothesis is that at least one of those proportion will be different between Democrats and Republicans. 1446

Step two, just set our alpha to be .05 and we know that because we are doing chi-square hypothesis testing is one 1461

tailed step three you might want to draw a chi-square distribution for yourself or just in your head and certain 1476

color and that alpha part and try to think. 1485

I want to find my critical chi-square. 1488

In order to find the critical chi-square I need to find the degrees of freedom. 1493

And my degrees of freedom is going to be made up of the degrees of freedom for categories as well as the degree of 1499

nfreedom for population and there is two populations so it is 2-1 and you could also see that as the columns 2 column – 1. 1509

And the degrees of freedom for number of categories is with two categories that is satisfied and unsatisfied -1 1521

and there that corresponds perfectly to number of rows -1 and so the degrees of freedom here is going to be that 1535

this times this so degrees of freedom for category times degrees of freedom for population and is just one. 1545

So, what is our critical chi-square, but that is going to be found by chi in we put in our probability as well as 1553

our degrees of freedom and we find 3.84 is our chi-square critical chi-square. 1564

So we are looking for sample that represent population sample chi-square is that are larger than 3.84. 1571

Step four look something like this so in order to find your sample chi-square what we need to do first is find our 1584

expected values so here we have observed frequency and what we need to do is find infected frequency. 1595

So I am just going to copy and paste this down here so we do not have to keep scrolling and so I am going to draw 1609

a director at the table here for observed frequency and create the same table for expected frequency. 1623

Okay so when I look at my expected frequency I need to find out what is the general rate and then multiply it by 1635

however many however many industry people have in that sample so the general rate of being satisfied is 1316÷1886 1651

so that the general rate and that is about 70%. 1670

Take that and multiply that by the total number of Democrats. 1674

Now this part I want to keep that the same and I want to keep that in the same column so I am going to put $ affinity 1680

to walk down that column and here I am going to put $ in front of both the D and the 21 in order to lock down this actual cell. 1697

Because here is what I am going to do I am than actually copy and paste that over here and if look at this then what I am doing1708

is I have this same rates again the rate of being satisfied but now it is multiplied by the number of total Republicans. 1716

And I am going to take that cell copy and paste it down here and here I see that now I have the rates of being 1726

unsatisfied and they need to change this to that and here I have the rates of being unsatisfied and then 1737

multiplied by total number of Republican so these are my expected frequencies. 1750

Notice that the total still add up to be the same right and usually it should there might be some slight discrepancies1756

but that will just be because of rounding error so they should still be pretty close. 1766

So now we have observed frequencies as well as expected frequencies and now we need to figure out my chi-square. 1771

My chi-square is going to be made up of observed frequency minus expected frequency squared divided by expected frequency. 1779

And I am going to need to find that for Democrat Republican as well as satisfied and unsatisfied and then add off all of these cells. 1790

So I will see grand total and I will put that over here. 1808

Okay so let us find the observed frequency minus the expected frequency squared divided by expected frequency. 1813

And I could just copy and paste that here because Excel will just move everything down and I can take this over here because Excel 1829

will move everything over to the right. 1841

And the grand total for all four of these is going to be 547.18 and so my sample chi-square is quite large.1843

And so do I reject my no hypothesis? 1876

Indeed I do and we can find the P value so here I will put chi disc in order to find my probability. 1881

Here it is, degrees of freedom is going to be one and that is a very very very small P value so that is the pretty radically1898

different population that we set in there. 1911

And if you want to step five, example 2. 1917

Consider this data on pesticide residue on domestic and imported fruits. 1933

Does this data fit the conditions of a chi-square test of homogeneity regardless of your answer conduct hypothesis tests. 1937

Now be careful here although you see column and rows these are not the columns and rows you should be using the columns are 1944

actually okay domestic roads imported roads we could consider those two to be the different populations that are interested in. 1956

But the roads actually do not show the different categories such as sample size percentage showing no residue and percentage showing residue in violation right? 1964

So what we should do is we should actually transform this data into sort of the correct setup. 1975

So here you could just pull up a brand-new XL file just been a user of the bottom portion here and here is what we want. 1983

We would like it to be set up so that we have the two populations appear and we have the different categories here 2005

so the categories are probably going to be showing no residue showing residue in violation but one of the things I 2028

noticed is that these percentages do not add up to 100 that there must be some other category that were missing. 2035

So no residue showing residue in violation of the law so I guess that is really bad and maybe there is just one 2042

word it is residue but not in violation and you sort of have to figure that out from the data that they have given you. 2054

But they do give you the sample size 344 as well as 1136 so this is the total. 2063

The question is what are our observed value? 2073

In order to find observed value all we have to do is multiply but the proportion so 44.2% times the total. 2079

Here I walk down that row, now residue in violation what I have to do is to change this percentage so the percentage is .9%.2098

So that is .009 so that is .9%. 2116

And so what sort of leftover? 2127

Well, the leftover percentages is 1-.442 + .009 right so that sort of everybody else and that is I guess the 2131

number of fruits that are not in violation but still have some residue on them, some pesticide residue times this. 2143

And so when I add them all up I could check and that is 344 so I have done my proportions correctly. 2154

Now right away we could see that were actually not meeting the conditions for chi-square. 2169

If you look at this cell right here that has that only has three fruits in it even if we round up generously it is 3.1 right? 2176

So there is only three fruits. 2188

Remember expected frequencies have to have at least 5, so here the observed value is pretty small. 2191

Okay so that it said go ahead into hypothesis testing anyway you should not do this in real life but 2200

for the purpose of this exercise let us do it. 2210

So now let us find the proportion of imported fruits that are observed to have no residue on them. 2212

So that 70% 70.4% times this total and that is almost 800 fruits. 2222

Also we have those that have residue in violation .036 that is 3.6% times 1136, about 41 fruits and then 2232

I need the leftover percentage , so that is 1-.70% 74.4% +3.6% . 2249

That percentage times the total. 2262

And that is 295 right? 2268

So first notice that these seem like there is way more of these imported fruit than domestic fruits but that is because the 2272

totals are different so it does not necessarily mean that imported fruits they have so much residue on them, 2280

that is not necessarily what it means, but that is hard to compare because they have totally different totals. 2289

So it is helpful to find the row totals as well because that can help us find expected value expected frequency 2299

and so that is adding these rows together and we have a total of 1480 fruits Domestic and imported altogether. 2308

Once we have that then it would be easy for us to find expected frequency and expected frequency we could basically set up in a very similar way. 2329

So what is our expected frequency? 2346

Well,expected frequency is generally how frequent with the proportion of no residue over all the fruits right. 2362

So that will be this row totals divided by the grand total that is the general rates and we want to lockdown this row 2370

because we want to lock those two values down because and that is always going to be the rate for no residue 2383

times the actual number of domestic fruits. 2401

So we get 221 and here we do the same thing and I just copied and pasted across an Excel will just naturally you figure out what to do. 2410

So this is the rate of no residue over total fruits times the total number of imported fruits. 2428

Then we find there the rates of fruits that have residue but are not in violation which is this total over the grand total. 2436

And then I am going to lockdown those values and then I am going to multiply that by the total number of domestic fruit. 2449

And then if I copy that over that should give me the total number of imported fruits expected value of imported fruits given this proportion. 2467

And finally the proportion of fruits with residue in violation so a lot of pesticide residue that would be this total 2476

divided by the grand total times the total. 2489

And here what we can see is if we sum these three expected frequency together we should get something similar to 344. 2502

And indeed we do and here we should be 1136 and indeed we do great. 2515

So once we have our table of observed frequencies as well as expected frequencies now we can start to calculate 2522

for each cell the observed frequency minus expected frequencies where as a proportion of expected frequency. 2530

So O minus E squared as a proportion of expected frequency so I will copy this cell labels so observed frequency 2540

minus expected frequency squared divided by expected frequency , and just copy and paste all that let us check one of this. 2558

This one says that observed frequency minus expected frequency squared over expected frequency. 2573

And when we add all of these up we get 102 but we have forgotten the difference as we forgot to make a decision stage2581

so let us go ahead and do step three. 2599

So the decision stage will be our critical chi-square and our critical chi-square sound with degrees of freedom 2601

of the categories times the degrees of freedom of the population multiplied together so the other degrees of freedom for the chi-square. 2610

So categories -1 is 2, population -1 is 1, so the degrees of freedom is just 2, so our critical chi-square is chi in. 2628

Put in .05 as our desired probability, our degrees of freedom equals 2 and we get 5.99. 2646

We see that our chi-square is much larger than that so we would reject our null.2653