For more information, please see full course syllabus of Statistics

For more information, please see full course syllabus of Statistics

### ANOVA with Independent Samples

Lecture Slides are screen-captured images of important points in the lecture. Students can download and print out these lecture slide images to do practice problems as well as take notes while watching the lecture.

- Intro
- Roadmap
- The Limitations of t-tests
- Two Major Limitations of Many t-tests
- Ronald Fisher's Solution… F-test! New Null Hypothesis
- Analysis of Variance (ANoVA) Notation
- Partitioning (Analyzing) Variance
- Time out: Review Variance & SS
- F-statistic
- S²bet = SSbet / dfbet
- S²w = SSw / dfw
- Chart of Independent Samples ANOVA
- Example 1: Who Uploads More Photos: Unknown Ethnicity, Latino, Asian, Black, or White Facebook Users?
- Hypotheses
- Significance Level
- Decision Stage
- Calculate Samples' Statistic and p-Value
- Reject or Fail to Reject H0
- Example 2: ANOVA with Independent Samples

- Intro 0:00
- Roadmap 0:05
- Roadmap
- The Limitations of t-tests 1:12
- The Limitations of t-tests
- Two Major Limitations of Many t-tests 3:26
- Two Major Limitations of Many t-tests
- Ronald Fisher's Solution… F-test! New Null Hypothesis 4:43
- Ronald Fisher's Solution… F-test! New Null Hypothesis (Omnibus Test - One Test to Rule Them All!)
- Analysis of Variance (ANoVA) Notation 7:47
- Analysis of Variance (ANoVA) Notation
- Partitioning (Analyzing) Variance 9:58
- Total Variance
- Within-group Variation
- Between-group Variation
- Time out: Review Variance & SS 17:05
- Time out: Review Variance & SS
- F-statistic 19:22
- The F Ratio (the Variance Ratio)
- S²bet = SSbet / dfbet 22:13
- What is This?
- How Many Means?
- So What is the dfbet?
- So What is SSbet?
- S²w = SSw / dfw 26:05
- What is This?
- How Many Means?
- So What is the dfw?
- So What is SSw?
- Chart of Independent Samples ANOVA 29:25
- Chart of Independent Samples ANOVA
- Example 1: Who Uploads More Photos: Unknown Ethnicity, Latino, Asian, Black, or White Facebook Users? 35:52
- Hypotheses
- Significance Level
- Decision Stage
- Calculate Samples' Statistic and p-Value
- Reject or Fail to Reject H0
- Example 2: ANOVA with Independent Samples 58:21

### General Statistics Online Course

### Transcription: ANOVA with Independent Samples

*Hi, welcome to educator. com. *0000

*We are going to talk about ANOVA with independent samples today. *0002

*So first we need to talk a little bit about why we need to introduce the ANOVA. *0005

*We had been doing so well at t-test so far. *0011

*Well, there are some limitations of the t-test and that is why we are going to need an ANOVA here. *0013

*An ANOVA is also called the analysis of variance and the analysis of variance is really also could be thought of as the omnibus hypothesis test. *0020

*So still, hypothesis test just like the t-test but it is the omnibus hypothesis test, we are going to talk what that means. *0032

*We are going to need to go over a little bit of notation in order to break down with the ANOVA details. *0041

*And then were really going to get to the nitty-gritty of partitioning or analyzing variance like *0047

*getting down of breaking apart variance into its component parts. *0055

*The we are going to build up the S statistics made up of those bits and pieces of variances and *0059

*then finally talk about how that relates to the F distribution and hypothesis testing. *0066

*Okay so first thing, the limitations of the t-test. *0071

*Well here is a common problem like I want to know this question. *0077

*Who upload more pictures to facebook? *0083

*The Latino users, white users, Asian users or black Facebook users? *0086

*Which of these racial or ethnic groups uploads more pictures to facebook? *0091

*Well, let us see what would happen if we use independent samples t-test? *0098

*What would we have to do? *0101

*Well we have to compare Latinos to white, Latinos to Asian, Latinos to black and whites and Asians and whites and blacks and Asians and blacks. *0104

*As like, all of a sudden we have to do 6 different independent samples t-test. *0111

*That is a lot of tiny, tiny little t-test and really the more t-test you do that increases your likelihood of type 1 error. *0118

*Previously, to calculate type 1 error we looked at one minus the probability that you would be *0127

*correct, so one minus the probability of being right and that was to me like . 05 let say, right? *0135

*But now that we want to calculate the probability of type 1 error for six t-test we have to think *0144

*back to our probability principles but really I just want to look something like this. *0152

*One minus whatever your correct rate is to the sixth power and that is got to be a much higher, *0157

*much higher type 1 error rate than you really want. *0167

*So the problem is that the more t-test you have, the more the bigger the chance of your type 1 *0174

*error and even non-mathematically you could think about this. *0181

*Any time you do a t-test you could reject the null, every time you reject the null you have the *0186

*possibility of making a type 1 error and so if you reject the null six times then you have increased *0193

*your type 1 error rate because your just rejecting more null hypotheses. *0201

*So you should know there are two major limitations of having many many tiny tiny little t-test. *0206

*So you have six separate t-test, one is the increased likelihood of type 1 error and that is bad. *0213

*We do not want a false alarm but there is a second problem, you are not using the full set of data in order to estimate S. *0220

*Remember how before we talked about how estimate of the population standard deviation? *0231

*Well, it would be nice if we had a good estimate of the population standard deviation and you *0237

*know when you have a better estimate of the population standard deviation? *0242

*When you got more data rate when you do a t-test for instance with Latinos than white people *0246

*then you are ignoring your luscious and totally usable data from your Asian and black American *0253

*population so that is a problem you are ignoring some of your data in order to estimate S and *0260

*your estimating S a bunch of different little time instead of having one sort of giant estimate of S *0267

*which would be a better way to go so both of these are major limitations of using many many little t-test. *0274

*So back in the day statisticians knew that there was this problem Ronald Fisher came up with a *0282

*solution and his solution is called an F test for Fisher. *0291

*You think of a new statistic you could name it after you self. *0296

*So he thought of something called an F test but this F test also includes a new way of thinking *0302

*about hypotheses and so the F test could also be thought of as an omnibus test and the way you *0308

*could think about them is like the Lord of the rings ring idea. *0315

*It is one to rule them all instead of doing many many tiny tiny little test, you do one test to *0319

*decide once and for all if there is a difference. *0326

*And because you have this one test you need one null hypothesis and here is what that null hypothesis is. *0329

*You need to test whether all the samples belong to the same population or whether 1 at least *0337

*one belongs to a different population because remember the null hypothesis and the alternative *0346

*hypothesis they have to be like two sides of the same point so your null hypothesis is that they are all equal. *0351

*The mu’s are all equal. *0359

*They all came from exactly the same population. *0360

*The other hypothesis the alternative hypothesis is that they are not all equal but let us think about what that means. *0363

*That means at least two of them are different from each other mean that all of them are 4*0372

*different from each other, that means at least one guy is different from one of these guys. *0377

*That is it ,that is all it means, that is all you can find out. *0382

*So let us consider this situation let us say you have these three samples. *0386

*Your null hypothesis would be that they all came from the same population. *0392

*A1 A2 and A3 all the same population A but if we reject that null hypothesis what have we found out? *0399

*What we found out that at least two of them differ at least, all three of them could differ from each other or it could just be 2. *0413

*It could be that A1 and A2 are the same, the A3 is different. *0421

*It could be the A2 and A3 are the same but A1 is different. *0425

*It could mean that A 1 is totally different from A2 and that is totally different from A3. *0428

*Any of those are possibility so here is a good thing. *0433

*The good thing about the omnibus hypothesis is that you could test all mentioned things at ones. *0436

*That they all come from the same population, you could test that big hypothesis at ones. *0442

*The bad thing about it is that if you reject the null it still did not tell you which populations differ. *0446

*It only tells you that at least one of the valuations is different. *0454

*So when you reject the null, it is not quite as informative but still it is a very useful test. *0459

*So we need to know some notation before we go on. *0466

*An analysis of variance so analysis of variance, that is why it is called the ANOVA so sometimes *0471

*you might do with the opening little ANOVA notation, you want to analyze the variance so when *0479

*we want to analyze the variance we have to really think hard about what variance means. *0486

*And variance as sort of the average spread around some means so how much spread you have. *0492

*Are you really tightly clustered around the mean or you like really just burst around the mean. *0500

*Okay so first things first, consider all the data that we get from all the different groups. *0505

*That is why we have to regroup all the data from all the different groups, and a lot of variance *0511

*around the grand mean and the grand mean is a new idea. *0518

*The grand mean it is not just the mean of your sample but the grant mean is the mean of everybody lock together. *0521

*Pretend there are three groups pretend there is just one giant group that all three data sets have been sort of award into. *0528

*What is the meaning of that giant group? *0536

*That is called the grand mean and so for instance, here is our sample. *0538

*Our sample from A 1 our sample from A2, our sample from A3, and when you have sample means here is what the notation looks like. *0544

*It should be pretty familiar, X bar sub A1, X bar sub A2, X bar sub A3. *0552

*Now when we had a grand mean, we do not have three of them we just have one because remember, they are all lock together, right? *0564

*How do we distinguish the grand mean if we just say X bar we might confuse it for being a *0571

*sample instead of grand mean right and so in order to think grand mean this is the mean of all *0579

*the means, mean of all the samples right, we call it X double bar and that is how we know that it *0585

*is the grand mean so that is definitely one of the things you need to know. *0592

*So now let us talk about partitioning or analyzing the variance. *0596

*When we are analyzing variance, what we want to start with is the total amount of variance. *0606

*First, we got so have the big thing before we jump in apart. *0614

*So what is the big thing, the big variance in the room is total variance and this is the variance *0617

*from every single data point in our giant pool around the grand mean. *0625

*And we can actually just sort of think about how to write this as a formula just by knowing grand *0629

*mean as well as the variance formula right and so variance is always squared distance away from *0639

*the mean divided by however many data points you have to get average square distance from the mean. *0645

*Now we want the distance away from the grand mean so I am going to go ahead and put that *0653

*there instead of X bar I have X double bar and put my data points so that would be Exabyte. *0659

*And we want to get the sum of all of those and then divide by however many data points we have. *0668

*Usually N means the number of data points in a sample. *0676

*How do we tell the N of everybody of all your data points added together? *0682

*Here is how you, you call it N sub total. *0688

*And in this says it is not just the end of one of our little sample because we have three little *0691

*sample, I mean the N of everybody of the total number in your data set. *0698

*And so even this Exabyte I do not really mean just the axis in sample 1, I mean every single data *0704

*point so I would say I goes from 1 all the way up to N total. *0713

*Sorry this is a little small, N’s of total appear and so this will cycle through every single X, every *0719

*single data point in your entire sample lumped together. *0729

*Get their distance away from the grand mean, square it, add those squared distances together *0732

*divide by N so this is just the general idea of variance. *0741

*Average where distance from the mean. *0747

*In this case, retirement grand mean and so how do we say total variance? *0750

*Well it would be nice if we could say like, oh this is something subtotal, right? *0757

*Before we go on to variance though I just want to stop here before we go into average variance, *0765

*I just want to talk about this thing, what is this thing? *0773

*And so let us talk about some of variance request so variance is always going to be the sum of *0777

*squared distances, sum of squares divided by N or if you are talking about S, S squared is the sum *0784

*of squared distances over N minus 1 and another way of saying that is SS over degrees of freedom. *0795

*So we are just going to stop here for a second and just talk about this sum of squares and we are going to call that sum of squares total. *0805

*So that sum of squares total and that is going to be important to us because later we are going to *0817

*used these sum of squares, these different sum of squares to then talk about variance. *0824

*It is the squares are very unrelated to the idea of variance. *0830

*Now we have this total variance because this is really the idea of how much you are varying. *0834

*We have this total variance and were going to partition it into two types of variance. *0840

*One is within group variation and the other is between group variations. *0845

*So we have 3 groups, the between group variance is going to look at how different they are from each other. *0850

*The with in group variance is just going to look at how different they are from their own group, *0860

*how different the data are from their own group and that is going to be important because this *0865

*sum of squares total actually is made of sum of squares within plus sum of squares between. *0871

*So because of this idea we can really now see, where taking us all variance and partitioning it *0882

*into within group variance between group variance or between sample variance. *0892

*So first things first, within group variance. *0899

*How do we get an idea of how different each sample is from itself. *0902

*Well the very idea is just like what we have been talking about before. *0912

*This is each samples variance around their own mean and we already know the notation for this mean. *0917

*So that would be something like how much does everybody in sample A1 differ from the mean of *0938

*A 1 so what is that different to getting the sum of squares. *0947

*And what is the variance, the sum of squares for everybody in A2 square and the same thing for everybody in A3. *0954

*So this is sort of the regular use of variance that we used before regular use of sum of squares that we have used before. *0971

*Just looking at each sample variance from its own sample mean. *0977

*Now how do we get between group variance? *0982

* Between group variance is going to be each samples mean, how much does it very from the *0986

*grand mean, difference, squared difference from grand mean so there is some grand mean and *0999

*how much does each sample mean differ from that grand mean. *1011

*And so that is going to be between group variation. *1016

*How much do the group differ from that grand mean. *1020

*So first of all let us just review variance and sum of squares. *1024

*So sum of squares is the idea that were in use over and over again and it is just this idea that *1033

*yours summing the sigma sign, sum from X bar squared. *1043

*So it is just basically that the squared distance away from the mean and add them up. *1052

*That is sum of squares. *1061

*Now what we are doing is we are sort of swapping out this idea of the mean for things like grand *1063

*mean, sample mean, and were also swapping out what our data points are. *1071

*Is this like from N total, is it from all of data points, is it just the end from the sample, is it the group means? *1082

*So were swapping out these two ideas in order to get our sum of squares total, sum of square *1098

*between or sum of squares within but it is always the same idea. *1106

*Sum of distance, squared, add them up. *1110

*Okay now so what is variance in relationship? *1113

*Well variance is the average squared distance and so in doing this we always take the sum of *1116

*squares and we divide by however number we own but how many data points we have? *1130

*But often where using estimates of S instead of actually having the population standard deviation. *1139

*So were going to be using degrees of freedom instead of just N and we have different kinds of *1146

*degrees of freedom for between and within group variation so watch out for them. *1153

*Okay now let us go back to the idea of the F statistic. *1162

*Now that we have broken it down a little bit in terms of what kind of different variances there *1167

*are, hopefully the F statistic makes a little more steps sense. *1171

*The idea is that you want to take the ratio of the between group or sample variance over the *1175

*within group variance and the reason we want this particular ratio is that were actually very *1187

*interested in the between group difference that what our hypothesis test is all about whether the groups or difference are the same. *1197

*The within group variation, we cannot account for. *1206

*Its variation that just is inherent in the system and so we need to compare the between group *1210

*variation which we care about with the within group variation we cannot explain we do not have *1218

*any explanation for at least not in this hypothesis test, we have to do other tests to figure out that. *1223

*Okay so now that were here we need to do is replace these conceptual ideas with some of the things that we have been learning about. *1230

*In particular the variance between the variance within and so variance we are going to use S squared but S squared between over S squared within. *1242

*So variance between over variance within but now we know a little bit like we have refreshed, *1260

*what is variance about, how can we break it down in terms of sum of squares? *1266

*Well, that is what we are going to do. *1272

*We are going to double-click on this guy and here is what we see inside. *1276

*We see the sum of squares between divided by the degrees of freedom between all over the *1280

*sum of squares within then divided by the degrees of freedom within and this is how were going to actually calculate our S statistic. *1291

*Now, we will write out the formulas for each these but it is sort of good to know like where the S *1301

*statistics are comes from its conceptual route, you always wanted to be able to go back there. *1309

*Because ultimately when we have a large F, we want to be able to say, this means there is a *1314

*larger between group variation then within group relative to within group variation. *1321

*A larger difference in the thing that were interested in over the variance that we have no explanation for. *1327

*Okay so now let us figure out how to break down this idea and remember this idea really is the breakdown of the variance between. *1332

*So were breaking down the broken down thing. *1343

*So conceptually what is this? *1347

*Well, conceptually this is the difference of sample mean from the grand mean so imagine our *1350

*little group and their sum grand mean that all of these guys contributed to but this all have a little sample mean of their own. *1357

*What I want to do is know the difference between these, squared, then add them up. *1376

*That is the idea behind this. *1384

*So first of all how many means do we have how many data sets do we have, how many data points do we have? *1386

*Well we have a data point for every sample that we have so how many means do we have? *1395

*Or how many samples do we have. *1402

*We actually have a term for that. *1404

*The above letter that we reserve for how many samples is K, number of samples. *1406

*And so that you could think about okay if that is the number of samples then what might be the degrees of freedom here? *1415

*Well, just going to be K -1, here is why. *1427

*In order to get the grand mean we could do weighted average of these means and since there *1434

*are three of them if we knew what two of them were in advance the third one would not be free *1442

*to vary, we lockdown with that third one. *1449

*So the degree of freedom is K – 1. *1451

*Okay so what is the actual sum of squares between and know you need to take into *1454

*consideration how many actual data points are in each group. *1463

*For instance, group one might have a lot of data point or two might only have a few data points which means should matter more. *1468

*Well that can be taken into account. *1476

*So first things first, how do we tell it get the difference between this mean and this mean. *1479

*That is going to be this. *1486

*X bar minus X double bar so get them the difference between the mean and the grand mean. *1489

*Now we several means here so I am going to put an I for index and in my sum of squares my I is *1497

*going to go from one up through K so for each group that I have I want you to get this distance and square it. *1507

*Not only I am going to stop there but I also want you to make it count a lot if it has a lot of data *1515

*points so if this guy have a lot of data point he should get more votes, his difference from the *1526

*grand mean should count more than this guys different and so that is what we get by multiplying *1531

*by N if N is very large, this distance is an account a lot if N is very small, this distance is not going to count as much. *1538

*And this is the sum of squares between so that is the idea. *1546

*Okay so now we actually know this and this so we could actually create this guy but putting these two together. *1554

*Now let us talk about sum of squares within now that we know sum of squares between pretty well. *1563

*Well, first thing we need to know is that this idea sum of squares within divided by degrees of *1582

*freedom within is actually going to give us the variance within. *1587

*Let us talk about what this means conceptually. *1593

*This means the spread of all the data points from their own sample mean. *1596

*So this is the picture I want you to think of. *1604

*So everybody has their own little sample mean, X bars, own little sample mean and here are my *1610

*little data point and I want to get the distance of each set away from their own sets mean. *1620

*This is going to give me the within group variation. *1629

*Well, we need to think about first, how many data points do we have? *1635

*Well we have a total of N total, because you need to count all of these data points you need to add them all up. *1643

*The total number of data point. *1652

*So what is the degrees of freedom? *1656

*Well, it is not just N total -1. *1659

*How many means did we find? *1661

*We found three means, for each time we calculate a mean, we loss a degrees of freedom so it is *1663

*really the N total minus the number of means that we calculate and here, it is 3, it is 3 because we have three groups. *1674

*Remember, we have a letter for how many groups we have, and that is K so it is really going to *1684

*be N total minus K the number of group and that is going to give us the degrees of freedom within. *1689

*So what is the sum of squares within? *1698

*The sum of squares within is really going to be the sum of squares here plus the sum of squares here plus last the sum of squares here. *1701

*So for each group just get the sum of squares. *1713

*That is a pretty easy idea so the sum of squares within is just add up all the sum of squares. *1718

*Now what it this I mean? *1728

*I means the sum of squares for each group and that is I going from one to K so for however many *1730

*groups you have get that group sum of squares added to the next group sum of squares added to *1740

*the next group sum of squares and these are general formulas that work for two groups three *1746

*groups four groups, so that is sum of squares within and now that we know this and this, we could calculate this. *1751

*So now let us put it all together all at once. *1764

*My apologies because this may look a little bit tiny on your screen but hopefully you could sort of *1770

*reconstruct it from when you seen before because I am writing the same formulas just in a *1781

*different format just to show you how they all relate to each other. *1786

*So first conceptually this is always important because you can forget the formula but do not *1789

*forget the concept because from the concept you could reconstruct the formula. *1796

*It does take a little bit of mental work that you can do. *1800

*So first things first, the whole idea of the F is the between group variation over the within group variation. *1803

*So that is the whole idea right there and in order to get that we are going to get the variation between over the variability within. *1817

*Actually, I wrote this in the wrong place, should have written it down in the formula section. *1831

*So F equals the variability between divided by the variability within. *1839

*So that is the F. *1852

*Now for the F you cannot just calculate the sum of squares because really, the F is made up of a *1856

*bunch of squares and for F you actually need 2 degrees of freedom and that is going to be *1861

*determined by the between group degrees of freedom in the within group degrees of freedom. *1865

*So these I am just going to leave them empty. *1871

*Now let us talk about between group variability. *1873

*The big idea of this is that this spread around of sample means, around. *1876

*So gonna put S there of ex-bars around the grand mean. *1891

*That is what we are really looking for, that idea of this spread of all the sample means around the grand mean. *1897

*However the within group variability is the spread of data points from own sample mean. *1904

*So for each little group, what is the spread there? *1920

*So that is the idea of these two things. *1923

*Now in order to break it down into the formula you first wanted to get into what is S squared *1928

*between, so if you double-click on that, that takes you here, you double-click on this one, it will take you here S squared within. *1935

*So the variance between the between group variability, this is going to be just the very basic idea of variance. *1943

*Sum of squares over degrees of freedom. *1955

*Same thing here, sum of squares over degrees of freedom. *1958

*That stuff you already know but the only difference is with a little between here and with a little within here so that is only difference. *1963

*Once you get there then you could break this down right and you could say sum of squares *1973

*between and if you forget what the formula is, you can look up here, spread of ex-bars around the grand mean. *1978

*So X bar minus grand mean. *1988

*You know you have a whole bunch of them, sum of squares and you are going to go from 1 up *1990

*through K that is how many sample means you have. *2000

*And you wanted to be waited. *2005

*You wanted to count more your distance counts more if you have more data points in your *2008

*sample and then the degrees of freedom is fairly straightforward. *2016

*It is the number of the means -1 because when you find out your grand mean it is going to limit *2021

*one of those guys so your degrees of freedom is lessened by one. *2030

*So for sum of squares within, let us go back to this idea that spread of all the data point away *2034

*from their own sample mean and that is just going to be all those sum of squares for each little *2043

*group and you already know from for that, added together. *2051

*So I goes from one up to K. *2055

*And the degrees of freedom is really just this idea that you have all these points, all this data *2058

*points and total minus however many means you found because that is going to limit the *2072

*degrees of freedom for those data point and that is K. *2080

*One another thing I want to just say right here, it is just this idea that you might see in your *2083

*textbook or in a statistics package this idea called mean squared error. *2091

*So this term right here is sometimes going to be called the mean squared error term so that a common thing that you might see. *2099

*This may be called mean squared between or you might just see the mean square between *2112

*groups or something like that so between group start might be written out. *2126

*But almost always this denominator is going to be called mean squared error. *2130

*The reason I want to mention it here is not only to connect this lesson with whatever is going on *2135

*on your classes but also because mean squared error will be an important term for later on when *2142

*we are going to other kinds of ANOVA. *2148

*So now let us get to examples. *2151

*So first who uploads more photos? *2156

*People of unknown ethnicity Latino Asian black or white Facebook users. *2158

*So what are null hypothesis and sorry, you might be like, how will I ever know? *2164

*Is this data set found in your downloads? *2172

*And so the download looks like this and there is however many uploaded photos so here is *2176

*uploaded photos here so this person has uploaded 892 photos and their ethnicity is at zero. *2185

*And zero is just a stand-in for the term unknown or blank so they may have left there blank. *2191

*So the Latino sample is one, the Asian sample is 2, the black or African-American is three, the whiter European-American sample is 4. *2198

*And so you can look through that data set, I kind of recorded that just so that we can easily through see where we are. *2210

*Okay let us start off with our hypotheses. *2217

*On this hypotheses the hypotheses to rule them all right the null hypotheses should say that all *2222

*of these ethnicities and even unknown are all the same when it comes to uploading photos. *2231

*So our mu of ethnicity zero, occultist zero, occultist 1,2,3,4 only because that is what is also in the data set. *2239

*The mu of ethnicity zero, the mu of ethnicity 1 equals the mu of ethnicity 2, the mu of ethnicity three, equals the mu of ethnicity 4. *2251

*So we could say this in order to, say look they are all the same mathematically. *2265

*So this is how you write out that idea of they are all the same, they all came from the same population. *2276

*The reason we want to use E0 E1 E2 is just that it is going to make it a lot easier for us to write *2281

*the alternative hypotheses and this also helps us keep in mind why are we comparing the different groups. *2291

*What is the variable they will differ on and the variable is ethnicity and they all differ on that *2298

*variable they will have different values of it and that the between subjects variable so at least in *2307

*our sample people are either Latino or Asian or black or white although they can be both, just not in our sample. *2315

*So the alternative hypotheses is that the mu sub E are not all the same, not all equal. *2323

*We do not actually put does not equal because we know whether it is easy to that are not equal *2346

*or these two that are not equal or this one and this one that is not equal right. *2363

*So we do not make those claims and that is why you do not want it right those not equal ones *2367

*you want to just write a sentence that the means are just not all the same. *2371

*Now at the site in a significance level, just like before let us decide on a significance level of . 05, it is commonly accepted. *2376

*And because we are going to be calculating an S statistic, were going to be comparing it to disc alpha. *2384

*So it is always one tail, always only on the positive tail and so this is the F distribution. *2397

*Okay now let us talk about the decision stage so in the decision stage you want to draw the F distribution, just like I did so here is alpha, here is zero. *2404

*We need to find the critical F but in order to find the critical F we actually need to know the two *2419

*different degrees of freedom because this distribution is going to be different based on those 2° of freedom. *2434

*So we need to know the degrees of freedom in the numerator which in this case is the degrees of *2441

*freedom between and the degrees of freedom in the denominator and that is going to be the *2448

*degrees of freedom within, we could actually calculate that. *2457

*The degrees of freedom between is K -1 and here our K is 12345, K equals 5, 5 groups so that can *2460

*be a degrees of freedom of 4, and the degrees of freedom within is going to be N total minus K. *2473

*And so let us see how many we have total. *2484

*So we could do, we could just do account if you go down here I have actually sort of filled it in for *2488

*you a little bit just so that it is nice and formatted, I used E1 2345 but that really mean, one of them should be easy zero. *2500

*So K is five, we have five different groups, the degrees of freedom between is going to be 5-1, *2511

*the degrees of freedom within, we are going to need to know the total number of data points we *2520

*have so we need to count all the data point that we have. *2527

*All these different data point minus K so here is K. *2531

*So that is a 94 so apparently we have 99 people in our sample. *2541

*So then we can find the critical F. *2547

*Now ones we have the degrees of freedom between and the degrees of freedom within here just *2550

*to remind you this is the numerator and this is the denominator degrees of freedom. *2555

*Once we have that you can look it up in the back of your book. *2561

*Look for the F distribution chart or table and you need to find one of the, either the columns and *2564

*rows usually the columns will say degrees of freedom numerator and the degrees of freedom *2574

*denominator and then you could use both to look up your critical F 5% or you can look it up in Excel. *2580

*And the way we do that is by using F in because F discs will give you the probability, F in you put in probability and get the F value. *2594

*So the probability is . 05, only one tail so we do not have to worry about that. *2607

*The first degrees of freedom we are looking for is the numerator one and the second degrees of *2611

*freedom we are looking for is the denominator one. *2615

*And so when we look at that we see 2. 47 to be a critical F. *2620

*So your critical F is 2. 47 and so we need an F value greater than that or a P value less than . 05 in *2633

*order to reject our null hypothesis that they are all the same, all come from the same population. *2644

*Okay so step 4 in our same question. *2650

*We need to calculate the sample statistic as well as the P value so in order to calculate the *2658

*sample statistic we need to calculate F because F is the only test statistic that will help us rollout our omnivorous hypothesis. *2666

*Remember that is going to be the variance between over the variance within. *2675

*And once we get our F, then we can find the P value at that F. *2681

*So what is the probability of getting an F value that big or bigger given that the null hypothesis is true. *2688

*And we want that P value to be very small. *2697

*So let us go ahead and go to our example. *2700

*Example 1 and here I have already put in these formulas for you but one thing that I like to do for *2706

*myself is I like to tell myself sort of what I need and so I need this and then I break it down one *2715

*row at a time, the next row is going to be assessed between over the degrees of freedom *2722

*between and then I can find each one of those things separately and then I also am going to *2730

*break down the variance within into the sum of squares within and degrees of freedom within and I break those down. *2736

*Okay so first things first, I want to find the variance between but in order to do that I need to find *2743

*sum of squares between and that is this idea that I get every mean, so I need the mean for every *2750

*single one of these groups, for the mean for unknown, mean for Latino users for Asian users and *2758

*so on and so forth and I need to find the grand mean. *2764

*I need to find the squared distances between those guys. *2768

*Okay so first, I need to know how many people are in the this particular sample. *2770

*So let us find the count of E0. *2781

*That is our zero ethnicity for unknown people. *2785

*So I am going to account those people, and then I also going to count E1 and also going to count *2791

*E2, I am also going to count my E3 and finally I am going to count my E4. *2807

*Now these are the same data point that I am going to be using over and over again so what I am *2830

*going to do is I am going to lockdown my data point. *2845

*Say use this data whenever you are talking about the E subzero. *2848

*Use this data whenever I am talking about E1. *2854

*Use this data whenever I talk about E2 and use this data whenever I talk about E3, use this data when I talk about E4. *2862

*Now the nice thing about this is that you could see that they almost all have 20 data points in each sample. *2879

*The only one that differs is the unknown population the unknown ethnicity sample and they are just off by one. *2891

*So, what is the meaning of the sample? *2900

*One thing I could do is I could just copy and paste the cross but what I really want to do is I do *2904

*not want to get the count anymore, I want to get the average. *2915

*So once I do that I could just type an average instead of count, save me a little bit of work and I find all these X bars, X bars for 01234. *2918

*Now let us find the grand mean. *2941

*The grand mean it is going to be the means for everybody so that is going to be the average for every single data point that we have. *2944

*And we really only need to find the grand mean ones. *2951

*If you want you could just point to the grand mean, copy and paste that down it should be the *2962

*same grand mean over and over again or you could just refer to this top one every single time. *2972

*So now let us put together our N times the distance squared before we add them all up. *2978

*So we have N times the distance X bar minus the grand mean, square that, and that is a huge *2990

*number, variance and now we are going to sum them all up. *3004

*Equal sign, sum and I want to sum all of this up. *3011

*I get this giant number 8 million something. *3019

*So huge number. *3023

*So once I have that I can just put a pointer here. *3025

*I just put equal sign and point to this sum. *3031

*And that is really the sum of squares between. *3035

*What about degrees of freedom between, have I already found that? *3039

*Yes I have, I found it up here. *3047

*So I am not going to calculate that again I am just going to point to it. *3049

*Once I have these two now I can get variance between groups. *3054

*So it is this divided by the sum of squares divided the degrees of freedom. *3060

*We saw the giant number that it make sense if you take 8 millions something divide by 4 you get 2 millions something. *3067

*It is still a giant number but is it more giant than the variance within? *3072

*I do not know, let us see. *3080

*So in order to find the variance within then I need to find the sum of squares within as well as the degrees of freedom within. *3082

*So how do I find sum of squares within? *3087

*Well, one thing I could do I could go to each data point and find mean, subtract each X from each *3093

*mean, square it, add them all up, or I could use a little trick. *3101

*I might use a little trick. *3107

*So just to remind you. *3113

*So here is my little trick. *3113

*So remember the variance of anything, the variance is going to be some of squares divided by N-1. *3116

*So if I find variance and I multiply it by N-1 I could get my sum of squares, I could do variance times N-1. *3129

*I could use that trick if I use XL. *3146

*So here is what I am going to do. *3152

*I am going to find the variance. *3157

*First it might be easy if I copied these. *3159

*Just so that I do not have to go and select those. *3166

*If I find the variance and then I multiply it by N -1, I get my sum of squares. *3171

*I am just working backwards from what I know about variance. *3186

*So I am going to do that same thing here and and get my variance and multiply it by N minus 1. *3189

*Get my variance multiplied by N – 1, finally variance multiplied by N – 1. *3199

*Obviously you do not have to do this, you could go ahead and actually compute sum of squares *3234

*for each set of data but that would take up a lot of room and typically more time so if XL is *3243

*handy to you then I really highly recommend the shortcut and then we will just want to sum all the guys up. *3251

*That some of all the sum of squares and we get this giant number. *3258

*We get 42 million, really large number. *3263

*But our degrees of freedom within is also a larger number than our degrees of freedom between. *3279

*And so if I find out my variance within then let us see. *3287

*Is this smaller or bigger. *3295

*Well we see that this number 450,000, that is the smaller number than 2 million so that is looking good for S statistic. *3297

*So our S statistic is the variance between divided by the variance within and we get 4. 48 and *3312

*that is quite a bit larger than our critical F of 2.46 and I have forgotten to put a place for P value *3323

*but let us calculate the P value here so in order to calculate P value we put F discs and we put in *3334

*the F value and the degrees of freedom for the numerator as well as the degrees of freedom for the denominator. *3343

*And we get P = .002 just quite a bit smaller than .05. *3353

*So that is a good thing so in step five we reject the null. *3362

*How does which group is different or multiple groups are different from each other? *3366

*We just know that the groups are not all the same that is all we know. *3374

*Okay so we got a P value equals .002 so we rejected the null hypothesis. *3378

*Here is the thing, remember at the end of this, we still do not know who is who, we just know that somebody is different. *3390

*At the end of this what you wanted to do is, there is going to be like little paired t-test. *3398

*They are often called contrast and you want to do that in order to figure out what your actual, *3405

*which group actually differs from which other group not just whether some group differs from *3414

*some other group and so you want to do a little bit more after you do this. *3420

*This are called post hoc test. *3425

*And in a lot of ways they are very similar to t-test were you look at pairs. *3427

*There is one change, they change the sort of P value that you are looking for so but you wanted *3439

*to do the post hoc tests afterwards and to do all the little comparison so that you can figure out who is different from who. *3446

*But you are only allowed to do a post hoc test if you rejected the null hypothesis. *3452

*So you are not allowed to do a post hoc test if you have not reject the null hypothesis that is why *3457

*we cannot just get to the step from the very beginning. *3464

*So first thing we need to do if you reject is do post hoc test. *3468

*Something you need to do is find the effect size. *3472

*In the case of an F test, you are not going to find like coens D or hedges G. *3475

*You are not going to find that kind of effect size. *3486

*You are going to find what it’s called Eta squared. *3488

* Eta squared, it looks like the N squared. *3490

*And eight is where it is going to give you an idea of the effect size. *3495

*Now let us go to example 2. *3499

*So also the data is provided in your download. *3504

*A pharmaceutical company wants to know whether new drug had the side effect of causing patients to become jittery. *3508

*3 randomly selected sample, the patients were given 3 mild doses of the drug. *3513

*0, 100 200 mg and they were also given a finger tapping exercise. *3518

*Does this drug affect this finger tapping behaviour? *3523

*Will this one I did not format really nicely for you because I want to serve figured out as you go but do not worry I will do this with you. *3527

*So first things first Omnibus hypothesis. *3536

*And that is that all three dosages are the same so mu of dosage zero = mu of dosage 100 = mu of dosage 200. *3538

*And the alternative hypothesis is that mu of the dosages are not all same. *3563

*Okay step 2. *3575

*Alpha = .05. *3579

*Step three decision stage how do we make our decision to reject or fail to reject. *3581

*First you want to draw that F distribution, put colour in that Alpha = .05, that is the error rate were willing to tolerate. *3591

*Now what is our critical F? *3603

*In order to find our critical F we need to know the degrees of freedom for between the degrees of freedom for within. *3607

*So if you go to example 2 the worksheet for example 2, example 2 then you can see this data set. *3615

*Now usually this is not the way data is set up that especially if you use SPSS or some of these other statistics packages. *3627

*Usually you will see the data for one person on one line just like this. *3635

*Just like example 1 the data for one person their ethnicity and their photos are on one line. *3641

*You will rarely see this but you may see this in textbooks so I do want you to sort of pay attention *3649

*to that but here and the problem is that different people were given the different dosages so you *3655

*could assume each square to be a different person. *3660

*So, were on step three decision stage and in order to figure out our critical F, we need to know *3663

*the degrees of freedom between and degrees of freedom within, that is not so pretty anymore, *3677

*this takes a long time to do that to put all the little fancy things in there but it is very easy. *3683

*So degrees of freedom between in order to find that it would be really helpful if we knew the K, *3695

*how many groups right and there are three groups, three different drug dosages. *3699

*So it is K -1 degree of freedom 2. *3705

*In order to find degrees of freedom within we need to know N total. *3710

*How many total data points do we have? *3716

* And we could easily find that in XL using count and selecting all our data point so there is 30 people 10 people in each group. *3719

*So that is going to be N total minus K. *3730

*That should be 27. *3736

*Once we know that we can find our critical F and use F in probability of .05 degrees of freedom *3738

*for the numerator is going to be degrees of freedom between, degrees of freedom for the *3747

*denominator is degrees of freedom within and we get 3. 35 as our critical F. *3752

*Note that this is a larger critical F than before when we had more data points. *3760

*Like 90 data points in the other example and because of that brought down our critical F. *3767

*Now let us go to step 4, step 4 we need to calculate the sample F as well as the P value. *3772

*Let us talk about how you do F. *3784

*Here we need the variance, variance between divided by the variance within. *3786

*How do we find the variance between? *3795

*Well that is going to be the sum of squares between divided by the degrees of freedom between. *3797

*How do we find sum of squares between? *3805

*Well remember, the idea of it is going to be the means for each group, distance from that mean *3809

*to the grand mean, square that distance, weight that distance by how many N we have, and then add them all up. *3816

*So in order to get that, up and down here what will we put in the other stuff to? *3826

*The variance within, that is going to be the sum of squares within divided by the degrees of *3837

*freedom within, just so I know how much room I have to work with. *3844

*Okay so first they might be helpful to know which being were talking about, the dosage so it is 0, *3849

*D0, D 100 and D200, those are three different groups. *3857

*What is the N for each of these groups, what is the X-bar for each of these groups, what is the *3865

*grand mean and then we want to look at N times X bar minus the grand mean, we want to *3872

*square that and then once we have that, now we want to add these up and so I will put sum here *3883

*just so that I can remember to add them up. *3894

*Okay so the N for all of these are 10, we already know that, and let us find the X-bar. *3897

*So this is the average of this and then the next one, it is the same thing, we know it is the same *3906

*thing the average except for column B and the next one is average again, for column C for 200. *3922

*How do we find the grand mean? *3934

*We find the average, we could put a little pointer so that they all have the same grand means. *3937

*Now we could calculate the weighted distance squared for each of these group means. *3952

*So it is N times X bar minus the grand mean, squared. *3962

*And once you have that you could just dragging all the way down and here we sum of these all up. *3970

*We sum these weighted differences up and we get a sum of squares of 394. *3983

*And we already know that degrees of freedom between group so we could put this in divided by this number. *3994

*We get 197. *4006

*Now let us see. *4009

*It is not going to be bigger or smaller than the variance within right and in order to find the *4011

*variance within it helps to just sort of conceptually remember, okay what is the sum of squares *4018

*within, then the sum of squares for each of these groups from their own means. *4022

*And so the sum of squares for each of these dosages are going to be, and I am just going to use *4029

*that shortcut, the variance for this set multiplied by nine, that is N -1. *4041

*And I am just going to take that and I am going to say do that for the second column as well as the third column. *4058

*And once they do that I just want to sum these all up and I get 419. *4073

*So now I have my sum of squares within. *4082

*I divide that by my degrees of freedom within and I get 15.53 and even before we do anything *4085

*we could see wow that variance between is a lot bigger than the variance within. *4095

*So we divide and we get 12.69, 12.7 right and that is much larger than the critical F that we set. *4099

*What is the P value for this? *4111

*We use F disc, we put in the F value we got, we put in the degrees of freedom between, degrees of freedom within and we get a P value of .0001. *4114

*So were pretty sure that there is a difference between these 3 groups in terms of finger tapping. *4131

*We just do not know what that difference is. *4138

*So step five would be reject null and once we decided to reject the null then you would go on to *4140

*do post talk test as well as calculating effect size. *4153

*So that is one way ANOVA with independent samples. *4157

*Thanks for using educator.com. *4163

0 answers

Post by kabongo mpoyi on February 1 at 05:41:04 AM

I think since you are using Asian, Latino, it is also better to use African and European instead of black and white. Using the term black or white is not a good terminology.

Thanks

0 answers

Post by Jethro Buber on October 8, 2014

Is this one way or two way ANOVA? I yet to learn the differences if any.

0 answers

Post by George Kumar on May 15, 2012

Latinos can be of different colors (e.g. Brazil)

Asians can be of different colors (e.g. India)

Suggest using the term Cacausians and African-Americans instead of White and Black.

0 answers

Post by samer hanna on November 4, 2011

good job Dr. Jo