WEBVTT mathematics/statistics/son
00:00:00.000 --> 00:00:02.200
Hi and welcome to www.educator.com.
00:00:02.200 --> 00:00:07.200
Today we are going to talk about confidence intervals for the difference of two independent means.
00:00:07.200 --> 00:00:13.500
It is pretty important that there are for independent means because later we are going to go to non-independent or error means.
00:00:13.500 --> 00:00:22.800
We have been talking about how to find confidence intervals and hypothesis testing for one mean.
00:00:22.800 --> 00:00:28.700
We are going to talk about what that means for how we go about doing that for two means.
00:00:28.700 --> 00:00:32.300
We are going to talk about what two means means?
00:00:32.300 --> 00:00:41.600
We are going to talk a little bit about μ notation and we are going to talk about sampling distribution of the difference between two means.
00:00:41.600 --> 00:00:48.000
I am going to shorten this, this is just means this is not like official or anything as SDOD
00:00:48.000 --> 00:00:55.300
because it is long to say assembling distribution of the difference between two means, but that is what I mean.
00:00:55.300 --> 00:01:06.500
We will talk about the rules of the SDOD and those are going to be very similar to the CLT (the central limit theorem) with just a few differences.
00:01:06.500 --> 00:01:15.500
Finally, we all set it all up so that we can find and interpret the confidence interval.
00:01:15.500 --> 00:01:21.300
One mean versus two means.
00:01:21.300 --> 00:01:30.900
So far we have only looked at how to compare one mean against some population, but that is not usually how scientific studies go.
00:01:30.900 --> 00:01:33.500
Most scientific studies involve comparisons.
00:01:33.500 --> 00:01:42.400
Comparisons either between different kinds of water samples or language acquisition for babies versus babies who did not.
00:01:42.400 --> 00:01:46.300
Scores from the control group versus the experimental group.
00:01:46.300 --> 00:01:52.600
In science we are often comparing two different sets of the two different samples.
00:01:52.600 --> 00:02:00.400
Two means really means two samples.
00:02:00.400 --> 00:02:11.800
Here in the one mean scenarios we have one sample and we compare that to an idea in hypothesis testing
00:02:11.800 --> 00:02:19.700
or we use that one sample in order to derive the potential population means.
00:02:19.700 --> 00:02:22.900
But now we are going to be using two different means.
00:02:22.900 --> 00:02:25.400
What do we do with those two means?
00:02:25.400 --> 00:02:32.600
Do we just do the one sample thing two times or is there a different way?
00:02:32.600 --> 00:02:35.200
Actually, there is different and more efficient way to go about this.
00:02:35.200 --> 00:02:39.300
Two means is a different story.
00:02:39.300 --> 00:02:42.400
They are related but different story.
00:02:42.400 --> 00:02:49.700
In order to talk about two means and two samples, we have to talk about some new notation.
00:02:49.700 --> 00:02:55.800
This is totally arbitrary that we use x and y.
00:02:55.800 --> 00:03:01.800
You could use j and k or m and n, whatever you want.
00:03:01.800 --> 00:03:08.900
X and y is the generic variables that we use.
00:03:08.900 --> 00:03:11.000
Feel free to use your favorite letters.
00:03:11.000 --> 00:03:23.000
One sample will just be called x and all of its members in the sample will be x sub 1, x sub 2, x sub 3.
00:03:23.000 --> 00:03:28.400
When we say x sub I, we are talking about all of these little guys.
00:03:28.400 --> 00:03:36.400
The other sample we do not just call it x as well because we will get confused.
00:03:36.400 --> 00:03:41.200
We cannot call it x2 because x sub 2 has a meaning.
00:03:41.200 --> 00:03:44.000
What we call it is y.
00:03:44.000 --> 00:03:49.000
Y sub i now means all of these guys.
00:03:49.000 --> 00:03:51.800
We could keep them separate.
00:03:51.800 --> 00:03:56.200
In fact this x and y is going to follow us from here on out.
00:03:56.200 --> 00:04:01.100
For instance when we talk about the mean of x we call it the x bar.
00:04:01.100 --> 00:04:03.400
What would be the mean of y?
00:04:03.400 --> 00:04:06.300
Maybe y bar right.
00:04:06.300 --> 00:04:07.500
That makes sense.
00:04:07.500 --> 00:04:13.000
And if you call this b, this will be b bar.
00:04:13.000 --> 00:04:15.900
It just follows you.
00:04:15.900 --> 00:04:23.700
When we are talking about the difference between two means we are always talking about this difference.
00:04:23.700 --> 00:04:27.100
That is going to be x bar - y bar.
00:04:27.100 --> 00:04:30.700
Now you could also do y bar - x bar, it does not matter.
00:04:30.700 --> 00:04:34.500
But definitely mean by the difference between two means.
00:04:34.500 --> 00:04:45.000
We could talk about the standard error of all whole bunch of x bars, standard error of x, standard error of y.
00:04:45.000 --> 00:04:52.300
You could also talk about the variance of x and the variance of y.
00:04:52.300 --> 00:04:58.000
You can have all kinds of thing they need something to denote that they are little different.
00:04:58.000 --> 00:05:12.100
That standard error of x sort and another way you could write it is that we are not just talking about standard error.
00:05:12.100 --> 00:05:21.900
When we say standard error, you need to keep in mind if we double-click on it that means the standard deviation of a whole bunch of means.
00:05:21.900 --> 00:05:28.200
Standard deviation of a whole bunch of x bars.
00:05:28.200 --> 00:05:34.300
Sometimes we do not have sigma so we cannot get this value.
00:05:34.300 --> 00:05:45.600
We might have to estimate sigma from s and that would be s sub x bar.
00:05:45.600 --> 00:05:53.600
If we wanted to know how to get this that would just be s sub x.
00:05:53.600 --> 00:06:07.100
Notice that is different from this, but this is the standard error and this is the actual standard deviation of your sample ÷ √n.
00:06:07.100 --> 00:06:11.700
Not just n the n of your sample x.
00:06:11.700 --> 00:06:27.600
In this way we could perfectly denote that we are talking about the standard error of the x, the standard deviation of the x, and the n(x).
00:06:27.600 --> 00:06:30.400
You could do the same thing with y.
00:06:30.400 --> 00:06:42.600
The standard error of y, if you had sigma, you can just call it sigma sub y bar because it is the standard deviation of a whole bunch of y bars.
00:06:42.600 --> 00:06:50.700
Or if you do not have sigma you could estimate sigma and use s sub y bar.
00:06:50.700 --> 00:07:03.000
Instead of just getting the standard deviation of x we would get the standard deviation of y and divide that by √n Sub y.
00:07:03.000 --> 00:07:10.100
It makes everything a little more complicated because now I have to write sub x and sub y after everything.
00:07:10.100 --> 00:07:18.500
But it is not hard because the formula if you look remains exactly the same.
00:07:18.500 --> 00:07:26.200
The only thing that is different now is that we just add a little pointer to say we are talking
00:07:26.200 --> 00:07:31.700
about the standard deviation of our x sample or standard deviation of our y sample.
00:07:31.700 --> 00:07:46.200
Even this looks a little more complicated, deep down at the heart of the structure it is still the standard error equals standard deviation of the sample ÷√n.
00:07:46.200 --> 00:07:57.600
Let us talk about what this means, the sampling distribution of the difference between two means.
00:07:57.600 --> 00:08:00.500
Let us first start with the population level.
00:08:00.500 --> 00:08:11.100
When we talk about the population right now we do not know anything about the population.
00:08:11.100 --> 00:08:20.500
We do not know if it is uniform, the mean, standard deviation.
00:08:20.500 --> 00:08:27.200
Let us call this one x and this one y.
00:08:27.200 --> 00:08:34.300
From this x population and this y population we are going to draw out samples and
00:08:34.300 --> 00:08:42.100
create the sampling distribution and that is the SDOM (the sampling distribution of the mean).
00:08:42.100 --> 00:08:50.200
Here is a whole bunch of x bars and here is a whole bunch of y bars.
00:08:50.200 --> 00:08:59.700
Thanks to the central limit theorem if we have big enough n and all that stuff then we know that we could assume normality.
00:08:59.700 --> 00:09:05.400
Here we know a little bit more than we know about the population.
00:09:05.400 --> 00:09:17.100
We know that in the SDOM, the standard error, I will write s from here because
00:09:17.100 --> 00:09:24.800
we are basically going to assume real life examples when we do not have the population standard deviation.
00:09:24.800 --> 00:09:30.300
The only time we get that is like in problems given to you in statistics textbook.
00:09:30.300 --> 00:09:45.500
We will call it s sub x bar and that can be the standard deviation of x/√n sub x.
00:09:45.500 --> 00:10:01.100
We know those things and we also know the standard error of y and that is going to be the standard deviation of y ÷ √n sub y.
00:10:01.100 --> 00:10:07.300
Because of that you do not write s sub y again because that would not make sense that
00:10:07.300 --> 00:10:12.100
the standard error would equal the standard error over into something else.
00:10:12.100 --> 00:10:13.900
That would not quite make sense.
00:10:13.900 --> 00:10:20.800
You want to make sure that you keep this s special and different because standard error
00:10:20.800 --> 00:10:25.000
is talking about entirely different idea than the standard deviation.
00:10:25.000 --> 00:10:38.300
Now that we have two SDOM if we just decided to do this then we would not need to know anything new about creating a confidence interval of two means.
00:10:38.300 --> 00:10:44.600
You what just create two separate confidence intervals like you consider that x bar,
00:10:44.600 --> 00:10:48.800
consider that y bar, construct a 95% confidence interval for both of these guys.
00:10:48.800 --> 00:10:49.900
You are done.
00:10:49.900 --> 00:11:01.500
Actually what we want is not a sampling distribution of two means and get two sampling distributions.
00:11:01.500 --> 00:11:07.800
We would like one sampling distribution of the difference between two means.
00:11:07.800 --> 00:11:11.600
That is what I am going to call SDOD.
00:11:11.600 --> 00:11:22.500
Here is what you have to imagine, in order to get the SDOM what we had to do is go to the population and draw out samples of size n and plot the means.
00:11:22.500 --> 00:11:24.900
Do that millions and millions of times.
00:11:24.900 --> 00:11:26.800
That is what we had to do here.
00:11:26.800 --> 00:11:39.300
We also have to do that here, we want the entire population of y pulled out samples and plotted the means until we got this distribution of means.
00:11:39.300 --> 00:11:54.700
Imagine pulling out a mean from here randomly and then finding the difference of those means and plotting that difference down here.
00:11:54.700 --> 00:11:58.600
Do that over and over again.
00:11:58.600 --> 00:12:07.000
You would start to get a distribution of the difference of these two means.
00:12:07.000 --> 00:12:14.500
You would get a distribution of a whole bunch of x bar - y bar.
00:12:14.500 --> 00:12:22.000
That is what this distribution looks like and that distribution looks normal.
00:12:22.000 --> 00:12:27.200
This is actually one of the principle of probability distributions that we have covered before.
00:12:27.200 --> 00:12:29.600
I think we have covered it in binomial distributions.
00:12:29.600 --> 00:12:44.000
I know this is not a binomial distribution but the same principles apply here where if you draw from two normally distributed population
00:12:44.000 --> 00:12:49.100
and subtract those from each other you will get a normal distribution down here.
00:12:49.100 --> 00:13:03.000
We have this thing and what we now want to find is not just the μ sub x bar or μ sub y bar, that is not what we want to find.
00:13:03.000 --> 00:13:16.000
What we want to find is something like the μ of x bar - y bar because this is our x bar - y bar and we want to find the μ of that.
00:13:16.000 --> 00:13:20.400
Not only that but we also want to find the standard error of this thing.
00:13:20.400 --> 00:13:27.200
I think we can figure out what that y might be.
00:13:27.200 --> 00:13:32.100
At least the notation for it, that would be the standard error.
00:13:32.100 --> 00:13:36.900
Standard error always have these x bar and y bar things.
00:13:36.900 --> 00:13:49.200
This is how you notate the standard deviation of x bar - y bar and that is called
00:13:49.200 --> 00:13:56.700
the standard error of the difference and that is a shortcut way of saying x bar - y bar.
00:13:56.700 --> 00:13:59.000
We could just say of the difference.
00:13:59.000 --> 00:14:04.700
You can think of this as the sampling distribution of a whole bunch of differences of means.
00:14:04.700 --> 00:14:16.100
In order to find this, again it draws back on probability principles but actually let us go to variance first.
00:14:16.100 --> 00:14:29.400
If we talk about the variance of this distribution that is going to be the variance of x bar + the variance of y bar.
00:14:29.400 --> 00:14:34.200
If you go back to your probability principles you will see why.
00:14:34.200 --> 00:14:41.100
This from this we could actually figure out standard error by square rooting both sides.
00:14:41.100 --> 00:14:47.700
We are just building on all the things we have learned so far.
00:14:47.700 --> 00:14:49.200
We know population.
00:14:49.200 --> 00:14:51.200
We know how to do the SDOM.
00:14:51.200 --> 00:14:57.800
We are going to use two SDOM in order to create a sampling distribution of differences.
00:14:57.800 --> 00:15:09.500
Let us talk about the rules of the SDOD and these are going to be very, very similar to the CLT.
00:15:09.500 --> 00:15:19.000
The first thing is this, if SDOM for x and SDOM for y are both normal then the SDOD is going to be normal too.
00:15:19.000 --> 00:15:21.900
Think about when these are normal?
00:15:21.900 --> 00:15:24.500
These are normal if your population is normal.
00:15:24.500 --> 00:15:26.800
That is one case where it is normal.
00:15:26.800 --> 00:15:29.200
This is also normal when n is large.
00:15:29.200 --> 00:15:38.900
In certain cases, you can assume that the SDOM is normal, and if both of these have met those conditions,
00:15:38.900 --> 00:15:42.200
then you can assume that the SDOD is normal too.
00:15:42.200 --> 00:15:49.100
We have conditions where we can assume it is normal and they are not crazy.
00:15:49.100 --> 00:15:50.800
There are things we have learned.
00:15:50.800 --> 00:15:53.500
What about the mean?
00:15:53.500 --> 00:15:56.100
It is always shape, center, spread.
00:15:56.100 --> 00:15:59.000
What about the mean for the SDOD?
00:15:59.000 --> 00:16:11.800
That is going to be characterized by μ sub x bar - y bar.
00:16:11.800 --> 00:16:14.800
That is the idea.
00:16:14.800 --> 00:16:27.300
Let us consider the null hypothesis and in the null hypothesis usually the idea is they are not different like nothing stands out.
00:16:27.300 --> 00:16:31.500
Y does not stand out from x and x does not stand out from y.
00:16:31.500 --> 00:16:34.300
That means we are saying very similar.
00:16:34.300 --> 00:16:48.700
If that is the case we are saying is that when we take x bar – y bar and do it over and over again, on average, the difference should be 0.
00:16:48.700 --> 00:16:52.400
Sometimes the difference will be positive.
00:16:52.400 --> 00:16:54.400
Sometimes the difference will be negative.
00:16:54.400 --> 00:17:02.500
But if x and y are roughly the same then we should actually get a difference of 0 on average.
00:17:02.500 --> 00:17:06.700
For the null hypothesis that is 0.
00:17:06.700 --> 00:17:11.300
The so what would be the alternative hypothesis?
00:17:11.300 --> 00:17:16.800
Something like the mean of the SDOD is not 0.
00:17:16.800 --> 00:17:31.200
This is in the case where x and y assume to be same.
00:17:31.200 --> 00:17:34.900
That is always with the null hypothesis.
00:17:34.900 --> 00:17:36.300
They assume to be the same.
00:17:36.300 --> 00:17:38.500
They are not significantly different from each other.
00:17:38.500 --> 00:17:42.500
That is the mean of the SDOD.
00:17:42.500 --> 00:17:44.300
What about standard error?
00:17:44.300 --> 00:17:52.900
In order to calculate standard error, you have to know whether these are independent samples or not.
00:17:52.900 --> 00:17:59.300
Remember to go back to sampling, independent samples is where you know that these two
00:17:59.300 --> 00:18:09.100
come from different populations and the picking one does not change the probabilities of picking the other.
00:18:09.100 --> 00:18:16.200
As long as these are independent samples, then you can use these ideas of the standard error.
00:18:16.200 --> 00:18:25.000
As we said before, it is easier when I think about the variance of the SDOD first because that rule is quite easy.
00:18:25.000 --> 00:18:41.600
The variance of SDOD, so the variance is going to be just the variance of the SDOM + the variance of the SDOM for the other guy.
00:18:41.600 --> 00:18:51.300
And notice that these are the x bars and the y bars.
00:18:51.300 --> 00:18:57.000
These are for the SDOM they are not for the populations nor the samples.
00:18:57.000 --> 00:19:09.300
From here what you can do is sort of justice derive the standard error formula.
00:19:09.300 --> 00:19:13.100
We can just square root both sides.
00:19:13.100 --> 00:19:28.500
If you wanted to just get standard error, then it would just be the square root of adding each of these variances together.
00:19:28.500 --> 00:19:35.100
Let us say you double-click on this guy, what is inside of him?
00:19:35.100 --> 00:19:52.700
He is like a stand in for just the more detailed idea of s sub x / n sub x.
00:19:52.700 --> 00:20:04.900
Remember when we talk about standard error we are talking about standard error = s / √n.
00:20:04.900 --> 00:20:09.900
The variance of the SDOM =s² /n.
00:20:09.900 --> 00:20:19.700
If you imagine squaring this you would get s/n but we need the variance.
00:20:19.700 --> 00:20:24.500
We need to add the variances together before you square root them.
00:20:24.500 --> 00:20:35.100
Here we have the variance of y / n sub y.
00:20:35.100 --> 00:20:40.400
You could write it either like this or like this.
00:20:40.400 --> 00:20:42.600
They mean the same thing.
00:20:42.600 --> 00:20:43.900
They are perfectly equivalent.
00:20:43.900 --> 00:20:52.700
You do have to remember that when you have this all under the square root sign,
00:20:52.700 --> 00:21:00.700
the square root sign acts like a parentheses so you have to do all of this before you square root.
00:21:00.700 --> 00:21:04.900
That is standard error.
00:21:04.900 --> 00:21:12.900
I know it looks a little complicated, but they are just all the principles we learned before,
00:21:12.900 --> 00:21:19.500
but now we have to remember does it come from x or does come from y distributions.
00:21:19.500 --> 00:21:27.300
That is one of the few things you have to ask yourself whenever we deal with two samples.
00:21:27.300 --> 00:21:38.400
Now that we know the revised CLT for this sampling distribution of the differences,
00:21:38.400 --> 00:21:46.000
now we need to ask when can we construct a confidence interval for the difference between two means?
00:21:46.000 --> 00:21:53.700
Actually these conditions are very similar to the conditions that must be met when we construct an SDOM.
00:21:53.700 --> 00:21:58.500
There are a couple of differences because we are dealing with two samples.
00:21:58.500 --> 00:22:01.400
The three conditions have to be met.
00:22:01.400 --> 00:22:03.200
All three of these have to be checked.
00:22:03.200 --> 00:22:09.600
One is independence, the notion of independence.
00:22:09.600 --> 00:22:19.700
The first is this, the two samples we are randomly and independently selected from two different populations.
00:22:19.700 --> 00:22:28.100
That is the first thing you have to meet before you can construct this confidence interval.
00:22:28.100 --> 00:22:35.200
The second thing is this, this is the assumption for normality.
00:22:35.200 --> 00:22:38.300
How do we know that the SDOD is normal.
00:22:38.300 --> 00:22:51.900
It needs to be reasonable to assume that both populations that the sample comes from the population are normal or your sample size is sufficiently large.
00:22:51.900 --> 00:22:56.100
These are the same ones that apply to the CLT.
00:22:56.100 --> 00:23:04.000
This is the case where we can assume normality for the SDOM but also the SDOD.
00:23:04.000 --> 00:23:17.000
In number 3, in the case of sample surveys the population size should be at least 10 times larger than the sample size for each sample.
00:23:17.000 --> 00:23:29.000
The only reason for this is we talked before about replacement, a sampling with replacement versus sampling not with replacement.
00:23:29.000 --> 00:23:33.600
Well, whenever you are doing a sample you are technically not having replacement
00:23:33.600 --> 00:23:46.800
but if your population is large enough then this condition actually makes it so that you could assume that it works pretty much like with replacement.
00:23:46.800 --> 00:23:49.700
If you have many people then it does not matter.
00:23:49.700 --> 00:23:53.200
That is the replacement rule.
00:23:53.200 --> 00:24:04.200
Finally, we could get to actually finding the confidence interval.
00:24:04.200 --> 00:24:10.000
Here is the deal, with confidence interval let us just review how we used to do it for one mean.
00:24:10.000 --> 00:24:15.200
One mean confidence interval.
00:24:15.200 --> 00:24:26.500
Back in the day when we did one mean and life was nice and what we would do is often take the SDOM
00:24:26.500 --> 00:24:43.800
and assume that the x bar, the sample mean is at the center of it and then we construct something like 95% confidence interval.
00:24:43.800 --> 00:24:56.000
These are .025 because if this is 95% and symmetrical there is 5% leftover but it needs to be divided on both sides.
00:24:56.000 --> 00:25:25.000
What we did was we found these boundary values by using this idea, this middle + or – how many standard errors you are away.
00:25:25.000 --> 00:25:28.900
We used either t or z.
00:25:28.900 --> 00:25:30.100
I’m just going to use t from now on because usually we are not given the standard deviation of the population × the standard error.
00:25:30.100 --> 00:25:36.900
That was the basic idea from before and that would give us this value, as well as this value.
00:25:36.900 --> 00:25:44.900
We could say we have 95% confidence that the population mean falls in between these boundaries.
00:25:44.900 --> 00:25:47.900
That is for one mean.
00:25:47.900 --> 00:25:49.400
What about two means?
00:25:49.400 --> 00:26:00.000
In this case, we are not going to be calculating using the SDOM anymore.
00:26:00.000 --> 00:26:01.800
We are going to use the SDOD.
00:26:01.800 --> 00:26:14.800
If this mean is going to be x bar, this sample mean then you can probably assume that
00:26:14.800 --> 00:26:19.700
it might be something as simple as a difference between the two means.
00:26:19.700 --> 00:26:23.000
That is what we assume to be the center of the SDOD.
00:26:23.000 --> 00:26:32.700
Just like before, whatever level of confidence you need.
00:26:32.700 --> 00:26:38.000
If it is 99% you have 1% left over on the side.
00:26:38.000 --> 00:26:42.800
You have to divide that 1% in half so .5% for the side and .5% for that side.
00:26:42.800 --> 00:26:51.400
In this case, let us just keep the 95%.
00:26:51.400 --> 00:26:58.300
What we need to do is find these borders.
00:26:58.300 --> 00:27:04.200
What we can to just use the exact same idea again.
00:27:04.200 --> 00:27:09.300
We could use that exact same idea because we can find the standard error of this distribution.
00:27:09.300 --> 00:27:10.900
We know what that is.
00:27:10.900 --> 00:27:20.200
Let me write this out.
00:27:20.200 --> 00:27:25.500
We will write s sub x bar.
00:27:25.500 --> 00:27:32.500
We can actually just translate these ideas into something like this.
00:27:32.500 --> 00:27:41.600
That would be taking this, adding or subtracting how many jumps away you are, like the distance you are away.
00:27:41.600 --> 00:27:50.100
That would be something like x bar - y bar but instead of just having x in the middle we have this thing in the middle.
00:27:50.100 --> 00:28:00.000
+ or – the t remains the same, t distributions but we have to talk about how to find degrees of freedom for this guy.
00:28:00.000 --> 00:28:11.500
The new SE, but now this is the SE of the difference.
00:28:11.500 --> 00:28:13.900
How do we write that?
00:28:13.900 --> 00:28:27.600
X bar - y bar + or - the t × s sub x bar = y bar.
00:28:27.600 --> 00:28:38.700
If we wanted to we could take all that out into the square root of variance of the SDOM for x and variance of SDOM for y.
00:28:38.700 --> 00:28:46.800
We could unpack all of this if we need to but this is the basic idea of the confidence interval of two means.
00:28:46.800 --> 00:28:51.700
In order to do this I want you to notice something.
00:28:51.700 --> 00:29:00.300
Here we need to find t and because we need to find t we need to find degrees of freedom
00:29:00.300 --> 00:29:04.400
but not just any all degrees of freedom because right now we have 2 degrees of freedom.
00:29:04.400 --> 00:29:07.500
Degrees of freedom for x and degrees of freedom for y.
00:29:07.500 --> 00:29:11.500
We need a degrees of freedom for the difference.
00:29:11.500 --> 00:29:13.300
That is what we need.
00:29:13.300 --> 00:29:15.900
Let us figure out how to do that.
00:29:15.900 --> 00:29:20.400
We need to find degrees of freedom.
00:29:20.400 --> 00:29:23.700
We know how to find degrees of freedom for x, that is straightforward.
00:29:23.700 --> 00:29:31.600
That is n sub x -1 and degrees of freedom for y is just going to be n sub y -1.
00:29:31.600 --> 00:29:32.400
Life is good.
00:29:32.400 --> 00:29:33.300
Life is easy.
00:29:33.300 --> 00:29:37.900
How do we find the degrees of freedom for the difference between x and y?
00:29:37.900 --> 00:29:50.000
That is actually going to just be the degrees of freedom for x + degrees of freedom for y.
00:29:50.000 --> 00:29:52.000
We just add them together.
00:29:52.000 --> 00:29:57.100
If we want to unpack this, if you think about double-clicking on this and get that.
00:29:57.100 --> 00:30:03.700
N sub x - 1 + n sub y -1.
00:30:03.700 --> 00:30:09.700
I am just putting that parentheses as you could see the natural groupings but obviously you could
00:30:09.700 --> 00:30:15.900
do them in any order because you could just do them straight across this adding and subtracting.
00:30:15.900 --> 00:30:20.100
They all have the same order of operation.
00:30:20.100 --> 00:30:29.800
That is degrees of freedom and once you have that then you can easily find the t.
00:30:29.800 --> 00:30:33.600
Look it up in the back of your book or you can do it in Excel.
00:30:33.600 --> 00:30:37.300
Let us interpret confidence interval.
00:30:37.300 --> 00:30:44.400
We have the confidence interval let us think about how to say what we have found.
00:30:44.400 --> 00:30:51.900
I am just going to briefly draw that picture again because this picture anchors my thinking.
00:30:51.900 --> 00:30:57.800
Here is our difference of means.
00:30:57.800 --> 00:31:03.100
When you look at this t, think of this as the difference of two means.
00:31:03.100 --> 00:31:09.100
I guess I could write DOTM but that would just be DOM.
00:31:09.100 --> 00:31:27.600
Here what we found, if we find something like a 95% confidence interval that means we have found these boundaries.
00:31:27.600 --> 00:31:31.200
We say something like this.
00:31:31.200 --> 00:31:58.900
The actual difference of the two means of the real population, of the population x and y.
00:31:58.900 --> 00:32:18.900
The real population that they come from should be within this interval 95% of the time or something like
00:32:18.900 --> 00:32:30.100
we have 95% confidence that the actual difference between means of the population of x and population of y should be within this interval.
00:32:30.100 --> 00:32:35.100
That comes from that notion that this is created from the SDOM.
00:32:35.100 --> 00:32:42.600
Remember the SDOM, the CLT says that their means or the means of the population.
00:32:42.600 --> 00:32:50.600
We are getting the population means drop down to the SDOM and from the SDOM we get this.
00:32:50.600 --> 00:33:00.300
Because of that we could actually make a conclusion that goes back to the population.
00:33:00.300 --> 00:33:07.400
Let us think about if 0 is not in between here.
00:33:07.400 --> 00:33:13.300
Remember the null hypothesis when we think about two means is going to be something like this.
00:33:13.300 --> 00:33:18.200
That the μ sub x bar – y bar is going to be equal to 0.
00:33:18.200 --> 00:33:23.900
This is going to mean that on average when you subtract these two things the average is going to be 0.
00:33:23.900 --> 00:33:26.100
There is going to be no difference on average.
00:33:26.100 --> 00:33:34.900
The alternative hypothesis should then be the mean of these differences should not be 0.
00:33:34.900 --> 00:33:36.500
They are different.
00:33:36.500 --> 00:33:46.200
If 0 is not within this confidence interval then we have very little reason to suspect that this would be true.
00:33:46.200 --> 00:33:50.500
It is a very little reason to think that this null hypothesis is true.
00:33:50.500 --> 00:34:00.200
We could also say that if we do not find 0 in our confidence interval that we might in my hypothesis testing be able to also reject the null hypothesis.
00:34:00.200 --> 00:34:02.100
But we will get to that later.
00:34:02.100 --> 00:34:09.900
I just wanted to show you this because the confidence interval here is very tightly linked to the hypothesis testing part.
00:34:09.900 --> 00:34:12.600
They are like two side of the same coin.
00:34:12.600 --> 00:34:25.600
That universe is fairly straightforward but I feel like I need to cover one other thing because sometimes this is emphasized in some books.
00:34:25.600 --> 00:34:36.300
Some teachers emphasize this over other teachers and so I'm going to talk to you about SPOOL because this will come up.
00:34:36.300 --> 00:34:44.800
One of the things I hope you noticed was that in order to find our estimate of SDOM,
00:34:44.800 --> 00:35:06.100
in order to find the SDOD sample error what we did was we took the variance of one SDOM
00:35:06.100 --> 00:35:09.800
and added that to the variance of the other SDOM and square root the whole thing.
00:35:09.800 --> 00:35:11.100
Let me just write that here.
00:35:11.100 --> 00:35:28.800
The s sub x bar - y bar is the square root of one the variances + the variance of the other SDOM.
00:35:28.800 --> 00:35:37.100
Here what we did was let us just treat them separately and then combine them together.
00:35:37.100 --> 00:35:38.500
That is what we did.
00:35:38.500 --> 00:35:54.600
Although this is an okay way of doing it, in doing this we are assuming that they might have different standard deviations.
00:35:54.600 --> 00:35:59.000
The two different populations might have two different standard deviations.
00:35:59.000 --> 00:36:02.600
Normally, that is a reasonable assumption to make.
00:36:02.600 --> 00:36:06.500
Very few populations have the exact standard deviation.
00:36:06.500 --> 00:36:16.700
For the vast majority of time because we just assumed if you come from two different population you probably have two different standard deviations.
00:36:16.700 --> 00:36:22.400
This is pretty reasonable to do like 98% of the time.
00:36:22.400 --> 00:36:24.600
The vast majority of time.
00:36:24.600 --> 00:36:37.800
But it is actually is not as good as the estimate of this value then, if you had just used up a POOL version of the standard deviation.
00:36:37.800 --> 00:36:38.600
Here is what I mean.
00:36:38.600 --> 00:36:46.400
Now we are saying, we are going to create the standard deviation of x.
00:36:46.400 --> 00:36:50.500
You are going to be what we used to create the standard deviation of y.
00:36:50.500 --> 00:36:53.600
Just of not make that explicit.
00:36:53.600 --> 00:37:08.300
I am going to write this out so that you could actually see the variance of x and the variance of y.
00:37:08.300 --> 00:37:14.700
We use x to create this guy and we use y to create that guy and they remain separate.
00:37:14.700 --> 00:37:18.900
This is going to take a little reasoning.
00:37:18.900 --> 00:37:33.000
Think back if you have more data then your estimate of the population standard deviation is better, more data more accurate.
00:37:33.000 --> 00:37:42.400
Would not it be nice if we took all the guys from the x pool and all the guys from the y pull and put them together.
00:37:42.400 --> 00:37:47.200
Together let us estimate the standard deviation.
00:37:47.200 --> 00:37:48.500
Would not that be nice?
00:37:48.500 --> 00:37:58.400
Then we will have more data and more data should give us a more accurate estimate of the population.
00:37:58.400 --> 00:38:13.100
You can do that but only in the case that you have reason to think that the population of x has a similar standard deviation to the population of y.
00:38:13.100 --> 00:38:19.100
If you have a reason to think they are both normally distributed.
00:38:19.100 --> 00:38:23.100
Let us say something like this.
00:38:23.100 --> 00:38:44.300
If you have reason to believe that the population x and y have similar standard deviation
00:38:44.300 --> 00:39:06.800
then you can pull samples together to estimate standard deviation.
00:39:06.800 --> 00:39:11.200
You can pull them together and that is going to be called spull.
00:39:11.200 --> 00:39:17.600
There are very few populations that you can do this for.
00:39:17.600 --> 00:39:25.500
One thing something like height of males and females, height tends to be normally distributed and we know that.
00:39:25.500 --> 00:39:33.800
Height of Asians and Latinos or something, but there are a lot of examples that come to mind where you could do this.
00:39:33.800 --> 00:39:38.200
That is why some teachers do not emphasize it but I know that some others do so.
00:39:38.200 --> 00:39:40.200
That is why I want to definitely go over it.
00:39:40.200 --> 00:39:44.200
How do you get spull and where does it come in?
00:39:44.200 --> 00:39:55.900
Here is the thing, in order to find Spull, what we would do is we would substitute in spull for s sub x and s sub y.
00:39:55.900 --> 00:40:08.400
Instead of two separate estimates of standard deviations use Spull.
00:40:08.400 --> 00:40:11.300
We will be using Spull².
00:40:11.300 --> 00:40:15.500
How do we find Spull²?
00:40:15.500 --> 00:40:31.900
In order to find Spull², what you would do is you would add up all of the sum of squares.
00:40:31.900 --> 00:40:42.500
The sum of squares of x and sum of squares of y, add them together and then divide by the sum of all the degrees of freedom.
00:40:42.500 --> 00:40:57.100
If I double-click on this, this would mean the sum of squares of x + the sum of squares of y ÷ degrees of freedom x + degrees of freedom y.
00:40:57.100 --> 00:41:09.200
This is what you need only to do in order to find Spull and then what you would do is substitute in s(x)² and s sub y².
00:41:09.200 --> 00:41:11.200
That is the deal.
00:41:11.200 --> 00:41:22.800
In the examples that are going to follow, I am not going to use Spull because there is very little reason usually to assume that we can use Spull.
00:41:22.800 --> 00:41:29.800
And but a lot of times you might hear this phrase assumption of homogeneity of variance.
00:41:29.800 --> 00:41:42.500
If you could assume that these guys have a similar variance, if you can assume
00:41:42.500 --> 00:41:48.300
they have similar homogeneous variance then you can use Spull.
00:41:48.300 --> 00:41:54.200
For the most part, for the vast majority of time you cannot assume homogenous variance.
00:41:54.200 --> 00:41:57.500
Because of that we will often use this one.
00:41:57.500 --> 00:42:05.400
However, I should say that some teachers do want you to be able to calculate both.
00:42:05.400 --> 00:42:07.700
That is the only thing.
00:42:07.700 --> 00:42:11.400
Finally I should just say one thing.
00:42:11.400 --> 00:42:15.900
Usually this works just as well as pull.
00:42:15.900 --> 00:42:23.400
It is just that there are sometimes we get more of a benefit from using this one.
00:42:23.400 --> 00:42:28.200
If worse comes to worse, and after the statistics class you are only remember this one.
00:42:28.200 --> 00:42:30.800
If not all you are pretty good to go.
00:42:30.800 --> 00:42:36.400
Let us go on to some examples.
00:42:36.400 --> 00:42:42.100
A random sample of American college students was collected to examine quantitative literacy.
00:42:42.100 --> 00:42:45.200
How good they are in reasoning about quantitative ideas.
00:42:45.200 --> 00:42:51.600
The survey sampled 1,000 students from four-year institutions, this was the mean and standard deviation.
00:42:51.600 --> 00:42:56.600
800 from two-year institutions, here is the mean and standard deviations.
00:42:56.600 --> 00:43:01.000
Are the conditions for confidence intervals met?
00:43:01.000 --> 00:43:06.600
Also construct a 95% confidence interval and interpret it.
00:43:06.600 --> 00:43:12.700
Let us think about the confidence interval requirements.
00:43:12.700 --> 00:43:16.000
First is independent random samples.
00:43:16.000 --> 00:43:23.100
It does say random sample right and these are independent populations.
00:43:23.100 --> 00:43:26.600
One is for your institutions, one is to your institutions.
00:43:26.600 --> 00:43:29.500
There are very few people going to both of them at the same time.
00:43:29.500 --> 00:43:32.600
First one, check.
00:43:32.600 --> 00:43:42.500
Second one, can we assume normality either because of the large n or because we know that both these populations are originally normally distributed?
00:43:42.500 --> 00:43:47.200
Well, they have pretty large n, so I am going to say number 2 check.
00:43:47.200 --> 00:43:55.600
Number 3, is this sample roughly sampling with replacement?
00:43:55.600 --> 00:44:00.200
And although 1000 students seem a lot, there are a lot of college students.
00:44:00.200 --> 00:44:03.300
I am pretty sure that this meets that qualification as well.
00:44:03.300 --> 00:44:07.700
Go ahead and construct the 95% confidence interval.
00:44:07.700 --> 00:44:15.700
Well, it helped to start off with the drawing of SDOD just to anchor my thinking.
00:44:15.700 --> 00:44:27.400
And this μ sub x bar - y bar we could assume that this is x bar - y bar.
00:44:27.400 --> 00:44:30.500
That is what we do with confidence intervals.
00:44:30.500 --> 00:44:37.800
We use what we have from the samples to figure out what the population might be.
00:44:37.800 --> 00:44:45.500
We want to construct a 95% confidence interval.
00:44:45.500 --> 00:45:03.200
That is going to be .025 and then maybe it will help us to figure out the degrees of freedom so that we will know the t value to use.
00:45:03.200 --> 00:45:05.700
Let us figure out degrees of freedom.
00:45:05.700 --> 00:45:18.000
It is going to be the degrees of freedom for x and I will call x the four-year university guys and the degrees of freedom for y the two-year university guys.
00:45:18.000 --> 00:45:46.900
That is going to be 999 + 799 and so it is going to be 1800 - 2 = 1798.
00:45:46.900 --> 00:45:55.500
We have quite large degrees of freedom and let us find the t for this place.
00:45:55.500 --> 00:46:00.000
We need to find is this and this.
00:46:00.000 --> 00:46:05.300
Let us find the t first.
00:46:05.300 --> 00:46:12.300
This is the raw score, this is the t, and let me delete some of the stuff.
00:46:12.300 --> 00:46:22.500
I will just put x bar - y bar in there and we can find that later.
00:46:22.500 --> 00:46:28.200
The t is going to be the boundaries for this guy and the boundaries for this guy.
00:46:28.200 --> 00:46:30.400
What is our t value?
00:46:30.400 --> 00:46:39.400
You can look it up in the back of your book or you could do it in Excel.
00:46:39.400 --> 00:46:46.600
Here we want to put in the t in because we have the probability and remember this one
00:46:46.600 --> 00:46:59.300
wants two tailed probability .05 and the degrees of freedom which is 1798 = 1.896.
00:46:59.300 --> 00:47:10.900
We will put 1.961 just to distinguish it.
00:47:10.900 --> 00:47:18.200
Let us write down our confidence interval formula and see what we can do.
00:47:18.200 --> 00:47:23.700
Confidence interval is going to be x bar - y bar.
00:47:23.700 --> 00:47:34.300
The middle of this guy + or - t × standard error of this guy.
00:47:34.300 --> 00:47:37.800
That is going to be s sub x bar - y bar.
00:47:37.800 --> 00:47:42.400
It would be probably helpful to find this thing.
00:47:42.400 --> 00:47:48.000
X bar - y bar.
00:47:48.000 --> 00:48:02.800
X bar - y bar that is going to be 330 – 310.
00:48:02.800 --> 00:48:31.200
Let us also try to figure out the standard error of SDOD which is s sub x bar - y bar.
00:48:31.200 --> 00:48:38.000
What I'm trying to do is find this guy.
00:48:38.000 --> 00:48:41.400
In order to find that guy let us think about the formula.
00:48:41.400 --> 00:48:44.700
I'm just writing this for myself.
00:48:44.700 --> 00:48:57.300
The square root of the variance of x bar + the variance of y bar .
00:48:57.300 --> 00:49:03.500
We do not have the variance of x bar and y bar.
00:49:03.500 --> 00:49:07.400
Let us think about how to find the variance of x bar.
00:49:07.400 --> 00:49:18.900
The variance of x bar is going to be s sub s² ÷ n sub x.
00:49:18.900 --> 00:49:37.200
The variance of y bar is going to be the variance of y² ÷ n sub y.
00:49:37.200 --> 00:49:46.600
I wanted to write all these things out just because I need to get to a place where finally I can put in s.
00:49:46.600 --> 00:49:48.100
Finally, I can do that.
00:49:48.100 --> 00:49:50.500
This is s sub x and this is s sub y.
00:49:50.500 --> 00:50:17.400
I can put in 111² ÷ n sub x which is 1000 and I could put in the standard deviation of y² ÷ 800.
00:50:17.400 --> 00:50:27.800
I have these two things and what I need to do is go back up here and add these and square root them.
00:50:27.800 --> 00:50:34.200
Square root this + this.
00:50:34.200 --> 00:50:38.400
I know that this equal that.
00:50:38.400 --> 00:51:04.000
We have our standard error, which is 4.49 and this is 20 + or - 1.961.
00:51:04.000 --> 00:51:06.200
Now I could do this.
00:51:06.200 --> 00:51:09.000
I will going to take that in my calculator as well.
00:51:09.000 --> 00:51:24.700
The confidence interval for the high boundary is going to be 20 + 1.961 × 4.49
00:51:24.700 --> 00:51:37.400
and the confidence interval for the low boundary is going to be that same thing.
00:51:37.400 --> 00:51:41.200
I am just going to change that into subtraction.
00:51:41.200 --> 00:51:45.100
11.20.
00:51:45.100 --> 00:51:50.200
Let me move this over.
00:51:50.200 --> 00:51:56.800
It is going to be 28.8.
00:51:56.800 --> 00:52:00.800
Let me get the low end first.
00:52:00.800 --> 00:52:07.400
The confidence interval is from about 11.2 through 28.8.
00:52:07.400 --> 00:52:10.300
We have to interpret it.
00:52:10.300 --> 00:52:13.600
This is the hardest part for a lot of people.
00:52:13.600 --> 00:52:16.100
We have to say something like this.
00:52:16.100 --> 00:52:26.500
The true difference between the population means 95% of the time is going to fall in between these two numbers.
00:52:26.500 --> 00:52:34.200
Or we have 95% confidence that the true difference between the two population means fall in between these two numbers.
00:52:34.200 --> 00:52:37.600
Let us go to example 2.
00:52:37.600 --> 00:52:38.700
This will be our last example.
00:52:38.700 --> 00:52:46.800
If the sample size of both samples are the same, what would be the simplified formula for standard error of the difference?
00:52:46.800 --> 00:52:55.600
If in addition, the standard deviation of both samples are the same, what would be the simplified formula for standard error of the difference?
00:52:55.600 --> 00:53:02.900
This is just asking depending on how similar the two examples are can we simplify a formula for standard error.
00:53:02.900 --> 00:53:04.000
We can.
00:53:04.000 --> 00:53:27.100
Let us write the actual formula out so that would just x bar – y bar = square root of the variance of x bar + variance of y bar.
00:53:27.100 --> 00:53:43.000
If we double-click on these guys that would give the variance of x / n sub x + the variance of y / n sub y.
00:53:43.000 --> 00:53:49.800
It is asking, what if the sample size for both samples are the same?
00:53:49.800 --> 00:53:51.600
What would be the simplified formula?
00:53:51.600 --> 00:54:00.400
That is saying that if n sub x = n sub y then what would be this?
00:54:00.400 --> 00:54:11.600
We can get the variance of x + variance of y / n.
00:54:11.600 --> 00:54:14.400
Because the n for each of them should be the same.
00:54:14.400 --> 00:54:20.300
This would make it a lot simpler.
00:54:20.300 --> 00:54:32.500
If in addition a standard deviation of both samples are the same right then this would mean that
00:54:32.500 --> 00:54:36.400
because the standard deviation is the same then the variances are the same.
00:54:36.400 --> 00:54:39.100
That would be that case.
00:54:39.100 --> 00:54:54.400
If in addition this was the case, then you would just get 2 × s² whatever the equal variances /n.
00:54:54.400 --> 00:54:58.400
That would make it a simple formula.
00:54:58.400 --> 00:55:03.500
That would make life a lot easier but that is not always the case.
00:55:03.500 --> 00:55:07.400
If it is you know that it will be simple for you.
00:55:07.400 --> 00:55:12.000
That is it for the confidence intervals for the difference between two means.
00:55:12.000 --> 00:55:14.000
Thank you for using www.educator.com.