WEBVTT mathematics/statistics/son
00:00:00.000 --> 00:00:01.500
Hi and welcome to www.educator.com.
00:00:01.500 --> 00:00:05.800
Today we are going to be talking about sampling distribution of sample proportions.
00:00:05.800 --> 00:00:14.100
First thing, we are going to do is just introduce ourselves to the concept of sampling distribution of sample proportions.
00:00:14.100 --> 00:00:19.500
This is just me this is not like everybody in statistics but I am going to call it SDOS for short.
00:00:19.500 --> 00:00:24.100
We do not have to keep writing out sampling distribution of sample proportions.
00:00:24.100 --> 00:00:34.300
Then we are going to go through some notation and then finally we are going to compare and contrast the SDOS to the SDOM.
00:00:34.300 --> 00:00:40.600
They are both sampling distribution, but one is that the mean and the others of sample proportions.
00:00:40.600 --> 00:00:45.300
We are going to compare and contrast the binomial distribution.
00:00:45.300 --> 00:00:50.400
The probability distribution that we looked at a couple of lessons back with the SDOS.
00:00:50.400 --> 00:00:54.500
What is this thing?
00:00:54.500 --> 00:00:55.900
What is the SDOS?
00:00:55.900 --> 00:01:02.900
First this concept is going to come into play whenever we collect some sort of categorical data.
00:01:02.900 --> 00:01:10.100
For instance, we might ask a sample of citizen do you approve or disapprove of the president and
00:01:10.100 --> 00:01:13.000
That would be a categorical response.
00:01:13.000 --> 00:01:19.300
They are not saying I approve this much, but they are just saying I approve or disapprove.
00:01:19.300 --> 00:01:26.100
At the end of that data collection what you get is not a mean or a median
00:01:26.100 --> 00:01:30.800
but you get something like a proportion of citizens who believe the president is doing a good job.
00:01:30.800 --> 00:01:36.700
Something like 43%, 64%, 29%.
00:01:36.700 --> 00:01:44.200
Or another example might be proportion of students who plagiarized on a paper before.
00:01:44.200 --> 00:01:47.200
Here we are getting proportions.
00:01:47.200 --> 00:01:49.400
They are not means or medians.
00:01:49.400 --> 00:01:52.300
They are just percentages of the entire sample.
00:01:52.300 --> 00:01:58.900
Finally another one that is been talked about a lot these days is the proportion of people covered under healthcare.
00:01:58.900 --> 00:02:09.100
When we collect this categorical data oftentimes we want to use that in order to estimate the proportion
00:02:09.100 --> 00:02:17.600
of the population that actually is covered by health care or with plagiarized before or who believe the president is doing a good job.
00:02:17.600 --> 00:02:21.800
We want to estimate the population level parameter.
00:02:21.800 --> 00:02:27.300
However, samples are very variables.
00:02:27.300 --> 00:02:37.400
Samples are variable and that means that our estimate would not always be very good.
00:02:37.400 --> 00:02:40.800
It will be good sometimes, but it would not always be very good.
00:02:40.800 --> 00:02:55.600
It would help us out if we knew the entire distribution of potential samples.
00:02:55.600 --> 00:02:58.700
Would not it be handy?
00:02:58.700 --> 00:03:02.000
That is called the sampling distribution.
00:03:02.000 --> 00:03:06.700
And because we are not sampling and finding a mean, instead we are sampling
00:03:06.700 --> 00:03:16.300
and finding a proportion it is called a sampling distribution of sample proportions.
00:03:16.300 --> 00:03:26.000
This is the idea found here the entire distribution of potential samples, but once we get each sample what do we do to it?
00:03:26.000 --> 00:03:33.100
We do not find the mean, we find the sample proportion and we plot those on a distribution.
00:03:33.100 --> 00:03:39.900
Some things that are helpful for us to get straight.
00:03:39.900 --> 00:03:46.200
When we talk about the population proportion that parameter we will just call it p.
00:03:46.200 --> 00:03:52.100
When we talk about a sample proportion we are going to call it p hat.
00:03:52.100 --> 00:04:03.400
We have seen that notation before when we talk about regression when we had y for the actual data but we had y hat for the predicted data.
00:04:03.400 --> 00:04:05.500
You can think about it like this.
00:04:05.500 --> 00:04:14.400
The real one from the world is going to be not have the hat.
00:04:14.400 --> 00:04:22.500
This is sort of the truth that we are trying to find and this is how we are going to estimate that truth.
00:04:22.500 --> 00:04:30.000
We are going to use this to estimate that but we want to know how good is our estimate?
00:04:30.000 --> 00:04:30.600
Is it any good?
00:04:30.600 --> 00:04:32.700
And is it reliable or not?
00:04:32.700 --> 00:04:35.600
Think about it like this.
00:04:35.600 --> 00:04:41.000
Here the distribution of the population is binary.
00:04:41.000 --> 00:04:42.500
There is 1 or the other.
00:04:42.500 --> 00:04:52.000
I am just going to draw the entire population as just a bar and pretend this bar had the value of 1.0, 100%.
00:04:52.000 --> 00:04:59.300
Some proportion of this is p and the other is not p.
00:04:59.300 --> 00:05:11.800
Some proportion of these people approve of the job he's doing as a president and the other proportion does not.
00:05:11.800 --> 00:05:16.500
How should we represent that in a picture and algebraic form?
00:05:16.500 --> 00:05:24.500
Well I'm just going to draw a line here and say this part is going to be my proportion p.
00:05:24.500 --> 00:05:30.000
That is my proportion of those that agree that the president is doing a good job.
00:05:30.000 --> 00:05:35.500
Then what would be but this little area here?
00:05:35.500 --> 00:05:38.400
Well, how would we represent that algebraically?
00:05:38.400 --> 00:05:43.900
That would simply be 1 – p because the whole thing is p.
00:05:43.900 --> 00:05:50.000
This segment is p so this segment must be 1 – p.
00:05:50.000 --> 00:05:55.400
When you add p + 1 -p what you get is 1.
00:05:55.400 --> 00:06:03.600
Notice that we did not draw like a normal distribution or anything because it is not that people have different values.
00:06:03.600 --> 00:06:05.600
It is not that some people are low, some people are high.
00:06:05.600 --> 00:06:07.100
It is just yes or no?
00:06:07.100 --> 00:06:10.300
Have you plagiarized or not?
00:06:10.300 --> 00:06:15.300
Have you gone bungee jumping or not?
00:06:15.300 --> 00:06:18.300
Are you covered by health care or not?
00:06:18.300 --> 00:06:23.100
They are just these binary characteristics that we are interested in.
00:06:23.100 --> 00:06:25.000
That is what the population is like.
00:06:25.000 --> 00:06:42.600
Now from population we draw a sample of size n just like always.
00:06:42.600 --> 00:06:45.600
We are always drawing sample size n.
00:06:45.600 --> 00:06:52.100
When we look at that little sample of the population what does it look like?
00:06:52.100 --> 00:07:01.200
Instead of the whole thing you drew a little sample, you drew a subset of those people
00:07:01.200 --> 00:07:10.100
and presumably this little sample should most likely reflect the population that it came from.
00:07:10.100 --> 00:07:14.100
These should be radically different from this.
00:07:14.100 --> 00:07:21.700
It can be sometimes but for the most part this sample should reflect the population that it came from.
00:07:21.700 --> 00:07:45.800
The entire sample this thing =1 and in this entire sample we have some probability p hat and that is the proportion in our sample that agree.
00:07:45.800 --> 00:07:53.200
This would be represented by 1 – p hat those are the people in our sample that disagree.
00:07:53.200 --> 00:08:00.400
You might be thinking how is this whole thing 1 and how is this whole thing 1 because this one looks smaller?
00:08:00.400 --> 00:08:03.800
When we say 1 we are talking about proportions.
00:08:03.800 --> 00:08:05.700
We are really think 100%.
00:08:05.700 --> 00:08:10.100
When we are saying 1 here it represent 100% of the population.
00:08:10.100 --> 00:08:15.800
When we day 1 down here we are saying 100% of the sample.
00:08:15.800 --> 00:08:18.800
That is the distinction we want to make.
00:08:18.800 --> 00:08:24.600
Once you do this then you get this p hat.
00:08:24.600 --> 00:08:32.000
And once you have that p hat then you can plot it on a sampling distribution.
00:08:32.000 --> 00:08:36.400
Here is what the sampling distribution of sample portions looks like.
00:08:36.400 --> 00:08:42.100
The lower bound and upper bound on this have to be 0 and 1.
00:08:42.100 --> 00:08:50.300
You can never have a p hat that is less than 0 and you can never have a p hat that is greater than 100%.
00:08:50.300 --> 00:08:53.900
You are inevitably stuck between 0 and 1.
00:08:53.900 --> 00:08:57.000
Those are the only sample proportions you could possibly get.
00:08:57.000 --> 00:09:00.000
Whatever we get here we plot here.
00:09:00.000 --> 00:09:04.200
Soon we will build up a sampling distribution of sample proportions.
00:09:04.200 --> 00:09:16.400
Whatever it looks like that will be our sampling distribution of sample portions.
00:09:16.400 --> 00:09:25.600
Let us contrast the SDOM versus the SDOS.
00:09:25.600 --> 00:09:33.400
In fact we are going to learn more about the SDOS by learning about the SDOM and how it relates.
00:09:33.400 --> 00:09:39.300
There is one key difference between these two and that is the biggest thing you really need to keep in mind.
00:09:39.300 --> 00:09:42.600
When we are talking about the SDOM we are finding a mean.
00:09:42.600 --> 00:09:45.600
You cannot find a mean between agree and disagree.
00:09:45.600 --> 00:09:48.700
Those are categorical data.
00:09:48.700 --> 00:09:59.000
Here is what we do know is we need the data where you can find the mean and those data are continuous data.
00:09:59.000 --> 00:10:08.700
Sometimes these are also called measurement data because you actually got this by measuring something.
00:10:08.700 --> 00:10:18.800
When you have continuous data for instance how many miles do you drive per day?
00:10:18.800 --> 00:10:24.500
Getting a sample of you know what the average number of miles people in California drive each day?
00:10:24.500 --> 00:10:33.000
That would be a continuous measure if you get data like that you can actually average it together.
00:10:33.000 --> 00:10:38.200
But if we ask the question like do you drive every day?
00:10:38.200 --> 00:10:39.500
Yes or no?
00:10:39.500 --> 00:10:42.500
That would be categorical data.
00:10:42.500 --> 00:10:56.500
The type of data we are talking about here happens to be binary because it is either you are in one category or you are in the other.
00:10:56.500 --> 00:10:59.100
There is not like three categories.
00:10:59.100 --> 00:11:06.300
It is like there might be you agree with the president or you disagree with the president or you feel neutral.
00:11:06.300 --> 00:11:13.900
What we would do in order to look at it as SDOS is to lump people together.
00:11:13.900 --> 00:11:18.500
It might be agree versus disagree or do not care.
00:11:18.500 --> 00:11:23.300
We might lump those two people together to just call them not agrees.
00:11:23.300 --> 00:11:34.500
The shape of the SDOM what is nice about it is that as n increases, what happens to the shape?
00:11:34.500 --> 00:11:42.700
The shape approximates normal.
00:11:42.700 --> 00:11:49.800
As our sample size increases the shape is more and more reliably normal.
00:11:49.800 --> 00:11:55.100
The nice thing about the SDOS is that the same principle applies.
00:11:55.100 --> 00:12:12.300
You could just draw a little link there as an increases, shape approximates normal.
00:12:12.300 --> 00:12:28.500
Because as we draw sample sizes of size n, as n gets bigger even for SDOS we are actually seeing normal like distributions.
00:12:28.500 --> 00:12:32.300
Let us talk about center.
00:12:32.300 --> 00:12:39.000
If you remember the central limit theorem, that is where the shape, center, and spread stuff comes from.
00:12:39.000 --> 00:12:51.200
Center if you remember, the population μ equals the center of the SDOM which is μ sub x bar.
00:12:51.200 --> 00:12:54.000
It is a whole bunch of little x bar.
00:12:54.000 --> 00:13:06.600
There is a similar idea here, but there is a difference.
00:13:06.600 --> 00:13:13.800
Basically when we talk about center here remember that we do not have the population μ.
00:13:13.800 --> 00:13:15.200
We do not have a population μ.
00:13:15.200 --> 00:13:16.500
We do not have a population mean.
00:13:16.500 --> 00:13:25.100
Instead what we have is more like a population proportion.
00:13:25.100 --> 00:13:26.400
We had that p.
00:13:26.400 --> 00:13:34.600
We want to know what is the relationship between p and p hat?
00:13:34.600 --> 00:13:51.300
In this case, what we see is that the μ that we want to see is going to be equal to the proportion.
00:13:51.300 --> 00:14:05.300
And if you think about it, let us say you have 60% of your population is approving of the president.
00:14:05.300 --> 00:14:11.600
If you are draw just 1 person, 1 person from that population what is the chance that that 1 person approved the president?
00:14:11.600 --> 00:14:19.300
And that 1 person have a 60% chance of approving the president.
00:14:19.300 --> 00:14:26.900
When you draw 2 people, those 2 people also have a 60% chance of approving of the president.
00:14:26.900 --> 00:14:31.900
If you draw 3 people you still have 60% chance of approving of the president.
00:14:31.900 --> 00:14:48.400
The population p is equal to the μ because remember now we have a mean because we have a distribution of p hat.
00:14:48.400 --> 00:14:50.500
Here is the idea.
00:14:50.500 --> 00:14:54.500
Get all these p hats, the entire distribution p hats.
00:14:54.500 --> 00:15:02.100
Once you have those, if you find a mean of that, that is equal to the population.
00:15:02.100 --> 00:15:05.700
That is the nice thing about the center.
00:15:05.700 --> 00:15:15.300
Remember this number is between 0 and 1 because you cannot have lower than 0 higher than 1.
00:15:15.300 --> 00:15:20.800
This value is also between 0 and 1.
00:15:20.800 --> 00:15:35.600
Another way to think about it is that the rate in the population will be the mean of all the rates that you get in your samples.
00:15:35.600 --> 00:15:37.900
Let us talk about spread.
00:15:37.900 --> 00:15:43.300
When we talk about spread before we often look at standard deviation.
00:15:43.300 --> 00:15:48.800
Obviously you can also look at variance and the equal sides of standard deviation.
00:15:48.800 --> 00:16:01.100
Here when we talked about standard deviation of the SDOM we called it sigma sub x bar because it is the standard deviation of a bunch of x-bars,
00:16:01.100 --> 00:16:08.800
a bunch of means and that is equal to sigma.
00:16:08.800 --> 00:16:16.000
The real population standard deviation divided by √n your sample size.
00:16:16.000 --> 00:16:28.000
Here as n increases what happens to the standard error?
00:16:28.000 --> 00:16:30.400
We should also call it, standard error.
00:16:30.400 --> 00:16:32.500
What happens to standard error?
00:16:32.500 --> 00:16:42.600
Standard error goes down, decreases.
00:16:42.600 --> 00:16:51.000
As n goes up standard error goes down because as n gets bigger and bigger and bigger, this whole thing gets smaller and smaller.
00:16:51.000 --> 00:16:56.000
Let us talk about the spread in the SDOS.
00:16:56.000 --> 00:17:04.200
Just like here we did not have a population mean.
00:17:04.200 --> 00:17:07.500
We do not have a population standard deviation.
00:17:07.500 --> 00:17:10.700
There is no variability there.
00:17:10.700 --> 00:17:14.000
Instead we use a different formula.
00:17:14.000 --> 00:17:17.800
First, let us talk about variance here.
00:17:17.800 --> 00:17:29.700
In order to write variance you call it sigma and instead of sigma sub x bar you call it sigma sub p hat.
00:17:29.700 --> 00:17:31.700
Just like μ sub p hat.
00:17:31.700 --> 00:17:39.100
You are constantly saying this is the sigma of the whole bunch of sample proportions.
00:17:39.100 --> 00:17:43.900
And because we are talking about variance you want to square that.
00:17:43.900 --> 00:18:02.000
That is going to also be p × 1 - P ÷ n.
00:18:02.000 --> 00:18:08.100
When you look at this you see that this still holds for both of these.
00:18:08.100 --> 00:18:16.900
As n increases what happens to the value of the spread?
00:18:16.900 --> 00:18:19.000
Spread goes down.
00:18:19.000 --> 00:18:21.800
As n increases spread goes down.
00:18:21.800 --> 00:18:23.900
Imagine squeezing it.
00:18:23.900 --> 00:18:40.100
If you wanted to find standard deviation what you would see a sigma sub p hat =vp×1-p / n.
00:18:40.100 --> 00:18:45.400
We will talk a little bit more about where this comes from in the next segment.
00:18:45.400 --> 00:18:51.200
But what I want you to see here is that there is this principle as n increases,
00:18:51.200 --> 00:19:03.600
as your sample size increases your sampling distribution spread goes down.
00:19:03.600 --> 00:19:06.100
It becomes less variable.
00:19:06.100 --> 00:19:14.100
We see a lot of similarities across the SDOM and SDOS.
00:19:14.100 --> 00:19:22.700
Let us talk about the binomial distribution and SDOS.
00:19:22.700 --> 00:19:32.800
Hopefully remember the binomial distribution from few lessons ago, there we are also talking about categorical data.
00:19:32.800 --> 00:19:39.900
Not only that we are talking about binary categorical data.
00:19:39.900 --> 00:19:56.700
Remember we are talking about how many successes, K number of successes out of n.
00:19:56.700 --> 00:20:05.800
You take a sample of size n and your counting how many number of successes and plotting all of that on a distribution.
00:20:05.800 --> 00:20:15.000
Here is also categorical and we are also looking at binary choices 1 or the other.
00:20:15.000 --> 00:20:23.000
Here we are not looking at k number of successes we are looking at sample proportions.
00:20:23.000 --> 00:20:35.100
I want to stop here to briefly remind you what we are talking about the SDOM the lowest number
00:20:35.100 --> 00:20:41.500
that p hat could be a 0 and the highest number is 1.
00:20:41.500 --> 00:20:43.600
Those are the limits.
00:20:43.600 --> 00:20:56.200
When we talk about a binomial distribution the lowest number that this could be is 0 and the highest number over here is n.
00:20:56.200 --> 00:21:00.100
It is because we are plotting k on this distribution.
00:21:00.100 --> 00:21:07.500
0 number of successes 1, 2, 3, 4, 5 all the way up to n number of successes and out of n.
00:21:07.500 --> 00:21:12.700
What is the shape here when we do not necessarily know.
00:21:12.700 --> 00:21:14.600
It does not have to be normal.
00:21:14.600 --> 00:21:18.800
It could be different kinds of shapes.
00:21:18.800 --> 00:21:22.300
It must be skewed, it could be different shapes.
00:21:22.300 --> 00:21:23.900
We do not necessarily know.
00:21:23.900 --> 00:21:25.800
We do not know the shape.
00:21:25.800 --> 00:21:33.300
Here we know as n increases more normal.
00:21:33.300 --> 00:21:40.300
Here we do know the shape as long as we have a large enough n.
00:21:40.300 --> 00:21:43.400
What about center?
00:21:43.400 --> 00:21:59.200
Here when we talk about center we had looked at sort of how many n would we normally see?
00:21:59.200 --> 00:22:03.000
What would be the average k?
00:22:03.000 --> 00:22:22.100
Before in the binomial distribution our notion of center was largely guided by the probability of success × n.
00:22:22.100 --> 00:22:34.000
You can think of it like this, here is our little sample of n and here some proportion of the sample p
00:22:34.000 --> 00:22:43.400
of this sample is going to be a success whatever the successes is.
00:22:43.400 --> 00:22:49.400
Some proportion is going to be the success and how many is that p?
00:22:49.400 --> 00:22:53.400
To get the raw value I do not want in terms of percent.
00:22:53.400 --> 00:22:54.900
I want it in raw value here.
00:22:54.900 --> 00:23:13.500
To get that proportion what I would say is that the center of the binomial distribution is p × n because this whole thing is size n.
00:23:13.500 --> 00:23:16.500
It is only p(n).
00:23:16.500 --> 00:23:21.200
If it was 100% then it would be 1 × n, all of them.
00:23:21.200 --> 00:23:29.200
If it was 75% it would be .75 × n and that will give you only 75% of n.
00:23:29.200 --> 00:23:33.700
If it was 10% of n it would be .1 × n.
00:23:33.700 --> 00:23:36.700
This is our definition of center.
00:23:36.700 --> 00:23:41.200
Here we saw that the definition of center.
00:23:41.200 --> 00:23:50.700
All we did is basically divide this by n because we no longer want k number of successes we want to know what is that proportion?
00:23:50.700 --> 00:23:53.200
We do not care what the n is.
00:23:53.200 --> 00:23:56.600
We care what the actual k is.
00:23:56.600 --> 00:23:57.700
We just want the proportion.
00:23:57.700 --> 00:24:07.200
Life becomes easier and μ sub p hat is actually just p.
00:24:07.200 --> 00:24:08.300
Life is simple.
00:24:08.300 --> 00:24:11.900
Let us talk about spread.
00:24:11.900 --> 00:24:29.200
If you remember spread way back in the day here, this is standard deviation so you know why I am square rooting.
00:24:29.200 --> 00:24:36.100
The standard deviation of n × p × 1 – p.
00:24:36.100 --> 00:24:50.900
You could see sort of the similarity between this and the standard deviation sigma sub p hat where we have vp×1-p.
00:24:50.900 --> 00:24:58.800
But instead of multiplying by my √n we are dividing by √n.
00:24:58.800 --> 00:25:02.000
Let us think about the implications of that.
00:25:02.000 --> 00:25:08.900
Here as n increases, what happens to the standard deviation?
00:25:08.900 --> 00:25:17.200
It gets wider and wider and wider because remember if n increases we are stretching out the space.
00:25:17.200 --> 00:25:20.800
There are more room for variation.
00:25:20.800 --> 00:25:30.200
Standard deviation increases.
00:25:30.200 --> 00:25:36.500
However, here you are always limited to 0 and 1.
00:25:36.500 --> 00:25:43.900
You can never go about that even if you increase your n you try to get more and more people in a sample.
00:25:43.900 --> 00:25:45.600
It does not matter.
00:25:45.600 --> 00:25:47.700
You are always stuck between 0 and 1.
00:25:47.700 --> 00:25:54.000
As n increases the standard deviation decreases.
00:25:54.000 --> 00:26:05.500
Here there are some definite similarities but there are moments of contrast that are important.
00:26:05.500 --> 00:26:10.600
Let us go on to some examples.
00:26:10.600 --> 00:26:22.000
The ethnicity of about 92% of the population of China is Han Chinese, so there are a lot of other ethnic minorities in China, but not a lot only 8%.
00:26:22.000 --> 00:26:32.000
Suppose you take a random sample of 1,000 Chinese what is the probability of getting 90% or fewer pun Chinese in your sample?
00:26:32.000 --> 00:26:37.100
What is the probability of getting 925 pun Chinese or more?
00:26:37.100 --> 00:26:51.400
Well, one thing that helps is for us to realize here if we wanted to we could use binary distributions
00:26:51.400 --> 00:26:57.600
because we can easily translate from 90% to 900 Hun Chinese.
00:26:57.600 --> 00:27:09.800
But we can also use the sampling distribution of sample means because we can easily change 925 into 92.5%.
00:27:09.800 --> 00:27:13.100
We can choose either path we want.
00:27:13.100 --> 00:27:16.600
I am going to go with the SDOS because that was the lesson is about.
00:27:16.600 --> 00:27:25.100
First we know that the population I am just going to draw a fake population here, just so that we can remember.
00:27:25.100 --> 00:27:29.600
Here is my population of China and 92%.
00:27:29.600 --> 00:27:43.400
My real p= 92% and so my 1 - p =.08.
00:27:43.400 --> 00:27:49.100
8% is non-Hun Chinese, 92% is Hun Chinese.
00:27:49.100 --> 00:27:57.100
Now, given this let us say I sample a whole bunch of times and every time I sample I get a sample proportion and I plot that.
00:27:57.100 --> 00:28:03.800
Because we have a fairly large sample size I can assume that we have a normal distribution.
00:28:03.800 --> 00:28:16.400
I know that my limits are 0 and 1 and this whole thing this is really p hat.
00:28:16.400 --> 00:28:25.300
The question is what is the probability of getting 90% or fewer Hun Chinese in your sample?
00:28:25.300 --> 00:28:30.100
First, it would be helpful to know what this middle is.
00:28:30.100 --> 00:28:35.500
Actually it is not exactly going to be symmetrical.
00:28:35.500 --> 00:28:36.800
It is 50%.
00:28:36.800 --> 00:28:53.000
Here it should really be 92% because the μ(p hat) = p and that is 92%.
00:28:53.000 --> 00:29:02.500
The upper limit here is 1.0 and the limit down here is 0.
00:29:02.500 --> 00:29:12.600
What is the probability of getting 90% or fewer Hun Chinese?
00:29:12.600 --> 00:29:21.600
In order to figure out where 90% is, it would be helpful for us to know the standard error
00:29:21.600 --> 00:29:25.200
or the standard deviation of the sampling distribution.
00:29:25.200 --> 00:29:27.100
What is my standard error?
00:29:27.100 --> 00:29:41.000
This is sigma sub p hat and in order to find that that is going to be the vp×1-p /n.
00:29:41.000 --> 00:29:54.200
That would be 92% × .08 ÷ 1,000 and take the square root of all of that.
00:29:54.200 --> 00:30:00.100
Feel free to do it on a calculator I am just going to show it to you one Excel.
00:30:00.100 --> 00:30:18.600
We have v92% × 8% / 1, 000.
00:30:18.600 --> 00:30:24.900
Remember order of operations does not really matter for multiplying and dividing.
00:30:24.900 --> 00:30:30.400
They can be done simultaneously, so it does not matter if they do this first or this first.
00:30:30.400 --> 00:30:47.700
We see that we have a tiny standard deviation .0086.
00:30:47.700 --> 00:30:55.700
Even though 90% does not seem like that far away actually is quite far away.
00:30:55.700 --> 00:31:01.900
How do we find how far away .90 is?
00:31:01.900 --> 00:31:14.800
You have to think and say this is the normal distribution and there is something we know about normal distribution.
00:31:14.800 --> 00:31:18.200
We could find these areas in terms of z score.
00:31:18.200 --> 00:31:21.800
We knew the z score we can find that area.
00:31:21.800 --> 00:31:30.000
These are my p hats but I'm going to start a row for z scores.
00:31:30.000 --> 00:31:44.100
Z scores I know the middle is going to be 0 and 1 standard deviation out this is the .0086 distance that is -1.
00:31:44.100 --> 00:31:51.400
How many of this .0086 is away am I?
00:31:51.400 --> 00:31:54.900
I could use my notion of z scores.
00:31:54.900 --> 00:32:00.300
My z score is 4.90 looks something like this.
00:32:00.300 --> 00:32:07.500
What is the distance between the middle and the score that I'm interested in?
00:32:07.500 --> 00:32:11.000
That is just 90 -.92.
00:32:11.000 --> 00:32:17.000
That is going to give me that distance but I do not want that distance in terms of percentages.
00:32:17.000 --> 00:32:22.500
I want it in terms of my standard error in terms of these little jumps.
00:32:22.500 --> 00:32:32.400
I'm going to say divide by .0086 to give me how many of these points are 6 jumps away am I if I am at 90?
00:32:32.400 --> 00:32:38.700
Let us put back into our calculators.
00:32:38.700 --> 00:32:50.500
I need a parenthesis as order of operations we need to do the subtraction before the division and Excel will not know that.
00:32:50.500 --> 00:33:10.700
.9 -.92 / .0086 = -2.33.
00:33:10.700 --> 00:33:19.500
Here is -2 and apparently this is -2.33.
00:33:19.500 --> 00:33:33.600
Okay, now that we have that z score are we done?
00:33:33.600 --> 00:33:39.700
No, we need to know what is the probability of getting 90% or fewer Han Chinese in your sample?
00:33:39.700 --> 00:33:42.200
What we want to know is this area here.
00:33:42.200 --> 00:33:52.300
This is 90% or fewer are Han Chinese in sample.
00:33:52.300 --> 00:33:54.800
That is the area we want to know.
00:33:54.800 --> 00:34:01.700
At this point because you have the z score you could look it up in the back of the book using your z tables.
00:34:01.700 --> 00:34:16.900
Just to show you I am going to use Excel to find this and I will leave my z score there because it will come in handy.
00:34:16.900 --> 00:34:32.900
Remember normsdist and it asks me to put in the z and once it does that I know that this proportion should be very small and its only 1% of this.
00:34:32.900 --> 00:34:40.300
1% is our answer.
00:34:40.300 --> 00:34:46.600
We should expect what is the probability of getting 90% or fewer Han Chinese in our sample.
00:34:46.600 --> 00:34:48.500
It is 1%.
00:34:48.500 --> 00:35:01.900
We want to find out what is the probability of getting 925 Hun Chinese or more.
00:35:01.900 --> 00:35:14.800
In this case, why do we do the same thing but 92.5 so that would be somewhere past here.
00:35:14.800 --> 00:35:22.700
925 where is that?
00:35:22.700 --> 00:35:25.200
Let us find the z score so that we could be exact.
00:35:25.200 --> 00:35:41.200
Z score of .925 is the distance between 925 and .92 divided by the little jumps, the standard errors .0086.
00:35:41.200 --> 00:36:00.700
When I do that what do I get .925 -.9 /.0086 = 2.990.
00:36:00.700 --> 00:36:30.200
That did not look right to me because this should be a smaller z score than this one because this should be farther out.
00:36:30.200 --> 00:36:36.300
It is .58 is our z score.
00:36:36.300 --> 00:36:39.900
I wrote this one at a wrong place.
00:36:39.900 --> 00:36:46.500
.925 is somewhere here.
00:36:46.500 --> 00:36:56.600
That is .58 that is our z score.
00:36:56.600 --> 00:37:07.800
In order to find the area let me shade that in so you know in order to find the area because
00:37:07.800 --> 00:37:15.600
we are looking for is the probability of getting this score or more, that area should be 50%.
00:37:15.600 --> 00:37:33.100
It should be much more than this and actually I have put it in my normal distribution but remember this will give you what is on this left side or the negative side.
00:37:33.100 --> 00:37:37.700
We need to look at 1 - that normal distribution.
00:37:37.700 --> 00:37:41.700
This is 28%.
00:37:41.700 --> 00:37:53.700
What is the probability of getting 929 Han Chinese or more that is going to be.28.
00:37:53.700 --> 00:38:05.300
Example 2, college freshmen from a wide variety of colleges across the US participate in a survey
00:38:05.300 --> 00:38:11.200
where 61% reported that they are attending college that was their first choice.
00:38:11.200 --> 00:38:19.500
If you took a random sample of 100 freshmen how likely is it that at least 50 of those students are attending their first choice college?
00:38:19.500 --> 00:38:24.800
Saying at least 50 that is a good thing to keep in mind for later.
00:38:24.800 --> 00:38:26.700
Let us try this population.
00:38:26.700 --> 00:38:35.600
Here is my population of college freshmen and 61% a little more than half.
00:38:35.600 --> 00:38:43.800
61% is our p and 1 - p is not quite 40% but is 39.
00:38:43.800 --> 00:38:48.400
The other 39 they are not attending their first choice college.
00:38:48.400 --> 00:38:59.700
Imagine taking out of that population of random sample of 100 freshmen and looking at
00:38:59.700 --> 00:39:03.700
the sample proportion and plotting that on the SDOS.
00:39:03.700 --> 00:39:16.100
100 is still a pretty large n so I am going to go with that normal distribution.
00:39:16.100 --> 00:39:23.500
I know that my SDOS μ.
00:39:23.500 --> 00:39:32.200
μ sub p hat this should equal p and that is 61%.
00:39:32.200 --> 00:39:40.100
What is my standard deviation of this SDOS because I'm not just looking at who is in here.
00:39:40.100 --> 00:39:46.400
I am looking at it if I took a sample of 100 students how good is my sample?
00:39:46.400 --> 00:39:53.100
Whenever you hear that, like how good is the sample then you know you need a sampling distribution.
00:39:53.100 --> 00:40:08.400
I should probably find my standard error because standard error because it is a sampling distribution.
00:40:08.400 --> 00:40:27.900
Here is vp×1-p /n that is going to be v.61 × .39 ÷ 100.
00:40:27.900 --> 00:40:55.900
I will just look that up here so v.61 × .39 ÷ 100 = .0488.
00:40:55.900 --> 00:41:21.500
This little jumper here is .0488 that is how big does little jumps are.
00:41:21.500 --> 00:41:33.000
I'm looking for how likely is it that at least 50 of these students are attending their first choice college.
00:41:33.000 --> 00:41:38.800
I can turn this into a percentage by looking at 50/100.
00:41:38.800 --> 00:41:52.100
My p hat that I have been given is 50/100 and that is .5 and I want to know how likely is this p hat.
00:41:52.100 --> 00:42:00.000
It is nice to find out where the p hat is and this is the raw proportion.
00:42:00.000 --> 00:42:20.400
It would be nice to find the z score and the z score of .5 should be the distance between .5 and the mean divided by the little jumps.
00:42:20.400 --> 00:42:25.400
How big are my jumps in order to find how many jumps away.
00:42:25.400 --> 00:42:49.000
Let us put that in our calculator, .5 - .61 ÷ .0488 = -2.25.
00:42:49.000 --> 00:43:02.800
Here we are somewhere like this -2.25 and this is 4.5.
00:43:02.800 --> 00:43:12.400
We want to know how likely is it that at least 50 of those students are attending that first choice college.
00:43:12.400 --> 00:43:18.400
When we say at least this is the lower limit.
00:43:18.400 --> 00:43:23.100
We are looking for this whole thing.
00:43:23.100 --> 00:43:38.000
You can look that up in the back of your book or you could say the proportion that p hat will be greater than or equal to .5.
00:43:38.000 --> 00:43:54.200
I do not know if you remember this notation here we want to know, I remember will give us the negative side,
00:43:54.200 --> 00:43:58.700
so we have the 1 - this little piece.
00:43:58.700 --> 00:44:20.500
1 – norms s in order for standardized that is how we get that z and we put in our z and we should get .9879.
00:44:20.500 --> 00:44:27.500
Very close to survey .9879.
00:44:27.500 --> 00:44:40.900
Almost 99% of our sample should have at least 50% of those students attending their first choice college.
00:44:40.900 --> 00:44:50.700
Third example, about 75% of the US population owns a cell phone and that is growing.
00:44:50.700 --> 00:44:57.700
On average, what proportion of people would you expect to have a cell phone in a sample of 10, 20 or 40?
00:44:57.700 --> 00:45:01.700
This is talking about the average proportion.
00:45:01.700 --> 00:45:11.100
We are looking at the μ sub p hat on average, what proportion of people would you expect?
00:45:11.100 --> 00:45:21.000
For 10 people it should be 75% for n=10.
00:45:21.000 --> 00:45:24.800
What about n=20?
00:45:24.800 --> 00:45:32.600
Even for that the sampling distributions mean should be 75%.
00:45:32.600 --> 00:45:35.500
What about n=40?
00:45:35.500 --> 00:45:40.600
This should be 75%.
00:45:40.600 --> 00:45:50.800
What it is getting at is that no matter how big or little your sample size your mean of the sampling distribution
00:45:50.800 --> 00:45:56.900
does not really change and that is similar to what we saw in the sampling distribution of the mean as well.
00:45:56.900 --> 00:46:03.300
Final example, that 60% of married women are employed.
00:46:03.300 --> 00:46:10.400
If you select 75 married women, what is the probability that between 30 and 40 women are employed?
00:46:10.400 --> 00:46:33.300
Here we need to know that our actual population and these are all married ladies and 60% are employed.
00:46:33.300 --> 00:46:37.900
That is our p and 1 - p is 40%.
00:46:37.900 --> 00:46:48.000
Imagine now taking samples of 75 so this is SDOS for n=75.
00:46:48.000 --> 00:46:55.400
75 is still fairly large so I will assume normal distribution.
00:46:55.400 --> 00:47:01.700
What is the probability that between 30 and 40 women are employed?
00:47:01.700 --> 00:47:22.000
We know that μ sub p hat is 60% we also know that the standard deviation of p hat is the v.60 × .40 /n(75)
00:47:22.000 --> 00:47:25.100
and all of that under the square root sign.
00:47:25.100 --> 00:47:30.600
I will just quickly put this into my calculator.
00:47:30.600 --> 00:47:53.000
v.6 × .4 ÷ 75 = .0566.
00:47:53.000 --> 00:48:02.900
What is the probability that between 30 and 40 women are employed?
00:48:02.900 --> 00:48:12.200
First of all it helps me to figure out what percentage is 30 women out of 75 and what percentage 40 women out of 75?
00:48:12.200 --> 00:48:25.400
Let us call that p hat sub 30 that is 30 ÷ 75 and also p hat sub 40 that is 40 ÷ 75.
00:48:25.400 --> 00:48:41.800
If I want to get it in decimals 40 ÷ 75, 30 ÷ 75 that is .53 and 4.
00:48:41.800 --> 00:48:56.600
I am going to know that these 2 slickers.
00:48:56.600 --> 00:49:02.300
About the distance in between here is about 6%.
00:49:02.300 --> 00:49:17.600
I will go about 6 down so this first one and another 6, so this would be roughly .54.
00:49:17.600 --> 00:49:26.100
Let us actually find the z scores of this.
00:49:26.100 --> 00:49:35.700
Z(.4) = these are the p hats.
00:49:35.700 --> 00:49:52.600
These are the z scores is .4 - .6 all divided by the little jumps.
00:49:52.600 --> 00:50:01.100
And these little jumps are .0566.
00:50:01.100 --> 00:50:24.900
.4 - .6 we need a parenthesis here divided by .0566.
00:50:24.900 --> 00:50:48.100
That is the z score of -3 .5.
00:50:48.100 --> 00:51:01.300
What about the score of .53 I'm just going to forget about the repeating part.
00:51:01.300 --> 00:51:21.000
It will just be something like .5 - .6 ÷.0566 and that is -1.2.
00:51:21.000 --> 00:51:49.400
Here is my big problem, first we need to know this area but there is no table that will tell us just that area.
00:51:49.400 --> 00:52:11.600
Here is what we will have to do, we will have to take everything below this and then subtract out everything below that
00:52:11.600 --> 00:52:21.100
because then we will get this entire area including this infinite tail and then take out a tiny little bit of it to top that part off
00:52:21.100 --> 00:52:24.100
to get it in between this part and this part.
00:52:24.100 --> 00:52:40.700
In order to do that, I will use my normsdist and remember that will give me the negative side.
00:52:40.700 --> 00:53:01.400
Let me put in my bigger number first and then subtract, that is my entire area below z= -1.2.
00:53:01.400 --> 00:53:03.700
That is the entire area.
00:53:03.700 --> 00:53:08.400
I am going to subtract out the tiny sliver way over here.
00:53:08.400 --> 00:53:15.700
Area below z= -3.5.
00:53:15.700 --> 00:53:28.000
I can just normsdist -3.5 and that should be a really tiny, tiny, tiny number.
00:53:28.000 --> 00:53:31.900
I need to subtract this area out of this.
00:53:31.900 --> 00:53:41.500
I take this whole thing and subtract this little sliver and I get roughly very similar number.
00:53:41.500 --> 00:53:47.600
.119 that is my area.
00:53:47.600 --> 00:54:17.000
This area here we can call it the probability where p hat is greater than or equal to .4 and less than or equal to .53 repeating is roughly .119.
00:54:17.000 --> 00:54:29.600
It is about 11.9% is the probability that between 30 and 40 women are employed.
00:54:29.600 --> 00:54:35.100
That is the end of sampling distribution of sample proportion.
00:54:35.100 --> 00:54:37.000
Thanks for using www.educator.com.