WEBVTT mathematics/probability/murray
00:00:00.000 --> 00:00:05.700
Hi and welcome back to the probability lectures here on www.educator.com, my name is Will Murray.
00:00:05.700 --> 00:00:11.500
Today, we are going to be discussing the glamorously named hyper geometric distribution.
00:00:11.500 --> 00:00:16.300
Let me tell you about the situation where you would use the hyper geometric distribution.
00:00:16.300 --> 00:00:20.900
I set it up in terms of picking a committee of women and men.
00:00:20.900 --> 00:00:29.200
The idea is that you have a larger group, you have a big group of N people.
00:00:29.200 --> 00:00:33.500
There is N and there is a n in the hyper geometric distribution.
00:00:33.500 --> 00:00:35.400
Make sure you do not get those mixed up.
00:00:35.400 --> 00:00:42.700
You got N people total, all women, and N - R man.
00:00:42.700 --> 00:00:47.800
What you are going to do is you are going to form a committee from this larger group.
00:00:47.800 --> 00:00:57.700
Your committee is going to have n, that is the number of men and women you are going to put on your committee.
00:00:57.700 --> 00:01:01.400
We want to emphasize here that this is an unordered choice.
00:01:01.400 --> 00:01:06.300
You are going to just grab a group of people, it does not matter which order you are grabbing them in.
00:01:06.300 --> 00:01:10.700
You are not going to have a chair of the committee, you are not going to have any special positions.
00:01:10.700 --> 00:01:17.100
You are going to have a group of people, you can think of it may be as a team, a sports team.
00:01:17.100 --> 00:01:21.300
It is without replacement meaning you cannot pick the same person twice.
00:01:21.300 --> 00:01:32.100
You grab this group of people and then the question is, how many women did you end up with on your committee, out of all the possible men and women?
00:01:32.100 --> 00:01:39.700
More specifically, what are the chances of getting exactly y women on your committee?
00:01:39.700 --> 00:01:45.900
Our random variable here represents the number of women that you end up with on our committee.
00:01:45.900 --> 00:01:52.800
Let us go ahead and look at all the parameters, there is a lot of them, and let us figure out the formula.
00:01:52.800 --> 00:01:58.700
There is a lot of parameters here N is the total number of people that you are looking at.
00:01:58.700 --> 00:02:04.900
That is the number people that are available to be selected on your committee.
00:02:04.900 --> 00:02:15.000
R is the number of women available and that means that all that remain are men, that is N - R is the number of men available.
00:02:15.000 --> 00:02:20.000
And then n is the number of people we are going to pick.
00:02:20.000 --> 00:02:25.600
When I look at this large pool, let me draw a little Venn diagram here.
00:02:25.600 --> 00:02:32.000
This large pool of N people available, there is N people available /all.
00:02:32.000 --> 00:02:37.300
All of them are women, N - R of them are men.
00:02:37.300 --> 00:02:43.600
We are going to create our committee of N people and that means that,
00:02:43.600 --> 00:02:53.300
we want to find the probability of why those people being women which means that n - Y of those people are men.
00:02:53.300 --> 00:02:57.900
The probability distribution formula looks very complicated but
00:02:57.900 --> 00:03:02.200
I'm going to try to persuade you that it is actually a very easy formula to remember,
00:03:02.200 --> 00:03:06.800
if you can remember this situation that we are describing.
00:03:06.800 --> 00:03:14.400
The probability formula is R choose Y/N - R choose n – Y.
00:03:14.400 --> 00:03:18.800
Multiply by that and then N choose n.
00:03:18.800 --> 00:03:22.000
I want to emphasize that these are all binomial coefficients.
00:03:22.000 --> 00:03:28.000
These are combinations, you will use the factorial formula to simplify these.
00:03:28.000 --> 00:03:33.700
That looks like a very difficult formula to remember but it is not, and here is why.
00:03:33.700 --> 00:03:41.300
The denominator that just represents, remember there is N people total and you are choosing n of them.
00:03:41.300 --> 00:03:53.900
This is the total number of ways to choose your committee.
00:03:53.900 --> 00:04:02.700
There is N people total and you are choosing n of those people to be on your committee.
00:04:02.700 --> 00:04:10.500
If you are going to disregard gender, you are just making a choice of n people out of the total number of people.
00:04:10.500 --> 00:04:16.400
Suppose you take gender into account and suppose you want to get exactly Y women on your committee.
00:04:16.400 --> 00:04:19.700
You have a fixed number of women that you want to get on your committee.
00:04:19.700 --> 00:04:26.300
Then you will look at all the women in the room and you would choose exactly Y of them to be on your committee.
00:04:26.300 --> 00:04:34.100
There are women and you are choosing Y of them to be on your committee.
00:04:34.100 --> 00:04:40.600
You are making a choice of Y people out of R women available.
00:04:40.600 --> 00:04:46.600
Then, after you have chosen your women, you look around at all the men and you choose the number of men you need.
00:04:46.600 --> 00:04:48.000
How many men do you need?
00:04:48.000 --> 00:04:58.800
If you want to get Y women, that means you need n – Y men and how many men are available.
00:04:58.800 --> 00:05:02.900
We said there is N – R, the number of men available.
00:05:02.900 --> 00:05:10.000
This term really represents you choosing the men to be on your committee.
00:05:10.000 --> 00:05:13.500
You have a certain number of ways you can pick the women.
00:05:13.500 --> 00:05:16.600
You can have a certain number of ways you can pick the men.
00:05:16.600 --> 00:05:25.300
You multiply those together, that gives you the total number of ways to pick your committee that has exactly Y women.
00:05:25.300 --> 00:05:32.000
And then, you divide that by the total number of ways to pick your committee, if you do not pay any attention to gender at all.
00:05:32.000 --> 00:05:39.300
That is actually, I think that is a fairly easy formula to remember, even though it looks very complicated.
00:05:39.300 --> 00:05:43.900
It is definitely one of the most complicated probability distribution formula.
00:05:43.900 --> 00:05:50.600
This Y here, the range for Y, you could have as few as 0 people, 0 women on your committee.
00:05:50.600 --> 00:05:59.100
Or it is a n bit complicated here because the most number of women you can have on your committee would be N,
00:05:59.100 --> 00:06:04.400
because that is the size of the committee, or R because that is the number of women available.
00:06:04.400 --> 00:06:13.500
Whichever one of those is smaller, that is the maximum possible number of women you can have on your committee.
00:06:13.500 --> 00:06:17.500
We need to get a couple of properties down with the hyper geometric distribution.
00:06:17.500 --> 00:06:24.300
The most useful one is the mean, which you remember is the same as expected value.
00:06:24.300 --> 00:06:32.600
The expected value of the hyper geometric distribution, this n × R/N.
00:06:32.600 --> 00:06:37.700
N is the size of your committee, R is the number of women available,
00:06:37.700 --> 00:06:43.400
and N is the total number of people in the room that you are choosing from.
00:06:43.400 --> 00:06:48.900
The variance is really a kind of a nasty formula, I do not recommend memorizing it.
00:06:48.900 --> 00:06:55.400
I do not use it very often but I wanted to record it for posterity, in case you do need it.
00:06:55.400 --> 00:07:01.900
These are actual fractions, let me emphasize, these are not binomial coefficients.
00:07:01.900 --> 00:07:04.600
This is just what it turns out to be.
00:07:04.600 --> 00:07:13.800
Like I said, I do not really think there is a lot of intuition to be gained from this variance.
00:07:13.800 --> 00:07:16.600
I do not think it is worth memorizing that formula.
00:07:16.600 --> 00:07:20.500
The standard deviation, of course, is just the square root of the variance.
00:07:20.500 --> 00:07:22.600
It is always the square root of the variance.
00:07:22.600 --> 00:07:29.200
I just took the variance formula and took the square root of it, to get the standard deviation.
00:07:29.200 --> 00:07:31.800
Let us go ahead and jump into some examples here.
00:07:31.800 --> 00:07:39.400
In example 1, we got 33 students in a class and 12 women and 21 men.
00:07:39.400 --> 00:07:45.300
We are going to pick a committee, maybe we are going to do a group project and 7 students are going in a group project.
00:07:45.300 --> 00:07:51.500
I will pick 7 students at random, what is the chance that we will get exactly 5 women working on that project?
00:07:51.500 --> 00:07:55.400
This is a hyper geometric distribution, let me set up the parameters here.
00:07:55.400 --> 00:08:01.900
N is the total number of people available, that is 33.
00:08:01.900 --> 00:08:06.500
R is the number of women in the room, that is 12.
00:08:06.500 --> 00:08:18.900
That means that N - R is the number of men available, that is 21.
00:08:18.900 --> 00:08:27.200
The number people on our committee is 7 and we are interested in the chance that we are going to end up with Y,
00:08:27.200 --> 00:08:32.000
with 5 women on our committee, that is the value of Y or Y is 5.
00:08:32.000 --> 00:08:35.900
That is because we want our committee to have exactly 5 women.
00:08:35.900 --> 00:08:39.900
Let me write down the formula for the hyper geometric distribution.
00:08:39.900 --> 00:08:55.700
P of Y is R choose Y, that is where we picked the women, × N -R men available, n - - Y men on our committee ÷ N ÷ n,
00:08:55.700 --> 00:09:01.200
that is the total number of ways we could have chosen this committee or this group of students do a project.
00:09:01.200 --> 00:09:03.300
I will just drop the numbers in.
00:09:03.300 --> 00:09:24.900
R is 12, Y is 5, N - R is 21, n -y is 7 -5 is 2, N is 33, and n is 7.
00:09:24.900 --> 00:09:32.000
I'm going to leave that as a fraction like that, I did not bother to work it out to a decimal.
00:09:32.000 --> 00:09:37.100
It would be a fairly small number, if you actually worked out the numbers, it should be pretty small.
00:09:37.100 --> 00:09:42.400
But it would be a load of factorials that I just did not want to calculate.
00:09:42.400 --> 00:09:46.700
I did not think it would be very illuminating but it would be pretty small,
00:09:46.700 --> 00:09:50.200
because if you pick 7 people at random from a class like this,
00:09:50.200 --> 00:09:57.500
the chance you getting 5 women is very low because there is there is more men than women in this class.
00:09:57.500 --> 00:10:00.700
Let me recap where those came.
00:10:00.700 --> 00:10:06.700
First, I set up all my parameters, the N, R, n, n – R, and Y.
00:10:06.700 --> 00:10:12.500
Then I just use the probability distribution formula for the hyper geometric distribution.
00:10:12.500 --> 00:10:19.600
This is the formula, I know it looks difficult to remember but if you kind of think about what each one of those factors represents,
00:10:19.600 --> 00:10:25.200
it is really not hard to remember the formula.
00:10:25.200 --> 00:10:31.600
I think this formula kind of makes intuitive sense, if you think about the R choose Y
00:10:31.600 --> 00:10:36.900
means you are picking Y women from R available women.
00:10:36.900 --> 00:10:44.300
N -R being is the number of men available and n - Y is the number of men you want.
00:10:44.300 --> 00:10:52.100
We multiply those together and N choose n is the number ways of choosing your committee in the first place.
00:10:52.100 --> 00:10:55.800
We drop the numbers in for each one of those and we just give that as our answer.
00:10:55.800 --> 00:11:00.600
That is our chance that the committee will contain exactly 5 women.
00:11:00.600 --> 00:11:03.400
We are going to hang onto these numbers for the next example.
00:11:03.400 --> 00:11:09.800
Remember the basic setup of this example and we will go ahead and take a look at that.
00:11:09.800 --> 00:11:13.500
Example 2 was referring back to example 1.
00:11:13.500 --> 00:11:22.900
In example 1, we were picking students from a class and we are picking a committee of 7 students, maybe a group project in a class.
00:11:22.900 --> 00:11:26.100
Let me just remind you of the parameters from example 1.
00:11:26.100 --> 00:11:31.900
We had N was the number of students in the class, 33.
00:11:31.900 --> 00:11:38.300
R was the number women in the class, I got this from example 1, they were 12 woman in the class.
00:11:38.300 --> 00:11:46.400
N was the number of people that we are picking to be on our committees, that is 7.
00:11:46.400 --> 00:11:52.600
The expected number of women is the expected number of our random variable Y.
00:11:52.600 --> 00:12:03.100
Y is the number of women on our committee.
00:12:03.100 --> 00:12:14.400
We have a formula for the expected value of a hyper geometric random variable, the mean.
00:12:14.400 --> 00:12:20.100
E of Y is n × r/N.
00:12:20.100 --> 00:12:31.600
In this case, that n is 7 × r is 12, N is 33.
00:12:31.600 --> 00:12:36.900
I guess we could simplify that, 12 and 33, you can take out a 3 from each of those.
00:12:36.900 --> 00:12:43.400
7 × 4/11, that is a 28/11.
00:12:43.400 --> 00:12:49.600
Our units here are women, that is the total number of women we expect on our committee.
00:12:49.600 --> 00:12:55.500
Obviously, you cannot have fractions of women but on average, if we did this many times,
00:12:55.500 --> 00:13:01.800
we would expect to see on average, 28/11 is a lot less than 3.
00:13:01.800 --> 00:13:08.400
A little less than 3 women on the committee, on average.
00:13:08.400 --> 00:13:13.500
To recap here, I got these parameters from example 1.
00:13:13.500 --> 00:13:17.200
Example 1 setup how many people there were in the room, how many women, how many men,
00:13:17.200 --> 00:13:19.900
how many people we are picking on our committee.
00:13:19.900 --> 00:13:26.400
I got this formula for the mean from the third slide at the beginning of the lecture.
00:13:26.400 --> 00:13:29.100
If you scroll back a couple of slides, you will see this mean formula.
00:13:29.100 --> 00:13:36.000
I will just drop the numbers in and I simplified that down to a certain number of women.
00:13:36.000 --> 00:13:42.700
Of course, in real life, we will either have 1 woman, or 2 women, or 3 women.
00:13:42.700 --> 00:13:51.200
On average, we will have a bit fewer than 3 women on our committee.
00:13:51.200 --> 00:13:56.700
In example 3 here, you open up your shoe closet and you do a shoe inventory.
00:13:56.700 --> 00:14:00.400
It looks like you have 10 pairs of shoes in your closet.
00:14:00.400 --> 00:14:02.700
You have lots of pairs of shoes in your closet.
00:14:02.700 --> 00:14:06.100
You are getting ready to move to a new apartment.
00:14:06.100 --> 00:14:11.900
You are in a hurry, you grab the nearest box you see and you start throwing your shoes in.
00:14:11.900 --> 00:14:15.500
You are not really keeping track of which shoe matches up which.
00:14:15.500 --> 00:14:19.000
You are just throwing them all in, you will unpack them after you move.
00:14:19.000 --> 00:14:25.900
You start throwing your shoes in and you get 13 shoes in the box, and it is full.
00:14:25.900 --> 00:14:32.800
You seal up the box and then you start to wonder, how many left shoes are in the box and how many right shoes are in the box?
00:14:32.800 --> 00:14:39.400
In particular, what is the probability that there are exactly 5 left shoes and 8 right shoes in the box?
00:14:39.400 --> 00:14:50.000
This is a hyper geometric distribution because if you think about it, it is just like selecting women and men to be on a committee.
00:14:50.000 --> 00:14:52.800
You had a certain number of left shoes in your closet.
00:14:52.800 --> 00:14:54.200
You have a certain number right shoes in your closet.
00:14:54.200 --> 00:15:00.200
You grab some and put them in the box, it is just like selecting women and men to be on your committee.
00:15:00.200 --> 00:15:05.000
Let me set up the parameters here for the hyper geometric distribution.
00:15:05.000 --> 00:15:13.300
N is the total number of people in a room, or in this case, it is the total number of shoes in the closet,
00:15:13.300 --> 00:15:15.600
before you start packing them.
00:15:15.600 --> 00:15:20.400
Shoes in the closet, counting both left and right.
00:15:20.400 --> 00:15:25.700
Let us say we got 10 pairs, there are 20 of those.
00:15:25.700 --> 00:15:29.800
R is the number of left handed shoes.
00:15:29.800 --> 00:15:36.500
Left handed shoes sounds a little strange, I will just say left shoes.
00:15:36.500 --> 00:15:42.000
There are 10 left shoes in your closet, assuming that all your pairs match up.
00:15:42.000 --> 00:16:00.400
Let me go ahead and calculate N – R, that is the number of right shoes but that is 20 -10 is still 10.
00:16:00.400 --> 00:16:07.700
N is the number of shoes that you have chosen randomly, when you throw them in a box.
00:16:07.700 --> 00:16:18.800
The number in the box and that is given to us to be 13.
00:16:18.800 --> 00:16:23.400
Y is the number of left shoes that we are interested in.
00:16:23.400 --> 00:16:38.100
Y is 5, 5 left shoes, because we are curious about the likelihood that there are exactly 5 left shoes in the box.
00:16:38.100 --> 00:16:42.900
Let me go ahead and remind you of the formula for the hyper geometric distribution.
00:16:42.900 --> 00:16:48.900
P of Y, it is not hard to remember if you think about what these things are measuring.
00:16:48.900 --> 00:16:57.600
It is R choose Y because it is the number of left shoes available, the number that you are interested in,
00:16:57.600 --> 00:17:02.000
× the number of right shoes available, that is N – R.
00:17:02.000 --> 00:17:08.400
N - R and n – y, that is the number of right shoes that should be in the box ×
00:17:08.400 --> 00:17:14.800
all the possible ways of choosing your shoes, that is N choose n.
00:17:14.800 --> 00:17:16.900
I will just fill in all the numbers here.
00:17:16.900 --> 00:17:33.000
R is 10, Y is 5, N - R is 10, n - Y is 13 – 5, that is 8.
00:17:33.000 --> 00:17:54.800
N was 20 and n was 13, 20 choose 13.
00:17:54.800 --> 00:17:59.900
That is all the number of ways that you could have chosen 13 there.
00:17:59.900 --> 00:18:08.600
Again, I did not bother to simplify this down because it will be a lot of factorials.
00:18:08.600 --> 00:18:12.700
I think I will just leave it that way.
00:18:12.700 --> 00:18:18.500
If you want to simplify that down, you could just calculate a bunch of factorials,
00:18:18.500 --> 00:18:23.800
and then do some arithmetic there and get a decimal answer.
00:18:23.800 --> 00:18:27.900
Let me recap and show you where each one of those values came from.
00:18:27.900 --> 00:18:34.100
Each one of these numbers, these parameters for the problem came from somewhere in the problem.
00:18:34.100 --> 00:18:37.800
N is the total number of shoes available in the closet.
00:18:37.800 --> 00:18:42.400
They were 10 pairs which means they were 20 shoes available.
00:18:42.400 --> 00:18:45.400
R is the number of left shoes.
00:18:45.400 --> 00:18:53.200
We figure this analogously to picking a committee of people from a group of women and men.
00:18:53.200 --> 00:18:57.600
Instead, we are picking a box of shoes from a group of left and right shoes.
00:18:57.600 --> 00:19:03.100
R is the number of left shoes that we just picked.
00:19:03.100 --> 00:19:06.800
We picked R to be the number of left shoes.
00:19:06.800 --> 00:19:10.300
We could have switched the role of left shoes and right shoes, and it really would not matter,
00:19:10.300 --> 00:19:14.900
we would end up getting the same answer here.
00:19:14.900 --> 00:19:20.900
The number of left shoes, since there is 10 pairs, there is exactly 10 left shoes that makes
00:19:20.900 --> 00:19:25.700
the number of right shoes to be 20 -10 which is 10.
00:19:25.700 --> 00:19:28.000
That is easy to figure out as well.
00:19:28.000 --> 00:19:39.500
The number of shoes in the box total is 13, that is where that 13 came from, that is n right there.
00:19:39.500 --> 00:19:42.700
Y is the number of left shoes that we are interested in.
00:19:42.700 --> 00:19:49.400
We want to find the probability of getting 5 left choose, that 5 came from that number right there.
00:19:49.400 --> 00:19:52.100
You could switch the roles of left shoes and right shoes.
00:19:52.100 --> 00:19:57.200
You could have keep track of right shoes instead, and that we are giving you the same answer.
00:19:57.200 --> 00:20:02.700
The probability of that Y, I just wrote down the formula for the hyper geometric distribution.
00:20:02.700 --> 00:20:05.700
I do remember this, even though this is kind of a complicated formula,
00:20:05.700 --> 00:20:14.200
it is not hard to remember when you think about what each one of these things it is counting and what each one represents physically.
00:20:14.200 --> 00:20:21.500
I just dropped in all the parameters, r, y, N, n.
00:20:21.500 --> 00:20:31.900
We got some number that you could simplify to a fraction or to a decimal but it did not seem to me to be that relevant.
00:20:31.900 --> 00:20:41.800
We are going to hang onto this example and we are going to keep using this example in problem 4.
00:20:41.800 --> 00:20:48.000
Remember these numbers and we will look in another aspect of this in the next example.
00:20:48.000 --> 00:20:50.700
Example 4, this refers back to example 3.
00:20:50.700 --> 00:20:54.900
If you have not just watched example 3, go back and watch example 3.
00:20:54.900 --> 00:21:01.100
Or at least, read the setup before you look at example 4 and that will make sense.
00:21:01.100 --> 00:21:06.100
Remember back then, we have a shoe closet which has 10 pairs of shoes.
00:21:06.100 --> 00:21:11.600
We start throwing the shoes into a box at random because we are getting ready to move and we are in a hurry.
00:21:11.600 --> 00:21:15.900
We are not going to bother to keep the left shoe with its corresponding right shoe.
00:21:15.900 --> 00:21:21.000
We just throw our shoes into the box and it turns out that there are 13 shoes in the box.
00:21:21.000 --> 00:21:27.000
I'm curious about how many left shoes there might be in the box?
00:21:27.000 --> 00:21:33.800
This is again a hyper geometric distribution, let me remind you of the parameters that we had on example 3.
00:21:33.800 --> 00:21:35.700
This was coming from example 3.
00:21:35.700 --> 00:21:42.400
N was the total number of shoes, that is 20 total number of shoes in your closet.
00:21:42.400 --> 00:21:53.300
r is the number of left shoes, there is 10 left shoes which means that there is 10 right shoes.
00:21:53.300 --> 00:22:03.600
n is the number of shoes in the box which we said back in example 3, we said the box fills up when you got 13 shoes in there.
00:22:03.600 --> 00:22:05.600
Our n is 13.
00:22:05.600 --> 00:22:08.800
I want to know the expected number of left shoes in the box.
00:22:08.800 --> 00:22:12.400
Remember, we sealed up the box, we cannot go and count.
00:22:12.400 --> 00:22:18.200
Let us try to find the expected number of our random variable here.
00:22:18.200 --> 00:22:30.800
Y is the number of left shoes in the box.
00:22:30.800 --> 00:22:38.400
We want to find the expected number of left shoes, E of Y, the expected number of left shoes.
00:22:38.400 --> 00:22:43.000
We have a formula for the expected value of the hyper geometric distribution.
00:22:43.000 --> 00:22:45.300
Let me remind you what it was.
00:22:45.300 --> 00:22:48.600
It is the same as the mean.
00:22:48.600 --> 00:23:00.300
It is n × r/N, that is in this case, n is 13, r is 10.
00:23:00.300 --> 00:23:02.600
I’m just reading these from up above.
00:23:02.600 --> 00:23:09.900
N is 20, the 10 and the 20 simplify down to 13/2.
00:23:09.900 --> 00:23:20.500
13/2 which is 6.5 left shoes.
00:23:20.500 --> 00:23:25.800
It makes perfect sense and another sense is absurd because you cannot have half a shoe.
00:23:25.800 --> 00:23:28.100
You are not cutting your shoes in half.
00:23:28.100 --> 00:23:36.400
It does not really mean that we open the box, there will be 6.5 left shoes in there.
00:23:36.400 --> 00:23:42.300
You either find some whole number shoes, you might find 4 left shoes, you might find 7 left shoes.
00:23:42.300 --> 00:23:45.000
You will not find 6.5 left shoes.
00:23:45.000 --> 00:23:53.500
What it does mean is that if you pack many boxes and there are 13 shoes in each one,
00:23:53.500 --> 00:23:59.700
on average, over the long run you will expect to find 6 1/2 left shoes per box.
00:23:59.700 --> 00:24:06.600
On average, if you add up all the left shoes and divide by the number of boxes.
00:24:06.600 --> 00:24:11.200
Of course, that does not make sense because if you have 13 left shoes,
00:24:11.200 --> 00:24:17.700
remember that in your shoe closet, half of the shoes are left and half of the shoes were right.
00:24:17.700 --> 00:24:20.700
On average, you expect see half of them being left shoes.
00:24:20.700 --> 00:24:27.800
If you have 13 shoes total then on average you expect to see 6 1/2 left shoes.
00:24:27.800 --> 00:24:29.800
Let me recap that problem.
00:24:29.800 --> 00:24:33.800
We took these parameters from examples 3.
00:24:33.800 --> 00:24:38.800
If these numbers look strange to you, just go back and read the setup in examples 3.
00:24:38.800 --> 00:24:44.800
You will see that we had 20 shoes in the closet, 10 left shoes, 10 right shoes.
00:24:44.800 --> 00:24:49.900
We took 13 of them, we threw them into a box.
00:24:49.900 --> 00:24:59.300
The mean number of shoes there, the mean of the number left shoes using the hyper geometric distribution is n × r/N.
00:24:59.300 --> 00:25:06.000
That formula came from our slide about means and standard deviations, earlier on in this lecture.
00:25:06.000 --> 00:25:08.000
I think it was the third slide of this video.
00:25:08.000 --> 00:25:10.800
You can scroll back and see where that comes from.
00:25:10.800 --> 00:25:13.600
I just drop the numbers in 13, 10, and 20.
00:25:13.600 --> 00:25:21.800
Simplify down to 6.5 left shoes which of course, does not make sense because you will find a whole number of shoes in the box.
00:25:21.800 --> 00:25:28.000
But as an expected value, as an average value, it makes perfect sense because out of 13 shoes,
00:25:28.000 --> 00:25:32.600
you can expect half of them to be left shoes and half of them to be right shoes.
00:25:32.600 --> 00:25:41.100
You would expect in the long run, an average of 6 1/2 left shoes in the box.
00:25:41.100 --> 00:25:48.900
Example 5 here is a little more theoretical, it is asking us to use indicator variables and linearity of expectation
00:25:48.900 --> 00:25:59.000
to prove that the expected value of a hyper geometric random variable is n × r/N.
00:25:59.000 --> 00:26:02.500
This one is a little more theoretical, we are going to prove this value.
00:26:02.500 --> 00:26:05.300
We cannot just pull it from the earlier slide.
00:26:05.300 --> 00:26:08.000
Let me show you how this works out.
00:26:08.000 --> 00:26:12.200
Remember the premise of the hyper geometric distribution.
00:26:12.200 --> 00:26:33.100
We are calculating a random variable that represents the number of women on a committee of,
00:26:33.100 --> 00:26:37.000
n was the number of people on our committee.
00:26:37.000 --> 00:26:42.500
We have several parameters here.
00:26:42.500 --> 00:26:54.200
N is the total number of people in the room that we are going to pick from, total number of people.
00:26:54.200 --> 00:27:09.200
Among those total number of people, R is the number of women and N - R is the number of men in the room.
00:27:09.200 --> 00:27:12.400
N – r is the number of men.
00:27:12.400 --> 00:27:18.900
We are going to pick a committee of n people and we want to find the expected number of women.
00:27:18.900 --> 00:27:22.600
There is a very clever way to do this which is to set up indicator variable.
00:27:22.600 --> 00:27:25.400
Let me show you what I mean by indicator variables.
00:27:25.400 --> 00:27:32.300
Let me define Y1, by definition is an indicator variable.
00:27:32.300 --> 00:27:38.000
Let us consider that we are going to pick these people to be on our committee one by one.
00:27:38.000 --> 00:27:43.300
We look around the room and say I want you, you, you, and you, to be on the committee.
00:27:43.300 --> 00:27:45.600
We are picking these people one by one.
00:27:45.600 --> 00:27:55.300
Y1 is going to be an indicator variable that tells us whether the first person on that committee is a woman or not.
00:27:55.300 --> 00:28:12.700
Y1 is defined to be, one if we get a woman on the first pick.
00:28:12.700 --> 00:28:18.000
We pick our first person to be on the committee, Y1 is an indicator variable.
00:28:18.000 --> 00:28:27.000
It is going to be a one if it is a woman, 0 if it is a man.
00:28:27.000 --> 00:28:32.000
It is a little strange but when we say Y1 is the number of women we get on the first choice.
00:28:32.000 --> 00:28:35.900
We either get one woman or we get a man, that is 0 women.
00:28:35.900 --> 00:28:50.000
We will define Y2 to be one, if we get a woman on the second pick.
00:28:50.000 --> 00:28:52.000
The second person we look at.
00:28:52.000 --> 00:28:55.700
If that is a woman, we say Y2 was going to be 1.
00:28:55.700 --> 00:29:02.100
If it is a man, we say Y2 is going to be 0.
00:29:02.100 --> 00:29:06.300
Let us keep on going and we are picking n people to be on this committee.
00:29:06.300 --> 00:29:12.000
We go to Yn here, we define our indicator variables.
00:29:12.000 --> 00:29:16.200
There is one variable for each person on this committee.
00:29:16.200 --> 00:29:28.300
What that means is Y is the total number of women on the committee.
00:29:28.300 --> 00:29:35.000
What that means is it is the number of women we got on the first pick, which is either 1 or 0.
00:29:35.000 --> 00:29:39.400
The number of woman we got on the second pick up to Yn.
00:29:39.400 --> 00:29:47.200
The total number of women, we can count the number of women just by counting all the 1 we got by those indicator variables.
00:29:47.200 --> 00:29:50.900
That breaks down into a sum of these indicator variables.
00:29:50.900 --> 00:30:06.100
In order to find the expected value of Y, the expected number of women, it is the same as the expected value of Y1 + Y2, up to Yn.
00:30:06.100 --> 00:30:09.500
We can use linearity of expectations.
00:30:09.500 --> 00:30:18.700
This is where we are going to use linearity right here, linearity of expectation, very important here.
00:30:18.700 --> 00:30:25.600
These variables are not independent but linearity of expectation does not require that.
00:30:25.600 --> 00:30:29.000
Even though these variables are not independent, if you get a woman on the first pick,
00:30:29.000 --> 00:30:34.100
you are less likely to get a woman on the second because there is fewer women to pick now.
00:30:34.100 --> 00:30:38.200
Even though they are not independent, you can still use linearity of expectation.
00:30:38.200 --> 00:30:43.200
That is the glorious thing about linearity of expectation.
00:30:43.200 --> 00:30:48.500
It breaks up in the expected value of each of these indicator variables.
00:30:48.500 --> 00:30:52.900
What is the expected value of each of these indicator variables?
00:30:52.900 --> 00:30:56.600
Let us think about that, I will give you good way to think about that.
00:30:56.600 --> 00:31:05.300
If you think about just listing Y1, we pick one woman out of a crowd.
00:31:05.300 --> 00:31:12.900
The original definition of expected value is, you look at all the possible values of that variable
00:31:12.900 --> 00:31:19.300
and you multiply that variable × the probability of getting that value.
00:31:19.300 --> 00:31:25.300
This is going back to the original definition of expected value.
00:31:25.300 --> 00:31:29.700
I covered this in one of the very early lectures on probability.
00:31:29.700 --> 00:31:33.600
You can go back and look at some of those early lectures on probability and you will see this.
00:31:33.600 --> 00:31:37.500
What are the possible values of these indicator variables?
00:31:37.500 --> 00:31:47.000
There is only 0 and 1 because we setup here that the indicator variable is always going to be 0 or 1.
00:31:47.000 --> 00:31:59.100
This expands out in to 0 × the probability of 0 + 1 × the probability of 1.
00:31:59.100 --> 00:32:03.300
What is the probability that indicator variable is going to come up 0?
00:32:03.300 --> 00:32:14.500
It is the probability that we get a man because the indicator variable was 0 if we get a man + 1 ×
00:32:14.500 --> 00:32:19.800
the probability that we get a woman when we make our first pick.
00:32:19.800 --> 00:32:27.200
I do not care about the 0, the probability of getting a woman.
00:32:27.200 --> 00:32:29.700
How many people were there in the room?
00:32:29.700 --> 00:32:38.900
There were N people in the room and r of those of people is women.
00:32:38.900 --> 00:32:46.900
This is exactly r/N, that is the expected value of one of those indicator variables.
00:32:46.900 --> 00:32:51.800
It is just r/N.
00:32:51.800 --> 00:32:59.800
We can say that all of those indicator variables, they all have the same expected value.
00:32:59.800 --> 00:33:12.700
Each one of these is r/N and there are n of these variables.
00:33:12.700 --> 00:33:20.500
What we get here is n × r/N.
00:33:20.500 --> 00:33:26.500
That is the expected value of our random variable.
00:33:26.500 --> 00:33:30.800
That is the expected number of women on our committee.
00:33:30.800 --> 00:33:39.300
That checks with the value of the mean that I gave you way back on the third slide of this video.
00:33:39.300 --> 00:33:43.300
That is really where that number comes from, now you have the derivation to back it up.
00:33:43.300 --> 00:33:46.300
Now, you hopefully understand it yourself.
00:33:46.300 --> 00:33:48.900
In case that did not make sense, a quick recap here.
00:33:48.900 --> 00:33:53.200
N was the total number of people in the room N.
00:33:53.200 --> 00:33:59.200
r is the number of women which leaves N - R to be the number of men left over.
00:33:59.200 --> 00:34:08.300
We are going to pick a committee of n people and Y is the number of women we get on our committee.
00:34:08.300 --> 00:34:11.600
One way to break that down is to look at our picks one by one.
00:34:11.600 --> 00:34:17.400
We pick this person and then that person and then that person and then that person, to be on our committee.
00:34:17.400 --> 00:34:25.000
Each person we set up this n indicator variable, that is going to be 1 if we get a woman and 0 if we get a man.
00:34:25.000 --> 00:34:31.400
Each person has their own indicator variable and that means the total number of women
00:34:31.400 --> 00:34:34.900
is just the sum of all these indicator variables.
00:34:34.900 --> 00:34:41.500
It is the sum of all the women that we got when we made each one of these picks.
00:34:41.500 --> 00:34:49.600
The expected value is expected value of the sum here is where we use linearity of expectation.
00:34:49.600 --> 00:34:58.400
That is kind of a big deal in probability, let me highlight that to break that up into the expected value of each of these indicator variables.
00:34:58.400 --> 00:35:06.900
We can calculate the expected value of these indicator variables, we just say the only possible values they can take are 0 and 1.
00:35:06.900 --> 00:35:15.100
Using our original definition for expected value, we have 0 × the probability of 0, 1 × the probability of 1.
00:35:15.100 --> 00:35:20.900
We really only need to calculate the probability of 1, which means the probability that we get a woman,
00:35:20.900 --> 00:35:23.900
when we pick a certain person from this room.
00:35:23.900 --> 00:35:37.100
There are R women in the room and n total people in the room, that probability is r/N.
00:35:37.100 --> 00:35:48.100
We fill that in for each of our expected values here, it is the same for every indicator variable.
00:35:48.100 --> 00:35:52.300
We are adding up a bunch of r/N.
00:35:52.300 --> 00:36:01.700
We are adding up n of them and we get n × r/N as our answer.
00:36:01.700 --> 00:36:11.100
That checks with the mean of the hyper geometric random variable that I gave you back earlier on in this lecture.
00:36:11.100 --> 00:36:18.900
That is our last example problem and that wraps up our lecture here on the hyper geometric distribution.
00:36:18.900 --> 00:36:23.800
You are watching the probability videos here on www.educator.com.
00:36:23.800 --> 00:36:27.000
My name is Will Murray, thank you for joining us, see you next time, bye.