WEBVTT mathematics/statistics/son
00:00:00.000 --> 00:00:05.600
Hi welcome to the first lesson in www.educator.com statistics course.
00:00:05.600 --> 00:00:12.000
Today we are going to talk about descriptive statistics versus inferential statistics.
00:00:12.000 --> 00:00:18.400
Here is the road map for today, first we need to distinguish how statistics is different from other mathematics.
00:00:18.400 --> 00:00:24.300
We will talk about how descriptive and inferential statistics separate.
00:00:24.300 --> 00:00:30.500
Finally we are going to talk about populations versus samples and then we are going to put all of those ideas together
00:00:30.500 --> 00:00:37.600
and look at how population, samples, descriptive, and inferential statistics all fit together.
00:00:37.600 --> 00:00:48.000
First things first, how is statistics different from other specializations in mathematics such as trigonometry, geometry, calculus, linear algebra.
00:00:48.000 --> 00:00:55.500
Statistics is different because it is the science of classifying, organizing, and interpreting or analyzing data.
00:00:55.500 --> 00:01:03.100
You might be thinking to yourself - "Hey science? I thought this was mathematics." Right?
00:01:03.100 --> 00:01:09.300
Its link implies much of science and because of that it is important in mathematics.
00:01:09.300 --> 00:01:13.300
Let me explain that link to you in just one second.
00:01:13.300 --> 00:01:17.400
First I want to step back and think about high school science firmament.
00:01:17.400 --> 00:01:23.800
A lot of high school science is concerned with measurement, we go around measuring things and measuring how fast people run
00:01:23.800 --> 00:01:29.600
and how fast things are dropped and how much things grow and how much things way.
00:01:29.600 --> 00:01:35.400
How big things are and we are gathering a lot of data on measurement.
00:01:35.400 --> 00:01:44.100
Then we find patterns within those measurements and that is basically the fundamentals behind high school science.
00:01:44.100 --> 00:01:49.900
Those patterns can often be described as mathematical formulas.
00:01:49.900 --> 00:01:57.400
I do not know if you have this experience that some of you may have had the experience of trying to derive the gravitational constant.
00:01:57.400 --> 00:02:06.400
To some of you this equation might look familiar, D= ½ gt².
00:02:06.400 --> 00:02:18.200
(D) stands for distance, (g) stands for the gravitational constant and (t) stands for time.
00:02:18.200 --> 00:02:23.300
Some of you may have had the experience of dropping things off a building and timing them
00:02:23.300 --> 00:02:29.200
and putting in these numbers to try and figure out what (g) is.
00:02:29.200 --> 00:02:39.300
(g) theoretically is supposed to be 9.8 m/sec².
00:02:39.300 --> 00:02:47.200
But rarely do you calculate exactly 9.8 when you put in distance and time into this equation.
00:02:47.200 --> 00:02:52.700
Often, science students think I'm terrible at science, I’m not getting the right answer
00:02:52.700 --> 00:02:58.100
but it is because all of these measurements are inherently a little bit sloppy.
00:02:58.100 --> 00:03:06.100
Granted that high school students might be sloppier scientists than other scientists but in actuality all science experiments
00:03:06.100 --> 00:03:11.000
have measurement error and there is variance that comes with measurement.
00:03:11.000 --> 00:03:19.300
There is always a little bit of jiggle in that data and often we do not pinpoint the exact right data even when you look at something
00:03:19.300 --> 00:03:26.800
like measuring someone's height, you might have 10 people measure the same person's height and come up with slightly different answers.
00:03:26.800 --> 00:03:33.300
It is not because they are trying to cheat but that person might that a deep breath or slouch a little bit
00:03:33.300 --> 00:03:41.800
or maybe they read the tape measure at their hairline instead at their actual height.
00:03:41.800 --> 00:03:45.000
There are always different reasons for measurement error.
00:03:45.000 --> 00:03:49.800
All science is fought with measurement error.
00:03:49.800 --> 00:04:02.600
While because all experiments, even the good ones at SERV, MIT and Caltech, all experiments will have a little bit sloppiness.
00:04:02.600 --> 00:04:10.600
That is because we are dealing with measuring the physical world.
00:04:10.600 --> 00:04:16.000
It is not bad which we are looking at terrible scientist or just real messy
00:04:16.000 --> 00:04:22.000
it is just that inherently in measuring the world we are going to have a little bit of sloppiness.
00:04:22.000 --> 00:04:29.500
Now because of that sloppiness, even the best experiment will produce a scatter of numbers.
00:04:29.500 --> 00:04:48.800
Even best experiment as well as the worst experiments they will produce a scatter of values or measurements.
00:04:48.800 --> 00:04:50.500
That is where the problem is right?
00:04:50.500 --> 00:04:59.500
You will not get just one number like nice 9.8 gravitational constant, you will instead get this scatter of numbers.
00:04:59.500 --> 00:05:04.900
How do we deal with that scatter and that is where statistics come in.
00:05:04.900 --> 00:05:11.600
Statistics is the math of distributions then you could see how the math part and the science part fit together.
00:05:11.600 --> 00:05:16.800
Statistics is invented because we want to do better in science.
00:05:16.800 --> 00:05:30.300
We even have a special name for the scatter of measurements and that is called a distribution.
00:05:30.300 --> 00:05:36.700
Not only that but we are going to look and see how we can go from frequencies of these values
00:05:36.700 --> 00:05:41.000
in order to get probability distributions of these values.
00:05:41.000 --> 00:05:59.700
Those are also going to be called probability distributions.
00:05:59.700 --> 00:06:05.800
One thing that should come to your mind is that when you have a scatter of values or a whole bunch of different probabilities
00:06:05.800 --> 00:06:13.300
predicting different values then you are not going to have just one number, you are going to have a whole set of numbers.
00:06:13.300 --> 00:06:17.700
Because of that we are going to have to deal with the mathematics a little bit differently.
00:06:17.700 --> 00:06:25.200
We are not just computing one number at a time and looking at one number and adding things to it, subtracting things to it, doing things to it.
00:06:25.200 --> 00:06:28.700
Instead we are looking at entire distributions.
00:06:28.700 --> 00:06:30.500
How do we treat these distributions?
00:06:30.500 --> 00:06:31.800
How do we interpret them?
00:06:31.800 --> 00:06:35.100
That is the question behind statistics.
00:06:35.100 --> 00:06:40.500
You might think working with whole distributions that sounds problematic.
00:06:40.500 --> 00:06:42.700
Sometimes it might seem like it.
00:06:42.700 --> 00:06:47.900
It might seem like these equations are pretty complicated because we have to deal with the whole distribution.
00:06:47.900 --> 00:06:52.500
Also you will get some great stuff out of working with distributions.
00:06:52.500 --> 00:06:59.200
One reason is because distributions are often much more predictable than individual values.
00:06:59.200 --> 00:07:15.600
Distributions are more predictable than individual values.
00:07:15.600 --> 00:07:23.800
Models of distributions or theories of distributions can often predict the mathematical nature of randomness.
00:07:23.800 --> 00:07:24.800
Is it not great?
00:07:24.800 --> 00:07:27.700
They are predicting randomness.
00:07:27.700 --> 00:07:36.100
That is what statistics is a little bit about, it is dealing with that randomness and teaming it.
00:07:36.100 --> 00:07:40.300
How is statistics different from other specializations in mathematics?
00:07:40.300 --> 00:07:49.100
It is born out of the science of classifying, organizing, and interpreting data, distributions of data to be more precise.
00:07:49.100 --> 00:07:54.100
And because of that statistics is the mathematics of distributions.
00:07:54.100 --> 00:08:00.400
Statistics is fundamental in all science in both natural and social sciences.
00:08:00.400 --> 00:08:09.700
I’m a social science professor, a psychology professor by trade but even in the natural sciences all these discoveries that you have heard of
00:08:09.700 --> 00:08:16.800
they only come about through rigorous applications of statistics in physics, biology, economics, psychology,
00:08:16.800 --> 00:08:22.400
you name it statistics have left its math there.
00:08:22.400 --> 00:08:26.600
There are two skills that you need to know when to enter into statistics.
00:08:26.600 --> 00:08:32.600
The first is the skill of data description or what you can think of that as exploration.
00:08:32.600 --> 00:08:36.600
Often you could think of it as just an open-ended examination of the data.
00:08:36.600 --> 00:08:38.600
Let us look and see what is there.
00:08:38.600 --> 00:08:44.300
We are looking for patterns and often it is helpful to make a graph or to look at averages
00:08:44.300 --> 00:08:55.400
and standard deviations that are called summary values when you are looking for patterns.
00:08:55.400 --> 00:09:00.400
These are tools that help us see patterns better.
00:09:00.400 --> 00:09:07.700
The problem with just exploring or describing data is that you are not able to come to any conclusions.
00:09:07.700 --> 00:09:17.100
You have to rain yourself from making conclusions when you are just doing descriptive statistics that is inferential statistics will come in.
00:09:17.100 --> 00:09:26.200
When you make inferences in statistics you are doing a much more strict examination of the data according to set rules.
00:09:26.200 --> 00:09:34.900
Then you will judge whether these patterns that you find through description are likely or not according to theories
00:09:34.900 --> 00:09:39.300
and different models that you may have set up.
00:09:39.300 --> 00:09:44.700
At the end of inferential statistics you should be able to make measured conclusions.
00:09:44.700 --> 00:09:52.700
Often in science we do not say statistics has proven this theory or completely disproven this theory.
00:09:52.700 --> 00:10:00.900
Instead we make much more measured and qualified conclusions.
00:10:00.900 --> 00:10:11.000
Those skills of description and inference applied directly to descriptive statistics and inferential statistics.
00:10:11.000 --> 00:10:19.400
This thing that is different now is you want to think about those skills and how they apply to distributions.
00:10:19.400 --> 00:10:24.200
Here is how descriptive statistics applies to distributions.
00:10:24.200 --> 00:10:36.800
These are the concepts and tools that you need in order to analyze sample distributions.
00:10:36.800 --> 00:10:52.700
Use to describe or explore sample distributions.
00:10:52.700 --> 00:11:00.300
We just have taken the same concepts of what describing data means and we have applied it to sample distributions.
00:11:00.300 --> 00:11:06.100
Distributions that we have plucked out and a set of data that we plucked out.
00:11:06.100 --> 00:11:13.800
In inferential statistics what we need to do is then apply inference to distribution.
00:11:13.800 --> 00:11:33.700
Here it is the concepts and tools to reason from sample distribution.
00:11:33.700 --> 00:11:54.700
To make some inference to reason from a sample distribution to a larger population distribution.
00:11:54.700 --> 00:12:01.500
In inferential statistics what we are doing is using those skills of inference to go from sample distributions
00:12:01.500 --> 00:12:08.400
but not only just to understand the sample but to make some inferences about a greater larger population.
00:12:08.400 --> 00:12:11.500
Just to go beyond our actual data.
00:12:11.500 --> 00:12:15.000
In descriptive statistics we just stay with our sample.
00:12:15.000 --> 00:12:23.400
We do not make any inferences beyond what we have.
00:12:23.400 --> 00:12:30.700
It behooves us to figure out what is the difference between the population and the sample distribution?
00:12:30.700 --> 00:12:36.500
Here it might be helpful to just think of the population a sort of like the truth.
00:12:36.500 --> 00:12:39.400
This is where we are interested in.
00:12:39.400 --> 00:12:45.100
Is it the truth? This is the truth.
00:12:45.100 --> 00:12:47.500
This is the thing that we want to get at.
00:12:47.500 --> 00:12:55.200
If you think about the gravitational constant, this is that magical value that is out there in the world.
00:12:55.200 --> 00:13:00.700
The sample is not the truth, it is like a little bit of that truth.
00:13:00.700 --> 00:13:10.700
When we drop our objects from the top of the building and measure how fast they come down, we are getting samples.
00:13:10.700 --> 00:13:14.400
From those samples we are trying to get at the truth.
00:13:14.400 --> 00:13:23.600
The sample is not the whole truth but the sample does provide a window to the truth.
00:13:23.600 --> 00:13:28.600
It is important to realize that the sample is not the actual truth itself.
00:13:28.600 --> 00:13:31.700
This is not what we want to know about.
00:13:31.700 --> 00:13:38.900
We want to know about the population but we are using the sample in order to know about the population.
00:13:38.900 --> 00:13:41.800
Some pros and cons.
00:13:41.800 --> 00:13:48.300
Some pros of the population is this because it is the truth if you happen to have all the information
00:13:48.300 --> 00:13:55.900
about the real population it will be absolutely 100% accurate.
00:13:55.900 --> 00:14:06.800
However here is the con, it is almost impossible to get.
00:14:06.800 --> 00:14:13.200
It is almost impossible to get the truth, the real population true.
00:14:13.200 --> 00:14:21.400
For instance let us say you just want to know what the real average height of every person in the United States is.
00:14:21.400 --> 00:14:28.300
In order to do that you would have to get measurements from every single person in the United States.
00:14:28.300 --> 00:14:31.700
All of those measurements would have to be 100% accurate.
00:14:31.700 --> 00:14:34.400
Let us say I will give that to you, you will even do that.
00:14:34.400 --> 00:14:41.000
By the time you are finish recording all of those measurements, some people would have died and new people will have been born.
00:14:41.000 --> 00:14:45.300
All of a sudden your measurements would not be accurate anymore.
00:14:45.300 --> 00:14:49.900
It is almost impossible to get the entire population.
00:14:49.900 --> 00:14:57.200
Often in statistics, they will pick a small population like they will say consider all the people who attend your school
00:14:57.200 --> 00:15:05.600
and to shrink down the population that you could think about it without feeling like your mind is being blown.
00:15:05.600 --> 00:15:10.400
In the real world it is basically impossible to get the real truth.
00:15:10.400 --> 00:15:16.900
On the other hand, the sample has the pro of being convenient.
00:15:16.900 --> 00:15:23.000
It is easy to get data from just a sample of the population.
00:15:23.000 --> 00:15:29.600
You do not have to get the whole population, you just have to get a sample of it and it is convenient and easy to get.
00:15:29.600 --> 00:15:33.000
Here is the big con that you need to worry about.
00:15:33.000 --> 00:15:37.800
The con is that the sample might be what is called biased.
00:15:37.800 --> 00:15:43.800
By biased they do not necessarily mean like the sample like racists or prejudiced in some way,
00:15:43.800 --> 00:16:00.000
I just mean that the sample may not be representative of the population.
00:16:00.000 --> 00:16:05.600
The problem with that is when we look at our sample we are going to use our sample to try to get on the truth.
00:16:05.600 --> 00:16:15.200
If our sample is different from the truth then it might lead us astray and that is called being biased.
00:16:15.200 --> 00:16:22.400
When we describe the population in terms of numbers and we get some summary values for the population,
00:16:22.400 --> 00:16:28.100
those descriptive values are going to be called parameters.
00:16:28.100 --> 00:16:36.400
A friend of mine who teaches statistics with a help of the population parameter.
00:16:36.400 --> 00:16:46.100
On the other hand, for samples you would use what is called statistics.
00:16:46.100 --> 00:16:50.000
This word for statistics is the same word as the word for the class.
00:16:50.000 --> 00:16:58.400
But statistics covers all of statistics, descriptive, inferential, population, sample, all that stuff.
00:16:58.400 --> 00:17:03.900
This is the sort of smaller use of that word.
00:17:03.900 --> 00:17:12.900
Population and parameter, specific sample for statistics.
00:17:12.900 --> 00:17:15.700
Now let us put all those ideas together.
00:17:15.700 --> 00:17:22.000
How do we put together descriptive and inferential statistics with populations and samples?
00:17:22.000 --> 00:17:31.900
It helps us to ground ourselves by starting off with the idea that what we are interested in, in knowing about is the entire population.
00:17:31.900 --> 00:17:35.800
We want to know about the real population.
00:17:35.800 --> 00:17:40.100
Let us deal with one population at a time for now.
00:17:40.100 --> 00:17:49.100
Often we do not have the population's entire data in front of us, we only have a sample of that data.
00:17:49.100 --> 00:18:00.400
This is our wish to go from sample to the population but remember the sample can be biased, that is problematic.
00:18:00.400 --> 00:18:02.700
Here is where statistics comes in.
00:18:02.700 --> 00:18:17.600
From samples we compute statistics and from populations we could know the parameters.
00:18:17.600 --> 00:18:25.800
But we often do not have this link either because we do not know anything about the actual population.
00:18:25.800 --> 00:18:34.000
Here is where we are, what inferential statistics will help us do is make this link.
00:18:34.000 --> 00:18:39.600
How do we go from statistics of the sample to population parameters?
00:18:39.600 --> 00:18:54.600
This jump, this inferential jump is going to be made through inferential statistics.
00:18:54.600 --> 00:19:07.500
However in order to go from the sample to statistics we will use descriptive statistics.
00:19:07.500 --> 00:19:10.600
This is how it all fits together.
00:19:10.600 --> 00:19:12.800
Let us try some examples.
00:19:12.800 --> 00:19:20.800
Here is example 1, a pollster asks a group of voters how they intend to vote in the upcoming election for governor.
00:19:20.800 --> 00:19:30.800
In this example is the individual pollster primarily using descriptive statistics or inferential statistics.
00:19:30.800 --> 00:19:35.400
What he or she computes parameters or samples.
00:19:35.400 --> 00:19:40.600
Here the pollster is just asking a group of voters how they intend to vote.
00:19:40.600 --> 00:19:54.600
A poll is often just a sample of the entire set of voters so I would say the pollster is probably going to compute some sample statistics.
00:19:54.600 --> 00:20:02.400
We should say statistics not samples.
00:20:02.400 --> 00:20:07.800
I would say the pollster is probably calculating statistics.
00:20:07.800 --> 00:20:19.600
If the pollster just got an answer such as this sample of voters is going to vote for the governor 75% of them are going to vote for the governor
00:20:19.600 --> 00:20:25.100
and only 25% are not that would be counted as descriptive statistics.
00:20:25.100 --> 00:20:36.000
Once this pollster actually uses that information to then make some inferences and predicts and then I predict the governor will win,
00:20:36.000 --> 00:20:38.600
that would be inferential statistics.
00:20:38.600 --> 00:20:41.900
But so far, it does not say that.
00:20:41.900 --> 00:20:48.600
It seems that only descriptive statistics is being used here.
00:20:48.600 --> 00:20:59.200
Example 2, a teacher organizes his classes test grades into distribution from best to worst and compares it to the test grades of the entire school.
00:20:59.200 --> 00:21:05.600
In this example is the individual primarily using descriptive statistics or inferential statistics.
00:21:05.600 --> 00:21:13.100
First he is definitely using descriptive statistics in order to organize his classes data.
00:21:13.100 --> 00:21:19.200
He is using this but then he is comparing it to the test grades that the entire school.
00:21:19.200 --> 00:21:30.200
He is getting his sample, his class and looking at how they are relative to the entire school.
00:21:30.200 --> 00:21:34.300
That leap is going to be inferential statistics.
00:21:34.300 --> 00:21:42.200
I would say he is using both descriptive and inferential.
00:21:42.200 --> 00:21:51.100
A statistician is interested in the choices of majors of this year’s entering freshmen at a university 10% of randomly sampled.
00:21:51.100 --> 00:21:57.100
What is the population? what is the sample? What is the parameter? What is the statistic?
00:21:57.100 --> 00:22:17.600
The population seems to be all freshmen at the University, right? but the sample is this 10%.
00:22:17.600 --> 00:22:22.200
That is the population and the sample so what is the parameter?
00:22:22.200 --> 00:22:35.000
The parameter is what are the real major choices of all the students.
00:22:35.000 --> 00:22:54.200
Maybe he will look at it as you know maybe 50% are engineering and 20% are science and 30% are humanities.
00:22:54.200 --> 00:23:03.100
Majors picked by freshmen.
00:23:03.100 --> 00:23:05.800
What is the actual statistic?
00:23:05.800 --> 00:23:21.600
The statistic that is going to be made up of the majors picked by the sample.
00:23:21.600 --> 00:23:30.000
In order to go from this to this, you will need to use inferential statistics.
00:23:30.000 --> 00:23:37.700
Example 4, a group of pediatricians are trying to estimate the rate of increase in obesity in young children in their city.
00:23:37.700 --> 00:23:44.800
They begin a research project for every four years a group of 8 year-old children are randomly sampled from the city and weighed.
00:23:44.800 --> 00:23:51.300
What is the population? What is the sample? what is the parameter? what is the statistic?
00:23:51.300 --> 00:24:06.200
The population looks like young children in the city, whichever city this happens to be.
00:24:06.200 --> 00:24:28.800
The sample is the group of 8 year-old children, group of selected to be in this study.
00:24:28.800 --> 00:24:34.300
What is the parameter?
00:24:34.300 --> 00:24:50.300
The parameter would really be the actual rate of increasing obesity and they do not know what that is, they can not get that data.
00:24:50.300 --> 00:25:03.500
By looking at the different groups of 8 year-old children every four years they could look at the rate between the samples.
00:25:03.500 --> 00:25:20.800
The statistic would be the rate among the sample, the samples every four years.
00:25:20.800 --> 00:25:27.600
In that way they will try to use this rate in order to estimate this rate.
00:25:27.600 --> 00:25:30.500
That is the end of lesson one for www.educator.com.
00:25:30.500 --> 00:25:31.000
Thanks so much for watching.