WEBVTT mathematics/statistics/son
00:00:00.000 --> 00:00:01.900
Welcome www.educator.com.
00:00:01.900 --> 00:00:06.700
We are going to be talking transformations of data today.
00:00:06.700 --> 00:00:11.800
First we are going to talk about why we even transform data then we are going to talk about
00:00:11.800 --> 00:00:16.900
two different broad types of transformations, shape preserving and shape changing transformation.
00:00:16.900 --> 00:00:23.700
Then we will talk about some common shape changing transformations that you might need to know.
00:00:23.700 --> 00:00:28.400
Some of them you already know.
00:00:28.400 --> 00:00:30.300
First y transform.
00:00:30.300 --> 00:00:40.100
One of the big reasons to transform data specially in the shape changing way is that all the stuff with a regression and correlation,
00:00:40.100 --> 00:00:44.500
and all the stuff we have been learning works for linear patterns.
00:00:44.500 --> 00:00:54.100
If the pattern is not linear even if you can still fit it to a regression line and you still can find the correlation it probably is not the best way to go.
00:00:54.100 --> 00:01:00.500
Because for instance, in this graph you could see this has a distinct sort of curvy shape.
00:01:00.500 --> 00:01:14.500
A simple linear regression one that account for a lot of this variation not as well as a curved line work.
00:01:14.500 --> 00:01:25.700
Sometimes the transformation might make a nonlinear pattern more linear, thus making regression and correlation more useful.
00:01:25.700 --> 00:01:28.700
All of a sudden you can use regression and correlation and it will account for a lot of the data.
00:01:28.700 --> 00:01:30.600
That might be one reason to do that.
00:01:30.600 --> 00:01:33.200
Let us look at this data for example.
00:01:33.200 --> 00:01:42.100
This is data that we have looked at before from www.dotMinder.org, where it shows the income per person and GDP per capita.
00:01:42.100 --> 00:01:50.000
It takes all that stuff that your country buys and sells and a divided by how many people you have.
00:01:50.000 --> 00:02:00.000
It shows life expectancy here and notice that it says lin, this is a linear graph.
00:02:00.000 --> 00:02:12.700
It is just showing you even intervals, the distance between 10,000 and 20,000 is the same as the distance between 60,000 and 70,000.
00:02:12.700 --> 00:02:18.800
Same here, the distance between 55 and 60 years old is the same as the distance between 80 and 85.
00:02:18.800 --> 00:02:23.900
But one of the issues with this is that it has a distinctly curved shape.
00:02:23.900 --> 00:02:32.200
And primarily, it is that a lot of countries are very poor in terms of GDP per capita.
00:02:32.200 --> 00:02:38.000
They are very poor and they are all put together over on this side.
00:02:38.000 --> 00:02:44.900
Most countries make less than about 15,000 per person.
00:02:44.900 --> 00:02:52.200
They are also squished over on this side.
00:02:52.200 --> 00:02:58.900
It would be nice if we could somehow stroke out this part and squish those down
00:02:58.900 --> 00:03:05.900
because these countries are probably very similar because they are rich countries.
00:03:05.900 --> 00:03:13.900
A lot of them are Europe and this part of the United States because of that this might be nice.
00:03:13.900 --> 00:03:18.200
One way we could do that is we could do a log transformed.
00:03:18.200 --> 00:03:24.900
Instead of giving us the income per person we can look at it at a logarithmic scale.
00:03:24.900 --> 00:03:47.900
If you remember logs, logs is a lot like this log this 10 that means 10 to that power, 10 to the nth power will give you that number x.
00:03:47.900 --> 00:03:51.500
We have log 10(x).
00:03:51.500 --> 00:03:56.800
10⁺y will give us x.
00:03:56.800 --> 00:04:15.000
Instead of plotting the actual x it is asking maybe we could transform this so that it is giving us just the exponents.
00:04:15.000 --> 00:04:28.000
The way you could do this is to show this in a logarithmic scale and now the first parts of these are stretched out.
00:04:28.000 --> 00:04:37.200
The distance between 401,000 is big and that is bigger than the distance between 20,000 and 40,000.
00:04:37.200 --> 00:04:41.000
That is our logarithmic scale or exponential scale.
00:04:41.000 --> 00:04:54.400
Here we see the same data except now we are looking at plotting by the log of these incomes.
00:04:54.400 --> 00:04:59.600
Here what we see is a more linear pattern.
00:04:59.600 --> 00:05:08.500
Before we saw a curved pattern but now we see roughly more of a linear pattern.
00:05:08.500 --> 00:05:12.600
That is one reason why transformations are very useful.
00:05:12.600 --> 00:05:20.400
There are two kinds of broad transformations that you should know.
00:05:20.400 --> 00:05:25.100
One is shaped preserving transformation and the other shape changing transformation.
00:05:25.100 --> 00:05:32.400
Shape preserving transformations are something like you do not actually do the distributions of shape.
00:05:32.400 --> 00:05:34.500
The shape looks the same.
00:05:34.500 --> 00:05:39.200
When we look at a scatter plot, the scatter plot will look exactly the same.
00:05:39.200 --> 00:05:42.000
If it is linear it will stay linear.
00:05:42.000 --> 00:05:47.400
Shape changing means that if it is linear we will make it curvilinear.
00:05:47.400 --> 00:05:50.900
If it is curvilinear we will make it more linear.
00:05:50.900 --> 00:05:53.200
Those are shape changing transformations.
00:05:53.200 --> 00:06:00.000
In order to be shape preserving, this means that these are any linear transformations.
00:06:00.000 --> 00:06:09.200
Remember, the equation for a line is y = mx + b.
00:06:09.200 --> 00:06:12.500
This is the classic formula for a line.
00:06:12.500 --> 00:06:19.300
Anything if you add a constant or you multiply a value by a constant those are called linear transformation.
00:06:19.300 --> 00:06:26.800
Shape changing transformation are anything that is non linear.
00:06:26.800 --> 00:06:35.800
Now you need to do something more than adding by a constant or multiplying by a constant.
00:06:35.800 --> 00:06:48.600
That might be changing x into x² or taking the square root of x or adding in other variables.
00:06:48.600 --> 00:06:52.600
Adding in another variable here.
00:06:52.600 --> 00:06:58.200
These are non linear transformations.
00:06:58.200 --> 00:07:06.700
This is anything you do beyond just adding or subtracting or multiplying and dividing by a constant.
00:07:06.700 --> 00:07:13.400
Here are some common shape preserving transformations.
00:07:13.400 --> 00:07:17.000
These are the ones that do not change the shape at all.
00:07:17.000 --> 00:07:19.900
When you add and subtract the constant that is fine.
00:07:19.900 --> 00:07:22.800
If you multiply or divide a constant that is fine.
00:07:22.800 --> 00:07:30.000
Converting units is often a common shape preserving transformation.
00:07:30.000 --> 00:07:40.900
For instance, we collected our data in feet, but we want to see it in inches or something like we looked at minute, but we really wanted in hour.
00:07:40.900 --> 00:07:47.800
There we are just multiplying like a constant here where you multiplying by 12.
00:07:47.800 --> 00:07:50.100
Here we are dividing by 60.
00:07:50.100 --> 00:07:55.200
Those are shape preserving, you will have the same shape.
00:07:55.200 --> 00:08:00.700
Another shape preserving transformation is standardization or finding that z scores.
00:08:00.700 --> 00:08:09.100
When we find the z scores, the z scores will have the same shape as your raw scores because we are subtracting by constant.
00:08:09.100 --> 00:08:15.100
Subtracting by x bar or the mean and dividing by constant.
00:08:15.100 --> 00:08:20.500
You could do combinations of these two things and still have a shape preserving transformation.
00:08:20.500 --> 00:08:35.500
Another common shape preserving transformation that you might want to know are transformations from frequency to relative frequency.
00:08:35.500 --> 00:08:47.700
So that is also shape preserving where you might have raw number of people that you might also want to have proportion from the total.
00:08:47.700 --> 00:08:54.800
That is another way, because remember finding relative frequency is often just dividing by a constant.
00:08:54.800 --> 00:08:58.800
Those are shape preserving transformations that you have already seen.
00:08:58.800 --> 00:09:09.500
The shape changing transformations, the most common ones used are power transformation and log transformations.
00:09:09.500 --> 00:09:17.500
Power transformations or anything where you raise your y x by some power.
00:09:17.500 --> 00:09:31.300
For instance from y you change it into y² or y into the square root of y or dividing 1/y.
00:09:31.300 --> 00:09:34.300
Raising it to the negative power.
00:09:34.300 --> 00:09:39.000
Any of these and any combination of these is a power transform.
00:09:39.000 --> 00:09:43.400
log transformed are finding the exponent.
00:09:43.400 --> 00:09:46.700
Instead of raising it you have to find the exponent.
00:09:46.700 --> 00:09:56.600
You could find the log of y and this will give you smaller numbers or you can find the natural log of numbers as well.
00:09:56.600 --> 00:10:00.100
Any of these are possibilities.
00:10:00.100 --> 00:10:12.200
You could also look at things like the e to y so that is just the inverse of this.
00:10:12.200 --> 00:10:20.500
And also like some other constant to y so we could use exponential constant or do something else.
00:10:20.500 --> 00:10:30.500
Although I have written y here and an oftentimes you might see y become transparent but it is also quite common to transform x as well.
00:10:30.500 --> 00:10:32.900
Sometimes you might transform both.
00:10:32.900 --> 00:10:37.800
You may transform both y and x and we will talk about those situations as well.
00:10:37.800 --> 00:10:47.100
Great, the question is how we know when to do this and should we just change one variable or both?
00:10:47.100 --> 00:10:54.900
log transformation are usually when you do transformations on both.
00:10:54.900 --> 00:10:56.900
Log x and log y.
00:10:56.900 --> 00:11:08.500
Log transformations are often useful for data model by this basic formula.
00:11:08.500 --> 00:11:11.500
So y = ax⁺b.
00:11:11.500 --> 00:11:18.300
When x is raised to some constants power you often want to do a log log transformation.
00:11:18.300 --> 00:11:20.000
It is just a nice rule.
00:11:20.000 --> 00:11:28.300
When we do a log log transformation you are basically shrinking and expanding variables on both axis.
00:11:28.300 --> 00:11:35.700
You are not just stretching out one or shrinking one variable, you are doing that both.
00:11:35.700 --> 00:11:45.100
Just to give you some ideas for how to do that I'm always going to put back x here and y here.
00:11:45.100 --> 00:11:59.800
This is a case where all these variables, all of these y are squished together.
00:11:59.800 --> 00:12:01.800
Here they are not rising very quickly.
00:12:01.800 --> 00:12:05.500
Here the y are not rising very quickly and then the y rise very quickly.
00:12:05.500 --> 00:12:14.000
The y are all like there like shooting for each X.
00:12:14.000 --> 00:12:20.500
Here we would want to shrink y and expand the x.
00:12:20.500 --> 00:12:27.400
And so when you see curves that are approximately the shape you want to think shrink y and expand x.
00:12:27.400 --> 00:12:46.700
Here we have a slightly different situation where now we still want to shrink y, because y is descending too quickly but we also want to shrink x here.
00:12:46.700 --> 00:12:50.000
When the curves goes like this.
00:12:50.000 --> 00:12:59.000
We can think of it like keeping track of y goes circle that goes around like this.
00:12:59.000 --> 00:13:01.200
That is the order that I have written in it.
00:13:01.200 --> 00:13:03.200
Here is 1, 2, 3.
00:13:03.200 --> 00:13:07.900
In 1 you want to shrink y and expand it.
00:13:07.900 --> 00:13:12.500
In 2 you want to shrink y but you also want to shrink x.
00:13:12.500 --> 00:13:16.800
X is also expanding too quickly.
00:13:16.800 --> 00:13:21.700
Here we want to expand y but shrink x.
00:13:21.700 --> 00:13:28.000
Here y is not changing very fast up here.
00:13:28.000 --> 00:13:37.900
It is changing very less and want to expand that up but we want to shrink x because x is going up too quickly in relation to y.
00:13:37.900 --> 00:13:54.800
Here for the last one, number 4 we want to expand y, but we also want to expand x because y is changing in a way
00:13:54.800 --> 00:14:02.400
where you it would be helpful to see it expanded outwards because here it is going down very fast.
00:14:02.400 --> 00:14:09.500
Also with x it would be helpful to expand x out because all the x are squished up here but sort of spread out there.
00:14:09.500 --> 00:14:19.600
That is just the largest nice rules of them obviously, you do not have to memorize these.
00:14:19.600 --> 00:14:23.600
Sometimes what I do is play around with it a little bit.
00:14:23.600 --> 00:14:32.800
I try in a shrieking one expanding the other and as long as I can identify that these are all y = ax⁺b power.
00:14:32.800 --> 00:14:48.700
If x is your exponent before it was y =ax⁺b but now this is ab⁺x.
00:14:48.700 --> 00:15:00.900
Here, you probably just want to transform one variable and leave the other one alone.
00:15:00.900 --> 00:15:10.500
If you are not able to eyeball what you are trying to do is try things out there is no harm in playing around with it.
00:15:10.500 --> 00:15:18.100
But eventually when you do decide on a transform you want to have reason for it instead of it is everything to do.
00:15:18.100 --> 00:15:30.900
Let us go to example 1, create a set of data with this function, graph this data set and what kind of transformation should be done to make this data more linear.
00:15:30.900 --> 00:15:37.800
Already we could see from this that this is the example of y = ax⁺b.
00:15:37.800 --> 00:15:41.800
That is the case we are going to need to do a log log transformation.
00:15:41.800 --> 00:15:57.000
We are going to need to do transformation to both x and y, but let us look at the shape of it to see what this data looks like.
00:15:57.000 --> 00:16:18.000
If you download the example, for example 1 I have already put in the function y= ax⁺b let us put in a which is 10 and b -.4.
00:16:18.000 --> 00:16:30.700
I already have seated this with just a whole bunch of positive integers for x that are just steadily going all the way up to 33.
00:16:30.700 --> 00:16:34.500
Let us find the corresponding Y values.
00:16:34.500 --> 00:16:39.900
In order to find y we just have to follow this formula here.
00:16:39.900 --> 00:16:51.100
y = a × x⁺b.
00:16:51.100 --> 00:16:57.400
Remember Excel knows order of operations, so it should do the power before it is multiplication.
00:16:57.400 --> 00:17:11.900
Unless we have that I’m just going to drive all of these all the way down and get a whole set of data.
00:17:11.900 --> 00:17:16.900
I think I forgot to lock down this.
00:17:16.900 --> 00:17:21.500
I forgot to lock down a and b that is like giving me all these craziness.
00:17:21.500 --> 00:17:35.200
Let us lock down A and B once we have that then I can.
00:17:35.200 --> 00:17:49.600
We see have this nice curve if you remember that the second type of curve or so.
00:17:49.600 --> 00:17:55.300
We know we need to do both kind of transformations already.
00:17:55.300 --> 00:18:15.400
It would be helpful for us if we can actually shrink y, but also maybe shrink x and logged the way of shrinking both of them.
00:18:15.400 --> 00:18:17.200
When we try log.
00:18:17.200 --> 00:18:20.600
Let us do log transforms.
00:18:20.600 --> 00:18:23.900
To get lot of x and get log of y.
00:18:23.900 --> 00:18:26.800
Feel free to also use natural log.
00:18:26.800 --> 00:18:40.000
I’m going to use log based 10 and Excel thankfully has log and I'm going to use log 10 and it put in my x.
00:18:40.000 --> 00:18:48.300
It is going to change x from this into the exponent.
00:18:48.300 --> 00:18:56.100
10⁰ power will give us 1 and I will do the same thing to y.
00:18:56.100 --> 00:19:07.500
10¹ 1¹ will give us 10.
00:19:07.500 --> 00:19:16.700
I’m going to take that copy and paste all the way down, get a nice log transform.
00:19:16.700 --> 00:19:24.700
Here I have already made this graph and set this up so that it'll actually get this data.
00:19:24.700 --> 00:19:28.200
If you click on those it will show you which data it is using.
00:19:28.200 --> 00:19:33.900
I already labeled as log y, just y and log x instead of just x.
00:19:33.900 --> 00:19:40.800
And what you notice this data that has ones been curved is now straightened out .
00:19:40.800 --> 00:19:53.100
This is one way transformation can be useful because now we could use log x and y instead of x and y and put log in xy into our calculations and enter correlations population.
00:19:53.100 --> 00:20:00.900
And we should be able to get more traction out of using those tools.
00:20:00.900 --> 00:20:08.300
Let us move on to example 2.
00:20:08.300 --> 00:20:17.500
Create a set of data with this function Y = ab⁺x and graph this data set.
00:20:17.500 --> 00:20:21.100
What kind of transformation should be done to make this data more linear?
00:20:21.100 --> 00:20:33.100
We could just put in whatever numbers we want for a and b now we want to probably just do one variable transform.
00:20:33.100 --> 00:20:36.800
Like a log transforms or power transform on one side.
00:20:36.800 --> 00:20:55.400
If we go to example 2 it already has a and a × b⁺x power and so we could just put in some numbers like 5.2, anything you want.
00:20:55.400 --> 00:21:00.400
Let us put in our formula.
00:21:00.400 --> 00:21:02.900
Let us not make the same mistake again let us lock-in.
00:21:02.900 --> 00:21:15.200
Here is a × b⁺x.
00:21:15.200 --> 00:21:32.200
Here I’m going to lock in b and once we have that I could just drag this all the way down.
00:21:32.200 --> 00:21:40.600
This is very curved, very steeply curved.
00:21:40.600 --> 00:21:44.800
How can we transform this?
00:21:44.800 --> 00:21:49.500
One option might be to transform y.
00:21:49.500 --> 00:22:06.500
Let me put in log y and maybe I will put in just the same thing log base 10 y and then just drag that all the way down.
00:22:06.500 --> 00:22:09.800
Let us do it again.
00:22:09.800 --> 00:22:23.100
Here we now get this nice linear looking distribution instead of this very, very curvy like right angle.
00:22:23.100 --> 00:22:33.800
Because we do not change x, x stays nice and linear, but here we have a logarithmic axis.
00:22:33.800 --> 00:22:39.200
logarithmic function here.
00:22:39.200 --> 00:22:46.800
Finally let us move on to example 3.
00:22:46.800 --> 00:22:58.300
Example 3 says considering this data set, the goal of a statistical model would be to allow accurate prediction of birthrate
00:22:58.300 --> 00:23:02.000
from a country's GNP, that is gross national product.
00:23:02.000 --> 00:23:05.100
And what kind of model would you choose for this data?
00:23:05.100 --> 00:23:13.200
It is often helpful to just sort of look at the data and draw for yourself what you think
00:23:13.200 --> 00:23:26.300
might be a helpful model ideal theory for what the underlying data come from.
00:23:26.300 --> 00:23:41.500
This looks very curvy to me and to me that looks sort of like what we saw before where we did the log log transformation that looks sort of like that.
00:23:41.500 --> 00:23:46.100
What kind of model would you choose?
00:23:46.100 --> 00:23:52.500
Assume you would not choose a linear one, I'm not going to use the y = nx + b for it.
00:23:52.500 --> 00:24:00.600
It is not quite like a parabola shape but it seems something like this before.
00:24:00.600 --> 00:24:05.700
I have actually seen it and x financial function.
00:24:05.700 --> 00:24:18.000
I have seen something that looks like that right and that is y = e⁺x.
00:24:18.000 --> 00:24:26.500
I have seen something like that before and this is like that it can be flipped around.
00:24:26.500 --> 00:24:41.700
I do not flip the y that would be like me folding it down, but it will sort of folded along the x.
00:24:41.700 --> 00:24:46.000
For every positive x maybe I wanted to be negative.
00:24:46.000 --> 00:24:55.300
Maybe I want model but there was something it does not have to be e but some constant to the –x.
00:24:55.300 --> 00:25:06.900
One way that I that I would advise you to do this is if you have access to graphing calculator or www.wolframalpha,
00:25:06.900 --> 00:25:13.200
one thing is that you can put these equations in so that you can eyeball it and see if you get roughly this shape and play around with it.
00:25:13.200 --> 00:25:22.100
Feel free to put in different exponent and constants and try to get something at least the shape that looks like this.
00:25:22.100 --> 00:25:23.600
Exact numbers are not important.
00:25:23.600 --> 00:25:25.500
What we are really looking for that shape.
00:25:25.500 --> 00:25:37.900
I’m going to guess that I need a shape that looks something like this and if you do not want to put in e you could just use a, a⁻x.
00:25:37.900 --> 00:25:45.000
That would be the model that I would choose for this data.
00:25:45.000 --> 00:25:55.300
Example 4, same data set, but it says what kind of transformation might be drawn as data before fitting this data to regression line and finding correlation.
00:25:55.300 --> 00:26:19.100
Well, as though this data corresponds to something like y = a⁺x and we said we need a –x to get that curve that looks like that
00:26:19.100 --> 00:26:28.800
and when we see something like that perhaps one thing we might want to do is just change one variable.
00:26:28.800 --> 00:26:47.700
That might be one of the strategies that we use since this corresponds to the basic equation y = ab⁺x.
00:26:47.700 --> 00:26:53.500
Whenever you see equations of that kind you probably just want to change one of your variables.
00:26:53.500 --> 00:27:01.600
You probably want to play around and either change GMP or change the birth rate to try and straighten out this data.
00:27:01.600 --> 00:27:08.000
That is it for transformation thanks for using www.educator.com.