• 科普

SCIENCE

英语

ENGLISH

科技

TECHNOLOGY

MOVIE

FOOD

励志

INSPIRATIONS

社会

SOCIETY

TRAVEL

动物

ANIMALS

KIDS

卡通

CARTOON

计算机

COMPUTER

心理

PSYCHOLOGY

教育

EDUCATION

手工

HANDCRAFTS

趣闻

MYSTERIES

CAREER

GEEKS

时尚

FASHION

• 精品课
• 公开课
• 欢迎下载我们在各应用市场备受好评的APP

点击下载Android最新版本

点击下载iOS最新版本

扫码下载译学馆APP

#4 数据转换

Data Analysis 4: Data Transformation - Computerphile

People need to learn to use standardized measures for things.

So take me for example when I drive anywhere,

I drive in miles, I drive in miles per hour.

My fuel economy is messaging miles per gallon,

but of course, I don’t pump fuel in gallons,

I pump it in liters.

And then but when I run anywhere so short distances

I run in kilometers and I run in kilometers per hour.

So I’m using two different systems there.

And any short distances I’m measuring are going to be in meters, not feet, right.

So if I’m measuring let’s say

around my house for painting,

I’m going to measure in square meters,

so I know how much paint to buy.

But then I’m selling a house, or I’m buying a house

I’m going to be looking at the size of the house in square feet.

Again, what, who knows why, British people.

If I’m baking anything,

it’s going to be weight in grams or kilograms going into the recipe.

But if I’m weighing myself is going to be in stones and pounds.

But of course a ton would for me would be a metric ton

not an imperial ton.

And as I said, I measure fuel in liters

and most of my liquids are measured in liters

except for cause for beer and milk, which are in pints.

So this is the kind of problem you’re going to be dealing with

when you’re looking at data.

You’re trying to transform your data into a usable form.

Maybe the data is coming from different sources,

none of it goes together.

You need standardized units standardized scales,

so we can go on and analyze it.

<04 - 数据转换>
<电脑狂热>
So let’s think back, we

what we’re doing is we’re trying to prepare our data

into a densest, most clean format

modeling or machine learning

or some kind of statistical test

to work out what’s going on and draw knowledge from our data.

So this is going to be an iterative process,

we’re going to be cleaning the data,

we’re going to transform the data

and then we’re going to reduce for data,

and transforming data is what we’re going to do today.

So let’s imagine that you’ve cleaned your data.

So we’ve got rid of as many missing variables as possible,

hopefully all of them with deleted instances and attributes that

just we’re not going to work out for us.

Now what we’re going to try and do

is we’re going to try and transform our data

so that everything’s on the same scale

Everything makes sense together

and if we’re bringing datasets from different places,

we need to also make sure all the units are the same

and everything makes sense.

There’s no point in trying to use machine learning

or sum or clustering or any other mechanism

to draw knowledge from our data if our data is is all wrong.

So today we’re going to be looking at census data.

Now census data is kind of a classic example of a kind of data

you might look at in data analysis.

It has got lots of different kinds of attributes,

things that are going to need cleaning up and transforming.

So we’re back in our we’re going to read the census data

represents samples from the US population to begin with.

We’re going to read that in and you can see that

we’ve got 32,000 observations and 15 attributes

or variables.

So what are the first math.

So let’s have a quick look at just a little bit of it

and we can see the kind of thing we’re looking at.

So we’re going to say head of census

and that’s just going to produce the first few rows

so we can kind of see the kind of data.

So you can see we’ve got age

we’ve got what working classification that person has, their educational level

and numerical representation about whether they’re married or not this kind of thing

So there’s a lot of different kinds of data here

some of it is going to be nominal

So for example, this working-class

state government, private employee.

That’s a nominal value.

We might have ordinal values or ratio values

or interval values

We’re gonna have to delve into a little bit closer to find out what these are.

Now what we do to transform this data

into a usable format for clustering or machine learning

is going to depend on exactly what these types of these columns are

and what we want to do with them

So let’s look at it just a couple of the attributes

and see what we can do with them, right?

We’re going to use a process called codification.

The idea is that maybe things like random forests or

multi-layer perceptrons, you know neural networks

aren’t going to be very amenable to putting in text-based inputs.

So what we want to do is try and replace these attributes

with a numerical score.

All right. So let’s look at just for example of a working class,

and also for example the educational level. So education.

Now work class is the kind of class of worker that we’re looking at here

So for example a state worker or in private sector,

or someone that worked in a school or something like this.

Now this is a nominal value.

That means there’s no order to this data at all

we can’t say but someone in state is higher or lower than someone in private

and we can’t also say but let’s say state is two times more or less than some other one.

That makes no sense at all. Alright.

So what we can we can replace this with numbers.

so let’s say we could replace private with zero

and state with one

and you know, self-employed with two and so on, right

And that we’ve got back perfectly reasonable thing to do,

but it’s still nominal data.

So what we can’t do is then calculate a mean and

say “ah the mean is halfway between private and public”

that doesn’t make any sense.

Just because something has been replaced by a numerical score

doesn’t mean that it actually represents something that we can quantify in that way, right?

It’s still nominal data.

Okay, so I bet the best advice I can give is

but just bear in mind that

you can calculate the mode just like you know the most common,

but you can’t calculate the median and you can’t calculate the mean.

Another example would be something like the educational level.

Now theoretically this is ordinal data,

so we could save it someone with a an undergraduate degree

is maybe slightly higher in terms of their the amount of time they spent in education,

than someone with a high school diploma.

But we don’t know exactly what the distance is,

and what’s the distance between let’s say a high school and a degree and then a PhD,

and so on an MD and things like this.

We can represent these using numbers,

and probably in order, right,

so we could say that zero is no education

and one is sort of the end of primary school

and two is the end of high school and so on and so forth

But again,

it’s difficult to calculate distances between these things

We don’t know what high school is two times more than primary school

and half of a degree or something like that.

That doesn’t really make sense.

So again,

you might be able to calculate a median on this or a mode,

but you can’t calculate an average.

You can’t say the average level of education

is halfway between high school and undergraduate.

That doesn’t make any sense either.

So for any kind of attribute that is nominal or

possibly ordinal and it’s sort of represented using text

we can codify this so that it’s more amenable to things like

decision trees depending on the library you’re using, right?

But you just have to be careful all machine learning algorithms

will take any number you give them

and you just have to be careful that this makes sense to do.

So what you would do is you would go through your data

and you’d begin to systematically replace appropriate attributes

with numerical versions of themselves,

remembering all the time,

that they don’t necessarily represent true numbers,

you know in a ratio or interval format.

So for any text-based value,

Well, they might be okay,

but the issue is going to be one of scale.

you might find for example in this census data

that one of the dimensions

or one of the attributes is much much larger than another one.

So for example, this dataset has hours per week

which is obviously going to be somewhere between naught and maybe 60 or 70 hours

for someone has got, you know a very strong work ethic,

and salary, right?

Or salary or income or any other measure of, you know, monetary gain.

Now obviously hours per week is going to be in the tens and

Salary could be into the tens of thousands. Maybe even the hundreds of thousands

Those scales are not even close to being the same.

That means if you’re doing clustering or machine learning

on this kind of data

you’re going to be finding the salary

is kind of overbearing everything, right

So it’s going to be very easy for your clustering

to find differences in salary,

and it’s harder for it to spot differences in hours,

because they’re so small in comparison, right?

So we need to start to bring everything onto the same scale.

The more attributes you have

which is another way of saying, the more dimensions you have to your data,

then the further everything is going to be spread around.

If we can scale all of these values to between

sort of let’s say around 0 and 1,

then everything gets more tightly sort of controlled in the middle,

And so it gets much easier to do clustering

or machine learning or any kind of analysis we want.

So let’s look back at our data

and see what we can do to try and scale some of this into the right range.

So we’re going to look back at the head of our data again

so our numerical values are things like the capital gain

the capital loss which I guess

presumably how much money they’ve made in the loss that year,

probably for normalize them on some scale

and then things like the hours per week that they work.

and their salary which at this case is greater than or less than 50,000.

So let’s have a quick look at the kind of range of values we’re looking at here

so we can see if scalings even necessary

Maybe we got lucky

and the person did it before they sent us the data

So we’re going to apply a function across all the columns

and we’re going to calculate the range of the data

So this is going to be apply on a census data

division 2, so that’s all of our columns,

and we’re going to use the range function for this,

and this is going to tell us okay,

so for example the age ranges from 17 to 90

the educational level from 1 to 16

It gives you the range for things like nominal values as well,

but they don’t really make any sense

I mean working class ranges from question mark to without pay,

you know is meaningless.

And then so for example capital gain ranges from zero to nearly one hundred thousand,

and capital loss from zero to four thousand.

And finally the hours per week ranges from 1 to 99,

So you can see that the capital gain

is many orders of magnitude larger in scale than the hours per week.

We’re going to need to try and scale this data.

We’ll begin by doing to make our lives a little bit easier.

It’s just focus on the numerical attributes right,

so we’d have to worry about the nominal values, which we’ve not codified yet

We’re going to select all the columns from the data where they are numeric.

So that’s this line here, and paste that down here.

So we’re going to s apply that applies over each of the fields is it numeric,

and that’s going to give us a logical list

that says true or false depending on whether those columns are numeric.

What we’re doing here is selecting from this list any bit of true

and then finding their names.

So what are the names of a columns for the numeric?

So let’s have a look at just a range of these attributes

to make our life a little bit easier.

So I’m gonna run this line

and so this is a simplified version of what I was just showing,

you can see that capital gain is massive

compared to the hours per week for example.

Let’s have a look at the standard deviation.

the call that the standard deviation, is the average distance from the mean,

so it kinda gives us an idea of the spread of some data, right.

Is it very tight and everyone owns roughly the same

or is it very spread out and it’s huge deviations.

And the answer is there’s pretty huge deviations.

So the age has a standard deviation of 13 so it, obviously

that means that most people are going to be kind of in the middle

and on average they’re going to be 13 years younger or older,

but you can see that things like capital gain have over 7,000 standard deviation,

which is a huge amount.

To give you some idea what we’re aiming for,

it’s very common to standardize this kind of data.

So the standard deviation is 1 right.

So, 7,000, much too big.
7000这个数字太大了
Let’s plot an example

to gives you some idea of what the kind of problem is when we have these massive ranges.

So I’m going to plot here a graph of age versus capital gains, right

We know age goes between about one and a hundred

and capital gain is much much larger.

So if I run this

basically the figure makes no sense at all,

because the capital gain ranges from zero to one hundred thousand

and as a few people earning right at the top scale,

everything is sort of squished down the bottom.

We can’t see anything that’s going on.

There’s no way of telling whether

the capital gain of an individual is related to their age.

I mean it probably is, right

Cause retired people, people who are very young,

perhaps earn slightly less.

We can’t really see that here,

because it’s just too compressed, right

We need to start trying to bring these things together

so that we can perform better analysis.

What we’re going to do is creating a new data frame

with just the numerical attribute.

so we want to focus on just to make our life a little bit easier

and then we’re going to write a normalized function to

move all our data to between 0 and 1,

and we will do this per attribute.

So for example, if you’ve got some data which goes between a minimum and a maximum

and we want to scale this data to between 0 and 1

All we need to do is first of all, take away the minimum,

and that’s going to move everything to be

from 0, to max minus min.

And then we’re going to divide by this distance here,

so this is max minus min.

And if we divide by this everything is going to go from 0 to 1.

So that’s exactly what we’re doing in this function here

we’re gonna function X

and it subtracts the minimum of X
(x-min(x))
and then divides by the difference between the maximum and the minimum alright.
/(max(x)-min(x))
So this is very standard. So I’m going to run this.

I’ll let you write functions like this and then use them

in applications over data.

So we’re going to calculate a normalized census dataset,

which is we’re going to apply over dimension to

this normalized function we just wrote.

And then now if we look at the range will see that our range is now

between 0 and 1 for all of our data, which is exactly what we want.

The normalization is a perfectly good way of handling your data.

If everything is between 0 and 1

we have fewer problems with the scale of things being way off right.

Now some statistical techniques like PCA

that we’re going to talk about in another video

They require standardized data,

that’s data is centered around zero,

has a mean of zero and a standard deviation of one.

Now we can standardize data pretty easily in the same way.

Actually, we don’t need to write our own function for this,

the scale function in R performs this for us.
R语言里的scale函数就可以实现
So we’re going to take the census data over numerical attributes
我们将选取人口普查数据中的所有数字属性
and we’re going to call the scale function

and that’s going to take all of the attributes

and center them around their mean,

so that means the mean will become close to zero

and it’s going to divide them all by the standard deviation

so their standard deviation becomes one.

So if we run that and then we have a look at the mean of this data

So for example here, we calculate the mean.

You can see that I mean these values are very very close to one

That’s 10 to the minus 17 or something like that, very very small.

And if we look at the standard deviation, and similarly, they’re all going to be 1.

Alright, so this is now standardized data.

This is a very good thing to do

if you want to use your data in some kind of machine learning algorithm or some kind of clustering.

Let’s imagine now that we want to join some datasets together.

So we standardize data everything’s between 0 and 1,

or it’s centered around 0 with a standard deviation of 1,

we’ve codified some attributes.

What happens if we get other data from other sources？

You can imagine that census data from the US might be a bit useful.

But maybe we want census data from Spain

or from the UK or from another country.

Can we join all of these together

to get a bigger more useful dataset? Alright.

Now the thing to think about when you’re doing this,

is just to make sure that everything makes sense, right?

Are the scales the same?

Are they all normalized or none of them normalized?

Because otherwise, what you’re going to be doing is you’re going to be adding, you know,

pay between naught and a hundred thousand, to somewhere between naught and one,

nothing makes any sense anymore.

So let’s have a look at this on the census dataset.

We have some Spanish census data in a very similar format

to our census data from the United States.

Let’s have a quick look.

So I’m going to read the CSV file of Spain data.

Let’s remind ourselves of the columns that we had in our census data from the United States.

These are the numerical columns,

so we have age, education number

capital gain capital loss this kind of thing.

Let’s look at the Spanish dataset

to see if we can just join the two together.

So I’m gonna run head Spain,

that’s going to give us the first few rows

and you can see that

there’s some of the stuff in there is as it was before

so things like what their level of education is,

whether they work in the private sector or the public sector, right.

We’re going to need to remove these things

to create just a numerical attributes.

And the other problem is if you look carefully,

you’ll see that the capital gain in the Spanish dataset is in euros,

not in dollars, right.

Now that’s a huge problem.

They don’t they’re not massively different obviously

they’re on the same order of magnitude

But we don’t want to be jamming

capital gain in euros next to dollars

because those two scales are not the same, right?

So what we need to do first

is scale this data using some kind of exchange rate.

So here what we’re going to do is we’re going to create a new column in Spain

so given a Spain data frame,

we’re going to say the Spain capital gain is equal to the

Euro capital gain times by 1.13,

which is the exchange rate we’re going to use.

Now It’s quite important in this kind of situation

not just to look up the exchange rate online.

You’ve got to consider but this might have been collected a while ago

What was the exchange rate when this data was collected right,

these are things you’re going to have to think about.

So let’s run that line,

and let’s do the same thing for the capital loss.

Now we’re going to keep just the numerical attributes of

our census data and of the Spanish data,

and we’re also going to add another column,

that is what country they come from,

otherwise we’re not going to know.

So we’re going to use the columbine function

to combine the census data as numerical attributes

and the native country which in this case will be the United States.

We’re going to do the exact same thing for the Spain data,

which will be basically exactly the same

except obviously we’re also going to have Spain as the native country.

And then we’re going to use the rowbind feature

to just join those two tables together

Now that will only work if those two datasets have the exact same attributes.

What did I do wrong?

So let’s join these two together using rbind.

There we go. And so our United dataset now has

the combined observations for the United States and Spain.

Now, what you wouldn’t want to do is just join them together

and just leave it at that, right.

You want to perhaps have a little look at some plots to make sure that

the distributions of the data you’ve just joined together make sense.

For example, alright,

the United States data has a nice broad distribution of different ages.

We want to make sure that the Spanish data has that same distribution

Otherwise, you’re kind of going to skew your dataset.

So, for example, let’s have a look at roughly whether the levels of capital gain

are approximately the same for both the United States and the Spanish dataset.

So I’m gonna use ggplot for this. We’re gonna plot a bar chart

where we’ve color-coded United States and Spain,

and you can see that broadly speaking

there’s a lot in the kind of around zero or less than 50k,

and then there’s a few a little bit above.

Alright, so that looks broadly speaking the same distribution.

I’m fairly happy with that.

This is gonna be a judgement call

when you get your own data.

So I’ll clear the screen

and then let’s have a look at the next plot.

So the next plot is going to be capital loss versus the native country.

Let’s make sure those distributions are the same.

So it’s posting there and broadly speaking again yes,

the majority are down the bottom,

and then there’s a few United States ones

and a couple of Spanish ones up at the top as well.

Again, it’s not a disaster.

That’s probably ok.

Finally, let’s have a look at ages by native country.

So if we plot this,

we can see two very very similar distributions.

You can see that it’s essentially a bell curve.

Maybe slightly skewed towards older participants

for the United States and very very similar for Spain. This is okay.

If we hypothesized that

capital gain, capital loss and salary

was something to do with your age,

then it would make sense to have two datasets that you’re joining together

have very similar distributions in this regard.

So let’s look at one more dataset from Denmark.

Alright, so it’s the same thing, same format.

and we’re going to have a look at just the top few rows to make sure it’s in the same format,

so that’s using a head function,
and you can see actually we’ve already removed the nominal

and other text attributes from here

and we’ve just got the numerical ones.

And actually also capital gain and capital loss

are already in dollars in this dataset

so we don’t have to perform a conversion.

So we can use rbind to put these two things together,

and now we just need to check the distributions are the same.

So again,

we’re going to put the age against the native country,

and see if these towards the same distributions.

And you actually you can see this isn’t looking too good.

The United States and the Spanish datasets

have very similar distributions.

The participants or the people who have been polled from Denmark are much much older on average, right?

This could have an effect on things like capital gain,

so I wouldn’t necessarily feel comfortable just joining this dataset in,

without you thinking about it a little bit more closely.

Alright, so

whenever you’re joining dataset like this taking data from different sources,

think carefully, to make sure that it’s fair

and what you are doing is a reasonable, concatenation of datasets.

And actually these are the features

that power Spotify recommender system and numerous others.

So we’ve got things like acousticness.

How acoustic does it sound from

from a zero to a one?

We’ve got instrumentalness.

I’m not convinced that’s a word.

Speechness.

That, how, how, to what extent is it speech or not speech, alright.

And then things like tempo…

ericaeureka