• #### 科普

SCIENCE

#### 英语

ENGLISH

#### 科技

TECHNOLOGY

MOVIE

FOOD

#### 励志

INSPIRATIONS

#### 社会

SOCIETY

TRAVEL

#### 动物

ANIMALS

KIDS

#### 卡通

CARTOON

#### 计算机

COMPUTER

#### 心理

PSYCHOLOGY

#### 教育

EDUCATION

#### 手工

HANDCRAFTS

#### 趣闻

MYSTERIES

CAREER

GEEKS

#### 时尚

FASHION

• 精品课
• 公开课
• 欢迎下载我们在各应用市场备受好评的APP

点击下载Android最新版本

点击下载iOS最新版本

扫码下载译学馆APP

#### #3 数据清洗

Data Analysis 3: Cleaning Data - Computerphile

Well, we’re looking at chocolate datasets today,

so I thought I’d bring some research.

Yeah, good and definitely relevant

<03 - 数据清洗>
<数据狂热>
We’ve been looking at techniques like data visualization

to try and explore our data

and start to draw some initial, you know, conclusions or hypotheses

We’re going to start to move towards kind of modeling our data

and actually trying to extract proper knowledge from this data

because remember just because we’ve got data

doesn’t mean we’ve got knowledge.

Now this is going to be a kind of iterative process.

We’re going to need to clean up our data

to make it as useful as possible

We need to transform it

so that we can combine datasets together

And statistically measure our datasets.

And then we’re going to need to reduce it sometimes

if our data set is too big and unmanageable.

and this combination of cleaning data

and transforming data and reducing data

is a kind of cycle where we’re going to iterate this until

our dataset, is a smallest most useful form it can be

So if we’ve got redundant variables

which are basically the same as others

or we’ve got duplicates

These are all problems that we need to sort out.

Because if we’re going to be analyzing data with these kinds of issues,

we’re just making our life slightly more difficult.

It’s computationally inefficient,

and you know, in the worst case,

we could draw the wrong conclusions.

You might be surprised and disappointed

when you get your first dataset

that actually it’s not quite as nice as you were hoping right

It’s gonna need some cleaning up.

Maybe there’s missing values

Maybe there’s outliers that need to be dealt with

because this yeah they’re warping your distributions and your and your medians and means

And perhaps you also got noise in your dataset, right

these few things we can start to address with cleaning

So cleaning data is essentially the idea of

trying to correct or fill in any missing values

Or remove those bits completely.

You might be surprised that it’s missing data at all

I mean oh, what, are we not paying attention?

Like we’ve got one job and that was to collect the data

and we seem to have got missing data

But actually it’s quite common

because I mean, for example if you’re

if you’re trying to track patient records over time

Maybe we didn’t show up to an appointment

or maybe you in a hospital when they weren’t there

when they needed to have their temperature taken

right and then your trend line of your temperature

over time is going to be missing some values

Maybe you’ve got data going back for hundreds of years

and they didn’t have certain techniques

and certain measurement systems back then

so they only have other kinds of data

so missing data is very common.

We’re gonna have to be able to deal with it.

So the dataset we’re looking at today is a kind of ratings for chocolate bars.

This is why I ate my chocolate

or at least that’s why I’m telling myself

So we’ll read the data in

we’ve got ten different variables

We’ve got about 1,700 nearly observations

and let’s have a quick look using the summary

So we’ve got things like the company who produce the chocolate,

the name of the chocolate

reviews, the cocoa percentage,

the type of bean, this kind of information

Right and you can imagine what you might do if you were trying to produce better chocolate

is a look at a huge amount of this kind of data

and work out what it is that customers like and what it is they don’t like

Right this is going to be quite common in market research

So the first thing we’re going to do

Right, we’ve received this data

We know now what the columns are,

but we don’t really know anything else other than this

So we’re going to have to start looking through and seeing first of all,

is there any missing data?

So we’re going to use the sapply function for this

The sapply function
sapply函数
will apply a function over our dataset

so for each column or each attribute of our data we’re going to apply this

And the function we’re going to use we’re writing ourselves

So it’s going to be the sum for anytime where our item is either blank or na

Now blank means an empty string and na means literally not applicable

Right, which is something that comes up in data from time to time

Alright, so in any case both of these are missing values,

so we’re going to treat them both the same

So if we apply this to our chocolate dataset

then we’re going to see that for example,

there are eight missing names

There are seven missing review dates,

and there are four missing coco percentages

So for each row in our data,

there are four rows where the Coco percent is missing, right

That’s not too bad, four, I mean,

this is a dataset of nearly 1,700 items.

Four is not too bad.

That’s quite quite expected

You might imagine that

if you’re pooling this data from lots of different sources

People are going to forget to add data in,

or they weren’t able to record data on that day

There’s a huge amount of reasons why you might have missing data.

And now it starts to become a little bit of a problem

when we look at things like bean type.

Because bean type has got 1,200 missing values

That’s a huge proportion of a dataset.

And in that case

So the only issue we’ve got is that 1200 is not relat… It’s just an amount of rows.

It’s not relative to the size of the dataset.

So we’re going to use the exact same function any empty rows

but this time we’re going to divide by the total number of rows

so we can get a percentage for how much of a missing data we’ve got.

So we can see for example that company name

has zero missing data,

whereas bean type has 74 percent missing data.

So that’s a huge problem

Now a kind of general rule of thumb is

if you’ve got over half your data is missing

it’s going to be quite hard to estimate or guess what that data is going to be

That’s when you might want to start thinking about removing it.

So what we want to do is we want to extract any of the names of any of our attributes

that have this sort of over let’s say 60% missing.

So we’re going to start by calculating all the percentages

and saving them in a variable.

And then we’re going to select only those percentages, where the value is over point six, right?

60 percent.

So we’re gonna say any attribute where the attribute is over point six

and that is just bean type at point seven four.

Or seventy four percent.

So we can now delete bean type ourselves

so we could say something like choco all the rows

for bean type is null

by setting that to null, that’s just going to delete that column

We can also do it automatically

so we could actually pass in

those attributes that we just calculated as a parameter

So that would be this line here

So it’ll be something like choco, all rows, that’s here,

the names of any attributes where the percentage missing is greater than 0.6.

And that’s going to just delete bean type.

There’s not a lot we can do about bean type.

We’ve only got 25%-ish of the data.

It’s not enough to start guessing

what bean types are going to be in other chocolate bars.

Let’s have a look at now our rows of data.

For each instance they’re going to have a number of attributes,

now there’s nine left.

And we want really to keep the instances that have the majority of their data, right

So we’re gonna apply, right, so this is going to be row wise

to dimension one, so that’s the rows

we’re gonna count any of a blank or n/a

for each row over our dataset,

and we’re going to put this into missing

So what it is going to do is return a list of values of every single row

that tells you how many missing items are there in that row.

So we can now look at the first few missing items.

So we’re going to order them, right, by larges first

And then we’re going to show just the first few.

And you can see that actually some of them are missing seven and six attributes.

That’s quite a serious situation

Because it was only nine, right.

So eventually they’ve only got a couple of entries in their fields.

Now let’s do this again as a percentage of the number of attributes

So this is exactly the same thing

but this time we’re dividing by the number of columns,

which is nine.

and we’re going to have a look at the top of these.

and so you can see that we’re missing 77% of some of these initial attributes,

That’s a real problem.

Missing is the same length as the number of rows we’ve got.

So we can actually look up any rows

where there’s a greater percentage of missing values that we want

and just remove them from the dataset.

So what we’re going to do that is a bit like this.

We can say choco is choco, anywhere where

missing is less than null point seven and then all the columns.

And what that’s going to do is select only the rows we want

where they’ve got a nice amount of data.

So the choco dataset is going to be a little bit smaller now, but much more useful to us

We don’t really want to be trying to do things like machine learning or statistics

when 70% of some of the data is missing.

Right, that isn’t going to be a good idea.

So it’s quite easy just to delete data, right?

I mean in some sense, it’s just more convenient to do that.

In general the rule is that if you’ve got more than 50% or 60% missing data,

it’s a good idea to delete it, right?

Delete either the instances or the attributes

depending on how much data you’ve got missing and where.

if you’re missing a huge amount of data

then you’re not going to be able to recreate it by let’s say using an average, right?

We’ve got so little data

that an average isn’t going to be reliable.

If we have got sufficient data that

we could maybe start to infer what these missing values might be.

We can start to try and replace them

instead of deleting them.

So what we might do

is we might for example set them all to zero.

Maybe if an attribute is missing we can say well okay,

if it’s missing, we’ll just not use it

and we’ll say it’s zero.

Now, whether you do that is going to depend on what the attribute is.

Something zero is not a useful property.

Right and we’ll look at that in the chocolate dataset in a moment.

What we might also do is

we might start to add the dataset mean into those attributes

So maybe we don’t know what the rating for this chocolate bar is

but we can guess

that it’s going to be around the average rating for any chocolate bar.

Again, this is going to depend on your situation, right?

You’re still making up data in some sense

You’ve got to be very careful about what you do here.

So we’ve deleted as much of our choco data as we feel comfortable doing now.

Now let’s see if we can

fill in some of missing values with appropriate replacements.

So let’s have a look at our attributes.

Alright, so we’ve got company, name, reference, things like this.

Bean type has been removed,

but we still got things like the bean origin and the ratings,

and there’s a few of these missing from our dataset.

Can we estimate these rather than completely removing them from the dataset?

Obviously the less data you use,

the less useful things like machine learning are going to be.

So let’s look at an attribute and see what we can do.

So if we look at bar price,

and that’s the price of each chocolate bar

we can see that there’s a few missing values somewhere around 3%.

That’s something we want to deal with.

But we’ve got enough data, you know 97%

maybe we can start to guess what the prices of these chocolate bars might be.

Now this is a good instance

of a time when you wouldn’t want to just populate with zeros, right?

No chocolate bar is free, I wish.

And so what we need to do is produce a reliable value

to represent an unknown price,

rather than just setting them all to be zero.

So what we could do here is something like this.

We could set every missing bar price to be

the average of all the chocolate bar prices.

and that way at least we’re not warping our distribution up or down.

We’re keeping it exactly the same.

We’re gonna say for the chocolate dataset

for any row, where bar price is n/a and for all columns,

we’re gonna set the bar price to be

the mean of all the bar prices.

And we’re gonna obviously remove any NAs from that calculation of what it’s not going to work

And that’s already worked.

So now if we recalculate our missing values,

you’ll see that bar price now has zero missing values

So we’ve fixed that problem. Great.

So that was quite an easy one, right

Bar price seems to me to be quite an intuitive time

when you would just calculate an average and put it in.

Right now actually, maybe not because

you know, bar price might depend on where in the world we’re selling it

or you know, what company is producing the chocolate bar.

So could we do the same thing for rating?

If we look, if we take the sum of all the NA values in rating

It’s eight. Right, so there are eight chocolate bars for which there is no rating

So what we can do is we could do something called a stratified replacement

We could say well, let’s group our chocolate bars by country or by company

calculate those averages,

and then we can specifically fill in companies missing ratings

based on what that company actually show in the market

rather than just an average over everything

So what we’re going to do is we’re going to calculate an aggregate function

over of the ratings by company

And we’re going to calculate a median

Median is a little bit more robust to outliers

So maybe you make up a very very expensive or very very cheap line.

The median will get what middle value is, right?

So this is going to be per company

and we can set the columns to be a little bit more helpful using colnames

and so now our per company if we look at it,

is each company and the median rating of chocolate bar from

I think, one to five.

This is how this dataset is going.

So now we know that data per company,

we can actually fill those in.

Now you could automate this process.

We don’t have much missing data.

So let’s just show an example of doing it by hand

So this is the line of code we’re going to do and I’ll talk through it

So we’re going to say for the chocolate dataset

for any value where the chocolate rating is n/a for missing, right?

and the company is Vicuna

We want to set the rating to be equal to

The Vicuna entry in our new per company average or median.
Vicuna对应的平均数或中位数
and that’s going to fill in that value there

So we do this for all the missing companies

and what we’re going to find is that we’ve replaced all our missing values with

appropriate medians for those ratings per company.

So the last thing we might find in our data is outliers.

So let’s imagine we do a box plot of cocoa percentage, right?

So I’m going to produce a box plot of a cocoa percentage

Now, maybe our assumption is that cocoa percentage in some way

informs what the rating is going to be,

because maybe a higher cocoa percentage tastes nicer.

I don’t really know about chocolate.

So if we look at this box plot,

what we’ll see is we’ve got actually quite a tight distribution of cocoa percentage right

between about 50% and just above 80%

But you can see there are three outliers when it produces a box plot

R will show outliers is anything
R语言中对于异常值的定义是
that is more than three standard deviations away from the median

What we do with these outliers is going to be a judgment call,

it’s going to depend on a situation

So, for example, we have an outlier here, which is above a hundred percent,

now that makes no sense.

We can’t have a chocolate bar with more than a hundred percent cocoa,

right, it doesn’t make sense

So that is obviously a mistake,

we would delete that item right,

and probably delete the whole row

or reestimate that value based on a stratified average or a different average

For these lower ones, this is a judgment call.

One is just above 20 and one is up closer to 30%

I don’t know whether those of outliers or not, right?

Is it possible to make a viable chocolate bar with 20% cocoa?

I mean it maybe, right.

You’re going to have to know a little bit something about your the situation that your data was collected in

and whether that’s reasonable.

So you might for example, delete the bottom one as a mistake,

but keep the top one because that’s just a low amount of cocoa.

So this is what cleaning data is about

We’re going to have missing data, we’re going to have outliers,

we might have noise.

and you’re going to have to look for your data,

and try and condense it and remove all these problems

We do this so that we can more effectively transform our data later

and also reduce our data if we need to.

And then eventually your datasets going to be really nice

so that we can do things like modeling or machine learning on it.

…per hour. My fuel economy is messaging miles per gallon,
……每小时 我的油耗单位是英里/加仑
but of course, I don’t pump fuel in gallons, I pump it in liters.

And then but when I run anywhere, so short distances

I run in kilometers and I run in kilometers per hour.

So I’m using two different systems there.

ericaeureka