ADM-201 dump PMP dumps pdf SSCP exam materials CBAP exam sample questions

#4 数据转换 – 译学馆
未登录,请登录后再发表信息
最新评论 (0)
播放视频

#4 数据转换

Data Analysis 4: Data Transformation - Computerphile

People need to learn to use standardized measures for things.
人们需要学着用标准化的单位来描述事物
So take me for example when I drive anywhere,
比如说当我开车的时候
I drive in miles, I drive in miles per hour.
我会用英里来描述 或英里/小时
My fuel economy is messaging miles per gallon,
我的油耗单位是英里/加仑
but of course, I don’t pump fuel in gallons,
当然 在加油时我不用加仑
I pump it in liters.
我用单位升
And then but when I run anywhere so short distances
但当我跑步的时候 描述短距离
I run in kilometers and I run in kilometers per hour.
我用千米 或者千米/小时
So I’m using two different systems there.
因此我在这儿用了不同的单位系统
And any short distances I’m measuring are going to be in meters, not feet, right.
我会用米来测量短距离 而非英尺
So if I’m measuring let’s say
举个例子吧 如果说我想要
around my house for painting,
粉刷我的房子
I’m going to measure in square meters,
我会用平方米进行测量
so I know how much paint to buy.
这样才知道要买多少油漆
But then I’m selling a house, or I’m buying a house
但当我买卖房子时
I’m going to be looking at the size of the house in square feet.
我会用平方英尺来描述大小
Again, what, who knows why, British people.
谁知道呢 英国人就这样
If I’m baking anything,
如果我做烘焙
it’s going to be weight in grams or kilograms going into the recipe.
我会按照食谱来用克或者千克
But if I’m weighing myself is going to be in stones and pounds.
但如果测量体重 我就会用英石和英磅
But of course a ton would for me would be a metric ton
但当然 对我来说 吨是公制吨
not an imperial ton.
而非英制吨
And as I said, I measure fuel in liters
如我所说 我用升来描述汽油
and most of my liquids are measured in liters
和绝大部分其他液体
except for cause for beer and milk, which are in pints.
除了啤酒和牛奶 我用品脱
So this is the kind of problem you’re going to be dealing with
这就是你在观察数据时
when you’re looking at data.
可能碰到的问题
You’re trying to transform your data into a usable form.
你尝试将数据整理成可用的形式
Maybe the data is coming from different sources,
可能这些数据来自不同的渠道
none of it goes together.
它们相互不匹配
You need standardized units standardized scales,
你就需要将单位和比例进行统一
so we can go on and analyze it.
接下来才能做分析
<04 - 数据转换>
<电脑狂热>
So let’s think back, we
回到主题
what we’re doing is we’re trying to prepare our data
我们需要做的是将数据
into a densest, most clean format
整理成最精简 最清晰的形式
modeling or machine learning
以便我们进行建模 机器学习
or some kind of statistical test
或其他统计学分析
to work out what’s going on and draw knowledge from our data.
从而挖掘数据背后的原因 从数据中获取信息
So this is going to be an iterative process,
因此这是一个迭代的过程
we’re going to be cleaning the data,
我们需要对数据进行清洗
we’re going to transform the data
并且转换数据
and then we’re going to reduce for data,
然后精简数据
and transforming data is what we’re going to do today.
我们今天要做的就是数据转换
So let’s imagine that you’ve cleaned your data.
现在假设你已经清洗好了你的数据
So we’ve got rid of as many missing variables as possible,
我们尽可能地剔除了缺失的变量
hopefully all of them with deleted instances and attributes that
希望我们成功地将所有不可用的数据和属性
just we’re not going to work out for us.
都剔除了
Now what we’re going to try and do
我们现在要做的
is we’re going to try and transform our data
是试着将我们的数据
so that everything’s on the same scale
转换成统一的单位
Everything makes sense together
我们希望数据是有逻辑的
and if we’re bringing datasets from different places,
且如果数据集来自不同的渠道
we need to also make sure all the units are the same
我们还要确保数据单位相同
and everything makes sense.
确保它们都是合理的
There’s no point in trying to use machine learning
如果数据是错的 那么我们用机器学习
or sum or clustering or any other mechanism
求和或聚类分析或其他任何分析方法
to draw knowledge from our data if our data is is all wrong.
得出的结论都毫无意义
So today we’re going to be looking at census data.
今天我们会来看人口普查数据
Now census data is kind of a classic example of a kind of data
在数据分析中 人口普查数据
you might look at in data analysis.
是一种经典的数据类型
It has got lots of different kinds of attributes,
它有很多属性
things that are going to need cleaning up and transforming.
并且需要清洗和转换
So we’re back in our we’re going to read the census data
我们先来读取人口普查数据
using census is read CSV
输入> census <-read.csv(''census.csv)
So we’ve downloaded some census data that
我们已经下载好了
represents samples from the US population to begin with.
美国人口普查信息的一些例子
We’re going to read that in and you can see that
我们来读取它 可以看到
we’ve got 32,000 observations and 15 attributes
我们有32000个观测值和15个属性
or variables.
或者说变量
So what are the first math.
这是我们的第一步
So let’s have a quick look at just a little bit of it
接下来我们通过一些例子
and we can see the kind of thing we’re looking at.
来了解这个数据
So we’re going to say head of census
输入head(census)
and that’s just going to produce the first few rows
这将生成数据的头几行
so we can kind of see the kind of data.
以便通过它们大概了解一下数据
So you can see we’ve got age
可以看到 我们有年龄
we’ve got what working classification that person has, their educational level
工作类型 受教育程度
and numerical representation about whether they’re married or not this kind of thing
以及代表他们婚姻状况的数字等
So there’s a lot of different kinds of data here
这里有很多不同的类型
some of it is going to be nominal
有的是定类数据
So for example, this working-class
比如说工作类型
state government, private employee.
我们有政府部门 私企等
That’s a nominal value.
这是定类变量
We might have ordinal values or ratio values
我们也可能有定序变量 定比变量
or interval values
或定距变量
We’re gonna have to delve into a little bit closer to find out what these are.
我们需要进一步去探索这些是什么
Now what we do to transform this data
要是想要将这些数据转换成
into a usable format for clustering or machine learning
可以进行聚类分析或机器学习的形式
is going to depend on exactly what these types of these columns are
我们需要弄清这些列是什么类型
and what we want to do with them
以及如何处理它们
So let’s look at it just a couple of the attributes
我们来一起看几个属性
and see what we can do with them, right?
并看看能怎么处理它们吧
We’re going to use a process called codification.
我们将通过编码来实现它
The idea is that maybe things like random forests or
编码背后的逻辑是 像随机森林
multi-layer perceptrons, you know neural networks
多层感知器 人工神经网络等技术
aren’t going to be very amenable to putting in text-based inputs.
它们无法直接处理文字类型的数据
So what we want to do is try and replace these attributes
因此我们需要把这些属性替换成
with a numerical score.
具体的数值
All right. So let’s look at just for example of a working class,
好吧 接下来看一些例子 比如工作类型
and also for example the educational level. So education.
和教育程度 写下教育
Now work class is the kind of class of worker that we’re looking at here
工作类型是指人们的工作类型分类
So for example a state worker or in private sector,
比如说在政府部门 私企
or someone that worked in a school or something like this.
或者在教育机构工作 诸如此类的
Now this is a nominal value.
这是定类变量
That means there’s no order to this data at all
这意味着它们无法进行排序
we can’t say but someone in state is higher or lower than someone in private
我们不能说政府部门比私企更高或更低
and we can’t also say but let’s say state is two times more or less than some other one.
也不能说政府部门是其他类型的三倍或1/3
That makes no sense at all. Alright.
根本讲不通 对吧
So what we can we can replace this with numbers.
我们可以用数字来进行替换
so let’s say we could replace private with zero
比如说我们把私企替换成0
and state with one
政府部门换成1
and you know, self-employed with two and so on, right
自由职业者换成2等等
And that we’ve got back perfectly reasonable thing to do,
这样做没有问题
but it’s still nominal data.
但它们还是定类变量
So what we can’t do is then calculate a mean and
因此我们无法对其求平均值
say “ah the mean is halfway between private and public”
也不能说平均值落在私企和政府部门之间
that doesn’t make any sense.
这不合逻辑
Just because something has been replaced by a numerical score
把值替换成数字
doesn’t mean that it actually represents something that we can quantify in that way, right?
并不是量化这些值 对吧?
It’s still nominal data.
它们还是定类变量
Okay, so I bet the best advice I can give is
因此我能给你最好的建议就是
feel free to codify your data into easy-to-read numbers
你可以将数据转换成容易阅读的数字
but just bear in mind that
但记住
you can calculate the mode just like you know the most common,
你可以求众数 也就是出现最多的数据
but you can’t calculate the median and you can’t calculate the mean.
但你无法计算中位数或平均数
Another example would be something like the educational level.
另外一个例子就是教育程度
Now theoretically this is ordinal data,
理论上这是定序变量
so we could save it someone with a an undergraduate degree
我们可以说一个拥有本科学位的人
is maybe slightly higher in terms of their the amount of time they spent in education,
在其受教育过程中 可能比高中毕业的人
than someone with a high school diploma.
花费了更多的时间
But we don’t know exactly what the distance is,
但我们无法准确计算其差值
and what’s the distance between let’s say a high school and a degree and then a PhD,
即我们无法算出拥有高中 本科 博士
and so on an MD and things like this.
医学博士等学位的人之间的差值是多少
We can represent these using numbers,
我们可以用数字来表示它们
and probably in order, right,
甚至是有序的数字
so we could say that zero is no education
比如说把未接受过教育标为0
and one is sort of the end of primary school
把小学毕业标为1
and two is the end of high school and so on and so forth
高中毕业标为2等等等等
But again,
但再强调一次
it’s difficult to calculate distances between these things
我们很难计算出它们间的差值
We don’t know what high school is two times more than primary school
我们不能说高中毕业比小学毕业多两倍
and half of a degree or something like that.
或是其他学历的二分之一之类的
That doesn’t really make sense.
这完全不合逻辑
So again,
再强调一次
you might be able to calculate a median on this or a mode,
你可能可以计算中位数或众数
but you can’t calculate an average.
但你无法计算平均值
You can’t say the average level of education
你不能说教育的平均水平
is halfway between high school and undergraduate.
在高中毕业和大学毕业之间
That doesn’t make any sense either.
这也不合逻辑
So for any kind of attribute that is nominal or
总结一下 对于用文字表示的
possibly ordinal and it’s sort of represented using text
定类变量或定序变量
we can codify this so that it’s more amenable to things like
我们可以对其进行数字编码 以便于根据
decision trees depending on the library you’re using, right?
你用的数据库来进行如决策树之类的处理
But you just have to be careful all machine learning algorithms
但你必须小心 所有的机器学习算法
will take any number you give them
都会接收你提供的一切数字
and you just have to be careful that this makes sense to do.
你要做的是仔细确认它们符合逻辑
So what you would do is you would go through your data
所以说你得仔细检查数据
and you’d begin to systematically replace appropriate attributes
并且系统地对合适的属性
with numerical versions of themselves,
用数字做替换
remembering all the time,
时刻牢记
that they don’t necessarily represent true numbers,
它们并不是真正的数字
you know in a ratio or interval format.
它们不是定比或定距变量
So for any text-based value,
那么对于文本类型的值
we’re going to start with replacing possibly with numerical scores.
我们用数字去替换
What about the numerical values?
那么那些数字类型的值呢?
Well, they might be okay,
它们应该没有大问题
but the issue is going to be one of scale.
但还是需要注意它们的范围
you might find for example in this census data
比如说在这个人口普查数据中
that one of the dimensions
你可能会发现 有些维度
or one of the attributes is much much larger than another one.
或说有些属性比其他高出许多
So for example, this dataset has hours per week
举个例子 这个数据集有小时/周
which is obviously going to be somewhere between naught and maybe 60 or 70 hours
显然 它们区间会在0到60或70之间
for someone has got, you know a very strong work ethic,
因为有些人有很强的职业道德
and salary, right?
还有的是为了工资 对吧?
Or salary or income or any other measure of, you know, monetary gain.
薪酬或收入 或是其他形式的金钱收入
Now obviously hours per week is going to be in the tens and
很显然 小时/周会是两位数
Salary could be into the tens of thousands. Maybe even the hundreds of thousands
薪酬则可能是四位数 甚至是五位数
Those scales are not even close to being the same.
这些数字范围区别很大
That means if you’re doing clustering or machine learning
这意味着 如果对它们做聚类分析
on this kind of data
或机器学习
you’re going to be finding the salary
你会发现薪酬这个变量
is kind of overbearing everything, right
超出其他变量许多
So it’s going to be very easy for your clustering
所以聚类分析可以很容易地
to find differences in salary,
发现薪酬的差别
and it’s harder for it to spot differences in hours,
但发现工作时间的差别则很难
because they’re so small in comparison, right?
因为它们太小了 对吧?
So we need to start to bring everything onto the same scale.
因此我们需要统一所有属性的范围
The more attributes you have
数据的属性越多
which is another way of saying, the more dimensions you have to your data,
即数据的维度越多
then the further everything is going to be spread around.
这些数据的分散程度就会越高
If we can scale all of these values to between
如果可以将数值范围全部整理到
sort of let’s say around 0 and 1,
比如说在0到1的区间
then everything gets more tightly sort of controlled in the middle,
数值就会比较集中地聚集在中间的区域
And so it gets much easier to do clustering
这样我们就能更容易地进行如聚类分析
or machine learning or any kind of analysis we want.
或机器学习等我们想要运用的分析方法
So let’s look back at our data
现在让我们回到数据
and see what we can do to try and scale some of this into the right range.
并来试着将它们整合到合适的区间范围
So we’re going to look back at the head of our data again
我们回到数据顶部再来看看这些数据
so our numerical values are things like the capital gain
可以看到 数字类型的数据有资本收益
the capital loss which I guess
资本损失
presumably how much money they’ve made in the loss that year,
应该是指他们去年亏了多少钱
probably for normalize them on some scale
也许我们需要统一它们的单位
and then things like the hours per week that they work.
还有像他们每周的工作时间
and their salary which at this case is greater than or less than 50,000.
以及他们的薪酬 大约是在50000上下
So let’s have a quick look at the kind of range of values we’re looking at here
快速看一下这些数值范围 来帮助我们判断
so we can see if scalings even necessary
是否有必要重新定义范围
Maybe we got lucky
运气好的话
and the person did it before they sent us the data
给我们数据的人可能已经做了这一步
So we’re going to apply a function across all the columns
接下来我会对所有列使用一个函数
and we’re going to calculate the range of the data
来计算它们的范围
So this is going to be apply on a census data
输入apply(census,2)
division 2, so that’s all of our columns,
参数是2 这就将所有列都包含在内了
and we’re going to use the range function for this,
我们还需要用到range函数
and this is going to tell us okay,
这将告诉我们
so for example the age ranges from 17 to 90
比如说年龄的范围是17到90岁
the educational level from 1 to 16
教育程度的范围是1到16
It gives you the range for things like nominal values as well,
它也会返回定类变量的范围
but they don’t really make any sense
但它们并不合理
I mean working class ranges from question mark to without pay,
比如说工作类型范围是从问号到没有收入
you know is meaningless.
这不合逻辑
And then so for example capital gain ranges from zero to nearly one hundred thousand,
再比如说资本收益的范围是0到大约十万
and capital loss from zero to four thousand.
而资本损失的范围是0到4000
And finally the hours per week ranges from 1 to 99,
每周工作时间的范围是1到99
So you can see that the capital gain
可以看到 资本收益
is many orders of magnitude larger in scale than the hours per week.
比每小时工作时间大好几个数量级
We’re going to need to try and scale this data.
我们需要转换数据范围
We’ll begin by doing to make our lives a little bit easier.
为了后续能轻松点 我们直接开始
It’s just focus on the numerical attributes right,
我们只需处理数字类型的属性
so we’d have to worry about the nominal values, which we’ve not codified yet
不用管那些我们还没编码的定类变量
We’re going to select all the columns from the data where they are numeric.
我们需要选取所有用数字表示的列
So that’s this line here, and paste that down here.
找到这行代码 把它复制粘贴过来
So we’re going to s apply that applies over each of the fields is it numeric,
接下来我们输入sapply(census, is.numeric)
and that’s going to give us a logical list
这将通过判断数据列是否为数字
that says true or false depending on whether those columns are numeric.
而返回一个真/假值的列表
What we’re doing here is selecting from this list any bit of true
我们需要选取所有判断结果为真的列
and then finding their names.
和它们的名称
So what are the names of a columns for the numeric?
那么这些是数字的列的名字是什么?
So let’s have a look at just a range of these attributes
为了之后轻松点
to make our life a little bit easier.
我们来稍微看一下这些属性的范围
So I’m gonna run this line
运行这行代码
and so this is a simplified version of what I was just showing,
这是一个简化的版本
you can see that capital gain is massive
比如可以看到资本收益
compared to the hours per week for example.
比每周工作时间大很多
Let’s have a look at the standard deviation.
接下来我们来看标准差
the call that the standard deviation, is the average distance from the mean,
标准差指的是数值和平均值的平均差值
so it kinda gives us an idea of the spread of some data, right.
我们可以通过它大致了解数据的离散程度
Is it very tight and everyone owns roughly the same
如果数据很集中 这意味着数值大小相似
or is it very spread out and it’s huge deviations.
如果数值很分散 标准差就大
And the answer is there’s pretty huge deviations.
答案是 这组数据的离散程度很高
So the age has a standard deviation of 13 so it, obviously
年龄的标准差是13 显然
that means that most people are going to be kind of in the middle
这意味着大多数人的年龄
and on average they’re going to be 13 years younger or older,
会比平均年龄大或小13岁
but you can see that things like capital gain have over 7,000 standard deviation,
而资本收益的标准差则超过了7000
which is a huge amount.
这数字很大
To give you some idea what we’re aiming for,
稍微说明一下我们的目标
it’s very common to standardize this kind of data.
对这类型的数据进行标准化处理是很常见的
So the standard deviation is 1 right.
我们会得到的标准差是1
So, 7,000, much too big.
7000这个数字太大了
Let’s plot an example
我们来看个例子
to gives you some idea of what the kind of problem is when we have these massive ranges.
为什么会得出这么大的数字范围呢
So I’m going to plot here a graph of age versus capital gains, right
我来绘制一个关于年龄和资本收益的图表
We know age goes between about one and a hundred
我们知道年龄的范围是1到100
and capital gain is much much larger.
而资本收益范围则大很多
So if I run this
运行这行代码
basically the figure makes no sense at all,
这个图表基本上没有什么意义
because the capital gain ranges from zero to one hundred thousand
因为资本收益的范围是从零到十万
and as a few people earning right at the top scale,
尽管有一部分人收入很高 分布在顶部
everything is sort of squished down the bottom.
但绝大部分都被挤在下面
We can’t see anything that’s going on.
我们无法知道其内在联系
There’s no way of telling whether
我们也无法看出
the capital gain of an individual is related to their age.
个体的资本收入和他们的年龄是否有关
I mean it probably is, right
它们应该是有联系的
Cause retired people, people who are very young,
因为退休和特别年轻的人
perhaps earn slightly less.
可能赚得比较少
We can’t really see that here,
这个图表无法得出这样的结论
because it’s just too compressed, right
因为数据都挤在一起了
We need to start trying to bring these things together
我们需要重新转换这些数据
so that we can perform better analysis.
才能得出更好的分析结论
What we’re going to do is creating a new data frame
我们要建立一个新的
with just the numerical attribute.
只含数字属性的数据框架
so we want to focus on just to make our life a little bit easier
我们的目标是简化后续的分析工作
and then we’re going to write a normalized function to
我们会用归一化函数
move all our data to between 0 and 1,
将所有的数据映射到从0到1的区间
and we will do this per attribute.
我们逐个逐个属性来
So for example, if you’ve got some data which goes between a minimum and a maximum
举个例子 这些数据里有最小值和最大值
and we want to scale this data to between 0 and 1
我们想将它们整理到从0到1的区间
All we need to do is first of all, take away the minimum,
首先我们要拿掉最小值
and that’s going to move everything to be
这将把整个数据移到从0
from 0, to max minus min.
到最大值减去最小值的范围
And then we’re going to divide by this distance here,
然后我们用它来除以这个差值
so this is max minus min.
即最大值减去最小值
And if we divide by this everything is going to go from 0 to 1.
处理后 所有数值都会被转换到0到1之间
So that’s exactly what we’re doing in this function here
这就是这个函数的功能
we’re gonna function X
输入function(x)
and it subtracts the minimum of X
(x-min(x))
and then divides by the difference between the maximum and the minimum alright.
/(max(x)-min(x))
So this is very standard. So I’m going to run this.
这是非常标准的处理 来运行它
I’ll let you write functions like this and then use them
我会让你也在数据里也写出这样的函数
in applications over data.
然后对数据进行处理
So we’re going to calculate a normalized census dataset,
我们将对人口普查数据进行归一化处理
which is we’re going to apply over dimension to
就用我们刚写的归一化函数
this normalized function we just wrote.
在这个屏幕里
And then now if we look at the range will see that our range is now
现在可以看到 所有数据的范围都
between 0 and 1 for all of our data, which is exactly what we want.
在0到1之间 我们得到了想要的结果
The normalization is a perfectly good way of handling your data.
归一化是非常好的数据处理方法
If everything is between 0 and 1
如果所有数据都在0到1之间
we have fewer problems with the scale of things being way off right.
那就不用太担心数据的单位问题了
Now some statistical techniques like PCA
我们会在另一个视频中讲到如PCA
that we’re going to talk about in another video
即主成分分析技术
They require standardized data,
它们要求数据是标准化的
that’s data is centered around zero,
即数据以0为中心分布
has a mean of zero and a standard deviation of one.
平均值为0 且标准差为1
Now we can standardize data pretty easily in the same way.
现在我们能很轻松地用同样的方法 将数据标准化了
Actually, we don’t need to write our own function for this,
实际上我们不需要自己来写这个函数
the scale function in R performs this for us.
R语言里的scale函数就可以实现
So we’re going to take the census data over numerical attributes
我们将选取人口普查数据中的所有数字属性
and we’re going to call the scale function
然后我们用scale函数
and that’s going to take all of the attributes
把所有的属性
and center them around their mean,
都放到平均值周围
so that means the mean will become close to zero
这意味着它们的平均值将趋近于零
and it’s going to divide them all by the standard deviation
然后它将计算标准差
so their standard deviation becomes one.
标准差的结果会是1
So if we run that and then we have a look at the mean of this data
我们来运行代码 来看一下它的平均值
So for example here, we calculate the mean.
举个例子 我们来计算平均值
You can see that I mean these values are very very close to one
可以看到 数值的平均值都非常接近1
That’s 10 to the minus 17 or something like that, very very small.
这里是10的-17次方 非常非常小
And if we look at the standard deviation, and similarly, they’re all going to be 1.
类似地 它们的标准差也非常接近1
Alright, so this is now standardized data.
现在我们就有了一组标准化的数据
This is a very good thing to do
这对于你后续进行
if you want to use your data in some kind of machine learning algorithm or some kind of clustering.
如机器学习或聚类分析 是非常有好处的
Let’s imagine now that we want to join some datasets together.
现在我们来看怎样将合并不同的数据集
So we standardize data everything’s between 0 and 1,
我们已经将数据归一化到0到1的区间了
or it’s centered around 0 with a standard deviation of 1,
它们以0为中心分布 标准差为1
we’ve codified some attributes.
且对部分属性进行了编码
What happens if we get other data from other sources?
那如果我们从其他渠道获取额外的数据 会怎么样?
You can imagine that census data from the US might be a bit useful.
可以想象 美国的人口普查数据也许有些用处
But maybe we want census data from Spain
但也许我们还想要其他地方的数据
or from the UK or from another country.
比如西班牙的 英国的等等
Can we join all of these together
我们可否合并它们
to get a bigger more useful dataset? Alright.
以得到更完整有用的数据集?
Now the thing to think about when you’re doing this,
现在你需要考虑的是
is just to make sure that everything makes sense, right?
确保你做的每一步都符合逻辑 对吧
Are the scales the same?
数据规模都统一了吗?
Are they all normalized or none of them normalized?
它们是否经过了归一化处理?
Because otherwise, what you’re going to be doing is you’re going to be adding, you know,
否则 你可能会把在0到十万之间的数
pay between naught and a hundred thousand, to somewhere between naught and one,
和在0到1之间的数加在一起
nothing makes any sense anymore.
这样做就没有意义了
You’re gonna wreck your data.
你会把数据毁掉
So let’s have a look at this on the census dataset.
我们再来看一下这个人口普查数据
We have some Spanish census data in a very similar format
我们有一些西班牙的人口普查数据
to our census data from the United States.
它们和美国的数据形式很相似
Let’s have a quick look.
我们来看一下
So I’m going to read the CSV file of Spain data.
我将会读取西班牙数据的CSV文档
Let’s remind ourselves of the columns that we had in our census data from the United States.
别忘了 我们已经有了美国的数据
These are the numerical columns,
它们都是数字类型的
so we have age, education number
比如年龄 教育程度
capital gain capital loss this kind of thing.
资本收益和资本损失等
Let’s look at the Spanish dataset
来看一下西班牙的数据
to see if we can just join the two together.
看能否直接合并这两个数据集
So I’m gonna run head Spain,
我将运行head(Spain)
that’s going to give us the first few rows
这将返回数据的头几行
and you can see that
可以看到
there’s some of the stuff in there is as it was before
有些内容和之前的数据一样
so things like what their level of education is,
比如说教育程度
whether they work in the private sector or the public sector, right.
或者是他们是在私企还是政府部门工作等
We’re going to need to remove these things
我们需要剔除这些
to create just a numerical attributes.
只保留数字类型的数据
And the other problem is if you look carefully,
如果仔细观察 你会发现另外一个问题
you’ll see that the capital gain in the Spanish dataset is in euros,
在西班牙数据里 资本收益的单位是欧元
not in dollars, right.
而非美元
Now that’s a huge problem.
这会带来很大的问题
They don’t they’re not massively different obviously
虽然它们的差距并不明显
they’re on the same order of magnitude
是在同一个数量级上的
But we don’t want to be jamming
但我们并不想
capital gain in euros next to dollars
把以欧元和美元为单位的资本收益放在一起
because those two scales are not the same, right?
因为毕竟单位不统一
So what we need to do first
所以我们首先需要
is scale this data using some kind of exchange rate.
通过转换汇率来统一数据的单位
So here what we’re going to do is we’re going to create a new column in Spain
现在我们将在西班牙的数据中新建一个列
so given a Spain data frame,
在西班牙数据集中
we’re going to say the Spain capital gain is equal to the
我们将输入公式 西班牙的资本收益等于
Euro capital gain times by 1.13,
欧洲的资本收益乘以1.13
which is the exchange rate we’re going to use.
这是我们会用到的汇率
Now It’s quite important in this kind of situation
在这种情况下 很重要的是
not just to look up the exchange rate online.
你不能在网站上随便找一个汇率
You’ve got to consider but this might have been collected a while ago
你需要考虑 这些数据可能是一段时间前收集的
What was the exchange rate when this data was collected right,
这些数据被收集时的汇率是多少?
these are things you’re going to have to think about.
这是你需要考虑的
So let’s run that line,
运行这行代码
and let’s do the same thing for the capital loss.
接下来对资本损失做同样的处理
Now we’re going to keep just the numerical attributes of
现在我们成功地只保留了美国和西班牙
our census data and of the Spanish data,
人口普查数据中的数字属性
and we’re also going to add another column,
我们还要添加另外一列
that is what country they come from,
即它们来自哪个国家
otherwise we’re not going to know.
否则我们就无从得知国家来源
So we’re going to use the columbine function
接下来我们来用columbine函数
to combine the census data as numerical attributes
把人口普查数据中的数字属性
and the native country which in this case will be the United States.
和来自于美国的数据合并起来
We’re going to do the exact same thing for the Spain data,
接下来对西班牙的数据做同样的处理
which will be basically exactly the same
步骤基本上是一样的
except obviously we’re also going to have Spain as the native country.
除了我们要在国家这里写上西班牙
And then we’re going to use the rowbind feature
然后我们就可以用rowbind函数
to just join those two tables together
把这两个表格合并起来
Now that will only work if those two datasets have the exact same attributes.
这个函数只有在两个数据集属性完全相同时才能使用
‘nu_census’ is not found.
无法找到’nu_census’
What did I do wrong?
我哪里出错了?
So I had a typo.
原来我打错了
So let’s join these two together using rbind.
再用rbind函数试试
There we go. And so our United dataset now has
成功了 现在我们的美国数据集
the combined observations for the United States and Spain.
就拥有了美国和西班牙的数据了
Now, what you wouldn’t want to do is just join them together
需要注意 我们不能简单地合并这两个表格
and just leave it at that, right.
然后就不管了
You want to perhaps have a little look at some plots to make sure that
应该要通过一些图表观察数据分布状况
the distributions of the data you’ve just joined together make sense.
来确保刚才合并的数据是合理的
For example, alright,
比如说
the United States data has a nice broad distribution of different ages.
美国的数据在年龄上的分布跨度比较大
We want to make sure that the Spanish data has that same distribution
我们希望西班牙数据的分布情况也类似
Otherwise, you’re kind of going to skew your dataset.
否则 数据就可能出现偏差
So, for example, let’s have a look at roughly whether the levels of capital gain
比如说来看看在美国和西班牙数据集中
are approximately the same for both the United States and the Spanish dataset.
它们的资本收益水平是否相当
So I’m gonna use ggplot for this. We’re gonna plot a bar chart
我会使用ggplot函数创建一个柱状图
where we’ve color-coded United States and Spain,
它会用不同颜色表示美国和西班牙的数据
and you can see that broadly speaking
可以看到 大致上
there’s a lot in the kind of around zero or less than 50k,
大部分的数据都分布在0到50000之间
and then there’s a few a little bit above.
有小部分在50000以上
Alright, so that looks broadly speaking the same distribution.
因此它们大致上的分布是差不多的
I’m fairly happy with that.
我认为这还不错
This is gonna be a judgement call
当你在处理数据时
when you get your own data.
你需要自己进行判断
So I’ll clear the screen
清屏
and then let’s have a look at the next plot.
来看下一个图表
So the next plot is going to be capital loss versus the native country.
这个图表显示的是两个国家的资本损失
Let’s make sure those distributions are the same.
我们希望它们的分布情况也是类似的
So it’s posting there and broadly speaking again yes,
在这儿可以看到 它们的情况也大体类似
the majority are down the bottom,
绝大部分数据集中在底部
and then there’s a few United States ones
在顶部也有一些数据
and a couple of Spanish ones up at the top as well.
美国和西班牙都是这样
Again, it’s not a disaster.
这并不意味着你出错了
That’s probably ok.
这也许就是实际情况
Finally, let’s have a look at ages by native country.
最后我们来看各国的年龄分布情况
So if we plot this,
来创建这个图表
we can see two very very similar distributions.
可以看到它们的分布情况也非常非常相似
You can see that it’s essentially a bell curve.
大体上这是一个正态分布
Maybe slightly skewed towards older participants
稍稍往老年人的方向偏离
for the United States and very very similar for Spain. This is okay.
在美国和西班牙都是这样 这很不错
If we hypothesized that
如果我们假设
capital gain, capital loss and salary
资本收益 资本损失以及薪酬
was something to do with your age,
都与年龄有一定关系
then it would make sense to have two datasets that you’re joining together
那么就年龄而言 你所合并的这两个数据集
have very similar distributions in this regard.
都应该具有类似的分布情况
So let’s look at one more dataset from Denmark.
我们再来看一下丹麦的数据
Alright, so it’s the same thing, same format.
它也是一样 有着相同的内容
We’re gonna read the CSV,
我们将读取CSV文档
and we’re going to have a look at just the top few rows to make sure it’s in the same format,
通过它的头几行来确认它有同样的内容
so that’s using a head function,
head函数可以实现
and you can see actually we’ve already removed the nominal
可以看到 我们已经剔除了定类变量
and other text attributes from here
和文字类型的变量
and we’ve just got the numerical ones.
只保留了数字类型的变量
And actually also capital gain and capital loss
而且像资本收益和资本损失的单位
are already in dollars in this dataset
也已经是美元了
so we don’t have to perform a conversion.
因此我们不必去进行汇率转换
So we can use rbind to put these two things together,
我们用rbind函数把这两者连接起来
and now we just need to check the distributions are the same.
现在只需要看它们的分布情况是否一致
So again,
和前面一样
we’re going to put the age against the native country,
我们把年龄按国家分组
and see if these towards the same distributions.
来看国家间的年龄分布是否一致
And you actually you can see this isn’t looking too good.
这结果看起来不太妙
The United States and the Spanish datasets
美国和西班牙的数据
have very similar distributions.
分布情况非常类似
The participants or the people who have been polled from Denmark are much much older on average, right?
但丹麦数据的平均年龄则大很多
This could have an effect on things like capital gain,
这可能对资本收益等也会产生影响
so I wouldn’t necessarily feel comfortable just joining this dataset in,
因此我并不会简单地就合并这个数据集
without you thinking about it a little bit more closely.
除非经过更深入的思考
Alright, so
总结一下
whenever you’re joining dataset like this taking data from different sources,
当你想要合并不同渠道的数据
think carefully, to make sure that it’s fair
要认真思考 以确保你的处理
and what you are doing is a reasonable, concatenation of datasets.
能得出合理且可以被合并的数据集
And actually these are the features
实际上今天讲的这些处理方法
that power Spotify recommender system and numerous others.
被用在了声田和很多其他软件的推荐系统中
So we’ve got things like acousticness.
比如说原声性
How acoustic does it sound from
这首歌曲在原声性上
from a zero to a one?
能在0到1间打多少分
We’ve got instrumentalness.
还有乐器性
I’m not convinced that’s a word.
我不太确定有这个词
Speechness.
很容易上口吗
That, how, how, to what extent is it speech or not speech, alright.
这首歌在多大程度上是朗朗上口的
And then things like tempo…
还有像节奏……

发表评论

译制信息
视频概述

想要学习数据分析?那数据转换是你一定不能错过的一课!

听录译者

收集自网络

翻译译者

ericaeureka

审核员

审核员CH

视频来源

https://www.youtube.com/watch?v=ms6EV1pG3tc

相关推荐