ADM-201 dump PMP dumps pdf SSCP exam materials CBAP exam sample questions

#5 数据归约 – 译学馆
未登录,请登录后再发表信息
最新评论 (0)
播放视频

#5 数据归约

Data Analysis 5: Data Reduction - Computerphile

Let’s imagine that you work for
想象一下 你任职于
a major streaming media provider, right
一家主流流媒体提供商
So you have I don’t know some 100 million subscribers
有超过一亿订阅者
So you’ve got I don’t know ten thousand videos on your site
你的网站上有数以万计的视频
or many more audio files, right
以及数量更多的音频文件 对吧
So for each user you’re gonna have collected information
你需要收集每一个用户的信息
on what they’ve watched,
他们观看的内容
when they’ve watched it, how long they’ve watched it for
他们什么时候看的 看了多久
whether they went from this one to this one?
他们是从这个跳转过来的吗?
Did that work? Was that good for them?
跳转是否成功?用户体验怎么样?
And so maybe you’ve got 30,000 data points per user
也许从一个用户身上你就能收集到三万个数据点
We’re now talking about trillions of data points
我们现在谈论的是十亿级的数据点
and your job
你的工作是
is to try and predict what someone wants to watch or listen to next.
试着预测用户接下来想看到或听到的内容
Best of luck.
祝你好运
<05 - 数据归约>
<数据狂热>
So we’ve cleaned the data, we’ve transformed our data
我们已经做了数据清理和数据转换
everything’s on the same scale
统一了数据规模
we’ve joined datasets together
也合并了不同的数据集
The problem is because we’ve joined datasets together
那么问题来了 数据集合并之后
perhaps our datasets has got quite large right,
它变得特别大 对吧
now or maybe we just work for a company that has a lot a lot of data.
当然可能我们的公司本来就有很多很多的数据
Certainly the general consensus these days
的确 我们现在一般倾向于
is to collect as much data as you can right,
尽可能多地收集信息
this isn’t always a good idea.
但这并不总是最好的方式
We what we want remember,
时刻牢记
is the smallest most compact and useful dataset we can
我们要的是一个最精简 最完整 且最有效的数据集
otherwise you’re just going to be wasting CPU hours or GPU hours,
否则只是在滥用中央处理器和图形处理器
training on this, wasting time.
并把时间浪费在算法训练上
We want to get to the knowledge as quickly as possible
我们希望尽快地从数据中提取有用的信息
and if you can do that with a small amount of data
如果通过一个小的数据集就能做到
that’s going to be great.
那就太棒了
So we’ve got quite an interesting dataset to look at today based on music.
我们今天来看一个很有趣的关于音乐的数据集
It’s quite common these days when you’re building something like a streaming service
如果你今天想搭建一个像声田这样的
for example Spotify
流媒体服务器
You might want to have a recommender system
你通常都会需要一个推荐系统
This is an idea where you’ve maybe clustered people
它的概念是根据音乐品味
who are similar in their tastes,
将用户分类
you know what kind of music they’re listening to
能了解他们喜欢的
and you know the attributes of that music
音乐类型及属性
and if you know that
通过这些
you can say well this person likes high tempo music
就能知道 比如 这个人喜欢节奏强的音乐
So maybe he’d like this track as well.
那么他可能也喜欢这首歌
And this is how playlists are generated.
这就是播放列表生成的原理
One of the problems is that you’re gonna have to
为了能够运用机器学习
produce descriptions of the audio
你可能碰到的问题是
on things like tempo and how upbeat they are
你需要对音频进行描述 如其节奏
in order to machine learn on this kind of system, alright.
或者判断它们是否欢快
And that’s what this dataset is about.
这个数据集就是关于这些的
So we’ve collected a dataset here today.
现在我们已经有了数据集
There is, lots and lots of metadata on music tracks right.
它包含许多歌曲元数据
Now these are freely available tracks and freely available data,
这些歌曲和数据是完全免费的
we’ll put a link in the description if you want to have a look at it yourself
如果你想看 我们会在描述区放上链接
I’ve cleaned it up a bit already
我已经通过
because obviously I’ve been through the process of cleaning and transforming my data.
数据清洗和数据转换进行了一些处理
So we’re gonna load this now this takes quite a long time to do,
我们将载入数据 这会花上好一会儿
because there’s quite a lot of attributes and quite a lot of instances
因为它有大量的属性和实例
[music]
[音乐]
It’s loaded right?
载入好了吧?
How much is this data?
到底有多少数据?
Well, we’ve got 13,500 observations
我们有13500个观测值
that’s instances,
即实例
and we’ve got seven hundred and sixty-two attributes, right?
以及762个属性 对吧
So that means another way of putting this if in sort of machine learning parlance
用机器学习的术语说
is we’ve got 13,000 instances and 760 features.
就是我们有13000个实例和760个特征
Now these features are a combination of things.
这些特征是由一系列内容组成的
So let’s have a quick look at the columns we’re looking at
我们来快速看一下数据列
so we can see what this datasets about.
了解一下这个数据集
So names of music all right,
输入names(music_all)
so we’ve got some 760 features or attributes
返回了约760个特征或属性
and you can see there’s a lot of slightly meaningless text here.
可以看到 有很多意义不太清晰的文字内容
But if we look at the top you’ll see
但回到数据顶部 可以看到
some actual things that may be familiar to us.
一些我们熟悉的概念
So we’ve got the track ID, album ID the genre, right?
有歌曲ID 专辑ID和流派
So genre was an interesting one
流派很有趣
because maybe we can start to use
因为或许我们可以通过
some of these audio descriptions to predict what genre this music is or something like that.
这些音频描述来预测音乐的流派等
Things like the track number and the track duration and,
还有歌曲编码 歌曲时长
then we get onto the actual audio description features.
然后才是具体的音乐特征描述
Now these have been generated by two different libraries,
这些信息来源于两个不同的数据系统
the first is called Librosa,
一个叫Librosa
which is a publicly available library
它是一个公共数据库
for taking an mp3 and calculating musical sort of attributes of it.
我们可以从中读取并分析mp3类型音乐的属性
What we’re trying to do here
我们现在要做的
is represent our data in terms of attributes.
是用属性来表示我们的数据
An mp3 file is not an attribute. It’s a lot of data.
mp3文件不是属性 它是一组很大的数据
So can we summarize it in some way?
那能否通过某些方法总结它?
Can we calculate by looking at the mp3?
能否通过分析mp3文件进行计算?
[music]
[音乐]
What the tempo is,
它的节奏怎么样
what the amplitude is, how loud the track is, these kind of things.
振幅如何 音量有多大等等
This is the kind of thing we’re measuring.
这些是我们想要测算的
And a lot of these are going to go into a lot of detail
这些细节特征
down at kind of a waveform level.
将以声波的形式呈现
So we have the Librosa features first,
我们首先看到的是Librosa特征
and then if we scroll down
接下来往下拉
after a while we’d get to some Echo Nest features.
很快可以看到Echo Nest的特征
Echo Nest is a company
Echo Nest是一家专注于
that produces very interesting features on music.
研究音乐特征的公司 它们很有意思
Actually, these are the features that
实际上 这些特征是声田
power Spotify’s recommend system and numerous others.
和很多其他软件在推荐歌曲时用到的
We’ve got things like acousticness.
比如说原声性
How acoustic does it sound.
这首歌曲的原声性怎么样
We’ve got instrumentalness.
还有乐器性
I’m not convinced that‘s a word.
我不太确定真的有这个词
Speechiness.
朗读性
They how how to what extent is it speech or not speech, alright.
歌曲中有多少歌词不是唱出来的而是说出来的
And then things like tempo how fast it is,
还有节奏性 指的是歌曲节奏有多快
and valence,
以及积极性
how happy does it sound, right.
它听起来有多让人高兴等
A track of zero would be
我猜如果得分是0
quite sad, i guess,
就说明这音乐很悲伤
and a track of one will be really high happy and upbeat.
得分是1就说明音乐很欢快
And then of course we’ve got a load of features.
当然还有很多其他的特征
I’ve labeled temporal here and
对于这些特征 我在这里已经标记上了temporal
these are going to be based on the actual music data themselves.
之后会根据实际的音乐数据本身进行调整
Often when we talk about data reduction,
通常当我们谈到数据归约时
what we’re actually using is dimensionality reduction, alright.
我们一般是通过降维来实现的
Well way of thinking about it is we
可以这样理解 从一开始
as we started we’ve been looking at things like attributes and
我们就接触了像属性这样的概念
we’ve been saying what is the mean or a standard deviation
我们会谈论数据中某些属性的
of some attribute on our data.
平均值或标准差等
Right. But actually when we start to talk about clustering
但实际上当我们谈到聚类分析
and machine learning
和机器学习时
we’re going to talk a little bit more about dimensions.
我们需要更多地讨论维度
Now this is in many ways
实际上很多时候
the number of attributes is the number of dimensions.
属性的数量就是维度的数量
It’s just another term for the same thing.
只是叫法不同
But certainly from a machine learning background,
从机器学习的角度看
we refer to a lot of these things as dimensions.
我们把这些东西叫做维度
so you can imagine if you’ve got some data here
想象你有一些数据
So you’ve got your instances down here
你的实例在这里
and you’ve got your attributes across here
属性在这里
So in this case our music data, we’ve got each song.
我们还有所有的音乐数据
So this is song one, this is song two, song three,
比如说这是歌曲一 歌曲二 歌曲三
and then all the attributes of a tempo,
和所有的属性 像是节奏感
Echo Nest attributes, it’s tempo and things like this.
Echo Nest的属性比如节奏感之类的
These are all dimensions in which this data can vary,
数据会在这些维度上有所不同
so they can be different in the first dimension, which is the track ID,
比如说在第一个维度 歌曲ID上就不一样
but they can also down here be different in this dimension, which is for tempo.
但它们也可能是在这个节奏的维度上有不同
When we say some data is seven hundred dimensional
当我们说数据拥有700个维度时
what that actually means is it has seven hundred different ways
实际上是指数据可以在700个属性上
or different attributes in which it can vary.
有不同的组合方式
And you can imagine that first of all
可以想象 首先
this is going to get quite big quite quickly,
数据的体量会增长得很快
right, seven hundred attributes seem like a lot to me.
700个属性对于我来说很多
Right, and depending on what the algorithm you’re running is,
其次 具体看你用的是什么算法
it can get quite slow
这么大量的数据
when you’re running on this kind of size of data.
它可能会运行得很慢
And you can imagine this is a relatively small dataset
可以想象 和声田每天的数据量相比
compared to what Spotify might deal with on a daily basis.
这样的数据量已经算是小的了
But another way to think about this data is actually
另一种理解这个数据的方式是
points in this space.
空间里的数据点的分布
so we have some 700 different attributes
700个属性意味着数据有
that you can vary,
非常多不同的可能性
and when we take a specific track,
当我们拿出一首特定的歌曲
it sits somewhere in this space
它会落在这个空间内的某一处
So if we were looking at it in just two dimensions, you know
如果我们只看其中的两个属性
track one might be over here,
歌曲一可能在这里
and track two over here and track three over here
歌曲二在这里 歌曲三在这里
And then three dimensions,
如果是三个维度
track four might be back at the back here.
歌曲四可能就在这个后面
You can imagine the more dimensions we add,
可以想象 当我们不断加入维度
the further spread out these things are going to get,
这些歌曲会更分散
But we can still do all the same things we can
但哪怕有700个维度 我们也能进行
in three dimensions, in 700 dimensions.
和只有3个维度时 一样的处理
It just takes a little bit longer.
只是时间会长一些
So one of the problems is that
有个问题是
some things like machine learning don’t like to have too many dimensions.
机器学习这类技术不太喜欢处理过多的维度
So things like linear regression can get quite slow
因此如果你有成千上万个属性
if you have tens of thousands of attributes or dimensions
做像线性回归这类的分析就会特别慢
So remember that perhaps the the default response to anyone collecting data
记住 数据收集者可能会尽可能多地收集数据
is just to collect it all and worry about it later.
而并不考虑分析的复杂性
Right, this is a time of what we when you have to worry about it.
但你需要考虑这个问题
What we’re trying to do is
我们尝试做的是
remove any redundant variables.
剔除多余的变量
If you’ve got two attributes of your music
如果你的音乐中已经有了
like tempo and valence,
两个几乎一样的属性
that turn out to be exactly the same,
如节奏和积极性
why are we using both for making our problem a little bit harder, right.
那就不用两个都保留 那会让分析更加困难
Now an actual fact Echo Nest features are pretty good,
实际上Echo Nest的数据特征很不错
they don’t tend to correlate that strongly,
它们之间的相关性没有那么强
but you might find where we’ve collected some data on a big scale
但也许你收集了非常大量的数据
actually a lot of the variables are very very similar all the time
但很多变量其实是非常相似的
and you can just remove some of them
这时候你就可以剔除
or combine some of them together
或合并部分变量
and just make your problem a little bit easier.
以简化分析工作
So let’s look at this on the music dataset and see what we can do.
我们来看这个音乐数据集 思考能做什么
So the first thing we can do is we could remove duplicates right.
首先可以去除重复项
It sounds like an obvious one
这听起来显而易见
and perhaps one that we could also do during cleaning,
并且是在数据清理时就能完成的工作
but exactly when you do it doesn’t really matter as long as you’re paying attention.
但只要留心 你可以任何阶段做这一步
what we’re going to say is music all equals unique of music all.
我们来输入music_all=unique(music_all)
and what that’s going to do is look for find any duplicate rows
这将找到数据中的重复项
and remove them.
并剔除它们
The number of rows we’ve got will drop by some amount. Let’s see.
数据集的行数会有所减少 我们来看看
Thinking.
思考中
[music]
[音乐]
This is where you need a timer.
你需要一个秒表
Actually, this is quite a slow process
事实上这会是个漫长的过程
you’ve got to consider that we’re going to look through every single row
因为电脑正在扫描每一行数据
and try and find any other rows that match.
并且从中找到重复项
Okay, so this is removed a bit about 40 rows
终于 我们成功剔除了约40行数据
So this meant we had some duplicate tracks.
这意味着之前我们有重复项
You can imagine that
可以想象
things might get accidentally added to the database twice,
有些数据可能被重复载入了
or maybe two tracks are actually identical
或者是两首完全相同的歌曲
because they were released multiple times or something like this.
被多次发布等等
Now what this is doing,
这一步的作用
the unique function actually finds rows that are exactly the same
是通过unique函数找到
for every single attribute or every single dimension, of course in practice,
在每个属性和维度上都相同的歌曲
you might find that you have two versions of the same track,
实际上 你可能有两首完全相同的歌曲
which differ by one second,
它们只相差一秒
they might have slightly different attributes.
或者只有非常微小的属性差别
Hopefully they’ll be very very similar.
希望它们非常非常相似
So what we could also do is have a threshold where we said
我们也可以设置一个门槛来判断
these are too similar,
这两首歌太像了
they’re the same thing.
它们其实是同一首
The name is the same
它们的名字相同
the artist is the same
演唱者相同
and the audio descriptors are very very similar,
音乐描述也非常非常相似
maybe we should just remove one of them, right
我们也许要剔除其中一首
This is the other thing you could do.
你可以这么做
Just for demonstration what we’re going to do is focus on
我们会用数据中的
just a few of the genres in this dataset right,
一些流派做说明
just to make things a little bit clearer for visualizations.
这会使数据可视化的结果更清晰
We’re going to select just the classical jazz pop
我们将选取经典 爵士 流行
and spoken-word genres, right
及说唱类型
cause these have a good distribution of different amounts in the dataset.
因为它们在数据中的分布数量差异较大
So we’re going to run that.
我们来运行这行代码
We’re creating a list of genres.
这将创建一个音乐流派的列表
We’re going to say music is music_all
我们来选取数据
Where any time where the genre is in
条件是歌曲的流派属性存在于
that list of genres we just produced right,
我们刚才创建的流派列表当中
and that’s going to produce a much smaller dataset
这将返回一个小很多的数据集
of 1,600 observations
它只有1600个观测值
the same number of attributes or dimensions.
以及同样数量的属性或者说是维度
Now normally you would also keep most of your data in,
一般来说 你可以保留所有的数据
this is just for a demonstration.
这里只是为了做说明
But removing genres that aren’t useful to you for your experiment
如果数据量过大 剔除那些对你的实验结果
is a perfectly reasonable way
没有帮助的流派信息
of reducing your data size if that’s a problem.
可以大大缩小数据集
Assuming they’ve been labeled right in the first place.
前提是它们一开始就被正确地标注了
Assuming they’ve been labeled right in the first place.
前提是它们一开始就被正确地标注了
Right, that’s on someone else. That’s someone else’s job.
那是另外的人该负责的 那是他的事儿
Let’s imagine but 1,600 is still too long.
假设1600个观测值还是太多了
Now actually computers are getting pretty quick.
但实际上电脑运行得很流畅
Maybe 1,600 observations is fine,
也许它还可以应付1600个观测值
but perhaps we want to remove some more.
但我们应该还可以剔除更多的数据
The first thing we could do is just chop off the data half way
首先我们可以把数据分成两半
and keep about half.
只取其中一半
So let’s try that first of all,
我们先来试试
so we’re going to say the first music
定义音乐集一为
that’s the first few rows of our music is
数据集的前一半数据
rows 1 to 835
即第1行到第835行
and all the columns.
及所有数据列
So we’re going to run that.
我们来运行看看
And that’s even smaller.
返回的数据更少了
Right so we can start to whittle down our data.
我们成功地缩小了数据量
This is not necessarily a good idea.
但这并不总是好的
We’re assuming here
我们假设了
that our genre is equally,
不同流派是平均
you know, randomly sampled around our dataset.
且随机地分布在数据集中的
That might not be true.
事实可能并非如此
You might have all the rock first and then all the pop or something like that.
可能它的顺序是先摇滚 后流行之类的
If you take the first few,
如果只选取开头几行
you’re just going to get all the rock,
可能会只有摇滚乐
right depending on what you like, that might not be for you.
这取决于这是不是你想要的 也许不是
So let’s plot the genres in the normal dataset,
我们在最初的数据集里绘制一个关于流派的图表
and you can see that we’ve got very little spoken word,
可以看到 虽然数量很少
but it is there.
但还是有说唱类
we have some classical international jazz and pop
还有经典 国际 爵士和流行
in sort of roughly the same amount.
数量都差不多
If we plot after we’ve selected the first 50
如果我们选取头50行信息 再绘制图表
you can see we’ve lost two of the genres, right
就会缺失两个流派的数据
we only have classical International and jazz
这里只有经典 国际和爵士乐
and there’s hardly any jazz.
且爵士乐的数量还很少
That’s not a good idea.
这不太妙
So don’t do that unless you know
因此 除非你确定数据是完全随机打乱的
that your data is randomized.
否则不要只选取头几行
So this is not this is not giving us a good representation of genres
因此 如果我们想要根据音乐特征
if we wanted to predict genre
来预测歌曲流派
for example based on the musical features,
这些流派数据没有代表性
cutting out half the genres seems like an unwise decision.
直接删掉一半的流派并不太明智
So a better thing to do will be
更好的处理方式
to sample randomly from the dataset.
是在数据集中随机取样
So what we’re going to do
我们现在要做的是
is we’re going to use the sample function
通过sample函数
to give us 835 random indices into this data
来生成835个随机标记值
and then we’re going to use that
并通过它们来标记出
to index our music data frame instead.
我们的音乐数据样本
Alright, that’s this line here.
运行这行代码
And hopefully this will give us a better distribution
希望返回的数据分布能更完整
if we plot the original again,
回到原始数据的图表
it looks like this
是这个样子的
and you can see we’ve got a broad distribution
数据分布情况完整
and then if we plot the randomized version
接下来对随机处理后的版本进行图表绘制
You can see we’ve still got some spoken.
在这里我们有一些说唱类型
It’s actually going up slightly,
数量更多了些
but the distributions are broadly the same.
但它们的数据分布基本一致
So this is worked exactly how we want.
这就是我们想要的
So how you select your data
如果你想精简数据
if you’re trying to make it a little bit smaller is very very important.
精简方法是非常非常重要的
And consider but obviously we only had 1,600 here
同时要考虑 虽然我们只有1600个观测值
and even this whole dataset is only 1,300 rows,
甚至整个数据集只有1300行
you could imagine that you might have tens of millions of rows
但想象一下你有上千万条数据
and you’ve got to think about this before you start just getting rid of them completely.
你从一开始就要考虑这个问题
Randomized sampling is is a perfectly good way of selecting your data.
随机取样是非常好的数据选取方法
Obviously, it has a risk that
但显然 这也有风险
maybe if the distributions of your genres are a little bit off
你的流派数据分布可能不够全面
and maybe you haven’t got very much of a certain genre.
在某些流派上的数据量可能不够
You can’t guarantee
你无法保证处理前后的
that the distributions are going to be the same on the way out.
数据分布情况能完全保持一致
And if you’re trying to predict genre,
如果想要预测流派
that’s going to be a problem.
这会是个问题
So perhaps the best approach is stratified sampling.
所以最佳的取样方式应该是分层取样
This is where we try and maintain
这样子我们就可以保持
the distribution of our classes.
各个类型的数据占比
So for example in this case genre.
比如说回到流派
So we could say we already we had 50% rock
我们有50%的摇滚乐
30% pop and 20% spoken,
30%的流行乐和20%的说唱音乐
and we want to maintain that kind of distribution
我们将选取所有数据中的50%
on the way out, even if we only sample 50%, right?
并希望在取样前后 数据的分布情况保持一致
This is a little bit more complicated in our but it can be done.
这比较复杂 但也是可以做到的
And this is a good approach if you want to make absolutely sure with distributions
而且这是使你的数据分布情况在取样前后
of your sample data are the same as your original data.
完全保持一致的最佳方式
We just looked at some ways,
我们刚刚讲述了
we can reduce the size of our dataset
通过减少实例数量
in terms of a number of instances or the number of rows.
或行数的数据归约方法
Can we make the number of dimensions
那是否也可以通过减少维度
or the number of attributes smaller, right?
或属性来减小数据量呢?
Cause that’s often one of the problems
这是一个常见的问题
and the answer is yes
它的答案是肯定的
And there’s lots of different ways we can do this
并且方法有很多
some more powerful and useful than others.
但有一些会更有效
One of the ways we can do this is something called correlation analysis.
其中一个就是相关分析
So a correlation between two attributes basically tells us that
两个属性间的关联性的大概意思是
when one of them increases
当其中一个属性增加时
the other one either increases or decreases in general in relation to it.
另一个属性会相应增加或减少
So you might have some data like this with attribute one
你可能会有像这样的数据 横轴是属性一
and we might have attribute two,
纵轴是属性二
and they sort of look like this.
看起来像这样
These are the data points for all of our different data
这些是数据点 它们分布在不同的地方
Obviously we’ve got a lot of data points
我们会有很多数据点
and you can see that roughly speaking
它们的大致情况
they kind of increase in this, sort of direction here like this.
是朝这个方向递增的
Now it might be but if this correlation is very very strong.
它们的相关性可能非常非常强
So basically,
因此可以说
attribute two is a copy of attribute one more or less.
属性二和属性一的同质性比较强
Maybe it doesn’t make sense to have attribute two in our dataset.
也许在数据集中并不需要属性二
Maybe we can remove it without too much of a problem.
直接剔除它或许不会带来什么问题
Alright. What we can do is something called correlation analysis where we
我们可以通过相关分析来检测
pitch all of the attributes versus all of the other attributes,
一个属性和其他属性间的相关性
we look for high correlations and we decide,
关注那些相关性高的属性
ourselves, whether to remove them.
并决定是否要剔除它们
Now sometimes it’s useful just to keep everything in
有时我们需要保留所有属性
and try not to remove them too early
不要过早剔除它们
But on the other hand, if you’ve got a huge amount of data
但另一方面 如果你的数据量很大
and your correlations are very high,
而相关性又很强
this could be one way of doing it.
就可以删除一些属性
Another option is something called forward or backward attribute selection
还有另一种方法就是前向或后向属性选择
Now this is the idea that
其概念是
maybe we have a machine learning model or clustering algorithm in mind
如果你有成型的机器学习或聚类分析算法
we can measure the performance of that,
我们可以测试算法的表现
and then we can remove features,
与此同时移除特征
and see if the performance remains the same.
来看其表现是否保持一致
Because if it does
如果是
maybe we didn’t need those features.
就可以移除那些特征
So what we could do is we could train our model on let’s say a 720-dimensional dataset.
我们可以在720个维度上训练我们的模型
and then we could get a certain level of accuracy and record that.
当模型达到一定的精度时 将其记录下来
Then we could try it again by removing one of the dimensions
然后去掉其中一个维度
and try on seven hundred and nineteen,
继续在剩下的719个维度上训练模型
and maybe the accuracy is exactly the same
算法准确性可能会保持一致
in which case we can say,
因此可以说
well, we didn’t really need that dimension at all,
我们并不需要这个维度
and we can start to whittle down our data this way.
并通过这样的方式来精简数据
Another option is forwards attribute selection.
另一个选择是前向属性选择
this is where we literally train our machine learning on just one of the attributes,
我们只在一个属性上训练机器学习模型
and then we see what our accuracy is,
记录下准确性
and we keep adding attributes in and retraining
并添加另一个属性继续训练模型
until our performance plateaus,
直到准确性不再发生改变
and we can say you know what?
就可以得出结论
We’re not gaining anything now by adding more attributes.
加入新的属性不会带来更多的信息
Obviously, there’s the question of which order do you try this in.
显然你可能会问 怎样确定训练维度顺序
Usually randomly.
通常是随机的
So what you would do is you would train on all the data
比方说使用后向属性选择方法
for example of a backwards attribute selection.
你需要训练所有数据
You take one out at random,
随意取出一个属性
if your performance stays the same, you can leave it out.
如果算法表现保持一致 就剔除这个属性
If your performance gets much worse,
如果你的算法表现变差了很多
you put it back in and you don’t try that one again.
就把它放回去 并且别再动它了
And you try a different one.
然后再试另一个
And you slowly start to take dimensions away
把这些维度一个个试一遍
and hopefully whittle down your data.
希望这最终能缩小你的数据量
Let’s have a quick look at correlation analysis on this dataset.
我们来快速看一下这个数据集的相关分析
You might imagine that
可以想象
if we’re calculating features based on the mp3
如果我们通过Librosa或Echo Nest的mp3文件
from Librosa or Echo Nest,
来计算特征
maybe they’re quite similar a lot of the time.
绝大多数时候它们会非常类似
And maybe we can remove them.
我们也许可以剔除它们
Let’s have a quick look.
来快速看一下
So we’re just going to focus on
为了简要说明
one of a set of Librosa features just for simplicity.
我们只看一组Librosa的特征
So we’re going to select only the attributes that contain
我们来选取所有只包含
this chroma kurtosis field,
这个叫色度峰度的属性
which is one of the attributes that you can calculate using Librosa.
你可以通过Librosa来计算这个属性
So I’m going to run that.
运行这行代码
We’re going to rename them just for whole simplicity
为了方便 将它们的名字改成
to Kurt 1 Kurt 2 Kurt 3.
kurt1 kurt2 kurt3等
And then we’re going to calculate a correlation matrix
然后我们将通过相关矩阵
of each of these different features versus each other,
来计算不同特征之间的相关性
like this.
就像这样
Ok, finally, we’re going to plot this
最后 我们绘制这个图表
and see what it looks like.
看看它长什么样
Hopefully we can find some good correlations
希望我们能够找到一些特征间的相关性
and we could have candidates
若它们是冗余的
for just removing a few of these dimensions, if it’s redundant.
我们就会有一些备选的待删除属性
And it’s not too bad. So you can see that we’ve got for example kurt 7 here.
这结果看起来还不错 可以看到这个kurt7
So index 7 is fairly similar to 8.
kurt7和kurt8挺像的
That’s a correlation of 0.65
它们的相关性有0.65
Maybe that means that we could remove one over two of those.
这就意味着也许我们可以剔除两者间的一个
This one here is 0.59.
这里还有个0.59
We’ve got a 0.48 over here.
还有个0.48在这里
These are fairly high correlations.
这些是比较高的相关性
If you’re really stretched for CPU time,
如果电脑配置不高
or you’re worried about a size of your dataset,
或你认为数据量太大
this is the kind of thing you could do to remove them.
你就可以剔除相关性高的属性
Of course, whether 0.65 is a strong enough correlation
当然 0.65是否代表强相关
that you want to delete and completely remove one of these dimensions
是否需要彻底剔除它
is really up to you and it’s going to depend on your situation.
完全取决于你和实际情况
One of the reasons that the correlations aren’t quite as hard as you might think
计算相关性可能比你想得简单 其中一个原因是
is that these libraries have been designed with this in mind.
这两个数据库就是为此而生的
If you just, if Echo Nest just produce 200 features that are exactly the same,
如果Echo Nest生成的200个特征都是完全一样的
it wouldn’t be very useful for picking playlists.
那它对于歌单选择的参考性就很有限
So they’ve produced 200 features that are widely different.
因此这200个特征是非常不同的
So we’re not necessarily going to correlate all the time, right?
它们也并不一定总存在相关性 对吧?
That’s the whole point and that’s a really useful feature of this data
这就是这个数据库的特性和作用
We’ve looked at some ways we can try and make our dataset a little bit smaller.
我们已经探讨了一些精简数据的方法
Remember our ultimate goal
记住 我们的终极目标是
is a smallest most sort of useful data we can get our hands on, right.
通过处理得到最精简且有用的数据
Then we can put that into machine learning or clustering
然后运用机器学习或聚类分析
and really extract some knowledge.
来获取其背后的信息
The problem is that
尽管我们做了相关分析
what we might do may based on correlation analysis
或前向或后向属性选择
or forward backwards attribute selection
但还是可能产生问题 比如
We might just be deleting data.
我们可能仅仅是在剔除数据
And maybe the correlation wasn’t one.
因为相关性不是1
It wasn’t completely redundant
因此它们并非是完全冗余的
Do we actually want to completely remove this data?
我们是否真的要彻底剔除这些数据?
Is there another way we can transform our data
是否有别的方法能更好地
to make more informed decisions
甄别应该剔除的数据
as to what we remove, and more effective ones?
并获取更有意义的信息?
That’s PCA or principal component analysis
有的 那就是PCA技术 即主成分分析方法
At the moment, we’re just fitting one line through our two-dimensional data
目前我们只是在处理二维数据
there’s going to be more principal components later, right?
之后数据中会有更多的主成分
But what we want to do is we want to pick the direction through this data,
而我们想做的是从数据的所有属性中
however many attributes it has, that has the most spread.
挑选出分布最广的属性 并确定其方向
So how do we measure this? Well quite simply…
我们如何进行测量呢? 简而言之……

发表评论

译制信息
视频概述

欢迎来到数据分析的第五部分数据归约,我们将在这个视频中讨论数据归约和数据精简的方法及其注意事项,不要错过哦!

听录译者

收集自网络

翻译译者

ericaeureka

审核员

审核员CH

视频来源

https://www.youtube.com/watch?v=8k56bvhXw4s

相关推荐