ADM-201 dump PMP dumps pdf SSCP exam materials CBAP exam sample questions

《了解深度学习》#5 让数据变得惊艳 – 译学馆
未登陆,请登陆后再发表信息
最新评论 (0)
播放视频

《了解深度学习》#5 让数据变得惊艳

How to Make Data Amazing - Intro to Deep Learning #5

数据很神圣
Data is sacred.
每个小时
Every hour of every day,
都会有一个新的传感器连接到网络
a new sensor is connected to the web.
每次旅行 每段回忆 每个创造
Every trip, every memory, every creation.
都会促进我们共同建造的网络不停增长
Every discovery joins the ever-growingweb we’re building together.
在一个充斥着怀疑与谎言的世界里
Amidst a world of ever growingskepticism and falsehood,
数据是真理
data is truth.
它透明 可证明
It’s transparent, provable.
大部分数据是无序的原始的数字流
Most of it is unstructured streams of raw numbers
以惊人的速度积累
being amassed at dizzying rates.
但经过智能化
But by applying intelligence to it,
会发现其中重要的规律与联系
we can find the patterns and connections that matter.
我们能找到隐藏在数字中的意义
We can find the meaning hidden in the numbers.
经济学家说 你得到的越多 我得到的就越少
The economist says more for you is less for me.
但情侣知道 你得到的越多 我得到就越多
But the lover knows more for you is more for me too.
当我们将正确的数据可视化
And when we visualize the right data,
就会在艺术与科技中
it gives us that most precious feeling.
得到无比珍贵的感受
at the intersection of art and science.
太妙了
Wonder…
大家好 我是西拉杰
Hello world, it’s Siraj,
今天我们将学习如何对数据集进行预处理
and today, we’re going to learn how to pre-process a dataset.

Yes!
准备数据是机器学习流程中最重要
Preparing data is one of the most important,
也最容易被忽视的步骤之一
yet most overlooked partsof the machine running pipeline.
很多入门教程
A lot of introductory tutorials
仅仅让你导入已经处理过的数据集
just have you import a preprocessed version of a dataset,
比如手写字母或电影等级
like handwritten characters or movie ratings
一行代码就搞定了
in just a single line of code
但真实世界没这么简单
Real world is not that easy.
当你决定你想解决某个问题
Once you’ve decided what problem you’re trying to solve
或者你有个想要回答的问题
or you have a question that you want the answer to,
就要先找到正确的数据集
it’s time to find the right dataset.
我想知道他们发给你的计划怎么样了
I want to know what happened to the plans they sent you.
我把它们储存在了Microsoft Azure
I stored them on Microsoft Azure.
[窒息的声音]
[Choking noises]
你的深度网络做出的预测
The prediction your deepnet makes
只会跟你提供的数据一样准确
are only as good as the data you give it.
垃圾进 垃圾出
Garbage in, garbage out.
因此 要确保数据与你的问题相关
So you want to make sure your data is relevant to your problem.
网络有大量可以找到公开的数据集
There are tons of resources to findpublicly available data sets,
在视频描述中可以找到一些链接
and I’ve Iinked to some in the description.
数据的标准格式是CSV
The defacto standard format for data is CSV.
大多数软件包用该格式处理数据
Most software packages out there deal with data in that format,
你可以容易地将数据转换成CSV格式
and you can convert your data into CSV format just as easily.
我们可以对数据做很多事情
There’s so much we could potentially do to our data
但有三个关键的预处理步骤
but there are three key pre-processing steps for every data set we’ll cover:
清洗 转换和消减
Cleaning, transformation, and reduction.
我们将准备三个数据集
We’re going to look at three different data sets,
并用这些步骤处理它们
and go through these steps for each one
然后就可以将它们输入模型了
to get them ready to be fed into a model.
让我们看看第一个数据集
So, let’s start with our first.
第一个数据集是基于音乐的
The first dataset we’ll use is music-based.
这个数据是从游戏Tag-a-Tune中采集来的
This data was collected from a game called Tag-a-Tune.
游戏中两个玩家听一首歌
In the game, two players listen to a song
并用他们认为相关的流派和乐器标记它
and tag it with genres and instruments that they think are relevant.
当歌曲结束时
When the song is over,
拥有最多正确标签的玩家得一分
the player who had the most correct tags get a point.
所以我每次都会赢的
So I would win every time!
这个数据集包含25000首带有正确标签的歌曲
Our dataset has 25,000 songs with the correctly labeled tags.
我们要用这些数据训练一个模型
We want to train a model on this data
以便给出一首新歌
so that given a new song,
它将正确地对其进行分类
it will correctly classify its genre.
我们将导入pandas库来帮助解析此数据集
We’ll import pandas to help parse this dataset
然后用read_csv函数将数据存储到
Then the read_csv function will let us store the data
二维的pandas数据结构中
in the two-dimensional pandas data structure
称为DataFrame
known as a DataFrame.
DataFrame很容易修改
Data frames are easily modifiable,
我们将该变量命名为newdata
and we’ll call our variable newdata.
让我们先来探索这个数据集 好吗
Let’s explore this data first, shall we?
我们将使用head函数显示前五行
We’ll display the first five rows using the head function
其中5作为参数
with five as a parameter,
所以每行都有编号作为ID
So basically each row is numbered, as an ID,
标签旁边的1或0
and then either a 1 or a 0 next to a tag
用来标记指定的MP3是否具有该标签
to indicate whether or not the given MP3 has that tag.
看起来很简单
Seems simple enough.
我们可以用info函数来获取更多数据(只有38兆)
We can use the info function to get some more data.(Only 38 megs)
对于数据清洗步骤
So for our cleaning step,
还需要做什么吗
is there anything we need to do?
不需要了
Not really.
每个标签都有一个简单的二进制标签
Each label has a simple binary tag,
标签是一致的
it’s consistent
幸运的是 我们的数据没有空值
and luckily our data does not have empty values.
但我的灵魂很空虚
(But my soul does)
我们可以直接进入转换步骤
We can move right on to the transformation step.
我们可以对这些数据进行哪些修改
What are some modifications we can make to this data
从而使模型更容易理解呢
that will make it easier for our model to understand?
注意很多标签听起来非常相似
Well, notice how a lot of the tags are pretty similar sounding.
比如歌声 女声
Like, you know, singing, female vocals…
我们可以将这些特征统称为女性
We can generalize these features into one feature called female.
让我们来创建一个二维同义词列表
Let’s create a two-dimensional list of synonyms that we find in our data,
然后将它们合并
then we can merge them
再删除除第一列的所有其他列
and drop all the other columns, except for the first one.
对于矩阵中的每个同义词列表
For each synonym list in our matrix,
让我们获取每个特征中的最大值
let’s get the max values from each of the features
并将它们全部添加到DataFrame中的第一个同义词
and add them all to our first synonym in our data frame object,
这将有效地将值合并为一列
which will effectively merge the values into one column
然后从DataFrame中删除其余的特征
Then we’ll drop the rest of the features from the data frame.
现在我们有了更笼统的特征
Now we’ve got more generalized features.
下一步是消减
Next, for the reduction step,
我们能删除哪些不必要的数据呢
what can we remove from this data that’s not necessary?
全部数据都是可靠的
Everything seems pretty solid
让我们把数据放入训练 校验和测试数据集吧
so let’s go ahead and put it into training, validation and testing sets
这样就可以把这些数据集提供给模型了
that we can feed into our model.
注意在这个例子中
Notice how in this example
我没有思考哪些特征该用哪些特征不该用
I’m not thinking about which features to use and which not to.
在深度学习之前
Before deep learning,
我们需要挑选正确的特征提供给模型
we had to pick the right features to use to feed our model.
但深度神经网络从任何提供特征中学习高层的特征
But deep neural nets learn high level features from whatever features we give it.
它可以自己决定哪些数据是跟问题相关的
It decides for itself what is relevant to the problem from a dataset.
结构工程是新的特征工程
Architecture engineering is the new feature engineering.
我们将用的第二个数据集是一个网络连接的集合
The second dataset we’ll use is a collection of network connections,
网络连接被标记为正常或异常
either labeled normal or abnormal.
异常的连接来自于入侵者
The abnormal connections are intruders trying to break in.
根据其他特征 我们想要将一个连接分类
We want to be able to classify a connection given the set of other features.
这些数据看上去很密集
When we look at this data it seems pretty dense.
没有空缺的值 也没有异常的值
No missing values, nothing really jumps out at an outlier.
让我们跳过清洗步骤 直接进入转换步骤
So let’s skip the cleaning step and move right on to transforming it.
我们的数字特征在不同的单位上
Our numerical features areall operating on different scales,
因此我们需要将它们统一
so we should normalize them
以确保我们的模型能相同地处理每个特征
to ensure each feature is treated equally by our model.
把数据存储到pandas的DataFrame后
After storing our data into a pandas data frame,
scikit-learn库有一个方便的叫做StandardScaler的子模块
scikit-learn has a handy sub-modulecalled standard scalar,
我们将引入它然后初始化
which we’ll import then initialize.
然后 我们就可以继续消减步骤了
After that, we’re ready to move on to our reduction step.
我们有很多的特征
We’ve got a lot of features
很多特征很有可能是高度相关的
and there are probably a lot that are highly correlated.
我们可以用一个叫做降维的技巧
We could use a technique calleddimensionality reduction
来减少特征的数量
to reduce the number of features we have.
它也能让数据在二维和三维上可视化
This will also let visualize our data in 2D or 3D space.
这不一定意味着我们的模型会更精确
This doesn’t mean that our model will be more accurate necessarily,
只是数据变得更易读
just that our data is easier to read.
应用这种技巧的其中一种方法叫做PCA
One method of doing this is called PCA,
是Porsche Club of America的缩写
which stands for Porsche Club of Amer…
等一下 弄错了
Wait, wrong definition!
主成分分析法
Principle component analysis.
我的数据有很多特征
My data’s got so many features,
把它们合并为三个特征
so squash’em into three like little creatures.
首先将数据标准化
First I’ll normalize,
然后将它们关联矩阵化
then I’ll correlation matricize.
找出特征向量和它们的值
Pull eigenvectors and values out of it’s eyes
将它们排序
Sort them.
我需要多少个维度呢
How many dimensions do I want?
我将选出在最前面的特征向量
I’ll select that many eigens up front.(Yeah!)
在它们的基础上做一个映射
Make a projection matrix from’em
用它把我的数据转化为3D
and use it to turn my data three dimensional
部署到生产环境验证它们
Prod em, so I can judge’em.
让我再总结一下流程
So let me summarize this process again.
假设我们有四个特征
Let’s say we had four features
我们想用PCA把它们消减为两个
and we wanted to reduce them to just two using PCA.
需要五步
There are five steps to this.
第一步是将储存在变量中的数据标准化
The first is to normalized the data once we have it stored in a variable.
然后我们需要计算一个协方差矩阵
Then we want to compute a covariance matrix.
为了构建它
To construct this,
我们计算每两个特征的协方差
we compute the covariance between each feature with every other feature.
我们用特征矩阵减去平均值
So we subtract the mean from the feature matrix,
计算转置矩阵
calculate the transpose
然后乘以特征矩阵和平均值的差
and multiply it by the feature matrix minus the mean.
最后我们用这个值除以特征数量减一
Then we take that whole value and divide it by the number of features minus one.
这样我们就得到了协方差矩阵
This gives us our covariance matrix.
接下来我们对该协方差矩阵进行特征分解
Next we’ll perform eigendecomposition on it
从而得到特征向量和特征值
to get the eigenvectors and eigenvalues.
特征 这个词很有趣 不是么
Eigen — isn’t it such a fun word, wouldn’t you say?
特征向量是一个数据集的主要成分
Eigenvectors are the principal components of a dataset.
它给了我们转换的方向
The give us the directions along which our transformation acts,
特征值给了我们每个方向的大小
The eigenvalues give us the magnitude of each.
将特征向量和特征值倒序排列
We’ll sort both in descending order,
然后创建一个矩阵
then create a matrix out of them.
我们用点乘来转换原有的特征向量
We’ll use this matrix to transform our original feature matrix via the dot product,
然后画出数据的二维图像
We could then plot our data into 2D space
再用这些主要的成分取代很多特征
and use these principal components to replace our many features.
让我们来再看一组数据
Let’s look at one more data set.
这一次是从纽约到巴黎的往返机票价格
This time for airline prices for flights between New York and Paris.
我们想仅仅从出发日期预测机票价格
We want to predict the ticket price from just the departure date.
已知的信息有出发和到达日期 机场
We’ve got departure and arrival dates, airports,
和出发日期前220天的航班价格
and flight prices of 220 days before departure.
注意在数据中我们有好几个空缺的值
Notice how we’ve got quite a few missing values in our data,
所以对于清洗步骤来说
so for our cleaning step,
我们可以删除这些空值
we could remove these values,
用0代替它们
fill them with zeros,
或用平均价格代替它们
fill them with the average price across all the days,
或用一个学习算法来预测它们
or try to predict them using a learning algorithm.
让我们用平均函数计算每天的平均价格吧
Let’s go ahead and calculate the average price for each row across all days using the mean function,
然后遍历数据
and then we’ll iterate through the data,
如果它为空值
and if it’s null,
我们就用平均价格代替它
we’ll replace it with the mean price.
然后我们可以让数据变得平滑
Then we can smooth our data.
这意味着我们要找到异常值并删除它们
That means finding outliers in it that we can remove.
我们用聚集和回归算法
To find these, we could run clustering or regression algorithms on certain values
来找到这些异常值
to find the outliers,
然后删除它们
and then remove them.
或者挑出并删除它们
Or just remove them by eye.
因为我们的数据集很小
Since our dataset is small,
让我们用后面的方法吧
let’s do the latter.
不需要消减我们的数据
No need to reduce our data.
这个看起来是一个好的数据集
This seems like a good set.
让我们把它拆开
Let’s break it down.
预处理数据集分三步
There are three steps to preprocessing a dataset.
清洗 转化 和 消减
Cleaning, transformation, and reduction.
深度学习能从数据中学习相关的特征
Deep learning learns the relevant features from our data
结构工程是新的特征工程
so architecture engineering is the new feature engineering.
主成分分析是一种流行的降维技巧
And principal component analysisis a popular dimensionality reduction technique
它可以用scikit-learn来实现
that can be implemented with scikit-learn.
上个视频的代码挑战的获胜者是Charles David Blot
The winner of the coding challenge from the last video is Charles David Blot.
Charles David只用了numpy来建造了一个三层的神经网络
Charles David used just numpy to build a three-layer neural net,
它可以预测地震
capable of predicting an earthquake,
他用随机搜索的策略
and he used the random search strategy
来为他的模型找到最佳的超参数
to find the optimal hyperparameters for his model.
本周最佳
Wizard of the week.
第二名是Siby-Jack Grove
And the runner-up is Siby-Jack Grove.
他用TensorFlow来预测 只用了三个输入
He used TensorFlow for his prediction using just three inputs.
这个视频的代码挑战
The coding challenge for this video
是用一个约会的数据集
is to use a dating dataset
基于一个人的性格特征 来预测他是否能找到一个匹配的人
to predict if someone gets a match based on their personality traits.
详情在readme文件中
Details are in the readme.
请把你的GitHub留在评论区
Post your GitHub link in the comments.
我将在下一个视频中宣布获胜者
and I’ll announce the winner next video.
如果你想看到更多类似的视频 欢迎订阅
Please subscribe if you want to seemore videos like this.
看看这个相关的视频
Check out this related video,
现在我要去预测玫瑰是否闻起来像便便啦
and for now, I gotta predict if roses really smell like poo poo poo.
多谢观看
So, thanks for watching.

发表评论

译制信息
视频概述

《了解深度学习》#5 让数据变得惊艳

听录译者

收集自网络

翻译译者

鹿琳

审核员

审核员YT

视频来源

https://www.youtube.com/watch?v=koiTTim4M-s

相关推荐