ADM-201 dump PMP dumps pdf SSCP exam materials CBAP exam sample questions

【机器学习入门】#9 TensorFlow的特征工程 – 译学馆
未登录,请登录后再发表信息
最新评论 (0)
播放视频

【机器学习入门】#9 TensorFlow的特征工程

Intro to Feature Engineering with TensorFlow - Machine Learning Recipes #9

【播放音乐中】
[MUSIC PLAYING]
乔许·戈登:大家好
JOSH GORDON: Hey, everyone.
欢迎回来
Welcome back.
特征就是你将关于这个世界的知识
Features are the way you represent your knowledge
呈现给分类器的一种方式
about the world for the classifier,
今天我来带你们了解一些
and today I’ll walk you through techniques
可以用来表示特征的技术 以及
you can use to represent your features and utilities
TensorFlow 提供的一些实用工具
TensorFlow provides to help.
美国的人口普查数据将作为示例数据集
You use a dataset from the US census as an example,
我们的目标就是通过一些特点比如
and the goal is to predict if someone’s income is
年龄和职业 来预测某人的收入是否超过5万美元
greater than $50,000 based on attributes like their age and occupation.
数据集的文件格式是 CSV
The dataset is stored as a CSV file,
我们以前讲过怎样直接将列的原始值作为特征
and previously we’ve seen how to use the column values directly as features.
今天我们来试试通过特征工程
But today we’ll use feature engineering
来将这些值转化成更好用的形式
to transform them into a more useful representation.
讲解过程中 我会用一个叫做 Facet 的工具
As we go, I’ll visualize what these transformations do
来可视化转换过程 你们可以从视频介绍中
using a tool called Facets, and you can find a link to it
找到这个工具的链接
in the description.
此外还有训练这个 TensorFlow 估计器的
You’ll also find complete code to train a TensorFlow estimator
全部源代码
on this dataset.
好的 我们开始吧
OK, let’s get started.
【数值型属性】
首先从数值型属性开始 比如年龄
Let’s begin with a numeric attribute like age,
想一想我们怎么利用年龄来预测收入
and think about how we can use it to predict income.
如果你想过这个问题 那么
Now if you think about how age correlates with income,
第一反应一般是随着年龄增加
our first intuition is that as age increases,
收入也会增加
usually so does income.
应用这一想法的最简单的方式
And the simplest way to represent this
就是将原始值
would just be to take the raw numeric value
直接当作特征值
and use that as a feature.
这里我们来构建一个特征列表
Here we’re building a list of features
用这些特征来训练模型 列表中每一项
we use to train the model, and each of these
都放着一个特征列
is stored as a feature column.
特征列里包含 CSV 文件中的数据
This contains data about the column from the CSV file
以及它们的表示方式
and how to represent it.
这里我们要写一个特征 直接用年龄列的原始值
Here we’ll write a feature that just uses the raw value of age,
这个字符串对应着 CSV 文件的一列
and this string corresponds to a column in the CSV file.
那这种方式会有什么问题吗?
Now what can go wrong with this approach?
好 如果我们再仔细考虑年龄这个数据
Well, if we think more closely about age,
我们会发现收入和它的关系并不是线性的
we realize it’s not in a linear relationship with income.
曲线可能长成这样
The curve might look something like this.
孩童时期是平的 然后到工作年龄开始增长
It’s flat for children, then increases during working age,
然后在退休年龄附近开始下降
and decreases during retirement.
如果是线性分类器
A linear classifier, for example,
就不能捕捉到这种关系
is unable to capture this relationship.
因为它只会给每个特征一个权重值
That’s because it learns a single weight for each feature.
【分区法】
为了能让分类器更好的处理这种情况
To make it easier for the classifier, one thing we can do
我们需要将特征分区
is bucket the feature.
将一个数值型特征转化为
And bucketing transforms a numeric feature
几组类属型新特征
into several categorical ones based
分隔方法就是看数值落在哪个区间内
on the range it falls into, and each of these new features
每个新特征都表示某个人的年龄落在哪个年龄段
indicate whether a person’s age falls into that range.
这样线性模型就可以通过对于每个年龄段的
And now a linear model can capture the relationship
不同权重值 来捕捉到这种关系了
by learning different weights for each bucket.
我们看看在 Facet 中看起来是什么样的
Let’s see how this looks in Facets.
很方便的一点是 Facet 有一个在线模型
Conveniently, there’s a live demo
可以在浏览器中运行 这里我预加载了人口普查数据
that runs in the browser with our census data preloaded,
CSV 文件中的每个数据
and each individual from the CSV is visualized
都用不同颜色的点表示在图像中了
as a dot colored by income.
如果你点击某个点 你就可以看到这个人的详细数据
If you click on a dot, you can see stats about the person.
这就是分区表示年龄的方法
Now let’s bucket by age, and you can
你可以通过调整分段数来调整颗粒度
adjust the number of buckets to make it more or less granular.
怎么选择分段数量完全取决于你
How you choose the number of buckets is up to you,
理想状态是结合问题利用你的知识
and ideally, you’d want to use your knowledge of the problem
来将这参数调整到最佳
to do this well.
在 TensorFlow 中 我们可以通过打包设置
In TensorFlow, we can create a bucketized feature
来将 CSV 中的数值型数据进行分区
by wrapping a numeric column from the CSV.
这里我们就指定好分区个数
And here we’re specifying the number
以及我们要创建的各个分区的范围
and the ranges of the buckets we’d like created.
一旦做好这些 我们就可以将分区特征
Once this is done, we can add the bucketized feature
加进训练模型的特征列表中了
to the list used to train our model.
【类属型特征】
接下来我们来看看如何表示一个类属特征
Now let’s see how to represent a categorical feature,
这里我用 受教育程度 这一列做个例子
and I’ll use the education column as an example.
因为这一列只有几个值
Because there are only a few values,
表示这个的最好方法就是使用原始值
the best way to represent this is just use the raw value.
所以这里我们创建一个特征列
And here we’ll create a feature column
这样 受教育程度 就可以是特征列表中的一个单独的值
that says education can be a single value from this list.
当然 你也可以从硬盘上的文件中读取到这些值
Of course, you could also read the values from a file on disk
而不是直接把它写在代码里
rather than writing them out in code.
如果可能值的数量比较少
Now using the raw value is the right thing
那么直接用原始值并没有什么问题
to do when there are only a small number of possibilities.
以后我们可能会遇到
We’ll cover the case where there are thousands
有上千种可能值的情况
of possibilities in a moment.
那这里我们就先说说特征交叉
First, let’s take a look at feature crossing.
【特征交叉】
特征交叉是将已经存在的特征
Feature crossing is a way to create new features that
合并成一个新特征的方法
are combinations of existing ones,
这个方法对线性分类器非常有用
and these can be especially helpful to linear classifiers,
因为线性分类器无法处理特征间的关系
which can’t model interactions between features.
这就是在 Facets 中的效果
Here’s what this looks like in Facets.
这里我们把之前做的年龄的分区特征
I’ll take our age buckets from before
和受教育程度来做一个交叉
and cross them with education.
你可以把它看做一个值为 真或假 的特征
Under the hood, you can think of a true-false feature being
每个年龄段都对应一个值
created for each bucket that tells
来告诉分类器某个人
the classifier whether an individual falls
在哪个年龄段
into that range.
现在看来这些年龄段包含很多信息
Now these buckets can be informative,
这些群体很有可能会有高收入
and here we see some groups are likely to have a high income,
其他则是低收入
and others low.
在代码里 特征交叉的效果和之前一样
In code, using a feature cross works the same way as before.
我们将年龄的分区特征和教育程度交叉
We’ll cross our age buckets with education
然后将其加入特征列表
and add it to the list of features to use.
特征交叉会很快产生非常多的可能性
A feature cross can generate many possibilities quickly,
这也就是我们为什么要用哈希的方式
which is why they are often represented
来表示它们
under the hood with a hash.
【哈希】
哈希特征列是用来表示大容量
A hashed feature column is one way to efficiently represent
类属特征的一种行之有效的方式
a categorical feature with a large vocabulary.
更重要的是 它让你可以更容易地
More importantly, you can use these
处理你的数据
as a way to make your data easier
因为它省去了你建立一个
to work with because they free you from having
词汇表的功夫
to provide a vocabulary list.
在这个例子中 我们将从 CSV 文件中
In this example, we’ll represent the occupation column
读取的 职业 这一列表示成一个有
from our CSV file by using a hash
1000种可能值的哈希列
with 1,000 possible values.
要注意这里我们并不需要去建立词汇表
Notice we don’t have to provide a vocabulary list,
为了避免特征值冲突 我将哈希值大小
and to avoid collisions, I’ve set the hash size
设置的大于词汇表的容量
so it’s larger than the number of items in the vocabulary.
这里就是它如何工作的
Here’s how this works under the hood.
通常类属特征会用
Normally, a categorical feature is represented
one hot 编码表示
as a one hot encoding.
也就是每个词汇表中的可能值
That means there’s one bit for each possible value
都有1比特的位置
in the vocabulary.
这样我们就可以进行查询因为
And we can create a lookup because we know the vocabulary
我们提前知道词汇表内容
list in advance.
那如果我们不知道词汇表内容呢
Now if we don’t know the vocab, we
我们就可以用哈希函数来自动计算比特值
can use a hash function to compute the bit automatically.
缺点就是这样做可能会产生特征值冲突
The downside is there could be collisions,
也就是不同值最后被算成相同的值
meaning different items are mapped to the same value.
哈希也可以用来限制内存使用
Hashes can also be used to limit memory usage
当然这会为你的训练数据带来一些噪点
at the cost of adding some noise to your training data.
如果你的词汇表很大
If you have a large vocabulary, it
它可能会在神经网络输入层中
can be memory intensive to use that as input
让内存非常吃紧
to a neural network.
哈希列可以用来限制
A hashed column can be used to limit
可能性的最大数值
the maximum number of possibilities,
但我还是更喜欢把它
but I prefer them simply as a tool
当做节省编程时间的工具
to save you programming time.
【嵌入】
最后 我要说说嵌入
Finally, I’d like to mention embeddings,
这可能会没有其他技巧直观
and these can be less intuitive than the other techniques,
但它们确实对处理深度学习中的类属数据
but they’re a powerful way to work with categorical data
非常的有帮助
in a deep learning setting.
你可以将嵌入当做一个向量
You can think of an embedding as a vector that represents
其代表一个词语的意思
the meaning of a word.
我们可以用 TensorFlow Embedding Projector
And we can visualize a dataset of word embeddings
来可视化一个词语嵌入的数据库
using the TensorFlow Embedding Projector,
在视频简介中你可以找到一个在线demo
and there’s an online demo you can find in the description.
这里有一个1万词的数据库
Here we’re looking at a dataset of 10,000 words, each of which
用了一个非常多维的向量表示
is represented by a vector with many dimensions,
这里投影到3维以便我们可以观察
projected down to 3D so we can see them.
你可以在右边的输入框里搜索词语
You can search for words in the box to the right.
如果你稍微试验一下
And if you experiment a bit, you’ll
就会发现相似词靠得很近
find similar words are often close together.
比如说 所有这些词都属于城市这一类
For example, all of the words in this cluster are cities.
嵌入好用的一个地方就是
What’s neat about embeddings is that they’re
在训练一个 DNN 网络时自动学习
learned automatically in the process of training a DNN.
为了使用它 你只要
And to make that happen, all you need to do
写一个嵌入列就行了
is write an embedding column.
这里我们为 受教育程度 创建一个嵌入
Here we’ll create an embedding for education
维度为10
with 10 dimensions.
如果你有一个词汇表很大的类属列
Now embeddings are helpful if you have a categorical column
那嵌入会很有帮助
with a large vocabulary and you want
你需要压缩其表示的维度大小以便分类器
to compress the representation so the classifier learns
可以学习一般概念而不是简单记忆这些
general concepts rather than memorizing
特殊词的意思
the meaning of specific words.
举个例子 如果人口普查数据
For example, imagine if the census data
有一列叫做 工作头衔
had a column called job title.
肯定有成千上万的工作头衔
There are thousands of different jobs,
嵌入则可以帮助你的分类器
and an embedding could be used to help your classifier learn
学习到程序员和软件工程师
that words like programmer and software engineer
通常是同一个工作
often mean the same thing.
【下一步】
好的 希望这些可以帮到你们
OK, hope this was a helpful intro,
接下来思考下怎么去表示你的分类
and thinking about how to represent your features
这会是你在机器学习实验中
is one of the most important contributions
最重要的一项工作了
you can make to a machine learning experiment.
特征列很好因为
Feature columns are great because they
它们让你可以在代码中试验
let you experiment with different representations
不同的表现方式 还能创造嵌入这种
in code and make advanced features like embeddings
高级特征
accessible.
下一步 我建议你们
As a next step, I’d recommend you
尝试一下简介中的代码 看看能不能修改它们
try the code in the description and see if you can modify it
为你们自己的问题服务
for a problem you care about.
感谢各位收看 我们下次见
Thanks for watching everyone, and I’ll see you next time.
【音乐播放中】
[MUSIC PLAYING]

发表评论

译制信息
视频概述

本视频介绍了四种机器学习和深度学习中常用的四种特征工程的技巧,分区、特征交叉、哈希和嵌入。使用的框架为 TensorFlow,图形化工具为 Facet。

听录译者

收集自网络

翻译译者

[B]刀子

审核员

审核员1024

视频来源

https://www.youtube.com/watch?v=d12ra3b_M-0

相关推荐