ADM-201 dump PMP dumps pdf SSCP exam materials CBAP exam sample questions

【机器学习入门】#4 我们来编写一条管道 – 译学馆
未登录,请登录后再发表信息
最新评论 (0)
播放视频

【机器学习入门】#4 我们来编写一条管道

Let’s Write a Pipeline - Machine Learning Recipes #4

(音乐播放中)
[MUSIC PLAYING]
欢迎回来
Welcome back.
我们已经学习了很多基础知识
We’ve covered a lot of ground already,
那么今天我想要复习和加固这些概念
so today I want to review and reinforce concepts.
我将会带大家探索两件事
To do that, we’ll explore two things.
首先 我们会编写一条基础的
First, we’ll code up a basic pipeline
监督式学习的管道
for supervised learning.
给大家展示不同的分类器是如何
I’ll show you how multiple classifiers
解决同一个问题的
can solve the same problem.
接下来 我们还会建立起一种直觉能力
Next, we’ll build up a little more intuition
学习型算法能从数据学习到预测 这究竟是什么
for what it means for an algorithm to learn something
因为这听起来很神奇 其实没啥
from data, because that sounds kind of magical, but it’s not.
那么开始 我们看一个
To kick things off, let’s look at a common experiment
你会想去试试的常见例子
you might want to do.
假设你要构建一个垃圾邮件分类器
Imagine you’re building a spam classifier.
也就是对收到的邮件标记上
That’s just a function that labels an incoming email
是否是垃圾邮件的功能
as spam or not spam.
现在 假设你已经收集了一套数据集
Now, say you’ve already collected a data set
准备来训练模型了
and you’re ready to train a model.
但是开始之前 你还是需要
But before you put it into production,
先回答一个问题
there’s a question you need to answer first–
你用它来分类不在训练数据集里的邮件时
how accurate will it be when you use it to classify emails that
会有多准确?
weren’t in your training data?
我们应该在真正开始运转之前
As best we can, we want to verify our models work well
尽可能的验证它的准确性
before we deploy them.
那么我们可以做一个实验来帮助确定这一点
And we can do an experiment to help us figure that out.
一种方式是将我们的数据集切分为两块
One approach is to partition our data set into two parts.
我们称之为训练集和测试集
We’ll call these Train and Test.
我们可以使用训练集来训练模型
We’ll use Train to train our model
然后用测试集来判断模型在新数据上的准确性
and Test to see how accurate it is on new data.
这是个常见模式 我们来看看代码
That’s a common pattern, so let’s see how it looks in code.
那么开始 我们将数据集导入SciKit
To kick things off, let’s import a data set into SciKit
这次还使用Iris 因为它已经有了很方便
We’ll use Iris again, because it’s handily included.
我们在第二集的时候已经见过Iris这个数据集了
Now, we already saw Iris in episode two.
但之前没见过的是
But what we haven’t seen before is
所谓的特征x和标签y
that I’m calling the features x and the labels y.
为什么这样?
Why is that?
因为对分类器的一种思考方式
Well, that’s because one way to think of a classifier
是将其理解成函数
is as a function.
在高层来说 你可以认为x是输入
At a high level, you can think of x as the input
y是输出
and y as the output.
这一集的后半部分我会详细说
I’ll talk more about that in the second half of this episode.
在导入数据集之后 我们要做的第一件事就是
After we import the data set, the first thing we want to do
将它切分为训练集和测试集
is partition it into Train and Test.
要做到这一点 我们还需要引入一个顺手的工具
And to do that, we can import a handy utility,
可以让语法清晰
and it makes the syntax clear.
我们将x和y
We’re taking our x’s and our y’s,
或者叫特征和标签 将它们
or our features and labels, and partitioning them
切分到两个集合中
into two sets.
x_train和y_train就是训练集的
X_train and y_train are the features and labels
特征和标签
for the training set.
x_test和y_test是测试集的
And X_test and y_test are the features and labels
特征和标签
for the testing set.
这里我想要让一半的数据
Here, I’m just saying that I want half the data to be
都用来测试
used for testing.
那么如果Iris中有150个示例数据 训练集和测试集
So if we have 150 examples in Iris, 75 will be in Train
分别会有75个数据
and 75 will be in Test.
现在我们来创建分类器
Now we’ll create our classifier.
我要使用两种不同的分类器类型
I’ll use two different types here
给你们演示它们分别是如何解决同一个问题的
to show you how they accomplish the same task.
我们从之前见过的决策树开始
Let’s start with the decision tree we’ve already seen.
注意这里只需要两行代码
Note there’s only two lines of code
用来创建分类器
that are classifier-specific.
现在我们来使用训练集来训练分类器
Now let’s train the classifier using our training data.
这里准备好了就可以对数据进行分类
At this point, it’s ready to be used to classify data.
接下来 我们会调用predict方法
And next, we’ll call the predict method
来对测试集进行分类
and use it to classify our testing data.
如果你打印出预测结果
If you print out the predictions,
会看到它们是一组数字
you’ll see there are a list of numbers.
这些数字与鸢尾花的类型相关
These correspond to the type of Iris
分类器对测试集的每一行进行了预测
the classifier predicts for each row in the testing data.
现在我们看下分类器对测试集数据
Now let’s see how accurate our classifier
预测的精确度如何
was on the testing set.
回忆下上面 我们对测试集数据有一组
Recall that up top, we have the true labels for the testing
正确的标签
data.
要计算准确性
To calculate our accuracy, we can
我们可以将预测标签与真实标签做比对
compare the predicted labels to the true labels,
累计出分值
and tally up the score.
SciKit提供了一个便捷的方法
There’s a convenience method in SciKit
我们可以引用来做这个
we can import to do that.
注意 我们的准确性超过了90%
Notice here, our accuracy was over 90%.
如果你自己做实验 也许会有一点点不同
If you try this on your own, it might be a little bit different
因为训练集和测试集的切分
because of some randomness in how the Train/Test
是有一定随机性的
data is partitioned.
现在 有趣的来了
Now, here’s something interesting.
我们可以换掉这两行 使用一种不同的分类器
By replacing these two lines, we can use a different classifier
来完成相同的任务
to accomplish the same task.
我们将决策树换掉
Instead of using a decision tree,
使用K最近邻居分类器
we’ll use one called [? KNearestNeighbors. ?]
我们运行这个示例的话 代码与
If we run our experiment, we’ll see that the code
之前的工作方式完全一致
works in exactly the same way.
但最终的准确度会有不同
The accuracy may be different when you run it,
因为分类器的原理有些差别
because this classifier works a little bit differently
并且训练集和测试集拆分也是随机的
and because of the randomness in the Train/Test split.
同样 如果我们打算使用一种更加复杂的分类器
Likewise, if we wanted to use a more sophisticated classifier,
就引用它然后换掉这两行代码
we could just import it and change these two lines.
除去这两行 代码没变化
Otherwise, our code is the same.
很明显虽然分类器有很多种
The takeaway here is that while there are many different types
但从更高层面来看 它们都拥有相同的接口
of classifiers, at a high level, they have a similar interface.
现在我们来说一下
Now let’s talk a little bit more about what
这对于从数据学习有什么意义
it means to learn from data.
之前 我说过我们称之为特征x和标签y
Earlier, I said we called the features x and the labels y,
因为它们是函数的输入和输出
because they were the input and output of a function.
现在 我们来定义这个函数
Now, of course, a function is something we already
大家应该已经知道怎样写了
know from programming.
def classify – 这就是我们要的函数
def classify– there’s our function.
因为我们已经对监督式学习有所了解
As we already know in supervised learning,
这并不需要我们自己来编写
we don’t want to write this ourselves.
我们只是需要一个能从训练集学习的算法
We want an algorithm to learn it from training data.
那么 函数是什么?
So what does it mean to learn a function?
额 函数就是能将输入映射到输出的
Well, a function is just a mapping from input
一种方法
to output values.
这个函数你之前也许见过 –
Here’s a function you might have seen before– y
y等于mx加上b
equals mx plus b.
这是一条线的等式
That’s the equation for a line, and there
这有两个参数 m 代表的是斜率
are two parameters– m, which gives the slope;
b 是y轴的截距
and b, which gives the y-intercept.
当然 有了这些参数
Given these parameters, of course,
我们就可以绘制出x变化后的曲线
we can plot the function for different values of x.
现在 在监督式学习里 我们用来做分类的函数
Now, in supervised learning, our classified function
也可以有一些参数
might have some parameters as well,
输入参数x实际上会是一条我们要分类的
but the input x are the features for an example we
示例数据的特征 输出数据y就是标签
want to classify, and the output y
比如是否是垃圾邮件 或者是花朵的类型
is a label, like Spam or Not Spam, or a type of flower.
那么 函数里写的是什么?
So what could the body of the function look like?
额 就是我们用算法来编写的部分
Well, that’s the part we want to write algorithmically
换种说法 就是学习的方法
or in other words, learn.
重点要明白的事
The important thing to understand here
这里我们没必要从零开始
is we’re not starting from scratch
将这个方法从最基础编写出来
and pulling the body of the function out of thin air.
而是从一个模型开始
Instead, we start with a model.
你可以认为模型就是定义我们整个函数的
And you can think of a model as the prototype for
所有规则的原型
or the rules that define the body of our function.
一般来说 模型会包含
Typically, a model has parameters
用于调整训练数据的参数
that we can adjust with our training data.
这里是个演示这一过程原理的高级例子
And here’s a high-level example of how this process works.
我们看一组训练数据集 思考一下
Let’s look at a toy data set and think about what kind of model
我们可以用哪种模型来作为分类器
we could use as a classifier.
假设我们对区分这两种
Pretend we’re interested in distinguishing
不同红绿色的圆点有兴趣
between red dots and green dots, some of which
有一些我在图上画在了这里
I’ve drawn here on a graph.
我们只需要两个特征就能做到
To do that, we’ll use just two features–
圆点的x和y坐标
the x- and y-coordinates of a dot.
我们来思考一下该怎样对这个数据进行分类
Now let’s think about how we could classify this data.
我们需要一个可以
We want a function that considers
思考它从未见过的一个点
a new dot it’s never seen before,
区分它是红色的还是绿色的
and classifies it as red or green.
事实上来说 也许我们打算要分类的数据还有很多
In fact, there might be a lot of data we want to classify.
这里我画了一堆测试数据
Here, I’ve drawn our testing examples
用浅绿和浅红色表示
in light green and light red.
这些圆点不在我们的训练数据集里
These are dots that weren’t in our training data.
分类器从未见过它们
The classifier has never seen them before, so how can
那怎么能预测出正确的标签呢?
it predict the right label?
设想我们可以穿过数据
Well, imagine if we could somehow draw a line
画出一条线
across the data like this.
然后我们可以说在它左侧的
Then we could say the dots to the left
圆点是绿色的 右侧的是
of the line are green and dots to the right of the line are
红色的
red.
那么这条线就可以成为我们的分类器
And this line can serve as our classifier.
那我们怎样能得到这条线?
So how can we learn this line?
一种方式是使用训练数据
Well, one way is to use the training data to adjust
来调整模型的参数
the parameters of a model.
这里我们使用的模型是一条
And let’s say the model we use is a simple straight line
简单的直线 我们刚刚看过
like we saw before.
这意味着我们有两个可以调节的参数 – m和b
That means we have two parameters to adjust– m and b.
改变它们就可以改变这条线
And by changing them, we can change where the line appears.
那么该怎样才能学习获得到正确的参数呢?
So how could we learn the right parameters?
一种方法是 我们可以
Well, one idea is that we can iteratively adjust
通过使用训练数据来反复调整参数
them using our training data.
例如 我们可以先从一条随机的线条开始
For example, we might start with a random line
用它来对第一个训练示例数据进行分类
and use it to classify the first training example.
如果它是对的 我们就不用改变这条线
If it gets it right, we don’t need to change our line,
然后我们继续下一条数据
so we move on to the next one.
但是假如结果不对呢
But on the other hand, if it gets it wrong,
我们就可以轻微调整模型的参数
we could slightly adjust the parameters of our model
让它更加精确
to make it more accurate.
这里要记住的是
The takeaway here is this.
学习的本质就是 用训练数据来
One way to think of learning is using training data
调整模型参数
to adjust the parameters of a model.
现在 非常特别的东西来了
Now, here’s something really special.
它叫Tensorflow/Playground
It’s called tensorflow/playground.
它是一个神经网络的优质示例
This is a beautiful example of a neural network
直接在浏览器里你就可以运行实验
you can run and experiment with right in your browser.
当然 它会有单独的一集来介绍
Now, this deserves its own episode for sure,
但是现在你就可以上去玩玩
but for now, go ahead and play with it.
非常棒
It’s awesome.
这个游乐场里提供了
The playground comes with different data
不同的数据集你可以试试
sets you can try out.
一些非常简单
Some are very simple.
例如 我们可以使用这条线来区分
For example, we could use our line to classify this one.
一些则复杂的多
Some data sets are much more complex.
这组数据集特别复杂
This data set is especially hard.
你可以看看能否构建起网络来对它进行分类
And see if you can build a network to classify it.
当前你可以认为神经网络就是一种
Now, you can think of a neural network
更加复杂的分类器类型
as a more sophisticated type of classifier,
比如决策树或一条简单的线
like a decision tree or a simple line.
但在概念上 它们是相同的
But in principle, the idea is similar.
OK
OK.
希望这些对你有帮助
Hope that was helpful.
我刚刚注册了一个推特账号
I just created a Twitter that you can follow
大家可以关注 会用于接收新节目通知
to be notified of new episodes.
下一集可能要等上几周了
And the next one should be out in a couple of weeks,
因为这取决于我需要为谷歌IO准备多少工作
depending on how much work I’m doing for Google I/O. Thanks,
以往一样 谢谢大家的观看 下次节目再见
as always, for watching, and I’ll see you next time.

发表评论

译制信息
视频概述
听录译者

收集自网络

翻译译者

知易行难

审核员

自动通过审核

视频来源

https://www.youtube.com/watch?v=84gqSbLcBFE

相关推荐