未登录,请登录后再发表信息
最新评论 (0)
播放视频

【机器学习入门】#5 我们第一个分类器

Writing Our First Classifier - Machine Learning Recipes #5

(音乐播放中)
[MUSIC PLAYING]
大家好
Hey, everyone.
欢迎回来
Welcome back.
这期节目里 我们将做点特别的
In this episode, we’re going to do something special,
我们要从零编写一个分类器
and that’s write our own classifier from scratch.
如果你是机器学习的新手
If you’re new to machine learning,
这对你来说可是个里程碑
this is a big milestone.
因为如果你能跟下来 自己完成
Because if you can follow along and do this on your own,
就意味着你明白了迷题的核心部分
it means you understand an important piece of the puzzle.
我们今天要写的分类器
The classifier we’re going to write today
是k最接近邻居算法版本
is a scrappy version of k-Nearest Neighbors.
这是分类器里最简单的一种
That’s one of the simplest classifiers around.
先说下本集梗概
First, here’s a quick outline of what we’ll do in this episode.
我们会从第四集 《让我们编写一条管道》里的
We’ll start with our code from Episode 4, Let’s
代码开始
Write a Pipeline.
回忆下这一集 我们做了一个简单的实验
Recall in that episode we did a simple experiment.
导入了一组数据 将它切分为训练集和测试集
We imported a data set and split it into train and test.
训练集被用来训练分类器
We used train to train a classifier,
测试集是用来检验它的准确性
and test to see how accurate it was.
今天我们则要关注于
Writing the classifier is the part
编写分类器本身
we’re going to focus on today.
之前我们写了这两行来
Previously we imported the classifier
从库里引用分类器
from a library using these two lines.
这里我会将它注释掉 自己来编写
Here we’ll comment them out and write our own.
这条程序管道其它部分保持不变
The rest of the pipeline will stay exactly the same.
讲解过程中我会从截屏中切出切入
I’ll pop in and out of the screencast to explain things
来对概念进行解释
as we go along.
开始我们来运行这条管道
To start, let’s run our pipeline to remind ourselves
看看它的精确性
what the accuracy was.
可以看到 超过了90%
As you can see, it’s over 90%.
嗯 这就是我们自己编写分类器的
And that’s the goal for the classifier
精确度目标
we’ll write ourselves.
现在我们注释掉引用
Now let’s comment out that import.
代码会立刻出错
Right off the bat, this breaks our code.
于是我们立即要做的就是修复管道
So the first thing we need to do is fix our pipeline.
这里我们要先实现自己的分类器的类
And to do that, we’ll implement a class for our classifier.
我将称它为ScrappyKNN
I’ll call it ScrappyKNN.
scrappy的意思是原始的 也就是说是自己写的
And by scrappy, I mean bare bones.
只要让它能正常工作就可以
Just enough to get it working.
接下来 我修改管道来使用它
Next, I’ll change our pipeline to use it.
现在我们来看下需要实现哪些方法
Now let’s see what methods we need to implement.
我们看下分类器需要实现的接口
Looking at the interface for a classifier,
有两个我们要关注
we see there are two we care about– fit,
fit 用来训练的方法
which does the training, and predict,
predict 实现预测的接口
which does the prediction.
我们先来声明fit方法
First we’ll declare our fit method.
记住它的输入参数是训练数据集的
Remember this takes the features and labels for the training set
特征和标签 那么我们自己来加上这些参数
as input, so we’ll add parameters for those.
下面是predict方法
Now let’s move on to our predict method.
它的输入参数是测试数据集的特征数据
As input, this receives the features for our testing data.
输出是预测结果 也就是对应的标签
And as output, it returns predictions for the labels.
我们第一个目标是让管道能正常运行
Our first goal is to get the pipeline working,
弄明白这些方法的作用
and to understand what these methods do.
因此在编写真实的分类器之前
So before we write our real classifier,
先从一个更简单的开始
we’ll start with something simpler.
我们来写一个随机分类器
We’ll write a random classifier.
所谓随机 我的意思是用猜测的办法来得到标签
And by random, I mean we’ll just guess the label.
我们来给fit和predict方法添加上一些代码
To start, we’ll add some code to the fit and predict methods.
在fit方法里 我将训练数据集记下来
In fit, I’ll store the training data in this class.
你可以认为这只是做了个记录
You can think of this as just memorizing it.
后面会明白我们为什么这样做
And you’ll see why we do that later on.
predict方法中
Inside the predict method, remember
我们需要返回一个预测结果的列表
that we’ll need to return a list of predictions.
参数X_test是个二维数组
That’s because the parameter, X_test, is actually
或者说是列表的列表
a 2D array, or list of lists.
特征的每一行都是一条测试示例数据
Each row contains the features for one testing example.
要对每一行做出预测
To make a prediction for each row,
这里我从训练数据集中随机挑选一个标签
I’ll just randomly pick a label from the training data
添加到我们的预测结果中
and append that to our predictions.
好的 到这里 我们的管道可以恢复工作了
At this point, our pipeline is working again.
我们运行一下看看效果
So let’s run it and see how well it does.
回忆下 鸢尾花数据集里有三种不同类型的花
Recall there are three different types of flowers
那么这样的计算方法它的准确性应该是在33%左右
in the iris dataset, so accuracy should be about 33%.
我们现在明白了分类器的接口
Now we know the interface for a classifier.
但是这个练习的目标
But when we started this exercise,
是要让准确度超过90%
our accuracy was above 90%.
那么我们来看看能否做的更好
So let’s see if we can do better.
要实现这一目标 我们要实现自己的分类器
To do that, we’ll implement our classifier,
基于k最接近邻居算法
which is based on k-Nearest Neighbors.
这里是该算法的基本实现原理
Here’s the intuition for how that algorithm works.
我们回到上一集节目里绘制
Let’s return to our drawings of green dots and red dots
红点和绿点那一段
from the last episode.
假设我们在屏幕上看到的这些点是
Imagine the dots we see on the screen
我们在fit方法里保存的训练数据
are the training data we memorized in the fit method,
数据集不大
say for a toy dataset.
现在 假如我们要对这个灰色的
Now imagine we’re asked to make a prediction for this testing
测试圆点进行预测
point that I’ll draw here in gray.
我们要怎么做
How can we do that?
额 最接近邻居的分类器
Well in a nearest neighbor classifier,
它的工作原理已经在名字中体现了
it works exactly like it sounds.
我们要找到距离测试圆点最近的
We’ll find the training point that’s
训练圆点数据
closest to the testing point.
那么这个圆点就是最接近的邻居
This point is the nearest neighbor.
我们会预测说
Then we’ll predict that the testing
测试点和这个训练点拥有相同的标签
point has the same label.
例如 我们猜测测试点是绿色
For example, we’ll guess that this testing dot is green,
因为这是它最接近邻居的颜色
because that’s the color of its nearest neighbor.
另一个例子里 如果这里有一个测试圆点
As another example, if we had a testing dot over here,
我们就会猜测它是红色的
we’d guess that it’s red.
那么 这个在中间的圆点呢?
Now what about this one right in the middle?
如果这个圆点到它两边的
Imagine that this dot is equidistant to the nearest
绿点和红点的距离是相等的呢
green dot and the nearest red one.
两边一样 我们该怎么分类?
There’s a tie, so how could we classify it?
一种办法是我们可以随机选一方
Well one way is we could randomly break the tie.
但其实还有另一种方法 这就是k的作用了
But there’s another way, and that’s where k comes in.
k指的是我们做预测时
K is the number of neighbors we consider
需要加入计算的邻居数量
when making our prediction.
如果k是1 我们就只需要看最近的一个训练数据
If k was 1, we’d just look at the closest training point.
如果k是3 我们就得看3个接近的训练数据圆点
But if k was 3, we’d look at the three closest.
这里 有两个是绿色一个是红色
In this case, two of those are green and one is red.
那么预测的结果可以是选择多数的那一类
To predict, we could vote and predict the majority class.
现在虽然算法本身还有很多细节
Now there’s more detail to this algorithm,
但作为开始已经足够了
but that’s enough to get us started.
要完成代码 首先我们需要
To code this up, first we’ll need a way
一种能找到最接近邻居的方法
to find the nearest neighbor.
要做到这一点 我们需要测量出
And to do that, we’ll measure the straight line
两点之间的直线距离 就像使用一把尺子来测量一样
distance between two points, just like you do with a ruler.
这里有个公式叫“欧氏几何距离”
There’s a formula for that called the Euclidean Distance,
公式是这样
and here’s what the formula looks like.
它能计算出两点之间的距离
It measures the distance between two points,
原理类似于勾股定理
and it works a bit like the Pythagorean Theorem.
A平方加上B平方等于C平方
A squared plus B squared equals C squared.
设这个是A 也就是
You can think of this term as A, or the difference
开始两个特征的距离
between the first two features.
同理 设这个是B
Likewise, you can think of this term as B,
也是第二对特征之间的距离
or the difference between the second pair of features.
我们现在要计算的距离就是斜边的长度
And the distance we compute is the length of the hypotenuse.
下面是关键
Now here’s something cool.
当前我们计算基于的是
Right now we’re computing distance
二维空间 因为我们的数据集只有
in two-dimensional space, because we have just two
两个特征值
features in our toy dataset.
那么如果我们需要计算三个特征 或者说三维空间呢?
But what if we had three features or three dimensions?
那么这就变成了一个立方体了
Well then we’d be in a cube.
我们仍然能可视化的用尺子
We can still visualize how to measure distance
测量出空间里的距离
in the space with a ruler.
但如果我们有四个特征或者说是四维空间呢
But what if we had four features or four dimensions,
鸢尾花那个数据集就是
like we do in iris?
额 现在我们处在超立方体中
Well, now we’re in a hypercube, and we
要用可视化方式表达就不容易了
can’t visualize this very easy.
然而好消息是 欧氏几何距离
The good news is the Euclidean Distance
无论在几维空间下都是相同的方法
works the same way regardless of the number of dimensions.
更多的特征 只需要往公式里添加更多的变量即可
With more features, we can just add more terms to the equation.
关于这一点线上可以找到更多资料
You can find more details about this online.
现在我们来写程序计算欧氏几何距离
Now let’s code up Euclidean distance.
虽然这有很多种实现方式
There are plenty of ways to do that,
但我们选择使用scipy这个库来实现
but we’ll use a library called scipy that’s
Anaconda已经安装了这个库
already installed by Anaconda.
这里A和B分别是一个用数字表示的特征列表
Here, A and B are lists of numeric features.
假设A是来自于训练数据的一个点
Say A is a point from our training data,
B是来自测试数据的一个点
and B is a point from our testing data.
函数返回的是它们之间的距离
This function returns the distance between them.
这是我们要用到的所有数学知识
That’s all the math we need, so now
现在我们来看下分类器的算法
let’s take a look at the algorithm for a classifier.
要对一个测试数据圆点来做预测
To make a prediction for a test point,
我们需要计算出它与所有训练圆点的距离
we’ll calculate the distance to all the training points.
然后测试数据圆点的预测结果就是
Then we’ll predict the testing point has the same label
与最接近它的圆点的标签
as the closest one.
我删掉之前写的随机逻辑
I’ll delete the random prediction we made,
换上可以找到最接近
and replace it with a method that finds the closest training
训练点的方法
point to the test point.
为了降低本视频的难度 我将k写死为1
For this video hard, I’ll hard-code k to 1,
那么这样我们写出了一个最接近邻居分类器
so we’re writing a nearest neighbor classifier.
变量k没有在我们的代码里体现
The k variable won’t appear in our code,
因为我们总是查找最近的点
since we’ll always just find the closest point.
方法里 我们会循环对所有的点进行计算
Inside this method, we’ll loop over all the training points
来跟踪当前情况下的最接近的点
and keep track of the closest one so far.
之前我们把训练数据记录在了fit函数里
Remember that we memorized the training data in our fit
x_train包含的是所有的特征
function, and X_train contains the features.
一开始我计算的是测试数据点
To start, I’ll calculate the distance from the test point
与第一个训练数据点之间的距离
to the first training point.
我会使用这个变量来跟踪
I’ll use this variable to keep track of the shortest
当前计算出的最短距离的变化
distance we’ve found so far.
这个变量则是用来
And I’ll use this variable to keep
跟踪距离最近的训练数据点的下标
track of the index of the training point that’s closest.
后面会用这个来取出对应的标签
We’ll need this later to retrieve its label.
现在我们遍历所有其它的训练数据点
Now we’ll iterate over all the other training points.
每次发现一个更接近的点
And every time we find a closer one,
就更新变量
we’ll update our variables.
最终我们会将距离最近的训练数据的
Finally, we’ll use the index to return
下标所对应的标签返回
the label for the closest training example.
现在 我们就完成了一个最接近邻居分类器
At this point, we have a working nearest neighbor classifier,
那么我们来运行看下它的准确度如何
so let’s run it and see what the accuracy is.
可以看到 超过了90%
As you can see, it’s over 90%.
我们做到了
And we did it.
你运行自己编写的版本时
When you run this on your own, the accuracy
准确度可能有些不同 因为训练数据和测试数据
might be a bit different because of randomness in the train test
分割是随机的
split.
如果现在你能编写出程序并且弄懂它
Now if you can code this up and understand it,
这将是个很大的进步
that’s a big accomplishment because it
因为这意味着你可以从零编写一个简单的分类器了
means you can write a simple classifier from scratch.
当前我们实现的这个算法有优点也有缺点
Now, there are a number of pros and cons
可以在网上找到相关信息
to this algorithm, many of which you can find online.
一个明显的优势在于它相对比较容易理解
The basic pro is that it’s relatively easy to understand,
对于某些问题来说它的效果还不错
and works reasonably well for some problems.
而它基础的缺陷在于效率不高
And the basic cons are that it’s slow,
因为它需要对每个训练数据点进行遍历
because it has to iterate over every training point
才能做出判断
to make a prediction.
还有重要的一点是 我们在第三集看到的
And importantly, as we saw in Episode 3,
一些特征相比较其它来说有更大的信息价值
some features are more informative than others.
但是在k最近邻居算法中
But there’s not an easy way to represent
这一点不易表示
that in k-Nearest Neighbors.
长远来说 我们需要
In the long run, we want a classifier
能够学习特征与待预测标签间更复杂关系的
that learns more complex relationships between features
一种分类器
and the label we’re trying to predict.
决策树是个好例子
A decision tree is a good example of that.
在TensorFlow Playground里看到的神经网络
And a neural network like we saw in TensorFlow Playground
则会更好
is even better.
OK 希望这些能帮到你
OK, hope that was helpful.
以往一样 感谢观看
Thanks as always for watching.
你可以在推特关注我了解更新
You can follow me on Twitter for updates and, of course,
对 当然还是谷歌开发者频道
Google Developers.
下次再见吧
And I’ll see you guys next time.
(音乐播放中)
[MUSIC PLAYING]

发表评论

译制信息
视频概述
听录译者

收集自网络

翻译译者

知易行难

审核员

【MR】拾月

视频来源

https://www.youtube.com/watch?v=AoeEHqVSNOw

相关推荐