ADM-201 dump PMP dumps pdf SSCP exam materials CBAP exam sample questions

《机器学习python实践》#17 编写我们自己的K最邻近代码 – 译学馆
未登陆,请登陆后再发表信息
最新评论 (0)
播放视频

《机器学习python实践》#17 编写我们自己的K最邻近代码

Writing our own K Nearest Neighbors in Code - Practical Machine Learning Tutorial with Python p.17

大家好 欢迎来到机器学习系列教程第17讲
What is going on everybody and welcome to part 17 of our machine learning tutorial series.
我们继续来研究K最近邻算法吧
We’ve been working on K nearest neighbors and let’s get into it.
我就不在这废话了
Don’t want to waste any more time here. So
上节我们开始了K最近邻函数的构造
we’ve started creating this K nearest neighbors function.
目前完成的事情只是提醒用户别做蠢事
All we’ve done so far though is just warn the user when they’re trying to do something stupid.
接下来我们要
So now we actually need to
开始构造算法中K最近邻的部分
create the K nearest neighbors part to calculate
也就是计算谁是最接近预测点的K个值 它们分类是什么
who are the K nearest neighbors, what are their class, was the class of whatever we’re predicting
最后的预测结果取决于对这些数据的比较
based on the comparison to the data. OK.
那我们要怎么做?
So what we need to do?
怎么能达到目标呢?
How could we do this?
怎么通过比较预测数据点和其它数据点
How can we compare a data point to the other data points
来找出哪个是距离最近的那个点
to find out who is the closest data point
其实这就是K近邻的主要问题
and therein lies the problem with K nearest neighbors.
为了解决这个问题 我们就得对比预测点
To do this we have to compare the prediction point
和数据集中所有其他点
to all other points in the data set.
这就是所谓的K最近邻问题
That’s the problem with K nearest neighbors.
不过我们得先做点额外工作
Now there are actually some additional things
比如可以先做条半径
we can do call…you can do what’s called a radius.
然后就可以来观察某个点在特定半径内的情况
So you can look within a certain radius of a point.
半径外的点就就不管了
And then you can kind of like forget the outliers.
这样在计算欧几里得距离的时候可以省些事
And that can save you a little bit on the calculation of like Euclidean distance.
判断一个点在半径内还是半径外会容易得多
You can find out much easier as a point inside or outside a radius.
然后再去计算具体的欧几里得距离的值
Then you can to calculate the Euclidean distance. Okay.
不过我就不具体讲这些了 你们知道
So anyways, we’re not going to get into that but just keep in mind
这是K最近邻问题就行了
that this is the problem with K nearest neighbors.
我们接下来要做的基本上就是弄出一个列表
So what we’re going to do is we’re just going to grab a basically a list of lists
元素是距离的列表
of the distances.
那么就是要写
So what we’re going to say is
你可能会这么写 distances = [ ]
you might say something like this, like you might say distances equals a list.
然后是 for group in data
And then you’re going to say a for group in data.
这样就会遍历所有的数据
And so that would be like for each, keep in mind, data
在这里传入的数据也就是这个数据集
is whatever you pass through here but in this case it would be this data set.
那么for循环group 也就是遍历所有分类
So for each group…So for each class basically.
在其中再嵌套一个for循环 for features in group[ ]
So for a group in data. for features in group[]
应该这么写 features in data[group]
or rather features in data[group]
这样就可以遍历这些特征值了
So this was now iterating through the features here.
我们想要什么呢?我们想要欧几里得距离 对吧?
What do we want to do? Well, we need to calculate Euclidean distance, right?
这里我就不写欧几里得距离的具体式子了
And I got rid of the Euclidean distance. So
这样可以节省些时间 我就直接复制吧
in order to save your time I’m not going to write it out. I’m just going to copy and paste here.
因为其实我们最后不会用这个版本
Because we’re not going to actually use this version
而是用之前写的那个版本 就是这个 对吧?
but going with the version that I wrote previously we would say that, right?
sqrt(features[0]-predict[0])**2 + (features[1]-predict[1])**2
sqrt(features[0]-predict[0])**2 + (features[1]-predict[1])**2.
然后对整个式子开方 是吧?
And then take the whole square root of that, right?
这和之前的式子是一样的了
That is the equivalent of what we wrote earlier.
不过这个算法
Unfortunately that is not…that’s the algorithm
运算速度并不快
but it’s not fast. It’s not a very fast calculation.
虽然在这个数据集里算得飞快
So I mean on our data set it would be lightning-fast
在乳腺癌数据集里速度也不错
and probably on the breast cancer set
但是它的运算速度确实算不上快
it would be pretty fast but in this case no.
我们得要改进这一点
So we definitely want to change it.
所以我们得换成一个更快的算法
So what we’re going to change it to is, for example, this would be a faster version.
而且还有一个问题就是
Also the other problem here is
如果数据维度从2维变成3维 那怎么办?
what if you have this works with 2 dimensions but if you had 3 dimensions, right?
这里这个算法只是针对2维K最近邻问题写的
This will not change. This is hard-coded to only do K nearest neighbors on 2 feature dimensions.
有点弱 是吧?
Bummer, alright?
不过还是有办法解决这个问题的
There are a few things that we could do to counteract this.
其实只需要一个单行for循环语句就可以
Specifically with like a for loop would do it, like a one-line for loop could figure it out
所以为了处理动态的数据维度 可以这么做
but to counteract for a dynamic number of features you can do something like this.
首先我们要用numpy的平方根算符以及加和算符
So first of all we’re using the numpy square root. We’re using numpy version of sum.
当然也还是要用numpy数组
And then we’re also using numpy arrays
这样我们就可以对数组进行 加减或者平方
where we can actually just subtract numpy arrays, square those numpy arrays.
当然也能完美地处理欧几里得距离的计算
And perfect we get the same value for euclidean distance here as well.
不过刚复制的这个式子我们还是不用
But we’re going to go ahead and not use that one either
因为numpy有更简单的方法
because turns out that numpy even has a more simple version
就是下面这个
and that would be the following.
这部分我还是得写出来
I guess this one I’ll write out
这部分还是需要写出来的
because you’re gonna…this is one you’re going to need to write out.
就是euclidean_distance = np.linalg
So it’s euclidean_distance = np.linalg
其实就是代表线性代数算法 对吧?
It’s basically linear algebra alg, right?
np.linalg.norm 是吧?
np.linalg.norm, okay?
其实就是欧几里得距离的换了个名字而已
It’s just another name basically for what we’re going to do here which is Euclidean distance.
这就是np.array的范数(矩阵模量)
So this is the norm of
这里要写features 对吧?对应的就是上面这个features
np.array and we’re going to say features, right? This is features corresponding with features
减去np.array(predict) 好的
minus, wait do we do…yeah – np.array(predict). Okay.
当然这看起来完全不像是欧几里得距离公式了
And this obviously no longer does it look like that the Euclidean distance formula.
这就是我一开始不用它的原因
And that’s why I did not want to use this initially
因为这个方法比较高级 算是一种技巧
because this is…you’re just…you’re kind of cheating using this like you this is much higher level.
不过 它确实会算得更快 所以就用它了
So anyways, but we’re going to use it because it’s faster. But I did want to show you
我也把它完整写出来了 你们也可以试试
the true way to write it if you needed to.
好的 搞定
Okay. So we’ve done that.
接下来我们要做
And now what we’re going to say is
哦 碰到鼠标了 我们在这里开始吧
let’s do…whoops…hit my mouse. Let’s do over here.
接下来就是distances.append
So we’ve got…Actually what we need to do is now we need to do distances.append.
我们要追加一个列表
We’re going to append the following. We’re going to append the list.
首先是euclidean_distance的值
So we’re going to say whatever the euclidean_distance happens to be.
然后是group
And then we’re going to append the group.
这个是距离 然后是它对应的group(分类)
So it’s the distances and then a group.
所以这就是个列表的列表
So it’s going to be a list of lists
第一个元素就是实际距离 第二个就是对应的group
where the first item in the list is the actual distance. Second one’s the group.
我们可以这样排序 把这三个值放进去
This way we can sort that list. Take the three things in that list.
得到排名第一的元素 和对应的group 我的思路就是这样
And then get the first element and those are the group. That’s my thinking.
接下来我们到下面这里 然后写
So now we’re going to come down here and we’re going to say
votes = [ ] 我先来解释下接下来怎么做
the votes equals…So I guess…Let me explain how I’m going to do this first.
我还是会用个单行for循环 不过
So again I’m going to use a one-liner for loop but I’ll explain
我会解释的 我先写出来
I guess I’ll explain…I’ll write this one out and then I’ll explain it.
那这样基本上就是
So basically it is going to be
votes = [i[1]]
the votes is equal to i[1].
再次提醒 i[1] 就是group
So again i[1] is the group.
一旦我们知道了排名前三的距离 虽然我们并不关心距离的具体值
Once we know what the top 3 distances are we don’t actually care what the distance is.
我们只关心这些距离的排名情况
We just care about it because we want to be able to rank the distances
做完这些后我们就得到了最接近的三个距离
after we’ve done that and got in the top three with the best closest distances.
距离是多少无所谓 所以是for i in
We don’t care what the distance was. So I want for i in sorted
sorted(distances)[:k] 对吧?排序之后
(distances)[:k], right? Because after we’ve sorted the distances
我们就只要知道前k个距离就行了
we only care about the distances to k.
差不多就等同于 votes = 然后 for i in distances
So it would almost be the votes are equal to for distances i
这种形式 这样我们其实就是将votes弄成一个列表
i[1]. But this just allows us to populate votes as a list
i[1]其实也就是分类 对吧?
that is just simply the groups, right?
也就是 类别标签 或者说其他名称 就是分类吧
Right, are the categories labels whatever you want to call it. Class.
这样就可以有选举结果了 votes_results = Counter[votes]
So there you have votes and then we’re going to say the vote result is equal to Counter
Counter(votes).most_common
(votes).most_common
那么most_common值应该得是多少呢?
And how many of the most common do we care about?
就是一 我们就要第一个元素
We just care about the first one The one that is the most common.
所以我们就取[0][0]
And then we’re going to take [0] [0].
most_common会给你返回一个…
So most_common it gives you like a…First of all it’s…
一个数组列表之类的
It comes as like an array of a list.
先拿出第0个列表 然后
So you take 0 first. You get the list and then
再取第0个元素 这个列表表示
you take the 0 again because that list tells you what the most common
最可能值的group是哪些
how many there were or the most common group.
然后里面的最可能值又是哪些 有2层意义在里面
And then how many there were. So there’s two things in it.
我觉着返回的应该是个数组的元组
I think it’s actually a tuple of an array.
我记不清了 反正我们就取出来它们
I can’t remember. Anyway, it’s like that so we’re going to take those.
其实我可以把它们打印出来看看
So in fact I mean we could print. Just so you can see
也就是Counter(votes).most_common
Counter(votes).most_common. Just so you can
这样你们就相信我了 是吧?
not take my word for it, Okay?
然后我们就得到选举结果了 返回votes_results
So then you’ve got vote result. We return vote results.
K最近邻计算就结束了
And we’ve done K nearest neighbors. So now
如果我们想要实际运行这个程序的话
Let’s say we actually want to run this darn thing
那么就是result =
we’re going to say results=
复制这个 复制 粘贴 传入dataset
And let’s just copy this. Copy. Paste. We’re going to pass through data set
需要预测new_features 然后设定k=3 没问题吧 好的
We want to predict on the new_features and we’re going to say k=3. No problem. Okay.
最后打印results
And now let me print results.
这里还会打印一些其他内容 不过不影响
So we’ll also get this other print out but that’ll be okay.
估计程序运行会报错 不过没有错误 太好了
And then we’ll maybe get an error even if we’re lucky. Nope. No error. Cool.
最后结果是一个元组 好 我把它打印出来了 之前都是假设
So it’s actually a list of a tuple. Okay. Good thing I printed it out. I guess I was making it up.
那这里我们就得到
So what we get here is that
返回的值 它是
the return is that…it’s a…
选举后最接近的是 r分类 而k=3 3个都是r
the most voted thing with ‘r’. Turns out k=3. So all three were r.
那当然最后的投票结果是 r 了
And of course the vote result is r.
那么我们就得到了分类 这个数据点就属于 r 类型
So we get that the class back. This data point right here is an r type which
符合我们之前观察的结果
as we already saw like we kind of expected that to be the case.
好的 那么
So good. So now
还有几件事我们要做一下
there’s a couple of things that we want to do.
首先我们可能要将这个算法和scikit-learn的版本比较一下
First of all we might want to compare it to scikit-learn’s version.
然后我们可能…我们可以打印或者干脆画出来图像
The other thing we could do…I mean like we could print or like we could graph this just
那么我先复制这个 粘贴到这里
for example. So let me do…We could take this cut. Copy. Come down here. Paste.
去掉注释符号
And I’ll take this away and
new_features…好我们在这写 是color么?嗯 是的
new_features. So instead…yeah we can pull out that and then…Is it color? Yeah, color.
=result
=result
我觉得这样就行了 我们来运行一下
I think we’ll get away with that. So if we do that.
好的 可能图的大小需要调整一下 这个就是新的数据点 对吧?
Right, we just need to change the size but that was the new point, right?
实际上最终分类也是对的
So indeed it classified it in the correct group.So now
接下来我们来将这个算法和scikit-learn的K最近邻算法比较一下
what we want to do is we want to compare that to scikit-learn’s version of K nearest neighbors.
好的 我们将我们这个K最近邻算法用在乳腺癌数据集上
Okay. So we’re going to use this this K nearest neighbors algorithm against that breast cancer data.
然后来看看怎么和scikit-learn版本比较一下
And we’re going to see how well we compare to scikit-learn.
不过这个就放在下节课吧
So that’s we’re going to be doing in the next tutorial.
如果你有任何问题或者评论 就在下方留言吧
So if you have any questions comments concerns or whatever leaving below.
感谢各位的收看与支持 记得订阅 我们下次再见
Otherwise as always thanks for watching. Thanks for all the support and subscriptions and until next time.

发表评论

译制信息
视频概述

本节课作者带大家手写了一个K近邻算法,并且最后进行了实验。

听录译者

[B]刀子

翻译译者

[B]刀子

审核员

审核员1024

视频来源

https://www.youtube.com/watch?v=GWHG3cS2PKc

相关推荐