最新评论 (0)

《机器学习python实践》#17 编写我们自己的K最邻近代码

Writing our own K Nearest Neighbors in Code - Practical Machine Learning Tutorial with Python p.17

大家好 欢迎来到机器学习系列教程第17讲
What is going on everybody and welcome to part 17 of our machine learning tutorial series.
We’ve been working on K nearest neighbors and let’s get into it.
Don’t want to waste any more time here. So
we’ve started creating this K nearest neighbors function.
All we’ve done so far though is just warn the user when they’re trying to do something stupid.
So now we actually need to
create the K nearest neighbors part to calculate
也就是计算谁是最接近预测点的K个值 它们分类是什么
who are the K nearest neighbors, what are their class, was the class of whatever we’re predicting
based on the comparison to the data. OK.
So what we need to do?
How could we do this?
How can we compare a data point to the other data points
to find out who is the closest data point
and therein lies the problem with K nearest neighbors.
为了解决这个问题 我们就得对比预测点
To do this we have to compare the prediction point
to all other points in the data set.
That’s the problem with K nearest neighbors.
Now there are actually some additional things
we can do call…you can do what’s called a radius.
So you can look within a certain radius of a point.
And then you can kind of like forget the outliers.
And that can save you a little bit on the calculation of like Euclidean distance.
You can find out much easier as a point inside or outside a radius.
Then you can to calculate the Euclidean distance. Okay.
不过我就不具体讲这些了 你们知道
So anyways, we’re not going to get into that but just keep in mind
that this is the problem with K nearest neighbors.
So what we’re going to do is we’re just going to grab a basically a list of lists
of the distances.
So what we’re going to say is
你可能会这么写 distances = [ ]
you might say something like this, like you might say distances equals a list.
然后是 for group in data
And then you’re going to say a for group in data.
And so that would be like for each, keep in mind, data
is whatever you pass through here but in this case it would be this data set.
那么for循环group 也就是遍历所有分类
So for each group…So for each class basically.
在其中再嵌套一个for循环 for features in group[ ]
So for a group in data. for features in group[]
应该这么写 features in data[group]
or rather features in data[group]
So this was now iterating through the features here.
我们想要什么呢?我们想要欧几里得距离 对吧?
What do we want to do? Well, we need to calculate Euclidean distance, right?
And I got rid of the Euclidean distance. So
这样可以节省些时间 我就直接复制吧
in order to save your time I’m not going to write it out. I’m just going to copy and paste here.
Because we’re not going to actually use this version
而是用之前写的那个版本 就是这个 对吧?
but going with the version that I wrote previously we would say that, right?
sqrt(features[0]-predict[0])**2 + (features[1]-predict[1])**2
sqrt(features[0]-predict[0])**2 + (features[1]-predict[1])**2.
然后对整个式子开方 是吧?
And then take the whole square root of that, right?
That is the equivalent of what we wrote earlier.
Unfortunately that is not…that’s the algorithm
but it’s not fast. It’s not a very fast calculation.
So I mean on our data set it would be lightning-fast
and probably on the breast cancer set
it would be pretty fast but in this case no.
So we definitely want to change it.
So what we’re going to change it to is, for example, this would be a faster version.
Also the other problem here is
如果数据维度从2维变成3维 那怎么办?
what if you have this works with 2 dimensions but if you had 3 dimensions, right?
This will not change. This is hard-coded to only do K nearest neighbors on 2 feature dimensions.
有点弱 是吧?
Bummer, alright?
There are a few things that we could do to counteract this.
Specifically with like a for loop would do it, like a one-line for loop could figure it out
所以为了处理动态的数据维度 可以这么做
but to counteract for a dynamic number of features you can do something like this.
So first of all we’re using the numpy square root. We’re using numpy version of sum.
And then we’re also using numpy arrays
这样我们就可以对数组进行 加减或者平方
where we can actually just subtract numpy arrays, square those numpy arrays.
And perfect we get the same value for euclidean distance here as well.
But we’re going to go ahead and not use that one either
because turns out that numpy even has a more simple version
and that would be the following.
I guess this one I’ll write out
because you’re gonna…this is one you’re going to need to write out.
就是euclidean_distance = np.linalg
So it’s euclidean_distance = np.linalg
其实就是代表线性代数算法 对吧?
It’s basically linear algebra alg, right?
np.linalg.norm 是吧?
np.linalg.norm, okay?
It’s just another name basically for what we’re going to do here which is Euclidean distance.
So this is the norm of
这里要写features 对吧?对应的就是上面这个features
np.array and we’re going to say features, right? This is features corresponding with features
减去np.array(predict) 好的
minus, wait do we do…yeah – np.array(predict). Okay.
And this obviously no longer does it look like that the Euclidean distance formula.
And that’s why I did not want to use this initially
因为这个方法比较高级 算是一种技巧
because this is…you’re just…you’re kind of cheating using this like you this is much higher level.
不过 它确实会算得更快 所以就用它了
So anyways, but we’re going to use it because it’s faster. But I did want to show you
我也把它完整写出来了 你们也可以试试
the true way to write it if you needed to.
好的 搞定
Okay. So we’ve done that.
And now what we’re going to say is
哦 碰到鼠标了 我们在这里开始吧
let’s do…whoops…hit my mouse. Let’s do over here.
So we’ve got…Actually what we need to do is now we need to do distances.append.
We’re going to append the following. We’re going to append the list.
So we’re going to say whatever the euclidean_distance happens to be.
And then we’re going to append the group.
这个是距离 然后是它对应的group(分类)
So it’s the distances and then a group.
So it’s going to be a list of lists
第一个元素就是实际距离 第二个就是对应的group
where the first item in the list is the actual distance. Second one’s the group.
我们可以这样排序 把这三个值放进去
This way we can sort that list. Take the three things in that list.
得到排名第一的元素 和对应的group 我的思路就是这样
And then get the first element and those are the group. That’s my thinking.
接下来我们到下面这里 然后写
So now we’re going to come down here and we’re going to say
votes = [ ] 我先来解释下接下来怎么做
the votes equals…So I guess…Let me explain how I’m going to do this first.
我还是会用个单行for循环 不过
So again I’m going to use a one-liner for loop but I’ll explain
我会解释的 我先写出来
I guess I’ll explain…I’ll write this one out and then I’ll explain it.
So basically it is going to be
votes = [i[1]]
the votes is equal to i[1].
再次提醒 i[1] 就是group
So again i[1] is the group.
一旦我们知道了排名前三的距离 虽然我们并不关心距离的具体值
Once we know what the top 3 distances are we don’t actually care what the distance is.
We just care about it because we want to be able to rank the distances
after we’ve done that and got in the top three with the best closest distances.
距离是多少无所谓 所以是for i in
We don’t care what the distance was. So I want for i in sorted
sorted(distances)[:k] 对吧?排序之后
(distances)[:k], right? Because after we’ve sorted the distances
we only care about the distances to k.
差不多就等同于 votes = 然后 for i in distances
So it would almost be the votes are equal to for distances i
这种形式 这样我们其实就是将votes弄成一个列表
i[1]. But this just allows us to populate votes as a list
i[1]其实也就是分类 对吧?
that is just simply the groups, right?
也就是 类别标签 或者说其他名称 就是分类吧
Right, are the categories labels whatever you want to call it. Class.
这样就可以有选举结果了 votes_results = Counter[votes]
So there you have votes and then we’re going to say the vote result is equal to Counter
And how many of the most common do we care about?
就是一 我们就要第一个元素
We just care about the first one The one that is the most common.
And then we’re going to take [0] [0].
So most_common it gives you like a…First of all it’s…
It comes as like an array of a list.
先拿出第0个列表 然后
So you take 0 first. You get the list and then
再取第0个元素 这个列表表示
you take the 0 again because that list tells you what the most common
how many there were or the most common group.
然后里面的最可能值又是哪些 有2层意义在里面
And then how many there were. So there’s two things in it.
I think it’s actually a tuple of an array.
我记不清了 反正我们就取出来它们
I can’t remember. Anyway, it’s like that so we’re going to take those.
So in fact I mean we could print. Just so you can see
Counter(votes).most_common. Just so you can
这样你们就相信我了 是吧?
not take my word for it, Okay?
然后我们就得到选举结果了 返回votes_results
So then you’ve got vote result. We return vote results.
And we’ve done K nearest neighbors. So now
Let’s say we actually want to run this darn thing
那么就是result =
we’re going to say results=
复制这个 复制 粘贴 传入dataset
And let’s just copy this. Copy. Paste. We’re going to pass through data set
需要预测new_features 然后设定k=3 没问题吧 好的
We want to predict on the new_features and we’re going to say k=3. No problem. Okay.
And now let me print results.
这里还会打印一些其他内容 不过不影响
So we’ll also get this other print out but that’ll be okay.
估计程序运行会报错 不过没有错误 太好了
And then we’ll maybe get an error even if we’re lucky. Nope. No error. Cool.
最后结果是一个元组 好 我把它打印出来了 之前都是假设
So it’s actually a list of a tuple. Okay. Good thing I printed it out. I guess I was making it up.
So what we get here is that
返回的值 它是
the return is that…it’s a…
选举后最接近的是 r分类 而k=3 3个都是r
the most voted thing with ‘r’. Turns out k=3. So all three were r.
那当然最后的投票结果是 r 了
And of course the vote result is r.
那么我们就得到了分类 这个数据点就属于 r 类型
So we get that the class back. This data point right here is an r type which
as we already saw like we kind of expected that to be the case.
好的 那么
So good. So now
there’s a couple of things that we want to do.
First of all we might want to compare it to scikit-learn’s version.
The other thing we could do…I mean like we could print or like we could graph this just
那么我先复制这个 粘贴到这里
for example. So let me do…We could take this cut. Copy. Come down here. Paste.
And I’ll take this away and
new_features…好我们在这写 是color么?嗯 是的
new_features. So instead…yeah we can pull out that and then…Is it color? Yeah, color.
我觉得这样就行了 我们来运行一下
I think we’ll get away with that. So if we do that.
好的 可能图的大小需要调整一下 这个就是新的数据点 对吧?
Right, we just need to change the size but that was the new point, right?
So indeed it classified it in the correct group.So now
what we want to do is we want to compare that to scikit-learn’s version of K nearest neighbors.
好的 我们将我们这个K最近邻算法用在乳腺癌数据集上
Okay. So we’re going to use this this K nearest neighbors algorithm against that breast cancer data.
And we’re going to see how well we compare to scikit-learn.
So that’s we’re going to be doing in the next tutorial.
如果你有任何问题或者评论 就在下方留言吧
So if you have any questions comments concerns or whatever leaving below.
感谢各位的收看与支持 记得订阅 我们下次再见
Otherwise as always thanks for watching. Thanks for all the support and subscriptions and until next time.