What is going on everybody and welcome to part 17 of our machine learning tutorial series.
We’ve been working on K nearest neighbors and let’s get into it.
Don’t want to waste any more time here. So
we’ve started creating this K nearest neighbors function.
All we’ve done so far though is just warn the user when they’re trying to do something stupid.
So now we actually need to
create the K nearest neighbors part to calculate
who are the K nearest neighbors, what are their class, was the class of whatever we’re predicting
based on the comparison to the data. OK.
So what we need to do?
How could we do this?
How can we compare a data point to the other data points
to find out who is the closest data point
and therein lies the problem with K nearest neighbors.
To do this we have to compare the prediction point
to all other points in the data set.
That’s the problem with K nearest neighbors.
Now there are actually some additional things
we can do call…you can do what’s called a radius.
So you can look within a certain radius of a point.
And then you can kind of like forget the outliers.
And that can save you a little bit on the calculation of like Euclidean distance.
You can find out much easier as a point inside or outside a radius.
Then you can to calculate the Euclidean distance. Okay.
So anyways, we’re not going to get into that but just keep in mind
that this is the problem with K nearest neighbors.
So what we’re going to do is we’re just going to grab a basically a list of lists
of the distances.
So what we’re going to say is
你可能会这么写 distances = [ ]
you might say something like this, like you might say distances equals a list.
然后是 for group in data
And then you’re going to say a for group in data.
And so that would be like for each, keep in mind, data
is whatever you pass through here but in this case it would be this data set.
So for each group…So for each class basically.
在其中再嵌套一个for循环 for features in group[ ]
So for a group in data. for features in group
应该这么写 features in data[group]
or rather features in data[group]
So this was now iterating through the features here.
What do we want to do? Well, we need to calculate Euclidean distance, right?
And I got rid of the Euclidean distance. So
in order to save your time I’m not going to write it out. I’m just going to copy and paste here.
Because we’re not going to actually use this version
而是用之前写的那个版本 就是这个 对吧？
but going with the version that I wrote previously we would say that, right?
sqrt(features-predict)**2 + (features-predict)**2
sqrt(features-predict)**2 + (features-predict)**2.
And then take the whole square root of that, right?
That is the equivalent of what we wrote earlier.
Unfortunately that is not…that’s the algorithm
but it’s not fast. It’s not a very fast calculation.
So I mean on our data set it would be lightning-fast
and probably on the breast cancer set
it would be pretty fast but in this case no.
So we definitely want to change it.
So what we’re going to change it to is, for example, this would be a faster version.
Also the other problem here is
what if you have this works with 2 dimensions but if you had 3 dimensions, right?
This will not change. This is hard-coded to only do K nearest neighbors on 2 feature dimensions.
There are a few things that we could do to counteract this.
Specifically with like a for loop would do it, like a one-line for loop could figure it out
but to counteract for a dynamic number of features you can do something like this.
So first of all we’re using the numpy square root. We’re using numpy version of sum.
And then we’re also using numpy arrays
where we can actually just subtract numpy arrays, square those numpy arrays.
And perfect we get the same value for euclidean distance here as well.
But we’re going to go ahead and not use that one either
because turns out that numpy even has a more simple version
and that would be the following.
I guess this one I’ll write out
because you’re gonna…this is one you’re going to need to write out.
就是euclidean_distance = np.linalg
So it’s euclidean_distance = np.linalg
It’s basically linear algebra alg, right?
It’s just another name basically for what we’re going to do here which is Euclidean distance.
So this is the norm of
np.array and we’re going to say features, right? This is features corresponding with features
minus, wait do we do…yeah – np.array(predict). Okay.
And this obviously no longer does it look like that the Euclidean distance formula.
And that’s why I did not want to use this initially
because this is…you’re just…you’re kind of cheating using this like you this is much higher level.
不过 它确实会算得更快 所以就用它了
So anyways, but we’re going to use it because it’s faster. But I did want to show you
the true way to write it if you needed to.
Okay. So we’ve done that.
And now what we’re going to say is
哦 碰到鼠标了 我们在这里开始吧
let’s do…whoops…hit my mouse. Let’s do over here.
So we’ve got…Actually what we need to do is now we need to do distances.append.
We’re going to append the following. We’re going to append the list.
So we’re going to say whatever the euclidean_distance happens to be.
And then we’re going to append the group.
So it’s the distances and then a group.
So it’s going to be a list of lists
where the first item in the list is the actual distance. Second one’s the group.
This way we can sort that list. Take the three things in that list.
得到排名第一的元素 和对应的group 我的思路就是这样
And then get the first element and those are the group. That’s my thinking.
So now we’re going to come down here and we’re going to say
votes = [ ] 我先来解释下接下来怎么做
the votes equals…So I guess…Let me explain how I’m going to do this first.
So again I’m going to use a one-liner for loop but I’ll explain
I guess I’ll explain…I’ll write this one out and then I’ll explain it.
So basically it is going to be
votes = [i]
the votes is equal to i.
再次提醒 i 就是group
So again i is the group.
Once we know what the top 3 distances are we don’t actually care what the distance is.
We just care about it because we want to be able to rank the distances
after we’ve done that and got in the top three with the best closest distances.
距离是多少无所谓 所以是for i in
We don’t care what the distance was. So I want for i in sorted
(distances)[:k], right? Because after we’ve sorted the distances
we only care about the distances to k.
差不多就等同于 votes = 然后 for i in distances
So it would almost be the votes are equal to for distances i
i. But this just allows us to populate votes as a list
that is just simply the groups, right?
也就是 类别标签 或者说其他名称 就是分类吧
Right, are the categories labels whatever you want to call it. Class.
这样就可以有选举结果了 votes_results = Counter[votes]
So there you have votes and then we’re going to say the vote result is equal to Counter
And how many of the most common do we care about?
We just care about the first one The one that is the most common.
And then we’re going to take  .
So most_common it gives you like a…First of all it’s…
It comes as like an array of a list.
So you take 0 first. You get the list and then
you take the 0 again because that list tells you what the most common
how many there were or the most common group.
And then how many there were. So there’s two things in it.
I think it’s actually a tuple of an array.
I can’t remember. Anyway, it’s like that so we’re going to take those.
So in fact I mean we could print. Just so you can see
Counter(votes).most_common. Just so you can
not take my word for it, Okay?
So then you’ve got vote result. We return vote results.
And we’ve done K nearest neighbors. So now
Let’s say we actually want to run this darn thing
we’re going to say results=
复制这个 复制 粘贴 传入dataset
And let’s just copy this. Copy. Paste. We’re going to pass through data set
需要预测new_features 然后设定k=3 没问题吧 好的
We want to predict on the new_features and we’re going to say k=3. No problem. Okay.
And now let me print results.
So we’ll also get this other print out but that’ll be okay.
估计程序运行会报错 不过没有错误 太好了
And then we’ll maybe get an error even if we’re lucky. Nope. No error. Cool.
最后结果是一个元组 好 我把它打印出来了 之前都是假设
So it’s actually a list of a tuple. Okay. Good thing I printed it out. I guess I was making it up.
So what we get here is that
the return is that…it’s a…
选举后最接近的是 r分类 而k=3 3个都是r
the most voted thing with ‘r’. Turns out k=3. So all three were r.
那当然最后的投票结果是 r 了
And of course the vote result is r.
那么我们就得到了分类 这个数据点就属于 r 类型
So we get that the class back. This data point right here is an r type which
as we already saw like we kind of expected that to be the case.
So good. So now
there’s a couple of things that we want to do.
First of all we might want to compare it to scikit-learn’s version.
The other thing we could do…I mean like we could print or like we could graph this just
for example. So let me do…We could take this cut. Copy. Come down here. Paste.
And I’ll take this away and
new_features…好我们在这写 是color么？嗯 是的
new_features. So instead…yeah we can pull out that and then…Is it color? Yeah, color.
I think we’ll get away with that. So if we do that.
好的 可能图的大小需要调整一下 这个就是新的数据点 对吧？
Right, we just need to change the size but that was the new point, right?
So indeed it classified it in the correct group.So now
what we want to do is we want to compare that to scikit-learn’s version of K nearest neighbors.
Okay. So we’re going to use this this K nearest neighbors algorithm against that breast cancer data.
And we’re going to see how well we compare to scikit-learn.
So that’s we’re going to be doing in the next tutorial.
So if you have any questions comments concerns or whatever leaving below.
感谢各位的收看与支持 记得订阅 我们下次再见
Otherwise as always thanks for watching. Thanks for all the support and subscriptions and until next time.