最新评论 (0)

《机器学习python实践》#19 K最邻近算法最后一课

Final thoughts on K Nearest Neighbors - Practical Machine Learning Tutorial with Python p.19

What’s going on everybody and welcome to part 19
大家好 欢迎来到Python机器学习
of our machine learning with Python tutorial series.
In this video we’re going to be talking about K nearest neighbors
one last time before we jettison off
into the support vector machine.
What were we talking about here is K accuracy and predictions.
First question we might have is…Well
what if we…If we increased k
如果我们 增加k的值
would accuracy necessarily go up?
So just for fun. Let’s say k is equal to 25.
随便试一下 假设K=25
And we run that. The previous accuracy was about 95.
运行一下 之前的准确度大约是95
Now we got 88, or 98 rather.
现在是88 不对是98
Sorry I can’t read for crap.
抱歉 看错了
Run it again. Okay, 97. Run it again. 94. Run it again.
再运行一次 好的 97 再来一次 94 还可以再来
Okay, you get the point.
就这样 你明白就好
So we’re still doing pretty good. What if we raise it to 75.
我们做的还是不错的 如果我们把K增加到75呢?
95%. Okay, still doing pretty good.
95% 可以 还是很好
97%. Still doing pretty good.
97% 也挺好
Okay, 93 is one of our lower numbers.
可以了 93算是比较低了
Anyway, uh, we don’t really seem…
但是 呃 好像也没什么……
I mean we’re still kind of the average around here is about 97.
So, you know, we can keep going.
I forget how many data points we actually have in this set, let’s see,
我都忘了这个数据集里有多少数据点了 我们来看下
which have 600 about. So if we said k is equal to 200.
差不多有600个 所以我们假设K=200
So one of the closest 200 points.
We see accuracy actually appears to be worse.
So looking at more points and actually,
probably what’s happening there is…Recall that
实际上结果可能会是…… 回想一下
we actually have a skewed number. Only like 30% were malignant where,
原本我们的数据就是失衡的 只有30%是恶性的
right yeah, 30% were malignant and the other like 70% were benign.
对的没错 30%恶性 70%良性
So this is, you known, adding to k doesn’t necessarily do you any more of a favor.
所以就这样啦 你知道的 增加K值并没有什么卵用
5 is probably a pretty good.
Just guess.
But depending on your data set and all that you might want to
fiddle with K just to see if it makes a huge difference.
And just test on whatever data set you might have. So anyway, that’s K.
有什么数据集就拿来试试吧 总之 这就是K
Next we’re going to be talking about confidence versus accuracy.
And K nearest neighbors can give us confidence.
So accuracy is…Did we get the classification, right?
准确度则是…我们已经分类了 对吧?
But confidence actually can come from the classifier and the classifier can say: ”Hey
其实置信度可以来自于分类器 分类器会说:
we have 100% of the votes in favour of this class being such-and-such.
Or conversely it can go back and say: ”Hey
we have only 60% of the votes. This is what the vote was but
只有60%的选票赞成喔 这就是投票结果
our confidence is only 60%, right?
但置信度只有60% 对吧?
So, for example, you could create confidence
举个例子 创建一个置信度confidence
and confidence would be basically this.
This. And instead of that 0th element it would be the 1th that’s how many.
就这个 然后将第0个元素改为第1个元素
And then what would it be? It would be divided by whatever k was, right?
接下来会怎样?除以任意的K值 对吧?
So you’re hoping if you say k is five.
You’re hoping that this number is 5.
So anyways, there’s that and we’ll just leave it like this. Simply because
所以不管怎样 就这样了 不用管它
Every other like accuracy and stuff is the decimal form. So we’re not actually times by 100.
因为其他类似于准确度的东西都是小数形式 所以我们不用乘以100
So you can have confidence there and then you literally could just do confidence. Okay.
所以此处就要有confidence 接下来也是如此 好吧
And we can…Let’s see vote.
And then confidence would be here. So let me make sure we get away with that.
然后confidence放这里 要确保能顺利进行噢
Right. So there you have that and then basically what we could do is we could…
没错 就是这样 接下来我们大致要做的就是……
You wouldn’t want to do this all the time but you could print, you know, vote_result
也不用一直这样做 你也可以打印 你懂的 vote_result
and not vote_result, confidence.
这个不是vote_result 而是confidence
So there we go. So almost all of these were a, you know, 1.0 confidence.
好了 那么你看到了 几乎所有的置信度都是1.0
This one was eighty percent but this one was sixty percent.
这个是80% 但这个是60%
And the interesting thing would maybe be to say vote, confidence
有趣的是vote, confidence
if that is the case. else: print()
如果是另一种情况:print ()
print the confidence. So every print out of confidence. So we’re going to comment this out.
print (confidence) 如此每次打印都没有confidence 所以我们就要把它注释掉
So all the numbers we’re about to see
are the confidence score of the votes that we got incorrect.
So we run this.
Okay. So actually this is a pretty good split here.
好了 实际上这样就很好的分割开了
This wasn’t…These were 100% votes that were incorrect
but then we also had some 6 votes that were incorrect.
Let’s go up here and let’s maybe change the test size to
继续向上 我们把test size改为0.4
0.4. We’re going to sacrifice a lot of data.
So accuracy might actually go down but the question was in my mind
所以准确度可能会下降 但这个问题还一直萦绕在我的脑海
would you see less 100% confidence if
if that in the incorrect valuations and sure enough it appears with
如果说赋值错误 而它有刚好出现在
this…at least this test. This was a pretty even split but the ones that it was unconfident about was 60%.
这里……至少是在这个数据集里 对半开吧 但不可靠的几率有60%
So you might even get to the point though where
let’s say you have someone in your…and you’re telling them whether or not they have cancer.
Okay. You’re telling them whether or not they have cancer
好吧 你告诉他们是否患有癌症
if the confidence is only 60%.
You might…you know, not say anything, right?
你可能……你懂的 等于啥也没说 对吧?
You might say: ”Well, the confidence that the test was not accurate.” or whatever.
你可能会说:额 测试的置信度不一定正确 等等
Or the test was not confident, right?
或者对这个测试没有把握 对吧?
So you know that might be a reason why you might want to take a little more care
所以你懂的 这可能就是要对报告多加注意的一个原因吧
in your reports. Anyways.
So that’s K’s confidence and now comparison.
这就是K的置信度 现在比较下
So what I want to do is…
Let’s see what I want to do is take this here
来看下 我要做的就是用这里的这个东西
and basically we’re going to take this.
And let’s say for i in range()
假设for i in range(25)
(25). And we’re gonna say for i in range(25). We’re going to do…
我们输入for i in range(25) 我们要做的是……
Actually…Yeah I will just will be sloppy about it whatever.
实际上……没错 我只是随便敷衍一下
Tab that over.
And we’re gonna say accuracies equals a list.
假设accuracies=[ ]
And then here we’re gonna say
accuracies.append that number.
accuracies.append 再加上这个数
And then when we’re all done. We’re going to say print. We’re gonna say len()
当我们全部弄好之后 我们就要print 假设是(len())
No. sum. sum(accuracies)
不 是sum sum(accuracies)
divided by len()
Let’s just do 10. Well do five first just to make sure logic.
我们就做10个 好吧 逻辑上来说先做5个
We shouldn’t have…We have to get rid of these printouts. Right.
我们没有必要 我们删掉这些打印输出 对的
Okay. So we’ll get rid of the printouts.
好了 我们已经删掉这些打印输出
And as we can see at least out of 5 tests where we average 96.2% accuracy
正如我们所看到的 至少在这5个测试中 我们的平均准确度是96.2%
and the other thing we need to make sure. k is indeed 5.
还有就是我们要确保 k=5
We’ll stop printing accuracy and we will stop…
我们要停止打印准确度 停止……
Where’s the other thing we were printing? Oh, it’s here.
我们打印的那个东西呢?噢 在这
else print(confidence). Let’s just delete that.
else print(confidence) 我们把它删掉
Okay. So now.
可以了 所以现在
Now we’ve got that and we are getting different accuracies that just
现在我们有了这个 还要有不同的准确度
to show the whole process is being repeated over again.
So let’s run this 25 times.
May not be enough for you
but we’ll run that 25 times. And then let’s go to
但我们就运行25次 我们再来看下
the other K nearest neighbors.
Let’s say…You don’t have to follow along if you don’t want to by the way.
比如说…… 顺便说下 不想跟着做的话可以不做
You can just see what the result is. I believe that this is the other K nearest neighbors from part 14.
反正你能看到结果是怎样的 我认为这是第14部分的另外一个K最近邻
What we could do is do basically the exact same thing here, right?
我们做的还是老一套东西 对吧?
Let’s take this here. Tab over.
剪切这个 在这上面
Paste for i range that.
粘贴在 for i range这里
I’m going to tab. I’m going to stop this accuracy stuff.
缩进 我将会停止这些准确度的东西
And in fact we don’t need to do predictions either.
I’m going to leave them there just in case I…Because this is the old code actually. So…
我还是把它们放在这里以防……毕竟这些都是旧代码 所以……
So I usually upload this to github.
So anyway. We’ll save that and then
不管怎样 保存它
And then we print sum accuracies.
Divided by len(accuracies).
Okay. So out of 97…out of 25
好了 在这97个……25个例子中
examples we ran a 96.4% accuracy on the version we wrote.
Now let’s test K nearest neighbors.
Okay. So this is the average accuracy for K nearest neighbors. 96.8%
嗯嗯 这是K最近邻算法的平均准确度96.8%
The only thing I will stress though is the following.
Keep in mind. On both of these we’re basically doing the exact same thing.
要记住 这两种情况下我们做的基本上是同一件事情
We’re loading in the data frame. We’re passing the data. We’re training. We’re testing.
我们加载数据帧 传递数据 训练 测试
And we’re doing 25 iterations.
So again let’s run this one.
And it’s still running and I’m going to go. It’s still running. I’ll run this one, too.
它还在运行 那就等会 它还在运行 那我运行这个吧
And it there’s your answer there and then we’re still waiting on the version we wrote.
这就是你的答案 接下来我们还在等待我们编写的版本
And we’re still waiting.
Oh man. Anyway. So…
天啊 不管怎样 那么……
So what’s the difference? Okay. There’s a couple of differences here.
那么区别在哪里 好吧 其实主要有几点不同之处
One is K nearest neighbors has a default parameter.
Actually I think the default n_jobs is equal to one
实际上 我认为默认的n_jobs等于1
but a couple of things. One. K nearest neighbors can be threaded.
而不是一堆 一个K最邻近数可以放到一个线程里
Okay. So
好了 所以
you do not have to test each prediction point
or like in tests you had 20% of the sample, right? So
或者说测试20%的样本 对吧?
when you go to test. Each one, each like set of features is its own unique snowflake.
So you don’t actually need to test linearly. You can test all…
那就没必要测试线性了 你也可以都测试……
Test each one on their own basically. So you can heavily thread
K nearest neighbors on a bulk of predictions.
So first of all. You can do that but I actually
首先 你可以这样做
I think that the default is they do not.
Let me pull up sklearn’s K neighbors for you guys. So here we go.
我把sklearn的K的邻近数放上来 就这些
And neighbors default 5 sure enough. Radius. This is most likely where they’re…
邻近默认值为5就可以了半径 就是它们可能所处的位置
They’re winning. They’re beating us because they’re using that idea of that radius
它们赢了 它们打败我们了 因为它们用的是半径的概念
to ignore points that are outside the radius most likely.
n_jobs. This is what we’re talking about. So how many parallel jobs do you want to run
这就是我们所要讲述的 那么如果近邻研究的话
for the neighbors search?
The default is one but if you set that to be negative one.
默认值为1 但如果设置为-1
Where are we? Here we are.
Then it will do as many jobs as possible.
So that’s another way to actually speed it up but as you can see it was actually already pretty fast.
这是另一种加速方法 但如你所见 实际上它已经很快了
But this is not a high performance tutorial on Python. This is a machine learning tutorial.
但这并不是Python的高级教程 而是机器学习教程
So if you want high performance you’ll most likely be using someone else’s algorithm.
所以如果你想要高性能 你可以用其他人的算法
But our accuracy was actually very similar to theirs.
I would wager that if we ran like a million samples.
我敢打赌 如果我们运行上百万个样本的话
We would find that accuracy was identical.
It’s just 25 is not actually a decent sample size.
25的话 并不是一个合适的样本量
So anyways. That’s all for K nearest neighbors
就这样了 这就是非常有效的
of a very valid algorithm to use.
You can use it on pretty like…A lot of people say it doesn’t scale well but
it scales to very large data sets.
It just doesn’t scale to like terabytes of data.
It’s just not going to run very well on terabytes of data.
But you can use a radius. You can thread it.
但是可以用半径 线性就可以了
And it actually still will run pretty well.
The accuracy is pretty good
so one of the other upsides of the K nearest neighbors classifier
is that it can work on both linear and nonlinear data.
So you’ll see that makes a big difference
especially in something like the support vector machine which is what we’re covering next.
But to use algorithms that we’ve already covered you actually can use
regression to do classification
so long as you’re using linear data. So for example I’ve got a graph here.
都可以用回归来分类 以图表为例
So if I was to draw the best fit line, the Y hat line,
假设我要画一条最佳拟合线 Y帽线
the regression line for these two data sets. I would probably do something maybe like
这两个数据集的回归线 我可能会这样画
like this. Okay. So it would be for the blue dots and then I might do something
好了 蓝色点是这样的 接下来
like this for the orange dots.
And then if I had an unknown data point like…Let’s say maybe this point here.
如果有一个未知的数据点 假设点在这里
Without using K nearest neighbors instead what I could do is I could measure the squared error
不用K最近邻算法的话 我们可以测量
between this regression line and this regression line.
And whichever one had a lesser squared error
that would be what class that new X plot belonged to which in this case would be the orange
就和新点X归于同类 案例中就是橙色
orange dot class, right?
属于橙色点类 对吧
So with regression if the data is indeed linear
那么就算数据是线性的 用回归
you can still do classification, right? You don’t just have to forecast out.
依然可以分类 对吧?没有必要再做预测
Now what about a dataset like this though?
This dataset both of these the orange dots and the blue dots
do have a best fit line. I have no idea what they are
都有最佳拟合线 我不知道它们是什么
but they do have a best fit line.
But even if they did, even if when we did draw the best fit line.
但即使确实有 即使我们能画出最佳拟合线
The squared error, the coefficient of determination would be so poor
so the actual, you know, confidence of this algorithm
所以说 实际上这种算法的置信度
would be pretty bad overall.
So you can’t really do classification
on a data set like this. This is nonlinear data.
没法分类 它是非线性数据
But you can do K nearest neighbors on nonlinear data.
So, you know, we could still…We can classify this point
所以说 你懂的 我们还是可以用
using K nearest neighbors because then we’re just…
K最近邻算法归类该点 这样的话我们就……
We’re measuring the distance between this point, probably this point and this point.
要测量到这个点之间的距离 也可能是这个点或者这个点
Those are our closest three points.
We would get…If K was equal to three
如果K=3 我们会得到…
we have two to one vote. So we would vote that it’s an orange and orange dot class, right?
2:1 就得出这个点是橙色的 属于橙色点类 对吧
So anyways that’s one of the other upsides to the K nearest neighbors classifier.
不管怎样 这也算是K最邻近分类器的另一个优点了
That’s it for K nearest neighbors. The next topic we’re gonna be talking about is the support vector machine.
这就是K最近邻算法 下节课我们要讲的是支持向量机
So that’s what we’re gonna be getting into. If you have any questions comments concerns whatever
这就是我们要讲的全部内容了 如果你有任何的问题或评论
feel free to leave them below.
Otherwise as always thanks for watching thanks for all the support subscriptions and until next time.
谢谢观看 支持 订阅 我们下节课见