ADM-201 dump PMP dumps pdf SSCP exam materials CBAP exam sample questions

《机器学习python实践》#19 K最邻近算法最后一课 – 译学馆
未登录,请登录后再发表信息
最新评论 (0)
播放视频

《机器学习python实践》#19 K最邻近算法最后一课

Final thoughts on K Nearest Neighbors - Practical Machine Learning Tutorial with Python p.19

What’s going on everybody and welcome to part 19
大家好 欢迎来到Python机器学习
of our machine learning with Python tutorial series.
系列教程的第19节
In this video we’re going to be talking about K nearest neighbors
本节课我们要讲的是K最近邻算法的最后一部分内容
one last time before we jettison off
之后我们将进入
into the support vector machine.
支持向量机的课程
What were we talking about here is K accuracy and predictions.
这节课我们要讲的就是K的准确度和预测值
First question we might have is…Well
我们的第一个问题就是…呃
what if we…If we increased k
如果我们 增加k的值
would accuracy necessarily go up?
准确度会提高吗?
So just for fun. Let’s say k is equal to 25.
随便试一下 假设K=25
And we run that. The previous accuracy was about 95.
运行一下 之前的准确度大约是95
Now we got 88, or 98 rather.
现在是88 不对是98
Sorry I can’t read for crap.
抱歉 看错了
Run it again. Okay, 97. Run it again. 94. Run it again.
再运行一次 好的 97 再来一次 94 还可以再来
Okay, you get the point.
就这样 你明白就好
So we’re still doing pretty good. What if we raise it to 75.
我们做的还是不错的 如果我们把K增加到75呢?
95%. Okay, still doing pretty good.
95% 可以 还是很好
97%. Still doing pretty good.
97% 也挺好
Okay, 93 is one of our lower numbers.
可以了 93算是比较低了
Anyway, uh, we don’t really seem…
但是 呃 好像也没什么……
I mean we’re still kind of the average around here is about 97.
我的意思是平均水平还是97
So, you know, we can keep going.
那我们就继续试下
I forget how many data points we actually have in this set, let’s see,
我都忘了这个数据集里有多少数据点了 我们来看下
which have 600 about. So if we said k is equal to 200.
差不多有600个 所以我们假设K=200
So one of the closest 200 points.
那么最靠近200的那个点
We see accuracy actually appears to be worse.
我们看到准确度似乎变差了
So looking at more points and actually,
用更多的数据点来试一下
probably what’s happening there is…Recall that
实际上结果可能会是…… 回想一下
we actually have a skewed number. Only like 30% were malignant where,
原本我们的数据就是失衡的 只有30%是恶性的
right yeah, 30% were malignant and the other like 70% were benign.
对的没错 30%恶性 70%良性
So this is, you known, adding to k doesn’t necessarily do you any more of a favor.
所以就这样啦 你知道的 增加K值并没有什么卵用
5 is probably a pretty good.
K=5就不错了
Just guess.
猜的哈
But depending on your data set and all that you might want to
不过还是要根据你的数据集和赋予K的值
fiddle with K just to see if it makes a huge difference.
来看是否会有很大的区别
And just test on whatever data set you might have. So anyway, that’s K.
有什么数据集就拿来试试吧 总之 这就是K
Next we’re going to be talking about confidence versus accuracy.
接下来我们要讲的是置信度和准确度
And K nearest neighbors can give us confidence.
K最近邻算法可以提供给我们置信度
So accuracy is…Did we get the classification, right?
准确度则是…我们已经分类了 对吧?
But confidence actually can come from the classifier and the classifier can say: ”Hey
其实置信度可以来自于分类器 分类器会说:
we have 100% of the votes in favour of this class being such-and-such.
100%的选票赞成这类属于某某类
Or conversely it can go back and say: ”Hey
又或者反过来说:嘿
we have only 60% of the votes. This is what the vote was but
只有60%的选票赞成喔 这就是投票结果
our confidence is only 60%, right?
但置信度只有60% 对吧?
So, for example, you could create confidence
举个例子 创建一个置信度confidence
and confidence would be basically this.
然后confidence基本上就等于这个东西
This. And instead of that 0th element it would be the 1th that’s how many.
就这个 然后将第0个元素改为第1个元素
And then what would it be? It would be divided by whatever k was, right?
接下来会怎样?除以任意的K值 对吧?
So you’re hoping if you say k is five.
假如你设k=5
You’re hoping that this number is 5.
这里这个数字就是5
So anyways, there’s that and we’ll just leave it like this. Simply because
所以不管怎样 就这样了 不用管它
Every other like accuracy and stuff is the decimal form. So we’re not actually times by 100.
因为其他类似于准确度的东西都是小数形式 所以我们不用乘以100
So you can have confidence there and then you literally could just do confidence. Okay.
所以此处就要有confidence 接下来也是如此 好吧
And we can…Let’s see vote.
我们可以……看下vote吧
And then confidence would be here. So let me make sure we get away with that.
然后confidence放这里 要确保能顺利进行噢
Right. So there you have that and then basically what we could do is we could…
没错 就是这样 接下来我们大致要做的就是……
You wouldn’t want to do this all the time but you could print, you know, vote_result
也不用一直这样做 你也可以打印 你懂的 vote_result
and not vote_result, confidence.
这个不是vote_result 而是confidence
So there we go. So almost all of these were a, you know, 1.0 confidence.
好了 那么你看到了 几乎所有的置信度都是1.0
This one was eighty percent but this one was sixty percent.
这个是80% 但这个是60%
And the interesting thing would maybe be to say vote, confidence
有趣的是vote, confidence
if that is the case. else: print()
如果是另一种情况:print ()
print the confidence. So every print out of confidence. So we’re going to comment this out.
print (confidence) 如此每次打印都没有confidence 所以我们就要把它注释掉
So all the numbers we’re about to see
因此我们看到所有数字的
are the confidence score of the votes that we got incorrect.
置信度结果都是不正确的
So we run this.
那么我们这样运行
Okay. So actually this is a pretty good split here.
好了 实际上这样就很好的分割开了
This wasn’t…These were 100% votes that were incorrect
这不是……100%的投票是不正确的
but then we also had some 6 votes that were incorrect.
这个6也是不正确的
Let’s go up here and let’s maybe change the test size to
继续向上 我们把test size改为0.4
0.4. We’re going to sacrifice a lot of data.
我们就浪费了很多数据
So accuracy might actually go down but the question was in my mind
所以准确度可能会下降 但这个问题还一直萦绕在我的脑海
would you see less 100% confidence if
你有看到低于100%的置信度吗?
if that in the incorrect valuations and sure enough it appears with
如果说赋值错误 而它有刚好出现在
this…at least this test. This was a pretty even split but the ones that it was unconfident about was 60%.
这里……至少是在这个数据集里 对半开吧 但不可靠的几率有60%
So you might even get to the point though where
所以你可能就卡在这儿了
let’s say you have someone in your…and you’re telling them whether or not they have cancer.
比如说有人……你告诉他们是否患有癌症
Okay. You’re telling them whether or not they have cancer
好吧 你告诉他们是否患有癌症
if the confidence is only 60%.
如果置信度只有60%
You might…you know, not say anything, right?
你可能……你懂的 等于啥也没说 对吧?
You might say: ”Well, the confidence that the test was not accurate.” or whatever.
你可能会说:额 测试的置信度不一定正确 等等
Or the test was not confident, right?
或者对这个测试没有把握 对吧?
So you know that might be a reason why you might want to take a little more care
所以你懂的 这可能就是要对报告多加注意的一个原因吧
in your reports. Anyways.
不管怎样
So that’s K’s confidence and now comparison.
这就是K的置信度 现在比较下
So what I want to do is…
那么我要做的是……
Let’s see what I want to do is take this here
来看下 我要做的就是用这里的这个东西
and basically we’re going to take this.
基本上用的就是这个
And let’s say for i in range()
假设for i in range(25)
(25). And we’re gonna say for i in range(25). We’re going to do…
我们输入for i in range(25) 我们要做的是……
Actually…Yeah I will just will be sloppy about it whatever.
实际上……没错 我只是随便敷衍一下
Tab that over.
全部缩进
And we’re gonna say accuracies equals a list.
假设accuracies=[ ]
And then here we’re gonna say
然后就是
accuracies.append that number.
accuracies.append 再加上这个数
And then when we’re all done. We’re going to say print. We’re gonna say len()
当我们全部弄好之后 我们就要print 假设是(len())
No. sum. sum(accuracies)
不 是sum sum(accuracies)
divided by len()
除以len()
len(accuracies)
len(accuracies)
Let’s just do 10. Well do five first just to make sure logic.
我们就做10个 好吧 逻辑上来说先做5个
We shouldn’t have…We have to get rid of these printouts. Right.
我们没有必要 我们删掉这些打印输出 对的
Okay. So we’ll get rid of the printouts.
好了 我们已经删掉这些打印输出
And as we can see at least out of 5 tests where we average 96.2% accuracy
正如我们所看到的 至少在这5个测试中 我们的平均准确度是96.2%
and the other thing we need to make sure. k is indeed 5.
还有就是我们要确保 k=5
We’ll stop printing accuracy and we will stop…
我们要停止打印准确度 停止……
Where’s the other thing we were printing? Oh, it’s here.
我们打印的那个东西呢?噢 在这
else print(confidence). Let’s just delete that.
else print(confidence) 我们把它删掉
Okay. So now.
可以了 所以现在
Now we’ve got that and we are getting different accuracies that just
现在我们有了这个 还要有不同的准确度
to show the whole process is being repeated over again.
来表明整个过程正在重复
So let’s run this 25 times.
我们运行25次
May not be enough for you
也许对你来说还不够
but we’ll run that 25 times. And then let’s go to
但我们就运行25次 我们再来看下
the other K nearest neighbors.
另外一个K最邻近数
Let’s say…You don’t have to follow along if you don’t want to by the way.
比如说…… 顺便说下 不想跟着做的话可以不做
You can just see what the result is. I believe that this is the other K nearest neighbors from part 14.
反正你能看到结果是怎样的 我认为这是第14部分的另外一个K最近邻
What we could do is do basically the exact same thing here, right?
我们做的还是老一套东西 对吧?
Let’s take this here. Tab over.
剪切这个 在这上面
Paste for i range that.
粘贴在 for i range这里
I’m going to tab. I’m going to stop this accuracy stuff.
缩进 我将会停止这些准确度的东西
And in fact we don’t need to do predictions either.
事实上我们也不需要做预测
I’m going to leave them there just in case I…Because this is the old code actually. So…
我还是把它们放在这里以防……毕竟这些都是旧代码 所以……
So I usually upload this to github.
我通常把这个上传到github
So anyway. We’ll save that and then
不管怎样 保存它
accuracies.append(accuracy)
然后accuracies.append(accuracy)
And then we print sum accuracies.
接着print(sum(accuracies))
Divided by len(accuracies).
除以len(accuracies)
Okay. So out of 97…out of 25
好了 在这97个……25个例子中
examples we ran a 96.4% accuracy on the version we wrote.
我们编写的这个版本的准确度为96.4%
Now let’s test K nearest neighbors.
现在我们来测试一下K最近邻算法
Okay. So this is the average accuracy for K nearest neighbors. 96.8%
嗯嗯 这是K最近邻算法的平均准确度96.8%
The only thing I will stress though is the following.
我要的强调的一点是
Keep in mind. On both of these we’re basically doing the exact same thing.
要记住 这两种情况下我们做的基本上是同一件事情
We’re loading in the data frame. We’re passing the data. We’re training. We’re testing.
我们加载数据帧 传递数据 训练 测试
And we’re doing 25 iterations.
然后迭代25次
So again let’s run this one.
再来运行这个
And it’s still running and I’m going to go. It’s still running. I’ll run this one, too.
它还在运行 那就等会 它还在运行 那我运行这个吧
And it there’s your answer there and then we’re still waiting on the version we wrote.
这就是你的答案 接下来我们还在等待我们编写的版本
And we’re still waiting.
还在等
Oh man. Anyway. So…
天啊 不管怎样 那么……
So what’s the difference? Okay. There’s a couple of differences here.
那么区别在哪里 好吧 其实主要有几点不同之处
One is K nearest neighbors has a default parameter.
其中之一就是K最近邻算法有一个默认参数
Actually I think the default n_jobs is equal to one
实际上 我认为默认的n_jobs等于1
but a couple of things. One. K nearest neighbors can be threaded.
而不是一堆 一个K最邻近数可以放到一个线程里
Okay. So
好了 所以
you do not have to test each prediction point
没有必要测试每一个预测点
or like in tests you had 20% of the sample, right? So
或者说测试20%的样本 对吧?
when you go to test. Each one, each like set of features is its own unique snowflake.
所以若是每个都测试的话
So you don’t actually need to test linearly. You can test all…
那就没必要测试线性了 你也可以都测试……
Test each one on their own basically. So you can heavily thread
基本上每个都自己测试下
K nearest neighbors on a bulk of predictions.
你就可以通过K的最邻近数做大量的预测
So first of all. You can do that but I actually
首先 你可以这样做
I think that the default is they do not.
但我认为默认条件下是不用的
Let me pull up sklearn’s K neighbors for you guys. So here we go.
我把sklearn的K的邻近数放上来 就这些
And neighbors default 5 sure enough. Radius. This is most likely where they’re…
邻近默认值为5就可以了半径 就是它们可能所处的位置
They’re winning. They’re beating us because they’re using that idea of that radius
它们赢了 它们打败我们了 因为它们用的是半径的概念
to ignore points that are outside the radius most likely.
从而忽略了半径以外的点
n_jobs. This is what we’re talking about. So how many parallel jobs do you want to run
这就是我们所要讲述的 那么如果近邻研究的话
for the neighbors search?
你要同时进行多少项工作呢?
The default is one but if you set that to be negative one.
默认值为1 但如果设置为-1
Where are we? Here we are.
哪里呢?就这里
Then it will do as many jobs as possible.
然后它会尽可能地做很多工作
So that’s another way to actually speed it up but as you can see it was actually already pretty fast.
这是另一种加速方法 但如你所见 实际上它已经很快了
But this is not a high performance tutorial on Python. This is a machine learning tutorial.
但这并不是Python的高级教程 而是机器学习教程
So if you want high performance you’ll most likely be using someone else’s algorithm.
所以如果你想要高性能 你可以用其他人的算法
But our accuracy was actually very similar to theirs.
但这两者的准确度十分相似
I would wager that if we ran like a million samples.
我敢打赌 如果我们运行上百万个样本的话
We would find that accuracy was identical.
就会发现准确度是一样的
It’s just 25 is not actually a decent sample size.
25的话 并不是一个合适的样本量
So anyways. That’s all for K nearest neighbors
就这样了 这就是非常有效的
of a very valid algorithm to use.
K最近邻算法的所有应用了
You can use it on pretty like…A lot of people say it doesn’t scale well but
你可以将它应用于像是……许多人说它不能很好地缩放
it scales to very large data sets.
但实际上它可应用于很大的数据集
It just doesn’t scale to like terabytes of data.
万亿字节的数据就不行了
It’s just not going to run very well on terabytes of data.
万亿字节的数据运行的不是很好
But you can use a radius. You can thread it.
但是可以用半径 线性就可以了
And it actually still will run pretty well.
就能很好的运行了
The accuracy is pretty good
准确度相当不错
so one of the other upsides of the K nearest neighbors classifier
所以K近邻分类器的其中一个优点是
is that it can work on both linear and nonlinear data.
可以处理线性数据和非线性数据
So you’ll see that makes a big difference
你会发现这很不一般
especially in something like the support vector machine which is what we’re covering next.
特别是对于接下来我们要讲的支持向量机来说
But to use algorithms that we’ve already covered you actually can use
但对于我们已经讲过的算法来说
regression to do classification
只要是线性数据
so long as you’re using linear data. So for example I’ve got a graph here.
都可以用回归来分类 以图表为例
So if I was to draw the best fit line, the Y hat line,
假设我要画一条最佳拟合线 Y帽线
the regression line for these two data sets. I would probably do something maybe like
这两个数据集的回归线 我可能会这样画
like this. Okay. So it would be for the blue dots and then I might do something
好了 蓝色点是这样的 接下来
like this for the orange dots.
橙色点也是如此
And then if I had an unknown data point like…Let’s say maybe this point here.
如果有一个未知的数据点 假设点在这里
Without using K nearest neighbors instead what I could do is I could measure the squared error
不用K最近邻算法的话 我们可以测量
between this regression line and this regression line.
这两条回归线之间的平方差
And whichever one had a lesser squared error
平方差小的那个
that would be what class that new X plot belonged to which in this case would be the orange
就和新点X归于同类 案例中就是橙色
orange dot class, right?
属于橙色点类 对吧
So with regression if the data is indeed linear
那么就算数据是线性的 用回归
you can still do classification, right? You don’t just have to forecast out.
依然可以分类 对吧?没有必要再做预测
Now what about a dataset like this though?
那如果是这样的数据集呢?
This dataset both of these the orange dots and the blue dots
所有橙色点和蓝色点的数据集
do have a best fit line. I have no idea what they are
都有最佳拟合线 我不知道它们是什么
but they do have a best fit line.
但它们有最佳拟合线
But even if they did, even if when we did draw the best fit line.
但即使确实有 即使我们能画出最佳拟合线
The squared error, the coefficient of determination would be so poor
平方差和判定系数会很差
so the actual, you know, confidence of this algorithm
所以说 实际上这种算法的置信度
would be pretty bad overall.
总体来说是很差的
So you can’t really do classification
因此这样的数据集
on a data set like this. This is nonlinear data.
没法分类 它是非线性数据
But you can do K nearest neighbors on nonlinear data.
但是非线性的数据可以用K最近邻算法
So, you know, we could still…We can classify this point
所以说 你懂的 我们还是可以用
using K nearest neighbors because then we’re just…
K最近邻算法归类该点 这样的话我们就……
We’re measuring the distance between this point, probably this point and this point.
要测量到这个点之间的距离 也可能是这个点或者这个点
Those are our closest three points.
它们是最近的三个点
We would get…If K was equal to three
如果K=3 我们会得到…
we have two to one vote. So we would vote that it’s an orange and orange dot class, right?
2:1 就得出这个点是橙色的 属于橙色点类 对吧
So anyways that’s one of the other upsides to the K nearest neighbors classifier.
不管怎样 这也算是K最邻近分类器的另一个优点了
That’s it for K nearest neighbors. The next topic we’re gonna be talking about is the support vector machine.
这就是K最近邻算法 下节课我们要讲的是支持向量机
So that’s what we’re gonna be getting into. If you have any questions comments concerns whatever
这就是我们要讲的全部内容了 如果你有任何的问题或评论
feel free to leave them below.
就在下面留言吧
Otherwise as always thanks for watching thanks for all the support subscriptions and until next time.
谢谢观看 支持 订阅 我们下节课见

发表评论

译制信息
视频概述

本节课是k最邻近算法的最后一部分内容,主要是准确度和预测值的相关内容。

听录译者

[B]刀子

翻译译者

长安小盆友

审核员

审核员1024

视频来源

https://www.youtube.com/watch?v=r_D5TTV9-2c

相关推荐