What’s going on everybody and welcome to part 19
of our machine learning with Python tutorial series.
In this video we’re going to be talking about K nearest neighbors
one last time before we jettison off
into the support vector machine.
What were we talking about here is K accuracy and predictions.
First question we might have is…Well
what if we…If we increased k
would accuracy necessarily go up?
So just for fun. Let’s say k is equal to 25.
And we run that. The previous accuracy was about 95.
Now we got 88, or 98 rather.
Sorry I can’t read for crap.
Run it again. Okay, 97. Run it again. 94. Run it again.
再运行一次 好的 97 再来一次 94 还可以再来
Okay, you get the point.
So we’re still doing pretty good. What if we raise it to 75.
95%. Okay, still doing pretty good.
95% 可以 还是很好
97%. Still doing pretty good.
Okay, 93 is one of our lower numbers.
Anyway, uh, we don’t really seem…
但是 呃 好像也没什么……
I mean we’re still kind of the average around here is about 97.
So, you know, we can keep going.
I forget how many data points we actually have in this set, let’s see,
which have 600 about. So if we said k is equal to 200.
So one of the closest 200 points.
We see accuracy actually appears to be worse.
So looking at more points and actually,
probably what’s happening there is…Recall that
we actually have a skewed number. Only like 30% were malignant where,
right yeah, 30% were malignant and the other like 70% were benign.
对的没错 30%恶性 70%良性
So this is, you known, adding to k doesn’t necessarily do you any more of a favor.
所以就这样啦 你知道的 增加K值并没有什么卵用
5 is probably a pretty good.
But depending on your data set and all that you might want to
fiddle with K just to see if it makes a huge difference.
And just test on whatever data set you might have. So anyway, that’s K.
有什么数据集就拿来试试吧 总之 这就是K
Next we’re going to be talking about confidence versus accuracy.
And K nearest neighbors can give us confidence.
So accuracy is…Did we get the classification, right?
But confidence actually can come from the classifier and the classifier can say: ”Hey
we have 100% of the votes in favour of this class being such-and-such.
Or conversely it can go back and say: ”Hey
we have only 60% of the votes. This is what the vote was but
our confidence is only 60%, right?
So, for example, you could create confidence
and confidence would be basically this.
This. And instead of that 0th element it would be the 1th that’s how many.
And then what would it be? It would be divided by whatever k was, right?
So you’re hoping if you say k is five.
You’re hoping that this number is 5.
So anyways, there’s that and we’ll just leave it like this. Simply because
所以不管怎样 就这样了 不用管它
Every other like accuracy and stuff is the decimal form. So we’re not actually times by 100.
So you can have confidence there and then you literally could just do confidence. Okay.
所以此处就要有confidence 接下来也是如此 好吧
And we can…Let’s see vote.
And then confidence would be here. So let me make sure we get away with that.
Right. So there you have that and then basically what we could do is we could…
没错 就是这样 接下来我们大致要做的就是……
You wouldn’t want to do this all the time but you could print, you know, vote_result
也不用一直这样做 你也可以打印 你懂的 vote_result
and not vote_result, confidence.
So there we go. So almost all of these were a, you know, 1.0 confidence.
好了 那么你看到了 几乎所有的置信度都是1.0
This one was eighty percent but this one was sixty percent.
And the interesting thing would maybe be to say vote, confidence
if that is the case. else: print()
print the confidence. So every print out of confidence. So we’re going to comment this out.
print (confidence) 如此每次打印都没有confidence 所以我们就要把它注释掉
So all the numbers we’re about to see
are the confidence score of the votes that we got incorrect.
So we run this.
Okay. So actually this is a pretty good split here.
This wasn’t…These were 100% votes that were incorrect
but then we also had some 6 votes that were incorrect.
Let’s go up here and let’s maybe change the test size to
继续向上 我们把test size改为0.4
0.4. We’re going to sacrifice a lot of data.
So accuracy might actually go down but the question was in my mind
would you see less 100% confidence if
if that in the incorrect valuations and sure enough it appears with
this…at least this test. This was a pretty even split but the ones that it was unconfident about was 60%.
这里……至少是在这个数据集里 对半开吧 但不可靠的几率有60%
So you might even get to the point though where
let’s say you have someone in your…and you’re telling them whether or not they have cancer.
Okay. You’re telling them whether or not they have cancer
if the confidence is only 60%.
You might…you know, not say anything, right?
你可能……你懂的 等于啥也没说 对吧？
You might say: ”Well, the confidence that the test was not accurate.” or whatever.
你可能会说：额 测试的置信度不一定正确 等等
Or the test was not confident, right?
So you know that might be a reason why you might want to take a little more care
in your reports. Anyways.
So that’s K’s confidence and now comparison.
So what I want to do is…
Let’s see what I want to do is take this here
and basically we’re going to take this.
And let’s say for i in range()
假设for i in range(25)
(25). And we’re gonna say for i in range(25). We’re going to do…
我们输入for i in range(25) 我们要做的是……
Actually…Yeah I will just will be sloppy about it whatever.
Tab that over.
And we’re gonna say accuracies equals a list.
And then here we’re gonna say
accuracies.append that number.
And then when we’re all done. We’re going to say print. We’re gonna say len()
当我们全部弄好之后 我们就要print 假设是(len())
No. sum. sum(accuracies)
不 是sum sum(accuracies)
divided by len()
Let’s just do 10. Well do five first just to make sure logic.
我们就做10个 好吧 逻辑上来说先做5个
We shouldn’t have…We have to get rid of these printouts. Right.
我们没有必要 我们删掉这些打印输出 对的
Okay. So we’ll get rid of the printouts.
And as we can see at least out of 5 tests where we average 96.2% accuracy
正如我们所看到的 至少在这5个测试中 我们的平均准确度是96.2%
and the other thing we need to make sure. k is indeed 5.
We’ll stop printing accuracy and we will stop…
Where’s the other thing we were printing? Oh, it’s here.
else print(confidence). Let’s just delete that.
else print(confidence) 我们把它删掉
Okay. So now.
Now we’ve got that and we are getting different accuracies that just
to show the whole process is being repeated over again.
So let’s run this 25 times.
May not be enough for you
but we’ll run that 25 times. And then let’s go to
the other K nearest neighbors.
Let’s say…You don’t have to follow along if you don’t want to by the way.
比如说…… 顺便说下 不想跟着做的话可以不做
You can just see what the result is. I believe that this is the other K nearest neighbors from part 14.
What we could do is do basically the exact same thing here, right?
Let’s take this here. Tab over.
Paste for i range that.
粘贴在 for i range这里
I’m going to tab. I’m going to stop this accuracy stuff.
And in fact we don’t need to do predictions either.
I’m going to leave them there just in case I…Because this is the old code actually. So…
So I usually upload this to github.
So anyway. We’ll save that and then
And then we print sum accuracies.
Divided by len(accuracies).
Okay. So out of 97…out of 25
examples we ran a 96.4% accuracy on the version we wrote.
Now let’s test K nearest neighbors.
Okay. So this is the average accuracy for K nearest neighbors. 96.8%
The only thing I will stress though is the following.
Keep in mind. On both of these we’re basically doing the exact same thing.
We’re loading in the data frame. We’re passing the data. We’re training. We’re testing.
我们加载数据帧 传递数据 训练 测试
And we’re doing 25 iterations.
So again let’s run this one.
And it’s still running and I’m going to go. It’s still running. I’ll run this one, too.
它还在运行 那就等会 它还在运行 那我运行这个吧
And it there’s your answer there and then we’re still waiting on the version we wrote.
And we’re still waiting.
Oh man. Anyway. So…
天啊 不管怎样 那么……
So what’s the difference? Okay. There’s a couple of differences here.
那么区别在哪里 好吧 其实主要有几点不同之处
One is K nearest neighbors has a default parameter.
Actually I think the default n_jobs is equal to one
but a couple of things. One. K nearest neighbors can be threaded.
you do not have to test each prediction point
or like in tests you had 20% of the sample, right? So
when you go to test. Each one, each like set of features is its own unique snowflake.
So you don’t actually need to test linearly. You can test all…
Test each one on their own basically. So you can heavily thread
K nearest neighbors on a bulk of predictions.
So first of all. You can do that but I actually
I think that the default is they do not.
Let me pull up sklearn’s K neighbors for you guys. So here we go.
And neighbors default 5 sure enough. Radius. This is most likely where they’re…
They’re winning. They’re beating us because they’re using that idea of that radius
它们赢了 它们打败我们了 因为它们用的是半径的概念
to ignore points that are outside the radius most likely.
n_jobs. This is what we’re talking about. So how many parallel jobs do you want to run
for the neighbors search?
The default is one but if you set that to be negative one.
Where are we? Here we are.
Then it will do as many jobs as possible.
So that’s another way to actually speed it up but as you can see it was actually already pretty fast.
这是另一种加速方法 但如你所见 实际上它已经很快了
But this is not a high performance tutorial on Python. This is a machine learning tutorial.
So if you want high performance you’ll most likely be using someone else’s algorithm.
But our accuracy was actually very similar to theirs.
I would wager that if we ran like a million samples.
We would find that accuracy was identical.
It’s just 25 is not actually a decent sample size.
So anyways. That’s all for K nearest neighbors
of a very valid algorithm to use.
You can use it on pretty like…A lot of people say it doesn’t scale well but
it scales to very large data sets.
It just doesn’t scale to like terabytes of data.
It’s just not going to run very well on terabytes of data.
But you can use a radius. You can thread it.
And it actually still will run pretty well.
The accuracy is pretty good
so one of the other upsides of the K nearest neighbors classifier
is that it can work on both linear and nonlinear data.
So you’ll see that makes a big difference
especially in something like the support vector machine which is what we’re covering next.
But to use algorithms that we’ve already covered you actually can use
regression to do classification
so long as you’re using linear data. So for example I’ve got a graph here.
So if I was to draw the best fit line, the Y hat line,
the regression line for these two data sets. I would probably do something maybe like
like this. Okay. So it would be for the blue dots and then I might do something
好了 蓝色点是这样的 接下来
like this for the orange dots.
And then if I had an unknown data point like…Let’s say maybe this point here.
Without using K nearest neighbors instead what I could do is I could measure the squared error
between this regression line and this regression line.
And whichever one had a lesser squared error
that would be what class that new X plot belonged to which in this case would be the orange
orange dot class, right?
So with regression if the data is indeed linear
you can still do classification, right? You don’t just have to forecast out.
Now what about a dataset like this though?
This dataset both of these the orange dots and the blue dots
do have a best fit line. I have no idea what they are
but they do have a best fit line.
But even if they did, even if when we did draw the best fit line.
The squared error, the coefficient of determination would be so poor
so the actual, you know, confidence of this algorithm
would be pretty bad overall.
So you can’t really do classification
on a data set like this. This is nonlinear data.
But you can do K nearest neighbors on nonlinear data.
So, you know, we could still…We can classify this point
所以说 你懂的 我们还是可以用
using K nearest neighbors because then we’re just…
We’re measuring the distance between this point, probably this point and this point.
Those are our closest three points.
We would get…If K was equal to three
we have two to one vote. So we would vote that it’s an orange and orange dot class, right?
2:1 就得出这个点是橙色的 属于橙色点类 对吧
So anyways that’s one of the other upsides to the K nearest neighbors classifier.
That’s it for K nearest neighbors. The next topic we’re gonna be talking about is the support vector machine.
So that’s what we’re gonna be getting into. If you have any questions comments concerns whatever
feel free to leave them below.
Otherwise as always thanks for watching thanks for all the support subscriptions and until next time.
谢谢观看 支持 订阅 我们下节课见
What’s going on everybody and welcome to part 19