(音乐播放)
[MUSIC PLAYING]
你的第一个机器学习程序
Six lines of code is all it takes
只需要六行代码足矣
to write your first Machine Learning program.
我是约什戈登
My name’s Josh Gordon, and today I’ll
今天我将带大家写出机器学习的Hello World
walk you through writing Hello World for Machine learning.
在这一系列教程的前一部分里
In the first few episodes of the series,
我们将教您该从哪开始学习
we’ll teach you how to get started with Machine
机器学习
Learning from scratch.
我们需要用到两个开源库
To do that, we’ll work with two open source libraries,
scikit-learn和TensorFlow
scikit-learn and TensorFlow.
马上我们就能看到scikit实际的运行效果
We’ll see scikit in action in a minute.
但是在这之前 我们先快速说一下
But first, let’s talk quickly about what Machine Learning is
机器学习是什么 以及为什么它很重要
and why it’s important.
你可以认为机器学习其实是
You can think of Machine Learning as a subfield
人工智能领域的一个子集
of artificial intelligence.
早期的人工智能大多数都擅长于某一点
Early AI programs typically excelled at just one thing.
例如 深蓝计算机可以
For example, Deep Blue could play chess
下国际象棋到达冠军水平 但这是它唯一能做的
at a championship level, but that’s all it could do.
现在 我们希望编写一个程序
Today we want to write one program that
就能解决很多问题 而不用重写
can solve many problems without needing to be rewritten.
AlphaGo就是一个很好的例子
AlphaGo is a great example of that.
我们说过 它击败了世界围棋冠军
As we speak, it’s competing in the World Go Championship.
但相同的程序还能学习如何打雅达利的游戏
But similar software can also learn to play Atari games.
机器学习让这成为可能
Machine Learning is what makes that possible.
它是一种学习的算法
It’s the study of algorithms that
可以从例子和经验中学习
learn from examples and experience
而不是遵循某种写死的规则
instead of relying on hard-coded rules.
因此 这是很前沿的
So that’s the state-of-the-art.
今天我们要看到的是
But here’s a much simpler example
一个很简单的例子
we’ll start coding up today.
我会给你们一个听起来简单
I’ll give you a problem that sounds easy but is
但是不使用机器学习就解决不了的问题
impossible to solve without Machine Learning.
你能用代码来区分出
Can you write code to tell the difference
哪个是苹果哪个是橘子吗?
between an apple and an orange?
假设我让你编写一个程序 可以接收
Imagine I asked you to write a program that takes an image
一张图片作为输入 经过分析后
file as input, does some analysis,
输出水果的类型
and outputs the types of fruit.
你要怎么解决这一问题?
How can you solve this?
也许你会从手动编写一堆规则开始
You’d have to start by writing lots of manual rules.
比如 你也许会写
For example, you could write code
数出有多少个橙色的像素点
to count how many orange pixels there are and compare that
与绿色的像素点对比
to the number of green ones.
对比的结果应该可以帮你认出水果类型
The ratio should give you a hint about the type of fruit.
对于这种简单的图像 这是好使的
That works fine for simple images like these.
但当你深入问题时
But as you dive deeper into the problem,
你会发现现实世界其实是一团糟
you’ll find the real world is messy, and the rules you
你编写的规则开始不断被打破
write start to break.
你要怎样用程序来处理黑白相片呢
How would you write code to handle black-and-white photos
或者处理根本没有苹果和橘子的相片
or images with no apples or oranges in them at all?
事实上 无论你写出什么样的规则
In fact, for just about any rule you write,
我都能找出一张图片来让你的规则失效
I can find an image where it won’t work.
你得编写无数条规则
You’d need to write tons of rules,
而只是用来区分出
and that’s just to tell the difference between apples
苹果和橘子
and oranges.
如果我给你提出一个新的问题的话 你就得全部重新来一遍
If I gave you a new problem, you need to start all over again.
很明显 我们需要更好的办法
Clearly, we need something better.
要解决这个问题 我们需要一种算法
To solve this, we need an algorithm
可以帮我们来制定规则
that can figure out the rules for us,
这样我们就不需要手动来编写规则了
so we don’t have to write them by hand.
于是 我们就要训练一个所谓的“分类器”(classifier)
And for that, we’re going to train a classifier.
现在 你可以认为分类器就是一个函数
For now you can think of a classifier as a function.
它接受一些数据作为输入 然后给这些数据
It takes some data as input and assigns a label to it
打上标签作为输出
as output.
例如 我有一张图片
For example, I could have a picture
希望机器能区分出它是苹果还是橘子
and want to classify it as an apple or an orange.
或者 我有一封邮件 希望来甄别出
Or I have an email, and I want to classify it
它是否是垃圾邮件
as spam or not spam.
自动编写分类器的技术
The technique to write the classifier
被称为监督式学习(supervised learning)
automatically is called supervised learning.
它从你打算解决的问题例子开始
It begins with examples of the problem you want to solve.
要写出这样的程序 我们要用到scikit-learn
To code this up, we’ll work with scikit-learn.
我们在这里下载和安装相关的库
Here, I’ll download and install the library.
这里有不少方法
There are a couple different ways to do that.
对我来说 最简单的方法是使用Anaconda
But for me, the easiest has been to use Anaconda.
它可以很方便的下载所有的依赖文件 配置好
This makes it easy to get all the dependencies set up
还是跨平台的
and works well cross-platform.
开启视频魔法
With the magic of video, I’ll fast forward
快进下载和安装过程
through downloading and installing it.
安装完成后 你可以
Once it’s installed, you can test
通过执行一个Python脚本 import scikit-learn
that everything is working properly
来验证安装正常
by starting a Python script and importing SK learn.
如果一切正常 这就是我们程序的第一行
Assuming that worked, that’s line one of our program down,
还剩下5行
five to go.
要使用监督式学习
To use supervised learning, we’ll
我们需要遵循一些标准步骤
follow a recipe with a few standard steps.
第一步是收集训练数据
Step one is to collect training data.
这些是我们要解决问题的示例
These are examples of the problem we want to solve.
我们的问题里 要完成一个能够
For our problem, we’re going to write a function
区分水果类型的函数
to classify a piece of fruit.
作为输入 它需要水果的描述
For starters, it will take a description of the fruit
并且预测水果究竟是苹果还是橘子
as input and predict whether it’s
将其作为输出 依据的是
an apple or an orange as output, based on features
它的重量和表面光滑度
like its weight and texture.
为了收集我们的训练数据
To collect our training data, imagine
想象我们去了一片果园
we head out to an orchard.
我们会看到许多不同的苹果和橘子
We’ll look at different apples and oranges
然后在表格里写下描述它们的测量方法
and write down measurements that describe them in a table.
在机器学习领域 这种测量方法
In Machine Learning these measurements
被称为“特征”
are called features.
为了简化问题 这里我们只使用了2个
To keep things simple, here we’ve used just two–
每个水果有多少克重 以及它的外观纹理
how much each fruit weighs in grams and its texture, which
是疙疙瘩瘩和是很光滑
can be bumpy or smooth.
好的特征可以使得
A good feature makes it easy to discriminate
区分水果的品种变得简单
between different types of fruit.
训练数据中的每一行都是一个示例
Each row in our training data is an example.
描述的是一个水果
It describes one piece of fruit.
表格最后一列被称为标签(label)
The last column is called the label.
它描述的是每行水果的类型
It identifies what type of fruit is in each row,
这里只有两种可能 –
and there are just two possibilities–
苹果和橘子
apples and oranges.
整张表格就是我们的训练数据
The whole table is our training data.
你可以认为所有这些示例
Think of these as all the examples
都是我们提供给分类器学习用的
we want the classifier to learn from.
你有越多的训练数据
The more training data you have, the better a classifier
能创建的分类器就更好
you can create.
那么 我们来用程序编写训练数据
Now let’s write down our training data in code.
会用到两个变量 – 特征和标签
We’ll use two variables– features and labels.
特征包含了前面两列数据
Features contains the first two columns,
标签包含了最后一列数据
and labels contains the last.
你可以认为特征是分类器的输入
You can think of features as the input
标签是我们想要的结果
to the classifier and labels as the output we want.
我要将所有特征的变量类型
I’m going to change the variable types of all features
从字符串改为整型 那么疙疙瘩瘩用0表示
to ints instead of strings, so I’ll use 0 for bumpy and 1
光滑用1表示
for smooth.
我们对标签做相同处理 苹果用0表示
I’ll do the same for our labels, so I’ll use 0 for apple
橘子用1表示
and 1 for orange.
这是我们程序的第二行和第三行
These are lines two and three in our program.
方法的第二步就是使用这些示例来训练
Step two in our recipes to use these examples to train
一个分类器
a classifier.
我们即将用到的一种分类器
The type of classifier we’ll start with
被称作决策树(decision tree)
is called a decision tree.
细节我们会在后面几期节目中
We’ll dive into the details of how
深入介绍
these work in a future episode.
但是现在我们只需要知道分类器就是一堆规则的组合即可
But for now, it’s OK to think of a classifier as a box of rules.
因为即便有多种不同类型的分类器
That’s because there are many different types of classifier,
它们的输入和输出总是相同的
but the input and output type is always the same.
我们要引入这个树
I’m going to import the tree.
这是程序的第四行 创建分类器
Then on line four of our script, we’ll create the classifier.
现在这只是一个没有规则的空盒子
At this point, it’s just an empty box of rules.
它还不知道什么是苹果和橘子
It doesn’t know anything about apples and oranges yet.
要训练它 我们还需要一种学习算法
To train it, we’ll need a learning algorithm.
如果说分类器是规则盒子
If a classifier is a box of rules,
那么你就可以认为学习算法
then you can think of the learning algorithm
是创建规则的过程
as the procedure that creates them.
它通过在你的训练数据里找到规律来生成规则
It does that by finding patterns in your training data.
例如 它也许会注意到橘子往往更重
For example, it might notice oranges tend to weigh more,
于是就会创建一条规则说 水果越重
so it’ll create a rule saying that the heavier fruit is,
就越有可能是橘子
the more likely it is to be an orange.
在scikit中 包含在分类器对象中的
In scikit, the training algorithm
训练算法被称为Fit
is included in the classifier object, and it’s called Fit.
你可以认为Fit的意思就是
You can think of Fit as being a synonym for “find patterns
“在数据中找到规律”
in data.”
后面节目我们会深入细节
We’ll get into the details of how
来看看原理究竟是什么
this happens under the hood in a future episode.
这里我们已经有了一个训练好的分类器
At this point, we have a trained classifier.
那我们试试看让它来分辨一个新的水果
So let’s take it for a spin and use it to classify a new fruit.
给分类器提供的输入是新例子的特征数据
The input to the classifier is the features for a new example.
这里我们希望它区分的水果的特征是
Let’s say the fruit we want to classify
150克 以及表面疙疙瘩瘩
is 150 grams and bumpy.
输出如果是0就是苹果 输出是1就是橘子
The output will be 0 if it’s an apple or 1 if it’s an orange.
在回车得到这个分类器的预测结果前
Before we hit Enter and see what the classifier predicts,
我们先想想
let’s think for a sec.
如果是你来猜 输出可能是什么?
If you had to guess, what would you say the output should be?
我们将这个水果与训练数据做比对
To figure that out, compare this fruit to our training data.
看上去与句子相同
It looks like it’s similar to an orange
因为它又重 表面又不光滑
because it’s heavy and bumpy.
不过这是我的猜测 我们回车看下结果
That’s what I’d guess anyway, and if we hit Enter,
这是我们的分类器预测的结果(之前用整型表示苹果是0橘子是1)
it’s what our classifier predicts as well.
如果一切顺利
If everything worked for you, then
这就是你的第一个机器学习程序
that’s it for your first Machine Learning program.
现在你可以只通过改变训练数据
You can create a new classifier for a new problem
来创建解决新问题的分类器了
just by changing the training data.
相比较为每个问题写新的规则
That makes this approach far more reusable
这种方法的复用度更高
than writing new rules for each problem.
现在 也许你在思考为什么我们
Now, you might be wondering why we described our fruit
用特征表的形式来描述水果
using a table of features instead of using pictures
而不是使用图片作为训练数据
of the fruit as training data.
额 你当然可以使用图片
Well, you can use pictures, and we’ll
我们在接下来的教程里会做介绍
get to that in a future episode.
然而后面你会发现 我们现在
But, as you’ll see later on, the way we did it here
使用的方法更加的通用
is more general.
机器学习的程序部分(重要概念)
The neat thing is that programming with Machine
往往并不复杂(· 这种方法在真实世界是如何工作的?)
Learning isn’t hard.
但是要让它运行正确(· 你需要多少训练数据?)
But to get it right, you need to understand
你得理解一些重要概念(· 决策树是怎样创建的?)
a few important concepts.
我会在后面几期节目里带你了解这些(· 怎样选择好的特征?)
I’ll start walking you through those in the next few episodes.
非常感谢你的观看 下期节目再见
Thanks very much for watching, and I’ll see you then.
(音乐播放中)
[MUSIC PLAYING]
