ADM-201 dump PMP dumps pdf SSCP exam materials CBAP exam sample questions

《边学Python边学数据科学》#6 遗传算法 – 译学馆
未登录,请登录后再发表信息
最新评论 (0)
播放视频

《边学Python边学数据科学》#6 遗传算法

Genetic Algorithms - Learn Python for Data Science #6

大家好 我是西拉杰
Hello World, it’s Siraj!
在本视频中 我们将用遗传编程
In this video, we’re going to use genetic programming
来确定一些能量是否是伽马辐射
to identify if some energy is gamma radiation or not.
我要生气了 伽马射线!啊!
I’m getting angry. Gamma rays! Augh!
没有 是我希望的
Nah, I wish.
数据科学是关于发现的一种思维方式
Data science is a way of thinking about discovery.
数据科学家需要决定
A data scientist needs to decide
要问的正确问题是什么
the right question to ask,
比如 在美国大选中谁是最好的候选人?
like “Who’s the best candidate to vote for in the US election?,”
然后要决定使用什么数据集
then decide what dataset to use,
比如候选人的推特历史记录
like tweet history of candidates
以及每一位候选人过去的支持情况
and past endorsements of each candidate,
最后要根据数据决定用什么机器学习模型
and lastly decide what machine learning model to use
以发现正确答案
on the data to discover the right answer.
#生活在继续!#
♫ Life goes on!
有了合适的数据 计算能力
♫ With the right data, computing power,
和机器学习模型
and machine learning model,
就可以找到任何问题的解决方法
you can discover a solution to any problem,
但对新进数据科学家来说
but knowing which model to use can
知道用哪种模型是很有挑战的
be challenging for new data scientists.
因为模型实在太多了!
There are so many of them!
这时候遗传编程就有用了
That’s where genetic programming can help.
遗传算法受达尔文的
Genetic algorithms are inspired by
自然选择过程启发
the Darwinian process of natural selection,
它们被用于生成最优化问题
and they’re used to generate solutions to optimization
和搜索问题的答案
and search problems.
它们有三种属性
They have three properties:
选择 交叉和变异
selection, crossover, and mutation.
就给定的问题而言
You have a population of possible solutions
会有一个可能解群
to a given problem
和一个适应度函数
and a fitness function.
每次迭代
Every iteration,
我们都会用适应度函数
we evaluate how fit each solution is
评价每一种解的适应度
with our fitness function.
然后我们选择最适合的解
Then we select the fittest ones
进行交叉以创造一个新的群体
and perform crossover to create a new population.
我们用某种随机修改
We take those children and mutate them
使那些“孩子”进行变异
with some random modification and
并重复该过程直到得到最适应或最优的解
repeat the process until we get the fittest or best solution.
比如下面这个问题
So take this problem, for instance.
假如你想来一次跨越很多城市的自驾游
Let’s say you want to take a road trip across a bunch of cities.
要想最终回到自己所居住的城市
What’s the shortest possible path
而途中只经过每个城市一次
you could take to hit up each city once
可能走的最短路线是什么呢?
and then return back to your home city?
在计算机科学中
This is popularly called
这通常被称为“旅行推销员问题”
the”traveling salesman problem” in computer science,
我们可以用遗传算法
and we can use a genetic algorithm
来帮助我们解决这个问题
to help us solve it.
我们来看些高级Python代码
Let’s look at some high-level Python code.
我们将代的值设置为5000
We have the number of generations set to 5,000
将群体规模设置为100
and the population size set to 100.
我们首先用规模参数
So we start by initializing our population
初始化群体
using our size parameter.
群体中的每个个体
Each individual in our population
代表一种不同的答案路径
represents a different solution path.
至于每一代
Then, for each generation,
我们计算每种解的适应度
we compute the fitness of each solution and
并储存在群体适应度数组中
store it in our population fitness array.
现在我们将执行选择
Now we’ll perform selection
只选取群体中的前10%
by only taking the top 10 % of the population
即自驾游路线中比较短的前10%
which are our shortest road trips
接着通过执行交叉 它们又产生支系
and produce offspring from them by performing crossover.
然后随机选取这些支系
Then you take those offspring randomly
并重复这个过程
and repeat the process.
正如在动画中看到的这样
As you can see in the animation,
最终 我们用这个过程会得到一个最优解
eventually we will get an optimal solution using this process,
这跟苹果地图不同
unlike Apple Maps.
那么所有这些是如何融入数据科学的呢?
Alright, so how does this all fit into data science?
其实 选择正确的机器学习模型
Well, it turns out that choosing the right machine learning model
并为此模型选择最佳超参数
and all the best hyperparameters for that model
本身就是一个最优化问题
is itself an optimization problem.
我们将使用Python的TPOT库
We’re gonna use a Python library called TPOT,
来优化机器学习过程
built on top of scikit-learn,
该库建立于scikit-learn之上
that uses genetic programming
并使用了遗传编程算法
to optimize our machine learning pipeline.
在正确格式化数据后
So after formatting our data properly,
我们需要知道往模型中输入什么特征
we need to know what features to input to our model
以及应怎样构建这些特征
and how we should construct those features.
一旦有了这些特征
Once we have those features,
我们会将其输入模型中进行训练
we’ll input them into our model to train on,
同时我们需调整超参数或调谐旋钮
and we’ll want to tune our hyperparameters, or tuning knobs,
以得到最优结果
to get the optimal results.
TPOT并不是让我们自己
Instead of doing this all ourselves
通过反复试验来做这些
through trial and error,
而是用遗传编程
TPOT automates these steps for us
自动为我们实现这些步骤
with genetic programming,
当它完成后
and it will output the optimal code
会输出最优的代码
for us when it’s done
这样以后就可以用到
so we can use it later.
在安装依赖项之后
So we’re going to create a classifier
我们将用TPOT为伽马射线
for gamma radiation using TPOT
创建一个分类器
after installing our dependencies,
然后对结果进行分析
and then analyze the results.
TPOT建立于广泛应用的
TPOT is built on the popular scikit-learn
scikit-learn机器学习库之上
machine learning library, so we’ll want to make sure
所以我们得确保先安装好了
that we have that installed first.
然后我们会安装pandas
Then we’ll install pandas
来帮助我们分析数据
to help us analyze our data
安装numpy来执行数学计算
and numpy to perform math calculations.
我们的第一步是加载数据集
Our first step is to load our dataset.
我们将用pandas的readcsv()方法
We’ll use pandas’ read_csv() method
并将参数设置为已保存的CSV文件名
and set the parameter to the name of our saved CSV file.
这是从科学仪器“切伦科夫望远镜”
This is data collected from a scientific instrument
收集来的数据
called a”Cherenkov telescope”
它被用于测量大气中的辐射
that measures radiation in the atmosphere
这些是它采集到的
and these are a bunch of features
任意辐射类型的特征群
of whatever type of radiation it picks up.
感谢普京!
Thanks, Putin!
因为类对象早已编排过了
Since the class object is already organized,
我们将置乱数据以得到更好的结果
we’ll shuffle our data to get a better result.
望远镜变量的iloc()函数
The iloc() function of the telescope variable
是pandas获得其在索引中位置的方法
is pandas’ way of getting the positions in the index.
我们将用numpy子模块’random’的置换函数
And we’ll generate a sequence of random indices
生成与数据大小匹配的
the size of our data using the permutation function
随机索引序列
of numpy’s’random’ submodule.
现在所有例子都随机重新排列了
Since all the instances are now randomly rearranged,
虽然数据是乱序的
we’ll just reset all these indices so they are ordered
但我们只需将drop参数设置为“True”
even though the data is now shuffled,
用pandas的reset_index()方法
using the reset_index() method of pandas
重置所有索引即可使其有序排列
with the drop parameter set to”True.”
我们现在要用map()方法
We’ll now let our’tele’ variable know
将两个分类映射到整数
what our two classes are by mapping both of them
让‘tele’变量知道它们是什么
to an integer with the map() method.
‘g’代表‘伽马’ 设置为0
So’g’ for”gamma” is set to 0;
‘h’代表‘强子’ 设置为1
‘h’ for”hadron” is set to 1.
储存这些‘类’标签
Let’s store those’Class’ labels,
稍后在另一个变量‘tele_class’中进行预估
which we’re going to predict, in a seperate variable called’tele_class’
并用值属性对其进行检索
and use the values attribute to retrieve it.
在对模型进行训练前
Before we train our model,
我们需要将数据
we need to split our data
分成训练集和验证集
into training and validation sets.
我们将用所导入的
We’ll use the train_test_split() method
scikit-learn的train_test_split()方法
of scikit-learn that we imported
来为二者创建索引
to create the indices for both.
参数为数据集的大小
The parameters will be the size of our dataset.
我们想让两个集合都为数组
We want both sets to be arrays,
所以将‘stratify’参数设置为我们之前的数组类型
so we’ll set the’stratify’ parameter to our array type.
然后我们将
Then we’ll define what percent
用上面两个参数来定义想要
of our data we want to be training
训练和测试的数据百分比
and testing with these last two parameters.
现在将数据按75/25划分
We have a 75/25 split now in our data
这样就准备好训练模型了
and we’re ready to train our model.
用‘TPOT’分类来初始化‘tpot’变量
We’ll initialize the’tpot’ variable using the’TPOT’ class
代值设置为5
with the number of generations set to 5.
4GB运行内存的标准笔记本电脑
On a standard laptop with 4 gigs of RAM,
每代运行需要5分钟
it takes five minutes per generation to run
所以一共需要约25分钟
so this will take about 25 minutes.
这样TPOT的遗传算法就知道
This is so TPOT’s genetic algorithm knows
需要运行多少次迭代
how many iterations to run for,
将‘verbosity’设置为2
and we’ll set’verbosity’ to 2,
这意味着在优化过程期间
which just means ”Show a progress bar in terminals
在各终端显示一个进度条
during the optimization process.”
然后在训练数据上应用fit()方法
Then we can call our fit() method on our training data
让其用遗传编程对其执行优化
to let it perform optimization using genetic programming.
第一个参数是训练特征集
The first parameter is the training feature set
在首次读取每个训练索引时
which we’ll retrieve from our’tele’ variable
我们将从‘tele’变量中对其进行检索
along the first access for every training index.
第二个变量是训练类集
The second variable is our training class set,
我们也会这样从‘tele’变量中进行检索
which we’ll retrieve from our’tele’ variable like so.
我们可以用TPOT的score()方法
We can compute the testing error for validation
将验证特征集作为第一个参数
using TPOT’s score() method
验证类集作为第二个参数
with validation feature set as the first parameter
来计算测试误差以进行验证
and the validation class set as the second.
我们会用这种方法
We’ll export the computed Python code
将计算过的Python代码导出
to the pipeline.py class
到pipeline.py类中
using this method
并在参数中将其类型命名为字符串
and name it in the parameter as a string.
我们来演示一下
Let’s demo this thing.
经过训练 我们可以看到5次迭代后
After training, we’ll see that after five generations,
TPOT选择gradient_boosting分类器
TPOT chose the gradient_boosting classifier
作为最精确的机器学习模型来使用
as the most accurate machine learning model to use.
它也为我们展示了最优超参数
It also shows the optimal hyperparameters
如学习率和估计量的个数
like the learning rate and number of estimators for us.
#耶 了不起!#
♫ Yeah, boyyy! ♫
所以 分解来看
So, to break it down:
有了适量的数据
with the right amount of data,
计算能力和机器学习模型
computing power, and machine learning model,
就可以发现任何问题的解
you can discover a solution to any problem.
遗传算法通过选择 交叉和变异
Genetic algorithms replicate evolution
复制进化
via selection, crossover, and mutation
以寻找问题最优解
to find an optimal solution to a problem,
TPOT是Python的一个库
and TPOT is a Python library
它用遗传编程帮你为使用案例
that uses genetic programming to help you
找到最优模型和超参数
find the best model and hyperparameters for your use case.
上个视频的代码挑战冠军
The winner of the coding challenge from the last video
是Peter Miltrano
is Peter Mitrano.
他为他的仓库加了一些
He added some great Deep Dream samples
很不错的Deep Dream样品
to his repository,
甚至还对我自己的视频进行了Deep Dream
and even Deep Dream’d my own video.
本周的最佳者!
Badass of the week!
亚军是Kyle Jordaan
And the runner-up is Kyle Jordaan.
他用一行代码就将所有Deep Dream框架
Good job stitching all the Deep Dream’d frames
串联在了一起 做的非常好!
together with one line of code.
本视频的挑战是
The challenge for this video is
用TPOT和我提供的气候变化数据集
to use TPOT and a climate change dataset that I’ll provide
预测你提出的一个问题的答案
to predict the answer to a question you decide.
这将是学习像数据科学家
This will be great practice in learning to think
一样思考的很好的实践
like a data scientist.
将GitHub链接粘贴在评论区
Post your GitHub link in the comments
我会在下次视频中宣布获胜者
and I’ll announce the winner next time.
现在 我要去锻炼身体好繁殖下一代了
For now, I’ve got to stay fit to reproduce,
感谢观看
so thanks for watching.

发表评论

译制信息
视频概述

1. 数据+计算+机器学习 能解决任何问题2. 遗传算法通过选择 交叉和变异求解3. TPOT用遗传算法优化机器学习过程

听录译者

收集自网络

翻译译者

Nam

审核员

审核员#LN

视频来源

https://www.youtube.com/watch?v=dSofAXnnFrY

相关推荐