《边学Python边学数据科学》#6 遗传算法 – 译学馆

• #### 科普

SCIENCE

#### 英语

ENGLISH

#### 科技

TECHNOLOGY

MOVIE

FOOD

#### 励志

INSPIRATIONS

#### 社会

SOCIETY

TRAVEL

#### 动物

ANIMALS

KIDS

#### 卡通

CARTOON

#### 计算机

COMPUTER

#### 心理

PSYCHOLOGY

#### 教育

EDUCATION

#### 手工

HANDCRAFTS

#### 趣闻

MYSTERIES

CAREER

GEEKS

#### 时尚

FASHION

• 精品课
• 公开课
• 欢迎下载我们在各应用市场备受好评的APP

点击下载Android最新版本

点击下载iOS最新版本

扫码下载译学馆APP

#### 《边学Python边学数据科学》#6 遗传算法

Genetic Algorithms - Learn Python for Data Science #6

Hello World, it’s Siraj!

In this video, we’re going to use genetic programming

to identify if some energy is gamma radiation or not.

I’m getting angry. Gamma rays! Augh!

Nah, I wish.

Data science is a way of thinking about discovery.

A data scientist needs to decide

like “Who’s the best candidate to vote for in the US election?,”

then decide what dataset to use,

like tweet history of candidates

and past endorsements of each candidate,

and lastly decide what machine learning model to use

on the data to discover the right answer.
#生活在继续！#
♫ Life goes on!

♫ With the right data, computing power,

and machine learning model,

you can discover a solution to any problem,

but knowing which model to use can

be challenging for new data scientists.

There are so many of them!

That’s where genetic programming can help.

Genetic algorithms are inspired by

the Darwinian process of natural selection,

and they’re used to generate solutions to optimization

and search problems.

They have three properties:

selection, crossover, and mutation.

You have a population of possible solutions

to a given problem

and a fitness function.

Every iteration,

we evaluate how fit each solution is

with our fitness function.

Then we select the fittest ones

and perform crossover to create a new population.

We take those children and mutate them

with some random modification and

repeat the process until we get the fittest or best solution.

So take this problem, for instance.

Let’s say you want to take a road trip across a bunch of cities.

What’s the shortest possible path

you could take to hit up each city once

and then return back to your home city?

This is popularly called

the”traveling salesman problem” in computer science,

and we can use a genetic algorithm

to help us solve it.

Let’s look at some high-level Python code.

We have the number of generations set to 5,000

and the population size set to 100.

So we start by initializing our population

using our size parameter.

Each individual in our population

represents a different solution path.

Then, for each generation,

we compute the fitness of each solution and

store it in our population fitness array.

Now we’ll perform selection

by only taking the top 10 % of the population

which are our shortest road trips

and produce offspring from them by performing crossover.

Then you take those offspring randomly

and repeat the process.

As you can see in the animation,

eventually we will get an optimal solution using this process,

unlike Apple Maps.

Alright, so how does this all fit into data science?

Well, it turns out that choosing the right machine learning model

and all the best hyperparameters for that model

is itself an optimization problem.

We’re gonna use a Python library called TPOT,

built on top of scikit-learn,

that uses genetic programming

to optimize our machine learning pipeline.

So after formatting our data properly,

we need to know what features to input to our model

and how we should construct those features.

Once we have those features,

we’ll input them into our model to train on,

and we’ll want to tune our hyperparameters, or tuning knobs,

to get the optimal results.
TPOT并不是让我们自己
Instead of doing this all ourselves

through trial and error,

TPOT automates these steps for us

with genetic programming,

and it will output the optimal code

for us when it’s done

so we can use it later.

So we’re going to create a classifier

after installing our dependencies,

and then analyze the results.
TPOT建立于广泛应用的
TPOT is built on the popular scikit-learn
scikit-learn机器学习库之上
machine learning library, so we’ll want to make sure

that we have that installed first.

Then we’ll install pandas

to help us analyze our data

and numpy to perform math calculations.

Our first step is to load our dataset.

and set the parameter to the name of our saved CSV file.

This is data collected from a scientific instrument

called a”Cherenkov telescope”

that measures radiation in the atmosphere

and these are a bunch of features

of whatever type of radiation it picks up.

Thanks, Putin!

Since the class object is already organized,

we’ll shuffle our data to get a better result.

The iloc() function of the telescope variable

is pandas’ way of getting the positions in the index.

And we’ll generate a sequence of random indices

the size of our data using the permutation function

of numpy’s’random’ submodule.

Since all the instances are now randomly rearranged,

we’ll just reset all these indices so they are ordered

even though the data is now shuffled,

using the reset_index() method of pandas

with the drop parameter set to”True.”

We’ll now let our’tele’ variable know

what our two classes are by mapping both of them

to an integer with the map() method.
‘g’代表‘伽马’ 设置为0
So’g’ for”gamma” is set to 0;
‘h’代表‘强子’ 设置为1
‘h’ for”hadron” is set to 1.

Let’s store those’Class’ labels,

which we’re going to predict, in a seperate variable called’tele_class’

and use the values attribute to retrieve it.

Before we train our model,

we need to split our data

into training and validation sets.

We’ll use the train_test_split() method
scikit-learn的train_test_split()方法
of scikit-learn that we imported

to create the indices for both.

The parameters will be the size of our dataset.

We want both sets to be arrays,

so we’ll set the’stratify’ parameter to our array type.

Then we’ll define what percent

of our data we want to be training

and testing with these last two parameters.

We have a 75/25 split now in our data

and we’re ready to train our model.

We’ll initialize the’tpot’ variable using the’TPOT’ class

with the number of generations set to 5.
4GB运行内存的标准笔记本电脑
On a standard laptop with 4 gigs of RAM,

it takes five minutes per generation to run

so this will take about 25 minutes.

This is so TPOT’s genetic algorithm knows

how many iterations to run for,

and we’ll set’verbosity’ to 2,

which just means ”Show a progress bar in terminals

during the optimization process.”

Then we can call our fit() method on our training data

to let it perform optimization using genetic programming.

The first parameter is the training feature set

which we’ll retrieve from our’tele’ variable

along the first access for every training index.

The second variable is our training class set,

which we’ll retrieve from our’tele’ variable like so.

We can compute the testing error for validation

using TPOT’s score() method

with validation feature set as the first parameter

and the validation class set as the second.

We’ll export the computed Python code

to the pipeline.py class

using this method

and name it in the parameter as a string.

Let’s demo this thing.

After training, we’ll see that after five generations,

as the most accurate machine learning model to use.

It also shows the optimal hyperparameters

like the learning rate and number of estimators for us.
#耶 了不起！#
♫ Yeah, boyyy! ♫

So, to break it down:

with the right amount of data,

computing power, and machine learning model,

you can discover a solution to any problem.

Genetic algorithms replicate evolution

via selection, crossover, and mutation

to find an optimal solution to a problem,
TPOT是Python的一个库
and TPOT is a Python library

find the best model and hyperparameters for your use case.

The winner of the coding challenge from the last video

is Peter Mitrano.

He added some great Deep Dream samples

to his repository,

and even Deep Dream’d my own video.

And the runner-up is Kyle Jordaan.

Good job stitching all the Deep Dream’d frames

together with one line of code.

The challenge for this video is

to use TPOT and a climate change dataset that I’ll provide

to predict the answer to a question you decide.

This will be great practice in learning to think

like a data scientist.

and I’ll announce the winner next time.

For now, I’ve got to stay fit to reproduce,

so thanks for watching.

##### 译制信息

1. 数据＋计算＋机器学习 能解决任何问题2. 遗传算法通过选择 交叉和变异求解3. TPOT用遗传算法优化机器学习过程

Nam