• #### 科普

SCIENCE

#### 英语

ENGLISH

#### 科技

TECHNOLOGY

MOVIE

FOOD

#### 励志

INSPIRATIONS

#### 社会

SOCIETY

TRAVEL

#### 动物

ANIMALS

KIDS

#### 卡通

CARTOON

#### 计算机

COMPUTER

#### 心理

PSYCHOLOGY

#### 教育

EDUCATION

#### 手工

HANDCRAFTS

#### 趣闻

MYSTERIES

CAREER

GEEKS

#### 时尚

FASHION

• 精品课
• 公开课
• 欢迎下载我们在各应用市场备受好评的APP

点击下载Android最新版本

点击下载iOS最新版本

扫码下载译学馆APP

#### 《机器学习Python实践》#2 回归简介

Regression Intro - Practical Machine Learning Tutorial with Python p.2

Alright, so now we are at least going to get started with setting up a simple linear regression example.

The first thing that we need to make sure we have is scikit learn, Pandas and Quandl. So open up terminal, command prompt, whatever.

And pip install sklearn. pip install quandl. And pip install pandas.

Once you have all those, you are good to go. So to install those, go ahead and pause the video and pick back up once you have them.

Ok, so once you have those, let’s go ahead and get started with a simple example. So we are starting with regression,

and the idea of regression is to take continuous data,

and figure out a best fit line to that data.

and basically with that just boils down to

we are trying to like “model” your data

and the way we do that with regression at least with simple linear regression

is just with a straight line so the equation of the line as we will talk more about down the line

but as you might remember from school, y=mx+b,

so if you have x, you can figure out what y is, also if you have m and b.

So basically the whole point of regression is to find out what m an b is.

So for example, a lot of people use regression with stock prices so that’s what we are gonna do

at least in this one.

And, so the idea is, this is continuous data and you’ve got months and months of stock prices and

and each price is in its own kind of unique day.

But all the data is kind of one dataset together as opposed to with like classification,

where each group of data has its own unique label.

So with machine learning, basically everything boils down, at least with supervised machine learning,

everything boils down to features and labels.

Features are like your attributes, or in this case, the continuous data.

So, let’s go ahead and get started and we’ll talk a little bit more about features.

So first of all, let’s go ahead and import pandas as pd.

And then we are gonna import Quandl with a capital Q.

And then what we are gonna say is df for dataframe
Quandl.get赋值给df 我们将把股票数据放到括号里面
equals Quandl.get and we’ll put in the ticker.

You can get this from quandl.

So if you just go to quandl.com

You can use a little search and find stuff like

we can probably find it.

Let’s see I am trying to find, we are using the wiki dataset

Let’s just do free. Anyway, when you find it you can find

all kinds of different datasets here but we are looking

just simply for the wiki one.

Here it is.

You will pick up a dataset and you can

come over here

here is the quandl code and then you can click on like
Python 这是使用这个数据集的准确的语法
python and this is the exact statement to get it.

If you have an account, you can make basically unlimited

request free data. If you don’t use an account,

like we are not gonna use an account here,

like we are not gonna use an auth token.

If you don’t have an account, I think, it’s limited like 50 calls a day.

We are actually only use quandl fairly short term here

and then maybe later on. So you really don’t

need to create an account, but if you like quandl,

you might as well make an account at some point.

So anyways, quandl.get and then wiki/google was the ticker there

so then we can just simply print

just so we can see what it is we are working with.

We’ll see that basically each column here

is a feature.

So the open, high, low, close, these are features.

So in machine learning you can have all the features you want but you want to have meaningful features

features that actually have something to do with your data

So some people are pretty avid believers in the ideas like

pattern recognition with stock prices

and that might be you but

do you need every single one of these

open high low close columns to do

pattern recognition? No.

Also, you would know we’ve got
open high low close volume这些列 然后我们调整这些列
open high low close volume and then adjusted

so a stock split

maybe your company has 10 stocks

and each stock is \$1000 a share

and you decide I want people to be able to buy shares of my company for less than \$1000

So you might say, ok, BAM,

every share is now two shares

so we have 20 total shares and the share price is \$500

so you have adjusted prices to account for that

so it doesn’t like like the stock price went from \$1000 to \$500

so we are gonna be using those

but again, each one of these is really related to the other one

like the correlation of these two columns is super high

so would you use each one of these columns

Does that the next one really brings that much meaningful data? No

but one thing to always think about

when you have features and labels is maybe like

what about the relationship between those columns

so when we get into something like deep learning and then some of the other algorithms

you can start to discover relationships between attributes

but with regression, just simply no.

what you wanna do

you wanna like simplify your data as much as possible.

You want as many meaningful features as you can get

but useless features as we’ll show kinda through this series

can really cause a lot of trouble for your machine learning classifiers

especially the more simple ones in supervised learning and so on

anyways let’s close out of this and let’s go ahead and grab some features

what we’re gonna say

first we are gonna pair this down,

we are gonna say dataframe equals the df

and then we are going to create a long list basically all of the columns that we wanna have

so we are gonna take adjusted, open, and then I’m just gonna go ahead and copy this

copy, ok

so that’s adjusted, open, and then we are gonna take
open high low close 和 volume
oepn, so high, low, close, and volume

ok so now we have just these columns so we

kinda recreated our dataframe to just be the open high low close and volume of the adjusted ones.

so then, like I was saying, some of these columns are relatively worthless

but they do have some relationships

so for example, like what is interesting about high and low
high和low的差告诉我们
is the margin of high and low tells us

a little bit about volatility for the day
open的值是一天开始的价格
Also, the open price that’s the starting price for the day

and it’s relationship to the close price

tells us did the price go up

if so, by how much

and did it go down? If so, by how much and so on.

so the relationship there is very valuable.

But a simple linear regression is not gonna seek out that

relationship. It’s just gonna work with whatever

features you feed through it

so what we need to do is define those special relationships

and then use those as our features rather than

redundant almost prices that not gonna really give us

anything else very useful

first let’s do the high minus the low percent

so this is like the percent volatility almost

so we are gonna define a new column

we are gonna call it HL_percent

and then that is going to be

I’m having a hard time here

that’s gonna be equal to

so percent change is

in this case it would be the high minus low

divided by the low times 100

so for us it would be df Adj high minus the df Adj close

and what’s happening here is just on a per row basis

which is just this column minus this column

so that column divided by df Adj close and then times 100

you can either times by 100 or not

the classifier really is not gonna care about that

we are just doing that for ourselves

So that’s the high minus low percent

and then we actually want just the daily percent change

like the daily move

so I’m just gonna copy that whole line, paste

and then we are gonna call this one percent_change

and that is equal to pretty much the same thing only we need to change some stuff

so normally percent change is new minus the old divided by the old times 100

so new minus the old

divided by the old times 100

oh, I am sorry, ha, we did it the wrong way.

divided by the old, so this would be open times 100

so that’s percent change.

actually, you can pass close here again

the classifier doesn’t really care as long as everything kinda normalized

but yea so either way would been fine

this is the actual way you should do it anyways

once we have that data

we are gonna define a new dataframe

and we are instead gonna say
df就是df[]
so it’s gonna be df equals df[]

and then now we define the only columns that

and so in our case the columns we care about are gonna be
adjusted close, the high low percent, the percent change

and then volume is also somewhat useful to have
volume就是每天的交易量
so volume is just how many trades were occurred basically that day

so volume is also kinda related to volatility

so you can also make more features with some sort of relationship there

but we’ll try to keep this pretty simple

so for now we’ll just print df.head

and we wait just to make sure everything worked out

and sure enough it did

so we have all the numbers we are kinda interested in

so we got our features and eventually

this will actually wound up being, possibly, our label

but we’ll get to…, I guess think about

between now and the next tutorial

features are the kinda of the attributes that make up the label

and the label is, hopefully, some sort of prediction

into the future

so will the adjusted close, will this column,

actually be a feature? or will it be a label

as it stands right now.

so think about that and the next tutorial

we’ll pick it up and start getting closer actually

making real predictions with this data

so if you have any questions, comments, whatever, leave them below

otherwise, as always, thanks for watching, thanks for all the supports, subscriptions and until next time

[B]倔强