人脑之战

好意是评定行为价值的绝对标准——康德

导航

Notes : <Hands-on ML with Sklearn & TF> Chapter 2

Chapter 2 - Hoursing

 

 

 

Main Steps
1.Look at the big picture
2.Get the data
3.Discover and visualize the data to gain insights
4.Prepare the data for Machine Learning algorithms
5.Select a model and train it
6.Fine-tune model
7.Present solution
8.Launch, monitor, and maintain system

 

 Frame the Problem and Look at the big picture

 
  1. Define the objective in business terms. ’目标‘
  2. How will your solution be used? 可能使用的方案
  3. What are the current solutions/workarounds (if any)? 当前方案
  4. How should you frame this problem (supervised/unsupervised, online/offline, etc.)? 问题框架
  5. How should performance be measured? 性能度量
  6. Is the performance measure aligned with the business objective? 性能度量方法和商业目标是否一致
  7. What would be the minimum performance needed to reach the business objective? 最低要求
  8. What are comparable problems? Can you reuse experience or tools? 类似的问题和可利用的经验和工具
  9. Is human expertise available? 专业知识
  10. How would you solve the problem manually? 可以手动解决吗
  11. List the assumptions you (or others) have made so far. 列出所做的假设
  12. Verify assumptions if possible. 验证这些假设
 
  1. 目标 : 收益;模型的输出(预测)传入下级独立的ML系统,判断是否值得投资 pipeline
  2. 当前方案 : 收集并更新district信息,复杂的规则估计value
  3. 问题框架 : 监督(label = house value),multivariate regression,batch(set很大时,split->MapReduce)
  4. 性能度量 : RMSE,MAE, l , l2, 其他的范数
  5. 类似问题 : Chapter 1 Example 1-1
  6. 假设 : 获取具体价值
  7. 验证假设 : need actual price not categories
 

Get the Data

 

Note: automate as much as possible so you can easily get fresh data.

  1. List the data you need and how much you need.
  2. Find and document where you can get that data.
  3. Check how much space it will take.
  4. Check legal obligations, and get authorization if necessary.
  5. Get access authorizations.
  6. Create a workspace (with enough storage space).
  7. Get the data.
  8. Convert the data to a format you can easily manipulate (without changing the data itself).
  9. Ensure sensitive information is deleted or protected (e.g., anonymized).
  10. Check the size and type of data (time series, sample, geographical, etc.).
  11. Sample a test set, put it aside, and never look at it (no data snooping!).
In [1]:
# 获取数据
import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = "datasets/housing"
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"

def fetch_hoursing_data(housing_url = HOUSING_URL, housing_path = HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedires(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
#    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path = housing_path)
    housing_tgz.close()

fetch_hoursing_data()
In [2]:
# 导入数据,查看数据基本信息
import pandas as pd

def load_housing_data(housing_path = HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()
housing.head()
Out[2]:
 
 longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
In [3]:
housing.info()
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
In [4]:
housing["ocean_proximity"].value_counts()
Out[4]:
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64
In [5]:
housing.describe()
Out[5]:
 
 longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_value
count 20640.000000 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
std 2.003532 2.135952 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 33.930000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000
In [6]:
%matplotlib inline 
#将生成的图片嵌入Jupyter notebook magic command
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
plt.show()
 
In [7]:
%magic
 

Notice:

  1. some data has been scaled and capped, discuss whether need the values has been capped or not
  2. tail-heavy distribution try transforming to bell-shaped distribution
 

avoid data snooping bias creat test set

In [8]:
import numpy as np
import numpy.random as rnd
rnd.seed(42) # to make this notebook's output identical at every run,使用相同的起源(seed)会使最后的随机序列相同
#每次都生成一个序列,每次的序列都不相同
def split_train_test(data, test_ratio):
    shuffled_indices = rnd.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]

train_set, test_set = split_train_test(housing, 0.2)
print(len(train_set), len(test_set))
 
16512 4128
 

尽管seed(number)可以得到每次都相同的伪随机序列(Pseudo-random sequence) but these solution will break next time you fetch an updated dataset use each instance's identifier to decide whether or not it should going in test set housing data don't have identifier can use housing.reset_index()<> or use the stable features to build a unique identifier

In [9]:
import hashlib

def test_set_check(identifier, test_ratio, hash):
    return hash(np.int64(identifier)).digest()[-1] < 256*test_ratio #最后一个字节转化为整数和256×0.2比较,小于的划入test_set

def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash)) #http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html#pandas.DataFrame.apply
    return data.loc[~in_test_set],data.loc[in_test_set]  #http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html#pandas.DataFrame.loc

#方案1:使用行标1...20640来计算hash
housing_with_id_1 = housing.reset_index()   #http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html
train_set_1, test_set_1 = split_train_test_by_id(housing_with_id_1, 0.2, "index")
print(len(train_set_1),len(test_set_1))

#方案2:使用经纬度的相加来得到id,计算hash
housing_with_id = housing.copy()
housing_with_id["id"] = housing["longitude"]*1000 + housing["latitude"]
train_set_2, test_set_2 = split_train_test_by_id(housing_with_id, 0.2, "id")
print(len(train_set_2),len(test_set_2))

#方案3:直接使用sklearn模块的方法
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
print(len(train_set),len(test_set))
 
16362 4278
16267 4373
16512 4128
In [10]:
#分层抽样(stratified sampling),represent all dataset, need sufficient number of instaces in your dataset for each stratum
housing["income_cat"] = np.ceil(housing["median_income"]/1.5)
housing["income_cat"].where(housing["income_cat"]<5, 5.0, inplace=True)  #小于5的保留,大于的归入5。http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html#pandas.DataFrame.where

from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)  #http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html
for train_index,test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

housing["income_cat"].value_counts()/len(housing)
Out[10]:
3.0    0.350581
2.0    0.318847
4.0    0.176308
5.0    0.114438
1.0    0.039826
Name: income_cat, dtype: float64
In [11]:
strat_train_set["income_cat"].value_counts()/len(strat_train_set)
Out[11]:
3.0    0.350594
2.0    0.318859
4.0    0.176296
5.0    0.114402
1.0    0.039850
Name: income_cat, dtype: float64
In [12]:
for set in (strat_test_set, strat_train_set):
    set.drop(["income_cat"], axis=1, inplace=True)  #移除income_cat
 

Explore the Data

 

Note: try to get insights from a field expert for these steps.

  1. Create a copy of the data for exploration (sampling it down to a manageable size if necessary).
  2. Create a Jupyter notebook to keep a record of your data exploration.
  3. Study each attribute and its characteristics:

    Name Type (categorical, int/float, bounded/unbounded, text, structured, etc.) % of missing values Noisiness and type of noise (stochastic, outliers, rounding errors, etc.) Possibly useful for the task? Type of distribution (Gaussian, uniform, logarithmic, etc.)

  4. For supervised learning tasks, identify the target attribute(s).
  5. Visualize the data.
  6. Study the correlations between attributes.
  7. Study how you would solve the problem manually.
  8. Identify the promising transformations you may want to apply.
  9. Identify extra data that would be useful (go back to “Get the Data”).
  10. Document what you have learned.
 

Visualizing Geographical Data

In [13]:
#creat a copy
housing = strat_train_set.copy()
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=.1)
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff1756e8f60>
 
In [14]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4, s=housing["population"]/100, label="population", c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True)
plt.legend()
Out[14]:
<matplotlib.legend.Legend at 0x7ff172cfe550>
 
 

Looking for Correlations 计算标准相关系数

In [15]:
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
Out[15]:
median_house_value    1.000000
median_income         0.687160
total_rooms           0.135097
housing_median_age    0.114110
households            0.064506
total_bedrooms        0.047689
population           -0.026920
longitude            -0.047432
latitude             -0.142724
Name: median_house_value, dtype: float64
In [16]:
#another way to check for correlation between attributes is to use Pandas'scatter_matrix
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
scatter_matrix(housing[attributes],figsize=(24,16))
Out[16]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7ff17598bc50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff1757b21d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff17578d4e0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff172cfc198>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff172d28630>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff175847b70>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff172d89898>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff1758e0d30>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff1758a2400>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff17585b9b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff175a09550>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff172d124e0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff175b19d68>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff175bc58d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff174e36a90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff17294dda0>]], dtype=object)
 
In [17]:
#zoom in
housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff172930fd0>
 
 

Experimenting with Attribute Combination 查看相较之前的属性,相关系数有啥改变

In [18]:
#try out various attribute combinations
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"] = housing["population"]/housing["households"]

#correlation matrix 
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
Out[18]:
median_house_value          1.000000
median_income               0.687160
rooms_per_household         0.146285
total_rooms                 0.135097
housing_median_age          0.114110
households                  0.064506
total_bedrooms              0.047689
population_per_household   -0.021985
population                 -0.026920
longitude                  -0.047432
latitude                   -0.142724
bedrooms_per_room          -0.259984
Name: median_house_value, dtype: float64
 

Prepare the Data

 

Notes:

  1. Work on copies of the data (keep the original dataset intact).
  2. Write functions for all data transformations you apply, for five reasons:
    • So you can easily prepare the data the next time you get a fresh dataset
    • So you can apply these transformations in future projects
    • To clean and prepare the test set
    • To clean and prepare new data instances once your solution is live
    • To make it easy to treat your preparation choices as hyperparameters
  3. Data cleaning:
    • Fix or remove outliers (optional).
    • Fill in missing values (e.g., with zero, mean, median…) or drop their rows (or columns).
  4. Feature selection (optional):
    • Drop the attributes that provide no useful information for the task.
  5. Feature engineering, where appropriate:
    • Discretize continuous features.
    • Decompose features (e.g., categorical, date/time, etc.).
    • Add promising transformations of features (e.g., log(x), sqrt(x), x^2, etc.).
    • Aggregate features into promising new features.
  6. Feature scaling: standardize or normalize features.
 

Requement

  1. reproduce these transformations easily on any database
  2. gradually build a library of transformation functions that can be reused
  3. use the function in live system to transform the new data before feeding
  4. easily try various transformations and see which conbination of transformations works best
In [19]:
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

# housing.dropna(subset=["total_bedroom"])      option 1:删掉对应的district
# housing.drop("total_bedroom", axis=1)         option 2:删掉整个attribute
# median = housing["total_bedroom"].median()    option 3:给它一个值
# housing["total_bedroom"].fillna(median)       option 3

#option 3 use scikit-learn
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy = "median")
housing_num = housing.drop("ocean_proximity", axis=1)
imputer.fit(housing_num)
print(imputer.statistics_)
print(housing_num.median().values)
X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns = housing_num.columns)
 
[ -118.51      34.26      29.      2119.5      433.      1164.       408.
     3.5409]
[ -118.51      34.26      29.      2119.5      433.      1164.       408.
     3.5409]
 

Scikit-Learn's API Design

  • Estimator : any object that can estimate some parameters based on dataset
    • fit()
      • the estimation itself is preformed by fit() method and it takes 1 or 2 dataset as parameter
      • other patameters as hyperparameters, set as an instance variable
    • Transformer : transform()
      • transform a dataset and return the transformed dataset
      • generally relies on the learned parameter
    • Predictor : predict()
      • takes a new dataset and return a dataset of corresponding prediction(labels on supervised)
      • has a scord() mode that measure the quality
  • Inspection
    • estimator's hyperparameter can accessible directly via public instance variables
    • estimator's learned parameter can accessible via public instance variables with a underscore suffix(后缀下划线)
  • Nonproliferation of classes
    • Datases are Numpy arrays or Scipy spare matrix
    • hyperparameter is Python strings or numbers
  • Composition
    • existing blocks are reused as much as possible
  • Sensible defaults
    • provides resonable default values for most parameters, easy to creat a baseline working system
In [20]:
imputer.strategy
Out[20]:
'median'
 

Handling Text and Catagorical Attributes

In [21]:
# convert these text labels to number
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
housing_cat = housing["ocean_proximity"]
housing_cat_encoded = encoder.fit_transform(housing_cat)
housing_cat_encoded
Out[21]:
array([0, 0, 4, ..., 1, 0, 3])
In [22]:
print(encoder.classes_)
 
['<1H OCEAN' 'INLAND' 'ISLAND' 'NEAR BAY' 'NEAR OCEAN']
In [23]:
# use OneHotEncoder encoder to convert integer categorical values into one-hot vector
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
# fit_transform expect 2D array should reshape
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1, 1))  
housing_cat_1hot
Out[23]:
<16512x5 sparse matrix of type '<class 'numpy.float64'>'
	with 16512 stored elements in Compressed Sparse Row format>
In [24]:
housing_cat_1hot.toarray()
Out[24]:
array([[ 1.,  0.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.],
       ..., 
       [ 0.,  1.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.]])
In [25]:
#使用LabelBinarizer代替以上两个
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
housing_cat_1hot = encoder.fit_transform(housing_cat)  #add "sparse_output = True" get scipy sparse matrix
housing_cat_1hot
Out[25]:
array([[1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1],
       ..., 
       [0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0]])
In [26]:
#custom transform
from sklearn.base import BaseEstimator, TransformerMixin

rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True):   #no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix]/X[:, household_ix]
        population_per_household = X[:, population_ix]/X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix]/X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]
        
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room = False)
housing_extra_attribs = attr_adder.transform(housing.values)
In [27]:
housing.values
Out[27]:
array([[-121.89, 37.29, 38.0, ..., 339.0, 2.7042, '<1H OCEAN'],
       [-121.93, 37.05, 14.0, ..., 113.0, 6.4214, '<1H OCEAN'],
       [-117.2, 32.77, 31.0, ..., 462.0, 2.8621, 'NEAR OCEAN'],
       ..., 
       [-116.4, 34.09, 9.0, ..., 765.0, 3.2723, 'INLAND'],
       [-118.01, 33.82, 31.0, ..., 356.0, 4.0625, '<1H OCEAN'],
       [-122.45, 37.77, 52.0, ..., 639.0, 3.575, 'NEAR BAY']], dtype=object)
In [28]:
housing_extra_attribs = pd.DataFrame(housing_extra_attribs, columns=list(housing.columns)+["rooms_per_household", "population_per_household"])
housing_extra_attribs.head()
Out[28]:
 
 longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomeocean_proximityrooms_per_householdpopulation_per_household
0 -121.89 37.29 38 1568 351 710 339 2.7042 <1H OCEAN 4.62537 2.0944
1 -121.93 37.05 14 679 108 306 113 6.4214 <1H OCEAN 6.00885 2.70796
2 -117.2 32.77 31 1952 471 936 462 2.8621 NEAR OCEAN 4.22511 2.02597
3 -119.61 36.31 25 1847 371 1460 353 1.8839 INLAND 5.23229 4.13598
4 -118.59 34.23 17 6592 1525 4459 1463 3.0347 <1H OCEAN 4.50581 3.04785
 

Feature Scaling :

  1. min-max scaling : MinMaxScaler : default 0-1
  2. standardization : StandardScaler : 数据中心化(-mean), 离差标准化(divide variance), 数据正规化(divide freedom) ...
 

Transformation Pipline : do sequence of transform

In [29]:
#使用Pipline来对Estimators 进行 fit_transfoem()
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipline = Pipeline([('imputer', Imputer(strategy='median')),\
                        ('attribs_adder', CombinedAttributesAdder()),\
                        ('std_scaler', StandardScaler()),])  #三步,每一步的结果传到下一步继续执行,填充缺失->属性组合->标准化
# housing_num 表示所有的数据行 
housing_num_tr = num_pipline.fit_transform(housing_num)

housing_num_tr[0:5]
Out[29]:
array([[-1.15604281,  0.77194962,  0.74333089, -0.49323393, -0.44543821,
        -0.63621141, -0.42069842, -0.61493744, -0.31205452, -0.08649871,
         0.15531753],
       [-1.17602483,  0.6596948 , -1.1653172 , -0.90896655, -1.0369278 ,
        -0.99833135, -1.02222705,  1.33645936,  0.21768338, -0.03353391,
        -0.83628902],
       [ 1.18684903, -1.34218285,  0.18664186, -0.31365989, -0.15334458,
        -0.43363936, -0.0933178 , -0.5320456 , -0.46531516, -0.09240499,
         0.4222004 ],
       [-0.01706767,  0.31357576, -0.29052016, -0.36276217, -0.39675594,
         0.03604096, -0.38343559, -1.04556555, -0.07966124,  0.08973561,
        -0.19645314],
       [ 0.49247384, -0.65929936, -0.92673619,  1.85619316,  2.41221109,
         2.72415407,  2.57097492, -0.44143679, -0.35783383, -0.00419445,
         0.2699277 ]])
In [30]:
# 将transform后的数据行和文本行合并
from sklearn.pipeline import FeatureUnion

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

# apply the LabelBinarizer on the categorical values
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([('selector', DataFrameSelector(num_attribs)),\
                         ('imputer', Imputer(strategy='median')),\
                         ('attribs_adder', CombinedAttributesAdder()),\
                         ('std_scaler', StandardScaler())])
cat_pipeline = Pipeline([('selector', DataFrameSelector(cat_attribs)),\
                         ('label_binarizer', LabelBinarizer()),])
full_pipeline = FeatureUnion(transformer_list=[("num_pipeline", num_pipeline), ("cat_pipeline", cat_pipeline)])
In [79]:
from sklearn.pipeline import FeatureUnion

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        return X[self.attribute_names].values

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('imputer', Imputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('label_binarizer', LabelBinarizer()),
    ])

preparation_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])
In [32]:
housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared
Out[32]:
array([[-1.15604281,  0.77194962,  0.74333089, ...,  0.        ,
         0.        ,  0.        ],
       [-1.17602483,  0.6596948 , -1.1653172 , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.18684903, -1.34218285,  0.18664186, ...,  0.        ,
         0.        ,  1.        ],
       ..., 
       [ 1.58648943, -0.72478134, -1.56295222, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.78221312, -0.85106801,  0.18664186, ...,  0.        ,
         0.        ,  0.        ],
       [-1.43579109,  0.99645926,  1.85670895, ...,  0.        ,
         1.        ,  0.        ]])
In [33]:
housing_prepared.shape
Out[33]:
(16512, 16)
 

Select and Train a Model

 

Notes: If the data is huge, you may want to sample smaller training sets so you can train many different models in a reasonable time (be aware that this penalizes complex models such as large neural nets or Random Forests). Once again, try to automate these steps as much as possible.

  • Train many quick and dirty models from different categories (e.g., linear, naive Bayes, SVM, Random Forests, neural net, etc.) using standard parameters.
  • Measure and compare their performance.
    • For each model, use N-fold cross-validation and compute the mean and standard deviation of the performance measure on the N folds.
  • Analyze the most significant variables for each algorithm.
  • Analyze the types of errors the models make.
    • What data would a human have used to avoid these errors?
  • Have a quick round of feature selection and engineering.
  • Have one or two more quick iterations of the five previous steps. Short-list the top three to five most promising models, preferring models that make different types of errors.
In [34]:
# 回归
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
Out[34]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [35]:
#prediction
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print("Prediation:\t", lin_reg.predict(some_data_prepared))
print("Labels:\t\t", list(some_labels))
 
Prediation:	 [ 210644.60459286  317768.80697211  210956.43331178   59218.98886849
  189747.55849879]
Labels:		 [286600.0, 340600.0, 196900.0, 46300.0, 254500.0]
In [36]:
#RMSE
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
Out[36]:
68628.198198489234
In [37]:
# 决策树
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)

housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse  # overfitting
Out[37]:
0.0
In [38]:
# cross validation
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

def display_scores(scores):
    print("Scores:\t",scores)
    print("Mean:\t",scores.mean())
    print("Standard deviation:", scores.std())
    
display_scores(tree_rmse_scores)
 
Scores:	 [ 69316.02634772  65498.84994772  71404.25935862  69098.46240168
  70580.30735263  75540.88413124  69717.93143674  70428.42648461
  75888.17618283  68976.12268448]
Mean:	 70644.9446328
Standard deviation: 2938.93789263
In [39]:
scores = cross_val_score(lin_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-scores)

display_scores(lin_rmse_scores)
 
Scores:	 [ 66777.82733235  66965.45468791  70347.95244419  74772.12135067
  68031.13388938  71241.5745495   64960.48650029  68274.99450087
  71552.91566558  67665.10082067]
Mean:	 69058.9561741
Standard deviation: 2743.77518885
 

Decision Tree Overfitting LinearRegression Underfitting

In [40]:
#Random Forest work by training mang Decision Trees on random subsets of the feature
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)

forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
    
display_scores(forest_rmse_scores)
 
Scores:	 [ 52869.23106834  49189.93801195  51726.73647871  54995.98190463
  50979.93079904  55978.43765914  52283.7609046   51001.92227546
  54447.35786983  53389.94422283]
Mean:	 52686.3241195
Standard deviation: 1971.26547795
In [41]:
from sklearn.svm import SVR

svm_reg = SVR(kernel="linear")
svm_reg.fit(housing_prepared, housing_labels)
housing_predictions = svm_reg.predict(housing_prepared)
svm_mse = mean_squared_error(housing_labels, housing_predictions)
svm_rmse = np.sqrt(svm_mse)
svm_rmse
Out[41]:
111094.6308539982
 

Fine-tune the System

 

Notes: You will want to use as much data as possible for this step, especially as you move toward the end of fine-tuning. As always automate what you can.

  • Fine-tune the hyperparameters using cross-validation.
    • Treat your data transformation choices as hyperparameters, especially when you are not sure about them (e.g., should I replace missing values with zero or with the median value? Or just drop the rows?).
    • Unless there are very few hyperparameter values to explore, prefer random search over grid search. If training is very long, you may prefer a Bayesian optimization approach (e.g., using Gaussian process priors, as described by Jasper Snoek, Hugo Larochelle, and Ryan Adams).1
  • Try Ensemble methods. Combining your best models will often perform better than running them individually. Once you are confident about your final model, measure its performance on the test set to estimate the generalization error. WARNING Don’t tweak your model after measuring the generalization error: you would just start overfitting the test set.
 

tell it what which hyperparameters you want to experiment with, and what values to try out, and it will evaluate all the possible combinations of hyperparameter valus, using cross-validation

In [42]:
# 网格try
from sklearn.model_selection import GridSearchCV

param_grid = [
        {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
        {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
    ]

forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)
Out[42]:
GridSearchCV(cv=5, error_score='raise',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid=[{'max_features': [2, 4, 6, 8], 'n_estimators': [3, 10, 30]}, {'bootstrap': [False], 'max_features': [2, 3, 4], 'n_estimators': [3, 10]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_squared_error', verbose=0)
In [43]:
grid_search.best_params_
Out[43]:
{'max_features': 6, 'n_estimators': 30}
In [44]:
grid_search.best_estimator_
Out[44]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=6, max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=30, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)
In [45]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)
 
64912.0351358 {'max_features': 2, 'n_estimators': 3}
55535.2786524 {'max_features': 2, 'n_estimators': 10}
52940.2696165 {'max_features': 2, 'n_estimators': 30}
60384.0908354 {'max_features': 4, 'n_estimators': 3}
52709.9199934 {'max_features': 4, 'n_estimators': 10}
50503.5985321 {'max_features': 4, 'n_estimators': 30}
59058.1153485 {'max_features': 6, 'n_estimators': 3}
52172.0292957 {'max_features': 6, 'n_estimators': 10}
49958.9555932 {'max_features': 6, 'n_estimators': 30}
59122.260006 {'max_features': 8, 'n_estimators': 3}
52441.5896087 {'max_features': 8, 'n_estimators': 10}
50041.4899416 {'max_features': 8, 'n_estimators': 30}
62371.1221202 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54572.2557534 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
59634.0533132 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52456.0883904 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
58825.665239 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
52012.9945396 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}
In [46]:
pd.DataFrame(grid_search.cv_results_)
Out[46]:
 
 mean_fit_timemean_score_timemean_test_scoremean_train_scoreparam_bootstrapparam_max_featuresparam_n_estimatorsparamsrank_test_scoresplit0_test_score...split2_test_scoresplit2_train_scoresplit3_test_scoresplit3_train_scoresplit4_test_scoresplit4_train_scorestd_fit_timestd_score_timestd_test_scorestd_train_score
0 0.060753 0.003462 -4.213572e+09 -1.122089e+09 NaN 2 3 {'max_features': 2, 'n_estimators': 3} 18 -4.322392e+09 ... -4.091199e+09 -1.132659e+09 -4.048299e+09 -1.084169e+09 -4.278616e+09 -1.181979e+09 0.007624 0.000016 1.194097e+08 4.503304e+07
1 0.190027 0.009662 -3.084167e+09 -5.686194e+08 NaN 2 10 {'max_features': 2, 'n_estimators': 10} 11 -2.920668e+09 ... -3.189759e+09 -5.684440e+08 -2.977423e+09 -5.753131e+08 -3.140389e+09 -5.569981e+08 0.008264 0.000134 1.133146e+08 1.555889e+07
2 0.560841 0.027835 -2.802672e+09 -4.390709e+08 NaN 2 30 {'max_features': 2, 'n_estimators': 30} 9 -2.635798e+09 ... -2.899767e+09 -4.299952e+08 -2.628577e+09 -4.459977e+08 -2.910563e+09 -4.319555e+08 0.005079 0.000553 1.398004e+08 6.703308e+06
3 0.093439 0.003463 -3.646238e+09 -9.779480e+08 NaN 4 3 {'max_features': 4, 'n_estimators': 3} 16 -3.583831e+09 ... -3.950913e+09 -9.887841e+08 -3.308822e+09 -1.011182e+09 -3.662211e+09 -9.190933e+08 0.001568 0.000022 2.083609e+08 3.282495e+07
4 0.320713 0.009649 -2.778336e+09 -5.111719e+08 NaN 4 10 {'max_features': 4, 'n_estimators': 10} 8 -2.703532e+09 ... -2.884782e+09 -4.948073e+08 -2.650746e+09 -5.355259e+08 -2.882622e+09 -5.245530e+08 0.023930 0.000276 9.396457e+07 2.504651e+07
5 0.953676 0.027456 -2.550613e+09 -3.959620e+08 NaN 4 30 {'max_features': 4, 'n_estimators': 30} 3 -2.302148e+09 ... -2.682066e+09 -3.923106e+08 -2.492072e+09 -4.120956e+08 -2.653622e+09 -3.940042e+08 0.035897 0.000251 1.402365e+08 8.599915e+06
6 0.130158 0.003463 -3.487861e+09 -9.302686e+08 NaN 6 3 {'max_features': 6, 'n_estimators': 3} 13 -3.323532e+09 ... -3.477330e+09 -8.673024e+08 -3.255834e+09 -9.719544e+08 -3.719818e+09 -9.505729e+08 0.002415 0.000006 1.818530e+08 3.502555e+07
7 0.428550 0.009539 -2.721921e+09 -5.009736e+08 NaN 6 10 {'max_features': 6, 'n_estimators': 10} 5 -2.605933e+09 ... -2.871511e+09 -4.944972e+08 -2.601547e+09 -5.127494e+08 -2.799685e+09 -4.853162e+08 0.003952 0.000054 1.062510e+08 1.527767e+07
8 1.303528 0.027117 -2.495897e+09 -3.848766e+08 NaN 6 30 {'max_features': 6, 'n_estimators': 30} 1 -2.410445e+09 ... -2.600516e+09 -3.791315e+08 -2.304437e+09 -3.834466e+08 -2.627380e+09 -3.763532e+08 0.023792 0.000425 1.215337e+08 7.051109e+06
9 0.167242 0.003468 -3.495442e+09 -8.965714e+08 NaN 8 3 {'max_features': 8, 'n_estimators': 3} 14 -3.274179e+09 ... -3.517974e+09 -9.317195e+08 -3.512932e+09 -9.331547e+08 -3.562802e+09 -8.541539e+08 0.001919 0.000029 1.160130e+08 3.128209e+07
10 0.557235 0.009574 -2.750120e+09 -5.032131e+08 NaN 8 10 {'max_features': 8, 'n_estimators': 10} 6 -2.694581e+09 ... -2.883188e+09 -4.955736e+08 -2.540331e+09 -5.046915e+08 -2.845775e+09 -4.838147e+08 0.004840 0.000046 1.227074e+08 1.561460e+07
11 1.705456 0.027242 -2.504151e+09 -3.825022e+08 NaN 8 30 {'max_features': 8, 'n_estimators': 30} 2 -2.371638e+09 ... -2.565840e+09 -3.751654e+08 -2.377880e+09 -3.897076e+08 -2.653704e+09 -3.785671e+08 0.052605 0.000643 1.112988e+08 5.231629e+06
12 0.091101 0.003988 -3.890157e+09 0.000000e+00 False 2 3 {'bootstrap': False, 'max_features': 2, 'n_est... 17 -3.617603e+09 ... -4.217359e+09 -0.000000e+00 -3.780422e+09 -0.000000e+00 -3.677274e+09 -0.000000e+00 0.002504 0.000067 2.492080e+08 0.000000e+00
13 0.294909 0.011605 -2.978131e+09 0.000000e+00 False 2 10 {'bootstrap': False, 'max_features': 2, 'n_est... 10 -2.815093e+09 ... -3.044746e+09 -0.000000e+00 -2.827508e+09 -0.000000e+00 -3.097349e+09 -0.000000e+00 0.001960 0.000504 1.298188e+08 0.000000e+00
14 0.117953 0.003990 -3.556220e+09 0.000000e+00 False 3 3 {'bootstrap': False, 'max_features': 3, 'n_est... 15 -3.546021e+09 ... -3.625256e+09 -0.000000e+00 -3.465998e+09 -0.000000e+00 -3.596042e+09 -0.000000e+00 0.002875 0.000036 5.415723e+07 0.000000e+00
15 0.395329 0.011699 -2.751641e+09 0.000000e+00 False 3 10 {'bootstrap': False, 'max_features': 3, 'n_est... 7 -2.604595e+09 ... -2.789225e+09 -0.000000e+00 -2.644243e+09 -0.000000e+00 -2.895713e+09 -0.000000e+00 0.005156 0.000551 1.101169e+08 0.000000e+00
16 0.150052 0.003963 -3.460459e+09 0.000000e+00 False 4 3 {'bootstrap': False, 'max_features': 4, 'n_est... 12 -3.060089e+09 ... -3.597422e+09 -0.000000e+00 -3.416000e+09 -0.000000e+00 -3.699168e+09 -0.000000e+00 0.005011 0.000038 2.203775e+08 0.000000e+00
17 0.494180 0.011123 -2.705352e+09 0.000000e+00 False 4 10 {'bootstrap': False, 'max_features': 4, 'n_est... 4 -2.534795e+09 ... -2.748411e+09 -0.000000e+00 -2.497470e+09 -0.000000e+00 -2.897782e+09 -0.000000e+00 0.006378 0.000020 1.622491e+08 0.000000e+00

18 rows × 23 columns

In [47]:
# 随机try
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
    }

forest_reg = RandomForestRegressor()
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
                                n_iter=10, cv=5, scoring='neg_mean_squared_error')
rnd_search.fit(housing_prepared, housing_labels)
Out[47]:
RandomizedSearchCV(cv=5, error_score='raise',
          estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False),
          fit_params={}, iid=True, n_iter=10, n_jobs=1,
          param_distributions={'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7ff15d8c6d68>, 'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7ff15d8ca0b8>},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring='neg_mean_squared_error',
          verbose=0)
In [48]:
cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)
 
50239.6442738 {'max_features': 3, 'n_estimators': 121}
50307.8432326 {'max_features': 3, 'n_estimators': 187}
49185.0150532 {'max_features': 6, 'n_estimators': 88}
49133.3305418 {'max_features': 5, 'n_estimators': 137}
49021.6318804 {'max_features': 7, 'n_estimators': 197}
49636.8878839 {'max_features': 6, 'n_estimators': 39}
52273.457854 {'max_features': 2, 'n_estimators': 50}
54413.8506712 {'max_features': 1, 'n_estimators': 184}
51953.3364641 {'max_features': 2, 'n_estimators': 71}
49174.1414792 {'max_features': 6, 'n_estimators': 140}
 

Analy the Best Modes and Their Errors

In [49]:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances
Out[49]:
array([  7.38167445e-02,   6.93425001e-02,   4.41741354e-02,
         1.80040251e-02,   1.65486595e-02,   1.80013616e-02,
         1.59794977e-02,   3.32716759e-01,   5.57319056e-02,
         1.05464076e-01,   8.70481930e-02,   9.90812199e-03,
         1.43083072e-01,   1.03976446e-04,   3.87833961e-03,
         6.19863203e-03])
In [50]:
extra_attribs = ["rooms_per_household", "population_per_household", "bedrooms_per_room"]
cat_one_hot_attribs = list(encoder.classes_)
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)
Out[50]:
[(0.33271675888101543, 'median_income'),
 (0.143083072155538, 'INLAND'),
 (0.10546407641519769, 'population_per_household'),
 (0.087048193008911728, 'bedrooms_per_room'),
 (0.073816744482605889, 'longitude'),
 (0.069342500136275007, 'latitude'),
 (0.055731905592885149, 'rooms_per_household'),
 (0.0441741353556422, 'housing_median_age'),
 (0.018004025077788675, 'total_rooms'),
 (0.018001361574586823, 'population'),
 (0.016548659526332148, 'total_bedrooms'),
 (0.015979497714769569, 'households'),
 (0.0099081219902823602, '<1H OCEAN'),
 (0.0061986320346514171, 'NEAR OCEAN'),
 (0.003878339607642522, 'NEAR BAY'),
 (0.00010397644587548846, 'ISLAND')]
 

Evaluate Your System on the test set

In [51]:
final_model = grid_search.best_estimator_

X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

X_test_transformed = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_transformed)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse
Out[51]:
47728.481110152476
 

Present Your Solution

 
  1. Document what you have done.
  2. Create a nice presentation.
    • Make sure you highlight the big picture first.
  3. Explain why your solution achieves the business objective.
  4. Don’t forget to present interesting points you noticed along the way.
    • Describe what worked and what did not.
    • List your assumptions and your system’s limitations.
  5. Ensure your key findings are communicated through beautiful visualizations or easy-to-remember statements (e.g., “the median income is the number-one predictor of housing prices”).
 

Launch!

 
  1. Get your solution ready for production (plug into production data inputs, write unit tests, etc.).
  2. Write monitoring code to check your system’s live performance at regular intervals and trigger alerts when it drops.
    • Beware of slow degradation too: models tend to “rot” as data evolves.
    • Measuring performance may require a human pipeline (e.g., via a crowdsourcing service).
    • Also monitor your inputs’ quality (e.g., a malfunctioning sensor sending random values, or another team’s output becoming stale). This is particularly important for online learning systems.
  3. Retrain your models on a regular basis on fresh data (automate as much as possible).
 

Exercises

 
  1. Try a Support Vector Machine regressor (sklearn.svm.SVR), with various hyperparameters such as kernel="linear" (with various values for the C hyperparameter) or kernel="rbf" (with various values for the C and gamma hyperparameters). Don't worry about what these hyperparameters mean for now. How does the best SVR predictor perform?
In [52]:
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV

param_grid = [{"kernel" : ["linear"], "C" : [10., 50.]},
              {"kernel" : ['rbf'], "C" : [300., 600.], 'gamma' : [.001]}]

svr_reg = SVR()
svr_search = GridSearchCV(svr_reg, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=4, verbose=2)
svr_search.fit(housing_prepared, housing_labels)

svres = svr_search.cv_results_
for mean_score, params in zip(svres["mean_test_score"], svres["params"]):
    print(np.sqrt(-mean_score), params)
 
Fitting 5 folds for each of 4 candidates, totalling 20 fits
...
 
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  5.4min finished
 
84654.0893002 {'C': 10.0, 'kernel': 'linear'}
73235.0217516 {'C': 50.0, 'kernel': 'linear'}
115058.204591 {'C': 300.0, 'kernel': 'rbf', 'gamma': 0.001}
111584.382328 {'C': 600.0, 'kernel': 'rbf', 'gamma': 0.001}
 
  1. Try replacing GridSearchCV with RandomizedSearchCV.
In [53]:
svr_reg.get_params()
Out[53]:
{'C': 1.0,
 'cache_size': 200,
 'coef0': 0.0,
 'degree': 3,
 'epsilon': 0.1,
 'gamma': 'auto',
 'kernel': 'rbf',
 'max_iter': -1,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}
In [54]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import expon, reciprocal

# see https://docs.scipy.org/doc/scipy-0.19.0/reference/stats.html
# for `expon()` and `reciprocal()` documentation and more probability distribution functions.

# Note: gamma is ignored when kernel is "linear"
param_distribs = {
        'kernel': ['linear', 'rbf'],
        'C': reciprocal(20, 200), #handson-ml answers 20000
        'gamma': expon(scale=1.0),
    }

svm_reg = SVR()
rnd_search = RandomizedSearchCV(svm_reg, param_distributions=param_distribs,
                                n_iter=10, cv=5, scoring='neg_mean_squared_error', verbose=2, n_jobs=4)
rnd_search.fit(housing_prepared, housing_labels)
 
Fitting 5 folds for each of 10 candidates, totalling 50 fits
...
 
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:  9.7min
 


 
 
 
[Parallel(n_jobs=4)]: Done  50 out of  50 | elapsed: 13.4min finished
Out[54]:
RandomizedSearchCV(cv=5, error_score='raise',
          estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False),
          fit_params={}, iid=True, n_iter=10, n_jobs=4,
          param_distributions={'C': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7ff15d8c3a90>, 'kernel': ['linear', 'rbf'], 'gamma': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7ff15d8c3eb8>},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring='neg_mean_squared_error',
          verbose=2)
In [55]:
negative_mse = rnd_search.best_score_
rmse = np.sqrt(-negative_mse)
rmse
Out[55]:
71204.177308638871
In [56]:
rnd_search.best_params_
Out[56]:
{'C': 141.89295169056408, 'gamma': 1.0123312540374287, 'kernel': 'linear'}
In [57]:
expon_distrib = expon(scale=1.)
samples = expon_distrib.rvs(10000)
plt.figure(figsize=(10, 4))
plt.subplot(121)
plt.title("Exponential distribution (scale=1.0)")
plt.hist(samples, bins=50)
plt.subplot(122)
plt.title("Log of this distribution")
plt.hist(np.log(samples), bins=50)
plt.show()
 
 
  1. Try adding a transformer in the preparation pipeline to select only the most important attributes.
 

feature selector assume you has already compute the feature importances

In [82]:
from sklearn.base import BaseEstimator, TransformerMixin

def indices_of_top_k(arr, k):
    return np.sort(np.argpartition(np.array(arr), -k)[-k:])

class TopFeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, feature_importances, k):
        self.feature_importances = feature_importances
        self.k = k
    def fit(self, X, y=None):
        self.feature_indices_ = indices_of_top_k(self.feature_importances, self.k)
        return self
    def transform(self, X, y=None):
        return X[:, self.feature_indices_]
    
#define k
k = 5

#look the selected feature 
top_k_feature_indices = indices_of_top_k(feature_importances, k)
print(top_k_feature_indices)
print(np.array(attributes)[top_k_feature_indices])

sorted(zip(feature_importances, attributes), reverse=True)[:k]
 
[ 0  7  9 10 12]
['longitude' 'median_income' 'population_per_household' 'bedrooms_per_room'
 'INLAND']
Out[82]:
[(0.33271675888101543, 'median_income'),
 (0.143083072155538, 'INLAND'),
 (0.10546407641519769, 'population_per_household'),
 (0.087048193008911728, 'bedrooms_per_room'),
 (0.073816744482605889, 'longitude')]
In [59]:
#pipeline
preparation_and_feature_selection_pipeline = Pipeline([
    ('preparation', full_pipeline),
    ('feature_selection', TopFeatureSelector(feature_importances, k))
])
In [60]:
#fit_transform
housing_prepared_top_k_features = preparation_and_feature_selection_pipeline.fit_transform(housing)
In [61]:
housing_prepared_top_k_features
Out[61]:
array([[-1.15604281, -0.61493744, -0.08649871,  0.15531753,  0.        ],
       [-1.17602483,  1.33645936, -0.03353391, -0.83628902,  0.        ],
       [ 1.18684903, -0.5320456 , -0.09240499,  0.4222004 ,  0.        ],
       ..., 
       [ 1.58648943, -0.3167053 , -0.03055414, -0.52177644,  1.        ],
       [ 0.78221312,  0.09812139,  0.06150916, -0.30340741,  0.        ],
       [-1.43579109, -0.15779865, -0.09586294,  0.10180567,  0.        ]])
 
  1. Try creating a single pipeline that does the full data preparation plus the final prediction. 

注意,一定要把LabelBinarizer换成Spuervision Friendly的!!!!!!

In [93]:
class SupervisionFriendlyLabelBinarizer(LabelBinarizer):
    def fit_transform(self, X, y=None):
        return super(SupervisionFriendlyLabelBinarizer, self).fit_transform(X)

# Replace the Labelbinarizer with a SupervisionFriendlyLabelBinarizer
cat_pipeline.steps[1] = ("label_binarizer", SupervisionFriendlyLabelBinarizer())

# Now you can create a full pipeline with a supervised predictor at the end.
fulll_pipeline = Pipeline([
        ("preparation", preparation_pipeline),
        ("linear", LinearRegression())
    ])

fulll_pipeline.fit(housing, housing_labels)
fulll_pipeline.predict(some_data)
Out[93]:
array([ 210644.60459286,  317768.80697211,  210956.43331178,
         59218.98886849,  189747.55849879])
In [94]:
prepare_select_and_predict_pipeline = Pipeline([
    ('preparation', preparation_pipeline),
    ('feature_selection', TopFeatureSelector(feature_importances, k)),
    ('svr_rege', SVR(C=122659.12862707644, gamma=0.22653313890837068, kernel='rbf')),
])
In [95]:
prepare_select_and_predict_pipeline.fit(housing, housing_labels)
Out[95]:
Pipeline(steps=[('preparation', FeatureUnion(n_jobs=1,
       transformer_list=[('num_pipeline', Pipeline(steps=[('selector', DataFrameSelector(attribute_names=['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income'])), ('imputer', Imputer(... gamma=0.22653313890837068, kernel='rbf', max_iter=-1, shrinking=True,
  tol=0.001, verbose=False))])
 

终于找到了报错原因:没有转换为监督友好的label二值化函数!!!

 
  1. Automatically explore some preparation options using GridSearchCV.
In [96]:
param_grid = [
        {'preparation__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent'],
         'feature_selection__k': [3, 4, 5, 6, 7]}
]

grid_search_prep = GridSearchCV(prepare_select_and_predict_pipeline, param_grid, cv=5,
                                scoring='neg_mean_squared_error', verbose=2, n_jobs=4)
grid_search_prep.fit(housing, housing_labels)
 
Fitting 5 folds for each of 15 candidates, totalling 75 fits
...
 
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:  9.1min
 
...
 
[Parallel(n_jobs=4)]: Done  75 out of  75 | elapsed: 18.7min finished
Out[96]:
GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('preparation', FeatureUnion(n_jobs=1,
       transformer_list=[('num_pipeline', Pipeline(steps=[('selector', DataFrameSelector(attribute_names=['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income'])), ('imputer', Imputer(... gamma=0.22653313890837068, kernel='rbf', max_iter=-1, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params={}, iid=True, n_jobs=4,
       param_grid=[{'preparation__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent'], 'feature_selection__k': [3, 4, 5, 6, 7]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_squared_error', verbose=2)
In [97]:
grid_search_prep.best_params_
Out[97]:
{'feature_selection__k': 7,
 'preparation__num_pipeline__imputer__strategy': 'median'}
In [98]:
housing.shape
Out[98]:
(16512, 9)
In [ ]:
 

posted on 2017-05-23 16:55  人脑之战  阅读(867)  评论(1编辑  收藏  举报