# 《数据分析实战-托马兹.卓巴斯》读书笔记第6章-回归模型

python数据分析个人学习读书笔记-目录索引

第6章涵盖了许多回归模型，有线性的，也有非线性的。我们还会复习随机森林和支持向量机，它们可用来解决分类或回归问题。

·识别并解决数据中的多重共线性
·构建线性回归模型，预测发电厂生产的电量
·使用OLS预测生产的电量
·使用CART估算发电厂生产的电量
·将kNN模型用于回归问题
·将随机森林模型用于回归分析
·使用SVM预测发电厂生产的电量
·训练神经网络，预测发电厂生产的电量

6.1导论

NG天然气，HYC水力，SUN太阳能、光生，WND风力

ST蒸汽涡轮，HY水轮，GT燃气轮机

6.2识别并解决数据中的多重共线性

 1 # this is needed to load helper from the parent folder
2 import sys
3 sys.path.append('..')
4
5 # the rest of the imports
6 import helper as hlp
7 import pandas as pd
8 import numpy as np
9 import sklearn.decomposition as dc
10
11 def reduce_PCA(x, n):
12     '''
13         Reduce the dimensions using Principal Component
14         Analysis
15     '''
16     # create the PCA object
17     pca = dc.PCA(n_components=n, whiten=True)
18
19     # learn the principal components from all the features
20     return pca.fit(x)
21
22 # the file name of the dataset
23 r_filename = '../../Data/Chapter06/power_plant_dataset.csv'
24
27
30
31 # produce correlation matrix for the independent variables
32 #生成自变量协方差矩阵
33 corr = x.corr()
34
35 # and check the eigenvectors and eigenvalues of the matrix
36 #检查矩阵的特征向量与特征值
37 w, v = np.linalg.eig(corr)
38 print('Eigenvalues: ', w)
39
40 # values that are close to 0 indicate multicollinearity
41 #接近0的值意味着多重共载
42 s = np.nonzero(w < 0.01)
43 # inspect which variables are collinear
44 #找出共线变量
45 print('Indices of eigenvalues close to 0:', s[0])
46
47 all_columns = []
48 for i in s[0]:
49     print('\nIndex: {0}. '.format(i))
50
51     t = np.nonzero(abs(v[:,i]) > 0.33)
52     all_columns += list(t[0]) + [i]
53     print('Collinear: ', t[0])

/*
Indices of eigenvalues close to 0: [28 29 30 31 32]
*/

/*
Index: 28.
Collinear:  [0 1 3]

Index: 29.
Collinear:  [ 9 11 12 13]

Index: 30.
Collinear:  [15]

Index: 31.
Collinear:  [ 2 10]

Index: 32.
Collinear:  [ 4 14]
*/

for i in np.unique(all_columns):
print('Variable {0}: {1}'.format(i, x.columns[i]))

/*
Variable 0: fuel_aer_NG
Variable 1: fuel_aer_DFO
Variable 2: fuel_aer_HYC
Variable 3: fuel_aer_SUN
Variable 4: fuel_aer_WND
Variable 9: mover_GT
Variable 10: mover_HY
Variable 11: mover_IC
Variable 12: mover_PV
Variable 13: mover_ST
Variable 14: mover_WT
Variable 15: state_CA
Variable 28: state_OH
Variable 29: state_GA
Variable 30: state_WA
Variable 31: total_fuel_cons
Variable 32: total_fuel_cons_mmbtu
*/

（除了删减模型中的变量而外）解决多重共线性的一个方法就是对数据集降维。我们可以通过PCA做到这一点（参考本书5.3节）。

 1 # and reduce the data keeping only 5 principal components
2 n_components = 5
3 z = reduce_PCA(x, n=n_components)
4 pc = z.transform(x)
5
6 # how much variance each component explains?
7 #每个成分对方差贡献多少大？
8 print('\nVariance explained by each principal component: ',
9     z.explained_variance_ratio_)
10
11 # and total variance accounted for
12 #一共贡献多少方差
13 print('Total variance explained: ',
14     np.sum(z.explained_variance_ratio_))

/*
Variance explained by each principal component:  [0.14446578 0.13030196 0.11030824 0.0935897  0.06694611]
Total variance explained:  0.5456117974906106
*/

 1 # append the reduced dimensions to the dataset
2 for i in range(0, n_components):
3     col_name = 'p_{0}'.format(i)
4     x[col_name] = pd.Series(pc[:, i])
5
8
9 # output to file
10 w_filename = '../../Data/Chapter06/power_plant_dataset_pc.csv'
11 with open(w_filename, 'w',newline='') as output:
12     output.write(csv_read.to_csv(index=False))

6.3构建线性回归模型

 1 # this is needed to load helper from the parent folder
2 import sys
3 sys.path.append('..')
4
5 # the rest of the imports
6 import helper as hlp
7 import pandas as pd
8 import numpy as np
9 import sklearn.linear_model as lm
10
11 @hlp.timeit
12 def regression_linear(x,y):
13     '''
14         Estimate a linear regression估算线性回归
15     '''
16     # create the regressor object创建回归对象
17     linear = lm.LinearRegression(fit_intercept=True,
18         normalize=True, copy_X=True, n_jobs=-1)
19
20     # estimate the model估算模型
21     linear.fit(x,y)
22
23     # return the object
24     return linear
25
26 # the file name of the dataset
27 r_filename = '../../Data/Chapter06/power_plant_dataset_pc.csv'
28
31
32 # select the names of columns
34 independent_reduced = [
35     col
36     for col
38     if col.startswith('p')
39 ]
40
41 independent = [
42     col
43     for col
45     if      col not in independent_reduced
46         and col not in dependent
47 ]
48
49 # split into independent and dependent features
50 #拆成自变量和因变量特征
54
55
56 # estimate the model using all variables (without PC)
57 #使用所有变量估算模型
58 regressor = regression_linear(x,y)
59
60 # print model summary
61 print('\nR^2: {0}'.format(regressor.score(x,y)))
62 coeff = [(nm, coeff)
63     for nm, coeff
64     in zip(x.columns, regressor.coef_)]
65 intercept = regressor.intercept_
66 print('Coefficients: ', coeff)
67 print('Intercept', intercept)
68 print('Total number of variables: ',
69     len(coeff) + 1)

regression_linear（...）方法内部，我们先用.LinearRegression（...）方法创建模型对象。我们指定模型要估算常数（fit_intercept=True），正规化数据，并复制自变量。
Python（默认情况下）传参是传引用（不像C或C++那样是传值），所以方法改变了传入的数据，这一可能性总是存在的。http://www.python-course.eu/passing_arguments.php。

LinearRegression对象提供了.score（...）方法。我们的模型有如下的结果：

/*
R^2: 0.9965751657989851
*/

/*
Intercept -11531009013.842264
*/

 1 # estimate the model
2 regressor_nm = regression_linear(x_no_state,y)
3
4 # print model summary
5 print('\nR^2: {0}'.format(regressor_nm.score(x_no_state,y)))
6 coeff = [(nm, coeff)
7     for nm, coeff
8     in zip(x_no_state.columns, regressor_nm.coef_)]
9 intercept = regressor_nm.intercept_
10 print('Coefficients: ', coeff)
11 print('Intercept', intercept)
12 print('Total number of variables: ',
13     len(coeff) + 1)

/*
R^2: 0.12740548278756392
*/

1 # removing the state variables and keeping only fuel and state
2 columns = [col for col in independent if 'state' not in col and col != 'total_fuel_cons']
3 x_no_state = x[columns]

R2有下降，但不显著：

/*
R^2: 0.9957289744500172
*/

 1 # stack up the principal components
2 #将主成分放在栈顶
3 pc_stack = pd.DataFrame()
4
5 # stack up the principal components
6 #将主成分放在栈顶
7 for col in x_red.columns:
8     series = pd.DataFrame()
9     series['x'] = x_red[col]
10     series['y'] = y
11     series['PC'] = col
12     pc_stack = pc_stack.append(series)

1 # Show the results of a linear regression within each
2 # principal component
3 sns.lmplot(x='x', y='y', col='PC', hue='PC', data=pc_stack,
4            col_wrap=2, size=5)
5
6 pl.savefig('../../Data/Chapter06/charts/regression_linear.png',
7     dpi=300)

.lmplot（...）方法将线性回归用到我们的数据上。col参数指定了图表的分组，我们将为5个主成分生成图表。hue参数将为每个主成分更换颜色。col_wrap=2意味着我们在一行中放两个图表，每个都是5英寸高。

Tips:

/*
D:\tools\Python37\lib\site-packages\seaborn\regression.py:546: UserWarning: The size paramter has been renamed to height; please update your code.
warnings.warn(msg, UserWarning)
*/

/*
sns.lmplot(x='x', y='y', col='fuel', hue='fuel',
data=fuel_stack, col_wrap=2, height=5)
*/

# select only the fel consumption
fuel_cons = ['total_fuel_cons','total_fuel_cons_mmbtu']
x = csv_read[fuel_cons]

/* MMBTU，million British Thermal Units代表百万英热单位，百万英制热单位。
1mmBtu=2.52X10^8cal(卡) =2.52X10^5 kcal(千卡)
1桶原油=5.8 MMBTU
*/

6.4使用OLS预测生产的电量

OLS（Ordinary Least Squares，最小二乘法）也是一个线性模型。使用最小二乘法实际上也估算了一个线性回归。然而，即使自变量与因变量之间不是线性关系，只要参数之间是线性关系，OLS就可以估算出一个模型。

 1 import statsmodels.api as sm
2
3 @hlp.timeit
4 def regression_ols(x,y):
5     '''
6         Estimate a linear regression
7     '''
8     # add a constant to the data
10
11     # create the model object
12     model = sm.OLS(y, x)
13
14     # and return the fit model
15     return model.fit()
16
17 # the file name of the dataset
18 r_filename = '../../Data/Chapter06/power_plant_dataset_pc.csv'
19
22
23 # select the names of columns
25 independent_reduced = [
26     col
27     for col
29     if col.startswith('p')
30 ]
31
32 independent = [
33     col
34     for col
36     if      col not in independent_reduced
37         and col not in dependent
38 ]
39
40 # split into independent and dependent features
43
44 # estimate the model using all variables (without PC)
45 regressor = regression_ols(x,y)
46 print(regressor.summary()) 

Tips:

/*
D:\tools\Python37\lib\site-packages\numpy\core\fromnumeric.py:2495: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
return ptp(axis=axis, out=out, **kwargs)
*/

R2以及total_fuel_cons与total_fuel_cons_mmbtu统计上的显著性证实了我们之前的发现：仅靠这两个变量就决定了发电厂的发电量。使用什么燃料并没有多大差别：同样的输入能量会得到同样的电量，不管是什么燃料。其实发电厂本质上就是个能量转换器：它将一种形式的能量转化为另一种。问题来了——为什么使用化石燃料，而不是可再生能源？

1 # remove insignificant variables
2 #移除不显著的变量
3 significant = ['total_fuel_cons', 'total_fuel_cons_mmbtu']
4 x_red = x[significant]
5
6 # estimate the model with limited number of variables
7 # 用仅有的变量估算模型
8 regressor = regression_ols(x_red,y)
9 print(regressor.summary()) 

 1 import mlpy as ml
2
3 @hlp.timeit
4 def regression_linear(x,y):
5     '''
6         Estimate a linear regression
7     '''
8     # create the model object
9     ols = ml.OLS()
10
11     # estimate the model
12     ols.learn(x, y)
13
14     # and return the fit model
15     return ols
16
17 # the file name of the dataset
18 r_filename = '../../Data/Chapter06/power_plant_dataset_pc.csv'
19
22
23 # remove insignificant variables
24 significant = ['total_fuel_cons_mmbtu']
26
29
30 # estimate the model using all variables (without PC)
31 regressor = regression_linear(x,y)
32
33 # predict the output
34 predicted = regressor.pred(x)
35
36 # and calculate the R^2
37 score = hlp.get_score(y, predicted)
38 print('R2: ', score)

MLPY的.OLS（）方法只能构建简单的线性模型。我们使用最显著的变量，total_fuel_cons_mmbtu。由于MLPY不提供关于模型表现的信息，我们决定手写一个评分函数（helper.py文件）：

 1 def get_score(y, predicted):
2     '''
3         Method to calculate R^2
4     '''
5     # calculate the mean of actuals
6     mean_y = y.mean()
7
8     # calculate the total sum of squares and residual
9     # sum of squares
10     sum_of_square_total = np.sum((y - mean_y)**2)
11     sum_of_square_resid = np.sum((y - predicted)**2)
12
13     return 1 - sum_of_square_resid / sum_of_square_total

get_score（...）方法要求两个输入参数：因变量的实际观测值和模型预测的值。

/*
The method regression_linear took 0.00 sec to run.
R2:  0.9951670803634637
*/

6.5使用CART估算发电厂生产的电量

 1 import sklearn.tree as sk
2
3 @hlp.timeit
4 def regression_cart(x,y):
5     '''
6         Estimate a CART regressor
7     '''
8     # create the regressor object
9     cart = sk.DecisionTreeRegressor(min_samples_split=80,
10         max_features="auto", random_state=66666,
11         max_depth=5)
12
13     # estimate the model
14     cart.fit(x,y)
15
16     # return the object
17     return cart

# print out the results
print('R2: ', regressor.score(x,y))

/*
The method regression_cart took 0.01 sec to run.
R2:  0.9595455136505057

The method regression_cart took 0.01 sec to run.
R:  0.7035525409649943
0. p_0: 0.17273577812298352
1. p_1: 0.08688236875805344
2. p_2: 0.0
3. p_3: 0.5429677683710962
4. p_4: 0.19741408474786684
*/

1 for counter, (nm, label) \
2     in enumerate(
3         zip(x.columns, regressor.feature_importances_)
4     ):
5     print("{0}. {1}: {2}".format(counter, nm,label))
/*

0. fuel_aer_NG: 0.0
1. fuel_aer_DFO: 0.0
2. fuel_aer_HYC: 0.0
3. fuel_aer_SUN: 0.0
4. fuel_aer_WND: 0.0
5. fuel_aer_COL: 0.0
6. fuel_aer_MLG: 0.0
7. fuel_aer_NUC: 0.0
8. mover_CT: 0.0
9. mover_GT: 0.0
10. mover_HY: 0.0
11. mover_IC: 0.0
12. mover_PV: 0.0
13. mover_ST: 0.0
14. mover_WT: 0.0
15. state_CA: 0.0
16. state_TX: 0.0
17. state_NY: 0.0
18. state_FL: 0.0
19. state_MN: 0.0
20. state_MI: 0.0
21. state_NC: 0.0
22. state_PA: 0.0
23. state_MA: 0.0
24. state_WI: 0.0
25. state_NJ: 0.0
26. state_IA: 0.0
27. state_IL: 0.0
28. state_OH: 0.0
29. state_GA: 0.0
30. state_WA: 0.0
31. total_fuel_cons: 0.0
32. total_fuel_cons_mmbtu: 1.0

*/

# and export to a .dot file
sk.export_graphviz(regressor,
out_file='../../Data/Chapter06/CART/tree.dot')

/*
D:\PythonLearn\数据分析实战-托马兹·卓巴斯\tools\GraphViz\release\bin\dot -Tpdf D:\Java2018\practicalDataAnalysis\Data\Chapter06\CART\tree.dot -o D:\Java2018\practicalDataAnalysis\Data\Chapter06\CART\tree.pdf

D:\PythonLearn\数据分析实战-托马兹·卓巴斯\tools\GraphViz\release\bin\dot -Tpdf D:\Java2018\practicalDataAnalysis\Data\Chapter06\CART\tree_red.dot -o D:\Java2018\practicalDataAnalysis\Data\Chapter06\CART\tree_red.pdf

*/

X[32]，看一下列名列表，果然就是total_fuel_cons_mmbtu啊。

1 # estimate the model using Principal Components only
2 regressor_red = regression_cart(x_red,y)
3
4 # print out the results
5 print('R: ', regressor_red.score(x_red,y)) 

R2的值也可接受：

/*
R^2:  0.7035525409649943
*/

/*
0. p_0: 0.17273577812298352
1. p_1: 0.08688236875805344
2. p_2: 0.0
3. p_3: 0.5429677683710962
4. p_4: 0.19741408474786684
*/

（通常）使用主成分的问题在于无法直接看懂结果，要得出真实意义就有点绕（https://onlinecourses.science.psu.edu/stat505/node/54）。

6.6将kNN模型用于回归问题

 1 import sklearn.neighbors as nb
2 import sklearn.cross_validation as cv
3
4 @hlp.timeit
5 def regression_kNN(x,y):
6     '''
7         Build the kNN classifier
8     '''
9     # create the classifier object
10     knn = nb.KNeighborsRegressor(n_neighbors=80,
11         algorithm='kd_tree', n_jobs=-1)
12
13     # fit the data
14     knn.fit(x,y)
15
16     #return the classifier
17     return knn

1 import sklearn.cross_validation as cv
2 # test the sensitivity of R2
3 scores = cv.cross_val_score(regressor, x_sig, y, cv=100)
4 print('Expected R2: {0:.2f} (+/- {1:.2f})'\
5     .format(scores.mean(), scores.std()**2))

/*
The method regression_kNN took 0.00 sec to run.
R^2:  0.9433911310360819
Expected R^2: 0.91 (+/- 0.03)
*/

Tips:

/*
ModuleNotFoundError: No module named 'sklearn.cross_validation'
*/

/*
import sklearn.model_selection as cv
*/

1 # estimate the model using Principal Components only
2 #只使用主成分估算模型
3 regressor_principal = regression_kNN(x_principal,y)
4
5 print('R^2: ', regressor_principal.score(x_principal,y))

/*
The method regression_kNN took 0.01 sec to run.
R^2:  0.4602300520714968
Expected R^2: -18.55 (+/- 14427.43)
*/

6.7将随机森林模型用于回归分析

 1 import sys
2 sys.path.append('..')
3
4 # the rest of the imports
5 import helper as hlp
6 import pandas as pd
7 import numpy as np
8 import sklearn.ensemble as en
9 import sklearn.model_selection as cv
10
11 @hlp.timeit
12 def regression_rf(x,y):
13     '''
14         Estimate a random forest regressor
15         估算随机森林回归器
16     '''
17     # create the regressor object
18     random_forest = en.RandomForestRegressor(
19         min_samples_split=80, random_state=666,
20         max_depth=5, n_estimators=10)
21
22     # estimate the model
23     random_forest.fit(x,y)
24
25     # return the object
26     return random_forest

.RandomForestRegressor（...）方法创建了模型对象。和决策树一样，min_samples_split决定拆分时节点最少的样本数。random_state决定了模型的初始随机状态，max_depth决定了树最多能有多少层。最后一个参数n_estimators，告诉模型森林中估算多少树。

 1 # print out the results
2 print('R^2: ', regressor_red.score(x_red,y))
3
4 # test the sensitivity of R2 测试r^2敏感度
5 scores = cv.cross_val_score(regressor_red, x_red, y, cv=100)
6 print('Expected R^2: {0:.2f} (+/- {1:.2f})'\
7     .format(scores.mean(), scores.std()**2))
8
9 # print features importance
10 for counter, (nm, label) \
11     in enumerate(
12         zip(x_red.columns, regressor_red.feature_importances_)
13     ):
14     print("{0}. {1}: {2}".format(counter, nm,label))    

Tips:

/*
Traceback (most recent call last):
File "D:\Java2018\practicalDataAnalysis\Codes\Chapter06\regression_randomForest.py", line 79, in <module>

KeyError: "None of [Int64Index([32], dtype='int64')] are in the [columns]"
*/

/*

The method regression_rf took 0.06 sec to run.
R^2:  0.9704592485243615
Expected R^2: 0.82 (+/- 0.21)
0. fuel_aer_NG: 0.0
1. fuel_aer_DFO: 0.0
2. fuel_aer_HYC: 0.0
......
8. mover_CT: 0.0
9. mover_GT: 0.0
10. mover_HY: 0.0
11. mover_IC: 0.0
12. mover_PV: 0.0
13. mover_ST: 1.9909417055476134e-05
14. mover_WT: 0.0
15. state_CA: 0.0
16. state_TX: 0.0
......
28. state_OH: 0.0
29. state_GA: 0.0
30. state_WA: 0.0
31. total_fuel_cons: 0.0
32. total_fuel_cons_mmbtu: 0.9999800905829446
*/

/*
The method regression_rf took 0.03 sec to run.
R^2:  0.9704326205779759
Expected R^2: 0.81 (+/- 0.22)
0. total_fuel_cons_mmbtu: 1.0
*/

6.8使用SVM预测发电厂生产的电量

SVM（Support Vector Machine，支持向量机）已流行多年。模型使用多种核技巧，自变量和因变量之间最复杂的关系也可以建模。

 1 import sys
2 sys.path.append('..')
3
4 # the rest of the imports
5 import helper as hlp
6 import pandas as pd
7 import numpy as np
8 import sklearn.svm as sv
9 import matplotlib.pyplot as plt
10
11 @hlp.timeit
12 def regression_svm(x, y, **kw_params):
13     '''
14         Estimate a SVM regressor
15     '''
16     # create the regressor object
17     svm = sv.SVR(**kw_params)
18
19     # estimate the model
20     svm.fit(x,y)
21
22     # return the object
23     return svm
24
25 # simulated dataset
26 x = np.arange(-2, 2, 0.004)
27 errors = np.random.normal(0, 0.5, size=len(x))
28 y = 0.8 * x**4 - 2 * x**2 +  errors
29
30 # reshape the x array so its in a column form
31 x_reg = x.reshape(-1, 1)
32
33 models_to_test = [
34     {'kernel': 'linear'},
35     {'kernel': 'poly','gamma': 0.5, 'C': 0.5, 'degree': 4},
36     {'kernel': 'poly','gamma': 0.5, 'C': 0.5, 'degree': 6},
37     {'kernel': 'rbf','gamma': 0.5, 'C': 0.5}
38 ]

models_to_test列表包括了4个关键词参数，用于regression_svm（...）方法；我们估算一个线性核SVM，两个多项式核SVM（一个度数为4，一个度数为6），以及一个RBF核SVM。
regression_svm（...）方法和所有模型估算方法的逻辑一致：先创建模型对象，然后估算模型，最后返回估算的模型。

1 # create chart frame
2 plt.figure(figsize=(len(models_to_test) * 2 + 3, 9.5))
4     bottom=.05, top=.96, wspace=.1, hspace=.15)

 1 for i, model in enumerate(models_to_test):
2     # estimate the model
3     regressor = regression_svm(x_reg, y, **model)
4
5     # score
6     score = regressor.score(x_reg, y)
7
8     # plot the chart
9     plt.subplot(2, 2, i + 1)
10     if model['kernel'] == 'poly':
11         plt.title('Kernel: {0}, deg: {1}'\
12             .format(model['kernel'], model['degree']))
13     else:
14         plt.title('Kernel: {0}'.format(model['kernel']))
15     plt.ylim([-4, 8])
16     plt.scatter(x, y)
17     plt.plot(x, regressor.predict(x_reg), color='r')
18     plt.text(.9, .9, ('R^2: {0:.2f}'.format(score)),
19                  transform=plt.gca().transAxes, size=15,
20                  horizontalalignment='right')

plt.savefig('../../data/Chapter06/charts/regression_svm.png',
dpi= 300)

Anaconda不直接支持Optunity，我们需要自行安装。

>pip install optunity

/*
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting optunity
.......
Successfully built optunity
Installing collected packages: optunity
Successfully installed optunity-1.1.1
FINISHED

*/

 1 import optunity as opt
2 import optunity.metrics as mt
3
4 # simulated dataset
5 x = np.arange(-2, 2, 0.002)
6 errors = np.random.normal(0, 0.5, size=len(x))
7 y = 0.8 * x**4 - 2 * x**2 +  errors
8
9 # reshape the x array so its in a column form
10 x_reg = x.reshape(-1, 1)
11
12 @opt.cross_validated(x=x_reg, y=y, num_folds=10, num_iter=5)
13 def regression_svm(
14     x_train, y_train, x_test, y_test, logC, logGamma):
15     '''
16         Estimate a SVM regressor
17     '''
18     # create the regressor object
19     svm = sv.SVR(kernel='rbf',
20         C=0.1 * logC, gamma=0.1 * logGamma)
21
22     # estimate the model
23     svm.fit(x_train,y_train)
24
25     # decision function
26     decision_values = svm.decision_function(x_test)
27
28     # return the object
29     return mt.roc_auc(y_test, decision_values)

num_folds是用于交叉验证的块数。我们将1/10的样本用于交叉验证，9/10用于训练。num_iter是交叉验证的循环次数。

 1 # and the values are...
2 print('The optimal values are: C - {0:.2f}, gamma - {1:.2f}'\
3     .format(0.1 * hps['logC'], 0.1 * hps['logGamma']))
4
5 # estimate the model with optimal values
6 regressor = sv.SVR(kernel='rbf',
7             C=0.1 * hps['logC'],
8             gamma=0.1 * hps['logGamma'])\
9         .fit(x_reg, y)
10
11 # predict the output
12 predicted = regressor.predict(x_reg)
13
14 # and calculate the R^2
15 score = hlp.get_score(y, predicted)
16 print('R2: ', score)
17
18 # plot the chart
19 plt.scatter(x, y)
20 plt.plot(x, predicted, color='r')
21 plt.title('Kernel: RBF')
22 plt.ylim([-4, 8])
23 plt.text(.9, .9, ('R^2: {0:.2f}'.format(score)),
24              transform=plt.gca().transAxes, size=15,
25              horizontalalignment='right')
26
27 plt.savefig(
28     '../../data/Chapter06/charts/regression_svm_alt.png',
29     dpi=300
30 )

/*
D:\tools\Python37\lib\site-packages\optunity\metrics.py:204: RuntimeWarning: invalid value encountered in true_divide
return float(TP) / (TP + FN)
...
D:\tools\Python37\lib\site-packages\optunity\metrics.py:204: RuntimeWarning: invalid value encountered in true_divide
return float(TP) / (TP + FN)
The optimal values are: C - 0.38, gamma - 1.44
R^2:  0.8662968996734814

*/

6.9训练神经网络，预测发电厂生产的电量

 1 import sys
2 sys.path.append('..')
3
4 # the rest of the imports
5 import helper as hlp
6 import pandas as pd
7 import pybrain.structure as st
8 import pybrain.supervised.trainers as tr
9 import pybrain.tools.shortcuts as pb
10
11 @hlp.timeit
12 def fitANN(data):
13     '''
14         Build a neural network regressor
15     '''
16     # determine the number of inputs and outputs
17     inputs_cnt = data['input'].shape[1]
18     target_cnt = data['target'].shape[1]
19
20     # create the regressor object
21     ann = pb.buildNetwork(inputs_cnt,
22         inputs_cnt * 3,
23         target_cnt,
24         hiddenclass=st.TanhLayer,
25         outclass=st.LinearLayer,
26         bias=True
27     )
28
29     # create the trainer object
30     trainer = tr.BackpropTrainer(ann, data,
31         verbose=True, batchlearning=False)
32
33     # and train the network
34     trainer.trainUntilConvergence(maxEpochs=50, verbose=True,
35         continueEpochs=2, validationProportion=0.25)
36
37     # and return the regressor
38     return ann

 1 # split the data into training and testing
2 train_x, train_y, \
3 test_x,  test_y, \
5     y='net_generation_MWh', x=['total_fuel_cons_mmbtu'])
6
7 # create the ANN training and testing datasets
8 training = hlp.prepareANNDataset((train_x, train_y),
9     prob='regression')
10 testing  = hlp.prepareANNDataset((test_x, test_y),
11     prob='regression')

prob（代表problem，问题）指定了要用数据集解决一个回归问题：对于分类问题，我们需要有类别指标及其到1的补充abs（item-1），参考本书3.8节。

1 # train the model
2 regressor = fitANN(training)
3
4 # predict the output from the unseen data
5 predicted = regressor.activateOnDataset(testing)
6
7 # and calculate the R^2
8 score = hlp.get_score(test_y, predicted[:, 0])
9 print('R^2: ', score)

fitANN（...）方法先创建神经网络。我们的神经网络有一个输入神经元，三个（带tanh激活函数的）隐藏层神经元，一个（带线性激活函数的）输出神经元。

/*
The method fitANN took 68.70 sec to run.
R^2:  0.9843443717345242
*/

Tips:

/* Traceback (most recent call last):
File "D:\Java2018\practicalDataAnalysis\Codes\Chapter06\regression_ann.py", line 8, in <module>
import pybrain.structure as st
File "D:\tools\Python37\lib\site-packages\pybrain\__init__.py", line 1, in <module>
from structure.__init__ import *
ModuleNotFoundError: No module named 'structure'
*/

/*
pip install git+https://github.com/pybrain/pybrain.git
This seems to be an issue with the library when used from Python 3 and installed using pip. The latest version listed is 0.3.3, but I'm still getting v0.3.0 when installing.

Try installing numpy and scipy in Anaconda, then (after activating the conda environment) install directly from the git repo:

pip install git+https://github.com/pybrain/pybrain.git@0.3.3

Running command git clone -q https://github.com/pybrain/pybrain.git 'd:\Temp\pip-req-build-0soc1yjx'
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting git+https://github.com/pybrain/pybrain.git@0.3.3
Cloning https://github.com/pybrain/pybrain.git (to revision 0.3.3) to d:\temp\pip-req-build-0soc1yjx
Running command git checkout -q 882cf66a89b6b6bb8112f41d200c55461464aa06
Requirement already satisfied: scipy in d:\tools\python37\lib\site-packages (from PyBrain==0.3.1) (1.3.3)
Requirement already satisfied: numpy>=1.13.3 in d:\tools\python37\lib\site-packages (from scipy->PyBrain==0.3.1) (1.17.4)
Building wheels for collected packages: PyBrain
Building wheel for PyBrain (setup.py): started
Building wheel for PyBrain (setup.py): finished with status 'done'
Created wheel for PyBrain: filename=PyBrain-0.3.1-cp37-none-any.whl size=472111 sha256=07f80767b5065f40565e595826df759d2dc66740d3fb6db709d7343cbdc2a8c0
Stored in directory: d:\Temp\pip-ephem-wheel-cache-h4bvm49q\wheels\a3\1c\d1\9cfc0e65ef0ed5559fd4f2819e4423e9fa4cfe02ff3a5c3b3e
Successfully built PyBrain
Installing collected packages: PyBrain
Found existing installation: PyBrain 0.3
Uninstalling PyBrain-0.3:
Successfully uninstalled PyBrain-0.3
Successfully installed PyBrain-0.3.1
FINISHED

D:\tools\Python37\Scripts>pip3  install https://github.com/pybrain/pybrain/archive/0.3.3.zip
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting https://github.com/pybrain/pybrain/archive/0.3.3.zip
/ 1.9MB 163kB/s
Requirement already satisfied: scipy in d:\tools\python37\lib\site-packages (from PyBrain==0.3.1) (1.3.3)
Requirement already satisfied: numpy>=1.13.3 in d:\tools\python37\lib\site-packages (from scipy->PyBrain==0.3.1) (1.17.4)
Building wheels for collected packages: PyBrain
Building wheel for PyBrain (setup.py) ... done
Created wheel for PyBrain: filename=PyBrain-0.3.1-cp37-none-any.whl size=468236 sha256=ce9591da503451919d71cd322324930d3d2259c7fdbe7c3041a8309bcf1ff9b2
Stored in directory: d:\Temp\pip-ephem-wheel-cache-ouwcf150\wheels\0b\04\38\2f174aa3c578350870947ca6ab12e0eb89aef3478c9610eb0a
Successfully built PyBrain
Installing collected packages: PyBrain
Successfully installed PyBrain-0.3.1

D:\tools\Python37\Scripts>

*/

pip install https://github.com/pybrain/pybrain/archive/0.3.3.zip

/*
File "D:\tools\Python37\lib\site-packages\pybrain\tools\functions.py", line 4, in <module>
from scipy.linalg import inv, det, svd, logm, expm2
ImportError: cannot import name 'expm2' from 'scipy.linalg' (D:\tools\Python37\lib\site-packages\scipy\linalg\__init__.py)
*/

/* Total error:  0.0234644034879
Total error:  0.00448033176755
Total error:  0.00205475261566
Total error:  0.000943766244883
Total error:  0.000441922091808
Total error:  0.00023137206216
Total error:  0.00014716265305
Total error:  0.000113546124273
Total error:  0.000100591874178
Total error:  9.48731641822e-05
Total error:  9.17187548879e-05
Total error:  9.00545474786e-05
Total error:  8.8754314954e-05
Total error:  8.78622211966e-05
Total error:  8.64148298491e-05
Total error:  8.52580031627e-05
Total error:  8.45496751335e-05
Total error:  8.31773350186e-05
Total error:  8.22848381841e-05
Total error:  8.13736136075e-05
Total error:  8.01448506931e-05
Total error:  7.91777696701e-05
Total error:  7.82493713543e-05
Total error:  7.75330633562e-05
Total error:  7.63823501292e-05
Total error:  7.55286037883e-05
Total error:  7.44680561538e-05
Total error:  7.36680418929e-05
Total error:  7.27012120487e-05
Total error:  7.18980979088e-05
Total error:  7.09005144214e-05
Total error:  6.99157006989e-05
Total error:  6.9222957236e-05
Total error:  6.82360145737e-05
Total error:  6.71353795523e-05
Total error:  6.63631396761e-05
Total error:  6.56685823709e-05
Total error:  6.53158010387e-05
Total error:  6.4518956904e-05
Total error:  6.35146379129e-05
Total error:  6.28304190101e-05
Total error:  6.22125569291e-05
Total error:  6.13769678787e-05
Total error:  6.09232354752e-05
Total error:  6.0155987076e-05
Total error:  5.94907225251e-05
Total error:  5.88562745717e-05
Total error:  5.80850744217e-05
Total error:  5.75905974532e-05
Total error:  5.65098746457e-05
Total error:  5.60159718614e-05
('train-errors:', '[0.023464 , 0.00448  , 0.002055 , 0.000944 , 0.000442 , 0.000231 , 0.000147 , 0.000114 , 0.000101 , 9.5e-05  , 9.2e-05  , 9e-05    , 8.9e-05  , 8.8e-05  , 8.6e-05  , 8.5e-05  , 8.5e-05  , 8.3e-05  , 8.2e-05  , 8.1e-05  , 8e-05    , 7.9e-05  , 7.8e-05  , 7.8e-05  , 7.6e-05  , 7.6e-05  , 7.4e-05  , 7.4e-05  , 7.3e-05  , 7.2e-05  , 7.1e-05  , 7e-05    , 6.9e-05  , 6.8e-05  , 6.7e-05  , 6.6e-05  , 6.6e-05  , 6.5e-05  , 6.5e-05  , 6.4e-05  , 6.3e-05  , 6.2e-05  , 6.1e-05  , 6.1e-05  , 6e-05    , 5.9e-05  , 5.9e-05  , 5.8e-05  , 5.8e-05  , 5.7e-05  , 5.6e-05  ]')
('valid-errors:', '[2.98916  , 0.006323 , 0.003087 , 0.001464 , 0.000735 , 0.00041  , 0.000263 , 0.000197 , 0.000168 , 0.000154 , 0.000147 , 0.000144 , 0.00014  , 0.000143 , 0.000136 , 0.000132 , 0.000131 , 0.000131 , 0.000136 , 0.000127 , 0.000126 , 0.000124 , 0.000124 , 0.000122 , 0.000123 , 0.000121 , 0.00012  , 0.000117 , 0.000116 , 0.000118 , 0.000114 , 0.000113 , 0.000112 , 0.000117 , 0.00011  , 0.000108 , 0.000107 , 0.000107 , 0.000105 , 0.000105 , 0.000105 , 0.000103 , 0.000101 , 0.000102 , 0.000103 , 9.8e-05  , 9.7e-05  , 9.6e-05  , 9.5e-05  , 9.6e-05  , 9.7e-05  , 9.4e-05  ]')
The method fitANN took 68.70 sec to run.
R^2:  0.9843443717345242
*/

第6章完。

python数据分析个人学习读书笔记-目录索引

posted @ 2020-03-28 01:34  邀月  阅读(847)  评论(0编辑  收藏  举报