# 机器学习笔记(6) 线性回归

ok，书归正传,到了这里,问题来了,我们怎么评价"尽量好"地拟合呢？

• 均方误差MSE
• 均方根误差RMSE
• 平均绝对误差MAE
• R Squared

MSE的一个问题是,假如y是有量纲的,MSE的结果把量纲改变了.比如y的单位是dollar,MSE的结果变成了$dollar^2$。RSME就避免了这个问题.

R Squared  $$R^2 = 1 - \frac {\sum _{i=1}^m(\hat y^{(i)} - y^{(i)})^2} {\sum _{i=1}^m(\bar y - y^{(i)})^2}$$

$y = a_1x_1+a_2x_2+…+a_nx_n+b$

$$\hat y^{(i)} = \begin{bmatrix} 1& X_1^{(i)}&X_2^{(i)}& … &X_n^{(i)}\end{bmatrix}\begin{bmatrix} \theta_ 0 \\ \theta_ 1 \\ \theta_ 2 \\ … \\ \theta_ n \\ \end{bmatrix}$$

$\theta=\begin{bmatrix} \theta_ 0\\ \theta_ 1\\ \theta_ 2\\ …\\ \theta_ n\\ \end{bmatrix}$

Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive

:Median Value (attribute 14) is usually the target

:Attribute Information (in order):
- CRIM     per capita crime rate by town
- ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS    proportion of non-retail business acres per town
- CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX      nitric oxides concentration (parts per 10 million)
- RM       average number of rooms per dwelling
- AGE      proportion of owner-occupied units built prior to 1940
- DIS      weighted distances to five Boston employment centres
- TAX      full-value property-tax rate per $10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in$1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.

**References**

- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
- many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

boston数据集有13个特征,包括了房间数目,房龄，是否临河,离商圈距离等等,一个label，表示房屋价格.

boston = datasets.load_boston()
X = boston.data
y = boston.target
X = X[y < 50.0]
y = y[y < 50.0]

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)

print(lin_reg.coef_)####array([ -1.05574295e-01,   3.52748549e-02,  -4.35179251e-02,
         4.55405227e-01,  -1.24268073e+01,   3.75411229e+00,
-2.36116881e-02,  -1.21088069e+00,   2.50740082e-01,
-1.37702943e-02,  -8.38888137e-01,   7.93577159e-03,
-3.50952134e-01]
print(boston.feature_names[np.argsort(lin_reg.coef_)])####array(['NOX', 'DIS', 'PTRATIO', 'LSTAT', 'CRIM', 'INDUS', 'AGE', 'TAX',       'B', 'ZN', 'RAD', 'CHAS', 'RM'], dtype='<U7')


coef系数越大越正相关,越小越负相关.上面例子里可以看出,特征'NOX'最不想干,特征'RM'最相关.

posted @ 2018-12-27 18:12  core!  阅读(1302)  评论(0编辑  收藏  举报