多元线性回归

多元线性回归


​ 其实就是把简单线性回归进行多元的推广。其中的输入x由单一特征变为含有n个特征的向量。每个特征向量乘以一个系数,再加上最后的偏置θ。

mark

​ 求解思路也是跟简单线性回归一致。

一、多元线性回归的数学推导

mark

mark

mark

mark

mark

这个式子就成为多元线性回归的正规方程解(Normal Equation)。

问题:时间复杂度O(n3)(优化O(n2.4))

优点:不需要对数据进行归一化处理

二、多元线性回归的编程实现

​ 在实际的程序实现过程中,参数分为两部分:截距和系数。

mark

import numpy as np
from play_ML.multi_linear_regression.metrics import r2_score

class LinearRegression(object):

    def __init__(self):
        "初始化多元线性回归模型"
        self.coef_ = None
        self.interception_ = None
        self._theta = None

    def fit_normal(self, x_train, y_train):
        assert x_train.shape[0] == y_train.shape[0], "the size of x_train must be equal to the size of y_train"

        X = np.hstack([np.ones((len(x_train), 1)), x_train])
        self._theta = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y_train)

        self.interception_ = self._theta[0]
        self.coef_ = self._theta[1:]

        return self

    def predict(self, x_predict):
        assert self.interception_ is not None and self.coef_ is not None, "must fit before predict"
        assert x_predict.shape[1] == len(self.coef_), "the feature number must be equal to x_train"

        X = np.hstack([np.ones((len(x_predict), 1)), x_predict])
        return X.dot(self._theta)

    def score(self, x_test, y_test):
        y_preict = self.predict(x_test)
        return r2_score(y_test, y_preict)

    def __repr__(self):
        return "Multi_Linear_Regression"

​ 接下来使用sklearn中的波士顿房价进行预测:

if __name__ == '__main__':
    import numpy as np
    from sklearn import datasets
    from sklearn.model_selection import train_test_split

    boston = datasets.load_boston()
    x = boston.data
    y = boston.target

    X = x[y < 50]
    Y = y[y < 50]

    x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=666)
    reg = LinearRegression()
    reg.fit_normal(x_train, y_train)
    print(reg.coef_)
    print(reg.interception_)
    print(reg.score(x_test, y_test))

输出结果:

[-1.20354261e-01  3.64423279e-02 -3.61493155e-02  5.12978140e-02
 -1.15775825e+01  3.42740062e+00 -2.32311760e-02 -1.19487594e+00
  2.60101728e-01 -1.40219119e-02 -8.35430488e-01  7.80472852e-03
 -3.80923751e-01]
34.117399723201785
0.8129794056212823

三、scikit-learn中的线性回归

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

boston = datasets.load_boston()
x = boston.data
y = boston.target

X = x[y < 50]
Y = y[y < 50]

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=666)

reg = LinearRegression()
reg.fit(x_train, y_train)
print(reg.coef_)
print(reg.intercept_)
print(reg.score(x_test, y_test))

输出结果:

array([-1.20354261e-01,  3.64423279e-02, -3.61493155e-02,  5.12978140e-02,
       -1.15775825e+01,  3.42740062e+00, -2.32311760e-02, -1.19487594e+00,
        2.60101728e-01, -1.40219119e-02, -8.35430488e-01,  7.80472852e-03,
       -3.80923751e-01])
34.117399723229845
0.8129794056212809

通过输出结果对比,跟上面自己写的结果一致。其实sklearn中封装的函数跟自己写的还是有一点区别的。后面会有提到。

四、使用KNN进行多元回归

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor

boston = datasets.load_boston()
x = boston.data
y = boston.target

X = x[y < 50]
Y = y[y < 50]

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=666)

knn_reg = KNeighborsRegressor()     # 默认情况下使用的k=5
knn_reg.fit(x_train, y_train)
print(knn_reg.score(x_test, y_test))

输出结果:0.5865412198300899

显然,需要超参寻优

from sklearn.model_selection import GridSearchCV
param_grid = [
    {
        'weights':['uniform'],
        'n_neighbor':[i for i in range(1, 11)]
    },
    {
        'weights':['distance'],
         'n_neighbor':[i for i in range(1, 11)],
        'p':[i for i in range(1, 6)]
    }
]
knn_reg = KNeighborsRegressor()
grid_search = GridSearchCV(knn_reg, param_grid, n_jobs=-1, verbose=2)
grid_search.fit(x_train, y_train)

输出结果:

Fitting 3 folds for each of 60 candidates, totalling 180 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 tasks      | elapsed:    1.6s
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:    2.0s finished
GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=None, n_neighbors=5, p=2,
          weights='uniform'),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid=[{'weights': ['uniform'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}, 
{'weights': ['distance'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'p': [1, 2, 3, 4, 5]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=2)
grid_search.best_params_
grid_search.best_score_
grid_search.best_estimator_
grid_search.best_estimator_.score(x_test, y_test)

输出结果:

{'n_neighbors': 5, 'p': 1, 'weights': 'distance'}
0.6340477954176972
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=None, n_neighbors=5, p=1,
          weights='distance')
0.7044357727037996

通过超参寻优交叉验证,准确率由0.5865412198300899到0.6340477954176972,其中最优参数组合为k=5,使用曼哈顿距离进行加权。但这个score和多元线性回归的score计算方式并不一样。因此没有可比性,所以使用grid_search.best_estimator_.score分类器对于测试集计算score,结果为0.7044357727037996,相比多元线性回归还是有点差距。很多时候因为设置的计算方式不同,不能随意地将其进行比较。

五、线性回归算法的可解释性

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

boston = datasets.load_boston()
x = boston.data
y = boston.target

X = x[y < 50]
Y = y[y < 50]

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=666)

reg = LinearRegression()
reg.fit(x_train, y_train)
print(reg.coef_)

输出结果:

array([-1.20354261e-01,  3.64423279e-02, -3.61493155e-02,  5.12978140e-02,
       -1.15775825e+01,  3.42740062e+00, -2.32311760e-02, -1.19487594e+00,
        2.60101728e-01, -1.40219119e-02, -8.35430488e-01,  7.80472852e-03,
       -3.80923751e-01])

这就是线性回归出来各个特征向量的系数,正数表示正相关,负数表示负相关。我们把这些数字进行一下排序。

np.argsort(reg.coef_)

输出结果:array([ 4, 7, 10, 12, 0, 2, 6, 9, 11, 1, 3, 8, 5], dtype=int64)

首先我们来先看一下都有哪些特征:

boston.feature_names

输出结果:array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

然后根据排序结果将这些特征进行排序:

boston.feature_names[np.argsort(reg.coef_)]

输出结果:array(['NOX', 'DIS', 'PTRATIO', 'LSTAT', 'CRIM', 'INDUS', 'AGE', 'TAX', 'B', 'ZN', 'CHAS', 'RAD', 'RM'], dtype='<U7')

这些特征究竟是什么?通过DESCR查看:

print(boston.DESCR)
# 输出结果:
- CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

通过上面的描述可以看出,负相关前三分别是‘NOX':一氧化碳含量,'DIS':距离波士顿劳务雇佣中心的加权平均距离,正相关前三分别是'RM':房间数量,'RAD':交通方便,'CHAS':是否邻近河边(海景房)。通过几条描述可以说明线性回归具有较好的解释性。

未完待续~

posted @ 2019-08-05 17:26  m1racle  阅读(1566)  评论(0编辑  收藏  举报