机器学习(三)-多元线性回归(数学推导及代码实现)

前面讨论了 y = ax + b 考虑的只有一个特征值(因素)的情况下,但在很多情况下特征值不只有一个打个比方要预测房价要考虑的不只是面积还要有地段建造年代户型等等 ,此时就要用到多元线性回归了。

$(\theta_{0},\theta_{1},\theta_{2},\theta_{3},.....,\theta_{n})$ $\theta$ 代表一系列我们需要学习出来的参数
$(1,X_{1}^{(i)},X_{2}^{(i)}，X_{3}^{(i)},.....,X_{n}^{(i)}$ X代表了要训练的参数(特征) 比如面积朝向户型装修交通等等特征
决定一间房屋的价格可以由很多因素综合的出这一组综合的权重就是要求解得出的。
房屋价格 = 面积*面积的权重 + 朝向 * 朝向的权重 + 户型 * 户型的权重 + 装修 * 装修的权重 + 交通 * 交通的权重
$\hat{y}^{(i)} =\theta_{0}X_{0}^{(i)}+\theta_{1}X_{1}^{(i)}+\theta_{2}X_{2}^{(i)}+\theta_{3}X_{3}^{(i)},.....,\theta_{n}X_{n}^{(i)}$

此处 $\theta_{0}X_{0}^{(i)}$ =b 为偏置(截距)

$\hat{y}^{(i)} =b+\theta_{1}X_{1}^{(i)}+\theta_{2}X_{2}^{(i)}+\theta_{3}X_{3}^{(i)},.....,\theta_{n}X_{n}^{(i)}$

使用和一元线性同样的损失函数:
${argmin}\sum_{i=1}^{m}(y^{(i)} - \hat{y}^{(i)})^2$

写成矩阵形式
$\theta=(\theta_{0},\theta_{1},\theta_{2},\theta_{3},.....,\theta_{n})^T$
$X^{(i)}=(1,X^{(i)}_{1},X^{(i)}_{2},X^{(i)}_{3},.....,X^{(i)}_{n})^T$
方程于是就变成:

$\hat y^{(i)} =\theta \cdot X^{(i)}$
相当于 $y=\theta X + b$ 的矩阵形式

输入的样本:
$X = \left[ \begin{matrix} 1 & X^{(1)}_{(1)} & X^{(1)}_{(2)} & .... & X^{(1)}_{(n)}\\ 1 & X^{(2)}_{(1)} & X^{(2)}_{(2)} & .... & X^{(2)}_{(n)}\\ ..... \\1 & X^{(m)}_{(1)} & X^{(m)}_{(2)} & .... & X^{(m)}_{(n)}\\ \end{matrix} \right]$

待求得权重: $\theta = \left[ \begin{matrix} \theta_{0} \\ \theta_{1} \\ \theta_{2} \\....\\\theta_{n} \\\end{matrix} \right]^T$ 真实值: $y = \left[ \begin{matrix} y_{0} \\ y_{1} \\ y_{2} \\....\\y_{n} \\\end{matrix} \right]$

$X 为m*(n+1)的矩阵$
$\theta为 (n+1) * 1 列的向量$
$y 为 (n+1) * 1 列的向量$

下标(n) 代表特征(参数)数量
上表(m)代表训练数据数量
X 第一列为1 相当于 $\theta$ 为偏置 b

${argmin}\sum_{i=1}^{m}(y^{(i)} - \hat{y}^{(i)})^2$ → ${argmin}(Y -X \cdot \theta)^2$ → ${argmin}(Y - X \cdot \theta )^T \cdot(Y - X\cdot \theta )$

所以问题就转化成求上述式子的最小值,求最小值是一个凸优化问题遇到这种问题一般先需要证明这个式子是有凸函数连续可导。

(以下推导可以不看直接看结论)
一般推导思路

凸集定义

设集合 $D \subset R^n$ ,如果对集合内任意元素 $x,y \subset D$ 与任意的 $a \subset [0,1]$ ,有 $ax + (1 - a)y \subset D$ ,则称集合D是凸集。

任意连2条直线所形成的集合还在原空间中则称为凸集。

梯度定义

设n原函数 $f(x)$ 对自变量 $x = (x_1,x_2,...,x_n)^T$ 的个分量 $x_i$ 的偏导数 $\frac {\partial f(x)}{\partial x_i} i=(1,2,3...,n)$ 都存在,则函数 f(x)在x处一阶可导,并称向量
$\nabla f(x) = \left[ \begin{matrix} \frac {\partial f(x)}{\partial x_1} \\ \frac {\partial f(x)}{\partial x_2} \\ ... \\ \frac {\partial f(x)}{\partial x_n} \end{matrix} \right]$
为函数 $f(x)$ 在x处的一节阶倒数或梯度,记为 $\nabla f(x)$ (列向量)

海塞矩阵

Hession(海塞)矩阵定义:设n元函数 $f(x)$ 对自变量 $x =(x_1,x_2,....,x_n)^T$ 的各分量 $x_i$ 的二阶偏导数 $\frac{\partial ^2f(x)}{\partial x_i \partial x_j}(i,j = 1,2,3.....,n)$ 都存在,则称函数 $f(x)$ 在点 $x$ 处二阶可导,并称矩阵
$\nabla ^2 f(x) = \nabla f(x) = \left[ \begin{matrix} \frac {\partial ^2 f(x)}{\partial x_1^2 } & \frac {\partial ^2 f(x)}{\partial x_1 \partial x_2 } & ... & \frac {\partial ^2 f(x)}{\partial x_1\partial x_n} \\ \frac {\partial ^2 f(x)}{\partial x_2 \partial x_1} & \frac {\partial ^2 f(x)}{\partial x_2^2 } & ... & \frac {\partial ^2 f(x)}{\partial x_2\partial x_n} \\ \vdots & \vdots& \ddots & \vdots \\ \frac {\partial ^2 f(x)}{\partial x_n \partial x_1 } & \frac {\partial ^2 f(x)}{\partial x_n \partial x_2 } & ... & \frac {\partial ^2 f(x)}{\partial x_n\partial x_n} \end{matrix} \right]$
为 $f(x)$ 在 $x$ 处的二阶倒数或 Hession 矩阵 ,记为 $\nabla ^2 f(x)$ ,若 $f(x)$ 对 $x$ 各变元的所有二阶偏导数都连续,则 $\frac{\partial ^2f(x)}{\partial x_ix_j}$ = $\frac{\partial ^2f(x)}{\partial x_jx_i}$

多元实值函数凹凸型判定定理

设 $D \subset R^n$ 是非空开凸集, $f:D \subset R^n ,$ 且 $f(x)$ 在 $D$ 上二阶连续可微,如果 $f(x)$ 的Hession矩阵 $\nabla ^2 f(x)$ 在 $D$ 上是正定的,则 $f(x)$ 是D上的严格凸函数。
凸充分性定理
若$f:R^n → R $ 是凸函数,且 $f(x)$ 一阶连续可微,则 $x^*$ 是全局最优解(全局最小值)的充分必要条件是 $\nabla f(x ^*) = 0$ ,其中 $f(x)$ 关于 $x$ 的一阶导数(也称梯度)

[标量-向量]的矩阵微分公式为:
$\frac{\partial y}{\partial x} = \left(\begin{matrix} \frac{\partial y}{\partial x_1} \\ \frac{\partial y}{\partial x_2} \\ \vdots \\ \frac{\partial y}{\partial x_n} \end{matrix} \right)$
$(分母布局)$
$\frac{\partial y}{\partial x} = \left(\begin{matrix} \frac{\partial y}{\partial x_1} \frac{\partial y}{\partial x_2} \ldots \frac{\partial y}{\partial x_n} \end{matrix} \right)$
$(分子布局)$
其中, $x =(x_1,x_2,.....,x_n)^T$ 为n为向量, $y$ 为 $x$ 的 $n$ 元标量函数.
分子分母布局按照习惯选择。
由[标量 -向量]的矩阵微分公式可推得:
$\frac {\partial x^Ta}{\partial x} =\frac {\partial a^T x}{\partial x}$ = $\frac{\partial y}{\partial x} = \left(\begin{matrix} \frac{\partial (a_1x_1 + a_2x_2 +...+a_nx_n)}{\partial x_1} \\ \frac{\partial (a_1x_1 + a_2x_2 +...+a_nx_n)}{\partial x_2} \\ \vdots \\ \frac{\partial (a_1x_1 + a_2x_2 +...+a_nx_n)}{\partial x_n} \end{matrix} \right)$ = $\left(\begin{matrix} a_1\\a_2 \\ \vdots \\a_n \end{matrix} \right)$ =a
同理可推得: $\frac{\partial x^TBx}{\partial x} =(B +B^T)x$

有了以上定理就可以开始证明了:

证明损失函数 $E\theta$ 是关于 $\theta$ 的凸函数。

$\frac{\partial E \theta}{\partial \theta }=\frac{\partial(( Y- X \cdot \theta)^T (Y - X \cdot \theta ))}{\partial \theta}$

= $\frac{\partial}{\partial \theta}( Y-\theta ^T X^T) (Y - X \theta)$

= $\frac{ \partial}{\partial \theta}[-Y^TX\theta - \theta^TX^TY + \theta^TX^TX\theta]$

= $\frac{ \partial Y^TX \theta}{\partial \theta} -\frac{ \partial \theta ^T X^TY}{\partial \theta} +\frac{ \partial \theta^TX^TX\theta}{\partial \theta}$

由 $\frac {\partial x^Ta}{\partial x} =\frac {\partial a^T x}{\partial x}$ = $\frac{\partial y}{\partial x} =a$ $\frac{\partial x^TBx}{\partial x} =(B +B^T)x$ 可得:

$\frac{\partial E_\theta}{\partial\theta} = -X^Ty -X^Ty +(X^TX +X^TX\theta)$

= $2X^T(X\theta - Y)$ 为一阶偏导

再对一阶偏导数求二阶偏导数:
$\frac {\partial ^2E \theta}{\partial \theta\partial \theta^T}$ = $\frac{\partial }{\partial \theta}(\frac{\partial E \theta}{\partial \theta})$ = $\frac{\partial}{\partial E \theta}(2X^TX\theta - 2X^TY)$ = $2X^TX$ (即Hession矩阵)

假设 $X^TX$ 为正定矩阵( $E\theta$ 是关于 $\theta$ 的凸函数)
令 $E\theta$ 关于 $\theta$ 的一阶导数为0 前面已经证明过了:
$\frac{\partial E \theta}{\partial \theta }$ = $2X^T(X\theta - Y)=0$
$2X^TX\theta - 2X^TY =0$
$2X^TX\theta = 2X^TY$
2边除以2:
$X^TX\theta = X^TY$
2边乘以 $(X^TX)^{-1}$
$(X^TX)^{-1}X^TX\theta = (X^TX)^{-1}X^TY$
得: $\theta = (X^TX)^{-1}X^TY$

以上证明毕：
$\theta = (X^TX)^{-1}X^TY$

问题:时间复杂度高:O( $n^3$ )(优化后O( $n^{2.4}$ )
优点:不需要对数据做归一化处理

import numpy as np
class LinearRegression:
    def __init__(self):
        """初始化linear Regression模型"""
        self.coef_ = None
        self.interception_ = None
        self._theta = None
    def fit_normal(self,X_train,y_train):
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of x_train must be equal to the size of y_train"
        #传入的训练样本没有截距 在每一行加一个截距项
        X_b = np.hstack([np.ones((len(X_train),1)),X_train])
        #根据公式计算 theta 的值
        self._theta = np .linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y_train)
        self.interception_=self._theta[0]
        self.coef_ = self._theta[1:]
        return self
    def predict(self,X_predict):
        assert self.coef_ is not None and  self.interception_ is not None, \
            "you must run fit_normal before predict"
        assert X_predict.shape[1] == len(self.coef_), \
            "the size of X_predict must be equal to the size of X_train"
        X_predict_b = np.hstack([np.ones((len(X_predict),1)), X_predict])
        return X_predict_b.dot(self._theta)

    def score(self, X_test,y_test):
        y_predict = self.predict(X_test)
        return r_Squared(y_test,y_predict)

    def __repr__(self):
        return "LinearRegression()"
	def r_Squared(y_true,y_predict):
	    assert len(y_true)  == len(y_predict), \
	        "the size of y_true must be equal to the size of y_predict"
	    return 1- (mean_squared_error(y_true,y_predict) / np.var(y_true))

测试

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from ml_utils.data_split import train_test_split
from sklearn import datasets
boston = datasets.load_boston()
x = boston.data
y = boston.target
reg = LinearRegression()
reg.fit_normal(x_train,y_train)
print(reg.coef_)
print(reg.interception_)
print(reg.score(x_test,y_test))

Sklearn中的线性回归

from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
boston = datasets.load_boston()
x =boston.data
y = boston.target
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=666);
lin_reg = LinearRegression()
lin_reg.fit(x_train,y_train)
#查看训练的权重参数
lin_reg.coef_
#截距
lin_reg.intercept_
#评分
lin_reg.score(x_test,y_test)

通过线性模型能找到特征比如 RM 是权重最高的说明房间数量和房价正相关而且权重较高 CHAS 第二高的权重表示波士顿房子邻河不临河临河和房价也是正相关而排在最后一个的参数 NOX 表示一氧化氮的含量和房价负相关。这也说明线性回归法对数据有可解释性不管模型预测好坏先用线性模型进行预测能直观的观测处数据中存在的规律。

posted @ 2019-12-15 14:23 caomaoboy 阅读(1083) 评论(0) 收藏举报

刷新页面返回顶部

caomaoboy