机器学习(四) - 梯度下降法

梯度下降

梯度下降是一种非常通用的优化算法,能够为大范围的问题找到最优解。梯度下降中的目的就是通过迭代地调整参数从而使成本函数最小化。

假设你迷失在山上的迷雾中,你能感觉到的只有你脚下路面的坡度。快速到达山脚的一个策略就是沿着最陡的方向下坡。这就是梯度下降的思想:通过测量 $\theta$ 相关的误差函数的局部梯度，并不断沿着降梯度的方向调整，直到梯度降为0，到达最小值！

具体来说，首先使用一个随机 $\theta$ 值(称之为随机初始化)，然后逐步改进，每次踏出yibu，每一步都尝试将第一点成本函数(如MSE¹)，直到算法收敛出一个最小值,如下图

梯度下降中一个重要参数是每一步的步长，梯度下降中称为学习率(超参数)。如果学习率太低，就需要更多次的迭代才能达到收敛，如果学习率太高可能直接越过山脚最低点，如果学习率太高非但不能走到山脚反而可能一不小心走过头走到了另一个比之前还要高的位置。

上图可以看出当学习率不能太低也不能太高机器学习中一般为 0.01
由 MSE作为损失函数而MSE正好又是一个凸函数可以找到全局最小值
而在现实中损失函数可能没有那么好的性质只能找到局部最小值。

$\hat{y}^{(i)} =b+\theta_{1}X_{1}^{(i)}+\theta_{2}X_{2}^{(i)}+\theta_{3}X_{3}^{(i)},.....,\theta_{n}X_{n}^{(i)}$
$X_d = \left[ \begin{matrix} 1 & X^{(1)}_{(1)} & X^{(1)}_{(2)} & .... & X^{(1)}_{(n)}\\ 1 & X^{(2)}_{(1)} & X^{(2)}_{(2)} & .... & X^{(2)}_{(n)}\\ ..... \\1 & X^{(m)}_{(1)} & X^{(m)}_{(2)} & .... & X^{(m)}_{(n)}\\ \end{matrix} \right]$
$\theta=(\theta_{0},\theta_{1},\theta_{2},\theta_{3},.....,\theta_{n})^T$
$\hat{y}^{(i)}=X_d\theta$
$\hat{y}^{(i)} ∈ {m*n +1}$ 维向量

目标使损失函数最小化: ${argmin}\sum_{i=1}^{m}(y^{(i)} - \hat{y}^{(i)})^2$
根据求导法则可得:
$\nabla J(\theta) = \left[ \begin{matrix} \frac {\partial J}{\partial \theta_1} \\ \frac {\partial J}{\partial \theta_2} \\ \frac {\partial J}{\partial \theta_3} \\ ... \\ \frac {\partial J}{\partial \theta_n} \end{matrix} \right]$ = $\ \left[ \begin{matrix} \sum_{i=1}^m2(y^{(i)} - X_d^{(i)}\theta)\cdot(-1) \\ \sum_{i=1}^m2(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_1 ^{(i)}) \\ \sum_{i=1}^m2(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_2 ^{(i)}) \\ ... \\ \sum_{i=1}^m2(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_n ^{(i)}) \end{matrix} \right]$ = $\frac{2}{m}\ \left[ \begin{matrix} \sum_{i=1}^m(y^{(i)} - X_d^{(i)}\theta)\cdot(-1) \\ \sum_{i=1}^m(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_1 ^{(i)}) \\ \sum_{i=1}^m(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_2 ^{(i)}) \\ ... \\ \sum_{i=1}^m(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_n ^{(i)}) \end{matrix} \right]$
式子前除以m 是因为在计算梯度的时候随着样本量的增加梯度也会变大但梯度不应该和样本量挂钩所以除以m。
所以此时优化函数就变成了: ${argmin}\frac {1}{m}\sum_{i=1}^{m}(y^{(i)} - \hat{y}^{(i)})^2$
而这个函数就是 $J(\theta)=MSE(y,\hat y)$
还有一种常用的形式: ${argmin}\frac {1}{2m}\sum_{i=1}^{m}(y^{(i)} - \hat{y}^{(i)})^2$ 这种方式求导后分母2会消掉。

批量梯度下降法

def J(theta,x_b,y):
    try:
    	#通过一组theta计算损失
        return np.sumy(y - x_b.dot(theta)) ** 2/len(x_b)
    except:
        return float('INF')
def dJ(theta,x_b,y):
	#初始化梯度数组
    res = np.empty(len(theta))
    res[0] = np.sum((x_b.dot(theta) - y))
    for i in range(1,len(theta)):
    	#根据初始化为一组 0 的theta求导数 
        res[1] = (x_b.dot(theta)- y).dot(x_b[:,i])   
    return res * 2 /len(x_b)
def gradient_descent(x_b,y,initial_theta,eta,iters_num = 1e4,epsilon = 1e-8):
    theta = initial_theta
    i_iter = 0
    while i_iter < iters_num:
    	#计算梯度值
        gradient = dJ(theta,x_b,y)
        last_theta = theta
        #不停地更新损失函数 eta 为学习率 
        theta = theta - eta * gradient
        #迭代次数
        i_iter += 1
        #比较两次求出的theta值如果2次 下降的间隔 小于一定的值 则认为达到最小点
        if(abs(J(theta,x_b,y) - J(last_theta,x_b,y))<epsilon):
            break
    return theta

梯度求导向量化

上面对theta点求导用的是循环更简洁的直接使用矩阵乘法效率更高。

$\frac{2}{m}\ \left[ \begin{matrix} \sum_{i=1}^m(y^{(i)} - X_d^{(i)}\theta)\cdot(-1) \\ \sum_{i=1}^m(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_1 ^{(i)}) \\ \sum_{i=1}^m(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_2 ^{(i)}) \\ ... \\ \sum_{i=1}^m(y^{(i)} - X_d^{(i)}\theta)\cdot(-X_n ^{(i)}) \end{matrix} \right]$ = $\frac{2}{m}(X_d^{(1)}\theta -y^{(1)},X_d^{(2)}\theta -y^{(2)},X_d^{(3)}\theta -y^{(3)},X_d^{(4)}\theta -y^{(4)},.....,X_d^{(m)}\theta -y^{(m)})\cdot\left[ \begin{matrix} 1 & X^{(1)}_{(1)} & X^{(1)}_{(2)} & .... & X^{(1)}_{(n)}\\ 1 & X^{(2)}_{(1)} & X^{(2)}_{(2)} & .... & X^{(2)}_{(n)}\\ ..... \\1 & X^{(m)}_{(1)} & X^{(m)}_{(2)} & .... & X^{(m)}_{(n)}\\ \end{matrix} \right]$ = $\frac{2}{m}\cdot(X_d\theta - y)^T\cdot X_d$ = $\frac{2}{m} X_d^T\cdot (X_d\theta - y)$

#矩阵化后求导代码
return x_b.T.dot(x_b.dot(theta) -y) * 2. /len(y)

随机梯度下降法(Stochastic Gradient Descent)

前面讨论了用所有样本点进行计算搜索梯度,当样本量非常大是可能非常耗时，有没有其他方法呢?
$2\cdot (X_d^{(i)})^T\cdot (X_d^{(i)}\theta - y^{(i)})$
可以拿出一个单独数据来搜索方向。

随机梯度每次迭代不会朝着更小的点前进可能还会往梯度更大的值走但在现实情况下随机梯度下降法是能找到最小值的附近的。
在随机梯度下降中学习率随着迭代次数的增加需要不断变小,否则就算到达最小点可能也会在下一次迭代中跳过。
$\eta=\frac{a}{i\_iters+b}$
a 和 b 为超参数 i_iters为循环迭代次数
随着循环次数增加,学习率是下降的
模拟退火的思想
$\eta=\frac{t_0}{i\_iters+t_1}$

def learing_rate(t):
    return t0 / (t + t1)
def dJ_Sgd(theta, x_i_b, y_i):
    return  x_i_d.T.dot((x_i_d.dot(theta) - y_i)) * 2

梯度的调试

当2个蓝点取任意都无限趋近于红点时求极限用2点之间的斜率当做红点的斜率。 $\frac{y_1 - y_2}{x_1 -x_2}$ 斜率公式

$x_1 =\theta + \epsilon$
$x_2 =\theta - \epsilon$
$y_1 =f(\theta + \epsilon)$
$y_2 =f(\theta - \epsilon)$

$\frac{f(\theta + \epsilon)-f(\theta - \epsilon)}{\theta + \epsilon - \theta + \epsilon}$
可得:
$\frac{df}{d\theta} = \frac {f(\theta + \epsilon) - f(\theta - \epsilon)}{2\epsilon}$

假设有n维参数:
$\theta =(\theta_0,\theta_1,\theta_2,\theta_3,\cdots,\theta_n)$

$\theta =(\frac{\partial f}{\theta_0},\frac{\partial f}{\theta_1},\frac{\partial f}{\theta_2},\frac{\partial f}{\theta_3},\cdots,\frac{\partial f}{\theta_n})$

如果要求 $theta_0$ 维度导数

$\theta^+_0 =(\theta_0 +\varepsilon,\theta_1,\theta_2,\theta_3,\cdots,\theta_n)$

$\theta^-_0 =(\theta_0 -\varepsilon,\theta_1,\theta_2,\theta_3,\cdots,\theta_n)$

$\frac{df}{d\theta} = \frac {f(\theta_1^+) - f(\theta_1^-}{2\epsilon}$

以此类推通过不断的近似

要计算对 $\frac{df}{d\theta}$ 可以通过将每个 $\theta$ 加上一个最小的值和减去一个最小的值然后带入
$\frac{df}{d\theta} = \frac {f(\theta + \epsilon) - f(\theta - \epsilon)}{2\epsilon}$ 求导.
然后再用梯度下降的方式一步步找到最小值。

def J(theta,x_b,y):
    try:
        return np.sum((y - x_b.dot(theta)) ** 2) /len(x_b)
    except:
        return float('inf')
def dj_math(theta,x_b,y):
    return x_b.T.dot(x_b.dot(theta) -y) * 2. /len(y)
def dj_debug(theta,x_b,y,epsilon = 0.01):
    res = np.empty(len(theta))
    for i in range(len(theta)):
    	#拷贝 theta
        theta_p = theta.copy()
        theta_m = theta.copy()
        #找到theta左边的点
        theta_p[i] += epsilon
        #找到theta右边的点
        theta_m[i] -= epsilon
        #通2个点 斜率公式 当 2个点 和 点theta 无限接近的时候 2点之间的斜率就是 j在 theta 的导数 (导数定义)
        res[i] = (J(theta_p,x_b,y) - J(theta_m,x_b,y))/(2 * epsilon)
    return res
def gradient_descent(dj,x_b,y,initial_theta,eta,iters_num = 1e4,epsilon = 1e-4):
    theta = initial_theta
    i_iter = 0
    while i_iter < iters_num:
        gradient = dj(theta,x_b,y)
        last_theta = theta
        #通过不断迭代 theta
        theta = theta - eta * gradient
        i_iter += 1
        #当 2 次 theta 波动小于某个值时则认为 到达了局部最小值 当函数是凸函数则是全局最小值
        if(abs(J(theta,x_b,y) - J(last_theta,x_b,y))<epsilon):
            break
    return theta
X_b = np.hstack((np.ones([len(X),1]),X))
initial_theta =np.zeros(x_b.shape[1])
theta gradient_descent(dj_debug,X_b,y,eta=0.01,initial_theta=initial_theta)

LinearRegression封装代码

import numpy as np
from ml_utils.metrics import r_Squared

class LinearRegression:
    def __init__(self):
        """初始化linear Regression模型"""
        self.coef_ = None
        self.interception_ = None
        self._theta = None
    def fit_normal(self,X_train,y_train):
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of x_train must be equal to the size of y_train"
        #传入的训练样本没有截距 在每一行加一个截距项
        X_b = np.hstack([np.ones((len(X_train),1)),X_train])
        #根据公式计算 theta 的值
        self._theta = np .linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y_train)
        self.interception_=self._theta[0]
        self.coef_ = self._theta[1:]
        return self
    def fit_gd(self,X_train,y_train,eta,iters_num = 1e4,epsilon = 1e-7):
        def J( theta, x_b, y):
            try:
                return np.sum((y - x_b.dot(theta)) ** 2) / len(y)
            except:
                return float('INF')
        def dJ(theta, x_b, y):
            return x_b.T.dot(x_b.dot(theta) -y) * 2./len(y)
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of x_train must be equal to the size of y_train"
        b = np.ones([X_train.shape[0], 1])
        X_train = np.hstack((b,X_train))
        i_iter = 0
        initial_theta = np.zeros(X_train.shape[1])
        theta = initial_theta
        while i_iter < iters_num:
            gradient = dJ(theta, X_train, y_train)
            last_theta = theta
            theta = theta - eta * gradient
            i_iter += 1
            if (abs(J(theta, X_train, y_train) - J(last_theta, X_train, y_train)) < epsilon):
                break
        self._theta = theta
        self.coef_ = theta[1:]
        self.interception_=theta[0]
    def sgd(self,X_train,y_train,epoch_num = 2):
        assert epoch_num >= 1,\
            "epoch must bigger than one."
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_predict must be equal to the size of X_train"
        t0 = 5
        t1 =  50
        X_train = np.hstack((np.ones((X_train.shape[0],1)),X_train))
        init_theta = np.zeros(X_train.shape[1])
        theta = init_theta
        def learing_rate(t):
            return t0 / (t + t1)
        def dJ_Sgd(theta, x_i_b, y_i):
            return  x_i_b.T.dot((x_i_b.dot(theta) - y_i)) * 2
        for cur_epoch in range(epoch_num):
            disRupt_index = np.random.permutation(len(X_train))
            disRupt_X_train = X_train[disRupt_index]
            disRupt_y_train =y_train[disRupt_index]
            random_test = np.random.randint(0,X_train.shape[0],10)
            for i in range(X_train.shape[0]):
                gradient = dJ_Sgd(theta,disRupt_X_train[i],disRupt_y_train[i])
                theta  = theta - learing_rate(i) * gradient
            self._theta = theta
            self.coef_ = theta[1:]
            self.interception_ = theta[0]
            xt = disRupt_X_train[random_test][:,1:]
            epoch_score = self.score(xt,disRupt_y_train[random_test])
            print("now traing epoch:" + str(cur_epoch) +' loss:' + str(epoch_score))
        return self
    def predict(self,X_predict):
        assert self.coef_ is not None and  self.interception_ is not None, \
            "you must run fit_normal before predict"
        assert X_predict.shape[1] == len(self.coef_), \
            "the size of X_predict must be equal to the size of coef_"
        X_predict_b = np.hstack([np.ones((len(X_predict),1)), X_predict])
        return X_predict_b.dot(self._theta)

    def score(self, X_test,y_test):
        y_predict = self.predict(X_test)
        return r_Squared(y_test,y_predict)
    def __repr__(self):
        return "LinearRegression()"

$\frac {1}{m}\sum_{i=1}^m(y^{(i)}_{test} - \hat {y}^{(i)}_{tets})^2$ ↩︎

posted @ 2019-12-15 14:23 caomaoboy 阅读(186) 评论(0) 收藏举报

刷新页面返回顶部

caomaoboy