《统计学习方法》第6章习题

习题6.1

首先解释什么是指数分布族。组数分布族,也称指数族分布(后面用这个名词替代),指数族分布为满足 \(f(x|\theta) = h(x) *exp(\eta(\theta)*T(x) - A(\eta))\) 形式的概率分布 (\(f(x|\theta)\) 可为概率分布的概率密度函数)。

考虑逻辑斯谛分布,并且不考虑偏置的问题(或者将 \(x\) 增加一个维度为常数1)。 \(P(Y=1|x) = \frac{exp(w*x)}{1 + exp(w*x)}\)\(P(Y=0|x) = \frac{1}{1 + exp(w*x)}\)

所以,逻辑斯谛分布可以写为

\(P(Y|x) = (\frac{exp(w*x)}{1 + exp(w*x)})^y*(\frac{1}{1 + exp(w*x)})^{(1-y)} \\ =exp(y*log(\frac{exp(w*x)}{1 + exp(w*x)}) + (1-y)*log(\frac{1}{1 + exp(w*x)})) \\ = exp(y*w*x - log(1 + exp(w*x)))\)

可以发现 \(h(y)=1\)\(T(y) = y\)\(\eta(x) = w*x\)\(A(\eta) = log(1 + exp(\eta))\)

因此,逻辑斯谛分布属于指数族分布。

习题6.2

假设要估计的逻辑斯谛回归模型为 \(h_{\theta}(x) = P(Y=1|x) = \frac{1}{1 + e^{(-\theta*x)}}\)

损失函数为 \(L(\theta) = \sum \limits_i -y^{(i)}*log(h_{\theta}(x^{(i)})) - (1-y^{(i)})*log(1 - h_{\theta}(x^{(i)})) = \sum \limits_i -y^{(i)}*log(\frac{e^{x^{(i)}\theta}}{1 + e^{x^{(i)}\theta}}) + y^{(i)}*log(\frac{1}{1 + e^{x^{(i)}\theta}}) -log(\frac{1}{1 + e^{x^{(i)}\theta}}) = \sum \limits_i -y^{(i)}*x^{(i)}*\theta +log(1+e^{x^{(i)}\theta})\)

所以, \(\frac{\partial}{\partial \theta} L(\theta) = \sum \limits_i -y^{(i)}*x^{(i)} + \frac{e^{x^{(i)}\theta}}{1 + e^{x^{(i)}\theta}}x^{(i)} = \sum \limits_i x^{(i)}*(h_{\theta}(x^{(i)} - y^{(i)}))\)

假设 \(f(\theta) = L(\theta)\)\(g(\theta) = \frac{\partial}{\partial \theta} L(\theta)= \nabla f(\theta)\)

逻辑斯谛回归模型学习的梯度下降算法

输入:目标函数\(f(\theta)\) ,梯度函数\(g(\theta)\) ,学习率 \(\eta\) ,计算精度\(\epsilon\)

输出:\(f(\theta)\) 的极小值 \(\theta^*\)

(1)选取初值 \(\theta^{(0)}\) ,并令 \(k=0\)

(2)计算 \(f(\theta^{(k)})\)

(3)计算梯度 \(g(\theta^{(k)})\) ,当 \(\left\| g(\theta^{(k)}) \right\| < \epsilon\) 时,停止迭代,令 \(\theta^* = \theta^{(k)}\) ;否则,令 \(\theta^{(k+1)} = \theta^{(k)} -\eta*g(\theta^{(k)})\)

(4)计算 \(f(\theta^{(k+1)})\) 。当 \(\left\| f(\theta^{(k+1)} - f(\theta^{(k)} \right\| < \epsilon\)\(\left\| \theta^{(k+1)} - \theta^{(k)}\right\| < \epsilon\) 时,停止迭代,令 \(\theta^* = \theta^{(k)}\)

(5)否则,令 \(k = k+1\) ,转(3)

习题6.3

勘误:《统计学习方法》附录B 5.Broyden类算法 从DFP算法 \(G_k\) 的迭代公式应该是(B.24)

最大熵模型为 \(P_w(y|x) = \frac{1}{Z_w(x)} exp(\sum \limits_{i=1}^n w_if_i(x,y))\) ,其中 \(Z_w(x) = \sum \limits_y exp(\sum \limits_{i=1}^nw_if_i(x,y))\)

即 最大熵模型为 \(P_w(y|x) = \frac{exp(\sum \limits_{i=1}^n w_if_i(x,y))}{\sum \limits_y exp(\sum \limits_{i=1}^nw_if_i(x,y))}\)

最大熵模型的目标函数为 \(\mathop{min} \limits_w f(w) = \sum \limits_x \widetilde P(x) log \sum \limits_yexp(\sum \limits_{i=1}^n w_if_i(x,y)) - \sum \limits_{x,y} \widetilde P(x,y) \sum \limits_{i=1}^nw_if_i(x,y)\)

梯度为 \(g(w) = (\frac{\partial f(w)}{\partial w_1}, \frac{\partial f(w)}{\partial w_2},..., \frac{\partial f(w)}{\partial w_n})^T\) ,其中 \(\frac{\partial f(w)}{\partial w_i} = \sum \limits_{x,y} \widetilde P(x)P_w(y|w)f_i(x,y) - E_{\widetilde P}(f_i)\)

DFP算法是拟牛顿法的一种,主要是通过用 \(G_k\) 作为 \(H_k^-\) 的近似。DFP算法首先需要满足每次迭代 \(G_k\) 是正定的,其次需要满足 \(G_{k+1}y_k = \delta_k\) 这一拟牛顿条件。

其中 \(y_k = g_{k+1} - g_k\)\(\delta_k = x^{(k+1)} -x^k\)

假设每次迭代按照 \(G_{k+1} = G_k + \Delta G_k\) 去更新矩阵。

DFP算法做了如下假设,假设 \(G_{k+1} = G_k +P_k + Q_k\) ,其中 \(P_k\)\(Q_k\) 待定, 所以 \(G_{k+1} y_k = G_ky_k +P_ky_k + Q_ky_k\)

为使条件满足,可令 \(P_ky_k = \delta_k\)\(Q_ky_k = -G_ky_k\) 。由此,可取 \(P_k = \frac{\delta_k \delta_k^T}{\delta_k^T y_k}\)\(Q_k = - \frac{G_ky_ky_k^TG_k}{y_k^TG_ky_k}\)

所以 \(G_{k+1} = G_k + \frac{\delta_k \delta_k^T}{\delta_k^T y_k} - \frac{G_ky_ky_k^TG_k}{y_k^TG_ky_k}\)

最大熵模型学习的DFP算法

输入:特征函数 \(f_1,f_2,...,f_n\) ;经验分布 \(\widetilde P(x,y)\) ,目标函数 \(f(w)\), 梯度 \(g(w) = \nabla f(w)\) ,学习率 \(\eta\) ,计算精度 \(\epsilon\)

输出:最优参数 \(w^*\) ;最优模型 \(P_{w^*}(y|x)\)

(1)选择初值 \(w^{(0)}\) ,同时取 \(G_0\) 为正定对称矩阵,令 \(k=0\)

(2)计算 \(g_k = g(w^{(0)})\) ,当 \(\left\| g(w^{(k)}) \right\| < \epsilon\) 时,停止迭代,令 \(w^* = w^{(k)}\) ;否则转(3)

(3)令 \(w^{(k+1)} = w^{(k)} - \eta * G_kg_k\)

(4)计算 \(g_{k+1} = g(w^{(k+1)})\) ,若 \(\left\| g(w^{(k+1)}) \right\| < \epsilon\) ,停止迭代,令 \(w^* = w^{(k)}\) ;否则,计算 \(G_{k+1} = G_k + \frac{\delta_k \delta_k^T}{\delta_k^T y_k} - \frac{G_ky_ky_k^TG_k}{y_k^TG_ky_k}\) ,其中 \(y_k = g_{k+1} - g_k\)\(\delta_k = x^{(k+1)} -x^k\)

(5)令 \(k = k+1\) ,转(3)

posted @ 2021-07-08 09:26  程劼  阅读(350)  评论(0编辑  收藏  举报