机器学习基础

1. Linear Regression 线性回归

cost function 代价函数
\[J(\theta_0,\theta_1) = \frac{1}{2m} \sum^m_{i=1} (h_ \theta(x^{(i)}) - y^{(i)})^2 \]
Hypothesis: $h_\theta(x) = \theta_0 + \theta_1x$
Parameters: $\theta_0, \theta_1$
Cost Function: $(\theta_0,\theta_1) = \frac{1}{2m} \sum\limits^m_{i=1}(h_\theta (x^{(i)}) - y^{(i)})^2$

1.1 Gradient Descent Algorithm 梯度下降算法

$ \begin{align*} & \theta_j := \theta_j - \alpha \frac{\partial} {\partial \theta_j} J(\theta_0,\theta_1) (for \; j=0 \; and \; j=1) \newline & \alpha : \mbox{learning rate} \newline \newline \end{align*} $
$ \begin{align} \text{repeat until convergence: } \lbrace & \newline \theta_0 := & \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)}) \newline \theta_1 := & \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)}) x^{(i)} \newline \rbrace & \end{align} $

When the slope is negative, the value of $\theta_j$ increases and when it is positive, the value of $\theta_j$ decreases. $\theta_j$ eventually converges to its minimum value.

$ \begin{align} & \mbox{simultaneous update} \newline & temp0:=\theta_0 - \alpha \frac{\partial}{\partial\theta_0} J(\theta_0,\theta_1) \newline & temp1:=\theta_1 - \alpha \frac{\partial}{\partial\theta_1} J(\theta_0,\theta_1) \newline & \theta_0 :=temp0 \newline & \theta_1:=temp1 \newline \end{align} $

Note that, while gradient descent can be susceptible to local minima, in general, the optimization problem we have posed here for linear regression has only one global, and no other local, optima; thus gradient descent always converges (assuming the learning rate $\alpha$ is nit too large) to the global minimum. Indeed, J is a convex quadratic function.

Tips For Gradient Descent

feature scaling 特征缩放

\[x' = \frac{x}{mean(x)} \]
mean normalization 均值统一化

\[x' = \frac{x - mean(x)}{max(x) - min(x)} \]
learning rate 学习率
(1) If α is too small: slow convergence.
(2) If α is too large: may not decrease on every iteration and thus may not converge.

1.2 Multivariate Linear Regression 多元线性回归

Hypothesis
$ \begin{align} & h_\theta(x) = \theta_0x_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n \newline \newline & (x^{(i)}_0 = 1, i = 1, 2, ..., m) \newline \newline & \theta = \left[ \begin{array}{theta0} \theta_0 \newline \theta_1 \newline \theta_2 \newline \vdots \newline \theta_n \newline \end{array} \right] \quad x = \left[ \begin{array}{x} x_0 \newline x_1 \newline x_2 \newline \vdots \newline x_n \end{array} \right] \newline \newline & h_\theta(x) = \theta^Tx \end{align} $

gradient descent algorithm

$ \begin{align} & \text{Repeat } \lbrace \newline & \theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \newline & (simultaneously update \theta_j for j = 0, ..., n) \newline & \rbrace \newline & \theta_0 := \theta_0 - \alpha\frac{1}{m}\sum\limits^m_{i=1} (h_\theta(x^{(i)}) - y^{(i)})x^{(i)}_0 \newline & \theta_1 := \theta_1 - \alpha\frac{1}{m}\sum\limits^m_{i=1} (h_\theta(x^{(i)}) - y^{(i)})x^{(i)}_1 \newline & \theta_2 := \theta_2 - \alpha\frac{1}{m}\sum\limits^m_{i=1} (h_\theta(x^{(i)}) - y^{(i)})x^{(i)}_2 \newline ... \end{align} $

1.3 Normal Equation 正规方程

normal equation: Method to solve for $\theta$ analytically

$ \begin{align} X = \left[ \begin{array}{X} x^{(1)}_0 & x^{(1)}_1 & ... & x^{(1)}_n \newline x^{(2)}_0 & x^{(2)}_1 & ... & x^{(2)}_n \newline & & \vdots \newline x^{(m)}_0 & x^{(m)}_1 & ... & x^{(m)}_n \newline \end{array} \right] (x_0 = 1) \quad \left[ \begin{array}{Y} y^{(1)} \newline y^{(2)} \newline \vdots y^{(m)} \newline \end{array} \right] \end{align} $

1.4 GD vs NE

Gradient Descent	Normal Equation
Needs to choose $\alpha$	No need to choose $\alpha$
Needs to many iteration	Don't need to iterate
Works well when n is large	Slow if n is very large
$O(kn^2)$	$O(n^3)$

n <= 10,000 Normal Equation
n > 10,000 Gradient Descent

2. Logistic Regression 逻辑回归

2.1 Sigmoid Function

\[h_\theta(x) = g(\theta^Tx) \\ g(z) = \frac{1}{1 + e^{-z}} \]

$
\begin{align}
Suppose ; & predict ; y = 1 ;
if ; h_\theta(x) \ge 0.5 \Leftrightarrow \theta^Tx \ge 0 \newline
& predict ; y = 0 ; if ; h_\theta(x) \lt 0.5 \Leftrightarrow \theta^Tx \lt 0
\end{align}
$

cost function

We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima. In other words, it will not be a convex function.

$ \begin{align} & \text{cost function for logistic regression looks like:} \newline & J(\theta) = \frac{1}{m} \sum\limits^m_{i=1}Cost(h_\theta(x^{(i)}), y^{(i)}) \newline & h_\theta(x) = \frac{1}{1+e^{-\theta^Tx}} \newline \newline & Cost(h_\theta(x), y) = -\log(h_\theta(x)) & if \; y = 1 \newline & Cost(h_\theta(x), y) = -\log(1-h_\theta(x)) & if \; y = 0 \end{align} $

$ \begin{align} & Cost(h_\theta(x), y) = 0 \; if \; h_\theta(x) = y \newline & Cost(h_\theta(x), y) \rightarrow \infty \; if \; y = 0 \; and \; h_\theta(x) \rightarrow 1 \newline & Cost(h_\theta(x), y) \rightarrow \infty \; if \; y = 1 \; and \; h_\theta(x) \rightarrow 0 \newline \newline & \text{So, the cost function can be simplified:} \newline \newline & J(\theta) = -\frac{1}{m} \sum\limits^m_{i=1} [y^{(i)}\log(h_\theta(x^{(i)})) + (1-y^{(i)})\log(1-h_\theta(x^{(i)}))] \newline \newline & \text{A vectorized implementation is:} \newline \newline & h = g(X\theta) \newline & J(\theta) = \frac{1}{m} \cdot (-y^T\log(h)-(1-y)^T\log(1-h)) \end{align} $

2.2 Gradient Descent

$ \begin{align} & J(\theta) = -\frac{1}{m}\sum\limits^m_{i=1}[y^{(i)}\log(h_\theta(x^{(i)}))+(1-y^{(i)})\log(1-h_\theta(x^{(i)}))] \newline \newline & \theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta) \newline & \frac{\partial}{\partial\theta_j}J(\theta) = \frac{1}{m}\sum\limits^m_{i=1}(h_\theta(x^{(i)})-y^{(i)})x^{(i)}_j \newline & \text{Repeat } \lbrace \newline & \theta_j := \theta_j - \frac{\alpha}{m}\sum\limits^m_{i=1}(h_\theta(x^{(i)})-y^{(i)})x^{(i)}_j \newline \rbrace & \text{A vectorized implementation is:} \newline & \theta := \theta - \frac{\alpha}{m}X^T(g(X\theta)-\vec{y}) \newline \end{align} $

证明过程：

2.3 Multiclass Classification

multiclass classification: one-vs-all

$ \begin{align} & y = \{0, 1, ..., n\} \newline & h^{(0)}_\theta(x) = P(y=0 | x; \theta) \newline & h^{(1)}_\theta(x) = P(y=1 | x; \theta) \newline & ... \newline & h^{(n)}_\theta(x) = P(y=n | x; \theta) \newline & prediction = max_i(h^{(n)}_\theta(x)) \end{align} $

We are basically choosing one class and then lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction.

3. Regularization 正则化

Regularization cost function:

$ J(\theta) = \frac{1}{2m}[\sum\limits^m_{i=1}(h_\theta(x^{(i)})-y^{(i)})^2+\lambda\sum\limits^n_{j=1}\theta^2_j] $

$\lambda$:regularization parameter (It determines how much the costs of our theta parameters are inflated.)

3.1 Overfitting Problem

address the issue of overfitting
1. reduce the number of features:
  - Manually select which features to keep.
  - Use a model selection algorithm.
2. regularization
  - Keep all the features, but reduce the magnitude of parameters $\theta_j$ .
  - Regularization works well when we have a log of slightly useful features.

3.2 Regularized Linear Regression

$ cost function: J(\theta) = \frac{1}{2m}[\sum\limits^m_{i=1}(h_\theta(x^{(i)})-y^{(i)})^2 + \lambda\sum\limits^n_{j=1}\theta^2_j ] $

gradient descent

$ \begin{align} & Repeat \; \lbrace \newline & \theta_0 := \theta_0 - \alpha - \frac{1}{m} \sum\limits^m_{i=1}(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}_0 \newline & \theta_j := \theta_j - \alpha[(\frac{1}{m}\sum\limits^m_{i=1}(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}_j) + \frac{\lambda}{m}\theta_j] \newline \rbrace \newline & \theta_j \; \text{can also be represented as:} \newline & \theta_j := \theta_j(1-\alpha\frac{\lambda}{m}) - \alpha\frac{1}{m}\sum\limits^m_{i=1}(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}_j \newline & \text{should not be regularizing } \theta_0 \text{ which is used for the bias term.} \end{align} $
normal equation

$ \begin{align} & \theta = (X^TX + \lambda\cdot L)^{-1}X^Ty \newline & where \; L = \left[ \begin{array}{L} 0 \newline & 1 \newline & & 1 \newline & & & \ddots \newline & & & & 1 \end{array} \right] \newline & L : (n+1)*(n+1) \end{align} $

3.3 Regularized Logistic Regression

$ \begin{align} & \text{cost function:} \newline & J(\theta) = -\frac{1}{m}\sum\limits^m_{i=1}[y^{(i)}\log(h_\theta(x^{(i)}))+(1-y^{(i)})\log(1-h_\theta(x^{(i)}))] + \frac{\lambda}{2m}\sum\limits^n_{j=1}\theta^2_j \end{align} $

gradient descent

$ \begin{align} & Repeat \; \lbrace \newline & \theta_0 := \theta_0 - \alpha\frac{1}{m}\sum\limits(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}_0 \newline & \theta_j := \theta_j - \alpha[(\frac{1}{m}\sum\limits^m_{i=1}(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}_j) + \frac{\lambda}{m}\theta_j] \newline & (j = 1, 2, 3, ..., n) \newline \rbrace \end{align} $

4. Neural Networks 神经网络

layer 1 (first layer): input layer
layer 2: hidden layer
layer 3 (final layer): output layer

$a^{(j)}_i$ = "activation" of unit $i$ in layer $j$
$\Theta^(j)$ = matrix of weights controlling function mapping from layer $j$ to layer $j+1$

$ \begin{align} & a^{(2)}_1 = g(\Theta^{(1)}_{10}x_0 + \Theta^{(1)}_{11}x_1 + \Theta^{(1)}_{12}x_2 + \Theta^{(1)}_{13}x_3) \newline & a^{(2)}_2 = g(\Theta^{(1)}_{20}x_0 + \Theta^{(1)}_{21}x_1 + \Theta^{(1)}_{22}x_2 + \Theta^{(1)}_{23}x_3) \newline & a^{(2)}_3 = g(\Theta^{(1)}_{30}x_0 + \Theta^{(1)}_{31}x_1 + \Theta^{(1)}_{32}x_2 + \Theta^{(1)}_{33}x_3) \newline \newline & h_\theta(x) = g(\Theta^{(2)}_{10}a^{(2)}_0 + \Theta^{(2)}_{11}a^{(2)}_1 + \Theta^{(2)}_{12}a^{(2)}_2 + \Theta^{(2)}_{13}a^{(2)}_3) \end{align} $

if network has $s_i$ units in layer $j$, $s_{j+1}$ units in layer $j+1$, then $\theta^{(j)}$ will be of dimension $s_{j+1}\times(s_j+1)$

$ \begin{align} & a^(1) = x \newline & z^{(2)}_1 = \Theta^{(2)}_1\alpha^{(1)} \newline & z^{(2)}_2 = \Theta^{(2)}_2\alpha^{(1)} \newline & z^{(2)}_3 = \Theta^{(2)}_3\alpha^{(1)} \newline \newline & a^{(2)}_1 = g(z^{(2)}_1) \newline & a^{(2)}_2 = g(z^{(2)}_2) \newline & a^{(2)}_3 = g(z^{(2)}_3) \newline \end{align} $

In other words, for layer j=2 and node k, the variable z will be:

$ z^{(2)}_k = \Theta^{(1)}_{k,0}x_0 + \Theta^{(1)}_{k,1}x_1 + ... + \Theta^{(1)}_{k,n}x_n $

The vector representation of $x$ and $z^i$ is :

$ \begin{align} x = \left[ \begin{array}{x} x_0 \newline x_1 \newline \vdots \newline x_n \end{array} \right] \quad z^{(j)} = \left[ \begin{array}{z} z^{(j)} \newline z^{(j)}_2 \newline \vdots \newline z^{(j)}_n \end{array} \right] \end{align} $

Setting $x=a^{(1)}$, we can rewrite the equation as:

$ z^{(j)} = \Theta^(j-1)a^{(j-1)} $

4.1 Cost Function (Classification)

$\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ... , (x^{(m)}, y^{(m)})\}$
$L$ = total no. of layers in network
$s_l$ = no. of units(not counting bias unit) in layer $l$
$K$ = the total number of possible labels

$ \begin{gather*} J(\Theta) = - \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[y^{(i)}_k \log ((h_\Theta (x^{(i)}))_k) + (1 - y^{(i)}_k)\log (1 - (h_\Theta(x^{(i)}))_k)\right] + \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} ( \Theta_{j,i}^{(l)})^2 \end{gather*} $

$ y = \left[ \begin{array}{y} 1 \\ 0 \\ 0 \\ \vdots \\ 0 \end{array} \right], \; \left[ \begin{array}{y} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \end{array} \right], \; ... \; , \left[ \begin{array}{y} 0 \\ 0 \\ 0 \\ \vdots \\ 1 \end{array} \right] $

The number of columns in our current theta matrix is equal to the number of nodes in our current layer (including the bias unit). The number of rows in our current theta matrix is equal to the number of nodes in the next layer (excluding the bias unit).

4.2 Forward Propagation Algorithm

Given one training example $(x, y)$
Forward Propagation:

$ \begin{align} & a^{(1)} = x \newline & z^{(2)} = \Theta^{(1)}a^{(1)} \newline & a^{(2)} = g(z^{(2)}) & (add \; a^{(2)}_0) \newline & z^{(3)} = \Theta^{(2)}a^{(2)} \newline & a^{(3)} = g(z^{(3)}) & (add \; a^{(3)}_0) \newline & z^{(4)} = \Theta^{(3)}a^{(3)} \newline & a^{(4)} = h_\Theta(x) = g(z^{(4)}) \end{align} $

4.3 Backpropagation Algorithm

Intuition: $\delta^{(l)}_j$ = "error" of node $j$ in layer $l$.
For each output unit (layer L = 4)
$\delta^(4)_j = a^{(4)}_j - y_i$
Vector:
$ \begin{align} & \delta^{(4)} = a^{(4)} - y \newline & \delta^{(3)} = (\Theta^{(3)})^T\delta^{(4)}\cdot*g'(z^{(3)}) & \leftarrow g'(z^{(3)}) = a^{(3)}\cdot*(1-a^{(3)}) \newline & \delta^{(2)} = (\Theta^{(2)})^T\delta^{(3)}\cdot*g'(z^{(2)}) & \leftarrow g'(z^{(2)}) = a^{(2)}\cdot*(1-a^{(2)}) \newline \end{align} $

$ \begin{align} & \text{Backpropagation Algorithm} \newline & Training \; set \; \{(x^{(1)}, y^{(1)}), ...\;, (x^{(m)}, y^{(m)})\} \newline & Set \; \Delta^{(l)}_{ij} = 0 \; (for \; all \; l, i, j). \newline & For \; i = 1 \; to \; m \newline & \quad Set \; a^{(l)} = x^{(i)} \newline & \quad Perform \; forward \; propagation \; to \; compute \; a^{(l)} \; for \; l = 2, 3, ..., l \newline & \quad Using \; y^{(i)}, \; compute \delta^{(L)} = a^{(L)} - y^{(i)} \newline & \quad Compute \; \delta^{(L-1)}, \delta{(L-2)}, ..., \delta^{(2)} \newline & \quad \Delta^{(l)}_{ij} := \Delta^{(l)} + a^{(l)}_j\delta^{(l+1)}_i \quad (vector: \; \Delta^{(i)} := \Delta^{(i)} + \delta^{(l+1)})(a^{(l)})^T) \newline \newline & \frac{\partial}{\partial\Theta^{(l)}_{ij}} J(\Theta) = D^{(l)}_{ij} = \begin{cases} \frac{1}{m}\Delta^{(l)}_{ij} + \frac{\lambda}{m}\Theta^{(l)}_{ij} & \quad for \; j \ge 1 \newline \frac{1}{m}\Delta^{(l)}_{ij} & \quad for \; j = 0 \end{cases} \end{align} $

4.4 Gradient Checking

Gradient checking will assure that our backpropagation works as intended.

two-sided difference 双侧差分

$ \begin{align} & \frac{\partial}{\partial\Theta}J(\Theta) \approx \frac{J(\Theta+\varepsilon) - J(\Theta-\varepsilon)}{2\varepsilon} \newline & \text{A small value for } \varepsilon \text{ such as } \varepsilon = 10^{-4} \newline \newline & \text{Parameter vector } \theta \newline & \theta \in \mathbb{R}^n \newline & \frac{\partial}{\partial\theta_1}J(\theta) \approx \frac{J(\theta_1 + \epsilon, \theta_2, \theta_3, ..., \theta_n) - J(\theta_1 - \epsilon, \theta_2, \theta_3, ..., \theta_n)}{2\epsilon} \newline & \frac{\partial}{\partial\theta_2}J(\theta) \approx \frac{J(\theta_1, \theta_2 + \epsilon, \theta_3, ..., \theta_n) - J(\theta_1, \theta_2 - \epsilon, \theta_3, ..., \theta_n)}{2\epsilon} \newline & \quad \vdots \newline & \frac{\partial}{\partial\theta_n}J(\theta) \approx \frac{J(\theta_1, \theta_2, \theta_3, ..., \theta_n + \epsilon) - J(\theta_1, \theta_2, \theta_3, ..., \theta_n - \epsilon)}{2\epsilon} \newline \end{align} $

Hence, we are only adding or subtraction $epsilon$ to the $\Theta_j$ matrix. In octave we can do it as follow:
```
epsilon = 1e-4;
for i = 1:n,
	thetaPlus = theta;
	thetaPlus(i) += epsilon;
	thetaMinus = theta;
	thetaMinus(i) -= epsilon;
	gradApprox(i) = (j(thetaPlus) - J(thetaMinus))/(2*epsilon)
end;
```
Implementation Note
- Implement backprop to compute Dvec (unrolled $D^{(1)}, D^{(2)}, D^{(3)})$).
- Implement numerical gradient check to compute gradApprox.
- Make sure they give similar values.
- Turn off gradient checking. Using backprop code for learning.
Important
- Be sure to disable gradient checking code before training classifier if you fun numerical gradient computation on every iteration of gradient descent(or in the inner loop of costFunction(...)), code will be very slow.

4.5 Random Initialization

Initializing all $\theta$ weights to zero does not work with neural networks. When it backpropagates, all nodes will update to the same value repeatedly. Instead we can randomly initialize weights.

symmetry breaking 对称性破缺
Initialize each $\Theta^{(l)}_{ij}$ to a random value in $[-\epsilon, \epsilon] \quad (i.e. \; -\epsilon \le \Theta^{(l)}_{ij} \le \epsilon)$
(Note: the $\epsilon$ used above is unrelated to the $\epsilon$ from Gradient Checking)
Octave:
```
Theta1 = rand(10, 11)*(2*INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(1, 11)*(2*INIT_EPSILON) - INIT_EPSILON;
```
One effective strategy for choosing $\epsilon{init}$ is to base on the number of units in networks. A good choice of $\epsilon{init}$ is $\epsilon{init} = \frac{\sqrt{6}}{\sqrt{L_{in}+L_{out}}}$, where $L_{in} = s_l$ and $L_{out} = s_{l+1}$ are the number of units in the layers adjacent to $\Theta^{(l)}$.

4.6 Training

(1) Randomly initialize weights.

(2) Implement forward propagation to get $h_\theta(x^{(i)})$ for any $x^{(i)}$.

(3) Implement code to compute cost function $J(\Theta)$.

(4) Implement backprop to compute partial derivatives $\frac{\partial}{\partial\Theta^{(l)}_{jk}}J(\Theta)$.

(5) Use gradient checking to compute $\frac{\partial}{\partial\Theta^{(l)}_{jk}}J(\Theta)$ computed using backpropagation vs. using numerical estimate of gradient of $J(\Theta)$. Then disable gradient checking code.

(6) Use gradient descent or advanced optimization method with backpropagation to try to minimize $J(\Theta)$ as a function of parameters $\Theta$.

posted @ 2021-07-05 11:03 WindChen 阅读(135) 评论(0) 收藏举报

刷新页面返回顶部

wind_chen