机器学习（Machine Learning）- 吴恩达（Andrew Ng）学习笔记（四）

Linear Regression with multiple variables 多变量线性回归

Multiple Features

Notation 符号说明

$n$ = number of features. 特征量的个数
$x^{(i)}$ = input (features) of $i^{th}$ training example. 第$i$个训练样本
$x_j{(i)}$ = value of feature $j$ in $i^{th}$ training example. 第$i$个训练样本的第$j$个特征量

Hypothesis

Previously: $h_\theta(x) = \theta_0 + \theta_1x$
Four features: $h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3 + \theta_4x_4$
Multiple features: $\theta_0 + \theta_1x_1 + \theta_2x_2 + \ldots + \theta_nx_n$
For convenience of notation, define $x_0 = 1(x_0^{(i)} = 1)$,
Then $h_\theta(x) = \theta_0x_0 + \theta_1x_1 + \theta_2x_2 + \ldots + \theta_nx_n = \theta^Tx$.
我们把它叫做多元线性回归。

Gradient Descent for Multiple Variables

Hypothesis

$h_\theta(x) = \theta^Tx = \theta_0x_0 + \theta_1x_1 + \theta_2x_2 + \ldots + \theta_nx_n$

Parameters 参数

$\theta_0, \theta_1, \ldots, \theta_n$ --> (n + 1) - dimensional vector

Cost function 代价函数

$J(\theta_0, \theta_1, \ldots, \theta_n) = \frac{1}{2m}\sum^m_{i = 1}(h_\theta(x^{(i)}) - y^{(i)})^2$ --> (n + 1) - dimensional vector function

Gradient descent 梯度下降算法

Repeat {

$\theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\ldots,\theta_n) $

}
Previously (n = 1):

Repeat {

$\theta_0 := \theta_0 - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})$

$\theta_1 := \theta_1 - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}$

}
New algorithm ($n \geq 1$):

Repeat {

$\theta_j := \theta_j - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}$

}

Gradient Descent in Practice I: Feature Scaling 梯度下降运算中的实用技巧I：特征放缩

Feature Scaling 特征放缩

Idea: Make sure features are on a similiar scale. 确保特征值有近似的规模。
E.g. $x_1$ = size(0-2000${feet}^2$), $x_2$ = number of bedrooms(1-5)

---> $x_1$ = $\frac{size({feet}^2)}{2000}$, $x_2$ = $\frac{number of bedrooms}{5}$
Get every feature into approximately a $-1 \leq x_i \leq 1$ range. Too small or too large is not acceptable. 让每个特征值在接近[-1,1]的范围内。

$-100 \leq x_i \leq 100$ or $-0.0001 \leq x_i \leq 0.0001$ (×)

Mean normalization 均一化

Replace $x_i$ with $x_i - \mu_i$ to make features have approxinately zero mean (Do not apply to $x_0 = 1$). 用$x_i - \mu_i$代替$x_i$使得特征值有接近0的平均值。
E.g. $x_1 = \frac{size - 1000}{2000}$, $x_2 = \frac{bedrooms - 2}{5}$. --> $-0.5 \leq x_1 \leq 0.5$, $-0.5 \leq x_2 \leq 0.5$.
更一般的规律：$x_1 = \frac{x_1 - \mu_1}{S1}$。

$\mu_i: x$的平均值，$S_1:$特征值的范围——最大值减去最小值（或者看做变量的标准差）。

Gradient Descent in Practice II: Learning Rate 梯度下降运算中的实用技巧II：学习速率

Making sure gradient descent is working correcrly

$J(\theta)$ should decrease after every iteration. 通过观察$J(\theta)$的曲线随着迭代次数的增加的变化情况，当曲线几乎变为直线时说明梯度下降算法已收敛。

【注】对每一个特定的问题，梯度下降算法所需的迭代次数可以相差很大。
Example automatic convergence test: 自动收敛测试

Declare convergence if $J(\theta)$ decrease by less than $10^{-3}$ in one iteration. 如果代价函数$J(\theta)$的下降小于一个很小的数$\varepsilon$，那么就认为已经收敛。

【注】通常选择一个合适的$\varepsilon$是非常困难的，所以通常用方法一。

Choose learning rate $\alpha$

Summary

If $\alpha$ is too small: slow convergence. 如果$\alpha$太小会导致收敛速度慢。

If $\alpha$ is too large: $J(\theta)$ may not decrease on every iteration; may not converge. 如果$\alpha$太大会导致$J(\theta)$并不是在每一步都减小或者不收敛。
Choose $\alpha$

try …，0.001，0.003，0.01，0.03，0.1，0.3，1，…

Features and Polynomial Regression 特征值和多项式回归

Housing prices prediction

$h_\theta(x) = \theta_0 + \theta_1 \times frontage + \theta_2 \times depth$

--> Land area: $x = frontage \times depth$ --> $h_\theta(x) = \theta_0 + \theta_1x $

有时从另一个角度去审视问题，定义一个新的特征值，而不是直接使用开始时使用的特征值，确实会得到一个更好的模型。

Polynomial regression 多项式回归

例如：当直线不能很好的拟合曲线时，选择二次/三次模型。

$h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3 = \theta_0 + \theta_1(size) + \theta_2(size)^2 + \theta_3(size)^3$
其中，$x_1 = (size), x_2 = (size)^2, x_3 = (size)^3$

Normal Equation 标准方程法

Overview

Method to solve for $\theta$ analytically 一种求$\theta$的解析解法。

与梯度下降法不同的是，此方法可直接一次性求解$\theta$的最优值。

Intuition

If 1D($\theta \in R$) 如果$\theta$是个实数

令$\frac{d}{d\theta}J(\theta) = 0$ $\rightarrow$ $\theta$
$\theta \in R^{n+1}$, $J(\theta_0,\theta_1,\ldots,\theta_m) = \frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2$

令$\frac{\partial}{\partial\theta_j}J(\theta) = 0$ (for every $j$) $\rightarrow$ $\theta_0, \theta_1,\ldots,\theta_n$
写成向量组的形式后，可通过下面的公式直接计算，证明略。$X\theta = y$, $X^TX\theta = X^Ty$, $\rightarrow$ $\theta = (X^TX)^{-1}X^Ty$.
```
pinv(X'*X)*X'*y
```

Advantages and disadvantages

m training examples, n features.

Gradient Descent
- Need to choose $\alpha$. 需要选择学习速率$\alpha$
- Needs many iterations. 需要多次迭代
- $O(kn^2)$
- Works well even $n$ is large. 即使有很多特征变量也能运行的很好
Normal Equation
- No need to choose $\alpha$. 不需要选择学习速率$\alpha$
- Don't need to iterate. 不需要迭代
- Need to compute $(X^TX)^{-1}$. 需要计算这一项（接近$n^3$）
- Slow if $n$ is very large. 如果n很大则会很慢

Normal Equation and Noninvertibility 正规方程及它们的不可逆性

What if $X^TX$ is non-invertible(singular/degenerate)?

当$X^TX$不可逆时怎么办（为奇异矩阵或退化矩阵）？

Redundant features, where two features are very closely related (i.e. they are linearly dependent) 有冗余(十分相近)的特征值时删掉冗余部分(如两个线性相关的特征值)

E.g. $x_1$ = size in $feet^2$, $x_2$ = size in $m^2$
Too many features(e.g. $m \leq n$). 太多特征值

Delete some features, or use regularization. 删除某些特征值或者对其进行正则化

Review

测验

Suppose $m = 4$ students have taken some class, and the class had a midterm exam and a final exam. You have collected a dataset of their scores on the two exams, which is as follows:

midterm $($midterm exam$)^2$ final exam

89 7921 96

72 5184 74

94 8836 87

69 4761 78

You'd like to use polynomial regression to predict a student's final exam score from their midterm exam score. Concretely, suppose you want to fit a model of the form $h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2$, where $x_1$ is the midterm score and $x_2$ is $($midterm score$)^2$. Further, you plan to use both feature scaling (dividing by the "max-min", or range, of a feature) and mean normalization.

What is the normalized feature $x_1^{(1)}$? (Hint: midterm = 89, final = 96 is training example 1.) Please round off your answer to two decimal places and enter in the text box below.

0.32
You run gradient descent for 15 iterations with $\alpha = 0.3$ and compute $J(\theta)$ after each iteration. You find that the value of $J(\theta)$ decreases quickly then levels off. Based on this, which of the following conclusions seems most plausible?
- Rather than use the current value of $\alpha$, it'd be more promising to try a larger value of $\alpha$ (say $\alpha$ = 1.0).
- Rather than use the current value of $\alpha$, it'd be more promising to try a smaller value of $\alpha$ (say $\alpha$ = 0.1).
- $\alpha$ = 0.3 is an effective choice of learning rate. 因为$J(\theta)$可以快速降到最小值，所以此时$\alpha$是个合适的值
Suppose you have $m = 28$ training examples with $n = 4$ features (excluding the additional all-ones feature for the intercept term, which you should add). The normal equation is $\theta = (X^TX)^{-1}X^Ty$. For the given values of $m$ and $n$, what are the dimensions of $\theta$, $X$, and $y$ in this equation?
- $X$ is $28\times4$, $y$ is $28\times1$, $\theta$ is $4\times4$
- $X$ is $28\times4$, $y$ is $28\times1$, $\theta$ is $4\times1$
- $X$ is $28\times5$, $y$ is $28\times1$, $\theta$ is $5\times1$ $y$是一列，$X$是$n + 1$行
- $X$ is $28\times5$, $y$ is $28\times5$, $\theta$ is $5\times5$
Suppose you have a dataset with $m = 50$ examples and $n = 15$ features for each example. You want to use multivariate linear regression to fit the parameters $\theta$ to our data. Should you prefer gradient descent or the normal equation?
- The normal equation, since it provides an efficient way to directly find the solution.
- Gradient descent, since it will always converge to the optimal $\theta$.
- Gradient descent, since $(X^TX)^{-1}$ will be very slow to compute in the normal equation. $n \geq 10^4$时才体现出
- The normal equation, since gradient descent might be unable to find the optimal $\theta$.
Which of the following are reasons for using feature scaling?
- It is necessary to prevent gradient descent from getting stuck in local optima. 线性回归无局部最优
- It speeds up solving for $\theta$ using the normal equation. 正规方程不需要特征放缩
- It speeds up gradient descent by making it require fewer iterations to get to a good solution. 减少迭代次数从而加速程序
- It prevents the matrix $X^TX$ (used in the normal equation) from being non-invertable (singular/degenerate). 正规方程不需要特征放缩

midterm	\((\)midterm exam\()^2\)	final exam
89	7921	96
72	5184	74
94	8836	87
69	4761	78

posted @ 2020-01-14 11:57 Albert_s 阅读(583) 评论(0) 收藏举报

刷新页面返回顶部

咸鱼の小窝

人一定要有梦想，即使是咸鱼，也要做最咸的那条！

机器学习（Machine Learning）- 吴恩达（Andrew Ng）学习笔记（四）

Multiple Features

Notation 符号说明

Hypothesis

Gradient Descent for Multiple Variables

Hypothesis

Parameters 参数

Cost function 代价函数

Gradient descent 梯度下降算法

Gradient Descent in Practice I: Feature Scaling 梯度下降运算中的实用技巧I：特征放缩

Feature Scaling 特征放缩

Mean normalization 均一化

Gradient Descent in Practice II: Learning Rate 梯度下降运算中的实用技巧II：学习速率

Making sure gradient descent is working correcrly

Choose learning rate \(\alpha\)

Features and Polynomial Regression 特征值和多项式回归

Housing prices prediction

Polynomial regression 多项式回归

Normal Equation 标准方程法

Overview

Intuition

Advantages and disadvantages

Normal Equation and Noninvertibility 正规方程及它们的不可逆性

Review

公告

咸鱼の小窝

人一定要有梦想，即使是咸鱼，也要做最咸的那条！

机器学习（Machine Learning）- 吴恩达（Andrew Ng） 学习笔记（四）

Multiple Features

Notation 符号说明

Hypothesis

Gradient Descent for Multiple Variables

Hypothesis

Parameters 参数

Cost function 代价函数

Gradient descent 梯度下降算法

Gradient Descent in Practice I: Feature Scaling 梯度下降运算中的实用技巧I：特征放缩

Feature Scaling 特征放缩

Mean normalization 均一化

Gradient Descent in Practice II: Learning Rate 梯度下降运算中的实用技巧II：学习速率

Making sure gradient descent is working correcrly

Choose learning rate \(\alpha\)

Features and Polynomial Regression 特征值和多项式回归

Housing prices prediction

Polynomial regression 多项式回归

Normal Equation 标准方程法

Overview

Intuition

Advantages and disadvantages

Normal Equation and Noninvertibility 正规方程及它们的不可逆性

Review

公告

机器学习（Machine Learning）- 吴恩达（Andrew Ng）学习笔记（四）