机器学习(Machine Learning)- 吴恩达(Andrew Ng) 学习笔记(四)
Linear Regression with multiple variables 多变量线性回归
Multiple Features
Notation 符号说明
- \(n\) = number of features. 特征量的个数
- \(x^{(i)}\) = input (features) of \(i^{th}\) training example. 第\(i\)个训练样本
- \(x_j{(i)}\) = value of feature \(j\) in \(i^{th}\) training example. 第\(i\)个训练样本的第\(j\)个特征量
Hypothesis
- Previously: \(h_\theta(x) = \theta_0 + \theta_1x\)
- Four features: \(h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3 + \theta_4x_4\)
- Multiple features: \(\theta_0 + \theta_1x_1 + \theta_2x_2 + \ldots + \theta_nx_n\)
For convenience of notation, define \(x_0 = 1(x_0^{(i)} = 1)\),
Then \(h_\theta(x) = \theta_0x_0 + \theta_1x_1 + \theta_2x_2 + \ldots + \theta_nx_n = \theta^Tx\).
我们把它叫做多元线性回归。
Gradient Descent for Multiple Variables
Hypothesis
\(h_\theta(x) = \theta^Tx = \theta_0x_0 + \theta_1x_1 + \theta_2x_2 + \ldots + \theta_nx_n\)
Parameters 参数
\(\theta_0, \theta_1, \ldots, \theta_n\) --> (n + 1) - dimensional vector
Cost function 代价函数
\(J(\theta_0, \theta_1, \ldots, \theta_n) = \frac{1}{2m}\sum^m_{i = 1}(h_\theta(x^{(i)}) - y^{(i)})^2\) --> (n + 1) - dimensional vector function
Gradient descent 梯度下降算法
-
Repeat {
$\theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\ldots,\theta_n) $
}
-
Previously (n = 1):
Repeat {
\(\theta_0 := \theta_0 - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})\)
\(\theta_1 := \theta_1 - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}\)
}
-
New algorithm (\(n \geq 1\)):
Repeat {
\(\theta_j := \theta_j - \alpha\frac{1}{m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}\)
}
Gradient Descent in Practice I: Feature Scaling 梯度下降运算中的实用技巧I:特征放缩
Feature Scaling 特征放缩
-
Idea: Make sure features are on a similiar scale. 确保特征值有近似的规模。
-
E.g. \(x_1\) = size(0-2000\({feet}^2\)), \(x_2\) = number of bedrooms(1-5)
---> \(x_1\) = \(\frac{size({feet}^2)}{2000}\), \(x_2\) = \(\frac{number of bedrooms}{5}\)
-
Get every feature into approximately a \(-1 \leq x_i \leq 1\) range. Too small or too large is not acceptable. 让每个特征值在接近[-1,1]的范围内。
\(-100 \leq x_i \leq 100\) or \(-0.0001 \leq x_i \leq 0.0001\) (×)
Mean normalization 均一化
-
Replace \(x_i\) with \(x_i - \mu_i\) to make features have approxinately zero mean (Do not apply to \(x_0 = 1\)). 用\(x_i - \mu_i\)代替\(x_i\)使得特征值有接近0的平均值。
-
E.g. \(x_1 = \frac{size - 1000}{2000}\), \(x_2 = \frac{bedrooms - 2}{5}\). --> \(-0.5 \leq x_1 \leq 0.5\), \(-0.5 \leq x_2 \leq 0.5\).
-
更一般的规律:\(x_1 = \frac{x_1 - \mu_1}{S1}\)。
\(\mu_i: x\)的平均值,\(S_1:\)特征值的范围——最大值减去最小值(或者看做变量的标准差)。
Gradient Descent in Practice II: Learning Rate 梯度下降运算中的实用技巧II:学习速率
Making sure gradient descent is working correcrly
-
\(J(\theta)\) should decrease after every iteration. 通过观察\(J(\theta)\)的曲线随着迭代次数的增加的变化情况,当曲线几乎变为直线时说明梯度下降算法已收敛。
【注】对每一个特定的问题,梯度下降算法所需的迭代次数可以相差很大。
-
Example automatic convergence test: 自动收敛测试
Declare convergence if \(J(\theta)\) decrease by less than \(10^{-3}\) in one iteration. 如果代价函数\(J(\theta)\)的下降小于一个很小的数\(\varepsilon\),那么就认为已经收敛。
【注】通常选择一个合适的\(\varepsilon\)是非常困难的,所以通常用方法一。
Choose learning rate \(\alpha\)
-
Summary
If \(\alpha\) is too small: slow convergence. 如果\(\alpha\)太小会导致收敛速度慢。
If \(\alpha\) is too large: \(J(\theta)\) may not decrease on every iteration; may not converge. 如果\(\alpha\)太大会导致\(J(\theta)\)并不是在每一步都减小或者不收敛。
-
Choose \(\alpha\)
try …,0.001,0.003,0.01,0.03,0.1,0.3,1,…
Features and Polynomial Regression 特征值和多项式回归
Housing prices prediction
\(h_\theta(x) = \theta_0 + \theta_1 \times frontage + \theta_2 \times depth\)
--> Land area: \(x = frontage \times depth\) --> $h_\theta(x) = \theta_0 + \theta_1x $
有时从另一个角度去审视问题,定义一个新的特征值,而不是直接使用开始时使用的特征值,确实会得到一个更好的模型。
Polynomial regression 多项式回归
例如:当直线不能很好的拟合曲线时,选择二次/三次模型。
\(h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3 = \theta_0 + \theta_1(size) + \theta_2(size)^2 + \theta_3(size)^3\)
其中,\(x_1 = (size), x_2 = (size)^2, x_3 = (size)^3\)
Normal Equation 标准方程法
Overview
Method to solve for \(\theta\) analytically 一种求\(\theta\)的解析解法。
与梯度下降法不同的是,此方法可直接一次性求解\(\theta\)的最优值。
Intuition
-
If 1D(\(\theta \in R\)) 如果\(\theta\)是个实数
令\(\frac{d}{d\theta}J(\theta) = 0\) \(\rightarrow\) \(\theta\)
-
\(\theta \in R^{n+1}\), \(J(\theta_0,\theta_1,\ldots,\theta_m) = \frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2\)
令\(\frac{\partial}{\partial\theta_j}J(\theta) = 0\) (for every \(j\)) \(\rightarrow\) \(\theta_0, \theta_1,\ldots,\theta_n\)
-
写成向量组的形式后,可通过下面的公式直接计算,证明略。\(X\theta = y\), \(X^TX\theta = X^Ty\), \(\rightarrow\) \(\theta = (X^TX)^{-1}X^Ty\).
pinv(X'*X)*X'*y
Advantages and disadvantages
m training examples, n features.
- Gradient Descent
- Need to choose \(\alpha\). 需要选择学习速率\(\alpha\)
- Needs many iterations. 需要多次迭代
- \(O(kn^2)\)
- Works well even \(n\) is large. 即使有很多特征变量也能运行的很好
- Normal Equation
- No need to choose \(\alpha\). 不需要选择学习速率\(\alpha\)
- Don't need to iterate. 不需要迭代
- Need to compute \((X^TX)^{-1}\). 需要计算这一项(接近\(n^3\))
- Slow if \(n\) is very large. 如果n很大则会很慢
Normal Equation and Noninvertibility 正规方程及它们的不可逆性
What if \(X^TX\) is non-invertible(singular/degenerate)?
当\(X^TX\)不可逆时怎么办(为奇异矩阵或退化矩阵)?
-
Redundant features, where two features are very closely related (i.e. they are linearly dependent) 有冗余(十分相近)的特征值时删掉冗余部分(如两个线性相关的特征值)
E.g. \(x_1\) = size in \(feet^2\), \(x_2\) = size in \(m^2\)
-
Too many features(e.g. \(m \leq n\)). 太多特征值
Delete some features, or use regularization. 删除某些特征值或者对其进行正则化
Review
测验
-
Suppose \(m = 4\) students have taken some class, and the class had a midterm exam and a final exam. You have collected a dataset of their scores on the two exams, which is as follows:
midterm \((\)midterm exam\()^2\) final exam 89 7921 96 72 5184 74 94 8836 87 69 4761 78 You'd like to use polynomial regression to predict a student's final exam score from their midterm exam score. Concretely, suppose you want to fit a model of the form \(h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2\), where \(x_1\) is the midterm score and \(x_2\) is \((\)midterm score\()^2\). Further, you plan to use both feature scaling (dividing by the "max-min", or range, of a feature) and mean normalization.
What is the normalized feature \(x_1^{(1)}\)? (Hint: midterm = 89, final = 96 is training example 1.) Please round off your answer to two decimal places and enter in the text box below.
0.32
-
You run gradient descent for 15 iterations with \(\alpha = 0.3\) and compute \(J(\theta)\) after each iteration. You find that the value of \(J(\theta)\) decreases quickly then levels off. Based on this, which of the following conclusions seems most plausible?
-
Suppose you have \(m = 28\) training examples with \(n = 4\) features (excluding the additional all-ones feature for the intercept term, which you should add). The normal equation is \(\theta = (X^TX)^{-1}X^Ty\). For the given values of \(m\) and \(n\), what are the dimensions of \(\theta\), \(X\), and \(y\) in this equation?
-
Suppose you have a dataset with \(m = 50\) examples and \(n = 15\) features for each example. You want to use multivariate linear regression to fit the parameters \(\theta\) to our data. Should you prefer gradient descent or the normal equation?
-
Which of the following are reasons for using feature scaling?