Part I Liner Regression with One Variable

Just a study note for a maching learning course by Andrew Ng

Univariate linear regression

Model Representation

\(m\): Number of training examples
\(x^{(i)}\): ”input“ variable / features
\(y^{(i)}\): ”output“ variable / ”target“ variable
\((x,y)\) : one training example
\((x^{(i)},y^{(i)})\) : i-th training example

Function h:
\(h_\theta(x)=\theta_0+\theta_1x\)

Cost Function

\(h_\theta(x)=\theta_0+\theta_1x\)
\(\theta_i\) ：Parameters
Cost Function : To measure the accuracy of hypothesis function.
Idea: Choose \(\theta_0,\theta_1\) so that \(h_\theta(x)\) is close to \(y\) for our training example \((x,y)\)
The Squared Error Function :

\[\ J(\theta_0,\theta_1)=\frac{1}{2m}\sum^{m}_{i=1}(h_\theta(x^{(i)})-y^{(i)})^2 \]

Goal : \(minimize\ J(\theta_0,\theta_1)\)
Contour Plot:

Gradient descent

To \(min\) \(J(\theta_0,\theta_1)\)
Outline
- Start with some \(\theta_0,\theta_1\) (\(\theta_0=0,\theta_1=0\))
- Keeping changing \(\theta_0,\theta_1\) to reduce \(J(\theta_0,\theta_1)\) until we hopefully end up at a minimum.

The distance between each 'star' in the graph above represents a step determined by our parameter \(\alpha\). A smaller \(\alpha\) would result in a smaller step and a larger \(\alpha\) results in a larger step.
The direction in which the step is taken is determined by the partial derivative of \(J(\theta_0,\theta_1)\)
Gradient descent algorithm
- repeat until convergence
\[\theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}{J(\theta_0,\theta_1)} (\mbox{for}\ j=0\ and\ j=1) \]
- \(:=\) assign,differ from \(=\)
- \(\alpha\) : constant value,"learning rate"
- "simultaneous update:"
To simply the question, we make the question only one parameter \(\theta_1\)
\(\theta_1:=\theta_1-\alpha\frac{d}{d\theta_1}J(\theta_1)\)
\(\theta_1\) eventually converges to its minimum value

We should adjust our parameter \(\alpha\) to ensure that the gradient descent algorithm converges in a reasonable time.
- If \(\alpha\) is too small, gradient descent can be slow
- If \(\alpha\) is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge.
When \(\theta_1\) at local optima, \(\frac{d}{d\theta_1}J(\theta_1)=0\)
As we approach a local minimum, gradient descent will automatically take smaller steps.
Consider the question with two parameters \(\theta_0,\theta_1\)
\(\frac{\partial}{\partial\theta_j}{J(\theta_0,\theta_1)}=\frac{\partial}{\partial\theta_j}\frac{1}{2m}\sum^{m}_{i=1}(h(x^{(i)})-y^{(i)})^2=\frac{\partial}{\partial\theta_j}\frac{1}{2m}\sum^{m}_{i=1}(\theta_0+\theta_1x^{(i)}-y^{(i)})^2\)
\(j=0:\frac{\partial}{\partial\theta_0}J(\theta_0,\theta_1)=\frac{1}{m}\sum^{m}_{i=1}(h(x^{(i)})-y^{(i)})\)
\(j=1:\frac{\partial}{\partial\theta_1}J(\theta_0,\theta_1)=\frac{1}{m}\sum^{m}_{i=1}(h(x^{(i)})-y^{(i)})x^{(i)}\)
so we have

\[\theta_0:=\theta_0-\alpha\frac{1}{m}\sum^{m}_{i=1}(h(x^{(i)})-y^{(i)})\\ \theta_1=\theta_1-\alpha\frac{1}{m}\sum^{m}_{i=1}(h(x^{(i)})-y^{(i)})x^{(i)} \]

the optimization problem we have posed here for linear regression has only one global, and no other local, optima.
thus gradient descent always converges (assuming the learning rate α is not too large) to the global minimum. Indeed, J is a convex quadratic function.

posted @ 2021-12-15 18:32 Ghaser 阅读(42) 评论(0) 收藏举报

刷新页面返回顶部

Ghaser

Part I Liner Regression with One Variable

Model Representation

Cost Function

Gradient descent

公告