CS229: Introduction to deep learning
deep learning
- computationally expensive
- data
- algorithm
logistic regression
e.g.1 find a cat in the image
- input : suppose image 64*64, input vector x: (64*64*3,1)
- function: \(\hat{y} = \sigma(wx+b) = \sigma(\theta^{T}x)\)
- steps:
- initialize parameters(weights and bias)
- find optimal w and b
- loss function: \(L = -[ylog\hat{y}+(1-y)log(1-\hat{y})]\) (from maximum likelihood estimation)
- gradient descent
- use model to predict
neural = linear + activation here, it's wx+b and sigmoid function
model = architecture + parameter
e.g.2 find cat\lion\iguana in image
apply more neural at the same time, they are in the same layer and don't communicate with each other. A neural like \(a^{[1]}_{2}\) indicates that it's the second neural in the first layer
dataset should be labeled by more information.
how your label will affect the inner structure of your network.
it has good robustness because all neural in the same layer are not related.
e.g.3 +constraint: unique animal on an image
- modify neural using softmax:
- \(z^{[1]}_{2}\) indicates the linear part of the second neural in the first layer
- activation is like \(\dfrac{e^{z^{[1]}_{2}}}{\Sigma^{n}_{i=1}e^{z^{[1]}_{i}}}\), where n is the number of neural in this layer
- the sum of the output from this layer is definitely 1, and we select the biggest probability
- softmax multi-class network
- cross entropy loss:\(L = -\Sigma^{n}_{k=1}y_{k}log\hat{y_{k}}\)
Neural network
- More neural and more layers(architecture)
- the output layer must have the same number of neurons compared with the number of classes to be for reclassification.
- input layer and hidden layer
- hidden layer can understand complicated structure of raw data
- fully connected(different from how human design)
Propagation equation(forward)
- each layer's input is from the former layer's output, after linear part and activation part, output is fed to the next layer.
- be careful of the size of matrix
input batch of m examples
- input: \(X = (x^{(1)},x^{(2)},\cdots,x^{(m)})\), each row is an input case
- \(m\) is the number of cases in total
- \(n_{0}\) is the number of origin features
- \(n_{i}\) is the number of neurons in the \(i\) layer
- \(X\): \((n_{0},m)\)
- layer: \(Z^{[i] = w^{[i]}X + b^{[i]}}\)
- \(Z^{[i]}\) : \((n_{i},m)\)
- \(w^{[i]}\) : \((n_{i},n_{i-1})\)
- \(b\): \((n_{i},m)\) (by broadcasting, each row stays the same)
- in the first layer, the input is the origin feature matrix \(X\), afterwards it will be replaced by \(Z_{i}\)
- architecture should be based on the complexity
Optimizing parameters
- define loss/cost functions
- \(J(\hat{y}.y) = \dfrac{1}{m}\Sigma^{m}_{i=1} L^{i}\)
- with \(L^{i} = -[y^{i}log\hat{y^{i}}+(1-y^{i})log(1-\hat{y^{i}})]\)
- Backward propagation
- \(w^{[i]} = w^{[i]}-\alpha \dfrac{\part J}{\part w^{[i]}}\)
- \(b^{[i]} = b^{[i]}-\alpha \dfrac{\part J}{\part b^{[i]}}\)
- begin from \(w^{[3]}\) (close to output)
- using chain rule to propagate backwards

浙公网安备 33010602011771号