Introduction to Machine Learning

1 Tradition

1.1 Perceptron

Formula.

Error Bound Theorem.

Multiclass.

1.2 Linear Regression

MLE & MAP.

Formula.

Regularization.

1.3 kNN

Graph.

Overfit lines.

1.4 Logistic Regression

MLE & MAP.

Softmax & sigmoid.

Multiclass.

1.5 SVM

1.5.1 Primal

Formula.

1.5.2 Dual

Formula.

1.5.3 Feature Map

Quadratic example.

1.5.4 Kernel Machine

Formula.

Gram Matrix.

Complexity analysis.

1.6 Bagging and Boosting

1.6.1 Bagging and OOB

Definition.

1.6.2 Boosting Algorithms

Definition.

1.7 GMM

1.7.1 k-Means and ++

1.7.2 E-M

1.8 Gradient Descent Methods

1.8.1 Projection

1.8.2 ADAG

Also called Adaptive Gradient.

1.8.3 RMSProp

1.8.4 ADAM

Also called Adaptive Moment.

1.9 Decision Trees

1.9.1 Random Forest

1.9.2 Hyperparameters

2 Modern

2.1 Back Propagation

2.2 Common NN

2.2.1 CNN

Optimize.

2.2.2 RNN

Also the first sequence model.

2.2.3 GRU

2.3 Generative Models

2.3.1 GAN

An adversarial setting.

2.3.2 VAE

A latent variable method, with \(\text{dim}(z)\ll \text{dim}(x)\).

We optimize ELBO.

Reparameterize.

2.3.3 Normalizing Flow

Also a latent variable method, but \(\text{dim}(z)= \text{dim}(x)\).

Assume the flow \(x=z_0\rightarrow z_1\rightarrow\cdots\rightarrow z_K=z\), \(f: x\rightarrow z\),

\[\log_{p_X}(x)=\log_{p_Z}(z_K)+\sum_{k=1}^K \log|\det (J_{f_k}(z_{k-1}))|. \]

Sometimes, can also notate \(T^{-1}:=\bigcup f\) with triangular assumption,

\[\log_{p_X}(x)=\log_{p_Z}(T^{-1}(x))-\sum_{j=1}^d\log\partial_j T_j(T^{-1}(x)). \]

2.3.4 An Expectation View

MLE (Maximum Likelihood Estimation) and MAP (Maximum A Posteriori):

  • MLE is frequentist, MAP is bayesian.

  • MLE is a special MAP.

  • Regularization = MAP.

  • CE and OLS are MLE.

Derivation of CE:

\[H(p,q)=\mathbb E_{y|x\sim p}[-\log q(y|x)]=-\sum_{k=1}^K p_k\log q_k. \]

Note \(H(p,q)\neq H(q,p)\).

Derivation of KL-divergence:

\[D_{\text{KL}}(p\| q)=H(p,q)-H(p)=\sum_{k=1}^K p_k\log\frac{p_k}{q_k}. \]

Note \(D_{\text{KL}}(p\| q)\neq D_{\text{KL}}(q\| p)\).

Derivation of F-divergence:

\[D_f(p\| q)=\mathbb E_{x\sim q}\left [f\left(\frac{p(x)}{q(x)}\right)\right]. \]

Note \(D_{f}(p\| q)\neq D_{f}(q\| p)\), \(f\) is convex such as \(f(t)=t\log t\).

2.4 Transformer Models

2.4.1 Examples

2.4.2 Scaling Law

2.5 Differential Privacy and Robustness

2.6 RLHF

PPO Formula.

Reward Term + KL Penalty Term + LM Penalty Term.

LM Term is Language modeling term / MLE term on pretraining data. The two penalties are called Alignment Tax.

3 Tips

  • What's the difference between online vs. offline?

Online gains dataset one by one not in once, but offline does.

  • What is Statistical Learning?

\[\underset{w}{\text{argmin}}\;\;\mathbb E_{(x,y)\sim P}[\ell_w(X,Y)]. \]

  • What is an optimizer?

Generally, for first-order methods, answer GD is the optimizer. Or, say like AGD, APGD, ADAM..(they are all optimizers).

  • Common curves.

ReLU, sigmoid, softmax, hinge.

  • Remember MultiGaussian.

\[\frac1{(2\pi)^{\frac d2}\sqrt{|\Sigma|}}\exp \left(-\frac12 (\bm x-\bm \mu)^\top \Sigma^{-1} (\bm x-\bm \mu)\right). \]

  • What is a weighted-sum?

Such as Attention, Divergence and its variants.

  • What is the difference between discriminative and generative?

\[\underset{\theta}{\text{argmax}}\;\;\mathbb E_{(x,y)\sim p}[\log(p_\theta (y\mid x))]\quad \text{and}\quad \underset{\theta}{\text{argmax}}\;\;\mathbb E_{x\sim p}[\log p_\theta (x)]. \]

  • What is the difference between \(p\) and \(q\)?

\(p\) is real distribution, \(q\) is fake distribution.

  • What are the three main types of learning?

Supervised, unsupervised, reinforcement.

  • What is Statistical Inference?

Such as Variational Inference, Method of Moments, Maximum Likelihood, OLS, Hypothesis Test, Confidence Interval.

posted @ 2025-12-11 06:11  rainrzk  阅读(10)  评论(0)    收藏  举报