CS229 学习笔记

Introduction about ML

definitions

  • Arthur Samuel(1959): field of study that gives computers the ability to learn without being explicitly programmed
  • Tom Mitchell(1998): A computer program is said to be learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E

division

  • supervised learning
  • unsupervised learning

applications

  • organizing computing clusters
  • social network analysis
  • market segmentation
  • astronomical data analysis
  • cocktail party problem
  • autonomous helicopter

Linear Regression and Gradient Descent

  • some pre defs:
    • \(\theta\): parameters(learning algorithm need to generate)
    • \(m\): the number of training examples
    • \(n\): the number of features
    • \(x\): input, features (define that \(x_0=1\))
    • \(y\): output, target
    • \((x,y)\): training example
    • \((x^{(i)},y^{(i)})\): the i-th training example
  • Learning algorithm:
    • input: training set
    • output: hypothesis (the thing used in classification/prediction)
      • input: data
      • output: the prediction
      • target: find \(\theta\) s.t. \(h(x) \approx y\) for training set
        • transformation: find \(\argmin\limits_\theta J(\theta)\)
          • \(J(\theta)\): cost function
  • the representation of hypothesis:
    • linear function:
      • one \(x\): \(h(x)=\theta_0+\theta_1x\)
      • multiple \(x\): \(h_\theta(x)=\sum\limits_{j=1}^n \theta_j x_j\)
        • vector version: \(h_\theta(x)=\theta x^T\)
          • \(\theta=(\theta_{i-1})_{n+1}^T\)
          • \(x=(x_{i-1})_{n+1}^T\)
  • linear regression
    • def of \(J(\theta)\): \(J(\theta)=\frac 12\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2\)
    • way to calc \(J(\theta)\): gradient descent (batch gradient descent)
      1. choose a random \(\theta\)
      2. calc the gradient of \(\theta\) over \(x\)
      3. update \(\theta_j := \theta_j-\alpha{\partial \over \partial\theta_j}J(\theta)\)
        • \(\alpha\): learning rate (usually start with 0.01)(test several value and use the one with best performance)
        • \(:=\): the representation of assignment
      • problem: slow to update \(\theta_j\) when having large dataset
      • alternative: stochastic gradient descent:
        • principle: update all \(\theta_j\) in \(\theta\) with only one training data's \(J(\theta)\)
        • when to stop: when \(J(\theta)\) stop going down
        • implementation:
          while True:
              for j in range(m):
                  theta[j]=theta[j]-alpha*(h(theta,x[i]),y[i])*x[i][j]
          
    • the update equation in linear regression:
      • \({\partial \over \partial \theta_j}J(\theta)=\sum\limits_{i=1}^m{\partial \over \partial \theta_j}\frac 12(h_\theta(x^{(i)})-y^{(i)})^2=\sum\limits_{i=1}^n[(h_\theta(x^{(i)})-y^{(i)})*{\partial \over \partial \theta_j}(h_\theta(x^{(i)})-y^{(i)})]=\sum\limits_{i=1}^n x^{(i)}_j(h_\theta(x^{(i)})-y^{(i)})\)
      • \(\theta_j := \theta_j-\alpha\sum\limits_{i=1}^n x_j^{(i)}(h_\theta(x^{(i)})-y^{(i)})\)
    • feature of \(J(\theta)\): no local optimum(like a big bowl) \(\Rightarrow\) no error
  • special versio of linear regression (normal equation):
    • feature:
      • only work for linear regression
      • could one-step jump to the global optimum
    • get the equation:
      • some defs:
        • \(\nabla_\theta J(\theta)=({\partial J \over \partial \theta_{i-1}})_{n+1}\)
        • \(f: \R^{m \times n} \rightarrow \R \Rightarrow \nabla_A f(A)=\lgroup{\partial f \over \partial A_{ij}}\rgroup_{m*n}\)
        • \({\rm tr}A_n=\sum\limits_{i=1}^n A_{ii}\)
          • \(\nabla_A {\rm tr}AB=B^T\)
          • \({\rm tr}ABC={\rm tr}CAB={\rm tr}BCA\)
          • \(\nabla_A {\rm tr}ABA^TC=CAB+C^TAB^T\)
          • \(\nabla_{A^T} f(A)=(\nabla_A f(A))^T\)
          • \(\nabla_A|A|=|A|(A^{-1})^T\)
        • \(X=((x^{(i)})^T)_m \Rightarrow \theta X=(h_\theta(x^{(i)}))_m\)
        • \(y=(y^{(i)})_m\)
          • \(J(\theta)=\frac 12(\theta X-y)(\theta X-y)^T\)
      • proof: $$\begin{align}
        \nabla_\theta J(\theta)&=\nabla_\theta \frac 12(\theta X-y)(\theta X-y)^T\
        &=\frac 12\nabla_\theta(\theta X-y)(XT\thetaT-y^T)\
        &=\frac 12\nabla_\theta(\theta XXT\thetaT-yXT\thetaT-\theta XyT+yyT)\
        &=\frac 12\nabla_\theta{\rm tr}(\theta XXT\thetaT-yXT\thetaT-\theta XyT+yyT)\
        &=\frac 12(\nabla_\theta{\rm tr}\theta XXT\thetaT-2\nabla_\theta{\rm tr}\theta Xy^T+\nabla_\theta{\rm tr}yy^T)\
        &=\frac 12(\theta XX^T+\theta XXT-2yXT)\
        &=\theta XXT-yXT=\vec 0\
        \Rightarrow \theta&=yXT(XXT)^{-1}
        \end{align
        }

      \[\]

  • Non-linear Regression: the linear combination of different features
    • representation: \(h_\theta(x)=\theta_0+\theta_1x_1+\theta_2\sqrt x+\theta_3\log x+...\)
  • Local Weighted Regression
    • terminology:
      • parametric learning algorithm: fir fixed set of parameters(\(\theta\)) to data
      • non-parametric learning algorithm: the number of data/parameters you need to keep growing (linearly) with the size of data (not great for great dataset)
    • implementation: use the local datas around the predict x to make regression and prediction
    • formalize: fit \(\theta\) to minimize \(\sum\limits_{i=1}^m w^{(i)}(y^{(i)}-\theta^Tx^{(i)})^2\) where \(w^{(i)}=e^{-\frac{(x^{(i)}-x)^2}{2\tau^2}}\)(Gaussion Function)
      • \(|x^{(i)}-x| \rightarrow 0\): \(w^{(i)} \rightarrow 1\)
      • \(|x^{(i)}-x| \rightarrow \infty\): \(w^{(i)} \rightarrow 0\)
      • \(\tau \rightarrow 0\): jagged fit
      • \(\tau \rightarrow \infty\): over smoothing
  • why least squares? (maybe not actually true, but accurate enough)
    • \(y^{(i)}=\theta^Tx^{(i)}+\varepsilon^{(i)}\) (thing assumed)
      • \(\varepsilon^{(i)}\): error(unmodeled features, random noise, ...)
    • \(\varepsilon^{(i)} \sim N(0,\sigma^2)\) (thing assumed)
    • \(P(\varepsilon^{(i)})=\frac 1{\sqrt{2\pi}\sigma}e^{-{(\varepsilon^{(i)})^2 \over 2\sigma^2}}\)
    • \(P(y^{(i)}|x^{(i)};\theta)=\frac 1{\sqrt{2\pi}\sigma}e^{-\frac {y^{(i)}-\theta^T x^{(i)}}2}\)
      • "\(;\)": parametrized by
        • \(y|x,\theta\): \(y\) is conditioned by \(x,\theta\)
        • \(y|x;\theta\): \(y\) is conditioned by \(x\) and parametrized by \(\theta\)
      • representation in another way: \((y^{(i)}|x^{(i)};\theta) \sim N(\theta^Tx^{(i)},\sigma^2)\)
    • IID(Independent and Indentity Distribution): the error term of two distributions are different(thing assumed)
    • Likelihood of parameters: \(L(\theta)=P(y|x;\theta)=\prod\limits_{i=1}^m P(y^{(i)}|x^{(i)};\theta)=\prod\limits_{i=1}^m \frac 1{\sqrt{2\pi}\sigma}e^{-{(y^{(i)}-\theta^Tx^{(i)})^2 \over 2\sigma^2}}\)
      • \(P(y|x;\theta)=\prod\limits_{i=1}^m P(y^{(i)}|x^{(i)};\theta)\): need IDD over every two distribution
      • \(l(\theta)=\log L(\theta)=\log\prod\limits_{i=1}^m \frac 1{\sqrt{2\pi}\sigma}e^{-{(y^{(i)}-\theta^Tx^{(i)})^2 \over 2\sigma^2}}=\sum\limits_{i=1}^m (\log\frac 1{\sqrt{2\pi}\sigma}+\log e^{-{(y^{(i)}-\theta^Tx^{(i)})^2 \over 2\sigma^2}})=m\log\frac 1{\sqrt{2\pi}\sigma}-\sum\limits_{i=1}^m {(y^{(i)}-\theta^Tx^{(i)})^2 \over 2\sigma^2}\)
    • MLE: maximum likelihood estimation
    • target: choose \(\theta\) to maximize \(l(\theta) \Rightarrow\) choose \(\theta\) to minimize \(\frac 12\sum\limits_{i=1}^m(y^{(i)}-\theta^Tx^{(i)})^2=J(\theta)\)
  • classification:
    • binary classification: dataset whose \(y \in \{0,1\}\)
      • bad to fit with linear regression
    • logistic regression:
      • want: \(h_\theta(x) \in [0,1]\)
      • define: \(g(z)=\frac 1{1+e^{-z}}\) (sigmoid function/logistic function)
        • increasing
        • val between \((0,1)\)
      • \(h_\theta(x)=g(\theta^Tx)=\frac 1{1+e^{-\theta^Tx}}\)
      • assume:
        • \(P(y=1|x;\theta)=h_\theta(x)\)
        • \(P(y=0|x;\theta)=1-h_\theta(x)\)
      • combination: \(P(y|x;\theta)=h_\theta(x)^y(1-h_\theta(x))^{1-y^{(i)}}\)
      • MLE:
        • \(L(\theta)=P(y|x;\theta)=\prod\limits_{i=1}^m P(y^{(i)}|x^{(i)};\theta)=\prod\limits_{i=1}^m h_\theta(x^{(i)})^{y^{(i)}}(1-h_\theta(x^{(i)}))^{1-y^{(i)}}\)
        • \(l(\theta)=\log L(\theta)=\sum\limits_{i=1}^m (y^{(i)}\log h_\theta(x^{(i)})+(1-y^{(i)})\log(1-h_\theta(x^{(i)})))\)
      • batch gradient ascent: \(\theta_j := \theta_j+\alpha{\partial \over \partial\theta_j}l(\theta)\)
        • no error: only global maximum without local maximum
        • difference: try to maximize the function rather than minimize the function (\(+/-,l/J\))
        • result: \(\theta_j :=\theta_j+\alpha \sum\limits_{i=1}^m(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}\)
          • why the same as linear regression: defintion of \(x^{(i)}\) don't change, but \(h_\theta(x)\) changed, which didn't show on the surface
          • no something like normal equation
  • Newton's Method
    • advantage: sometimes much faster than gradient descent
    • Has: \(f\)
    • want: find \(\theta\ {\sf s.t.} f(\theta)=0\)
      • also applied in maximum finding (\(f'(\theta)=0\))
    • update: \(\theta^{(k+1)} := \theta^{(t)}-{f(\theta^{(t)}) \over f'(\theta^{(t)})}\)
    • property: quodratic conversions (error will decrease with square speed)
      • \(0.01\) error \(\rightarrow 0.0001\) error \(\rightarrow 0.00000001\) error(each arrow need one step)
    • update when \(\theta \in \R^{n+1}: \theta^{(k+1)} := \theta^{(k)}+H^{-1}\nabla_\theta l(\theta)\)
      • H: Hessian matrix (\(\R^{n+1 \times n+1}\))

Perceptron & Generalized Linear Models

  • Perceptron algorithm (something applied to logistic regression):
    • \(g(z)=\begin{cases}1, z \geq 0\\0, z<0\end{cases}\)
    • \(h_\theta(x)=g(\theta^Tx)\)
    • update rule: \(\theta_j := \theta_j+\alpha(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}\)
      • \(y^{(i)}-h_\theta(x^{(i)})\): a scalar
        • \(0\): prediction is right
        • \(\pm 1\): pridiction is wrong
          • \(1: y^{(i)}=1\)
          • \(-1: y^{(i)}=0\)
      • \(\theta\): the norm of the classification line (points on one side is in class 0, and points on the other side is class 1)
        • \(\Delta \theta_j: \alpha' x_j\)
    • can't solve the classification problem that couldn't represent the division by a line cross origin point
    • often determine when to stop by you (since it may not find answer eventually)
  • exponential family: a class of probabilitistic distribution
    • PDF(probability distribution function): \(P(y;\eta)=b(y)e^{\eta^TT(y)-a(\eta)}={b(y)e^{\eta^TT(y)} \over e^{a(\eta)}}\)
      • \(y\): data
      • \(\eta\): natural parameter
      • \(T(y)\): sufficient statistic (\(T(y)=y\) in this lecture)
      • \(b(y)\): base measure
      • \(a(\eta)\): log-partition function
    • some example distributions:
      • Bernoulli Distribution (over binary data)
        • \(\phi\): probability of event happening or not
        • \(P(y;\phi)=\phi^y(1-\phi)^{1-y}=e^{\log(\phi^y(1-\phi)^{1-y})}=e^{y\log\frac\phi{1-\phi}+\log(1-\phi)}\)
          • \(b(y)=1\)
          • \(\eta=\log\frac\phi{1-\phi} \Rightarrow \phi=\frac 1{1+e^{-\eta}}\)
          • \(T(y)=y\)
          • \(\alpha(\eta)=-\log(1-\phi)=-\log(1-\frac 1{1+e^{-\eta}})=\log(1+e^\eta)\)
      • Gaussian (with fixed varience) (over real data)
        • Assume \(\sigma^2=1\)
        • \(P(y;\mu)=\frac 1{\sqrt{2\pi}}e^{-\frac{(y-\mu)^2}2}=\frac 1{\sqrt{2\pi}}e^{-\frac{y^2}2}e^{\mu y-\frac 12\mu^2}\)
          • \(b(y)=\frac 1{\sqrt{2\pi}}e^{-\frac{y^2}2}\)
          • \(\eta=\mu\)
          • \(T(y)=y\)
          • \(a(\eta)=\frac{\mu^2}2=\frac{\eta^2}2\)
      • Poisson (over count) (distribution over integers)
      • Gamma,Exponential (over data in \(\R^2\))
      • Beta, Direchlet (over distribution data)
    • the nice mathematical properties:
      • MLE with respect to \(\eta\) is concave, NLL(negative log likelihood) is convex
      • \(E(y;\eta)=\frac\partial{\partial\eta}a(\eta), D(y;\eta)={\partial^2 \over \partial\eta^2}a(\eta)\)
        • why good: most \(E\)s and \(D\)s needs integration, but this just need differentiation
        • \(\eta\) is a vector \(\Rightarrow\) partial becomes Hessian
  • Generalized Linear Models: (GLM)
    • Assumptions/Designed Choice:
      • \(y|x;\theta \sim F(\eta)\), where \(F(\eta)\) is in exponential family
      • \(\eta=\theta^Tx,\theta \in \R^n,x \in \R^n\)
      • at test time: output is \(E(y|x;\theta)\) \(\big(h_\theta(x)=E(y|x;\theta)\big)\)
    • use: choose \(b,a,T\) based on the data
      • train: find \(\argmax\limits_\theta\ \log P(y^{(i)};\theta^Tx^{(i)})\)
      • test: \(E(y;\eta)=E(y;\theta^Tx)\)
    • learning update rule: \(\theta_j := \theta_j+\alpha(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}\)
      • batch gradient descent: \(\theta_j := \theta_j+\alpha\sum\limits_{i=1}^m(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}\)
      • Newton's Method: need data less than \(1000\)
    • some terminologies:
      • canonical response function(CRF): \(\mu=g(\eta)=E(y;\eta)\)
      • canonical link function(CLF): \(\eta=g^{-1}(\mu)\)
    • parametizations:
      • model param: \(\theta\)
      • canonical param:
        • \(\phi\) for Bernoulli
        • \(\mu,\sigma^2\) for Gaussian
        • \(\lambda\) for Poisson
      • natural param: \(\eta\)
        • link with model param: \(\eta=\theta^Tx\)
        • link with canonical param: \(g/g^{-1}\) (CRF/CLF)
    • the distribution of regressions:
      • linear regression: Gaussian
      • logistic regression: Bernoulli
  • visualization of GLM:
    • data generation: data was generated over distributions over \(\eta Oy\)
      • Gaussian: \(\eta\) axis is the position of \(\mu\) of distribution
      • Bernoulli: \(O\) is the cross point of \(x\) axis and \(\eta\)
  • Softmax Regression (a member of exponential family) (Cross Entropy)
    • defs:
      • \(K\): the number of classes
      • data: \(x^{(i)} \in \R^n\)
      • labels: \(y^{(i)}=\{0,1\}^K\) (only have one \(1\) and others are all \(0\))
        • \(c\): the position \(j\) where \(y_j=1\)
      • param: \(\theta_{class} \in \R^n (class \in classes)\)
        • \(classes\): the set of all possible class
      • \(\theta=(\theta_i^T)_K \in \R^{K \times n}\)
    • representation: a set of lines (one line for each class) (one side \(\Leftrightarrow\) in the class, the other side \(\Leftrightarrow\) not in the class)
    • predicted distribution(hypothesis function): \(\hat p(y)={e^{\theta_y^Tx} \over \sum\limits_{i \in classes}e^{\theta_i^Tx}}\) (exp+normalization) (a distribution over \(K\) classes)
      • why exp: \(\R\) to \(\R_+\)
      • why normalization: \(\R_+\) to \([0,1]\)
    • target distribution: \(p(y)=\begin{cases}1,y=c\\0,y \neq c\end{cases}\)
    • cross entropy: the distance between \(\hat p(y)\) and \(p(y)\) (\(J(\theta)\) in Linear Regression)

    \[\begin{align*} CrossEnt(p,\hat p)&=\sum\limits_{y \in classes}p(y)\log\hat p(y)\\ &=-\log\hat p(c)\\ &=-\log{e^{\theta_c^Tx} \over \sum\limits_{i \in classes}e^{\theta_i^Tx}}\\ &=-\theta_c^Tx+\log\sum\limits_{i \in classes}e^{\theta_i^Tx} \end{align*} \]

    • update: gradient descent towards cross entropy

GDA & Naive Bayes

  • Generative Learning Algorithm:
    • basic principle: build model for each class and let the class of the input be the model output the max likelihood
    • formalize:
      • discrimitive: learn \(P(y|x)\) (or learn \(h_\theta(x)=0/1\))
      • generative: learn \(P(x|y)\)(\(P(x|y=0)\) and \(P(x|y=1)\)) and \(P(y)\)(class prior)
        • Bayes rule: \(P(y|x)={P(x|y)P(y) \over P(x)}\)
          • \(P(x)=P(x|y=1)P(y=1)+P(x|y=0)P(y=0)\)
  • Gaussian Distributed Analysis(GDA): (A Generative Learning Algorithm)
    • suppose \(x \in \R^n\) (drop \(x_0=1\) convention)
    • Assume \(P(x|y)\) is Gaussian
    • some prerequirities:
      • Multivariate Gaussian Distribution: \(z \sim N(\mu,\Sigma)\)
        • \(z \in \R^n\)
        • \(\mu \in \R^n\)
        • \(\Sigma \in \R^n\)
        • \(E(z)=\mu\)
        • \(Cov(z)=E((z-\mu)(z-\mu)^T)=E(zz^T)-E(z)E^T(z)\)
        • \(P(z)=\frac 1{(2\pi)^\frac n2|\Sigma|^\frac 12}e^{-\frac 12(z-\mu)^T\Sigma^{-1}(z-\mu)}\)
      • indentity function: \([true]=1,[false]=0\)
    • GDA model:
      • \(P(x|y=0)=\frac 1{(2\pi)^\frac n2|\Sigma|^\frac 12}e^{-\frac 12(x-\mu_0)^T\Sigma^{-1}(x-\mu_0)}\)
      • \(P(x|y=1)=\frac 1{(2\pi)^\frac n2|\Sigma|^\frac 12}e^{-\frac 12(x-\mu_1)^T\Sigma^{-1}(x-\mu_1)}\)
      • \(P(y)=\phi^y(1-\phi)^{1-y}\) (\(P(y=1)=\phi\))
    • parameters: \(\mu_0,\mu_1,\Sigma,\phi\)
    • how to fit parameters:
      • maximize joint likelihood
        • likelihood: \(L(\phi,\mu_0,\mu_1,\Sigma)=\prod\limits_{i=1}^m P(x^{(i)},y^{(i)};\phi,\mu_0,\mu_1,\Sigma)=\prod\limits_{i=1}^m P(x^{(i)}|y^{(i)})P(y^{(i)})\)
        • \(l(\phi,\mu_0,\mu_1,\Sigma)=\log L(\phi,\mu_0,\mu_1,\Sigma)\)
      • in discrimitive: maximize conditional likelihood:
        • likelihood: \(L(\theta)=\prod\limits_{i=1}^m P(y^{(i)}|x^{(i)};\theta)\)
    • fit result:
      • \(\phi=\frac{\sum\limits_{i=1}^m y^{(i)}}m=\frac{\sum\limits_{i=1}^n [y^{(i)}=1]}m\)
      • \(\mu_0={\sum\limits_{i=1}^m[y^{(i)}=0]x^{(i)} \over \sum\limits_{i=1}^m [y^{(i)}=0]}\)
      • \(\mu_1={\sum\limits_{i=1}^m[y^{(i)}=1]x^{(i)} \over \sum\limits_{i=1}^m [y^{(i)}=1]}\)
      • \(\Sigma=\frac 1m\sum\limits_{i=1}^m(x^{(i)}-\mu_{y^{(i)}})(x^{(i)}-\mu_{y^{(i)}})^T\)
    • prediction: \(\argmax\limits_y P(y|x)=\argmax\limits_y {P(x|y)P(y) \over P(x)}=\argmax\limits_y P(x|y)P(y)\)
    • pros: quick when dataset is small
    • why one \(\Sigma\): decrease the amount of parameters and make the function linear
    • Compare to Logictic Regression:
      • formal comparation:
        • GDA: (generative)
          • \(x|y=0 \sim N(\mu_0,\Sigma)\)
          • \(x|y=1 \sim N(\mu_1,\Sigma)\)
          • \(y \sim Ber(\phi)\)
        • logistic regression:
          • \(P(y=1|x)=\frac 1{1+e^{-\theta^Tx}}\)
      • for fixed \(\phi,\mu_0,\mu_1,\Sigma\), lets plot \(P(y=1|x;\phi,\mu_0,\mu_1,\Sigma)\) to a function of \(x\)
      • \(P(y=1|x;\phi,\mu_0,\mu_1,\Sigma)={P(x|y=1;\mu_1,\Sigma)P(y=1,\phi) \over P(x;\phi,\mu_0,\mu_1,\Sigma)}\)
      • \(P(y=1|x;\phi,\mu_0,\mu_1,\Sigma)\) when \(x \in \R\)
        • Logistic regression can be proved by GDA
        • GDA is a stronger assumption than logistic regression
        • GDA do better than logistic regression if the assumptions are correct
          • \(P(x|y)\) distribution is poisson could also prove \(P(y=1|x)\) is logistic
    • how to choose: (general)
      • large number of data \(\Rightarrow\) logistic regression
      • why still use GDA: computational efficient
  • Naive Bayes:
    • applicagtion filed: text classification
    • feature vector \(x \in \{0,1\}^n\)
      • \(x_i=[\) word \(i\) appears in email \(]\)
    • if GDA: need \(2^n\) parameters (too much)
    • Assume \(x\) is conditionally independent given \(y\)
      • \(P(x_1,x_2,...,x_n|y)=\prod\limits_{i=1}^n P(x_i|x_1,...,x_{i-1},y)=\prod\limits_{i=1}^n P(x_i|y)\)
      • maybe not true in mathematics, but not too horrible to give up
    • parameters:
      • \(\phi_{j|y=1}=P(x_j=1|y=1)\)
      • \(\phi_{j|y=0}=P(x_j=1|y=0)\)
      • \(\phi_y=P(y=1)\)
    • joint likelihood: \(L(\phi_y,\phi_{j|y})=\prod\limits_{i=1}^m P(x^{(i)},y^{(i)};\phi_y,\phi_{j|y})\)
      • MLE:
        • \(\phi_y=\frac{\sum\limits_{i=1}^m [y^{(i)}=1]}m\)
        • \(\phi_{j|y=1}={\sum\limits_{i=1}^m [x_j^{(i)}=1,y^{(i)}=1] \over \sum\limits_{i=1}^m[y^{(i)}=1]}\)
        • \(\phi_{j|y=0}={\sum\limits_{i=1}^m [x_j^{(i)}=1,y^{(i)}=0] \over \sum\limits_{i=1}^m[y^{(i)}=0]}\)
    • actually not so bad (update while testing)
    • problem: may have \(0\) in equations (if have not been seen)
      • solution: Laplas smoothing
        • for \(X \in \{i\}_k\), estimate \(P(X=j)={\sum\limits_{j=1}^m[x^{(i)}=j]+1 \over m+k}\)
        • in naive Bayes(\(\phi_{x|y=0}\) for example): \(\phi_{x|y=0}={\sum\limits_{i=1}^m[x_j^{(i)}=1,y^{(i)}=0]+1 \over \sum\limits_{j=1}^m [y^{(i)}=0]+2}\)
    • applied in multinormial: \(P(x|y)=\prod\limits_{j=1}^m P(x_j|y)\)
      • a new representation for text feature: \(x=(x_i)_n\) (multinomial event model)
        • pre feature: Multivariate Bernoulli event model
        • \(n\): the length of the text
        • \(x_i\): the i-th word's index in the dictionary
      • parameters:
        • \(\phi_y=P(y=1)\)
        • \(\phi_{k|y=0}=P(\bigvee\limits_{j=1}^m(x_j=k)|y=0)\)
        • \(\phi_{k|y=1}=P(\bigvee\limits_{j=1}^m(x_j=k)|y=1)\)
      • MLE parameters:
        • \(\phi_{k|y=0}={\sum\limits_{i=1}^m \left([y^{(i)}=0]\sum\limits_{j=1}^{n_i}[x_j^{(i)}=k]\right) \over \sum\limits_{i=1}^m[y^{(i)}=0]n_i}\)
        • \(\phi_{k|y=1}={\sum\limits_{i=1}^m \left([y^{(i)}=1]\sum\limits_{j=1}^{n_i}[x_j^{(i)}=k]\right) \over \sum\limits_{i=1}^m[y^{(i)}=1]n_i}\)
        • \(\phi_y={\sum\limits_{i=1}^m [y^{(i)}=1] \over m}\)
      • MLE after smoothing:
        • \(\phi_{k|y=0}={\sum\limits_{i=1}^m \left([y^{(i)}=0]\sum\limits_{j=1}^{n_i}[x_j^{(i)}=k]+1\right) \over \sum\limits_{i=1}^m[y^{(i)}=0]n_i+10000}\) (10000 is the number of possible values of x)
        • \(\phi_{k|y=1}={\sum\limits_{i=1}^m \left([y^{(i)}=1]\sum\limits_{j=1}^{n_i}[x_j^{(i)}=k]+1\right) \over \sum\limits_{i=1}^m[y^{(i)}=1]n_i+10000}\)
        • \(\phi_y={\sum\limits_{i=1}^m [y^{(i)}=1] +1\over m+2}\)
    • extension: another word-expressing technique: word embedding

Support Vector Machine

  • optimal margin classifier
    • functional margin
      • Asssume \(h_\theta(x)=g(\theta^Tx),g(z)=\frac 1{1+e^{-z}}\)
      • Predict \(1\) if \(\theta^Tx \geq 0\) and \(0\) otherwise
      • If \(y^{(i)}=1\), hope that \(\theta^Tx \gg 0\)
      • If \(y^{(i)}=0\), hope that \(\theta^Tx \ll 0\)
    • Geometric margin: the separate line which has higher average distance to data points mix better
      • what optimal margin classifier do: try to find the separate line with the highest average distance
    • notations:
      • \(y \in \{-1,+1\}\)
      • have \(h\) output values in \(\{-1,1\}\)
      • \(g(z)=[z \geq 0]-[z<0]\)
      • \(h_\theta(x)=g(w^Tx+b)\)
        • \(x \in \R^n\) (drop \(x_0=1\) convention)
        • \(b \in \R\)
      • \(\theta=(\theta_{i-1})_n^T\)
        • \(b=\theta_0\)
        • \(w=(\theta_i)^T_n\)
    • function margin of hyperplane defined by \((w,b)\): \(\hat\gamma\)
      • \(\hat\gamma^{(i)}=y^{(i)}(w^Tx^{(i)}+b)\)
        • If \(y^{(i)}=1\), then want \(w^Tx^{(i)}+b \gg 0\)
        • If \(y^{(i)}=-1\), then want \(w^Tx^{(i)}+b \ll 0\)
      • summary: hope \(\hat\gamma^{(i)} \gg 0\)
        • If \(\hat\gamma^{(i)}>0\), that means \(h(x^{(i)})=y^{(i)}\)
      • functional margin respect to the training set: \(\hat\gamma=\min\limits_{\xi \in [1,m] \cap \Z} \hat\gamma^{(i)}\)
    • a way to cheat functional margin: multiply \(w\) and \(b\) with the same value, then \(\hat\gamma\) scales
      • solution: \((w,b) \rightarrow (\frac w{\parallel w\parallel},\frac b{\parallel w\parallel})\)
    • geometric margin: the distance between \((x^{(i)},y^{(i)})\) and line \(w^Tx+b=0\)
      • formalize: geometric margin of plane \((w,b)\) with \((x^{(i)},y^{(i)})\)
        • \(\gamma^{(i)}={w^Tx+b \over \parallel w\parallel}\)
        • more generally: \(\gamma^{(i)}={y^{(i)}(\omega^Tx^{(i)}+b) \over \parallel w\parallel}\)
          • \(\gamma^{(i)}={\hat\gamma^{(i)} \over \parallel w\parallel}\)
      • geometric margin with training set: \(\gamma=\min\limits_i \gamma^{(i)}\)
    • optimal margin classifier: choose \(w,b\) to maximize \(\gamma\)
      • one implement way: \(\max\limits_{\gamma,w,b}\gamma\ {\sf s.t.}\ {y^{(i)}(w^Tx^{(i)}+b) \over \parallel w\parallel} \geq \gamma \Rightarrow \min\limits_{w,b}\parallel w\parallel^2\ {\sf s.t.}\ y^{(i)}(w^Tx^{(i)}+b) \geq 1\)
        (convex optimize problem)
        更新中。。。
posted @ 2023-04-17 15:45  Sherlocked_hzoi  阅读(29)  评论(0)    收藏  举报