CS229：GDA Naive Bayes

GDA Naive Bayes

introduction

All of the learning algorithm I've learned so far are called discriminative learning algorithms.

this time generative learning algorithm

data set with two classes -> gradient descent(logistic regression)

not try to find the separation

check one class at a time, model these classes in isolation, and find which model the test case matches better.

formalization

learn a mapping method from (feature x) directly to 0 or 1(label y). (discriminative)
learn a model given label y to describe its features x. (Generative)
- also learning the probability of a certain y without giving any features. (class prior)
using Bayes rule to describe the probability of a test case belonging to a certain label:

\[p(y \mid x)=\frac{p(x \mid y) p(y)}{p(x)} \]

GDA(Gaussian Discriminant Analysis)

Assumptions

Assume all features' distribution are Gaussian.

multivariate Gaussian distribution where \(\mu\) is the mean vector and \(\Sigma\) is a covariance matrix, symmetric and positive semi-definite.

\[p(x ; \mu, \Sigma)=\frac{1}{(2 \pi)^{n / 2}|\Sigma|^{1 / 2}} \exp \left(-\frac{1}{2}(x-\mu)^{T} \Sigma^{-1}(x-\mu)\right) \]

where \(\mathrm{E}[X]=\int_{x} x p(x ; \mu, \Sigma) d x=\mu\) and \(\operatorname{Cov}(Z)=\mathrm{E}\left[Z Z^{T}\right]-(\mathrm{E}[Z])(\mathrm{E}[Z])^{T}=\Sigma\)
elements in \(\mu\) describes where the central point of is distribution is located, while the diagonal elements of the covariance matrix \(\Sigma\) indicates the direction of its major axis and non-diagonal elements "compress" the figure in different directions and degrees depending on its sign and value. (noting that the matrix is always symmetric)

Model

\[\begin{aligned} y & \sim \operatorname{Bernoulli}(\phi) \\ x \mid y=0 & \sim \mathcal{N}\left(\mu_{0}, \Sigma\right) \\ x \mid y=1 & \sim \mathcal{N}\left(\mu_{1}, \Sigma\right) \end{aligned} \]

Writing out the distributions, this is:

\[\begin{aligned} p(y) &=\phi^{y}(1-\phi)^{1-y} \\ p(x \mid y=0) &=\frac{1}{(2 \pi)^{n / 2}|\Sigma|^{1 / 2}} \exp \left(-\frac{1}{2}\left(x-\mu_{0}\right)^{T} \Sigma^{-1}\left(x-\mu_{0}\right)\right) \\ p(x \mid y=1) &=\frac{1}{(2 \pi)^{n / 2}|\Sigma|^{1 / 2}} \exp \left(-\frac{1}{2}\left(x-\mu_{1}\right)^{T} \Sigma^{-1}\left(x-\mu_{1}\right)\right) \end{aligned} \]

(only one covariance matrix is applied, indicating that we are treating the two classes in the same way)

The log likelihood of the data is given by

\[\begin{aligned} \ell\left(\phi, \mu_{0}, \mu_{1}, \Gamma\right) &=\log \prod_{i=1}^{m} p\left(x^{(i)}, y^{(i)} ; \phi, \mu_{0}, \mu_{1}, \Gamma\right) \\ &=\log \prod_{i=1}^{m} p\left(x^{(i)} \mid y^{(i)} ; \mu_{0}, \mu_{1}, \Sigma\right) p\left(y^{(i)} ; \phi\right) \end{aligned} \]

by maximizing it, the parameters is found to be:

\[\begin{aligned} \phi &=\frac{1}{m} \sum_{i=1}^{m} 1\left\{y^{(i)}=1\right\} \\ \mu_{0} &=\frac{\sum_{i=1}^{m} 1\left\{y^{(i)}=0\right\} x^{(i)}}{\sum_{i=1}^{m} 1\left\{y^{(i)}=0\right\}} \\ \mu_{1} &=\frac{\sum_{i=1}^{m} 1\left\{y^{(i)}=1\right\} x^{(i)}}{\sum_{i=1}^{m} 1\left\{y^{(i)}=1\right\}} \\ \Sigma &=\frac{1}{m} \sum_{i=1}^{m}\left(x^{(i)}-\mu_{y^{(i)}}\right)\left(x^{(i)}-\mu_{y^{(i)}}\right)^{T} \end{aligned} \]

The decision boundary is given when a new test case has the same probability to be included in the two distributions.

To predict the probability of a new case belonging to a certain class, we use Bayes rules to calculate \(P(y\mid x)\), where \(P(x\mid y)\) can be known when given the Gaussian distribution, and \(P(x)\), \(P(y)\) simply by the proportion it has among all features or labels.

Compare GDA and logistic regression

If we view the quantity \(p\left(y=1 \mid x ; \phi, \mu_{0}, \mu_{1}, \Sigma\right)\) as a function of x, we’ll find that it can be expressed in the form

\[p\left(y=1 \mid x ; \phi, \Sigma, \mu_{0}, \mu_{1}\right)=\frac{1}{1+\exp \left(-\theta^{T} x\right)} \]

where \(\theta\) is some appropriate function of \(\phi, \Sigma, \mu_{0}, \mu_{1}\). This form is exactly the form of logistic regression.

Noting that from logistic regression, we can't infer the distribution of features to be Gaussian, indicating that GDA actually put forward a stronger assumption, which means if the assumptions are correct, GDA will act better than the weaker logistic regression, if wrong, using GDA is a disaster. (For example. if the data is Poisson, GDA might do poorly, while logistic regression still works)

Choosing a model depends on the amount of data set(the smaller, the more assumptions it should be given), and also the balance of model's universality and effectiveness. That's to say, the strength of the algorithm comes from data and knowledge given when developed. If we want it to deal with a certain work, abundant data will allow us to weaken our algorithm in exchange for robustness. When data is limited, GDA is easier to compute and no iterative process needed. and its performance is proved to be better regardless of the amount of training set.

Naive Bayes

In GDA, the feature vectors \(x\) were continuous, real-valued vectors. Lets now talk about a different learning algorithm in which the \(x_{i}\) ’s are discrete-valued.

For example, when building a classifier to filter spam emails, we encode the presence of a set of certain words into a binary vector called vocabulary, resulting in too much parameters to model.

So we make a strong assumption: the \(x_{i}\) ’s are conditionally independent given \(y\). This assumption is called the Naive Bayes (NB) assumption, and we now have:

\[\begin{aligned} p\left(x_{1}, \ldots, x_{50000} \mid y\right) \\ &=p\left(x_{1} \mid y\right) p\left(x_{2} \mid y, x_{1}\right) p\left(x_{3} \mid y, x_{1}, x_{2}\right) \cdots p\left(x_{50000} \mid y, x_{1}, \ldots, x_{49999}\right) \\ &=p\left(x_{1} \mid y\right) p\left(x_{2} \mid y\right) p\left(x_{3} \mid y\right) \cdots p\left(x_{50000} \mid y\right) \\ &=\prod_{i=1}^{n} p\left(x_{i} \mid y\right) \end{aligned} \]

We can write down the joint likelihood of the data, using MLE, and it estimates:

\[\begin{aligned} \phi_{j \mid y=1} &=\frac{\sum_{i=1}^{m} 1\left\{x_{j}^{(i)}=1 \wedge y^{(i)}=1\right\}}{\sum_{i=1}^{m} 1\left\{y^{(i)}=1\right\}} \\ \phi_{j \mid y=0} &=\frac{\sum_{i=1}^{m} 1\left\{x_{j}^{(i)}=1 \wedge y^{(i)}=0\right\}}{\sum_{i=1}^{m} 1\left\{y^{(i)}=0\right\}} \\ \phi_{y} &=\frac{\sum_{i=1}^{m} 1\left\{y^{(i)}=1\right\}}{m} \end{aligned} \]

so we can simply make a prediction on a new test case by calculating

\[\begin{aligned} p(y=1 \mid x) &=\frac{p(x \mid y=1) p(y=1)}{p(x)} \\ &=\frac{\left(\prod_{i=1}^{n} p\left(x_{i} \mid y=1\right)\right) p(y=1)}{\left(\prod_{i=1}^{n} p\left(x_{i} \mid y=1\right)\right) p(y=1)+\left(\prod_{i=1}^{n} p\left(x_{i} \mid y=0\right)\right) p(y=0)} \end{aligned} \]

and pick the higher probability.

For more widely use:

if \(x\) can take values in \(\{1,2,3,\cdots,k\}\), just model \(p(x\mid y)\) as multinomial rather than Bernoulli.
if some original input attribute are continuous valued, it is quite common to discretize it—that is, turn it into a small set of discrete values—and apply Naive Bayes.

Laplace smoothing

When a new word appears in the vocabulary, the predicting formula will encounter a \(\dfrac{0}{0}\) problem, because the algorithm has not seen it before, the probability in either class is zero.

To avoid this, we can use Laplace smoothing, which replaces the above estimate with

\[\phi_{j}=\frac{\sum_{i=1}^{m} 1\left\{z^{(i)}=j\right\}+1}{m+k} \]

and we therefore obtain the following estimates of the parameters:

\[\begin{aligned} &\phi_{j \mid y=1}=\frac{\sum_{i=1}^{m} 1\left\{x_{j}^{(i)}=1 \wedge y^{(i)}=1\right\}+1}{\sum_{i=1}^{m} 1\left\{y^{(i)}=1\right\}+2} \\ &\phi_{j \mid y=0}=\frac{\sum_{i=1}^{m} 1\left\{x_{j}^{(i)}=1 \wedge y^{(i)}=0\right\}+1}{\sum_{i=1}^{m} 1\left\{y^{(i)}=0\right\}+2} \end{aligned} \]

posted @ 2021-10-24 17:05 Phile-matology 阅读(83) 评论(0) 收藏举报

刷新页面返回顶部

Phile-matology