Formulation of PLSA

There are two ways to formulate PLSA. They are equivalent but may lead to different inference process.

$P(d,w) = P(d) \sum_{z} P(w|z)P(z|d)$
$P(d,w) = \sum_{z} P(w|z)P(d|z)P(z)$

Let’s see why these two equations are equivalent by using Bayes rule.

$P(z|d) = \frac{P(d|z)P(z)}{P(d)}$ $P(z|d)P(d) =P(d|z)P(z)$ $P(w|z)P(z|d)P(d) =P(w|z)P(d|z)P(z)$ $P(d) \sum_{z} P(w|z)P(z|d) = \sum_{z} P(w|z)P(d|z)P(z)$

The whole data set is generated as (we assume that all words are generated independently):

$D = \prod_{d} \prod_{w} P(d,w)^{n(d,w)}$

The Log-likelihood of the whole data set for (1) and (2) are:

$L_{1} = \sum_{d} \sum_{w} n(d,w) \log [ P(d) \sum_{z} P(w|z)P(z|d) ]$

$L_{2} = \sum_{d} \sum_{w} n(d,w) \log [ \sum_{z} P(w|z)P(d|z)P(z) ]$

EM

For $L_{1}$ or $L_{2}$ , the optimization is hard due to the log of sum. Therefore, an algorithm called Expectation-Maximization is usually employed. Before we introduce anything about EM, please note that EM is only guarantee to find a local optimum (although it may be a global one).

First, we see how EM works in general. As we shown for PLSA, we usually want to estimate the likelihood of data, namely $P(X|\theta)$ , given the paramter $\theta$ . The easiest way is to obtain a maximum likelihood estimator by maximizing $P(X|\theta)$ . However, sometimes, we also want to include some hidden variables which are usually useful for our task. Therefore, what we really want to maximize is $P(X|\theta)=\sum_{z}P(X|z,\theta)P(z|\theta)$ , the complete likelihood. Now, our attention becomes to this complete likelihood. Again, directly maximizing this likelihood is usually difficult. What we would like to show here is to obtain a lower bound of the likelihood and maximize this lower bound.

We need Jensen’s Inequality to help us obtain this lower bound. For any convex function $f(x)$ , Jensen’s Inequality states that :

$\lambda f(x) + (1-\lambda) f(y) \geq f(\lambda x + (1-\lambda) y)$

Thus, it is not difficult to show that :

$E[f(x)] = \sum_{x} P(x) f(x) \geq f(\sum_{x} P(x) x) = f(E[x])$

and for concave functions (like logarithm), it is :

$E[f(x)] \leq f(E[x])$

Back to our complete likelihood, we can obtain the following conclusion by using concave version of Jensen’s Inequality :

$\log \sum_{z}P(X|z,\theta)P(z|\theta)= \log \sum_{z}P(X|z,\theta)P(z|\theta)\frac{q(z)}{q(z)}$ $= \log E[\frac{P(X|z,\theta)P(z|\theta)}{q(z)}]$ $\geq E[\log \frac{P(X|z,\theta)P(z|\theta)}{q(z)}]$

Therefore, we obtained a lower bound of complete likelihood and we want to maximize it as tight as possible. EM is an algorithm that maximize this lower bound through a iterative fashion. Usually, EM first would fix current $\theta$ value and maximize $q(z)$ and then use the new $q(z)$ value to obtain a new guess on $\theta$ , which is essentially a two stage maximization process. The first step can be shown as follows:

$E[\log \frac{P(X|z,\theta)P(z|\theta)}{q(z)}] = \sum_{z} q(z) \log \frac{P(X|z,\theta)P(z|\theta)}{q(z)}$ $= \sum_{z} q(z) \log \frac{P(z|X,\theta)P(X,\theta)}{q(z)}$ $= \sum_{z} q(z) \log P(x,\theta) + \sum_{z} q(z) \log \frac{P(z|X,\theta)}{q(z)}$ $= \log P(x,\theta) - \sum_{z} q(z) \log \frac{q(z)}{P(z|X,\theta)}$ $= \log P(x,\theta) - KL(q(z) || P(z|X,\theta))$

The first term is the same for all $z$ . Therefore, in order to maximize the whole equation, we need to minimize KL divergence between $q(z)$ and $P(z|X,\theta)$ , which eventually leads to the optimum solution of $q(z) = P(z|X,\theta)$ . So, usually for E-step, we use current guess of $\theta$ to calculate the posterior distribution of hidden variable as the new update score. For M-step, it is problem-dependent. We will see how to do that in later discussions.

Another explanation of EM is in terms of optimizing a so-called Q function. We devise the data generation process as $P(X|\theta)=P(X,H|\theta)=P(H|X,\theta)P(X|\theta)$ . Therefore, the complete likelihood is modified as:

$L_{c}(\theta) = \log P(X,H|\theta) = \log P(X|\theta) + \log P(H|X,\theta) = L(\theta) + \log P(H|X,\theta)$

Think about how to maximize $L_{c}(\theta)$ . Instead of directly maximizing it, we can iteratively maximize $L_{c}(\theta^{(n+1)})-L_{c}(\theta^{(n)})$ as :

$L(\theta) - L(\theta^{(n)}) = L_{c}(\theta) - \log P(H|X,\theta) - L_{c}(\theta^{(n)}) + \log P(H|X,\theta^{(n)})$ $= L_{c}(\theta) - L_{c}(\theta^{(n)}) + \log \frac{P(H|X,\theta^{(n)})}{P(H|X,\theta)}$

Now take the expectation of this equation, we have:

$L(\theta) - L(\theta^{(n)}) = \sum_{H} L_{c}(\theta)P(H|X,\theta^{(n)}) - \sum_{H} L_{c}(\theta^{(n)})P(H|X,\theta^{(n)}) + \sum_{H} P(H|X,\theta^{(n)})\log \frac{P(H|X,\theta^{(n)})}{P(H|X,\theta)}$

The last term is always non-negative since it can be recognized as the KL-divergence of $P(H|X,\theta^{(n)}$ and $P(H|X,\theta)$ . Therefore, we obtain a lower bound of Likelihood :

$L(\theta) \geq \sum_{H} L_{c}(\theta)P(H|X,\theta^{(n)}) + L(\theta^{(n)}) - \sum_{H} L_{c}(\theta^{(n)})P(H|X,\theta^{(n)})$

The last two terms can be treated as constants as they do not contain the variable $\theta$ , so the lower bound is essentially the first term, which is also sometimes called as “Q-function”. $Q(\theta;\theta^{(n)}) = E(L_{c}(\theta)) = \sum_{H} L_{c}(\theta) P (H|X,\theta^{(n)})$

EM of Formulation 1

In case of Formulation 1, let us introduce hidden variables $R(z,w,d)$ to indicate which hidden topic $z$ is selected to generated $w$ in $d$ ( $\sum_{z} R(z,w,d) = 1$ ). Therefore, the complete likelihood can be formulated as :

$L_{c1} = \sum_{d} \sum_{w} n(d,w) \sum_{z} R(z,w,d) \log [ P(d) P(w|z)P(z|d) ]$ $= \sum_{d} \sum_{w} n(d,w) \sum_{z} R(z,w,d) [ \log P(d) + \log P(w|z) + \log P(z|d) ]$

From the equation above, we can write our Q-function for the complete likelihood $E[L_{c1}]$ :

$E[L_{c1}] = \sum_{d} \sum_{w} n(d,w) \sum_{z} P(z|w,d) [ \log P(d) + \log P(w|z) + \log P(z|d) ]$

For E-step, simply using Bayes Rule, we can obtain:

$P(z|w,d) = \frac{P(w|z,d)}{P(w,d)}$ $= \frac{P(w|z)P(z|d)P(d)}{\sum_{z} P(w|z)P(z|d)P(d)}$ $= \frac{P(w|z)P(z|d)}{\sum_{z} P(w|z)P(z|d)}$

For M-step, we need to maximize Q-function, which needs to be incorporated with other constraints:

$H = E[L_{c1}]+ \alpha [1-\sum_{d} P(d) ]+ \beta \sum_{z}[1- \sum_{w} P(w|z)]$ $+\gamma \sum_{d}[1- \sum_{z} P(z|d)]$

and take all derivatives:

$\frac{\partial H}{\partial P(d)} = \sum_{w} \sum_{z} n(d,w) \frac{P(z|w,d)}{P(d)} - \alpha = 0$ $\rightarrow \sum_{w} \sum_{z} n(d,w) P(z|w,d) - \alpha P(d) = 0$ $\frac{\partial H}{\partial P(w|z)} = \sum_{d} n(d,w) \frac{P(z|w,d)}{P(w|z)} - \beta = 0$ $\rightarrow \sum_{d} n(d,w) P(z|w,d) - \beta P(w|z) = 0$ $\frac{\partial H}{\partial P(z|d)} = \sum_{w} n(d,w) \frac{P(z|w,d)}{P(z|d)} - \gamma = 0$ $\rightarrow \sum_{w} n(d,w) P(z|w,d) - \gamma P(z|d) = 0$

Therefore, we can easily obtain:

$P(d) = \frac{\sum_{w} \sum_{z} n(d,w) P(z|w,d)}{\sum_{d} \sum_{w} \sum_{z} n(d,w) P(z|w,d)}$ $= \frac{n(d)}{\sum_{d} n(d)}$ $P(w|z) = \frac{\sum_{d} n(d,w) P(z|w,d)}{\sum_{w} \sum_{d} n(d,w) P(z|w,d) }$ $P(z|d) = \frac{\sum_{w} n(d,w) P(z|w,d)}{\sum_{z} \sum_{w} n(d,w) P(z|w,d) }$ $= \frac{\sum_{w} n(d,w) P(z|w,d)}{n(d)}$

EM of Formulation 2

Use similar method to introduce hidden variables to indicate which $z$ is selected to generated $w$ and $d$ and we can have the following complete likelihood :

$L_{c2} = \sum_{d} \sum_{w} n(d,w) \sum_{z} R(z,w,d) \log [ P(z) P(w|z)P(d|z) ]$ $= \sum_{d} \sum_{w} n(d,w) \sum_{z} R(z,w,d) [ \log P(z) + \log P(w|z) + \log P(d|z) ]$

Therefore, the Q-function $E[L_{c2}]$ would be :

$E[L_{c2}] = \sum_{d} \sum_{w} n(d,w) \sum_{z} P(z|w,d) [ \log P(z) + \log P(w|z) + \log P(d|z) ]$

For E-step, again, simply using Bayes Rule, we can obtain:

$P(z|w,d) = \frac{P(w|z,d)}{P(w,d)}$ $= \frac{P(w|z)P(d|z)P(z)}{\sum_{z} P(w|z)P(d|z)P(z)}$

For M-step, we maximize the constraint version of Q-function:

$H = E[L_{c2}] + \alpha [1-\sum_{z} P(z) ] + \beta \sum_{z}[1- \sum_{w} P(w|z)]+$ $+\gamma \sum_{z} [1- \sum_{d} P(d|z)]$

and take all derivatives:

$\frac{\partial H}{\partial P(z)}= \sum_{d} \sum_{w} n(d,w) \frac{P(z|w,d)}{P(z)} - \alpha = 0$ $\rightarrow \sum_{d} \sum_{w} n(d,w) P(z|w,d) - \alpha P(z)= 0$ $[latex]\rightarrow \sum_{d} n(d,w) P(z|w,d) - \beta P(w|z) = 0$ $\frac{\partial H}{\partial P(d|z)} = \sum_{w} n(d,w) \frac{P(z|w,d)}{P(d|z)} - \gamma = 0$ $\rightarrow \sum_{w} n(d,w) P(z|w,d) - \gamma P(d|z) = 0$

Therefore, we can easily obtain:

$P(z) = \frac{\sum_{d} \sum_{w} n(d,w) P(z|w,d)}{\sum_{d} \sum_{w} \sum_{z} n(d,w) P(z|w,d)}$ $= \frac{\sum_{d} \sum_{w} n(d,w) P(z|w,d)}{\sum_{d} \sum_{w} n(d,w)}$ $P(w|z)= \frac{\sum_{d} n(d,w) P(z|w,d)}{\sum_{w} \sum_{d} n(d,w) P(z|w,d) }$ $P(d|z) = \frac{\sum_{w} n(d,w) P(z|w,d)}{\sum_{d} \sum_{w} n(d,w) P(z|w,d) }$

posted on 2013-09-05 15:58 kalor 阅读(299) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

导航

转自：http://www.hongliangjie.com/2010/01/04/notes-on-probabilistic-latent-semantic-analysis-plsa/

Formulation of PLSA

EM

EM of Formulation 1

EM of Formulation 2