Markov Chain
\(\mathbf p(t)\):时间 \(t\) 的概率分布,\(P\):转移矩阵
\(\mathbf p(t)P=\mathbf p(t+1), \mathbf a(t)=\frac{1}{t}(\mathbf p(0)+\mathbf p(1)+\cdots+\mathbf p(t-1))\)
Lemma 4.1 Let \(P\) be the transition probability matrix for a connected Markov chain. The \(n\times (n+1)\) matrix \(A=[P-I,\mathbf 1]\) obtained by augmenting the matrix \(P-I\) with an additional column of ones has rank \(n\).
Proof. Let \(\mathbf z=(\mathbf x,\alpha)\) be a solution to \(A\mathbf z=\mathbf 0\), then \((P-I)\mathbf x+\alpha\mathbf 1=\mathbf 0\), and so, for each \(i\), \(x_i=\sum_jp_{ij}x_j+\alpha\). Note that \(\sum_jp_{ij}=1\) for all \(j\). Let \(x_i=\max_j\{x_j\}\) we have \(\alpha\ge 0\); let \(x_i=\min_j\{x_j\}\) we have \(\alpha\le 0\). Hence \(\alpha=0\), and then \(\mathbf x=\mathbf 1\). That is, the solution subspace to \(A\mathbf z=\mathbf 0\) has dimension \(1\), so \(A\) has rank \(n\). \(\square\)
Theorem 4.2 (Fundamental Theorem of Markov Chains) For a connected Markov chain, there is a unique probability vector \(\pi\) satisfying \(\pi P=\pi\). Moreover, for any starting distribution, \(\lim\limits_{t\to\infty}\mathbf a(t)\) exists and equals \(\pi\).
Proof. Note that \(\mathbf a(t)\) itself is a probability vector since its components are nonnegative and sum to \(1\). Run one step of the Markov chain starting with distribution \(\mathbf a(t)\) and we will get \(\mathbf a(t)P\). Calculate the change
Thus \(\mathbf b(t)=\mathbf a(t)P-\mathbf a(t)\) satisfies \(|\mathbf b(t)|\le \frac2t\to 0\) as \(t\to\infty\).
By Lemma 4.1, \(A=[P-I,\mathbf 1]\) has rank \(n\), so the \(n\times n\) submatrix \(B\) of \(A\) consisting of all its columns except the first is invertible. We can assume \(\mathbf a(t)B=[\mathbf c(t),1]\), then \(\mathbf c(t)\) is obtained from \(\mathbf b(t)\) by removing the first entry. Now \(\mathbf a(t)=[\mathbf c(t),1]B^{-1}\to [\mathbf 0,1]B^{-1}\) implying the theorem with \(\pi=[\mathbf 0,1]B^{-1}\). \(\square\)
Lemma 4.3 For a random walk on a strongly connected graph with probabilities on the edges, if the vectors \(\pi\) satisfies \(\pi_xp_{xy}=\pi_yp_{yx}\) for all \(x\) and \(y\) and \(\sum_x\pi_x=1\), then \(\pi\) is the stationary distribution for the walk.
Proof. Obviously. \(\square\)
The MCMC (Markov Chain Monte Carlo) method is used to estimate the expected value of a function \(f(\mathbf x)\), where \(\mathbf x=(x_1,\cdots,x_d)\) follows a multivariate probability distribution \(p(\mathbf x)\). That is, estimating \(E(f)=\sum_{\mathbf x}f(\mathbf x)p(\mathbf x)\).
To sample according to \(p(\mathbf x)\), design a Markov Chain whose states correspond to the possible values of \(\mathbf x\) and whose stationary probability distribution is \(p(\mathbf x)\). Theorem 4.2 states that the average of the function \(f\) over states seen in a sufficiently long run is a good estimate of \(E(f)\).
There are two general techniques to design such a Markov Chain: the Metropolis-Hastings algorithm and the Gibbs sampling.
Metropolis-Hastings algorithm
算法先把所有点连成一张图,记 \(r\) 是图中最大度,然后对于相邻的点 \(i,j\),取
Gibbs sampling
算法首先把所有点连成一个 \(d\) 维的超立方体,然后重复以下步骤:选一个变量 \(x_i\)(不妨设为 \(x_1\)),按照概率
更新状态。
两个算法的正确性都由 Lemma 4.3 保证。
Next topic is how many iterations it will take for the distribution to converge.
Definition 4.1 Fix \(\varepsilon>0\). The \(\varepsilon\)-mixing time of a Markov chain is the minimum integer \(t\) such that for any starting distribution \(\mathbf p\), the \(1\)-norm difference between the \(t\)-step running average probability distribution and the stationary distribution is at most \(\varepsilon\).
Definition 4.2 For a subset \(S\) of vertices, let \(\pi(S)=\sum\limits_{x\in S}\pi_x\). The normalized conductance \(\Phi(S)\) of \(S\) is
\(\Phi(S)\) is the probability of moving from \(S\) to \(\bar S\) in one step if we are in the stationary distribution restricted to \(S\). If one starts in \(S\), the expected number of steps before he step into \(\bar S\) is \(1/\Phi(S)\).
Definition 4.3 The normalized conductance of Markov chain, denoted \(\Phi\), is defined by
Theorem 4.5 The \(\varepsilon\)-mixing time of a random walk on an undirected graph is \(O(\dfrac{\ln(1/\pi_{\min})}{\Phi^2\varepsilon^3})\).
Proof. Let \(T=\frac{c\ln(1/\pi_{\min})}{\Phi^2\varepsilon^3}\) for a suitable constant \(c\). Let \(\mathbf a=\mathbf a(T)\), we need to show \(\|\mathbf a-\mathbf \pi\|_1\le \varepsilon\).
Let \(v_i=\frac{a_i}{\pi_i}\) and renumber states so that \(v_1\ge v_2\ge\cdots\). Let \(i_0\) be the maximum \(i\) such that \(v_i>1\). Then
Let \(\gamma_i=\pi_1+\pi_2+\cdots+\pi_i\). Define a function \(f\) by \(f(x)=v_i-1\) for \(x\in[\gamma_{i-1},\gamma_i)\). We divide \(\{1,2,\cdots,i_0\}\) into groups \(G_1,G_2,\cdots,G_r\) of contiguous subsets. Let \(u_t=\max_{i\in G_t}v_i\). Define a new function \(g(x)\) by \(g(x)=u_t-1\) for \(x\in\cup_{i\in G_t}[\gamma_{i-1},\gamma_i)\). Then (use \(u_{r+1}=1\))
可以参考图像来理解。\(f\) 是一条折线;\(g\) 是把折线的某些角抹掉;后一个等号是把竖着的面积换成横着的面积。
We now focus on proving that \(\sum_{i=1}^{r}\pi(G_1\cup G_2\cup\cdots\cup G_i)(u_i-u_{i+1})\le \varepsilon/2~~~~~~~~~~~~~(1)\).
A trivial case is if \(\sum_{i\ge i_0+1}(1-v_i)\pi_i\le \varepsilon/2\) then we would be done, and we assume not. It follows that \(\sum_{i\ge i_0+1}\pi_i\ge \varepsilon/2\), and so, for any subset \(A\subseteq\{1,2,\cdots,i_0\}\), \(\min(\pi(A),\pi(\bar A))\ge \frac{\varepsilon}{2}\pi(A)\).
We now define the subsets. \(G_1\) will be just \(\{1\}\). In general, suppose \(G_1,G_2,\cdots,G_{t-1}\) have already been defined, we start \(G_t\) at \(i_t=1+(\text{end of }G_{t-1})\). Let \(i_t=k\). We will define \(l\), the last element of \(G_t\) to be the largest integer greater than or equal to \(k\) and at most \(i_0\) so that
We claim that (which will be proved later)
Then we only need an upper bound of \(r\). If \(G_t=\{k,k+1,\cdots,l\}\) with \(l<i_0\), by definition of \(l\), we have
which completes the proof of \((1)\) (recall the definition of \(t\)).
Now return to \((2)\). Consider a particular group \(G_t=\{k,k+1,\cdots,l\}\), say. Let \(A=\{1,2,\cdots,k\}\). The net loss of probability for each state from the set \(A\) in one step is \(\sum_{i=1}^{k}(a_i-(\mathbf aP)_i)\), which is at most \(\frac{2}{t}\) by the proof of Theorem 4.2.
Another way to reckon the net loss of probability from \(A\) is to take the difference of the probability flow from \(A\) to \(\bar A\) and the flow from \(\bar A\) to \(\bar A\). For any \(i<j\), by Lemma 4.3 we have
Thus for any two states \(i<j\), there is a non-negative net flow from \(i\) to \(j\). Then the net loss from \(A\) is at least
Thus
Since
and by the definition of \(\Phi\), using \(\min(\pi(A),\pi(\bar A))\ge \frac{\varepsilon}{2}\pi(A)=\frac12\varepsilon\gamma_k\)
Combine \((3)\) and \((4)\) we have \(v_k-v_{l+1}\le \frac{8}{t\varepsilon\Phi\gamma_k}\), proving \((2)\) when \(k<i_0\). If \(k=i_0\) the proof is similar but simpler. \(\square\)