马尔科夫模型
状态集合:\(\mathcal{S} = \{s_1,\cdots,s_N\}\)
观测状态序列:\(x = x_1,\cdots,x_t,\cdots,x_T\),其中\(x_t \in \mathcal{S}\)
状态初始化概率:\(\pi_i = p(x_1 = s_i),1 \leq i \leq N\)
状态转移概率:\(a_{ij} = p(x_t = s_j|x_{t - 1} = s_i),1 \leq i,j \leq N\)
计算观测状态序列的概率(假设当前的状态\(x_t\)的生成只依赖于前一个状态\(x_{t-1}\),通常称为二阶马尔科夫模型):
\[\begin{aligned}
P(x;\theta)
&= \prod_{t = 1}^Tp(x_t|x_1,\cdots,x_{t - 1}) \\
&\approx p(x_1) \times \prod_{t = 2}^Tp(x_t|x_{t - 1})
\end{aligned}
\]
其中\(\theta = \{p(x)|x \in \mathcal{S}\} \bigcup \{p(x'|x)|x,x' \in \mathcal{S}\}\)
模型的学习
目的:学习得到模型的参数\(\theta\),也即状态初始化概率和状态转移概率的学习,通过极大似然估计完成参数学习
假设训练集包含\(D\)个样本\(\mathscr{D} = \{x^{(d)}\}^D_{d = 1}\),使用极大似然估计来从训练数据中自动获取最优模型参数:
\[\hat{\theta} = arg \mathop{max}\limits_{\theta}\{L(\theta)\}
\]
似然函数的对数形式:
\[\begin{aligned}
L(\theta)
&= \sum_{d = 1}^D \log P(x^{(d)};\theta) \\
&= \sum_{d = 1}^D
\left(
\log p(x_1^{(d)}) + \sum_{t = 2}^{T^{(d)}} \log p(x_t^{(d)}|x_{t - 1}^{(d)})
\right)
\end{aligned}
\]
其中\(T^{(d)}\)表示第\(d\)个训练数据的序列长度,模型参数还需要满足一下两个约束条件:
\[\sum_{x \in \mathcal{S}} p(x) = 1 \\
\forall x:\sum_{x' \in \mathcal{S}}p(x'|x) = 1
\]
引入拉格朗日乘子法来实现约束条件下的极值求解:
\[J(\theta,\lambda,\gamma) = L(\theta) - \lambda\left(\sum_{x \in \mathcal{S}} p(x) - 1 \right) - \sum_{x \in \mathcal{S}} \gamma_x \left( \sum_{x' \in \mathcal{S}} p(x'|x) - 1\right)
\]
其中\(\lambda\)是与状态初始概率约束相关的拉格朗日乘子,\(\gamma = \{\gamma_x|x \in \mathcal{S}\}\)是与状态转移概率约束相关的拉格朗日乘子的集合
\[\begin{aligned}
\frac{\partial J(\theta,\lambda,\gamma)}{\partial p(x)}
&= \left(
\sum_{d = 1}^D
\left(
\log p(x_1^{(d)}) + \sum_{t = 2}^{T^{(d)}} \log p(x_t^{(d)}|x_{t - 1}^{(d)})
\right) - \lambda\left(\sum_{x \in \mathcal{S}} p(x) - 1 \right) - \sum_{x \in \mathcal{S}} \gamma_x \left( \sum_{x' \in \mathcal{S}} p(x'|x) - 1\right)
\right)'_{p(x)} \\
&= \frac{\partial}{\partial p(x)} \sum_{d = 1}^D \log p(x_1^{(d)}) - \lambda \\
&= \frac{1}{p(x)} \sum_{d = 1}^D \delta(x_1^{(d)},x) - \lambda
\end{aligned}
\]
其中\(\delta(a,b)\)的取值当\(a = b\)时为1,否则为0
用\(c(x,\mathscr{D})\)表示训练数据中第一个状态是\(x\)的次数:
\[c(x,\mathscr{D}) = \sum_{d = 1}^D \delta(x_1^{(d)},x)
\]
之后计算拉格朗日乘子\(\lambda\)的偏导
\[\frac{\partial J(\theta,\lambda,\gamma)}{\partial \lambda} = \sum_{x \in \mathcal{S}}p(x) - 1
\]
根据上式可以推导:
\[p(x) = \frac{c(x,\mathscr{D})}{\lambda}
\]
有\(\lambda = \sum_{x \in \mathcal{S}}c(x,\mathscr{D})\)所以状态初始化概率的估计公式:
\[p(x) = \frac{c(x,\mathscr{D})}{\sum_{x' \in \mathcal{S}}c(x',\mathscr{D})}
\]
\[\begin{aligned}
\frac{\partial J(\theta,\lambda,\gamma)}{\partial p(x'|x)}
&= \left(
\sum_{d = 1}^D
\left(
\log p(x_1^{(d)}) + \sum_{t = 2}^{T^{(d)}} \log p(x_t^{(d)}|x_{t - 1}^{(d)})
\right) - \lambda\left(\sum_{x \in \mathcal{S}} p(x) - 1 \right) - \sum_{x \in \mathcal{S}} \gamma_x \left( \sum_{x' \in \mathcal{S}} p(x'|x) - 1\right)
\right)'_{p(x'|x)} \\
&= \frac{\partial}{\partial p(x'|x)} \sum_{d = 1}^D \sum_{t = 2}^{T^{(d)}} \log p\left(x_t^{(d)}|x_{t - 1}^{(d)}
\right) - \gamma_x \\
&= \frac{1}{p(x'|x)} \sum_{d = 1}^D \sum_{t = 2}^{T^{(d)}} \delta(x_{t - 1}^{(d)},x) \delta(x_t^{(d)},x') - \gamma_x
\end{aligned}
\]
将训练数据中\(x'\)紧跟着出现在\(x\)后面的次数表示为:
\[c(x,x',\mathscr{D}) = \sum_{d = 1}^D \sum_{t = 2}^{T^{(d)}} \delta(x_{t - 1}^{(d)},x) \delta(x_t^{(d)},x')
\]
之后计算拉格朗日乘子\(\gamma_x\)的偏导:
\[\frac{\partial J(\theta,\lambda,\gamma)}{\partial \gamma_x} = \sum_{x' \in \mathcal{S}}p(x'|x) - 1
\]
从上式可以推出:
\[p(x'|x) = \frac{c(x,x',\mathscr{D})}{\gamma_x}
\]
又可以得出\(\gamma_x = \sum_{x'' \in \mathcal{S}}c(x,x'',\mathscr{D})\),故:
\[p(x'|x) = \frac{c(x,x',\mathscr{D})}{\sum_{x'' \in \mathcal{S}}c(x,x'',\mathscr{D})}
\]
计算观测概率
目的:计算观测序列的概率
一个例子:在暗室中不知道外面的天气,但是可以通过触摸地面的潮湿程度来推测外部的天气情况,此时,地面的潮湿程度是观测状态,外面的天气是隐状态
观测状态集合:\(\mathscr{O} = \{o_1,\cdots,o_m\}\)
隐状态集合:\(\mathcal{S} = \{s_1,\cdots,s_N\}\)
观测状态序列:\(x = x_1,\cdots,x_t,\cdots,x_T\)
隐状态序列:\(z = z_1,\cdots,z_t\cdots,z_T\)
隐状态初始化概率:\(\pi_i = p(z_1 = s_i),1 \leq i \leq N\)
隐状态转移概率:\(a_{ij} = p(z_t = s_j|z_{t - 1} = s_i),1 \leq i,j \leq N\)
观测状态生成概率:\(b_j(k) = p(x_t = o_k|z_t = s_j),1 \leq j \leq N \bigwedge 1 \leq k \leq M\)
隐马尔可夫模型:
\[\begin{aligned}
P(x;\theta)
&= \sum_zP(x,z;\theta) \\
&= \sum_zp(z_1) \times p(x_1|z_1) \times \prod_{t = 2}^Tp(z_t|z_{t - 1}) \times p(x_t|z_t)
\end{aligned}
\]
模型参数\(\theta = \{p(z)|z \in \mathcal{S}\} \bigcup \{p(z'|z)|z,z' \in \mathcal{S}\} \bigcup \{p(x|z)|x \in \mathscr{O} \bigwedge z \in \mathcal{S}\}\)
前向概率
部分观测状态序列\(x_1,\cdots,x_t\)与第\(t\)个隐状态为\(s_i\)的联合概率称为前向概率:
\[\alpha_t(i) = P(x_1,\cdots,x_t,z_t = s_i;\theta)
\]
使用动态规划算法递归计算:
\[\alpha_1(i) = \pi_ib_i(x_1),1 \leq i \leq N
\]
\[\alpha_t(j) = \left( \sum_{i = 1}^N\alpha_{t - 1}(i)a_{ij}\right)b_j(x_t),1 \leq j \leq N
\]
\[P(x;\theta) = \sum_{i = 1}^N\alpha_T(i)
\]
后向概率
第\(t\)个隐状态为\(s_j\)生成部分观测状态序列\(x_{t + 1},\cdots,x_T\)的条件概率称为后向概率,定义为:
\[\beta_t(i) = P(x_{t + 1},\cdots,x_T|z_t = s_i;\theta)
\]
使用动态规划算法递归计算如下:
\[\beta_T(i) = 1,1 \leq i \leq N
\]
- 递归:\(t = T - 1,\cdots,1\)
\[\beta_t(i) = \sum_{j = 1}^Na_{ij}b_j(x_{t + 1})\beta_{t + 1}(j),1 \leq i \leq N
\]
\[P(x;\theta) = \sum_{i = 1}^N\pi_ib_i(x_1)\beta_1(i)
\]
计算最优隐状态序列-Viterbi算法
目的:在给定一个观测状态序列\(x = x_1,\cdots,x_t,\cdots,x_T\)和模型参数\(\theta\)的条件下求出最优的隐状态序列
\[\begin{aligned}
\hat{z}
&= arg \mathop{max}\limits_z\left\{P(z|x;\theta)\right\} \\
&= arg \mathop{max}\limits_z\left\{\frac{P(x,z;\theta)}{P(x;\theta)}\right\} \\
&= arg \mathop{max}\limits_z\left\{P(x,z;\theta)\right\} \\
&= arg \mathop{max}\limits_z\left\{\sum_zp(z_1) \times p(x_1|z_1) \times \prod_{t = 2}^Tp(z_t|z_{t - 1}) \times p(x_t|z_t)\right\} \\
\end{aligned}
\]
假设\(\delta_i = \mathop{max}\limits_{j \in heads(i)}\{\omega_{ji}\delta_j\}\)是从结点1到结点\(i\)的最大路径取值,\(\psi_i = arg \mathop{max}\limits_{j \in heads(i)}\{\omega_{ji}\delta_j\}\)
Viterbi算法
\[\delta_1(i) = \pi_ib_1(x_1),\psi_1(i) = 0
\]
\[\delta_t(j) = \mathop{max}\limits_{1 \leq i \leq N}\{\delta_{t - 1}(i)a_{ij}\}b_j(x_t) \\
\psi_t(j) = arg \mathop{max}\limits_{1 \leq i \leq N}\{\delta_{t - 1}(i)a_{ij}\}b_j(x_t)
\]
\[\hat{P} = \mathop{max}\limits_{1 \leq i \leq N}\{\delta_{T}(i)\} \\
\hat{z}_T = arg \mathop{max}\limits_{1 \leq i \leq N}\{\delta_{T}(i)\}
\]
- 回溯:\(t = T - 1,\cdots,1\)
\[\hat{z}_t = \psi_{t + 1}(\hat{z}_{t + 1})
\]
模型的学习-前向后向算法
目的:估计模型参数,我们知道的是观测序列,隐状态序列是不确定的,所以参数的主要挑战是需要对指数级的隐状态序列进行求和
给定训练集\(\mathscr{D} = \{x^{(d)}\}^D_{d = 1}\),使用极大似然估计来获得模型的最优参数:
\[\hat{\theta} = arg \mathop{max}\limits_{\theta}\{L(\theta)\}
\]
Expectation-Maximization(简称EM)算法被广泛用于估计隐状态模型的参数。令\(\mathbf{X}\)表示一组观测数据,\(\mathbf{Z}\)表示未观测数据,也就是隐状态序列:EM算法在以下两个步骤中迭代运行:
\[\mathbf{Q}(\theta|\theta^{old}) = \mathbb{E}_{\mathbf{Z}|\mathbf{X};\theta^{old}}\left[\log P(\mathbf{X},\mathbf{Z};\theta)\right]
\]
\[\theta^{new} = arg \mathop{max}\limits_{\theta}\left\{\mathbf{Q}(\theta|\theta^{old})\right\}
\]
那么使用EM算法来训练隐马尔可夫模型,在E步实际使用的目标函数定义如下:
\[\begin{aligned}
J(\theta,\lambda,\gamma,\phi)
&= \sum_{d = 1}^D\mathbb{E}_{\mathbf{Z}|\mathbf{X}^{(d)};\theta^{old}}\left[\log P(\mathbf{x}^{(d)},\mathbf{Z};\theta)\right] \\
&- \lambda\left(\sum_{z \in \mathcal{S}}p(z) - 1\right) \\
&- \sum_{z \in \mathcal{S}}\gamma_z\left(\sum_{z' \in \mathcal{S}}p(z'|z) - 1\right) \\
&- \sum_{z \in \mathcal{S}}\phi_z\left(\sum_{x \in \mathscr{O}}p(x|z) - 1\right)
\end{aligned}
\]
通过计算偏导,可以得到公式:
\[p(z) = \frac{c(z,\mathscr{D})}{\sum_{z' \in \mathcal{S}}c(z',\mathscr{D})} \\
p(z'|z) = \frac{c(z,z',\mathscr{D})}{\sum_{z'' \in \mathcal{S}}c(z,z'',\mathscr{D})} \\
p(x|z) = \frac{c(z,x,\mathscr{D})}{\sum_{x' \in \mathscr{O}}c(z,x',\mathscr{D})}
\]
其中\(c(\cdot)\)表示计数函数,\(c(z,\mathscr{D})\)是在训练集\(\mathscr{D}\)上第一个隐状态是\(z\)的次数的期望值,\(c(z,z',\mathscr{D})\)是在训练集上隐状态\(z'\)出现在隐状态\(z\)的次数的期望值,\(c(z,x,\mathscr{D})\)是训练集上隐状态\(z\)生成观测状态\(x\)的次数的期望值。
上述的期望值定义如下:
\[\begin{aligned}
c(z,\mathscr{D}) &\equiv \sum_{d = 1}^D\mathbb{E}_{\mathbf{Z}|\mathbf{X};\theta^{old}}\left[\delta(z_1,z)\right] \\
c(z,z',\mathscr{D}) &\equiv \sum_{d = 1}^D\mathbb{E}_{\mathbf{Z}|\mathbf{X};\theta^{old}}\left[\sum_{t = 2}^{T^{(d)}}\delta(z_{t - 1},z)\delta(z_t,z')\right] \\
c(z,x,\mathscr{D}) &\equiv \sum_{d = 1}^D\mathbb{E}_{\mathbf{Z}|\mathbf{X};\theta^{old}}\left[\sum_{t = 1}^{T{(d)}}\delta(z_t,z)\delta(x_t^{(d)},x)\right]
\end{aligned}
\]
期望基于隐状态的后验概率\(P(\mathbf{z}|\mathbf{x}^{(d)};\theta^{old})\),计算期望的过程涉及指数级数量的计算,下面以隐状态转换次数的期望为例:
\[\begin{aligned}
\mathbb{E}_{\mathbf{Z}|\mathbf{X};\theta^{old}}\left[\delta(z_{t - 1},z)\delta(z_t,z')\right]
&= \sum_{\mathbf{z}}P(\mathbf{z}|\mathbf{x}^{(d)};\theta^{old})\delta(z_{t - 1},z)\delta(z_t,z') \\
&= \sum_{\mathbf{z}}\frac{P(\mathbf{x}^{(d)},\mathbf{z};\theta^{old})}{P(\mathbf{x}^{(d)};\theta^{old})}\delta(z_{t - 1},z)\delta(z_t,z') \\
&= \frac{1}{P(\mathbf{x}^{(d)};\theta^{old})}\sum_{\mathbf{z}}P(\mathbf{x}^{(d)},\mathbf{z};\theta^{old})\delta(z_{t - 1},z)\delta(z_t,z') \\
&= \frac{P(\mathbf{x}^{(d)},z_{t - 1} = z,z_t = z';\theta^{old})}{P(\mathbf{x}^{(d)};\theta^{old})}
\end{aligned}
\]
上式的分母可以使用前向概率来计算,下面是分母的计算:
\[\begin{aligned}
P(\mathbf{x},z_{t - 1} = s_i,z_t = s_j;\theta)
= &P(x_1,\cdots,x_{t - 1},z_{t - 1} = s_i;\theta) \times \\
&P(z_t = s_j|z_{t - 1} = s_i;\theta) \times \\
&P(x_t|z_t = s_j;\theta) \times \\
&P(x_{t + 1},\cdots,x_T|z_t = s_j;\theta) \\
= &\alpha_{t - 1}(i)a_{ij}b_j(x_t)\beta_t(j)
\end{aligned}
\]
估计隐状态初始化概率
\[\begin{aligned}
p(z) &= \frac{c(z,\mathscr{D})}{\sum_{z' \in \mathcal{S}}c(z',\mathscr{D})} \\
&= \frac{\sum_{d = 1}^DP(\mathbf{x}^{(d)},z_1 = z;\theta^{old})}{\sum_{d = 1}^DP(\mathbf{x}^{(d)};\theta^{old})} \\
\overline{\pi}_i &= \frac{\sum_{d = 1}^D\alpha_1(i)\beta_1(i)}{\sum_{d = 1}^D\sum_{i = 1}^N\alpha_{T^{(d)}}(i)}
\end{aligned}
\]
估计隐状态转换概率
\[\begin{aligned}
p(z'|z) &= \frac{c(z,z',\mathscr{D})}{\sum_{z'' \in \mathcal{S}}c(z,z'',\mathscr{D})} \\
&= \frac{\sum_{d = 1}^D\sum_{t = 2}^{T^{(d)}}P(\mathbf{x},z_{t - 1} = z,z_t = z';\theta^{old}) }{\sum_{d = 1}^D\sum_{t = 2}^{T^{(d)}}P(\mathbf{x},z_{t - 1} = z;\theta^{old}) }\\
\overline{a}_{ij} &= \frac{\sum_{d = 1}^D\sum_{t = 2}^{T^{(d)}}\alpha_{t - 1}(i)a_{ij}b_j(x^{(d)}_t)\beta_t(j)}{\sum_{d = 1}^D\sum_{t = 2}^{T^{(d)}}\alpha_{t - 1}(i)\beta_t(j)}
\end{aligned}
\]
估计观测状态生成概率
\[\begin{aligned}
p(x|z) &= \frac{c(z,x,\mathscr{D})}{\sum_{x' \in \mathscr{O}}c(z,x',\mathscr{D})} \\
&= \frac{\sum_{d = 1}^D\sum_{t = 1}^{T^{(d)}}\delta(x_t^{(d)},x)P(\mathbf{x}^{(d)},z_t = z;\theta^{old})}{\sum_{d = 1}^D\sum_{t = 1}^{T^{(d)}}P(\mathbf{x}^{(d)},z_t = z;\theta^{old})} \\
\overline{b}_i(k) &= \frac{\sum_{d = 1}^D\sum_{t = 1}^{T^{(d)}}\delta(x^{(d)},o_k)\alpha_t(i)\beta_t(i)}{\sum_{d = 1}^D\sum_{t = 1}^{T^{(d)}}\alpha_t(i)\beta_t(i)}
\end{aligned}
\]