1. 前言

n-gram语法模型

n-gram语法模型用来：在已知前面$n-1$个词$w_1, \cdots, w_{n-1}$的情况下，预测下一个词出现的概率：

$P(w_n | w_1, \cdots, w_{n-1})$

\begin{align}
\text{Unigram:} \quad & \hat{P} (w_3) = \frac{f(w_3)}{N} \cr
\text{Bigram:} \quad & \hat{P} (w_3|w_2) = \frac{f(w_2, w_3)}{f(w_2)} \cr
\text{Trigram:} \quad &\hat{P} (w_3|w_1,w_2) = \frac{f(w_1, w_2, w_3)}{f(w_1,w_2)} \
\end{align}

两种CWS模型

\arg \mathop{\max}\limits_{w_1^m} P(w_1^m)

P(w_1^m) = \prod_{i=1}^{m}P(w_i|w_1^{i-1}) \approx \prod_{i=1}^{m}P(w_i|w_{i-2}^{i-1})

Character-Based Discriminative Model采用类似与POS（Part-of-Speech）那一套序列标注的方法来进行分词：

\arg \mathop{\max}\limits_{t_1^n} P(t_1^n | c_1^n)
\label{eq:pos}

$t_i$表示字符$c_i$对应的B/M/E/S词标注。

HMM分词

\begin{aligned} \arg \mathop{\max}\limits_{t_1^n} P(t_1^n | c_1^n) & = \arg \mathop{\max}\limits_{t_1^n} \frac{P(c_1^n | t_1^n) P(t_1^n)}{P(c_1^n)} \\ & = \arg \mathop{\max}\limits_{t_1^n} P(c_1^n | t_1^n) P(t_1^n)\\ \end{aligned}

HMM做了两个基本假设：齐次Markov性假设与观测独立性假设，即

• 状态（标注）仅与前一状态相关；

$P(t_{i}|t_{1}^{i-1}) = P(t_i| t_{i-1})$

• 观测相互独立，即字符相对独立：

$P(c_1^n|t_1^n) = \prod_{i=1}^{n} P(c_i|t_1^n)$

• 观测值依赖于该时刻的状态，即字符的出现仅依赖于标注：

$P(c_i|t_1^n) = P(c_i | t_i)$

\begin{aligned} P(c_1^n | t_1^n) P(t_1^n) & = \prod_{i=1}^{n} P(c_i|t_1^n) \times [P(t_n|t_{1}^{n-1}) \cdots P(t_i|t_{1}^{i-1}) \cdots P(t_2|t_1)] \\ & = \prod_{i=1}^{n} [P(c_i|t_i) \times P(t_i|t_{i-1})]\\ \end{aligned}

\arg \mathop{\max}\limits_{t_1^n} \prod_{i=1}^{n} [P(t_i|t_{i-1}) \times P(c_i|t_i)]

\arg \mathop{\max}\limits_{t_1^n} \left[ \prod_{i=1}^{n} P(t_i|t_{i-1},t_{i-2}) P(c_i|t_i) \right] \times P(t_{n+1}|t_n)
\label{eq:tnt}

2. TnT

$P(t_3|t_2, t_1)=\lambda_1 \hat{P}(t_3) + \lambda_2 \hat{P}(t_3|t_2) + \lambda_3 \hat{P}(t_3|t_2, t_1)$

3. Character-Based Generative Model

• Word-Based Generative Model高召回IV、低召回OOV；
• Character-Based Discriminative Model高召回OOV，低召回IV

$\arg \mathop{\max}\limits_{t_1^n} P([c,t]_1^n)= \arg \mathop{\max}\limits_{t_1^n} \prod_{i=1}^n P([c,t]_i | [c,t]_{i-k}^{i-1})$

4. 开源实现Snownlp

isnowfy大神在项目Snownlp实现TnT与Character-Based Discriminative Model；并且在博文中给出两者与最大匹配、Word-based Unigram模型的准确率比较，可以看出Generative Model的准确率较高。Snownlp的默认分词方案采用的是CharacterBasedGenerativeModel

from snownlp import SnowNLP

s = SnowNLP('小明硕士毕业于中国科学院计算所，后在日本京都大学深造')
print('/'.join(s.words))
# 小明/硕士/毕业/于/中国/科学院/计算/所/，/后/在/日本/京都/大学/深造
# Jieba HMM: 小明/硕士/毕业于/中国/科学院/计算/所/，/后/在/日/本京/都/大学/深造

5. 参考资料

[1] Manning, Christopher D., and Hinrich Schütze. Foundations of statistical natural language processing. Vol. 999. Cambridge: MIT press, 1999.
[2] Brants, Thorsten. "TnT: a statistical part-of-speech tagger." Proceedings of the sixth conference on Applied natural language processing. Association for Computational Linguistics, 2000.
[3] Wang, Kun, Chengqing Zong, and Keh-Yih Su. "Which is More Suitable for Chinese Word Segmentation, the Generative Model or the Discriminative One?." PACLIC. 2009.
[4] isnowfy, 几种中文分词算法的比较
[5] hankcs, 基于HMM2-Trigram字符序列标注的中文分词器Java实现.

posted @ 2016-12-15 15:43 Treant 阅读(...) 评论(...) 编辑 收藏