N-gram模型如何工作？

我们知道语言模型（Language Model, LM）实际上是对一种token序列的概率分布。例如对一个句子$S = w_1, ... , w_t$，我们要估计这一个句子的概率，计算过程如下：

\[ P(S) = P(w_1, ..., w_t) \\ =P(w_1, ..., w_{t-1}) * P(w_t|w_1, ..., w_{t-1}) \\ =P(w_1, ..., w_{t-2}) * P(w_{t-1}|w_1, ..., w{t-2}) * P(w_t|w_1, ..., w_{t-1}) \\ =P(w_1, w_2) * P(w_3|w_1, w_2) * ...* P(w_{t-1}|w_1, ..., w{t-2}) * P(w_t|w_1, ..., w_{t-1}) \\ =P(w_1) * P(w_2|w_1) * P(w_3|w_1, w_2) ...* P(w_{t-1}|w_1, ..., w{t-2}) * P(w_t|w_1, ..., w_{t-1}) \]

N-gram模型建立在此基础之上[1][2]，本文主要参考清华大学孙茂松老师的课程讲义，讨论N-gram如何工作。

N-gram模型。在一个n-gram模型中，关于$x_{i}$的预测只依赖于最后的 $n-1$ 个字符 $x_{i−(n−1):i−1}$ ，而不是整个历史。

\[p(x_i \mid x_{1:i-1}) = p(x_i \mid x_{i-(n-1):i-1}). \]

具体来说，unigram, bigram和trigram如下：

unigram---The 0 order Markov Model---$ P(w_i) $

bigram---The first order Markov Model---$ P(w_i | w_{i-1}) $

trigram---The second order Markov Model---$ P(w_i | w_{i-1}, w_{i-2}) $

例如bigram，实际上是一个一阶马尔可夫模型。

举个语音识别中的例子来说，我们要估计两句话的概率："prepare for leap in the dark"，"prepare for lip in the dark"，使用BNS语料库，其中包含100,106,008个单词。
我们得到每个单词/词组在语料库中的出现次数如下：
COUNT(prepare)=3023
COUNT(for)=899331
COUNT(leap)=1045
COUNT(in)=1970532
COUNT(the)=2165569
COUNT(dark)=13489
COUNT(lip)=1592

COUNT(prepare for)=528
COUNT(for leap)=1
COUNT(leap in)=100
COUNT(in the)=535036
COUNT(the dark)=3668
COUNT(for lip)=2
COUNT(lip in)=25

单词计算得到的probabilities如下:
P(prepare)= 0.000030
P(for)=0.0090
P(leap)=0.00001
P(in)=0.020
P(the)=0.022
P(dark)=0.00013
P(lip)=0.000016

词组计算得到的probabilities如下:
P(prepare for)= 0.0000053
P(for leap)=0.00000001
P(leap in)=0.000001
P(in the)= 0.0053
P(the dark)=0.000037
P(for lip)= 0.00000002
P(lip in)= 0.00000025

词组出现的频率普遍比单词出现的频率要小很多，符合直觉。

bigram在计算概率时，当前词考虑前一个词出现的概率，因此，我们首先使用bigram模型对"prepare后接for"这一组词的概率做计算：

\[P(for|prepare) = \frac{P(prepare, for)}{P(prepare)} = \frac{Count(prepare, for)/N}{Count(prepare)/N} = \frac{Count(prepare, for)}{Count(prepare)} = \frac{528}{3023} = 0.17 \]

最后，分别用unigram和bigram计算句子的概率如下：
S1= “prepare for leap in the dark”
S2= “prepare for lip in the dark”

unigram:

\[P(S1) = P(prepare) * P(for) * P(leap) * P(in) * P(the) * P(dark) \\ = 0.000030*0.0090*0.00001*0.02*0.022*0.00013 \\ = 1.54*e-19 \]

\[P(S2) = P(prepare) * P(for) * P(lip) * P(in) * P(the) * P(dark) \\ = 0.000030*0.0090*0.000016*0.02*0.022*0.00013 \\ = 2.46*e-19 \]

bigram:

我们可以发现，

unigram估计下$P(S1) > P(S2)$，而在bigram估计下$P(S1) < P(S2)$，说明bigram模型比unigram模型更为有效；
$P(bigram(S1))/P(unigram(S1))=1604$
$P(bigram(S2))/P(unigram(S2))=340$
bigram对句子估计得也更为精准。

Ref:
[1] https://stanford-cs324.github.io/winter2022/
[2] https://github.com/datawhalechina/so-large-lm

posted @ 2024-01-15 20:28 Teddyonthebench 阅读(61) 评论(0) 收藏举报

刷新页面返回顶部

Teddyonthebench

N-gram模型如何工作？

公告