学习日记-如何写一个简单英文拼写检查工具
经典书籍《人工智能-一种现代方法》作者Peter Norvig写过一个关于拼写检查文章,原文
中文翻译,Peter大师深入浅出讲述如果一步步优化并且测试拼写检查工具,如果你和我一样是AI初学者,
那么就跟着大师脚步学习吧。
#!/usr/bin/env python import re, collections def words(text): return re.findall('[a-z]+', text.lower()) def train(features): model = collections.defaultdict(lambda :1) for f in features: model[f] += 1 return model NWORDS = train(words(file('big.txt').read())) alphabet = 'abcdefghijklmnopqrstuvxwz' def edits1(word): splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [a + b[1:] for a, b in splits if b] transposes = [a + b[1] + b[0] + b[2:] for a,b in splits if len(b)>1] replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b] inserts = [a + c + b for a, b in splits for c in alphabet] return set(deletes + transposes + replaces + inserts) def know_edits2(word): return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS) def know(words): return set(w for w in words if w in NWORDS) def correct(word): candidates = know([word]) or know(edits1(word)) or know_edits2(word) or [word] return max(candidates, key=NWORDS.get)
测试correct函数,输入单词,程序自动输出纠正单词。
工作原理分析:
1 probability Theory
以上代码是如何运行。给定一个单词我们尽量选择单词正确的拼写方式(当然单词可能就是正确拼写)。我们无法准确知道那个才是真正正确写法,例如"lates"正确写法可能是"late"或者 "latest",因此我们采用概率工具进行纠正。我们通过载给定单词w的所有可能纠正单词c中找出概率最大的那个单词数学表达如下:
argmaxc P(c|w) (1)
根据贝叶斯原理,等价于:
argmaxc P(w|c) P(c) / P(w) (2)
因为p(w)对于所有可能c都是一样,所以我们忽略,以上公式可以简化为:
argmaxc P(w|c) P(c) (3)
表达式(3)有3个部分,我们从右边往左边分析:
1. P(c), 表示候选纠正单词c本身存在概率。这叫做语言模型(language model):它回答c出现在英文文献中可能性。因此,P("the)概率比较大,而P("zxzxzx")概率比较小。
2. P(w|c)表示w可能被输入而此时作者想要输入是c,这是错误模型(error model):它回答当输入者想要输入c但是却打错成w的可能性。
3. argmaxc,控制机制,枚举所有可能c选项数值然后从中选中可能性最大那个。
接下来我们展示代码如何工作:
首先,我们读入一个大文本文件big.txt,它包含一百万单词。文件搜集各个领域词汇,可以保证词汇完整性以及代表性。将文本作为训练数据,获得各个单词出现频率字典。即train函数:
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(file('big.txt').read()))
最后产生NOWORDS字典,单词是索引,数值是单词在big文件中出现次数+1,任何没有出现单词默认数值是1。
接下来,枚举出所有给定单词w的可能纠正c,我们先给出edit distance只有1的情况:c可能是在w基础上经过4中操作,曾删改插(deletion,transposition, alterationinsertion)。实现函数如下:
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
如果edit distance距离是2情况,只要递归就行
def edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1))
以上可能产生许多不可能的单词,所以我们需要剔除那么不是单词的c。通过查找NWORDS实现:
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
我们已经解决P(c)问题,接下来就是错误模型,P(w|c),我们需要一定数据来训练错误模型,但是作者提供一种比较简单但是可能不是特别精确方法:那就是假设单词不出现错误概率最高,一个edit distance,2个距离再次,最后如果都没有就返回给定单词:
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)
Evaluation:
接下我们评估这个模型的准确程度,作者手动输入发现系统工作不错。
大师决定自己到网上下载错误拼写数据,我们可以在这个网站下载http://www.dcs.bbk.ac.uk/~ROGER/corpora.html,wikipedia.dat。Peter大师测试代码如下:
tests1 = { 'access': 'acess', 'accessing': 'accesing', 'accommodation': 'accomodation acommodation acomodation', 'account': 'acount', ...} tests2 = {'forbidden': 'forbiden', 'decisions': 'deciscions descisions', 'supposedly': 'supposidly', 'embellishing': 'embelishing', ...} def spelltest(tests, bias=None, verbose=False): import time n, bad, unknown, start = 0, 0, 0, time.clock() if bias: for target in tests: NWORDS[target] += bias for target,wrongs in tests.items(): for wrong in wrongs.split(): n += 1 w = correct(wrong) if w!=target: bad += 1 unknown += (target not in NWORDS) if verbose: print '%r => %r (%d); expected %r (%d)' % ( wrong, w, NWORDS[w], target, NWORDS[target]) return dict(bad=bad, n=n, bias=bias, pct=int(100. - 100.*bad/n), unknown=unknown, secs=int(time.clock()-start) ) print spelltest(tests1) print spelltest(tests2) ## only do this after everything is debugged
我们分析测试程序:
输入测试数据是字典,索引是单词正确拼写,即c,value是它可能错误拼写方式,是一个中间用空格字符串。
我们需要测试出错次数bad,以及训练模型没有包含单词unknown次数,命中率pct,具体如何实现就看上面代码。本人直接下载数据,wikipedia.dat然后分词,大家有兴趣可以看一下我的测试代码(在最底下)。
调试结果:
wrong: similiarity
correct: similarity
target: similarity
wrong: consolodated
correct: consolidated
target: consolidated
wrong: surrepetitious
correct: surrepetitious
target: surreptitious
***********
wrong: surreptious
correct: surreptious
target: surreptitious
***********
wrong: desireable
correct: desirable
target: desirable
wrong: contravercial
correct: controversial
target: controversial
wrong: controvercial
correct: controversial
target: controversial
wrong: comandos
correct: commands
target: commandos
***********
wrong: commandoes
correct: commander
target: commandos
***********
{'unknown': 2, 'secs': 0, 'bad': 4, 'pct': 100, 'n': 1852}
本人测试代码,release_1_corrector是拼写上面代码,请自行改正:
#!/usr/bin/env python import re, collections import sys sys.path.append('/home/wenwt/machingLearning/source/') from release_1_corrector import NWORDS from release_1_corrector import correct def words_test(): data = open('wikipedia.dat', 'r') text = data.read() data.close() words = text.split('\n') return words def dict_test(words): pattern = '^[$][a-z]+' spell_dict = collections.defaultdict(lambda :[]) target = None for word in words: if re.findall(pattern, word): target = word[1:] else: spell_dict[target].append(word) return spell_dict def evaluation(tests): import time n, bad, unknown, start = 0, 0, 0, time.clock() debug = 1 for target, wrongs in tests.items(): for wrong in wrongs: n += 1 debug += 1 if debug > 10: break print 'wrong: ', wrong w = correct(wrong) print 'correct: ', w print 'target: ', target if target != w: print('***********') bad += 1 unknown += w in NWORDS return dict(bad=bad, n=n, pct=int(100-100*bad/n), unknown=unknown, secs=int(time.clock()-start)) spell_dict = dict_test(words_test()) spell_test = evaluation(spell_dict) print spell_test