学习日记-如何写一个简单英文拼写检查工具

经典书籍《人工智能-一种现代方法》作者Peter Norvig写过一个关于拼写检查文章,原文

中文翻译,Peter大师深入浅出讲述如果一步步优化并且测试拼写检查工具,如果你和我一样是AI初学者,

那么就跟着大师脚步学习吧。

#!/usr/bin/env python
import re, collections
def words(text): return re.findall('[a-z]+', text.lower())

def train(features):
    model = collections.defaultdict(lambda :1)
    for f in features:
        model[f] += 1
    return model
NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvxwz'

def edits1(word):
    splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
    deletes = [a + b[1:] for a, b in splits if b]
    transposes = [a + b[1] + b[0] + b[2:] for a,b in splits if len(b)>1]
    replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
    inserts = [a + c + b for a, b in splits for c in alphabet]
    return set(deletes + transposes + replaces + inserts)

def know_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def know(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = know([word]) or know(edits1(word)) or know_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

 测试correct函数,输入单词,程序自动输出纠正单词。

工作原理分析:

1 probability Theory

  以上代码是如何运行。给定一个单词我们尽量选择单词正确的拼写方式(当然单词可能就是正确拼写)。我们无法准确知道那个才是真正正确写法,例如"lates"正确写法可能是"late"或者 "latest",因此我们采用概率工具进行纠正。我们通过载给定单词w的所有可能纠正单词c中找出概率最大的那个单词数学表达如下:

  argmaxc P(c|w)   (1)

根据贝叶斯原理,等价于:

  argmaxc P(w|c) P(c) / P(w)  (2)

因为p(w)对于所有可能c都是一样,所以我们忽略,以上公式可以简化为:

  argmaxc P(w|c) P(c)  (3)

表达式(3)有3个部分,我们从右边往左边分析:

  1. P(c), 表示候选纠正单词c本身存在概率。这叫做语言模型(language model):它回答c出现在英文文献中可能性。因此,P("the)概率比较大,而P("zxzxzx")概率比较小。

  2. P(w|c)表示w可能被输入而此时作者想要输入是c,这是错误模型(error model):它回答当输入者想要输入c但是却打错成w的可能性。

  3. argmaxc,控制机制,枚举所有可能c选项数值然后从中选中可能性最大那个。

接下来我们展示代码如何工作:

首先,我们读入一个大文本文件big.txt,它包含一百万单词。文件搜集各个领域词汇,可以保证词汇完整性以及代表性。将文本作为训练数据,获得各个单词出现频率字典。即train函数:

def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model
NWORDS = train(words(file('big.txt').read()))

最后产生NOWORDS字典,单词是索引,数值是单词在big文件中出现次数+1,任何没有出现单词默认数值是1。

接下来,枚举出所有给定单词w的可能纠正c,我们先给出edit distance只有1的情况:c可能是在w基础上经过4中操作,曾删改插(deletion,transposition, alterationinsertion)。实现函数如下:

def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

如果edit distance距离是2情况,只要递归就行

def edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1))

 

以上可能产生许多不可能的单词,所以我们需要剔除那么不是单词的c。通过查找NWORDS实现:

def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

我们已经解决P(c)问题,接下来就是错误模型,P(w|c),我们需要一定数据来训练错误模型,但是作者提供一种比较简单但是可能不是特别精确方法:那就是假设单词不出现错误概率最高,一个edit distance,2个距离再次,最后如果都没有就返回给定单词:

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

Evaluation:

接下我们评估这个模型的准确程度,作者手动输入发现系统工作不错。

大师决定自己到网上下载错误拼写数据,我们可以在这个网站下载http://www.dcs.bbk.ac.uk/~ROGER/corpora.html,wikipedia.dat。Peter大师测试代码如下:

tests1 = { 'access': 'acess', 'accessing': 'accesing', 'accommodation':
    'accomodation acommodation acomodation', 'account': 'acount', ...}

tests2 = {'forbidden': 'forbiden', 'decisions': 'deciscions descisions',
    'supposedly': 'supposidly', 'embellishing': 'embelishing', ...}

def spelltest(tests, bias=None, verbose=False):
    import time
    n, bad, unknown, start = 0, 0, 0, time.clock()
    if bias:
        for target in tests: NWORDS[target] += bias
    for target,wrongs in tests.items():
        for wrong in wrongs.split():
            n += 1
            w = correct(wrong)
            if w!=target:
                bad += 1
                unknown += (target not in NWORDS)
                if verbose:
                    print '%r => %r (%d); expected %r (%d)' % (
                        wrong, w, NWORDS[w], target, NWORDS[target])
    return dict(bad=bad, n=n, bias=bias, pct=int(100. - 100.*bad/n), 
                unknown=unknown, secs=int(time.clock()-start) )

print spelltest(tests1)
print spelltest(tests2) ## only do this after everything is debugged

 我们分析测试程序:

输入测试数据是字典,索引是单词正确拼写,即c,value是它可能错误拼写方式,是一个中间用空格字符串。

我们需要测试出错次数bad,以及训练模型没有包含单词unknown次数,命中率pct,具体如何实现就看上面代码。本人直接下载数据,wikipedia.dat然后分词,大家有兴趣可以看一下我的测试代码(在最底下)。

调试结果:

wrong:  similiarity
correct:  similarity
target:  similarity
wrong:  consolodated
correct:  consolidated
target:  consolidated
wrong:  surrepetitious
correct:  surrepetitious
target:  surreptitious
***********
wrong:  surreptious
correct:  surreptious
target:  surreptitious
***********
wrong:  desireable
correct:  desirable
target:  desirable
wrong:  contravercial
correct:  controversial
target:  controversial
wrong:  controvercial
correct:  controversial
target:  controversial
wrong:  comandos
correct:  commands
target:  commandos
***********
wrong:  commandoes
correct:  commander
target:  commandos
***********
{'unknown': 2, 'secs': 0, 'bad': 4, 'pct': 100, 'n': 1852}

 

本人测试代码,release_1_corrector是拼写上面代码,请自行改正:

#!/usr/bin/env python
import re, collections
import sys
sys.path.append('/home/wenwt/machingLearning/source/')
from release_1_corrector import NWORDS
from release_1_corrector import correct

def words_test():
    data = open('wikipedia.dat', 'r')
    text = data.read()
    data.close()
    words = text.split('\n')
    return words

def dict_test(words):
    pattern = '^[$][a-z]+'
    spell_dict = collections.defaultdict(lambda :[])
    target = None
    for word in words:
        if re.findall(pattern, word):
            target = word[1:]
        else:
            spell_dict[target].append(word)
    return spell_dict

def evaluation(tests):
    import time
    n, bad, unknown, start = 0, 0, 0, time.clock()
    debug = 1
    for target, wrongs in tests.items():
        for wrong in wrongs:
            n += 1
            debug += 1
            if debug > 10:
                break
            print 'wrong: ', wrong
            w = correct(wrong)
            print 'correct: ', w
            print 'target: ', target
            if target != w:
                print('***********')
                bad += 1
                unknown += w in NWORDS

    return dict(bad=bad, n=n, pct=int(100-100*bad/n), unknown=unknown, secs=int(time.clock()-start))

spell_dict = dict_test(words_test())
spell_test = evaluation(spell_dict)
print spell_test

 

posted @ 2015-04-30 19:15  快乐的小土狗  阅读(686)  评论(0)    收藏  举报