Python自然语言处理学习笔记(47):5.8 小结

5.8 Summary 小结

Words can be grouped into classes, such as nouns, verbs, adjectives, and adverbs. These classes are known as lexical categories or parts-of-speech. Parts-of-speech are assigned short labels, or tags, such as NN and VB.


The process of automatically assigning parts-of-speech to words in text is called part-of-speech tagging, POS tagging, or just tagging.


Automatic tagging is an important step in the NLP pipeline, and is useful in a variety of situations, including predicting the behavior of previously unseen words, analyzing word usage in corpora, and text-to-speech systems.


Some linguistic corpora, such as the Brown Corpus, have been POS tagged.


A variety of tagging methods are possible, e.g., default tagger, regular expression tagger, unigram tagger, and n-gram taggers. These can be combined using a technique known as backoff.

各种不同的标记方法都是合适的,例如,缺省tagger,正则表达式tagger,unigram tagger以及n-gram tagger。这些可以使用一种称为backoff的技术进行组合。

Taggers can be trained and evaluated using tagged corpora.


Backoff is a method for combining models: when a more specialized model (such as a bigram tagger) cannot assign a tag in a given context, we back off to a more general model (such as a unigram tagger).

Backoff是一个用于组合模型的方法:当一个更详细的模型(例如bigram tagger)不能为给定内容分配标记,我们后退到一个更加一般化的模型(例如unigram tagger

Part-of-speech tagging is an important, early example of a sequence classification task in NLP: a classification decision at any one point in the sequence makes use of words and tags in the local context.


A dictionary is used to map between arbitrary types of information, such as a string and a number: freq['cat'] = 12. We create dictionaries using the brace notation: pos = {}, pos = {'furiously': 'adv', 'ideas': 'n', 'colorless': 'adj'}.

字典用来映射任意类型之间的信息,例如字符串和数字:freq[‘cat’]=12。我们使用大括号标记来创建字典:pos = {}, pos = {'furiously': 'adv', 'ideas': 'n', 'colorless': 'adj'}.

N-gram taggers can be defined for large values of n, but once n is larger than 3, we usually encounter the sparse data problem; even with a large quantity of training data, we see only a tiny fraction of possible contexts.

N-gram tag可以定义为较大数值的n,但是一旦n大于3,我们常常会面临稀疏数据问题,即时使用大量的训练数据,我们仅可以看到可能的上下文的细小部分。

Transformation-based tagging involves learning a series of repair rules of the form change tag s to tag t in context c,” where each rule fixes mistakes and possibly introduces a (smaller) number of errors.

基于转换的标记包含了一系列的“change tag s to tag t in context c”形式的修复规则,每个规则修复错误并且可能地引入更小的错误。

posted @ 2011-08-30 22:46  牛皮糖NewPtone  阅读(...)  评论(...编辑  收藏