摘要: 3.8Segmentation 分割 This section discusses more advanced concepts, which you may prefer to skip on the first time through this chapter. Tokenization is an instance of a more general problem of segmentation. In this section, we will look at two other instances of this problem, which use radically(根本上) 阅读全文
posted @ 2011-08-06 22:46 牛皮糖NewPtone 阅读(1716) 评论(0) 推荐(0)
摘要: 3.7Regular Expressions for Tokenizing Text 用正则表达式文本分词 Tokenization is the task of cutting a string into identifiable linguistic units that constitute a piece of language data. Although it is a fundamental task, we have been able to delay it until now because many corpora are already tokenized, and . 阅读全文
posted @ 2011-08-06 22:36 牛皮糖NewPtone 阅读(3735) 评论(0) 推荐(0)
摘要: 3.6Normalizing Text 规格化文本 In earlier program examples we have often converted text to lowercase before doing anything with its words, e.g., set(w.lower() for w in text). By using lower(), we have normalized the text to lowercase so that the distinction between The and the is ignored. Often we want t 阅读全文
posted @ 2011-08-06 22:27 牛皮糖NewPtone 阅读(2176) 评论(0) 推荐(0)
摘要: 3.5Useful Applications of Regular Expressions 正则表达式的有益应用 The previous examples all involved searching for words w that match some regular expression regexp using re.search(regexp, w). Apart from checking whether a regular expression matches a word, we can use regular expressions to extract material 阅读全文
posted @ 2011-08-06 16:08 牛皮糖NewPtone 阅读(2041) 评论(0) 推荐(0)
摘要: 转载请注明出处“一块努力的牛皮糖”:http://www.cnblogs.com/yuxc/新手上路,翻译不恰之处,恳请指出,不胜感谢Updated log3.4Regular Expressions for Detecting Word Patterns 使用正则表达式检测词组 Many linguistic processing tasks involve pattern matching(模式匹配). For example, we can find words ending with ed using endswith('ed'). We saw a variety o 阅读全文
posted @ 2011-08-06 15:32 牛皮糖NewPtone 阅读(2247) 评论(0) 推荐(0)
摘要: 3.3Text Processing with Unicode使用Unicode进行文字处理 Our programs will often need to deal with different languages, and different character sets. The concept of “plain text” is a fiction(虚构). If you live in the English-speaking world you probably use ASCII, possibly without realizing it. If you live in Eu 阅读全文
posted @ 2011-08-06 14:39 牛皮糖NewPtone 阅读(2861) 评论(0) 推荐(0)