随笔分类 -  Python自然语言处理

not only include Python NLTK
摘要:4.3Questions of Style 关于风格 Programming is as much an art as a science(编程作为一门像艺术一样的科学). The undisputed "bible" of programming, a 2,500 page multi-volume work by Donald Knuth, is calledThe Art of Computer Programming. Many books have been written onLiterate Programming, recognizing that huma 阅读全文
posted @ 2011-08-12 23:12 牛皮糖NewPtone 阅读(840) 评论(0) 推荐(0) 编辑
摘要:4.2Sequences序列 So far, we have seen two kinds of sequence object: strings and lists. Another kind of sequence is called atuple. Tuples are formed with the comma operator, and typically enclosed using parentheses. We've actually seen them in the previous chapters, and sometimes referred to them a 阅读全文
posted @ 2011-08-12 23:07 牛皮糖NewPtone 阅读(1299) 评论(0) 推荐(0) 编辑
摘要:Chapter 4 Writing Structured Programs编写结构化程序 By now you will have a sense of the capabilities of the Python programming language for processing natural language. However, if you're new to Python or to programming, you may still be wrestling with(努力对付) Python and not feel like you are in full con 阅读全文
posted @ 2011-08-11 22:34 牛皮糖NewPtone 阅读(904) 评论(0) 推荐(0) 编辑
摘要:3.12Exercises 练习 ☼ Define a string s = 'colorless'. Write a Python statement that changes this to "colourless" using only the slice and concatenation operations. ☼ We can use the slice notation to remove morphological endings on words. For example, 'dogs'[:-1] removes the l 阅读全文
posted @ 2011-08-11 22:25 牛皮糖NewPtone 阅读(2475) 评论(0) 推荐(0) 编辑
摘要:3.11Further Reading深入阅读 Extra materials for this chapter are posted at http://www.nltk.org/ , including links to freely available resources on the Web. Remember to consult the Python reference materials at http://docs.python.org/ . (For example, this documentation covers “universal newline support,” 阅读全文
posted @ 2011-08-11 22:21 牛皮糖NewPtone 阅读(601) 评论(0) 推荐(0) 编辑
摘要:3.10Summary小结 • In this book we view a text as a list of words. A “raw text” is a potentially long string containing words and whitespace formatting, and is how we typically store and visualize a text. • A string is specified in Python using single or double quotes:'Monty Python', "Mont 阅读全文
posted @ 2011-08-11 22:20 牛皮糖NewPtone 阅读(719) 评论(0) 推荐(0) 编辑
摘要:3.9Formatting: From Lists to Strings 格式化:从列表到字符串 Often we write a program to report a single data item, such as a particular element in a corpus that meets some complicated criterion, or a single summary statistic such as a word-count or the performance of a tagger. More often, we write a program to 阅读全文
posted @ 2011-08-07 20:19 牛皮糖NewPtone 阅读(2601) 评论(0) 推荐(1) 编辑
摘要:3.8Segmentation 分割 This section discusses more advanced concepts, which you may prefer to skip on the first time through this chapter. Tokenization is an instance of a more general problem of segmentation. In this section, we will look at two other instances of this problem, which use radically(根本上) 阅读全文
posted @ 2011-08-06 22:46 牛皮糖NewPtone 阅读(1680) 评论(0) 推荐(0) 编辑
摘要:3.7Regular Expressions for Tokenizing Text 用正则表达式文本分词 Tokenization is the task of cutting a string into identifiable linguistic units that constitute a piece of language data. Although it is a fundamental task, we have been able to delay it until now because many corpora are already tokenized, and . 阅读全文
posted @ 2011-08-06 22:36 牛皮糖NewPtone 阅读(3689) 评论(0) 推荐(0) 编辑
摘要:3.6Normalizing Text 规格化文本 In earlier program examples we have often converted text to lowercase before doing anything with its words, e.g., set(w.lower() for w in text). By using lower(), we have normalized the text to lowercase so that the distinction between The and the is ignored. Often we want t 阅读全文
posted @ 2011-08-06 22:27 牛皮糖NewPtone 阅读(2150) 评论(0) 推荐(0) 编辑
摘要:3.5Useful Applications of Regular Expressions 正则表达式的有益应用 The previous examples all involved searching for words w that match some regular expression regexp using re.search(regexp, w). Apart from checking whether a regular expression matches a word, we can use regular expressions to extract material 阅读全文
posted @ 2011-08-06 16:08 牛皮糖NewPtone 阅读(1983) 评论(0) 推荐(0) 编辑
摘要:转载请注明出处“一块努力的牛皮糖”:http://www.cnblogs.com/yuxc/新手上路,翻译不恰之处,恳请指出,不胜感谢Updated log3.4Regular Expressions for Detecting Word Patterns 使用正则表达式检测词组 Many linguistic processing tasks involve pattern matching(模式匹配). For example, we can find words ending with ed using endswith('ed'). We saw a variety o 阅读全文
posted @ 2011-08-06 15:32 牛皮糖NewPtone 阅读(2189) 评论(0) 推荐(0) 编辑
摘要:3.3Text Processing with Unicode使用Unicode进行文字处理 Our programs will often need to deal with different languages, and different character sets. The concept of “plain text” is a fiction(虚构). If you live in the English-speaking world you probably use ASCII, possibly without realizing it. If you live in Eu 阅读全文
posted @ 2011-08-06 14:39 牛皮糖NewPtone 阅读(2802) 评论(0) 推荐(0) 编辑
摘要:转载请注明出处“一块努力的牛皮糖”:http://www.cnblogs.com/yuxc/新手上路,翻译不恰之处,恳请指出,不胜感谢 Updated log1st 2011.8.6 3.2Strings: Text Processing at the Lowest Level 字符串:最底层的文本处理PS:个人认为这部分很重要,字符串处理是NLP里最基本的部分,各位童鞋好好看,老鸟略过...It’s time to study a fundamental data type that we’ve been studiously(故意地) avoiding so far. In earlier 阅读全文
posted @ 2011-08-05 23:13 牛皮糖NewPtone 阅读(2537) 评论(0) 推荐(0) 编辑
摘要:CHAPTER 3Processing Raw Text 处理原始文本The most important source of texts is undoubtedly the Web. It’s convenient to have existing text collections to explore, such as the corpora we saw in the previous chapters. However, you probably have your own text sources in mind, and need to learn how to access t 阅读全文
posted @ 2011-08-05 21:51 牛皮糖NewPtone 阅读(2658) 评论(1) 推荐(0) 编辑
摘要:博主懒人...练习未做完,下回补全...2.8Exercises练习 1. ○ Create a variable phrase containing a list of words. Experiment with the operations described in this chapter, including addition, multiplication, indexing, slicing, and sorting. List_practice=['Hello','World!']List_practice+['Pythoner' 阅读全文
posted @ 2011-08-05 21:36 牛皮糖NewPtone 阅读(2394) 评论(0) 推荐(0) 编辑
摘要:转载请注明出处“一块努力的牛皮糖”:http://www.cnblogs.com/yuxc/新手上路,翻译不恰之处,恳请指出,不胜感谢2.7Further Reading深入阅读 Extra materials for this chapter are posted at http://www.nltk.org/ , including links to freely available resources on the Web. The corpus methods are summarized in the Corpus HOWTO, at http://www.nltk.org/howt 阅读全文
posted @ 2011-08-05 21:26 牛皮糖NewPtone 阅读(704) 评论(0) 推荐(0) 编辑
摘要:转载请注明出处“一块努力的牛皮糖”:http://www.cnblogs.com/yuxc/新手上路,翻译不恰之处,恳请指出,不胜感谢2.6Summary 小结 • A text corpus is a large, structured collection of texts. NLTK comes with many corpora, e.g., the Brown Corpus, nltk.corpus.brown. 文本语料库是一个大型的结构化的一系列的文本。NLTK包含了许多语料库,例如,Brown Corpus,nltk.corpus.brown。 • Some text corp 阅读全文
posted @ 2011-08-05 21:24 牛皮糖NewPtone 阅读(2012) 评论(1) 推荐(0) 编辑
摘要:新手上路,翻译不恰之处,恳请指出,不胜感谢Updated log1st:2011/8/6 2nd:新图标更换,原图标实在不喜欢那~相信有不少童鞋会喜欢~1.2 A Closer Look at Python: Texts as Lists of Words 进一步学习Python:将文本视作单词列表You’ve seen some important elements of the Python programming language. Let’s take a few moments to review them systematically.Lists 列表What is a text? 阅读全文
posted @ 2011-08-05 16:40 牛皮糖NewPtone 阅读(1760) 评论(0) 推荐(0) 编辑
摘要:原创翻译,如需转载,请与博主联系:yuxcer@126.com新手上路,翻译不恰之处,恳请指出,不胜感谢2.5 WordNet WordNet is a semantically oriented dictionary of English, similar to a traditional thesaurus(辞典)but with a richer structure. NLTK includes the English WordNet, with 155,287 words and 117,659 synonym(同义词)sets. We’ll begin by looking at s 阅读全文
posted @ 2011-07-27 17:33 牛皮糖NewPtone 阅读(4700) 评论(2) 推荐(0) 编辑