摘要:因为现实中的数据多为‘非结构化数据’,比如一般的txt文档,或是‘半结构化数据’,比如html,对于这样的数据需要采用一些技术才能从中提取 出有用的信息。如果所有数据都是‘结构化数据’,比如Xml或关系数据库,那么就不需要特别去提取了,可以根据元数据去任意取到你想要的信息。那么就来讨论一下用NLTK来实现文本信息提取的方法,first, the raw text of the document is split into sentences using a sentence segmenter, and each sentence is further subdivided into word
阅读全文
摘要:Classification is the task of choosing the correct class label for a given input.A classifier is called supervised if it is built based on training corpora containing the correct label for each input.这里就以一个例子来说明怎样用nltk来实现分类器训练和分类一个简单的分类任务,给定一个名字,判断其性别,就是在male,female两类进行分类好,先来训练,训练就要有corpus,就是分好类的名字的
阅读全文
摘要:POS tagging :part-of-speech tagging , or word classes or lexical categories . 说法很多其实就是词性标注。那么用nltk的工具集的off-the-shelf工具可以简单的对文本进行POS tagging>>> text = nltk.word_tokenize("And now for something completely different")>>> nltk.pos_tag(text)[(''And'', ''
阅读全文