nltk处理文本
nltk(Natural Language Toolkit)是处理文本的利器。
安装
pip install nltk
进入python命令行,键入nltk.download()可以下载nltk需要的语料库等等。
分词
按词语分割(传入句子)
sentence='hello,world!'
tokens=nltk.word_tokenize(sentence)
tokens就是一个分割好的词表,如下:
['hello', ',', 'world', '!']
按句子分割(传入多个句子组成的文档)
text='This is a text. I want to split it.'
sens=nltk.sent_tokenize(text)
sens就是分割好的句子组成的list,如下:
['This is a text.', 'I want to split it.']
词性标注
tags = [nltk.pos_tag(tokens) for tokens in words]
[[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('text', 'NN'), ('for', 'IN'), ('test', 'NN'), ('.', '.')], [('And', 'CC'), ('I', 'PRP'), ('want', 'VBP'), ('to', 'TO'), ('learn', 'VB'), ('how', 'WRB'), ('to', 'TO'), ('use', 'VB'), ('nltk', 'NN'), ('.', '.')]]
附录:nltk的词性:
-
CC Coordinating conjunction 连接词 -
CD Cardinal number 基数词 -
DT Determiner 限定词(如this,that,these,those,such,不定限定词:no,some,any,each,every,enough,either,neither,all,both,half,several,many,much,(a) few,(a) little,other,another. -
EX Existential there 存在句 -
FW Foreign word 外来词 -
IN Preposition or subordinating conjunction 介词或从属连词 -
JJ Adjective 形容词或序数词 -
JJR Adjective, comparative 形容词比较级 -
JJS Adjective, superlative 形容词最高级 -
LS List item marker 列表标示 -
MD Modal 情态助动词 -
NN Noun, singular or mass 常用名词 单数形式 -
NNS Noun, plural 常用名词 复数形式 -
NNP Proper noun, singular 专有名词,单数形式 -
NNPS Proper noun, plural 专有名词,复数形式 -
PDT Predeterminer 前位限定词 -
POS Possessive ending 所有格结束词 -
PRP Personal pronoun 人称代词 -
PRP$ Possessive pronoun 所有格代名词 -
RB Adverb 副词 -
RBR Adverb, comparative 副词比较级 -
RBS Adverb, superlative 副词最高级 -
RP Particle 小品词 -
SYM Symbol 符号 -
TO to 作为介词或不定式格式 -
UH Interjection 感叹词 -
VB Verb, base form 动词基本形式 -
VBD Verb, past tense 动词过去式 -
VBG Verb, gerund or present participle 动名词和现在分词 -
VBN Verb, past participle 过去分词 -
VBP Verb, non-3rd person singular present 动词非第三人称单数 -
VBZ Verb, 3rd person singular present 动词第三人称单数 -
WDT Wh-determiner 限定词(如关系限定词:whose,which.疑问限定词:what,which,whose.) -
WP Wh-pronoun 代词(who whose which) -
WP$ Possessive wh-pronoun 所有格代词 -
WRB Wh-adverb 疑问代词(how where when)
提取关键词
如何对一段话提取关键词呢?主要思想就是先分词,再标词性。
# -*- coding=UTF-8 -*-
import nltk
from nltk.corpus import brown
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
# This is our fast Part of Speech tagger
#############################################################################
brown_train = brown.tagged_sents(categories='news')
regexp_tagger = nltk.RegexpTagger(
[(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),
(r'(-|:|;)$', ':'),
(r'\'*$', 'MD'),
(r'(The|the|A|a|An|an)$', 'AT'),
(r'.*able$', 'JJ'),
(r'^[A-Z].*$', 'NNP'),
(r'.*ness$', 'NN'),
(r'.*ly$', 'RB'),
(r'.*s$', 'NNS'),
(r'.*ing$', 'VBG'),
(r'.*ed$', 'VBD'),
(r'.*', 'NN')
])
unigram_tagger = nltk.UnigramTagger(brown_train, backoff=regexp_tagger)
bigram_tagger = nltk.BigramTagger(brown_train, backoff=unigram_tagger)
#############################################################################
# This is our semi-CFG; Extend it according to your own needs
#############################################################################
cfg = {}
cfg["NNP+NNP"] = "NNP"
cfg["NN+NN"] = "NNI"
cfg["NNI+NN"] = "NNI"
cfg["JJ+JJ"] = "JJ"
cfg["JJ+NN"] = "NNI"
#############################################################################
class NPExtractor(object):
# Split the sentence into singlw words/tokens
def tokenize_sentence(self, sentence):
tokens = nltk.word_tokenize(sentence)
#去除停用词,标点,数字,长度小于2的词
tokens=[w.lower() for w in tokens if(w.isalpha())&(len(w)>1)]#使用tfid,不必去除停用词
#词干提取
stemmer=SnowballStemmer('english')
tokens=[stemmer.stem(w) for w in tokens]
return tokens
# Normalize brown corpus' tags ("NN", "NN-PL", "NNS" > "NN")
def normalize_tags(self, tagged):
n_tagged = []
for t in tagged:
if t[1] == "NP-TL" or t[1] == "NP":
n_tagged.append((t[0], "NNP"))
continue
if t[1].endswith("-TL"):
n_tagged.append((t[0], t[1][:-3]))
continue
if t[1].endswith("S"):
n_tagged.append((t[0], t[1][:-1]))
continue
n_tagged.append((t[0], t[1]))
return n_tagged
# Extract the main topics from the sentence
def extract(self,sentence):
tokens = self.tokenize_sentence(sentence)
tags = self.normalize_tags(bigram_tagger.tag(tokens))
merge = True
while merge:
merge = False
for x in range(0, len(tags) - 1):
t1 = tags[x]
t2 = tags[x + 1]
key = "%s+%s" % (t1[1], t2[1])
value = cfg.get(key, '')
if value:
merge = True
tags.pop(x)
tags.pop(x)
match = "%s %s" % (t1[0], t2[0])
pos = value
tags.insert(x, (match, pos))
break
matches = []
for t in tags:
if t[1] == "NNP" or t[1] == "NNI" or t[1]=="NN":
matches.append(t[0])
return matches
利用这里的extract函数就可以提取文本的关键词。
更多参见nltk官方文档:nltk

浙公网安备 33010602011771号