nltk处理文本

nltk(Natural Language Toolkit)是处理文本的利器。

安装

pip install nltk

进入python命令行，键入nltk.download()可以下载nltk需要的语料库等等。

分词

按词语分割（传入句子）

sentence='hello,world!'
tokens=nltk.word_tokenize(sentence)

tokens就是一个分割好的词表，如下：

['hello', ',', 'world', '!']

按句子分割（传入多个句子组成的文档）

text='This is a text. I want to split it.'
sens=nltk.sent_tokenize(text)

sens就是分割好的句子组成的list,如下：

['This is a text.', 'I want to split it.']

词性标注

tags = [nltk.pos_tag(tokens) for tokens in words]
[[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('text', 'NN'), ('for', 'IN'), ('test', 'NN'), ('.', '.')], [('And', 'CC'), ('I', 'PRP'), ('want', 'VBP'), ('to', 'TO'), ('learn', 'VB'), ('how', 'WRB'), ('to', 'TO'), ('use', 'VB'), ('nltk', 'NN'), ('.', '.')]]

附录：nltk的词性：

 CC      Coordinating conjunction 连接词

```
CD     Cardinal number  基数词
```

DT     Determiner  限定词（如this,that,these,those,such，不定限定词：no,some,any,each,every,enough,either,neither,all,both,half,several,many,much,(a) few,(a) little,other,another.

```
EX     Existential there 存在句
```
```
FW     Foreign word 外来词
```

IN     Preposition or subordinating conjunction 介词或从属连词

```
JJ     Adjective 形容词或序数词
```

JJR     Adjective, comparative 形容词比较级

JJS     Adjective, superlative 形容词最高级

```
LS     List item marker 列表标示
```
```
MD     Modal 情态助动词
```

NN     Noun, singular or mass 常用名词 单数形式

NNS     Noun, plural  常用名词 复数形式

NNP     Proper noun, singular  专有名词，单数形式

NNPS     Proper noun, plural  专有名词，复数形式

```
PDT     Predeterminer 前位限定词
```

POS     Possessive ending 所有格结束词

```
PRP     Personal pronoun 人称代词
```

PRP$     Possessive pronoun 所有格代名词

```
RB     Adverb 副词
```

RBR     Adverb, comparative 副词比较级

RBS     Adverb, superlative 副词最高级

```
RP     Particle 小品词
```
```
SYM     Symbol 符号
```

TO     to 作为介词或不定式格式

```
UH     Interjection 感叹词
```

VB     Verb, base form 动词基本形式

VBD     Verb, past tense 动词过去式

VBG     Verb, gerund or present participle 动名词和现在分词

VBN     Verb, past participle 过去分词

VBP     Verb, non-3rd person singular present 动词非第三人称单数

VBZ     Verb, 3rd person singular present 动词第三人称单数

WDT     Wh-determiner 限定词（如关系限定词：whose,which.疑问限定词：what,which,whose.）

WP      Wh-pronoun 代词（who whose which）

WP$     Possessive wh-pronoun 所有格代词

WRB     Wh-adverb   疑问代词（how where when）

提取关键词

如何对一段话提取关键词呢？主要思想就是先分词，再标词性。

# -*- coding=UTF-8 -*-
import nltk
from nltk.corpus import brown
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords


# This is our fast Part of Speech tagger
#############################################################################
brown_train = brown.tagged_sents(categories='news')
regexp_tagger = nltk.RegexpTagger(
    [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),
    (r'(-|:|;)$', ':'),
    (r'\'*$', 'MD'),
    (r'(The|the|A|a|An|an)$', 'AT'),
    (r'.*able$', 'JJ'),
    (r'^[A-Z].*$', 'NNP'),
    (r'.*ness$', 'NN'),
    (r'.*ly$', 'RB'),
    (r'.*s$', 'NNS'),
    (r'.*ing$', 'VBG'),
    (r'.*ed$', 'VBD'),
    (r'.*', 'NN')
])
unigram_tagger = nltk.UnigramTagger(brown_train, backoff=regexp_tagger)
bigram_tagger = nltk.BigramTagger(brown_train, backoff=unigram_tagger)
#############################################################################


# This is our semi-CFG; Extend it according to your own needs
#############################################################################
cfg = {}
cfg["NNP+NNP"] = "NNP"
cfg["NN+NN"] = "NNI"
cfg["NNI+NN"] = "NNI"
cfg["JJ+JJ"] = "JJ"
cfg["JJ+NN"] = "NNI"
#############################################################################


class NPExtractor(object):
    # Split the sentence into singlw words/tokens
    def tokenize_sentence(self, sentence):
        tokens = nltk.word_tokenize(sentence)
        #去除停用词,标点，数字,长度小于2的词
        tokens=[w.lower() for w in tokens if(w.isalpha())&(len(w)>1)]#使用tfid，不必去除停用词
        #词干提取
        stemmer=SnowballStemmer('english')
        tokens=[stemmer.stem(w) for w in tokens]
        return tokens

    # Normalize brown corpus' tags ("NN", "NN-PL", "NNS" > "NN")
    def normalize_tags(self, tagged):
        n_tagged = []
        for t in tagged:
            if t[1] == "NP-TL" or t[1] == "NP":
                n_tagged.append((t[0], "NNP"))
                continue
            if t[1].endswith("-TL"):
                n_tagged.append((t[0], t[1][:-3]))
                continue
            if t[1].endswith("S"):
                n_tagged.append((t[0], t[1][:-1]))
                continue
            n_tagged.append((t[0], t[1]))
        return n_tagged

    # Extract the main topics from the sentence
    def extract(self,sentence):

        tokens = self.tokenize_sentence(sentence)
        tags = self.normalize_tags(bigram_tagger.tag(tokens))

        merge = True
        while merge:
            merge = False
            for x in range(0, len(tags) - 1):
                t1 = tags[x]
                t2 = tags[x + 1]
                key = "%s+%s" % (t1[1], t2[1])
                value = cfg.get(key, '')
                if value:
                    merge = True
                    tags.pop(x)
                    tags.pop(x)
                    match = "%s %s" % (t1[0], t2[0])
                    pos = value
                    tags.insert(x, (match, pos))
                    break

        matches = []
        for t in tags:
            if t[1] == "NNP" or t[1] == "NNI" or t[1]=="NN":
                matches.append(t[0])
        return matches

利用这里的extract函数就可以提取文本的关键词。

更多参见nltk官方文档：nltk

posted @ 2018-07-13 22:55 冬色阅读(435) 评论(0) 收藏举报

刷新页面返回顶部

冬色

GitHub: https://github.com/cnlinxi

nltk处理文本

安装

分词

词性标注

提取关键词

公告