Python与自然语言处理:NLTK入门实战
Python与自然语言处理:NLTK入门实战
零基础上手经典 NLP 库,用 10 行代码完成分词、词性、情感分析与文本分类!
一、为什么选择 NLTK?
NLTK(Natural Language Toolkit)是 Python 生态中最悠久、最全面的自然语言处理库之一:
- 教材级语料:内置布朗语料、电影评论、路透社新闻等,无需自己找数据
- 算法透明:源码附带详细注释,从分词到句法树,每一步都可解释
- 社区庞大:《Python 自然语言处理》官方配套库,Stack Overflow 解决方案丰富
- 入门友好:纯 Python 接口,安装即运行,0 配置即可体验 NLP 全流程
二、极速安装与环境准备
# 1. 创建虚拟环境(推荐)
python -m venv nltk_env
source nltk_env/bin/activate # Windows 用 nltk_env\Scripts\activate
# 2. 安装 NLTK
pip install -U nltk
# 3. 一键下载常用语料与模型
python -c "import nltk; nltk.download('popular')"
如果网络慢,可手动下载后解压到
~/nltk_data,官网提供 ZIP 镜像。
三、10 行代码体验 4 大核心任务
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk
text = "NLTK is a leading platform for building Python programs to work with human language data."
# 1. 分句
sents = sent_tokenize(text)
print("句子:", sents)
# 2. 分词
words = word_tokenize(text)
print("词汇:", words)
# 3. 词性标注
pos = pos_tag(words)
print("词性:", pos)
# 4. 命名实体识别(NER)
tree = ne_chunk(pos)
print("实体树:", tree)
输出示例:
句子: ['NLTK is a leading platform for building Python programs to work with human language data.']
词汇: ['NLTK', 'is', 'a', 'leading', 'platform', ...]
词性: [('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ...]
实体树: (S (PERSON NLTK/NNP) is/VBZ ...)
四、实战 1:垃圾邮件过滤器(朴素贝叶斯)
1. 加载内置语料
import random
from nltk.corpus import movie_reviews
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
2. 特征提取(词袋模型)
def word_feats(words):
return {w: True for w in words}
feature_sets = [(word_feats(d), c) for (d, c) in documents]
train_set, test_set = feature_sets[200:], feature_sets[:200]
3. 训练 & 评估
classifier = NaiveBayesClassifier.train(train_set)
print("Accuracy:", accuracy(classifier, test_set))
classifier.show_most_informative_features(10)
结果示例:
Accuracy: 0.845
Most Informative Features
outstanding = True pos : neg = 11.2 : 1.0
awful = True neg : pos = 9.4 : 1.0
五、实战 2:词频统计 + 词云可视化
from nltk.corpus import brown
from nltk.probability import FreqDist
import matplotlib.pyplot as plt
# 1. 获取新闻类别词汇
words = brown.words(categories='news')
# 2. 频率分布
fdist = FreqDist(w.lower() for w in words if w.isalpha())
# 3. 绘制前 30 高频词
fdist.plot(30, cumulative=False)
plt.show()
六、实战 3:基于词典的情感分析
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
text = "The movie is super boring, but the acting is amazing!"
scores = sia.polarity_scores(text)
print(scores)
# {'neg': 0.312, 'neu': 0.438, 'pos': 0.250, 'compound': -0.102}
解释:
compound ≥ 0.05→ 正面compound ≤ -0.05→ 负面- 否则 → 中性
七、中文支持:自定义 jieba 分词器
import jieba
from nltk.tokenize import TokenizerI
class JiebaTokenizer(TokenizerI):
def tokenize(self, text):
return list(jieba.cut(text))
tokenizer = JiebaTokenizer()
print(tokenizer.tokenize("我爱自然语言处理"))
# ['我', '爱', '自然语言', '处理']
八、学习路线 & 进阶资源
分词 → 停用词 → 词形还原 → 词性标注
↓
文本分类 / 情感分析 / 主题建模(LDA)
↓
命名实体识别 → 句法分析 → 语义角色标注
↓
深度学习(BERT、GPT、Transformer)
- 书籍:《Python 自然语言处理》(NLTK 作者著)
- 官方文档:https://www.nltk.org/
- 在线课程:Coursera《Natural Language Processing with Python》
九、总结
NLTK 是一把瑞士军刀级别的 NLP 入门利器:
- 10 行代码跑通分词、词性、实体、情感、分类全流程
- 内置语料+模型,免去数据烦恼
- 源码教学式注释,算法透明易懂
先用 NLTK 把经典算法玩透,再迁移到 spaCy、Hugging Face,将事半功倍!快去试试吧~
浙公网安备 33010602011771号