jieba自然语言处理库

评论数据情感分析，其实就是文本挖掘，首先要做的预处理就是分词，英文单词中间有空格隔开，很容易分词。中文没有空格，所以需要去做专门的分词处理。``
中文分词用jieba库，先安装第三方库 pip install jieba，看下面demo：

import jieba

s = '中华人民共和国是一个伟大的国家。'
# 精确模式（试图将语句最精确的切分，不存在冗余数据，适合做文本分析）
print(jieba.lcut(s))
# 全模式（将语句中所有可能是词的词语都切分出来，速度很快，但是存在冗余数据）
print(jieba.lcut(s, cut_all=True))
# 搜索引擎模式（在精确模式的基础上，对长词再次进行切分）
print(jieba.lcut_for_search(s))

['中华人民共和国', '是', '一个', '伟大', '的', '国家', '。']
['中华', '中华人民', '中华人民共和国', '华人', '人民', '人民共和国', '共和', '共和国', '国是', '一个', '伟大', '的', '国家', '。']
['中华', '华人', '人民', '共和', '共和国', '中华人民共和国', '是', '一个', '伟大', '的', '国家', '。']

统计次数出现最多的词语，这里以三国演义为例：


import  jieba

txt = open("D:\\三国演义.txt", "r", encoding='utf-8').read()
words = jieba.lcut(txt)     # 使用精确模式对文本进行分词
counts = {}     # 通过键值对的形式存储词语及其出现的次数

for word in words:
    if  len(word) == 1:    # 单个词语不计算在内
        continue
    else:
        counts[word] = counts.get(word, 0) + 1    # 遍历所有词语，每出现一次其对应的值加 1
        
items = list(counts.items())#将键值对转换成列表
items.sort(key=lambda x: x[1], reverse=True)    # 根据词语出现的次数进行从大到小排序

# 统计前10
for i in range(10):
    word, count = items[i]
    print("{0:<5}{1:>5}".format(word, count))

统计英文：


def get_text():
    txt = open("1.txt", "r", encoding='UTF-8').read()
    txt = txt.lower()
    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
        txt = txt.replace(ch, " ")      # 将文本中特殊字符替换为空格
    return txt

file_txt = get_text()
words = file_txt.split()    # 对字符串进行分割，获得单词列表
counts = {}

for word in words:
    if len(word) == 1:
        continue
    else:
        counts[word] = counts.get(word, 0) + 1 

items = list(counts.items())    
items.sort(key=lambda x: x[1], reverse=True)      

for i in range(5):
    word, count = items[i]
    print("{0:<5}->{1:>5}".format(word, count))

posted @ 2021-05-05 11:49 爱时尚疯了的朱阅读(163) 评论(0) 收藏举报

刷新页面返回顶部

爱时尚疯了的朱

jieba自然语言处理库

公告