中文词频统计 - 189黄思慧

中文词频统计

下载一长篇中文文章。

从文件读取待分析文本。

news = open('gzccnews.txt','r',encoding = 'utf-8')

安装与使用jieba进行中文分词。

pip install jieba

import jieba

list(jieba.lcut(news))

生成词频统计

排序

排除语法型词汇，代词、冠词、连词

输出词频最大TOP20

将代码与运行结果截图发布在博客上。

# -*- coding: UTF-8 -*-# -*-
import  jieba
fo = open('我们仨.txt','r',encoding = 'utf-8')
novel = fo.read()
novelList=list(jieba.lcut(novel))

exclude = {'，',',', '。','？', '“', '”',' ','\u3000','\n','：',
           '这', '走', '能', '好', '给', '来', '为', '等',
           '在', '也', '就', '不', '着', '到', '和', '很',
           '吃', '（', '）', '；', '好','后', '一', '叫', '已',
           '住', '呢', '》', '《', '但','—', '小', '没', '并',
           '有', '又', '一个', '、', '去', '得', '把', '还',
           '上', '只', '人', '地', '做', '对', '你', '没有',
           '都','说','了','的','是','里','要','从','什么','因为'}

novelDict = {}
novelSet = set(novelList)-exclude
for s in novelSet:
    novelDict[s] = novelList.count(s)  

    dictList = list(novelDict.items())
    dictList.sort(key=lambda x: x[1], reverse=True)

for i in range(20):
        print(dictList[i])

截图：

posted on 2018-03-28 16:48 189黄思慧阅读(104) 评论(0) 收藏举报