阶段作业1：完整的中英文词频统计

步骤：

1.准备utf-8编码的文本文件file

2.通过文件读取字符串 str

3.对文本进行预处理

4.分解提取单词 list

5.单词计数字典 set , dict

6.按词频排序 list.sort(key=)

7.排除语法型词汇，代词、冠词、连词等无语义词

8.输出TOP(20)

1.英文小说词频统计

#准备utf-8编码的文本文件file
fo=open('perfect.txt','r',encoding='utf-8')
#通过文件读取字符串 str,对文本进行预处理,将所有大写转换为小写
perfectstr=fo.read().lower()
fo.close()
print(perfectstr)

#字符串预处理,将其中所有的标点符号’ : ”  , . ？！和特殊符号替换为空格 str.replace()
sep=',.!?'""
for ch in sep:
    perfectstr=perfectstr.replace(sep,' ')
#list,分解提取单词 list
perfectstrlist=perfectstr.split()
print(len(perfectstrlist),perfectstrlist)
#set,dict,单词计数字典 set , dict
perfectstrset=set(perfectstr.split())
perfectstrdict={}
for i in perfectstrset:
    perfectstrdict[i]=perfectstr.count(i)
for key in perfectstrdict:
    print(key,perfectstrdict[key])

wclist=list(perfectstrdict.items())
print(wclist)
#按词频排序 list.sort(key=)
wclist.sort(key=lambda x:x[1],reverse=True)#出现词汇次数由高到低排序
print(wclist)

#排除语法型词汇、代词、冠词、连词等无语义词
perfectstrlist=set(perfectstrlist)
exclude={'a','the','and','i','you','it'}
print(perfectstrlist-exclude)

#输出TOP(20)
for i in range(20):
    print(wclist[i])

2.中文小说词频统计

import jieba;
#准备utf-8编码的文本文件file
f = open('doupo.txt','r',encoding='utf-8')
#通过文件读取字符串 str,对文本进行预处理
fo=f.read()
f.close()
print(fo)

#用字典形式统计每个词的字数
doupols = jieba.lcut(fo)
doupodict = {}
for word in doupols:
    if len(word)==1:
        continue
    else:
        doupodict[word]=doupodict.get(word,0)+1
print(doupodict)
#cut
print(list(jieba.cut(fo)))      #精确模式，将句子最精确的分开，适合文本分析
print(list(jieba.cut(fo,cut_all=True)))  #全模式，把句子中所有的可以成词的词语都扫描出来
print(list(jieba.cut_for_search(fo)))    #搜索引擎模式，在精确模式的基础上，对长词再次切分，提高召回率，适合用于搜索引擎分词

#以列表返回可遍历的(键, 值) 元组数组
wcList = list(doupodict.items())
wcList.sort(key = lambda x:x[1],reverse=True)   #出现词汇次数由高到低排序
print(wcList)

#第一个词循环遍历输出5次
for i in range(5):
    print(wcList[1])

posted @ 2018-10-15 10:15 李健殷阅读(258) 评论(0) 收藏举报

刷新页面返回顶部

李健殷

阶段作业1：完整的中英文词频统计

公告