2018.09.27 作业四
完整的中英文词频统计
步骤:
1.准备utf-8编码的文本文件file
2.通过文件读取字符串 str
3.对文本进行预处理
4.分解提取单词 list
5.单词计数字典 set , dict
6.按词频排序 list.sort(key=)
7.排除语法型词汇,代词、冠词、连词等无语义词
8.输出TOP(20)
1.完整的英文词频统计:
strBig = '''Big Big World
I'm a big big girl
In a big big world
It's not a big big thing if you leave me
But I do do feel
that I too too will miss you much
Miss you much.
I can see the first leaf falling
It's all yellow and nice
It's so very cold outside
Like the way I'm feeling inside
I'm a big big girl
In a big big world
It's not a big big thing if you leave me
But I do do feel
that I too too will miss you much
Miss you much.
Outside it's now raining
And tears are falling from my eyes
Why did it have to happen
Why did it all have to end
I'm a big big girl
In a big big world
It's not a big big thing if you leave me
But I do do feel
that I too too will miss you much
Miss you much.
I have your arms around me ooooh like fire
But when I open my eyes
You're gone.
I'm a big big girl
In a big big world
It's not a big big thing if you leave me
But I do do feel
that I too too will miss you much
Miss you much.
I'm a big big girl
In a big big world
It's not a big big thing if you leave me
But I do feel that will miss you much
Miss you much.'''.lower()
#open,读文本文件
#预处理之大写改小写
# 关闭文档
# 打印输出
fo = open ('bigbigworld.txt','r',encoding='utf-8')
strBig = fo.read().lower()
fo.close()
print(strBig)
#字符串预处理
#转换大小写
#处理标点符号
#处理特殊字符
sep = '.,:;?!-_'
for ch in sep:
strBig = strBig.replace(ch,' ')
#分解提取单词 List
strList = strBig.split()
print(len(strList), strList)
#单词计数 Set
strSet = set(strList)
print(len(strSet), strSet)
#单词计数 Dict
strDict = {}
for word in strSet:
strDict[word] = strList.count(word)
print(len(strDict), strDict)
#词频排序
wcList = list(strDict.items())
print(wcList)
wcList.sort(key=lambda x:x[1],reverse=True)
print(wcList)
#排除语法型词汇,代词、冠词、连词等无语义词
strSet = set(strSet)
exclude = {'a','and','the','i','you'}
str = strSet-exclude
print(len(str),str)
#.输出TOP(20)
for i in range(20):
print(wcList[i])
运行结果:
(1)

(2)

(3)

2.中文词频统计
#中文词频统计
import jieba
so = open ('hongloumeng.txt', 'r', encoding='utf-8')
strhong = so.read() #通过文件读取字符串 str
so.close()
print(strhong)
#特殊符号处理
sep = ',。?!;:‘’“”'
for ch in sep:
strhong = strhong.replace(ch,'')
print(strhong)
##单词计数set
strSet = set(strhong)
print(len(strSet),strSet)
##单词计数dict
strDict={}
for word in strSet:
strDict[word]=strhong.count(word)
print(len(strDict),strDict)
#把字典转换为列表
wcList = list(strDict.items())
print(wcList)
#排序
wcList.sort(key=lambda x: x[1], reverse=True)
print(wcList)
#.输出TOP(20)
for i in range(20):
print(wcList[i])
运行结果:
(1)

(2)

浙公网安备 33010602011771号