第四次作业
阶段作业1:完整的中英文词频统计
步骤:
1.准备utf-8编码的文本文件file
2.通过文件读取字符串 str
3.对文本进行预处理
4.分解提取单词 list
5.单词计数字典 set , dict
6.按词频排序 list.sort(key=)
7.排除语法型词汇,代词、冠词、连词等无语义词
8.输出TOP(20)
完成:
1.英文小说 词频统计
2.中文小说 词频统计
输入代码:
strBig = '''Loving strangers loving strangers Loving strangers Loving strangers loving strangers Loving strangers I’ve got a hole in my pocket where all the money has gone I’ve got a whole lot of work to do with your heart 'Cause it’s so busy mine’s not Loving strangers loving strangers Loving strangers Loving strangers loving strangers Loving strangers It’s just the start of the winter and I’m all alone but I’ve got my eye right on you Give me a coin and I'll take you to the moon Give me a beer and I’ll kiss you so foolishly Like you do when you lie when you’re not in my thoughts Like you do when you lie and I know it’s my imagination Loving strangers loving strangers Loving strangers Loving strangers loving strangers Loving strangers Loving strangers loving strangers Loving strangers Loving strangers loving strangers Loving strangers''' fo = open('1.txt','r') #读取fire文本文件 big = fo.read() fo.close() print(big) #字符串预处理: #大小写#标点符号#特殊符号 sep = '''.,:;?!-_''' for ch in sep: strBig = strBig.replace(ch,' ') strList = strBig.split() #.分解提取单词 list print(len(strList),strList) strSet = set(strList) strSet = set(strList) #排除语法型词汇,代词、冠词、连词等无语义词 exclude = {'my','the','when','has','you','I','where','to','me','and','in'} strSet = strSet-exclude print(len(strSet),strSet) strDict = {} #单词计数字典 set , dict for word in strSet: strDict[word] = strList.count(word) print(len(strDict),strDict) wcList = list(strDict.items()) #按词频排序 list.sort(key=) print(wcList) wcList.sort(key=lambda x:x[1],reverse=True) print(wcList) for i in range(20): #输出前20个 print(wcList[i])
运行结果:
import jieba f = open('hlm.txt','r') enstr = f.read() f.close() print(enstr) sep = ',。?!;:“”‘’' for ch in sep: enstr=enstr.replace(ch, '') print(enstr) strSet = set(enstr) print(len(strSet),strSet) strDict = dict() for word in strSet: strDict[word] = enstr.count(word) print(len(strDict), strDict) List = list(strDict.items()) print(List) List.sort(key=lambda x: x[1], reverse=True) print(List) for i in range(20): print(List[i])
中文小说 词频统计: