Python-文本词频统计

这学期跟着MOOC的嵩天老师在学Python，但是有蛮多虽然跟着视频敲，但实际上自己用vscode做出问题的案例，所以记一下以后人家百度搜比较快。（老是读不到文件..之类的

#英文文本词频统计

 #CalaliceV1.py
 def getText():
     txt = open("11.txt","r",encoding='utf-8').read()
     txt = txt.lower() #将所有大写变小写
     for ch in '|"$%&*()^#@;:_-.><!~`[\\]+=?/“”{|}':
         txt=txt.replace(ch," ")#将特殊符号替换为空格符
     return txt
 #得到一个没有符号的 都是小写的 单词间都用空格间隔开的txt
 aliceTxt=getText()
 words=aliceTxt.split()#split采用空格分隔单词，以列表形式返回
 counts={}
 for word in words:
     counts[word]=counts.get(word,0)+1
 items=list(counts.items())
 items.sort(key=lambda x:x[1],reverse=True)
 for i in range(10):
     word,count=items[i]
     print("{0:<10}{1:>5}".format(word,count))

#中文文本词频统计

 import jieba
 txt=open("sangou.txt","rb").read()
 excludes={"将军","却说","荆州","二人","不可","如此","不能","商议","如何","军马","引兵","次日","大喜","天下","于是","东吴","今日","不敢","陛下","人马","左右","军士","主公","魏兵","都督","一人","不知","汉中","众将","只见","后主","蜀兵","大叫","上马","此人","先主","城中","太守","天子","背后","后人"}
 words=jieba.lcut(txt)
 counts={}
 for word in words:
     if len(word)==1:
         continue
     elif word=='诸葛亮'or word=='孔明曰':
         rword='孔明'
     elif word=='关公'or word=='云长':
         rword='关羽'
     elif word=='玄德'or word=='玄德曰':
         rword='刘备'
     elif word=='孟德' or word=='丞相':
         rword='曹操'
     else:
         rword=word
     counts[rword]=counts.get(rword,0)+1
 for word in excludes:
     del counts[word]
 items=list(counts.items())
 items.sort(key=lambda x:x[1],reverse=True)
 for i in range(15):
     word,count=items[i]
     print("{0:<10}{1:>5}".format(word,count))

注意，要读的文件要放在上一级目录，而不是跟代码放在一起

posted @ 2020-04-18 18:35 Nicky_啦啦啦是阿落啊阅读(356) 评论(0) 收藏举报

刷新页面返回顶部

Nicky_啦啦啦是阿落啊

菜鸟白手起家

Python-文本词频统计

公告