综合练习:英文词频统计

  1. 词频统计预处理
  2. 下载一首英文的歌词或文章
  3. 将所有,.?!’:等分隔符全部替换为空格
  4. 将所有大写转换为小写
  5. 生成单词列表
  6. 生成词频统计
  7. 排序
  8. 排除语法型词汇,代词、冠词、连词
  9. 输出词频最大TOP10

代码:

# -*- coding:utf-8 -*-

f = open('song.txt', 'r')
song = f.read()
f.close()

symbol = ''',.?!’:"“”-%$'''

exclude = '''
a an the in on to at and of is was are were i he she you your they us their our it or for be too do no 
that s so as but it's
'''

for i in symbol:
    song = song.replace(i, ' ')

songList = song.lower().split()

prep = exclude.split()
excludeSet = set(prep)

songDict = {}
songSet = set(songList) - excludeSet

for i in songSet:
    songDict[i] = songList.count(i)
dictList = list(songDict.items())
dictList.sort(key=lambda item: item[1], reverse=True)
for i in range(10):
    print(dictList[i])

输出结果:

('regulatory', 7)
('commission', 6)
('insurance', 5)
('financial', 5)
('bank', 5)
('banking', 5)
('china', 5)
('newly', 4)
('said', 4)
('central', 4)

 

posted @ 2018-03-21 20:33  171蓝海鹏  阅读(134)  评论(0编辑  收藏  举报