综合练习：词频统计

1.英文词频统

下载一首英文的歌词或文章，将所有,.？！’:等分隔符全部替换为空格

news='''
Guo Shuqing, head of the newly established China banking and insurance regulatory commission, was appointed Party secretary and vice-governor of the central bank on Monday, according to an announcement published on the People's Bank of China website.

Guo, 61, former chairman of the China Banking Regulatory Commission, became Party secretary as well as chairman last week of the new banking and insurance regulatory commission, which combines the role of CBRC and the China Insurance Regulatory Commission.

Yi Gang, 60, the newly elected central bank governor, was also appointed the Party's deputy chief of the central bank.

Experts said former governors of the central bank also have held the title of Party chief, but the unusual arrangement will improve coordination between regulators of different sectors.

Experts said the PBOC leadership adjustment could be in line with the country's newly restructured financial regulatory framework, on top of which is the cabinet-level financial stability and development committee established in November.

It coordinates with the PBOC and two specialized supervision bodies－the newly merged banking and insurance regulatory commission, and the China Securities Regulatory Commission.

As part of the State institutional reform plan approved by the first session of the 13th National People's Congress last week, the new watchdog for banking and insurance will be directly led by the State Council, China's Cabinet, which aims to strengthen regulation and prevent systemic financial risks, experts have said.

Under the reform plan, functions and duties, including drafting key financial regulations and supervision of the basic financial system, will belong to the PBOC.

Ming Ming, an analyst with CITIC Securities, said Guo's appointment is expected to solve existing problems with the goal of forestalling and defusing major risks.
'''

sep = ''',.?":;()'''
for c in sep:
    news = news.replace(c,' ')

将所有大写转换为小写，生成单词列表

wordList = news.lower().split()
for w in wordList:
    print(w)

生成词频统计

wordDist = {}
wordSet = set(wordList)
for w in wordSet:
    wordDist[w] = wordList.count(w)

for w in wordDist:
    print(w, wordDist[w])

排序

dictList = list(wordDist.items())
dictList.sort(key = lambda x: x[1], reverse=True)

排除语法型词汇，代词、冠词、连词

exclude = {'the','of','and','s','to','which','will','as','on','is','by',}
wordSet=set(wordList)-exclude
for w in wordSet:
    wordDist[w]=wordList.count(w)

输出词频最大TOP20

for i in range(20):
    print(dictList[i])

将分析对象存为utf-8编码的文件，通过文件读取的方式获得词频分析内容。

读取news.txt文件：

f=open('news.txt','r',encoding='utf-8')
news=f.read()
f.close()
print(news)

　　将排序结果放在newscount.txt文件中：

f=open('newscount.txt','a')
for i in range(25):
    f.write(dictList[i][0]+' '+str(dictList[i][1])+'\n')
f.close()

2.中文词频统计

下载一长篇中文文章。从文件读取待分析文本。

news = open('gzccnews.txt','r',encoding = 'utf-8')

安装与使用jieba进行中文分词。

pip install jieba

import jieba

list(jieba.cut(news))

import jieba

file=open('hong.txt','r',encoding='utf-8')
word=file.read()
file.close()

生成词频统计

wordList=list(jieba.cut_for_search(word))

wordDist={}
for w in wordList:
    wordDist[w] = wordList.count(w)

for w in wordDist:
    print(w, wordDist[w])

排序

dictList = list(wordDist.items())
dictList.sort(key = lambda x: x[1], reverse=True)

排除语法型词汇，代词、冠词、连词

sep='''，。？“”：、?；!！'''

exclude ={' ','\n','了','的','\u3000','他','我','也','又','是','你','着','这','就','都','呢','只'}

for c in sep:
    word = word.replace(c,' ')

wordSet=set(wordList)-exclude

输出词频最大TOP20（或把结果存放到文件里）

f=open('hongcount.txt','a')
for i in range(20):
    f.write(dictList[i][0]+' '+str(dictList[i][1])+'\n')
f.close()

posted @ 2018-03-27 23:58 246王芷玲阅读(191) 评论(0) 收藏举报

刷新页面返回顶部

zhiling123

综合练习：词频统计

公告