2018.09.27 作业四

完整的中英文词频统计

步骤:

1.准备utf-8编码的文本文件file

2.通过文件读取字符串 str

3.对文本进行预处理

4.分解提取单词 list

5.单词计数字典 set , dict

6.按词频排序 list.sort(key=)

7.排除语法型词汇,代词、冠词、连词等无语义词

8.输出TOP(20)

 

1.完整的英文词频统计:

strBig = '''Big Big World

I'm a big big girl

In a big big world

It's not a big big thing if you leave me

But I do do feel

that I too too will miss you much

Miss you much.

I can see the first leaf falling

It's all yellow and nice

It's so very cold outside

Like the way I'm feeling inside

I'm a big big girl

In a big big world

It's not a big big thing if you leave me

But I do do feel

that I too too will miss you much

Miss you much.

Outside it's now raining

And tears are falling from my eyes

Why did it have to happen

Why did it all have to end

I'm a big big girl

In a big big world

It's not a big big thing if you leave me

But I do do feel

that I too too will miss you much

Miss you much.

I have your arms around me ooooh like fire

But when I open my eyes

You're gone.

I'm a big big girl

In a big big world

It's not a big big thing if you leave me

But I do do feel

that I too too will miss you much

Miss you much.

I'm a big big girl

In a big big world

It's not a big big thing if you leave me

But I do feel that will miss you much

Miss you much.'''.lower()


#open,读文本文件
#预处理之大写改小写
# 关闭文档
# 打印输出
fo = open ('bigbigworld.txt','r',encoding='utf-8')
strBig = fo.read().lower()
fo.close()
print(strBig)


#字符串预处理
#转换大小写
#处理标点符号
#处理特殊字符
sep = '.,:;?!-_'
for ch in sep:
    strBig = strBig.replace(ch,' ')


#分解提取单词 List
strList = strBig.split()
print(len(strList), strList)

#单词计数 Set
strSet = set(strList)
print(len(strSet), strSet)

#单词计数 Dict
strDict = {}
for word in strSet:
    strDict[word] = strList.count(word)

print(len(strDict), strDict)


#词频排序
wcList = list(strDict.items())
print(wcList)
wcList.sort(key=lambda x:x[1],reverse=True)
print(wcList)


#排除语法型词汇,代词、冠词、连词等无语义词
strSet = set(strSet)
exclude = {'a','and','the','i','you'}
str = strSet-exclude
print(len(str),str)

#.输出TOP(20)
for i in range(20):
    print(wcList[i])

  运行结果:

(1)

(2)

 (3)

 

2.中文词频统计

#中文词频统计

import jieba

so = open ('hongloumeng.txt', 'r', encoding='utf-8')
strhong = so.read()  #通过文件读取字符串 str
so.close()
print(strhong)

#特殊符号处理
sep = ',。?!;:‘’“”'
for ch in sep:
    strhong = strhong.replace(ch,'')
    print(strhong)

##单词计数set
strSet  = set(strhong)
print(len(strSet),strSet)

##单词计数dict
strDict={}
for word in strSet:
    strDict[word]=strhong.count(word)
    print(len(strDict),strDict)

#把字典转换为列表
wcList = list(strDict.items())
print(wcList)

#排序
wcList.sort(key=lambda x: x[1], reverse=True)
print(wcList)

#.输出TOP(20)
for i in range(20):
    print(wcList[i])

  运行结果:

(1)

(2)

 

posted on 2018-09-29 09:56  LinYxin  阅读(136)  评论(0)    收藏  举报