用python分析川普就职演讲稿
先贴一篇IBM沃森文本分析历届美国总统就职的演讲稿:http://36kr.com/p/5062661.html
2017年1月20日中午,唐纳德-特朗普在首都华盛顿宣誓就职,正式成为美国第45任总统。完成了从房地产老板直接到美国总统的华丽转身!今儿我们就对他的演讲稿进行文本分析,看看能不能翻出啥有意思的点。
首先,在CNN上面找到了川普就职的演讲稿及演讲视频,具体网址:https://edition.cnn.com/2017/01/20/politics/trump-inaugural-address/index.html

代码如下:
speech_text='''演讲稿内容'''
speech=speech_test.lower().split() #对演讲稿内容进行小写转化,并进行单个词的分割
dic={} #建立空词典,储存演讲稿中的词
for word in speech:
if word in dic:
dic[word]+=1
else:
dic[word]=1
dic#显示词典,接下来就是对词典进行处理,词典长这么个样子(只复制了一部分)
{'"how': 1,
'--': 4,
'17:17': 1,
'2017,': 1,
'20th': 1,
'a': 15,
'about': 2,
'accept': 1,
'across': 5,
'action': 1,
'action.': 1,
'address': 1,
'administration': 1,
'affairs,': 1,
'again,': 1}
|
import operator
swd=sorted(dic.items(),key=operator.itemgetter(1),reverse=True) #对词典内容进行value的提取,并且按照value逆序进行排序
swd#显示swd,具体如下,按照数值从大到小进行的排序(限于篇幅,只复制了一部分):
[('and', 73),
('the', 71),
('of', 48),
('our', 48),
('we', 45),
('will', 40),
('to', 37),
('is', 21),
('a', 15),
('for', 15),
('are', 14),
('in', 14),
('but', 13),
('all', 12),
('from', 12),
('be', 12),
('their', 11),
('american', 11),
('your', 11),
('not', 10),
('america', 9),
('this', 9),
('it', 9),
('that', 8),
('again.', 8),
('with', 8),
('every', 7),
('one', 7),
('you', 7),
('people', 6),
('great', 6),
('country', 6),
('on', 6),
('has', 6),
('back', 6),
('while', 6),
('by', 6),
('no', 6),
('new', 6),
('same', 6),
('president', 5),
('they', 5),
('have', 5),
('across', 5),
('right', 5),
('never', 5),
('at', 5),
('make', 5),
('you.', 4),
('america,', 4),
('world', 4),
('been', 4),
('today', 4),
('or', 4),
('--', 4),
('everyone', 4),
('which', 4),
('as', 4),
('nation', 4),
('other', 4),
('bring', 4),
('now', 3),
('its', 3),
('people.', 3),
('together,', 3),
('these', 3),
('too', 3),
("nation's", 3),
('factories', 3),
('protected', 3),
('there', 3),
('here', 3),
('america.', 3),
('whether', 3),
('millions', 3),
('many', 3),
('an', 3),
('so', 3),
('i', 3),
("we've", 3),
('foreign', 3),
('countries', 3),
('must', 3),
('let', 3),
('do', 3),
('when', 3),
('heart', 3),
('entire', 2),
('americans,', 2),
('thank', 2),
('citizens', 2),
('national', 2),
('face', 2),
('get', 2),
('done.', 2),
('obama', 2),
('very', 2),
('because', 2),
('transferring', 2),
('power', 2),
('party', 2),
('small', 2),
('government', 2),
('share', 2),
('wealth.', 2),
('politicians', 2),
('jobs', 2),
('country.', 2),
('capital,', 2),
('land.', 2),
('moment', 2),
('belongs', 2),
('united', 2),
('states', 2),
('day', 2),
('forgotten', 2),
('men', 2),
('women', 2),
('now.', 2),
('movement', 2),
('before.', 2),
('safe', 2),
('good', 2),
('like', 2),
]
|
#可以看得出来,里面有很多没有用处的词语,我们需要进行剔除,所以需要导入停止词。
import nltk
from nltk.corpus import stopwords
stop_words=stopwords.words('English')
stop_words#可以查看导入了哪些停止词,限于篇幅,也只黏贴一部分
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their'] |
for k,v in swd: #将swd中去除停止词
if k not in stop_words:
print(k,v)
我们可以看到词汇统计结果如下:

浙公网安备 33010602011771号