袁家琳
一个完整的大作业
1.选一个自己感兴趣的主题。
2.网络上爬取相关的数据。
3.进行文本分析,生成词云。
4.对文本分析结果解释说明。
5.写一篇完整的博客,附上源代码、数据爬取及分析结果,形成一个可展示的成果。
1.一直都有关注搜狐新闻,这次选取的也是搜狐新闻:http://news.sohu.com

2.网络上爬取的相关数据
import requests from bs4 import BeautifulSoup url = 'http://news.sohu.com/' res = requests.get(url) res.encoding = 'UTF-8' soup = BeautifulSoup(res.text, 'html.parser') for news in soup.select('.list16'): li = news.select('li') if len(li) > 0: title = li[0].text href = li[0].select('a')[0]['href'] print(title, href)
输出后获取相关信息

3.进行文本分析,生成词云
from os import path
from scipy.misc import imread
import jieba
import sys
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
text = open('D:\\sohu.txt').read()
wordlist = jieba.cut(text)
wl_space_split = " ".join(wordlist)
d = path.dirname(__file__)
nana_coloring = imread(path.join(d, "D:\\0.jpg"))
my_wordcloud = WordCloud( background_color = 'white',
mask = nana_coloring,
max_words = 3000,
stopwords = STOPWORDS,
max_font_size = 70,
random_state = 50, )
text_dict = { 'you': 2993, 'and': 6625, 'in': 2767, 'was': 2525, 'the': 7845,}
my_wordcloud = WordCloud().generate_from_frequencies(text_dict)
image_colors = ImageColorGenerator(nana_coloring)
my_wordcloud.recolor(color_func=image_colors)
plt.imshow(my_wordcloud)
plt.axis("off")
plt.show()
my_wordcloud.to_file(path.join(d, "cloudimg.png"))
生成云图

由此可见,搜狐新闻最近的热点词语是万圣节,摇号,选房等
浙公网安备 33010602011771号