一个完整的大作业
1.选一个自己感兴趣的主题。
2.网络上爬取相关的数据。
3.进行文本分析,生成词云。
4.对文本分析结果解释说明。
5.写一篇完整的博客,附上源代码、数据爬取及分析结果,形成一个可展示的成果。
选取的网站是“http://www.4399.com/flash/”
打开网页源代码找到相应的类跟需要的参数、
爬取数据
import requests from bs4 import BeautifulSoup def get(url): res = requests.get(url) res.encoding='gb2312' soup = BeautifulSoup(res.text,'html.parser') zx=soup.select('.n-game')[0] for games in zx: try: title=games.select('a')[0].text href = games.select('a')[0]['href'] type=games.select('em')[0].text print(title,type) except: pass gameurl = 'http://www.4399.com/flash/' print(get(gameurl))
分析生成词云
from os import path from scipy.misc import imread import jieba import sys import matplotlib.pyplot as plt from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator text = open('D:\\zx.txt').read() wordlist = jieba.cut(text) wl_space_split = " ".join(wordlist) d = path.dirname(__file__) nana_coloring = imread(path.join(d, "D:\\04.jpg")) my_wordcloud = WordCloud( background_color = 'white', mask = nana_coloring, max_words = 4000, stopwords = STOPWORDS, max_font_size = 90, random_state = 20, ) text_dict = { 'you': 2993, 'and': 6625, 'in': 2767, 'was': 2525, 'the': 7845,} my_wordcloud = WordCloud().generate_from_frequencies(text_dict) image_colors = ImageColorGenerator(nana_coloring) my_wordcloud.recolor(color_func=image_colors) plt.imshow(my_wordcloud) plt.axis("off") plt.show() my_wordcloud.to_file(path.join(d, "cloudimg.png"))
由词云看出最近万圣节的主题推广很大,还有类型上益智类的更受欢迎