一个完整的大作业

完整的大作业


 

1.选一个自己感兴趣的主题。

2.网络上爬取相关的数据。

3.进行文本分析,生成词云。

4.对文本分析结果解释说明。

5.写一篇完整的博客,附上源代码、数据爬取及分析结果,形成一个可展示的成果。


 

1.选取感兴趣的内容:http://bbs.hh010.com/forum-162-1.html

2. 爬取相关数据

import requests
from bs4 import BeautifulSoup
import jieba
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt 
from os import path
from scipy.misc import imread

url="http://bbs.hh010.com/forum-162-1.html"
result = requests.get(url)
sp = BeautifulSoup(result.text,'html.parser')
tag =  sp.find_all('a', class_="s xst")
f = open("C:\\Users\\LiuJiXuan\\Desktop\\test.txt", "w+")  
for tmp in tag:
    f.write(str(tmp.string.encode('utf-8'))+"\n")
    f.write(str(tmp['href'])+"\n")
    f.write("---------------------------------------------------------------"+"\n\n")
f.close()

 


3. 进行文本分析,生成词云。

d = path.dirname(__file__) 
jieba.load_userdict("jieba_dict.txt")
file=open("C:\\Users\\LiuJiXuan\\Desktop\\test.txt","r").readlines()
text=""
for s in file:
 text+=" ".join(jieba.cut(s))+" "

 
background = imread(path.join(d,"./1.png")) 
wc = WordCloud( font_path='./kaiti.ttf',#设置字体  
                background_color="white", #背景颜色  
                max_words=2000,# 词云显示的最大词数  
                mask=background,#设置背景图片  
                max_font_size=720, #字体最大值  
                random_state=42,  
                )
wc.generate(text)
image_colors = ImageColorGenerator(background) 
plt.imshow(wc)
plt.axis("off")
plt.show()
wc.to_file(path.join(d, "wordcloud.png")) 

 

4.文本分析(先通过爬取,把文章的标题和链接写入test.txt中,再用jieba模块进行文本分析,通过加载默认和自定义词典进行分词截取)
test.txt:
 
自定义词典(jieba_dict.txt):
 
 
 

5.生成词云

 
 
程序完整源代码:
import requests
from bs4 import BeautifulSoup
import jieba
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt 
from os import path
from scipy.misc import imread

url="http://bbs.hh010.com/forum-162-1.html"
result = requests.get(url)
sp = BeautifulSoup(result.text,'html.parser')
tag =  sp.find_all('a', class_="s xst")
f = open("C:\\Users\\LiuJiXuan\\Desktop\\test.txt", "w+")  
for tmp in tag:
    f.write(str(tmp.string.encode('utf-8'))+"\n")
#    f.write(str(tmp['href'])+"\n")
#    f.write("---------------------------------------------------------------"+"\n\n")
f.close()
d = path.dirname(__file__)  
jieba.load_userdict("jieba_dict.txt")
file=open("C:\\Users\\LiuJiXuan\\Desktop\\test.txt","r").readlines()
text=""
for s in file:
    text+=" ".join(jieba.cut(s))+" "

background = imread(path.join(d,"./1.png")) 
wc = WordCloud( font_path='./kaiti.ttf',
                background_color="white",
                max_words=2000,
                mask=background,
                max_font_size=720,
                random_state=42,  
                )
wc.generate(text)
image_colors = ImageColorGenerator(background) 
plt.imshow(wc)
plt.axis("off")
plt.show()
wc.to_file(path.join(d, "wordcloud.png")) 

 

 
posted @ 2017-10-24 17:32  刘己铉  阅读(252)  评论(0编辑  收藏  举报