一个完整的大作业

1.选一个自己感兴趣的主题。

2.网络上爬取相关的数据。

3.进行文本分析，生成词云。

4.对文本分析结果解释说明。

5.写一篇完整的博客，附上源代码、数据爬取及分析结果，形成一个可展示的成果。

1选取感兴趣的主题，

我对新浪新闻进行爬取，网站：http://news.sina.com.cn/china/

爬取此网页中的新闻标题，来源和时间

2.网络上爬取相关的数据。完整代码：

from bs4 import BeautifulSoup
import requests
from datetime import datetime
import json
import re

news_url = 'http://news.sina.com.cn/c/nd/2017-05-08/doc-ifyeycfp9368908.shtml'
web_data = requests.get(news_url)
web_data.encoding = 'utf-8'
soup = BeautifulSoup(web_data.text,'lxml')
title = soup.select('#artibodyTitle')[0].text
print(title)

time = soup.select('.time-source')[0].contents[0].strip()
dt = datetime.strptime(time,'%Y年%m月%d日%H:%M')
print(dt)

source = soup.select('.time-source span span a')[0].text
print(source)

print('\n'.join([p.text.strip() for p in soup.select('#artibody p')[:-1]]))

editor = soup.select('.article-editor')[0].text.lstrip('责任编辑：')
print(editor)

comments = requests.get('http://comment5.news.sina.com.cn/page/info?version=1&format=js&channel=gn&newsid=comos-fyeycfp9368908&group=&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20')
comments_total = json.loads(comments.text.strip('var data='))
print(comments_total['result']['count']['total'])

news_id = re.search('doc-i(.+).shtml',news_url)
print(news_id.group(1))

运行结果：
国土部:10月到11月实行汛期地质灾害日报告制度
2017-10-08 17:21:00
央视新闻
原标题：国土资源部：地质灾害高发期实行日报告制度
国土资源部消息，10月份将逐渐进入地质灾害的高发期，防灾减灾形势更加严峻。据中国气象局预计，10月份我国江南大部、华南东部、西北地区大部降水较常年同期偏多，应加强防范极端气象事件诱发的滑坡、泥石流等地质灾害。对此从10月起至11月，国土资源部应急办实行汛期地质灾害日报告制度，各地必须将每天发生的灾情险情及其重大工作部署于当天下午3点前报告国土资源部应急办。
李伟山

3.进行文本分析。

（1）先抓取标题：

from bs4 import BeautifulSoup
import requests
from datetime import datetime
import json
import re

url = 'http://news.sina.com.cn/c/nd/2017-05-08/doc-ifyeycfp9368908.shtml'
web_data = requests.get(url)
web_data.encoding = 'utf-8'
soup = BeautifulSoup(web_data.text,'lxml')
title = soup.select('#artibodyTitle')[0].text
print(title)

（2）抓取时间，并将原有日期格式转化为标准格式：

1 # time = soup.select('.time-source')[0]
2 # print(time)
3 time = soup.select('.time-source')[0].contents[0].strip()
4 dt = datetime.strptime(time,'%Y年%m月%d日%H:%M')
5 print(dt)

（3）抓取新闻来源：

1 # source = soup.select('#navtimeSource > span > span > a')[0].text
2 source = soup.select('.time-source span span a')[0].text
3 print(source)

（4）抓取新闻详情：

1 article = []
2 for p in soup.select('#artibody p')[:-1]:
3     article.append(p.text.strip())
4 # print(article)
5 print('\n'.join(article))

（5）抓取责任编辑：

1 editor = soup.select('.article-editor')[0].text.lstrip('责任编辑：')
2 print(editor)

（6）抓取评论数：

# comments = soup.select('#commentCount1')
# print(comment)
comments = requests.get('http://comment5.news.sina.com.cn/page/info?version=1&format=js&channel=gn&newsid=comos-fyeycfp9368908&group=&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20')
# print(comments.text)
comments_total = json.loads(comments.text.strip('var data='))
# print(comments_total)
print(comments_total['result']['count']['total'])

（7）抓取新闻ID：

1 print(news_url.split('/')[-1].rstrip('.shtml').lstrip('doc-i'))

4.生成词云。

posted on 2017-10-29 11:46 俊礼阅读(238) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

俊礼

一个完整的大作业

导航

公告