公告

爬虫大作业

1.选一个自己感兴趣的主题。

2.用python 编写爬虫程序，从网络上爬取相关主题的数据。

3.对爬了的数据进行文本分析，生成词云。

4.对文本分析结果进行解释说明。

5.写一篇完整的博客，描述上述实现过程、遇到的问题及解决办法、数据分析思想及结论。

6.最后提交爬取的全部数据、爬虫及数据分析源代码。

主题：人民网法治新闻

import requests,jieba,locale
from  bs4 import  BeautifulSoup
locale.setlocale(locale.LC_CTYPE,'chinese')

f = open('content.txt','a+',encoding='UTF-8')

def getKeyWords(): #获取关键词
    f1 = open('content.txt', 'r', encoding='UTF-8')
    text = f1.read()
    wordset = set(jieba._lcut(text))
    worddict = {}
    for i in wordset:
        worddict[i] = text.count(i)
    deletelist, keywords = [], []
    for i in worddict.keys():
        if len(i) < 2:
            deletelist.append(i)
    for i in deletelist:
        del worddict[i]
    dictlist = list(worddict.items())
    dictlist.sort(key=lambda x: x[1], reverse=True)
    for i in range(20):
        print(dictlist[i])


def getNewDetail(newsUrl): #获取新闻内容
    resd = requests.get(newsUrl) 
    resd.encoding = 'gbk'
    soupd = BeautifulSoup(resd.text, 'html.parser')
    content = soupd.select('.box_con')[0].text
    f.write(content) #将新闻内容写入文件
    getKeyWords()

def getLiUrl(ListPageUrl): #获取每页的新闻链接
    res = requests.get(ListPageUrl)
    res.encoding = 'gbk'
    soupn = BeautifulSoup(res.text,'html.parser')
    for news in soupn.select('.on'):
            atail = news.a.attrs['href']
            a = 'http://legal.people.com.cn/'+atail
            getNewDetail(a)
            break

Url = 'http://legal.people.com.cn/'
res = requests.get(Url)
res.encoding = 'gbk'
soup = BeautifulSoup(res.text,'html.parser')
print('第1页：')
getLiUrl(Url)
# for i in range(2,6):
#     pageUrl = 'http://legal.people.com.cn/index{}.html#fy01'.format(i)
#     print('第{}页：'.format(i))
#     getLiUrl(pageUrl)
#     break

根据第一页的所有新闻的内容生成词云：

从词云图可看出人民法治新闻分析了很多案件

问题：安装词云插件失败，所以在网上使用在线生成词云工具来生成词云，先将获取的所有新闻的内容放到一个文件，再复制文件的内容到词云生成工具生成词云。

posted on 2018-04-30 21:13 187司徒春燕阅读(170) 评论(0) 收藏举报

刷新页面返回顶部