爬虫大作业

1.选一个自己感兴趣的主题。

2.用python 编写爬虫程序，从网络上爬取相关主题的数据。

3.对爬了的数据进行文本分析，生成词云。

4.对文本分析结果进行解释说明。

5.写一篇完整的博客，描述上述实现过程、遇到的问题及解决办法、数据分析思想及结论。

6.最后提交爬取的全部数据、爬虫及数据分析源代码。

因为我比较喜欢看言情类小说，这里我选择了小说红袖添香网作为数据爬取网站

# -*- coding: UTF-8 -*-
# -*- author: yjw -*-
from bs4 import BeautifulSoup
import requests
import jieba
import time
import datetime
for i in range(1,10):
    res = requests.get('https://www.hongxiu.com/all?pageSize=10&gender=2&catId=-1&isFinish=-1&isVip=-1&size=-1&updT=-1&orderBy=0&pageNum='+str(i))
    res.encoding='utf-8'
    soup = BeautifulSoup(res.text, 'html.parser')

def save():
    file_name = '红袖网'
    with open(file_name+'.txt', 'a') as file:
        num = 1
        for booklist in hongxiulist:
            file.write('\n')
            file.write('#' + str(num) +'. ' + booklist.title)
            file.write('\n')
            file.write('书名：{0}\n作者:{1}\n类型:{2}\n状态：{3}\n字数:{4}\n描述:{5}\n\n书本图片：{6}\n书本网址:{7}\n'.format(booklist.title,booklist.author,booklist.style,booklist.state,booklist.wordcount,booklist.abstract,booklist.imgurl,booklist.bookurl))
            file.write('-*' * 100)
            file.write('\n')
            num = num + 1
# /定义一个数据存储类,那么在创建类的实例的时候，实例会自动调用init方法
class Info(object):
    def __init__(me, title, author, style, state, wordcount, abstract, imgurl, bookurl):
        me.title = title
        me.author = author
        me.style = style
        me.state = state
        me.wordcount = wordcount
        me.abstract = abstract
        me.imgurl = imgurl
        me.bookurl = bookurl

for i in range(1,11):
    res = requests.get('https://www.hongxiu.com/all?pageSize=10&gender=2&catId=-1&isFinish=-1&isVip=-1&size=-1&updT=-1&orderBy=0&pageNum='+str(i))
    res.encoding='utf-8'
    soup = BeautifulSoup(res.text, 'html.parser')
    hongxiu = soup.find('div', class_='right-book-list')
    hongxiulist = []
    for list in hongxiu.find_all('li'):
        listinfo = list.find('div', class_='book-info')
        listinfo_href = listinfo.find('a')
        title = listinfo_href.text
        author = listinfo.find(class_='default').string.strip()
        style = listinfo.find(class_='org').string.strip()
        state = listinfo.find(class_='pink').string.strip()
        wordcount = listinfo.find(class_='blue').string.strip()
        abstract = listinfo.find(class_='intro').string.strip()
        img = list.find('div', class_='book-img')
        imgurl= 'https:'+img.find('img')['src'].strip()
        bookurl = 'https://www.hongxiu.com'+listinfo_href['href'].strip()
        booklist = Info(title, author, style, state, wordcount, abstract, imgurl, bookurl)
        hongxiulist.append(booklist)
        for booklist in hongxiulist:
            print('-*' * 100)
            print('书名：{0}\n作者:{1}\n类型:{2}\n状态：{3}\n字数:{4}\n描述:{5}\n\n书本图片：{6}\n书本网址:{7}\n'.format(booklist.title,
                                                                                                  booklist.author,
                                                                                                  booklist.style,
                                                                                                  booklist.state,
                                                                                                  booklist.wordcount,
                                                                                                  booklist.abstract,
                                                                                                  booklist.imgurl,
                                                                                                  booklist.bookurl))
            save()

爬取了全部分类前十页目录的数据

结果截图：

一开始遇到一点问题，就是开发者模式爬到的a标签里面的href（书本url以及图片url）是相对路径，于是我从network里面找到真正的路径，用本方法加上路径的前缀。

一开始不知道init方法以及findall的具体用法，寻找了同学以及上网寻找资料解决了问题。

从数据可以看出很多言情小说都以霸道总裁为主要题材，配以情深，婚姻等吸引读者观看。也可能正因如此，最近的霸道总裁类型市场基本饱和，读者的口味也慢慢改变，以后可能不再是霸道总裁占大头了

posted @ 2018-04-30 20:26 150颜杰文阅读(202) 评论(0) 收藏举报

刷新页面返回顶部

阿毛QAQ

爬虫大作业

公告