（小说目录）为例

需求分析

目标网站：https://www.tadu.com/book/catalogue/891710/
目标内容：小说《没有爸爸也能活》第一章到第十三一章的正文内容。
任务要求：编写两个爬虫，爬虫1从https://www.tadu.com/book/catalogue/891710/获取小说《没有爸爸也能活》第一章到第十三一章的网址，并将网址添加到Redis里名为url_queue的列表中。爬虫2从Redis里名为url_queue的列表中读出网址，进入网址爬取每一章的具体内容，再将内容保存到MongoDB中。
具体步骤：

1.使用正则表达式或者Xpath获取每一章网址，将它们添加到Redis中。

def get_source(self, url, headers):#获取url源代码
        return requests.get(url, headers).content.decode()

def url2redis(self):#将url源代码中的章节url存储在redis数据库中
    source = self.get_source(self.url,self.HEADERS)
    selector = lxml.html.fromstring(source)  # 创建树对象
    url_lst = selector.xpath('//div[@class="chapter clearfix"]/a/@href')
    for url in url_lst:
        url = 'https://www.tadu.com' + url.strip()
        self.client1.lpush('url_queue', url)

　　　　　2.其次从url_queue中一个个弹出章节url，由于章节内容采取异步加载的方式，刚开始使用了selenium，通过webdriver来驱动Chrome浏览器，来解析JS接收数据，获取元素源代码。

def wait(url):#用于获取
    driver  = webdriver.Chrome(r'.\chromedriver.exe')#用chromedriver.exe驱动Chrome浏览器解析源代码中JS部分
    driver.get(url) #连接网页
    try:
        WebDriverWait(driver,30).until(EC.presence_of_element_located((By.CLASS_NAME,"read_details"))) #等待页面加载
    except Exception as _:
        print('网页加载太慢。')
    return driver.page_source #返回网页源代码

　　3.最后把这些章节内容源代码，通过Xpath来获取小说标题和内容，将其加入字典列表里，用于插入MongoDB数据库

 def article2mongodb(self):#将各个url中文章内容传入MongoDB数据库中
        while self.client1.llen('url_queue')>0:
            url = self.client1.lpop('url_queue').decode()
            html = wait(url)
            selector = lxml.html.fromstring(html)
            chapter_name = selector.xpath('//div[@class="clearfix"]/h4/text()')[0]
            content = selector.xpath('//div[@id="partContent"]/p/text()')
            self.content_lst.append({'title':chapter_name,'content':content})
        self.handler.insert_many(self.content_lst)

　　5.源码：

# Author:CK
# -*- coding = utf-8 -*-
# @Time :2022/3/24 17:21
# @Author:ck
# @File :get_article.py
# @Software: PyCharm
from selenium import webdriver#浏览器驱动模块
from selenium.webdriver.support.ui import WebDriverWait#浏览器请求等待模块
from selenium.webdriver.common.by import By#锁定元素模块
from selenium.webdriver.support import expected_conditions as EC#期待模块
import requests#请求模块
import lxml.html#源代码解析模块
import redis #redis数据库
from  pymongo import MongoClient #MongoDB数据库

def wait(url):#用于获取
    driver  = webdriver.Chrome(r'.\chromedriver.exe')#用chromedriver.exe驱动Chrome浏览器解析源代码中JS部分
    driver.get(url) #连接网页
    try:
        WebDriverWait(driver,30).until(EC.presence_of_element_located((By.CLASS_NAME,"read_details"))) #等待页面加载
    except Exception as _:
        print('网页加载太慢。')
    return driver.page_source #返回网页源代码

class get_article(object):#获取文章内容类
    HEADERS = {
        'User - Agent': 'Mozilla / 5.0(WindowsNT10.0;WOW64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 99.0.4844.51Safari / 537.36'
    }#头部
    def __init__(self, url):
        self.url = url
        self.content_lst = []
        self.client1 = redis.StrictRedis()#连接redis数据库
        self.handler = db['article']#连接数据库集合
        self.url2redis()
        self.article2mongodb()

    def get_source(self, url, headers):#获取url源代码
        return requests.get(url, headers).content.decode()


    def url2redis(self):#将url源代码中的章节url存储在redis数据库中
        source = self.get_source(self.url,self.HEADERS)
        selector = lxml.html.fromstring(source)  # 创建树对象
        url_lst = selector.xpath('//div[@class="chapter clearfix"]/a/@href')
        for url in url_lst:
            url = 'https://www.tadu.com' + url.strip()
            self.client1.lpush('url_queue', url)

    def article2mongodb(self):#将各个url中文章内容传入MongoDB数据库中
        while self.client1.llen('url_queue')>0:
            url = self.client1.lpop('url_queue').decode()
            html = wait(url)
            selector = lxml.html.fromstring(html)
            chapter_name = selector.xpath('//div[@class="clearfix"]/h4/text()')[0]
            content = selector.xpath('//div[@id="partContent"]/p/text()')
            self.content_lst.append({'title':chapter_name,'content':content})
        self.handler.insert_many(self.content_lst)
if __name__ =='__main__':
    client0 =MongoClient()
    db = client0['spicer']#创建数据库
    article = get_article('https://www.tadu.com/book/catalogue/891710')

　　6.但实际运行，由于每次爬取一个章节内容都会运行一次Chrome解析器，所耗用内存过大，要想解决这个问题，得想个其它思路。由于异步加载，我们可以在浏览器开发者工具中找到每个章节的数据文件。通过对每个数据文件内容、url、请求方式的解析，从而只需要简单爬虫模式即可爬取。思路仅供参考，具体源码实现自考。

posted @ 2022-03-25 14:18 chanxe 阅读(819) 评论(0) 收藏举报

刷新页面返回顶部

chanxe

小说爬虫——以https://www.tadu.com/book/catalogue/891710/（小说目录）为例

公告