Python爬虫 #019 Scrapy之CrawlSpider

CrawlSpider能够设置链接规则，符合规则的则请求该链接，CrawlSpider能够对网站全站进行爬取，功能非常强大。

1. CrawlSpider介绍
2. CrawlSpider创建
3. CrawlSpider参数说明
4. CrawlSpider整体爬取流程
5. CrawlSpider实战案例

1. CrawlSpider介绍

传统的spider实现多页面获取数据，需要多次回调函数，而CrawlSpider，使用正则设定url规则，只要网页中含有满足规则的url，就能爬取，例如：

https://www.bilibili.com/video/BV1kx411h7jv?p=1
https://www.bilibili.com/video/BV1kx411h7jv?p=2
https://www.bilibili.com/video/BV1kx411h7jv?p=3
......
设定规则：https://www.bilibili.com/video/BV1kx411h7jv?p=/d

crawlspider是Spider的子类，Spider类的设计原则是只爬取start_url列表中的网页，而CrawlSpider类定义了一些规则(rule)来提

供跟进link的方便的机制，从爬取的网页中获取link并继续爬取

2. CrawlSpider创建

创建项目（cmd中执行）：进入要创建项目的路径，scrapy startproject [项目名称]
创建爬虫（cmd中执行）：进入到项目所在的路径，scrapy genspider -t crawl [爬虫名字] [爬虫的域名]

-t crawl：表示创建的爬虫文件是基于CrawlSpider这个类的，而不再是Spider这个基类
示例：

3. CrawlSpider参数说明

LinkExtractor(连接提取器)：提取链接
- allow：满足“正则表达式”的值会被提取，如果为空，则会全部匹配。
- deny：与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取
- allow_domains：会被提取的链接domains。
- deny_domains：一定不会被提取链接的domains
- restrick_xpaths：使用xpath表达式，和allow共同作用过滤链接
rules(规则解析器): 是Rule对象的集合，用于匹配目标网站并排除干扰
- linkExtractor（）：指定链接提取器
- callback="parse_item", ：设置回调函数parse_item
- follow=TRUE, ：设置是否跟进，即是否将链接提取器继续作用到链接提取器提取出的链接网页中。
  
  当callback为None,参数3的默认值为true。

4. CrawlSpider整体爬取流程

爬虫文件首先根据起始url，获取该url的网页内容a
链接提取器会根据指定提取规则将网页内容a中的链接进行提取，得到链接b，链接c，链接d....

同时这些链接对应网页内容b，网页内容c，网页内容d
规则解析器会根据指定解析规则将链接提取器中提取到的链接（链接b，链接c，链接d....）

中的网页内容（网页内容b，网页内容c，网页内容d）根据指定的规则进行解析
将解析数据封装到item中，然后提交给管道进行持久化存储

5. CrawlSpider实战案例

爬取网址：http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1
创建项目：记得编写启动程序 start.py

修改 setting.py :

# -*- coding: utf-8 -*-

# Scrapy settings for wxapp project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'wxapp'

SPIDER_MODULES = ['wxapp.spiders']
NEWSPIDER_MODULE = 'wxapp.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'wxapp (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3754.400 QQBrowser/10.5.3991.400'
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'wxapp.middlewares.WxappSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'wxapp.middlewares.WxappDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'wxapp.pipelines.WxappPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

明确目标 item.py：要获取标题，时间，内容，作者

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class WxappItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    author = scrapy.Field()
    time = scrapy.Field()
    article_content = scrapy.Field()

编写爬虫程序 wxapp_spider.py:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

# 导入item.py中的类
from wxapp.items import WxappItem

class WxappSpiderSpider(CrawlSpider):
    name = 'wxapp_spider'
    allowed_domains = ['wxapp-union.com']
    start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']

    rules = (
        # .+ 和 /d 用正则,设置满足条件的url
        # follow=True,表示一直跟进，当网页翻页后，在翻页后的页面寻找满足条件的url（即符合allow中设置的url规则）
        # 为False则只保存第一个页面满足的url
        ## 主页面需要的url示范： http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1
        Rule(LinkExtractor(allow=r'.+mod=list&catid=2&page=/d'), follow=True),

        # 详情页面需要回调，（回调后解析内容，保存），且不需要跟进
        ## 详情页面需要的url示范 http://www.wxapp-union.com/article-5822-1.html
        ### \. 转义字符
        Rule(LinkExtractor(allow=r'.+article-.+\.html'), follow=False, callback="parse_detail"),
    )

    def parse_detail(self, response):
        title = response.xpath('//h1[@class="ph"]/text()').get()
        author = response.xpath('//*[@id="ct"]/div[1]/div/div[1]/div/div[2]/div[3]/div[1]/p/a/text()').get()
        time = response.xpath('//*[@id="ct"]/div[1]/div/div[1]/div/div[2]/div[3]/div[1]/p/span/text()').get()
        # print(title + '\n' + author + ' ' +  time)
        # print('='*30)
        article_content = response.xpath('//*[@id="article_content"]//text()').getall()
        article_content = "".join(article_content).strip()
        # print(article_content)
        item = WxappItem()
        item['title'] = title
        item['author'] = author
        item['time'] = time
        item['article_content'] = article_content
        yield item

保存数据 pipelines.py:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy.exporters import JsonLinesItemExporter

class WxappPipeline(object):
    def __init__(self):
        self.fp = open(r'C:\Users\Administrator\Desktop\project.json', 'wb')
        self.exporter = JsonLinesItemExporter(self.fp, ensure_ascii = False, encoding = 'utf-8')

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self):
        self.fp.close()

posted @ 2023-06-28 23:02 枫_Null 阅读(104) 评论(0) 收藏举报

刷新页面返回顶部

枫_Null