Python爬虫 #019 Scrapy之CrawlSpider

CrawlSpider能够设置链接规则,符合规则的则请求该链接,CrawlSpider能够对网站全站进行爬取,功能非常强大。


1. CrawlSpider介绍

  • 传统的spider实现多页面获取数据,需要多次回调函数,而CrawlSpider,使用正则设定url规则,只要网页中含有满足规则的url,就能爬取,例如:

    https://www.bilibili.com/video/BV1kx411h7jv?p=1
    https://www.bilibili.com/video/BV1kx411h7jv?p=2
    https://www.bilibili.com/video/BV1kx411h7jv?p=3
    ......
    设定规则:https://www.bilibili.com/video/BV1kx411h7jv?p=/d
    
  • crawlspider是Spider的子类,Spider类的设计原则是只爬取start_url列表中的网页,而CrawlSpider类定义了一些规则(rule)来提

    供跟进link的方便的机制,从爬取的网页中获取link并继续爬取



2. CrawlSpider创建

  • 创建项目(cmd中执行): 进入要创建项目的路径,scrapy startproject [项目名称]

  • 创建爬虫(cmd中执行): 进入到项目所在的路径,scrapy genspider -t crawl [爬虫名字] [爬虫的域名]

    -t crawl:表示创建的爬虫文件是基于CrawlSpider这个类的,而不再是Spider这个基类

  • 示例:



3. CrawlSpider参数说明

  • LinkExtractor(连接提取器):提取链接

    • allow:满足“正则表达式”的值会被提取,如果为空,则会全部匹配。

    • deny:与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取

    • allow_domains:会被提取的链接domains。

    • deny_domains:一定不会被提取链接的domains

    • restrick_xpaths:使用xpath表达式,和allow共同作用过滤链接

  • rules(规则解析器): 是Rule对象的集合,用于匹配目标网站并排除干扰

    • linkExtractor():指定链接提取器

    • callback="parse_item", : 设置回调函数parse_item

    • follow=TRUE, : 设置是否跟进,即是否将链接提取器继续作用到链接提取器提取出的链接网页中。

      当callback为None,参数3的默认值为true。



4. CrawlSpider整体爬取流程

  • 爬虫文件首先根据起始url,获取该url的网页内容a

  • 链接提取器会根据指定提取规则将网页内容a中的链接进行提取,得到链接b,链接c,链接d....

    同时这些链接对应网页内容b,网页内容c,网页内容d

  • 规则解析器会根据指定解析规则将链接提取器中提取到的链接(链接b,链接c,链接d....

    中的网页内容(网页内容b,网页内容c,网页内容d)根据指定的规则进行解析

  • 将解析数据封装到item中,然后提交给管道进行持久化存储



5. CrawlSpider实战案例

  • 爬取网址:http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1

  • 创建项目:记得编写启动程序 start.py

  • 修改 setting.py :

    # -*- coding: utf-8 -*-
    
    # Scrapy settings for wxapp project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://docs.scrapy.org/en/latest/topics/settings.html
    #     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'wxapp'
    
    SPIDER_MODULES = ['wxapp.spiders']
    NEWSPIDER_MODULE = 'wxapp.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'wxapp (+http://www.yourdomain.com)'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    DEFAULT_REQUEST_HEADERS = {
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Language': 'en',
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3754.400 QQBrowser/10.5.3991.400'
    }
    
    # Enable or disable spider middlewares
    # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'wxapp.middlewares.WxappSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'wxapp.middlewares.WxappDownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See https://docs.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
       'wxapp.pipelines.WxappPipeline': 300,
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    
    
  • 明确目标 item.py:要获取标题,时间,内容,作者

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://docs.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class WxappItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        title = scrapy.Field()
        author = scrapy.Field()
        time = scrapy.Field()
        article_content = scrapy.Field()
    
  • 编写爬虫程序 wxapp_spider.py:

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    
    # 导入item.py中的类
    from wxapp.items import WxappItem
    
    class WxappSpiderSpider(CrawlSpider):
        name = 'wxapp_spider'
        allowed_domains = ['wxapp-union.com']
        start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']
    
        rules = (
            # .+ 和 /d 用正则,设置满足条件的url
            # follow=True,表示一直跟进,当网页翻页后,在翻页后的页面寻找满足条件的url(即符合allow中设置的url规则)
            # 为False则只保存第一个页面满足的url
            ## 主页面需要的url示范: http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1
            Rule(LinkExtractor(allow=r'.+mod=list&catid=2&page=/d'), follow=True),
    
            # 详情页面需要回调,(回调后解析内容,保存),且不需要跟进
            ## 详情页面需要的url示范 http://www.wxapp-union.com/article-5822-1.html
            ### \. 转义字符
            Rule(LinkExtractor(allow=r'.+article-.+\.html'), follow=False, callback="parse_detail"),
        )
    
        def parse_detail(self, response):
            title = response.xpath('//h1[@class="ph"]/text()').get()
            author = response.xpath('//*[@id="ct"]/div[1]/div/div[1]/div/div[2]/div[3]/div[1]/p/a/text()').get()
            time = response.xpath('//*[@id="ct"]/div[1]/div/div[1]/div/div[2]/div[3]/div[1]/p/span/text()').get()
            # print(title + '\n' + author + ' ' +  time)
            # print('='*30)
            article_content = response.xpath('//*[@id="article_content"]//text()').getall()
            article_content = "".join(article_content).strip()
            # print(article_content)
            item = WxappItem()
            item['title'] = title
            item['author'] = author
            item['time'] = time
            item['article_content'] = article_content
            yield item
    
  • 保存数据 pipelines.py:

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    
    from scrapy.exporters import JsonLinesItemExporter
    
    class WxappPipeline(object):
        def __init__(self):
            self.fp = open(r'C:\Users\Administrator\Desktop\project.json', 'wb')
            self.exporter = JsonLinesItemExporter(self.fp, ensure_ascii = False, encoding = 'utf-8')
    
        def process_item(self, item, spider):
            self.exporter.export_item(item)
            return item
    
        def close_spider(self):
            self.fp.close()
    
posted @ 2023-06-28 23:02  枫_Null  阅读(96)  评论(0)    收藏  举报