Python Scrapy框架之CrawlSpider爬虫

本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理

本文章来自腾讯云作者：Python知识大全

想要学习Python？有问题得不到第一时间解决？来看看这里“1039649593”满足你的需求，资料都已经上传至文件中，可以自行下载！还有海量最新2020python学习资料。
点击查看

scrapy genspider -c crawl [爬虫名字] [域名]

class scrapy.linkextractors.LinkExtractor(
    allow = (),
    deny = (),
    allow_domains = (),
    deny_domains = (),
    deny_extensions = None,
    restrict_xpaths = (),
    tags = ('a','area'),
    attrs = ('href'),
    canonicalize = True,
    unique = True,
    process_value = None
)

Rule规则类：

定义爬虫的规则类。以下对这个类做一个简单的介绍：

class scrapy.spiders.Rule(
    link_extractor,
    callback = None,
    cb_kwargs = None,
    follow = None,
    process_links = None,
    process_request = None
)

主要参数讲解：

spider页面案例（带注释为重点）：

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class ChoutiSpider(CrawlSpider):
    name = 'chouti'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://dig.chouti.com/1']

    # 连接提取器：从起始url对应的页面中提取符合规则的所有连接；allow=正则表达式
    # 正则为空的话，提取页面中所有连接
    link = LinkExtractor(allow=r'\d+')
    rules = (
        # 规则解析器:将连接提取器提取到的连接对应的页面源码进行指定规则的解析
        # Rule自动发送对应链接的请求
        Rule(link, callback='parse_item', follow=True),
        # follow：True 将连接提取器 继续 作用到 连接提取器提取出来的连接 对应的页面源码中
    )

    def parse_item(self, response):
        item = {}
        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        #item['name'] = response.xpath('//div[@id="name"]').get()
        #item['description'] = response.xpath('//div[@id="description"]').get()
        return item

posted @ 2021-01-25 15:37 锦麟阅读(61) 评论(0) 收藏举报

刷新页面返回顶部

锦麟

Python Scrapy框架之CrawlSpider爬虫

Rule规则类：

公告