Python爬虫 #019 Scrapy之CrawlSpider
CrawlSpider能够设置链接规则,符合规则的则请求该链接,CrawlSpider能够对网站全站进行爬取,功能非常强大。
1. CrawlSpider介绍
-
传统的spider实现多页面获取数据,需要多次回调函数,而CrawlSpider,使用正则设定url规则,只要网页中含有满足规则的url,就能爬取,例如:
https://www.bilibili.com/video/BV1kx411h7jv?p=1 https://www.bilibili.com/video/BV1kx411h7jv?p=2 https://www.bilibili.com/video/BV1kx411h7jv?p=3 ...... 设定规则:https://www.bilibili.com/video/BV1kx411h7jv?p=/d -
crawlspider是Spider的子类,Spider类的设计原则是只爬取start_url列表中的网页,而CrawlSpider类定义了一些规则(rule)来提
供跟进link的方便的机制,从爬取的网页中获取link并继续爬取
2. CrawlSpider创建
-
创建项目(cmd中执行): 进入要创建项目的路径,scrapy startproject [项目名称]
-
创建爬虫(cmd中执行): 进入到项目所在的路径,scrapy genspider -t crawl [爬虫名字] [爬虫的域名]
-t crawl:表示创建的爬虫文件是基于CrawlSpider这个类的,而不再是Spider这个基类 -
示例:


3. CrawlSpider参数说明
-
LinkExtractor(连接提取器):提取链接
-
allow:满足“正则表达式”的值会被提取,如果为空,则会全部匹配。
-
deny:与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取
-
allow_domains:会被提取的链接domains。
-
deny_domains:一定不会被提取链接的domains
-
restrick_xpaths:使用xpath表达式,和allow共同作用过滤链接
-
-
rules(规则解析器): 是Rule对象的集合,用于匹配目标网站并排除干扰
-
linkExtractor():指定链接提取器
-
callback="parse_item", : 设置回调函数
parse_item -
follow=TRUE, : 设置是否跟进,即是否将链接提取器继续作用到链接提取器提取出的链接网页中。
当callback为None,参数3的默认值为true。
-
4. CrawlSpider整体爬取流程
-
爬虫文件首先根据起始url,获取该url的网页内容a
-
链接提取器会根据指定提取规则将网页内容a中的链接进行提取,得到链接b,链接c,链接d....
同时这些链接对应网页内容b,网页内容c,网页内容d
-
规则解析器会根据指定解析规则将链接提取器中提取到的链接(链接b,链接c,链接d....)
中的网页内容(网页内容b,网页内容c,网页内容d)根据指定的规则进行解析
-
将解析数据封装到item中,然后提交给管道进行持久化存储
5. CrawlSpider实战案例
-
爬取网址:http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1
-
创建项目:记得编写启动程序
start.py
-
修改
setting.py:# -*- coding: utf-8 -*- # Scrapy settings for wxapp project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'wxapp' SPIDER_MODULES = ['wxapp.spiders'] NEWSPIDER_MODULE = 'wxapp.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'wxapp (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3754.400 QQBrowser/10.5.3991.400' } # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'wxapp.middlewares.WxappSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'wxapp.middlewares.WxappDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'wxapp.pipelines.WxappPipeline': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' -
明确目标
item.py:要获取标题,时间,内容,作者# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class WxappItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() author = scrapy.Field() time = scrapy.Field() article_content = scrapy.Field() -
编写爬虫程序
wxapp_spider.py:# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule # 导入item.py中的类 from wxapp.items import WxappItem class WxappSpiderSpider(CrawlSpider): name = 'wxapp_spider' allowed_domains = ['wxapp-union.com'] start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1'] rules = ( # .+ 和 /d 用正则,设置满足条件的url # follow=True,表示一直跟进,当网页翻页后,在翻页后的页面寻找满足条件的url(即符合allow中设置的url规则) # 为False则只保存第一个页面满足的url ## 主页面需要的url示范: http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1 Rule(LinkExtractor(allow=r'.+mod=list&catid=2&page=/d'), follow=True), # 详情页面需要回调,(回调后解析内容,保存),且不需要跟进 ## 详情页面需要的url示范 http://www.wxapp-union.com/article-5822-1.html ### \. 转义字符 Rule(LinkExtractor(allow=r'.+article-.+\.html'), follow=False, callback="parse_detail"), ) def parse_detail(self, response): title = response.xpath('//h1[@class="ph"]/text()').get() author = response.xpath('//*[@id="ct"]/div[1]/div/div[1]/div/div[2]/div[3]/div[1]/p/a/text()').get() time = response.xpath('//*[@id="ct"]/div[1]/div/div[1]/div/div[2]/div[3]/div[1]/p/span/text()').get() # print(title + '\n' + author + ' ' + time) # print('='*30) article_content = response.xpath('//*[@id="article_content"]//text()').getall() article_content = "".join(article_content).strip() # print(article_content) item = WxappItem() item['title'] = title item['author'] = author item['time'] = time item['article_content'] = article_content yield item -
保存数据
pipelines.py:# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html from scrapy.exporters import JsonLinesItemExporter class WxappPipeline(object): def __init__(self): self.fp = open(r'C:\Users\Administrator\Desktop\project.json', 'wb') self.exporter = JsonLinesItemExporter(self.fp, ensure_ascii = False, encoding = 'utf-8') def process_item(self, item, spider): self.exporter.export_item(item) return item def close_spider(self): self.fp.close()
本文来自博客园,作者:{枫_Null},转载请注明原文链接:https://www.cnblogs.com/fengNull/articles/16662601.html

浙公网安备 33010602011771号