scrapy - 3

爬虫x6 - day117

回顾:
1. requests + Bs
2. Web微信
3. 高性能相关
4. Scrapy框架
- 组件
- 写爬虫
a. scrapy startproject sp3
b.
cd sp3
scrapy genspider xx xx.com

		c. 
			起始URL start_urls = ['http://chouti.com/']
			
			def start_requests(self):
				for url in self.start_urls:
					yield Request(url, dont_filter=True,callback=self.parse1)
					
			def parse1(self,response):
				
				hxs = Selector(response)
				hxs.xpath(...)
				
				yield Item()
				yield Request()
				
		d. pipeline
			
			class xxx:
			
				@classmethod
				def from_crawler(cls, crawler):
				
				..
			
				
	- 高级定制
		
		去重规则:
			DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
			
			class xx:
				
				@classmethod
				def from_settings(cls)
					...
				
		基于信号自定义扩展(信号)
			# EXTENSIONS = {
			#     # 'step8_king.extensions.MyExtension': 500,
			# }

			@classmethod
			def from_crawler(cls, crawler):

今日内容:

1. 中间件
	爬虫中间件
		#SPIDER_MIDDLEWARES = {
		#    'sp3.middlewares.Sp3SpiderMiddleware': 543,
		#}
		
	下载中间
		DOWNLOADER_MIDDLEWARES = {
		   'sp3.middlewares.DownMiddleware1': 543,
		}
		
		PS: 代理

2. 自定义命令
	- 只看主要的,扩展,操作....
	- 原理
	
	class Command(ScrapyCommand):

		requires_project = True

		def syntax(self):
			return '[options]'

		def short_desc(self):
			return 'Runs all of the spiders'

		def run(self, args, opts):
			from scrapy.crawler import CrawlerProcess
			from scrapy.core.engine import ExecutionEngine
			# 爬虫列表
			spider_list = self.crawler_process.spiders.list()
			for name in spider_list:
				# 初始化爬虫
				self.crawler_process.crawl(name, **opts.__dict__)
			# 开始执行所有的爬虫
			self.crawler_process.start()
			
3. 其他(scrapy配置文件)
	- 代理
	- https
	- 限速
	
	
	
内容梳理
	- 基本:
		起始url
		start_requests
		parse:
			选择器
			yield Item
			yield Request(...)
		cookie
		
		item/pipeline
		
	- 高级:
		- 去重规则
		- 基于信号自定义扩展
		- 爬虫中间件
		- 下载中间件(代理)
		- 自定义命令
		- HTTPS
			
		- 调度器
			- 完全自己写
			- scrapy-redis插件(自定制)
	
4. 自己写TinyScrapy框架(*)

下节:
redis
scrapy-redis分布式爬虫

posted @ 2019-02-18 21:07  MAU  阅读(178)  评论(0)    收藏  举报