scrapy - 3
爬虫x6 - day117
回顾:
1. requests + Bs
2. Web微信
3. 高性能相关
4. Scrapy框架
- 组件
- 写爬虫
a. scrapy startproject sp3
b.
cd sp3
scrapy genspider xx xx.com
c.
起始URL start_urls = ['http://chouti.com/']
def start_requests(self):
for url in self.start_urls:
yield Request(url, dont_filter=True,callback=self.parse1)
def parse1(self,response):
hxs = Selector(response)
hxs.xpath(...)
yield Item()
yield Request()
d. pipeline
class xxx:
@classmethod
def from_crawler(cls, crawler):
..
- 高级定制
去重规则:
DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
class xx:
@classmethod
def from_settings(cls)
...
基于信号自定义扩展(信号)
# EXTENSIONS = {
# # 'step8_king.extensions.MyExtension': 500,
# }
@classmethod
def from_crawler(cls, crawler):
今日内容:
1. 中间件
爬虫中间件
#SPIDER_MIDDLEWARES = {
# 'sp3.middlewares.Sp3SpiderMiddleware': 543,
#}
下载中间
DOWNLOADER_MIDDLEWARES = {
'sp3.middlewares.DownMiddleware1': 543,
}
PS: 代理
2. 自定义命令
- 只看主要的,扩展,操作....
- 原理
class Command(ScrapyCommand):
requires_project = True
def syntax(self):
return '[options]'
def short_desc(self):
return 'Runs all of the spiders'
def run(self, args, opts):
from scrapy.crawler import CrawlerProcess
from scrapy.core.engine import ExecutionEngine
# 爬虫列表
spider_list = self.crawler_process.spiders.list()
for name in spider_list:
# 初始化爬虫
self.crawler_process.crawl(name, **opts.__dict__)
# 开始执行所有的爬虫
self.crawler_process.start()
3. 其他(scrapy配置文件)
- 代理
- https
- 限速
内容梳理
- 基本:
起始url
start_requests
parse:
选择器
yield Item
yield Request(...)
cookie
item/pipeline
- 高级:
- 去重规则
- 基于信号自定义扩展
- 爬虫中间件
- 下载中间件(代理)
- 自定义命令
- HTTPS
- 调度器
- 完全自己写
- scrapy-redis插件(自定制)
4. 自己写TinyScrapy框架(*)
下节:
redis
scrapy-redis分布式爬虫
I can feel you forgetting me。。
有一种默契叫做我不理你,你就不理我

浙公网安备 33010602011771号