python学习:python爬虫之Scrapy框架(3):Scrapy拓展,配置文件解析
Scrapy框架
这里主要介绍Scrapy给我们提供的拓展接口,我们可以在scrapy执行的任何时刻,添加我们自己的代码。拓展执行的时刻,通过Scrapy提供的Signal控制。
8Scrapy拓展
修改settings.py,将extensions解除注释,我们可以查看scrapy.extensions.telnet.TelnetConsole提供的拓展功能
EXTENSIONS = {
'scrapy.extensions.telnet.TelnetConsole': None,
}
对应的包和源代码:
from scrapy.extensions.telnet import TelnetConsole
class TelnetConsole(protocol.ServerFactory):
def __init__(self, crawler):
if not crawler.settings.getbool('TELNETCONSOLE_ENABLED'):
raise NotConfigured
if not TWISTED_CONCH_AVAILABLE:
raise NotConfigured(
'TELNETCONSOLE_ENABLED setting is True but required twisted '
'modules failed to import:\n' + _TWISTED_CONCH_TRACEBACK)
self.crawler = crawler
self.noisy = False
self.portrange = [int(x) for x in crawler.settings.getlist('TELNETCONSOLE_PORT')]
self.host = crawler.settings['TELNETCONSOLE_HOST']
self.username = crawler.settings['TELNETCONSOLE_USERNAME']
self.password = crawler.settings['TELNETCONSOLE_PASSWORD']
if not self.password:
self.password = binascii.hexlify(os.urandom(8)).decode('utf8')
logger.info('Telnet Password: %s', self.password)
self.crawler.signals.connect(self.start_listening, signals.engine_started)
self.crawler.signals.connect(self.stop_listening, signals.engine_stopped)
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def start_listening(self):
self.port = listen_tcp(self.portrange, self.host, self)
h = self.port.getHost()
logger.info("Telnet console listening on %(host)s:%(port)d",
{'host': h.host, 'port': h.port},
extra={'crawler': self.crawler})
def stop_listening(self):
self.port.stopListening()
def protocol(self):
class Portal:
"""An implementation of IPortal"""
@defers
def login(self_, credentials, mind, *interfaces):
if not (credentials.username == self.username.encode('utf8') and
credentials.checkPassword(self.password.encode('utf8'))):
raise ValueError("Invalid credentials")
protocol = telnet.TelnetBootstrapProtocol(
insults.ServerProtocol,
manhole.Manhole,
self._get_telnet_vars()
)
return (interfaces[0], protocol, lambda: None)
return telnet.TelnetTransport(
telnet.AuthenticatingTelnetProtocol,
Portal()
)
def _get_telnet_vars(self):
# Note: if you add entries here also update topics/telnetconsole.rst
telnet_vars = {
'engine': self.crawler.engine,
'spider': self.crawler.engine.spider,
'slot': self.crawler.engine.slot,
'crawler': self.crawler,
'extensions': self.crawler.extensions,
'stats': self.crawler.stats,
'settings': self.crawler.settings,
'est': lambda: print_engine_status(self.crawler.engine),
'p': pprint.pprint,
'prefs': print_live_refs,
'hpy': hpy,
'help': "This is Scrapy telnet console. For more info see: "
"https://docs.scrapy.org/en/latest/topics/telnetconsole.html",
}
self.crawler.signals.send_catch_log(update_telnet_vars, telnet_vars=telnet_vars)
return telnet_vars
对照scrapy的拓展,自定义扩展
Signals:有多种信号,对应Scrapy处理的各个阶段,都可以调用相应方法。
修改settings.py,注册自定义拓展类
EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
'Pro_scrapy.extensions.MyExtend': 300,
}
创建extensions.py,创建对应拓展类:
# -*- coding: utf-8 -*-
#拓展方法
from scrapy import signals
class MyExtend:
def __init__(self,crawler):
self.crawler = crawler
#注册信号
self.crawler.signals.connect(self.start_engine, signals.engine_started)
@classmethod
def from_crawler(cls,crawler):
return cls(crawler)
def start_engine(self):
print("start_engine")

运行时我们可以看到我们注册的拓展执行的时期,我们自定义的filter是最先执行;pipline后执行,爬虫打开时执行;再就是我们注册的拓展,引擎启动时执行;返回页面信息;执行pipline close,爬虫关闭时执行; 最后filter close。
9配置文件解析
分析settings.py文件
BOT_NAME:爬虫名;
USER_AGENT:爬虫名+域名,可以将user_agent伪装成浏览器,User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0;
ROBOTSTXT_OBEY:是否遵循robotstxt协议,True or False;如果遵守,很多网站限制;
CONCURRENT_REQUESTS:并发请求数;
DOWNLOAD_DELAY:下载延迟,等于数值单位秒;
CONCURRENT_REQUESTS_PER_DOMAIN:单个域名下并发数量;
CONCURRENT_REQUESTS_PER_IP:单个IP下并发数量;
COOKIES_ENABLED:cookies是否可用;
COOKIES_DEBUG:显示cookies的debug信息;
TELNETCONSOLE_ENABLED:是否能够通过telnet监听爬虫状态;
DEFAULT_REQUEST_HEADERS:默认请求头;
SPIDER_MIDDLEWARES:spider中间件;
DOWNLOADER_MIDDLEWARES:下载中间件;
DEPTH_LIMIT :递归深度层数;
DEPTH_PRIORITY :0或1,深度or广度;
10动态请求延迟
配置文件中参数:
AUTOTHROTTLE_ENABLED:动态延迟使能;
DOWNLOAD_DELAY:下载延迟,等于数值单位秒,最小延迟;
AUTOTHROTTLE_MAX_DELAY:最大延迟;
AUTOTHROTTLE_START_DELAY:初始延迟时间;
AUTOTHROTTLE_TARGET_CONCURRENCY:用来计算动态延迟时间参数,每秒并发数;
11请求缓存
配置文件参数:
示例:
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
HTTPCACHE_ENABLED:是否启用缓存;
HTTPCACHE_DIR:缓存路径;
HTTPCACHE_POLICY:缓存策略,”scrapy.extensions.httpcache.DummyPolicy”:缓存所有请求;
from scrapy.extensions.httpcache import DummyPolicy,RFC2616Policy
HTTPCACHE_EXPIRATION_SECS:缓存超时时间;
HTTPCACHE_IGNORE_HTTP_CODES:不缓存对应状态码请求;
HTTPCACHE_STORAGE:保存缓存对应的类,我们可以自定义类进行自定义缓存操作;
本文来自博客园,作者:渔歌晚唱,转载请注明原文链接:https://www.cnblogs.com/tangToms/articles/14248637.html

浙公网安备 33010602011771号