python学习:python爬虫之Scrapy框架(3):Scrapy拓展,配置文件解析

Scrapy框架

  这里主要介绍Scrapy给我们提供的拓展接口,我们可以在scrapy执行的任何时刻,添加我们自己的代码。拓展执行的时刻,通过Scrapy提供的Signal控制。

8Scrapy拓展

修改settings.py,将extensions解除注释,我们可以查看scrapy.extensions.telnet.TelnetConsole提供的拓展功能

EXTENSIONS = {

   'scrapy.extensions.telnet.TelnetConsole': None,

}

 

对应的包和源代码:

from scrapy.extensions.telnet import TelnetConsole

class TelnetConsole(protocol.ServerFactory):

    def __init__(self, crawler):

        if not crawler.settings.getbool('TELNETCONSOLE_ENABLED'):

            raise NotConfigured

        if not TWISTED_CONCH_AVAILABLE:

            raise NotConfigured(

                'TELNETCONSOLE_ENABLED setting is True but required twisted '

                'modules failed to import:\n' + _TWISTED_CONCH_TRACEBACK)

        self.crawler = crawler

        self.noisy = False

        self.portrange = [int(x) for x in crawler.settings.getlist('TELNETCONSOLE_PORT')]

        self.host = crawler.settings['TELNETCONSOLE_HOST']

        self.username = crawler.settings['TELNETCONSOLE_USERNAME']

        self.password = crawler.settings['TELNETCONSOLE_PASSWORD']

 

        if not self.password:

            self.password = binascii.hexlify(os.urandom(8)).decode('utf8')

            logger.info('Telnet Password: %s', self.password)

 

        self.crawler.signals.connect(self.start_listening, signals.engine_started)

        self.crawler.signals.connect(self.stop_listening, signals.engine_stopped)

 

    @classmethod

    def from_crawler(cls, crawler):

        return cls(crawler)

 

    def start_listening(self):

        self.port = listen_tcp(self.portrange, self.host, self)

        h = self.port.getHost()

        logger.info("Telnet console listening on %(host)s:%(port)d",

                    {'host': h.host, 'port': h.port},

                    extra={'crawler': self.crawler})

 

    def stop_listening(self):

        self.port.stopListening()

 

    def protocol(self):

        class Portal:

            """An implementation of IPortal"""

            @defers

            def login(self_, credentials, mind, *interfaces):

                if not (credentials.username == self.username.encode('utf8') and

                        credentials.checkPassword(self.password.encode('utf8'))):

                    raise ValueError("Invalid credentials")

 

                protocol = telnet.TelnetBootstrapProtocol(

                    insults.ServerProtocol,

                    manhole.Manhole,

                    self._get_telnet_vars()

                )

                return (interfaces[0], protocol, lambda: None)

 
        return telnet.TelnetTransport(

            telnet.AuthenticatingTelnetProtocol,

            Portal()

        )

 

    def _get_telnet_vars(self):

        # Note: if you add entries here also update topics/telnetconsole.rst

        telnet_vars = {

            'engine': self.crawler.engine,

            'spider': self.crawler.engine.spider,

            'slot': self.crawler.engine.slot,

            'crawler': self.crawler,

            'extensions': self.crawler.extensions,

            'stats': self.crawler.stats,

            'settings': self.crawler.settings,

            'est': lambda: print_engine_status(self.crawler.engine),

            'p': pprint.pprint,

            'prefs': print_live_refs,

            'hpy': hpy,

            'help': "This is Scrapy telnet console. For more info see: "

                    "https://docs.scrapy.org/en/latest/topics/telnetconsole.html",

        }

        self.crawler.signals.send_catch_log(update_telnet_vars, telnet_vars=telnet_vars)

        return telnet_vars

 

对照scrapy的拓展,自定义扩展

Signals:有多种信号,对应Scrapy处理的各个阶段,都可以调用相应方法。

修改settings.py,注册自定义拓展类

EXTENSIONS = {

   # 'scrapy.extensions.telnet.TelnetConsole': None,

   'Pro_scrapy.extensions.MyExtend': 300,

}

 

创建extensions.py,创建对应拓展类:

# -*- coding: utf-8 -*-

#拓展方法

from scrapy import signals

class MyExtend:

    def __init__(self,crawler):

        self.crawler = crawler

        #注册信号

        self.crawler.signals.connect(self.start_engine, signals.engine_started)

    @classmethod

    def from_crawler(cls,crawler):

        return  cls(crawler)

 

    def start_engine(self):

        print("start_engine")

  运行时我们可以看到我们注册的拓展执行的时期,我们自定义的filter是最先执行;pipline后执行,爬虫打开时执行;再就是我们注册的拓展,引擎启动时执行;返回页面信息;执行pipline close,爬虫关闭时执行; 最后filter close。

 

9配置文件解析

    分析settings.py文件

    BOT_NAME:爬虫名;

    USER_AGENT:爬虫名+域名,可以将user_agent伪装成浏览器,User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0;

    ROBOTSTXT_OBEY:是否遵循robotstxt协议,True or False;如果遵守,很多网站限制;

    CONCURRENT_REQUESTS:并发请求数;

    DOWNLOAD_DELAY:下载延迟,等于数值单位秒;

    CONCURRENT_REQUESTS_PER_DOMAIN:单个域名下并发数量;

    CONCURRENT_REQUESTS_PER_IP:单个IP下并发数量;

    COOKIES_ENABLED:cookies是否可用;

    COOKIES_DEBUG:显示cookies的debug信息;

    TELNETCONSOLE_ENABLED:是否能够通过telnet监听爬虫状态;

    DEFAULT_REQUEST_HEADERS:默认请求头;

    SPIDER_MIDDLEWARES:spider中间件;

    DOWNLOADER_MIDDLEWARES:下载中间件;

    DEPTH_LIMIT :递归深度层数;

 DEPTH_PRIORITY :0或1,深度or广度;

 

10动态请求延迟

配置文件中参数:

AUTOTHROTTLE_ENABLED:动态延迟使能;

DOWNLOAD_DELAY:下载延迟,等于数值单位秒,最小延迟;

AUTOTHROTTLE_MAX_DELAY:最大延迟;

AUTOTHROTTLE_START_DELAY:初始延迟时间;

AUTOTHROTTLE_TARGET_CONCURRENCY:用来计算动态延迟时间参数,每秒并发数;

 

11请求缓存

配置文件参数:

示例:

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

HTTPCACHE_ENABLED:是否启用缓存;

HTTPCACHE_DIR:缓存路径;

HTTPCACHE_POLICY:缓存策略,”scrapy.extensions.httpcache.DummyPolicy”:缓存所有请求;

from scrapy.extensions.httpcache import  DummyPolicy,RFC2616Policy

HTTPCACHE_EXPIRATION_SECS:缓存超时时间;

HTTPCACHE_IGNORE_HTTP_CODES:不缓存对应状态码请求;

HTTPCACHE_STORAGE:保存缓存对应的类,我们可以自定义类进行自定义缓存操作;

posted @ 2021-01-07 20:39  渔歌晚唱  阅读(251)  评论(0)    收藏  举报