scrapy框架

http://www.cnblogs.com/wupeiqi/articles/6229292.html 参考大王博客

scrapy是个什么鬼？
　　- 帮我们提供一个可扩展功能齐全的爬虫框架。

Scrapy主要包括了以下组件：

引擎(Scrapy)
用来处理整个系统的数据流处理, 触发事务(框架核心)
调度器(Scheduler)
用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL（抓取网页的网址或者说是链接）的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
下载器(Downloader)
用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
爬虫(Spiders)
爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
项目管道(Pipeline)
负责处理爬虫从网页中抽取的实体，主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后，将被发送到项目管道，并经过几个特定的次序处理数据。
下载器中间件(Downloader Middlewares)
位于Scrapy引擎和下载器之间的框架，主要是处理Scrapy引擎与下载器之间的请求及响应。
爬虫中间件(Spider Middlewares)
介于Scrapy引擎和爬虫之间的框架，主要工作是处理蜘蛛的响应输入和请求输出。
调度中间件(Scheduler Middewares)
介于Scrapy引擎和调度之间的中间件，从Scrapy引擎发送到调度的请求和响应。

Scrapy运行流程大概如下：

引擎从调度器中取出一个链接(URL)用于接下来的抓取
引擎把URL封装成一个请求(Request)传给下载器
下载器把资源下载下来，并封装成应答包(Response)
爬虫解析Response
解析出实体（Item）,则交给实体管道进行进一步的处理
解析出的是链接（URL）,则把URL交给调度器等待抓取

1、安装：

        安装：
            Linux/mac
                - pip3 install scrapy 
            Windows:
            
                - 安装twsited
                    a. pip3 install wheel
                    b. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
                    c. 进入下载目录，执行 pip3 install Twisted-xxxxx.whl
                - 安装scrapy 
                    d. pip3 install scrapy  -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
                - 安装pywin32
                    e. pip3 install pywin32  -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

View Code

2、快速使用：

        快速使用：
            Django：##这个是django的用法
                django-admin startproject mysite
                cd mysite 
                python manage.py startapp app01 
                
                # 写代码
                
                python manage.py runserver
            Scrapy： #这个才是scrapy真正的用法
                
                创建project:
                    scrapy startproject xianglong
                    cd xianglong
                    scrapy genspider chouti chouti.com 
                    
                    # 写代码
                    
                    scrapy crawl chouti --nolog

View Code

1. scrapy startproject 项目名称
   - 在当前目录中创建中创建一个项目文件（类似于Django）
2，cd 项目名称    #切换到项目下
3，scrapy genspider xxx xxx.com 
    eg：scrapy genspider chouti chouti.com ，这是创建一个爬虫叫抽屉，chouti.com表示要爬取这个网址里面的东西
4，# 编写代码,
5，scrapy crawl chouti --nolog 
    ##这一行是爬取下来的东西，--nolog表示不要慢日志，只是显示内容（记得前面写代码中是text）
    
    


ps：
    如果yield 一个Item对象那么会去pipelines.py中去处理
    items.py 中主要处理数据的格式化
    持久化组件pipelines.py
    
    https://www.cnblogs.com/Stay-J/p/9021444.html   简单参考同学的

简单注释

3、 scrapy相关：编写爬虫程序，去解析并处理请求 / item/pipelines等配置

        scrapy相关：
            1. spider，编写爬虫程序，去解析并处理请求。
                
                def parse():
                    - HtmlXPathSelector
                    - yield item 
                    - yield request 
            2. item/pipelines
                配置：
                    ITEM_PIPELINES = {
                       'xianglong.pipelines.XianglongPipeline': 300,
                    }
                
                使用：    
                    class XianglongPipeline(object):

                        def process_item(self, item, spider):
                            self.f.write(item['href']+'\n')
                            self.f.flush()

                            return item

                        def open_spider(self, spider):
                            """
                            爬虫开始执行时，调用
                            :param spider:
                            :return:
                            """
                            self.f = open('url.log','w')

                        def close_spider(self, spider):
                            """
                            爬虫关闭时，被调用
                            :param spider:
                            :return:
                            """
                            self.f.close()

View Code

4. start_requests

    -- start_requests
    
        def start_requests(self):
            for url in self.start_urls:
                yield Request(url=url,callback=self.parse2)
    
        def start_requests(self):
            req_list = []
            for url in self.start_urls:
                req_list.append(Request(url=url,callback=self.parse2))
            return req_list
        
        因为scrapy内部会将返回值转换成迭代器。

View Code

5. 数据解析 ------>解析器

解析器 
        将字符串转换成对象：
            - 方式一：
                response.xpath('//div[@id='content-list']/div[@class='item']')
            - 方式二：
                hxs = HtmlXPathSelector(response=response)
                items = hxs.xpath("//div[@id='content-list']/div[@class='item']")
        查找规则：
            //a
            //div/a
            //a[re:test(@id, "i\d+")]            
            
            items = hxs.xpath("//div[@id='content-list']/div[@class='item']")
            for item in items:
                item.xpath('.//div')
        
        解析：
            标签对象：xpath('/html/body/ul/li/a/@href')
            列表：    xpath('/html/body/ul/li/a/@href').extract()
            值：      xpath('//body/ul/li/a/@href').extract_first()
        
    
        PS: 
            单独应用
                from scrapy.selector import Selector, HtmlXPathSelector
                from scrapy.http import HtmlResponse
                html = """<!DOCTYPE html>
                <html>
                    <head lang="en">
                        <meta charset="UTF-8">
                        <title></title>
                    </head>
                    <body>
                        <ul>
                            <li class="item-"><a id='i1' href="link.html">first item</a></li>
                            <li class="item-0"><a id='i2' href="llink.html">first item</a></li>
                            <li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li>
                        </ul>
                        <div><a href="llink2.html">second item</a></div>
                    </body>
                </html>
                """
                response = HtmlResponse(url='http://example.com', body=html,encoding='utf-8')
                obj = response.xpath('//a[@id="i1"]/text()').extract_first()
                print(obj)
            
            chrome
                xpath

View Code

6. pipelines

    -- pipelines 
    
        - pipelines基础 
            
            class FilePipeline(object):

                def process_item(self, item, spider):
                    print('写入文件',item['href'])

                    return item

                def open_spider(self, spider):
                    """
                    爬虫开始执行时，调用
                    :param spider:
                    :return:
                    """
                    print('打开文件')

                def close_spider(self, spider):
                    """
                    爬虫关闭时，被调用
                    :param spider:
                    :return:
                    """
                    print('关闭文件')
        
        - 多pipelines(值越小优先级越高)
    
    
        - 多pipelines，返回值会传递给下一个pipelines的process_item

        
            PS：如果想要丢弃，不给后续pipeline使用：
                from scrapy.exceptions import DropItem
                class FilePipeline(object):

                    def process_item(self, item, spider):
                        print('写入文件',item['href'])

                        # return item
                        raise DropItem()
        - 根据配置文件读取相关值，再进行pipeline处理
            
            class FilePipeline(object):
                def __init__(self,path):
                    self.path = path
                    self.f = None

                @classmethod
                def from_crawler(cls, crawler):
                    """
                    初始化时候，用于创建pipeline对象
                    :param crawler:
                    :return:
                    """
                    path = crawler.settings.get('XL_FILE_PATH')
                    return cls(path)

                def process_item(self, item, spider):
                    self.f.write(item['href']+'\n')
                    return item

                def open_spider(self, spider):
                    """
                    爬虫开始执行时，调用
                    :param spider:
                    :return:
                    """
                    self.f = open(self.path,'w')

                def close_spider(self, spider):
                    """
                    爬虫关闭时，被调用
                    :param spider:
                    :return:
                    """
                    self.f.close()

View Code

7. POST/请求头/Cookie

    -- POST/请求头/Cookie
        自动登录抽屉+点赞
        
        POST+请求头: 
            from scrapy.http import Request 
            req = Request(
                url='http://dig.chouti.com/login',
                method='POST',
                body='phone=8613121758648&password=woshiniba&oneMonth=1',
                headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
                cookies={},
                callback=self.parse_check_login,
            )
        
        cookies:
            手动：
                cookie_dict = {}
                cookie_jar = CookieJar()
                cookie_jar.extract_cookies(response, response.request)
                for k, v in cookie_jar._cookies.items():
                    for i, j in v.items():
                        for m, n in j.items():
                            cookie_dict[m] = n.value
                            
                req = Request(
                    url='http://dig.chouti.com/login',
                    method='POST',
                    headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
                    body='phone=8615131255089&password=pppppppp&oneMonth=1',
                    cookies=cookie_dict, # 手动携带
                    callback=self.check_login
                )
                yield req
            
            自动：
                class ChoutiSpider(scrapy.Spider):
                    name = 'chouti'
                    allowed_domains = ['chouti.com']
                    start_urls = ['http://dig.chouti.com/',]

                    def start_requests(self):
                        for url in self.start_urls:
                            yield Request(url=url,callback=self.parse_index,meta={'cookiejar':True})

                    def parse_index(self,response):
                        req = Request(
                            url='http://dig.chouti.com/login',
                            method='POST',
                            headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
                            body='phone=8613121758648&password=woshiniba&oneMonth=1',
                            callback=self.parse_check_login,
                            meta={'cookiejar': True}
                        )
                        yield req

                    def parse_check_login(self,response):
                        # print(response.text)
                        yield Request(
                            url='https://dig.chouti.com/link/vote?linksId=19440976',
                            method='POST',
                            callback=self.parse_show_result,
                            meta={'cookiejar': True}
                        )

                    def parse_show_result(self,response):
                        print(response.text)
    
        配置文件制定是否允许操作cookie:
            # Disable cookies (enabled by default)
            # COOKIES_ENABLED = False

View Code

8. 去重规则

    --去重规则
    
        配置：
            DUPEFILTER_CLASS = 'xianglong.dupe.MyDupeFilter'
        
        编写类：
            class MyDupeFilter(BaseDupeFilter):

                def __init__(self):
                    self.record = set()

                @classmethod
                def from_settings(cls, settings):    #源码中这么写的这个函数
                    return cls()

                def request_seen(self, request):    #源码中这么写的这个函数
                    if request.url in self.record:
                        print('已经访问过了', request.url)
                        return True
                    self.record.add(request.url)

                def open(self):  # can return deferred    #源码中这么写的这个函数
                    pass

                def close(self, reason):  # can return a deferred    #源码中这么写的这个函数
                    pass
        
        问题：为请求创建唯一标识
            ##为什么要创建唯一标识呢？因为下面两个url是一样的，这样的话，爬取的时候会重复爬取，造成资源浪费，
            而用了request_fingerprint之后，则会把这两个区分开来，就不会重复爬取了，节省了资源
            
            http://www.oldboyedu.com?id=1&age=2
            http://www.oldboyedu.com?age=2&id=1
            
            from scrapy.utils.request import request_fingerprint
            from scrapy.http import Request


            u1 = Request(url='http://www.oldboyedu.com?id=1&age=2')
            u2 = Request(url='http://www.oldboyedu.com?age=2&id=1')

            result1 = request_fingerprint(u1)
            result2 = request_fingerprint(u2)
            print(result1,result2)
        
        问题：记录到低要不要放在数据库？【使用redis集合存储】
            访问记录可以放在redis中。
    
        
        补充：dont_filter到低在哪里？
            from scrapy.core.scheduler import Scheduler
            
             def enqueue_request(self, request):
                # request.dont_filter=False
                    # self.df.request_seen(request):
                    #   - True,已经访问
                    #   - False，未访问
                # request.dont_filter=True，全部加入到调度器
                if not request.dont_filter and self.df.request_seen(request):
                    self.df.log(request, self.spider)
                    return False
                # 如果往下走，把请求加入调度器
                dqok = self._dqpush(request)

View Code

9. 下载中间件/爬虫中间件

    -- 中间件 
        问题：对爬虫中所有请求发送时，携带请求头？
            方案一：在每个Request对象中添加一个请求头
            
            方案二：下载中间件
                配置： 
                
                    DOWNLOADER_MIDDLEWARES = {
                       'xianglong.middlewares.UserAgentDownloaderMiddleware': 543,
                    }
                编写类：
                    
                    class UserAgentDownloaderMiddleware(object):

                        @classmethod
                        def from_crawler(cls, crawler):
                            # This method is used by Scrapy to create your spiders.
                            s = cls()
                            return s

                        def process_request(self, request, spider):
                            # Called for each request that goes through the downloader
                            # middleware.

                            # Must either:
                            # - return None: continue processing this request
                            # - or return a Response object
                            # - or return a Request object
                            # - or raise IgnoreRequest: process_exception() methods of
                            #   installed downloader middleware will be called

                            request.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"

                            # return None # 继续执行后续的中间件的process_request

                            # from scrapy.http import Request
                            # return Request(url='www.baidu.com') # 重新放入调度器中，当前请求不再继续处理

                            # from scrapy.http import HtmlResponse # 执行从最后一个开始执行所有的process_response
                            # return HtmlResponse(url='www.baidu.com',body=b'asdfuowjelrjaspdoifualskdjf;lajsdf')

                        def process_response(self, request, response, spider):
                            # Called with the response returned from the downloader.

                            # Must either;
                            # - return a Response object
                            # - return a Request object
                            # - or raise IgnoreRequest
                            return response

                        def process_exception(self, request, exception, spider):
                            # Called when a download handler or a process_request()
                            # (from other downloader middleware) raises an exception.

                            # Must either:
                            # - return None: continue processing this exception
                            # - return a Response object: stops process_exception() chain
                            # - return a Request object: stops process_exception() chain
                            pass

            
            方案三：内置下载中间件
                配置文件：
                    USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'

View Code

    1. 下载中间件
        问题：scrapy中如何添加代理？
        解决方案：
            方式一：内置添加代理功能
                # -*- coding: utf-8 -*-
                import os
                import scrapy
                from scrapy.http import Request

                class ChoutiSpider(scrapy.Spider):
                    name = 'chouti'
                    allowed_domains = ['chouti.com']
                    start_urls = ['https://dig.chouti.com/']

                    def start_requests(self):
                        os.environ['HTTP_PROXY'] = "http://192.168.11.11"

                        for url in self.start_urls:
                            yield Request(url=url,callback=self.parse)

                    def parse(self, response):
                        print(response)

            方式二：自定义下载中间件
                import random
                import base64
                import six
                def to_bytes(text, encoding=None, errors='strict'):
                    """Return the binary representation of `text`. If `text`
                    is already a bytes object, return it as-is."""
                    if isinstance(text, bytes):
                        return text
                    if not isinstance(text, six.string_types):
                        raise TypeError('to_bytes must receive a unicode, str or bytes '
                                        'object, got %s' % type(text).__name__)
                    if encoding is None:
                        encoding = 'utf-8'
                    return text.encode(encoding, errors)
                    
                class MyProxyDownloaderMiddleware(object):
                    def process_request(self, request, spider):
                        proxy_list = [
                            {'ip_port': '111.11.228.75:80', 'user_pass': 'xxx:123'},
                            {'ip_port': '120.198.243.22:80', 'user_pass': ''},
                            {'ip_port': '111.8.60.9:8123', 'user_pass': ''},
                            {'ip_port': '101.71.27.120:80', 'user_pass': ''},
                            {'ip_port': '122.96.59.104:80', 'user_pass': ''},
                            {'ip_port': '122.224.249.122:8088', 'user_pass': ''},
                        ]
                        proxy = random.choice(proxy_list)
                        if proxy['user_pass'] is not None:
                            request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
                            encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass']))
                            request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)
                        else:
                            request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
    
    
    
                配置：
                    DOWNLOADER_MIDDLEWARES = {
                       # 'xiaohan.middlewares.MyProxyDownloaderMiddleware': 543,
                    }
                        
                        
        问题：scrapy中如何处理https
            掏钱：
                pass 
            不掏钱：
                from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory
                from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate)

                class MySSLFactory(ScrapyClientContextFactory):
                    def getCertificateOptions(self):
                        from OpenSSL import crypto
                        v1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read())
                        v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read())
                        return CertificateOptions(
                            privateKey=v1,  # pKey对象
                            certificate=v2,  # X509对象
                            verify=False,
                            method=getattr(self, 'method', getattr(self, '_ssl_method', None))
                        )
    
                配置： 
                    DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
                    DOWNLOADER_CLIENTCONTEXTFACTORY = "xiaohan.middlewares.MySSLFactory"
            
    
        总结：
            问：下载中间件的作用？
            答：在每次下载前和下载后对请求和响应可以定制功能。例如：user-agent/代理/cookie

View Code

2. 爬虫中间件 
        编写：
            middlewares.py
                class XiaohanSpiderMiddleware(object):
                    # Not all methods need to be defined. If a method is not defined,
                    # scrapy acts as if the spider middleware does not modify the
                    # passed objects.
                    def __init__(self):
                        pass
                    @classmethod
                    def from_crawler(cls, crawler):
                        # This method is used by Scrapy to create your spiders.
                        s = cls()
                        return s

                    # 每次下载完成之后，未执行parse函数之前。
                    def process_spider_input(self, response, spider):
                        # Called for each response that goes through the spider
                        # middleware and into the spider.

                        # Should return None or raise an exception.
                        print('process_spider_input',response)
                        return None

                    def process_spider_output(self, response, result, spider):
                        # Called with the results returned from the Spider, after
                        # it has processed the response.

                        # Must return an iterable of Request, dict or Item objects.
                        print('process_spider_output',response)
                        for i in result:
                            yield i

                    def process_spider_exception(self, response, exception, spider):
                        # Called when a spider or process_spider_input() method
                        # (from other spider middleware) raises an exception.

                        # Should return either None or an iterable of Response, dict
                        # or Item objects.
                        pass

                    # 爬虫启动时，第一次执行start_requests时，触发。（只执行一次）
                    def process_start_requests(self, start_requests, spider):
                        # Called with the start requests of the spider, and works
                        # similarly to the process_spider_output() method, except
                        # that it doesn’t have a response associated.

                        # Must return only requests (not items).

                        print('process_start_requests')
                        for r in start_requests:
                            yield r

        应用：
            SPIDER_MIDDLEWARES = {
               'xiaohan.middlewares.XiaohanSpiderMiddleware': 543,
            }

View Code


10. 扩展：信号

    --扩展：信号 
        单纯扩展：
            extends.py 
                class MyExtension(object):
                    def __init__(self):
                        pass

                    @classmethod
                    def from_crawler(cls, crawler):
                        obj = cls()
                        return obj
            配置：
                EXTENSIONS = {
                    'xiaohan.extends.MyExtension':500,
                }
        
        扩展+信号：
            extends.py 
                from scrapy import signals


                class MyExtension(object):
                    def __init__(self):
                        pass

                    @classmethod
                    def from_crawler(cls, crawler):
                        obj = cls()
                        # 在爬虫打开时，触发spider_opened信号相关的所有函数：xxxxxxxxxxx
                        crawler.signals.connect(obj.xxxxxxxxxxx1, signal=signals.spider_opened)
                        # 在爬虫关闭时，触发spider_closed信号相关的所有函数：xxxxxxxxxxx
                        crawler.signals.connect(obj.uuuuuuuuuu, signal=signals.spider_closed)
                        return obj

                    def xxxxxxxxxxx1(self, spider):
                        print('open')

                    def uuuuuuuuuu(self, spider):
                        print('close')
        
                            return obj
            配置：
            EXTENSIONS = {
                'xiaohan.extends.MyExtension':500,
            }

View Code

11. 其他：配置文件

# -*- coding: utf-8 -*-

# Scrapy settings for step8_king project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

1. 爬虫名称
BOT_NAME = 'xiaohan'  #BOT_NAME = 'XXX' 

2. 爬虫应用路径或目录
SPIDER_MODULES = ['xiaohan.spiders']
NEWSPIDER_MODULE = 'xiaohan.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

3. 客户端 user-agent请求头
# USER_AGENT = 'xiaohan (+http://www.yourdomain.com)'
# USER_AGENT = 'XXX'  ##这个请求头，可以从网址的可视化工具里面进行查找，然后放到这里
# Obey robots.txt rules

4. 禁止爬虫配置            #这里默认的是T，一般都用T
    #是否遵循爬虫协议
    #True    把握的爬虫名称发送到要爬取的网站，进行检查是否允许爬取
    #False    硬爬
# ROBOTSTXT_OBEY = True
# ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 5. 并发请求数
# CONCURRENT_REQUESTS = 4    #并发请求数量，且是单线程来处理的

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 6. 延迟下载秒数
# DOWNLOAD_DELAY = 2


# The download delay setting will honor only one of:
# 7. 单域名访问并发数，并且延迟下次秒数也应用在每个域名
# CONCURRENT_REQUESTS_PER_DOMAIN = 2    #这是单域名，一次发多少个  -（粒度比上面的细）
# 单IP访问并发数，如果有值则忽略：CONCURRENT_REQUESTS_PER_DOMAIN，并且延迟下次秒数也应用在每个IP
# CONCURRENT_REQUESTS_PER_IP = 3    #粒度更细，域名下面的某个ip发多少个

# Disable cookies (enabled by default)
# 8. 是否支持cookie，cookiejar进行操作cookie  #######
# COOKIES_ENABLED = True
# COOKIES_DEBUG = True

# Disable Telnet Console (enabled by default)
# 9. Telnet用于查看当前爬虫的信息，操作爬虫等...
#    使用telnet ip port ，然后通过命令操作        ##在这个里面能发命令，让其终止爬虫任务
# TELNETCONSOLE_ENABLED = True
# TELNETCONSOLE_HOST = '127.0.0.1'
# TELNETCONSOLE_PORT = [6023,]


# 10. 默认请求头
# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#     'Accept-Language': 'en',
# }


# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
# 11. 定义pipeline处理请求
# ITEM_PIPELINES = {
#    'step8_king.pipelines.JsonPipeline': 700,
#    'step8_king.pipelines.FilePipeline': 500,
# }



# 12. 自定义扩展，基于信号进行调用
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#     # 'step8_king.extensions.MyExtension': 500,
# }


# 13. 爬虫允许的最大深度，可以通过meta查看当前深度；0表示无深度
# DEPTH_LIMIT = 3

# 14. 爬取时，0表示深度优先Lifo(默认)；1表示广度优先FiFo    #这个一般不用

# 后进先出，深度优先
# DEPTH_PRIORITY = 0
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue'
# 先进先出，广度优先

# DEPTH_PRIORITY = 1
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

# 15. 调度器队列    #用队列来存储任务，但是真正使用中不用这个，用redis
# SCHEDULER = 'scrapy.core.scheduler.Scheduler'
# from scrapy.core.scheduler import Scheduler


# 16. 访问URL去重
# DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl'


# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html

"""
17. 自动限速算法        ##还可加上这些
    from scrapy.contrib.throttle import AutoThrottle
    自动限速设置
    1. 获取最小延迟 DOWNLOAD_DELAY
    2. 获取最大延迟 AUTOTHROTTLE_MAX_DELAY
    3. 设置初始下载延迟 AUTOTHROTTLE_START_DELAY
    4. 当请求下载完成后，获取其"连接"时间 latency，即：请求连接到接受到响应头之间的时间
    5. 用于计算的... AUTOTHROTTLE_TARGET_CONCURRENCY
    target_delay = latency / self.target_concurrency
    new_delay = (slot.delay + target_delay) / 2.0 # 表示上一次的延迟时间
    new_delay = max(target_delay, new_delay)
    new_delay = min(max(self.mindelay, new_delay), self.maxdelay)
    slot.delay = new_delay
"""

# 开始自动限速
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# 初始下载延迟
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# 最大下载延迟
# AUTOTHROTTLE_MAX_DELAY = 10
# The average number of requests Scrapy should be sending in parallel to each remote server
# 平均每秒并发数
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:
# 是否显示
# AUTOTHROTTLE_DEBUG = True

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings


"""
18. 启用缓存
    目的用于将已经发送的请求或相应缓存下来，以便以后使用
    
    from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware
    from scrapy.extensions.httpcache import DummyPolicy
    from scrapy.extensions.httpcache import FilesystemCacheStorage
"""
# 是否启用缓存策略
# HTTPCACHE_ENABLED = True

# 缓存策略：所有请求均缓存，下次在请求直接访问原来的缓存即可
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy"
# 缓存策略：根据Http响应头：Cache-Control、Last-Modified 等进行缓存的策略
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy"

# 缓存超时时间
# HTTPCACHE_EXPIRATION_SECS = 0

# 缓存保存路径
# HTTPCACHE_DIR = 'httpcache'

# 缓存忽略的Http状态码
# HTTPCACHE_IGNORE_HTTP_CODES = []

# 缓存存储的插件
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


"""
19. 代理，需要在环境变量中设置
    from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware
    
    方式一：使用默认
        os.environ
        {
            http_proxy:http://root:woshiniba@192.168.11.11:9999/
            https_proxy:http://192.168.11.11:9999/
        }
    方式二：使用自定义下载中间件
    
    def to_bytes(text, encoding=None, errors='strict'):
        if isinstance(text, bytes):
            return text
        if not isinstance(text, six.string_types):
            raise TypeError('to_bytes must receive a unicode, str or bytes '
                            'object, got %s' % type(text).__name__)
        if encoding is None:
            encoding = 'utf-8'
        return text.encode(encoding, errors)
        
    class ProxyMiddleware(object):
        def process_request(self, request, spider):
            PROXIES = [
                {'ip_port': '111.11.228.75:80', 'user_pass': ''},
                {'ip_port': '120.198.243.22:80', 'user_pass': ''},
                {'ip_port': '111.8.60.9:8123', 'user_pass': ''},
                {'ip_port': '101.71.27.120:80', 'user_pass': ''},
                {'ip_port': '122.96.59.104:80', 'user_pass': ''},
                {'ip_port': '122.224.249.122:8088', 'user_pass': ''},
            ]
            proxy = random.choice(PROXIES)
            if proxy['user_pass'] is not None:
                request.meta['proxy'] = to_bytes（"http://%s" % proxy['ip_port']）
                encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass']))
                request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)
                print "**************ProxyMiddleware have pass************" + proxy['ip_port']
            else:
                print "**************ProxyMiddleware no pass************" + proxy['ip_port']
                request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
    
    DOWNLOADER_MIDDLEWARES = {
       'step8_king.middlewares.ProxyMiddleware': 500,
    }
    
"""

"""
20. Https访问
    Https访问时有两种情况：
    1. 要爬取网站使用的可信任证书(默认支持)
        DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
        DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory"
        
    2. 要爬取网站使用的自定义证书
        DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
        DOWNLOADER_CLIENTCONTEXTFACTORY = "step8_king.https.MySSLFactory"
        
        # https.py
        from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory
        from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate)
        
        class MySSLFactory(ScrapyClientContextFactory):
            def getCertificateOptions(self):
                from OpenSSL import crypto
                v1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read())
                v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read())
                return CertificateOptions(
                    privateKey=v1,  # pKey对象
                    certificate=v2,  # X509对象
                    verify=False,
                    method=getattr(self, 'method', getattr(self, '_ssl_method', None))
                )
    其他：
        相关类
            scrapy.core.downloader.handlers.http.HttpDownloadHandler
            scrapy.core.downloader.webclient.ScrapyHTTPClientFactory
            scrapy.core.downloader.contextfactory.ScrapyClientContextFactory
        相关配置
            DOWNLOADER_HTTPCLIENTFACTORY
            DOWNLOADER_CLIENTCONTEXTFACTORY

"""

View Code

12. 自定义命令

在spiders同级创建任意目录，如：commands
在其中创建 crawlall.py 文件（此处文件名就是自定义的命令）

在settings.py 中添加配置 COMMANDS_MODULE = '项目名称.目录名称'
在项目目录执行命令：
　　scrapy crawlall #在cmd终端中来执行
　　或者：
　　scrapy crawlall --nolog #自定命令，能让所有的爬虫都运行起来

    --自定义命令 
        from scrapy.commands import ScrapyCommand
        from scrapy.utils.project import get_project_settings


        class Command(ScrapyCommand):
            requires_project = True

            def syntax(self):
                return '[options]'

            def short_desc(self):
                return 'Runs all of the spiders'

            def run(self, args, opts):
                spider_list = self.crawler_process.spiders.list()
                for name in spider_list:
                    self.crawler_process.crawl(name, **opts.__dict__)
                self.crawler_process.start()
                
                
            PS：源码
                def run(self, args, opts):
                    from scrapy.crawler import CrawlerProcess
                    CrawlerProcess.crawl
                    CrawlerProcess.start
                    """
                    self.crawler_process对象中含有：_active = {d,}
                    """
                    self.crawler_process.crawl('chouti',**opts.__dict__)
                    self.crawler_process.crawl('cnblogs',**opts.__dict__)
                    #
                    self.crawler_process.start()
            分享：源码

View Code

总结：重要指数
　　下载中间件（*****）
　　爬虫中间件（***）
　　扩展：信号（***）
　　配置（*****）
　　自定义命令（*****）

posted @ 2018-05-14 17:08 Justin壮志凌云阅读(121) 评论(0) 收藏举报

刷新页面返回顶部

scrapy框架

公告