scrapy实用操作

下载中间件实现代理：

加代理的本质实际就是在请求体重加一点数据；

使用pycharm，可以在scrapy中看到一个downloadmiddleware包，会自动调用httpproxy

内置添加代理功能

import os
import scrapy
from scrapy.http import Request

class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['chouti.com']
    start_urls = ['https://dig.chouti.com/']

    def start_requests(self):
        os.environ['HTTP_PROXY'] = "http://192.168.11.11" # 只要加这么一句就可以了

        for url in self.start_urls:
            yield Request(url=url,callback=self.parse)

    def parse(self, response):
        print(response)

自定义下载中间件实现代理功能

从源码的基础上更改

import random
import base64
import six
def to_bytes(text, encoding=None, errors='strict'):
    """Return the binary representation of `text`. If `text`
    is already a bytes object, return it as-is."""
    if isinstance(text, bytes):
        return text
    if not isinstance(text, six.string_types):
        raise TypeError('to_bytes must receive a unicode, str or bytes '
                        'object, got %s' % type(text).__name__)
    if encoding is None:
        encoding = 'utf-8'
    return text.encode(encoding, errors)
    
class MyProxyDownloaderMiddleware(object):
    def process_request(self, request, spider):
        proxy_list = [
            {'ip_port': '111.11.228.75:80', 'user_pass': 'xxx:123'},
            {'ip_port': '120.198.243.22:80', 'user_pass': ''},
            {'ip_port': '111.8.60.9:8123', 'user_pass': ''},
            {'ip_port': '101.71.27.120:80', 'user_pass': ''},
            {'ip_port': '122.96.59.104:80', 'user_pass': ''},
            {'ip_port': '122.224.249.122:8088', 'user_pass': ''},
        ]
        proxy = random.choice(proxy_list)
        if proxy['user_pass'] is not None:
　　　　　　　# 有值要加密
　　　　　　　# 如果拿到用户名和密码那么我们就把他写到请求头里
            request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
            encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass']))
            request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)
        else:
　　　　　　　# 没值直接设置
            request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])



配置：
    DOWNLOADER_MIDDLEWARES = {
       # 'xiaohan.middlewares.MyProxyDownloaderMiddleware': 543,
    }

对于那些不掏钱实用https的

在中间件中设置：

from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory
from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate)

class MySSLFactory(ScrapyClientContextFactory):
    def getCertificateOptions(self):
        from OpenSSL import crypto
        v1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read())
        v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read())
        return CertificateOptions(
            privateKey=v1,  # pKey对象
            certificate=v2,  # X509对象
            verify=False,
            method=getattr(self, 'method', getattr(self, '_ssl_method', None))
        )

配置： 
    DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
    DOWNLOADER_CLIENTCONTEXTFACTORY = "xiaohan.middlewares.MySSLFactory"

View Code

posted @ 2018-05-14 15:49 沈俊杰阅读(87) 评论(0) 收藏举报

刷新页面返回顶部

Shenjunjie

scrapy实用操作

下载中间件实现代理：

公告