Scrapy Proxy

Scrapy有2种方式配置代理
1.代码实现
2.通过修改配置文件

1.代码实现 

import scrapy, base64

class muhe(scrapy.Spider):
    name = "mutian"

    def start_requests(self):
        url = 'http://www.baidu.com'
        proxy_user_pass = b"username:password"
        encoded_user_pass = base64.b64encode(proxy_user_pass)
        headers = {
            'Proxy-Authorization': b'Basic ' + encoded_user_pass
        }

        yield scrapy.Request(url, self.parse, headers = headers, meta = {'proxy': 'http://openproxy.xxxxxx.com:8080'})

    def parse(self, response):
        open('test.html', 'wb').write(response.body)
        self.log('---------save file------------')

2.通过配置文件实现 

<1、爬取代码

D:\work\pycode>scrapy startproject tutorial

import scrapy

class muhe(scrapy.Spider):
    name = "mutian"
    start_urls = [
        r'http://www.baidu.com',
    ]

    def parse(self, response):
        self.log('---------save file------------')
        open('test.html', 'wb').write(response.body)
        self.log('---------save file------------')

<2、设置代理

这里使用自动生成的代理中间件,修改settings.py,将这三行打开

# Obey robots.txt rules
ROBOTSTXT_OBEY = False   #修改为false, 对于使用robot协议的站点,只需要我们的爬虫不遵守该协议

DOWNLOADER_MIDDLEWARES = {
    'tutorial.middlewares.TutorialDownloaderMiddleware': 543,
}

表示使用tutorial.middlewares.TutorialDownloaderMiddleware作为代理中间件
另外需要将ROBOTSTXT_OBEY 设置为false

<3、修改TutorialDownloaderMiddleware.py

import base64 

def process_request(self, request, spider): request.meta['proxy'] = "http://openproxy.xxxxxx.com:8080" proxy_user_pass = b"username:password" # 这里不要使用base64.encodestring() encoded_user_pass = base64.b64encode(proxy_user_pass) request.headers['Proxy-Authorization'] = b'Basic ' + encoded_user_pass return None

scrapy  crawl mutian

 

posted @ 2019-01-24 17:33  牧 天  阅读(202)  评论(0)    收藏  举报