Scrapy Proxy
Scrapy有2种方式配置代理
1.代码实现
2.通过修改配置文件
1.代码实现
import scrapy, base64 class muhe(scrapy.Spider): name = "mutian" def start_requests(self): url = 'http://www.baidu.com' proxy_user_pass = b"username:password" encoded_user_pass = base64.b64encode(proxy_user_pass) headers = { 'Proxy-Authorization': b'Basic ' + encoded_user_pass } yield scrapy.Request(url, self.parse, headers = headers, meta = {'proxy': 'http://openproxy.xxxxxx.com:8080'}) def parse(self, response): open('test.html', 'wb').write(response.body) self.log('---------save file------------')
2.通过配置文件实现
<1、爬取代码
D:\work\pycode>scrapy startproject tutorial
import scrapy class muhe(scrapy.Spider): name = "mutian" start_urls = [ r'http://www.baidu.com', ] def parse(self, response): self.log('---------save file------------') open('test.html', 'wb').write(response.body) self.log('---------save file------------')
<2、设置代理
这里使用自动生成的代理中间件,修改settings.py,将这三行打开
# Obey robots.txt rules ROBOTSTXT_OBEY = False #修改为false, 对于使用robot协议的站点,只需要我们的爬虫不遵守该协议 DOWNLOADER_MIDDLEWARES = { 'tutorial.middlewares.TutorialDownloaderMiddleware': 543, }
表示使用tutorial.middlewares.TutorialDownloaderMiddleware作为代理中间件
另外需要将ROBOTSTXT_OBEY 设置为false
<3、修改TutorialDownloaderMiddleware.py
import base64
def process_request(self, request, spider): request.meta['proxy'] = "http://openproxy.xxxxxx.com:8080" proxy_user_pass = b"username:password" # 这里不要使用base64.encodestring() encoded_user_pass = base64.b64encode(proxy_user_pass) request.headers['Proxy-Authorization'] = b'Basic ' + encoded_user_pass return None
scrapy crawl mutian

浙公网安备 33010602011771号