解读Scrapy框架
Scrapy框架基础:Twsited
Scrapy内部基于事件循环的机制实现爬虫的并发。 原来:
url_list = ['http://www.baidu.com','http://www.baidu.com','http://www.baidu.com',]
for item in url_list:
response = requests.get(item)
print(response.text)
现在:
from twisted.web.client import getPage, defer
from twisted.internet import reactor
# 第一部分:代理开始接收任务
def callback(contents):
print(contents)
deferred_list = [] # [(龙泰,贝贝),(刘淞,宝件套),(呼呼,东北)]
url_list = ['http://www.bing.com', 'https://segmentfault.com/','https://stackoverflow.com/' ]
for url in url_list:
deferred = getPage(bytes(url, encoding='utf8')) # (我,要谁)
deferred.addCallback(callback)
deferred_list.append(deferred)
# # 第二部分:代理执行完任务后,停止
dlist = defer.DeferredList(deferred_list)
def all_done(arg):
reactor.stop()
dlist.addBoth(all_done)
# 第三部分:代理开始去处理吧
reactor.run()
什么是twisted?
- 官方:基于事件循环的异步非阻塞模块。
- 白话:一个线程同时可以向多个目标发起Http请求。
非阻塞:不等待,所有请求同时发出。 我向请求A、请求B、请求C发起连接请求的时候,不等连接返回结果之后再去连下一个,而是发送一个之后,马上发送下一个。
import socket
sk = socket.socket()
sk.setblocking(False)
sk.connect((1.1.1.1,80))
import socket
sk = socket.socket()
sk.setblocking(False)
sk.connect((1.1.1.2,80))
import socket
sk = socket.socket()
sk.setblocking(False)
sk.connect((1.1.1.3,80))
异步:回调。我一旦帮助callback_A、callback_B、callback_F找到想要的A,B,C,我会主动通知他们。
def callback(contents):
print(contents)
事件循环: 我,我一直在循环三个socket任务(即:请求A、请求B、请求C),检查他三个状态:是否连接成功;是否返回结果。 它和requests的区别?
requests是一个Python实现的可以伪造浏览器发送http请求的模块
-封装socket发送请求
twisted是基于事件循环的异步非阻塞网络框架
-封装socket发送请求
-单线程完成完成并发请求
PS:三个关键词
-非阻塞:不等待
-异步:回调
-事件循环:一直循环去检查状态
Scrapy
Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 其可以应用在数据挖掘,信息处理或存储历史数据等一系列的程序中。 其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的, 也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。
Scrapy 使用了 Twisted异步网络库来处理网络通讯。整体架构大致如下:

Scrapy主要包括了以下组件:
- 引擎(Scrapy)
用来处理整个系统的数据流处理, 触发事务(框架核心)
- 调度器(Scheduler)
用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
- 下载器(Downloader)
用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
- 爬虫(Spiders)
爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
- 项目管道(Pipeline)
负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。
- 下载器中间件(Downloader Middlewares)
位于Scrapy引擎和下载器之间的框架,主要是处理Scrapy引擎与下载器之间的请求及响应。
- 爬虫中间件(Spider Middlewares)
介于Scrapy引擎和爬虫之间的框架,主要工作是处理蜘蛛的响应输入和请求输出。
- 调度中间件(Scheduler Middewares)
介于Scrapy引擎和调度之间的中间件,从Scrapy引擎发送到调度的请求和响应。
Scrapy运行流程大概如下:
- 引擎找到要执行的爬虫,并执行爬虫的 start_requests 方法,并得到一个迭代器。
- 迭代器循环时会获取Request对象,而Request对象中封装了要访问的URL和回调函数,将所有的request对象(任务)放到调度器中,放入一个请求队列,同时去重。
- 下载器去引擎中获取要下载任务(就是Request对象),引擎向调度器获取任务,调度器从队列中取出一个Request返回引擎,引擎交给下载器
- 下载器下载完成后返回Response对象交给引擎执行回调函数。
- 回到spider的回调函数中,爬虫解析Response
- yield Item(): 解析出实体(Item),则交给实体管道进行进一步的处理
- yield Request()解析出的是链接(URL),则把URL交给调度器等待抓取
一. 基本命令及项目结构
基本命令
1 建立项目:scrapy startproject 项目名称
2 在当前目录中创建中创建一个项目文件(类似于Django)
3 创建爬虫应用
4 cd 项目名称
5 scrapy genspider [-t template] <name> <domain>
6 scrapy gensipider -t basic oldboy oldboy.com
7 scrapy genspider -t crawl weisuen sohu.com
8 PS:
9 查看所有命令:scrapy gensipider -l
10 查看模板命令:scrapy gensipider -d 模板名称
11 scrapy list
12 展示爬虫应用列表
13 运行爬虫应用
14 scrapy crawl 爬虫应用名称
15 Scrapy crawl quotes
16 Scrapy runspider quote
17 scrapy crawl lagou -s JOBDIR=job_info/001 暂停与重启
18 保存文件:Scrapy crawl quotes –o quotes.json
19 shell脚本测试 scrapy shell 'http://scrapy.org' --nolog
项目结构
1 project_name/
2 scrapy.cfg
3 project_name/
4 __init__.py
5 items.py
6 pipelines.py
7 settings.py
8 spiders/
9 __init__.py
10 爬虫1.py
11 爬虫2.py
12 爬虫3.py
文件说明:
- scrapy.cfg 项目的主配置信息。(真正爬虫相关的配置信息在settings.py文件中)
- items.py 设置数据存储模板,用于结构化数据,如:Django的Model
- pipelines 数据处理行为,如:一般结构化的数据持久化
- settings.py 配置文件,如:递归的层数、并发数,延迟下载等
- spiders 爬虫目录,如:创建文件,编写爬虫规则
注意:一般创建爬虫文件时,以网站域名命名
二. spider编写
1.start_urls
内部原理
scrapy引擎来爬虫中取起始URL:
1. 调用start_requests并获取返回值
2. v = iter(返回值)
3.
req1 = 执行 v.__next__()
req2 = 执行 v.__next__()
req3 = 执行 v.__next__()
...
4. req全部放到调度器中
class ChoutiSpider(scrapy.Spider):
name = 'chouti'
allowed_domains = ['chouti.com']
start_urls = ['https://dig.chouti.com/']
cookie_dict = {}
def start_requests(self):
# 方式一:
for url in self.start_urls:
yield Request(url=url)
# 方式二:
# req_list = []
# for url in self.start_urls:
# req_list.append(Request(url=url))
# return req_list
- 定制:可以去redis中获取
2. 响应:
# response封装了响应相关的所有数据:
- response.text
- response.encoding
- response.body - response.meta['depth':'深度']
- response.request # 当前响应是由那个请求发起;请求中 封装(要访问的url,下载完成之后执行那个函数)
3. 选择器
1 from scrapy.selector import Selector
2 from scrapy.http import HtmlResponse
3
4 html = """<!DOCTYPE html>
5 <html>
6 <head lang="en">
7 <meta charset="UTF-8">
8 <title></title>
9 </head>
10 <body>
11 <ul>
12 <li class="item-"><a id='i1' href="link.html">first item</a></li>
13 <li class="item-0"><a id='i2' href="llink.html">first item</a></li>
14 <li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li>
15 </ul>
16 <div><a href="llink2.html">second item</a></div>
17 </body>
18 </html>
19 """
20 response = HtmlResponse(url='http://example.com', body=html, encoding='utf-8')
21 # hxs = Selector(response=response).xpath('//a')
22 # print(hxs)
23 # hxs = Selector(text=html).xpath('//a')
24 # print(hxs)
25 # hxs = Selector(response=response).xpath('//a[2]')
26 # print(hxs)
27 # hxs = Selector(response=response).xpath('//a[@id]')
28 # print(hxs)
29 # hxs = Selector(response=response).xpath('//a[@id="i1"]')
30 # print(hxs)
31 # hxs = Selector(response=response).xpath('//a[@href="link.html"][@id="i1"]')
32 # print(hxs)
33 # hxs = Selector(response=response).xpath('//a[contains(@href, "link")]')
34 # print(hxs)
35 # hxs = Selector(response=response).xpath('//a[starts-with(@href, "link")]')
36 # print(hxs)
37 # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]')
38 # print(hxs)
39 # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/text()').extract()
40 # print(hxs)
41 # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/@href').extract()
42 # print(hxs)
43 # hxs = Selector(response=response).xpath('/html/body/ul/li/a/@href').extract()
44 # print(hxs)
45 # hxs = Selector(response=response).xpath('//body/ul/li/a/@href').extract_first()
46 # print(hxs)
47
48 # ul_list = Selector(response=response).xpath('//body/ul/li')
49 # for item in ul_list:
50 # v = item.xpath('./a/span')
51 # # 或
52 # # v = item.xpath('a/span')
53 # # 或
54 # # v = item.xpath('*/a/span')
55 # print(v)
response.css('...') 返回一个response xpath对象 response.css('....').extract() 返回一个列表 response.css('....').extract_first() 提取列表中的元素
def parse_detail(self, response):
# items = JobboleArticleItem()
# title = response.xpath('//div[@class="entry-header"]/h1/text()')[0].extract()
# create_date = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/text()').extract()[0].strip().replace('·','').strip()
# praise_nums = int(response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract_first())
# fav_nums = response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract_first()
# try:
# if re.match('.*?(\d+).*', fav_nums).group(1):
# fav_nums = int(re.match('.*?(\d+).*', fav_nums).group(1))
# else:
# fav_nums = 0
# except:
# fav_nums = 0
# comment_nums = response.xpath('//a[contains(@href,"#article-comment")]/span/text()').extract()[0]
# try:
# if re.match('.*?(\d+).*',comment_nums).group(1):
# comment_nums = int(re.match('.*?(\d+).*',comment_nums).group(1))
# else:
# comment_nums = 0
# except:
# comment_nums = 0
# contente = response.xpath('//div[@class="entry"]').extract()[0]
# tag_list = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/a/text()').extract()
# tag_list = [tag for tag in tag_list if not tag.strip().endswith('评论')]
# tags = ",".join(tag_list)
# items['title'] = title
# try:
# create_date = datetime.datetime.strptime(create_date,'%Y/%m/%d').date()
# except:
# create_date = datetime.datetime.now()
# items['date'] = create_date
# items['url'] = response.url
# items['url_object_id'] = get_md5(response.url)
# items['img_url'] = [img_url]
# items['praise_nums'] = praise_nums
# items['fav_nums'] = fav_nums
# items['comment_nums'] = comment_nums
# items['content'] = contente
# items['tags'] = tags
# title = response.css('.entry-header h1::text')[0].extract()
# create_date = response.css('p.entry-meta-hide-on-mobile::text').extract()[0].strip().replace('·','').strip()
# praise_nums = int(response.css(".vote-post-up h10::text").extract_first()
# fav_nums = response.css(".bookmark-btn::text").extract_first()
# if re.match('.*?(\d+).*', fav_nums).group(1):
# fav_nums = int(re.match('.*?(\d+).*', fav_nums).group(1))
# else:
# fav_nums = 0
# comment_nums = response.css('a[href="#article-comment"] span::text').extract()[0]
# if re.match('.*?(\d+).*', comment_nums).group(1):
# comment_nums = int(re.match('.*?(\d+).*', comment_nums).group(1))
# else:
# comment_nums = 0
# content = response.css('.entry').extract()[0]
# tag_list = response.css('p.entry-meta-hide-on-mobile a::text')
# tag_list = [tag for tag in tag_list if not tag.strip().endswith('评论')]
# tags = ",".join(tag_list)
# xpath选择器 /@href /text()
def parse_detail(self, response):
img_url = response.meta.get('img_url','')
item_loader = ArticleItemLoader(item=JobboleArticleItem(), response=response)
item_loader.add_css("title", ".entry-header h1::text")
item_loader.add_value('url',response.url)
item_loader.add_value('url_object_id', get_md5(response.url))
item_loader.add_css('date', 'p.entry-meta-hide-on-mobile::text')
item_loader.add_value("img_url", [img_url])
item_loader.add_css("praise_nums", ".vote-post-up h10::text")
item_loader.add_css("fav_nums", ".bookmark-btn::text")
item_loader.add_css("comment_nums", "a[href='#article-comment'] span::text")
item_loader.add_css("tags", "p.entry-meta-hide-on-mobile a::text")
item_loader.add_css("content", "div.entry")
items = item_loader.load_item()
yield items
4. 再次发起请求
yield Request(url='xxxx',callback=self.parse) yield Request(url=parse.urljoin(response.url,post_url), meta={'img_url':img_url}, callback=self.parse_detail)
5. 携带cookie
方式一:携带
cookie_dict
cookie_jar = CookieJar()
cookie_jar.extract_cookies(response, response.request)
# 去对象中将cookie解析到字典
for k, v in cookie_jar._cookies.items():
for i, j in v.items():
for m, n in j.items():
cookie_dict[m] = n.value
yield Request(
url='https://dig.chouti.com/login',
method='POST',
body='phone=8615735177116&password=zyf123&oneMonth=1',
headers={'content-type': 'application/x-www-form-urlencoded; charset=UTF-8'},
# cookies=cookie_obj._cookies,
cookies=self.cookies_dict,
callback=self.check_login,
)
方式二: meta
yield Request(url=url, callback=self.login, meta={'cookiejar': True})
6. 回调函数传递值:meta
def parse(self, response): yield scrapy.Request(url=parse.urljoin(response.url,post_url), meta={'img_url':img_url}, callback=self.parse_detail)
def parse_detail(self, response):
img_url = response.meta.get('img_url','')
from urllib.parse import urljoin
import scrapy
from scrapy import Request
from scrapy.http.cookies import CookieJar
class SpiderchoutiSpider(scrapy.Spider):
name = 'choutilike'
allowed_domains = ['dig.chouti.com']
start_urls = ['https://dig.chouti.com/']
cookies_dict = {}
def parse(self, response):
# 去响应头中获取cookie,cookie保存在cookie_jar对象
cookie_obj = CookieJar()
cookie_obj.extract_cookies(response, response.request)
# 去对象中将cookie解析到字典
for k, v in cookie_obj._cookies.items():
for i, j in v.items():
for m, n in j.items():
self.cookies_dict[m] = n.value
# self.cookies_dict = cookie_obj._cookies
yield Request(
url='https://dig.chouti.com/login',
method='POST',
body='phone=8615735177116&password=zyf123&oneMonth=1',
headers={'content-type': 'application/x-www-form-urlencoded; charset=UTF-8'},
# cookies=cookie_obj._cookies,
cookies=self.cookies_dict,
callback=self.check_login,
)
def check_login(self,response):
# print(response.text)
yield Request(url='https://dig.chouti.com/all/hot/recent/1',
cookies=self.cookies_dict,
callback=self.good,
)
def good(self,response):
id_list = response.css('div.part2::attr(share-linkid)').extract()
for id in id_list:
url = 'https://dig.chouti.com/link/vote?linksId={}'.format(id)
yield Request(
url=url,
method='POST',
cookies=self.cookies_dict,
callback=self.show,
)
pages = response.css('#dig_lcpage a::attr(href)').extract()
for page in pages:
url = urljoin('https://dig.chouti.com/',page)
yield Request(url=url,callback=self.good)
def show(self,response):
print(response.text)
三、持久化
1. 书写顺序
- a. 先写pipeline类
- b. 写Item类
import scrapy
class ChoutiItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
href = scrapy.Field()
- c. 配置settings
ITEM_PIPELINES = {
# 'chouti.pipelines.XiaohuaImagesPipeline': 300,
# 'scrapy.pipelines.images.ImagesPipeline': 1,
'chouti.pipelines.ChoutiPipeline': 300,
# 'chouti.pipelines.Chouti2Pipeline': 301,
}
- d. 爬虫,yield每执行一次,process_item就调用一次。
yield Item对象
2. pipeline的编写
源码执行流程
1. 判断当前XdbPipeline类中是否有from_crawler
有:
obj = XdbPipeline.from_crawler(....)
否:
obj = XdbPipeline()
2. obj.open_spider()
3. obj.process_item()/obj.process_item()/obj.process_item()/obj.process_item()/obj.process_item()
4. obj.close_spider()
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
class ChoutiPipeline(object):
def __init__(self,conn_str):
self.conn_str = conn_str
@classmethod
def from_crawler(cls,crawler):
"""
初始化的时候,用于创建pipeline对象
:param crawler:
:return:
"""
conn_str = crawler.settings.get('DB')
return cls(conn_str)
def open_spider(self,spider):
"""
爬虫开始时调用
:param spider:
:return:
"""
self.conn = open(self.conn_str,'a',encoding='utf-8')
def process_item(self, item, spider):
if spider.name == 'spiderchouti':
self.conn.write('{}\n{}\n'.format(item['title'],item['href']))
#交给下一个pipline使用
return item
#丢弃item,不交给下一个pipline
# raise DropItem()
def close_spider(self,spider):
"""
爬虫关闭时调用
:param spider:
:return:
"""
self.conn.close()
注意:pipeline是所有爬虫公用,如果想要给某个爬虫定制需要使用spider参数自己进行处理。
json文件
class JsonExporterPipeline(JsonItemExporter):
#调用scrapy提供的json export 导出json文件
def __init__(self):
self.file = open('articleexpoter.json','wb')
self.exporter = JsonItemExporter(self.file, encoding='utf-8', ensure_ascii=False)
self.exporter.start_exporting()#开始导出
def close_spider(self):
self.exporter.finish_exporting() #停止导出
self.file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
class JsonWithEncodingPipeline(object):
#自定义json文件的导出
def __init__(self):
self.file = codecs.open('article.json','w',encoding='utf-8')
def process_item(self,item,spider):
lines = json.dumps(dict(item), ensure_ascii=False) + '\n'
self.file.write(lines)
return item
def spider_closed(self):
self.file.close()
存储图片
# -*- coding: utf-8 -*-
from urllib.parse import urljoin
import scrapy
from ..items import XiaohuaItem
class XiaohuaSpider(scrapy.Spider):
name = 'xiaohua'
allowed_domains = ['www.xiaohuar.com']
start_urls = ['http://www.xiaohuar.com/list-1-{}.html'.format(i) for i in range(11)]
def parse(self, response):
items = response.css('.item_list .item')
for item in items:
url = item.css('.img img::attr(src)').extract()[0]
url = urljoin('http://www.xiaohuar.com',url)
title = item.css('.title span a::text').extract()[0]
obj = XiaohuaItem(img_url=[url],title=title)
yield obj
class XiaohuaItem(scrapy.Item):
img_url = scrapy.Field()
title = scrapy.Field()
img_path = scrapy.Field()
class XiaohuaImagesPipeline(ImagesPipeline):
#调用scrapy提供的imagepipeline下载图片
def item_completed(self, results, item, info):
if "img_url" in item:
for ok,value in results:
print(ok,value)
img_path = value['path']
item['img_path'] = img_path
return item
def get_media_requests(self, item, info): # 下载图片
if "img_url" in item:
for img_url in item['img_url']:
yield scrapy.Request(img_url, meta={'item': item, 'index': item['img_url'].index(img_url)}) # 添加meta是为了下面重命名文件名使用
def file_path(self, request, response=None, info=None):
item = request.meta['item']
if "img_url" in item:# 通过上面的meta传递过来item
index = request.meta['index'] # 通过上面的index传递过来列表中当前下载图片的下标
# 图片文件名,item['carname'][index]得到汽车名称,request.url.split('/')[-1].split('.')[-1]得到图片后缀jpg,png
image_guid = item['title'] + '.' + request.url.split('/')[-1].split('.')[-1]
# 图片下载目录 此处item['country']即需要前面item['country']=''.join()......,否则目录名会变成\u97e9\u56fd\u6c7d\u8f66\u6807\u5fd7\xxx.jpg
filename = u'full/{0}'.format(image_guid)
return filename
ITEM_PIPELINES = {
# 'chouti.pipelines.XiaohuaImagesPipeline': 300,
'scrapy.pipelines.images.ImagesPipeline': 1,
}
mysql数据库
class MysqlPipeline(object):
def __init__(self):
self.conn = pymysql.connect('localhost', 'root','0000', 'crawed', charset='utf8', use_unicode=True)
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
insert_sql = """insert into article(title,url,create_date,fav_nums) values (%s,%s,%s,%s)"""
self.cursor.execute(insert_sql,(item['title'],item['url'],item['date'],item['fav_nums']))
self.conn.commit()
class MysqlTwistePipeline(object):
def __init__(self, dbpool):
self.dbpool = dbpool
@classmethod
def from_settings(cls,settings):
dbparms = dict(
host=settings['MYSQL_HOST'],
db=settings['MYSQL_DB'],
user=settings['MYSQL_USER'],
password=settings['MYSQL_PASSWORD'],
charset='utf8',
cursorclass=pymysql.cursors.DictCursor,
use_unicode=True,
)
dbpool = adbapi.ConnectionPool('pymysql',**dbparms)
return cls(dbpool)
def process_item(self, item, spider):
#使用twisted将mysql插入变异步执行
query = self.dbpool.runInteraction(self.do_insert,item)
# query.addErrorback(self.handle_error) #处理异常
query.addErrback(self.handle_error) #处理异常
def handle_error(self,failure):
#处理异步插入的异常
print(failure)
def do_insert(self,cursor,item):
insert_sql, params = item.get_insert_sql()
try:
cursor.execute(insert_sql,params)
print('插入成功')
except Exception as e:
print('插入失败')
MYSQL_HOST = 'localhost'
MYSQL_USER = 'root'
MYSQL_PASSWORD = '0000'
MYSQL_DB = 'crawed'
SQL_DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"
SQL_DATE_FORMAT = "%Y-%m-%d"
RANDOM_UA_TYPE = "random"
ES_HOST = "127.0.0.1"
四、去重规则
Scrapy默认去重规则:
from scrapy.dupefilter import RFPDupeFilter
from __future__ import print_function
import os
import logging
from scrapy.utils.job import job_dir
from scrapy.utils.request import request_fingerprint
class BaseDupeFilter(object):
@classmethod
def from_settings(cls, settings):
return cls()
def request_seen(self, request):
return False
def open(self): # can return deferred
pass
def close(self, reason): # can return a deferred
pass
def log(self, request, spider): # log that a request has been filtered
pass
class RFPDupeFilter(BaseDupeFilter):
"""Request Fingerprint duplicates filter"""
def __init__(self, path=None, debug=False):
self.file = None
self.fingerprints = set()
self.logdupes = True
self.debug = debug
self.logger = logging.getLogger(__name__)
if path:
self.file = open(os.path.join(path, 'requests.seen'), 'a+')
self.file.seek(0)
self.fingerprints.update(x.rstrip() for x in self.file)
@classmethod
def from_settings(cls, settings):
debug = settings.getbool('DUPEFILTER_DEBUG')
return cls(job_dir(settings), debug)
def request_seen(self, request):
fp = self.request_fingerprint(request)
if fp in self.fingerprints:
return True
self.fingerprints.add(fp)
if self.file:
self.file.write(fp + os.linesep)
def request_fingerprint(self, request):
return request_fingerprint(request)
def close(self, reason):
if self.file:
self.file.close()
def log(self, request, spider):
if self.debug:
msg = "Filtered duplicate request: %(request)s"
self.logger.debug(msg, {'request': request}, extra={'spider': spider})
elif self.logdupes:
msg = ("Filtered duplicate request: %(request)s"
" - no more duplicates will be shown"
" (see DUPEFILTER_DEBUG to show all duplicates)")
self.logger.debug(msg, {'request': request}, extra={'spider': spider})
self.logdupes = False
spider.crawler.stats.inc_value('dupefilter/filtered', spider=spider)
自定义去重规则 1.编写类
# -*- coding: utf-8 -*-
"""
@Datetime: 2018/8/31
@Author: Zhang Yafei
"""
from scrapy.dupefilter import BaseDupeFilter
from scrapy.utils.request import request_fingerprint
class RepeatFilter(BaseDupeFilter):
def __init__(self):
self.visited_fd = set()
@classmethod
def from_settings(cls, settings):
return cls()
def request_seen(self, request):
fd = request_fingerprint(request=request)
if fd in self.visited_fd:
return True
self.visited_fd.add(fd)
def open(self): # can return deferred
print('open')
pass
def close(self, reason): # can return a deferred
print('close')
pass
def log(self, request, spider): # log that a request has been filtered
pass
2. 配置
# 默认去重规则
# DUPEFILTER_CLASS = "chouti.duplication.RepeatFilter"
DUPEFILTER_CLASS = "chouti.dupeFilter.RepeatFilter"
3. 爬虫使用
from urllib.parse import urljoin
from ..items import ChoutiItem
import scrapy
from scrapy.http import Request
class SpiderchoutiSpider(scrapy.Spider):
name = 'spiderchouti'
allowed_domains = ['dig.chouti.com']
start_urls = ['https://dig.chouti.com/']
def parse(self, response):
#获取当前页的标题
print(response.request.url)
# news = response.css('.content-list .item')
# for new in news:
# title = new.css('.show-content::text').extract()[0].strip()
# href = new.css('.show-content::attr(href)').extract()[0]
# item = ChoutiItem(title=title,href=href)
# yield item
#获取所有页码
pages = response.css('#dig_lcpage a::attr(href)').extract()
for page in pages:
url = urljoin(self.start_urls[0],page)
#将新要访问的url添加到调度器
yield Request(url=url,callback=self.parse)
注意:
- request_seen中编写正确逻辑
- dont_filter=False
五、中间件
下载中间件
from scrapy.http import HtmlResponse
from scrapy.http import Request
class Md1(object):
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
print('md1.process_request',request)
# 1. 返回Response
# import requests
# result = requests.get(request.url)
# return HtmlResponse(url=request.url, status=200, headers=None, body=result.content)
# 2. 返回Request
# return Request('https://dig.chouti.com/r/tec/hot/1')
# 3. 抛出异常
# from scrapy.exceptions import IgnoreRequest
# raise IgnoreRequest
# 4. 对请求进行加工(*)
# request.headers['user-agent'] = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
pass
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
print('m1.process_response',request,response)
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
DOWNLOADER_MIDDLEWARES = {
#'xdb.middlewares.XdbDownloaderMiddleware': 543,
# 'xdb.proxy.XdbProxyMiddleware':751,
'xdb.md.Md1':666,
'xdb.md.Md2':667,
}
应用:- user-agent
- 代理
爬虫中间件
class Sd1(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, dict or Item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Response, dict
# or Item objects.
pass
# 只在爬虫启动时,执行一次。
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
SPIDER_MIDDLEWARES = {
# 'xdb.middlewares.XdbSpiderMiddleware': 543,
'xdb.sd.Sd1': 666,
'xdb.sd.Sd2': 667,
}
应用:
- 深度
- 优先级
class SpiderMiddleware(object):
def process_spider_input(self,response, spider):
"""
下载完成,执行,然后交给parse处理
:param response:
:param spider:
:return:
"""
pass
def process_spider_output(self,response, result, spider):
"""
spider处理完成,返回时调用
:param response:
:param result:
:param spider:
:return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)
"""
return result
def process_spider_exception(self,response, exception, spider):
"""
异常调用
:param response:
:param exception:
:param spider:
:return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline
"""
return None
def process_start_requests(self,start_requests, spider):
"""
爬虫启动时调用
:param start_requests:
:param spider:
:return: 包含 Request 对象的可迭代对象
"""
return start_requests
class DownMiddleware1(object):
def process_request(self, request, spider):
"""
请求需要被下载时,经过所有下载器中间件的process_request调用
:param request:
:param spider:
:return:
None,继续后续中间件去下载;
Response对象,停止process_request的执行,开始执行process_response
Request对象,停止中间件的执行,将Request重新调度器
raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
"""
pass
def process_response(self, request, response, spider):
"""
spider处理完成,返回时调用
:param response:
:param result:
:param spider:
:return:
Response 对象:转交给其他中间件process_response
Request 对象:停止中间件,request会被重新调度下载
raise IgnoreRequest 异常:调用Request.errback
"""
print('response1')
return response
def process_exception(self, request, exception, spider):
"""
当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
:param response:
:param exception:
:param spider:
:return:
None:继续交给后续中间件处理异常;
Response对象:停止后续process_exception方法
Request对象:停止中间件,request将会被重新调用下载
"""
return None
设置代理
在爬虫启动时,提前在os.envrion中设置代理即可。
class ChoutiSpider(scrapy.Spider):
name = 'chouti'
allowed_domains = ['chouti.com']
start_urls = ['https://dig.chouti.com/']
cookie_dict = {}
def start_requests(self):
import os
os.environ['HTTPS_PROXY'] = "http://root:woshiniba@192.168.11.11:9999/"
os.environ['HTTP_PROXY'] = '19.11.2.32',
for url in self.start_urls:
yield Request(url=url,callback=self.parse)
meta:
class ChoutiSpider(scrapy.Spider):
name = 'chouti'
allowed_domains = ['chouti.com']
start_urls = ['https://dig.chouti.com/']
cookie_dict = {}
def start_requests(self):
for url in self.start_urls:
yield Request(url=url,callback=self.parse,meta={'proxy':'"http://root:woshiniba@192.168.11.11:9999/"'})
import base64
import random
from six.moves.urllib.parse import unquote
try:
from urllib2 import _parse_proxy
except ImportError:
from urllib.request import _parse_proxy
from six.moves.urllib.parse import urlunparse
from scrapy.utils.python import to_bytes
class XdbProxyMiddleware(object):
def _basic_auth_header(self, username, password):
user_pass = to_bytes(
'%s:%s' % (unquote(username), unquote(password)),
encoding='latin-1')
return base64.b64encode(user_pass).strip()
def process_request(self, request, spider):
PROXIES = [
"http://root:woshiniba@192.168.11.11:9999/",
"http://root:woshiniba@192.168.11.12:9999/",
"http://root:woshiniba@192.168.11.13:9999/",
"http://root:woshiniba@192.168.11.14:9999/",
"http://root:woshiniba@192.168.11.15:9999/",
"http://root:woshiniba@192.168.11.16:9999/",
]
url = random.choice(PROXIES)
orig_type = ""
proxy_type, user, password, hostport = _parse_proxy(url)
proxy_url = urlunparse((proxy_type or orig_type, hostport, '', '', '', ''))
if user:
creds = self._basic_auth_header(user, password)
else:
creds = None
request.meta['proxy'] = proxy_url
if creds:
request.headers['Proxy-Authorization'] = b'Basic ' + creds
class DdbProxyMiddleware(object):
def process_request(self, request, spider):
PROXIES = [
{'ip_port': '111.11.228.75:80', 'user_pass': ''},
{'ip_port': '120.198.243.22:80', 'user_pass': ''},
{'ip_port': '111.8.60.9:8123', 'user_pass': ''},
{'ip_port': '101.71.27.120:80', 'user_pass': ''},
{'ip_port': '122.96.59.104:80', 'user_pass': ''},
{'ip_port': '122.224.249.122:8088', 'user_pass': ''},
]
proxy = random.choice(PROXIES)
if proxy['user_pass'] is not None:
request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
encoded_user_pass = base64.b64encode(to_bytes(proxy['user_pass']))
request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)
else:
request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
六、定制命令
单爬虫运行 main.py
1 from scrapy.cmdline import execute
2 import sys
3 import os
4
5 sys.path.append(os.path.dirname(__file__))
6
7 # execute(['scrapy','crawl','spiderchouti','--nolog'])
8 # os.system('scrapy crawl spiderchouti')
9 # os.system('scrapy crawl xiaohua')
10 os.system('scrapy crawl choutilike --nolog')
所有爬虫:
- 在spiders同级创建任意目录,如:commands
- 在其中创建 crawlall.py 文件 (此处文件名就是自定义的命令)
- 在settings.py 中添加配置 COMMANDS_MODULE = '项目名称.目录名称'
- 在项目目录执行命令:scrapy crawlall
# -*- coding: utf-8 -*-
"""
@Datetime: 2018/9/1
@Author: Zhang Yafei
"""
from scrapy.commands import ScrapyCommand
from scrapy.utils.project import get_project_settings
class Command(ScrapyCommand):
requires_project = True
def syntax(self):
return '[options]'
def short_desc(self):
return 'Runs all of the spiders'
def run(self, args, opts):
print(type(self.crawler_process))
from scrapy.crawler import CrawlerProcess
# 1. 执行CrawlerProcess构造方法
# 2. CrawlerProcess对象(含有配置文件)的spiders
# 2.1,为每个爬虫创建一个Crawler
# 2.2,执行 d = Crawler.crawl(...) # ************************ #
# d.addBoth(_done)
# 2.3, CrawlerProcess对象._active = {d,}
# 3. dd = defer.DeferredList(self._active)
# dd.addBoth(self._stop_reactor) # self._stop_reactor ==> reactor.stop()
# reactor.run
#找到所有的爬虫名称
spider_list = self.crawler_process.spiders.list()
# spider_list = ['choutilike','xiaohua']爬取任意项目
for name in spider_list:
self.crawler_process.crawl(name, **opts.__dict__)
self.crawler_process.start()
七、信号
信号就是使用框架预留的位置,帮助你自定义一些功能。 内置信号
# 引擎开始和结束
engine_started = object()
engine_stopped = object()
# spider开始和结束
spider_opened = object()
# 请求闲置
spider_idle = object()
# spider关闭
spider_closed = object()
# spider发生异常
spider_error = object()
# 请求放入调度器
request_scheduled = object()
# 请求被丢弃
request_dropped = object()
# 接收到响应
response_received = object()
# 响应下载完
response_downloaded = object()
# item
item_scraped = object()
# item被丢弃
item_dropped = object()
自定义扩展
class MyExtend():
def __init__(self,crawler):
self.crawler = crawler
#钩子上挂障碍物
#在指定信息上注册操作
crawler.signals.connect(self.start,signals.engine_started)
crawler.signals.connect(self.close,signals.engine_stopped)
@classmethod
def from_crawler(cls,crawler):
return cls(crawler)
def start(self):
print('signals.engine_started start')
def close(self):
print('signals.engine_stopped close')
from scrapy import signals
class MyExtend(object):
def __init__(self):
pass
@classmethod
def from_crawler(cls, crawler):
self = cls()
crawler.signals.connect(self.x1, signal=signals.spider_opened)
crawler.signals.connect(self.x2, signal=signals.spider_closed)
return self
def x1(self, spider):
print('open')
def x2(self, spider):
print('close')
配置
EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
'chouti.extensions.MyExtend':200,
}
八、配置文件
Scrapy默认配置文件
"""
This module contains the default values for all settings used by Scrapy.
For more information about these settings you can read the settings
documentation in docs/topics/settings.rst
Scrapy developers, if you add a setting here remember to:
* add it in alphabetical order
* group similar settings without leaving blank lines
* add its documentation to the available settings documentation
(docs/topics/settings.rst)
"""
import sys
from importlib import import_module
from os.path import join, abspath, dirname
import six
AJAXCRAWL_ENABLED = False
AUTOTHROTTLE_ENABLED = False
AUTOTHROTTLE_DEBUG = False
AUTOTHROTTLE_MAX_DELAY = 60.0
AUTOTHROTTLE_START_DELAY = 5.0
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
BOT_NAME = 'scrapybot'
CLOSESPIDER_TIMEOUT = 0
CLOSESPIDER_PAGECOUNT = 0
CLOSESPIDER_ITEMCOUNT = 0
CLOSESPIDER_ERRORCOUNT = 0
COMMANDS_MODULE = ''
COMPRESSION_ENABLED = True
CONCURRENT_ITEMS = 100
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
CONCURRENT_REQUESTS_PER_IP = 0
COOKIES_ENABLED = True
COOKIES_DEBUG = False
DEFAULT_ITEM_CLASS = 'scrapy.item.Item'
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
DEPTH_LIMIT = 0
DEPTH_STATS = True
DEPTH_PRIORITY = 0
DNSCACHE_ENABLED = True
DNSCACHE_SIZE = 10000
DNS_TIMEOUT = 60
DOWNLOAD_DELAY = 0
DOWNLOAD_HANDLERS = {}
DOWNLOAD_HANDLERS_BASE = {
'data': 'scrapy.core.downloader.handlers.datauri.DataURIDownloadHandler',
'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler',
'http': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
'https': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
's3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler',
'ftp': 'scrapy.core.downloader.handlers.ftp.FTPDownloadHandler',
}
DOWNLOAD_TIMEOUT = 180 # 3mins
DOWNLOAD_MAXSIZE = 1024*1024*1024 # 1024m
DOWNLOAD_WARNSIZE = 32*1024*1024 # 32m
DOWNLOAD_FAIL_ON_DATALOSS = True
DOWNLOADER = 'scrapy.core.downloader.Downloader'
DOWNLOADER_HTTPCLIENTFACTORY = 'scrapy.core.downloader.webclient.ScrapyHTTPClientFactory'
DOWNLOADER_CLIENTCONTEXTFACTORY = 'scrapy.core.downloader.contextfactory.ScrapyClientContextFactory'
DOWNLOADER_CLIENT_TLS_METHOD = 'TLS' # Use highest TLS/SSL protocol version supported by the platform,
# also allowing negotiation
DOWNLOADER_MIDDLEWARES = {}
DOWNLOADER_MIDDLEWARES_BASE = {
# Engine side
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
# Downloader side
}
DOWNLOADER_STATS = True
DUPEFILTER_CLASS = 'scrapy.dupefilters.RFPDupeFilter'
EDITOR = 'vi'
if sys.platform == 'win32':
EDITOR = '%s -m idlelib.idle'
EXTENSIONS = {}
EXTENSIONS_BASE = {
'scrapy.extensions.corestats.CoreStats': 0,
'scrapy.extensions.telnet.TelnetConsole': 0,
'scrapy.extensions.memusage.MemoryUsage': 0,
'scrapy.extensions.memdebug.MemoryDebugger': 0,
'scrapy.extensions.closespider.CloseSpider': 0,
'scrapy.extensions.feedexport.FeedExporter': 0,
'scrapy.extensions.logstats.LogStats': 0,
'scrapy.extensions.spiderstate.SpiderState
