爬虫之Scrapy
wus点我
Scrapy
Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 其可以应用在数据挖掘,信息处理或存储历史数据等一系列的程序中。
其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的, 也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。
Scrapy 使用了 Twisted异步网络库来处理网络通讯。整体架构大致如下5


深度


Scrapy主要包括了以下组件:
- 引擎(Scrapy)
用来处理整个系统的数据流处理, 触发事务(框架核心) - 调度器(Scheduler)
用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址 - 下载器(Downloader)
用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的) - 爬虫(Spiders)
爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面 - 项目管道(Pipeline)
负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。 - 下载器中间件(Downloader Middlewares)
位于Scrapy引擎和下载器之间的框架,主要是处理Scrapy引擎与下载器之间的请求及响应。 - 爬虫中间件(Spider Middlewares)
介于Scrapy引擎和爬虫之间的框架,主要工作是处理蜘蛛的响应输入和请求输出。 - 调度中间件(Scheduler Middewares)
介于Scrapy引擎和调度之间的中间件,从Scrapy引擎发送到调度的请求和响应。
Scrapy运行流程大概如下:
引擎从调度器中取出一个链接(URL)用于接下来的抓取
引擎把URL封装成一个请求(Request)传给下载器
下载器把资源下载下来,并封装成应答包(Response)
爬虫解析Response
解析出实体(Item),则交给实体管道进行进一步的处理
解析出的是链接(URL),则把URL交给调度器等待抓取
一、安装
Linux
pip3 install scrapyWindows a. pip3 install wheel b. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted (版本号依据自己电脑和系统定) c. 进入下载目录,执行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl (版本号依据自己电脑和系统定) d. pip3 install scrapy e. 下载并安装pywin32:https://sourceforge.net/projects/pywin32/files/g. pip3 install pywin32 -i
二、基本使用
1. 基本命令
1. scrapy startproject 项目名称 如 scrapy startproject xianglong
![]()
- 在当前目录中创建中创建一个项目文件(类似于Django)
cd xianglong #进入目录
scrapy genspider chouti chouti.com #创建一个爬虫项目 genspider 蜘蛛 创建之后可以看懂这些内容
![]()
2. scrapy genspider [-t template] <name> <domain> - 创建爬虫应用 如: scrapy gensipider -t basic oldboy oldboy.com scrapy gensipider -t xmlfeed autohome autohome.com.cn PS: 查看所有命令:scrapy gensipider -l 查看模板命令:scrapy gensipider -d 模板名称 3. scrapy list - 展示爬虫应用列表 4. scrapy crawl 爬虫应用名称 - 运行单独爬虫应用
scrapy crawl chouti --nolog 运行爬虫项目
运行爬虫
在cmd中先进入到项目目录中

创建完项目之后的目录结构
![]()
文件说明:
scrapy.cfg 项目的主配置信息。(真正爬虫相关的配置信息在settings.py文件中)
items.py 设置数据存储模板,用于结构化数据,如:Django的Model
pipelines 数据处理行为,如:一般结构化的数据持久化
settings.py 配置文件,如:递归的层数、并发数,延迟下载等
spiders 爬虫目录,如:创建文件,编写爬虫规则
选择器
5.scrapy查询语法:
from scrapy.selector import HtmlXPathSelector 此解析器的解析查找标签的方法
当我们爬取大量的网页,如果自己写正则匹配,会很麻烦,也很浪费时间,令人欣慰的是,scrapy内部支持更简单的查询语法,帮助我们去html中查询我们需要的标签和标签内容以及标签属性。下面逐一进行介绍:
-
查询子子孙孙中的某个标签(以div标签为例)://div 查询儿子中的某个标签(以div标签为例):/div 查询标签中带有某个class属性的标签://div[@class=’c1′]即子子孙孙中标签是div且class=‘c1’的标签 查询标签中带有某个class=‘c1’并且自定义属性name=‘alex’的标签://div[@class=’c1′][@name=’alex’] 查询某个标签的文本内容://div/span/text() 即查询子子孙孙中div下面的span标签中的文本内容 查询某个属性的值(例如查询a标签的href属性)://a/@href
实例
# -*- coding: utf-8 -*- import scrapy from scrapy.selector import HtmlXPathSelector from bs4 import BeautifulSoup class ChoutiSpider(scrapy.Spider): name = 'chouti' allowed_domains = ['chouti.com'] start_urls = ['http://dig.chouti.com/'] def parse(self, response):
#HtmlXPathSelector 是一个解析器, 把response穿进去实例化一个HtmlXPathSelector对象就能够对这个对象使用响应的规则进行解析找 标签等内容 hxs = HtmlXPathSelector(response=response) # 去下载的页面中:找新闻,去子子孙孙中去找div id='content-list'的标签, # 并去他的儿子中去找 div class='item' 的表桥 items = hxs.xpath("//div[@id='content-list']/div[@class='item']") for item in items: #在这儿加了一个 . 表示从当前位置(不加 . 表示从根目录开始找)的相对位置去找, # 索引为1的a标签, href1 = item.xpath('.//div[@class="part1"]//a[1]') #[<HtmlXPathSelector xpath='.//div[@class="part1"]//a[1]' data='<a href="https://www.weibo.com/394967112'>] #这是找到的内容,为标签对象
print(item.xpath('.//div[]#class="part1"]//a[1]/text()')) #想要直接打印标签内部的文本
href2 = item.xpath('.//div[@class="part1"]//a[1]/text()').extract_first() #/text()能够取出a不标签内的文本,但是还是标签对象,如果想要取出具体的内容,需要在加上.extract_first() #这样就能把里边的文本取出来了 # # print(href2.strip())
href3 = item.xpath('.//div[@class="part1"]//a[1]/@href').extract_first()#获取标签的属性 print(href3)
在浏览器上复制查询路径的方法


代码 解析器使用方法详解
#!/usr/bin/env python # -*- coding:utf-8 -*- from scrapy.selector import Selector, HtmlXPathSelector from scrapy.http import HtmlResponse html = """<!DOCTYPE html> <html> <head lang="en"> <meta charset="UTF-8"> <title></title> </head> <body> <ul> <li class="item-"><a id='i1' href="link.html">first item</a></li> <li class="item-0"><a id='i2' href="llink.html">first item</a></li> <li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li> </ul> <div><a href="llink2.html">second item</a></div> </body> </html> """ response = HtmlResponse(url='http://example.com', body=html,encoding='utf-8') # hxs = HtmlXPathSelector(response) # print(hxs) # hxs = Selector(response=response).xpath('//a') # print(hxs) # hxs = Selector(response=response).xpath('//a[2]') # print(hxs) # hxs = Selector(response=response).xpath('//a[@id]') # print(hxs) # hxs = Selector(response=response).xpath('//a[@id="i1"]') # print(hxs) # hxs = Selector(response=response).xpath('//a[@href="link.html"][@id="i1"]') # print(hxs) # hxs = Selector(response=response).xpath('//a[contains(@href, "link")]') # print(hxs) # hxs = Selector(response=response).xpath('//a[starts-with(@href, "link")]') # print(hxs) # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]') # print(hxs) # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/text()').extract() # print(hxs) # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/@href').extract() # print(hxs) # hxs = Selector(response=response).xpath('/html/body/ul/li/a/@href').extract() # print(hxs) # hxs = Selector(response=response).xpath('//body/ul/li/a/@href').extract_first() # print(hxs) # ul_list = Selector(response=response).xpath('//body/ul/li') # for item in ul_list: # v = item.xpath('./a/span') # # 或 # # v = item.xpath('a/span') # # 或 # # v = item.xpath('*/a/span') # print(v)
如何实现数据的持久化
一: items.py 的文件和功能 用于与pipeline配合使用
items.py的功能就是对爬虫爬取的数据进行格式化 返回一实例化对象 item 如果 yeild tem对象, 就会自动把这个item对象自动交给pipeline
代码
# -*- coding: utf-8 -*- import scrapy from scrapy.selector import HtmlXPathSelector from ..items import XianglongItem class ChoutiSpider(scrapy.Spider): name = 'chouti' allowed_domains = ['chouti.com'] start_urls = ['http://dig.chouti.com/'] def parse(self, response): hxs = HtmlXPathSelector(response=response) # 去下载的页面中:找新闻,去子子孙孙中去找div id='content-list'的标签, # 并去他的儿子中去找 div class='item' 的表桥 # items = hxs.xpath("//div[@id='content-list']/div[@class='item']") items = hxs.xpath("//div[@id='content-list']/div[@class='item']") for item in items: href = item.xpath('.//div[@class="part1"]//a[1]/@href').extract_first() text = item.xpath('.//div[@class="part1"]//a[1]/text()').extract_first() item = XianglongItem(title=text, href=href) yield item
items.py中的文件用于对数据的处理,按相应的格式传给 pipelines.py

代码
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class XianglongItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() href = scrapy.Field()
去配置文件中注册上这用户持久话的对戏那个

pipelines.py文件中的代码
此类是用于数据持久化操作的
class XianglongPipeline(object): def process_item(self, item, spider): self.f.write(item['href']+'\n') #flush一下,将文件更新的数据库中去 self.f.flush() return item def open_spider(self, spider): """ 爬虫开始执行时,调用,然后上边就可以按照顺序一致去往 文件内部写入内容,而且不会覆盖刚刚写入的,因为文件在这里一直打开 的状态,所以可以将爬到的每一条数据都写入进去 :param spider: :return: """ self.f = open('url.log','w') def close_spider(self, spider): """ 爬虫关闭时,被调用 :param spider: :return: """ self.f.close()
都要去settings进行配置
当想要将数据一份存入文件,一份写入数据库的时候就可以定义两个pipliens
当有多个pipelins的时候, 配置中的值的大小对执行顺序他的影响
定义两个piplines 和执行顺序

当先执行的pipeliens中的 process_item return None的时候,后执行的 接收的值为空

DropIrem
PS:如果想要丢弃item,不给后续pipeline使用:就需要使用 DropItem

效果 只写入文件了 不写入数据库
def from_crawler(cls,) 类的使用

代码
from scrapy.exceptions import DropItem class CustomPipeline(object): def __init__(self,v): self.value = v def process_item(self, item, spider): # 操作并进行持久化 # return表示会被后续的pipeline继续处理 return item # 表示将item丢弃,不会被后续pipeline处理 # raise DropItem() @classmethod def from_crawler(cls, crawler): """ 初始化时候,用于创建pipeline对象 :param crawler: :return: """ val = crawler.settings.getint('MMMM') return cls(val) def open_spider(self,spider): """ 爬虫开始执行时,调用 :param spider: :return: """ print('000000') def close_spider(self,spider): """ 爬虫关闭时,被调用 :param spider: :return: """ print('111111')
完整的piplines
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html """ 当根据配置文件: ITEM_PIPELINES = { 'xianglong.pipelines.FilePipeline': 300, 'xianglong.pipelines.DBPipeline': 301, } 找到相关的类:FilePipeline之后,会优先判断类中是否含有 from_crawler 如果有: obj = FilePipeline.from_crawler() #实例化的方式 #最终目的都是实例化一个对象,所以这样看写这个方法与不写都一样 #但是详细看方法的定义 如下 没有则: obj = FilePipeline() #实例化的方式 obj.open_spider(..) ob.process_item(..) obj.close_spider(..) """ from scrapy.exceptions import DropItem class FilePipeline(object): def __init__(self,path): self.path = path self.f = None @classmethod def from_crawler(cls, crawler): """ 初始化时候,用于创建pipeline对象 :param crawler: :return: """ # return cls() # path = crawler.settings.get('XL_FILE_PATH') #但是可以在实例化对象之前,去配置文件中取出文件要保存的路径等信息,
#把路径信息初始化到此类的 __init__的 属性中去, 便于文件写入时调用路径信息 #方便 return cls(path) def process_item(self, item, spider): self.f.write(item['href']+'\n') return item def open_spider(self, spider): """ 爬虫开始执行时,调用 :param spider: :return: """ self.f = open(self.path,'w') def close_spider(self, spider): """ 爬虫关闭时,被调用 :param spider: :return: """ self.f.close() class DBPipeline(object): def process_item(self, item, spider): print('数据库',item) return item def open_spider(self, spider): """ 爬虫开始执行时,调用 :param spider: :return: """ print('打开数据') def close_spider(self, spider): """ 爬虫关闭时,被调用 :param spider: :return: """ print('关闭数据库')
4. POST/请求头/Cookie 自定义请求头
方案一:
post请求头
from scrapy.http import Request req = Request( url='http://dig.chouti.com/login', method='POST', body='phone=8613121758648&password=woshiniba&oneMonth=1', headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'}, cookies={}, callback=self.parse_check_login, )
手动加cookie
cookie_dict = {} cookie_jar = CookieJar() cookie_jar.extract_cookies(response, response.request) for k, v in cookie_jar._cookies.items(): for i, j in v.items(): for m, n in j.items(): cookie_dict[m] = n.value req = Request( url='http://dig.chouti.com/login', method='POST', headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'}, body='phone=8615131255089&password=pppppppp&oneMonth=1', cookies=cookie_dict, # 手动携带 callback=self.check_login ) yield req
6.中间件
爬虫中间件
class SpiderMiddleware(object): def process_spider_input(self,response, spider): """ 下载完成,执行,然后交给parse处理 :param response: :param spider: :return: """ pass def process_spider_output(self,response, result, spider): """ spider处理完成,返回时调用 :param response: :param result: :param spider: :return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable) """ return result def process_spider_exception(self,response, exception, spider): """ 异常调用 :param response: :param exception: :param spider: :return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline """ return None def process_start_requests(self,start_requests, spider): """ 爬虫启动时调用 :param start_requests: :param spider: :return: 包含 Request 对象的可迭代对象 """ return start_requests
下载器中间件 现在settings中做好配置

代码

from scrapy import signals class UserAgentDownloaderMiddleware(object): @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() return s def process_request(self, request, spider): request.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36" #相当于给每个请求过来的请求头都加上这个请求头 return None # 继续执行后续的中间件的process_request # from scrapy.http import Request # return Request(url='www.baidu.com') #这个代表重新放入调度器中,当前请求不再继续处理 # from scrapy.http import HtmlResponse # 执行从最后一个开始执行所有的process_response # return HtmlResponse(url='www.baidu.com',body=b'asdfuowjelrjaspdoifualskdjf;lajsdf') #如果出现这个return的话就会执行从最后一个开始执行所有的process_response
def process_response(self, request, response, spider): return response def process_exception(self, request, exception, spider): pass
class DownMiddleware1(object): def process_request(self, request, spider): """ 请求需要被下载时,经过所有下载器中间件的process_request调用 :param request: :param spider: :return: None,继续后续中间件去下载; Response对象,停止process_request的执行,开始执行process_response Request对象,停止中间件的执行,将Request重新调度器 raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception """ pass def process_response(self, request, response, spider): """ spider处理完成,返回时调用 :param response: :param result: :param spider: :return: Response 对象:转交给其他中间件process_response Request 对象:停止中间件,request会被重新调度下载 raise IgnoreRequest 异常:调用Request.errback """ print('response1') return response def process_exception(self, request, exception, spider): """ 当下载处理器(download handler)或 process_request() (下载中间件)抛出异常 :param response: :param exception: :param spider: :return: None:继续交给后续中间件处理异常; Response对象:停止后续process_exception方法 Request对象:停止中间件,request将会被重新调用下载 """ return None
cookie操作:
去重
在爬取数据的过程中,为了避免爬取重复的url 可以加上去重规则
scrapy默认使用 scrapy.dupefilter.RFPDupeFilter 进行去重,相关配置有:
DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter' DUPEFILTER_DEBUG = False JOBDIR = "保存范文记录的日志路径,如:/root/" # 最终路径为 /root/requests.seen
使用自定义去重规则
在setting中进行配置
DUPEFILTER_CLASS = 'xianglong.dupe.MyDupeFilter'
""" 1. 根据配置文件找到 DUPEFILTER_CLASS = 'xianglong.dupe.MyDupeFilter' 2. 判断是否存在from_settings 如果有: obj = MyDupeFilter.from_settings() 否则: obj = MyDupeFilter() 3. 在request_seen中取出request.url 判断是否在self.record(访问记录)中 如果在的话就,就return True 表示已经访问过了,不在重复访问 如果没有访问过,就访问,并把url放到访问记录中去, 下次再访问的话就会报以访问过,并不再重复访问 """
自定义去重代碼 dupe.py
from scrapy.dupefilter import BaseDupeFilter from scrapy.utils.request import request_fingerprint class MyDupeFilter(BaseDupeFilter): def __init__(self): self.record = set() @classmethod def from_settings(cls, settings): return cls() def request_seen(self, request): ident = request_fingerprint(request) if ident in self.record: print('已经访问过了', request.url) return True self.record.add(ident) def open(self): # can return deferred pass def close(self, reason): # can return a deferred pass


一个新的类 用于处理url



代碼
from scrapy.utils.request import request_fingerprint from scrapy.http import Request u1 = Request(url='http://www.oldboyedu.com?id=1&age=2') u2 = Request(url='http://www.oldboyedu.com?age=2&id=1') result1 = request_fingerprint(u1) result2 = request_fingerprint(u2) print(result1,result2)
don'tfilter到底在哪兒
补充:dont_filter到低在哪里? from scrapy.core.scheduler import Scheduler def enqueue_request(self, request): # request.dont_filter=False # self.df.request_seen(request): # - True,已经访问 # - False,未访问 # request.dont_filter=True,全部加入到调度器 if not request.dont_filter and self.df.request_seen(request): self.df.log(request, self.spider) return False # 如果往下走,把请求加入调度器 dqok = self._dqpush(request)
访问下一页的内容 此时yeild的Request对象


start_requests
可用点:可以写多个回调函数parse对response进行不同解析,然后start_requests来选择先执行那个回调函数
通过他来yiel Request(url, callback=sdf) 来调用执行下边定义的回调函数来执行爬取数据
这个类中自带star_requests 这个方法,也可以自定义此方法

这里的start_request可以通过yeild返回,也可以return 一个list

代码
# -*- coding: utf-8 -*- import scrapy from bs4 import BeautifulSoup from scrapy.selector import HtmlXPathSelector from scrapy.http import Request from ..items import XianglongItem from scrapy.http import HtmlResponse from scrapy.http.response.html import HtmlResponse """ obj = ChoutiSpider() obj.start_requests() """ class ChoutiSpider(scrapy.Spider): name = 'chouti' allowed_domains = ['chouti.com'] start_urls = ['https://dig.chouti.com/',] def start_requests(self): for url in self.start_urls: # 通过他来yiel # Request(url, callback=sdf) # 来调用执行下边定义的回调函数来执行爬取数据 yield Request( url=url, callback=self.parse, headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) ' 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'} ) def parse(self, response): """ 当起始URL下载完毕后,自动执行parse函数:response封装了响应相关的所有内容。 :param response: :return: """ pages = response.xpath('//div[@id="page-area"]//a[@class="ct_pagepa"]/@href').extract() for page_url in pages: page_url = "https://dig.chouti.com" + page_url yield Request(url=page_url,callback=self.parse,headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'})
返回给parse的response对象
实际上是from scrapy.http.response.html import HtmlResponse 的对象


这个类中就封这response中的所有内容
如何处理cookie
手动处理cookie
# -*- coding: utf-8 -*- import scrapy from bs4 import BeautifulSoup from scrapy.selector import HtmlXPathSelector from scrapy.http import Request from ..items import XianglongItem from scrapy.http import HtmlResponse from scrapy.http.response.html import HtmlResponse """ obj = ChoutiSpider() obj.start_requests() """ class ChoutiSpider(scrapy.Spider): name = 'chouti' allowed_domains = ['chouti.com'] start_urls = ['http://dig.chouti.com/',] cookie_dict = {} def start_requests(self): for url in self.start_urls: yield Request(url=url,callback=self.parse_index) def parse_index(self,response): # 原始cookie # print(response.headers.getlist('Set-Cookie')) # 解析后的cookie from scrapy.http.cookies import CookieJar cookie_jar = CookieJar() cookie_jar.extract_cookies(response, response.request) for k, v in cookie_jar._cookies.items(): for i, j in v.items(): for m, n in j.items(): self.cookie_dict[m] = n.value print(cookie_dict)
进阶:g给抽屉完成走动登录并点赞
# -*- coding: utf-8 -*- import scrapy from bs4 import BeautifulSoup from scrapy.selector import HtmlXPathSelector from scrapy.http import Request from ..items import XianglongItem from scrapy.http import HtmlResponse from scrapy.http.response.html import HtmlResponse """ obj = ChoutiSpider() obj.start_requests() """ class ChoutiSpider(scrapy.Spider): name = 'chouti' allowed_domains = ['chouti.com'] start_urls = ['http://dig.chouti.com/',] cookie_dict = {}#声明一个全局的空的字典, def start_requests(self): for url in self.start_urls: yield Request(url=url,callback=self.parse_index) #先完成登录 def parse_index(self,response): # 原始cookie # print(response.headers.getlist('Set-Cookie')) # 解析后的cookie from scrapy.http.cookies import CookieJar
cookie_jar = CookieJar()
#将cookie信息封装进去,,然后将内容写入字典 cookie_jar.extract_cookies(response, response.request) for k, v in cookie_jar._cookies.items(): for i, j in v.items(): for m, n in j.items(): self.cookie_dict[m] = n.value req = Request( url='http://dig.chouti.com/login', method='POST', headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'}, body='phone=8613121758648&password=woshiniba&oneMonth=1', cookies=self.cookie_dict, #设置回调函数,当这个登录执行完毕之后,去走下边定义的点赞功能 callback=self.parse_check_login )
#执行登录 因为抽屉网的验证时使用的登录前的cookie所以这里登录的过程中要cookie信息才能顺利完成登录,并且这条cookie信息可以使用 yield req #完成登录之后,完成对单条数据的点赞 def parse_check_login(self,response): print(response.text) yield Request( url='https://dig.chouti.com/link/vote?linksId=19440976', method='POST', cookies=self.cookie_dict, callback=self.parse_show_result ) #看看是否点赞成功 def parse_show_result(self,response): print(response.text)
# -*- coding: utf-8 -*- import scrapy from bs4 import BeautifulSoup from scrapy.selector import HtmlXPathSelector from scrapy.http import Request from ..items import XianglongItem from scrapy.http import HtmlResponse from scrapy.http.response.html import HtmlResponse """ obj = ChoutiSpider() obj.start_requests() """ class ChoutiSpider(scrapy.Spider): name = 'chouti' allowed_domains = ['chouti.com'] start_urls = ['http://dig.chouti.com/',] cookie_dict = {} def start_requests(self): for url in self.start_urls: yield Request(url=url,callback=self.parse_index) #先完成登录 def parse_index(self,response): # 原始cookie # print(response.headers.getlist('Set-Cookie')) # 解析后的cookie from scrapy.http.cookies import CookieJar cookie_jar = CookieJar() cookie_jar.extract_cookies(response, response.request) for k, v in cookie_jar._cookies.items(): for i, j in v.items(): for m, n in j.items(): self.cookie_dict[m] = n.value req = Request( url='http://dig.chouti.com/login', method='POST', headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'}, body='phone=8613121758648&password=woshiniba&oneMonth=1', cookies=self.cookie_dict, #设置回调函数,当这个登录执行完毕之后,去走下边定义的点赞功能 callback=self.parse_check_login ) yield req #完成登录之后,完成对单条数据的点赞 def parse_check_login(self,response): print(response.text) yield Request( url='https://dig.chouti.com/link/vote?linksId=19440976', method='POST', cookies=self.cookie_dict, callback=self.parse_show_result ) #看看是否点赞成功 def parse_show_result(self,response): print(response.text)
自动处理cookie
# -*- coding: utf-8 -*- import scrapy from bs4 import BeautifulSoup from scrapy.selector import HtmlXPathSelector from scrapy.http import Request from ..items import XianglongItem from scrapy.http import HtmlResponse from scrapy.http.response.html import HtmlResponse """ obj = ChoutiSpider() obj.start_requests() """ class ChoutiSpider(scrapy.Spider): name = 'chouti' allowed_domains = ['chouti.com'] start_urls = ['http://dig.chouti.com/',] def start_requests(self): for url in self.start_urls: #加上一个这个就可以自动把cookie带上, yield Request(url=url,callback=self.parse_index,meta={'cookiejar':True}) def parse_index(self,response): req = Request( url='http://dig.chouti.com/login', method='POST', headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'}, body='phone=8613121758648&password=woshiniba&oneMonth=1', callback=self.parse_check_login, #使用的时候用到cookie的时候,要加上这个参数 meta={'cookiejar': True} ) yield req def parse_check_login(self,response): # print(response.text) yield Request( url='https://dig.chouti.com/link/vote?linksId=19440976', method='POST', callback=self.parse_show_result, meta={'cookiejar': True} ) def parse_show_result(self,response): print(response.text)

其他
settings.py
# -*- coding: utf-8 -*- # Scrapy settings for step8_king project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html # 1. 爬虫名称 BOT_NAME = 'step8_king' # 2. 爬虫应用路径 SPIDER_MODULES = ['step8_king.spiders'] NEWSPIDER_MODULE = 'step8_king.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent # 3. 客户端 user-agent请求头 # USER_AGENT = 'step8_king (+http://www.yourdomain.com)' # Obey robots.txt rules # 4. 禁止爬虫配置 # ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) # 5. 并发请求数 # CONCURRENT_REQUESTS = 4 # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs # 6. 延迟下载秒数 # DOWNLOAD_DELAY = 2 # The download delay setting will honor only one of: # 7. 单域名访问并发数,并且延迟下次秒数也应用在每个域名 # CONCURRENT_REQUESTS_PER_DOMAIN = 2 # 单IP访问并发数,如果有值则忽略:CONCURRENT_REQUESTS_PER_DOMAIN,并且延迟下次秒数也应用在每个IP # CONCURRENT_REQUESTS_PER_IP = 3 # Disable cookies (enabled by default) # 8. 是否支持cookie,cookiejar进行操作cookie # COOKIES_ENABLED = True # COOKIES_DEBUG = True # Disable Telnet Console (enabled by default) # 9. Telnet用于查看当前爬虫的信息,操作爬虫等... # 使用telnet ip port ,然后通过命令操作 # TELNETCONSOLE_ENABLED = True # TELNETCONSOLE_HOST = '127.0.0.1' # TELNETCONSOLE_PORT = [6023,] # 10. 默认请求头 # Override the default request headers: # DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', # } # Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html # 11. 定义pipeline处理请求 # ITEM_PIPELINES = { # 'step8_king.pipelines.JsonPipeline': 700, # 'step8_king.pipelines.FilePipeline': 500, # } # 12. 自定义扩展,基于信号进行调用 # Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html # EXTENSIONS = { # # 'step8_king.extensions.MyExtension': 500, # } # 13. 爬虫允许的最大深度,可以通过meta查看当前深度;0表示无深度 # DEPTH_LIMIT = 3 # 14. 爬取时,0表示深度优先Lifo(默认);1表示广度优先FiFo # 后进先出,深度优先 # DEPTH_PRIORITY = 0 # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue' # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue' # 先进先出,广度优先 # DEPTH_PRIORITY = 1 # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue' # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue' # 15. 调度器队列 # SCHEDULER = 'scrapy.core.scheduler.Scheduler' # from scrapy.core.scheduler import Scheduler # 16. 访问URL去重 # DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl' # Enable and configure the AutoThrottle extension (disabled by default) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html """ 17. 自动限速算法 from scrapy.contrib.throttle import AutoThrottle 自动限速设置 1. 获取最小延迟 DOWNLOAD_DELAY 2. 获取最大延迟 AUTOTHROTTLE_MAX_DELAY 3. 设置初始下载延迟 AUTOTHROTTLE_START_DELAY 4. 当请求下载完成后,获取其"连接"时间 latency,即:请求连接到接受到响应头之间的时间 5. 用于计算的... AUTOTHROTTLE_TARGET_CONCURRENCY target_delay = latency / self.target_concurrency new_delay = (slot.delay + target_delay) / 2.0 # 表示上一次的延迟时间 new_delay = max(target_delay, new_delay) new_delay = min(max(self.mindelay, new_delay), self.maxdelay) slot.delay = new_delay """ # 开始自动限速 # AUTOTHROTTLE_ENABLED = True # The initial download delay # 初始下载延迟 # AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies # 最大下载延迟 # AUTOTHROTTLE_MAX_DELAY = 10 # The average number of requests Scrapy should be sending in parallel to each remote server # 平均每秒并发数 # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: # 是否显示 # AUTOTHROTTLE_DEBUG = True # Enable and configure HTTP caching (disabled by default) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings """ 18. 启用缓存 目的用于将已经发送的请求或相应缓存下来,以便以后使用 from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware from scrapy.extensions.httpcache import DummyPolicy from scrapy.extensions.httpcache import FilesystemCacheStorage """ # 是否启用缓存策略 # HTTPCACHE_ENABLED = True # 缓存策略:所有请求均缓存,下次在请求直接访问原来的缓存即可 # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy" # 缓存策略:根据Http响应头:Cache-Control、Last-Modified 等进行缓存的策略 # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy" # 缓存超时时间 # HTTPCACHE_EXPIRATION_SECS = 0 # 缓存保存路径 # HTTPCACHE_DIR = 'httpcache' # 缓存忽略的Http状态码 # HTTPCACHE_IGNORE_HTTP_CODES = [] # 缓存存储的插件 # HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' """ 19. 代理,需要在环境变量中设置 from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware 方式一:使用默认 os.environ { http_proxy:http://root:woshiniba@192.168.11.11:9999/ https_proxy:http://192.168.11.11:9999/ } 方式二:使用自定义下载中间件 def to_bytes(text, encoding=None, errors='strict'): if isinstance(text, bytes): return text if not isinstance(text, six.string_types): raise TypeError('to_bytes must receive a unicode, str or bytes ' 'object, got %s' % type(text).__name__) if encoding is None: encoding = 'utf-8' return text.encode(encoding, errors) class ProxyMiddleware(object): def process_request(self, request, spider): PROXIES = [ {'ip_port': '111.11.228.75:80', 'user_pass': ''}, {'ip_port': '120.198.243.22:80', 'user_pass': ''}, {'ip_port': '111.8.60.9:8123', 'user_pass': ''}, {'ip_port': '101.71.27.120:80', 'user_pass': ''}, {'ip_port': '122.96.59.104:80', 'user_pass': ''}, {'ip_port': '122.224.249.122:8088', 'user_pass': ''}, ] proxy = random.choice(PROXIES) if proxy['user_pass'] is not None: request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port']) encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass'])) request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass) print "**************ProxyMiddleware have pass************" + proxy['ip_port'] else: print "**************ProxyMiddleware no pass************" + proxy['ip_port'] request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port']) DOWNLOADER_MIDDLEWARES = { 'step8_king.middlewares.ProxyMiddleware': 500, } """ """ 20. Https访问 Https访问时有两种情况: 1. 要爬取网站使用的可信任证书(默认支持) DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory" DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory" 2. 要爬取网站使用的自定义证书 DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory" DOWNLOADER_CLIENTCONTEXTFACTORY = "step8_king.https.MySSLFactory" # https.py from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate) class MySSLFactory(ScrapyClientContextFactory): def getCertificateOptions(self): from OpenSSL import crypto v1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read()) v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read()) return CertificateOptions( privateKey=v1, # pKey对象 certificate=v2, # X509对象 verify=False, method=getattr(self, 'method', getattr(self, '_ssl_method', None)) ) 其他: 相关类 scrapy.core.downloader.handlers.http.HttpDownloadHandler scrapy.core.downloader.webclient.ScrapyHTTPClientFactory scrapy.core.downloader.contextfactory.ScrapyClientContextFactory 相关配置 DOWNLOADER_HTTPCLIENTFACTORY DOWNLOADER_CLIENTCONTEXTFACTORY """ """ 21. 爬虫中间件 class SpiderMiddleware(object): def process_spider_input(self,response, spider): ''' 下载完成,执行,然后交给parse处理 :param response: :param spider: :return: ''' pass def process_spider_output(self,response, result, spider): ''' spider处理完成,返回时调用 :param response: :param result: :param spider: :return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable) ''' return result def process_spider_exception(self,response, exception, spider): ''' 异常调用 :param response: :param exception: :param spider: :return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline ''' return None def process_start_requests(self,start_requests, spider): ''' 爬虫启动时调用 :param start_requests: :param spider: :return: 包含 Request 对象的可迭代对象 ''' return start_requests 内置爬虫中间件: 'scrapy.contrib.spidermiddleware.httperror.HttpErrorMiddleware': 50, 'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware': 500, 'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': 700, 'scrapy.contrib.spidermiddleware.urllength.UrlLengthMiddleware': 800, 'scrapy.contrib.spidermiddleware.depth.DepthMiddleware': 900, """ # from scrapy.contrib.spidermiddleware.referer import RefererMiddleware # Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html SPIDER_MIDDLEWARES = { # 'step8_king.middlewares.SpiderMiddleware': 543, } """ 22. 下载中间件 class DownMiddleware1(object): def process_request(self, request, spider): ''' 请求需要被下载时,经过所有下载器中间件的process_request调用 :param request: :param spider: :return: None,继续后续中间件去下载; Response对象,停止process_request的执行,开始执行process_response Request对象,停止中间件的执行,将Request重新调度器 raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception ''' pass def process_response(self, request, response, spider): ''' spider处理完成,返回时调用 :param response: :param result: :param spider: :return: Response 对象:转交给其他中间件process_response Request 对象:停止中间件,request会被重新调度下载 raise IgnoreRequest 异常:调用Request.errback ''' print('response1') return response def process_exception(self, request, exception, spider): ''' 当下载处理器(download handler)或 process_request() (下载中间件)抛出异常 :param response: :param exception: :param spider: :return: None:继续交给后续中间件处理异常; Response对象:停止后续process_exception方法 Request对象:停止中间件,request将会被重新调用下载 ''' return None 默认下载中间件 { 'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': 100, 'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware': 300, 'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware': 350, 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400, 'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500, 'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': 550, 'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580, 'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': 590, 'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600, 'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700, 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750, 'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': 830, 'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': 850, 'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900, } """ # from scrapy.contrib.downloadermiddleware.httpauth import HttpAuthMiddleware # Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # DOWNLOADER_MIDDLEWARES = { # 'step8_king.middlewares.DownMiddleware1': 100, # 'step8_king.middlewares.DownMiddleware2': 500, # }











浙公网安备 33010602011771号