scrapy - 2

爬虫x5

上节内容回顾:
- 异步非阻塞
- 回调
- 不等待
- scrapy框架
- 创建爬虫
scrapy startproject sp2
cd sp2
scrapy genspider chouti chout.com

		scrapy crawl chouti
	- 写代码 chouti.py
		- name属性
		- 域名
		- 起始url
		- parse, response
			- 选择
				//
				/
				/@属性名
				/text()
			- yield Request(url='xxx',callback=self.parse)

今日内容:

1. 起始URL - parse
	import scrapy
	from scrapy.http import Request

	class ChoutiSpider(scrapy.Spider):
		name = 'chouti'
		allowed_domains = ['chouti.com']
		start_urls = ['http://chouti.com/']

		def start_requests(self):
			for url in self.start_urls:
				yield Request(url, dont_filter=True,callback=self.parse1)

		def parse1(self, response):
			pass
	
2. POST请求,请求头
	requests.get(params={},headers={},cookies={})
	requests.post(params={},headers={},cookies={},data={},json={})
	
	url, 
	method='GET', 
	headers=None, 
	body=None,
    cookies=None,
	
	
	GET请求:
		url, 
		method='GET', 
		headers={}, 
		cookies={}, cookiejar
	
	POST请求:
		url, 
		method='GET', 
		headers={}, 
		cookies={}, cookiejar
		body=None,
			application/x-www-form-urlencoded; charset=UTF-8
				form_data = {
					'user':'alex',
					'pwd': 123
				}
				import urllib.parse
				data = urllib.parse.urlencode({'k1':'v1','k2':'v2'})
				
				"phone=86155fa&password=asdf&oneMonth=1"   
				
			application/json; charset=UTF-8
				json.dumsp()
				
				"{k1:'v1','k2':'v2'}"   
	
	示例:
		 Request(
			url='http://dig.chouti.com/login',
			method='POST',
			headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
			body='phone=8615131255089&password=pppppppp&oneMonth=1',
			callback=self.check_login
		)

2.5 cookie
		Request(
			url='http://dig.chouti.com/login',
			method='POST',
			headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
			body='phone=8615131255089&password=pppppppp&oneMonth=1',
			callback=self.check_login
		)
	
	练习:自动登录抽屉
	
			1. 发送一个GET请求,抽屉
			   获取cookie
			   
			2. 用户密码POST登录:携带上一次cookie
			   返回值:9999
			   
			3. 为所欲为,携带cookie,点赞
			
	
3. 持久化: item,pipeline
	pipeline执行的前提:
		- spider中yield Item对象
		- settings中注册
			ITEM_PIPELINES = {
			   'sp2.pipelines.Sp2Pipeline': 300,
			   'sp2.pipelines.Sp2Pipeline': 100,
			}
		
	编写pipeline
		
		class Sp2Pipeline(object):
			def __init__(self):
				self.f = None

			def process_item(self, item, spider):
				"""

				:param item:  爬虫中yield回来的对象
				:param spider: 爬虫对象 obj = JianDanSpider()
				:return:
				"""
				print(item)
				self.f.write('....')
				return item
				# from scrapy.exceptions import DropItem
				# raise DropItem()  下一个pipeline的process_item方法不在执行

			@classmethod
			def from_crawler(cls, crawler):
				"""
				初始化时候,用于创建pipeline对象
				:param crawler:
				:return:
				"""
				# val = crawler.settings.get('MMMM')
				print('执行pipeline的from_crawler,进行实例化对象')
				return cls()

			def open_spider(self,spider):
				"""
				爬虫开始执行时,调用
				:param spider:
				:return:
				"""
				print('打开爬虫')
				self.f = open('a.log','a+')

			def close_spider(self,spider):
				"""
				爬虫关闭时,被调用
				:param spider:
				:return:
				"""
				self.f.close()

	
	
	PipeLine是全局生效,所有爬虫都会执行,个别做特殊操作: spider.name
	
4. 自定义去重规则
	- 类
	- 配置文件中指定
	
	class RepeatUrl:
		def __init__(self):
			self.visited_url = set() # 放在当前服务的内存

		@classmethod
		def from_settings(cls, settings):
			"""
			初始化时,调用
			:param settings:
			:return:
			"""
			return cls()

		def request_seen(self, request):
			"""
			检测当前请求是否已经被访问过
			:param request:
			:return: True表示已经访问过;False表示未访问过
			"""
			if request.url in self.visited_url:
				return True
			self.visited_url.add(request.url)
			return False

		def open(self):
			"""
			开始爬去请求时,调用
			:return:
			"""
			print('open replication')

		def close(self, reason):
			"""
			结束爬虫爬取时,调用
			:param reason:
			:return:
			"""
			print('close replication')

		def log(self, request, spider):
	
	DUPEFILTER_CLASS = 'sp2.rep.RepeatUrl'

6. 自定义扩展【基于信号】	
		from scrapy import signals


		class MyExtension(object):
			def __init__(self, value):
				self.value = value

			@classmethod
			def from_crawler(cls, crawler):
				val = crawler.settings.getint('MMMM')
				ext = cls(val)

				# 在scrapy中注册信号: spider_opened
				crawler.signals.connect(ext.opened, signal=signals.spider_opened)
				# 在scrapy中注册信号: spider_closed
				crawler.signals.connect(ext.closed, signal=signals.spider_closed)

				return ext

			def opened(self, spider):
				print('open')

			def closed(self, spider):
				print('close')
				

		EXTENSIONS = {
		   # 'scrapy.extensions.telnet.TelnetConsole': None,
		}
		
5. 中间件
	- 爬虫中间件
	- 下载中间件
	
7. 其他
	配置文件 
	代理
	Https证书
	
8. 自定义命令【看源码的入口】
	- 所有爬虫开始工作


作业:看源码

任务:
post,cookie,headers
pipeline
* 去重,信号
1. 扩展一定写运行成功
2. 预习: 中间件,自定义命令【看源码的入口】

算法:
	 - 代码写出来?
	 - 伪代码写出来?
posted @ 2019-02-18 21:02  MAU  阅读(149)  评论(0)    收藏  举报