爬虫 -- 1

s4爬虫x4

1、总结爬虫流程：

爬取--->解析--->存储

2、爬虫所需工具：

请求库：requests,selenium
解析库：正则，beautifulsoup，pyquery
存储库：文件，MySQL，Mongodb，Redis

3、爬虫常用框架：

scrapy

内容回顾：
requests
beautifulsoup
Web知识：
- 请求头和请求体
- cookies
- csrf
- 常用请求头：
- user-agent: 设备信息
- content-type:
- referer
- host: ...
Web微信：
- 检测网络请求
- 尝试

高性能相关：
	- 串行
	- 线程池，进程池
	- 异步非阻塞
		- asycio
		- gevent
		- twisted
		- tornado
	  PS: 协程

今日内容：
- 史上最牛逼的异步IO模块
- scrapy框架

可能是史上最牛逼的异步IO模块
知识必备：
a. socket客户端
obj = socket()
# obj.connect((198.1.1.1,80))
obj.connect((http://dig.chouti.com/,80)) # 阻塞

 	obj.send('GET /index http1.1\r\nhost:...\r\ncontent-type:xxxxx\r\n\r\n')
 	
 	obj.recv(1024) # 最多接收字节 # 阻塞
 	
 	
 	obj.close()
 	
 	
 	
 	示例：
 		import socket

 		client = socket.socket()
 		# 连接
 		client.connect(("43.226.160.17",80)) # 阻塞

 		# 发送请求
 		data = b"GET / HTTP/1.0\r\nhost: dig.chouti.com\r\n\r\n"
 		client.sendall(data)


 		response = client.recv(8096) # 阻塞
 		print(response)

 		client.close()
 	
 	总结：
 		发送Http请求
 		非阻塞，报错 try
 		定义一些操作
 		
 		
 b. IO多路复用，用来检测【多个】socket对象是否有变化？
 	
 	socket_list = []
 		
 	for i in [www.baid.......,.....]
 	
 		client = socket.socket()
 		client.setblocking(False)
 		# 连接
 		try:
 			client.connect((i,80)) # 连接的请求已经发送出去，
 		except BlockingIOError as e:
 			print(e)
 		socket_list.append(client)
 
 	
 	# 事件循环
 	while True:
 		r,w,e = select.select(socket_list,socket_list,[],0.05)
 		# w, 是什么？[sk2,sk3]，连接成功了
 		for obj in w:
 			obj.send("GET / http/1.0\....")
 		# r，是什么？ [sk2,sk3], 要收数据了
 		for obj in r:
 			response = obj.recv(...)
 			print(response)
 					   
 		
 	知识点：
 		client.setblocking(False)
 		select.select检测：连接成功，数据回来了 
 		
 		
 		
 		
 模块：
 	import socket
 	import select

 	class Request(object):
 		def __init__(self,sock,info):
 			self.sock = sock
 			self.info = info

 		def fileno(self):
 			return self.sock.fileno()


 	class QinBing(object):
 		def __init__(self):
 			self.sock_list = []
 			self.conns = []

 		def add_request(self,req_info):
 			"""
 			创建请求
 			 req_info: {'host': 'www.baidu.com', 'port': 80, 'path': '/'},
 			:return:
 			"""
 			sock = socket.socket()
 			sock.setblocking(False)
 			try:
 				sock.connect((req_info['host'],req_info['port']))
 			except BlockingIOError as e:
 				pass

 			obj = Request(sock,req_info)
 			self.sock_list.append(obj)
 			self.conns.append(obj)

 		def run(self):
 			"""
 			开始事件循环,检测：连接成功？数据是否返回？
 			:return:
 			"""
 			while True:
 				# select.select([socket对象,])
 				# 可是任何对象，对象一定要fileno方法
 				# 对象.fileno()
 				# select.select([request对象,])
 				r,w,e = select.select(self.sock_list,self.conns,[],0.05)
 				# w,是否连接成功
 				for obj in w:
 					# 检查obj:request对象
 					# socket, {'host': 'www.baidu.com', 'port': 80, 'path': '/'},
 					data = "GET %s http/1.1\r\nhost:%s\r\n\r\n" %(obj.info['path'],obj.info['host'])
 					obj.sock.send(data.encode('utf-8'))
 					self.conns.remove(obj)
 				# 数据返回，接收到数据
 				for obj in r:
 					response = obj.sock.recv(8096)
 					obj.info['callback'](response)
 					self.sock_list.remove(obj)


 				# 所有请求已经返回
 				if not self.sock_list:
 					break

 		
 	
 		使用：
 			from .qb import QinBing


 			def done1(response):
 				print(response)

 			def done2(response):
 				print(response)

 			url_list = [
 				{'host': 'www.baidu.com', 'port': 80, 'path': '/','callback': done1},
 				{'host': 'www.cnblogs.com', 'port': 80, 'path': '/index.html','callback': done2},
 				{'host': 'www.bing.com', 'port': 80, 'path': '/','callback': done2},
 			]

 			qinbing = QinBing()
 			for item in url_list:
 				qinbing.add_request(item)

 			qinbing.run()
 		
 		
 	Twisted,Tornado
 		
   
 **********必须知道**********

Scrapy框架
{{uploading-image-520027.png(uploading...)}}
功能：
- 应用twisted，下载页面
- HTML解析对象
- 代理
- 延迟下载
- 去重
- 深度（5），广度
...
猜想：
起始URL
下载页面
- 持久化
- 继续下载
调度器

安装：
Linux:
pip3 install scrapy

 Windows:
 	pip3 install wheel
 	D:twisted.wheel
 	pip3 install D:twisted.wheel
 	
 	pip3 install scrapy 报错：twisted安装错误
 	
 	pywin32

 
 PS: 
 	- python3对twisted未完全支持
 	- python2
 
 import scrapy

使用：
Django:
django-admin startproject mysite
cd mysite
python manage.py startapp app01

 Scrapy
 	# 创建项目
 	scrapy startproject sp1
 	
 	sp1
 		- sp1
 			- spiders目录
 			- middlewares.py	中间件
 			- items.py			格式化
 			- pipelines.py		持久化
 			- settings.py		配置文件
 		- scrapy.cfg 			配置
 	
 	# 创建爬虫
 	cd sp1
 	scrapy genspider example example.com
 
 	# 执行爬虫，进入project
 	scrapy crawl baidu
 	scrapy crawl baidu --nolog

基本操作：
from scrapy.selector import HtmlXPathSelector,Selector
from scrapy.http import Request

 1. selector
 
 		hxs = Selector(response=response)
 		# print(hxs)
 		user_list = hxs.xpath('//div[@class="item masonry_brick"]')
 		for item in user_list:
 			price = item.xpath('./span[@class="price"]/text()').extract_first()
 			url = item.xpath('div[@class="item_t"]/div[@class="class"]//a/@href').extract_first()
 			print(price,url)

 		result = hxs.xpath('/a[re:test(@href,"http://www.xiaohuar.com/list-1-\d+.html")]/@href')
 		print(result)
 		result = ['http://www.xiaohuar.com/list-1-1.html','http://www.xiaohuar.com/list-1-2.html']

             
         特殊：
             >>> response.xpath('//body/a') #开头的//代表从整篇文档中寻找,body之后的/代表body的儿子

             >>> response.xpath('//body//a') #开头的//代表从整篇文档中寻找,body之后的//代表body的子子孙孙

             hxs.xpath('//div[@class="item masonry_brick"]/xxx') # / 代表在儿子中查找  // 代表在子子孙孙中查找
 		
 2. yield Request(url=url,callback=self.parse)

练习题：抽屉，煎蛋

posted @ 2019-02-12 23:13 MAU 阅读(209) 评论(0) 收藏举报

刷新页面返回顶部

麻木

Give me the more power , o god!！！

爬虫 -- 1

1、总结爬虫流程：

2、爬虫所需工具：

3、爬虫常用框架：

公告

麻 木

Give me the more power , o god!！！

爬虫 -- 1

1、总结爬虫流程：

2、爬虫所需工具：

3、爬虫常用框架：

公告

麻木