scrapy - 2
爬虫x5
上节内容回顾:
- 异步非阻塞
- 回调
- 不等待
- scrapy框架
- 创建爬虫
scrapy startproject sp2
cd sp2
scrapy genspider chouti chout.com
scrapy crawl chouti
- 写代码 chouti.py
- name属性
- 域名
- 起始url
- parse, response
- 选择
//
/
/@属性名
/text()
- yield Request(url='xxx',callback=self.parse)
今日内容:
1. 起始URL - parse
import scrapy
from scrapy.http import Request
class ChoutiSpider(scrapy.Spider):
name = 'chouti'
allowed_domains = ['chouti.com']
start_urls = ['http://chouti.com/']
def start_requests(self):
for url in self.start_urls:
yield Request(url, dont_filter=True,callback=self.parse1)
def parse1(self, response):
pass
2. POST请求,请求头
requests.get(params={},headers={},cookies={})
requests.post(params={},headers={},cookies={},data={},json={})
url,
method='GET',
headers=None,
body=None,
cookies=None,
GET请求:
url,
method='GET',
headers={},
cookies={}, cookiejar
POST请求:
url,
method='GET',
headers={},
cookies={}, cookiejar
body=None,
application/x-www-form-urlencoded; charset=UTF-8
form_data = {
'user':'alex',
'pwd': 123
}
import urllib.parse
data = urllib.parse.urlencode({'k1':'v1','k2':'v2'})
"phone=86155fa&password=asdf&oneMonth=1"
application/json; charset=UTF-8
json.dumsp()
"{k1:'v1','k2':'v2'}"
示例:
Request(
url='http://dig.chouti.com/login',
method='POST',
headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
body='phone=8615131255089&password=pppppppp&oneMonth=1',
callback=self.check_login
)
2.5 cookie
Request(
url='http://dig.chouti.com/login',
method='POST',
headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
body='phone=8615131255089&password=pppppppp&oneMonth=1',
callback=self.check_login
)
练习:自动登录抽屉
1. 发送一个GET请求,抽屉
获取cookie
2. 用户密码POST登录:携带上一次cookie
返回值:9999
3. 为所欲为,携带cookie,点赞
3. 持久化: item,pipeline
pipeline执行的前提:
- spider中yield Item对象
- settings中注册
ITEM_PIPELINES = {
'sp2.pipelines.Sp2Pipeline': 300,
'sp2.pipelines.Sp2Pipeline': 100,
}
编写pipeline
class Sp2Pipeline(object):
def __init__(self):
self.f = None
def process_item(self, item, spider):
"""
:param item: 爬虫中yield回来的对象
:param spider: 爬虫对象 obj = JianDanSpider()
:return:
"""
print(item)
self.f.write('....')
return item
# from scrapy.exceptions import DropItem
# raise DropItem() 下一个pipeline的process_item方法不在执行
@classmethod
def from_crawler(cls, crawler):
"""
初始化时候,用于创建pipeline对象
:param crawler:
:return:
"""
# val = crawler.settings.get('MMMM')
print('执行pipeline的from_crawler,进行实例化对象')
return cls()
def open_spider(self,spider):
"""
爬虫开始执行时,调用
:param spider:
:return:
"""
print('打开爬虫')
self.f = open('a.log','a+')
def close_spider(self,spider):
"""
爬虫关闭时,被调用
:param spider:
:return:
"""
self.f.close()
PipeLine是全局生效,所有爬虫都会执行,个别做特殊操作: spider.name
4. 自定义去重规则
- 类
- 配置文件中指定
class RepeatUrl:
def __init__(self):
self.visited_url = set() # 放在当前服务的内存
@classmethod
def from_settings(cls, settings):
"""
初始化时,调用
:param settings:
:return:
"""
return cls()
def request_seen(self, request):
"""
检测当前请求是否已经被访问过
:param request:
:return: True表示已经访问过;False表示未访问过
"""
if request.url in self.visited_url:
return True
self.visited_url.add(request.url)
return False
def open(self):
"""
开始爬去请求时,调用
:return:
"""
print('open replication')
def close(self, reason):
"""
结束爬虫爬取时,调用
:param reason:
:return:
"""
print('close replication')
def log(self, request, spider):
DUPEFILTER_CLASS = 'sp2.rep.RepeatUrl'
6. 自定义扩展【基于信号】
from scrapy import signals
class MyExtension(object):
def __init__(self, value):
self.value = value
@classmethod
def from_crawler(cls, crawler):
val = crawler.settings.getint('MMMM')
ext = cls(val)
# 在scrapy中注册信号: spider_opened
crawler.signals.connect(ext.opened, signal=signals.spider_opened)
# 在scrapy中注册信号: spider_closed
crawler.signals.connect(ext.closed, signal=signals.spider_closed)
return ext
def opened(self, spider):
print('open')
def closed(self, spider):
print('close')
EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
}
5. 中间件
- 爬虫中间件
- 下载中间件
7. 其他
配置文件
代理
Https证书
8. 自定义命令【看源码的入口】
- 所有爬虫开始工作
作业:看源码
任务:
post,cookie,headers
pipeline
* 去重,信号
1. 扩展一定写运行成功
2. 预习: 中间件,自定义命令【看源码的入口】
算法:
- 代码写出来?
- 伪代码写出来?
I can feel you forgetting me。。
有一种默契叫做我不理你,你就不理我

浙公网安备 33010602011771号