06 scrapy深入了解
scrapy深入了解
1. scrapy手动发起请求
-
yield scrapy.Request(url,callback):GET
- callback指定解析函数,用于解析数据
-
yield scrapy.FormRequest(url,callback,formdata):POST
- formdata:字典,请求参数
start_urls自动发起请求原因
- 为什么start_urls列表中的url会被自动进行get请求的发送?
- 因为列表中的url其实是被start_requests这个父类方法实现的get请求发送
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(url=u,callback=self.parse)
- 如何将start_urls中的url默认进行post请求的发送?
- 重写start_requests方法即可
def start_requests(self, formdata):
for u in self.start_urls:
yield scrapy.FormRequest(url=u,callback=self.parse,formdata=formdata)
scrapy手动发起请求&翻页爬取
class DuanziSpider(scrapy.Spider):
name = 'duanzi'
# allowed_domains = ['www.xxx.com']
start_urls = ['https://duanzixing.com/段子/']
# 网站模板
url = 'https://duanzixing.com/段子/%d/'
page_num = 2
def parse(self, response):
article_list = response.xpath('/html/body/section/div/div/article')
for article in article_list:
title = article.xpath('./header/h2/a/text()').get()
content = article.xpath('./p[2]/text()').get()
item = HandreqproItem() # 实例化item对象
item['title'] = title # 添加数据
item['content'] = content
yield item # 向管道提交item
if self.page_num < 5: # 递归出口
new_url = format(self.url % self.page_num) # 拼接得到新的url
self.page_num += 1
# 手动发起请求 callback 是回调函数,重新调用parse函数
yield scrapy.Request(url=new_url, callback=self.parse)
2. scrapy 获取详情页数据(请求传参)
- 需要使用回调函数,调用可以解析详情页数据的方法
- 需要将item进行一个传递yield scrapy.Request(url=url_detail, callback=self.parse_detail, meta={'item': item})
- 在详情页中提交item到管道
def parse(self, response):
movie_list = response.xpath('//*[@id="contents"]/ul/li')
for movie in movie_list:
title = movie.xpath('./div/a/text()').get()
# https://www.heiyingcn.com/video/101157.html
url_detail = 'https://www.heiyingcn.com' + movie.xpath('./a/@href').get()
item = MovieproItem()
item['title'] = title
# 请求参数
yield scrapy.Request(url=url_detail, callback=self.parse_detail, meta={'item': item})
# 如果需要翻页爬取,这这里发起请求
# yield scrapy.Request()
def parse_detail(self, response):
item = response.meta['item']
desc = response.xpath('//*[@class="synopsis-article"]/text()').get()
item['desc'] = desc
yield item
再坚持一下下,会越来越优秀