爬虫--scrapy的请求传参,POST请求和cookie问题
1.scrapy的请求传参
使用场景:如果使用scrapy爬取的数据没有在同一张页面中,则必须使用请求传参
使用方法:yield scrapy.Request(url,callback,meta)
:callback回调一个函数用于数据解析
:meta用来传递数据
爬虫文件操作:
1.导包 from moviepro.items import MovieproItem
2.第一次解析是实例化item:item = MovieproItem()
3.手动传参:yield scrapy.Request(url=datile_url,callback=self.datile_parse,meta ={'item':item} )
4.详情页解析是需要首先接收item:item = response.meta['item']
5.提交给管道:yield item
import scrapy
from moviepro.items import MovieproItem
class MovieSpider(scrapy.Spider):
name = 'movie'
# allowed_domains = ['www.xxx.com']
start_urls = ['https://www.doubiyang.cc/frim/index1.html']
def parse(self, response):
li_list = response.xpath('/html/body/div[1]/div/div[2]/ul[2]/li')
for li in li_list:
title = li.xpath('./a/@title').extract_first()
datile_url = 'https://www.doubiyang.cc/'+li.xpath('./a/@href').extract_first()+'#desc'
item = MovieproItem()
item['title']= title
item['datile_url'] = datile_url
yield scrapy.Request(url=datile_url,callback=self.datile_parse,meta ={'item':item} )
def datile_parse(self,response):
item = response.meta['item']
# desc = response.xpath('//div[@class="stui-content__detail"]/p[4]/text()').extract()
desc = response.xpath('//div[@class= "stui-content__desc"]/text()').extract()
item['desc'] = desc
yield item
items.py文件操作
import scrapy
class MovieproItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
datile_url = scrapy.Field()
desc = scrapy.Field()
settings.py文件操作
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
LOG_LEVEL = 'ERROR'
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
'moviepro.pipelines.MovieproPipeline': 300,
}
2.scrapy的POST请求和cookie处理
post请求的发送:
1.重写父类的start_requests(self)方法
2.在该方法内部只需要调用yield scrapy.FormRequest(url,callback,formdata)
- 使用FormRequest发起POST请求:yield scrapy.FormRequest(url=url,callback=self.parse,formdata=data)
- callback回调函数给到数据解析的函数
- formdata用来传参
cookie处理:scrapy默认情况下会自动进行cookie处理
1.重写start_requests方法
2.发起POST请求FormRequest:
yield scrapy.FormRequest(url=url,callback=self.parse,formdata=data)
def start_requests(self):
for url in self.start_urls:
data = {
'kw':'cat'
}
#post请求的手动发送使用的是FormRequest
yield scrapy.FormRequest(url=url,callback=self.parse,formdata=data)
def parse(self, response):
print(response.text)

浙公网安备 33010602011771号