06 scrapy深入了解

scrapy深入了解

1. scrapy手动发起请求

  • yield scrapy.Request(url,callback):GET

    • callback指定解析函数,用于解析数据
  • yield scrapy.FormRequest(url,callback,formdata):POST

    • formdata:字典,请求参数
start_urls自动发起请求原因
- 为什么start_urls列表中的url会被自动进行get请求的发送?
    - 因为列表中的url其实是被start_requests这个父类方法实现的get请求发送
    def start_requests(self):
        for u in self.start_urls:
           yield scrapy.Request(url=u,callback=self.parse)
- 如何将start_urls中的url默认进行post请求的发送?
    - 重写start_requests方法即可
    def start_requests(self, formdata):
        for u in self.start_urls:
           yield scrapy.FormRequest(url=u,callback=self.parse,formdata=formdata)
scrapy手动发起请求&翻页爬取
class DuanziSpider(scrapy.Spider):
    name = 'duanzi'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://duanzixing.com/段子/']
    # 网站模板
    url = 'https://duanzixing.com/段子/%d/'
    page_num = 2

    def parse(self, response):
        article_list = response.xpath('/html/body/section/div/div/article')
        for article in article_list:
            title = article.xpath('./header/h2/a/text()').get()
            content = article.xpath('./p[2]/text()').get()
            item = HandreqproItem()   # 实例化item对象
            item['title'] = title     # 添加数据
            item['content'] = content
            yield item                # 向管道提交item

        if self.page_num < 5:         # 递归出口
            new_url = format(self.url % self.page_num)   # 拼接得到新的url
            self.page_num += 1
            # 手动发起请求   callback 是回调函数,重新调用parse函数
            yield scrapy.Request(url=new_url, callback=self.parse)

2. scrapy 获取详情页数据(请求传参)

  • 需要使用回调函数,调用可以解析详情页数据的方法
  • 需要将item进行一个传递yield scrapy.Request(url=url_detail, callback=self.parse_detail, meta={'item': item})
  • 在详情页中提交item到管道
def parse(self, response):
    movie_list = response.xpath('//*[@id="contents"]/ul/li')
    for movie in movie_list:
        title = movie.xpath('./div/a/text()').get()
        # https://www.heiyingcn.com/video/101157.html
        url_detail = 'https://www.heiyingcn.com' + movie.xpath('./a/@href').get()
        item = MovieproItem()
        item['title'] = title

        # 请求参数
        yield scrapy.Request(url=url_detail, callback=self.parse_detail, meta={'item': item})

    # 如果需要翻页爬取,这这里发起请求
    # yield scrapy.Request()


def parse_detail(self, response):
    item = response.meta['item']
    desc = response.xpath('//*[@class="synopsis-article"]/text()').get()
    item['desc'] = desc

    yield item
posted @ 2021-11-09 15:19  超暖  阅读(42)  评论(0)    收藏  举报