爬虫框架Scrapy官方教程之runspider

英文地址：https://docs.scrapy.org/en/latest/intro/overview.html

爬取

爬取http://quotes.toscrape.com中的名言，代码：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield { # yeild指令会把数据发送给scrapy的pipeline
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }

        # 爬取下一页
        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

保存为quotes_spider.py，然后在当前目录通过scrapy的runspider命令运行这个py文件：

scrapy runspider quotes_spider.py -o quotes.json

运行完之后，我们发现当前目录多了一个quotes.json文件。

Following links总结

在写scrapy的spider类的parse方法的时候，有些链接需要提取出来继续爬取，这里scrapy提供了一些方法可以方便的实现这个功能，总结如下：

假设我们的目标a标签是target_a

方法1

next_page = target_a.css('::attr(href)').extract_first()
if next_page is not None:
    next_page = response.urljoin(next_page)
    yield scrapy.Request(next_page, callback=self.parse)

方法2

next_page = target_a.css('::attr(href)').extract_first()
if next_page is not None:
    yield response.follow(next_page, callback=self.parse)

方法3

next_page = target_a.css('::attr(href)')
if next_page is not None:
    yield response.follow(next_page[0], callback=self.parse)

方法4

if target_a is not None:
    yield response.follow(target_a, callback=self.parse)

方法1：直接获取到下一页的绝对url，yield一个新Request对象
方法2：不用获取到绝对的url，使用follow方法会自动帮我们实现
方法3：不用获取提取url字符串，只需要传入href这个selector
方法4：不用获取href这个selector，传递一个a的selector，follow方法自动会提取href

注意传入的对象只能是str或selector，不能是SelectorList

posted @ 2018-04-30 23:07 wbinbin 阅读(1729) 评论(0) 收藏举报

刷新页面返回顶部

wbinbin

爬虫框架Scrapy官方教程之runspider

爬取

Following links总结

公告