初识Scrapy

简述

Scrapy (/ˈskreɪpaɪ/)

Scrapy是一个被广泛应用于爬取网站和提取结构化数据的应用框架,例如数据挖掘、信息处理等等。其设计之处就是为了网站爬虫,发展到现在已经可以使用APIs来提取数据,是一个通用的网站爬取工具。

安装

当然,前提是已经安装 Python,通过 pip 安装 Scrapy:

pip install Scrapy

实例

通过一个小 Demo 认识 Scrapy,实践出真知

演示

这里使用的是官方文档提供的例子,使用的网站链接依旧有效,是一个很适合爬取做数据分析的演示站点。

直接上代码:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'https://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

先不用看代码具体什么意思,直接在命令运行:

scrapy runspider quotes_spider.py -o quotes.jl

之后会进行跑码过程,并将提取的结构数据输出到文件 quotes.jl中。
控制台输出:

2022-09-12 20:09:23 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: scrapybot)
2022-09-12 20:09:23 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 22.8.0, Python 3.9.12 (tags/v3.9.12:b28265d, Mar 23 2022, 23:52:46) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 38.0.1, Platform Windows-10-10.0.25197-SP0
2022-09-12 20:09:23 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_LOADER_WARN_ONLY': True}
2022-09-12 20:09:23 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-09-12 20:09:23 [scrapy.extensions.telnet] INFO: Telnet Password: a89bf20948a48863
2022-09-12 20:09:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2022-09-12 20:09:24 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-09-12 20:09:24 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-09-12 20:09:24 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-09-12 20:09:24 [scrapy.core.engine] INFO: Spider opened
2022-09-12 20:09:24 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-09-12 20:09:24 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-09-12 20:09:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/tag/humor/> (referer: None)
2022-09-12 20:09:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/tag/humor/>
{'author': 'Jane Austen', 'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”'}
2022-09-12 20:09:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/tag/humor/>
{'author': 'Steve Martin', 'text': '“A day without sunshine is like, you know, night.”'}
2022-09-12 20:09:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/tag/humor/>
{'author': 'Garrison Keillor', 'text': '“Anyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.”'}
2022-09-12 20:09:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/tag/humor/>
{'author': 'Jim Henson', 'text': '“Beauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.”'}
2022-09-12 20:09:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/tag/humor/>
{'author': 'Charles M. Schulz', 'text': "“All you need is love. But a little chocolate now and then doesn't hurt.”"}
2022-09-12 20:09:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/tag/humor/>
{'author': 'Suzanne Collins', 'text': "“Remember, we're madly in love, so it's all right to kiss me anytime you feel like it.”"}
2022-09-12 20:09:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/tag/humor/>
{'author': 'Charles Bukowski', 'text': '“Some people never go crazy. What truly horrible lives they must lead.”'}
2022-09-12 20:09:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/tag/humor/>
{'author': 'Terry Pratchett', 'text': '“The trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.”'}
2022-09-12 20:09:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/tag/humor/>
{'author': 'Dr. Seuss', 'text': '“Think left and think right and think low and think high. Oh, the thinks you can think up if only you try!”'}
2022-09-12 20:09:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/tag/humor/>
{'author': 'George Carlin', 'text': '“The reason I talk to myself is because I’m the only one whose answers I accept.”'}
2022-09-12 20:09:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/tag/humor/page/2/> (referer: https://quotes.toscrape.com/tag/humor/)
2022-09-12 20:09:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/tag/humor/page/2/>
{'author': 'W.C. Fields', 'text': '“I am free of all prejudice. I hate everyone equally. ”'}
2022-09-12 20:09:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/tag/humor/page/2/>
{'author': 'Jane Austen', 'text': "“A lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.”"}
2022-09-12 20:09:26 [scrapy.core.engine] INFO: Closing spider (finished)
2022-09-12 20:09:26 [scrapy.extensions.feedexport] INFO: Stored jl feed (12 items) in: quotes.jl
2022-09-12 20:09:26 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 514,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 15665,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 2.077645,
 'feedexport/success_count/FileFeedStorage': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 9, 12, 12, 9, 26, 430224),
 'item_scraped_count': 12,
 'log_count/DEBUG': 15,
 'log_count/INFO': 11,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2022, 9, 12, 12, 9, 24, 352579)}
2022-09-12 20:09:26 [scrapy.core.engine] INFO: Spider closed (finished)

输出到文件:

{"author": "Jane Austen", "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"}
{"author": "Steve Martin", "text": "\u201cA day without sunshine is like, you know, night.\u201d"}
{"author": "Garrison Keillor", "text": "\u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.\u201d"}
{"author": "Jim Henson", "text": "\u201cBeauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.\u201d"}
{"author": "Charles M. Schulz", "text": "\u201cAll you need is love. But a little chocolate now and then doesn't hurt.\u201d"}
{"author": "Suzanne Collins", "text": "\u201cRemember, we're madly in love, so it's all right to kiss me anytime you feel like it.\u201d"}
{"author": "Charles Bukowski", "text": "\u201cSome people never go crazy. What truly horrible lives they must lead.\u201d"}
{"author": "Terry Pratchett", "text": "\u201cThe trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.\u201d"}
{"author": "Dr. Seuss", "text": "\u201cThink left and think right and think low and think high. Oh, the thinks you can think up if only you try!\u201d"}
{"author": "George Carlin", "text": "\u201cThe reason I talk to myself is because I\u2019m the only one whose answers I accept.\u201d"}
{"author": "W.C. Fields", "text": "\u201cI am free of all prejudice. I hate everyone equally. \u201d"}
{"author": "Jane Austen", "text": "\u201cA lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.\u201d"}

分析

首先导入 scrapy 包:

import scrapy

import scrapy之后指定scrapy.Spider,提取数据的函数名称为parse,这是缺省的约定。这样就定义了一个爬虫,现在来完善这个爬虫。首先是name,其次是start_urls,值得注意的是这是一个列表。

定义缺省函数提取数据:

# 通过向start_URLs属性中定义的URL发出请求
# 接收请求后的响应,通过向start_URLs属性中定义的URL发出请求
def parse(self, response):
    # 使用CSS选择器遍历quote元素
    for quote in response.css('div.quote'):
        # 生成包含提取的quote文本和作者的字典
        yield {
             'author': quote.xpath('span/small/text()').get(),
             'text': quote.css('span.text::text').get(),
        }
    # 查找指向下一页的链接
    next_page = response.css('li.next a::attr("href")').get()
    if next_page is not None:
        # 使用与回调相同的解析方法调度下一个请求
        yield response.follow(next_page, self.parse)

其中的 CSS 选择器,可以看出与网页中的对应,CSS class为quotediv标签,返回一个选择器,其data为整个class为quotediv标签。

下面再精确定位一波,quote.xpath('span/small/text()')深度遍历获取目标 div 下的 span 标签,span 标签下的 small 标签,并传入 text()。使用 get() 函数获取其文本值,具体 div 如下:

<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>
        <span>by <small class="author" itemprop="author">Jane Austen</small>
        <a href="/author/Jane-Austen">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="aliteracy,books,classic,humor"> 

            <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>

            <a class="tag" href="/tag/books/page/1/">books</a>

            <a class="tag" href="/tag/classic/page/1/">classic</a>

            <a class="tag" href="/tag/humor/page/1/">humor</a>

        </div>
    </div>

我们可以通过修改代码来获取tags

yield {
    'author': quote.xpath('span/small/text()').get(),
    'text': quote.css('span.text::text').get(),
    'tags': quote.css('a.tag::text').getall()
}

总结

通过一个小案例成功了解了Scrapy爬虫框架。

posted @ 2022-09-15 09:38  Flipped1121  阅读(64)  评论(0编辑  收藏  举报