爬虫框架 - Scrapy

安装

依赖库安装:

在Mac上构件Scrapy的依赖库需要C编译器以及开发头文件, 它一般由Xcode提供, 运行命令安装:

xcode-select --install

利用pip安装即可

pip3 install Scrapy

验证

安装后在终端输入 scrpy, 会有如下图:

爬取流程

创建过程

scrapy startproject PROJECTNAME

cd PROJECTNAME

scrapy genspider APPNAME DOMAINNAME

实例

爬取quotes.toscrape.com上的名言

scrapy startproject quotes

cd quotes

scrapy genspider quote quotes.toscrape.com

创建完后目录如下

实例:

# quote.py
import scrapy
from quotes.items import QuoteItem

class QuoteSpider(scrapy.Spider):
    name = 'quote'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # print(response.text)
        quotes = response.css('.quote')
        for quote in quotes:
            item = QuoteItem()
            text = quote.css('.text::text').extract_first()
            author = quote.css('.author::text').extract_first()
            tags = quote.css('.tags .tag::text').extract()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            # print(author,tags,text)
            yield item

        next = response.css('.pager .next a::attr(href)').extract_first()
        url = response.urljoin(next)
        yield scrapy.Request(url=url,callback=self.parse)

# items.py
import scrapy

class QuoteItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

# pipelines.py

# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exceptions import DropItem
import pymongo

class TextPipeline(object):  # limit the text length
    def __init__(self):
        self.limit = 50

    def process_item(self, item, spider):
        if item['text']:
            if len(item['text']) > self.limit:
                item['text'] = item['text'][0:self.limit].rstrip() + '...'
            return item
        else:
            return DropItem('Missing Text')

class MongoPipeline(object):
    def __init__(self,mongo_url,mongo_db):
        self.mongo_url = mongo_url
        self.mongo_db = mongo_db
    @classmethod
    def from_crawler(cls,crawler):
        return cls(
            mongo_url=crawler.settings.get('MONGO_URL'),
            mongo_db = crawler.settings.get('MONGO_DB'),
        )

    def open_spider(self,spider):
        self.client = pymongo.MongoClient(self.mongo_url)
        self.db = self.client[self.mongo_db]

    def process_item(self,item,spider):
        name = item.__class__.__name__
        self.db[name].insert(dict(item))
        return item

    def close_spider(self,spider):
        self.client.close()

# settings.py

...

MONGO_URL = 'localhost'
MONGO_DB = 'quotes'

....
....

ITEM_PIPELINES = {
   # 'quotes.pipelines.TextPipeline': 300,
    'quotes.pipelines.MongoPipeline': 400,
}

运行爬虫:

scrapy crawl quote  

scrapy crawl quote --nolog   #没有日志

保存:

可以保存下面图中的格式

scrapy crawl quote -o quotes.json
...

保存到ftp服务器:

scrapy crawl quote -o ftp://user:pass@ftp.example.com/path/quotes.csv

调式解析器:

scrapy shell quotes.toscrape.com

介绍

Scrapy一个开源和协作的框架, 其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的, 使用它可以以快速、简单、可扩展的方式从网站中提取所需的数据. 目前Scrapy的用途十分广泛，可用于如数据挖掘、监测和自动化测试等领域，也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫.

Scrapy 是基于twisted框架开发而来, twisted是一个流行的事件驱动的python网络框架. 因此Scrapy使用了一种非阻塞(异步)的代码来实现并发. 整体架构大致如下

The data flow in Scrapy is controlled by the execution engine, and goes like this:

The Engine gets the initial Requests to crawl from the Spider.
The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.
The Scheduler returns the next Requests to the Engine.
The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see process_request()).
Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middlewares (see process_response()).
The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (see process_spider_input()).
The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine, passing through the Spider Middleware (see process_spider_output()).
The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl.
The process repeats (from step 1) until there are no more requests from the Scheduler.

数据流(Data flow) - 上面的中文版

Scrapy中的数据流由执行引擎控制，其过程如下:

引擎打开一个网站(open a domain)，找到处理该网站的Spider并向该spider请求第一个要爬取的URL(s)。
引擎从Spider中获取到第一个要爬取的URL并在调度器(Scheduler)以Request调度。
引擎向调度器请求下一个要爬取的URL。
调度器返回下一个要爬取的URL给引擎，引擎将URL通过下载中间件(请求(request)方向)转发给下载器(Downloader)。
一旦页面下载完毕，下载器生成一个该页面的Response，并将其通过下载中间件(返回(response)方向)发送给引擎。
引擎从下载器中接收到Response并通过Spider中间件(输入方向)发送给Spider处理。
Spider处理Response并返回爬取到的Item及(跟进的)新的Request给引擎。
引擎将(Spider返回的)爬取到的Item给Item Pipeline，将(Spider返回的)Request给调度器。
(从第二步)重复直到调度器中没有更多地request，引擎关闭该网站。

生命周期

从start_urls开始,通过Request发请求。标准的Request如下:
from url import parse 
final_url = parse.urljoin(response.url, url) // 发送新的请求 
yield Request(url=final_url, callback=self.parse_detail, meta={})
url可以通过parse可以处理那些无域名的相对路径 
callback制定另一个类成员函数进行处理 
meta可以将任意数据通过dict的形式传入response中,向下传递
通过yield Request可发送新的请求
自定义解析函数时,函数的参数签名为(self, response)

可以通过response.xpath() 、response.css()和response.re()手动获取页面内容。推荐的做法是使用scrapy提供的ItemLoader,ItemLoader会自动解析提取的值,而不用手动extract()[0]。

与ItemLoader配合的是Item,Item定义时可以设置input_processor和output_processor。Item的定义只是通过Filed()进行声明即可。

// 预置processor的种类
from scrapy.loader.processors import Join, MapCompose, Compose,TakeFirst, Identity, MergeDict, SelectJmesclass JobboleItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field(output_processor = TakeFirst()) # TakeFirst() 返回取第一个值的函数 post_time = scrapy.Field(output_processor = Compose(TakeFirst(), lambda x: x.strip(['']))) tags = scrapy.Field() source_url = scrapy.Field(output_processor=TakeFirst()) upvote_count = scrapy.Field(output_processor=TakeFirst())
最后在爬虫类中通过yield item将item交给ItemPipeline处理,且在item传入pipeline做进一步处理之前,会读取settings中的设置

View Code

Components：

引擎(ENGINE)

引擎负责控制系统所有组件之间的数据流，并在某些动作发生时触发事件. 有关详细信息可以参见上面的数据流部分.
调度器(SCHEDULER)
用来接受引擎发过来的请求(request), 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL的优先级队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址.
下载器(DOWNLOADER)
下载器负责获取页面数据并提供给引擎, 而后提供给spider, 下载器是建立在twisted这个高效的异步模型上的.
爬虫(SPIDERS)
Spider是开发人员自定义的类, 用于分析response并提取item(即获取到的item)或发送新的请求. 每个spider负责处理一个特定(或一些)网站.
项目管道(ITEM PIPELINES)
Item Pipeline负责处理被spider提取出来的item. 典型的处理有清理、验证及持久化(例如存取到数据库中)等操作.
下载器中间件(Downloader Middlewares)
位于Scrapy引擎和下载器之间, 主要用来处理从ENGINE传到DOWNLOADER的请求request 和已经从DOWNLOADER传到ENGINE的响应response.

process a request just before it is sent to the Downloader (i.e. right before Scrapy sends the request to the website);
change received response before passing it to a spider;
send a new Request instead of passing received response to a spider;
pass response to a spider without fetching a web page;
silently drop some requests.

爬虫中间件(Spider Middlewares)
位于ENGINE和SPIDERS之间，主要工作是处理SPIDERS的输入(responses)和输出(items and requests)

post-process output of spider callbacks - change/add/remove requests or items;
post-process start_requests;
handle spider exceptions;
call errback instead of callback for some of the requests based on response content.

命令行工具(Command Line)

#1 查看帮助
    scrapy -h
    scrapy <command> -h

#2 有两种命令：其中Project-only必须切到项目文件夹下才能执行，而Global的命令则不需要
    Global commands:
        startproject #创建项目
        genspider    #创建爬虫程序
        settings     #如果是在项目目录下，则得到的是该项目的配置
        runspider    #运行一个独立的python文件，不必创建项目
        shell        #scrapy shell url地址  在交互式调试，如选择器规则正确与否
        fetch        #独立于程单纯地爬取一个页面，可以拿到请求头
        view         #下载完毕后直接弹出浏览器，以此可以分辨出哪些数据是ajax请求
        version      #scrapy version 查看scrapy的版本，scrapy version -v查看scrapy依赖库的版本
    Project-only commands:
        crawl        #运行爬虫，必须创建项目才行，确保配置文件中ROBOTSTXT_OBEY = False
        check        #检测项目中有无语法错误
        list         #列出项目中所包含的爬虫名
        edit         #编辑器，一般不用
        parse        #scrapy parse url地址 --callback 回调函数  #以此可以验证我们的回调函数是否正确
        bench        #scrapy bench压力测试

#1、执行全局命令：请确保不在某个项目的目录下，排除受该项目配置的影响
scrapy startproject MyProject

cd MyProject
scrapy genspider baidu www.baidu.com

scrapy settings --get XXX #如果切换到项目目录下，看到的则是该项目的配置

scrapy runspider baidu.py

scrapy shell https://www.baidu.com
    response
    response.status
    response.body
    view(response)
    
scrapy view https://www.taobao.com #如果页面显示内容不全，不全的内容则是ajax请求实现的，以此快速定位问题

scrapy fetch --nolog --headers https://www.taobao.com

scrapy version #scrapy的版本

scrapy version -v #依赖库的版本


#2、执行项目命令：切到项目目录下
scrapy crawl baidu
scrapy check
scrapy list
scrapy parse http://quotes.toscrape.com/ --callback parse
scrapy bench

示范

项目结构以及爬虫应用简介

scrapy.cfg
myproject/
    __init__.py
    items.py
    middlewares.py
    pipelines.py
    settings.py
    spiders/
        __init__.py
        spider1.py
        spider2.py
        ...

文件说明：

scrapy.cfg 项目的主配置信息，用来部署scrapy时使用，爬虫相关的配置信息在settings.py文件中.
items.py 设置数据存储模板，用于结构化数据，如：Django的Model
pipelines 数据处理行为，如：一般结构化的数据持久化
settings.py 配置文件，如：递归的层数、并发数，延迟下载等。强调:配置文件的选项必须大写否则视为无效，正确写法USER_AGENT='xxxx'
spiders 爬虫目录，如：创建文件，编写爬虫规则

注意：一般创建爬虫文件时，以网站域名命名

Spiders

Spiders是由一系列类(定义了一个网址或一组网址将被爬取)组成, 具体包括如何执行爬取任务并且如何从页面中提取结构化的数据. 即Spiders是开发者为了一个特定的网址或一组网址自定义爬取和解析页面行为的地方.

spiders爬取循环:

1、生成初始的Requests来爬取第一个URLS, 并且标识一个回调函数. 第一个请求定义在start_requests()方法内默认从start_urls列表中获得url地址来生成Request请求, 默认的回调函数是parse方法. 回调函数在下载完成返回response时自动触发.

2、在回调函数中, 解析response并且返回值. 返回值可以4种: 包含解析数据的字典, Item对象, 新的Request对象(新的Requests也需要指定一个回调函数) 或者是可迭代对象(包含Items或Request）

3、在回调函数中解析页面内容. 通常使用Scrapy自带的Selectors, 也可以使用Beutifulsoup, lxml或其他.

4、最后, 针对返回的Items对象将会被持久化到数据库. 通过Item Pipeline组件存到数据库：(https://docs.scrapy.org/en/latest/topics/item-pipeline.html#topics-item-pipeline) 或者通过Feed exports导出到不同的文件(https://docs.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-exports）

Spiders总共提供了五种类：

1、scrapy.spiders.Spider (scrapy.Spider等同于scrapy.spiders.Spider)

2、scrapy.spiders.CrawlSpider

3、scrapy.spiders.XMLFeedSpider

4、scrapy.spiders.CSVFeedSpider

5、scrapy.spiders.SitemapSpider

导入使用

import scrapy
from scrapy.spiders import Spider,CrawlSpider,XMLFeedSpider,CSVFeedSpider,SitemapSpider

class AmazonSpider(scrapy.Spider): #自定义类，继承Spiders提供的基类
    name = 'amazon'
    allowed_domains = ['www.amazon.cn']
    start_urls = ['http://www.amazon.cn/']
    
    def parse(self, response):
        pass

class scrapy.spiders.Spider

这是最简单的spider类，任何其他的spider类都需要继承它(包含自己定义的）

该类不提供任何特殊的功能, 它仅提供了一个默认的start_requests方法默认从start_urls中读取url地址发送requests请求，并且默认parse作为回调函数.

#1、name = 'amazon' 
定义爬虫名，scrapy会根据该值定位爬虫程序
所以它必须要有且必须唯一（In Python 2 this must be ASCII only.）

#2、allowed_domains = ['www.amazon.cn'] 
定义允许爬取的域名，如果OffsiteMiddleware启动（默认就启动），
那么不属于该列表的域名及其子域名都不允许爬取
如果爬取的网址为：https://www.example.com/1.html，那就添加'example.com'到列表.

#3、start_urls = ['http://www.amazon.cn/']
如果没有指定url，就从该列表中读取url来生成第一个请求

#4、custom_settings
值为一个字典，定义一些配置信息，在运行爬虫程序时，这些配置会覆盖项目级别的配置
所以custom_settings必须被定义成一个类属性，由于settings会在类实例化前被加载
    custom_settings = {
        'BOT_NAME' : 'Spider_Amazon',
        'REQUEST_HEADERS' : {
          'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
          'Accept-Language': 'en',
        }

#5、settings
通过self.settings['配置项的名字']可以访问settings.py中的配置，如果自己定义了custom_settings还是以自己的为准

#6、logger
日志名默认为spider的名字
self.logger.debug('=============>%s' %self.settings['BOT_NAME'])

#7、crawler：
该属性必须被定义到类方法from_crawler中

#8、from_crawler(crawler, *args, **kwargs)：了解
You probably won’t need to override this directly  because the default implementation acts as a proxy to the __init__() method, calling it with the given arguments args and named arguments kwargs.

#9、start_requests()
该方法用来发起第一个Requests请求，且必须返回一个可迭代的对象. 它在爬虫程序打开时就被Scrapy调用，Scrapy只调用它一次.
默认从start_urls里取出每个url来生成Request(url, dont_filter=True)    # 参数dont_filter,请看自定义去重规则
如果你想要改变起始爬取的Requests，你就需要覆盖这个方法，例如你想要起始发送一个POST请求，如下
class MySpider(scrapy.Spider):
    name = 'myspider'
    def start_requests(self):
        return [scrapy.FormRequest("http://www.example.com/login",
                                   formdata={'user': 'john', 'pass': 'secret'},
                                   callback=self.logged_in)]

    def logged_in(self, response):
        pass
        
#10、parse(response)
这是默认的回调函数，所有的回调函数必须返回an iterable of Request and/or dicts or Item objects.

#11、log(message[, level, component])：了解
Wrapper that sends a log message through the Spider’s logger, kept for backwards compatibility. For more information see Logging from Spiders.

#12、closed(reason)
爬虫程序结束时自动触发

去除重复的url

去重规则应该多个爬虫共享的, 但凡一个爬虫爬取了, 其他都爬不了, 实现方式如下

#方法一：
1、新增类属性
visited=set() #类属性

2、回调函数parse方法内：
def parse(self, response):
    if response.url in self.visited:
        return None
    .......

    self.visited.add(response.url) 

#方法一改进：针对url可能过长，所以我们存放url的hash值
    def parse(self, response):
        url=md5(response.request.url)
        if url in self.visited:
            return None
        .......
        self.visited.add(url) 
  -----------------------
    @staticmethod
    def md5(val):
        import hashlib
        ha = hashlib.md5()
        ha.update(bytes(val, encoding='utf-8'))
        key = ha.hexdigest()
        return key

#方法二：Scrapy自带去重功能
配置文件：
DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter' #默认的去重规则用来去重，去重规则在内存中
DUPEFILTER_DEBUG = False
JOBDIR = "保存范文记录的日志路径，如：/root/"  # 最终路径为 /root/requests.seen，去重规则放文件中

scrapy自带去重规则默认为RFPDupeFilter，只需要指定
Request(...,dont_filter=False) ，如果dont_filter=True则告诉Scrapy这个URL不参与去重.

#方法三：
可以仿照RFPDupeFilter自定义去重规则,
from scrapy.dupefilters import RFPDupeFilter，看源码，仿照BaseDupeFilter

#步骤一：在项目目录下自定义去重文件dup.py
class UrlFilter(object):
    def __init__(self):
        self.visited = set() #或者放到数据库

    @classmethod
    def from_settings(cls, settings):
        return cls()

    def request_seen(self, request):
        if request.url in self.visited:
            return True
        self.visited.add(request.url)

    def open(self):  # can return deferred
        pass

    def close(self, reason):  # can return a deferred
        pass

    def log(self, request, spider):  # log that a request has been filtered
        pass

#步骤二：配置文件settings.py：
DUPEFILTER_CLASS = '项目名.dup.UrlFilter'

View Code

例子

#例一：
import scrapy

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]
    def parse(self, response):
        self.logger.info('A response from %s just arrived!', response.url)
        
    
#例二：一个回调函数返回多个Requests和Items
import scrapy

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]

    def parse(self, response):
        for h3 in response.xpath('//h3').extract():
            yield {"title": h3}

        for url in response.xpath('//a/@href').extract():
            yield scrapy.Request(url, callback=self.parse)
            
            
#例三：在start_requests()内直接指定起始爬取的urls, start_urls就没有用了.
import scrapy
from myproject.items import MyItem  # 自定义的数据结构, 详情看Items

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']

    def start_requests(self):
        yield scrapy.Request('http://www.example.com/1.html', self.parse)
        yield scrapy.Request('http://www.example.com/2.html', self.parse)
        yield scrapy.Request('http://www.example.com/3.html', self.parse)

    def parse(self, response):
        for h3 in response.xpath('//h3').extract():
            yield MyItem(title=h3)

        for url in response.xpath('//a/@href').extract():
            yield scrapy.Request(url, callback=self.parse)

文档原例

参数传递

可能需要在命令行为爬虫程序传递参数, 比如传递初始的url
#命令行执行
scrapy crawl myspider -a category=electronics

#在__init__方法中可以接收外部传进来的参数
import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'

    def __init__(self, category=None, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://www.example.com/categories/%s' % category]
        #...

#注意接收的参数全都是字符串, 如果想要结构化的数据{}, 需要用类似json.loads的方法

View Code

Selectors

现时从HTML源码解析数据有几个现成的库如BeautifulSoup, lxml等. Scrapy提取数据也有自己一套机制 - Selectors. 通过特定的 XPath 或者 CSS 表达式来"选择"HTML文件中的某个部分. XPath 是一门用来在XML文件中选择节点的语言, 也可以用在HTML上. CSS 是一门将HTML文档样式化的语言. Scrapy选择器构建于lxml库之上，这意味着它们在速度和解析准确性上非常相似.

//与/
text
extract与extract_first:从selector对象中解出内容
属性：xpath的属性加前缀@
嵌套查找
设置默认值
按照属性查找
按照属性模糊查找
正则表达式
xpath相对路径
带变量的xpath

'''
<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>
'''

可以用shell来尝试: scrapy shell https://doc.scrapy.org/en/latest/_static/selectors-sample1.html

response.selector.css()
response.selector.xpath()
可简写为
response.css()
response.xpath()

#1 //与/
response.xpath('//body/a/')
response.css('div a::text')
>>> response.xpath('//body/a')  #开头的//代表从整篇文档中寻找, body之后的/代表body的儿子
>>> response.xpath('//body//a') #开头的//代表从整篇文档中寻找, body之后的//代表body的子子孙孙
[<Selector xpath='//body//a' data='<a href="image1.html">Name: My image 1 <'>, <Selector xpath='//body//a' data='<a href="image2.html">Name: My image 2 <'>, <Selector xpath='//body//a' data='<a href="
image3.html">Name: My image 3 <'>, <Selector xpath='//body//a' data='<a href="image4.html">Name: My image 4 <'>, <Selector xpath='//body//a' data='<a href="image5.html">Name: My image 5 <'>]

#2 text
>>> response.xpath('//body//a/text()')
>>> response.css('body a::text')

#3、extract()与extract_first(): 从selector对象中解出内容
>>> response.xpath('//div/a/text()').extract()
['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']
>>> response.css('div a::text').extract()
['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']

>>> response.xpath('//div/a/text()').extract_first()
'Name: My image 1 '
>>> response.css('div a::text').extract_first()
'Name: My image 1 '

#4、属性：xpath的属性加前缀@
>>> response.xpath('//div/a/@href').extract_first()
'image1.html'
>>> response.css('div a::attr(href)').extract_first()
'image1.html'

#4、嵌套查找
>>> response.xpath('//div').css('a').xpath('@href').extract_first()
'image1.html'

#5、设置默认值
>>> response.xpath('//div[@id="xxx"]').extract_first(default="not found")
'not found'

#4、按照属性查找
response.xpath('//div[@id="images"]/a[@href="image3.html"]/text()').extract()
response.css('a[href="image3.html"]::text').extract()
['Name: My image 3']

#5、按照属性模糊查找
response.xpath('//a[contains(@href,"image")]/@href').extract()
response.css('a[href*="image"]::attr(href)').extract()
[u'image1.html',
 u'image2.html',
 u'image3.html',
 u'image4.html',
 u'image5.html']

response.xpath('//a[contains(@href,"image")]/img/@src').extract()
response.css('a[href*="imag"] img::attr(src)').extract()
[u'image1_thumb.jpg',
 u'image2_thumb.jpg',
 u'image3_thumb.jpg',
 u'image4_thumb.jpg',
 u'image5_thumb.jpg']

response.xpath('//*[@href="image1.html"]')
response.css('*[href="image1.html"]')

#6、正则表达式regular expressions
response.xpath('//a/text()').re(r'Name:\s* (.*)')
response.xpath('//a/text()').re_first(r'Name:\s* (.*)')
[u'My image 1',
 u'My image 2',
 u'My image 3',
 u'My image 4',
 u'My image 5']

#7、xpath相对路径
>>> res=response.xpath('//a[contains(@href,"3")]')[0]
>>> res.xpath('img')
[<Selector xpath='img' data='<img src="image3_thumb.jpg">'>]
>>> res.xpath('./img')
[<Selector xpath='./img' data='<img src="image3_thumb.jpg">'>]
>>> res.xpath('.//img')
[<Selector xpath='.//img' data='<img src="image3_thumb.jpg">'>]
>>> res.xpath('//img') #这就是从头开始扫描
[<Selector xpath='//img' data='<img src="image1_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image2_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image3_thumb.jpg">'>, <Selector xpa
th='//img' data='<img src="image4_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image5_thumb.jpg">'>]

#8、带变量的xpath
>>> response.xpath('//div[@id=$xxx]/a/text()',xxx='images').extract_first()
'Name: My image 1 '
>>> response.xpath('//div[count(a)=$yyy]/@id',yyy=5).extract_first() #求有5个a标签的div的id'images'

Items

爬取的主要目标就是从非结构性的数据源提取结构性数据, 例如网页. Scrapy提供 Item 类来满足这样的需求.

Item 对象是种简单的容器, 保存了爬取到得数据. 其提供了类似字典(dict)的API以及用于声明可用字段的简单语法.

声明Item

import scrapy

class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()
    last_updated = scrapy.Field(serializer=str)

Scrapy Item定义方式与 Django Models 很类似, 不过没有那么多不同的字段类型(Field type), 更为简单.

Item Fields

Field 对象指明了每个字段的元数据(metadata)

可以为每个字段指明任何类型的元数据. Field 对象对接受的值没有任何限制. Field 对象中保存的每个键可以由多个组件使用，并且只有这些组件知道这个键的存在. 设置 Field 对象的主要目的就是在一个地方定义好所有的元数据.

注: 用来声明item的 Field 对象并没有被赋值为class的属性. 不过可以通过 Item.fields 属性进行访问, 如:

# items.py
class QuoteItem(scrapy.Item):
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

# quotes.py
　　...
    def parse(self, response):
        quotes = response.css('.quote')
        for quote in quotes:
            item = QuoteItem()

            for i in item.fields:  # !!!
                print('fields',i)    # text,author,tags三个字段的循环

            text = quote.css('.text::text').extract_first()
            author = quote.css('.author::text').extract_first()
            tags = quote.css('.tags .tag::text').extract()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            yield item

Item跟dict基本完全一样, 只是Item多了一个fields的属性, 包含了item所有声明的字段的字典, key是字段(field)的名字, 值是 Item声明中使用到的 Field 对象.

Item Pipeline

posted @ 2018-06-08 19:08 Charonnnnn 阅读(128) 评论(0) 收藏举报

刷新页面返回顶部

Charonnnnn

爬虫框架 - Scrapy

爬虫框架 - Scrapy

安装

爬取流程

创建过程

实例

介绍

命令行工具(Command Line)

项目结构以及爬虫应用简介

Spiders

spiders爬取循环:

Spiders总共提供了五种类：

导入使用

class scrapy.spiders.Spider

Selectors

Items

Item Pipeline

公告