scrapy图片爬取

基于普通的管道类,我们可以实现字符串的保存。

爬取图片也可以用普通的管道类自己写open(),fp.write(img)这也没什么难的,

但scrapy已经给我们提供了一些好用的用于处理图片的类,我们只要重写这些方法就好了

新建一个scrapy项目

scrapy startproject imgspider

新建爬虫目录

scrapy genspider imgspiderarse www.hhhh.com

在imgspiderarse中编写爬取逻辑

# -*- coding: utf-8 -*-
import scrapy
from imagePro.items import ImageproItem


class ImgspiderSpider(scrapy.Spider):
    name = 'imgspider'
    # allowed_domains = ['www.hello.com']
    start_urls = ['http://sc.chinaz.com/tupian/']

    def parse(self, response):
        list_img = response.xpath('//div[@id="container"]/div')
        for img in list_img:
            src = img.xpath('./div/a/img/@src2').extract_first()

            item = ImageproItem()
            item['src'] = src

            yield item

item中定义字段

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class ImageproItem(scrapy.Item):
    src = scrapy.Field()
    pass

pipelines处理图片

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.pipelines.images import ImagesPipeline 
import scrapy


"""
### 继承ImagesPipeline,重写get_media_requests, file_path, item_completed
"""
class SaveImg(ImagesPipeline):

    # 对保存在item中的src
    def get_media_requests(self, item, info):
        yield scrapy.Request(item['src'])

    # 定制图片的名称
    def file_path(self, request, response=None, info=None):
        url = request.url
        file_name = url.split('/')[-1]
        return file_name

    # 该返回值将传递给下一个即将被执行的管道类
    def item_completed(self, results, item, info):
        return item

settings.py中定义图片保存的路径

IMAGES_STORE = './imgs'

settings.py中写上自定义的管道

ITEM_PIPELINES = { 'imagePro.pipelines.SaveImg': 300, }

posted @ 2020-06-27 21:59  bibicode  阅读(157)  评论(0)    收藏  举报