scrapy框架内置的图片下载器ImagesPipeline

1,配置
1.1,ROBOTSTXT

ROBOTSTXT_OBEY = True

1.2,UA伪装

USER_AGENT = "使用浏览器随便发起一个请求,打开检查查找网络刷新页面,选择一个xhr请求找到user-agent的值"

1.3,管道优先级

ITEM_PIPELINES = {
    'qiubaiPro.pipelines.imgsPileine': 300,
}

1.4,设置日志等级

LOG_LEVEL = "ERROR"

1.5,图片存储地址

import os
IMAGES_STORE=os.path.join(os.path.dirname(os.path.dirname(__file__)),'images')
print(IMAGES_STORE)

2,建立一个scrapy项目
2.1,scrapy获取到图片url,存储到item并提交

from qiubaiPro.items import QiubaiproItem
.......
.......
.......
title = res.xpath("./div[1]/div[1]/h1/text()").extract_first()
image_banner = response.xpath("/html/body/article/div[1]/section/div/img/@src").extract_first()
item = QiubaiproItem()
item["title"] = title
item["image_banner"] = image_banner
print(item)

yield item

2.2,items.py

import scrapy


class QiubaiproItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    image_banner = scrapy.Field()

2.3,pipelines.py


from scrapy.pipelines.images import ImagesPipeline
import scrapy

class imgsPileine(ImagesPipeline):

    def get_media_requests(self, item, info):
        # 根据图片地址进行请求
        yield scrapy.Request(item["image_banner"])

    def file_path(self, request, response=None, info=None, *, item=None):
        # 指定图片存储路径

        imgName = request.url.split("/")[-2]
        # 将图片url进行切割,选择合适的图片名称,如:04_14_05_afc7d23d524bce96d554279a8530c1d2_1522821911852.jpg
        return imgName

    def item_completed(self, results, item, info):
        return item
        # 返回下一个即将被执行的管道类

3,出现bug
3.1,如:图片未下载成功,未写入,imgsPileine类中方法未执行
解决方法
3.1.1,查看日志

# 没有就把setting里面的LOG_LEVEL注释
[scrapy.middleware] WARNING: Disabled PianImgPipeline: ImagesPipeline requires installing Pillow 4.0.0 or later
INFO: Enabled item pipelines:
[]

没有Pillow模块

pip install Pillow
posted @ 2022-05-23 11:32  下个ID见  阅读(158)  评论(0)    收藏  举报