scrapy框架内置的图片下载器ImagesPipeline
1,配置
1.1,ROBOTSTXT
ROBOTSTXT_OBEY = True
1.2,UA伪装
USER_AGENT = "使用浏览器随便发起一个请求,打开检查查找网络刷新页面,选择一个xhr请求找到user-agent的值"
1.3,管道优先级
ITEM_PIPELINES = {
'qiubaiPro.pipelines.imgsPileine': 300,
}
1.4,设置日志等级
LOG_LEVEL = "ERROR"
1.5,图片存储地址
import os
IMAGES_STORE=os.path.join(os.path.dirname(os.path.dirname(__file__)),'images')
print(IMAGES_STORE)
2,建立一个scrapy项目
2.1,scrapy获取到图片url,存储到item并提交
from qiubaiPro.items import QiubaiproItem
.......
.......
.......
title = res.xpath("./div[1]/div[1]/h1/text()").extract_first()
image_banner = response.xpath("/html/body/article/div[1]/section/div/img/@src").extract_first()
item = QiubaiproItem()
item["title"] = title
item["image_banner"] = image_banner
print(item)
yield item
2.2,items.py
import scrapy
class QiubaiproItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
image_banner = scrapy.Field()
2.3,pipelines.py
from scrapy.pipelines.images import ImagesPipeline
import scrapy
class imgsPileine(ImagesPipeline):
def get_media_requests(self, item, info):
# 根据图片地址进行请求
yield scrapy.Request(item["image_banner"])
def file_path(self, request, response=None, info=None, *, item=None):
# 指定图片存储路径
imgName = request.url.split("/")[-2]
# 将图片url进行切割,选择合适的图片名称,如:04_14_05_afc7d23d524bce96d554279a8530c1d2_1522821911852.jpg
return imgName
def item_completed(self, results, item, info):
return item
# 返回下一个即将被执行的管道类
3,出现bug
3.1,如:图片未下载成功,未写入,imgsPileine类中方法未执行
解决方法
3.1.1,查看日志
# 没有就把setting里面的LOG_LEVEL注释
[scrapy.middleware] WARNING: Disabled PianImgPipeline: ImagesPipeline requires installing Pillow 4.0.0 or later
INFO: Enabled item pipelines:
[]
没有Pillow模块
pip install Pillow
浙公网安备 33010602011771号