scrapy实战一:爬虫图片存在不同目录
可能有些朋友使用scrapy下载图片,并不满足简单下载,还需要重命名,还需要图片归类(把同一url里的图片放入同一文件夹)。那scrapy图片下再要如何处理?
其实横简单,如果你看了我们继承的scrapy类:ImagesPipeline的一些实现,你会发现里面有这么一个方法:def file_path(self, request, response=None, info=None) 这个方法便是图片重命名以及目录归类的方法,我们只需要重写里面的一些内容,便可轻松实现scrapy图片重命名,图片保存不同目录。
1.创建项目
scrapy startproject ImagesRename
2.编写item
# -*- coding: utf-8 -*- import scrapy class ImagerenameItem(scrapy.Item): imgurl = scrapy.Field() imgname = scrapy.Field()
3.spiders目录下创建蜘蛛文件:ImgRename.py编写相应蜘蛛
# -*- coding: utf-8 -*- import scrapy from imageRename.items import ImagerenameItem class ImgRenameSpider(scrapy.Spider): name = "imgRename" allowed_domains = ['lab.scrapyd.cn'] start_urls = ['http://lab.scrapyd.cn/archives/55.html', 'http://lab.scrapyd.cn/archives/57.html', ] def parse(self, response): item = ImagerenameItem() img_url_list = response.xpath("//div[@class='post-content']/p/img/@src").extract() item['imgurl'] = img_url_list item['imgname'] = response.xpath("//h1[@class='post-title']/a/text()").extract()[0] yield item
4.编写pipeline
import re import scrapy from scrapy.pipelines.images import ImagesPipeline class ImagerenamePipeline(ImagesPipeline): def get_media_requests(self, item, info): for img_url in item['imgurl']: yield scrapy.Request(img_url, meta={"name": item['imgname']}) def file_path(self, request, response=None, info=None): img_guid = request.url.split("/")[-1] name = request.meta['name'] name = re.sub(r'[?\\*|“<>:/]', '', name) filename = u'{0}/{1}'.format(name, img_guid) print ("file........", filename) return filename
5.设置
NEWSPIDER_MODULE = 'imageRename.spiders' ITEM_PIPELINES = { "imageRename.pipelines.ImagerenamePipeline": 300, } IMAGES_STORE = "/home/liuxh/Desktop/tmpimg/"
6.运行
scrapy list
scrapy crawl imgRename
转自:http://www.scrapyd.cn/example/175.html
posted on 2018-11-15 17:33 myworldworld 阅读(359) 评论(0) 收藏 举报
浙公网安备 33010602011771号