Scrapy 下载图片

Scrapy 的图片管道，是通过内置的ImagesPipeline类实现的，源码在site-packages\scrapy\pipelines\images.py中；可以查看内置函数和方法。

ImagesPipeline具有以下功能：

将所有下载的图片转换成通用的格式（JPG）和模式（RGB）
避免重新下载最近已经下载过的图片
缩略图生成
检测图像的宽/高，确保它们满足最小限制

在使用这个类时，典型的工作流程如下：

1.在一个爬虫里，你抓取一个项目，并把其中的图片url和其他需要爬取的参数放入item（字典）中；

2.scrapy将爬取到的item进入管道（处理类），进行一系列处理。

这里通过setting里面的ITEM _PIPELINES中的对应每个pipeline里的数字大小决定item的流经顺序。

3.将经过不同pipelines处理过的数据交给下载器。

在对item中的图片保存到本地时，需要进行设置：

 1 # Configure item pipelines
 2 # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
 3 ITEM_PIPELINES = {
 4    'ArticleSpider.pipelines.ArticlespiderPipeline': 300,
 5    'scrapy.pipelines.images.ImagesPipeline':1
 6    # 'ArticleSpider.pipelines.ArticleimagePipeline':1
 7 }
 8 IMAGES_URLS_FIELD="front_page_url"  #告诉Imagespipeline item里面的哪个字段是url
 9 project_path=os.path.abspath(os.path.dirname(__file__))  
10 IMAGES_STORE=os.path.join(project_path,'images')    #设置下载的图片的保存路径，这里设为相对路径

自定义pipeline

如果我们需要将直接下载图片的路径和图片里的item绑定起来，可以自定义一个pipeline，这个pipeline继承ImagesPipeline类，并且对其中函数进行重载：

1 class ArticleimagePipeline(ImagesPipeline):
2     def item_completed(self, results, item, info): #返回的results是一个list，每个是元组，tuple里第一个值是bool，第二个是一个字典，包含图片的url
3         for ok, value in results:
4             image_file_path=value["path"]
5             item["front_image_path"]=image_file_path
6 
7         return item

ImagesPipeline类里面的重要函数：

1. get_media_requests()

1  def get_media_requests(self, item, info):
2         return [Request(x) for x in item.get(self.images_urls_field, [])]

所以在setting里面传入的images_urls_field必须是iterable的。

如果是一个List,将每个url交给下载器进行下载

2.item_completed()

1     def item_completed(self, results, item, info):
2         if isinstance(item, dict) or self.images_result_field in item.fields:
3             item[self.images_result_field] = [x for ok, x in results if ok]
4         return item

results是一个List，每一个元素是一个元组，第一项是bool，表示有没有成功，第二项是一个字典，里面有表示保存的文件的路径的数据；

注意需要将item返回，因为可能下一个pipeline需要用到。

posted @ 2018-01-31 16:42 小虾饺阅读(448) 评论(0) 收藏举报

刷新页面返回顶部

小虾饺

Scrapy 下载图片

公告