scrapy图片爬取
基于普通的管道类,我们可以实现字符串的保存。
爬取图片也可以用普通的管道类自己写open(),fp.write(img)这也没什么难的,
但scrapy已经给我们提供了一些好用的用于处理图片的类,我们只要重写这些方法就好了
新建一个scrapy项目
scrapy startproject imgspider
新建爬虫目录
scrapy genspider imgspiderarse www.hhhh.com
在imgspiderarse中编写爬取逻辑
# -*- coding: utf-8 -*-
import scrapy
from imagePro.items import ImageproItem
class ImgspiderSpider(scrapy.Spider):
name = 'imgspider'
# allowed_domains = ['www.hello.com']
start_urls = ['http://sc.chinaz.com/tupian/']
def parse(self, response):
list_img = response.xpath('//div[@id="container"]/div')
for img in list_img:
src = img.xpath('./div/a/img/@src2').extract_first()
item = ImageproItem()
item['src'] = src
yield item
item中定义字段
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class ImageproItem(scrapy.Item):
src = scrapy.Field()
pass
pipelines处理图片
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.pipelines.images import ImagesPipeline
import scrapy
"""
### 继承ImagesPipeline,重写get_media_requests, file_path, item_completed
"""
class SaveImg(ImagesPipeline):
# 对保存在item中的src
def get_media_requests(self, item, info):
yield scrapy.Request(item['src'])
# 定制图片的名称
def file_path(self, request, response=None, info=None):
url = request.url
file_name = url.split('/')[-1]
return file_name
# 该返回值将传递给下一个即将被执行的管道类
def item_completed(self, results, item, info):
return item
settings.py中定义图片保存的路径
IMAGES_STORE = './imgs'
settings.py中写上自定义的管道
ITEM_PIPELINES = { 'imagePro.pipelines.SaveImg': 300, }

浙公网安备 33010602011771号