Scrapy框架爬取图片
一、任务
爬取该网站(https://desk.zol.com.cn/bizhi/9506_115438_2.html)壁纸,并保存
二、项目代码
1.创建项目
scrapy startproject zol
2.修改配置信息:
USER_AGENT
ROBOTSTXT_OBEY 改成 False
开启ITEM_PIPLINES
设置图片保存位置:
1 IMAGES_STORE = "d:/pics"
3.爬虫文件 -- zol.py
1 scrapy genspider zol zol.com.cn
1 import scrapy 2 3 4 class ZolSpider(scrapy.Spider): 5 name = 'zol' 6 allowed_domains = ['zol.com.cn'] 7 start_urls = ['http://desk.zol.com.cn/bizhi/9506_115438_2.html'] 8 9 def parse(self, response): 10 image_url = response.xpath('//img[@id="bigImg"]/@src').extract_first() 11 image_name = response.xpath('string(//h3)').extract_first() 12 yield { 13 # 'image_urls': [image_url], 14 'image_url': image_url, 15 'image_name': image_name 16 } 17 next_url = response.xpath('//a[@id="pageNext"]/@href').extract_first() 18 yield scrapy.Request(response.urljoin(next_url), callback=self.parse)
4.pipelines.py
1 # Define your item pipelines here 2 # 3 # Don't forget to add your pipeline to the ITEM_PIPELINES setting 4 # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html 5 6 7 # useful for handling different item types with a single interface 8 import scrapy 9 from itemadapter import ItemAdapter 10 from scrapy.pipelines.images import ImagesPipeline 11 from scrapy import Request 12 13 class ImagePipeline(ImagesPipeline): 14 def get_media_requests(self, item, info): 15 yield Request(item['image_url'],meta={'name': item['image_name']}) 16 17 def file_path(self, request, response=None, info=None): 18 name = request.meta['name'].strip().replace('\r\n\t\t', '') 19 name = name.replace('/', '_') 20 return name + '.jpg'
三、结果


浙公网安备 33010602011771号