Scrapy框架爬取图片

一、任务

爬取该网站(https://desk.zol.com.cn/bizhi/9506_115438_2.html)壁纸,并保存

二、项目代码

1.创建项目

 scrapy startproject zol 

2.修改配置信息:

  USER_AGENT

  ROBOTSTXT_OBEY 改成 False

 

  开启ITEM_PIPLINES

  设置图片保存位置:

1 IMAGES_STORE = "d:/pics"

 

3.爬虫文件 -- zol.py

 

1 scrapy genspider zol zol.com.cn

 

 1 import scrapy
 2 
 3 
 4 class ZolSpider(scrapy.Spider):
 5     name = 'zol'
 6     allowed_domains = ['zol.com.cn']
 7     start_urls = ['http://desk.zol.com.cn/bizhi/9506_115438_2.html']
 8 
 9     def parse(self, response):
10         image_url = response.xpath('//img[@id="bigImg"]/@src').extract_first()
11         image_name = response.xpath('string(//h3)').extract_first()
12         yield {
13             # 'image_urls': [image_url],
14             'image_url': image_url,
15             'image_name': image_name
16         }
17         next_url = response.xpath('//a[@id="pageNext"]/@href').extract_first()
18         yield scrapy.Request(response.urljoin(next_url), callback=self.parse)

 

4.pipelines.py

 1 # Define your item pipelines here
 2 #
 3 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 4 # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
 5 
 6 
 7 # useful for handling different item types with a single interface
 8 import scrapy
 9 from itemadapter import ItemAdapter
10 from scrapy.pipelines.images import ImagesPipeline
11 from scrapy import Request
12 
13 class ImagePipeline(ImagesPipeline):
14     def get_media_requests(self, item, info):
15         yield Request(item['image_url'],meta={'name': item['image_name']})
16 
17     def file_path(self, request, response=None, info=None):
18         name = request.meta['name'].strip().replace('\r\n\t\t', '')
19         name = name.replace('/', '_')
20         return name + '.jpg'

三、结果

 

 

posted @ 2021-03-10 00:55  简单de人  阅读(72)  评论(0)    收藏  举报