scrapy-例:douyu图片下载案例
一. 新建项目(scrapy startproject)
scrapy startproject douyuSpider
二、明确目标(douyuSpider/items.py)
1 import scrapy 2 3 class DouyuspiderItem(scrapy.Item): 4 # define the fields for your item here like: 5 room_name = scrapy.Field() 6 imagelink=scrapy.Field() 7 imagepath=scrapy.Field()
三、制作爬虫 (spiders/douyu.py)
1、scrapy genspider douyu "capi.douyucdn.cn"
scrapy shell调试


2、打开 tencentSpider/spider目录里的 douyu.py,代码如下:
1 # -*- coding: utf-8 -*- 2 import scrapy 3 import json 4 from ..items import DouyuspiderItem 5 6 class DouyuSpider(scrapy.Spider): 7 name = 'douyu' 8 allowed_domains = ['capi.douyucdn.cn'] 9 10 url="http://capi.douyucdn.cn/api/v1/getVerticalRoom?limit=20&offset=" 11 offset=0 12 13 start_urls = [url+str(offset)] 14 15 def parse(self, response): 16 # 把json格式的数据转换为python格式,data段是列表 17 data=json.loads(response.text)['data'] 18 for each in data: 19 item=DouyuspiderItem() 20 item['room_name']=each['room_name'] 21 item['imagelink']=each['vertical_src'] 22 23 yield item 24 25 if self.offset<100: 26 self.offset+=20 27 28 yield scrapy.Request(self.url+str(self.offset),callback=self.parse)
四、存储内容 (pipelines.py)
修改settings.py以下几个地方:
ROBOTSTXT_OBEY = False DEFAULT_REQUEST_HEADERS = { "User-Agent" : "DYZB/1 CFNetwork/808.2.16 Darwin/16.3.0" } ITEM_PIPELINES = { 'douyuSpider.pipelines.ImagesPipeline': 300, } IMAGES_STORE=r"E:\python_practice_ku\pachong\douyuSpider\douyuSpider\images"
编写pipelines.py文件(图片下载可参考:https://blog.csdn.net/kuangshp128/article/details/80321099)
1 from scrapy.utils.project import get_project_settings 2 from scrapy.pipelines.images import ImagesPipeline 3 import scrapy 4 import os 5 6 #下载图片可以使用该方法,函数名及参数都是固定写法 7 class ImagesPipeline(ImagesPipeline): 8 # 获取settings文件里设置的变量值 9 IMAGES_STORE=get_project_settings().get('IMAGES_STORE') 10 11 def get_media_requests(self,item,info): 12 image_url=item['imagelink'] #获取图片链接 13 yield scrapy.Request(image_url) #再次请求图片链接 14 15 def item_completed(self,result,item,info): 16 #result返回的是一个列表,例:[(True, {'url': 'https://p0.ssl.qhimgs1.com/t01a098025e4214bacc.jpg', 'path': 't01a098025e4214bacc.jpg', 'checksum': '7adb29c836cde7a422c740aac3f86234'})] 17 image_path=[x["path"] for ok,x in result if ok] 18 #对图片重命名 19 os.rename(self.IMAGES_STORE+'/'+image_path[0],self.IMAGES_STORE+'/'+item['room_name']+'.jpg') 20 item['imagepath']=self.IMAGES_STORE+'/'+item['room_name'] 21 return item
命令执行:


posted on 2020-03-10 23:41 cherry_ning 阅读(208) 评论(0) 收藏 举报
浙公网安备 33010602011771号