scrapy-例:douyu图片下载案例

一. 新建项目(scrapy startproject)

scrapy startproject douyuSpider

 

二、明确目标(douyuSpider/items.py)

1 import scrapy
2 
3 class DouyuspiderItem(scrapy.Item):
4     # define the fields for your item here like:
5     room_name = scrapy.Field()
6     imagelink=scrapy.Field()
7     imagepath=scrapy.Field()

 

三、制作爬虫 (spiders/douyu.py)

 1、scrapy genspider douyu "capi.douyucdn.cn"

scrapy shell调试

 

   2、打开 tencentSpider/spider目录里的 douyu.py,代码如下:

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 import json
 4 from ..items import DouyuspiderItem
 5 
 6 class DouyuSpider(scrapy.Spider):
 7     name = 'douyu'
 8     allowed_domains = ['capi.douyucdn.cn']
 9 
10     url="http://capi.douyucdn.cn/api/v1/getVerticalRoom?limit=20&offset="
11     offset=0
12 
13     start_urls = [url+str(offset)]
14 
15     def parse(self, response):
16         # 把json格式的数据转换为python格式,data段是列表
17         data=json.loads(response.text)['data']
18         for each in data:
19             item=DouyuspiderItem()
20             item['room_name']=each['room_name']
21             item['imagelink']=each['vertical_src']
22 
23             yield item
24 
25         if self.offset<100:
26             self.offset+=20
27 
28         yield scrapy.Request(self.url+str(self.offset),callback=self.parse)

 

四、存储内容 (pipelines.py)

修改settings.py以下几个地方:

ROBOTSTXT_OBEY = False

DEFAULT_REQUEST_HEADERS = {
    "User-Agent" : "DYZB/1 CFNetwork/808.2.16 Darwin/16.3.0"
}

ITEM_PIPELINES = {
   'douyuSpider.pipelines.ImagesPipeline': 300,
}

IMAGES_STORE=r"E:\python_practice_ku\pachong\douyuSpider\douyuSpider\images"

编写pipelines.py文件(图片下载可参考:https://blog.csdn.net/kuangshp128/article/details/80321099

 1 from scrapy.utils.project import get_project_settings
 2 from scrapy.pipelines.images import ImagesPipeline
 3 import scrapy
 4 import os
 5 
 6 #下载图片可以使用该方法,函数名及参数都是固定写法
 7 class ImagesPipeline(ImagesPipeline):
 8     # 获取settings文件里设置的变量值
 9     IMAGES_STORE=get_project_settings().get('IMAGES_STORE')
10 
11     def get_media_requests(self,item,info):
12         image_url=item['imagelink']   #获取图片链接
13         yield scrapy.Request(image_url)   #再次请求图片链接
14 
15     def item_completed(self,result,item,info):
16         #result返回的是一个列表,例:[(True, {'url': 'https://p0.ssl.qhimgs1.com/t01a098025e4214bacc.jpg', 'path': 't01a098025e4214bacc.jpg', 'checksum': '7adb29c836cde7a422c740aac3f86234'})]
17         image_path=[x["path"] for ok,x in result if ok]
18         #对图片重命名
19         os.rename(self.IMAGES_STORE+'/'+image_path[0],self.IMAGES_STORE+'/'+item['room_name']+'.jpg')
20         item['imagepath']=self.IMAGES_STORE+'/'+item['room_name']
21         return item

命令执行:

 

 

 

posted on 2020-03-10 23:41  cherry_ning  阅读(208)  评论(0)    收藏  举报

导航