scrapy框架五大核心组件

爬虫五大核心组件

 组件的作用：
        引擎(Scrapy)
            用来处理整个系统的数据流处理, 触发事务(框架核心)
        调度器(Scheduler)
            用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL（抓取网页的网址或者说是链接）的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
        下载器(Downloader)
            用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
        爬虫(Spiders)
            爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
        项目管道(Pipeline)
            负责处理爬虫从网页中抽取的实体，主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后，将被发送到项目管道，并经过几个特定的次序处理数据。

请求传参的实现深度爬取

请求传参实现的深度爬取
- 深度爬取：爬取的数据没有在同一张页面中（首页数据+详情页数据）
- 在scrapy中如果没有请求传参我们是无法持久化存储数据
- 实现方式：
    - scrapy.Request(url,callback,meta)
        - meta是一个字典，可以将meta传递给callback
    - callback取出meta：
        - response.meta

代码实现：

# -*- coding: utf-8 -*-
import scrapy

from movie.items import MovieItem
class MovieTestSpider(scrapy.Spider):
    name = 'movie_test'
    # allowed_domains = ['www.xxxxx.com']
    start_urls = ['https://www.4567kan.com/index.php/vod/show/id/5.html']

    url ="https://www.4567kan.com/index.php/vod/show/id/5/page/{}.html"
    page = 2
    def parse(self, response):
        li_list = response.xpath("/html/body/div[1]/div/div/div/div[2]/ul/li")
        for li in li_list:
            titile = li.xpath("./div/a/@title").extract_first()
            src = "https://www.4567kan.com"+li.xpath("./div/a/@href").extract_first()

            item = MovieItem()
            item["title"] = titile


            yield scrapy.Request(url = src,callback=self.parse_movie,meta={"item":item})

        if self.page<5:
            new_url = self.url.format(self.page)
            self.page+=1
            yield scrapy.Request(url = new_url,callback=self.parse)




    def parse_movie(self,response):
        print(response)
        item = response.meta["item"]
        desc = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[5]/span[2]/text()').extract_first()
        item["desc"] = desc
        yield item

中间件

作用：批量拦截请求和响应
爬虫中间件（暂时未讲）
下载中间件（推荐）
- 拦截请求：
  - 篡改请求Url
  - 伪装请求头信息
    - UA
    - cookie
  - 设置请求代理（重点）
- 拦截响应
  - 篡改响应数据
- 代理操作必须使用中间件才可以实现
  - process_exception:
    - request.meta["proxy"] = "http:// ip:port"

middlewares.py代码

class MiddleproDownloaderMiddleware(object):
	#拦截所有的请求（正常和异常的都算上）
	def process_request(self,request,spider):
		print("process_request()")
		request.headers["User-Agent"] = "xxxx"
		request.headers["Cookie"] = "xxxxx"  #但是平常我们不会这么做，因为settings中有cookie配置，scrapy每次请求会带这cook 
		return None  #或者request
		
	#拦截所有响应的对象
	#参数：response拦截到的响应对象，request响应对象对应的请求对象
	def process_exception(self,request,response,spider):
		print("process_response()")
		return response
		
	#拦截异常的请求
	#参数：request就是拦截到的发生异常的请求
	#作用：想要将异常的请求进行修正，将其变成正常的请求，然后对其重新发送
	def process_exception(self.request,exception,spider)：
		#请求的ip被禁掉，该请求就会变成一个异常的请求
		#这里的meta跟请求传参一样。都是Request
		request.meta["proxy"] = "http://ip:port" #设置代理
		print("process_exception()")
		return request #将异常的请求修正后将其重新发送

在settings中

打开这个配置，每次请求都会带这cookie，不需要咱们去添加！

下载图片的爬取

大文件下载：大文件数据是在管道中请求到的

下属管道类是scrapy封装好的我们直接用即可：
from scrapy.pipelines.images import ImagesPipeline #提供了数据下载功能

重写该管道类的三个方法：
- get_media_requests
  - 对图片地址发起请求
- file_path
  - 返回图片名称即可
- item_copleted
  - 返回item,将其返回给下一个即将被执行的管道类
- 在配置文件中添加：
  - IMAGES_STORE = 'dirName'

img.py

import scrapy

from imgPro.items import ImgproItem
class ImgSpider(scrapy.Spider):
    name = 'img'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.521609.com/daxuexiaohua/']

    def parse(self, response):
        #解析图片地址和图片名称
        li_list = response.xpath('//*[@id="content"]/div[2]/div[2]/ul/li')
        for li in li_list:
            img_src = 'http://www.521609.com'+li.xpath('./a[1]/img/@src').extract_first()
            img_name = li.xpath('./a[1]/img/@alt').extract_first()+'.jpg'

            item = ImgproItem()
            item['name'] = img_name
            item['src'] = img_src

            yield item

items.py

import scrapy


class ImgproItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    src = scrapy.Field()

pipelines.py

#该默认管道无法帮助我们请求图片数据，因此该管道我们就不用
# class ImgproPipeline(object):
#     def process_item(self, item, spider):
#         return item

#管道需要接受item中的图片地址和名称，然后再管道中请求到图片的数据对其进行持久化存储
from scrapy.pipelines.images import ImagesPipeline #提供了数据下载功能
from scrapy.pipelines.media import MediaPipeline
from scrapy.pipelines.files import FilesPipeline
import scrapy
class ImgsPipiLine(ImagesPipeline):
    #根据图片地址发起请求
    def get_media_requests(self, item, info):
        # print(item)
        yield scrapy.Request(url=item['src'],meta={'item':item})
    #返回图片名称即可
    def file_path(self, request, response=None, info=None):
        #通过request获取meta
        item = request.meta['item']
        filePath = item['name']
        return filePath #只需要返回图片名称
    #将item传递给下一个即将被执行的管道类
    def item_completed(self, results, item, info):
        return item

settings.py

ITEM_PIPELINES = {
   'imgPro.pipelines.ImgsPipiLine': 300,
}
IMAGES_STORE = './imgLibs'

posted @ 2020-04-13 23:05 zz洲神在此阅读(1122) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

zz洲神在此

python

scrapy框架五大核心组件

爬虫五大核心组件

请求传参的实现深度爬取

中间件

middlewares.py代码

下载图片的爬取

公告