scrapy

scrapy安装和简单使用

  scrapy是一个大而全的爬虫组件,依赖twisted,内部基于事件循环的机制实现爬虫的并发

  下载安装:

 - Win:
    下载:http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
                
    pip3 install wheel   
    pip install Twisted‑18.4.0‑cp36‑cp36m‑win_amd64.whl #有些64位安装不了的,可以试下32位的
                
    pip3 install pywin32
                
    pip3 install scrapy 

 - Linux:
   pip3 install scrapy

  

    twisted是什么以及和requests的区别?
    requests是一个Python实现的可以伪造浏览器发送Http请求的模块。
        - 封装socket发送请求
        
    twisted是基于事件循环的异步非阻塞网络框架。
        - 封装socket发送请求
        - 单线程完成并发请求
        PS: 三个相关词
            - 非阻塞:不等待
            - 异步:回调
            - 事件循环:一直循环去检查状态。

  

    组件以及执行流程?
    - 引擎找到要执行的爬虫,并执行爬虫的 start_requests 方法,并的到一个 迭代器。
    - 迭代器循环时会获取Request对象,而request对象中封装了要访问的URL和回调函数。
    - 将所有的request对象(任务)放到调度器中,用于以后被下载器下载。
    - 下载器去调度器中获取要下载任务(就是Request对象),下载完成后执行回调函数。
    - 回到spider的回调函数中,
        yield Request()
        yield Item()


  基础命令

    # 创建project
    scrapy  startproject xdb
    
    cd xdb
    
    # 创建爬虫
    scrapy genspider chouti chouti.com
    scrapy genspider cnblogs cnblogs.com
    
    # 启动爬虫
    scrapy crawl chouti
    scrapy crawl chouti --nolog 

  HTML解析:xpath

	- response.text 
	- response.encoding
	- response.body 
	- response.request
	# response.xpath('//div[@href="x1"]/a').extract_first()
	# response.xpath('//div[@href="x1"]/a').extract()
	# response.xpath('//div[@href="x1"]/a/text()').extract()
	# response.xpath('//div[@href="x1"]/a/@href').extract()

   再次发起请求:yield Request对象

class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['chouti.com']
    start_urls = ['http://chouti.com/']

    def parse(self, response):
        # print(response,type(response)) # 对象
        # print(response.text)
        """
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(response.text,'html.parser')
        content_list = soup.find('div',attrs={'id':'content-list'})
        """
        # 去子孙中找div并且id=content-list
        f = open('news.log', mode='a+')
        item_list = response.xpath('//div[@id="content-list"]/div[@class="item"]')
        for item in item_list:
            text = item.xpath('.//a/text()').extract_first()
            href = item.xpath('.//a/@href').extract_first()
            print(href,text.strip())
            f.write(href+'\n')
        f.close()

        page_list = response.xpath('//div[@id="dig_lcpage"]//a/@href').extract()
        for page in page_list:
            from scrapy.http import Request
            page = "https://dig.chouti.com" + page
            yield Request(url=page,callback=self.parse) # https://dig.chouti.com/all/hot/recent/2

  注意:如果爬虫过程有编码报错,尝试加上下面这句代码

# import sys,os,io
# sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')

  如果爬虫不执行parse,可修改配置文件这项

ROBOTSTXT_OBEY = False

  

  对于上面的执行过程存在两个缺点:1.每次发起请求,都有打开连接和关闭连接的过程 2.分工不明确,既然解析过程,又有存储过程

  针对上面两个问题,scrapy提供了持久化

持久化pipeline/items

  1.定义pipeline类,这里编写你的存储过程

 class XXXPipeline(object):
    def process_item(self, item, spider):
         return item

  2.定义Item类,这里定义你要接受的数据          

class XdbItem(scrapy.Item):
     href = scrapy.Field()
     title = scrapy.Field()

  3.settings里配置

ITEM_PIPELINES = {
    'xdb.pipelines.XdbPipeline': 300,
}

  爬虫每次执行yield Item对象,process_item就会调用一次

        编写pipeline:

	'''
	源码内容:
	1. 判断当前XdbPipeline类中是否有from_crawler
		有:
			obj = XdbPipeline.from_crawler(....)
		否:
			obj = XdbPipeline()
	2. obj.open_spider()
	
	3. obj.process_item()/obj.process_item()/obj.process_item()/obj.process_item()/obj.process_item()
	
	4. obj.close_spider()
	'''
from scrapy.exceptions import DropItem

class FilePipeline(object):

	def __init__(self,path):
		self.f = None
		self.path = path

	@classmethod
	def from_crawler(cls, crawler):
		"""
		初始化时候,用于创建pipeline对象
		:param crawler:
		:return:
		"""
		print('File.from_crawler')
		path = crawler.settings.get('HREF_FILE_PATH')
		return cls(path)

	def open_spider(self,spider):
		"""
		爬虫开始执行时,调用
		:param spider:
		:return:
		"""
		print('File.open_spider')
		self.f = open(self.path,'a+')

	def process_item(self, item, spider):
		# f = open('xx.log','a+')
		# f.write(item['href']+'\n')
		# f.close()
		print('File',item['href'])
		self.f.write(item['href']+'\n')
		
		# return item  	# 交给下一个pipeline的process_item方法
		raise DropItem()# 后续的 pipeline的process_item方法不再执行

	def close_spider(self,spider):
		"""
		爬虫关闭时,被调用
		:param spider:
		:return:
		"""
		print('File.close_spider')
		self.f.close()

   注意:pipeline是所有爬虫公用,如果想要给某个爬虫定制需要使用spider参数自己进行处理。
        pipeline持久化,from_crawler指定写入路径,open_spider打开链接,close_spider关闭链接,
        process_item中执行持久化操作,return item就交给下一个pipeline的process_item方法,如果
        raise DropItem(),后续的pipeline的process_item方法不再执行

 

去重规则

        编写类

from scrapy.dupefilter import BaseDupeFilter
from scrapy.utils.request import request_fingerprint

class XdbDupeFilter(BaseDupeFilter):

	def __init__(self):
		self.visited_fd = set()

	@classmethod
	def from_settings(cls, settings):
		return cls()

	def request_seen(self, request):
		fd = request_fingerprint(request=request)
		if fd in self.visited_fd:
			return True
		self.visited_fd.add(fd)

	def open(self):  # can return deferred
		print('开始')

	def close(self, reason):  # can return a deferred
		print('结束')

	# def log(self, request, spider):  # log that a request has been filtered
	#     print('日志')

  配置

        # 修改默认的去重规则
        # DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
        DUPEFILTER_CLASS = 'xdb.dupefilters.XdbDupeFilter'

   爬虫使用

class ChoutiSpider(scrapy.Spider):
	name = 'chouti'
	allowed_domains = ['chouti.com']
	start_urls = ['https://dig.chouti.com/']

	def parse(self, response):
		print(response.request.url)
		# item_list = response.xpath('//div[@id="content-list"]/div[@class="item"]')
		# for item in item_list:
		#     text = item.xpath('.//a/text()').extract_first()
		#     href = item.xpath('.//a/@href').extract_first()

		page_list = response.xpath('//div[@id="dig_lcpage"]//a/@href').extract()
		for page in page_list:
			from scrapy.http import Request
			page = "https://dig.chouti.com" + page
			# yield Request(url=page,callback=self.parse,dont_filter=False) # https://dig.chouti.com/all/hot/recent/2
			yield Request(url=page,callback=self.parse,dont_filter=True) # https://dig.chouti.com/all/hot/recent/2

         注意:
            - request_seen中编写正确逻辑
            - dont_filter=False
            
            如果想实现去重,可以自定义dupefilter类,在request_seen方法中执行去重操作,
            还可以这么做,在yield request时dont_filter改成False,默认也是False

深度与优先级

        - 深度 
            - 最开始是0
            - 每次yield时,会根据原来请求中的depth + 1
            配置:DEPTH_LIMIT 深度控制
        - 优先级 
            - 请求被下载的优先级 -= 深度 * 配置 DEPTH_PRIORITY 
            配置:DEPTH_PRIORITY

   获取深度:response.meta.get("depth", 0)

 

cookie设置

  方式一,携带和解析

    cookie_dict = {}
    def parse(self, response):

        # 携带 解析的方式
        #去响应头里获取cookie,cookie保存在cookie_jar对象
        from scrapy.http.cookies import CookieJar
        from urllib.parse import urlencode
        cookie_jar = CookieJar()
        cookie_jar.extract_cookies(response, response.request)
        # 去对象中将cookie解析到字典
        for k, v in cookie_jar._cookies.items():
            for i, j in v.items():
                for m, n in j.items():
                    self.cookie_dict[m] = n.value

        yield Request(
            url="https://dig.chouti.com/login",
            method="POST",
            #body 可以自定拼接,也可以使用urlencode拼接
            body="phone=8613121758648&password=woshiniba&oneMonth=1",
            cookies=self.cookie_dict,
            headers={
                "Content-Type":'application/x-www-form-urlencoded; charset=UTF-8'
            },
            callback=self.check_login
        )

    def check_login(self, response):
        print(response.text)
        yield Request(
            url="https://dig.chouti.com/all/hot/recent/1",
            cookies=self.cookie_dict,
            callback=self.index
        )

    def index(self, response):
        news_list = response.xpath("//div[@id='content-list']/div[@class='item']")
        for new in news_list:
            link_id = new.xpath(".//div[@class='part2']/@share-linkid").extract_first()
            yield Request(
                url="http://dig.chouti.com/link/vote?linksId=%s"%(link_id, ),
                method="POST",
                cookies=self.cookie_dict,
                callback=self.check_result
            )

        page_list = response.xpath("//div[@id='dig_lcpage']//a/@href").extract()
        for page in page_list:
            page = "https://dig.chouti.com" + page
            yield Request(url=page, callback=self.index)

    def check_result(self, response):
        print(response.text)

  方式二,meta

meta={'cookiejar': True}

  

start_urls

        scrapy引擎获取start_requests函数返回的结果(Request列表)封装成一个迭代器,放入的调度器中,
        下载器从调度器中,通过__next__来获取Request对象

 

         - 定制:可以去redis中获取,也可以设置代理(os.envrion)

        - 内部原理:
        """
        scrapy引擎来爬虫中取起始URL:
            1. 调用start_requests并获取返回值
            2. v = iter(返回值)
            3. 
                req1 = 执行 v.__next__()
                req2 = 执行 v.__next__()
                req3 = 执行 v.__next__()
                ...
            4. req全部放到调度器中
            
        """

        - 编写

class ChoutiSpider(scrapy.Spider):
	name = 'chouti'
	allowed_domains = ['chouti.com']
	start_urls = ['https://dig.chouti.com/']
	cookie_dict = {}
	
	def start_requests(self):
		# 方式一:
		for url in self.start_urls:
			yield Request(url=url)
		# 方式二:
		# req_list = []
		# for url in self.start_urls:
		#     req_list.append(Request(url=url))
		# return req_list

        

代理

        问题:scrapy如何加代理?
            - 环境变量 start_requests在爬虫启动时,提前在os.envrion中设置代理
            - meta  yield Request的时候meta属性设置
            - 自定义下载中间件,在process_request中加入,这种方式可以实现随机代理

  内置代理:在爬虫启动时,提前在os.envrion中设置代理即可。

class ChoutiSpider(scrapy.Spider):
	name = 'chouti'
	allowed_domains = ['chouti.com']
	start_urls = ['https://dig.chouti.com/']
	cookie_dict = {}

	def start_requests(self):
		import os
		os.environ['HTTPS_PROXY'] = "http://root:woshiniba@192.168.11.11:9999/"
		os.environ['HTTP_PROXY'] = '19.11.2.32',
		for url in self.start_urls:
			yield Request(url=url,callback=self.parse)

  meta设置代理:yield Request的时候设置meta属性

class ChoutiSpider(scrapy.Spider):
	name = 'chouti'
	allowed_domains = ['chouti.com']
	start_urls = ['https://dig.chouti.com/']
	cookie_dict = {}

	def start_requests(self):
		for url in self.start_urls:
			yield Request(url=url,callback=self.parse,meta={'proxy':'"http://root:woshiniba@192.168.11.11:9999/"'})

  自定义下载中间件,在process_request中加代理,在这里你可以实现随机代码的过程

import base64
import random
from six.moves.urllib.parse import unquote

try:
    from urllib2 import _parse_proxy
except ImportError:
    from urllib.request import _parse_proxy
from six.moves.urllib.parse import urlunparse
from scrapy.utils.python import to_bytes


class XdbProxyMiddleware(object):

    def _basic_auth_header(self, username, password):
        user_pass = to_bytes(
            '%s:%s' % (unquote(username), unquote(password)),
            encoding='latin-1')
        return base64.b64encode(user_pass).strip()

    def process_request(self, request, spider):
        PROXIES = [
            "http://root:woshiniba@192.168.11.11:9999/",
            "http://root:woshiniba@192.168.11.12:9999/",
            "http://root:woshiniba@192.168.11.13:9999/",
            "http://root:woshiniba@192.168.11.14:9999/",
            "http://root:woshiniba@192.168.11.15:9999/",
            "http://root:woshiniba@192.168.11.16:9999/",
        ]
        url = random.choice(PROXIES)

        orig_type = ""
        proxy_type, user, password, hostport = _parse_proxy(url)
        proxy_url = urlunparse((proxy_type or orig_type, hostport, '', '', '', ''))

        if user:
            creds = self._basic_auth_header(user, password)
        else:
            creds = None
        request.meta['proxy'] = proxy_url
        if creds:
            request.headers['Proxy-Authorization'] = b'Basic ' + creds

  

选择器和解析器

html = """<!DOCTYPE html>
<html>
    <head lang="en">
        <meta charset="UTF-8">
        <title></title>
    </head>
    <body>
        <ul>
            <li class="item-"><a id='i1' href="link.html">first item</a></li>
            <li class="item-0"><a id='i2' href="llink.html">first item</a></li>
            <li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li>
        </ul>
        <div><a href="llink2.html">second item</a></div>
    </body>
</html>
"""

from scrapy.http import HtmlResponse
from scrapy.selector import Selector

response = HtmlResponse(url='http://example.com', body=html,encoding='utf-8')


# hxs = Selector(response)
# hxs.xpath()
response.xpath('')  

 

下载中间件

  在process_request里,你可以干这些事:

  • 返回HtmlResponse对象,不执行下载,但是还是会执行process_response
  • 返回Request对象,发起下次请求
  • 抛出异常IgnoreRequest,废弃当前请求,会执行process_exception
  • 对请求进行加工,比如设置User-Agent

  写中间件:

from scrapy.http import HtmlResponse
from scrapy.http import Request

class Md1(object):
	@classmethod
	def from_crawler(cls, crawler):
		# This method is used by Scrapy to create your spiders.
		s = cls()
		return s

	def process_request(self, request, spider):
		# Called for each request that goes through the downloader
		# middleware.

		# Must either:
		# - return None: continue processing this request
		# - or return a Response object
		# - or return a Request object
		# - or raise IgnoreRequest: process_exception() methods of
		#   installed downloader middleware will be called
		print('md1.process_request',request)
		# 1. 返回Response
		# import requests
		# result = requests.get(request.url)
		# return HtmlResponse(url=request.url, status=200, headers=None, body=result.content)
		# 2. 返回Request
		# return Request('https://dig.chouti.com/r/tec/hot/1')

		# 3. 抛出异常
		# from scrapy.exceptions import IgnoreRequest
		# raise IgnoreRequest

		# 4. 对请求进行加工(*)
		# request.headers['user-agent'] = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"

		pass

	def process_response(self, request, response, spider):
		# Called with the response returned from the downloader.

		# Must either;
		# - return a Response object
		# - return a Request object
		# - or raise IgnoreRequest
		print('m1.process_response',request,response)
		return response

	def process_exception(self, request, exception, spider):
		# Called when a download handler or a process_request()
		# (from other downloader middleware) raises an exception.

		# Must either:
		# - return None: continue processing this exception
		# - return a Response object: stops process_exception() chain
		# - return a Request object: stops process_exception() chain
		pass

   配置

DOWNLOADER_MIDDLEWARES = {
   #'xdb.middlewares.XdbDownloaderMiddleware': 543,
	# 'xdb.proxy.XdbProxyMiddleware':751,
	'xdb.md.Md1':666,
	'xdb.md.Md2':667,
}

  应用:
     - user-agent
     - 代理

 

 

爬虫中间件 

  • process_start_requests只在爬虫启动时执行一次,在下载中间件之前执行
  • process_spider_input在下载中间件执行完后,调用回调函数时执行
  • process_spider_output在回调函数执行完后执行

  编写:

class Sd1(object):
	# Not all methods need to be defined. If a method is not defined,
	# scrapy acts as if the spider middleware does not modify the
	# passed objects.

	@classmethod
	def from_crawler(cls, crawler):
		# This method is used by Scrapy to create your spiders.
		s = cls()
		return s

	def process_spider_input(self, response, spider):
		# Called for each response that goes through the spider
		# middleware and into the spider.

		# Should return None or raise an exception.
		return None

	def process_spider_output(self, response, result, spider):
		# Called with the results returned from the Spider, after
		# it has processed the response.

		# Must return an iterable of Request, dict or Item objects.
		for i in result:
			yield i

	def process_spider_exception(self, response, exception, spider):
		# Called when a spider or process_spider_input() method
		# (from other spider middleware) raises an exception.

		# Should return either None or an iterable of Response, dict
		# or Item objects.
		pass

	# 只在爬虫启动时,执行一次。
	def process_start_requests(self, start_requests, spider):
		# Called with the start requests of the spider, and works
		# similarly to the process_spider_output() method, except
		# that it doesn’t have a response associated.

		# Must return only requests (not items).
		for r in start_requests:
			yield r

  配置

SPIDER_MIDDLEWARES = {
   # 'xdb.middlewares.XdbSpiderMiddleware': 543,
	'xdb.sd.Sd1': 666,
	'xdb.sd.Sd2': 667,
}

  应用:
    - 深度
    - 优先级

 

定制命令

  单爬虫运行:

import sys
from scrapy.cmdline import execute

if __name__ == '__main__':
    execute(["scrapy","crawl","chouti","--nolog"])

        - 所有爬虫:
            - 在spiders同级创建任意目录,如:commands
            - 在其中创建 crawlall.py 文件 (此处文件名就是自定义的命令)
            - 在settings.py 中添加配置 COMMANDS_MODULE = '项目名称.目录名称'
            - 在项目目录执行命令:scrapy crawlall

信号

  使用框架预留的位置,帮助你自定义一些功能

class MyExtend(object):
    def __init__(self):
        pass

    @classmethod
    def from_crawler(cls, crawler):
        self = cls()

        crawler.signals.connect(self.x1, signal=signals.spider_opened)
        crawler.signals.connect(self.x2, signal=signals.spider_closed)

        return self

    def x1(self, spider):
        print('open')

    def x2(self, spider):
        print('close')

    配置

EXTENSIONS = {
    'xdb.ext.MyExtend':666,
}

  

 

骚师博客

 

posted @ 2019-01-17 15:41  财经知识狂魔  阅读(438)  评论(0编辑  收藏  举报