scrapy

爬虫本质

HTTP协议

爬虫的本质是:就是socket的客户端和服务端基于HTTP协议进行通信。

HTTP协议:基于TCP的一种无状态短连接的数据传递(通过/r/n分割数据,请求头与请求体通过/r/n/r/n区分)

浏览器与服务器进行通信时:连接过程(IO阻塞)、数据响应(IO阻塞)

sk = socket.socket()
# 创建连接(IO阻塞)
sk.connect(('www.cnblogs.com',80))

sk.sendall(b"GET /wupeiqi http1.1\r\n.....\r\n\r\n")
sk.sendall(b"POST /wupeiqi http1.1\r\n.....\r\n\r\nuser=alex&pwd=123")

# 接受数据(IO阻塞)
data = sk.recv(8096)

sk.close()

高性能爬虫

所以在多任务爬虫操作时则使用:依照节约资源和减少IO阻塞(异步IO阻塞):开多进程----->开多线程---->利用“异步IO非阻塞”模块实现单线程并发

异步IO非阻塞:IO多路复用+非socket阻塞

IO多路复用:

 - IO多路复用--->监听多个socket对象是否反生变化


import select

while True:
  r,w,e = select.select([sk1,sk2],[sk1,sk2],[],0.5)

  # 让select模块帮助检测sk1/sk2两个socket对象是否已经发生“变化”
   r=[]
  # 如果r中有值
   r=[sk1,] 表示:sk1这个socket已经获取到响应的内容
   r=[sk1,sk2] 表示:sk1,sk2两个socket已经获取到响应的内容
   w=[],如果w中有值
   w=[sk1,], 表示:sk1这个socket已经连接成功;
   w=[sk1,sk2],表示:sk1/sk2两个socket已经连接成功;

 非socket阻塞

sk = socket.socket()
# 创建连接(IO阻塞)
sk.setblocking(False) #关闭IO阻塞
try:   sk.connect((
'www.cnblogs.com',80))
except BlockingIOError as e:
  pass

自定义异步IO非阻塞

import socket
import select

class Request(object):
    def __init__(self,sk,callback):
        self.sk = sk
        self.callback = callback

    def fileno(self):
        return self.sk.fileno()

class AsyncHttp(object):

    def __init__(self):
        self.fds = []
        self.conn = []

    def add(self,url,callback):
        sk = socket.socket()
        # 关闭IO阻塞
        sk.setblocking(False)
        try:
            sk.connect((url,80))
        except BlockingIOError as e:
            pass
        req = Request(sk,callback)
        self.fds.append(req)
        self.conn.append(req)

    def run(self):
        """
        监听socket是否发生变化
        :return:
        """
        while True:
            """
            fds=[req(sk,callback),req,req]
            conn=[req,req,req]
            """
            r,w,e = select.select(self.fds,self.conn,[],0.05) # sk.fileno() = req.fileno()

            # w=已经连接成功的socket列表 w=[sk1,sk2]
            for req in w:
                req.sk.sendall(b'GET /wupeiqi HTTP/1.1\r\nUser-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36\r\n\r\n')
                # 已经连接成功的socket,无需再继续监听
                self.conn.remove(req)

            # r=服务端给用户返回数据了 r=[sk1,]
            for req in r:
                data = req.sk.recv(8096)
                req.callback(data)

                req.sk.close() # 断开连接:短连接、无状态
                self.fds.remove(req) # 不再监听

            if not self.fds:
                break

ah = AsyncHttp()

def callback1(data):
    print(11111,data)

def callback2(data):
    print(22222,data)

def callback3(data):
    print(333333,data)

ah.add('www.cnblogs.com',callback1) # sk1
ah.add('www.baidu.com',callback2)   # sk2
ah.add('www.luffycity.com',callback3) # sk3

ah.run()
View Code

使用第三方模块

import asyncio
import requests


@asyncio.coroutine
def fetch_async(func, *args):
    loop = asyncio.get_event_loop()
    future = loop.run_in_executor(None, func, *args)
    response = yield from future
    print(response.url, response.content)


tasks = [
    fetch_async(requests.get, 'http://www.cnblogs.com/wupeiqi/'),
    fetch_async(requests.get, 'http://dig.chouti.com/pic/show?nid=4073644713430508&lid=10273091')
]

loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()
asyncio + requests
import gevent

from gevent import monkey

monkey.patch_all()
import requests



def fetch_async(method, url, req_kwargs):
    print(method, url, req_kwargs)
    response = requests.request(method=method, url=url, **req_kwargs)
    print(response.url, response.content)

# ##### 发送请求 #####
gevent.joinall([
    gevent.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}),
    gevent.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}),
    gevent.spawn(fetch_async, method='get', url='https://github.com/', req_kwargs={}),
])

# ##### 发送请求(协程池控制最大协程数量) #####
# from gevent.pool import Pool
# pool = Pool(None)
# gevent.joinall([
#     pool.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}),
#     pool.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}),
#     pool.spawn(fetch_async, method='get', url='https://www.github.com/', req_kwargs={}),
# ])
gevent + requests
from twisted.web.client import getPage, defer
from twisted.internet import reactor


def all_done(arg):
    reactor.stop()


def callback(contents):
    print(contents)


deferred_list = []

url_list = ['http://www.bing.com', 'http://www.baidu.com', ]
for url in url_list:
    deferred = getPage(bytes(url, encoding='utf8'))
    deferred.addCallback(callback)
    deferred_list.append(deferred)

dlist = defer.DeferredList(deferred_list)
dlist.addBoth(all_done)

reactor.run()
Twisted

 原理问题

1、Http请求本质:

  基于TCP的一种无状态短连接的数据传递(通过/r/n分割数据,请求头与请求体通过/r/n/r/n区分)

2、异步非阻塞

  - 非阻塞:程序执行过程中遇到IO不等待
  - 代码:
      sk = socket.socket()
      sk.setblocking(False)   #会报错,捕获异常
  - 异步:
    - 通过执行回调函数:当达到某个指定的状态之后,自动调用特定函数。

3、IO多路复用

  监听多个socket是否反正变化

    - select,内部循环检测socket是否发生变化;最多检测1024个socket
    - poll,    内部循环检测socket是否发生变化;
    - epoll,  回调的方式

4、什么是协程

  协程是一种“微线程”,实际并不存在,是人为创造出来的控制程序执行方式的:当程序执行时遇到IO时,则去执行另一个任务

    --如果程序运行时,并没有遇到IO,这时程序来回切换——性能降低

    --遇到IO时,程序来回切换——性能提高,实现了单线程并发


5. 自定义异步非阻塞模块?
    - 基于事件循环(回调函数)
    - 基于协程 (send(生成器))
    本质:socket+IO多路复用

scrapy

Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 其可以应用在数据挖掘,信息处理或存储历史数据等一系列的程序中。

下载安装:

安装scrapy
        a. pip3 install wheel
        b. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted 

        c. 进入下载目录,执行 pip3 install Twisted-xxxxx.whl
        
        或者直接下载:d. pip3 install scrapy  -i http://pypi.douban.com/simple --trusted-host pypi.douban.com 

        e. pip3 install pywin32  -i http://pypi.douban.com/simple --trusted-host pypi.douban.com 

创建 scrapy

scrapy startproject 项目名
cd xianglong
scrapy genspider 文件名 文件名.com 

运行项目             
scrapy crawl chouti --nolog      

解决中文问题:
加在文件开始位置
  import sys,os
  sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')

美图:

 

Scrapy主要包括了以下组件:

  • 引擎(Scrapy)
    用来处理整个系统的数据流处理, 触发事务(框架核心)
  • 调度器(Scheduler)
    用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
  • 下载器(Downloader)
    用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
  • 爬虫(Spiders)
    爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
  • 项目管道(Pipeline)
    负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。
  • 下载器中间件(Downloader Middlewares)
    位于Scrapy引擎和下载器之间的框架,主要是处理Scrapy引擎与下载器之间的请求及响应。
  • 爬虫中间件(Spider Middlewares)
    介于Scrapy引擎和爬虫之间的框架,主要工作是处理蜘蛛的响应输入和请求输出。
  • 调度中间件(Scheduler Middewares)
    介于Scrapy引擎和调度之间的中间件,从Scrapy引擎发送到调度的请求和响应。

Scrapy运行流程大概如下:

  1. 引擎从调度器中取出一个链接(URL)用于接下来的抓取
  2. 引擎把URL封装成一个请求(Request)传给下载器
  3. 下载器把资源下载下来,并封装成应答包(Response)
  4. 爬虫解析Response
  5. 解析出实体(Item),则交给实体管道进行进一步的处理
  6. 解析出的是链接(URL),则把URL交给调度器等待抓取

start_requests

连续爬多个网站

def start_requests(self):
    for url in self.start_urls:
        yield Request(url=url,callback=self.parse)

def start_requests(self):
    req_list = []
    for url in self.start_urls:
        req_list.append(Request(url=url,callback=self.parse))
    return req_list

scrapy内部会将req_list值转换成迭代器。

解析器

将字符串转换成对象
方式一:
  response.xpath('//div[@id='content-list']/div[@class='item']')         格式可按照:copy xpath(页面)
方式二:
  hxs = HtmlXPathSelector(response=response)
  items = hxs.xpath("//div[@id='content-list']/div[@class='item']")

pipelines

处理spiders返回的item数据

settings.py

ITEM_PIPELINES = {
   'pachoing.pipelines.FilePipeline': 300,
  'pachoing.pipelines.xxxxxx': 400,
}
多个pipelines时值越小优先级越高

pipeline.py

class FilePipeline(object):
    def __init__(self,path):
        self.path = path
        self.f = None

    @classmethod
    def from_crawler(cls, crawler):
        """
        初始化时候,用于创建pipeline对象
        :param crawler:
        :return:
        """
        path = crawler.settings.get('XL_FILE_PATH')
        return cls(path)

    def process_item(self, item, spider):
        self.f.write(item['href']+'\n')
        return item

    def open_spider(self, spider):
        """
        爬虫开始执行时,调用
        :param spider:
        :return:
        """
        self.f = open(self.path,'w')

    def close_spider(self, spider):
        """
        爬虫关闭时,被调用
        :param spider:
        :return:
        """
        self.f.close()    

class FilePipeline(object):
  pass


- 多pipelines,return item会将item传递给下一个pipelines的process_item

如果想要丢弃,不给后续pipeline使用:

from scrapy.exceptions import DropItem
class FilePipeline(object):

def process_item(self, item, spider):

  raise DropItem()


处理post请求、请求头、cookies

手动添加cookies:

    cookie_dict = {}
    def start_requests(self):
        for url in self.start_urls:
            yield Request(url=url,callback=self.parse_index)

    def parse_index(self,response):
        # 获取原始cookie
        # print(response.headers.getlist('Set-Cookie'))

        # 解析后的cookie
        from scrapy.http.cookies import CookieJar
        cookie_jar = CookieJar()
     #将响应cookies放入cookie_jar cookie_jar.extract_cookies(response, response.request)
for k, v in cookie_jar._cookies.items(): for i, j in v.items(): for m, n in j.items(): self.cookie_dict[m] = n.value req = Request( url='http://dig.chouti.com/login', method='POST', headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'}, body='phone=8613121758648&password=woshiniba&oneMonth=1', cookies=self.cookie_dict, callback=self.parse_check_login ) yield req def parse_check_login(self,response): print(response.text) yield Request( url='https://dig.chouti.com/link/vote?linksId=19440976', method='POST', cookies=self.cookie_dict, callback=self.parse_show_result ) def parse_show_result(self,response): print(response.text)

自动添加cookies

    def start_requests(self):
        for url in self.start_urls:
        #自动携带cookies
yield Request(url=url,callback=self.parse_index,meta={'cookiejar':True}) def parse_index(self,response): req = Request( url='http://dig.chouti.com/login', method='POST', headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'}, body='phone=8613121758648&password=woshiniba&oneMonth=1', callback=self.parse_check_login, meta={'cookiejar': True} ) yield req def parse_check_login(self,response): # print(response.text) yield Request( url='https://dig.chouti.com/link/vote?linksId=19440976', method='POST', callback=self.parse_show_result, meta={'cookiejar': True} ) def parse_show_result(self,response): print(response.text)

去重

settings.py

DUPEFILTER_CLASS = 'pachong.dupe.MyDupeFilter'

dupe.py  自定义

class MyDupeFilter:
    def __init__(self):
        self.visited_url = set()

    @classmethod
    def from_settings(cls, settings):
        """
        初始化时,调用
        :param settings: 
        :return: 
        """
        return cls()

    def request_seen(self, request):
        """
        检测当前请求是否已经被访问过
        :param request: 
        :return: True表示已经访问过;False表示未访问过
        """
        if request.url in self.visited_url:
            return True
        self.visited_url.add(request.url)
        return False

    def open(self):
        """
        开始爬去请求时,调用
        :return: 
        """
        print('open replication')

    def close(self, reason):
        """
        结束爬虫爬取时,调用
        :param reason: 
        :return: 
        """
        print('close replication')

    def log(self, request, spider):
        """
        记录日志
        :param request: 
        :param spider: 
        :return: 
        """
        print('repeat', request.url)

中间件

在所有操作请求前或响应后添加操作:它的流程与falsk中间流程相似

下载中间件:  cookie、代理、user_agent

class DownMiddleware1(object):
    def process_request(self, request, spider):
        """
        请求需要被下载时,经过所有下载器中间件的process_request调用
        :param request: 
        :param spider: 
        :return:  
            None,继续后续中间件去下载;
            Response对象,停止process_request的执行,开始执行process_response
            Request对象,停止中间件的执行,将Request重新调度器
            raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
        """
        pass



    def process_response(self, request, response, spider):
        """
        spider处理完成,返回时调用
        :param response:
        :param result:
        :param spider:
        :return: 
            Response 对象:转交给其他中间件process_response
            Request 对象:停止中间件,request会被重新调度下载
            raise IgnoreRequest 异常:调用Request.errback
        """
        print('response1')
        return response

    def process_exception(self, request, exception, spider):
        """
        当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
        :param response:
        :param exception:
        :param spider:
        :return: 
            None:继续交给后续中间件处理异常;
            Response对象:停止后续process_exception方法
            Request对象:停止中间件,request将会被重新调用下载
        """
        return None
View Code

爬虫中间件

class SpiderMiddleware(object):

    def process_spider_input(self,response, spider):
        """
        下载完成,执行,然后交给parse处理
        :param response: 
        :param spider: 
        :return: 
        """
        pass

    def process_spider_output(self,response, result, spider):
        """
        spider处理完成,返回时调用
        :param response:
        :param result:
        :param spider:
        :return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)
        """
        return result

    def process_spider_exception(self,response, exception, spider):
        """
        异常调用
        :param response:
        :param exception:
        :param spider:
        :return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline
        """
        return None


    def process_start_requests(self,start_requests, spider):
        """
        爬虫启动时调用
        :param start_requests:
        :param spider:
        :return: 包含 Request 对象的可迭代对象
        """
        return start_requests
View Code

都需要在settings中配置

添加代理

1、内置添加代理

import os
import scrapy
from scrapy.http import Request

class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['chouti.com']
    start_urls = ['https://dig.chouti.com/']

    def start_requests(self):
        os.environ['HTTP_PROXY'] = "http://192.168.11.11"    #environ只能在一个进程中使用

        for url in self.start_urls:
            yield Request(url=url,callback=self.parse)

    def parse(self, response):
        print(response)

2、使用下载中间件自定义添加代理(可以添加多个代理)

import random
import base64
import six
#将str转化为bytes def to_bytes(text, encoding
=None, errors='strict'): """Return the binary representation of `text`. If `text` is already a bytes object, return it as-is.""" if isinstance(text, bytes): return text if not isinstance(text, six.string_types): raise TypeError('to_bytes must receive a unicode, str or bytes ' 'object, got %s' % type(text).__name__) if encoding is None: encoding = 'utf-8' return text.encode(encoding, errors) class MyProxyDownloaderMiddleware(object): def process_request(self, request, spider): proxy_list = [ {'ip_port': '111.11.228.75:80', 'user_pass': 'xxx:123'}, {'ip_port': '120.198.243.22:80', 'user_pass': ''}, {'ip_port': '111.8.60.9:8123', 'user_pass': ''}, {'ip_port': '101.71.27.120:80', 'user_pass': ''}, {'ip_port': '122.96.59.104:80', 'user_pass': ''}, {'ip_port': '122.224.249.122:8088', 'user_pass': ''}, ] proxy = random.choice(proxy_list) if proxy['user_pass'] is not None: request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port']) encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass'])) request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass) else: request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])

https

 Https访问时有两种情况:
    1. 要爬取网站使用的可信任证书(默认支持)
        DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
        DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory"
        
    2. 要爬取网站使用的自定义证书
        DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
        DOWNLOADER_CLIENTCONTEXTFACTORY = "step8_king.https.MySSLFactory"
        
        # https.py
        from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory
        from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate)
        
        class MySSLFactory(ScrapyClientContextFactory):
            def getCertificateOptions(self):
                from OpenSSL import crypto
                v1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read())
                v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read())
                return CertificateOptions(
                    privateKey=v1,  # pKey对象
                    certificate=v2,  # X509对象
                    verify=False,
                    method=getattr(self, 'method', getattr(self, '_ssl_method', None))
                )
    其他:
        相关类
            scrapy.core.downloader.handlers.http.HttpDownloadHandler
            scrapy.core.downloader.webclient.ScrapyHTTPClientFactory
            scrapy.core.downloader.contextfactory.ScrapyClientContextFactory
        相关配置
            DOWNLOADER_HTTPCLIENTFACTORY
            DOWNLOADER_CLIENTCONTEXTFACTORY
View Code

信号

框架内部已经定义好的,使用时可以直接调用,可以在其执行前后添加操作

extands.py

from scrapy import signals
class
MyExtension(object): def __init__(self): pass @classmethod def from_crawler(cls, crawler): obj = cls() # 在爬虫打开时,触发spider_opened信号相关的所有函数:func1 crawler.signals.connect(obj.func1, signal=signals.spider_opened) # 在爬虫关闭时,触发spider_closed信号相关的所有函数:func2 crawler.signals.connect(obj.func2, signal=signals.spider_closed) return obj def func1(self, spider): print('open') def func2(self, spider): print('close')

settings.py

EXTENSIONS = {
                'pachong.extends.MyExtension':500,
            }

自定制命令

1、创建一个与spiders同级的文件,如commands

2、在commamds文件中创建crawlall.py文件 (crawlall就是自定义的命名)

3、在settings.py配置

  COMMANDS_MODULE = '项目名称.commands'

crawlall.py
from scrapy.commands import ScrapyCommand
from scrapy.utils.project import get_project_settings


class Command(ScrapyCommand):
    requires_project = True

    def syntax(self):
        return '[options]'

    def short_desc(self):
        return 'Runs all of the spiders'

    def run(self, args, opts):
        #获取所有的爬虫
        spider_list = self.crawler_process.spiders.list()
        for name in spider_list:
            self.crawler_process.crawl(name, **opts.__dict__)
            
        #开始爬虫程序
        self.crawler_process.start()

scrapy-redis

scrapy-redis是基于redis的scrapy组件,它实现了分布式爬虫程序,主要功能有:url去重、调度器、数据持久化

参考链接

 

posted @ 2018-05-09 22:00  无名!  阅读(374)  评论(0)    收藏  举报