Scrapy模拟登录与中间件

响应头返回的set-cookie值

用session就可以自动帮你完成cookie管理和携带

一、Scrapy处理cookie

在requests中我们讲解处理cookie主要有两个方案. 第一个方案. 从浏览器里直接把cookie搞出来. 贴到heades里. 这种方案, 简单粗暴. 第二个方案是走正常的登录流程. 通过session来记录请求过程中的cookie. 那么到了scrapy中如何处理cookie? 其实也是这两个方案.

首先, 我们依然是把目标定好, 还是我们的老朋友, http://www.woaidu.cc/bookcase.php
这个url必须要登录后才能访问(用户书架). 对于该网页而言, 就必须要用到cookie了. 首先, 创建项目, 建立爬虫. 把该填的地方填上.

import scrapy
from scrapy import Request, FormRequest


class LoginSpider(scrapy.Spider):
    name = 'login'
    allowed_domains = ["woaige.net"]
    start_urls = ["http://www.woaige.net/bookcase.php"]

    def parse(self, response):
        print(response.text)

此时运行时, 显示的是该用户还未登录. 不论是哪个方案. 在请求到start_urls里面的url之前必须得获取到cookie. 但是默认情况下, scrapy会自动的帮我们完成其实request的创建. 此时, 我们需要自己去组装第一个请求. 这时就需要我们自己的爬虫中重写start_requests()方法. 该方法负责起始request的组装工作. 我们不妨先看看原来的start_requests()是如何工作的.

# 以下是scrapy源码

def start_requests(self):
    cls = self.__class__
    if not self.start_urls and hasattr(self, 'start_url'):
        raise AttributeError(
            "Crawling could not start: 'start_urls' not found "
            "or empty (but found 'start_url' attribute instead, "
            "did you miss an 's'?)")
    if method_is_overridden(cls, Spider, 'make_requests_from_url'):
        warnings.warn(
            "Spider.make_requests_from_url method is deprecated; it "
            "won't be called in future Scrapy releases. Please "
            "override Spider.start_requests method instead (see %s.%s)." % (
                cls.__module__, cls.__name__
            ),
        )
        for url in self.start_urls:
            yield self.make_requests_from_url(url)
    else:
        for url in self.start_urls:
            # 核心就这么一句话. 组建一个Request对象.我们也可以这么干. 
            yield Request(url, dont_filter=True)

自己写个start_requests()看看

def start_requests(self):
    print("我是万恶之源")
    yield Request(
        url=LoginSpider.start_urls[0],
        callback=self.parse
    )

1. 方案一、 直接从浏览器复制cookie过来

import scrapy


class DengSpider(scrapy.Spider):
    name = "deng"
    allowed_domains = ["woaidu.cc"]
    start_urls = ["http://www.woaidu.cc/bookcase.php"]

    def start_requests(self):
        cookies = "Hm_lvt_155d53bb19b3d8127ebcd71ae20d55b1=1725014283,1725263893,1725264973; HMACCOUNT=0BFAD8D83E97B549; username=User; t=727289967466d574c47bb09; Hm_lpvt_155d53bb19b3d8127ebcd71ae20d55b1=1725265123"
        cookie_dic = {}
        for cook in cookies.split("; "):
            k,v = cook.split("=")
            cookie_dic[k] = v
        yield scrapy.Request(url=self.start_urls[0], cookies=cookie_dic)

    def parse(self, resp, **kwargs):
        # 检测cookie是否可以延续
        yield scrapy.Request(url=self.start_urls[0], callback=self.chi, dont_filter=True)

    def chi(self, resp):
        # 能看到登陆后的内容. 没问题
        print(resp.text)

这种方案和原来的requests几乎一模一样. 需要注意的是: cookie需要通过cookies参数进行传递!

2. 方案二、 完成登录过程.

import scrapy
import ddddocr
from urllib.parse import urlencode


class DengSpider(scrapy.Spider):
    name = "deng"
    allowed_domains = ["woaige.net"]
    start_urls = ["http://www.woaige.net/bookcase.php"]

    # 完整的走一遍登陆流程
    def start_requests(self):
        # 1.先请求到验证码图片. 一定要保持cookie状态.
        yield scrapy.Request(
            url="http://www.woaige.net/code.php?0.827640446552238",
            callback=self.verify_code
        )

    def verify_code(self, response):
        # 拿到图片, # requests ->  resp.content, aiohttp -> resp.content.read(), scrapy -> resp.body
        print(f"Response status code: {response.status}")
        # print(response.body)
        f = open("code.jpg", "wb")
        f.write(response.body)
        f.close()
        # 2. 识别验证码
        result = ddddocr.DdddOcr(show_ad=False).classification(response.body)
        # print(result)

        from_data = {
            "LoginForm[username]": "k1003451503",
            "LoginForm[password]": "vitamin0824",
            "LoginForm[captcha]": result,
            "action": "login",
            "submit": "登  录"
        }
        # 3. 提交表单
        # 方案一
        # yield scrapy.Request(
        #     url="http://www.woaige.net/login.php",
        #     method="POST",
        #     body=urlencode(from_data),
        #     callback=self.login
        # )

        # 方案二、推荐用这种方案, 不用管头, 不用管内容, 直接怼字典
        yield scrapy.FormRequest(
            url="http://www.woaige.net/login.php",
            formdata=from_data,  # 直接给字典
            callback=self.login
        )

    def login(self, response):
        # 能看到登陆后的内容. 没问题
        yield scrapy.Request(url=DengSpider.start_urls[0], callback=self.parse)
        # print(response.text)

    def parse(self, response, **kwargs):
        print(response.text)

注意, 发送post请求有两个方案
1. Scrapy.Request(url=url, method='post', body=数据)
2. Scarpy.FormRequest(url=url, formdata=数据) -> 推荐
区别: 方式1的数据只能是字符串. 这个就很难受. 所以推荐用第二种

3. 方案三、 在settings文件中给出cookie值.

在settings中.有一个配置项: DEFAULT_REQUEST_HEADERS, 在里面可以给出默认的请求头信息. 但是要注意, 需要在settings中把COOKIES_ENABLED设置成False. 否则, 在下载器中间件中, 会被干掉.

COOKIES_ENABLED = False

DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'Cookie': 'xxxxxx',
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
}

二、Scrapy的中间件

中间件的作用: 负责处理引擎和爬虫以及引擎和下载器之间的请求和响应. 主要是可以对request和response做预处理. 为后面的操作做好充足的准备工作. 在python中准备了两种中间件, 分别是下载器中间件和爬虫中间件。

1. DownloaderMiddleware

点击查看代码
下载中间件, 它是介于引擎和下载器之间,  引擎在获取到request对象后, 会交给下载器去下载, 在这之间我们可以设置下载中间件. 它的执行流程:
引擎拿到request -> 中间件1(process_request) -> 中间件2(process_request) .....->         下载器-|
引擎拿到request <- 中间件1(process_response) <- 中间件2(process_response) ..... <-   下载器-|
class MidDownloaderMiddleware1:

    def process_request(self, request, spider):
        print("process_request", "ware1")
        return None

    def process_response(self, request, response, spider):
        print("process_response", "ware1")
        return response

    def process_exception(self, request, exception, spider):
        print("process_exception", "ware1")
        pass


class MidDownloaderMiddleware2:

    def process_request(self, request, spider):
        print("process_request", "ware2")
        return None

    def process_response(self, request, response, spider):
        print("process_response", "ware2")
        return response

    def process_exception(self, request, exception, spider):
        print("process_exception", "ware2")
        pass

设置中间件

DOWNLOADER_MIDDLEWARES = {
   # 'mid.middlewares.MidDownloaderMiddleware': 542,
   'mid.middlewares.MidDownloaderMiddleware1': 543,
   'mid.middlewares.MidDownloaderMiddleware2': 544,
}

优先级参考管道.
接下来, 我们来说说这几个方法的返回值问题(难点)

点击查看代码
1. process_request(request, spider):  在每个请求到达下载器之前调用

   一, return None  不拦截, 把请求继续向后传递给权重低的中间件或者下载器

   二, return request 请求被拦截, 并将一个新的请求返回. 后续中间件以及下载器收不到本次请求

   三, return response 请求被拦截, 下载器将获取不到请求, 但是引擎是可以接收到本次响应的内容, 也就是说在当前方法内就已经把响应内容获取到了. 

2. proccess_response(request, response, spider): 每个请求从下载器出来调用

   一, return response 通过引擎将响应内容继续传递给其他组件或传递给其他process_response()处理

   二, return request  响应被拦截. 将返回内容直接回馈给调度器(通过引擎), 后续process_response()接收不到响应内容. 

1.1. 动态随机设置UA

设置统一的UA很简单. 直接在settings里设置即可.

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'

但是这个不够好, 我希望得到一个随机的UA. 此时就可以这样设计, 首先, 在settings里定义好一堆UserAgent. http://useragentstring.com/pages/useragentstring.php?name=Chrome

USER_AGENT_LIST = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
    'Mozilla/5.0 (X11; Ubuntu; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2919.83 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2866.71 Safari/537.36',
    'Mozilla/5.0 (X11; Ubuntu; Linux i686 on x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2820.59 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2762.73 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2656.18 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/44.0.2403.155 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2226.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.4; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2225.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2225.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2224.3 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.93 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 4.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36',
]

中间件

# middleware.py

from random import choice

class MyRandomUserAgentMiddleware:

    def __init__(self, settings):
        # 在这里初始化时从设置中获取用户代理列表
        self.user_agent_list = settings.get('USER_AGENT_LIST')

    @classmethod
    def from_crawler(cls, crawler):
        # 这是 Scrapy 推荐的方式来实例化中间件
        # 它允许你访问 crawler 的设置和其他组件
        s = cls(crawler.settings)
        return s

    def process_request(self, request, spider):
        # 使用类属性中的用户代理列表
        UA = choice(self.user_agent_list)
        request.headers['User-Agent'] = UA

    def process_response(self, request, response, spider):
        return response

    def process_exception(self, request, exception, spider):
        pass

然后,在 settings.py 中,你需要确保中间件被正确配置

DOWNLOADER_MIDDLEWARES = {
   'mid.middlewares.MyRandomUserAgentMiddleware': 543,
}

1.2 处理代理问题

点击查看代码
代理问题一直是我们作为一名爬虫工程师很蛋疼的问题. 不加容易被检测, 加了效率低, 免费的可用IP更是凤毛麟角. 没办法, 无论如何还是得面对它. 这里, 我们采用两个方案来给各位展示scrapy中添加代理的逻辑.

1.2.1.免费代理

from random import choice
class ProxyMiddleware:
    def __init__(self, settings):
        # 在这里初始化时从设置中获取用户代理列表
        self.user_agent_list = settings.get('PROXY_LIST')

    @classmethod
    def from_crawler(cls, crawler):
        # 这是 Scrapy 推荐的方式来实例化中间件
        # 它允许你访问 crawler 的设置和其他组件
        s = cls(crawler.settings)
        return s

    def process_request(self, request, spider):
        proxy = choice(PROXY_LIST)
        request.meta['proxy'] = "https://"+proxy  # 设置代理
        return None

    def process_response(self, request, response, spider):
        if response.status != 200:
            print("尝试失败")
            request.dont_filter = True  # 丢回调度器重新请求
            return request
        return response

    def process_exception(self, request, exception, spider):
        print("出错了!")
        pass

1.2.2.收费代理

免费代理实在太难用了. 我们这里直接选择一个收费代理. 依然选择快代理, 这个根据你自己的喜好进行调整.

class MoneyProxyMiddleware:
    def _get_proxy(self):
        """
        912831993520336	t12831993520578	每次请求换IP
        tps138.kdlapi.com 15818
        需实名认证	5次/s	5Mb/s	有效	续费|订单详情|实名认证
        隧道用户名密码修改密码
        用户名:t12831993520578密码:t72a13xu
        :return:
        """
        url = "http://tps138.kdlapi.com:15818"
        auth = basic_auth_header(username="t12831993520578", password="t72a13xu")
        return url, auth

    def process_request(self, request, spider):
        print("......")
        url, auth = self._get_proxy()
        request.meta['proxy'] = url
        request.headers['Proxy-Authorization'] = auth
        request.headers['Connection'] = 'close'
        return None

    def process_response(self, request, response, spider):
        print(response.status, type(response.status))
        if response.status != 200:
            request.dont_filter = True
            return request
        return response

    def process_exception(self, request, exception, spider):
        pass

2. SpiderMiddleware(了解)

爬虫中间件. 是处于引擎和spider之间的中间件. 里面常用的方法有:
middlewares.py

class CuowuSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # 请求被返回, 即将进入到spider时调用
        # 要么返回None, 要么报错
        print("我是process_spider_input")
        return None

    def process_spider_output(self, response, result, spider):
        # 处理完spider中的数据. 返回数据后. 执行
        # 返回值要么是item, 要么是request.
        print("我是process_spider_output")
        for i in result:
            yield i
        print("我是process_spider_output")

    def process_spider_exception(self, response, exception, spider):
        print("process_spider_exception")
        # spider中报错 或者, process_spider_input() 方法报错
        # 返回None或者Request或者item.
        it = ErrorItem()
        it['name'] = "exception"
        it['url'] = response.url
        yield it

    def process_start_requests(self, start_requests, spider):
        print("process_start_requests")
        # 第一次启动爬虫时被调用.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        pass

items.py

class ErrorItem(scrapy.Item):
    name = scrapy.Field()
    url = scrapy.Field()

baocuo.py

class BaocuoSpider(scrapy.Spider):
    name = 'baocuo'
    allowed_domains = ['baidu.com']
    start_urls = ['http://www.baidu.com/']

    def parse(self, resp, **kwargs):
        name = resp.xpath('//title/text()').extract_first()
        # print(1/0)  # 调整调整这个. 简单琢磨一下即可~~
        it = CuowuItem()
        it['name'] = name
        print(name)
        yield it

pipelines.py

from cuowu.items import ErrorItem

class CuowuPipeline:
    def process_item(self, item, spider):
        if isinstance(item, ErrorItem):
            print("错误", item)
        else:
            print("没错", item)
        return item

目录结构

cuowu
├── cuowu
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── baocuo.py
└── scrapy.cfg
posted @ 2024-12-04 15:38  沈忻凯  阅读(137)  评论(0)    收藏  举报