账户登录问题

前言

有些网站需要用户登陆才能看到具体的内容,所以我们需要在爬虫中设置账户或者传入cookie等其他操作来“骗过”网站。

requests爬虫

  • 传入Cookie
import requests

def get_response(url):
    cookie = {
        '__cfduid': 'd069b9d13ff5c13bdb8375bd4d984f7181604580012',
        # '_ga': 'GA1.2.880489818.1571991338',
        # '_gid': 'GA1.2.308639808.1572249864',
        # 等等
    }
    response = requests.get(url,cookies=cookie)
    return response

#也可把cookie放入到headers中,与其他数据一同传入
def get_response(url):
    headers = {
        'User-Agent':'',
        'Cookie':'',
        #等等
    }
    response = requests.get(url,headers=headers)
    return response
  • 使用post传入账户密码等
import requests

def get_response(url):
    formdata = {
        "email" : "example@qq.com",
        "password":"123456"
        ## 等其他数据
    }
    headers = {
        ##一些数据
    }
    response = requests.post(url,data=formdata,headers=headers)
    return response
  • 使用session

在 requests 里,session对象是一个非常常用的对象,这个对象代表一次用户会话:从客户端浏览器连接服务器开始,到客户端浏览器与服务器断开。

会话能让我们在跨请求时候保持某些参数,比如在同一个 Session 实例发出的所有请求之间保持 cookie 。

import requests

# 创建session对象,可以保存Cookie值
ssion = requests.session()

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}

# 需要登录的用户名和密码
data = {"email":"example@qq.com", "password":"123456"}  

# 发送附带用户名和密码的请求,并获取登录后的Cookie值,保存在ssion里
ssion.post("http://www.renren.com/PLogin.do", data = data)

# session包含用户登录后的Cookie值,可以直接访问那些登录后才可以访问的页面
response = ssion.get("http://www.renren.com/410043129/profile")

scrapy爬虫

  • 直接POST数据(如:需要登陆的账户信息)
import scrapy

class DoubanSpider(scrapy.Spider):
    name = "douban"
    allowed_domains = ["douban.com"]

    def start_requests(self):
        url = 'https://movie.douban.com'
        # FormRequest 是Scrapy发送POST请求的方法
        yield scrapy.FormRequest(
                url = url,
                formdata = {"email" : "example@qq.com", "password" : "123456"},
                callback = self.parse_page)

    def parse_page(self, response):
        with open("douban.html", "w") as f:
            f.write(response.body)
  • 标准的模拟登陆步骤

    1. 首先发送登录页面的get请求,获取到页面里的登录必须的参数(比如说zhihu登陆界面的 _xsrf)
    2. 然后和账户密码一起post到服务器,登录成功
    import scrapy
    
    class RenrenSpider(scrapy.Spider):
        name = "renren"
        allowed_domains = ["renren.com"]
        start_urls = (
            "http://www.renren.com/PLogin.do",
        )
    
        # 处理start_urls里的登录url的响应内容,提取登陆需要的参数(如果需要的话)
        def parse(self, response):
            # 提取登陆需要的参数
            #_xsrf = response.xpath("//_xsrf").extract()[0]
    
            # 发送请求参数,并调用指定回调函数处理
            yield scrapy.FormRequest.from_response(
                    response,
                    formdata = {
                        "email" : "example@qq.com",
                        "password":"123456"
                    },#, "_xsrf" = _xsrf},
                    callback = self.parse_page
                )
    
        # 获取登录成功状态,访问需要登录后才能访问的页面
        def parse_page(self, response):
            url = "http://www.renren.com/422167102/profile"
            yield scrapy.Request(url, callback = self.parse_newpage)
    
        # 处理响应内容
        def parse_newpage(self, response):
            with open("renren.html", "w") as f:
                f.write(response.body)
    
  • 直接使用保存登陆状态的Cookie来模拟登陆

class DoubanSpider(scrapy.Spider):
    name = 'douban'
    # allowed_domains = ['https://movie.douban.com/']
    base_url = 'https://movie.douban.com/top250?start={}&filter='
    offset = 0
    start_urls = [base_url.format(str(offset))]
    cookies = {
        "bid":"8LnDIRxQaG8",
        "__utmc":"30149280",
        "__utmz":"30149280.1611728608.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)",
        "__utmc":"223695111",
        "__utmz":"223695111.1611728609.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)",
        "ll":"118263",
        "_vwo_uuid_v2":"D3CFA16601DFDFD413F3A2CFF049C7D42|e9f36c04757a1c3f6f33b5c028565cc9",
        "_pk_ses.100001.4cf6":"*",
        "__utma":"30149280.294007028.1611728608.1611728608.1611735527.2",
        "__utmb":"30149280.0.10.1611735527",
        "__utma":"223695111.26793853.1611728609.1611728609.1611735527.2",
        "__utmb":"223695111.0.10.1611735527",
        "ap_v":"0,6.0",
        "_pk_id.100001.4cf6":"28e712f08d4e9328.1611728608.2.1611737459.1611728695."
    }
    
    # 可以重写Spider类的start_requests方法,附带Cookie值,发送POST请求
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.FormRequest(url, cookies = self.cookies, callback = self.parse)
    
    def parse(self,response):
        #获取spu_id
        li_list = response.xpath("//div[@id='J_goodsList']/ul[1]/li")
        for li in li_list:
            item = JdItem()
            goods_id = str(li.xpath("./@data-sku").extract()[0])
            #商品ID
            item["goods_id"] = goods_id
            # print(goods_id)
            url = 'https://item.jd.com/{}.html'.format(goods_id)
            yield scrapy.Request(url,callback=self.parse_info,cookies=self.cookies,meta={'item':item,'goods_id':goods_id})
  • 注:必须保证settings.py里的 COOKIES_ENABLED (Cookies中间件) 处于开启状态
COOKIES_ENABLED = True 或 # COOKIES_ENABLED = False
posted @ 2021-02-22 20:23  F___Q  阅读(78)  评论(0)    收藏  举报