scrapy 例:模拟登录的几种策略

scrapy startproject loginSpider

scrapy genspider imooc "imooc.com"

无需写items.py、settings.py、pipelines.py文件。只需编写下面文件即可:

 

 登录的第一种策略:

imooc.py

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 
 4 # 实在没办法了,可以用这种方法模拟登录,麻烦一点,成功率100%
 5 class ImoocSpider(scrapy.Spider):
 6     name = 'imooc'
 7     allowed_domains = ['imooc.com']
 8     start_urls = ['https://www.imooc.com/']
 9 
10     #想偷懒直接这样写不行
11     # cookies={"Cookie":"*****"}
12 
13     #这样写才可以
14     cookies={
15         "zg_did":"%7B6980ed700d12e%22%7D",
16         "UM_distinctid":"1701587b-7711b3e-144000-17015931db84bf",
17         "Hm_lvt_f0cfcccd7b1393990c78efdeebff3968":",1582976265,1583336006",
18         "redrainTime":"2020-3-4; IMCDNS=0",
19         "imooc_uuid":"9f3b12bb-5a90-",
20         "imooc_isnew":"1",
21         "imooc_isnew_ct":"158423",
22         "last_login_username":"834",
23         "Hm_lpvt_f0cfcccd7b1393990c78efdeebff3968":"150",
24         "cvde":"5e402b1",
25         "loginstate":"1",
26         "apsid":"M",
27         "zg_f375fe2f71e542a4b890d9a620f9fb32":"%7B"
28     }
29 
30     def start_requests(self):
31         for url in self.start_urls:
32             yield scrapy.FormRequest(url,cookies=self.cookies,callback=self.parse_page)
33 
34     def parse_page(self,response):
35         print('========='+response.url)
36         with open('deng.html','wb') as f:
37             f.write(response.body)

 

登录的第二种策略(自认为这种蛮好,先在postman接口调好了,入参啥的再写到这里):

imooc1.py

# -*- coding: utf-8 -*-
import scrapy
import codecs

# 如果希望程序执行一开始就发送POST请求,可以重写Spider类的 start_requests(self)方法,并且不再调用start_urls里的url。
# 只要是需要提供post数据的,就可以用这种方法
# 下面示例:post数据是账户密码
class Imooc1Spider(scrapy.Spider):
    name = 'imooc1'
    allowed_domains = ['imooc.com']
    # start_urls = ['http://imooc.com/']

    def start_requests(self):
        url="https://www.imooc.com/passport/user/login"
        yield scrapy.FormRequest(
            url=url,
            formdata={"username":"**","password":"**","referer":"https://www.imooc.com"}, #这里referer必须写上,不然会报错
            callback=self.parse_page
        )

    def parse_page(self,response):
        print('=========' + response.url)
        with codecs.open('deng1.json','wb',encoding = "utf-8") as f:
            f.write(response.body.decode('unicode_escape'))  #和codecs模块结合使用,将结果\u6210\u529f这种格式转换成中文

 

 登录的第三种策略(假如登录页面不是有效的HTML,用了大量使用Javascript,有坑,不要用这种策略,用第二种就蛮好):

imooc2.py

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 
 4 # 正统模拟登录方法:
 5 # 首先发送登录页面的get请求,获取到页面里的登录必须的参数,比如说zhihu的 _xsrf
 6 # 然后和账户密码一起post到服务器,登录成功
 7 class Imooc2Spider(scrapy.Spider):
 8     name = 'imooc2'
 9     allowed_domains = ['imooc.com']
10     start_urls = ['https://www.imooc.com/user/newlogin']
11 
12     def parse(self, response):
13         # _xsrf = response.xpath("//_xsrf").extract()[0]
14         yield scrapy.FormRequest.from_response(
15             response,
16             # formdata={"username": "", "password": "", "referer": "https://www.imooc.com","_xsrf" = _xsrf},
17             formdata={"email": "", "password": "", "referer": "https://www.imooc.com"},  # 这里referer必须写上,不然会报错
18             callback=self.page_parse    # #登录成功后,调用page_parse回调函数
19         )
20 
21     def page_parse(self,response):
22         print('====1=====' + response.url)
23         url="https://www.imooc.com/"
24         yield scrapy.Request(url,callback=self.new_page)
25 
26     def new_page(self,response):
27         print('====2====='+response.url)
28         with open('deng2.html','wb') as f:
29             f.write(response.body)
30 
31 '''
32 报错了:ValueError: No <form> element found in <200 https://www.imooc.com/user/newlogin>
33 查了下原因,见:https://stackoverflow.com/questions/22707184/scrapy-request-form-for-scraping-data
34 中文翻译如下:
35 您提到的页面不是有效的HTML。看起来这是一个SSI块,应该是另一页的一部分。
36 您之所以会出错,是因为此页面大量使用Javascript。它没有form request。from_response尝试使用现有表单发出请求,但表单不存在。
37 您应该处理整个页面,或者手动填充表单请求的属性,然后将其从整个页面提交到URL。
38 '''

 

执行:

 

posted on 2020-03-15 13:25  cherry_ning  阅读(158)  评论(0)    收藏  举报

导航