scrapy 例:模拟登录的几种策略
scrapy startproject loginSpider
scrapy genspider imooc "imooc.com"
无需写items.py、settings.py、pipelines.py文件。只需编写下面文件即可:

登录的第一种策略:
imooc.py
1 # -*- coding: utf-8 -*- 2 import scrapy 3 4 # 实在没办法了,可以用这种方法模拟登录,麻烦一点,成功率100% 5 class ImoocSpider(scrapy.Spider): 6 name = 'imooc' 7 allowed_domains = ['imooc.com'] 8 start_urls = ['https://www.imooc.com/'] 9 10 #想偷懒直接这样写不行 11 # cookies={"Cookie":"*****"} 12 13 #这样写才可以 14 cookies={ 15 "zg_did":"%7B6980ed700d12e%22%7D", 16 "UM_distinctid":"1701587b-7711b3e-144000-17015931db84bf", 17 "Hm_lvt_f0cfcccd7b1393990c78efdeebff3968":",1582976265,1583336006", 18 "redrainTime":"2020-3-4; IMCDNS=0", 19 "imooc_uuid":"9f3b12bb-5a90-", 20 "imooc_isnew":"1", 21 "imooc_isnew_ct":"158423", 22 "last_login_username":"834", 23 "Hm_lpvt_f0cfcccd7b1393990c78efdeebff3968":"150", 24 "cvde":"5e402b1", 25 "loginstate":"1", 26 "apsid":"M", 27 "zg_f375fe2f71e542a4b890d9a620f9fb32":"%7B" 28 } 29 30 def start_requests(self): 31 for url in self.start_urls: 32 yield scrapy.FormRequest(url,cookies=self.cookies,callback=self.parse_page) 33 34 def parse_page(self,response): 35 print('========='+response.url) 36 with open('deng.html','wb') as f: 37 f.write(response.body)

登录的第二种策略(自认为这种蛮好,先在postman接口调好了,入参啥的再写到这里):
imooc1.py
# -*- coding: utf-8 -*- import scrapy import codecs # 如果希望程序执行一开始就发送POST请求,可以重写Spider类的 start_requests(self)方法,并且不再调用start_urls里的url。 # 只要是需要提供post数据的,就可以用这种方法 # 下面示例:post数据是账户密码 class Imooc1Spider(scrapy.Spider): name = 'imooc1' allowed_domains = ['imooc.com'] # start_urls = ['http://imooc.com/'] def start_requests(self): url="https://www.imooc.com/passport/user/login" yield scrapy.FormRequest( url=url, formdata={"username":"**","password":"**","referer":"https://www.imooc.com"}, #这里referer必须写上,不然会报错 callback=self.parse_page ) def parse_page(self,response): print('=========' + response.url) with codecs.open('deng1.json','wb',encoding = "utf-8") as f: f.write(response.body.decode('unicode_escape')) #和codecs模块结合使用,将结果\u6210\u529f这种格式转换成中文

登录的第三种策略(假如登录页面不是有效的HTML,用了大量使用Javascript,有坑,不要用这种策略,用第二种就蛮好):
imooc2.py
1 # -*- coding: utf-8 -*- 2 import scrapy 3 4 # 正统模拟登录方法: 5 # 首先发送登录页面的get请求,获取到页面里的登录必须的参数,比如说zhihu的 _xsrf 6 # 然后和账户密码一起post到服务器,登录成功 7 class Imooc2Spider(scrapy.Spider): 8 name = 'imooc2' 9 allowed_domains = ['imooc.com'] 10 start_urls = ['https://www.imooc.com/user/newlogin'] 11 12 def parse(self, response): 13 # _xsrf = response.xpath("//_xsrf").extract()[0] 14 yield scrapy.FormRequest.from_response( 15 response, 16 # formdata={"username": "", "password": "", "referer": "https://www.imooc.com","_xsrf" = _xsrf}, 17 formdata={"email": "", "password": "", "referer": "https://www.imooc.com"}, # 这里referer必须写上,不然会报错 18 callback=self.page_parse # #登录成功后,调用page_parse回调函数 19 ) 20 21 def page_parse(self,response): 22 print('====1=====' + response.url) 23 url="https://www.imooc.com/" 24 yield scrapy.Request(url,callback=self.new_page) 25 26 def new_page(self,response): 27 print('====2====='+response.url) 28 with open('deng2.html','wb') as f: 29 f.write(response.body) 30 31 ''' 32 报错了:ValueError: No <form> element found in <200 https://www.imooc.com/user/newlogin> 33 查了下原因,见:https://stackoverflow.com/questions/22707184/scrapy-request-form-for-scraping-data 34 中文翻译如下: 35 您提到的页面不是有效的HTML。看起来这是一个SSI块,应该是另一页的一部分。 36 您之所以会出错,是因为此页面大量使用Javascript。它没有form request。from_response尝试使用现有表单发出请求,但表单不存在。 37 您应该处理整个页面,或者手动填充表单请求的属性,然后将其从整个页面提交到URL。 38 '''
执行:

posted on 2020-03-15 13:25 cherry_ning 阅读(158) 评论(0) 收藏 举报
浙公网安备 33010602011771号