账户登录问题
前言
有些网站需要用户登陆才能看到具体的内容,所以我们需要在爬虫中设置账户或者传入cookie等其他操作来“骗过”网站。
requests爬虫
- 传入Cookie
import requests
def get_response(url):
cookie = {
'__cfduid': 'd069b9d13ff5c13bdb8375bd4d984f7181604580012',
# '_ga': 'GA1.2.880489818.1571991338',
# '_gid': 'GA1.2.308639808.1572249864',
# 等等
}
response = requests.get(url,cookies=cookie)
return response
#也可把cookie放入到headers中,与其他数据一同传入
def get_response(url):
headers = {
'User-Agent':'',
'Cookie':'',
#等等
}
response = requests.get(url,headers=headers)
return response
- 使用post传入账户密码等
import requests
def get_response(url):
formdata = {
"email" : "example@qq.com",
"password":"123456"
## 等其他数据
}
headers = {
##一些数据
}
response = requests.post(url,data=formdata,headers=headers)
return response
- 使用session
在 requests 里,session对象是一个非常常用的对象,这个对象代表一次用户会话:从客户端浏览器连接服务器开始,到客户端浏览器与服务器断开。
会话能让我们在跨请求时候保持某些参数,比如在同一个 Session 实例发出的所有请求之间保持 cookie 。
import requests
# 创建session对象,可以保存Cookie值
ssion = requests.session()
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}
# 需要登录的用户名和密码
data = {"email":"example@qq.com", "password":"123456"}
# 发送附带用户名和密码的请求,并获取登录后的Cookie值,保存在ssion里
ssion.post("http://www.renren.com/PLogin.do", data = data)
# session包含用户登录后的Cookie值,可以直接访问那些登录后才可以访问的页面
response = ssion.get("http://www.renren.com/410043129/profile")
scrapy爬虫
- 直接POST数据(如:需要登陆的账户信息)
import scrapy
class DoubanSpider(scrapy.Spider):
name = "douban"
allowed_domains = ["douban.com"]
def start_requests(self):
url = 'https://movie.douban.com'
# FormRequest 是Scrapy发送POST请求的方法
yield scrapy.FormRequest(
url = url,
formdata = {"email" : "example@qq.com", "password" : "123456"},
callback = self.parse_page)
def parse_page(self, response):
with open("douban.html", "w") as f:
f.write(response.body)
-
标准的模拟登陆步骤
- 首先发送登录页面的get请求,获取到页面里的登录必须的参数(比如说zhihu登陆界面的 _xsrf)
- 然后和账户密码一起post到服务器,登录成功
import scrapy class RenrenSpider(scrapy.Spider): name = "renren" allowed_domains = ["renren.com"] start_urls = ( "http://www.renren.com/PLogin.do", ) # 处理start_urls里的登录url的响应内容,提取登陆需要的参数(如果需要的话) def parse(self, response): # 提取登陆需要的参数 #_xsrf = response.xpath("//_xsrf").extract()[0] # 发送请求参数,并调用指定回调函数处理 yield scrapy.FormRequest.from_response( response, formdata = { "email" : "example@qq.com", "password":"123456" },#, "_xsrf" = _xsrf}, callback = self.parse_page ) # 获取登录成功状态,访问需要登录后才能访问的页面 def parse_page(self, response): url = "http://www.renren.com/422167102/profile" yield scrapy.Request(url, callback = self.parse_newpage) # 处理响应内容 def parse_newpage(self, response): with open("renren.html", "w") as f: f.write(response.body) -
直接使用保存登陆状态的Cookie来模拟登陆
class DoubanSpider(scrapy.Spider):
name = 'douban'
# allowed_domains = ['https://movie.douban.com/']
base_url = 'https://movie.douban.com/top250?start={}&filter='
offset = 0
start_urls = [base_url.format(str(offset))]
cookies = {
"bid":"8LnDIRxQaG8",
"__utmc":"30149280",
"__utmz":"30149280.1611728608.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)",
"__utmc":"223695111",
"__utmz":"223695111.1611728609.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)",
"ll":"118263",
"_vwo_uuid_v2":"D3CFA16601DFDFD413F3A2CFF049C7D42|e9f36c04757a1c3f6f33b5c028565cc9",
"_pk_ses.100001.4cf6":"*",
"__utma":"30149280.294007028.1611728608.1611728608.1611735527.2",
"__utmb":"30149280.0.10.1611735527",
"__utma":"223695111.26793853.1611728609.1611728609.1611735527.2",
"__utmb":"223695111.0.10.1611735527",
"ap_v":"0,6.0",
"_pk_id.100001.4cf6":"28e712f08d4e9328.1611728608.2.1611737459.1611728695."
}
# 可以重写Spider类的start_requests方法,附带Cookie值,发送POST请求
def start_requests(self):
for url in self.start_urls:
yield scrapy.FormRequest(url, cookies = self.cookies, callback = self.parse)
def parse(self,response):
#获取spu_id
li_list = response.xpath("//div[@id='J_goodsList']/ul[1]/li")
for li in li_list:
item = JdItem()
goods_id = str(li.xpath("./@data-sku").extract()[0])
#商品ID
item["goods_id"] = goods_id
# print(goods_id)
url = 'https://item.jd.com/{}.html'.format(goods_id)
yield scrapy.Request(url,callback=self.parse_info,cookies=self.cookies,meta={'item':item,'goods_id':goods_id})
- 注:必须保证settings.py里的
COOKIES_ENABLED(Cookies中间件) 处于开启状态
COOKIES_ENABLED = True 或 # COOKIES_ENABLED = False
遇上方知有

浙公网安备 33010602011771号