python爬虫之模拟登录
古诗文网模拟登录
古诗文网官网地址:古诗文网-古诗文经典传承 (gushiwen.cn)
环境安装
requests库
pip install requests
[post cid="137" /]
验证码识别库ddddocr
pip install ddddocr
基础使用方法:
import ddddocr
ocr = ddddocr.DdddOcr()
with open("test.jpg", 'rb') as f:
image = f.read()
res = ocr.classification(image)
print(res)
官方详细用法:https://pypi.org/project/ddddocr/
随机userAgent
pip install fake_useragent
使用方法:
from fake_useragent import UserAgent
ua = UserAgent()
headers = {
'user-agent': ua.random
}
数据解析
code: tree.xpath("//img[@id='imgCode']/@src")[0]
__VIEWSTATE: tree.xpath("//*[@id='__VIEWSTATE']/@value")[0]
__VIEWSTATEGENERATOR:tree.xpath("//*[@id='__VIEWSTATEGENERATOR']/@value")[0]
[post cid="139" /]
完整代码
# 古诗文网模拟登录
import ddddocr # 验证码识别库
import requests
from fake_useragent import UserAgent # 随机userAgent库
from lxml import etree
ocr = ddddocr.DdddOcr(old=True)
ua = UserAgent()
headers = {
'user-agent': ua.random
}
# 爬取验证码图片
# 登录主页面(包含验证码)
yanzhen_url = 'https://so.gushiwen.cn/user/login.aspx'
session = requests.Session()
response = session.get(url=yanzhen_url, headers=headers)
# 爬取主页面数据
yanzhen_page_text = response.content
yanzhen_tree = etree.HTML(yanzhen_page_text)
# 爬取验证码链接
code_url = 'https://so.gushiwen.cn' + \
yanzhen_tree.xpath("//img[@id='imgCode']/@src")[0]
# 数据解析
view_statue = yanzhen_tree.xpath("//*[@id='__VIEWSTATE']/@value")[0]
view_state_generator = yanzhen_tree.xpath(
"//*[@id='__VIEWSTATEGENERATOR']/@value")[0]
code_img_data = session.get(url=code_url, headers=headers).content
'''
下载验证码,其目的主要是为了判断orc是否识别正确,可以去掉
'''
with open('验证码.jpg', 'wb') as fp:
fp.write(code_img_data)
# 验证码识别
code = ocr.classification(code_img_data)
login_url = 'https://so.gushiwen.cn/user/login.aspx'
# 模拟登录
data = {
'__VIEWSTATE': view_statue,
'__VIEWSTATEGENERATOR': view_state_generator,
'from': 'http://so.gushiwen.cn/user/collect.aspx',
'email': input('邮箱:'),
'pwd': input('密码:'),
'code': code,
'denglu': '登录'
}
response = session.post(url=login_url, data=data, headers=headers)
with open('登录后页面.html', 'w', encoding='utf-8') as fp:
fp.write(response.text)
print(data)
print('登录请求状态:',response.status_code)
如果登录后页面打开后不正确请检查
- 邮箱/密码是否正确
- 验证码code是否和验证码.jpg中是否一致
- 登录请求状态是否正常(200)
版权属于:瞌学家 所有,转载请注明出处
友情提示: 如果文章部分链接出现404,请留言或者联系博主修复。

浙公网安备 33010602011771号