SPIDER-DAY04--,requests.post 请求,及代理
1. 代理参数
【1】定义
代替你原来的IP地址去对接网络的IP地址
【2】作用
隐藏自身真实IP,避免被封
【3】获取代理IP网站
快代理、全网代理、代理精灵、... ...
【4】参数类型
proxies
proxies = { '协议':'协议://IP:端口号' }
proxies = { '协议':'协议://用户名:密码@IP:端口号' }
1.2 代理分类
1.2.1 普通代理
【1】代理格式
proxies = { '协议':'协议://IP:端口号' }
【2】使用免费普通代理IP访问测试网站: http://httpbin.org/get
import requests
url = 'http://httpbin.org/get'
headers = {'User-Agent':'Mozilla/5.0'}
# 定义代理,在代理IP网站中查找免费代理IP
proxies = {
'http':'http://112.85.164.220:9999',
'https':'https://112.85.164.220:9999'
}
html = requests.get(url,proxies=proxies,headers=headers,timeout=5).text
print(html)
1.2.2 私密代理和独享代理
【1】代理格式
proxies = { '协议':'协议://用户名:密码@IP:端口号' }
【2】使用私密代理或独享代理IP访问测试网站: http://httpbin.org/get
import requests
url = 'http://httpbin.org/get'
proxies = {
'http': 'http://309435365:szayclhp@106.75.71.140:16816',
'https':'https://309435365:szayclhp@106.75.71.140:16816',
}
headers = {
'User-Agent' : 'Mozilla/5.0',
}
html = requests.get(url,proxies=proxies,headers=headers,timeout=5).text
print(html)
1.3 建立代理IP池
"""
建立代理IP池 - 开放代理
"""
import requests
from fake_useragent import UserAgent
class ProxyPool:
def __init__(self):
self.api_url = '快代理的api链接'
self.test_url = 'http://httpbin.org/get'
self.headers = {'User-Agent':UserAgent().random}
def get_proxy(self):
"""获取代理IP"""
html = requests.get(url=self.api_url,
headers=self.headers).text
proxy_list = html.split('\r\n')
for proxy in proxy_list:
# proxy: 1.1.1.1:8888
self.test_proxy(proxy)
def test_proxy(self, proxy):
"""测试1个代理IP是否可用"""
proxies = {
'http' : 'http://{}'.format(proxy),
'https': 'https://{}'.format(proxy)
}
try:
resp = requests.get(url=self.test_url,
proxies=proxies,
headers=self.headers,
timeout=3)
print(proxy, '\033[31m可用\033[0m')
except Exception as e:
print(proxy, '不可用')
if __name__ == '__main__':
spider = ProxyPool()
spider.get_proxy()
2. requests.post()
2.1 POST请求
【1】适用场景 : Post类型请求的网站
【2】参数 : data={}
2.1) Form表单数据: 字典
2.2) res = requests.post(url=url,data=data,headers=headers)
【3】POST请求特点 : Form表单提交数据
2.2 控制台抓包
-
打开方式及常用选项
【1】打开浏览器,F12打开控制台,找到Network选项卡
【2】控制台常用选项
2.1) Network: 抓取网络数据包
a> ALL: 抓取所有的网络数据包
b> XHR:抓取异步加载的网络数据包
c> JS : 抓取所有的JS文件
2.2) Sources: 格式化输出并打断点调试JavaScript代码,助于分析爬虫中一些参数
2.3) Console: 交互模式,可对JavaScript中的代码进行测试
【3】抓取具体网络数据包后
3.1) 单击左侧网络数据包地址,进入数据包详情,查看右侧
3.2) 右侧:
a> Headers: 整个请求信息
General、Response Headers、Request Headers、Query String、Form Data
b> Preview: 对响应内容进行预览
c> Response:响应内容
3. 有道翻译爬虫
3.1 项目需求
破解有道翻译接口,抓取翻译结果
# 结果展示
请输入要翻译的词语: elephant
翻译结果: 大象
*************************
请输入要翻译的词语: 喵喵叫
翻译结果: mews
3.2 项目分析流程
【1】准备抓包: F12开启控制台,刷新页面
【2】寻找地址
2.1) 页面中输入翻译单词,控制台中抓取到网络数据包,查找并分析返回翻译数据的地址
F12-Network-XHR-Headers-General-Request URL
【3】发现规律
3.1) 找到返回具体数据的地址,在页面中多输入几个单词,找到对应URL地址
3.2) 分析对比 Network - All(或者XHR) - Form Data,发现对应的规律
【4】寻找JS加密文件
控制台右上角 ...->Search->搜索关键字->单击->跳转到Sources,左下角格式化符号{}
【5】查看JS代码
搜索关键字,找到相关加密方法,用python实现加密算法
【6】断点调试
JS代码中部分参数不清楚可通过断点调试来分析查看
【7】Python实现JS加密算法
3.3 项目步骤
1、开启F12抓包,找到Form表单数据如下:
i: 喵喵叫 from: AUTO to: AUTO smartresult: dict client: fanyideskweb salt: 15614112641250 sign: 94008208919faa19bd531acde36aac5d ts: 1561411264125 bv: f4d62a2579ebb44874d7ef93ba47e822 doctype: json version: 2.1 keyfrom: fanyi.web action: FY_BY_REALTlME
2、在页面中多翻译几个单词,观察Form表单数据变化
salt: 15614112641250 sign: 94008208919faa19bd531acde36aac5d ts: 1561411264125 bv: f4d62a2579ebb44874d7ef93ba47e822 # 但是bv的值不变
3、一般为本地js文件加密,刷新页面,找到js文件并分析JS代码
控制台右上角 - Search - 搜索salt - 查看文件 - 格式化输出 【结果】 : 最终找到相关JS文件 : fanyi.min.js
4、打开JS文件,分析加密算法,用Python实现
【ts】经过分析为13位的时间戳,字符串类型
js代码实现) "" + (new Date).getTime()
python实现) str(int(time.time() * 1000))
【salt】ts + 0-9之间的随机数(字符串类型)
js代码实现) ts + parseInt(10 * Math.random(), 10);
python实现) ts + str(random.randint(0, 9))
【sign】('设置断点调试,来查看 e 的值,发现 e 为要翻译的单词')
js代码实现) n.md5("fanyideskweb" + e + salt + "Tbh5E8=q6U3EXe+&L[4c@")
python实现)
from hashlib import md5
m = md5()
m.update(string.encode())
sign = m.hexdigest()
5、pycharm中正则处理headers和formdata
【1】pycharm进入方法 :Ctrl + r ,选中 Regex
【2】处理headers和formdata
(.*): (.*)
"$1": "$2",
【3】点击 Replace All
3.4 代码实现
"""
请输入要翻译的单词:tiger
翻译结果:老虎
"""
import requests
import time
from hashlib import md5
import random
class YdSpider:
def __init__(self):
# URL地址一定要是:F12抓包抓到的POST的地址
self.post_url = 'http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule'
self.headers = {
# 检查频率最高的三个:Cookie、Referer、User-Agent
"Cookie": "OUTFOX_SEARCH_USER_ID=1391264118@10.108.160.105; OUTFOX_SEARCH_USER_ID_NCOO=2105417985.4787014; JSESSIONID=aaasSeD7PiY4G_nO8cWDx; SESSION_FROM_COOKIE=unknown; ___rl__test__cookies=1612506057171",
"Referer": "http://fanyi.youdao.com/",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36",
}
# 输入要翻译的单词
self.word = input('请输入要翻译的单词:')
def md5_string(self, string):
"""功能函数"""
m = md5()
m.update(string.encode())
return m.hexdigest()
def get_ts_salt_sign(self):
"""获取ts salt sign"""
ts = str(int(time.time() * 1000))
salt = ts + str(random.randint(0, 9))
string = "fanyideskweb" + self.word + salt + "Tbh5E8=q6U3EXe+&L[4c@"
sign = self.md5_string(string)
return ts, salt, sign
def attack_yd(self):
"""逻辑函数"""
ts, salt, sign = self.get_ts_salt_sign()
data = {
"i": self.word,
"from": "AUTO",
"to": "AUTO",
"smartresult": "dict",
"client": "fanyideskweb",
"salt": salt,
"sign": sign,
"lts": ts,
"bv": "6a1ac4a5cc37a3de2c535a36eda9e149",
"doctype": "json",
"version": "2.1",
"keyfrom": "fanyi.web",
"action": "FY_BY_REALTlME",
}
# .json():把json格式的字符串转为python数据类型
# .join() 等同于 json.loads('{}')
html = requests.post(url=self.post_url,
data=data,
headers=self.headers).json()
return html['translateResult'][0][0]['tgt']
if __name__ == '__main__':
spider = YdSpider()
print(spider.attack_yd())
4. 百度翻译JS逆向爬虫
4.1 JS逆向详解
【1】应用场景
当JS加密的代码过于复杂,没有办法破解时,考虑使用JS逆向思想
【2】模块
2.1》模块名:execjs
2.2》安装: sudo pip3 install pyexecjs
2.3》使用流程
import execjs
with open('xxx.js', 'r') as f:
js_code = f.read()
js_obj = execjs.compile(js_code)
js_obj.eval('函数名("参数")')
4.2 JS代码调试
-
抓到 JS 加密文件,存放到 translate.js 文件中
// e(r, gtk) 增加了gtk参数 // i = window[l] 改为了 i = gtk function a(r) { if (Array.isArray(r)) { for (var o = 0, t = Array(r.length); o < r.length; o++) t[o] = r[o]; return t } return Array.from(r) } function n(r, o) { for (var t = 0; t < o.length - 2; t += 3) { var a = o.charAt(t + 2); a = a >= "a" ? a.charCodeAt(0) - 87 : Number(a), a = "+" === o.charAt(t + 1) ? r >>> a : r << a, r = "+" === o.charAt(t) ? r + a & 4294967295 : r ^ a } return r } function e(r,gtk) { var o = r.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g); if (null === o) { var t = r.length; t > 30 && (r = "" + r.substr(0, 10) + r.substr(Math.floor(t / 2) - 5, 10) + r.substr(-10, 10)) } else { for (var e = r.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/), C = 0, h = e.length, f = []; h > C; C++) "" !== e[C] && f.push.apply(f, a(e[C].split(""))), C !== h - 1 && f.push(o[C]); var g = f.length; g > 30 && (r = f.slice(0, 10).join("") + f.slice(Math.floor(g / 2) - 5, Math.floor(g / 2) + 5).join("") + f.slice(-10).join("")) } var u = void 0 , l = "" + String.fromCharCode(103) + String.fromCharCode(116) + String.fromCharCode(107); u = null !== i ? i : (i = gtk || "") || ""; for (var d = u.split("."), m = Number(d[0]) || 0, s = Number(d[1]) || 0, S = [], c = 0, v = 0; v < r.length; v++) { var A = r.charCodeAt(v); 128 > A ? S[c++] = A : (2048 > A ? S[c++] = A >> 6 | 192 : (55296 === (64512 & A) && v + 1 < r.length && 56320 === (64512 & r.charCodeAt(v + 1)) ? (A = 65536 + ((1023 & A) << 10) + (1023 & r.charCodeAt(++v)), S[c++] = A >> 18 | 240, S[c++] = A >> 12 & 63 | 128) : S[c++] = A >> 12 | 224, S[c++] = A >> 6 & 63 | 128), S[c++] = 63 & A | 128) } for (var p = m, F = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(97) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(54)), D = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(51) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(98)) + ("" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(102)), b = 0; b < S.length; b++) p += S[b], p = n(p, F); return p = n(p, D), p ^= s, 0 > p && (p = (2147483647 & p) + 2147483648), p %= 1e6, p.toString() + "." + (p ^ m) } var i = null; -
test_translate.py调试JS文件
import execjs with open('translate.js', 'r', encoding='utf-8') as f: jscode = f.read() jsobj = execjs.compile(jscode) sign = jsobj.eval('e("hello","320305.131321201")') print(sign)
4.3 百度翻译代码实现
"""
百度翻译破解案例 - JS逆向(execjs模块)
"""
import requests
import execjs
import re
class BdSpider:
def __init__(self):
# url:F12抓包抓到的POST的URL地址
self.post_url = 'https://fanyi.baidu.com/v2transapi?from=en&to=zh'
self.post_headers = {
'''Accept''': '''*/*''',
'''Accept-Encoding''': '''gzip, deflate, br''',
'''Accept-Language''': '''zh-CN,zh;q=0.9''',
'''Cache-Control''': '''no-cache''',
'''Connection''': '''keep-alive''',
'''Content-Length''': '''135''',
'''Content-Type''': '''application/x-www-form-urlencoded; charset=UTF-8''',
'''Cookie''': '''BIDUPSID=F29F6E28B0FBCFC1A7E211DE92583124; PSTM=1608988098; BAIDUID=F29F6E28B0FBCFC153ABCCC1188CE960:FG=1; FANYI_WORD_SWITCH=1; REALTIME_TRANS_SWITCH=1; HISTORY_SWITCH=1; SOUND_SPD_SWITCH=1; SOUND_PREFER_SWITCH=1; __yjs_duid=1_40d64069dbd8841dbfa1e003a7c7dd8a1611644719974; BDSFRCVID_BFESS=WauOJeC627FxfbveyBYtbHBDYpJWrfTTH6aoVfeprTJ6BRthQWFTEG0P8M8g0KubvwPHogKKBmOTHgLF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF_BFESS=tJI8oK0XJD-3fP36qR6sMJI0hU5054RB2C6yX4K8Kb7VbU3p5fnkbfJBDGJlhP5abDT85fT2KqnIsMTxylJKbl07yajK2h5h56RH-pFK0foMjMj5QlOpQT8re-FOK5OibCrqLMJdab3vOpozXpO1bUAzBN5thURB2DkO-4bCWJ5TMl5jDh3Mb6ksD-FtqtJHKbDOVC_K3D; BDUSS=U5hRU50YVFPZmdLbnZ3QTlZYWJDdXpkZ2VRSUtDZzl3V0o4TEY1Z3d5YVhlMEJnRVFBQUFBJCQAAAAAAAAAAAEAAAADZykQv9u~2zMwOTQzNTM2NQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJfuGGCX7hhge; BDUSS_BFESS=U5hRU50YVFPZmdLbnZ3QTlZYWJDdXpkZ2VRSUtDZzl3V0o4TEY1Z3d5YVhlMEJnRVFBQUFBJCQAAAAAAAAAAAEAAAADZykQv9u~2zMwOTQzNTM2NQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJfuGGCX7hhge; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574=1610182098,1610437054,1610501127,1612504926; H_PS_PSSID=33423_33439_33273_31660_33595_33540_33590_26350; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; delPer=0; PSINO=2; BA_HECTOR=8l2kag0h21052l24g91g1ps820r; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1612509484; __yjsv5_shitong=1.0_7_7b5b5c8481cf67b7c1053df9dfdc2105d925_300_1612509484292_101.30.65.196_f2e2fdf6; ab_sr=1.0.0_MGIyNDA3Mjk2NGU4NjAxZjkzYzU5YjQ4Mjg3YjJmMTFjMzRjY2E0Y2EwYWE5YTllZGE2Yjk5NmM2M2RjZmViMjUwMjIyZGJlODNhZDJkOTk0YjNkMjRiNTE0NjM4YzEx''',
'''Host''': '''fanyi.baidu.com''',
'''Origin''': '''https://fanyi.baidu.com''',
'''Pragma''': '''no-cache''',
'''Referer''': '''https://fanyi.baidu.com/''',
'''sec-ch-ua''': '''"Chromium";v="88", "Google Chrome";v="88", ";Not A Brand";v="99"''',
'''sec-ch-ua-mobile''': '''?0''',
'''Sec-Fetch-Dest''': '''empty''',
'''Sec-Fetch-Mode''': '''cors''',
'''Sec-Fetch-Site''': '''same-origin''',
'''User-Agent''': '''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36''',
'''X-Requested-With''': '''XMLHttpRequest''',
}
self.word = input('请输入翻译单词:')
# 获取gtk和token的
self.get_url = 'https://fanyi.baidu.com/'
self.get_headers = {
'''Accept''': '''text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9''',
'''Accept-Encoding''': '''gzip, deflate, br''',
'''Accept-Language''': '''zh-CN,zh;q=0.9''',
'''Cache-Control''': '''no-cache''',
'''Connection''': '''keep-alive''',
'''Cookie''': '''BIDUPSID=F29F6E28B0FBCFC1A7E211DE92583124; PSTM=1608988098; BAIDUID=F29F6E28B0FBCFC153ABCCC1188CE960:FG=1; FANYI_WORD_SWITCH=1; REALTIME_TRANS_SWITCH=1; HISTORY_SWITCH=1; SOUND_SPD_SWITCH=1; SOUND_PREFER_SWITCH=1; __yjs_duid=1_40d64069dbd8841dbfa1e003a7c7dd8a1611644719974; BDSFRCVID_BFESS=WauOJeC627FxfbveyBYtbHBDYpJWrfTTH6aoVfeprTJ6BRthQWFTEG0P8M8g0KubvwPHogKKBmOTHgLF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF_BFESS=tJI8oK0XJD-3fP36qR6sMJI0hU5054RB2C6yX4K8Kb7VbU3p5fnkbfJBDGJlhP5abDT85fT2KqnIsMTxylJKbl07yajK2h5h56RH-pFK0foMjMj5QlOpQT8re-FOK5OibCrqLMJdab3vOpozXpO1bUAzBN5thURB2DkO-4bCWJ5TMl5jDh3Mb6ksD-FtqtJHKbDOVC_K3D; BDUSS=U5hRU50YVFPZmdLbnZ3QTlZYWJDdXpkZ2VRSUtDZzl3V0o4TEY1Z3d5YVhlMEJnRVFBQUFBJCQAAAAAAAAAAAEAAAADZykQv9u~2zMwOTQzNTM2NQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJfuGGCX7hhge; BDUSS_BFESS=U5hRU50YVFPZmdLbnZ3QTlZYWJDdXpkZ2VRSUtDZzl3V0o4TEY1Z3d5YVhlMEJnRVFBQUFBJCQAAAAAAAAAAAEAAAADZykQv9u~2zMwOTQzNTM2NQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJfuGGCX7hhge; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574=1610182098,1610437054,1610501127,1612504926; H_PS_PSSID=33423_33439_33273_31660_33595_33540_33590_26350; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; delPer=0; PSINO=2; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1612514819; __yjsv5_shitong=1.0_7_7b5b5c8481cf67b7c1053df9dfdc2105d925_300_1612514819989_101.30.65.196_430628f1; ab_sr=1.0.0_NjNjMjA3YzcyNDdiMzE4Njk5MGRkNjY1ZTY2YmFiNTI4MzE2ODQ3ZDIwYjBmNGRlZWFjODgyOGFjMGY0ZTQ3ODVlM2MxNDYxMjQ2ZWYzZGFkM2EzYWFjZjYyM2RkY2Vi''',
'''Host''': '''fanyi.baidu.com''',
'''Pragma''': '''no-cache''',
'''sec-ch-ua''': '''"Chromium";v="88", "Google Chrome";v="88", ";Not A Brand";v="99"''',
'''sec-ch-ua-mobile''': '''?0''',
'''Sec-Fetch-Dest''': '''document''',
'''Sec-Fetch-Mode''': '''navigate''',
'''Sec-Fetch-Site''': '''none''',
'''Sec-Fetch-User''': '''?1''',
'''Upgrade-Insecure-Requests''': '''1''',
'''User-Agent''': '''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36''',
}
def get_gtk_token(self):
"""获取gtk和token"""
html = requests.get(url=self.get_url,
headers=self.get_headers).text
gtk = re.findall("window.gtk = '(.*?)'", html, re.S)[0]
token = re.findall("token: '(.*?)'", html, re.S)[0]
return gtk, token
def get_sign(self):
"""获取sign"""
gtk, token = self.get_gtk_token()
with open('translate.js', 'r') as f:
jscode = f.read()
jsobj = execjs.compile(jscode)
sign = jsobj.eval('e("{}","{}")'.format(self.word, gtk))
return sign
def attack_bd(self):
"""逻辑函数"""
sign = self.get_sign()
gtk, token = self.get_gtk_token()
data = {
"from": "en",
"to": "zh",
"query": self.word,
"transtype": "realtime",
"simple_means_flag": "3",
"sign": sign,
"token": token,
"domain": "common",
}
html = requests.post(url=self.post_url,
data=data,
headers=self.post_headers).json()
return html['trans_result']['data'][0]['dst']
if __name__ == '__main__':
spider = BdSpider()
print(spider.attack_bd())
4. 今日作业
【1】抓取快代理网站免费高匿代理,并测试是否可用来建立自己的代理IP池
https://www.kuaidaili.com/free/
【2】肯德基餐厅门店信息抓取(POST请求练习)
1.1) URL地址: http://www.kfc.com.cn/kfccda/storelist/index.aspx
1.2) 所抓数据: 餐厅编号、餐厅名称、餐厅地址、城市
1.3) 数据存储: 保存到数据库
1.4) 程序运行效果:
请输入城市名称:北京
把北京的所有肯德基门店的信息保存到数据库中

浙公网安备 33010602011771号