Cookie、Session、防盗链、代理
Cookie 介绍
cookie就是保存在客户端(浏览器)上的一个字符串. 在每次发送请求时, 浏览器会自动的带上cookie的信息传递给服务器.
尤其在用户登录后, 为了能准确的获取到用户登录信息. cookie一般都会在请求是跟随请求头一起提交到服务器.
cookie 案例
如果是临时需要, 我们还可以直接从浏览器复制cookie出来直接丢到headers里
# 抓取自己收藏到书架上的小说信息
import requests
url = 'https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919'
# 需要拿到验证后的cookie
header = {
    "Cookie": "GUID=aab25315-7a37-447a-9d8b-b6dfd2401c67; sajssdk_2015_cross_new_user=1; Hm_lvt_9793f42b498361373512340937deb2a0=1653114452; c_channel=0; c_csc=web; accessToken=avatarUrl%3Dhttps%253A%252F%252Fcdn.static.17k.com%252Fuser%252Favatar%252F14%252F34%252F44%252F96654434.jpg-88x88%253Fv%253D1653114636000%26id%3D96654434%26nickname%3D%25E4%25B9%25A6%25E5%258F%258Ba7R4heIe1%26e%3D1668667187%26s%3D2b4e71b1e45eb9e5; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2296654434%22%2C%22%24device_id%22%3A%22180e54cb8c8a1d-05ea0d1680cb16-1734337f-1327104-180e54cb8c9b32%22%2C%22props%22%3A%7B%7D%2C%22first_id%22%3A%22aab25315-7a37-447a-9d8b-b6dfd2401c67%22%7D; Hm_lpvt_9793f42b498361373512340937deb2a0=1653115895",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36"
}
resp = requests.get(url, verify=False, headers=header)
print(resp.text)
Session 介绍
如果真的需要登录的话, 可以用session来保持会话.
session是个什么东西. 我们不用管服务器那边的session. 我们只关心爬虫里requests模块提供的session.
在requests模块中提供了session这个功能. 它能保持会话过程, 听着听绕嘴. 但其实就是它能自动帮我们管理和维护cookie. 但请注意, 它能自动维护的只能是响应头返回的cookie. js动态添加的cookie. 它可管不了.
登录临时获取cookie
import requests
url = 'https://passport.17k.com/ck/user/login'
dic = {
    "loginName": "13093740825",
    "password": "wjr996"
}
# 模拟登录
resp = requests.post(url, data=dic)
# print(resp.text)
# 拿到cookie
# 方式一 通过字典方式拿到 字符串
# print(resp.headers['Set-Cookie'])
# 方式二 取出登录的cookie
# print(resp.cookies)
# 拿到cookies案例
urls = 'https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919'
# 取出session传参
cookie_str = resp.cookies  # requestscookieJar
resp_num = requests.get(urls, cookies=resp.cookies)
print(resp_num.text)
session 保持会话
import requests
# 创建session对象
session = requests.session()
# 可以提前给session设置好请求头或者cookie
session.headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36"
}
# 自定义cookies第一方式
# session.cookies = cookiejar_from_dict({
#     # 可以把一些cookie的内容塞进来, 这里要的是字典
# })
# 第二种方式, 直接设置cookie
# session.headers["Cookie"] = "key1=value1; key2=value2"
# 模拟登录
url = 'https://passport.17k.com/ck/user/login'
data = {
    "loginName": "13093740825",
    "password": "wjr996"
}
# 登录后, 服务器会返回set-cookie. 这种直接返回的cookie会被session自动处理
session.post(url, data=data)  # 自动处理set-cookie
# 后续的所有请求. 都会带着cookie
url = 'https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919'
ses = session.get(url)
print(ses.text)
referer 防盗链
referer 下载视频案例
# 防盗链
import requests
while 1:
    main_url = input("请输入你需要爬取的梨视频的地址: ")  # "https://www.pearvideo.com/video_1756814"
    contId = main_url.split("_")[-1]
    url = f'https://www.pearvideo.com/videoStatus.jsp?contId={contId}'
    header = {
        "Referer": main_url,
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36"
    }
    resp = requests.get(url, headers=header)
    # print(resp.text)
    dic = resp.json()
    src_url = dic['videoInfo']['videos']['srcUrl']
    systemtime = dic['systemTime']
    src = src_url.replace(systemtime, f'cont-{contId}')
    print(src)
    # 下载视频
    resp = requests.get(src,headers=header)
    with open(f'{contId}.mp4', 'wb') as w:
        w.write(resp.content)
    print("下载成功")
代理
当我们反复抓取一个网站时, 由于请求过于频繁, 服务器很可能会将你的IP进行封锁来反爬. 应对方案就是通过网络代理的形式进行伪装.
import requests
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36",
}
# 准备代理
dic = {
    "http": "http://223.96.90.216:8085",
    "https": "https://223.96.90.216:8085",
}
# proxies = 代理
resp = requests.get("http://www.baidu.com/s?ie=UTF-8&wd=ip", proxies=dic, headers=headers)
print(resp.text)

 
                
             
         浙公网安备 33010602011771号
浙公网安备 33010602011771号