Cookie、Session、防盗链、代理

Cookie 介绍

cookie就是保存在客户端(浏览器)上的一个字符串. 在每次发送请求时, 浏览器会自动的带上cookie的信息传递给服务器.
尤其在用户登录后, 为了能准确的获取到用户登录信息. cookie一般都会在请求是跟随请求头一起提交到服务器.

如果是临时需要, 我们还可以直接从浏览器复制cookie出来直接丢到headers里

# 抓取自己收藏到书架上的小说信息
import requests

url = 'https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919'

# 需要拿到验证后的cookie
header = {
    "Cookie": "GUID=aab25315-7a37-447a-9d8b-b6dfd2401c67; sajssdk_2015_cross_new_user=1; Hm_lvt_9793f42b498361373512340937deb2a0=1653114452; c_channel=0; c_csc=web; accessToken=avatarUrl%3Dhttps%253A%252F%252Fcdn.static.17k.com%252Fuser%252Favatar%252F14%252F34%252F44%252F96654434.jpg-88x88%253Fv%253D1653114636000%26id%3D96654434%26nickname%3D%25E4%25B9%25A6%25E5%258F%258Ba7R4heIe1%26e%3D1668667187%26s%3D2b4e71b1e45eb9e5; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2296654434%22%2C%22%24device_id%22%3A%22180e54cb8c8a1d-05ea0d1680cb16-1734337f-1327104-180e54cb8c9b32%22%2C%22props%22%3A%7B%7D%2C%22first_id%22%3A%22aab25315-7a37-447a-9d8b-b6dfd2401c67%22%7D; Hm_lpvt_9793f42b498361373512340937deb2a0=1653115895",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36"
}


resp = requests.get(url, verify=False, headers=header)
print(resp.text)

Session 介绍

如果真的需要登录的话, 可以用session来保持会话.

session是个什么东西. 我们不用管服务器那边的session. 我们只关心爬虫里requests模块提供的session.
requests模块中提供了session这个功能. 它能保持会话过程, 听着听绕嘴. 但其实就是它能自动帮我们管理和维护cookie. 但请注意, 它能自动维护的只能是响应头返回的cookie. js动态添加的cookie. 它可管不了.

登录临时获取cookie

import requests

url = 'https://passport.17k.com/ck/user/login'

dic = {
    "loginName": "13093740825",
    "password": "wjr996"
}

# 模拟登录
resp = requests.post(url, data=dic)
# print(resp.text)

# 拿到cookie
# 方式一 通过字典方式拿到 字符串
# print(resp.headers['Set-Cookie'])

# 方式二 取出登录的cookie
# print(resp.cookies)
# 拿到cookies案例
urls = 'https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919'
# 取出session传参
cookie_str = resp.cookies  # requestscookieJar
resp_num = requests.get(urls, cookies=resp.cookies)
print(resp_num.text)

session 保持会话

import requests

# 创建session对象
session = requests.session()

# 可以提前给session设置好请求头或者cookie
session.headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36"
}


# 自定义cookies第一方式
# session.cookies = cookiejar_from_dict({
#     # 可以把一些cookie的内容塞进来, 这里要的是字典
# })

# 第二种方式, 直接设置cookie
# session.headers["Cookie"] = "key1=value1; key2=value2"


# 模拟登录
url = 'https://passport.17k.com/ck/user/login'

data = {
    "loginName": "13093740825",
    "password": "wjr996"
}

# 登录后, 服务器会返回set-cookie. 这种直接返回的cookie会被session自动处理
session.post(url, data=data)  # 自动处理set-cookie


# 后续的所有请求. 都会带着cookie
url = 'https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919'
ses = session.get(url)
print(ses.text)

referer 防盗链

referer 下载视频案例

# 防盗链
import requests

while 1:
    main_url = input("请输入你需要爬取的梨视频的地址: ")  # "https://www.pearvideo.com/video_1756814"

    contId = main_url.split("_")[-1]

    url = f'https://www.pearvideo.com/videoStatus.jsp?contId={contId}'

    header = {
        "Referer": main_url,
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36"
    }

    resp = requests.get(url, headers=header)
    # print(resp.text)

    dic = resp.json()
    src_url = dic['videoInfo']['videos']['srcUrl']
    systemtime = dic['systemTime']
    src = src_url.replace(systemtime, f'cont-{contId}')
    print(src)

    # 下载视频
    resp = requests.get(src,headers=header)
    with open(f'{contId}.mp4', 'wb') as w:
        w.write(resp.content)
    print("下载成功")

代理

当我们反复抓取一个网站时, 由于请求过于频繁, 服务器很可能会将你的IP进行封锁来反爬. 应对方案就是通过网络代理的形式进行伪装.

import requests


headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36",
}

# 准备代理
dic = {
    "http": "http://223.96.90.216:8085",
    "https": "https://223.96.90.216:8085",
}
# proxies = 代理
resp = requests.get("http://www.baidu.com/s?ie=UTF-8&wd=ip", proxies=dic, headers=headers)

print(resp.text)
posted @ 2022-05-22 16:16  沈忻凯  阅读(167)  评论(0)    收藏  举报