Spider(三)——cookie&requests
一、Cookie模拟登录
1、什么是cookie、session
HTTP是一种无连接协议,客户端和服务器交互仅仅限于 请求/响应过程,结束后断开,下一次请求时,服务器会认为是一个新的客户端,为了维护他们之间的连接,让服务器知道这是前一个用户发起的请求,必须在一个地方保存客户端信息。
cookie :通过在客户端记录的信息确定用户身份
session :通过在服务端记录的信息确定用户身份
2、案例 :使用cookie模拟登陆人人网
步骤:
1、通过抓包工具、F12获取到cookie(先登陆1次网站)
2、正常发请求
url:http://www.renren.com/967469305/profile
注意:cookie位置:Response -> Requests Headers
其中 Accept-Encoding要删除掉
import requests url = "http://www.renren.com/967469305/profile" headers = { "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Accept-Language":"zh-CN,zh;q=0.9", #Accept-Encoding:gzip, deflate #此行要删掉 "Connection":"keep-alive", "Cookie":"anonymid=joxn2zjd4wbtlr; depovince=BJ; _r01_=1; ln_uact=13603263409; ln_hurl=http://hdn.xnimg.cn/photos/hdn221/20181101/1550/h_main_qz3H_61ec0009c3901986.jpg; jebe_key=7fd23b61-42cf-4105-ab4f-8eb28565c128%7C2012cb2155debcd0710a4bf5a73220e8%7C1543196119249%7C1%7C1543196116786; JSESSIONID=abcvVRQNFidTP4Ot1UnDw; ick_login=eb316897-ab3e-47ce-86ba-d08b843d32ad; first_login_flag=1; loginfrom=syshome; ch_id=10016; wp_fold=0; jebecookies=f6f9cac4-a174-4839-9c19-3c6fd75e3331|||||; _de=4DBCFCC17D9E50C8C92BCDC45CC5C3B7; p=0b73085b1f59a0c8f08c11c37a7d59615; t=ca209d505ca3c93ad19921a5e8b53c015; societyguester=ca209d505ca3c93ad19921a5e8b53c015; id=967469305; xnsid=b20d9e56", "Host":"www.renren.com", "Referer":"http://www.renren.com/SysHome.do", "Upgrade-Insecure-Requests":"1", "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36", } res = requests.get(url,headers=headers) res.encoding = "utf-8" print(res.text)
二、requests模块
1、get()使用场景
params: 查询参数,字典,不用编码,也不用拼接URL
1、没有查询参数
res = requests.get(url,headers=headers))
2、有查询参数: params={}
res = requests.get(url,params=params,headers=headers))
注 :params参数必须为字典,自动进行编码
2、响应对象res的属性
1、encoding : 响应字符编码,res.encoding="utf-8"
2、text : 字符串
3、content : 字节流
4、status_code : 响应码
5、url : 返回实际数据的URL
3、非结构化数据存储
html = res.content
with open("XXX","wb") as f:
f.write(html)
4、post(url,data=data,headers=headers)
1、data为Form表单数据,字典,不用编码,不用转码
2、示例 :有道翻译
# 此处data为form表单数据
res = requests.post(url,data=data,headers=headers)
res.encoding="utf-8"
html = res.text
import requests import json # 接收用户输入,对 data 进行转码(字节流) key = input("请输入要翻译的内容:") data = { "i":key, "from":"AUTO", "to":"AUTO", "smartresult":"dict", "client":"fanyideskweb", "salt":"1543198916297", "sign":"21753ee815cabd98fb1c29635ba8e1d3", "doctype":"json", "version":"2.1", "keyfrom":"fanyi.web", "action":"FY_BY_REALTIME", "typoResult":"false" } # 发请求,获响应 url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule" headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36"} # 用post方式去发请求,data直接为字典就可以 res = requests.post(url,data=data,headers=headers) res.encoding = "utf-8" html = res.text # html为json格式的字符串 r_dict = json.loads(html) print(r_dict["translateResult"][0][0]["tgt"])
三、get()方法中不同参数
1、代理IP(参数名:proxies)
1、获取代理IP的网站
西刺代理 www.xicidaili.com/
快代理 www.kuaidaili.com/
全网代理
2、普通代理
1、格式:proxies = {"协议":"协议://IP地址:端口号"}
182.88.190.3 8123
proxies = {"HTTP":"http://61.152.248.147:80"}
2、查看网络是否通过代理IP访问
http://httpbin.org/get :能够显示客户端的headers和IP
3、get设置连接超时时间:
timeout=3 --超时3秒
import requests #url = "http://www.baidu.com/" url = "http://httpbin.org/get" headers = {"User-Agent":"Mozilla/5.0"} proxies = {"http":"http://61.152.248.147:80"} res = requests.get(url,proxies=proxies,headers=headers) res.encoding = "utf-8" print(res.text)
3、私密代理
格式:
proxies = {"http":"http://用户名:密码@IP地址:端口"}
ex:proxies = {"http":"http://309435365:szayclhp@123.206.119.108:21081"}
用户名 :309435365
密码 :szayclhp
IP地址 :116.255.162.107
端口号 :16816
import requests url = "http://httpbin.org/get" headers = {"User-Agent":"Mozilla/5.0"} proxies = {"http":"http://309435365:szayclhp@116.255.162.107:16816"} res = requests.get(url,proxies=proxies,headers=headers) #print(res.status_code) res.encoding = "utf-8" print(res.text)
4、案例1 :爬取链家二手房信息 --> 存到MySQL数据库
见 :05_链家tomysql.py
1、找URL
第1页:https://bj.lianjia.com/ershoufang/pg1/
第2页:https://bj.lianjia.com/ershoufang/pg2/
2、正则
<div class="houseInfo">.*?data-el="region">(.*?)</a>.*?<div class="totalPrice">.*?<span>(.*?)</span>
3、写代码
import requests import re import pymysql class LianjiaSpider: def __init__(self): self.baseurl = "https://bj.lianjia.com/ershoufang/pg" self.headers = {"User-Agent":"Mozilla/5.0"} self.proxies = {"http":"http://309435365:szayclhp@116.255.162.107:16816"} self.page = 1 self.db = pymysql.connect("localhost", "root","123456","lianjia", charset="utf8") self.cursor = self.db.cursor() def getPage(self,url): res = requests.get(url,proxies=self.proxies,headers=self.headers) res.encoding = "utf-8" html = res.text print("页面获取成功,正在解析...") self.parsePage(html) def parsePage(self,html): p = re.compile('<div class="houseInfo">.*?data-el="region">(.*?)</a>.*?<div class="totalPrice">.*?<span>(.*?)</span>',re.S) r_list = p.findall(html) # r_list : [("富力城","500"),(),()] print("解析成功,正在存入数据库...") self.writeTomysql(r_list) def writeTomysql(self,r_list): ins = 'insert into house(name,price) \ values(%s,%s)' for r in r_list: L = [r[0].strip(), float(r[1].strip())*10000] self.cursor.execute(ins,L) self.db.commit() print("存入数据库成功") def workOn(self): while True: c = input("爬按y,退出q:") if c == "y": url = self.baseurl + \ str(self.page) + "/" self.getPage(url) self.page += 1 else: self.cursor.close() self.db.close() break if __name__ == "__main__": spider = LianjiaSpider() spider.workOn()
5、链家二手房案例(MongoDB数据库)
见 :06_链家tomongo.py
>>>show dbs
>>>use 库名
>>>show collections
>>>db.集合名.find().pretty()
>>>db.集合名.count()
import requests import re import pymongo class LianjiaSpider: def __init__(self): self.baseurl = "https://bj.lianjia.com/ershoufang/pg" self.headers = {"User-Agent":"Mozilla/5.0"} self.proxies = {"http":"http://309435365:szayclhp@116.255.162.107:16816"} self.page = 1 # 连接对象 self.conn = pymongo.MongoClient("localhost",27017) # 库对象 self.db = self.conn["lianjia"] # 集合对象 self.myset = self.db["house"] def getPage(self,url): res = requests.get(url,proxies=self.proxies,headers=self.headers) res.encoding = "utf-8" html = res.text print("页面获取成功,正在解析...") self.parsePage(html) def parsePage(self,html): p = re.compile('<div class="houseInfo">.*?data-el="region">(.*?)</a>.*?<div class="totalPrice">.*?<span>(.*?)</span>',re.S) r_list = p.findall(html) # r_list : [("富力城","500"),(),()] print("解析成功,正在存入数据库...") self.writeTomongo(r_list) def writeTomongo(self,r_list): for r in r_list: D = { "name":r[0].strip(), "price":float(r[1].strip())*10000 } self.myset.insert(D) print("存入数据库成功") def workOn(self): while True: c = input("爬按y,退出q:") if c == "y": url = self.baseurl + \ str(self.page) + "/" self.getPage(url) self.page += 1 else: break if __name__ == "__main__": spider = LianjiaSpider() spider.workOn()
2、Web客户端验证(参数名:auth=(元组))
1、auth = ("用户名","密码")
auth = ("tarenacode","code_2013")
ex:res = requests.get(url,proxies=self.proxies,headers=self.headers,auth=("tarenacode","code_2013"))
2、案例:爬取code.tarena目录
见 :07_Web客户端验证.py
1、步骤
1、URL:http://code.tarena.com.cn
2、正则
<a href=".*?">(.*?)</a>
3、代码
import requests import re import pymysql import warnings class DaneiSpider: def __init__(self): self.headers = {"User-Agent":"Mozilla/5.0"} # self.proxies = {"http":"http://309435365:szayclhp@116.255.162.107:16816"} self.url = "http://code.tarena.com.cn/" # 连接对象 self.db = pymysql.connect("localhost", "root","123456","lianjia", charset="utf8") # 游标对象 self.cursor = self.db.cursor() def getParsePage(self,url): res = requests.get(url,headers=self.headers, auth=("tarenacode","code_2013")) res.encoding = "utf-8" html = res.text p = re.compile('<a href=".*?>(.*?)</a>',re.S) r_list = p.findall(html) # r_list : ["AIDCODE/","BIGCODE"] self.mysql(r_list) def mysql(self,r_list): ctab = 'create table code(\ id int primary key auto_increment,\ course varchar(30)\ )' ins = 'insert into code(course) values(%s)' # 过滤警告 warnings.filterwarnings("ignore") try: self.cursor.execute(ctab) except: pass for r in r_list: L = [r.strip()[0:-1]] self.cursor.execute(ins,L) self.db.commit() print("达内code已入库") def workOn(self): self.getParsePage(self.url) if __name__ == "__main__": spider = DaneiSpider() spider.workOn()
3、SSL证书认证(参数名:verify=True | False)
1、verify = True : 默认,进行SSL证书认证
2、verify = False: 不做认证
import requests url = "https://www.12306.cn/mormhweb/" headers = {"User-Agent":"Mozilla/5.0"} res = requests.get(url,headers=headers,verify=False) res.encoding = "utf-8" print(res.text)
四、urllib.request中Handler处理器
1、定义
自定义的urlopen()方法,因为模块自带的urlopen()方法是一个特殊的opener(模块已定义好),不支持代理等功能,通过Handler处理器对象来自定义urlopen对象
2、常用方法
1、opener=build_opener(某种功能Handler处理器对象) :创建opener对象
2、opener.open(url,参数)
3、使用流程
1、创建相关的Handler处理器对象
http_handler = urllib.request.HTTPHandler()
2、创建自定义opener对象
opener = urllib.request.build_opener(http_handler)
3、利用opener对象的open方法发送请求获响应
req = urllib.request.Request(url,headers=headers)
res = opener.open(req)
import urllib.request url = "http://www.baidu.com/" headers = {"User-Agent":"Mozilla/5.0"} # 1. 创建相关的Handler处理器对象 http_handler = urllib.request.HTTPHandler() # 2. 创建自定义opener对象 opener = urllib.request.build_opener(http_handler) # 3. 利用opener对象的open方法发请求获响应 req = urllib.request.Request(url,headers=headers) res = opener.open(req) print(res.getcode())
4、Handler处理器分类
1、HTTPHandler():没有任何特殊功能
2、ProxyHandler({普通代理})
代理: {"协议":"IP地址:端口号"}
3、ProxyBasicAuthHandler(密码管理器对象) :私密代理
4、HTTPBasicAuthHandler(密码管理器对象) :web客户端认证
5、密码管理器用途
1、私密代理
2、Web客户端认证
3、程序实现流程
1、创建密码管理器对象
pwdmg = urllib.request.HTTPPasswordMgrWithDefaultRealm()
2、把认证信息添加到密码管理器对象里面去
pwdmg.add_password(None,webserver,user,passwd)
webserver:为 私密代理的ip地址:端口号
user:账户
passwd:密码
3、创建Handler处理器对象
1、私密代理
proxy_handler = urllib.request.ProxyAuthBasicHandler(pwdmg)
2、Web客户端
webbasic_handler = urllib.request.HTTPBasicAuthHandler(pwdmg)
4、创建自定义opener对象
opener = urllib.request.build_opener(proxy_handler)
5、利用opener对象的open方法发请求获响应
req = urllib.request.Request(url,headers=headers)
res = opener.open(req)
posted on 2018-11-26 21:17 破天荒的谎言、谈敷衍 阅读(744) 评论(0) 收藏 举报
浙公网安备 33010602011771号