Spider(三)——cookie&requests

一、Cookie模拟登录

1、什么是cookie、session

       HTTP是一种无连接协议,客户端和服务器交互仅仅限于 请求/响应过程,结束后断开,下一次请求时,服务器会认为是一个新的客户端,为了维护他们之间的连接,让服务器知道这是前一个用户发起的请求,必须在一个地方保存客户端信息。
       cookie  :通过在客户端记录的信息确定用户身份
       session :通过在服务端记录的信息确定用户身份

2、案例 :使用cookie模拟登陆人人网

       步骤:
         1、通过抓包工具、F12获取到cookie(先登陆1次网站)
         2、正常发请求
        url:http://www.renren.com/967469305/profile

  注意:cookie位置:Response -> Requests Headers
               其中 Accept-Encoding要删除掉

import requests

url = "http://www.renren.com/967469305/profile"
headers = {
        "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language":"zh-CN,zh;q=0.9",
        #Accept-Encoding:gzip, deflate  #此行要删掉
        "Connection":"keep-alive",
        "Cookie":"anonymid=joxn2zjd4wbtlr; depovince=BJ; _r01_=1; ln_uact=13603263409; ln_hurl=http://hdn.xnimg.cn/photos/hdn221/20181101/1550/h_main_qz3H_61ec0009c3901986.jpg; jebe_key=7fd23b61-42cf-4105-ab4f-8eb28565c128%7C2012cb2155debcd0710a4bf5a73220e8%7C1543196119249%7C1%7C1543196116786; JSESSIONID=abcvVRQNFidTP4Ot1UnDw; ick_login=eb316897-ab3e-47ce-86ba-d08b843d32ad; first_login_flag=1; loginfrom=syshome; ch_id=10016; wp_fold=0; jebecookies=f6f9cac4-a174-4839-9c19-3c6fd75e3331|||||; _de=4DBCFCC17D9E50C8C92BCDC45CC5C3B7; p=0b73085b1f59a0c8f08c11c37a7d59615; t=ca209d505ca3c93ad19921a5e8b53c015; societyguester=ca209d505ca3c93ad19921a5e8b53c015; id=967469305; xnsid=b20d9e56",
        "Host":"www.renren.com",
        "Referer":"http://www.renren.com/SysHome.do",
        "Upgrade-Insecure-Requests":"1",
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36",
    }
res = requests.get(url,headers=headers)
res.encoding = "utf-8"
print(res.text)
01_人人网Cookie模拟登陆.py

二、requests模块

   1、get()使用场景
       params: 查询参数,字典,不用编码,也不用拼接URL
        1、没有查询参数
           res = requests.get(url,headers=headers))
        2、有查询参数: params={}
        res = requests.get(url,params=params,headers=headers))
           注 :params参数必须为字典,自动进行编码
   2、响应对象res的属性
      1、encoding : 响应字符编码,res.encoding="utf-8"
      2、text     : 字符串
      3、content  : 字节流
      4、status_code : 响应码
      5、url      : 返回实际数据的URL
   3、非结构化数据存储
      html = res.content
      with open("XXX","wb") as f:
              f.write(html)
   4、post(url,data=data,headers=headers)
       1、data为Form表单数据,字典,不用编码,不用转码
       2、示例 :有道翻译
       # 此处data为form表单数据
       res = requests.post(url,data=data,headers=headers)
       res.encoding="utf-8"
       html = res.text

import requests
import json

# 接收用户输入,对 data 进行转码(字节流)
key = input("请输入要翻译的内容:")
data = {
        "i":key,
        "from":"AUTO",
        "to":"AUTO",
        "smartresult":"dict",
        "client":"fanyideskweb",
        "salt":"1543198916297",
        "sign":"21753ee815cabd98fb1c29635ba8e1d3",
        "doctype":"json",
        "version":"2.1",
        "keyfrom":"fanyi.web",
        "action":"FY_BY_REALTIME",
        "typoResult":"false"
    }

# 发请求,获响应
url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule"
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36"}

# 用post方式去发请求,data直接为字典就可以
res = requests.post(url,data=data,headers=headers)
res.encoding = "utf-8"
html = res.text

# html为json格式的字符串
r_dict = json.loads(html)
print(r_dict["translateResult"][0][0]["tgt"])
02_有道翻译requests.post.py

三、get()方法中不同参数

1、代理IP(参数名:proxies)

     1、获取代理IP的网站
            西刺代理 www.xicidaili.com/
            快代理   www.kuaidaili.com/
            全网代理
     2、普通代理
         1、格式:proxies = {"协议":"协议://IP地址:端口号"}
              182.88.190.3    8123
              proxies = {"HTTP":"http://61.152.248.147:80"}
         2、查看网络是否通过代理IP访问
              http://httpbin.org/get :能够显示客户端的headers和IP 
         3、get设置连接超时时间:
              timeout=3  --超时3秒

import requests

#url = "http://www.baidu.com/"
url = "http://httpbin.org/get"
headers = {"User-Agent":"Mozilla/5.0"}
proxies = {"http":"http://61.152.248.147:80"}

res = requests.get(url,proxies=proxies,headers=headers)
res.encoding = "utf-8"
print(res.text)
03_普通代理示例.py

     3、私密代理
         格式:
         proxies = {"http":"http://用户名:密码@IP地址:端口"}
         ex:proxies = {"http":"http://309435365:szayclhp@123.206.119.108:21081"}
             用户名 :309435365
             密码   :szayclhp
             IP地址 :116.255.162.107
             端口号 :16816

import requests

url = "http://httpbin.org/get"
headers = {"User-Agent":"Mozilla/5.0"}
proxies = {"http":"http://309435365:szayclhp@116.255.162.107:16816"}

res = requests.get(url,proxies=proxies,headers=headers)
#print(res.status_code)
res.encoding = "utf-8"
print(res.text)
04_私密代理示例.py

     4、案例1爬取链家二手房信息 --> 存到MySQL数据库
         见 :05_链家tomysql.py
         1、找URL
           第1页:https://bj.lianjia.com/ershoufang/pg1/
           第2页:https://bj.lianjia.com/ershoufang/pg2/
         2、正则
         <div class="houseInfo">.*?data-el="region">(.*?)</a>.*?<div class="totalPrice">.*?<span>(.*?)</span>
         3、写代码

import requests
import re
import pymysql

class LianjiaSpider:
    def __init__(self):
        self.baseurl = "https://bj.lianjia.com/ershoufang/pg"
        self.headers = {"User-Agent":"Mozilla/5.0"}
        self.proxies = {"http":"http://309435365:szayclhp@116.255.162.107:16816"}
        self.page = 1
        self.db = pymysql.connect("localhost",
                          "root","123456","lianjia",
                          charset="utf8")
        self.cursor = self.db.cursor()

    def getPage(self,url):
        res = requests.get(url,proxies=self.proxies,headers=self.headers)
        res.encoding = "utf-8"
        html = res.text
        print("页面获取成功,正在解析...")
        self.parsePage(html)

    def parsePage(self,html):
        p = re.compile('<div class="houseInfo">.*?data-el="region">(.*?)</a>.*?<div class="totalPrice">.*?<span>(.*?)</span>',re.S)
        r_list = p.findall(html)
        # r_list : [("富力城","500"),(),()]
        print("解析成功,正在存入数据库...")
        self.writeTomysql(r_list)

    def writeTomysql(self,r_list):
        ins = 'insert into house(name,price) \
               values(%s,%s)'
        for r in r_list:
            L = [r[0].strip(),
                 float(r[1].strip())*10000]
            self.cursor.execute(ins,L)
            self.db.commit()
        print("存入数据库成功")

    def workOn(self):
        while True:
            c = input("爬按y,退出q:")
            if c == "y":
                url = self.baseurl + \
                      str(self.page) + "/"
                self.getPage(url)
                self.page += 1
            else:
                self.cursor.close()
                self.db.close()
                break

if __name__ == "__main__":
    spider = LianjiaSpider()
    spider.workOn()
05_链家tomysql.py

     5、链家二手房案例(MongoDB数据库)
         见 :06_链家tomongo.py
          >>>show dbs
          >>>use 库名
          >>>show collections
          >>>db.集合名.find().pretty()
          >>>db.集合名.count()

import requests
import re
import pymongo

class LianjiaSpider:
    def __init__(self):
        self.baseurl = "https://bj.lianjia.com/ershoufang/pg"
        self.headers = {"User-Agent":"Mozilla/5.0"}
        self.proxies = {"http":"http://309435365:szayclhp@116.255.162.107:16816"}
        self.page = 1
        # 连接对象
        self.conn = pymongo.MongoClient("localhost",27017)
        # 库对象
        self.db = self.conn["lianjia"]
        # 集合对象
        self.myset = self.db["house"]

    def getPage(self,url):
        res = requests.get(url,proxies=self.proxies,headers=self.headers)
        res.encoding = "utf-8"
        html = res.text
        print("页面获取成功,正在解析...")
        self.parsePage(html)

    def parsePage(self,html):
        p = re.compile('<div class="houseInfo">.*?data-el="region">(.*?)</a>.*?<div class="totalPrice">.*?<span>(.*?)</span>',re.S)
        r_list = p.findall(html)
        # r_list : [("富力城","500"),(),()]
        print("解析成功,正在存入数据库...")
        self.writeTomongo(r_list)

    def writeTomongo(self,r_list):
        for r in r_list:
            D = {
                "name":r[0].strip(),
                "price":float(r[1].strip())*10000
                }
            self.myset.insert(D)
        print("存入数据库成功")

    def workOn(self):
        while True:
            c = input("爬按y,退出q:")
            if c == "y":
                url = self.baseurl + \
                      str(self.page) + "/"
                self.getPage(url)
                self.page += 1
            else:
                break

if __name__ == "__main__":
    spider = LianjiaSpider()
    spider.workOn()
06_链家tomongo.py

2、Web客户端验证(参数名:auth=(元组))

     1、auth = ("用户名","密码")
         auth = ("tarenacode","code_2013")
         ex:res = requests.get(url,proxies=self.proxies,headers=self.headers,auth=("tarenacode","code_2013"))
     2、案例:爬取code.tarena目录
         见 :07_Web客户端验证.py
         1、步骤
         1、URL:http://code.tarena.com.cn
         2、正则
             <a href=".*?">(.*?)</a>
         3、代码

import requests
import re
import pymysql
import warnings

class DaneiSpider:
    def __init__(self):
        self.headers = {"User-Agent":"Mozilla/5.0"}
        # self.proxies = {"http":"http://309435365:szayclhp@116.255.162.107:16816"}
        self.url = "http://code.tarena.com.cn/"
        # 连接对象
        self.db = pymysql.connect("localhost",
                   "root","123456","lianjia",
                   charset="utf8")
        # 游标对象
        self.cursor = self.db.cursor()

    def getParsePage(self,url):
        res = requests.get(url,headers=self.headers,
                               auth=("tarenacode","code_2013"))
        res.encoding = "utf-8"
        html = res.text
        p = re.compile('<a href=".*?>(.*?)</a>',re.S)
        r_list = p.findall(html)
        # r_list : ["AIDCODE/","BIGCODE"]
        self.mysql(r_list)

    def mysql(self,r_list):
        ctab = 'create table code(\
                id int primary key auto_increment,\
                course varchar(30)\
                )'
        ins = 'insert into code(course) values(%s)'
        # 过滤警告
        warnings.filterwarnings("ignore")
        try:
            self.cursor.execute(ctab)
        except:
            pass

        for r in r_list:
            L = [r.strip()[0:-1]]
            self.cursor.execute(ins,L)
            self.db.commit()
        print("达内code已入库")

    def workOn(self):
        self.getParsePage(self.url)

if __name__ == "__main__":
    spider = DaneiSpider()
    spider.workOn()
07_Web客户端验证.py

   3、SSL证书认证(参数名:verify=True | False)
     1、verify = True : 默认,进行SSL证书认证
     2、verify = False: 不做认证

import requests

url = "https://www.12306.cn/mormhweb/"
headers = {"User-Agent":"Mozilla/5.0"}

res = requests.get(url,headers=headers,verify=False)
res.encoding = "utf-8"
print(res.text)
08_SSL证书认证.py

四、urllib.request中Handler处理器

1、定义
     自定义的urlopen()方法,因为模块自带的urlopen()方法是一个特殊的opener(模块已定义好),不支持代理等功能,通过Handler处理器对象来自定义urlopen对象
2、常用方法
     1、opener=build_opener(某种功能Handler处理器对象) :创建opener对象
     2、opener.open(url,参数)
3、使用流程
     1、创建相关的Handler处理器对象
         http_handler = urllib.request.HTTPHandler()
     2、创建自定义opener对象
         opener = urllib.request.build_opener(http_handler)
     3、利用opener对象的open方法发送请求获响应
         req = urllib.request.Request(url,headers=headers)
         res = opener.open(req)

import urllib.request

url = "http://www.baidu.com/"
headers = {"User-Agent":"Mozilla/5.0"}

# 1. 创建相关的Handler处理器对象
http_handler = urllib.request.HTTPHandler()
# 2. 创建自定义opener对象
opener = urllib.request.build_opener(http_handler)
# 3. 利用opener对象的open方法发请求获响应
req = urllib.request.Request(url,headers=headers)
res = opener.open(req)
print(res.getcode())
09_Handler示例.py

4、Handler处理器分类
     1、HTTPHandler():没有任何特殊功能
     2、ProxyHandler({普通代理})
          代理: {"协议":"IP地址:端口号"}
     3、ProxyBasicAuthHandler(密码管理器对象) :私密代理
     4、HTTPBasicAuthHandler(密码管理器对象)  :web客户端认证
5、密码管理器用途
     1、私密代理
     2、Web客户端认证
     3、程序实现流程
         1、创建密码管理器对象
             pwdmg = urllib.request.HTTPPasswordMgrWithDefaultRealm()
         2、把认证信息添加到密码管理器对象里面去
             pwdmg.add_password(None,webserver,user,passwd)
             webserver:为 私密代理的ip地址:端口号
             user:账户
             passwd:密码
         3、创建Handler处理器对象
             1、私密代理
              proxy_handler = urllib.request.ProxyAuthBasicHandler(pwdmg)
             2、Web客户端
              webbasic_handler = urllib.request.HTTPBasicAuthHandler(pwdmg)
         4、创建自定义opener对象
             opener = urllib.request.build_opener(proxy_handler)
         5、利用opener对象的open方法发请求获响应
             req = urllib.request.Request(url,headers=headers)
             res = opener.open(req)

posted on 2018-11-26 21:17  破天荒的谎言、谈敷衍  阅读(744)  评论(0)    收藏  举报

导航