爬虫进阶(三)——ip代理和验证码识别模拟登录

  • HttpConnectinPool:
    • 原因:
      • 1.短时间内发起了高频的请求导致ip被禁
      • 2.http连接池中的连接资源被耗尽
    • 解决:
      • 1.代理
      • 2.headers中加入Conection:“close”
 
  • 代理:代理服务器,可以接受请求然后将其转发。

  • 匿名度

    • 高匿:啥也不知道
    • 匿名:知道你使用了代理,但是不知道你的真实ip
    • 透明:知道你使用了代理并且知道你的真实ip
  • 类型:

    • http
    • https
  • 免费代理:

  • cookie的处理

    url = 'https://www.baidu.com/s?wd=ip'
    proxy_list=[{'https':'https://222.174.213.193:8888'},
             {'http':'http://222.174.213.193:8888'}]
    page_text = requests.get(url,headers=headers,proxies=random.choice(proxy_list)).text
    with open('ip.html','w',encoding='utf-8') as fp:
        fp.write(page_text)

    代理池:

    #代理池:列表
    import random
    proxy_list = [
        {'https':'https://121.231.94.44:8888'},
        {'https':'https://131.231.94.44:8888'},
        {'https':'https://141.231.94.44:8888'}
    ]
    url = 'https://www.baidu.com/s?wd=ip'
    page_text = requests.get(url,headers=headers,proxies=random.choice(proxy_list)).text
    with open('ip.html','w',encoding='utf-8') as fp:
        fp.write(page_text)
    from lxml import etree
    ##### 提取代理ip
    ip_url = 'http://www.xiladaili.com/gaoni/'
    page_text = requests.get(ip_url,headers=headers).text
    tree = etree.HTML(page_text)
    ip_list = tree.xpath('//tr//td[1]/text()')
    ip_xieyi = tree.xpath('//tr//td[2]/text()')
    print(ip_list + ip_xieyi)

    检测:

  • for ip in proxy_list_http:
        response = requests.get('https://www/sogou.com',headers=headers,proxies={'https':ip})
        if response.status_code == '200':
            print('检测到了可用ip')

    cookie的处理

    • 手动处理:将cookie封装到headers中
    • 自动处理:session对象。可以创建一个session对象,改对象可以像requests一样进行请求发送。不同之处在于如果在使用session进行请求发送的过程中产生了cookie,则cookie会被自动存储在session对象中。
      #对雪球网中的新闻数据进行爬取https://xueqiu.com/
      headers = {
          'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
      #     'Cookie':'aliyungf_tc=AQAAAAl2aA+kKgkAtxdwe3JmsY226Y+n; acw_tc=2760822915681668126047128e605abf3a5518432dc7f074b2c9cb26d0aa94; xq_a_token=75661393f1556aa7f900df4dc91059df49b83145; xq_r_token=29fe5e93ec0b24974bdd382ffb61d026d8350d7d; u=121568166816578; device_id=24700f9f1986800ab4fcc880530dd0ed'
      }
      url = 'https://xueqiu.com/v4/statuses/public_timeline_by_category.json?since_id=-1&max_id=20349203&count=15&category=-1'
      page_text = requests.get(url=url,headers=headers).json()
      page_text
      #创建session对象
      session = requests.Session()
      session.get('https://xueqiu.com',headers=headers)
      
      url = 'https://xueqiu.com/v4/statuses/public_timeline_by_category.json?since_id=-1&max_id=20349203&count=15&category=-1'
      page_text = session.get(url=url,headers=headers).json()
      page_text

      验证码的识别

      • 超级鹰:http://www.chaojiying.com/about.html
        • 注册:(用户中心身份)
        • 登陆:
          • 创建一个软件:899370
          • 下载示例代码
      • 打码兔
      • 云打码
        import requests
        from hashlib import md5
        
        class Chaojiying_Client(object):
        
            def __init__(self, username, password, soft_id):
                self.username = username
                password =  password.encode('utf8')
                self.password = md5(password).hexdigest()
                self.soft_id = soft_id
                self.base_params = {
                    'user': self.username,
                    'pass2': self.password,
                    'softid': self.soft_id,
                }
                self.headers = {
                    'Connection': 'Keep-Alive',
                    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
                }
        
            def PostPic(self, im, codetype):
                """
                im: 图片字节
                codetype: 题目类型 参考 http://www.chaojiying.com/price.html
                """
                params = {
                    'codetype': codetype,
                }
                params.update(self.base_params)
                files = {'userfile': ('ccc.jpg', im)}
                r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)
                return r.json()
        
            def ReportError(self, im_id):
                """
                im_id:报错题目的图片ID
                """
                params = {
                    'id': im_id,
                }
                params.update(self.base_params)
                r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
                return r.json()

        示例:

        #识别古诗文网中的验证码
        def tranformImgData(imgPath,t_type):
            chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370')
            im = open(imgPath, 'rb').read()
            return chaojiying.PostPic(im, t_type)['pic_str']
        
        url = 'https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx'
        page_text = requests.get(url,headers=headers).text
        tree = etree.HTML(page_text)
        img_src = 'https://so.gushiwen.org'+tree.xpath('//*[@id="imgCode"]/@src')[0]
        img_data = requests.get(img_src,headers=headers).content
        with open('./code.jpg','wb') as fp:
            fp.write(img_data)
            
        tranformImgData('./code.jpg',1004)

        模拟登录:

        #模拟登陆
        s = requests.Session()
        url = 'https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx'
        page_text = s.get(url,headers=headers).text
        tree = etree.HTML(page_text)
        img_src = 'https://so.gushiwen.org'+tree.xpath('//*[@id="imgCode"]/@src')[0]
        img_data = s.get(img_src,headers=headers).content
        with open('./code.jpg','wb') as fp:
            fp.write(img_data)
            
        #动态获取变化的请求参数
        __VIEWSTATE = tree.xpath('//*[@id="__VIEWSTATE"]/@value')[0]
        __VIEWSTATEGENERATOR = tree.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value')[0]
            
        code_text = tranformImgData('./code.jpg',1004)
        print(code_text)
        login_url = 'https://so.gushiwen.org/user/login.aspx?from=http%3a%2f%2fso.gushiwen.org%2fuser%2fcollect.aspx'
        data = {
            '__VIEWSTATE': __VIEWSTATE,
            '__VIEWSTATEGENERATOR': __VIEWSTATEGENERATOR,
            'from':'http://so.gushiwen.org/user/collect.aspx',
            'email': 'www.zhangbowudi@qq.com',
            'pwd': 'bobo328410948',
            'code': code_text,
            'denglu': '登录',
        }
        page_text = s.post(url=login_url,headers=headers,data=data).text
        with open('login.html','w',encoding='utf-8') as fp:
            fp.write(page_text)

        动态变化的请求参数,通常情况下动态变化的请求参数都会被隐藏在前台页面源码中

        超级鹰的示例(个人编写的类)

      • import requests
        from hashlib import md5
        
        class Chaojiying_Client(object):
        
            def __init__(self, username, password, soft_id):
                self.username = username
                password =  password.encode('utf8')
                self.password = md5(password).hexdigest()
                self.soft_id = soft_id
                self.base_params = {
                    'user': self.username,
                    'pass2': self.password,
                    'softid': self.soft_id,
                }
                self.headers = {
                    'Connection': 'Keep-Alive',
                    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
                }
        
            def PostPic(self, im, codetype):
                """
                im: 图片字节
                codetype: 题目类型 参考 http://www.chaojiying.com/price.html
                """
                params = {
                    'codetype': codetype,
                }
                params.update(self.base_params)
                files = {'userfile': ('ccc.jpg', im)}
                r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)
                return r.json()
        
            def ReportError(self, im_id):
                """
                im_id:报错题目的图片ID
                """
                params = {
                    'id': im_id,
                }
                params.update(self.base_params)
                r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
                return r.json()

         

posted @ 2021-05-09 11:22  Full-Stack~  阅读(315)  评论(0)    收藏  举报