Urllib库

urllib库是python内置的HTTP请求库,它包含如下几个模块:

  • urllib.request 请求模块
  • urllib.error 异常处理模块
  • urllib.parse URL解析模块
  • urllib.robotparser robots.txt解析模块

1. urllib.request

1.1 urlopen函数

1)urlopen函数的返回类型

  • urlopen函数返回的是一个bytes类型的数据,通过read()函数读取内容之后再进行decode转码后才能查看。
# urlopen函数
import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')
# response是一个bytes类型的数据,所以还需要转码成utf-8
print(response.read().decode('utf-8'))

2)urlopen函数的参数

  • urlopen函数的第一个参数是url,第二个参数是请求附加的数据,第三个参数是超时时间
  • 如果urlopen函数传了第二个参数,则表示以POST方式提交请求,且第二个参数要用bytes类型来传入
import urllib.parse
import urllib.request
data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
# 如果urlopen函数传了第二个参数,则表示以POST方式提交请求,第二个参数要用bytes类型来传入
# urlopen的第三个参数表示超时时间,若超过这个时间还没有得到响应,则会抛出异常
response = urllib.request.urlopen('http://httpbin.org/post', data=data, timeout=2)
print(response.read().decode('utf-8'))
  • 超时时间:
import socket
import urllib.request
import urllib.error
try:
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print("超时")

1.2 响应

# 响应
import urllib.request
response = urllib.request.urlopen('http://cnblogs.com/hgzero')
print(type(response))
print(response.status)                     # 状态码
print(response.getheaders())               # 响应头,得到的是一个个元组组成的列表
print(response.getheader('Content-Type'))  # 注意这里的getheader没有加s

1.3 请求

# 请求
import urllib.request
request = urllib.request.Request('http://cnblogs.com/hgzero')
response = urllib.request.urlopen(request)   # 将Request对象当做一个参数传给urlopen
print(response.read().decode('utf-8'))
  • 在请求中自定义http头和data数据:
# 第一种,自己构造一个包含自定义headers和data的Request对象,再将Request对象传入urlopen函数
from urllib import request, parse
url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilia/4.0 (compatible; MSIE 5.5;Windows NT)',
    'Host': 'httpbin.org'
}
dict = {
    'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

# 第二种,通过调用Request对象的add_header方法来添加http头 from urllib import request, parse url = 'http://httpbin.org/post' dict = { 'name': 'Germey' } data = bytes(parse.urlencode(dict), encoding='utf8') req = request.Request(url=url, data=data, method='POST') req.add_header('User-Agent', 'Mozilia/4.0 (compatible; MSIE 5.5;Windows NT)') # 添加一个http头 response = request.urlopen(req) print(response.read().decode('utf-8'))

1.4 代理

import urllib.request
proxy_handler = urllib.request.ProxyHandler(
    {
        'http': 'http://127.0.0.1:25379',
        # 'https': 'https://127.0.0.1:25379'
    }
)
opener = urllib.request.build_opener(proxy_handler)
response = opener.open('http://www.youtobe.com/')
print(response.read().decode('utf-8'))

1.5 Cookie

import http.cookiejar, urllib.request
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)
  • cookie的保存和读取:
import http.cookiejar, urllib.request
# MozillaCookieJar的cookie保存格式
filename = "cookie_first.txt"
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

# LWPCookieJar的cookie保存格式,两种保存格式随便选一种即可
filename = "cookie_second.txt"
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

# 下次请求时再读取保存的cookie
cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie_second.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

2. urllib.error

from urllib import request, error
try:
    response = request.urlopen('http://hgzerowzhpray.com')
except error.HTTPError as e:  # 这个错误范围较小
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:   # 这个错误范围较大
    print(e.reason)
else:
    print('Request Successfully')

3. urllib.parse

4. urllib.robotparser

  • 用的不多,直接忽略。

 

posted @ 2020-07-05 16:11  Praywu  阅读(241)  评论(0编辑  收藏  举报