urllib库

urllib库,是Python内置的http请求库,不需要额外安装,包含4个模块,前三个比较常用:

  1. request:http请求模块,用来模拟发送请求,只需要传入url以及额外的参数,就可以模拟整个实现过程
  2. error:异常处理模块
  3. parse:用于编码、解析、合并url、参数等
  4. robotparser:辨别Robot协议(爬虫协议/机器人协议/网络爬虫排除标准/Robots Exclusion protocol)。
    robot.txt协议通常放在根目录下,告诉爬虫和搜索引擎那些页面可以抓取,哪些不可抓取。
    # robot.txt大致格式
    User-agent:*
    Disallow:/
    Allow:/public/

 

Request

一、利用urllib.request.urlopen请求常用两种格式

1. 用 urllib.request.urlopen( url, data, timeout....),data为byte格式

import urllib

# 无参
response = urllib.request.urlopen("https://www.python.org")

# post传字节流
data = bytes(urllib.parse.urlencode({"word": "test"}), encoding="utf8")
response2 = urllib.request.urlopen("http://httpbin.org/post", data= data, timeout=3)

 

2. 通过urllib.request.Request(url, data, headers, method)类构造请求参数,data为byte字节流,headers为字典类型

import urllib

url = "http://httpbin.org/post"
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE5.5; Windows NT)',
    'Host': 'httpbin.org'
}
dict = {
    'name': 'tester'
}
data = bytes(urllib.parse.urlencode(dict), encoding = "utf8")
request = urllib.request.Request(url= url, data=data, headers= headers, method= "POST")
response = urllib.request.urlopen(request)
print(response.read().decode("utf-8"))

 

二、Handler 处理登录、Cookies、代理等

urllib.request.BaseHandler 类是所有Handler的父类
HTTPDefaultErrorHandler 用于处理HTTP响应错误时抛出的HTTPError类型的异常
HTTPRedirectHandler 用于处理重定向
HTTPCookieProcessor 用来处理Cookies
ProxyHandler 用于设置代理,默认代理为空
HTTPPasswordMgr 用于密码管理,它会维护一个用户名和密码表
HTTPBasicAuthHandler 用于管理认证,连接打开时需要认证是,用此解决

借助代理完成请求的过程:利用urllib.request中各种Handler构建Opener,在用Opener.open(url)去请求。

#构建过程:
1. handler
2. opener = urllib.request.build_opener(handler)
3. opener.open(url)

 Opener:OpenerDirector类,urlopen()这个方法也是一个urllib提供的一个简单的Opener,Opener的高级用法可完成更深一层、更高级的功能配置。

#实例:输入密码才能进入测试网页,需要借助HTTPBasicAuthHandler完成

from urllib.request import HTTPPasswordMgrWithDefaulRealm, HTTPBasicAuthHandler, build_opener
from urllib.error import URLError

username = "tester"
password = "testerpw"
url = "http://km******.test.mararun.com/"

pwMsg = HTTPPasswordMgrWithDefaultRealm()
pwMsg.addPassword(None, url, username, password)
handlerAuth = HTTPBasicAuthHandler(pwMsg)
opener = build_opener(handlerAuth)

try:
    response = opener.open(url)
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)
'''
Cookie处理:声明一个http.CookieJar对象,利用HTTPCookieProcessor来构建一个Handler,最后利用build_opener()方法构建Opener,执行open()方法请求
'''

import http.cookiejar, urllib.request

# 循环打印cookie的key-value
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
for item in cookie:
    print(item.name+"="+item.value)

# 以文本格式保存cookie数据
filename= "cookies.txt"
# 两种格式
# cookie = http.cookiejar.MozilaCookieJar(filename)
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
cookie.save(ignore_discard=True, ignore_expires=True)

# 读取并利用cookies,以LWPCookieJar格式为例
cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookies.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
print(response.status)

 

Error

URLError类继承自OSError类,是error异常模块的基类,由request模块产生的异常都可以通过这个类捕获,有个reason属性可返回错误原因。
HTTPError是URLError的子类,用来专门处理HTTP请求错误,有三个属性 code状态码、reason错误原因、headers请求头。

# 实例1 reason为字符串
from urllib import request, error
try:
    response = request.urlopen("http://testerror.com/index.html")
except error.HTTPError as e:
    print(e.code, e.reason, e.headers)
except error.URLError as e:
    print(e.reason)
else:
    print("Request Successfully")


# 实例2 reason为一个对象,比如请求超时,返回的是一个socket.timeout类,可以用isinstance()来判断它的类型
import socket
import urllib.request
import urllib.error
try:
    response = urllib.request.urlopen("https://www.baidu.com", timeout=0.01)
except urllib.error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason, socket.timeout):
        print("Time Out!")

 

Parse:url.parse.

1. urlparse() 解析url的6个部分,返回<class 'urllib.parse.ParseResult'>类型,ParseResult(scheme, netloc, path, params, query, fragment)

2. urlunparse() 拼接url,仅接收长度为6的可迭代对象

3. urlsplit() 解析url的5个部分,返回<class 'urllib.parse.SplitResult'>类型,SplitResult(scheme, netloc, path, query, fragment),不单独解析params,合并到path里面

4. urlunsplit() 拼接url,仅接收长度为5的迭代对象

5. urljoin() 拼接字符串,解析baseurl的scheme、netloc、path对后面链接缺失的部分进行补充

6. urlencode() 序列化参数 dict->url参数

7. parse_qs() 反序列化 url参数->dict

8. parse_qs() 反序列化 url参数->list[元组]

9. quote() 将内容转化为url编码格式,防止带中文时出现乱码的问题

10. unquote() 进行url解码

# 1. urlparse
from urllib.parse import urlparse
 
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result)

'''
<class 'urllib.parse.ParseResult'>
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
'''

# urlparse scheme默认协议参数、allow_fragments为False时fragment被解析为path、params、query最近那一个的一部分
result = urlparse('www.baidu.com/index.html;user?id=5#comment', scheme='https', allow_fragments=False)

'''
元组ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html', params='user', query='id=5#comment', fragment='')
'''
# 2. urlunparse()

from urllib.parse import urlunparse
 
data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))

'''
http://www.baidu.com/index.html;user?a=6#comment
'''
# 3. urlsplit()
from urllib.parse import urlsplit
 
result = urlsplit('http://www.baidu.com/index.html;user?id=5#comment')
print(result)

'''
元组SplitResult(scheme='http', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')
'''
# 4. urlunsplit()

from urllib.parse import urlunsplit
 
data = ['http', 'www.baidu.com', 'index.html', 'a=6', 'comment']
print(urlunsplit(data))

'''
http://www.baidu.com/index.html?a=6#comment
'''
# 5. unjoin()

from urllib.parse import urljoin

print(urljoin("http://www.baidu.com/about.html?wd=abc", "http://test/index.php"))

'''
 http://test/index.php
'''
# 6. urlencode()

from urllib.parse import urlencode
params = {
    "name": "test",
    "age": 30
}
base_url = "http://www.baidu.com?"
url = base_url + urlencode(params)
print(url)

'''
http://www.baidu.com?name=test&age=30
'''
# 7. parse_qs

from urllib.parse import parse_qs

query = "name = test&age=22"
print(parse_qs(query))

'''
 {'name ': [' test'], 'age': ['22']
'''
# 8. parse_qsl()

from urllib.parse import parse_qsl

query = "name = test&age=22"
print(parse_qsl(query))

'''
[('name ', ' test'), ('age', '22')]
'''
# 9. quote()

from urllib.parse import quote

keyword = "测试"
url = "https://www.baidu.com/s?wd=" + quote(keyword)
print(url)

'''
https://www.baidu.com/s?wd=%E6%B5%8B%E8%AF%95
'''
# unquote()

from urllib.parse import unquote

url = "https://www.baidu.com/s?wd=%E6%B5%8B%E8%AF%95"
print(unquote(url))

'''
https://www.baidu.com/s?wd=测试
'''

 

 参考:静觅 » [Python3网络爬虫开发实战]

posted @ 2020-01-17 16:14  BelleLs  阅读(357)  评论(0编辑  收藏  举报