Python3 urllib 详解

在 Python 的网络编程领域，urllib是一个历史悠久且功能完善的标准库，它提供了处理 URL 请求、解析 URL、处理网络异常等一系列功能。与第三方库（如requests）相比，urllib无需额外安装，随 Python 环境自带，适合轻量网络操作和对环境依赖有严格限制的场景。本文将从核心模块、基础用法到高级技巧，全面解析 Python3 中的urllib。

一、urllib 的核心组成

Python3 的urllib是一个包含多个子模块的集合，每个子模块专注于不同的功能，核心模块包括：

子模块	功能描述	核心场景
`urllib.request`	发送 HTTP 请求（GET/POST 等），获取响应	爬取网页、调用 API
`urllib.parse`	解析 URL、处理查询参数、编码数据	构造 URL、处理请求参数
`urllib.error`	定义网络请求中的异常类，用于错误处理	捕获 404/500 等错误、网络超时等
`urllib.robotparser`	解析网站的`robots.txt`文件，判断爬取权限	合规爬虫开发

二、urllib.request：发送 HTTP 请求的核心工具

urllib.request是urllib中最常用的模块，负责发起 HTTP/HTTPS 请求并获取服务器响应。其核心函数是urlopen()，可直接打开一个 URL 并返回响应对象。

2.1 基础用法：发送 GET 请求

GET 是最常见的 HTTP 请求方法，用于从服务器获取资源。urlopen()默认使用 GET 方法，示例如下：

import urllib.request

# 1. 发送GET请求（默认方法）
url = "https://httpbin.org/get"  # 测试API，返回请求信息
response = urllib.request.urlopen(url)

# 2. 处理响应
print("状态码：", response.status)  # 200（成功）
print("响应头：", response.getheaders())  # 所有响应头信息
print("响应体（字节）：", response.read()[:100])  # 读取响应体（前100字节）
print("响应体（字符串）：", response.read().decode("utf-8"))  # 解码为字符串

 

关键说明：

response是HTTPResponse对象，包含服务器返回的所有信息；
read()方法返回响应体的字节流（bytes类型），需用decode("utf-8")转为字符串；
一次read()后，指针会移至末尾，再次调用需重新请求或使用response.seek(0)重置指针。

2.2 发送 POST 请求

POST 请求用于向服务器提交数据（如表单提交、API 参数传递），需通过data参数指定提交的数据（需为bytes类型）。

import urllib.request
import urllib.parse

# 1. 定义URL和提交的数据
url = "https://httpbin.org/post"
data = {
    "name": "张三",
    "age": 25,
    "hobby": ["coding", "reading"]
}

# 2. 编码数据（字典→查询字符串→字节流）
# urlencode()将字典转为"name=张三&age=25..."格式（字符串）
# encode()将字符串转为bytes类型（POST数据必须为bytes）
encoded_data = urllib.parse.urlencode(data).encode("utf-8")

# 3. 发送POST请求（指定data参数）
response = urllib.request.urlopen(url, data=encoded_data)

# 4. 打印响应（服务器会返回提交的数据，用于验证）
print(response.read().decode("utf-8"))

 

核心步骤：

使用urllib.parse.urlencode()将字典转为 URL 编码的字符串（如name=%E5%BC%A0%E4%B8%89，中文会自动编码）；
通过encode("utf-8")将字符串转为bytes类型，满足urlopen()对data参数的要求；
若不指定data，urlopen()默认使用 GET 方法。

2.3 自定义请求：设置请求头、方法等

urlopen()的功能有限，若需设置请求头（如模拟浏览器）、指定请求方法（如 PUT/DELETE），需使用urllib.request.Request类构造自定义请求。

示例：模拟浏览器发送请求（设置 User-Agent）

许多网站会拒绝无浏览器标识的请求，通过设置User-Agent可模拟浏览器：

import urllib.request

# 1. 定义URL和请求头
url = "https://httpbin.org/get"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36",
    "Referer": "https://www.baidu.com"  # 可选，模拟来源页面
}

# 2. 构造Request对象（包含URL、请求头）
req = urllib.request.Request(url=url, headers=headers)

# 3. 发送请求
response = urllib.request.urlopen(req)

# 4. 验证请求头是否生效（服务器返回的请求信息中应包含User-Agent）
print(response.read().decode("utf-8"))

 

示例：发送 PUT/DELETE 请求

import urllib.request

# 发送PUT请求
url = "https://httpbin.org/put"
req = urllib.request.Request(url=url, method="PUT")  # 指定方法为PUT
response = urllib.request.urlopen(req)
print("PUT响应：", response.status)  # 200

# 发送DELETE请求
url = "https://httpbin.org/delete"
req = urllib.request.Request(url=url, method="DELETE")
response = urllib.request.urlopen(req)
print("DELETE响应：", response.status)  # 200

 

三、urllib.parse：URL 解析与数据编码工具

urllib.parse模块提供了 URL 处理的核心功能，包括解析 URL、拼接 URL、编码请求参数等，是构造合法请求的重要工具。

3.1 解析 URL：urlparse () 与 urlunparse ()

urlparse()可将 URL 字符串拆分为 6 个部分（协议、域名、路径等），urlunparse()则相反，将拆分后的部分重新组合为 URL。

 

from urllib.parse import urlparse, urlunparse

# 解析URL
url = "https://www.example.com:8080/path/index.html?name=张三&age=25#anchor"
parsed = urlparse(url)

print("协议（scheme）：", parsed.scheme)  # https
print("域名（netloc）：", parsed.netloc)  # www.example.com:8080
print("路径（path）：", parsed.path)      # /path/index.html
print("查询参数（query）：", parsed.query)  # name=张三&age=25
print("锚点（fragment）：", parsed.fragment)  # anchor

# 重新组合URL
components = (
    parsed.scheme,
    parsed.netloc,
    "/new/path",  # 修改路径
    parsed.params,
    parsed.query,
    parsed.fragment
)
new_url = urlunparse(components)
print("新URL：", new_url)  # https://www.example.com:8080/new/path?name=张三&age=25#anchor

 

3.2 处理查询参数：urlencode () 与 parse_qs ()

urlencode()用于将字典转为 URL 查询字符串（如name=张三&age=25），parse_qs()则将查询字符串解析为字典。

from urllib.parse import urlencode, parse_qs

# 字典→查询字符串（用于构造GET参数或POST数据）
params = {
    "name": "张三",
    "age": 25,
    "hobby": ["coding", "reading"]  # 列表会转为多个同名参数
}
query_string = urlencode(params)
print("查询字符串：", query_string)  # name=%E5%BC%A0%E4%B8%89&age=25&hobby=coding&hobby=reading

# 查询字符串→字典（用于解析响应中的参数）
parsed_params = parse_qs(query_string)
print("解析后的字典：", parsed_params)
# {'name': ['张三'], 'age': ['25'], 'hobby': ['coding', 'reading']}
# 注意：值为列表（因可能有多个同名参数）

 

3.3 编码与解码特殊字符：quote () 与 unquote ()

URL 中不能包含空格、中文等特殊字符，需通过quote()编码；unquote()则用于解码。

from urllib.parse import quote, unquote

# 编码特殊字符（中文、空格等）
original = "张三 & 李四"
encoded = quote(original)
print("编码后：", encoded)  # %E5%BC%A0%E4%B8%89%20%26%20%E6%9D%8E%E5%9B%9B

# 解码
decoded = unquote(encoded)
print("解码后：", decoded)  # 张三 & 李四

 

四、urllib.error：网络异常处理

网络请求可能因各种原因失败（如页面不存在、网络超时），urllib.error定义了两类主要异常，用于捕获和处理这些错误。

4.1 HTTPError：HTTP 协议错误（如 404、500）

HTTPError是URLError的子类，对应 HTTP 响应中的错误状态码（4xx 客户端错误、5xx 服务器错误），包含状态码、响应头等信息。

import urllib.request
from urllib.error import HTTPError

url = "https://httpbin.org/status/404"  # 模拟404错误

try:
    response = urllib.request.urlopen(url)
except HTTPError as e:
    print("HTTP错误状态码：", e.code)  # 404
    print("错误响应头：", e.headers)
    print("错误响应体：", e.read().decode("utf-8"))

 

4.2 URLError：通用网络错误（如无网络、域名不存在）

URLError涵盖非 HTTP 协议的错误（如网络中断、域名解析失败），错误原因可通过reason属性获取。

import urllib.request
from urllib.error import URLError

url = "https://invalid.example.invalid"  # 无效域名

try:
    response = urllib.request.urlopen(url, timeout=5)
except URLError as e:
    print("网络错误原因：", e.reason)  # [Errno -2] Name or service not known

 

4.3 异常捕获顺序

由于HTTPError是URLError的子类，捕获时需先处理HTTPError，再处理URLError，否则HTTPError会被URLError捕获：

try:
    response = urllib.request.urlopen(url)
except HTTPError as e:
    print("HTTP错误：", e.code)
except URLError as e:
    print("网络错误：", e.reason)
else:
    print("请求成功，状态码：", response.status)

 

五、高级用法：代理、Cookie 与超时设置

urllib虽基础，但也支持代理、Cookie 管理等高级功能，需结合Handler和Opener实现。

5.1 使用代理服务器

通过ProxyHandler设置代理，可隐藏真实 IP 或访问受限资源：

 

import urllib.request

# 1. 定义代理（键为协议，值为"代理IP:端口"）
proxies = {
    "http": "http://123.45.67.89:8080",
    "https": "https://123.45.67.89:8080"
}

# 2. 创建代理处理器
proxy_handler = urllib.request.ProxyHandler(proxies)

# 3. 构建自定义Opener
opener = urllib.request.build_opener(proxy_handler)

# 4. 使用Opener发送请求（替代urlopen()）
url = "https://httpbin.org/get"
response = opener.open(url)
print(response.read().decode("utf-8"))  # 响应中会包含代理IP信息

 

5.2 处理 Cookie

http.cookiejar模块（需与urllib配合）用于管理 Cookie，实现登录状态保持等功能：

import urllib.request
import http.cookiejar

# 1. 创建CookieJar对象（存储Cookie）
cookie_jar = http.cookiejar.CookieJar()

# 2. 创建Cookie处理器
cookie_handler = urllib.request.HTTPCookieProcessor(cookie_jar)

# 3. 构建Opener（自动处理Cookie）
opener = urllib.request.build_opener(cookie_handler)

# 4. 发送请求（登录操作，假设该URL会返回Cookie）
login_url = "https://httpbin.org/cookies/set?session=123456"
opener.open(login_url)

# 5. 访问需要登录的页面（会自动携带Cookie）
profile_url = "https://httpbin.org/cookies"
response = opener.open(profile_url)
print("当前Cookie：", response.read().decode("utf-8"))  # 应包含session=123456

 

5.3 设置超时时间

避免请求因网络问题无限等待，通过timeout参数设置超时（单位：秒）：

import urllib.request
from urllib.error import URLError

url = "https://httpbin.org/delay/10"  # 延迟10秒响应的测试URL

try:
    # 设置超时为5秒（超过则抛出URLError）
    response = urllib.request.urlopen(url, timeout=5)
except URLError as e:
    print("超时错误：", e.reason)  # [Errno 110] Connection timed out

 

六、urllib.robotparser：合规爬虫工具

urllib.robotparser用于解析网站的robots.txt文件，判断爬虫是否有权限访问特定 URL，是合规爬虫的基础工具。

from urllib.robotparser import RobotFileParser

# 1. 创建解析器并加载robots.txt
rp = RobotFileParser()
rp.set_url("https://www.example.com/robots.txt")  # 指定robots.txt的URL
rp.read()  # 读取并解析文件

# 2. 判断爬虫是否有权限访问URL
user_agent = "my_crawler"  # 爬虫标识
url1 = "https://www.example.com/public/page"  # 允许访问的URL
url2 = "https://www.example.com/private/page"  # 禁止访问的URL

print(f"能否访问{url1}：", rp.can_fetch(user_agent, url1))  # True
print(f"能否访问{url2}：", rp.can_fetch(user_agent, url2))  # False

 

七、urllib 的局限性与替代方案

urllib作为标准库，功能稳定但用法相对繁琐，存在以下局限性：

不支持会话保持（需手动管理 Cookie）；
处理复杂表单（如文件上传）时代码冗长；
无默认连接池，高并发请求效率低。

若需更简洁的 API 或高级功能，推荐使用第三方库requests（需pip install requests），其语法更直观（如requests.get()、requests.post()），内置会话管理和连接池。

八、总结

urllib是 Python 处理 URL 请求的基础标准库，通过urllib.request发送请求、urllib.parse处理 URL、urllib.error捕获异常，可满足大部分轻量网络操作需求。其核心优势在于 “零依赖”，适合环境受限的场景（如服务器无外网权限安装第三方库）。

掌握urllib的使用，不仅能完成简单的爬虫、API 调用，更能帮助理解 HTTP 协议的底层交互逻辑

posted on 2025-11-10 09:10 小陶coding 阅读(98) 评论(0) 收藏举报

刷新页面返回顶部

小陶coding