urllib学习

1. urllib的使用

request：最基本的HTTP请求模块，可以模拟请求的发送。就像是在浏览器中输入网址然后按下回车一样
error：异常处理模块。如果出现请求异常，就可以捕获这些异常，然后进行重试或者其他操作以保证程序
parse：一个工具模块，提供了许多URL的处理方法
robotparser：主要用来识别网站的robots.txt文件，然后判断哪些网站可以爬，哪些网站不可以爬

1.1 发送请求

使用urllib库的request模块

urlopen

import urllib.request
import ssl


ssl._create_default_https_context = ssl._create_unverified_context  # 全局取消证书验证
response = urllib.request.urlopen('https://www.python.org')
print(type(response))
print(response.read().decode('utf-8'))

--------------------------
输出结果：
<class 'http.client.HTTPResponse'>
<!doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr">  <!--<![endif]-->
······

从输出结果可以看出，响应是一个HTTPResponse类型的对象，主要包含read,readinto,getheader,getheaders,fileno等方法，以及msg,version,status,reason,debuglevel,closed等属性

例如：

调用read可以得到响应的网页内容

调用status可以得到响应结果的状态码

import urllib.request
import ssl


ssl._create_default_https_context = ssl._create_unverified_context  # 全局取消证书验证
response = urllib.request.urlopen('https://www.python.org')
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))
--------------------------------
输出结果：
200
[('Connection', 'close'), ('Content-Length', '49624'), ('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'DENY'), ('Via', '1.1 vegur, 1.1 varnish, 1.1 varnish'), ('Accept-Ranges', 'bytes'), ('Date', 'Tue, 14 Dec 2021 03:01:12 GMT'), ('Age', '2104'), ('X-Served-By', 'cache-bwi5135-BWI, cache-hkg17923-HKG'), ('X-Cache', 'HIT, HIT'), ('X-Cache-Hits', '1, 6534'), ('X-Timer', 'S1639450872.405611,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]
nginx

如果想要给链接中添加参数，该如何实现？首先查看urlopen方法的API：

urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
*, cafile=None, capath=None, cadefault=False, context=None)

data参数

data参数是可选的，默认值是None。在添加该参数时，需要使用bytes方法将参数转化为字节流编码格式的内容。另外，如果传递了参数，那么就不是get方式而是post方式了。

import urllib.request
import urllib.parse
import ssl

ssl._create_default_https_context = ssl._create_unverified_context  # 全局取消证书验证
data = bytes(urllib.parse.urlencode({'name': 'jack'}), encoding='utf-8')
response = urllib.request.urlopen('https://www.httpbin.org/post', data=data)
print(response.read().decode('utf-8'))
---------------------
输出结果：
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "jack"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "9", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "www.httpbin.org", 
    "User-Agent": "Python-urllib/3.10", 
    "X-Amzn-Trace-Id": "Root=1-61b80c81-3954b4385516dd6830732700"
  }, 
  "json": null, 
  "origin": "223.104.151.89", 
  "url": "https://www.httpbin.org/post"
}

因为传入链接中的参数需要转码成bytes类型，因此使用bytes方法进行转码，这个方法的第一个参数是str（字符串）类型，因此需要使用urllib.parse模块中的urlencode方法将字典转化成字符串，第二个参数用来指定编码格式

通过输出结果可以发现，我们传入链接的参数出现在了form字段中，这表示是模拟表单提交，以POST方法传输数据

timeout参数

timeout参数用于设置超时时间，以秒为单位，这表示如果请求超出了设置的时间还没有响应，就会显示异常。如果没有设置该参数，那么就会使用全局默认时间，这个参数支持HTTP,HTTPS,FTP请求。

import urllib.request
import ssl


ssl._create_default_https_context = ssl._create_unverified_context  # 全局取消证书验证
response = urllib.request.urlopen('https://www.httpbin.org/post', timeout=0.1)
print(response.read().decode('utf-8'))
----------------
输出结果：
urllib.error.URLError: <urlopen error timed out>

将timeout参数的值设为0.1s时，由于网页没有那么快得到响应，因此输出了一个URLError，该异常属于urllib.error模块，错误原因是超时。

可以在爬取网站的时候设置timeout参数，超时没有得到响应的网站就直接跳过爬取，使用try except语句也可以实现。

import socket
import urllib.request
import urllib.error
import ssl

ssl._create_default_https_context = ssl._create_unverified_context  # 全局取消证书验证
try:
    response = urllib.request.urlopen('https://www.httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('time out')

其他参数
- context参数，该参数必须是ssl.SSLContext类型，用来指定SSL的设置
- cafile和capath这两个参数分别用来指定CA证书和路径，这两个在请求HTTPS链接的时候有用
- cadefault现在已经弃用，默认值为False
Request

利用urlopen方法可以发起最基本的请求，但是它那几个简单的参数不能够构建一个完整的请求，如果需要往请求中添加Headers等信息，就需要使用到功能更为强大的Request

Request(url, data=None, headers={},origin_req_host=None,unverifiable=False,
method=None)

其中，url参数是必须要有的，其他参数都是可选参数

from urllib import parse, request
import ssl


ssl._create_default_https_context = ssl._create_unverified_context
url = 'https://www.httpbin.org/post'
data = bytes(parse.urlencode({'name': 'jack'}), encoding='utf-8')
hd = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:95.0) Gecko/20100101 Firefox/95.0',

}
response = request.Request(url, data=data, headers=hd, method='POST')
rep = request.urlopen(response)
print(rep.read().decode('utf-8'))


---------------------------------------------------
输出结果：
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "jack"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "9", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "www.httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:95.0) Gecko/20100101 Firefox/95.0", 
    "X-Amzn-Trace-Id": "Root=1-61b83f4d-5ff759be0bebd9b0577cf774"
  }, 
  "json": null, 
  "origin": "223.104.151.81", 
  "url": "https://www.httpbin.org/post"
}

通过观察输出结果可以发现已经成功设置了data,headers和method

添加headers还可以通过add_header方法进行添加：

response = request.Request(url, data=data, method='POST')
response.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:95.0) Gecko/20100101 Firefox/95.0')

高级用法
- 对于一些高级的操作，例如操作cookie，设置代理等，就需要使用到Handler。可以将Handler理解成各种处理器，有专门处理登录验证的，处理cookie的，利用这些Handler几乎可以实现HTTP请求中的所有功能
- urllib.request模块中的BaseHandler类，是其他所有Handler类的父类，它提供了基本的方法，例如default_open,protocol_request等

几个子类的例子：

HTTPDefaultErrorHandler	用于处理HTTP响应错误，所有错误都会抛出HTTPError类型的异常
HTTPRedirectHandler	用于处理重定向
HTTPCookieProcessor	用于处理cookie
ProxyHandler	用于设置代理，代理默认为空
HTTPPasswordMgr	用于管理密码，维护着用户与密码之间的对照表
HTTPBasicAuthHandler	用于管理认证，如果一个链接打开时需要认证，那么就可以使用这个类来解决验证问题

另一个比较重要的类是OpenDirector，可以称之为Opener。之前用到的urlopen方法，实际上就是urllib库为我们提供的一个Opener

利用之前的Request类和urlopen类相当于类库已经封装好的极其常用的请求方法，利用这两个类可以实现基本的请求，但是要实现更为高级的功能，就需要使用Opener。

Opener类可以提供open方法，该方法返回的响应类型和urlopen方法如出一辙。

Opener类与Handler类之间有什么关系？简而言之，就是Handler类来构建Opener类

from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
from urllib.error import URLError
import ssl

ssl._create_default_https_context = ssl._create_unverified_context
username = 'admin'
passwd = 'admin'
url = 'https://ssr3.scrape.center/'

p = HTTPPasswordMgrWithDefaultRealm()
# 构建一个密码管理对象，用来保存需要处理的用户名和密码
p.add_password(None, url, username, passwd)
# 添加账户信息，第一个参数realm是与远程服务器相关的域信息，一般没人管它都是写None，后面三个参数分别是url、用户名、密码
auth_handler = HTTPBasicAuthHandler(p)
# 构建一个HTTP基础用户名/密码验证的HTTPBasicAuthHandler处理器对象，参数是创建的密码管理对象
opener = build_opener(auth_handler)
# 将刚才构建的auth_handler当做参数传入bulid_opener方法，构建一个opener，这个opener在发送请求时就相当于验证成功了

try:
    result = opener.open(url)
    # 使用opener类中的open方法来打开链接，获取的结果就是验证成功登录后的页面源码
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

代理

from urllib.request import ProxyHandler, build_opener
import ssl
from urllib.error import URLError

ssl._create_default_https_context = ssl._create_unverified_context
proxy = ProxyHandler({
    'http': 'http://127.0.0.1:8080',
    'https': 'https://127.0.0.1:8080'
})
opener = build_opener(proxy)

try:
    response = opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

首先需要在本地搭建一个HTTP代理，并让其运行在8080端口上

ProxyHandler的参数是一个字典，键名是协议类型，键值是代理链接，可以添加多个代理

import http.cookiejar
import urllib.request
import ssl

ssl._create_default_https_context = ssl._create_unverified_context
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('https://www.baidu.com')
for item in cookie:
    print(item.name + "=" + item.value)
    
-------------------------------------
输出结果：
BAIDUID=94A74AAEB81732293073685636991C16:FG=1
BIDUPSID=94A74AAEB81732293C7FE402E7A4E7E0
PSTM=1639470242
BD_NOT_HTTPS=1

首先，需要声明一个CookieJar()对象，然后利用HTTPCookieProcessor()创建一个handler，最后利用build_opener方法创建opener，最后执行open函数。输出时按照cookie条目名称和值。

既然可以将cookie值输出出来，那么也可以将cookie的值存入文件当中，这时就需要将CookieJar换成MozillaCookieJar，它会在生成文件的时候用到，是CookieJar的子类，可以用来处理cookie和文件相关的操作，例如存储和读取

import http.cookiejar
import urllib.request
import ssl

ssl._create_default_https_context = ssl._create_unverified_context
filename = 'cookie.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('https://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

这样的话，就可以把cookie的值存入了一个名为cookie.txt的文件中

另外LWPCookieJar同样可以读取和保存cookie，会保存为LWP格式

import http.cookiejar
import urllib.request
import ssl

ssl._create_default_https_context = ssl._create_unverified_context
filename = 'cookie1.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('https://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

生成cookie后，就需要对cookie文件进行读取内容并加以利用

import http.cookiejar
import urllib.request
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie1.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('https://www.baidu.com')
print(response.read().decode('utf-8'))

1.2 处理异常

urllib库中的error模块定义了由request模块产生的异常，当出现问题时，request模块会抛出error模块中定义的异常

URLError

具有一个属性reason，即发生异常的原因

from urllib import request,error
import ssl


ssl._create_default_https_context = ssl._create_unverified_context
try:
    response = request.urlopen('https://www.cnblogs.com/xiaohuicode/404')
except error.URLError as e:
    print(e.reason)
-------------------------------
输出结果：
Not Found

可以看到当访问一个不存在的页面时，没有直接报错，而是输出了错误的原因，这样可以避免程序异常终止，同时异常得到了有效处理

HTTPError

HTTPError是URLError的一个子类，专门用来处理HTTP请求错误，共有三个属性：

code：返回HTTP状态码，如404表示页面不存在

reason：返回异常发生的原因

headers：返回请求头

from urllib import request,error
import ssl


ssl._create_default_https_context = ssl._create_unverified_context
try:
    response = request.urlopen('https://www.cnblogs.com/xiaohuicode/404')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')

因为HTTPError是URLError的一个子类，所以可以先捕获子类的异常信息，再捕获父类的异常信息

from urllib import request,error
import ssl


ssl._create_default_https_context = ssl._create_unverified_context
try:
    response = request.urlopen('https://www.cnblogs.com/xiaohuicode/404')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)

有时，reason属性返回的不一定是一个字符串，也可能是一个对象

from urllib import request,error
import ssl
import socket


ssl._create_default_https_context = ssl._create_unverified_context
try:
    response = request.urlopen('https://www.cnblogs.com/xiaohuicode/', timeout=0.1)
except error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason, socket.timeout):
        print('Time Out')
----------------------------------
输出结果：
<class 'TimeoutError'>
Time Out

1.3 解析链接

urllib库里还提供了parse模块，这个模块定义了处理URL的标准接口

urlparse

该方法可以实现URL的识别和分段

from urllib.parse import urlparse

result = urlparse('https://www.baidu.com/index.html;user?id=5#comment')
print(type(result))
print(result)
print(result.scheme, result[0], result.netloc, result[1], sep='\n')


------------------------------
输出结果：
<class 'urllib.parse.ParseResult'>

ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

https
https
www.baidu.com
www.baidu.com

1、scheme代表协议，netloc代表域名，path代表访问路径，params代表参数，query代表查询条件，fragment代表锚点（用于定位页面内部的下拉位置）

2、返回结果ParseResult是一个元组，可以通过属性名获取其内容，也可以通过索引来获取其内容

3、urlparse(url, scheme='', allow_fragments=True)urlparse有三个参数，第一个参数url是要解析的url，是必填项。第二个参数是scheme，是默认的协议，如果待解析的url中没有协议，就会使用这个参数中的协议。第三个参数是是否忽略fragment，如果此项被设置为False，那么fragment部分就会被忽略，解析成path,params,query的一部分

from urllib.parse import urlparse

result = urlparse('www.baidu.com/index.html;user?id=5#comment', scheme='http')
print(result)

--------------
ParseResult(scheme='http', netloc='', path='www.baidu.com/index.html', params='user', query='id=5', fragment='comment')

from urllib.parse import urlparse

result = urlparse('https://www.baidu.com/index.html;user?id=5#comment', allow_fragments=False)
print(result)

------------ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html', params='user', query='id=5#comment', fragment='')

urlunparse

urlunparse方法与urlparse方法相对应，urlunparse用于构造url，但是传入urlunparse中的参数必须要有6个，不然就会出现参数不足或者参数过少的问题。参数可以使用列表也可以使用元组或者其他特定的数据结构。

from urllib.parse import urlunparse

data = ('https', 'www.baidu.com', 'index.html', 'user', 'id=5', 'comment')
print(urlunparse(data))
-------------------
https://www.baidu.com/index.html;user?id=5#comment

urlsplit

urlsplit这个方法与urlparse方法类似，只是这个方法不再单独解析params这一部分，params这个部分会合并到path中

from urllib.parse import urlsplit

result = urlsplit('https://www.baidu.com/index.html;user?id=5#comment')
print(result)
-----------------
SplitResult(scheme='https', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')

可以看出SplitResult返回结果也是一个元组，也可以通过属性名或者索引值来获取内容

urlunsplit

这个方法与urlunparse类似，也是将链接的各个部分合并为一个完整的链接，但是传入的参数必须为5个

from urllib.parse import urlunsplit

data = ('https', 'www.baidu.com', 'index.html;user', 'id=5', 'comment')
print(urlunsplit(data))
--------------------
https://www.baidu.com/index.html;user?id=5#comment

urljoin

urlunparse和urlunsplit方法都可以完成链接的合并，但是必须传入特定长度的参数。

urljoin是另外一种完成链接合并的方法，第一个参数是base_url（基础链接），将新的链接作为第二个参数，urljoin会对base_url进行分析，分析出scheme、netloc和path这个3个内容，并对新链接缺失的部分进行补充，最后返回结果。

from urllib.parse import urljoin

print(urljoin('https://www.baidu.com', 'index.html'))
print(urljoin('https://www.baidu.com', 'https://www.cnblogs.com/xiaohuicode'))
---------------
https://www.baidu.com/index.html
https://www.cnblogs.com/xiaohuicode

urlencode

urlencode方法构造GET请求参数的时候十分有用

from urllib.parse import urlencode

params = {
    'name': 'jack',
    'age': '23'
}
base_url = 'https://www.baidu.com?'
url = base_url + urlencode(params)
print(url)

-----------------------------------
https://www.baidu.com?name=jack&age=23

Parse_qs

parse_qs可以将一串GET请求参数返回到字典中

from urllib.parse import urlparse
from urllib.parse import parse_qs

result = urlparse('https://www.baidu.com?name=jack&age=23')
print(parse_qs(result.query))
------------------------
{'name': ['jack'], 'age': ['23']}

Parse_qsl

pars_qsl方法用于将参数转化为元组组成的列表

from urllib.parse import urlparse
from urllib.parse import parse_qsl

result = urlparse('https://www.baidu.com?name=jack&age=23')
print(parse_qsl(result.query))
-------------------------
[('name', 'jack'), ('age', '23')]

可以看出输出结果为一个列表，这个列表中有两个元组，元组的第一个内容是参数，第二个内容是参数值

quote

将内容转换成url编码的格式

from urllib.parse import quote

kwd = '周杰伦'
url = 'https://www.baidu.com/s?wd=' + quote(kwd)
print(url)
------------------
https://www.baidu.com/s?wd=%E5%91%A8%E6%9D%B0%E4%BC%A6

unquote

对URL进行解码

from urllib.parse import unquote

url = 'https://www.baidu.com/s?wd=%E5%91%A8%E6%9D%B0%E4%BC%A6'
print(unquote(url))
-----------------------
https://www.baidu.com/s?wd=周杰伦

1.4 分析robots协议

robots协议

robots协议全名为网络爬虫排除标准，用来告诉爬虫哪些页面可以爬取，哪些页面不可以爬取，通常是一个robots.txt文件，放在文件的根目录下

User-agent: *
Disallow: /

User-agent代表着搜索爬虫的名称

Disallow指定了爬虫不允许爬取的目录，Disallow: /代表着不允许爬取所有页面

还有allow，allow不会单独使用，一般都是和disallow一起出现的

robotparser

robotparser模块用来解析robots.txt文件，可以根据robots.txt文件来判断爬虫是否可以爬取这个网页

robotparser常用方法：

方法	用法
Set_url	用来设置robots.txt文件链接
read	读取robots.txt文件并进行分析。必须要调用这个方法，不然的话接下来的判断都会变为False
parse	用来解析robots.txt文件，传入其中的参数是robots.txt文件某些行的内容，会按照robots.txt文件的语法规则来分析这些内容
can_fetch	第一个参数是User_Agent，第二个参数是要抓取的URL，表示User_Agent中指示的搜索引擎是否可以爬取这个URL
mtime	返回上次抓取和分析robots.txt文件的时间，这对于长时间分析和抓取robots.txt文件的搜索爬虫很有必要，可能需要定期检查以抓取最新的robots.txt文件
modified	将当前时间设置为上次抓取和分析robots.txt文件的时间，对于长时间分析和抓取的爬虫很有帮助

from urllib.robotparser import RobotFileParser
import ssl

ssl._create_default_https_context = ssl._create_unverified_context
rp = RobotFileParser()
rp.set_url('https://www.baidu.com/robots.txt')
# 设置URL也可直接 rp = RobotFileParser('https://www.baidu.com/robots.txt')
rp.read()
print(rp.can_fetch('Baiduspider', 'https://www.baidu.com'))
--------------
True

posted @ 2021-12-15 10:58 写代码的小灰阅读(128) 评论(0) 收藏举报

刷新页面返回顶部

写代码的小灰

urllib学习

urllib学习

1. urllib的使用

1.1 发送请求

1.2 处理异常

1.3 解析链接

1.4 分析robots协议

公告