爬虫自学之路（二）about Urlib库的使用

官方文档地址：https://docs.python.org/3/library/urllib.html

urllib 包括以下模块：

urllib.request 请求模块
urllib.error 异常处理模块
urllib.parse url解析模块
urllib.robotparser robots.txt解析模块

urlopen中data参数的使用

这里用到urllib.parse，通过bytes(urllib.parse.urlencode())可以将post数据进行转换放到urllib.request.urlopen的data参数中。这样就完成了一次post请求。所以如果我们添加data参数的时候就是以post请求方式请求，如果没有data参数就是get请求方式

import urllib.parse
import urllib.request
data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
print(data)
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read())
print(type(response.read()))

设置Headers:

有很多网站为了防止程序爬虫爬网站造成网站瘫痪，会需要携带一些headers头部信息才能访问，最长见的有user-agent参数写一个简单的例子：

#第一种方式
from urllib import request, parse

url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
}
dict = {
    'name': 'zhaofan'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

#第二种方式，这种添加方式有个好处是自己可以定义一个请求头字典，然后循环进行添加
from urllib import request, parse

url = 'http://httpbin.org/post'
dict = {
    'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, method='POST')
req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

urllib高级用法：代理，ProxyHandle　

　　通过urllib.request.ProxyHandler()可以设置代理,网站它会检测某一段时间某个IP 的访问次数，如果访问次数过多，它会禁止你的访问, 所以这个时候需要通过设置代理来爬取数据

"""
通过urllib.request.ProxyHandler()可以设置代理,网站它会检测某一段时间某个IP 的访问次数，如果访问次数过多，它会禁止你的访问,
所以这个时候需要通过设置代理来爬取数据
"""
import urllib.request

proxy_handler = urllib.request.ProxyHandler({
    'http': 'http://127.0.0.1:8058',
    'https': 'https://127.0.0.1:9743'
})
opener = urllib.request.build_opener(proxy_handler)
response = opener.open('http://httpbin.org/')
print(response.read())

Cookie的处理　

"""
cookie,HTTPCookiProcessor
cookie中保存中我们常见的登录信息，有时候爬取网站需要携带cookie信息访问,这里用到了http.cookijar，用于获取cookie以及存储cookie
"""

import http.cookiejar, urllib.request
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)

"""
同时cookie可以写入到文件中保存，有两种方式http.cookiejar.MozillaCookieJar和http.cookiejar.LWPCookieJar()，当然你自己用哪种方式都可以
具体代码例子如下：
"""

http.cookiejar.MozillaCookieJar()方式

import http.cookiejar, urllib.request
filename = "cookie.txt"
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

http.cookiejar.LWPCookieJar()方式

import http.cookiejar, urllib.request
filename = 'cookie.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

# 同样的如果想要通过获取文件中的cookie获取的话可以通过load方式，当然用哪种方式写入的，就用哪种方式读取。
import http.cookiejar, urllib.resquest
cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf8'))

异常处理与URL解析

"""
    异常处理：
        这里我们需要知道的是在urllb异常这里有两个个异常错误：
        URLError,HTTPError，HTTPError是URLError的子类
        URLError里只有一个属性：reason,即抓异常的时候只能打印错误信息，类似上面的例子
        HTTPError里有三个属性：code,reason,headers，即抓异常的时候可以获得code,reson，headers三个信息，例子如下：
"""

from urllib import request, error
try:
    response = request.urlopen("http://pythonsite.com/1111.html")
except error.HTTPError as e:
    print(e.reason)
    print(e.code)
    print(e.headers)
except error.URLError as e:
    print(e.reason)

else:
    print("reqeust successfully")

"""
    url解析：
        urlparse：这里就是可以对你传入的url地址进行拆分
        urlunpars： 其实功能和urlparse的功能相反，它是用于拼接
        urljoin：这个的功能其实是做拼接的
        urlencode：这个方法可以将字典转换为url参数
"""

posted @ 2018-04-21 16:39 木已成木炭阅读(76) 评论(0) 收藏举报

刷新页面返回顶部

木已成木炭

爬虫自学之路（二）about Urlib库的使用

官方文档地址：https://docs.python.org/3/library/urllib.html

urllib 包括以下模块：

公告