爬虫入门学习

在python3中将urllib3重构后变为urllib.requesr使用,网页在抓取之后要指定decode解码。

urllib.encode变为urllib.parse.quote()编码 或者urllib.parse.unquote() 解码  汉字的编码  但这样返回的会是byte类型

若希望是字符串类型得使用:urllib.parse.urlencode()

通常为了通过服务器的检测,会更改请求头的部分数据,以伪装成浏览器来访问。

此时User-Agent,设置为浏览器模式尤为重要,尽量不要设置支持gzip压缩方式接收数据

print(response.getcode())    #显示状态码
print(response.geturl()) #显示实际链接地址
print(response.info()) #显示服务器的报头
 1 import urllib.request
 2 
 3 url="http://www.hao123.com/"
 4 ua_headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"}
 5 
 6 # 构造一个待访问链接对象
 7 request=urllib.request.Request(url=url,headers=ua_headers)
 8 # 构造一个请求访问对象
 9 response=urllib.request.urlopen(request)
10 
11 html=response.read()
12 
13 print(html.decode("utf-8"))
14 
15 print(response.getcode(),response.geturl())

 ******************************************************************************************

关于User Agent的说明:

"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"

Mozilla/5.0:浏览器通用标识符

Windows NT 10.0; Win64; x64:操作系统的版本

 AppleWebKit/537.36 (KHTML, like Gecko):浏览器的内核

Chrome/72.0.3626.109 Safari/537.36:浏览器真实的版本信息

常见的User Agent内核通常有:火狐 欧朋 谷歌

ua_list = [
    "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "User-Agent: Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
    "User-Agent: Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
    "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"
]

访问服务器之前设置请求头:

 1 import random
 2 import urllib.request
 3 
 4 url="http://www.baidu.com/"
 5 ua_list = [
 6     "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
 7     "User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
 8     "User-Agent: Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
 9     "User-Agent: Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
10     "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
11     "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"
12 ]
13 
14 #在列表里随机选择一个User Agent
15 user_agent=random.choice(ua_list)
16 #构造一个请求
17 request=urllib.request.Request(url)
18 #设置或者添加一个HTTP请求头
19 request.add_header("User-Agent",user_agent)
20 
21 head=request.get_header("User-agent")
22 print(head)

 实现一个简单的按条件查询页面:

 1 import urllib.request
 2 import urllib.parse
 3 import random
 4 
 5 #目标地址
 6 url="http://www.baidu.com/s"
 7 #伪造客户端 http请求头
 8 ua_list = [
 9     "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
10     "User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
11     "User-Agent: Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
12     "User-Agent: Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
13     "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
14     "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"
15 ]
16 #随机选择一个作为请求头
17 user_agent=random.choice(ua_list)
18 #接收查询字段
19 select=input("请输入要查询的关键字:")
20 
21 #编码
22 wd={"wd":select}
23 wd=urllib.parse.urlencode(wd)
24 #拼接完整url地址
25 url=url+"?"+wd 
26 
27 #创建一个请求对象
28 request=urllib.request.Request(url)
29 #设置请求头的user-agent
30 request.add_header("User-Agent",user_agent)
31 #访问目标服务器
32 response=urllib.request.urlopen(request)
33 #读取并按照utf-8解码
34 html=response.read().decode("utf-8")
35 
36 # print(html)
37 print(">>>>>>"+url)

 

posted @ 2019-02-22 12:08  青红*皂了个白  阅读(258)  评论(0)    收藏  举报