爬虫基础库之Requests
request模块
安装
pip install requests
简单使用
import requests response=requests.get("https://movie.douban.com/cinema/nowplaying/beijing/") print(response.content) # 字节数据 print(response.text) # 字符数据 print(type(response)) # <class 'requests.models.Response'> print(response.status_code) # 200 print(response.encoding) # utf-8 print(response.cookies) # <RequestsCookieJar[<Cookie bid=YwWqpRG7Z_E for .douban.com/>]>
GET请求
最基本的GET请求可以直接用get方法
requests.get("https://movie.douban.com/cinema/nowplaying/beijing/")
如果想要加参数,可以利用 params 参数
param={"a":1,"b":2} response=requests.get("https://movie.douban.com/cinema/nowplaying/beijing/",params=param) print(response.url) # https://movie.douban.com/cinema/nowplaying/beijing/?a=1&b=2
如果想请求JSON文件,可以利用 json() 方法解析
response=requests.get("https://github.com/timeline.json") print(response.text) print(response.json().get("message"))
原始响应内容
如果想获取来自服务器的原始套接字响应,可以取得 r.raw 。 不过需要在初始请求中设置 stream=True 。
>>> r = requests.get('https://github.com/timeline.json', stream=True) >>> r.raw <requests.packages.urllib3.response.HTTPResponse object at 0x101194810> >>> r.raw.read(10) '\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'
但一般情况下,你应该以下面的模式将文本流保存到文件:
with open(filename, 'wb') as fd: for chunk in r.iter_content(chunk_size): fd.write(chunk)
定制请求头
如果你想为请求添加 HTTP 头部,只要简单地传递一个 dict
给 headers
参数就可以了。
headers={"Content-Type":"application/json"} data={"username":"yuan"} response=requests.get("https://movie.douban.com/cinema/nowplaying/beijing/",params=data,headers=headers) print(response.url) print(response.headers)
POST请求
通常,你想要发送一些编码为表单形式的数据——非常像一个 HTML 表单。要实现这个,只需简单地传递一个字典给 data 参数。你的数据字典在发出请求时会自动编码为表单形式:
payload = {'key1': 'value1', 'key2': 'value2'} response = requests.post("http://httpbin.org/post", data=payload) print(response.text)
{ "args": {}, "data": "", "files": {}, "form": { "key1": "value1", "key2": "value2" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Connection": "close", "Content-Length": "23", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "python-requests/2.12.4" }, "json": null, "origin": "114.242.248.107", "url": "http://httpbin.org/post" }
你还可以为 data
参数传入一个元组列表。在表单中多个元素使用同一 key 的时候,这种方式尤其有效:
>>> payload = (('key1', 'value1'), ('key1', 'value2')) >>> r = requests.post('http://httpbin.org/post', data=payload) >>> print(r.text) { ... "form": { "key1": [ "value1", "value2" ] }, ... }
很多时候你想要发送的数据并非编码为表单形式的。如果你传递一个 string
而不是一个 dict
,那么数据会被直接发布出去。
例如,Github API v3 接受编码为 JSON 的 POST/PATCH 数据:
>>> import json >>> url = 'https://api.github.com/some/endpoint' >>> payload = {'some': 'data'} >>> r = requests.post(url, data=json.dumps(payload))
此处除了可以自行对 dict
进行编码,你还可以使用 json
参数直接传递,然后它就会被自动编码。这是 2.4.2 版的新加功能:
>>> url = 'https://api.github.com/some/endpoint' >>> payload = {'some': 'data'} >>> r = requests.post(url, json=payload)
POST一个文件
Requests 使得上传文件变得很简单:
url = 'http://httpbin.org/post' files = {'file': open('test', 'rb')} r = requests.post(url, files=files) print(r.text)
结果:
{ "args": {}, "data": "", "files": { "file": "I am Yuan" }, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Connection": "close", "Content-Length": "149", "Content-Type": "multipart/form-data; boundary=461b1f78ee7049238c2fc3ec738fa275", "Host": "httpbin.org", "User-Agent": "python-requests/2.12.4" }, "json": null, "origin": "114.242.249.23", "url": "http://httpbin.org/post" }
Cookies
如果某个响应中包含一些 cookie,你可以快速访问它们:
应用
import requests import re import json def getPage(url): response=requests.get(url) return response.text def parsePage(s): com=re.compile('<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>\d+).*?<span class="title">(?P<title>.*?)</span>' '.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>',re.S) ret=com.finditer(s) for i in ret: yield { "id":i.group("id"), "title":i.group("title"), "rating_num":i.group("rating_num"), "comment_num":i.group("comment_num"), } def main(num): url='https://movie.douban.com/top250?start=%s&filter='%num response_html=getPage(url) ret=parsePage(response_html) print(ret) f=open("move_info7","a",encoding="utf8") for obj in ret: print(obj) data=json.dumps(obj,ensure_ascii=False) f.write(data+"\n") if __name__ == '__main__': count=0 for i in range(10): main(count) count+=25
更多见官方文档