爬虫基础库之Requests

request模块

安装

pip install requests

简单使用  

import requests

response=requests.get("https://movie.douban.com/cinema/nowplaying/beijing/")
print(response.content)   # 字节数据
print(response.text)      # 字符数据
print(type(response))       # <class 'requests.models.Response'>
print(response.status_code) # 200
print(response.encoding)    # utf-8
print(response.cookies)     # <RequestsCookieJar[<Cookie bid=YwWqpRG7Z_E for .douban.com/>]>

GET请求

最基本的GET请求可以直接用get方法

requests.get("https://movie.douban.com/cinema/nowplaying/beijing/")

如果想要加参数,可以利用 params 参数

param={"a":1,"b":2}
response=requests.get("https://movie.douban.com/cinema/nowplaying/beijing/",params=param)

print(response.url) # https://movie.douban.com/cinema/nowplaying/beijing/?a=1&b=2

如果想请求JSON文件,可以利用 json() 方法解析

response=requests.get("https://github.com/timeline.json")

print(response.text)
print(response.json().get("message"))

原始响应内容

如果想获取来自服务器的原始套接字响应,可以取得 r.raw 。 不过需要在初始请求中设置 stream=True 。

>>> r = requests.get('https://github.com/timeline.json', stream=True)
>>> r.raw
<requests.packages.urllib3.response.HTTPResponse object at 0x101194810>
>>> r.raw.read(10)
'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

但一般情况下,你应该以下面的模式将文本流保存到文件:

with open(filename, 'wb') as fd:
    for chunk in r.iter_content(chunk_size):
        fd.write(chunk)

定制请求头

如果你想为请求添加 HTTP 头部,只要简单地传递一个 dict 给 headers 参数就可以了。

headers={"Content-Type":"application/json"}
data={"username":"yuan"}
response=requests.get("https://movie.douban.com/cinema/nowplaying/beijing/",params=data,headers=headers)

print(response.url)
print(response.headers)

POST请求

通常,你想要发送一些编码为表单形式的数据——非常像一个 HTML 表单。要实现这个,只需简单地传递一个字典给 data 参数。你的数据字典在发出请求时会自动编码为表单形式:

payload = {'key1': 'value1', 'key2': 'value2'}
response = requests.post("http://httpbin.org/post", data=payload)

print(response.text)
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Content-Length": "23", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.12.4"
  }, 
  "json": null, 
  "origin": "114.242.248.107", 
  "url": "http://httpbin.org/post"
}

你还可以为 data 参数传入一个元组列表。在表单中多个元素使用同一 key 的时候,这种方式尤其有效:

>>> payload = (('key1', 'value1'), ('key1', 'value2'))
>>> r = requests.post('http://httpbin.org/post', data=payload)
>>> print(r.text)
{
  ...
  "form": {
    "key1": [
      "value1",
      "value2"
    ]
  },
  ...
}

很多时候你想要发送的数据并非编码为表单形式的。如果你传递一个 string 而不是一个 dict,那么数据会被直接发布出去。

例如,Github API v3 接受编码为 JSON 的 POST/PATCH 数据:

>>> import json

>>> url = 'https://api.github.com/some/endpoint'
>>> payload = {'some': 'data'}

>>> r = requests.post(url, data=json.dumps(payload))

此处除了可以自行对 dict 进行编码,你还可以使用 json 参数直接传递,然后它就会被自动编码。这是 2.4.2 版的新加功能:

>>> url = 'https://api.github.com/some/endpoint'
>>> payload = {'some': 'data'}

>>> r = requests.post(url, json=payload)

POST一个文件

Requests 使得上传文件变得很简单:

url = 'http://httpbin.org/post'
files = {'file': open('test', 'rb')}

r = requests.post(url, files=files)
print(r.text)

结果:

{
  "args": {}, 
  "data": "", 
  "files": {
    "file": "I am Yuan"
  }, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Content-Length": "149", 
    "Content-Type": "multipart/form-data; boundary=461b1f78ee7049238c2fc3ec738fa275", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.12.4"
  }, 
  "json": null, 
  "origin": "114.242.249.23", 
  "url": "http://httpbin.org/post"
}

Cookies

如果某个响应中包含一些 cookie,你可以快速访问它们:

应用

import requests

import re
import json

def getPage(url):

    response=requests.get(url)
    return response.text

def parsePage(s):

    com=re.compile('<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>\d+).*?<span class="title">(?P<title>.*?)</span>'
                   '.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>',re.S)

    ret=com.finditer(s)
    for i in ret:
        yield {
            "id":i.group("id"),
            "title":i.group("title"),
            "rating_num":i.group("rating_num"),
            "comment_num":i.group("comment_num"),
        }

def main(num):

    url='https://movie.douban.com/top250?start=%s&filter='%num
    response_html=getPage(url)
    ret=parsePage(response_html)
    print(ret)
    f=open("move_info7","a",encoding="utf8")

    for obj in ret:
        print(obj)
        data=json.dumps(obj,ensure_ascii=False)
        f.write(data+"\n")

if __name__ == '__main__':
    count=0
    for i in range(10):
        main(count)
        count+=25
View Code

更多见官方文档

posted @ 2017-12-04 19:22  Yuan先生  阅读(758)  评论(0编辑  收藏  举报