Python爬虫——使用request请求网页

1.安装request

pip install requests

2.请求网页

下载地址:http://phantomjs.org/download.html

>>> import requests
>>> r = requests.get('https://wwww.baidu.com')
>>> print(r.text)

3.请求失败重试

如果请求失败的话,可以使用urllib3的Retry来进行重试

import requests
from requests.adapters import HTTPAdapter
from urllib3.util import Retry

try:
    retry = Retry(
        total=5,
        backoff_factor=2,
        status_forcelist=[429, 500, 502, 503, 504],
    )
    adapter = HTTPAdapter(max_retries=retry)
    session = requests.Session()
    session.mount('https://', adapter)
    r = session.get('https://httpbin.org/status/502', timeout=180)
    print(r.status_code)
except Exception as e:
    print(e)

参考:https://oxylabs.io/blog/python-requests-retry

4.使用lxml解析网页

可以安装xpath helper插件来测试xpath,https://chromewebstore.google.com/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl,使用如下

from lxml import etree

tree = etree.HTML(r.content)
text = tree.xpath('/html/body/div/header/div/nav/ul/li')
print(text)

 

posted @ 2017-05-01 15:06  tonglin0325  阅读(1566)  评论(0)    收藏  举报