Python爬虫——使用request请求网页
1.安装request
pip install requests
2.请求网页
下载地址:http://phantomjs.org/download.html
>>> import requests
>>> r = requests.get('https://wwww.baidu.com')
>>> print(r.text)
3.请求失败重试
如果请求失败的话,可以使用urllib3的Retry来进行重试
import requests
from requests.adapters import HTTPAdapter
from urllib3.util import Retry
try:
retry = Retry(
total=5,
backoff_factor=2,
status_forcelist=[429, 500, 502, 503, 504],
)
adapter = HTTPAdapter(max_retries=retry)
session = requests.Session()
session.mount('https://', adapter)
r = session.get('https://httpbin.org/status/502', timeout=180)
print(r.status_code)
except Exception as e:
print(e)
参考:https://oxylabs.io/blog/python-requests-retry
4.使用lxml解析网页
可以安装xpath helper插件来测试xpath,https://chromewebstore.google.com/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl,使用如下

from lxml import etree
tree = etree.HTML(r.content)
text = tree.xpath('/html/body/div/header/div/nav/ul/li')
print(text)
本文只发表于博客园和tonglin0325的博客,作者:tonglin0325,转载请注明原文链接:https://www.cnblogs.com/tonglin0325/p/6791956.html

浙公网安备 33010602011771号