splash的使用

splash是一个JavaScript渲染服务，利用它可以爬取动态渲染的页面

一、简介

功能
- 异步处理多个网页的渲染过程
- 可以获取渲染后页面的源代码、截图、以及页面的加载过程信息（HAR，类似于浏览器开发工具中的网络加载）
- 执行特定的JavaScript脚本
- 通过Lua脚本控制页面的渲染过程
准备工作
- docker部署splash服务
  - 安装镜像
```
docker pull scrapinghub/splash
```
  - 运行splash服务
```
docker run --name splash -d -p 8050:8050 scrapinghub/splash --max-timeout 3600
```
    为了解决状态码504的问题，后面加上了--max-timeout参数，增大超时时间
- 测试
  - 本地浏览器访问http://127.0.0.1:8050/，即可查看splash的web页面
  - 在Render me!左侧的输入框中输入地址，比如动态渲染的页面：https://www.nmpa.gov.cn/zwfw/sdxx/sdxxyp/yppjfb/20230111151558195.html
  - 修改页面中的Lua脚本如下：
```
function main(splash, args)
  assert(splash:go(args.url))
  assert(splash:wait(5))
  return {
    html = splash:html(),
    png = splash:png(),
    har = splash:har(),
  }
end
```
    wait后的时间修改为5秒，尽量确保加载完成
  - 点击Render me，即可查看渲染后页面的源代码、截图、以及页面的加载过程信息
官方参考文档：
- https://splash.readthedocs.io/en/stable/api.html

二、常用API

介绍
- splash提供了一些HTTP API，python程序只需要请求这些API并传递相应的参数，即可爬取页面渲染后的结果
render.html
- 获取渲染后的HTML代码
- 常用参数：
  - url：指定渲染的url
  - wait：加载页面后需要等待的时间（默认0），要确保页面被完全加载出来，需要手动设置该参数值，比如5，注意数值不能超过timeout参数值
  - timeout：渲染超时时间，默认30秒，最大可设置为90秒，除非启动splash服务时，通过参数--max-timeout进行指定
  - resource_timeout：单个网络请求的超时时间
  - http_method：请求方式，默认为GET，但是也支持POST请求
  - proxy：设置代理，格式[protocol://][user:password@]proxyhost[:port]，协议只能是http或者socks5
  - images：是否加载图片，可取值1（加载，默认值）或0（不加载）
  - headers：请求头设置，支持JSON数组或者对象形式，如果是JSON数组，注意元素必须是(header_name, header_value)元组形式
  - body：http_method为POST时的表单数据，默认请求头Content-Type为application/x-www-form-urlencoded，字符串类型，比如name=laowang&age=30
  - 其它参数可参照官方文档：https://splash.readthedocs.io/en/stable/api.html#render-html
- 示例：
```
import requests

api_url = 'http://127.0.0.1:8050/render.html'
args = {
    'url': 'http://www.httpbin.org/post',
    'wait': 5,
    'http_method': 'POST',
    'body': 'name=laowang&age=30'
}
response = requests.get(url=api_url, params=args)
print(response.text)
```
  requests向HTTP API发送GET请求，但是获取到的是POST请求结果
render.png
- 获取PNG格式页面截图二进制数据
- 参数：
  - width：设置截图的缩放宽度
  - height：设置截图的缩放高度
  - render_all：是否渲染并截取整个网页，取值为1（是，图片可能会非常高）或者0（否，默认值），取值为1时，需要设置wait参数
  - 其它：参照render.html
- 示例：
```
import requests

api_url = 'http://127.0.0.1:8050/render.png'
args = {
    'url': 'https://www.cnblogs.com/eliwang/p/17004910.html',
    'wait': 5,
    'images': 0,
    # 'width': 1000,
    # 'height': 700,
    'render_all': 1
}
response = requests.get(url=api_url, params=args)
with open('test.png', 'wb') as f:
    f.write(response.content)
```
  渲染并截图整个页面，对于页面中的图片不进行加载
render.jpeg
- 获取JPEG格式页面截图二进制数据
- 参数：
  - quality：设置图片质量，取值范围0-100，默认值75，应避免超过95
  - 其它：参照render.png
render.json
- 以JSON格式返回所需要的数据
  - 涵盖上述所有相关API功能
  - 通过参数来控制返回结果
- 参数：
  - 涵盖所有render.jpeg的参数
  - html：是否返回页面HTML源代码，取值0或1，默认0
  - png：是否返回PNG格式的页面截图（经过了base64加密），取值0或1，默认0
  - iframes：是否返回子frames，取值0或1，默认0
  - 其它参考官方文档

execute

可实现与Lua脚本的对接，自由控制获取细节，功能最为强大
参数：
- lua_source：自动化脚本，字符串类型
- 其它参考官方文档

示例：

import requests
import base64

lua_source = '''
function main(splash, args)
  assert(splash:go("https://www.baidu.com"))
  assert(splash:wait(5))
  return {
    html = splash:html(),
    png = splash:png()
  }
end
'''

api_url = 'http://127.0.0.1:8050/execute'
args = {
    'lua_source': lua_source
}

response = requests.get(url=api_url, params=args)
result = response.json()

# 打印html
print(result.get('html'))

# 下载png图片
with open('test.png', 'wb') as f:
    f.write(base64.b64decode(result.get('png')))

posted @ 2023-02-04 20:51 eliwang 阅读(308) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

eliwang

学无止境的小渣渣

splash的使用

一、简介

功能

准备工作

docker部署splash服务

测试

官方参考文档：

二、常用API

介绍

render.html

获取渲染后的HTML代码

常用参数：

示例：

render.png

获取PNG格式页面截图二进制数据

参数：

示例：

render.jpeg

获取JPEG格式页面截图二进制数据

参数：

render.json

以JSON格式返回所需要的数据

execute

可实现与Lua脚本的对接，自由控制获取细节，功能最为强大

参数：

示例：

公告