Python 爬虫html内存 re.findall 正则提取span

前言全局说明

爬虫html内存 re.findall 正则提取

一、百度首页热搜

（和百度原网页代码有修改）
需求：提取内容文字。

<ul class="s-hotsearch-content" id="hotsearch-content-wrapper">
  <li class="hotsearch-item odd" data-index="0">
    <span class="title-content-title">必须坚持人民至上</span>
    <span class="title-content-title">因平凡的你们熠熠闪光</span>
    <span class="title-content-title">已婚男子找王婆说媒 妻子：将离婚</span>
    <span class="title-content-title">凯迪拉克：泼天的流量轮到我了</span>
    <span class="title-content-title">爸爸穿得太显眼竟把女儿气哭</span>
    <span class="title-content-title">女子辅导儿子作业情绪崩溃踹断脚趾</span>
  </li>
</ul>

实现代码：
baidu_hot.py

import re

html_hot = """<ul class="s-hotsearch-content" id="hotsearch-content-wrapper">
  <li class="hotsearch-item odd" data-index="0">
    <span class="title-content-title">必须坚持人民至上</span>
    <span class="title-content-title">因平凡的你们熠熠闪光</span>
    <span class="title-content-title">已婚男子找王婆说媒 妻子：将离婚</span>
    <span class="title-content-title">凯迪拉克：泼天的流量轮到我了</span>
    <span class="title-content-title">爸爸穿得太显眼竟把女儿气哭</span>
    <span class="title-content-title">女子辅导儿子作业情绪崩溃踹断脚趾</span>
  </li>
</ul>"""

res = re.findall('<span class="title-content-title">(.*?)</span>', html_hot)

print("html_hot=", html_hot)
print("res=", res)

说明：
re.findall(<正则规则>, <待提取的数据>)
.*? 任意字符串内容
() 优先提取的内容，就是我们想要的内容

注意点：

html_hot 的内容必须用六个双引号内。

效果：

二、

三、

四、

免责声明：本号所涉及内容仅供安全研究与教学使用，如出现其他风险，后果自负。

参考、来源：
https://www.luffycity.com/ 路飞学城
2024-03-24_路飞3天/Day01/converter_1-1740_.ts.mp4 01:36:34

posted @ 2024-03-31 23:09 悟透阅读(131) 评论(0) 收藏举报

刷新页面返回顶部

Python 爬虫html内存 re.findall 正则提取span

前言全局说明

一、百度首页热搜

二、

三、

四、

公告