Python爬虫 bs4解析

使用bs4前需要先进行安装。pip install bs4

有练习数据如下：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
    <body>
        <p class="title"><b>The Dormouse's story</b></p>
        <p class="story">Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
            and they lived at the bottom of a well.
        </p>

<p class="story">...</p>
"""

先使用bs4格式化上述练习数据，上述html代码不全，可以明显看到 html 标签和 body 标签都没有结束标签：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')      # 传入html代码数据，选择lxml作为解析器，创建bs4对象，创建成功会对html代码进行补全
print(soup.prettify())      # 格式化输出html代码

常见解析器对比：

解析器	使用方法	优势	劣势
Python标准库	`BeautifulSoup(markup, "html.parser")`	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, ["lxml-xml"])``BeautifulSoup(markup, "xml")	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

推荐使用lxml作为解析器，因为效率更高。在Python2.7.3之前的版本和Python3中3.2.2之前的版本，必须安装lxml或html5lib，因为那些Python版本的标准库中内置的HTML解析方法不够稳定。

获取结构化数据的方法：

print(soup.title)  # 获取title元素，有多个的话获取第一个
# <title>The Dormouse's story</title>

print(soup.title.name)　　# 获取title元素的名称，即title
# title

print(soup.title.string)　　# 获取title元素的文本值
# The Dormouse's story

print(soup.p)　　# 同第一个示例，获取p元素，多个的话获取第一个
# <p class="title"><b>The Dormouse's story</b></p>

print(soup.p['class']) # 获取p元素的class属性的值
# ['title']

print(soup.a)
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>


print(soup.find(id="link3"))　　# 获取id属性值为link3的元素
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

print(soup.find_all('a'))　　# 获取所有的a元素
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soup.find_all(["a", "b"]))    # 获取所有的a标签和b标签，同上，可以传入一个列表，同时查找多种标签

根据属性值查找标签：

按照CSS类名搜索标签的功能非常实用，但标识CSS类名的关键字 class 在Python中是保留字，使用 class 做参数会导致语法错误。

从Beautiful Soup的4.1.1版本开始，可以通过 class_ 参数搜索属性名为class的标签

soup.find_all("a", class_="sister")　　# 查找所有属性class="sister"的a标签
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

limit参数

假如文档树中有3个标签符合搜索条件，但我们只想要前2个，可以用limit参数限制返回数量

soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

获取所标签的属性值：

# 获取所有a标签的id属性值
for a in soup.find_all('a'):
    print(a.get('id'))
# link1
# link2
# link3

获取所有文本内容：

print(soup.get_text())  # 注意是get_text(), 不带()的话获取的是soup的值，也就是补全后的html字符串
# 返回所有标签的文本值

获取标签下的第一个标签：

print(soup.body.p)  # 获取body标签下的第一个p标签
# <p class="title" test="mytest"><b>The Dormouse's story</b></p>

获取标签的父节点：

# 获取第一个title标签的父标签
print(soup.title.parent)    
# <head><title>The Dormouse's story</title></head>

获取兄弟（同级）节点：

示例数据如下：

soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>", 'lxml')
print(soup.prettify())
# <html>
#  <body>
#   <a>
#    <b>
#     text1
#    </b>
#    <c>
#     text2
#    </c>
#   </a>
#  </body>
# </html>

上述示例数据中，因为<b>标签和<c>标签是同一层：他们是同一个元素的子节点，所以<b>和<c>可以被称为兄弟节点。一段文档以标准格式输出时，兄弟节点有相同的缩进级别。在代码中也可以使用这种关系。

print(soup.b.next_sibling)    # 获取下一个兄弟节点
# <c>text2</c>

print(soup.c.previous_sibling)    # 获取上一个兄弟节点
# <b>text1</b>

css选择器

Beautiful Soup支持大部分的CSS选择器，在标签或 BeautifulSoup 对象的 .select() 方法中传入字符串参数，即可使用CSS选择器的语法找到标签。

soup.select("title")
# [<title>The Dormouse's story</title>]

soup.select("p:nth-of-type(3)")　　# 查找第三个p标签
# [<p class="story">...</p>]

逐层查找：

soup.select("body a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("html head title")
# [<title>The Dormouse's story</title>]

查找标签下的直接子标签：

soup.select("head > title")
# [<title>The Dormouse's story</title>]

soup.select("p > a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("p > a:nth-of-type(2)")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

soup.select("p > #link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select("body > a")
# []

通过css类名查找（即通过属性查找）：

soup.select(".sister")  # 查找class属性值为sister的标签
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("[class~=sister]")　　# 查找class属性值为sister的标签
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过属性id值查找：

soup.select("#link1")    # 查找id属性值为link1的标签
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select("a#link2")    # 查找id属性值为link1的a标签
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

posted on 2023-04-11 18:27 木去阅读(41) 评论(0) 收藏举报

刷新页面返回顶部

Python爬虫 bs4解析

Python爬虫 bs4解析

常见解析器对比：

获取结构化数据的方法：

根据属性值查找标签：

limit参数

获取所标签的属性值：

获取所有文本内容：

获取标签下的第一个标签：

获取标签的父节点：

获取兄弟（同级）节点：

css选择器

逐层查找：

查找标签下的直接子标签：

通过css类名查找（即通过属性查找）：

通过属性id值查找：

公告