1、安装Beautiful Soup库:
pip install beautifulsoup4
2、导入bs4库:
from bs4 import BeautifulSoup
3、创建BeautifulSoup对象:
①、根据html文本创建对象:
soup = BeautifulSoup(html)
②、根据html文件创建对象:
soup = BeautifulSoup(open('index.html'))
4、格式化输出html文本:
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
5、获取tag对象:
from bs4 import BeautifulSoup
html = '''<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.a)
print(soup.p)
6、获取标签的属性:
soup = BeautifulSoup(html, 'lxml')
print(soup.a.attrs) # 以字典的形式返回
{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}
7、获取标签的文本:
# <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
soup = BeautifulSoup(html, 'lxml')
print(soup.p.string)
# The Dormouse's story
8、遍历节点:
(1)直接子节点:
要点:
.contents 返回直接子节点的列表
.children 返回直接子节点的迭代器对象
(2)所有子孙节点:
知识点:
.descendants 返回所有子孙节点的可迭代的对象