BeautifulSoup模块

1.BeautifulSoup模块用于接收一个HTML或XML字符串,然后将其进行格式化,之后遍可以使用他提供的方法进行快速查找指定元素,从而使得在HTML或XML中查找指定元素变得简单。
2.安装BeautifulSoup模块
pip3 install beautifulsoup4
3.使用方式
创建html

html_doc ="""
            <html>
                <head>
                    <title>BeautifulSoup示例</title>
                </head>
            <body>
                <div>
                    <a href='http://www.dongdong.com'>东东<p>东东内容</p></a>
                </div>
                <a id='xixi'>西西</a>
                <div>
                    <p>南南内容</p>
                </div>
                <p>北北内容</p>
            </body>
            </html>
        """

创建beautifulsoup对象

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")    #soup是整个html
print(soup.prettify())                                    #打印soup对象的内容,格式化输出

name标签名称
(1)通过soup对象找到所有a标签

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")    #soup是整个html
tag = soup.find('a')                                      #找到a标签
print(tag)

输出:
<a href="http://www.dongdong.com">东东<p>东东内容</p></a>
(2)通过a标签找到a标签的名称

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")    #soup是整个html
tag = soup.find('a')                                      #找到a标签                                              
name = tag.name                                           #获取a标签的名称

输出:
a
(3)通过a标签的名称修改a标签的名称

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")      #soup是整个html
tag = soup.find('a')                                      #找到a标签                                              
name = tag.name                                           #获取a标签的名字                                               
tag.name = 'span'                                         #把a标签的名称改为span
print(tag)

输出:
<span href="http://www.dongdong.com">东东<p>东东内容</p></span>
attr标签属性
(1)通过attrs获取a标签属性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
tag = soup.find('a')
attrs = tag.attrs              #获取属性
print(attrs)

输出:
{'href': 'http://www.dongdong.com'}
(2)通过attrs修改a标签属性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
tag = soup.find('a')
attrs = tag.attrs                                   #获取属性
tag.attrs = {'href':'http://www.nannan.com'}   #修改属性
print(tag)

输出:
<a href="http://www.nannan.com">东东<p>东东内容</p></a>
(3)通过attrs给标签里添加属性love="石头"

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")      
tag = soup.find('a')                                      
tag.attrs['love'] = '石头'
print(tag)

输出:
<a href="http://www.dongdong.com" love="石头">东东<p>东东内容</p></a>
(4)通过attrs把a标签里的属性href删除掉

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
tag = soup.find('a')
attrs = tag.attrs                                   #获取属性
del tag.attrs['href']
print(tag)

输出:
<a>东东<p>东东内容</p></a>
标签和内容
(1)通过children找所有body里所有子标签

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
tags = soup.find('body').children
print(list(tags))

输出:
['\n', <div>
<a href="http://www.dongdong.com">东东<p>东东内容</p></a>
</div>, '\n', <a id="xixi">西西</a>, '\n', <div>
<p>南南内容</p>
</div>, '\n', <p>北北内容</p>, '\n']
(2)通过children找所有body里所有子标签,再通过tags把每一个标签拿到再通过type(tag)把标签和内容分别取出来

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
tags = soup.find('body').children      ###通过tags把每一个标签拿到再通过type(tag)把标签和内容分别取出来
from bs4.element import Tag
for tag in tags:
    if type(tag) == Tag:         #判断如果type(tag) == Tag是标签
        print('我是标签:',tag, type(tag))
    else:                       #否则是文本
        print('文本....')

输出:
文本....
我是标签: <div>
<a href="http://www.dongdong.com">东东<p>东东内容</p></a>
</div> <class 'bs4.element.Tag'>
文本....
我是标签: <a id="xixi">西西</a> <class 'bs4.element.Tag'>
文本....
我是标签: <div>
<p>南南内容</p>
</div> <class 'bs4.element.Tag'>
文本....
我是标签: <p>北北内容</p> <class 'bs4.element.Tag'>
文本....
(3)通过descendants找所有body里所有子子孙孙标签(递归一个一个找)

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
tags = soup.find('body').descendants
print(list(tags))

输出:
['\n', <div>
<a href="http://www.dongdong.com">东东<p>东东内容</p></a>
</div>, '\n', <a href="http://www.dongdong.com">东东<p>东东内容</p></a>, '东东', <p>东东内容</p>, '东东内容', '\n', '\n', <a id="xixi">西西</a>, '西西', '\n', <div>
<p>南南内容</p>
</div>, '\n', <p>南南内容</p>, '南南内容', '\n', '\n', <p>北北内容</p>, '北北内容', '\n']
(4)通过把body标签里面的孩子都清空(保留标签名)

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
tag = soup.find('body')
tag.clear()
print(soup)

输出:
<html>
<head>
<title>BeautifulSoup示例</title>
</head>
<body></body>
</html>
(5)decompose递归的删除所有的标签

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
body = soup.find('body')
body.decompose()
print(soup)

输出:
<html>
<head>
<title>BeautifulSoup示例</title>
</head>
</html>
(6)extract递归的删除所有的标签,并获取删除的标签

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
body = soup.find('body')
v = body.extract()
print(v)

输出:
<body>
<div>
<a href="http://www.dongdong.com">东东<p>东东内容</p></a>
</div>
<a id="xixi">西西</a>
<div>
<p>南南内容</p>
</div>
<p>北北内容</p>
</body>
(7)decode把对象转换为字符串(含当前标签)

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
body = soup.find('body')
v = body.decode()
print(v)

输出:
<body>
<div>
<a href="http://www.dongdong.com">东东<p>东东内容</p></a>
</div>
<a id="xixi">西西</a>
<div>
<p>南南内容</p>
</div>
<p>北北内容</p>
</body>
(8)decode_contents把对象转换为字符串(不含当前标签)

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
body = soup.find('body')
v = body.decode_contents()
print(v)

输出:
<div>
<a href="http://www.dongdong.com">东东<p>东东内容</p></a>
</div>
<a id="xixi">西西</a>
<div>
<p>南南内容</p>
</div>
<p>北北内容</p>
(10)find获取匹配的第一个标签

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
tag = soup.find('body').find('p',recursive=False)                            #recursive=True是否递归去找
print(tag)

输出:
<p>北北内容</p>
(11)get_text获取标签内部文本内容

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
tag = soup.find('a')
print(tag)
v = tag.get_text()
print(v)

输出:
<a href="http://www.dongdong.com">东东<p>东东内容</p></a>
东东东东内容
(12)index检查标签在某标签中的索引位置

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
tag = soup.find('body')
v = tag.index(tag.find('div'))
print(v)

输出:
1
(13)index检查标签在某标签中的索引位置

tag = soup.find('body')
for i,v in enumerate(tag):
    print(i,v)

输出:
0
1 <div>
<a href="http://www.dongdong.com">东东<p>东东内容</p></a>
</div>
2
3 <a id="xixi">西西</a>
4
5 <div>
<p>南南内容</p>
</div>
6
7 <p>北北内容</p>
8
(14)append在当前标签内部追加一个标签

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
from bs4.element import Tag
obj = Tag(name='i',attrs={'id': 'it'})
obj.string = '我是一个新来的'
tag = soup.find('body')
tag.append(obj)
print(soup)

输出:
<html>
<head>
<title>BeautifulSoup示例</title>
</head>
<body>
<div>
<a href="http://www.dongdong.com">东东<p>东东内容</p></a>
</div>
<a id="xixi">西西</a>
<div>
<p>南南内容</p>
</div>
<p>北北内容</p>
<i id="it">我是一个新来的</i></body>
</html>
(15)insert在当前标签内部指定位置插入一个标签

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
from bs4.element import Tag
obj = Tag(name='i', attrs={'id': 'it'})
obj.string = '我是一个新来的'
tag = soup.find('body')
tag.insert(2, obj)
print(soup)

输出:
<html>
<head>
<title>BeautifulSoup示例</title>
</head>
<body>
<div>
<a href="http://www.dongdong.com">东东<p>东东内容</p></a>
</div><i id="it">我是一个新来的</i>
<a id="xixi">西西</a>
<div>
<p>南南内容</p>
</div>
<p>北北内容</p>
</body>
</html>
(16)replace_with 在当前标签替换为指定标签

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")

from bs4.element import Tag
obj = Tag(name='i', attrs={'id': 'it'})
obj.string = '我是一个新来的'
tag = soup.find('div')
tag.replace_with(obj)
print(soup)

输出:
<html>
<head>
<title>BeautifulSoup示例</title>
</head>
<body>
<i id="it">我是一个新来的</i>
<a id="xixi">西西</a>
<div>
<p>南南内容</p>
</div>
<p>北北内容</p>
</body>
</html>

posted on 2019-06-03 10:41  我不是西西  阅读(156)  评论(0编辑  收藏  举报