BeautifulSoup
BeautifulSoup
介绍
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的python库
安装
安装Beautiful Soup 4
pip install bs4
安装lxml
pip install lxml //解析器
BeautifulSoup对象
代表要解析整个文档树
它支持遍历文档树和搜索文档树中描述的大部分的方法
创建beautifulSoup对象
1,导入模块
from bs4 import BeautifulSoup
2,创建BeautifulSoup对象
soup = BeautifulSoup("data)")
案例
复制代码
- 1
- 2
- 3
- 4
- 5
| #导入模块 | |
| from bs4 import BeautifulSoup | |
| #创建BeautifulSoup对象 | |
| soup = BeautifulSoup('<html>data</html>','lxml') //'lxml'是解析器 | |
| print(soup) |
BeautifulSoup对象的find方法
find方法的作用:搜索文档树
find(self,name=None,attrs={},recursive=True,text=None,**kwargs)
参数
name:标签名
attrs:属性名
recursive:是否递归循环查找
text:根据文本内容查找
步骤
1,导入模块
2,准备文档字符串
复制代码
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
| #实验代码 | |
| <html> | |
| <head> | |
| <title>The Dormouse's story</title> | |
| </head> | |
| <body> | |
| <p class="title"> | |
| <b>The Dormmouse's story</b> | |
| </p> | |
| <p class="story">Once upon a time there were there little sisters; and their names were | |
| <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, | |
| <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and | |
| <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; | |
| and they lived at the bottom of a well. | |
| </p> | |
| <p class="story">...</p> | |
3,创建BeautifulSoup对象
4,查找文档中的title标签
5,查找文档中的a标签
Tag对象介绍
Tag对象对应于原始文档中的XML或HTML标签
Tag有很多方法和属性,可用 遍历文档树 和 搜索文档树 以及获取标签内容
Tag对象常见属性
name 获取标签名称
attrs 获取标签所有属性的键和值
text 获取标签的文本字符串
案例一----使用实验代码
复制代码
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
| #导入模块 | |
| from bs4 import BeautifulSoup | |
| #准备文档字符串 | |
| html ='''<html> | |
| <head> | |
| <title>The Dormouse's story</title> | |
| </head> | |
| <body> | |
| <p class="title"> | |
| <b>The Dormmouse's story</b> | |
| </p> | |
| <p class="story">Once upon a time there were there little sisters; and their names were | |
| <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, | |
| <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and | |
| <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; | |
| and they lived at the bottom of a well. | |
| </p> | |
| <p class="story">...</p> | |
| ''' | |
| #创建BeautifulSoup对象 | |
| soup = BeautifulSoup(html,'lxml') | |
| #一,根据标签进行查找 | |
| #查找title标签 | |
| title = soup.find('title') | |
| print(title) | |
| #查找a标签 | |
| a = soup.find('a') | |
| print(a) | |
| #查找所有a标签 | |
| a_s = soup.find_all('a') | |
| print(a_s) | |
| #二,根据属性进行查找 | |
| #查找id为link1的标签 | |
| #方法一:通过命名参数进行指定 | |
| a = soup.find(id='link1') | |
| print(a) | |
| #方法二:使用attrs来指定属性进行查找 | |
| a = soup.find(attrs={'id': 'link1'}) | |
| print(a) | |
| #三,根据文本进行查找 | |
| text = soup.find(text='Elsie') | |
| print(text) | |
| #Tag对象 | |
| print(type(a)) #<class 'bs4.element.Tag'> | |
| print('标签名',a.name) | |
| print('标签所有属性',a.attrs) | |
| print('标签文本内容',a.text) | |
测试结果
复制代码
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
| <title>The Dormouse's story</title> | |
| <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> | |
| [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] | |
| <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> | |
| <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> | |
| Elsie | |
| <class 'bs4.element.Tag'> | |
| 标签名 a | |
| 标签所有属性 {'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'} | |
| 标签文本内容 Elsie |
案例二---使用网站
复制代码
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
| #1,导入相关模块 | |
| import requests | |
| from bs4 import BeautifulSoup | |
| #2,发送请求,获取疫情首页内容 | |
| response = requests.get('https://ncov.dxy.cn/ncovh5/view/pneumonia') | |
| home_page = response.content.decode() | |
| # print(home_page) #测试一下 | |
| #3,使用BeautifulSoup提取疫情数据 | |
| soup = BeautifulSoup(home_page,'lxml') | |
| script = soup.find(id='getListByCountryTypeService2true') | |
| print(script) |
测试结果
复制代码
- 1
| <script id="getListByCountryTypeService2true">try { window.getListByCountryTypeService2true = [{"id":10400460,"createTime":1629720585000,"modifyTime":1629720585000,"tags":"","countryType":2,"continents":"北美洲","provinceId":"8","provinceName":"美国","provinceShortName":"","cityName":"","currentConfirmedCount":6609851,"confirmedCount":37711159,"confirmedCountRank":1,"suspectedCount":0,"curedCount":30472804,"deadCount":628504,"deadCountRank":1,"deadRate":"1.66","deadRateRank":96,"comment":"","sort":0,"operator":"fengxin","locationId":971002,"countryShortCode":"USA","countryFullName":"United States of America","statisticsData":"https://file1.dxycdn.com/2020/0315/553/3402160512808052518-135.json","incrVo":{"currentConfirmedIncr":33294,"confirmedIncr":43270,"curedIncr":9748,"deadIncr":228},"showRank":true,"yesterdayConfirmedCount":2147383647,"yesterdayLocalConfirmedCount":2147383647,"yesterdayOtherConfirmedCount":2147383647,"highDanger":"","midDanger":"","highInDesc":"","lowInDesc":"","outDesc":""},{"id":10400566,"createTime":1629720586000,"modifyTime":1629720586000,"tags":"","countryType":2,"continents":"欧洲","provinceId":"10","provinceName":"英国","provinceShortName":"","cityName":"","currentConfirmedCount":6360727,"confirmedCount":6492906,"confirmedCountRank":6,"suspectedCount":0,"curedCount":539,"deadCount":131640,"deadCountRank":7,"deadRate":"2.02","deadRateRank":70,"comment":"","sort":0,"operator":"fengxin","locationId":961007,"countryShortCode":"GBR",...... | |

浙公网安备 33010602011771号