BeautifulSoup

BeautifulSoup

介绍

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的python库

安装

安装Beautiful Soup 4

pip install bs4

安装lxml

pip install lxml //解析器

BeautifulSoup对象

代表要解析整个文档树

它支持遍历文档树搜索文档树中描述的大部分的方法

创建beautifulSoup对象

1,导入模块

from bs4 import BeautifulSoup

2,创建BeautifulSoup对象

soup = BeautifulSoup("data)")

案例

复制代码
  • 1
  • 2
  • 3
  • 4
  • 5
  #导入模块
  from bs4 import BeautifulSoup
  #创建BeautifulSoup对象
  soup = BeautifulSoup('<html>data</html>','lxml') //'lxml'是解析器
  print(soup)

BeautifulSoup对象的find方法

find方法的作用:搜索文档树

find(self,name=None,attrs={},recursive=True,text=None,**kwargs)

参数

name:标签名

attrs:属性名

recursive:是否递归循环查找

text:根据文本内容查找

步骤

1,导入模块

2,准备文档字符串

复制代码
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  #实验代码
  <html>
  <head>
  <title>The Dormouse's story</title>
  </head>
  <body>
  <p class="title">
  <b>The Dormmouse's story</b>
  </p>
  <p class="story">Once upon a time there were there little sisters; and their names were
  <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
  <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
  <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
  and they lived at the bottom of a well.
  </p>
  <p class="story">...</p>
   

3,创建BeautifulSoup对象

4,查找文档中的title标签

5,查找文档中的a标签

Tag对象介绍

Tag对象对应于原始文档中的XML或HTML标签

Tag有很多方法和属性,可用 遍历文档树 和 搜索文档树 以及获取标签内容

Tag对象常见属性

name 获取标签名称

attrs 获取标签所有属性的键和值

text 获取标签的文本字符串

案例一----使用实验代码

复制代码
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  #导入模块
  from bs4 import BeautifulSoup
  #准备文档字符串
  html ='''<html>
  <head>
  <title>The Dormouse's story</title>
  </head>
  <body>
  <p class="title">
  <b>The Dormmouse's story</b>
  </p>
  <p class="story">Once upon a time there were there little sisters; and their names were
  <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
  <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
  <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
  and they lived at the bottom of a well.
  </p>
  <p class="story">...</p>
  '''
  #创建BeautifulSoup对象
  soup = BeautifulSoup(html,'lxml')
   
  #一,根据标签进行查找
  #查找title标签
  title = soup.find('title')
  print(title)
  #查找a标签
  a = soup.find('a')
  print(a)
  #查找所有a标签
  a_s = soup.find_all('a')
  print(a_s)
   
  #二,根据属性进行查找
  #查找id为link1的标签
  #方法一:通过命名参数进行指定
  a = soup.find(id='link1')
  print(a)
  #方法二:使用attrs来指定属性进行查找
  a = soup.find(attrs={'id': 'link1'})
  print(a)
   
  #三,根据文本进行查找
  text = soup.find(text='Elsie')
  print(text)
   
  #Tag对象
  print(type(a)) #<class 'bs4.element.Tag'>
  print('标签名',a.name)
  print('标签所有属性',a.attrs)
  print('标签文本内容',a.text)
   

测试结果

复制代码
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  <title>The Dormouse's story</title>
  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
  [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
  Elsie
  <class 'bs4.element.Tag'>
  标签名 a
  标签所有属性 {'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}
  标签文本内容 Elsie

案例二---使用网站

复制代码
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  #1,导入相关模块
  import requests
  from bs4 import BeautifulSoup
   
  #2,发送请求,获取疫情首页内容
  response = requests.get('https://ncov.dxy.cn/ncovh5/view/pneumonia')
  home_page = response.content.decode()
  # print(home_page) #测试一下
   
  #3,使用BeautifulSoup提取疫情数据
  soup = BeautifulSoup(home_page,'lxml')
  script = soup.find(id='getListByCountryTypeService2true')
  print(script)

测试结果

复制代码
  • 1
  <script id="getListByCountryTypeService2true">try { window.getListByCountryTypeService2true = [{"id":10400460,"createTime":1629720585000,"modifyTime":1629720585000,"tags":"","countryType":2,"continents":"北美洲","provinceId":"8","provinceName":"美国","provinceShortName":"","cityName":"","currentConfirmedCount":6609851,"confirmedCount":37711159,"confirmedCountRank":1,"suspectedCount":0,"curedCount":30472804,"deadCount":628504,"deadCountRank":1,"deadRate":"1.66","deadRateRank":96,"comment":"","sort":0,"operator":"fengxin","locationId":971002,"countryShortCode":"USA","countryFullName":"United States of America","statisticsData":"https://file1.dxycdn.com/2020/0315/553/3402160512808052518-135.json","incrVo":{"currentConfirmedIncr":33294,"confirmedIncr":43270,"curedIncr":9748,"deadIncr":228},"showRank":true,"yesterdayConfirmedCount":2147383647,"yesterdayLocalConfirmedCount":2147383647,"yesterdayOtherConfirmedCount":2147383647,"highDanger":"","midDanger":"","highInDesc":"","lowInDesc":"","outDesc":""},{"id":10400566,"createTime":1629720586000,"modifyTime":1629720586000,"tags":"","countryType":2,"continents":"欧洲","provinceId":"10","provinceName":"英国","provinceShortName":"","cityName":"","currentConfirmedCount":6360727,"confirmedCount":6492906,"confirmedCountRank":6,"suspectedCount":0,"curedCount":539,"deadCount":131640,"deadCountRank":7,"deadRate":"2.02","deadRateRank":70,"comment":"","sort":0,"operator":"fengxin","locationId":961007,"countryShortCode":"GBR",......
 
posted @ 2021-11-22 12:36  我真不知道X  阅读(363)  评论(0)    收藏  举报