从零开始的python爬虫教程(Day05)

 

 

BeautifulSoup简介

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.(摘自BeautifulSoup中文文档)

和lxml库一样,BeautifulSoup库是一个功能强大的解析库,可以方便地提取各个网页元素,是爬虫的一大利器。

安装BeautifulSoup库:

pip install bs4

BeautifulSoup在解析网页时需要解析器。以下是一些BeautifulSoup库支持的解析器:

解析器使用方法
python标准库 BeautifulSoup(html, “html.parser”)
lxml HTML解析器 BeautifulSoup(html, “lxml”)
lxml xml解析器 BeautifulSoup(html, “xml”)
html5lib BeautifulSoup(html, “html5lib”)

BeautifulSoup用法

导入BeautifulSoup库:

from bs4 import BeautifulSoup
import re

实例html网页代码:

html = """
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
"""

基本用法

soup = BeautifulSoup(html, 'lxml')
print(soup.title.string.strip()) # 使用string来获取标签里面的字符串
print(soup.p.name) # name获取标签名称
print(soup.p.attrs) # attrs获取标签属性
The Dormouse's story
p
{'class': ['title'], 'name': 'dromouse'}

嵌套选择

print(soup.head.title.string.strip()) # 可以使用[标签1.标签2]的形式对标签1下一层节点进行选择
The Dormouse's story

关联选择(直接获取所选元素的子节点)

print(soup.p.contents)
['\n', <b>
    The Dormouse's story
   </b>, '\n']
print(soup.body.children)
for child in soup.body.children:
    print(child)
    print('---------')
<list_iterator object at 0xaddcbe70>



---------
<p class="title" name="dromouse">
<b>
    The Dormouse's story
   </b>
</p>
---------


---------
<p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
<!-- Elsie -->
</a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
---------



---------
<p class="story">
   ...
  </p>
---------



---------

find_all find_all(name, attrs, recursive, text, **kwargs)

(1)name

print(soup.find_all(name='b')) # name为标签类型
[<b>
    The Dormouse's story
   </b>]

(2)attrs

print(soup.find_all(attrs = {'class':'sister'})) # 根据标签属性选择标签
[<a class="sister" href="http://example.com/elsie" id="link1">
<!-- Elsie -->
</a>, <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>, <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>]
print(soup.find_all(class_ = 'sister')) # 使用标签属性名称加“_”效果相同
[<a class="sister" href="http://example.com/elsie" id="link1">
<!-- Elsie -->
</a>, <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>, <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>]

(3)text

print(soup.find_all('a', {'href': re.compile(r'http://(.*?)')})) # 匹配标签属性的方法
[<a class="sister" href="http://example.com/elsie" id="link1">
<!-- Elsie -->
</a>, <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>, <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>]
print(soup.find_all(text = re.compile('Dormouse'))) # 匹配标签内容的方法
["\n   The Dormouse's story\n  ", "\n    The Dormouse's story\n   "]

CSS选择器

print(soup.select('.sister')) # 选择所有class为sister的标签
[<a class="sister" href="http://example.com/elsie" id="link1">
<!-- Elsie -->
</a>, <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>, <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>]
posted @ 2020-07-18 16:29  12218  阅读(48)  评论(0编辑  收藏  举报