BeautifulSoup解析库详解

BeautifulSoup是灵活又方便的网页解析库,处理高效,支持多种解析器

利用它不用编写正则表达式即可方便地实现网页信息的提取

安装:pip3 install beautifulsoup4

用法详解:

beautifulsoup支持的一些解析库

解析器 使用方法 优势 劣势
Python标准库 BeautifulSoup(makeup,"html.parser") python的内置标准库,执行速度适中,文档容错能力强 python2.7 or python3.2.2前的版本中文容错能力差
lxml HTML解析器 BeautifulSoup(makeup,"lxml") 速度快,文档容错能力强 需要安装c语言库
lxml XML解析器 BeautifulSoup(makeup,"xmlr") 速度快,唯一支持xml的解析器 需要安装c语言库
html5lib BeautifulSoup(makeup,"html5lib") 最好的容错性,以浏览器的方式解析文档,生成HTML5格式的文档 速度慢,不依赖外部扩展

基本使用方法:

import bs4
from bs4 import BeautifulSoup

#下面是一段不完整的 html代码
html = '''
<html><head><title>The Demouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Domouse's story</b></p>
<p class="story">Once upon a time there were three little sisters,and their name were
<a href="http://examlpe.com/elele" class="sister" ld="link1"><!--Elsle--></a>
<a href="http://examlpe.com/lacie" class="sister" ld="link2"><!--Elsle--></a>
<a href="http://examlpe.com/title" class="sister" ld="link3"><title></a>
and they lived the bottom of a wall</p>
<p clas="stuy">..</p>
'''

soup = BeautifulSoup(html,'lxml')

#将代码补全,也就是容错处理
print(soup.prettify())

#选择title这个标签,并打印内容
print(soup.title.string)
输出结果为: <html> <head> <title> The Demouse's story </title> </head> <body> <p class="title" name="dromouse"> <b> The Domouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters,and their name were <a class="sister" href="http://examlpe.com/elele" ld="link1"> <!--Elsle--> </a> <a class="sister" href="http://examlpe.com/lacie" ld="link2"> <!--Elsle--> </a> <a class="sister" href="http://examlpe.com/title" ld="link3"> <title> </title> </a> and they lived the bottom of a wall </p> <p clas="stuy"> .. </p> </body> </html> The Demouse's story

标签选择器

如上面例程中的soup.title.string,就是选择了title标签

选择元素:import bs4

from bs4 import BeautifulSoup

#下面是一段不完整的 html代码
html = '''
<html><head><title>The Demouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Domouse's story</b></p>
<p class="story">Once upon a time there were three little sisters,and their name were
<a href="http://examlpe.com/elele" class="sister" ld="link1"><!--Elsle--></a>
<a href="http://examlpe.com/lacie" class="sister" ld="link2"><!--Elsle--></a>
<a href="http://examlpe.com/title" class="sister" ld="link3"><title></a>
and they lived the bottom of a wall</p>
<p clas="stuy">..</p>
'''

soup = BeautifulSoup(html,'lxml')
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)
输出结果为:
<title>The Demouse's story</title>
<class 'bs4.element.Tag'>
<head><title>The Demouse's story</title></head>
<p class="title" name="dromouse"><b>The Domouse's story</b></p>
#只输出第一个匹配结果

获取名称:

import bs4
from bs4 import BeautifulSoup

#下面是一段不完整的 html代码
html = '''
<html><head><title>The Demouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Domouse's story</b></p>
<p class="story">Once upon a time there were three little sisters,and their name were
<a href="http://examlpe.com/elele" class="sister" ld="link1"><!--Elsle--></a>
<a href="http://examlpe.com/lacie" class="sister" ld="link2"><!--Elsle--></a>
<a href="http://examlpe.com/title" class="sister" ld="link3"><title></a>
and they lived the bottom of a wall</p>
<p clas="stuy">..</p>
'''

soup = BeautifulSoup(html,'lxml')
print(soup.title.name)
输出结果为:title 

获取属性: 

import bs4
from bs4 import BeautifulSoup

#下面是一段不完整的 html代码
html = '''
<html><head><title>The Demouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Domouse's story</b></p>
<p class="story">Once upon a time there were three little sisters,and their name were
<a href="http://examlpe.com/elele" class="sister" ld="link1"><!--Elsle--></a>
<a href="http://examlpe.com/lacie" class="sister" ld="link2"><!--Elsle--></a>
<a href="http://examlpe.com/title" class="sister" ld="link3"><title></a>
and they lived the bottom of a wall</p>
<p clas="stuy">..</p>
'''

soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])
#注意soup.a.attrs或者soup.p['name']这两种获取属性的方法都是可以的
#还有就是要注意中括号!!!

获取内容:

如例程中所示,使用string方法,如:soup.title.string即可获取内容

嵌套选择:

如:print(soup.head.title.string)

子节点和子孙节点:

如:print(soup.p.contents)使用contents可以获取p标签的所有子节点,类型是一个列表 

也可以使用children,与contents不同的是,children是一个迭代器,获取所有子节点,需要使用循环才能把他的内容取到如:

print(soup.p.children)

for i ,child in enumerate(soup.p.children):

  print(i,child)

此外还有一个属性descendants,这个是获取所有的子孙节点,同样也是一个迭代器 

print(soup.p.descendants)

for i ,child in enumerate(soup.p.descendants):

  print(i,child)

注意:子节点,子孙节点和下面的父节点,祖先节点中使用的类似于soup.p语法,是获取第一个匹配到的p标签,所以这些节点也都是第一个匹配到的标签所对应的节点

父节点和祖先节点:

parent属性:获取所有的父节点

parents属性:获取所有的祖先节点

兄弟节点:

next_siblings属性

previous_siblings属性

--------------------------------------------------------------------------------------------------------------------

标准选择器

上面说的是标签选择器,速度比较快,但是不能满足解析html文档的需求的

find_all方法:

find_all(name,attrs,recursive,text,**kwargs)

可根据标签名、属性、内容查找文档

根据name进行查找:

import bs4
from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>hello</h4>
    </div>
    <div class="panel-body">
        <url class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">jay</li>
        </url>
        <url class="list list-small" id="list-2">
            <li lass="element">Foo</li>
            <li lass="element">Bar</li>
        </url>
    </div>
    </div>
'''

soup = BeautifulSoup(html,'lxml')
print(soup.find_all('url'))
输出结果为:
[<url class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">jay</li>
</url>, <url class="list list-small" id="list-2">
<li lass="element">Foo</li>
<li lass="element">Bar</li>
</url>]

 返回结果可以看到为一个列表,可以对列表进行循环,然后对每一项元素进行查找,如:

import bs4
from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>hello</h4>
    </div>
    <div class="panel-body">
        <url class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">jay</li>
        </url>
        <url class="list list-small" id="list-2">
            <li lass="element">Foo</li>
            <li lass="element">Bar</li>
        </url>
    </div>
    </div>
'''

soup = BeautifulSoup(html,'lxml')
for url in soup.find_all('url'):
    print(url.find_all('li'))

输出结果为:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">jay</li>]
[<li lass="element">Foo</li>, <li lass="element">Bar</li>]  

 根据attrs进行查找:

attrs传入的参数为字典形式的参数,如:

import bs4
from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>hello</h4>
    </div>
    <div class="panel-body">
        <url class="list" id="list-1" name='elements'>
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">jay</li>
        </url>
        <url class="list list-small" id="list-2">
            <li lass="element">Foo</li>
            <li lass="element">Bar</li>
        </url>
    </div>
    </div>
'''

soup = BeautifulSoup(html,'lxml')

print(soup.find_all(attrs={'id':'list-1'}))#也可以soup.find_all(id='list-1')这样的来进行查找
print(soup.find_all(attrs={'name':'elements'}))
输出结果为:
[<url class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">jay</li>
</url>]
[<url class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">jay</li>
</url>]

###注意:可以利用soup.find_all(id='list-1')这样的来进行查找,但对于class属性,需要写成class_='内容'的形式,因为在python中,class是一个关键字,所以在这里当作属性进行查找的时候,需要写成class_的样子

利用text进行查找:

import bs4
from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>hello</h4>
    </div>
    <div class="panel-body">
        <url class="list" id="list-1" name='elements'>
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">jay</li>
        </url>
        <url class="list list-small" id="list-2">
            <li lass="element">Foo</li>
            <li lass="element">Bar</li>
        </url>
    </div>
    </div>
'''

soup = BeautifulSoup(html,'lxml')

print(soup.find_all(text='Foo'))
输出结果为:
['Foo', 'Foo'] 

find方法,用法跟find_all方法是完全一样的,只不过find_all返回所有元素,是一个列表,find返回单个元素,列表中的第一个值

find(name,attrs,recurslve,text,**kwargs)

find_parents()

find_parent()

find_next_siblings()

find_next_sibling()

find_previous_siblings()

find_previous_sibling()

find_all_next()

find_next()

find_all_previous()

find_previous()

这些函数的用法都一样,只不过实现的方式不一样

css选择器

通过select()直接传入css选择器即可完成选择

import bs4
from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>hello</h4>
    </div>
    <div class="panel-body">
        <url class="list" id="list-1" name='elements'>
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">jay</li>
        </url>
        <url class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </url>
    </div>
    </div>
'''

soup = BeautifulSoup(html,'lxml')

#如果选择的是class,需要加上一个点,.panel .panel-heading
print(soup.select('.panel .panel-heading'))
#直接选择标签
print(soup.select('url li'))
#选择id,要用#来选
print(soup.select('#list-2 .element'))
输出结果为:
[<div class="panel-heading">
<h4>hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

 进行层层嵌套的选择:

import bs4
from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>hello</h4>
    </div>
    <div class="panel-body">
        <url class="list" id="list-1" name='elements'>
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">jay</li>
        </url>
        <url class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </url>
    </div>
    </div>
'''

soup = BeautifulSoup(html,'lxml')

for url in soup.select('url'):
    print(url.select('li'))
输出结果为:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

 获取属性 

 

import bs4
from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>hello</h4>
    </div>
    <div class="panel-body">
        <url class="list" id="list-1" name='elements'>
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">jay</li>
        </url>
        <url class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </url>
    </div>
    </div>
'''

soup = BeautifulSoup(html,'lxml')

for url in soup.select('url'):
    print(url['id'])
   #也可以使用print(url.attrs['id']) 输出结果为: list-1 list-2

 获取内容:

import bs4
from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>hello</h4>
    </div>
    <div class="panel-body">
        <url class="list" id="list-1" name='elements'>
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">jay</li>
        </url>
        <url class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </url>
    </div>
    </div>
'''

soup = BeautifulSoup(html,'lxml')

for l in soup.select('li'):
    print(l.get_text())
输出结果为:
Foo
Bar
jay
Foo
Bar

  

总结:

推荐使用lxml解析库,必要时使用html.parser

标签选择筛选功能弱但是速度快

建议使用find(),find_all()查询匹配单个结果或多个结果

如果对css选择器熟悉建议使用select()

记住常用的获取属性和文本值的方法

 

 

 

  

 

posted @ 2018-06-14 12:53  RongHe  阅读(775)  评论(0编辑  收藏  举报