beautifulsoup文档学习

Beautifulsoup

bs4中文文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#beautiful-soup-4-4-0
bs4英文文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/
原理：html标签转化成树结构
结构化输出tag树

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

按照点的方式寻找标签

soup.title	#title是标签的名字
# <title>The Dormouse's story</title>

soup.title.name	#title标签的name属性
# u'title'

soup.title.string	#title标签的字符串
# u'The Dormouse's story'

soup.title.parent.name	#父节点的名字
# u'head'

soup.p	#p标签
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']	#p标签的class属性
# u'title'

找到所有a标签的某个属性值，比如链接的值

for link in soup.find_all('a'):
    print(link.get('href'))

获取所有的文字内容

soup.get_text()

解析器

不同解析器的区别
创建beautifulsoup对象的时候第二个参数指定了使用的解析器类型，如果不指定bs4将会自动指定，指定的顺序是lxml, html5lib, Python标准库，主要看环境中安装了哪个。

解析器之间的区别

python标准库
BeautifulSoup(markup, "html.parser")
lxml解析
BeautifulSoup(markup, "lxml")
lxml-xml解析
BeautifulSoup(markup, ["lxml-xml"])
BeautifulSoup(markup, "xml")
html5lab解析
BeautifulSoup(markup, "html5lib")
html5lab具有最好的容错性

对于这一段html标签
BeautifulSoup("<a><b /></a>")

html结构解析的内容
自动补全b标签
<html><head></head><body><a><b></b></a></body></html>

xml解析的内容
不会补全内容，并且文档添加了xml文件头
<?xml version="1.0" encoding="utf-8"?>
<a><b/></a>

html解析器也有区别，如果文档是标准的html文件那么不会产生区别，如果不标准会产生轻微的差别

lxml解析
自动忽略了错误标签并且补全了不完整的标签
BeautifulSoup("<a></p>", "lxml")
# <html><body><a></a></body></html>

html5lib解析
html5lib自动补全了所有的标签
# <html><head></head><body><a><p></p></a></body></html>

python内置库
不会补全html文档和标签
# <a></a>

编码

任何编码解析之后都是utf-8
bs4的对象中的.original_encoding方法记录了识别的编码方式

通过bs4的自动编码检测可能性能比较弱，可以提前指定编码

soup = BeautifulSoup(markup, from_encoding="iso-8859-8")

如果事先知道了没有某种编码，可以提前排除

soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"])

如果文档使用utf-8并且其中包含特殊的字符，那么bs4会将.contains_replacement_characters标记为True，如果文档最后还包含�，并且contains_replacement_characters的值为False，那么说明�是文档中原本存在的

编码输出

bs4输出的文档，不加指定文档的编码格式都是utf-8，会改变meta标签的中输出格式的值

print(soup.prettify())
# <html>
#  <head>
#   <meta content="text/html; charset=utf-8" http-equiv="Content-type" />
#  </head>
#  <body>
#   <p>
#    Sacré bleu!
#   </p>
#  </body>
# </html>

可以使用prettify属性指定编码
print(soup.prettify("latin-1"))

可以对任意节点encode
soup.p.encode("latin-1")

自动编码猜测

bs4自带的编码猜测也可以单独使用

from bs4 import UnicodeDammit
dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!")
print(dammit.unicode_markup)
# Sacré bleu!
dammit.original_encoding
# 'utf-8'

如果Python中安装了 chardet 或 cchardet 那么编码检测功能的准确率将大大提高.
也可以提前指定可能的编码方式，这样将优先猜测

dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"])
print(dammit.unicode_markup)
# Sacré bleu!
dammit.original_encoding
# 'latin-1'

UnicodeDammit.detwingle()

UnicodeDammit.detwingle() 方法只能解码包含在UTF-8编码中的Windows-1252编码内容
new_doc = UnicodeDammit.detwingle(doc)

对象的种类

bs4一共四种对象种类
Tag
NavigableString
BeautifulSoup
Comment

NavigableString

bs4使用NavigableString 表示字符串，其中的字符砖不能编辑，但是可以被替换

tag.string.replace_with("No longer bold")
tag
<blockquote>No longer bold</blockquote>

如果想在其他地方引用字符串的内容的话，可以通过unicode(tag.string)转换成unicode字符串，可以释放内存

beautifulsoup

相当于一个tag，没有name，attributes属性

comment

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
comment = soup.b.string

comment
# u'Hey, buddy. Want to buy a used parser'

遍历文档树

可以通过点取获得某标签，但是只能获取第一个
也可以通过soup.find_all('a')获取所有的标签

content

节点的子节点列表可以通过content获取
head_tag.contents
title_tag = head_tag.contents[0]

children

通过节点的children属性获取的生成器可以对节点的子节点循环输出
content和children仅包含直接子节点

descendants

descendants属性可以获取所有的子孙节点的生成器

string

当tag只有一个子节点的时候，.string属性可以获取字符串
当有多个子节点的时候会返回none

.strings 和 stripped_strings

当tag中包含多个字符串的时候可以使用string来获取
使用.stripped_strings可以剔除空行

for string in soup.stripped_strings:
    print(repr(string))

父节点

.parent可以获取上一级的标签
.parents可以获取所有父辈标签的生成器

兄弟节点

.next_sibling
.previous_sibling
类似树，第一个兄弟节点没有previous_sibling，最后一个兄弟节点没有next_sibling
通常情况下这个方法返回的是换行或者空格，因为标签之间存在这这些符号，不怎么好用
.next_siblings
.previous_siblings
可以对节点的兄弟节点迭代输出

.next_element 和 .previous_element

next_element指向的是下一个被解析的元素，和兄弟节点不同，当前元素如果包含字符串，那么下一个解析的元素就是该字符串，
.next_elements
.previous_elements
这两个方法可以模拟解析的过程

搜索文档树

搜索使用的过滤器

1，字符串

soup.find_all('b')

2，正则表达式

for tag in soup.find_all(re.compile("^b"))

3，列表

soup.find_all(["a", "b"])

4，True
true可以返回所有的节点，但是不会返回字符串

for tag in soup.find_all(True):
    print(tag.name)
# html
# head
# title
# body
# p
# b
# p
# a
# a
# a
# p

5，方法
find和find_all可以通过方法自定义获取的标签或者内容
通过方法获得标签
可以使用普通函数和匿名函数

find会遍历所有的tag，每遍历一个tag都会执行一下函数，看看符不符合条件，符合条件的时候返回true，就找到一个tag，不符合条件返回False继续遍历下一个。
可以自定义函数通过返回True和False选择标签
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

soup.find_all(has_class_but_no_id)

通过属性寻找

传入参数要设置
def not_lacie(href):
        return href and not re.compile("lacie").search(href)
soup.find_all(href=not_lacie)

find_all()方法

find_all()方法按照过滤规则寻找当前tag所有的tag子节点

name参数
name参数可以寻找所有名字为name的标签
soup.find_all("title")
keyword参数
keyword参数可以通过字符串，正则，列表，True来寻找标签

字符串 soup.find_all(id='link2')寻找所有带id属性且属性值为link2的标签
正则soup.find_all(href=re.compile("elsie"))
使用多个参数过滤soup.find_all(href=re.compile("elsie"), id='link1')
attrs参数，attrs参数可以过滤特殊名字的属性，比如meta标签data_soup.find_all(attrs={"data-foo": "value"})

按css搜索

按照css搜索也可以使用正则表达式和方法
soup.find_all(class_=re.compile("itl"))
# [<p class="title"><b>The Dormouse's story</b></p>]

def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6

soup.find_all(class_=has_six_characters)

string参数
这个参数返回的值是字符串，意思是搜索文档树中字符串符合条件的字符串

def is_the_only_string_within_a_tag(s):
    ""如果改标签的string方法和他父标签的string方法结果一样name返回这个标签，""
	#这里的s参数是什么意思，不是很懂
    return (s == s.parent.string)

soup.find_all(string=is_the_only_string_within_a_tag)

当然也可以用来搜索tag
虽然 string 参数用于搜索字符串,还可以与其它参数混合使用来过滤tag.Beautiful Soup会找到 .string 方法与 string 参数值相符的tag.

soup.find_all("a", string="Elsie")

limit参数
limit参数可以限制搜索的数量

soup.find_all("a", limit=2)

recursive 参数，限制只搜索直接子节点

soup.html.find_all("title", recursive=False)

find_all（）的简写方法
以下代码是等价的

soup.find_all("a")
soup("a")

soup.title.find_all(string=True)
soup.title(string=True)

find()和find_all()的区别
find()找不到的时候返回None
find_all()找不到的时候返回空列表

find_parents() 和 find_parent()

find_next_siblings() 和 find_next_sibling()

find_next_siblings() 方法返回所有符合条件的后面的兄弟节点, find_next_sibling() 只返回符合条件的后面的第一个tag节点.
find_previous_siblings() 方法返回所有符合条件的前面的兄弟节点, find_previous_sibling() 方法返回第一个符合条件的前面的兄弟节点:

find_next()和find_all_next()

find_next()对某标签之后的标签进行迭代，返回符合条件的标签
find_all_next()返回所有符合条件的标签
同样还有 find_previous() find_all_previous()

css选择器

按tag标签逐层查找

soup.select("body a")

找直接子标签

soup.select("head > title")

通过是否存在某个属性查找

soup.select('a[href]')
soup.select('a[href$="tillie"]')
soup.select('a[href*=".com/el"]')

修改文档树

修改.string

tag = soup.a
tag.string = "New link text."

append()

soup = BeautifulSoup("<a>Foo</a>")
soup.a.append("Bar")

soup
# <html><head></head><body><a>FooBar</a></body></html>
soup.a.contents
# [u'Foo', u'Bar']

通过append创建标签

soup = BeautifulSoup("<b></b>")
original_tag = soup.b

#向原有标签里面添加新的标签
new_tag = soup.new_tag("a", href="http://www.example.com")
original_tag.append(new_tag)
original_tag
# <b><a href="http://www.example.com"></a></b>

#添加新的字符串
new_tag.string = "Link text."
original_tag
# <b><a href="http://www.example.com">Link text.</a></b>

clear() 移除标签内容

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
tag = soup.a

tag.clear()

extract() 移除文档树并返回该文档树

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_tag = soup.a

i_tag = soup.i.extract()

decompose() 移除文档树并销毁

replace_with() 替换文档树

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_tag = soup.a

new_tag = soup.new_tag("b")
new_tag.string = "example.net"
a_tag.i.replace_with(new_tag)

a_tag
# <a href="http://example.com/">I linked to <b>example.net</b></a>

wrap() 包装文档树

soup = BeautifulSoup("<p>I wish I was bold.</p>")
soup.p.string.wrap(soup.new_tag("b"))
# <b>I wish I was bold.</b>

soup.p.wrap(soup.new_tag("div"))
# <div><p><b>I wish I was bold.</b></p></div>

unwrap() 移除包装

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_tag = soup.a

a_tag.i.unwrap()
a_tag
# <a href="http://example.com/">I linked to example.com</a>

输出

prettify()

以格式化输出utf-8的内容，每个标签独占一行

输出格式

unicode()

soup = BeautifulSoup("&ldquo;Dammit!&rdquo; he said.")
unicode(soup)
# u'<html><head></head><body>\u201cDammit!\u201d he said.</body></html>'

bs4会将页面内容转换成html转义字符

使用unicode可以将内容还原

get_text()

可以使用get_text()获取文本内容
指定tag的分隔符

# soup.get_text("|")
u'\nI linked to |example.com|\n'

去除文本前后的空白

# soup.get_text("|", strip=True)
u'I linked to|example.com'

posted @ 2023-03-17 17:41 niko5960 阅读(135) 评论(0) 收藏举报

刷新页面返回顶部

niko5960