Python网络爬虫与信息提取-Beautiful Soup 库入门2

一、BeautifulSoup类的基本元素

1、Tag的name（名字）

每个<tag>都有自己的名字，通过<tag>.name获取，字符串类型

>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup(demo,"html.parser")

>>> soup.a.name
'a'
>>> soup.a.parent.name
'p'
>>> soup.a.parent.parent.name
'body'

2、Tag的attrs（属性）

一个<tag>可以有0个或多个属性，字典类型

>>> tag=soup.a
>>> tag.attrs
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>> tag.attrs["class"]
['py1']
>>> tag.attrs['href']
'http://www.icourse163.org/course/BIT-268001'

>>> type(tag.attrs)
<class 'dict'>
>>> type(tag)
<class 'bs4.element.Tag'>

3、Tag的NavigableString

NavigableString可以跨越多个层次

>>> soup.a
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.a.string
'Basic Python'
>>> type(soup.a.string)
<class 'bs4.element.NavigableString'>

4、Tag的Comment

Comment是一种特殊类型

>>> newsoup=BeautifulSoup("This is not a comment",'html.parser')
>>> newsoup.b.string
'This is a comment'
>>> type(newsoup.b.string)
<class 'bs4.element.Comment'>
>>> newsoup.p.string
'This is not a comment'
>>> type(newsoup.p.string)
<class 'bs4.element.NavigableString'>

二、基于bs4库的HTML内容遍历方法

1、HTML基本格式

<>...</>构成了所属关系，形成了标签树的树形结构

2、标签树的下行遍历

属性	说明
.contents	子节点的列表，将<tag>所有儿子节点存入列表
.children	子节点的迭代类型，与.contents类似，用于循环遍历儿子节点
.descendants	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

BeautifulSoup类型是标签树的根节点

>>> soup=BeautifulSoup(demo,'html.parser')
>>> soup.head
<head><title>This is a python demo page</title></head>
>>> soup.head.contents
[<title>This is a python demo page</title>]
>>> soup.body.contents
['\n', The demo python introduces several python courses., '\n', Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>., '\n']
>>> len(soup.body.contents)
5
>>> soup.body.contents[1]
The demo python introduces several python courses.

>>> for child in soup.body.children:
print(child) 遍历儿子节点

>>> for child in soup.body.descendants:
print(child) 遍历子孙节点

3、标签树的上行遍历

属性	说明
.parent	节点的父亲标签
.parents	节点先辈标签的迭代类型，用于循环遍历先辈节点

>>> soup=BeautifulSoup(demo,"html.parser")
>>> soup.title.parent
<head><title>This is a python demo page</title></head>
>>> soup.html.parent
<html><head><title>This is a python demo page</title></head>
<body>
The demo python introduces several python courses.
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.
</body></html>
>>> soup.parent
>>>

>>> soup=BeautifulSoup(demo,"html.parser")
>>> for parent in soup.a.parents:
if parent is None:
print(parent)
else:
print(parent.name)

p
body
html
[document]

遍历所有先辈节点，包括soup本身，所以要区别判断

4、标签树的平行遍历

属性	说明
.next_sibling	返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling	返回按照HTML文本顺序的上一个平行节点标签
.next_siblings	迭代类型，返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings	迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

平行遍历发生在同一个父节点下的各节点间

>>> soup=BeautifulSoup(demo,"html.parser")
>>> soup.a.next_sibling
' and '
>>> soup.a.next_sibling.next_sibling
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
>>> soup.a.previous_sibling
'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'
>>> soup.a.previous_sibling.previous_sibling
>>> soup.a.parent
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.

>>> for sibling in soup.a.next_sibling:
print(sibling) 遍历后续节点

>>> for sibling in soup.a.previous_sibling:
print(sibling) 遍历前续节点

三、基于bs4库的HTML格式输出

1、bs4库的prettify()方法

>>> import requests
>>> r=requests.get("http://python123.io/ws/demo.html")
>>> demo=r.text
>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup(demo,"html.parser")
>>> print(soup.prettify())
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>


The demo python introduces several python courses.



Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>

</body>
</html>

.prettify()为HTML文本<>及其内容增加更加'\n'

.prettify()可用于标签，方法：<tag>.prettify()

>>> print(soup.a.prettify())
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>

2、bs4库的编码

bs4库将任何HTML输入都变成utf-8编码

Python 3.x默认支持编码为utf-8，解析无障碍

posted @ 2018-01-19 20:16 扬帆_一点零阅读(288) 评论(0) 收藏举报

刷新页面返回顶部

Python网络爬虫与信息提取-Beautiful Soup 库入门2

公告