BeautifulSoup安装
终端输入: pip install BeautlfulSoup4
Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml
pip install lxml
另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib:
pip install html5lib
| Python标准库 | BeautifulSoup(markup,"html.parser") |
|
|
| lxml HTML 解析器 | BeautifulSoup(markup, "lxml") |
|
|
| lxml XML 解析器 |
BeautifulSoup(markup, ["lxml","xml"]) BeautifulSoup(markup, "xml") |
|
|
| html5lib | BeautifulSoup(markup, "html5lib") |
|
|
=====================================================
模块导入 :
from bs4 import BeautifulSoup
======================================
模块的使用
from bs4 import BeautifulSoup import requests html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ soup=BeautifulSoup(html_doc,'html.parser') #print(soup.a.name) #a #print(soup.a.parent.name) #p #print(soup.a.parent.parent.name)#body #print(soup.title)#<title>The Dormouse's story</title> #print(soup.title.string)#The Dormouse's story #print(soup.p.string)#The Dormouse's story #print(soup.p['class'])#['title'] #print(soup.find_all('a'))#打印出所有的 a 标签 #print(soup.find(id='link2'))#<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
从文档中找到所有<a>标签的链接:
for i in soup.find_all('a'): print(i.get('href'))

浙公网安备 33010602011771号