Python爬虫之二：基于HTML的爬虫，使用BeautifulSoup包来实现

安装BeautifulSoup

　　　　Python3版本需要安装BeautifulSoup4。

　　　　可以通过 pip 来安装BeautifulSoup4
　　　　pip3 install beautifulsoup4

Jupyter实现网页的获取

　　　　运行以下代码看BeautifulSoup 是否正常安装（若未提示错误则表示正常）
　　　　　　from bs4 import BeautifulSoup

　　　　输入示例网页代码：　　　　

html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """

　　　　使用BeautifulSoup解析HTML文档
　　　　soup = BeautifulSoup(html_doc,‘html.parser’)
　　　　“html_doc”表示这个文档名称，在上面的代码中已经定义，“html_parser”是解析网页所需的解析器，所以使用BeautifulSoup解析HTML文档的一般式为　　　

　　　　soup=BeautifulSoup(网页名称，'html.parser')

　　　　用 soup.prettify 打印网页
　　　　print(soup.prettify())
　　　　BeautifulSoup 中 “soup.prettify” 这个方法可以让网页更加友好地打印出来

BeautifulSoup 解析网页的一些基本操作

　　　　在 BeautifulSoup 中，通过soup.……的形式来调用一个方法
　　　　soup.title：返回title部分的全部内容 <title>The Dormouse's story</title>

　　　　soup.title.name：返回title标签的名称'title'

　　　　soup.title.string：返回这个标签的内容"The Dormouse's story"

　　　　soup.find_all(‘a’)：返回所有超链接的元素如下：
　　　　<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 　　　　<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 　　　　<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

　　　　soup.find(id="link3")：返回 id=link3 部分的内容，如下：
　　　　<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

posted @ 2017-10-24 11:12 LittleGirl_MyBaby 阅读(423) 评论(0) 收藏举报

刷新页面返回顶部

LittleGirl_MyBaby

Python爬虫之二：基于HTML的爬虫，使用BeautifulSoup包来实现

公告