python爬虫---beautifulsoup（1）

　　beautifulsoup是用于对爬下来的内容进行解析的工具，其find和find_all方法都很有用。并且按照其解析完之后，会形成树状结构，对于网页形成了类似于json格式的key - value这种样子，更容易并且更方便对于网页的内容进行操作。

　　下载库就不用多说，使用python的pip，直接在cmd里面执行pip install beautifulsoup即可

　　首先仿照其文档说明，讲代码拷贝过来，如下

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc,'html.parser')

print soup.find_all('a')

　　html_doc即是我们爬下来的东西，这里方便直接使用了文档里面提供的内容。

　　我们直接对html_doc执行解析，使用的是html.parser这个解析器。

　　在sublime敲完之后ctrl+B即可运行（推荐下载python的SublimePythonIDE这个插件包，可以直接编译无需使用cmd）

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[Finished in 0.2s]

　　代码执行结果如上，将带有a的行数执行出来了。

　　我们按照文档要求改写一下，改写soup的内容，并且答应出结果。（直接黏贴官网内容，不在重复）

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

　　如上，可以很明显的看出来，解析完毕的soup，形成了key-value格式的数据，使用soup.title等方法可以分别打印出需要的内容。（#为打出内容）

　　还有其他的一些方法。

for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

　　使用foreach即可很轻松的对于复杂父容器的子控件进行操作。（#为打出内容）

　　官网最后一个内容是将该网页的所有的内容去掉符号直接显示内容。方法如下

print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...

　　也很方便的直接把文本的内容打出来了。

　　以上为beautifulsoup的比较简单的使用。

posted @ 2017-04-04 13:37 Sample1994 阅读(241) 评论(0) 收藏举报

刷新页面返回顶部

Sample1994

python爬虫---beautifulsoup（1）

公告