Beautiful Soup 中文文档

完整的文档链接地址：http://www.pylong.com/bsoup/index.php
Thanks 小龙
这里也可以：http://www.crummy.com/software/BeautifulSoup/documentation.zh.html
Thanks Leonard.

原文 by Leonard Richardson (leonardr@segfault.org)
翻译 by Richie Yan (richieyan@gmail.com)

英文原文点这里

Beautiful Soup 是用Python写的一个HTML/XML的解析器，它可以很好的处理不规范标记并生成剖析树(parse tree)。它提供简单又常用的导航（navigating），搜索以及修改剖析树的操作。它可以大大节省你的编程时间。对于Ruby，使用Rubyful Soup。

这个文档说明了Beautiful Soup 3.0主要的功能特性，并附有例子。从中你可以知道这个库有哪些好处，它是怎样工作的，怎样让它帮做你想做的事以及你该怎样做当它做的和你期待不一样。

findNextSiblings(name, attrs, text, limit, **kwargs) and findNextSibling(name, attrs, text, **kwargs)
findPreviousSiblings(name, attrs, text, limit, **kwargs) and findPreviousSibling(name, attrs, text, **kwargs)
findAllNext(name, attrs, text, limit, **kwargs) and findNext(name, attrs, text, **kwargs)
findAllPrevious(name, attrs, text, limit, **kwargs) and findPrevious(name, attrs, text, **kwargs)

Modifying 剖析树

改变属性值
删除元素
替换元素
添加新元素

常见问题(Troubleshooting)

为什么Beautiful Soup不能打印我的no-ASCII字符?
Beautiful Soup 弄丢了我给的数据!为什么?为什么?????
Beautiful Soup 太慢了!

高级主题

产生器(Generators)
其他的内部剖析器
定制剖析器(Parser)
实体转换
使用正则式处理糟糕的数据
SoupStrainers 带来的乐趣
通过剖析部分文档来提升效率
使用extract改进内存使用

其它

使用Beautiful Soup的其他应用
类似的库

结论

快速开始

从这里获得 Beautiful Soup。变更日志描述了3.0 版本与之前版本的不同。

在程序中中导入 Beautiful Soup库:

from BeautifulSoup import BeautifulSoup          # For processing HTML
from BeautifulSoup import BeautifulStoneSoup     # For processing XML
import BeautifulSoup                             # To get everything

下面的代码是Beautiful Soup基本功能的示范。你可以复制粘贴到你的python文件中，自己运行看看。

from BeautifulSoup import BeautifulSoup
import re 
doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc)) 
print soup.prettify()
#<html>
#<head>
#<title>
#Page title
#</title>
#</head>
#<body>
#<p id="firstpara" align="center">
#This is paragraph
#<b>
#one
#</b>
#.
#</p>
#<p id="secondpara" align="blah">
#This is paragraph
#<b>
#two
#</b>
#.
#</p>
#</body>
#</html>

navigate soup的一些方法:

soup.contents[0].name
#u'html' 
soup.contents[0].contents[0].name
#u'head' 
head = soup.contents[0].contents[0]
head.parent.name
#u'html' 
head.next
#<title>Page title</title> 
head.nextSibling.name
#u'body' 
head.nextSibling.contents[0]
#<p id="firstpara" align="center">This is paragraph <b>one</b>.</p> 
head.nextSibling.contents[0].nextSibling
#<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

下面是一些方法搜索soup，获得特定标签或有着特定属性的标签：

titleTag = soup.html.head.title
titleTag
#<title>Page title</title> 
titleTag.string
#u'Page title' 
len(soup('p'))
#2 
soup.findAll('p', align="center")
#[<p id="firstpara" align="center">This is paragraph <b>one</b>. </p>] 
soup.find('p', align="center")
#<p id="firstpara" align="center">This is paragraph <b>one</b>. </p> 
soup('p', align="center")[0]['id']
#u'firstpara' 
soup.find('p', align=re.compile('^b.*'))['id']
#u'secondpara' 
soup.find('p').b.string
#u'one' 
soup('p')[1].b.string
#u'two'

修改soup也很简单：

titleTag['id'] = 'theTitle'
titleTag.contents[0].replaceWith("New title")
soup.html.head
#<head><title id="theTitle">New title</title></head> 
soup.p.extract()
soup.prettify()
#<html>
#<head>
#<title id="theTitle">
#New title
#</title>
#</head>
#<body>
#<p id="secondpara" align="blah">
#This is paragraph
#<b>
#two
#</b>
#.
#</p>
#</body>
#</html> 
soup.p.replaceWith(soup.b)
#<html>
#<head>
#<title id="theTitle">
#New title
#</title>
#</head>
#<body>
#<b>
#two
#</b>
#</body>
#</html> 
soup.body.insert(0, "This page used to have ")
soup.body.insert(2, " &lt;p&gt; tags!")
soup.body
#<body>This page used to have <b>two</b> &lt;p&gt; tags!</body>

一个实际例子，用于抓取 ICC Commercial Crime Services weekly piracy report页面, 使用Beautiful Soup剖析并获得发生的盗版事件:

import urllib2
from BeautifulSoup import BeautifulSoup 
page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php")
soup = BeautifulSoup(page)
for incident in soup('td', width="90%"):
    where, linebreak, what = incident.contents[:3]
    print where.strip()
    print what.strip()

posted @ 2011-07-14 20:08 westfly 阅读(466) 评论(0) 收藏举报

刷新页面返回顶部

westfly

Beautiful Soup 中文文档

Beautiful Soup 中文文档

目录

快速开始

公告