一款名为 Beautiful Soup 的常用配套工具帮助 Python 程序理解 Web 站点中包含的脏乱“基本是 HTML” 内容。是用Python写的一个HTML/XML的解析器,它可以很好的处理不规范标记并生成剖析树(parse tree)。
使用 Beautiful Soup 从无序的内容中生成整齐的数据
from glob import glob
from BeautifulSoup import BeautifulSoup
def process():
print "!MOVIE,DIRECTOR,KEY_GRIP,THE_MOOSE"
for fname in glob('result_*'):
# Put that sloppy HTML into the soup
soup = BeautifulSoup(open(fname))
# Try to find the fields we want, but default to unknown values
try:
movie = soup.findAll('span', {'class':'movie_title'})[1].contents[0]
except IndexError:
fname = "UNKNOWN"
try:
director = soup.findAll('div', {'class':'director'})[1].contents[0]
except IndexError:
lname = "UNKNOWN"
try:
# Maybe multiple grips listed, key one should be in there
grips = soup.findAll('p', {'id':'grip'})[0]
grips = " ".join(grips.split()) # Normalize extra spaces
except IndexError:
title = "UNKNOWN"
try:
# Hide some stuff in the HTML <meta> tags
moose = soup.findAll('meta', {'name':'shibboleth'})[0]['content']
except IndexError:
moose = "UNKNOWN"
print '"%s","%s","%s","%s"' % (movie, director, grips, moose)
|
具体可参考:http://www.crummy.com/software/BeautifulSoup/documentation.zh.html
与其类似的还有PyQuery库,看参考其网址 http://packages.python.org/pyquery/

浙公网安备 33010602011771号