BeautifulSoup解析非标准HTML的问题

发现问题：

BeautifulSoup版本：4.3.2

在用BeautifulSoup.find_all()搜索HTML时，遇到下面的代码：

<a href="/shipin/donghuapian/2012-07-25/23404.html"title="谦谦君子" target="_blank">温润如玉</a>

可以看出代码中a标签的href属性和title属性之间没有空格。

分析问题：

通过BeautifulSoup的诊断工具（4.2版以上才有）diagnose：

from bs4.diagnose import diagnose
html_doc = open('test.html').read()
diagnose(html_doc)

发现那行代码被解析成：

<a href="/shipin/donghuapian/2012-07-25/23404.html"> title="谦谦君子" target="_blank"&gt;温润如玉</a>

看出来了吗？这是个错误的a标签，包含title和target位置出现错误，造成BeautifulSoup.find_all()解析到此行代码时，匹配title就会失败。

问题出现的原因是BeautifulSoup默认使用Python自带的html parser，对错误网页的兼容性不强。

解决办法：

为BeautifulSoup指定一个新的html parser，这里有详情，我选择了lxml：

sudo pip install lxml

创建BeautifulSoup对象时，添加一个参数：

#coding=utf-8
import re
from bs4 import BeautifulSoup

html_doc = open('test.html').read()
soup = BeautifulSoup(html_doc, 'lxml')　　# 选择lxml作为新的html parser。
tags = soup.find_all('a', {'title': re.compile(u'君子')})

就OK了。

posted @ 2013-11-15 11:46 勤劳的天蓬阅读(1153) 评论(0) 收藏举报

刷新页面返回顶部