公告

lmth1 一个用Python编写的便捷网页信息提取工具 - _Luc_ - 博客园

mth1 一个用Python编写的便捷网页信息提取工具
lmth1 一个便捷的网页信息提取工具
0, Why lmth1?
玩Python的人十有八九用过urllib，扒数据的十有八九用过BeautifulSoup。我也不例外，平时抓数据几乎全用BeautifulSoup。
BeautifulSoup的功能挺不错，但就是API挫了点，用起来不顺。相对于中规中矩的API，我更中意jQuery的Fluent API。所以，花了两个晚上，以BeautifulSoup作为基础，搞了两个库lmth和lmth1：lmth提供基本功能，并负责Hpath解析；lmth1提供Fluent API，进行数据抓取。
lmth1的接口非常简单，它的实现更简单——不超过300行代码。但它的功能很强大，你很快就会看到，lmth1是如何用一行代码实现BeautifulSoup十行代码的功能的，而且，更易读。
1, 简介
如题。
使用前请将lmth.py, lmth1.py以及beautifulsoup.py放至Python的环境目录下。
2, Hpath
Hpath是一种我定义的一种类似于Xpath的HTML路径查询表达式，它的语法非常简单——几个例子就能说明白。如果需要严格的定义，请参考2.2的BNF定义。
2.1 实例阐述
注意，这里的例子所提到的获取元素，均为在目标节点下所获得的元素。
采用的实例HTML:

1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
2 <html xmlns="http://www.w3.org/1999/xhtml" >
3 <head>
4     <title>Untitled Page</title>
5 </head>
6 <body>
7 <h1 id="title">Page list</h1>
8 <div id="content" class="sites">
9     <a href="http://www.google.com/" class="good">Google</a>
10     <a href="http://www.yahoo.com/" class="good">Yahoo</a>
11     <a href="http://www.baidu.com/" class="asshole">Baidu</a>
12     <a href="http://www.bing.com/" class="excellent">Bing</a>
13 </div>
14 <div id="tbl">
15     <ul>
16     <li class="odd">1</li>
17     <li class="even">2</li>
18     <li class="odd">3</li>
19     <li class="even">4</li>
20     <li class="odd">5</li>
21     <li class="even">6</li>
22     </ul>
23 </div>
24 </body>
25 </html>
复制代码

2.1.1 基本表达式

li
作用：获取所有li元素
结果：
[
     <li class="odd">1</li>,
     <li class="even">2</li>,
     <li class="odd">3</li>,
     <li class="even">4</li>,
     <li class="odd">5</li>,
     <li class="even">6</li>
]
复制代码

div[id=tbl]
作用：获取所有id属性为tbl的div元素
提示：通过属性过滤来进行更精准的查找
结果：
<div id="tbl">
<ul>
<li class="odd">1</li>
<li class="even">2</li>
<li class="odd">3</li>
<li class="even">4</li>
<li class="odd">5</li>
<li class="even">6</li>
</ul>
</div>
复制代码

div[id=content, class=sites]
作用：获取所有id属性为name且class属性为grey的div元素
提示：你可以同时设定多个属性值，属性对之间用逗号分隔
结果：
<div id="content" class="sites">
<a href="http://www.google.com/" class="good">Google</a>
<a href="http://www.yahoo.com/" class="good">Yahoo</a>
<a href="http://www.baidu.com/" class="asshole">Baidu</a>
<a href="http://www.bing.com/" class="excellent">Bing</a>
</div>
复制代码

div[@id]
作用：获取所有div元素的id属性值
提示：你需要在需获取的属性值前加一个@符
结果：
[
     'content',
    'tbl'
]
复制代码

div[id=content]/a[@href]
作用：获取所有id属性为name的元素下面的p元素的href属性值
结果：
[
     'http://www.google.com',
     'http://www.yahoo.com',
     'http://www.baidu.com',
     'http://www.bing.com'
]
复制代码

posted on 2012-02-16 00:16 lexus 阅读(352) 评论(0) 收藏举报

刷新页面返回顶部

浙江省高等学校教师教育理论培训

公告

lmth1 一个用Python编写的便捷网页信息提取工具 - _Luc_ - 博客园

lmth1 一个便捷的网页信息提取工具

0, Why lmth1?

1, 简介

2, Hpath

2.1 实例阐述