飞龙再生

  博客园  :: 首页  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理

PHP Simple HTML DOM Parser

这个我从第一个测试版用到现在好几年了,轻量级,很不错,单文件代码 1393 行
项目地址: http://simplehtmldom.sourceforge.net/
手册: http://simplehtmldom.sourceforge.net/manual.htm

  • A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!
  • Require PHP 5+.
  • Supports invalid HTML.
  • Find tags on an HTML page with selectors just like jQuery.
  • Extract contents from HTML in a single line.

PHP Simple HTML DOM Parser 使用示例
查看

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
 
// Find all images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';
 
// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>';

修改

// Create DOM from string
$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');
 
$html->find('div', 1)->class = 'bar';
 
$html->find('div[id=hello]', 0)->innertext = 'foo';
 
echo $html; // Output: <div id="hello">foo</div><div id="world" class="bar">World</div>

Fix absoult url

$uri = new Net_URL2('http://example.com/foo/bar'); // URI of the resource
$baseURI = $uri;
foreach ($html->find('base[href]') as $elem) {
    $baseURI = $uri->resolve($elem->href);
}
 
foreach ($html->find('*[src]') as $elem) {
    $elem->src = $baseURI->resolve($elem->src)->__toString();
}
foreach ($html->find('*[href]') as $elem) {
    if (strtoupper($elem->tag) === 'BASE') continue;
    $elem->href = $baseURI->resolve($elem->href)->__toString();
}
foreach ($html->find('form[action]') as $elem) {
    $elem->action = $baseURI->resolve($elem->action)->__toString();
}

Ganon

项目地址: http://code.google.com/p/ganon/
文档: http://code.google.com/p/ganon/w/list

这个功能强大的很,最近才发现的,加入我的常库,单文件代码 2856 行

The Ganon library gives access to HTML/XML documents in a very simple object oriented way. It eases modifying the DOM and makes finding elements easy with CSS3-like queries.

A universal tokenizer
A HTML/XML/RSS DOM Parser
Ability to manipulate elements and their attributes
Supports invalid HTML
Supports UTF8
Can perform advanced CSS3-like queries on elements (like jQuery -- namespaces supported)
A HTML beautifier (like HTML Tidy)
Minify CSS and Javascript
Sort attributes, change character case, correct indentation, etc.
Extensible
Parsing documents using callbacks based on current character/token
Operations separated in smaller functions for easy overriding
Fast
Easy

Ganon 使用示例:

// Parse the google code website into a DOM
$html = file_get_dom('http://code.google.com/');

Access
Accessing elements is made easy through the CSS3-like selectors and the object model.

// Find all the paragraph tags with a class attribute and print the
 // value of the class attribute
 foreach($html('p[class]') as $element) {
   echo $element->class, "<br>\n"; 
 }
 
 // Find the first div with ID "gc-header" and print the plain text of
 // the parent element (plain text means no HTML tags, just the text)
 echo $html('div#gc-header', 0)->parent->getPlainText();
 
 // Find out how many tags there are which are "ns:tag" or "div", but not
 // "a" and do not have a class attribute
 echo count($html('(ns|tag, div + !a)[!class]');
?>

Modification
Elements can be easily modified after you've found them.

// Find all paragraph tags which are nested inside a div tag, change
     // their ID attribute and print the new HTML code
     foreach($html('div p') as $index => $element) {
       $element->id = "id$index";
     }
     echo $html;
 
 
     // Center all the links inside a document which start with "http://"
     // and print out the new HTML
     foreach($html('a[href ^= "http://"]') as $element) {
       $element->wrap('center');
     }
     echo $html;
 
 
     // Find all odd indexed "td" elements and change the HTML to make them links
     foreach($html('table td:odd') as $element) {
       $element->setInnerText('<a href="#">'.$element->getPlainText().'</a>');
     }
     echo $html;

Beautify
Ganon can also help you beautify your code and format it properly.

// Beautify the old HTML code and print out the new, formatted code
     dom_format($html, array('attributes_case' => CASE_LOWER));
     echo $html;

phpQuery

这个重量级,比较耗资源,单文件代码 5702 行

phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library.
Library is written in PHP5 and provides additional Command Line Interface (CLI).

项目地址: http://code.google.com/p/phpquery/
文档:http://code.google.com/p/phpquery/wiki/Manual

phpQuery Examples

CLI
Fetch number of downloads of all release packages

phpquery 'http://code.google.com/p/phpquery/downloads/list?can=1' \
  --find '.vt.col_4 a' --contents \
  --getString null array_sum

PHP
Examples from demo.php

require('phpQuery/phpQuery.php');
// for PEAR installation use this
// require('phpQuery.php');

初始化 INITIALIZE IT

// $doc = phpQuery::newDocumentHTML($markup);
// $doc = phpQuery::newDocumentXML();
// $doc = phpQuery::newDocumentFileXHTML('test.html');
// $doc = phpQuery::newDocumentFilePHP('test.php');
// $doc = phpQuery::newDocument('test.xml', 'application/rss+xml');
// this one defaults to text/html in utf8
$doc = phpQuery::newDocument('<div/>');

填充 FILL IT

// array syntax works like ->find() here
$doc['div']->append('<ul></ul>');
// array set changes inner html
$doc['div ul'] = '<li>1</li><li>2</li><li>3</li>';

操纵 MANIPULATE IT

// almost everything can be a chain
$li = null;
$doc['ul > li']
        ->addClass('my-new-class')
        ->filter(':last')
                ->addClass('last-li')
// save it anywhere in the chain
                ->toReference($li);

选择 SELECT DOCUMENT

// pq(); is using selected document as default
phpQuery::selectDocument($doc);
// documents are selected when created or by above method
// query all unordered lists in last selected document
pq('ul')->insertAfter('div');

遍历 ITERATE IT

// all LIs from last selected DOM
foreach(pq('li') as $li) {
        // iteration returns PLAIN dom nodes, NOT phpQuery objects
        $tagName = $li->tagName;
        $childNodes = $li->childNodes;
        // so you NEED to wrap it within phpQuery, using pq();
        pq($li)->addClass('my-second-new-class');
}

输出 PRINT OUTPUT

// 1st way
print phpQuery::getDocument($doc->getDocumentID());
// 2nd way
print phpQuery::getDocument(pq('div')->getDocumentID());
// 3rd way
print pq('div')->getDocument();
// 4th way
print $doc->htmlOuter();
// 5th way
print $doc;
// another...
print $doc['ul'];

Incoming search terms:

 

Tags: DOMGanonHTMLPHPphpQuerySimple HTML DOM

本文地址: http://www.21andy.com/new/20120716/2071.html

posted on 2015-09-16 16:13  飞龙再生  阅读(521)  评论(0)    收藏  举报