Fork me on GitHub

Treating HTML like XML using HtmlAgilityPack, and doing it inside of an XSLT too [转载]

I was not able to post this on Simon Mourier's blog due to the HTML and XSLT tags, so here it is on mine:

Maybe someone has done this already, but I don't see it in the comments.

I created an XSLT extension object based on HtmlAgilityPack. The class is tiny:

using System;
using System.Collections.Generic;
using System.Text;
using HtmlAgilityPack;
using System.Xml;
using System.Xml.XPath;
using System.IO;

namespace HtmlAgilityPack
{
    public class XslExtension
    {
        public XmlDocument loadhtmlasxml(string url)
        {
            // Create an instance of the HtmlWeb object
            HtmlWeb web = new HtmlWeb();
            // Declare necessary stream and writer objects
            MemoryStream m = new MemoryStream();           
            XmlTextWriter xtw = new XmlTextWriter(m,null);           
            // Load the content into the writer
            web.LoadHtmlAsXml(url, xtw);
            // Rewind the memory stream
            m.Position = 0;
            // Create, fill, and return the xml document
            XmlDocument xdoc = new XmlDocument();
            xdoc.LoadXml((new StreamReader(m)).ReadToEnd());
            return xdoc;
        }
    }
}


Then, I used NXSLT from http://www.xmllab.net to load the custom extension function in from the command line so that the following XSL style sheet can be used directly:

<xsl:stylesheet
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:hap="http://smourier.blogspot.com"
 xmlns:msxsl="urn:schemas-microsoft-com:xslt"
      version="1.0">

 <xsl:output method="html" omit-xml-declaration="yes" indent="no"/>

 <xsl:template match="/">

  <h1>BEGIN TEST OF HtmlAgilityPack.XslExtension</h1>

  <h2>First, connect to http://www.cnn.com and load its node set into a local variable</h2>   

  <xsl:variable name="cnn"><xsl:copy-of select="hap:loadhtmlasxml('http://www.cnn.com')" /></xsl:variable>

  <h3>CNN.com has this many nodes:</h3>

  <xsl:value-of select="count(msxsl:node-set($cnn)//*)" />
  <h2>Now, process all the A tags within the "Special Converage" stories inside the "div class="cnnLSSpecialCovBoxContent" that have an HREF that starts with /2005.</h2>
   <h3>Special Coverage</h3>
    <xsl:for-each select="msxsl:node-set($cnn)//div[@class='cnnLSSpecialCovBoxContent']//a[starts-with(@href, '/2005/')]">
   <div>
    <h3><xsl:copy-of select="." /></h3>
    <!-- Now get the images from each story if they exist -->
    <h5>Connecting to: <xsl:value-of select="concat('http://www.cnn.com', @href)" /> to retrieve image if it exists</h5>
    <xsl:copy-of select="hap:loadhtmlasxml(concat('http://www.cnn.com', @href))//img[@height = '168']" />
   <br /><br />
   </div>
   </xsl:for-each>
  <h1>END TEST OF HtmlAgilityPack.XslExtension</h1>
 </xsl:template>

</xsl:stylesheet>


The command for NXSLT to perform this is:


nxslt2.exe source.xml source.xsl -ext hap:HtmlAgilityPack.XslExtension xmlns:hap="http://smourier.blogspot.com" -af .\HtmlAgilityPackXs
lExtension.dll

The style sheet connects to CNN.com using the syntax:

select="hap:loadhtmlasxml('http://www.cnn.com')"

Then, further down, after it processes each of the selected A HREF's, it connects to each of the linked stories and retrieves any images with height 168, outputting the HTML result tree.

This could allow for any number of descendent link followings. I haven't worked out the automatic form processor yet, but I think that could be an XSLT extension too perhaps...

Let me know what you think...
http://blogs.wdevs.com/ultravioletconsulting/archive/2005/09/10/10506.aspx

posted @ 2006-08-27 20:09  张善友  阅读(1979)  评论(0编辑  收藏  举报