一些尝试，关于Rss

/**
     * 解析出所有rss文件中的节点，包括元素节点，文本节点等
     * 当然可以用正则表达式匹配。但是用字符串函数更能体现其性能。
     * 实现很乱，也花了不少时间。主要是一些鸡毛蒜皮的事儿，导致
     * 时间无法保证。
     */
    public function _token_parse()
    {
        $str = $this->rss_str;
        $str_len = strlen(trim($str));
        /**
         * $depth 记录节点的深度, <xx> or </xx> 有相同的深度
         */
        $depth = 0;
        //$_LEFT_ANGLE_BRACKET_ASCII = ord('<');
        /**
         * $state 状态使<xx></xx>配对
         */
        $state = 0;
        /**
         * $token 保存每一个节点（包含属性)
         */
        $token = '';
        $_L_A = ord('<');
        $_R_A = ord('>');
        $_SLASH = ord('/');
        for ($i = 0; $i < $str_len; $i++) {
            while (trim($str[$i]) == '') $i++;  
            if (ord($str[$i]) == $_L_A) {
                if (ord($str[$i+1]) == $_SLASH) {
                    $depth--;
                    while ($i < $str_len && ord($str[$i]) != $_R_A) {
                        $token .= $str[$i];
                        $i++;
                    }
                    $state = 1;
                } else if (ord($str[$i+1]) == ord('?')) {
                    /**
                     * XML 版本信息 <?xml ... ?>
                     */
                    while ($i < $str_len && substr($str, $i, 2) != '?>') {
                        $token .= $str[$i];
                        $i++;
                    }
                    $token .= '?';
                    $i = $i + 1;
                    /**
                     * xml版本信息深度为0
                     */
                    $depth = 0;
                } else if (substr($str, $i, 4) == '<rss') {
                    while ($i < $str_len && $str[$i] != '>') {
                        $token .= $str[$i];
                        $i++;
                    }
                    /**
                     * rss版本信息深度为0
                     */
                    $depth = 0;
                } else if (substr($str, $i, 9) == '<![CDATA[') {
                    /**
                     * 处理名字空间content 相关内容 <![CDATA[ ........ ]]>
                     */
                    $depth++;
                    while ($i < $str_len && substr($str, $i, 3) != ']]>') {
                        $token .= $str[$i];
                        $i++;
                    }
                    $token .= ']]';
                    $i = $i + 2;
                } else if (substr($str, $i, 4) == '<!--'){
                    /**
                     * 处理注释<!--......-->
                     */
                    while ($i < $str_len && substr($str, $i, 3) != '-->') {
                        $i++;
                    }
                    $i = $i + 2;
                } else {
                    if ($state == 0)
                        $depth++;
                    while ($i < $str_len && ord($str[$i]) != $_R_A) {
                        $token .= $str[$i];
                        $i++;
                    }
                    $state = 0;
                }
                if ($state == 0 && ord($str[$i-1]) == $_SLASH) {
                    $state = 1;
                }
                if (trim($token) != '')
                    $token .= $str[$i];
            } else {
                $depth++;
                while ($i < $str_len && ord($str[$i]) != $_L_A) {
                    $token .= $str[$i];
                    $i++;
                }
                $i = $i - 1;
            }
            if (trim($token) != '') {
                $this->_tokens[] = array($depth, $token);
                $token = '';
            }
            //echo $state . '>' . $depth . "\n";
        }
        return $this->_tokens;
    }

对于以上这段代码，没有任何意义。作用只是扫除所有Rss文档中的节点。比如一个个形如<xx> or </xx>的，或者是文本。其实它主要是处理标准文档，也就是说，已经处理成标准xml文档且符合rss结构能产生结构。其中数字是节点深度。没有特殊意义，就直闲暇玩乐，当然代码写的很乱。

基于上面得到的token，可以做如下处理。由于程序在设定之初，就已经算出每个节点的深度。这样就可以轻松获得想要的内容。对于一个Rss文档，有其规定的标记（XML一部分），比如title or description or …… 实现如下：

public function _construct_parse($cdts = array('title', 'description'))
    {
        if (empty($cdts))
            $cdts = array('title');
        $state  = 0;
        $depth  = 0;
        $key    = '';
        $i      = 0;
        foreach ($this->_tokens as $item) {
            $tmp = trim(str_replace(array('<', '>'), '', $item[1]));
            if (in_array($tmp, $cdts)) {
                $key = $tmp;
                $depth = $item[0];
                $state = 1;
                continue;
            }
            if ($state == 1 && $depth == $item[0] - 1) {
                $res[][$key] = preg_replace("/<!\[CDATA\[|\]\]>/", "", $item[1]);
                $state = 0;
            } 
        }
        return $res;
    }

这个函数是有缺点的，暂且不说其实用性。就此函数（方法），可以看出得到的结论不是按组排的。这样给显示数据带来了麻烦。所以无法完全把过程包装到类中。暂且给出一个所谓的处理例子（这是部分代码，等完工再补充）。

<?php
/**
 * 20/08/10 14:55:05
 * Xiang Shouding <fansekey@gmail.com> 
 */
include_once("./rss.php");
//$rss = new Rss("./localhost.rss");
//$rss = new Rss("http://hi.baidu.com/xiangshouding/rss");
//$rss = new Rss("http://www.phpweblog.net/rss.aspx");
$rss = new Rss("http://www.laruence.com/feed/rss");
//$rss = new Rss("http://static.userland.com/gems/backend/rssTwoExample2.xml");
$nodes = $rss->rss_fetch();
foreach ($nodes as $node) {
    if ($node['title'])
        echo "TITLE: " . $node['title'] . "<br />";
    if ($node['description'])
        echo "    Description: " . $node['description'] . "<br />";
}

posted @ 2010-08-23 15:51 那天阅读(264) 评论(0) 收藏举报

那天

一些尝试，关于Rss

公告