爬虫开发过程 - 采集器主程序开发

主程序的主要任务是解析和执行规则，那么首先就得设计规则文件的结构

我这里是设计了一种脚本作为规则使用，脚本有简单的语言规范：

engine
    command param1 param2 ....
    ....

一个简单的采集规则：

engine
    # 文件以engine作为开头，不能有任何其他字符
    # '#'作为注释标识，该行之后的内容作为注释
    # 每一行代表一个操作，操作示例：
    # data url=http://www.baidu.com #设置变量url
    # data是一个command，由主程序提供，空格之后是参数列表

    data url=http://www.baidu.com
    range 1 20
    select {url}?page={$0}
    foreach
        httper utf-8
        xpath title=//title content=//article

可以看到规则文件是一个树形结构，而且每个操作都是一行，只需要一行行解析就能获取到一个树形结构

    public class Node
    {
        public Node()
        {
            Childs = new List<Node>();
            Args = new List<string>();
        }
        public int Line { get; set; }
        public int Indent { get; set; }
        public string Name { get; set; }
        public List<string> Args { get; set; }
        public List<Node> Childs { get; set; }
        public Node Parent { get; set; }
    }

然后是命令的定义：

    public class CommandResult
    {
        public bool Success { get; set; }
        public Dictionary<string, string> Data { get; set; }
        public object PipelineOutput { get; set; }
    }

    public class CommandResult<T> : CommandResult
    {
        public new T PipelineOutput { get; set; }
    }

    public interface ICommand
    {
        string Name { get; }
        CommandResult Excute(object pipelineInput, Dictionary<string, string> data, string[] args);
    }

将解析出来的Node传递给ICommand执行，获取到结果，然后传递给下一个命令执行，这样子采集主程序就实现好了

比如规则中需要http请求，就编写一个http请求的命令，需要json解析，那么就编写一个json的命令，需要保存结果，就编写一个导出数据库的命令

爬虫主程序代码地址：

https://github.com/ss22219/SimpleSpider

posted @ 2018-07-04 01:31 幻影gool 阅读(212) 评论(0) 收藏举报

刷新页面返回顶部

李健宁

爬虫开发过程 - 采集器主程序开发

公告