『开源』50行代码扒取博客园文章

今天在博客园看到一篇文章：《网络爬虫+HtmlAgilityPack+windows服务从博客园爬取20万博文》

于是心血来潮，立即动手用 50 行代码，完成博客园文章扒取。

并非哗众取宠，有图有真相 —— 直接上图。

并非恶意攻击博客园 —— 所以只扒取 10页数据，望博客园管理员见谅。

数据准备（浏览器F12拦截监视）：

　　文章列表链接 : http://www.cnblogs.com/mvc/AggSite/PostList.aspx?CategoryId=808&CategoryType=SiteHome&ItemListActionName=PostList&PageIndex=3&ParentCategoryId=0

　　文章列表HTML : <a class="titlelnk" href="http://www.cnblogs.com/2010wuhao/p/4707154.html" target="_blank">Android中的Intent详解</a> —— 其中 class="titlelnk" 为重点

　　文章正文HTML : <div id="cnblogs_post_body"> ...... </div><div id="MySignature"> —— 其中 id="cnblogs_post_body" 和 id="MySignature" 为重点

匹配引擎：

　　配置文章列表的扒取规则：

　　得到格式化之后的 HTML：

　　添加正式字段的匹配规则：

　　所见即所得，实时查看匹配结果：

　　配置文章正文的扒取规则：

　　调试文章正文的扒取结果：

　　新建控制台项目：

　　引入核心框架 Laura.MatchCore：

　　完成 50行代码，调试运行，扒取开始：

程序源码（50行）：

using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
using System.Runtime.Serialization.Formatters.Binary;
using System.Text;
using Laura.MatchCore.Entity;

namespace Temp_20150806_1838
{
    class Program
    {
        static void Main(string[] args)
        {
            const string 文章列表URL_模板 = "http://www.cnblogs.com/mvc/AggSite/PostList.aspx?CategoryId=808&CategoryType=SiteHome&ItemListActionName=PostList&PageIndex={PageIndex}&ParentCategoryId=0";

            MatchSchema matchSchema_List = (MatchSchema) ReadStream(@"Data\扒取博客园_文章列表.data");
            MatchSchema matchSchema_Content = (MatchSchema) ReadStream(@"Data\扒取博客园_文章内容.data");

            for (int i = 1; i <= 10; i++) //只 扒取 10 页
            {
                Console.WriteLine("正在 扒取 第 {0} 页", i);
                Console.WriteLine("----------------------------------------");


                string urlList = 文章列表URL_模板.Replace("{PageIndex}", i.ToString());
                string htmlList = ReadHtml(urlList, Encoding.UTF8);

                MatchObject matchObject_List = matchSchema_List.CalculateFieldValues(htmlList);
                List<string> listTitle = matchObject_List.GetValues("文章标题");
                List<string> listUrl = matchObject_List.GetValues("文章URL");

                for (int j = 0; j < listUrl.Count; j++)
                {
                    string urlContent = listUrl[j];
                    string htmlContent = ReadHtml(urlContent, Encoding.UTF8);

                    MatchObject matchObject_Content = matchSchema_Content.CalculateFieldValues(htmlContent);
                    string content = matchObject_Content.GetValue("文章正文");

                    Console.WriteLine("标题: {0} \r\n正文: {1}", listTitle[j], (content.Length >= 100 ? content.Substring(0, 100) : content)); //控制台输出 正文截取 100位
                    Console.WriteLine("----------------------------------------");
                }
            }


            Console.WriteLine("扒取 博客园 10页 完成");
        }




        //两个 工具类, 不算入 正式代码
        public static object ReadStream(string file)
        {
            if (!File.Exists(file)) return null;

            try
            {
                BinaryFormatter myBf = new BinaryFormatter();
                using (FileStream myFs = new FileStream(file, FileMode.Open))
                {
                    object record = myBf.Deserialize(myFs);
                    return record;
                }
            }
            catch { return null; }
        }
        public static string ReadHtml(string url, Encoding encoding)
        {
            string result = string.Empty;
            try
            {
                using (WebClient webClient = new WebClient { Headers = { { "user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)" } } })
                {
                    using (Stream stream = webClient.OpenRead(url))
                    {
                        if (stream != null)
                        {
                            using (StreamReader streamReader = new StreamReader(stream, encoding))
                            {
                                result = streamReader.ReadToEnd();
                                stream.Close();
                                streamReader.Close();
                            }
                        }
                    }
                }
            }
            catch (Exception exp)
            {
                string logMsg = string.Format("BaseUtil.WebTools.ReadHtml(url) 通过 Url: |{0}| 获取网页Html 时发生错误:{1}", url, exp);
                Console.WriteLine(logMsg);
            }
            return result;
        }
    }
}

源码点击下载（包括核心框架源码） —— 如果您觉得本文不错，麻烦点击一下右下角的推荐。

posted on 2015-08-07 09:55 InkFx 阅读(7333) 评论(85) 编辑收藏举报

『开源』50行代码 扒取 博客园文章

『开源』50行代码扒取博客园文章