聪明的程序员用Delphi，真正的程序员用C++，偷懒的程序员用PowerShell

简介

感谢大家的支持，以及微软社区精英计划团队的肯定，我被邀请在微软MSDN网络建立个人主页，由于第一次建立主页的时候，需要提交相关博文的信息，为了实现该需求，我用PowerShell来完成博文的采集。本文讲述如何使用PowerShell来采集博客园上的博文信息。

需求

需要把提交的博文整理成一个表格，显示发布时间，内容标题，具体链结位置，技术分类和内容形式，入下表格。

发布时间	内容标题	具体链结位置	技术分类	内容形式
2010年07月22日	Windows Phone 7书托	http://www.cnblogs.com/procoder/archive/2010/07/22/Windows-Phone-7-Books.html	Windows Phone	博客

尽管文章列表的生成只是一次性的工作，可是Copy&Paste(拷贝粘贴)还是很annoying和error-prone（恼人和容易出错）的工作，这次继续使用PowerShell来简化工作。我承认我是一个偷懒的程序员。上次的文章讲述如何使用Powershell简化Windows Mobile和Windows Embedded CE的开发流程，可以参考如何使用PowerShell提升开发效率(以Windows Embedded CE为例)。

源代码

先上代码，然后再解析

#Global variables
$blogName = "procoder";
$articles = New-Object System.Collections.Generic.List``1[System.Object]

$OutputEncoding = New-Object -typename System.Text.UTF8Encoding;

$webClient = New-Object System.Net.WebClient;
$webClient.Encoding = [System.Text.Encoding]::UTF8;

$regex = New-Object System.Text.RegularExpressions.Regex('<a\s+id="homepage1_HomePageDays_ctl00_DayList_ctl\d+_TitleUrl" class="postTitle2" href="(?<url>.+)">(?<title>.+)</a>');
$regexDate = New-Object System.Text.RegularExpressions.Regex('http://www.cnblogs.com/\w+/archive/(?<year>\d+)/(?<month>\d+)/(?<day>\d+)/.+.html');

# Analyse the pages
# the number here is hardcoded, should be infinite. 
for($i=1; $i -lt 100; ++$i)
{
    echo "Analysing Page $i ...";
    $html = $webClient.DownloadString("http://www.cnblogs.com/" + $blogName +"/default.html?page=" + $i);
    
    $matches = $regex.Matches($html);
    if($matches.Count -eq 0)
    {
        #No more pages
        $j = $i - 1;
        $count = $articles.Count;
        echo "Finished analysing, total $j pages and $count articles.";
        break;
    }

    foreach ($match in $matches)
    {
        $article = "" | select title, url, date, catalog, type;
        $article.title = $match.Groups["title"].Value;
        $article.url = $match.Groups["url"].Value;
        $article.catalog = "Windows Mobile`r`n Windows Embedded CE";
        $article.type = "博客";
        $date = $regexDate.Matches($article.url);
        if($date.Count -gt 0)
        {
            $article.date = $date[0].Groups["year"].Value + "年" + $date[0].Groups["month"].Value + "月" + $date[0].Groups["day"].Value+ "日";
        }
        else
        {
            echo "Cannot find the date."
        }
        $articles.Add($article);
    }
}

# Generate the report
$head = '<style>
 BODY{font-family:Verdana; background-color:lightblue;}
 TABLE{border-width: 1px;border-style: solid;border-color: black;border-collapse: collapse;}
 TH{font-size:1.3em; border-width: 1px;padding: 2px;border-style: solid;border-color: black;background-color:#FFCCCC}
 TD{border-width: 1px;padding: 2px;border-style: solid;border-color: black;background-color:white}
</style>'
$header = "<H1>博客文章列表</H1>"
$title = "博客文章列表"

$path = Get-Location;
$path = $path.Path + "/report.html";

$articles | 
  Select-Object date, title, url, catalog, type | 
  ConvertTo-HTML -head $head -body $header -title $title | 
  Out-File -FilePath $path -encoding "unicode";

下面是在PowerShell执行的截图，关于PowerShell的环境配置，请看上篇文章。

下面是生成的文章列表。

代码解析

$blogName = "procoder";

需要采集的博客名字，如果有需要可能把之改成自己博客的名字，这个也可以通过参数传递进来。

$articles = New-Object System.Collections.Generic.List``1[System.Object]

$articles是用于保存采集文章信息的容器。注意生成的时候格式有点怪，需要加上``1

$OutputEncoding = New-Object -typename System.Text.UTF8Encoding;

由于我使用的是英文的操作系统，所有需要把环境变量$OutputEncoding改成UTF8的编码方式。

$webClient = New-Object System.Net.WebClient;
$webClient.Encoding = [System.Text.Encoding]::UTF8;

使用WebClient进行采集，由于采集内容有中文，把编码改成UTF8.

$regex = New-Object System.Text.RegularExpressions.Regex('<a\s+id="homepage1_HomePageDays_ctl00_DayList_ctl\d+_TitleUrl" class="postTitle2" href="(?<url>.+)">(?<title>.+)</a>');
$regexDate = New-Object System.Text.RegularExpressions.Regex('http://www.cnblogs.com/\w+/archive/(?<year>\d+)/(?<month>\d+)/(?<day>\d+)/.+.html');

使用正则表达式来匹配采集的结果。正则表达式根据采集的内容来写，例如下面为采集到的HTML源码，根据其格式采集出题目，链接以及日期信息。

<a id="homepage1_HomePageDays_ctl00_DayList_ctl00_TitleUrl" class="postTitle2" href="http://www.cnblogs.com/procoder/archive/2010/05/17/Microsoft_Word_Save_As_PDF.html">[Office 2010 易宝典]怎样直接将Office文档保存为PDF格式？</a>

echo "Analysing Page $i ...";
$html = $webClient.DownloadString("http://www.cnblogs.com/" + $blogName +"/default.html?page=" + $i);

$matches = $regex.Matches($html);
if($matches.Count -eq 0)
{
    #No more pages
    $j = $i - 1;
    $count = $articles.Count;
    echo "Finished analysing, total $j pages and $count articles.";
    break;
}

调用$webClient.DownloadString采集网页的内容，把HTML源码保存到字符串中。通过$regex.Matches($html);来匹配出标题，链接等信息。如果没有匹配，表示采集完成。

foreach ($match in $matches)
{
    $article = "" | select title, url, date, catalog, type;
    $article.title = $match.Groups["title"].Value;
    $article.url = $match.Groups["url"].Value;
    $article.catalog = "Windows Mobile`r`n Windows Embedded CE";
    $article.type = "博客";
    $date = $regexDate.Matches($article.url);
    if($date.Count -gt 0)
    {
        $article.date = $date[0].Groups["year"].Value + "年" + $date[0].Groups["month"].Value + "月" + $date[0].Groups["day"].Value+ "日";
    }
    else
    {
        echo "Cannot find the date."
    }
    $articles.Add($article);
}

匹配出年月日的信息，并且把所有匹配信息放到对象$artile中，最后存放到容器中。

# Generate the report
$head = '<style>
 BODY{font-family:Verdana; background-color:lightblue;}
 TABLE{border-width: 1px;border-style: solid;border-color: black;border-collapse: collapse;}
 TH{font-size:1.3em; border-width: 1px;padding: 2px;border-style: solid;border-color: black;background-color:#FFCCCC}
 TD{border-width: 1px;padding: 2px;border-style: solid;border-color: black;background-color:white}
</style>'
$header = "<H1>博客文章列表</H1>"
$title = "博客文章列表"

$path = Get-Location;
$path = $path.Path + "/report.html";

$articles | 
  Select-Object date, title, url, catalog, type | 
  ConvertTo-HTML -head $head -body $header -title $title | 
  Out-File -FilePath $path -encoding "unicode";

最后使用ConvertTo-HTML把容器信息转换成HTML输出，然后使用Out-File导出到文件中，由于使用了中文，所有要指定编码为"unicode"。

posted @ 2010-07-28 06:51 Jake Lin 阅读(14774) 评论(32) 收藏举报

刷新页面返回顶部

聪明的程序员用Delphi，真正的程序员用C++，偷懒的程序员用PowerShell

简介

需求

源代码

代码解析

公告