cnsnet

千里之行,始于足下
  博客园  :: 首页  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理

一个简单抓取糗事百科糗事的小程序

Posted on 2012-05-25 15:09  cnsnet  阅读(2960)  评论(12编辑  收藏  举报

看糗事百科是从2008年开始的,自从买了智能手机以后,就用手机看了,想着糗百的网站上下都有广告,自己只想看糗事,不想看广告,顺便还能节省下流量,就能能不能做个程序把糗百的糗事抓下来,其他的都去掉,于是就写了下面的这段.希望糗百大神们不要追究我的责任啊,我只是研究了一下下.

前台文件:

<%@ Page Language="C#" AutoEventWireup="true" CodeBehind="Default.aspx.cs" Inherits="WebTest._Default" EnableViewState="false" %>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  <title>糗事百科</title>
  <style type="text/css">
    body{margin:5px;font:12px arial,sinsun;background:#fff;}
    img{border:none;}
    a{text-decoration:none;}
    .qiushi{margin:5px 0;padding:10px;border-bottom:1px solid #ece5d8;}
  </style>
</head>
<body><form id="bodyForm" runat="server"></form></body></html>

后台代码:

1 protected void Page_Load(object sender, EventArgs e)
2 {
3       string URI = "http://wap3.qiushibaike.com";
4       string pageInfo = Request.QueryString["param"] == null ? string.Empty : Request.QueryString["param"].ToString().Trim();
5       URI = URI + pageInfo;
6 
7       bodyForm.InnerHtml = Server.HtmlDecode(getQiushi(URI));
8 }
getQiushi
 1 private string getQiushi(string URI)
 2 {
 3       WebRequest request = WebRequest.Create(URI);
 4       WebResponse result = null;
 5       result = request.GetResponse();
 6       Stream ReceiveStream = result.GetResponseStream();
 7       StreamReader sr = new StreamReader(ReceiveStream);
 8       string resultstring = sr.ReadToEnd();
 9       StringBuilder responseString = new StringBuilder();
10 
11       Regex regContent = new Regex("<div class=\"qiushi\">(?<content>[\\s\\S]+?)</div>");   //匹配糗事内容
12       Regex regComment = new Regex("<p class=\"vote\">(?<content>[\\s\\S]+?)</p>");         //匹配评论
13       Regex regUserInfo = new Regex("<p class=\"user\">(?<content>[\\s\\S]+?)</p>");        //匹配发布者信息

16       Regex regLinks = new Regex("(href=\")(/[^\\s]*)(\")");                                //匹配链接
17       Regex regPrevPage = new Regex("<a href=\".*?\">上一页</a>");                          //匹配换页
18       Regex regNextPage = new Regex("<a href=\".*?\">下一页</a>");
19       Regex regBlankLine = new Regex(@"[\n|\r|\r\n]");                                      //匹配换行
20       MatchCollection mcContent = regContent.Matches(resultstring);
21       Match mcPrevPage = regPrevPage.Match(resultstring);
22       Match mcNextPage = regNextPage.Match(resultstring);
23       string prevPage = "<a href=\"?param=" + mcPrevPage.ToString().Replace("<a href=\"", "").Replace("\">上一页</a>", "") + "\">上一页</a>&nbsp;&nbsp;";
24       string nextPage = "<a href=\"?param=" + mcNextPage.ToString().Replace("<a href=\"", "").Replace("\">下一页</a>", "") + "\">下一页</a>";
25 
26       for (int i = 0; i < mcContent.Count; i++)
27       {
28         string content = mcContent[i].ToString();
29         content = Regex.Replace(content, regComment.ToString(), "", RegexOptions.IgnoreCase);
30         content = Regex.Replace(content, regUserInfo.ToString(), "", RegexOptions.IgnoreCase);

32         content = Regex.Replace(content, regLinks.ToString(), "href=\"?param=$2\"", RegexOptions.IgnoreCase);
33         content = Regex.Replace(content, regBlankLine.ToString(),"", RegexOptions.IgnoreCase);
34 
35         responseString.Append(content);

37       }
38 
39       responseString.Append("<div style=\"text-align:center\">" + prevPage);
40       responseString.Append(nextPage + "</div>");
41 
42       return responseString.ToString();
43 }

 Page Load里面的那个param参数主要是为了获取上一页 ,下一页和标签的,现在基本的功能都实现了,没有广告了,不过不能查看留言.