正则综合实战取官方首页文章内容信息

原文内容截取当前的官方的首页HTML ,作为练习,用到了断言预判,分组。

<article class="post-item" data-post-id="15243338">
                  <section class="post-item-body">
                    <div class="post-item-text">
                      <a class="post-item-title" href="https://www.cnblogs.com/yixing-tuotuo/p/15243338.html" target="_blank">找论坛大神想想办法</a>
                      <p class="post-item-summary">
                        <a href="https://www.cnblogs.com/yixing-tuotuo/">
                          <img src="https://pic.cnblogs.com/face/2144536/20200928112626.png" class="avatar" alt="博主头像" />
                        </a>
                        记一个activiti工作流问题,话不多说,上 问题详情:工作流并行网关执行完流转到下一个节点审批,下一个节点审批驳回到了并行网关中的指定一个节点,然后当这个指定的节点审批通过后,流程却结束了,并没有再次到达部门经理这个节点,这肯定是有问题,不能这么糊里糊涂结束了 以上问题代码补充: 1.驳回是我自 ...
                      </p>
                    </div>
                    <footer class="post-item-foot">
                      <a href="https://www.cnblogs.com/yixing-tuotuo/" class="post-item-author">
                        <span>意行</span>
                      </a>
                      <span class="post-meta-item">
                        <span>2021-09-08 16:30</span>
                      </span>
                      <a id="digg_control_15243338" class="post-meta-item btn " href="javascript:void(0)" onclick="DiggPost('yixing-tuotuo', 15243338, 635050, 1);return false;">
                        <svg width="16" height="16" viewBox="0 0 16 16" xmlns="http://www.w3.org/2000/svg">
                          <use xlink:href="#icon_digg"></use>
                        </svg>
                        <span id="digg_count_15243338">0</span>
                      </a>
                      <a class="post-meta-item btn" href="https://www.cnblogs.com/yixing-tuotuo/p/15243338.html#commentform">
                        <svg width="16" height="16" xmlns="http://www.w3.org/2000/svg">
                          <use xlink:href="#icon_comment"></use>
                        </svg>
                        <span>0</span>
                      </a>
                      <a class="post-meta-item btn" href="https://www.cnblogs.com/yixing-tuotuo/p/15243338.html">
                        <svg width="16" height="16" viewBox="0 0 16 16" xmlns="http://www.w3.org/2000/svg">
                          <use xlink:href="#icon_views"></use>
                        </svg>
                        <span>0</span>
                      </a>
                      <span id="digg_tip_15243338" class="digg-tip" style="color: red"></span>
                    </footer>
                  </section>
                  <figure>
                  </figure>
                </article>
                <article class="post-item" data-post-id="15236427">
                  <section class="post-item-body">
                    <div class="post-item-text">
                      <a class="post-item-title" href="https://www.cnblogs.com/IsThis/p/15236427.html" target="_blank">窗口函数至排序——SQLServer2012可高用</a>
                      <p class="post-item-summary">
                        <a href="https://www.cnblogs.com/IsThis/">
                          <img src="https://pic.cnblogs.com/face/2017598/20200423122151.png" class="avatar" alt="博主头像" />
                        </a>
                        常用到的窗口函数 工作中要常对数据进行分析,分析前要对原始数据中找到想要的格式,数据原本存储的格式不一定时我们想要的,要在基础上进行一定的处理,下面介绍的几种方式是常用的数据排序的集中方式,包含 排名函数(row_number())、排序函数(rank(),dense_rank())、聚合函数(常用 ...
                      </p>
                    </div>
                    <footer class="post-item-foot">
                      <a href="https://www.cnblogs.com/IsThis/" class="post-item-author">
                        <span>就着</span>
                      </a>
                      <span class="post-meta-item">
                        <span>2021-09-08 15:57</span>
                      </span>
                      <a id="digg_control_15236427" class="post-meta-item btn " href="javascript:void(0)" onclick="DiggPost('IsThis', 15236427, 598666, 1);return false;">
                        <svg width="16" height="16" viewBox="0 0 16 16" xmlns="http://www.w3.org/2000/svg">
                          <use xlink:href="#icon_digg"></use>
                        </svg>
                        <span id="digg_count_15236427">0</span>
                      </a>
                      <a class="post-meta-item btn" href="https://www.cnblogs.com/IsThis/p/15236427.html#commentform">
                        <svg width="16" height="16" xmlns="http://www.w3.org/2000/svg">
                          <use xlink:href="#icon_comment"></use>
                        </svg>
                        <span>1</span>
                      </a>
                      <a class="post-meta-item btn" href="https://www.cnblogs.com/IsThis/p/15236427.html">
                        <svg width="16" height="16" viewBox="0 0 16 16" xmlns="http://www.w3.org/2000/svg">
                          <use xlink:href="#icon_views"></use>
                        </svg>
                        <span>52</span>
                      </a>
                      <span id="digg_tip_15236427" class="digg-tip" style="color: red"></span>
                    </footer>
                  </section>
                  <figure>
                  </figure>
                </article>

代码:

using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
using System.Text.RegularExpressions;

namespace PracticeRegex
{
  

    internal class HttpUtility
    {
        //默认获取第一页数据
        public static string HttpGetHtml()
        {
            string content = File.ReadAllText(AppContext.BaseDirectory + "Content.xml");
            return content;

        }

        public static List<Article> GetArticles(string htmlString)
        {
            List<Article> articleList = new List<Article>();
            Regex regex = null;
            Article article = null;

            regex = new Regex("<section class=\"post-item-body\">(?<content>.*?)(?=\\s*</section>)",
                            RegexOptions.Singleline);

            if (regex.IsMatch(htmlString))
            {
                MatchCollection aritcles = regex.Matches(htmlString);

                foreach (Match item in aritcles)
                {
                    article = new Article();
                    //取推荐
                    regex =
                        new Regex(
                            "<span id=\"digg_count_\\d+\">(?<digNum>\\d+(?=</span>))", RegexOptions.Singleline);
                    article.DiggNum = regex.Match(item.Value).Groups["digNum"].Value;
                   
                    //取文章标题 需要去除转义字符
                    regex = new Regex("<a class=\"post-item-title\" href=\"(?<url>.*)\" target=\"_blank\">(?<Title>.*?)</a>", RegexOptions.Singleline);
                    article.AritcleTitle = regex.Match(item.Value).Groups["Title"].Value;
                    article.AritcleUrl = regex.Match(item.Value).Groups["url"].Value;                   

                    //取作者图片 \\s* 匹配html标签的换行和多空格
                    regex = new Regex("<a\\s*href=\"(?<href>.*)\">\\s*<img\\ssrc=\"(?<img>.*?)\"\\s.*/>\\s*</a>", RegexOptions.Singleline);
                    article.AuthorImg = regex.Match(item.Value).Groups["img"].Value;
                    article.AuthorUrl = regex.Match(item.Value).Groups["href"].Value;


                    //取文章简介
                    //1 先取summary p中所有内容文字内容, [\\s\\S]* 匹配任意字符,<a 标签里面那一坨不想写用这个替代,分组和预判一起用,后面是\\s\\n
                    regex = new Regex("<p class=\"post-item-summary\">\\s*<a\\s*[\\s\\S]*</a>\\s*(?<summary>.*(?=[\\s\\n]*))</p>", RegexOptions.Singleline);
                    string summary = regex.Match(item.Value).Groups["summary"].Value;
                    //2 取简介
                    //regex = new Regex("(?<indroduct>(?<=</a>).*)", RegexOptions.Singleline);
                    article.AritcleInto = summary; //regex.Match(summary).Groups["indroduct"].Value;


                    //取发布人与发布时间,正则字符自身换行用+  \\s* 也可以匹配换行和空格
                    regex =
                        new Regex(
                            "<footer class=\"post-item-foot\">\\s*<a.*?>\\s*<span.*?>(?<publishName>.*)</span>\\s*</a>\\s*<span class=\"post-meta-item\">" +
                            "\\s*<span.*?>(?<publishTime>.*)</span>\\s*</span>",
                            RegexOptions.Singleline);
                    article.Author = regex.Match(item.Value).Groups["publishName"].Value;
                    article.PublishTime = regex.Match(item.Value).Groups["publishTime"].Value.Trim();

                    //取评论数,评论和阅读很想,差别就是前面的 icon_comment or icon_views.
                    regex =
                        new Regex(
                            "<use xlink:href=\"#icon_comment\"></use>\\s*</svg>\\s*<span>(?<comment>\\d+)</span>",
                            RegexOptions.Singleline);
                    article.CommentNum = regex.Match(item.Value).Groups["comment"].Value;

                    //取阅读数 //d+哦,要不超过2位的取不到。
                    regex = new Regex("<use xlink:href=\"#icon_views\"></use>\\s*</svg>\\s*<span>(?<readNum>\\d+)</span>", RegexOptions.Singleline);
                    article.ReadNum = regex.Match(item.Value).Groups["readNum"].Value;
                    articleList.Add(article);
                }

            }
            return articleList;
        }



        public static string ClearSpecialTag(string htmlString)
        {

            string htmlStr = Regex.Replace(htmlString, "\n", "", RegexOptions.IgnoreCase);
            htmlStr = Regex.Replace(htmlStr, "\t", "", RegexOptions.IgnoreCase);
            htmlStr = Regex.Replace(htmlStr, "\r", "", RegexOptions.IgnoreCase);
            htmlStr = Regex.Replace(htmlStr, "\"", "'", RegexOptions.IgnoreCase);
            return htmlStr;
        }
    }

    public class Article
    {
        /// <summary>
        /// 文章标题
        /// </summary>
        public string AritcleTitle { get; set; }
        /// <summary>
        /// 文章链接
        /// </summary>
        public string AritcleUrl { get; set; }
        /// <summary>
        /// 文章简介
        /// </summary>
        public string AritcleInto { get; set; }
        /// <summary>
        /// 作者名
        /// </summary>
        public string Author { get; set; }
        /// <summary>
        /// 作者地址
        /// </summary>
        public string AuthorUrl { get; set; }
        /// <summary>
        /// 作者图片
        /// </summary>
        public string AuthorImg { get; set; }
        /// <summary>
        /// 发布时间
        /// </summary>
        public string PublishTime { get; set; }
        /// <summary>
        /// 推荐数
        /// </summary>
        public string DiggNum { get; set; }

        /// <summary>
        /// 评论数
        /// </summary>
        public string CommentNum { get; set; }
        /// <summary>
        /// 阅读数
        /// </summary>
        public string ReadNum { get; set; }

    }
}

 

posted @ 2021-09-09 11:27  LearningAlbum  阅读(56)  评论(0)    收藏  举报