正则综合实战取官方首页文章内容信息
原文内容截取当前的官方的首页HTML ,作为练习,用到了断言预判,分组。
<article class="post-item" data-post-id="15243338"> <section class="post-item-body"> <div class="post-item-text"> <a class="post-item-title" href="https://www.cnblogs.com/yixing-tuotuo/p/15243338.html" target="_blank">找论坛大神想想办法</a> <p class="post-item-summary"> <a href="https://www.cnblogs.com/yixing-tuotuo/"> <img src="https://pic.cnblogs.com/face/2144536/20200928112626.png" class="avatar" alt="博主头像" /> </a> 记一个activiti工作流问题,话不多说,上 问题详情:工作流并行网关执行完流转到下一个节点审批,下一个节点审批驳回到了并行网关中的指定一个节点,然后当这个指定的节点审批通过后,流程却结束了,并没有再次到达部门经理这个节点,这肯定是有问题,不能这么糊里糊涂结束了 以上问题代码补充: 1.驳回是我自 ... </p> </div> <footer class="post-item-foot"> <a href="https://www.cnblogs.com/yixing-tuotuo/" class="post-item-author"> <span>意行</span> </a> <span class="post-meta-item"> <span>2021-09-08 16:30</span> </span> <a id="digg_control_15243338" class="post-meta-item btn " href="javascript:void(0)" onclick="DiggPost('yixing-tuotuo', 15243338, 635050, 1);return false;"> <svg width="16" height="16" viewBox="0 0 16 16" xmlns="http://www.w3.org/2000/svg"> <use xlink:href="#icon_digg"></use> </svg> <span id="digg_count_15243338">0</span> </a> <a class="post-meta-item btn" href="https://www.cnblogs.com/yixing-tuotuo/p/15243338.html#commentform"> <svg width="16" height="16" xmlns="http://www.w3.org/2000/svg"> <use xlink:href="#icon_comment"></use> </svg> <span>0</span> </a> <a class="post-meta-item btn" href="https://www.cnblogs.com/yixing-tuotuo/p/15243338.html"> <svg width="16" height="16" viewBox="0 0 16 16" xmlns="http://www.w3.org/2000/svg"> <use xlink:href="#icon_views"></use> </svg> <span>0</span> </a> <span id="digg_tip_15243338" class="digg-tip" style="color: red"></span> </footer> </section> <figure> </figure> </article> <article class="post-item" data-post-id="15236427"> <section class="post-item-body"> <div class="post-item-text"> <a class="post-item-title" href="https://www.cnblogs.com/IsThis/p/15236427.html" target="_blank">窗口函数至排序——SQLServer2012可高用</a> <p class="post-item-summary"> <a href="https://www.cnblogs.com/IsThis/"> <img src="https://pic.cnblogs.com/face/2017598/20200423122151.png" class="avatar" alt="博主头像" /> </a> 常用到的窗口函数 工作中要常对数据进行分析,分析前要对原始数据中找到想要的格式,数据原本存储的格式不一定时我们想要的,要在基础上进行一定的处理,下面介绍的几种方式是常用的数据排序的集中方式,包含 排名函数(row_number())、排序函数(rank(),dense_rank())、聚合函数(常用 ... </p> </div> <footer class="post-item-foot"> <a href="https://www.cnblogs.com/IsThis/" class="post-item-author"> <span>就着</span> </a> <span class="post-meta-item"> <span>2021-09-08 15:57</span> </span> <a id="digg_control_15236427" class="post-meta-item btn " href="javascript:void(0)" onclick="DiggPost('IsThis', 15236427, 598666, 1);return false;"> <svg width="16" height="16" viewBox="0 0 16 16" xmlns="http://www.w3.org/2000/svg"> <use xlink:href="#icon_digg"></use> </svg> <span id="digg_count_15236427">0</span> </a> <a class="post-meta-item btn" href="https://www.cnblogs.com/IsThis/p/15236427.html#commentform"> <svg width="16" height="16" xmlns="http://www.w3.org/2000/svg"> <use xlink:href="#icon_comment"></use> </svg> <span>1</span> </a> <a class="post-meta-item btn" href="https://www.cnblogs.com/IsThis/p/15236427.html"> <svg width="16" height="16" viewBox="0 0 16 16" xmlns="http://www.w3.org/2000/svg"> <use xlink:href="#icon_views"></use> </svg> <span>52</span> </a> <span id="digg_tip_15236427" class="digg-tip" style="color: red"></span> </footer> </section> <figure> </figure> </article>
代码:
using System; using System.Collections.Generic; using System.Text; using System.IO; using System.Text.RegularExpressions; namespace PracticeRegex { internal class HttpUtility { //默认获取第一页数据 public static string HttpGetHtml() { string content = File.ReadAllText(AppContext.BaseDirectory + "Content.xml"); return content; } public static List<Article> GetArticles(string htmlString) { List<Article> articleList = new List<Article>(); Regex regex = null; Article article = null; regex = new Regex("<section class=\"post-item-body\">(?<content>.*?)(?=\\s*</section>)", RegexOptions.Singleline); if (regex.IsMatch(htmlString)) { MatchCollection aritcles = regex.Matches(htmlString); foreach (Match item in aritcles) { article = new Article(); //取推荐 regex = new Regex( "<span id=\"digg_count_\\d+\">(?<digNum>\\d+(?=</span>))", RegexOptions.Singleline); article.DiggNum = regex.Match(item.Value).Groups["digNum"].Value; //取文章标题 需要去除转义字符 regex = new Regex("<a class=\"post-item-title\" href=\"(?<url>.*)\" target=\"_blank\">(?<Title>.*?)</a>", RegexOptions.Singleline); article.AritcleTitle = regex.Match(item.Value).Groups["Title"].Value; article.AritcleUrl = regex.Match(item.Value).Groups["url"].Value; //取作者图片 \\s* 匹配html标签的换行和多空格 regex = new Regex("<a\\s*href=\"(?<href>.*)\">\\s*<img\\ssrc=\"(?<img>.*?)\"\\s.*/>\\s*</a>", RegexOptions.Singleline); article.AuthorImg = regex.Match(item.Value).Groups["img"].Value; article.AuthorUrl = regex.Match(item.Value).Groups["href"].Value; //取文章简介 //1 先取summary p中所有内容文字内容, [\\s\\S]* 匹配任意字符,<a 标签里面那一坨不想写用这个替代,分组和预判一起用,后面是\\s\\n regex = new Regex("<p class=\"post-item-summary\">\\s*<a\\s*[\\s\\S]*</a>\\s*(?<summary>.*(?=[\\s\\n]*))</p>", RegexOptions.Singleline); string summary = regex.Match(item.Value).Groups["summary"].Value; //2 取简介 //regex = new Regex("(?<indroduct>(?<=</a>).*)", RegexOptions.Singleline); article.AritcleInto = summary; //regex.Match(summary).Groups["indroduct"].Value; //取发布人与发布时间,正则字符自身换行用+ \\s* 也可以匹配换行和空格 regex = new Regex( "<footer class=\"post-item-foot\">\\s*<a.*?>\\s*<span.*?>(?<publishName>.*)</span>\\s*</a>\\s*<span class=\"post-meta-item\">" + "\\s*<span.*?>(?<publishTime>.*)</span>\\s*</span>", RegexOptions.Singleline); article.Author = regex.Match(item.Value).Groups["publishName"].Value; article.PublishTime = regex.Match(item.Value).Groups["publishTime"].Value.Trim(); //取评论数,评论和阅读很想,差别就是前面的 icon_comment or icon_views. regex = new Regex( "<use xlink:href=\"#icon_comment\"></use>\\s*</svg>\\s*<span>(?<comment>\\d+)</span>", RegexOptions.Singleline); article.CommentNum = regex.Match(item.Value).Groups["comment"].Value; //取阅读数 //d+哦,要不超过2位的取不到。 regex = new Regex("<use xlink:href=\"#icon_views\"></use>\\s*</svg>\\s*<span>(?<readNum>\\d+)</span>", RegexOptions.Singleline); article.ReadNum = regex.Match(item.Value).Groups["readNum"].Value; articleList.Add(article); } } return articleList; } public static string ClearSpecialTag(string htmlString) { string htmlStr = Regex.Replace(htmlString, "\n", "", RegexOptions.IgnoreCase); htmlStr = Regex.Replace(htmlStr, "\t", "", RegexOptions.IgnoreCase); htmlStr = Regex.Replace(htmlStr, "\r", "", RegexOptions.IgnoreCase); htmlStr = Regex.Replace(htmlStr, "\"", "'", RegexOptions.IgnoreCase); return htmlStr; } } public class Article { /// <summary> /// 文章标题 /// </summary> public string AritcleTitle { get; set; } /// <summary> /// 文章链接 /// </summary> public string AritcleUrl { get; set; } /// <summary> /// 文章简介 /// </summary> public string AritcleInto { get; set; } /// <summary> /// 作者名 /// </summary> public string Author { get; set; } /// <summary> /// 作者地址 /// </summary> public string AuthorUrl { get; set; } /// <summary> /// 作者图片 /// </summary> public string AuthorImg { get; set; } /// <summary> /// 发布时间 /// </summary> public string PublishTime { get; set; } /// <summary> /// 推荐数 /// </summary> public string DiggNum { get; set; } /// <summary> /// 评论数 /// </summary> public string CommentNum { get; set; } /// <summary> /// 阅读数 /// </summary> public string ReadNum { get; set; } } }