看多远,走多远
把背影留给世界

最近在做把包含了HTML代码的文本显示在Grid里面, 所以就需要把HTML tag给移除调. 用了一些方法都不是太理想,分享一下经验.

1. 用正则表达式.


 1public static string RemoveHtmlTags(string strHtml)
 2        {
 3            string[] aryReg = {
 4                                  @"<script[^>]*?>.*?</script>",
 5                                  @"<(\/\s*)?!?((\w+:)?\w+)(\w+-?\w+(\s*=?\s*(([""'])(\\[""'tbnr]|[^\7])*?\7|\w+)|.{0})|\s)*?(\/\s*)?>"
 6                                  ,
 7                                  "<\\s*/{0,1}\\s*(font)[^>]*>",
 8                                  @"([\r\n])[\s]+",
 9                                  @"&(quot|#34);",
10                                  @"&(amp|#38);",
11                                  @"&(lt|#60);",
12                                  @"&(gt|#62);",
13                                  @"&(nbsp|#160);",
14                                  @"&(iexcl|#161);",
15                                  @"&(cent|#162);",
16                                  @"&(pound|#163);",
17                                  @"&(copy|#169);",
18                                  @"&#(\d+);",
19                                  @"-->",
20                                  @"<!--.*\n"
21                              }
;
22
23            string[] aryRep = {
24                                  "",
25                                  "",
26                                  "",
27                                  "",
28                                  "\"",
29                                  "&",
30                                  "<",
31                                  ">",
32                                  " ",
33                                  "\xa1"//chr(161),
34                                  "\xa2"//chr(162),
35                                  "\xa3"//chr(163),
36                                  "\xa9"//chr(169),
37                                  "",
38                                  "\r\n",
39                                  ""
40                              }
;
41
42            string strOutput = strHtml;
43            for (int i = 0; i < aryReg.Length; i++)
44            {
45                Regex regex = new Regex(aryReg[i], RegexOptions.IgnoreCase);
46                strOutput = regex.Replace(strOutput, aryRep[i]);
47            }

48            strOutput.Replace("<""");
49            strOutput.Replace(">""");
50            strOutput.Replace("\r\n""");
51            return strOutput;
52        }
这种方法的缺点是正则表达式执行比较慢.在数据比较多的时候会占用大量系统资源,引起程序失去响应

2.用WebBrowser的InnerText功能
这种方法,我觉得不好,不易多用,因为WebBrowser这个东东毕竟是包装COM的东西.

3.写个HTML Parser自己来解析HTML,想怎么去Tag都可以了.
我没有自己去写,在CodePlex里面去弄了一个来用了,感觉比上面两种的效率都要好.给个链接有需要的去看看吧
http://www.codeplex.com/htmlagilitypack

也贴点这个项目里面的代码吧.
HTMLaglityPack
posted on 2007-04-27 12:17  fly fly  阅读(777)  评论(3)    收藏  举报