基于.NET的分词软件设计与实现V7.0--移植B/S（完结篇）

之前的文章介绍了个人分词软件在介入数据库作为词典存储介质后的表现，为了让分词软件能够更好得得到展示，本篇讲介绍将其移植B/S端的相关方法。

　　在《基于.NET的分词软件设计与实现V3.0--对比测试及改变》中，我介绍了很多基于B/S和C/S的分词程序，在移植B/S端时，我参照了很多相关的功能，并从中提炼了四个最主要的功能（其中标注不作为本分词软件的重点，故予以排除）：

　　1、以“\”为间隔的切分；

　　2、以“ ”为间隔的切分（所谓的“北大标准”切分）；

　　3、去除文本标点；

　　4、在之前的版本中我进行了去除文本格式的尝试，虽然在对比测试后没有予以使用，但在这里依然作为一个附属功能提炼出来，称为“去除文本格式”，即去除文本中多余的换行符、空格符等。

　代码清单：

　　1、HTML代码：

 1 <div id="header">
 2 </div>
 3 <div id="logo">
 4 <div id="imgLogo">
 5 </div>
 6 </div>
 7 <div id="container">
 8 <div id="bk_container"> 
 9 <label id="inputMessage"> 请输入需要分词的文本：</label>
10 <textarea id="tbContent" cols="85" rows="20"></textarea>
11 <div id="option">
12 <input type="checkbox" id="ckPunc"/> 去除文本标点
13 <input type="checkbox" id="ckFormat"/> 去除文本格式
14 <input type="checkbox" id="ckPeking"/> 使用北大标准输出
15 <img src="img/bt_splitter.gif" id="btnSubmit" alt="分词" onclick="getResult();"/>
16 </div>
17 <label id="result"> 分词处理结果：</label>
18 <textarea id="tbResult" cols="85" rows="20"></textarea>
19 </div>
20 </div>
21 <div id="footer">
22 <div id="footerImg">
23 </div>
24 </div>
25 <script src="js/ajax.js" type="text/javascript"></script>
26 <script src="js/segment.js" type="text/javascript"></script>

　　展示效果（介于保密，故不展示页面的header和footer信息）：

　　2、ajax.js代码：

 1 var MyxAjax =
 2 {
 3     arrXmlHttp: [],
 4     getXmlHttp: function() {
 5 var xmlHttp =false;
 6 try {
 7             xmlHttp =new ActiveXObject("Microsoft.XMLHTTP"); 
 8         }
 9 catch (e) {
10 try {
11                 xmlHttp =new ActiveXObject("Msxml2.XMLHTTP");
12             }
13 catch (E) {
14                 xmlHttp =false;
15             }
16         }
17 if (!xmlHttp &&typeof XMLHttpRequest !='undefined') {
18             xmlHttp =new XMLHttpRequest(); 
19         }
20 return xmlHttp;
21     },
22 
23     clearXmlHttp: function() {
24 for (var i =0, ln =this.arrXmlHttp.length; i < ln; i++) {
25 if (this.arrXmlHttp[i]) {
26 try {
27 deletethis.arrXmlHttp[i];
28                 } catch (e) { }
29             }
30         }
31 this.arrXmlHttp = [];
32     },
33 
34     send: function(url, options) {
35 var xmlHttp =this.getXmlHttp();
36 var _options = {
37             method: "GET",
38             data: null,
39             successCallback: function() { },
40             failCallback: function() { }
41         };
42 for (var property in options) {
43             _options[property] = options[property];
44         }
45         xmlHttp.open(_options.method, url, true);
46 if (_options.method =="POST") {
47             xmlHttp.setRequestHeader('Content-Type', 'application/x-www-form-urlencoded');
48         }
49         xmlHttp.onreadystatechange =function() {
50 if (xmlHttp.readyState ==4) {
51 if (xmlHttp.status >=200&& xmlHttp.status <300) {
52                     _options.successCallback(xmlHttp, _options.data);
53                 }
54 else {
55                     _options.failCallback(xmlHttp, _options.data);
56                 }
57             }
58         }
59 this.clearXmlHttp();
60         xmlHttp.send(_options.data);
61 this.arrXmlHttp.push(xmlHttp);
62 return xmlHttp;
63     }
64 }

　　在测试的过程中，发现所有的基于B/S的分词软件都在分词操作的时候进行了页面刷新，个人觉得这影响了用户体验，故在个人的分词软件移植B/S端时，使用Ajax技术，使页面不进行刷新。

　　上述ajax.js的代码是个人封装的Ajax操作类，里面包含了常见的Ajax操作。

　　3、segment.js代码：

 1 function $$(id) {
 2 return document.getElementById(id);
 3 }
 4 
 5 function getResult() {
 6 if ($$('tbContent').value =="") {
 7         alert("请输入需要进行分词的内容！");
 8 return;
 9     }
10 var url ="Handler.ashx?";
11 var params ="content="+ encodeURIComponent($$('tbContent').value);
12     MyxAjax.send(url, { method: "POST", data: params, successCallback: showResult });
13 }
14 
15 function showResult(req, data) {
16 var result = req.responseText;
17 if ($$('ckPunc').checked)
18         result = result.replace(/\\[·～！◎￥％…※×（）－—＝【『』】÷§：；”“‘’，《。》、？～｀！￥％＾＆＊（）＿－＝［｛］｝

＼｜；：＂＇，．／＜＞？]\\/g, '\\');
19 if ($$('ckFormat').checked)
20         result = result.replace(/[ \r\n　]+/g, '');
21 if ($$('ckPeking').checked)
22         result = result.replace(/\\/g, '');
23     $$('tbResult').value = result;
24 }

　　这里要注意两句代码：

　　encodeURIComponent($$('tbContent').value)：对分词的内容进行了编码处理，去除了特殊字符的影响。

　　MyxAjax.send(url, { method: "POST", data: params, successCallback: showResult })：这里必须使用POST传值，而不能使用GET方式，这是基于以下两点进行考虑的：

　　（1）、POST传输数据时，不需要在URL中显示出来，而GET方法要在URL中显示；

　　（2）、POST传输的数据量大，可以达到2M，而GET方法则会受到URL长度的限制，参见官方资料：

　　（3）、基于搜索引擎的收录考虑，POST传值模式的收录量很好，但是GET传值模式就会有所不同，百度几乎不收，谷歌还好，当然这跟搜索引擎本身也有关系。

　　综上所述，基于安全、用户分词文本的长度未知以及SEO三方面的考虑，这里选择了POST的传参方式。

　　4、Handler.ashx代码：

1 publicvoid ProcessRequest(HttpContext context)
2         {
3             Splitter s =new Splitter(context.Server.MapPath("~/Resources/Vocabulary.txt"),

　　　　　　　　　　　　　　　　　　　　　　 context.Server.MapPath("~/Resources/ChinesePunctuations.txt"),

 　　　　　　　　　　　　　　　　　　　　　　context.Server.MapPath("~/Resources/EnglishPunctuations.txt"));
4             s.InputStr = HttpUtility.HtmlDecode(context.Request["content"].ToString());
5 string result = s.GetResult();
6             context.Response.Write(result);
7         }

　　5、Splitter.cs代码：

  1 publicclass WebSplitter
  2     {
  3 #region 字段属性
  4 
  5 ///<summary>
  6 /// 待分词文本
  7 ///</summary>
  8 publicstring InputStr { get; set; }
  9 ///<summary>
 10 /// Updated：分词词典(HashSet)
 11 ///</summary>
 12         HashSet<string> _dict =new HashSet<string>();
 13 ///<summary>
 14 /// 中文标点分隔符
 15 ///</summary>
 16 string _chineseSpiltters;
 17 ///<summary>
 18 /// 英文标点分隔符
 19 ///</summary>
 20 string _englishSpiltters;
 21 ///<summary>
 22 /// 非中文字符（包括数字、英文及英文标点符号）正则表达式
 23 ///</summary>
 24         Regex _nonChineseRegex;
 25 ///<summary>
 26 /// 中文字符正则表达式
 27 ///</summary>
 28         Regex _chineseRegex;
 29 ///<summary>
 30 /// 中文标点符号正则表达式
 31 ///</summary>
 32         Regex _chinesePunctuationRegex;
 33 ///<summary>
 34 /// Updated：是否对待分词文本作格式化（即是否保留原来文本的格式）
 35 ///</summary>
 36 //bool _isinit = false;
 37 publicbool IsFormat { get; set; }
 38 ///<summary>
 39 /// Updated：是否去除待分词文本的标点
 40 ///</summary>
 41 //bool _isStripPunctuation = false;
 42 publicbool IsStripPunctuation { get; set; }
 43 ///<summary>
 44 /// Updated：是否使用北大标准输出（以空格为间隔），默认为973标准（以\为间隔）
 45 ///</summary>
 46 publicbool IsPekingStandard { get; set; }
 47 
 48 #endregion
 49 
 50 #region 构造函数
 51 
 52 public WebSplitter()
 53         {
 54             _dict = GetVocabularyDictionary("Vocabulary.txt");
 55             _chineseSpiltters = GetPunctuationDictionary("ChinesePunctuations.txt");
 56             _englishSpiltters = GetPunctuationDictionary("EnglishPunctuations.txt");
 57 
 58             _nonChineseRegex =new Regex("["+ _englishSpiltters +"0-9A-Za-z]+");
 59             _chineseRegex =new Regex(@"[\u4e00-\u9fa5]+");
 60             _chinesePunctuationRegex =new Regex("["+ _chineseSpiltters +"]+");
 61         }
 62 
 63 public WebSplitter(string vocabularyPath, string chinesePunctuationsPath, string englishPunctuations)
 64         {
 65             _dict = GetVocabularyDictionary(vocabularyPath);
 66             _chineseSpiltters = GetPunctuationDictionary(chinesePunctuationsPath);
 67             _englishSpiltters = GetPunctuationDictionary(englishPunctuations);
 68 
 69             _nonChineseRegex =new Regex("["+ _englishSpiltters +"0-9A-Za-z]+");
 70             _chineseRegex =new Regex(@"[\u4e00-\u9fa5]+");
 71             _chinesePunctuationRegex =new Regex("["+ _chineseSpiltters +"]+");
 72         }
 73        
 74 #endregion
 75 
 76 #region 私有方法
 77 
 78 ///<summary>
 79 /// 创建分词字典
 80 ///</summary>
 81 ///<param name="dictPath">分词字典存放路径</param>
 82         HashSet<string> GetVocabularyDictionary(string dictPath)
 83         {
 84             var arrVocabulary = File.ReadAllLines(dictPath, Encoding.Default);
 85 foreach (string s in arrVocabulary)
 86             {
 87                 _dict.Add(s);
 88             }
 89 return _dict;
 90         }
 91 
 92 ///<summary>
 93 /// 创建分隔符组成的字符串
 94 ///</summary>
 95 ///<param name="dictPath">分隔符字典存放路径</param>
 96 string GetPunctuationDictionary(string dictPath)
 97         {
 98             StringBuilder strBuilder =new StringBuilder();
 99             var arrPunctuation = File.ReadAllLines(dictPath);
100 foreach (string s in arrPunctuation)
101             {
102                 strBuilder.Append(s);
103             }
104 return strBuilder.ToString();
105         }
106 
107 ///<summary>
108 /// Updated：对待分词文本进行预处理（如删除一个或多个标点符号、空白字符、换行符等)
109 ///</summary>
110 ///<param name="str">待分词文本</param>
111 string PreProcess()
112         {
113             Regex regex =new Regex("[ \r\n　]+");
114 if (regex.IsMatch(InputStr))
115             {
116                 InputStr = regex.Replace(InputStr, "");
117             }
118 return InputStr.Trim();
119         }
120 
121 ///<summary>
122 /// 为所有数字英文添加分隔符
123 ///</summary>
124 ///<returns></returns>
125 string SplitNonChinese()
126         {
127 //Updated
128 if (IsFormat)
129                 InputStr = PreProcess();
130 
131 for (int j =0; j < _nonChineseRegex.Matches(InputStr).Count; j++)
132             {
133                 InputStr = InputStr.Insert(_nonChineseRegex.Matches(InputStr)[j].Index +

 　　　　                                            _nonChineseRegex.Matches(InputStr)[j].Value.Length, @"\");
134             }
135 return InputStr;
136         }
137 
138 ///<summary>
139 /// Updated：去除待分词文本中的标点
140 ///</summary>
141 string StripPunctuation()
142         {
143             InputStr = SplitNonChinese();
144 foreach (var item in _chinesePunctuationRegex.Matches(InputStr))
145             {
146                 InputStr = InputStr.Replace(item.ToString(), "");
147             }
148 return InputStr;
149         }
150 
151 ///<summary>
152 /// 为所有中文标点添加分隔符
153 ///</summary>
154 string SplitChinesePunctuation()
155         {
156             InputStr = SplitNonChinese();
157 for (int i =0; i < _chinesePunctuationRegex.Matches(InputStr).Count; i++)
158             {
159                 InputStr = InputStr.Insert(_chinesePunctuationRegex.Matches(InputStr)[i].Index +

                                               _chinesePunctuationRegex.Matches(InputStr)[i].Value.Length, @"\");
160             }
161 return InputStr;
162         }
163 
164 ///<summary>
165 /// 为所有汉字添加分隔符
166 ///</summary>
167 string SplitChinese()
168         {
169 //Updated
170 if (IsStripPunctuation)
171                 InputStr = StripPunctuation();
172 else
173                 InputStr = SplitChinesePunctuation();
174 
175             StringBuilder sb =new StringBuilder(InputStr);
176 int i =0;
177 foreach (Match item in _chineseRegex.Matches(sb.ToString()))
178             {
179                 sb = sb.Insert(item.Index + item.Value.Length + i, "\\");
180                 i++;
181             }
182 return sb.ToString();
183         }
184 
185 ///<summary>
186 /// 采用逆向匹配算法对字符串进行分词操作
187 ///</summary>
188 ///<param name="str">待分词的字符串</param>
189         List<string> Spiltter(string str)
190         {
191             List<string> result =new List<string>();
192 
193 string initStr = str.Remove(0, 1);
194 int l =0;
195 
196 while (!string.IsNullOrEmpty(initStr))
197             {
198 for (int i =1; i < str.Length; i++)
199                 {
200 if (initStr.Length ==1|| _dict.Contains(initStr))
201                     {
202 if (Regex.IsMatch(initStr, "[　 \n\r]+"))
203                             result.Add(initStr);
204 else
205                             result.Add(initStr +"\\");
206 
207                         l += initStr.Length;
208                         initStr = str.Remove(str.Length - l, l);
209 break;
210                     }
211 else
212                     {
213                         initStr = initStr.Remove(0, 1);
214                     }
215                 }
216             }
217 return result;
218         }
219 
220 ///<summary>
221 /// 输出最终结果（仅在IsFormat属性为true时执行）
222 ///</summary>
223 string OutputResult(string str)
224         {
225             str = str.Replace("\\", "");
226 return str;
227         }
228 
229 #endregion
230 
231 #region 公有方法
232 
233 ///<summary>
234 /// 获得分词结果
235 ///</summary>
236 ///<param name="str">待分词文本</param>
237 publicstring GetResult()
238         {
239             StringBuilder strBuilder =new StringBuilder();
240             List<string> splitterResult =new List<string>();
241             List<string> strList = SplitChinese().Split('\\').ToList();
242 
243 for (int i =0; i <= strList.Count -1; i++)
244             {
245 if (!string.IsNullOrEmpty(strList[i]))
246                 {
247 if (strList[i].Length ==1|| _nonChineseRegex.IsMatch(strList[i]) || _dict.Contains(strList[i]))
248                     {
249                         strBuilder.Append(strList[i] +"\\");
250                     }
251 else
252                     {
253                         splitterResult = Spiltter(strList[i]);
254 for (int q = splitterResult.Count -1, p = q; p >=0; p--)
255                         {
256                             strBuilder.Append(splitterResult[p]);
257                         }
258                     }
259                 }
260             }
261 
262 //Updated
263 if (IsPekingStandard)
264 return OutputResult(strBuilder.ToString());
265 
266 return strBuilder.ToString();
267         }
268 #endregion
269     }