给lucene.net增加SCWS中文分词功能

最近自己做了个小站（http://www.micro-sharing.com），暂时还没完成呢，上班没时间做，下班之后又不高兴做了，偶尔实在是太无聊了，就写两行。

打算给这个小站增加搜索功能，很早就听说了lucene.net可以实现简单的索引搜索功能，但也知道自带的分词功能不能满足中文环境的需求，但中文分词太复杂了，我等解决不了这个问题，只能借助于高手了，网上也有好多lucene.net的中文分词租组件，就像昨天的博客中提到的，我不想自己维护分词字典，而且我对性能要求也不高，所以就用了SCWS中文分词的api了。这篇博客就是参考网上的文章实现了给lucene.net增加SCWS中文分词功能。

同样废话不多说，直接上菜。

SCWSAnalyzer.cs

1 using System;

2 using System.Collections.Generic;
3 using System.Linq;
4 using System.Web;
5
6 /// <summary>
7 ///SCWSAnalyzer 的摘要说明
8 ///1638988@gmail.com
9 /// </summary>
10 public class SCWSAnalyzer : Lucene.Net.Analysis.Analyzer
11 {
12     public SCWSAnalyzer()
13     {
14         //
15         //TODO: 在此处添加构造函数逻辑
16         //
17     }
18
19     public override Lucene.Net.Analysis.TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
20     {
21         return new SCWSTokenizer(reader);
22     }
23 }

SCWSTokenizer.cs

1 using System;

2 using System.Collections.Generic;
3 using System.Linq;
4 using System.Web;
5 using Lucene.Net.Analysis;
6
7 /// <summary>
8 ///SCWSTokenizer 的摘要说明
9 ///1638988@gmail.com
10 /// </summary>
11 public class SCWSTokenizer : Lucene.Net.Analysis.Tokenizer
12 {
13     private readonly string _txt;
14     private string[] Wordlist; //切好的词放入此数组中
15     private string Allstr; //对传入的流转成此string
16     private int offset = 0; int start = 0; int step = 0; //offset偏移量,start开始位置,step次数
17     public SCWSTokenizer(System.IO.TextReader reader)
18     {
19         _txt = Common.Segment(reader.ReadToEnd());
20         Allstr = _txt;
21         Wordlist = Allstr.Split(' ');
22     }
23     private Token Flush(string str)
24     {
25
26         if (str.Length > 0)
27         {
28             return new Token(str, start, start + str.Length); //返回一个Token 包含词,词在流中的开始位置和结束位置.
29         }
30         else
31         {
32             return null;
33         }
34     }
35     public override Lucene.Net.Analysis.Token Next()
36     {
37         Token token = null;
38         if (step <= Wordlist.Length)
39         {
40             start = Allstr.IndexOf(Wordlist[step], offset); //从Allstr里找每个分出来词汇的开始位置
41             offset = start + 1; //计算偏移量
42             token = Flush(Wordlist[step]); //返回已分词汇
43             step = step + 1; //变量+1,移动到wordlist的下一个词汇
44         }
45         return token;
46     }
47
48
49 }

posted on 2012-04-25 09:09 陈平阅读(503) 评论(0) 收藏举报

刷新页面返回顶部

kandy

公告