Lucene2.9.2 + 盘古分词2.3.1(一) 入门: 建立简单索引,搜索(原创)
有图有真相
ps:上图可以看到中文分词成功,搜索也命中了;
说明:如果想好好学Lucene建议看Lucene in action 2nd version,另外2.9.2中对以前很多方法已经废弃,旧代码就别看了;
下面是代码:
- public static void IndexFile(this IndexWriter writer, IO.FileInfo file)
- {
- var watch = new Stopwatch();
- var startTime = DateTime.Now;
- watch.Start();
- Console.WriteLine("Indexing {0}", file.Name);
- writer.AddDocument(file.GetDocument());
- watch.Stop();
- var timeSpan = DateTime.Now - startTime;
- Console.WriteLine("Indexing Completed! Cost time {0}[{1}]", timeSpan.ToString("c"), watch.ElapsedMilliseconds);
- }
- public static Document GetDocument(this IO.FileInfo file)
- {
- var doc = new Document();
- doc.Add(new Field("contents", new IO.StreamReader(file.FullName)));
- doc.Add(new Field("filename", file.Name,
- Field.Store.YES, Field.Index.ANALYZED));
- doc.Add(new Field("fullpath", file.FullName,
- Field.Store.YES, Field.Index.NOT_ANALYZED));
- return doc;
- }
Output
Indexing Scott.txt
Indexing Completed! Cost time 00:00:02.4231386[2423]
Indexing 黄金瞳.txt
Indexing Completed! Cost time 00:00:00.0860049[85]
There are 2 doc Indexed!
Index Exit!
代码解释:
第14行 GetDocument 建立相应的doc,doc是Lucene核心对象之一,下面是它的定义:
The Document class represents a collection of fields. Think of it as a virtual document—
a chunk of data, such as a web page, an email message, or a text file—that you
want to make retrievable at a later time. Fields of a document represent the document
or metadata associated with that document. The original source (such as a database
record, a Microsoft Word document, a chapter from a book, and so on) of
document data is irrelevant to Lucene. It’s the text that you extract from such binary
documents, and add as a Field instance, that Lucene processes. The metadata (such
as author, title, subject and date modified) is indexed and stored separately as fields
of a document.
不关心的同学可以将它理解为数据库里表的一条记录,最后查询出结果后得到的也是doc对象,也就是一条记录;
那么建立索引就是添加很多记录到lucene里;
第19行 第一个参数就不解释了,第二个参数NOT_ANALYZED并不是就搜不到这个字段而是作为整个字段搜索,不分词而已;
- public ActionResult Index(string keyWord)
- {
- var originalKeyWords = keyWord;
- ViewBag.TotalResult = 0;
- ViewBag.Results = new List<KeyValuePair<string, string>>();
- if (string.IsNullOrEmpty(keyWord))
- { ViewBag.Message = "Welcome Today!"; return View("Index"); }
- var q = keyWord;
- var search = new IndexSearcher(_indexDir, true);
- // q = GetKeyWordsSplitBySpace(q, new PanGuTokenizer());
- var queryParser = new QueryParser(Lucene.Net.Util.Version.LUCENE_29, "contents", new PanGuAnalyzer(false));
- var query = queryParser.Parse(q);
- var hits = search.Search(query, 100); //search.Search(bq, 100);
- var recCount = hits.totalHits;
- ViewBag.TotalResult = recCount;
- //show explain
- for (int d = 0; d < search.MaxDoc(); d++)
- {
- ViewBag.Explain += search.Explain(query, d).ToHtml();
- var termReader=search.GetIndexReader().Terms();
- ViewBag.Explain += "<ul >";
- do
- {
- if(termReader.Term()!=null)
- ViewBag.Explain += string.Format("<li>{0}</li>", termReader.Term().Text());
- } while (termReader.Next());
- ViewBag.Explain += "</ul>";
- }
- foreach (var hit in hits.scoreDocs)
- {
- try
- {
- var doc = search.Doc(hit.doc);
- var fileName = doc.Get("filename");
- // fileName = highlighter.GetBestFragment(originalKeyWords, fileName);
- //var contents = GetBestFragment(originalKeyWords, new StreamReader(doc.Get("fullpath"), Encoding.GetEncoding("gb2312")));
- (ViewBag.Results as List<KeyValuePair<string, string>>)
- .Add(new KeyValuePair<string, string>(fileName, string.Empty));
- }
- catch (Exception exc)
- {
- Response.Write(exc.Message);
- throw;
- }
- }
- search.Close();
- ViewBag.Message = string.Format("????{0}", keyWord);
- return View("Index");
- }
后续文章会继续贴这些代码,并带上注释,在外面写距离有点远,也累。


 
                
            
         
         浙公网安备 33010602011771号
浙公网安备 33010602011771号