Luence初探

一直觉得全文索引是个很神密的东西，所以很想一探究竟。查阅了网上的资料，做个整理。（以下内容均来自互联网）

一、全文索引

定义一系列词条，用这些词条在要搜索的文件匹配，记录下匹配到地址，将这些词条与匹配地址记录下来，形成索引。当查询词条时，可以直接从索引中读取到目标的位置而实现的快速索引。

MS Sql中也提供了全文索引服务，关于MS Sql的全文索引可以参看MSDN。相比Luence的使用，MS Sql的全文索引要简单的多。

二、什么是Luence

Luence是一个开源搜索引擎构架，最初为java版本，现.net，C++，Pascal等语言都有相应的版本。个人将Luence分为索引组件和搜索组件。一个完整的搜索引擎要由以下几个部分组成:

1.检索原始内容;

2.根据原始内容来创建对就的文档，如从二进制文件中提取文本信息;

3.创建文档

4.分析文档

5.对文档进行索引,建议索引文件

6.提供可编程查询语句和用户查询接口

7.展现查询结果

获取数据：

检索的数据是多样的，可能是txt文件，也可能是word,pdf等文档，索引前需要提取数据为可识别的数据（通常为纯文本），Luence不提供这部分功能，具体的获取方法由搜索程序自己定义，也可以用现有的第三方具，如蜘蛛/蚂蚁等

构建文档：

获取数据之后要形成通常意义上的“文档”，一个文档通常包括：文件名，标题，内容。

分析文档：

有了文档前不是就可以检索，还要将文档拆分成可搜索的格式，通常要用到分词技术，另外还要去除一些无用字符，如：如果是html文件，则需要将html标签去除。文档分析结果通常是一个字词与文档的对应记录

索引文档：

文档经过分析后，就可以将文档的分析结果存入索引数据库了。索引数据库是一个经过精心设计的数据库结构。它将关键字/词与文档及文档中的字词位置成一个键值对存储起来

查询：

有了索引数据后，使用查询组件对数据进行查询了。查询也有很多种策略，如：纯bool匹配，向量空间模型，相似度匹配等。

三、Luence使用示例

以下Demo演示对目录下.txt文件进行索引并查询

using System;
using System.Collections.Generic;
using System.Text;
using System.IO;

using Lucene.Net;
using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Documents;
using Lucene.Net.Index;
using Lucene.Net.QueryParsers;
using Lucene.Net.Search;
using Lucene.Net.Store;
using Lucene.Net.Util;

namespace LuceneNet
{
    public class LuceneIndex
    {
        private const string FieldName = "industry";
        private const string FieldValue = "value";

        private string indexDir = @"D:\temp\"; //Luence的索引存储目录
        private string dataDir = @"D:\temp\"; //要被索引的文件目录

        private IndexWriter indexWriter;

        public LuceneIndex()
        {
            DirectoryInfo dirInfo = new DirectoryInfo(indexDir);
            Lucene.Net.Store.Directory dir = FSDirectory.Open(dirInfo);
            Analyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29);

            indexWriter = new IndexWriter(dir, analyzer, true
                ,IndexWriter.MaxFieldLength.UNLIMITED
                );
        }

        public void Index()
        {
            string[] files = System.IO.Directory.GetFiles(dataDir,"*.txt"); //只索引目录下的.txt文件

            foreach (string fileFullPath in files)
            {
                Document doc = new Document();
                using (StreamReader sr = new StreamReader(fileFullPath))
                {
                    StringReader stringReader = new StringReader(sr.ReadToEnd());   //读取整个文件，将被转为StringReader对象

                    Fieldable field = new Field("Content", stringReader);   //分析文档内容并将其添加到Content域中
                    doc.Add(field);

                    //记录文档的全路径

                 Fieldable fieldPath = new Field("FullPath", fileFullPath, Field.Store.YES, Field.Index.NOT_ANALYZED);
                    doc.Add(fieldPath);

                    indexWriter.AddDocument(doc);   //添加到索引
                }
            }
            indexWriter.Close();
        }
    }
}

分析文档内容

Fieldable field = new Field("Content", stringReader);   //分析文档内容并将其添加到Content域中
doc.Add(field);

new Field("Content", stringReader); 构造函数用于分析体积较大的文件。

Fieldable fieldPath = new Field("FullPath", fileFullPath, Field.Store.YES, Field.Index.NOT_ANALYZED);
doc.Add(fieldPath);

new Field("FullPath", fileFullPath, Field.Store.YES, Field.Index.NOT_ANALYZED);构造函数用于保存文件的完整路径。

Field.Store.YES－－保存到索引

Field.Index.NOT_ANALYZED－－不对内容做分析，保持原始内容

Luence搜索文件using Lucene.Net;
using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Documents;
using Lucene.Net.Index;
using Lucene.Net.QueryParsers;
using Lucene.Net.Search;
using Lucene.Net.Store;
using Lucene.Net.Util;

using System;

namespace LuceneNet
{
    public class LuceneSearch
    {
        IndexSearcher search;

        private string indexDir = @"D:\temp\";

        public LuceneSearch()
        {
            System.IO.DirectoryInfo dirInfo = new System.IO.DirectoryInfo(indexDir);
            Directory directory =FSDirectory.Open(dirInfo);
            search = new IndexSearcher(directory,true);
        }

        public void Query(string s)
        {
            Analyzer anlyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29);
            QueryParser parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_29, "Content", anlyzer);
            Query q = parser.Parse(s);

            TopDocs topDos = search.Search(q, 10);
            for (int i=0; i < topDos.scoreDocs.Length;i++ )
            {
                Document doc = search.Doc(i);
                Console.WriteLine(doc.Get("FullPath"));
            }
            Console.WriteLine(string.Format("命中{0}次", topDos.totalHits));
            Console.ReadLine();
        }
    }
}

查询时要用 Query q = parser.Parse(s);将查询关键词转为IndexSearcher可识别的Query对象

最后在main函数中测试

using System;
using System.Collections.Generic;
using System.Text;

namespace LuceneNet
{
    public class Program
    {
        static void Main(string[] args)
        {
            LuceneIndex test = new LuceneIndex();
            test.Index();

            LuceneSearch search = new LuceneSearch();
            search.Query("abcdefg");
        }
    }
}

比如txt文件中存在以下内空：

abcdefg中国工产abcdefg

搜索abcdefg可以正确返回结果，并输出文件的全路径，搜索abcd不会返回结果，说明Luence是根据中英文分词的。

但搜索”中国共产”也不会有返回结果，中文分词搜索需要使用中文词库。后面看到怎么做时再写出来

posted @ 2011-09-02 13:43 徐某人阅读(2987) 评论(1) 收藏举报

刷新页面返回顶部

散修-徐某人

一条路走到黑

Luence初探

公告