Lucene之起源现状及初步应用

Lucene之起源现状及初步应用

1．起源与发展

Lucene是一个高性能、纯Java的全文检索引擎，而且免费、开源。Lucene几乎适合于任何需要全文检索的应用，尤其是跨平台的应用。

Lucene的作者Doug Cutting是一个资深的全文检索专家，刚开始，Doug Cutting将Lucene发表在自己的主页上，2000年3月将其转移到sourceforge，于2001年10捐献给Apache，作为Jakarta的一个子工程。

2．使用现状

经过多年的发展，Lucene在全文检索领域已经有了很多的成功案例，并积累了良好的声誉。

基于Lucene的全文检索产品（Lucene本身只是一个组件，而非一个完整的应用）和应用Lucene的项目在世界各地已经非常之多，比较知名的有：

l Eclipse：主流Java开发工具，其帮助文档采用Lucene作为检索引擎

l Jive：知名论坛系统，其检索功能基于Lucene

l Ifinder：出自德国的网站检索系统，基于Lucene（http://ifinder.intrafind.org/）

l MIT DSpace Federation：一个文档管理系统（http://www.dspace.org/）

国内外采用Lucene作为网站全文检索引擎的也很多，比较知名的有：

l http://www.blogchina.com/weblucene/

l http://www.ioffer.com/

l http://search.soufun.com/

l http://www.taminn.com/

（更多案例，请参见http://wiki.apache.org/jakarta-lucene/PoweredBy）

在所有这些案例中，开源应用占了很大一部分，但更多的还是商化业产品和网站。毫不夸张的说，Lucene的出现，极大的推动了全文检索技术在各个行业或领域中的深层次应用。

3．初步应用

前面提到，Lucene本身只是一个组件，而非一个完整的应用，所以若想让Lucene跑起来，还得在Lucene基础上进行必要的二次开发。

下载与安装

首先，你需要到Lucene的官方网站http://jakarta.apache.org/lucene/ 去下载一份拷贝，最新版是1.4。下载后将得到一个名为lucene-1.4-final.zip的压缩文件，将其解压，里面有一个名为lucene-1.4-final.jar的文件，这就是Lucene组件包了，若需要在项目使用Lucene，只需要把lucene-1.4-final.jar置于类路径下即可，至于解压后的其他文件都是参考用的。

接下来，我用Eclipse建立一个工程，实现基于Lucene的建库、记录加载和记录查询等功能。

如上图所示，这是开发完成后的工程，其中有三个源文件CreateDataBase.java，InsertRecords.java，QueryRecords.java，分别实现建库、入库、检索的功能。

以下是对这三个源文件的分析。

建库源码及说明

CreateDataBase.java

packagecom.holen.part1;

importjava.io.File;

importorg.apache.lucene.analysis.standard.StandardAnalyzer;

importorg.apache.lucene.index.IndexWriter;

/**

* @authorHolenChen

*初始化检索库

public classCreateDataBase{

publicCreateDataBase(){

}

public intcreateDataBase(Filefile){

intreturnValue=0;

if(!file.isDirectory()){

file.mkdirs();

}

try{

IndexWriterindexWriter= newIndexWriter(file,newStandardAnalyzer(),true);

indexWriter.close();

returnValue=1;

}catch(Exceptionex){

ex.printStackTrace();

}

returnreturnValue;

}

/**

*传入检索库路径,初始化库

* @paramfile

* @return

public intcreateDataBase(Stringfile){

return this.createDataBase(newFile(file));

}

public static voidmain(String[]args){

CreateDataBasetemp= newCreateDataBase();

if(temp.createDataBase("e:\\lucene\\holendb")==1){

System.out.println("db init succ");

}

说明：这里最关键的语句是IndexWriterindexWriter= newIndexWriter(file,newStandardAnalyzer(),true)。

第一个参数是库的路径，也就是说你准备把全文检索库保存在哪个位置，比如main方法中设定的“e:\\lucene\\holendb”，Lucene支持多库，且每个库的位置允许不同。

第二个参数是分析器，这里采用的是Lucene自带的标准分析器，分析器用于对整篇文章进行分词解析，这里的标准分析器实现对英文（或拉丁文，凡是由字母组成，由空格分开的文字均可）的分词，分析器将把整篇英文按空格切成一个个的单词（在全文检索里这叫切词，切词是全文检索的核心技术之一，Lucene默认只能切英文或其他拉丁文，默认不支持中日韩等双字节文字，关于中文切词技术将在后续章节重点探讨）。

第三个参数是是否初始化库，这里我设的是true，true意味着新建库或覆盖已经存在的库，false意味着追加到已经存在的库。这里新建库，所以肯定需要初始化，初始化后，库目录下只存在一个名为segments的文件，大小为1k。但是当库中存在记录时执行初始化，库中内容将全部丢失，库回复到初始状态，即相当于新建了该库，所以真正做项目时，该方法一定要慎用。

InsertRecords.java

packagecom.holen.part1;

importjava.io.File;

importjava.io.FileReader;

importjava.io.Reader;

importorg.apache.lucene.analysis.standard.StandardAnalyzer;

importorg.apache.lucene.document.Document;

importorg.apache.lucene.document.Field;

importorg.apache.lucene.index.IndexWriter;

/**

* @authorHolenChen

*记录加载

public classInsertRecords{

publicInsertRecords(){

}

public intinsertRecords(Stringdbpath,Filefile){

intreturnValue=0;

try{

IndexWriterindexWriter

= newIndexWriter(dbpath,newStandardAnalyzer(),false);

this.addFiles(indexWriter,file);

returnValue=1;

}catch(Exceptionex){

ex.printStackTrace();

}

returnreturnValue;

}

/**

*传入需加载的文件名

* @paramfile

* @return

public intinsertRecords(Stringdbpath,Stringfile){

return this.insertRecords(dbpath,newFile(file));

}

public voidaddFiles(IndexWriterindexWriter,Filefile){

Documentdoc= newDocument();

try{

doc.add(Field.Keyword("filename",file.getName()));

//以下两句只能取一句,前者是索引不存储,后者是索引且存储

//doc.add(Field.Text("content",new FileReader(file)));

doc.add(Field.Text("content",this.chgFileToString(file)));

indexWriter.addDocument(doc);

indexWriter.close();

}catch(Exceptionex){

ex.printStackTrace();

}

/**

*从文本文件中读取内容

* @paramfile

* @return

publicStringchgFileToString(Filefile){

StringreturnValue= null;

StringBuffersb= newStringBuffer();

char[]c= new char[4096];

try{

Readerreader= newFileReader(file);

intn=0;

while(true){

n=reader.read(c);

if(n>0){

sb.append(c,0,n);

}else{

break;

}

reader.close();

}catch(Exceptionex){

ex.printStackTrace();

}

returnValue=sb.toString();

returnreturnValue;

}

public static voidmain(String[]args){

InsertRecordstemp= newInsertRecords();

Stringdbpath="e:\\lucene\\holendb";

//holen1.txt中包含关键字"holen"和"java"

if(temp.insertRecords(dbpath,"e:\\lucene\\holen1.txt")==1){

System.out.println("add file1 succ");

}

//holen2.txt中包含关键字"holen"和"chen"

if(temp.insertRecords(dbpath,"e:\\lucene\\holen2.txt")==1){

System.out.println("add file2 succ");

}

posted on 2004-10-20 00:18 信息时代的生存哲学阅读(1265) 评论(0) 收藏举报

刷新页面返回顶部

信息时代的生存哲学

公告