【Lucene4.8教程之二】索引
一、基础内容
0、官方文档说明
(1)org.apache.lucene.index provides two primary classes: IndexWriter, which creates and adds documents to indices; and IndexReader, which accesses the data in the index.
(2)涉及的两个主要包有:
org.apache.lucene.index:Code to maintain and access indices.
org.apache.lucene.document:Thelogical representation of a Document for indexing and searching.
1、创建一个索引时,涉及的重要类有下面几个:
(1)IndexWriter:索引过程中的核心组件,用于创建新索引或者打开已有索引。以及向索引中加入、删除、更新被索引文档的信息。
(2)Document:代表一些域(field)的集合。
(3)Field及其子类:一个域,如文档创建时间,作者。内容等。
(4)Analyzer:分析器。
(5)Directory:可用于描写叙述Lucene索引的存放位置。
2、索引文档的基本过程例如以下:
(1)创建索引库IndexWriter
(2)依据文件创建文档Document
(3)向索引库中写入文档内容
基本程序例如以下:
package org.jediael.search.index;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.jediael.util.LoadProperties;
// 1、创建索引库IndexWriter
// 2、依据文件创建文档Document
// 3、向索引库中写入文档内容
public class IndexFiles {
	
	private IndexWriter writer = null;
	public void indexAllFileinDirectory(String indexPath, String docsPath)
			throws IOException {
		// 获取放置待索引文件的位置。若传入參数为空,则读取search.properties中设置的默认值。
		if (docsPath == null) {
			docsPath = LoadProperties.getProperties("docsDir");
		}
		final File docDir = new File(docsPath);
		if (!docDir.exists() || !docDir.canRead()) {
			System.out
					.println("Document directory '"
							+ docDir.getAbsolutePath()
							+ "' does not exist or is not readable, please check the path");
			System.exit(1);
		}
		// 获取放置索引文件的位置,若传入參数为空。则读取search.properties中设置的默认值。
		if (indexPath == null) {
			indexPath = LoadProperties.getProperties("indexDir");
		}
		final File indexDir = new File(indexPath);
		if (!indexDir.exists() || !indexDir.canRead()) {
			System.out
					.println("Document directory '"
							+ indexDir.getAbsolutePath()
							+ "' does not exist or is not readable, please check the path");
			System.exit(1);
		}
		
		try {
			// 1、创建索引库IndexWriter
			if(writer == null){
				initialIndexWriter(indexDir);
			}
			index(writer, docDir);
		} catch (IOException e) {
			e.printStackTrace();
		} finally{
			writer.close();
		}
	}
	//使用了最简单的单例模式,用于返回一个唯一的IndexWirter。注意此处非线程安全,须要进一步优化。
	private void initialIndexWriter(File indexDir) throws IOException {
		Directory returnIndexDir = FSDirectory.open(indexDir);
		IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_48,new StandardAnalyzer(Version.LUCENE_48));
		writer = new IndexWriter(returnIndexDir, iwc);
	}
	private void index(IndexWriter writer, File filetoIndex) throws IOException {
		if (filetoIndex.isDirectory()) {
			String[] files = filetoIndex.list();
			if (files != null) {
				for (int i = 0; i < files.length; i++) {
					index(writer, new File(filetoIndex, files[i]));
				}
			}
		} else {
			// 2、依据文件创建文档Document,考虑一下是否能不用每次创建Document对象
			Document doc = new Document();
			Field pathField = new StringField("path", filetoIndex.getPath(),
					Field.Store.YES);
			doc.add(pathField);
			doc.add(new LongField("modified", filetoIndex.lastModified(),
					Field.Store.YES));
			doc.add(new StringField("title",filetoIndex.getName(),Field.Store.YES));
			doc.add(new TextField("contents", new FileReader(filetoIndex)));
			//System.out.println("Indexing " + filetoIndex.getName());
			// 3、向索引库中写入文档内容
			writer.addDocument(doc);
		}
	}
}
一些说明:
(1)使用了最简单的单例模式。用于返回一个唯一的IndexWirter,注意此处非线程安全,须要进一步优化。
(2)注意IndexWriter,IndexReader等均须要耗费较大的资源用于创建实例。因此如非必要,使用单例模式创建一个实例后。
3、索引、Document、Filed之间的关系
简而言之,多个Filed组成一个Document,多个Document组成一个索引。
它们之间通过下面方法相互调用:
Document doc = new Document();
Field pathField = new StringField("path", filetoIndex.getPath(),Field.Store.YES);
doc.add(pathField);
writer.addDocument(doc);
二、关于Field
(一)创建一个域(field)的基本方法
Field field = new Field("filename", f.getName(),  Field.Store.YES, Field.Index.NOT_ANALYZED);
Field field = new Field("contents", new FileReader(f));
Field field = new Field("fullpath", f.getCanonicalPath(), Field.Store.YES, Field.Index.NOT_ANALYZED)Filed的四个參数分别代表:<pre name="code" class="java">Field field = new StringField("path", filetoIndex.getPath(),Field.Store.YES);
Field field = new LongField("modified", filetoIndex.lastModified(),Field.Store.NO);
Field field = new TextField("contents", new FileReader(filetoIndex));
在4.x以后,StringField即为NOT_ANALYZED的(即不正确域的内容进行切割分析),而textField是ANALYZED的,因此,创建Field对象时。无需再指定此属性。见http://stackoverflow.com/questions/19042587/how-to-prevent-a-field-from-not-analyzing-in-lucene这些被存储的内容能够在搜索结果中返回,呈现给用户。
Field field = new TextField("contents", new FileReader(filetoIndex));仅仅对纯文本有效。对于word,excel,pdf等富文本。FileReader读取到的内容仅仅是一些乱码。并不能形成有效的索引。
http://stackoverflow.com/questions/16640292/lucene-4-2-0-index-pdf
doc.add(new TextField("contents", TikaBasicUtil.extractContent(filetoIndex),Field.Store.NO));
注意此处不能使用StringField。由于StringField限制了字符串的大小不能超过32766,否则会报异常IllegalArgumentException:Document contains at least one immense term in field="contents" (whose UTF8 encoding is longer than the max length 32766)*/使用Tika索引富文本的简单示比例如以下:
package org.jediael.util;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
public class TikaBasicUtil {
	
	public static String extractContent(File f) {
		//1、创建一个parser
		Parser parser = new AutoDetectParser();
		InputStream is = null;
		try {
			Metadata metadata = new Metadata();
			metadata.set(Metadata.RESOURCE_NAME_KEY, f.getName());
			is = new FileInputStream(f);
			ContentHandler handler = new BodyContentHandler();
			ParseContext context = new ParseContext();
			context.set(Parser.class,parser);
			
			//2、运行parser的parse()方法。
			parser.parse(is,handler, metadata,context);
				
			String returnString = handler.toString();
			
			System.out.println(returnString.length());
			return returnString;
		} catch (FileNotFoundException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		} catch (SAXException e) {
			e.printStackTrace();
		} catch (TikaException e) {
			e.printStackTrace();
		}finally {
			try {
				if(is!=null) is.close();
			} catch (IOException e) {
				e.printStackTrace();
			}
		}
		return "No Contents";
	}
}Directory returnIndexDir = FSDirectory.open(indexDir); IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_48,new StandardAnalyzer(Version.LUCENE_48)); iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE); writer = new IndexWriter(returnIndexDir, iwc); System.out.println(writer.getConfig().getOpenMode()+""); System.out.println(iwc.getOpenMode());创建一个IndexWriter时,须要2个參数,一个是Directory对象,用于指定所创建的索引写到哪个地方。还有一个是IndexWriterConfig对象,用于指定writer的配置。
- java.lang.Object
 - 
- org.apache.lucene.index.LiveIndexWriterConfig
 - 
- org.apache.lucene.index.IndexWriterConfig
 
 
 
- 
- All Implemented Interfaces:
 - Cloneable
 - (2)Holds all the configuration that is used to create an 
IndexWriter. OnceIndexWriterhas been created with this object, changes to this object will not affect theIndexWriterinstance. - (3)IndexWriterConfig.OpenMode:指明了打开索引文件夹的方式,有下面三种:
 - APPEND:Opens an existing index. 若原来存在索引,则将本次索引的内容追加进来。无论文档是否与原来是否反复。因此若2次索引的文档同样,则返回结果数则为原来的2倍。
 - CREATE:Creates a new index or overwrites an existing one. 若原来存在索引,则先将其删除,再创建新的索引
 - CREATE_OR_APPEND【默认值】:Creates a new index if one does not exist, otherwise it opens the index and documents will be appended.
 
 
writer.addDocument(doc); writer.forceMerge(2);索引的优化是将索引结果文件归为一个或者有限的多个,它加大的索引过程中的消耗,降低了搜索时的消耗。
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_48,new StandardAnalyzer(Version.LUCENE_48)); writer = new IndexWriter(IndexDir, iwc);
writer.addDocument(doc, new SimpleAnalyzer(Version.LUCENE_48));
                    
                
                
            
        
浙公网安备 33010602011771号