Lucene使用小结(续)

前面总结了Lucene的基本使用，但大多数情况下，在多线程环境中总会出现多个线程同时访问索引的情况，这样不可免地会出现同步访问的问题。那么我们需要确定Lucene的线程安全性。我们先来看Lucene官网(http://wiki.apache.org/lucene-java/LuceneFAQ)上对几个问题的解答：

Why am I getting an IOException that says "Too many open files"?

The number of files that can be opened simultaneously is a system-wide limitation of your operating system. Lucene might cause this problem as it can open quite some files depending on how you use it, but the problem might also be somewhere else.

• Always make sure that you explicitly close all file handles you open, especially in case of errors. Use a try/catch/finally block to open the files, i.e. open them in the try block, close them in the finally block. Remember that Java doesn't have destructors, so don't close file handles in a finalize method -- this method is not guaranteed to be executed.

• Use the compound file format (it's activated by default starting with Lucene 1.4) by calling IndexWriter's setUseCompoundFile(true)

• Don't set IndexWriter's mergeFactor to large values. Large values speed up indexing but increase the number of files that need to be opened simultaneously.

• Make sure you only open one IndexSearcher, and share it among all of the threads that are doing searches -- this is safe, and it will minimize the number of files that are open concurently.

• Try to increase the number of files that can be opened simultaneously. On Linux using bash this can be done by calling ulimit -n <number>.

Does Lucene allow searching and indexing simultaneously?

Yes. However, an IndexReader only searches the index as of the "point in time" that it was opened. Any updates to the index, either added or deleted documents, will not be visible until the IndexReader is re-opened. So your application must periodically re-open its IndexReaders to see the latest updates. The IndexReader.isCurrent() method allows you to test whether any updates have occurred to the index since your IndexReader was opened.

Is the QueryParser thread-safe?

No, it's not.

Is the IndexSearcher thread-safe?

Yes, IndexSearcher is thread-safe. Multiple search threads may use the same instance of IndexSearcher concurrently without any problems. It is recommended to use only one IndexSearcher from all threads in order to save memory.

Is the IndexWriter class, and especially the method addIndexes(Directory[]) thread safe?

Yes, IndexWriter.addIndexes(Directory[]) method is thread safe (it is a synchronized method). IndexWriter in general is thread safe, i.e. you should use the same IndexWriter object from all of your threads. Actually it's impossible to use more than one IndexWriter for the same index directory, as this will lead to an exception trying to create the lock file.

从上面我们可以了解到：Lucene内部已经对线程安全性进行了处理，只要按照一定的规则，就可以在多线程环境下安全地使用Lucene。

1．由于IndexWriter是线程安全的，故可以在多线程中共享一个IndexWriter。

2．由于IndexSearch是线程安全的，故可以在多线程中共享一个IndexSearch。

3．QueryParser是非线程安全的。

同时，在官网中还建议：

在多线程中共享同一个IndexReader，这样可去掉处理线程冲突的代码。

在程序的查询和线程之间，使用单一的一个IndexSearcher的实例。只有在需要让最新提交的更新出现在搜索中的时候，才应该重新打开IndexSearcher。注意，重新打开搜索器有它的开销（在大的索引和排序打开的情况下可以被察觉到的），必须将这一开销最小化。可以考虑在面对第一个查询之前，使用预热技术对缓存进行预热。

IndexSearch和IndexWriter的开启和关闭都有较大开销，一般尽量使用单例模式处理。如搜索时同一个索引目录只需要实例化一个IndexReader即可。

在使用IndexReader时，IndexReader本身是一个线程安全的对象，跟索引目录一一对应。但当索引目录有增量更新时，IndexReader可能会出现一个更新的索引无法被检索到的情况。解决这个问题，需要调用openIfChanged()方法来加载那些变更过的索引片断，而不是重新加载完整的索引从而节省资源。

API： For performance reasons, if your index is unchanging, you should share a single IndexSearcher instance across multiple searches instead of creating a new one per-search. If your index has changed and you wish to see the changes reflected in searching, you should use IndexReader.openIfChanged(org.apache.lucene.index.IndexReader) to obtain a new reader and then create a new IndexSearcher from that. Also, for low-latency turnaround it's best to use a near-real-time reader (IndexReader.open(IndexWriter,boolean)). Once you have a new IndexReader, it's relatively cheap to create a new IndexSearcher from it.

具体操作时，由于需要通过reader去构造searcher，所以可以先通过searcher.getIndexReader()获得当前searcher的reader，再调用reader.iscurrent()判断索引是否有变化。如果索引发生变化，则先关闭当前的searcher，再通过reader. openIfChanged ()获取新的reader，然后再重新创建新的searcher。

如果是传递的是一个路径字符串或者Directory给searcher，那么searcher会维护一个内部reader，当本次搜索结束后这个内部reader就会关掉。

在构建索引的时候，虽然IndexWriter具体线程安全性，但对同一个索引仍不允许进行并发修改操作，即不允许多个IndexWriter或IndexReader实例同时对一个索引进行修改，也就是说同一时间只允许对一个索引修改操作，但允许任意多的检索操作并发执行。同时允许多个用户同时对同一索引做检索操作，Lucene会对所有对索引进行修改的方法的调用进行同步，保证修改操作一个接一个有序进行。即：

1）多个只读操作可以同时执行。即便索引正在被修改时，仍可以同时执行多个只读操作。

2）在任一时刻，只允许执行一个修改索引操作。即IndexReader对象在索引中删除一个文档时，IndexWriter对象不能向其中增加文档；IndexWriter对象在进行索引优化时，IndexReader对象不能从其中删除文档；IndexWriter对象在进行索引合并时，IndexReader对象也不能从其中删除文档；

API：Opening an IndexWriter creates a lock file for the directory in use. Trying to open another IndexWriter on the same directory will lead to a LockObtainFailedException. The LockObtainFailedException is also thrown if an IndexReader on the same directory is used to delete documents from the index.

NOTE: IndexWriter instances are completely thread safe, meaning multiple threads can call any of its methods, concurrently. If your application requires external synchronization, you should not synchronize on the IndexWriter instance as this may cause deadlock; use your own (non-Lucene) objects instead.

NOTE: If you call Thread.interrupt() on a thread that's within IndexWriter, IndexWriter will try to catch this (eg, if it's in a wait() or Thread.sleep()), and will then throw the unchecked exception ThreadInterruptedException and clear the interrupt status on the thread.

由于IndexReader可以删除document，但必须注意：

在使用IndexReader删除前，必须先关闭在同一索引中执行操作的IndexWriter实例。

使用IndexWriter向索引加入文档前，也必须先关闭执行删除操作的IndexReader实例。

任一时刻，也不允许有多个IndexReader执行删除文档操作。

每个InderReader应当在前一个InderReader执行完close方法后再开始运行。

API：An IndexReader can be opened on a directory for which an IndexWriter is opened already, but it cannot be used to delete documents from the index then.

同样，为了提高索引的速度，Lucene对很多的数据进行了缓存，这样在进行Add/Delete操作时，就需要考虑缓存的影响。

API：In either case, documents are added with addDocument and removed with deleteDocuments(Term) or deleteDocuments(Query). A document can be updated with updateDocument (which just deletes and then adds the entire document). When finished adding, deleting and updating documents, close should be called.

These changes are buffered in memory and periodically flushed to the Directory (during the above method calls). A flush is triggered when there are enough buffered deletes (see setMaxBufferedDeleteTerms(int)) or enough added documents since the last flush, whichever is sooner. For the added documents, flushing is triggered either by RAM usage of the documents (see setRAMBufferSizeMB(double)) or the number of added documents. The default is to flush when RAM usage hits 16 MB. For best indexing speed you should flush by RAM usage with a large RAM buffer. Note that flushing just moves the internal buffered state in IndexWriter into the index, but these changes are not visible to IndexReader until either commit() or close() is called. A flush may also trigger one or more segment merges which by default run with a background thread so as not to block the addDocument calls (see below for changing the MergeScheduler).

Expert: IndexWriter allows an optional IndexDeletionPolicy implementation to be specified. You can use this to control when prior commits are deleted from the index. The default policy is KeepOnlyLastCommitDeletionPolicy which removes all prior commits as soon as a new commit is done (this matches behavior before 2.2). Creating your own policy can allow you to explicitly keep previous "point in time" commits alive in the index for some time, to allow readers to refresh to the new commit without having the old commit deleted out from under them. This is necessary on filesystems like NFS that do not support "delete on last close" semantics, which Lucene's "point in time" search normally relies on.

Expert: IndexWriter allows you to separately change the MergePolicy and the MergeScheduler. The MergePolicy is invoked whenever there are changes to the segments in the index. Its role is to select which merges to do, if any, and return a MergePolicy.MergeSpecification describing the merges. The default is LogByteSizeMergePolicy. Then, the MergeScheduler is invoked with the requested merges and it decides when and how to run the merges. The default is ConcurrentMergeScheduler.

如果IndexWriter没有调用Commit或close时，其所修改的内容是不能够被看到的，即使IndexReader被重新打开。而要使最新的修改被看到，一方面IndexWriter需要commit或close，一方面IndexReader重新打开。

由于Lucene中存在缓存，我们需要考虑缓存使用的大小，SetMaxBufferedDocs方法规定了缓冲区能够缓冲Document的个数，这个值设置得越大，暂时存储到内存的Document就会越多；

SetMaxFieldLength方法设置Field的最大长度；

通常，一个索引存储由一个IndexWriter来控制，一个索引存储不应该超过2G(because of the 2GB file size limit of some 32-bit operating systems)，即使是2G，每次索引更新都需要10分钟左右来优化索引。至于如何分配索引，要根据实际情况来决定。

API：NOTE: if you hit an OutOfMemoryError then IndexWriter will quietly record this fact and block all future segment commits. This is a defensive measure in case any internal state (buffered documents and deletions) were corrupted. Any subsequent calls to commit() will throw an IllegalStateException. The only course of action is to call close(), which internally will call rollback(), to undo any changes to the index since the last commit. You can also just call rollback() directly.

网上有文章提到：如果应用程序架构由多个LUCENE索引组成，则可以通过MutltiSearcher把所有索引搜索。也可以通过ParallelMultiSearcher进行多线程搜索。在单核的情况下，MultiSearcher比ParallelMultiSearcher性能更高。

在MultiSearcher中进行查询，是通过循环的方式读取每个IndexSearcher，然后分别对这些索引文件进行查询，它类似于一种串行的处理方式。如果MultiSearcher是对很多个索引文件进行查询，无疑将影响查询的效率。

Lucene提供了另一种方式aralleMultiSearcher多线程搜索，它的使用和MultiSearcher一样，只不过它采用了多线程的方式实现了“并行处理”，搜索操作为每个Searchable分配一个线程，直到所有线程都完成其搜索。基本搜索和进行过滤的搜索是并行执行的。这就是MultiSearcher和ParalleMultiSearcher的最大区别。

Lucene使用中，官网上给出了几篇优化建议(http://wiki.apache.org/lucene-java/BasicsOfPerformance)，这几篇文章大多有中文译本，’ 如何提高和优化Lucene搜索速度(http://hi.baidu.com/lewutian/item/f148662219ab1bd4a417b682)”, ‘如何提高和优化Lucene索引速度(http://hi.baidu.com/lewutian/item/403e01d59e3473322b35c794)’,

posted @ 2013-04-25 22:09 Jevo 阅读(649) 评论(0) 收藏举报

刷新页面返回顶部

Jevo的博客

Lucene使用小结(续)

公告