加速 lucene 的搜索速度 ImproveSearchingSpeed - Eric Yao

公告

本文为简单翻译，原文在：
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed

* Be sure you really need to speed things up.

Many of the ideas here are simple to try, but others will necessarily add some complexity to your application. So be sure your searching speed is indeed too slow and the slowness is indeed within Lucene.

* 请确认你真的需要更快的搜索速度

这里的很多想法都非常容易尝试，但也有一些会给你的程序带来额外的复杂度。所以请确认你的搜索速度真的慢到不能忍受，并且慢的原因的确是因为lucene。

*Make sure you are using the latest version of Lucene.

* 请确认你在使用Lucene的最新版本

*Use a local filesystem.

Remote filesystems are typically quite a bit slower for searching. If the index must be remote, try to mount the remote filesystem as a "readonly" mount. In some cases this could improve performance.

* 尽量使用本地文件系统

远程文件系统一般来说都会降低搜索速度。如果索引必须分布在远程服务器，可以尝试将远程文件系统设置为只读。在某些情况下，这样可以提高性能。

* Get faster hardware, especially a faster IO system.

Flash-based Solid State Drives works very well for Lucene searches. As seek-times for SSD's are about 100 times faster than traditional platter-based harddrives, the usual penalty for seeking is virtually eliminated. This means that SSD-equipped machines need less RAM for file caching and that searchers require less warm-up time before they respond quickly.

* 使用更快的硬件设备，特别是更快的IO设备

Lucene 搜索可以很好的工作在基于闪存的固态硬盘上。固态硬盘的寻道时间大概比传统的以磁盘为基础的硬盘快100倍。这意味着，配备固态硬盘的机器用于文件缓存的内存将变少，搜索需要较少的热身时间，能够更加迅速作出反应。

* Open the IndexReader with readOnly=true.

This makes a big difference when multiple threads are sharing the same reader, as it removes certain sources of thread contention.

* 以只读方式打开 IndexReader

在多个线程共享同一个 reader 的环境下，这样做会带来很大的改善，因为它减少了部分锁争用。

*On non-Windows platform, using NIOFSDirectory instead of FSDirectory.

This also removes sources of contention when accessing the underlying files. Unfortunately, due to a longstanding bug on Windows in Sun's JRE (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734 -- feel particularly free to go vote for it), NIOFSDirectory gets poor performance on Windows.

* 在非 windows 平台上，使用 NIOFSDirectory 替代 FSDirectory

这样做同样可以减少部分底层文件访问时的锁争用。不幸的是，因为 windows 上 Sun 的 JRE 的一个 bug (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734 ), NIOFSDirectory 在 windows 上性能更差。

* Add RAM to your hardware and/or increase the heap size for the JVM.

For a large index, searching can use alot of RAM. If you don't have enough RAM or your JVM is not running with a large enough HEAP size then the JVM can hit swapping and thrashing at which point everything will run slowly.

* 加大你的机器内存容量，给Java虚拟机分配更多的内存

索引越大，在搜索时需要使用更多的内存。如果你的机器没有足够大的内存或者你的Java虚拟机没有设置足够大的堆空间，频繁的页面文件交换和虚拟内存的使用将使你的硬盘处于超负荷状态运行，此时，一切的程序都将运行的很慢。

*Use one instance of IndexSearcher.

Share a single IndexSearcher across queries and across threads in your application.

* 在程序中使用一个唯一的IndexSearch实例

在程序的查询中共享一个IndexSearch实例，它是多线程安全的。

*When measuring performance, disregard the first query.

The first query to a searcher pays the price of initializing caches (especially when sorting by fields) and thus will skew your results (assuming you re-use the searcher for many queries). On the other hand, if you re-run the same query again and again, results won't be realistic either, because the operating system will use its cache to speed up IO operations. On Linux (kernel 2.6.16 and later) you can clean the disk cache using sync ; echo 3 > /proc/sys/vm/drop_caches. See http://linux-mm.org/Drop_Caches for details.

* 当测试搜索速度时，忽略第一次查询时间

第一次搜索将花费部分时间在缓存上（特别在按某个字段排序的情况下），从而可能使你的测试结果不太准确（假设你在多个查询中复用一个 IndexSearch实例）。另一方面来说，如果你一次又一次的重复同一个查询，所得的测试结果也是不准确的。因为操作系统将利用其高速缓存加速IO操作。在Linux上，你可以使用如下命令清除磁盘高速缓存： echo 3 > /proc/sys/vm/drop_caches.

* Re-open the IndexSearcher only when necessary.

You must re-open the IndexSearcher in order to make newly committed changes visible to searching. However, re-opening the searcher has a certain overhead (noticeable mostly with large indexes and with sorting turned on) and should thus be minimized. Consider using a so called warming technique which allows the searcher to warm up its caches before the first query hits.

* 只有在必要的时候才重新打开 IndexSearcher

为了获得更新的索引信息，你必须重新打开 IndexSearch。当然，重新打开一个searcher会带来一定的系统开销（注意，这大多发生在大索引以及自定义排序上），所以你应该尽量减少重新构造。你可以考虑在重新构造之后强制进行一次搜索预热。

*Run optimize on your index before searching.

An optimized index has only 1 segment to search which can be much faster than the many segments that will normally be created, especially for a large index. If your application does not often update the index then it pays to build the index, optimize it, and use the optimized one for searching. If instead you are frequently updating the index and then refreshing searchers, then optimizing will likely be too costly and you should decrease mergeFactor instead.

* 在搜索之前调用optimize优化你的索引

一个优化后的索引只含有一个Segment（其实说法不严谨，这也取决于一个Segment最多含有的文档参数），这将比同等情况下含多个 Segment的索引搜索速度更快。特别是在大索引的情况下。如果你的程序不经常更新索引，那么花费一定的时间来优化下，然后使用优化后的索引来进行搜索。如果你的索引更新的频率很高，那么优化索引将会是一个很耗时间的事情，这个时候你可以减少mergeFactor参数。
个人建议，在频繁更新索引的情况下，使用两个索引，一个大的优化好的历史索引，一个小的实时添加的索引（如果数据不大的情况下，直接使用RAMDirectory，然后定时的合并到大索引中）

* Decrease mergeFactor.

Smaller mergeFactors mean fewer segments and searching will be faster. However, this will slow down indexing speed, so you should test values to strike an appropriate balance for your application.

* 减小MergeFactor合并因子的值

更小的合并因子意味着索引中拥有更少的Segment，搜索速度也将更快。但是，这也将降低索引速度。你需要自己测试一个值来平衡二者的关系。（此条只适用于不能经常优化的索引库）

* Limit usage of stored fields and term vectors.

Retrieving these from the index is quite costly. Typically you should only retrieve these for the current "page" the user will see, not for all documents in the full result set. For each document retrieved, Lucene must seek to a different location in various files. Try sorting the documents you need to retrieve by docID order first.

* 限制存储字段的使用以及获取尽可能少的数据

从索引中获取数据是一件很耗时间的事情，你最好只获取用户需要的数据。而不是整个文档中存储的数据。每个文档的取回，lucene都必须去索引文件中不同的地方甚至是不同的文件中查找。可以尝试将你需要的文档先按文档编号排序再获取。

* Use FieldSelector to carefully pick which fields are loaded, and how they are loaded, when you retrieve a document.

* 当你取回文档时，使用FieldSelector仔细的选择哪些字段需要获取，如何获取。

* Don't iterate over more hits than needed.

Iterating over all hits is slow for two reasons. Firstly, the search() method that returns a Hits object re-executes the search internally when you need more than 100 hits. Solution: use the search method that takes a HitCollector instead. Secondly, the hits will probably be spread over the disk so accessing them all requires much I/O activity. This cannot easily be avoided unless the index is small enough to be loaded into RAM. If you don't need the complete documents but only one (small) field you could also use the FieldCache class to cache that one field and have fast access to it.

* 不要获取多于你需要的hits

获取更多的搜索结果将会降低搜索速度。有两个原因：其一，search方法在返回Hits对象时，如果超过100个hits，lucene将在内部自动重新执行搜索。解决方案：使用返回HitCollector的Search方法。其二，搜索结果可能分布在硬盘的不同地方，获取他们可能需要很多的 IO操作。这个是很难避免的，除非索引比较小，可以直接缓存到内存当中。如果你不需要完整的文档对象，你只需要其中的一个很小的字段，你可以使用 FieldCache类来缓存它，从而达到快速访问的效果。

* When using fuzzy queries use a minimum prefix length.

Fuzzy queries perform CPU-intensive string comparisons - avoid comparing all unique terms with the user input by only examining terms starting with the first "N" characters. This prefix length is a property on both QueryParser and FuzzyQuery - default is zero so ALL terms are compared.

* 当使用 fuzzy 查询时设置一个较小的比较长度（prefixLength）

Fuzzy查询执行CPU密集型字符串比较，尽量避免将用户查询的Term与所有的 Term进行比较。你可以设置只比较以前N个字符开头的Term。prefixLength在QueryParser以及FuzzyQuery中可以设置。默认值为0，将比较所有的Term。

* Consider using filters.

It can be much more efficient to restrict results to a part of the index using a cached bit set filter rather than using a query clause. This is especially true for restrictions that match a great number of documents of a large index. Filters are typically used to restrict the results to a category but could in many cases be used to replace any query clause. One difference between using a Query and a Filter is that the Query has an impact on the score while a Filter does not.

* 考虑使用filters

有时候我们的查询将限制部分索引中的记录，这时候使用一个经过缓存了的bit set filter将比使用查询子句更有效，尤其在一个大索引中。过滤器经常用在查询分类结果上。它可以用查询子句限制来替换，区别在于使用Query将影响文档的得分，而Filter不会。

* Find the bottleneck.

Complex query analysis or heavy post-processing of results are examples of hidden bottlenecks for searches. Profiling with a tool such as VisualVM helps locating the problem.

* 找到瓶颈所在

复杂查询分析或大结果集的处理就是搜索可能的瓶颈，使用类似 VisualVM 的工具可以帮助定位问题的所在。

posted on 2009-08-13 16:08 Eric Yao 阅读(245) 评论(0) 收藏举报

刷新页面返回顶部