[译]How to make searching faster
Here are some things to try to speed up the seaching speed of your Lucene application. Please see ImproveIndexingSpeed for how to speed up indexing.
以下是一些嘗試提高lucene程序檢索速度的方法. 如果需要提高索引速度,請看提高索引速度.
?
-
Be sure you really need to speed things up. Many of the ideas here are simple to try, but others will necessarily add some complexity to your application. So be sure your searching speed is indeed too slow and the slowness is indeed within Lucene.
-
必須確定你真的需要提升檢索速度.以下的方法多數是簡單易用的,但有些卻可能會增加你程序的復雜度.所以你必須確定你的檢索速度真的過慢,而且真的是由Lucene引起的.
?
-
Make sure you are using the latest version of Lucene.
-
確定你正在使用的是最新版的Lucene.
?
-
Use a local filesystem. Remote filesystems are typically quite a bit slower for searching. If the index must be remote, try to mount the remote filesystem as a "readonly" mount. In some cases this could improve performance.
-
使用本地文件. 遠程文件的檢索通常會比本地檢索慢一點.如果索引必須放在遠程服務器上,可以把遠程文件設置為"只讀".有些情況下這樣做會提高效率.
?
-
Get faster hardware, especially a faster IO system. Flash-based Solid State Drives works very well for Lucene searches. As seek-times for SSD's are about 100 times faster than traditional platter-based harddrives, the usual penalty for seeking is virtually eliminated. This means that SSD-equipped machines need less RAM for file caching and that searchers require less warm-up time before they respond quickly.
-
升級到更快的硬件,特別是用于IO的硬件. 內置閃存芯片的固態硬盤會更利于Lucene的檢索.固態硬盤的尋址速度是傳統磁碟硬盤的100倍,平常的硬盤尋址損失會明顯減少.這意味著配置了固態硬盤的機器對用來緩存文件的內存(RAM)依賴減少,而且檢索用戶也無需再等待索引文件從硬盤讀入內存的這段緩存時間消耗.
?
-
Tune the OS
One tunable that stands out on Linux is swappiness (http://kerneltrap.org/node/3000), which controls how aggressively the OS will swap out RAM used by processes in favor of the IO Cache. Most Linux distros default this to a highish number (meaning, aggressive) but this can easily cause horrible search latency, especially if you are searching a large index with a low query rate. Experiment by turning swappiness down or off entirely (by setting it to 0). Windows also has a checkbox, under My Computer -> Properties -> Advanced -> Performance Settings -> Advanced -> Memory Usage, that lets you favor Programs or System Cache, that's likely doing something similar.
-
調整操作系統
一個在Linux下的可調整部分是交換系統(http://kerneltrap.org/node/3000),它會控制操作系統對騰出內存來處理IO緩存的積極性.大多數Linux會默認設置一個最大值(就是比較積極的緩存索引),但這樣很容易會引起嚴重的檢索延時,特別是當你檢索一個不經常使用的大索引文件時.(Clotho注:例如有一個1G的索引文件,但一個月才檢索一次,如果交換值設置太高,每次檢索都整個1G的文件被載入內存,之后又不再使用,就很浪費時間和內存空間).可以嘗試將交換值調低或者關閉交換系統(設置為0).Windows也有這個選項,在"我的電腦->右鍵菜單的"屬性"->高級->性能->設置->高級->內存使用"里,可以設置程序或者系統緩存,作用應該是和Linux的交換系統類似.?
-
Open the IndexReader with readOnly=true. This makes a big difference when multiple threads are sharing the same reader, as it removes certain sources of thread contention.
-
用只讀模式(readOnly=true)來調用IndexReader. 當多線程共享同一個reader時這樣會有很大不同,肯定會減少一部分線程同步的資源占用.
?
-
On non-Windows platform, using NIOFSDirectory instead of FSDirectory.
This also removes sources of contention when accessing the underlying files. Unfortunately, due to a longstanding bug on Windows in Sun's JRE (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734 -- feel particularly free to go vote for it), NIOFSDirectory gets poor performance on Windows.
-
在非Windows的操作系統中,用NIOFSDirectory類代替FSDirectory類.
這樣也可以減少訪問底層文件時的資源爭搶.很不幸地,作為一個SUN的JRE在Windows下存在已久的bug(http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734?-- 就當是特殊的免費投票),NIOFSDirectory類在Windows下的性能很低.?
-
Add RAM to your hardware and/or increase the heap size for the JVM. For a large index, searching can use a lot of RAM. If you don't have enough RAM or your JVM is not running with a large enough HEAP size then the JVM can hit swapping and thrashing at which point everything will run slowly.
-
增加硬件內存和/或增加JVM的堆大小. 當檢索一個大索引文件時,會占用很多內存空間.如果你的內存不夠大,或者你的JVM沒有足夠的堆空間,JVM會觸發虛擬內存頁交換和重新生成,令所有工作變得緩慢.
?
-
Use one instance of IndexSearcher.
Share a single IndexSearcher across queries and across threads in your application.
-
使用單個IndexSearcher實例.
程序中多條線程共享一個單獨IndexSearcher來進行檢索.?
-
When measuring performance, disregard the first query.
The first query to a searcher pays the price of initializing caches (especially when sorting by fields) and thus will skew your results (assuming you re-use the searcher for many queries). On the other hand, if you re-run the same query again and again, results won't be realistic either, because the operating system will use its cache to speed up IO operations. On Linux (kernel 2.6.16 and later) you can clean the disk cache using sync?;?echo?3?>?/proc/sys/vm/drop_caches. See http://linux-mm.org/Drop_Caches for details.
-
如果測試性能,可以忽略第一次檢索的測試結果.
第一次檢索需要花時間來初始化緩存(特別是當檢索結果按字段排序),因此會影響你的測試結果(假設你會重復使用該檢索器進行查詢).另一方面,如果你重復同一個查詢多次,測試結果也不一定符合實際,因為操作系統本身會使用緩存來加速IO操作.在Linux下(內核版本2.6.16或更高)你可以通過這條命令來清空硬盤緩存"sync?;?echo?3?>?/proc/sys/vm/drop_caches".詳情可以參考http://linux-mm.org/Drop_Caches.?
-
Re-open the IndexSearcher only when necessary.
You must re-open the IndexSearcher in order to make newly committed changes visible to searching. However, re-opening the searcher has a certain overhead (noticeable mostly with large indexes and with sorting turned on) and should thus be minimized. Consider using a so called warming technique which allows the searcher to warm up its caches before the first query hits.
-
除非不得已,否則盡量不要重新打開IndexSearcher.
你需要是打開IndexSearcher來提交新的修改給檢索.然而,重新打開檢索器會產生相當高的資源消耗(由于大索引文件和排序的啟用),因此請盡量減少重新打開的次數.可以考慮使用warming技術,它允許檢索器在第一次檢索之前預先載入緩存.?
-
Run optimize on your index before searching. An optimized index has only 1 segment to search which can be much faster than the many segments that will normally be created, especially for a large index. If your application does not often update the index then it pays to build the index, optimize it, and use the optimized one for searching. If instead you are frequently updating the index and then refreshing searchers, then optimizing will likely be too costly and you should decrease mergeFactor instead.
-
在檢索之前調用IndexWriter的optimize方法來優化你的索引文件. 一個優化后的索引只有一個索引段文件,檢索起來會比多個索引段文件快很多,正常未經優化的索引通常有多個索引段文件,特別是大索引更加多索引段.如果你的程序不經常更新索引的話,最好優化生成單個索引來檢索.相反,如果你要經常更新索引和刷新檢索器的話,調用優化反而會增加開銷,這時你需要降低mergeFactor參數的值.
?
-
Decrease mergeFactor. Smaller mergeFactors mean fewer segments and searching will be faster. However, this will slow down indexing speed, so you should test values to strike an appropriate balance for your application.
-
降低mergeFactor參數的值. 更小的mergeFactors值會生成著更少的索引段文件,檢索起來會更快.然而這樣會降低索引的速度,所以你要為你的程序定出一個令檢索和索引速度平衡的mergeFactors值.
?
-
Limit usage of stored fields and term vectors. Retrieving these from the index is quite costly. Typically you should only retrieve these for the current "page" the user will see, not for all documents in the full result set. For each document retrieved, Lucene must seek to a different location in various files. Try sorting the documents you need to retrieve by docID order first.
-
限制儲存字段和詞項向量的使用. 從索引中獲取這些數據的開銷比較大.通常的解決方法是你只獲取用戶可見的當前"頁"的結果數,而非結果集合中的全部文檔.因為每獲取一個結果文檔,Lucene就必須查找多個文件的不同位置.可以嘗試對需要獲取的文檔按docID排序.
?
-
Use FieldSelector to carefully pick which fields are loaded, and how they are loaded, when you retrieve a document.
-
當你獲取一個文檔時,用FieldSelector類仔細的操控哪些字段要讀取,怎么讀取.
?
-
Don't iterate over more hits than needed.
Iterating over all hits is slow for two reasons. Firstly, the search() method that returns a Hits object re-executes the search internally when you need more than 100 hits. Solution: use the search method that takes a HitCollector instead. Secondly, the hits will probably be spread over the disk so accessing them all requires much I/O activity. This cannot easily be avoided unless the index is small enough to be loaded into RAM. If you don't need the complete documents but only one (small) field you could also use the FieldCache class to cache that one field and have fast access to it.
-
不要遍歷比用戶需求數更多的結果.
遍歷全部結果會很慢的原因有兩個.首先,當你需要多于100個結果以上時search()方法會在內部重新檢索并返回一個新的結果對象.解決方法:用HitCollector代替原來的檢索結果類.其次,那些結果數據可能會遍布在磁盤各處,所以訪問它們需要多次I/O操作.這點不能輕易忽視,除非索引文件小到能夠整個放入內存.如果你不需要用到整個文檔而只要其中一個(小的)字段,你也可以使用FieldCache類來緩存那個字段來加快訪問它的速度.?
-
When using fuzzy queries use a minimum prefix length.
Fuzzy queries perform CPU-intensive string comparisons - avoid comparing all unique terms with the user input by only examining terms starting with the first "N" characters. This prefix length is a property on both QueryParser and FuzzyQuery - default is zero so ALL terms are compared.
-
使用模糊查詢時最好用一個最小的預先指定的長度值.
模糊查詢會執行精密的CPU字符串比較 - 盡量避免比較用戶輸入的全部的唯一詞項,而只比較詞項的前N個字符.預先指定的長度值是一個屬性,QueryParser和FuzzyQuery都有這個屬性 - 默認是0,所以全部詞項都會進行對比.?
-
Consider using filters. It can be much more efficient to restrict results to a part of the index using a cached bit set filter rather than using a query clause. This is especially true for restrictions that match a great number of documents of a large index. Filters are typically used to restrict the results to a category but could in many cases be used to replace any query clause. One difference between using a Query and a Filter is that the Query has an impact on the score while a Filter does not.
-
考慮使用filters. 它使用緩存后的位集合過濾器代替查詢語句可以更加有效率地限制結果數量.這樣做對于大索引中的大批量文檔匹配的情況特別有效.過濾器通常被用來限制結果的類別,但很多情況下也可以用來代替查詢語句.使用查詢和過濾的區別是,查詢的結果會帶有權重值而過濾沒有.
?
-
Find the bottleneck.
Complex query analysis or heavy post-processing of results are examples of hidden bottlenecks for searches. Profiling with at tool such as VisualVM helps locating the problem.
-
找出瓶頸.
復雜的查詢分析或大結果量的處理潛藏著很多檢索瓶頸.可以使用VisualVM等工具來檢測和定位出瓶頸所在.
?
原文地址: http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
一位牛人的翻譯版本: http://hi.baidu.com/expertsearch/blog/item/2195a237bfe83d360a55a9fd.html
轉載于:https://www.cnblogs.com/live41/archive/2009/12/31/1636900.html
總結
以上是生活随笔為你收集整理的[译]How to make searching faster的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 分享自己写的一个贪吃蛇的游戏(Linux
- 下一篇: CVS 客户端使用手册