生活随笔
收集整理的這篇文章主要介紹了
Lucene分类统计示例
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
需求
在檢索系統中,遇到了分組統計(Grouping/GroupBy)的需求,比如將搜索結果按照欄目分類,統計每個欄目下各有多少條結果。以前的做法很愚蠢,先發起一次search統計出有多少組,然后在每個組里發起一次search;這樣在有N組的情況下一共執行了N+1此搜索,效率低下。
改進
最近發現Lucene提供了分組的功能,是通過Collector實現的,最多可以在2次search的時候得出結果,如果內存夠用,CachingCollector還可以節約一次查詢。
兩次檢索
第一次
第一次的目的是收集符合條件的組,創建一個FirstPassGroupingCollector送入search接口即可。在此處使用CachingCollector對其cache的話,可以節省一次查詢:
?
????????TermFirstPassGroupingCollector?c1?=?new?TermFirstPassGroupingCollector("catalog",?groupSort,?topNGroups);????????boolean?cacheScores?=?true;????????double?maxCacheRAMMB?=?16.0;????????CachingCollector?cachedCollector?=?CachingCollector.create(c1,?cacheScores,?maxCacheRAMMB);????????searcher.search(query,?cachedCollector);
第二次
第二次的目的是收集每個組里面符合條件的文檔,此時利用第一次的分組結果創建TermSecondPassGroupingCollector,并執行/replay搜索。
完整實例
?
package com.hankcs;import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.search.grouping.GroupDocs;
import org.apache.lucene.search.grouping.SearchGroup;
import org.apache.lucene.search.grouping.TopGroups;
import org.apache.lucene.search.grouping.term.TermAllGroupsCollector;
import org.apache.lucene.search.grouping.term.TermFirstPassGroupingCollector;
import org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.Version;import java.util.Collection;/*** 演示faceting** @author hankcs*/
public class FacetingDemo
{public static void main(String[] args) throws Exception{// Lucene Document的主要域名String mainFieldName = "text";// Lucene版本Version ver = Version.LUCENE_48;// 實例化Analyzer分詞器Analyzer analyzer = new StandardAnalyzer(ver);Directory directory;IndexWriter writer;IndexReader reader;IndexSearcher searcher;//索引過程**********************************//建立內存索引對象directory = new RAMDirectory();//配置IndexWriterConfigIndexWriterConfig iwConfig = new IndexWriterConfig(ver, analyzer);iwConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);writer = new IndexWriter(directory, iwConfig);for (int i = 0; i < 100; ++i){Document doc = new Document();doc.add(new TextField(mainFieldName, "Banana is sweet " + i, Field.Store.YES));doc.add(new TextField("catalog", "fruit", Field.Store.YES));writer.addDocument(doc);}for (int i = 0; i < 50; ++i){Document doc = new Document();doc.add(new TextField(mainFieldName, "Juice is sweet " + i, Field.Store.YES));doc.add(new TextField("catalog", "drink", Field.Store.YES));writer.addDocument(doc);}for (int i = 0; i < 25; ++i){Document doc = new Document();doc.add(new TextField(mainFieldName, "Hankcs is here " + i, Field.Store.YES));doc.add(new TextField("catalog", "person", Field.Store.YES));writer.addDocument(doc);}writer.close();//搜索過程**********************************//實例化搜索器reader = DirectoryReader.open(directory);searcher = new IndexSearcher(reader);String keyword = "sweet";//使用QueryParser查詢分析器構造Query對象QueryParser qp = new QueryParser(ver, mainFieldName, analyzer);Query query = qp.parse(keyword);System.out.println("Query = " + query);//搜索相似度最高的5條記錄并且分組int topNGroups = 10; // 每頁需要多少個組int groupOffset = 0; // 起始的組boolean fillFields = true;Sort docSort = Sort.RELEVANCE; // groupSort用于對組進行排序,docSort用于對組內記錄進行排序,多數情況下兩者是相同的,但也可不同Sort groupSort = docSort;int docOffset = 0; // 用于組內分頁,起始的記錄int docsPerGroup = 2;// 每組返回多少條結果boolean requiredTotalGroupCount = true; // 是否需要計算總的組的數量// 如果需要對Lucene的score進行修正,則需要重載TermFirstPassGroupingCollectorTermFirstPassGroupingCollector c1 = new TermFirstPassGroupingCollector("catalog", groupSort, topNGroups);boolean cacheScores = true;double maxCacheRAMMB = 16.0;CachingCollector cachedCollector = CachingCollector.create(c1, cacheScores, maxCacheRAMMB);searcher.search(query, cachedCollector);Collection<SearchGroup<BytesRef>> topGroups = c1.getTopGroups(groupOffset, fillFields);if (topGroups == null){// No groups matchedreturn;}Collector secondPassCollector = null;boolean getScores = true;boolean getMaxScores = true;// 如果需要對Lucene的score進行修正,則需要重載TermSecondPassGroupingCollectorTermSecondPassGroupingCollector c2 = new TermSecondPassGroupingCollector("catalog", topGroups, groupSort, docSort, docsPerGroup, getScores, getMaxScores, fillFields);// 是否需要計算一共有多少個分類,這一步是可選的TermAllGroupsCollector allGroupsCollector = null;if (requiredTotalGroupCount){allGroupsCollector = new TermAllGroupsCollector("catalog");secondPassCollector = MultiCollector.wrap(c2, allGroupsCollector);}else{secondPassCollector = c2;}if (cachedCollector.isCached()){// 被緩存的話,就用緩存cachedCollector.replay(secondPassCollector);}else{// 超出緩存大小,重新執行一次查詢searcher.search(query, secondPassCollector);}int totalGroupCount = -1; // 所有組的數量int totalHitCount = -1; // 所有滿足條件的記錄數int totalGroupedHitCount = -1; // 所有組內的滿足條件的記錄數(通常該值與totalHitCount是一致的)if (requiredTotalGroupCount){totalGroupCount = allGroupsCollector.getGroupCount();}System.out.println("一共匹配到多少個分類: " + totalGroupCount);TopGroups<BytesRef> groupsResult = c2.getTopGroups(docOffset);totalHitCount = groupsResult.totalHitCount;totalGroupedHitCount = groupsResult.totalGroupedHitCount;System.out.println("groupsResult.totalHitCount:" + totalHitCount);System.out.println("groupsResult.totalGroupedHitCount:" + totalGroupedHitCount);int groupIdx = 0;// 迭代組for (GroupDocs<BytesRef> groupDocs : groupsResult.groups){groupIdx++;System.out.println("group[" + groupIdx + "]:" + groupDocs.groupValue); // 組的標識System.out.println("group[" + groupIdx + "]:" + groupDocs.totalHits); // 組內的記錄數int docIdx = 0;// 迭代組內的記錄for (ScoreDoc scoreDoc : groupDocs.scoreDocs){docIdx++;System.out.println("group[" + groupIdx + "][" + docIdx + "]:" + scoreDoc.doc + "/" + scoreDoc.score);Document doc = searcher.doc(scoreDoc.doc);System.out.println("group[" + groupIdx + "][" + docIdx + "]:" + doc);}}}
} ?
輸出
?
Query?=?text:sweet一共匹配到多少個分類:?2groupsResult.totalHitCount:150groupsResult.totalGroupedHitCount:150group[1]:[66?72?75?69?74]group[1]:100group[1][1]:0/0.573753group[1][1]:Document<stored,indexed,tokenized<text:Banana?is?sweet?0>?stored,indexed,tokenized<catalog:fruit>>group[1][2]:1/0.573753group[1][2]:Document<stored,indexed,tokenized<text:Banana?is?sweet?1>?stored,indexed,tokenized<catalog:fruit>>group[2]:[64?72?69?6e?6b]group[2]:50group[2][1]:100/0.573753group[2][1]:Document<stored,indexed,tokenized<text:Juice?is?sweet?0>?stored,indexed,tokenized<catalog:drink>>group[2][2]:101/0.573753group[2][2]:Document<stored,indexed,tokenized<text:Juice?is?sweet?1>?stored,indexed,tokenized<catalog:drink>>
Reference
創作挑戰賽新人創作獎勵來咯,堅持創作打卡瓜分現金大獎
總結
以上是生活随笔為你收集整理的Lucene分类统计示例的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。