當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Lucene分类统计示例

發布時間：2024/1/23 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了 Lucene分类统计示例小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

需求

在檢索系統中，遇到了分組統計（Grouping/GroupBy）的需求，比如將搜索結果按照欄目分類，統計每個欄目下各有多少條結果。以前的做法很愚蠢，先發起一次search統計出有多少組，然后在每個組里發起一次search；這樣在有N組的情況下一共執行了N+1此搜索，效率低下。

改進

最近發現Lucene提供了分組的功能，是通過Collector實現的，最多可以在2次search的時候得出結果，如果內存夠用，CachingCollector還可以節約一次查詢。

兩次檢索

第一次

第一次的目的是收集符合條件的組，創建一個FirstPassGroupingCollector送入search接口即可。在此處使用CachingCollector對其cache的話，可以節省一次查詢：

????????TermFirstPassGroupingCollector?c1?=?new?TermFirstPassGroupingCollector("catalog",?groupSort,?topNGroups);

????????boolean?cacheScores?=?true;

????????double?maxCacheRAMMB?=?16.0;

????????CachingCollector?cachedCollector?=?CachingCollector.create(c1,?cacheScores,?maxCacheRAMMB);

????????searcher.search(query,?cachedCollector);

第二次

第二次的目的是收集每個組里面符合條件的文檔，此時利用第一次的分組結果創建TermSecondPassGroupingCollector，并執行/replay搜索。

完整實例

package com.hankcs;import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.TextField; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.queryparser.classic.QueryParser; import org.apache.lucene.search.*; import org.apache.lucene.search.grouping.GroupDocs; import org.apache.lucene.search.grouping.SearchGroup; import org.apache.lucene.search.grouping.TopGroups; import org.apache.lucene.search.grouping.term.TermAllGroupsCollector; import org.apache.lucene.search.grouping.term.TermFirstPassGroupingCollector; import org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector; import org.apache.lucene.store.Directory; import org.apache.lucene.store.RAMDirectory; import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.Version;import java.util.Collection;/*** 演示faceting** @author hankcs*/ public class FacetingDemo {public static void main(String[] args) throws Exception{// Lucene Document的主要域名String mainFieldName = "text";// Lucene版本Version ver = Version.LUCENE_48;// 實例化Analyzer分詞器Analyzer analyzer = new StandardAnalyzer(ver);Directory directory;IndexWriter writer;IndexReader reader;IndexSearcher searcher;//索引過程**********************************//建立內存索引對象directory = new RAMDirectory();//配置IndexWriterConfigIndexWriterConfig iwConfig = new IndexWriterConfig(ver, analyzer);iwConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);writer = new IndexWriter(directory, iwConfig);for (int i = 0; i < 100; ++i){Document doc = new Document();doc.add(new TextField(mainFieldName, "Banana is sweet " + i, Field.Store.YES));doc.add(new TextField("catalog", "fruit", Field.Store.YES));writer.addDocument(doc);}for (int i = 0; i < 50; ++i){Document doc = new Document();doc.add(new TextField(mainFieldName, "Juice is sweet " + i, Field.Store.YES));doc.add(new TextField("catalog", "drink", Field.Store.YES));writer.addDocument(doc);}for (int i = 0; i < 25; ++i){Document doc = new Document();doc.add(new TextField(mainFieldName, "Hankcs is here " + i, Field.Store.YES));doc.add(new TextField("catalog", "person", Field.Store.YES));writer.addDocument(doc);}writer.close();//搜索過程**********************************//實例化搜索器reader = DirectoryReader.open(directory);searcher = new IndexSearcher(reader);String keyword = "sweet";//使用QueryParser查詢分析器構造Query對象QueryParser qp = new QueryParser(ver, mainFieldName, analyzer);Query query = qp.parse(keyword);System.out.println("Query = " + query);//搜索相似度最高的5條記錄并且分組int topNGroups = 10; // 每頁需要多少個組int groupOffset = 0; // 起始的組boolean fillFields = true;Sort docSort = Sort.RELEVANCE; // groupSort用于對組進行排序，docSort用于對組內記錄進行排序，多數情況下兩者是相同的，但也可不同Sort groupSort = docSort;int docOffset = 0; // 用于組內分頁，起始的記錄int docsPerGroup = 2;// 每組返回多少條結果boolean requiredTotalGroupCount = true; // 是否需要計算總的組的數量// 如果需要對Lucene的score進行修正，則需要重載TermFirstPassGroupingCollectorTermFirstPassGroupingCollector c1 = new TermFirstPassGroupingCollector("catalog", groupSort, topNGroups);boolean cacheScores = true;double maxCacheRAMMB = 16.0;CachingCollector cachedCollector = CachingCollector.create(c1, cacheScores, maxCacheRAMMB);searcher.search(query, cachedCollector);Collection<SearchGroup<BytesRef>> topGroups = c1.getTopGroups(groupOffset, fillFields);if (topGroups == null){// No groups matchedreturn;}Collector secondPassCollector = null;boolean getScores = true;boolean getMaxScores = true;// 如果需要對Lucene的score進行修正，則需要重載TermSecondPassGroupingCollectorTermSecondPassGroupingCollector c2 = new TermSecondPassGroupingCollector("catalog", topGroups, groupSort, docSort, docsPerGroup, getScores, getMaxScores, fillFields);// 是否需要計算一共有多少個分類，這一步是可選的TermAllGroupsCollector allGroupsCollector = null;if (requiredTotalGroupCount){allGroupsCollector = new TermAllGroupsCollector("catalog");secondPassCollector = MultiCollector.wrap(c2, allGroupsCollector);}else{secondPassCollector = c2;}if (cachedCollector.isCached()){// 被緩存的話，就用緩存cachedCollector.replay(secondPassCollector);}else{// 超出緩存大小，重新執行一次查詢searcher.search(query, secondPassCollector);}int totalGroupCount = -1; // 所有組的數量int totalHitCount = -1; // 所有滿足條件的記錄數int totalGroupedHitCount = -1; // 所有組內的滿足條件的記錄數(通常該值與totalHitCount是一致的)if (requiredTotalGroupCount){totalGroupCount = allGroupsCollector.getGroupCount();}System.out.println("一共匹配到多少個分類: " + totalGroupCount);TopGroups<BytesRef> groupsResult = c2.getTopGroups(docOffset);totalHitCount = groupsResult.totalHitCount;totalGroupedHitCount = groupsResult.totalGroupedHitCount;System.out.println("groupsResult.totalHitCount:" + totalHitCount);System.out.println("groupsResult.totalGroupedHitCount:" + totalGroupedHitCount);int groupIdx = 0;// 迭代組for (GroupDocs<BytesRef> groupDocs : groupsResult.groups){groupIdx++;System.out.println("group[" + groupIdx + "]:" + groupDocs.groupValue); // 組的標識System.out.println("group[" + groupIdx + "]:" + groupDocs.totalHits); // 組內的記錄數int docIdx = 0;// 迭代組內的記錄for (ScoreDoc scoreDoc : groupDocs.scoreDocs){docIdx++;System.out.println("group[" + groupIdx + "][" + docIdx + "]:" + scoreDoc.doc + "/" + scoreDoc.score);Document doc = searcher.doc(scoreDoc.doc);System.out.println("group[" + groupIdx + "][" + docIdx + "]:" + doc);}}} }

輸出

Query?=?text:sweet

一共匹配到多少個分類:?2

groupsResult.totalHitCount:150

groupsResult.totalGroupedHitCount:150

group[1]:[66?72?75?69?74]

group[1]:100

group[1][1]:0/0.573753

group[1][1]:Document<stored,indexed,tokenized<text:Banana?is?sweet?0>?stored,indexed,tokenized<catalog:fruit>>

group[1][2]:1/0.573753

group[1][2]:Document<stored,indexed,tokenized<text:Banana?is?sweet?1>?stored,indexed,tokenized<catalog:fruit>>

group[2]:[64?72?69?6e?6b]

group[2]:50

group[2][1]:100/0.573753

group[2][1]:Document<stored,indexed,tokenized<text:Juice?is?sweet?0>?stored,indexed,tokenized<catalog:drink>>

group[2][2]:101/0.573753

group[2][2]:Document<stored,indexed,tokenized<text:Juice?is?sweet?1>?stored,indexed,tokenized<catalog:drink>>

Reference

創作挑戰賽新人創作獎勵來咯，堅持創作打卡瓜分現金大獎

總結

以上是生活随笔為你收集整理的Lucene分类统计示例的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：华丽地处理字符串
下一篇： logistic逻辑回归分类算法及应用