當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

solr查询工作原理深入内幕

發布時間：2025/4/5 编程问答 21 豆豆

生活随笔收集整理的這篇文章主要介紹了 solr查询工作原理深入内幕小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1.什么是Lucene？

作為一個開放源代碼項目，Lucene從問世之后，引發了開放源代碼社群的巨大反響，程序員們不僅使用它構建具體的全文檢索應用，而且將之集成到各種系統軟件中去，以及構建Web應用，甚至某些商業軟件也采用了Lucene作為其內部全文檢索子系統的核心。apache軟件基金會的網站使用了Lucene作為全文檢索的引擎，IBM的開源軟件eclipse的2.1版本中也采用了Lucene作為幫助子系統的全文索引引擎，相應的IBM的商業軟件Web Sphere中也采用了Lucene。Lucene以其開放源代碼的特性、優異的索引結構、良好的系統架構獲得了越來越多的應用。

Lucene作為一個全文檢索引擎，其具有如下突出的優點：

（1）索引文件格式獨立于應用平臺。Lucene定義了一套以8位字節為基礎的索引文件格式，使得兼容系統或者不同平臺的應用能夠共享建立的索引文件。

（2）在傳統全文檢索引擎的倒排索引的基礎上，實現了分塊索引，能夠針對新的文件建立小文件索引，提升索引速度。然后通過與原有索引的合并，達到優化的目的。

（3）優秀的面向對象的系統架構，使得對于Lucene擴展的學習難度降低，方便擴充新功能。

（4）設計了獨立于語言和文件格式的文本分析接口，索引器通過接受Token流完成索引文件的創立，用戶擴展新的語言和文件格式，只需要實現文本分析的接口。

（5）已經默認實現了一套強大的查詢引擎，用戶無需自己編寫代碼即使系統可獲得強大的查詢能力，Lucene的查詢實現中默認實現了布爾操作、模糊查詢（Fuzzy Search）、分組查詢等等。

2.什么是solr？

為什么要solr：

　　solr是將整個索引操作功能封裝好了的搜索引擎系統(企業級搜索引擎產品)

　　solr可以部署到單獨的服務器上(WEB服務)，它可以提供服務，我們的業務系統就只要發送請求，接收響應即可，降低了業務系統的負載

　　solr部署在專門的服務器上，它的索引庫就不會受業務系統服務器存儲空間的限制

　　solr支持分布式集群，索引服務的容量和能力可以線性擴展

solr的工作機制：

　　solr就是在lucene工具包的基礎之上進行了封裝，而且是以web服務的形式對外提供索引功能

　　業務系統需要使用到索引的功能（建索引，查索引）時，只要發出http請求，并將返回數據進行解析即可

Solr 是Apache下的一個頂級開源項目，采用Java開發，它是基于Lucene的全文搜索服務器。Solr提供了比Lucene更為豐富的查詢語言，同時實現了可配置、可擴展，并對索引、搜索性能進行了優化。

Solr可以獨立運行，運行在Jetty、Tomcat等這些Servlet容器中，Solr 索引的實現方法很簡單，用 POST 方法向 Solr 服務器發送一個描述 Field 及其內容的 XML 文檔，Solr根據xml文檔添加、刪除、更新索引。Solr 搜索只需要發送 HTTP GET 請求，然后對 Solr 返回Xml、json等格式的查詢結果進行解析，組織頁面布局。Solr不提供構建UI的功能，Solr提供了一個管理界面，通過管理界面可以查詢Solr的配置和運行情況。

3.lucene和solr的關系

solr是門戶，lucene是底層基礎，solr和lucene的關系正如hadoop和hdfs的關系。

?4.Jetty是什么？

　　Jetty 是一個開源的servlet容器，它為基于Java的web容器，例如JSP和servlet提供運行環境。Jetty是使用Java語言編寫的，它的API以一組JAR包的形式發布。開發人員可以將Jetty容器實例化成一個對象，可以迅速為一些獨立運行（stand-alone）的Java應用提供網絡和web連接。

5.流程概況

6.Jetty接收請求并處理

設置本地調試見<lucene-solr本地調試方法>所示

StartSolrJetty.java

public static void main( String[] args ) {//System.setProperty("solr.solr.home", "../../../example/solr"); Server server = new Server();ServerConnector connector = new ServerConnector(server, new HttpConnectionFactory());// Set some timeout options to make debugging easier.connector.setIdleTimeout(1000 * 60 * 60);connector.setSoLingerTime(-1);connector.setPort(8983);server.setConnectors(new Connector[] { connector });WebAppContext bb = new WebAppContext();bb.setServer(server);bb.setContextPath("/solr");bb.setWar("solr/webapp/web");// // START JMX SERVER // if( true ) { // MBeanServer mBeanServer = ManagementFactory.getPlatformMBeanServer(); // MBeanContainer mBeanContainer = new MBeanContainer(mBeanServer); // server.getContainer().addEventListener(mBeanContainer); // mBeanContainer.start(); // } server.setHandler(bb);try {System.out.println(">>> STARTING EMBEDDED JETTY SERVER, PRESS ANY KEY TO STOP");server.start();while (System.in.available() == 0) {Thread.sleep(5000);}server.stop();server.join();} catch (Exception e) {e.printStackTrace();System.exit(100);}}

其中，Server是http服務器，聚合了Connector(http請求接收器)和請求處理器Hanlder，Server本身是一個handler和一個線程池，Connector使用線程池來調用handle方法。

/** Jetty HTTP Servlet Server.* This class is the main class for the Jetty HTTP Servlet server.* It aggregates Connectors (HTTP request receivers) and request Handlers.* The server is itself a handler and a ThreadPool. Connectors use the ThreadPool methods* to run jobs that will eventually call the handle method.*/

其工作流程如下圖所示

因其不是本文重點，故略去不述。

7.solr調用lucene過程

上篇文章<solr調用lucene底層實現倒排索引源碼解析>已經論述，可對照上面的整體流程圖進行解讀，故略去不述

?8.lucene調用過程

從上圖可以看出分兩個階段

8.1 創建Weight

? ?8.1.1 創建BooleanWeight

BooleanWeight.java

BooleanWeight(BooleanQuery query, IndexSearcher searcher, boolean needsScores, float boost) throws IOException {super(query);this.query = query;this.needsScores = needsScores;this.similarity = searcher.getSimilarity(needsScores);weights = new ArrayList<>();for (BooleanClause c : query) {Query q=c.getQuery(); Weight w = searcher.createWeight(q, needsScores && c.isScoring(), boost);weights.add(w);}}

? 8.1.2 同義詞權重分析

SynonymQuery.java

@Overridepublic Weight createWeight(IndexSearcher searcher, boolean needsScores, float boost) throws IOException {if (needsScores) {return new SynonymWeight(this, searcher, boost);} else {// if scores are not needed, let BooleanWeight deal with optimizing that case.BooleanQuery.Builder bq = new BooleanQuery.Builder();for (Term term : terms) {bq.add(new TermQuery(term), BooleanClause.Occur.SHOULD);}return searcher.rewrite(bq.build()).createWeight(searcher, needsScores, boost);}}

8.1.3?TermQuery.java

@Overridepublic Weight createWeight(IndexSearcher searcher, boolean needsScores, float boost) throws IOException {final IndexReaderContext context = searcher.getTopReaderContext();final TermContext termState;if (perReaderTermState == null|| perReaderTermState.wasBuiltFor(context) == false) {if (needsScores) {// make TermQuery single-pass if we don't have a PRTS or if the context// differs!termState = TermContext.build(context, term);} else {// do not compute the term state, this will help save seeks in the terms// dict on segments that have a cache entry for this querytermState = null;}} else {// PRTS was pre-build for this IStermState = this.perReaderTermState;}return new TermWeight(searcher, needsScores, boost, termState);}

調用TermWeight,計算CollectionStatistics和TermStatistics

public TermWeight(IndexSearcher searcher, boolean needsScores,float boost, TermContext termStates) throws IOException {super(TermQuery.this);if (needsScores && termStates == null) {throw new IllegalStateException("termStates are required when scores are needed");}this.needsScores = needsScores;this.termStates = termStates;this.similarity = searcher.getSimilarity(needsScores);final CollectionStatistics collectionStats;final TermStatistics termStats;if (needsScores) {termStates.setQuery(this.getQuery().getKeyword());collectionStats = searcher.collectionStatistics(term.field());termStats = searcher.termStatistics(term, termStates);} else {// we do not need the actual stats, use fake stats with docFreq=maxDoc and ttf=-1final int maxDoc = searcher.getIndexReader().maxDoc();collectionStats = new CollectionStatistics(term.field(), maxDoc, -1, -1, -1);termStats = new TermStatistics(term.bytes(), maxDoc, -1,term.bytes());}this.stats = similarity.computeWeight(boost, collectionStats, termStats);}

調用Similarity的computeWeight

BM25Similarity.java

@Overridepublic final SimWeight computeWeight(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) {Explanation idf = termStats.length == 1 ? idfExplain(collectionStats, termStats[0]) : idfExplain(collectionStats, termStats);float avgdl = avgFieldLength(collectionStats);float[] oldCache = new float[256];float[] cache = new float[256];for (int i = 0; i < cache.length; i++) {oldCache[i] = k1 * ((1 - b) + b * OLD_LENGTH_TABLE[i] / avgdl);cache[i] = k1 * ((1 - b) + b * LENGTH_TABLE[i] / avgdl);}return new BM25Stats(collectionStats.field(), boost, idf, avgdl, oldCache, cache);}

8.2 查詢過程

? 完整過程如下：IndexSearcher調用search方法

protected void search(List<LeafReaderContext> leaves, Weight weight, Collector collector)throws IOException {// TODO: should we make this// threaded...? the Collector could be sync'd?// always use single thread:for (LeafReaderContext ctx : leaves) { // search each subreaderfinal LeafCollector leafCollector;try { leafCollector = collector.getLeafCollector(ctx);//1} catch (CollectionTerminatedException e) {// there is no doc of interest in this reader context// continue with the following leafcontinue;} BulkScorer scorer = weight.bulkScorer(ctx);//2if (scorer != null) {try {scorer.score(leafCollector, ctx.reader().getLiveDocs());//3} catch (CollectionTerminatedException e) {// collection was terminated prematurely// continue with the following leaf }}}}

?8.2.1 獲取Collector

TopScoreDocCollector.java#SimpleTopScoreDocCollector

@Overridepublic LeafCollector getLeafCollector(LeafReaderContext context)throws IOException {final int docBase = context.docBase;return new ScorerLeafCollector() {@Overridepublic void collect(int doc) throws IOException {float score = scorer.score(); /* Document document=context.reader().document(doc); */ // This collector cannot handle these scores:assert score != Float.NEGATIVE_INFINITY;assert !Float.isNaN(score);totalHits++;if (score <= pqTop.score) {// Since docs are returned in-order (i.e., increasing doc Id), a document// with equal score to pqTop.score cannot compete since HitQueue favors// documents with lower doc Ids. Therefore reject those docs too.return;}pqTop.doc = doc + docBase;pqTop.score = score;pqTop = pq.updateTop();}};}

8.2.2 調用打分socore

/*** Optional method, to return a {@link BulkScorer} to* score the query and send hits to a {@link Collector}.* Only queries that have a different top-level approach* need to override this; the default implementation* pulls a normal {@link Scorer} and iterates and* collects the resulting hits which are not marked as deleted.** @param context* the {@link org.apache.lucene.index.LeafReaderContext} for which to return the {@link Scorer}.** @return a {@link BulkScorer} which scores documents and* passes them to a collector.* @throws IOException if there is a low-level I/O error*/public BulkScorer bulkScorer(LeafReaderContext context) throws IOException {Scorer scorer = scorer(context);if (scorer == null) {// No docs matchreturn null;}// This impl always scores docs in order, so we can// ignore scoreDocsInOrder:return new DefaultBulkScorer(scorer);}/** Just wraps a Scorer and performs top scoring using it.* @lucene.internal */protected static class DefaultBulkScorer extends BulkScorer {private final Scorer scorer;private final DocIdSetIterator iterator;private final TwoPhaseIterator twoPhase;/** Sole constructor. */public DefaultBulkScorer(Scorer scorer) {if (scorer == null) {throw new NullPointerException();}this.scorer = scorer;this.iterator = scorer.iterator();this.twoPhase = scorer.twoPhaseIterator();}@Overridepublic long cost() {return iterator.cost();}@Overridepublic int score(LeafCollector collector, Bits acceptDocs, int min, int max) throws IOException {collector.setScorer(scorer);if (scorer.docID() == -1 && min == 0 && max == DocIdSetIterator.NO_MORE_DOCS) {scoreAll(collector, iterator, twoPhase, acceptDocs);return DocIdSetIterator.NO_MORE_DOCS;} else {int doc = scorer.docID();if (doc < min) {if (twoPhase == null) {doc = iterator.advance(min);} else {doc = twoPhase.approximation().advance(min);}}return scoreRange(collector, iterator, twoPhase, acceptDocs, doc, max);}}

調用scoreAll方法，遍歷Document，執行SimpleTopScoreDocCollector的collect方法，包含打分邏輯<見SimpleTopScoreDocCollector代碼>。

/** Specialized method to bulk-score all hits; we* separate this from {@link #scoreRange} to help out* hotspot.* See <a href="https://issues.apache.org/jira/browse/LUCENE-5487">LUCENE-5487</a> */static void scoreAll(LeafCollector collector, DocIdSetIterator iterator, TwoPhaseIterator twoPhase, Bits acceptDocs) throws IOException {if (twoPhase == null) {for (int doc = iterator.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = iterator.nextDoc()) {if (acceptDocs == null || acceptDocs.get(doc)) {collector.collect(doc);}}} else {// The scorer has an approximation, so run the approximation first, then check acceptDocs, then confirmfinal DocIdSetIterator approximation = twoPhase.approximation();for (int doc = approximation.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = approximation.nextDoc()) {if ((acceptDocs == null || acceptDocs.get(doc)) && twoPhase.matches()) {collector.collect(doc);}}}}

總結：

　　梳理整理整個流程太累了。

?參考資料

【1】http://www.blogjava.net/hoojo/archive/2012/09/06/387140.html

【2】https://baike.baidu.com/item/jetty/370234?fr=aladdin

轉載于:https://www.cnblogs.com/davidwang456/p/10570935.html

《新程序員》：云原生和全面數字化實踐50位技術專家共同創作，文字、視頻、音頻交互閱讀

總結

以上是生活随笔為你收集整理的solr查询工作原理深入内幕的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：送给 Java 程序员的 Spring
下一篇：知乎推荐页 Ranking 构建历程和经