solr查询工作原理深入内幕
1.什么是Lucene?
作為一個開放源代碼項目,Lucene從問世之后,引發了開放源代碼社群的巨大反響,程序員們不僅使用它構建具體的全文檢索應用,而且將之集成到各種系統軟件中去,以及構建Web應用,甚至某些商業軟件也采用了Lucene作為其內部全文檢索子系統的核心。apache軟件基金會的網站使用了Lucene作為全文檢索的引擎,IBM的開源軟件eclipse的2.1版本中也采用了Lucene作為幫助子系統的全文索引引擎,相應的IBM的商業軟件Web Sphere中也采用了Lucene。Lucene以其開放源代碼的特性、優異的索引結構、良好的系統架構獲得了越來越多的應用。
Lucene作為一個全文檢索引擎,其具有如下突出的優點:
(1)索引文件格式獨立于應用平臺。Lucene定義了一套以8位字節為基礎的索引文件格式,使得兼容系統或者不同平臺的應用能夠共享建立的索引文件。
(2)在傳統全文檢索引擎的倒排索引的基礎上,實現了分塊索引,能夠針對新的文件建立小文件索引,提升索引速度。然后通過與原有索引的合并,達到優化的目的。
(3)優秀的面向對象的系統架構,使得對于Lucene擴展的學習難度降低,方便擴充新功能。
(4)設計了獨立于語言和文件格式的文本分析接口,索引器通過接受Token流完成索引文件的創立,用戶擴展新的語言和文件格式,只需要實現文本分析的接口。
(5)已經默認實現了一套強大的查詢引擎,用戶無需自己編寫代碼即使系統可獲得強大的查詢能力,Lucene的查詢實現中默認實現了布爾操作、模糊查詢(Fuzzy Search)、分組查詢等等。
2.什么是solr?
為什么要solr:
solr是將整個索引操作功能封裝好了的搜索引擎系統(企業級搜索引擎產品)
solr可以部署到單獨的服務器上(WEB服務),它可以提供服務,我們的業務系統就只要發送請求,接收響應即可,降低了業務系統的負載
solr部署在專門的服務器上,它的索引庫就不會受業務系統服務器存儲空間的限制
solr支持分布式集群,索引服務的容量和能力可以線性擴展
solr的工作機制:
solr就是在lucene工具包的基礎之上進行了封裝,而且是以web服務的形式對外提供索引功能
業務系統需要使用到索引的功能(建索引,查索引)時,只要發出http請求,并將返回數據進行解析即可
Solr 是Apache下的一個頂級開源項目,采用Java開發,它是基于Lucene的全文搜索服務器。Solr提供了比Lucene更為豐富的查詢語言,同時實現了可配置、可擴展,并對索引、搜索性能進行了優化。
Solr可以獨立運行,運行在Jetty、Tomcat等這些Servlet容器中,Solr 索引的實現方法很簡單,用 POST 方法向 Solr 服務器發送一個描述 Field 及其內容的 XML 文檔,Solr根據xml文檔添加、刪除、更新索引 。Solr 搜索只需要發送 HTTP GET 請求,然后對 Solr 返回Xml、json等格式的查詢結果進行解析,組織頁面布局。Solr不提供構建UI的功能,Solr提供了一個管理界面,通過管理界面可以查詢Solr的配置和運行情況。
3.lucene和solr的關系
solr是門戶,lucene是底層基礎,solr和lucene的關系正如hadoop和hdfs的關系。
?4.Jetty是什么?
Jetty 是一個開源的servlet容器,它為基于Java的web容器,例如JSP和servlet提供運行環境。Jetty是使用Java語言編寫的,它的API以一組JAR包的形式發布。開發人員可以將Jetty容器實例化成一個對象,可以迅速為一些獨立運行(stand-alone)的Java應用提供網絡和web連接。
5.流程概況
?
6.Jetty接收請求并處理
設置本地調試見<lucene-solr本地調試方法>所示
StartSolrJetty.java
public static void main( String[] args ) {//System.setProperty("solr.solr.home", "../../../example/solr"); Server server = new Server();ServerConnector connector = new ServerConnector(server, new HttpConnectionFactory());// Set some timeout options to make debugging easier.connector.setIdleTimeout(1000 * 60 * 60);connector.setSoLingerTime(-1);connector.setPort(8983);server.setConnectors(new Connector[] { connector });WebAppContext bb = new WebAppContext();bb.setServer(server);bb.setContextPath("/solr");bb.setWar("solr/webapp/web");// // START JMX SERVER // if( true ) { // MBeanServer mBeanServer = ManagementFactory.getPlatformMBeanServer(); // MBeanContainer mBeanContainer = new MBeanContainer(mBeanServer); // server.getContainer().addEventListener(mBeanContainer); // mBeanContainer.start(); // } server.setHandler(bb);try {System.out.println(">>> STARTING EMBEDDED JETTY SERVER, PRESS ANY KEY TO STOP");server.start();while (System.in.available() == 0) {Thread.sleep(5000);}server.stop();server.join();} catch (Exception e) {e.printStackTrace();System.exit(100);}}其中,Server是http服務器,聚合了Connector(http請求接收器)和請求處理器Hanlder,Server本身是一個handler和一個線程池,Connector使用線程池來調用handle方法。
/** Jetty HTTP Servlet Server.* This class is the main class for the Jetty HTTP Servlet server.* It aggregates Connectors (HTTP request receivers) and request Handlers.* The server is itself a handler and a ThreadPool. Connectors use the ThreadPool methods* to run jobs that will eventually call the handle method.*/其工作流程如下圖所示
因其不是本文重點,故略去不述。
7.solr調用lucene過程
上篇文章<solr調用lucene底層實現倒排索引源碼解析>已經論述,可對照上面的整體流程圖進行解讀,故略去不述
?8.lucene調用過程
從上圖可以看出分兩個階段
8.1 創建Weight
? ?8.1.1 創建BooleanWeight
BooleanWeight.java
BooleanWeight(BooleanQuery query, IndexSearcher searcher, boolean needsScores, float boost) throws IOException {super(query);this.query = query;this.needsScores = needsScores;this.similarity = searcher.getSimilarity(needsScores);weights = new ArrayList<>();for (BooleanClause c : query) {Query q=c.getQuery(); Weight w = searcher.createWeight(q, needsScores && c.isScoring(), boost);weights.add(w);}}? 8.1.2 同義詞權重分析
SynonymQuery.java
@Overridepublic Weight createWeight(IndexSearcher searcher, boolean needsScores, float boost) throws IOException {if (needsScores) {return new SynonymWeight(this, searcher, boost);} else {// if scores are not needed, let BooleanWeight deal with optimizing that case.BooleanQuery.Builder bq = new BooleanQuery.Builder();for (Term term : terms) {bq.add(new TermQuery(term), BooleanClause.Occur.SHOULD);}return searcher.rewrite(bq.build()).createWeight(searcher, needsScores, boost);}}8.1.3?TermQuery.java
@Overridepublic Weight createWeight(IndexSearcher searcher, boolean needsScores, float boost) throws IOException {final IndexReaderContext context = searcher.getTopReaderContext();final TermContext termState;if (perReaderTermState == null|| perReaderTermState.wasBuiltFor(context) == false) {if (needsScores) {// make TermQuery single-pass if we don't have a PRTS or if the context// differs!termState = TermContext.build(context, term);} else {// do not compute the term state, this will help save seeks in the terms// dict on segments that have a cache entry for this querytermState = null;}} else {// PRTS was pre-build for this IStermState = this.perReaderTermState;}return new TermWeight(searcher, needsScores, boost, termState);}調用TermWeight,計算CollectionStatistics和TermStatistics
public TermWeight(IndexSearcher searcher, boolean needsScores,float boost, TermContext termStates) throws IOException {super(TermQuery.this);if (needsScores && termStates == null) {throw new IllegalStateException("termStates are required when scores are needed");}this.needsScores = needsScores;this.termStates = termStates;this.similarity = searcher.getSimilarity(needsScores);final CollectionStatistics collectionStats;final TermStatistics termStats;if (needsScores) {termStates.setQuery(this.getQuery().getKeyword());collectionStats = searcher.collectionStatistics(term.field());termStats = searcher.termStatistics(term, termStates);} else {// we do not need the actual stats, use fake stats with docFreq=maxDoc and ttf=-1final int maxDoc = searcher.getIndexReader().maxDoc();collectionStats = new CollectionStatistics(term.field(), maxDoc, -1, -1, -1);termStats = new TermStatistics(term.bytes(), maxDoc, -1,term.bytes());}this.stats = similarity.computeWeight(boost, collectionStats, termStats);}調用Similarity的computeWeight
BM25Similarity.java
@Overridepublic final SimWeight computeWeight(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) {Explanation idf = termStats.length == 1 ? idfExplain(collectionStats, termStats[0]) : idfExplain(collectionStats, termStats);float avgdl = avgFieldLength(collectionStats);float[] oldCache = new float[256];float[] cache = new float[256];for (int i = 0; i < cache.length; i++) {oldCache[i] = k1 * ((1 - b) + b * OLD_LENGTH_TABLE[i] / avgdl);cache[i] = k1 * ((1 - b) + b * LENGTH_TABLE[i] / avgdl);}return new BM25Stats(collectionStats.field(), boost, idf, avgdl, oldCache, cache);}8.2 查詢過程
? 完整過程如下:IndexSearcher調用search方法
protected void search(List<LeafReaderContext> leaves, Weight weight, Collector collector)throws IOException {// TODO: should we make this// threaded...? the Collector could be sync'd?// always use single thread:for (LeafReaderContext ctx : leaves) { // search each subreaderfinal LeafCollector leafCollector;try { leafCollector = collector.getLeafCollector(ctx);//1} catch (CollectionTerminatedException e) {// there is no doc of interest in this reader context// continue with the following leafcontinue;} BulkScorer scorer = weight.bulkScorer(ctx);//2if (scorer != null) {try {scorer.score(leafCollector, ctx.reader().getLiveDocs());//3} catch (CollectionTerminatedException e) {// collection was terminated prematurely// continue with the following leaf }}}}?8.2.1 獲取Collector
TopScoreDocCollector.java#SimpleTopScoreDocCollector
@Overridepublic LeafCollector getLeafCollector(LeafReaderContext context)throws IOException {final int docBase = context.docBase;return new ScorerLeafCollector() {@Overridepublic void collect(int doc) throws IOException {float score = scorer.score(); /* Document document=context.reader().document(doc); */ // This collector cannot handle these scores:assert score != Float.NEGATIVE_INFINITY;assert !Float.isNaN(score);totalHits++;if (score <= pqTop.score) {// Since docs are returned in-order (i.e., increasing doc Id), a document// with equal score to pqTop.score cannot compete since HitQueue favors// documents with lower doc Ids. Therefore reject those docs too.return;}pqTop.doc = doc + docBase;pqTop.score = score;pqTop = pq.updateTop();}};}8.2.2 調用打分socore
/*** Optional method, to return a {@link BulkScorer} to* score the query and send hits to a {@link Collector}.* Only queries that have a different top-level approach* need to override this; the default implementation* pulls a normal {@link Scorer} and iterates and* collects the resulting hits which are not marked as deleted.** @param context* the {@link org.apache.lucene.index.LeafReaderContext} for which to return the {@link Scorer}.** @return a {@link BulkScorer} which scores documents and* passes them to a collector.* @throws IOException if there is a low-level I/O error*/public BulkScorer bulkScorer(LeafReaderContext context) throws IOException {Scorer scorer = scorer(context);if (scorer == null) {// No docs matchreturn null;}// This impl always scores docs in order, so we can// ignore scoreDocsInOrder:return new DefaultBulkScorer(scorer);}/** Just wraps a Scorer and performs top scoring using it.* @lucene.internal */protected static class DefaultBulkScorer extends BulkScorer {private final Scorer scorer;private final DocIdSetIterator iterator;private final TwoPhaseIterator twoPhase;/** Sole constructor. */public DefaultBulkScorer(Scorer scorer) {if (scorer == null) {throw new NullPointerException();}this.scorer = scorer;this.iterator = scorer.iterator();this.twoPhase = scorer.twoPhaseIterator();}@Overridepublic long cost() {return iterator.cost();}@Overridepublic int score(LeafCollector collector, Bits acceptDocs, int min, int max) throws IOException {collector.setScorer(scorer);if (scorer.docID() == -1 && min == 0 && max == DocIdSetIterator.NO_MORE_DOCS) {scoreAll(collector, iterator, twoPhase, acceptDocs);return DocIdSetIterator.NO_MORE_DOCS;} else {int doc = scorer.docID();if (doc < min) {if (twoPhase == null) {doc = iterator.advance(min);} else {doc = twoPhase.approximation().advance(min);}}return scoreRange(collector, iterator, twoPhase, acceptDocs, doc, max);}}調用scoreAll方法,遍歷Document,執行SimpleTopScoreDocCollector的collect方法,包含打分邏輯<見SimpleTopScoreDocCollector代碼>。
/** Specialized method to bulk-score all hits; we* separate this from {@link #scoreRange} to help out* hotspot.* See <a href="https://issues.apache.org/jira/browse/LUCENE-5487">LUCENE-5487</a> */static void scoreAll(LeafCollector collector, DocIdSetIterator iterator, TwoPhaseIterator twoPhase, Bits acceptDocs) throws IOException {if (twoPhase == null) {for (int doc = iterator.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = iterator.nextDoc()) {if (acceptDocs == null || acceptDocs.get(doc)) {collector.collect(doc);}}} else {// The scorer has an approximation, so run the approximation first, then check acceptDocs, then confirmfinal DocIdSetIterator approximation = twoPhase.approximation();for (int doc = approximation.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = approximation.nextDoc()) {if ((acceptDocs == null || acceptDocs.get(doc)) && twoPhase.matches()) {collector.collect(doc);}}}}總結:
梳理整理整個流程太累了。
?參考資料
【1】http://www.blogjava.net/hoojo/archive/2012/09/06/387140.html
【2】https://baike.baidu.com/item/jetty/370234?fr=aladdin
?
轉載于:https://www.cnblogs.com/davidwang456/p/10570935.html
《新程序員》:云原生和全面數字化實踐50位技術專家共同創作,文字、視頻、音頻交互閱讀總結
以上是生活随笔為你收集整理的solr查询工作原理深入内幕的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 送给 Java 程序员的 Spring
- 下一篇: 知乎推荐页 Ranking 构建历程和经