使用Elasticsearch做向量空间内的相似性搜索
文章目錄
- 什么是Word Embeddings
- 索引`Word Embeddings`
- 評(píng)分的余弦相似度
- 局限性
- 通過抽象屬性搜索
Elasticsearch做文本檢索是基于文本之間的相似性的。在Elasticsearch 5.0中,Elasticsearch將默認(rèn)算法由TF / IDF切換為Okapi BM25,該算法用于對(duì)與查詢相關(guān)的結(jié)果進(jìn)行評(píng)分。對(duì)Elasticsearch中的 TF / IDF 和 Okapi BM25感興趣的可以直接查看Elastic的官方博客。(簡(jiǎn)單的說(shuō),TF/IDF和BM25的本質(zhì)區(qū)別在于,TF/IDF是一個(gè)向量空間模型,而BM25是一個(gè)概率模型,他們的簡(jiǎn)單對(duì)比可以查看這個(gè) 文章)
TF/IDF中是一個(gè)向量空間模型,它將查詢的每個(gè)term都視為向量模型的一個(gè)維度。通過該模型可將查詢定義為一個(gè)向量,而Elasticsearch存儲(chǔ)的文檔定義另一個(gè)。然后,將兩個(gè)向量的標(biāo)量積視為文檔與查詢的相關(guān)性。標(biāo)量積越大,相關(guān)性越高。而BM25屬于概率模型,但是,它的公式和TF/IDF并沒有您所期望的那么大。兩種模型都將每個(gè)term的權(quán)重定義為一些idf函數(shù)和某些tf函數(shù)的乘積,然后將其作為整個(gè)文檔對(duì)給定查詢的分?jǐn)?shù)(相關(guān)度)匯總。
但是,如果我們想根據(jù)更抽象的內(nèi)容(例如單詞的含義或?qū)懽黠L(fēng)格)來(lái)查找類似的文檔,該怎么辦?這是Elasticsearch的密集矢量字段數(shù)據(jù)類型(dense vector)和矢量字段的script-score queries發(fā)揮作用的地方。
什么是Word Embeddings
因?yàn)橛?jì)算機(jī)不認(rèn)識(shí)字符串,所以我們需要將文字轉(zhuǎn)換為數(shù)字。Word Embedding就是來(lái)完成這樣的工作。 其定義是:A Word Embedding format generally tries to map a word using a dictionary to a vector。
如下圖,計(jì)算機(jī)用ASC碼來(lái)識(shí)別bread(面包)和toast(吐司),從ASC II碼上,這兩個(gè)單詞沒有任何的相關(guān)性,但如果我們將它們?cè)诖罅康恼Z(yǔ)料中的上下文關(guān)系作為向量維度進(jìn)行分析,就可以看到,他們通常都和butter(黃油)、breakfast(早餐)等單詞同時(shí)出現(xiàn)。
或許,我們的搜索空間里面沒有面包這個(gè)詞出現(xiàn),但如果用戶搜索了面包,我們可以嘗試給他同時(shí)返回吐司相關(guān)的信息,這些或許會(huì)對(duì)用戶有用
因此,如上圖,我們把beard和toast映射為:
對(duì)應(yīng)的向量數(shù)組,即為bread和toast的 Word Embeddings。當(dāng)然,Word Embeddings的生成可以采用不同的算法,這里不做詳述。
索引Word Embeddings
Word Embeddings是詞的矢量表示,通常用于自然語(yǔ)言處理任務(wù),例如文本分類或情感分析。相似的詞傾向于在相似的上下文中出現(xiàn)。Word Embeddings將出現(xiàn)在相似上下文中的單詞映射為具有相似值的矢量表示。這樣,單詞的語(yǔ)義被一定程度上得以保留。
為了演示矢量場(chǎng)的使用,我們將經(jīng)過預(yù)訓(xùn)練的GloVe 的 Word Embeddings導(dǎo)入到Elasticsearch中。該glove.6B.50d.txt文件將詞匯表的400000個(gè)單詞中的每一個(gè)映射到50維向量。摘錄如下所示。
public 0.034236 0.50591 -0.19488 -0.26424 -0.269 -0.0024169 -0.42642 -0.29695 0.21507 -0.0053071 -0.6861 -0.2125 0.24388 -0.45197 0.072675 -0.12681 -0.36037 0.12668 0.38054 -0.43214 1.1571 0.51524 -0.50795 -0.18806 -0.16628 -2.035 -0.023095 -0.043807 -0.33862 0.22944 3.4413 0.58809 0.15753 -1.7452 -0.81105 0.04273 0.19056 -0.28506 0.13358 -0.094805 -0.17632 0.076961 -0.19293 0.71098 -0.19331 0.019016 -1.2177 0.3962 0.52807 0.33352 early 0.35948 -0.16637 -0.30288 -0.55095 -0.49135 0.048866 -1.6003 0.19451 -0.80288 0.157 0.14782 -0.45813 -0.30852 0.03055 0.38079 0.16768 -0.74477 -0.88759 -1.1255 0.28654 0.37413 -0.053585 0.019005 -0.30474 0.30998 -1.3004 -0.56797 -0.50119 0.031763 0.58832 3.692 -0.56015 -0.043986 -0.4513 0.49902 -0.13698 0.033691 0.40458 -0.16825 0.033614 -0.66019 -0.070503 -0.39145 -0.11031 0.27384 0.25301 0.3471 -0.31089 -0.32557為了導(dǎo)入這些映射,我們創(chuàng)建了一個(gè)words索引,并在索引mapping中指定dense_vector為vector字段的類型。然后,我們遍歷GloVe文件中的行,并將單詞和向量分批批量插入該索引中。之后,例如可以使用GET請(qǐng)求檢索“ early”一詞(/words/_doc/early):
{"_index": "words","_type": "_doc","_id": "early","_version": 15,"_seq_no": 503319,"_primary_term": 2,"found": true,"_source": {"word": "early","vector": [0.35948,-0.16637,-0.30288,-0.55095,-0.49135,0.048866,-1.6003,0.19451,-0.80288,0.157,0.14782,-0.45813,-0.30852,0.03055,0.38079,0.16768,-0.74477,-0.88759,-1.1255,0.28654,0.37413,-0.053585,0.019005,-0.30474,0.30998,-1.3004,-0.56797,-0.50119,0.031763,0.58832,3.692,-0.56015,-0.043986,-0.4513,0.49902,-0.13698,0.033691,0.40458,-0.16825,0.033614,-0.66019,-0.070503,-0.39145,-0.11031,0.27384,0.25301,0.3471,-0.31089,-0.32557,-0.51921]} }評(píng)分的余弦相似度
對(duì)于 GloVe Word Embeddings,兩個(gè)詞向量之間的余弦相似性可以揭示相應(yīng)詞的語(yǔ)義相似性。從Elasticsearch 7.2開始,余弦相似度可作為預(yù)定義函數(shù)使用,可用于文檔評(píng)分。
要查找與表示形式相似的單詞,[0.1, 0.2, -0.3]我們可以向發(fā)送POST請(qǐng)求/words/_search,在此我們將預(yù)定義cosineSimilarity函數(shù)與查詢向量和存儲(chǔ)文檔的向量值一起用作函數(shù)自變量,以計(jì)算文檔分?jǐn)?shù)。請(qǐng)注意,由于分?jǐn)?shù)不能為負(fù),因此我們需要在函數(shù)的結(jié)果上添加1.0。
{"size": 1,"query": {"script_score": {"query": {"match_all": {}},"script": {"source": "cosineSimilarity(params.queryVector, doc['vector'])+1.0","params": {"queryVector": [0.1, 0.2, -0.3] }}}} }結(jié)果,我們得到單詞“ rites”的分?jǐn)?shù)約為1.5:
{"took": 103,"timed_out": false,"hits": {"total": {"value": 10000,"relation": "gte"},"max_score": 1.5047596,"hits": [{"_index": "words","_type": "_doc","_id": "rites","_score": 1.5047596,"_source": {"word": "rites","vector": [0.82594,0.55036,-2.5851,-0.52783,0.96654,0.55221,0.28173,0.15945,0.33305,0.41263,0.29314,0.1444,1.1311,0.0053411,0.35198,0.094642,-0.89222,-0.85773,0.044799,0.59454,0.26779,0.044897,-0.10393,-0.21768,-0.049958,-0.018437,-0.63575,-0.72981,-0.23786,-0.30149,1.2795,0.22047,-0.55406,0.0038758,-0.055598,0.41379,0.85904,-0.62734,-0.17855,1.7538,-0.78475,-0.52078,-0.88765,1.3897,0.58374,0.16477,-0.15503,-0.11889,-0.66121,-1.108]}}]} }為了更加容易探索Word Embeddings,我們使用Spring Shell在Kotlin中構(gòu)建了一個(gè)包裝器。它接受一個(gè)單詞作為輸入,從索引中檢索相應(yīng)的向量,然后執(zhí)行script_score查詢以顯示相關(guān)度最高結(jié)果:
shell:>similar --to "cat" {"word":"dog","score":1.9218005} {"word":"rabbit","score":1.8487821} {"word":"monkey","score":1.8041081} {"word":"rat","score":1.7891964} {"word":"cats","score":1.786527} {"word":"snake","score":1.779891} {"word":"dogs","score":1.7795815} {"word":"pet","score":1.7792249} {"word":"mouse","score":1.7731668} {"word":"bite","score":1.77288} {"word":"shark","score":1.7655175} {"word":"puppy","score":1.76256} {"word":"monster","score":1.7619764} {"word":"spider","score":1.7521701} {"word":"beast","score":1.7520056} {"word":"crocodile","score":1.7498653} {"word":"baby","score":1.7463862} {"word":"pig","score":1.7445586} {"word":"frog","score":1.7426511} {"word":"bug","score":1.7365949}該項(xiàng)目代碼可以在github.com/schocco/es-vectorsearch上找到。
局限性
目前 dense vector 數(shù)據(jù)類型還是experimental狀態(tài)。其存儲(chǔ)的向量不應(yīng)超過1024維(Elasticsearch <7.2 的維度限制是500)。
直接使用余弦相似度來(lái)計(jì)算文檔評(píng)分是相對(duì)昂貴的,應(yīng)與filter一起使用以限制需要計(jì)算分?jǐn)?shù)的文檔數(shù)量。對(duì)于大規(guī)模的向量相似性搜索,您可能需要研究更具體的項(xiàng)目,例如FAISS庫(kù),以“使用GPU進(jìn)行十億規(guī)模的相似性搜索”。
通過抽象屬性搜索
我們使用Word Embeddings來(lái)證明如何使用Elasticsearch來(lái)做向量空間的相似性,但相同的概念也應(yīng)適用于其他領(lǐng)域。我們可以將圖片映射到對(duì)圖片樣式進(jìn)行編碼的向量表示中,以搜索以相似樣式繪制的圖片:
通過余弦相似性去進(jìn)行向量搜索:
如果我們可以將諸如口味或樣式之類的抽象屬性映射到向量空間,那么我們還可以搜索與給定查詢項(xiàng)共享該抽象屬性的新食譜,服裝或家具。
由于機(jī)器學(xué)習(xí)的最新進(jìn)展,許多易于使用的開源庫(kù)都支持創(chuàng)建特定于域的Embeddings,而Elasticsearch中對(duì)向量字段的script score query支持,讓我們能輕松實(shí)現(xiàn)特定于域的相似性搜索的又一步。
總結(jié)
以上是生活随笔為你收集整理的使用Elasticsearch做向量空间内的相似性搜索的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: linux修改vlan子接口mac地址,
- 下一篇: MacBook M1 Flutter环境