Elasticsearch:Search-as-you-type 字段类型
search_as_you_type 字段類型是一個類似 text 的字段,經(jīng)過優(yōu)化,可以為提供按需輸入完成情況的查詢提供開箱即用的支持。 它創(chuàng)建了一系列子字段,這些子字段被分析以索引可被部分與整個索引文本值匹配的查詢有效匹配的術(shù)語。 支持前綴完成(即,匹配項從輸入的開頭開始)和中綴完成(即,匹配項在輸入中的任意位置)。
將這種類型的字段添加到 mapping 時
PUT my-index-000001 {"mappings": {"properties": {"my_field": {"type": "search_as_you_type"}}} }這將創(chuàng)建以下字段:
| my_field | 按照 mapping 中的配置進行分析。 如果未配置分析器,則使用索引的默認分詞器 |
| my_field._2gram | 用大小為 2 的 shingle token filter? 分詞器對 ny_field 進行分詞 |
| my_field._3gram | 用大小為 3 的 shingle token filter? 分詞器對 ny_field 進行分詞 |
| my_field._index_prefix | 用 edge ngram token filter 包裝 my_field._3gram 的分詞器 |
如果你對上面的 edge ngram 以及 shingle 不是很明白的話,你可以參考我之前的文章 “Elasticsearch: Ngrams, edge ngrams, and shingles”。
可以使用 max_shingle_size? mapping 參數(shù)配置子字段中的 shingles 的大小。 默認值為3,此參數(shù)的有效值為整數(shù)值2 - 4(含2和4)。 將為從 2 到 max_shingle_size(包括 max_shingle_size)的每個 shingle size 創(chuàng)建 shingle 子字段。 在構(gòu)造自己的分析器時,my_field._index_prefix 子字段將始終使用帶有 max_shingle_size 的 shingle 子字段中的分析器。
增加 max_shingle_size 將改善具有更多連續(xù)項的查詢的匹配度,但代價是較大的索引大小。 默認的 max_shingle_size 通常應(yīng)該足夠了。
當(dāng)被索引的文檔具有根字段 my_field 的值時,相同的輸入文本將使用不同的分析鏈自動索引到這些字段中的每個字段中。
上面的描述確實有些拗口,不太容易理解。我們還是拿一個簡單的例子來展示:
PUT jobs {"mappings": {"properties": {"title": {"type": "search_as_you_type"}}} }在上面,我們創(chuàng)建了一個叫做 jobs 的索引。在這個索引中,我們定義了一個叫做 search_as_you_type 的字段 title。
我們來使用如下的 _analyze API 來進行展示:
POST jobs/_analyze {"field": "title","text": ["Senior Software Developer"] }上面的結(jié)果顯示:
{"tokens" : [{"token" : "senior","start_offset" : 0,"end_offset" : 6,"type" : "<ALPHANUM>","position" : 0},{"token" : "software","start_offset" : 7,"end_offset" : 15,"type" : "<ALPHANUM>","position" : 1},{"token" : "developer","start_offset" : 16,"end_offset" : 25,"type" : "<ALPHANUM>","position" : 2}] }這個和我們之前看的沒有什么兩樣。但是在上面我們也講到,它將生成其它的字段:
- title._2gram
- title._3gram
- title._index_prefix
我們可以使用如下的方法來進行測試:
POST jobs/_analyze {"field": "title._2gram","text": ["Senior Software Developer"] }上面的結(jié)果顯示:
{"tokens" : [{"token" : "senior software","start_offset" : 0,"end_offset" : 15,"type" : "shingle","position" : 0},{"token" : "software developer","start_offset" : 7,"end_offset" : 25,"type" : "shingle","position" : 1}] }也就是說,當(dāng)我們查詢? senior software 或者 software developer 時,這個文檔將會被搜索到。
同樣地,我們針對 title._3gram 來做分析:
POST jobs/_analyze {"field": "title._3gram","text": ["Senior Software Developer"] }上面的分析結(jié)果為:
{"tokens" : [{"token" : "senior software developer","start_offset" : 0,"end_offset" : 25,"type" : "shingle","position" : 0}] }也就是說當(dāng)我們完整地輸入 senior software developer,那么這個文檔將會被搜索到。
接下來,我們對 title._index_prefix 來進行分析:
POST jobs/_analyze {"field": "title._index_prefix","text": ["Senior Software Developer"] }這個返回結(jié)果是非常之長的:
{"tokens" : [{"token" : "s","start_offset" : 0,"end_offset" : 25,"type" : "shingle","position" : 0},{"token" : "se","start_offset" : 0,"end_offset" : 25,"type" : "shingle","position" : 0},{"token" : "sen","start_offset" : 0,"end_offset" : 25,"type" : "shingle","position" : 0},{"token" : "seni","start_offset" : 0,"end_offset" : 25,"type" : "shingle","position" : 0},{"token" : "senio","start_offset" : 0,"end_offset" : 25,"type" : "shingle","position" : 0},{"token" : "senior","start_offset" : 0,"end_offset" : 25,"type" : "shingle","position" : 0},{"token" : "senior ","start_offset" : 0,"end_offset" : 25,"type" : "shingle","position" : 0},{"token" : "senior s","start_offset" : 0,"end_offset" : 25,"type" : "shingle","position" : 0},{"token" : "senior so","start_offset" : 0,"end_offset" : 25,"type" : "shingle","position" : 0},{"token" : "senior sof","start_offset" : 0,"end_offset" : 25,"type" : "shingle","position" : 0},{"token" : "senior soft","start_offset" : 0,"end_offset" : 25,"type" : "shingle","position" : 0},{"token" : "senior softw","start_offset" : 0,"end_offset" : 25,"type" : "shingle","position" : 0},{"token" : "senior softwa","start_offset" : 0,"end_offset" : 25,"type" : "shingle","position" : 0},{"token" : "senior softwar","start_offset" : 0,"end_offset" : 25,"type" : "shingle","position" : 0},{"token" : "senior software","start_offset" : 0,"end_offset" : 25,"type" : "shingle","position" : 0},{"token" : "senior software ","start_offset" : 0,"end_offset" : 25,"type" : "shingle","position" : 0},{"token" : "senior software d","start_offset" : 0,"end_offset" : 25,"type" : "shingle","position" : 0},{"token" : "senior software de","start_offset" : 0,"end_offset" : 25,"type" : "shingle","position" : 0},{"token" : "senior software dev","start_offset" : 0,"end_offset" : 25,"type" : "shingle","position" : 0},{"token" : "senior software deve","start_offset" : 0,"end_offset" : 25,"type" : "shingle","position" : 0},{"token" : "senior software developer","start_offset" : 0,"end_offset" : 25,"type" : "shingle","position" : 0},{"token" : "s","start_offset" : 7,"end_offset" : 25,"type" : "shingle","position" : 1},{"token" : "so","start_offset" : 7,"end_offset" : 25,"type" : "shingle","position" : 1},{"token" : "sof","start_offset" : 7,"end_offset" : 25,"type" : "shingle","position" : 1},{"token" : "soft","start_offset" : 7,"end_offset" : 25,"type" : "shingle","position" : 1},{"token" : "softw","start_offset" : 7,"end_offset" : 25,"type" : "shingle","position" : 1},{"token" : "softwa","start_offset" : 7,"end_offset" : 25,"type" : "shingle","position" : 1},{"token" : "softwar","start_offset" : 7,"end_offset" : 25,"type" : "shingle","position" : 1},{"token" : "software","start_offset" : 7,"end_offset" : 25,"type" : "shingle","position" : 1},{"token" : "software ","start_offset" : 7,"end_offset" : 25,"type" : "shingle","position" : 1},{"token" : "software d","start_offset" : 7,"end_offset" : 25,"type" : "shingle","position" : 1},{"token" : "software de","start_offset" : 7,"end_offset" : 25,"type" : "shingle","position" : 1},{"token" : "software dev","start_offset" : 7,"end_offset" : 25,"type" : "shingle","position" : 1},{"token" : "software deve","start_offset" : 7,"end_offset" : 25,"type" : "shingle","position" : 1},{"token" : "software devel","start_offset" : 7,"end_offset" : 25,"type" : "shingle","position" : 1},{"token" : "software develo","start_offset" : 7,"end_offset" : 25,"type" : "shingle","position" : 1},{"token" : "software develop","start_offset" : 7,"end_offset" : 25,"type" : "shingle","position" : 1},{"token" : "software develope","start_offset" : 7,"end_offset" : 25,"type" : "shingle","position" : 1},{"token" : "software developer","start_offset" : 7,"end_offset" : 25,"type" : "shingle","position" : 1},{"token" : "software developer ","start_offset" : 7,"end_offset" : 25,"type" : "shingle","position" : 1},{"token" : "d","start_offset" : 16,"end_offset" : 25,"type" : "shingle","position" : 2},{"token" : "de","start_offset" : 16,"end_offset" : 25,"type" : "shingle","position" : 2},{"token" : "dev","start_offset" : 16,"end_offset" : 25,"type" : "shingle","position" : 2},{"token" : "deve","start_offset" : 16,"end_offset" : 25,"type" : "shingle","position" : 2},{"token" : "devel","start_offset" : 16,"end_offset" : 25,"type" : "shingle","position" : 2},{"token" : "develo","start_offset" : 16,"end_offset" : 25,"type" : "shingle","position" : 2},{"token" : "develop","start_offset" : 16,"end_offset" : 25,"type" : "shingle","position" : 2},{"token" : "develope","start_offset" : 16,"end_offset" : 25,"type" : "shingle","position" : 2},{"token" : "developer","start_offset" : 16,"end_offset" : 25,"type" : "shingle","position" : 2},{"token" : "developer ","start_offset" : 16,"end_offset" : 25,"type" : "shingle","position" : 2},{"token" : "developer ","start_offset" : 16,"end_offset" : 25,"type" : "shingle","position" : 2}] }首先,我們必須知道的是,它是針對 title._3gram (senior software developer)來進行 dge ngram token filter 操作的。從上面的結(jié)果,我們可以看出來,當(dāng)我們輸入如下的任何一個:
s, se, sen, seni, senio, senior, senior , senior s, senior so, senior sof, senior soft, senior softw, senior softwa, senior softwar, senior software, senior software , senior software d, senior software de, senior software dev, senior software deve, senior software devel, senior software develo, senior software develop, senior software develope, senior software developer
s, so, sof, soft, softw, softwa, softwar, software, software , software d, software de, software dev, software deve, software devel, software develo, software develop, software develope, software developer?
d, de, dev, deve, devel, develo, develop, develope, developer
我們的文檔都將被匹配。這樣的好處是我們可以最大限度地匹配我們想要的文檔,但是問題是它將大大增加我們的存儲空間。
下面,我們將使用一個具體的例子來進行展示:
PUT jobs/_bulk?refresh { "index": {} } { "title": "Software Developer" } { "index": {} } { "title": "Senior Software Developer" } { "index": {} } { "title": "Principal Software Developer" } { "index": {} } { "title": "Developer Advocate" } { "index": {} } { "title": "Developer 🇨🇳" }在上面,我們創(chuàng)建了5個文檔。在下面我們將使用 match_phrase_prefix 來進行查詢。
GET jobs/_search {"query": {"match_phrase_prefix": {"title": "developer"}} }我們將看到5個文檔都將被搜索到。這是因為它們的分詞里都含有 developer,當(dāng)然,我們也甚至可以進行如下的搜索:
GET jobs/_search {"query": {"match_phrase_prefix": {"title": "dev"}} }它也能搜索所有的5個文檔。
我們進行如下的搜索:
GET jobs/_search {"query": {"match_phrase_prefix": {"title": "software dev"}} }我們將只搜索到3個文檔:
"hits" : [{"_index" : "jobs","_type" : "_doc","_id" : "MXYGyXgBI6xucLpoKQ9y","_score" : 0.67181337,"_source" : {"title" : "Software Developer"}},{"_index" : "jobs","_type" : "_doc","_id" : "MnYGyXgBI6xucLpoKQ9y","_score" : 0.5679247,"_source" : {"title" : "Software Software Developer"}},{"_index" : "jobs","_type" : "_doc","_id" : "M3YGyXgBI6xucLpoKQ9y","_score" : 0.5679247,"_source" : {"title" : "Principal Software Developer"}}]如果我們進行如下的搜索:
GET jobs/_search {"query": {"match_phrase_prefix": {"title": "senior dev"}} }我們將看不到任何的結(jié)果。這是因為在默認的情況下,針對 match_phrase_prefix 的 slop 設(shè)置為 0,也就是說 senior 后面應(yīng)該馬上接一個以 dev 為開頭的任何單詞,這樣才可以進行匹配。當(dāng)然,我們可以進行如下的修改:
GET jobs/_search {"query": {"match_phrase_prefix": {"title": {"query": "senior dev","slop": 1}}} }在上面,我們定義 slop 為1,表示 senior 和 dev 中間可以插入單詞,這樣也可以進行匹配。上面的搜索結(jié)果為:
"hits" : [{"_index" : "jobs","_type" : "_doc","_id" : "wHYQyXgBI6xucLpoWA-v","_score" : 0.8418889,"_source" : {"title" : "Senior Software Developer"}}]如果我們想搜索含有 senior 或者以 dev 為開頭的所有文檔,我們可以這么搜索:
GET jobs/_search {"query": {"multi_match": {"query": "senior dev","type": "bool_prefix","operator": "or", "fields": ["title","title._2gram","title._3gram"]}} }上面的搜索結(jié)果將顯示所有的5個文檔。
我們也可以針對 emoji 符號?🇨🇳 來進行搜索:
GET jobs/_search {"query": {"multi_match": {"query": "🇨🇳","type": "bool_prefix","operator": "and", "fields": ["title","title._2gram","title._3gram"]}} }上面顯示:
"hits" : [{"_index" : "jobs","_type" : "_doc","_id" : "w3YQyXgBI6xucLpoWA-v","_score" : 1.0,"_source" : {"title" : """Developer 🇨🇳"""}}]針對這種 search_as_you_type,還有一個類型的數(shù)據(jù)類型 completion。completion suggester 針對速度進行了優(yōu)化。 建議程序使用的數(shù)據(jù)結(jié)構(gòu)可實現(xiàn)快速查找,但構(gòu)建成本很高,并且存儲在內(nèi)存中,同時它不可以做 infix (中綴)查詢。
總結(jié)
以上是生活随笔為你收集整理的Elasticsearch:Search-as-you-type 字段类型的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: java 公共方法是什么意思_在java
- 下一篇: 小程序源码:花体字转换器-多玩法安装简单