Elasticsearch 自定义分析器Analyzer
java學(xué)習(xí)討論群:725562382
2,自定義分析器Analyzer
curl -X PUT "192.168.0.120:9200/simple_example" -H 'Content-Type: application/json' -d' {"settings": {"analysis": {"analyzer": {"rebuilt_simple": {"tokenizer": "lowercase","filter": [ ]}}}} }2.1.??Whitespace Analyzer
whitespace 分析器,當(dāng)它遇到空白字符時(shí),就將文本解析成terms
示例:
curl -X POST "192.168.0.120:9200/_analyze" -H 'Content-Type: application/json' -d' {"analyzer": "whitespace","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone." }輸出結(jié)果如下:
[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]2.2.??Stop Analyzer(停止詞分析器)
?stop 分析器 和 simple 分析器很像,唯一不同的是,stop 分析器增加了對刪除停止詞的支持。默認(rèn)用的停止詞是 _englisht_
(PS:意思是,假設(shè)有一句話“this is a apple”,并且假設(shè)“this” 和 “is”都是停止詞,那么用simple的話輸出會是[ this , is , a , apple ],而用stop輸出的結(jié)果會是[ a , apple ],到這里就看出二者的區(qū)別了,stop 不會輸出停止詞,也就是說它不認(rèn)為停止詞是一個(gè)term)
(PS:所謂的停止詞,可以理解為分隔符)
?示例
curl -X POST "192.168.0.120:9200/_analyze" -H 'Content-Type: application/json' -d' {"analyzer": "stop","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone." }輸出
[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]2.3? 配置
stop 接受以下參數(shù):
- stopwords? :? 一個(gè)預(yù)定義的停止詞列表(比如,_englisht_)或者是一個(gè)包含停止詞的列表。默認(rèn)是 _english_
- stopwords_path? :? 包含停止詞的文件路徑。這個(gè)路徑是相對于Elasticsearch的config目錄的一個(gè)路徑
2.3.1? 示例配置
curl -X PUT "192.168.0.120:9200/my_index" -H 'Content-Type: application/json' -d' {"settings": {"analysis": {"analyzer": {"my_stop_analyzer": {"type": "stop","stopwords": ["the", "over"]}}}} } 上面配置了一個(gè)stop分析器,它的停止詞有兩個(gè):the 和 over curl -X POST "192.168.0.120:9200/my_index/_analyze" -H 'Content-Type: application/json' -d' {"analyzer": "my_stop_analyzer","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone." }基于以上配置,這個(gè)請求輸入會是這樣的:
[ quick, brown, foxes, jumped, lazy, dog, s, bone ]2.4.??Pattern Analyzer
用Java正則表達(dá)式來將文本分割成terms,默認(rèn)的正則表達(dá)式是\W+(非單詞字符)
2.6.1.? 示例
curl -X POST "192.168.0.120:9200/_analyze" -H 'Content-Type: application/json' -d' {"analyzer": "pattern","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone." }由于默認(rèn)按照非單詞字符分割,因此輸出會是這樣的:
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]2.4.1.? 配置
pattern 分析器接受如下參數(shù):
- pattern? :? 一個(gè)Java正則表達(dá)式,默認(rèn) \W+
- flags? :? Java正則表達(dá)式flags。比如:CASE_INSENSITIVE 、COMMENTS
- lowercase? :? 是否將terms全部轉(zhuǎn)成小寫。默認(rèn)true
- stopwords? :? 一個(gè)預(yù)定義的停止詞列表,或者包含停止詞的一個(gè)列表。默認(rèn)是 _none_
- stopwords_path? :? 停止詞文件路徑
2.4.2.? 示例配置
curl -X PUT "192.168.0.120:9200/my_index" -H 'Content-Type: application/json' -d' {"settings": {"analysis": {"analyzer": {"my_email_analyzer": {"type": "pattern","pattern": "\\W|_", "lowercase": true}}}} }上面的例子中配置了按照非單詞字符或者下劃線分割,并且輸出的term都是小寫
curl -X POST "192.168.0.120:9200/my_index/_analyze" -H 'Content-Type: application/json' -d' {"analyzer": "my_email_analyzer","text": "John_Smith@foo-bar.com" }因此,基于以上配置,本例輸出如下:
[ john, smith, foo, bar, com ]2.5.??Language Analyzers
支持不同語言環(huán)境下的文本分析。內(nèi)置(預(yù)定義)的語言有:arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai
2.6.? 自定義Analyzer
前面也說過,一個(gè)分析器由三部分構(gòu)成:
- zero or more character filters
- a tokenizer
- zero or more token filters
2.6.1.? 實(shí)例配置
curl -X PUT "192.168.0.120:9200/my_index" -H 'Content-Type: application/json' -d' {"settings": {"analysis": {"analyzer": {"my_custom_analyzer": {"type": "custom", "tokenizer": "standard","char_filter": ["html_strip"],"filter": ["lowercase","asciifolding"]}}}} }3.??Tokenizer?
3.1.??Standard Tokenizer
curl -X POST "192.168.0.120:9200/_analyze" -H 'Content-Type: application/json' -d' {"tokenizer": "standard","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone." } '總結(jié)
以上是生活随笔為你收集整理的Elasticsearch 自定义分析器Analyzer的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 汽车行驶姿态 -- 初识
- 下一篇: SQL中使用视图的优点和缺点是什么