Python 操作 Elasticsearch 实现 增 删 改 查
?
?
Github 地址:https://github.com/elastic/elasticsearch-py/blob/master/docs/index.rst
官網地址:https://elasticsearch-py.readthedocs.io/en/latest/index.html
Python-ElasticSearch,python對ES進行寫入、更新、刪除、搜索:
https://blog.csdn.net/u013429010/article/details/81746179
Python3 操作 elasticsearch:https://blog.csdn.net/qq_41262248/article/details/100671930
?
?
Elasticsearch 簡介
?
想查數據就免不了搜索,搜索就離不開搜索引擎,百度、谷歌都是一個非常龐大復雜的搜索引擎,他們幾乎索引了互聯網上開放的所有網頁和數據。然而對于我們自己的業務數據來說,肯定就沒必要用這么復雜的技術了,如果我們想實現自己的搜索引擎,方便存儲和檢索,Elasticsearch 就是不二選擇,它是一個全文搜索引擎,可以快速地?儲存、搜索?和?分析?海量數據。
?
為什么要用 Elasticsearch ?
Elasticsearch 是一個開源的搜索引擎,建立在一個全文搜索引擎庫 Apache Lucene? 基礎之上。
那 Lucene 又是什么?Lucene 可能是目前存在的,不論開源還是私有的,擁有最先進,高性能和全功能搜索引擎功能的庫,但也僅僅只是一個庫。要用上 Lucene,我們需要編寫 Java 并引用 Lucene 包才可以,而且我們需要對信息檢索有一定程度的理解才能明白 Lucene 是怎么工作的,反正用起來沒那么簡單。
那么為了解決這個問題,Elasticsearch 就誕生了。Elasticsearch 也是使用 Java 編寫的,它的內部使用 Lucene 做索引與搜索,但是它的目標是使全文檢索變得簡單,相當于 Lucene 的一層封裝,它提供了一套簡單一致的 RESTful API 來幫助我們實現存儲和檢索。
所以 Elasticsearch 僅僅就是一個簡易版的 Lucene 封裝嗎?那就大錯特錯了,Elasticsearch 不僅僅是 Lucene,并且也不僅僅只是一個全文搜索引擎。 它可以被下面這樣準確的形容:
- 一個分布式的實時文檔存儲,每個字段可以被索引與搜索
- 一個分布式實時分析搜索引擎
- 能勝任上百個服務節點的擴展,并支持 PB 級別的結構化或者非結構化數據
總之,是一個相當牛逼的搜索引擎,維基百科、Stack Overflow、GitHub 都紛紛采用它來做搜索。
?
Elasticsearch 相關概念
Node 和 Cluster
Elasticsearch 本質上是一個分布式數據庫,允許多臺服務器協同工作,每臺服務器可以運行多個 Elasticsearch 實例。單個 Elasticsearch 實例稱為一個節點(Node)。一組節點構成一個集群(Cluster)。Index
Elasticsearch 會索引所有字段,經過處理后寫入一個反向索引(Inverted Index)。查找數據的時候,直接查找該索引。所以,Elasticsearch 數據管理的頂層單位就叫做 Index(索引),其實就相當于 MySQL、MongoDB 等里面的數據庫的概念。 另外值得注意的是,每個 Index (即數據庫)的名字必須是小寫。Document
Index 里面單條的記錄稱為 Document(文檔)。許多條 Document 構成了一個 Index。Document 使用 JSON 格式表示,下面是一個例子。同一個 Index 里面的 Document,不要求有相同的結構(scheme),但是最好保持相同,這樣有利于提高搜索效率。Type
Document 可以分組,比如 weather 這個 Index 里面,可以按城市分組(北京和上海),也可以按氣候分組(晴天和雨天)。 這種分組就叫做 Type,它是虛擬的邏輯分組,用來過濾 Document,類似 MySQL 中的數據表,MongoDB 中的 Collection。不同的 Type 應該有相似的結構(Schema),舉例來說,id 字段不能在這個組是字符串,在另一個組是數值。 這是與關系型數據庫的表的一個區別。 性質完全不同的數據(比如 products 和 logs)應該存成兩個 Index,而不是一個 Index 里面的兩個 Type(雖然可以做到)。根據規劃,Elastic 6.x 版只允許每個 Index 包含一個 Type,7.x 版將會徹底移除 Type。
?
Fields
field 即字段,每個 Document 都類似一個 JSON 結構,它包含了許多字段,每個字段都有其對應的值, 多個字段組成了一個 Document,其實就可以類比 MySQL 數據表中的字段。在 Elasticsearch 中,文檔歸屬于一種類型(Type),而這些類型存在于索引(Index)中,我們可以畫一些簡單的對比圖來類比傳統關系型數據庫:
關系型數據庫? ->?Databases?->?Tables?->?Rows?->?Columns Elasticsearch?->?Indices???->?Types??->?Documents?->?Fields以上就是 Elasticsearch 里面的一些基本概念,通過和關系性數據庫的對比更加有助于理解。
?
?
Python Elasticsearch Client
?
? ? ? ? elasticsearch-py 是一個正式的低級別的?Elasticsearch 客戶端。它的目標是為所有與 elasticsearch 相關的 Python 代碼提供公共基礎。正因為如此,它試圖做到無意見和可擴展。
? ? ? ? 對于更高級、范圍更有限的客戶端庫,請查看?elasticsearch-dsl?(?https://elasticsearch-dsl.readthedocs.io/en/latest/ ),它是一個位于 elasticsearch-py 之上的更 python 化的庫。
?
?
兼容性
?
該庫與所有的Elasticsearch版本兼容(從0.90開始)。但你必須使用一個匹配的主要版本:
For?Elasticsearch 6.0?and later, use the major version 6 (6.x.y) of the library.
For?Elasticsearch 5.0?and later, use the major version 5 (5.x.y) of the library.
For?Elasticsearch 2.0?and later, use the major version 2 (2.x.y) of the library, and so on.
The recommended way to set your requirements in your?setup.py?or?requirements.txt?is:
# Elasticsearch 6.x elasticsearch>=6.0.0,<7.0.0# Elasticsearch 5.x elasticsearch>=5.0.0,<6.0.0# Elasticsearch 2.x elasticsearch>=2.0.0,<3.0.0?
?
安裝
?
使用 pip 安裝?elasticsearch? 包
pip install elasticsearch # 豆瓣源 pip install elasticsearch -i https://pypi.doubanio.com/simple/?
Python 連接 elasticsearch
?
Python 連接 elasticsearch 有以下幾種連接方式:
from elasticsearch import Elasticsearch # es = Elasticsearch() # 默認連接本地elasticsearch # es = Elasticsearch(['127.0.0.1:9200']) # 連接本地9200端口 es = Elasticsearch(["192.168.1.10", "192.168.1.11", "192.168.1.12"], # 連接集群,以列表的形式存放各節點的IP地址sniff_on_start=True, # 連接前測試sniff_on_connection_fail=True, # 節點無響應時刷新節點sniff_timeout=60 # 設置超時時間 )指定連接
es = Elasticsearch(['172.16.153.129:9200'],# 認證信息# http_auth=('elastic', 'changeme') )動態連接
es = Elasticsearch(['esnode1:port', 'esnode2:port'],# 在做任何操作之前,先進行嗅探sniff_on_start=True,# 節點沒有響應時,進行刷新,重新連接sniff_on_connection_fail=True,# 每 60 秒刷新一次sniffer_timeout=60 )對不同的節點,賦予不同的參數
es = Elasticsearch([{'host': 'localhost'},{'host': 'othernode', 'port': 443, 'url_prefix': 'es', 'use_ssl': True}, ])假如使用了 ssl
es = Elasticsearch(['localhost:443', 'other_host:443'],#打開SSL use_ssl=True,#確保我們驗證了SSL證書(默認關閉)verify_certs=True,#提供CA證書的路徑ca_certs='/path/to/CA_certs',#PEM格式的SSL客戶端證書client_cert='/path/to/clientcert.pem',#PEM格式的SSL客戶端密鑰client_key='/path/to/clientkey.pem' )配置忽略響應狀態碼
es = Elasticsearch(['127.0.0.1:9200'],ignore=400) # 忽略返回的400狀態碼 es = Elasticsearch(['127.0.0.1:9200'],ignore=[400, 405, 502]) # 以列表的形式忽略多個狀態碼一個簡單的示例
from elasticsearch import Elasticsearch es = Elasticsearch() # 默認連接本地elasticsearch print(es.index(index='py2', doc_type='doc', id=1, body={'name': "張開", "age": 18})) print(es.get(index='py2', doc_type='doc', id=1))第1個 print 為創建?py2?索引,并插入一條數據,第 2個 print 查詢指定文檔。
查詢結果如下:
{'_index': 'py2', '_type': 'doc', '_id': '1', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 0, '_primary_term': 1} {'_index': 'py2', '_type': 'doc', '_id': '1', '_version': 1, 'found': True, '_source': {'name': '張開', 'age': 18}}示例 1
from datetime import datetime from elasticsearch import Elasticsearch es = Elasticsearch()doc = {'author': 'kimchy','text': 'Elasticsearch: cool. bonsai cool.','timestamp': datetime.now(), } res = es.index(index="test-index", doc_type='tweet', id=1, body=doc) print(res['result'])res = es.get(index="test-index", doc_type='tweet', id=1) print(res['_source'])es.indices.refresh(index="test-index")res = es.search(index="test-index", body={"query": {"match_all": {}}}) print("Got %d Hits:" % res['hits']['total']['value']) for hit in res['hits']['hits']:print("%(timestamp)s %(author)s: %(text)s" % hit["_source"])示例 2:
# -*- coding: utf-8 -*-from elasticsearch import Elasticsearch# 默認host為localhost,port為9200.但也可以指定host與port es = Elasticsearch()# 插入數據,index,doc_type名稱可以自定義,id可以根據需求賦值,body為內容 es.index(index="my_index",doc_type="test_type",id=0,body={"name":"python","addr":"深圳"}) es.index(index="my_index",doc_type="test_type",id=1,body={"name":"python","addr":"深圳"})# 同樣是插入數據,create() 方法需要我們指定 id 字段來唯一標識該條數據, # 而 index() 方法則不需要,如果不指定 id,會自動生成一個 id es.create(index="my_index",doc_type="test_type",id=1,body={"name":"python","addr":"深圳"})#刪除指定的index、type、id的文檔 es.delete(index='indexName', doc_type='typeName', id=1) #刪除index es.indices.delete(index='news', ignore=[400, 404])query = {'query': {'match_all': {}}}# 查找所有文檔 query1 = {'query': {'match': {'sex': 'famale'}}}# 刪除性別為女性的所有文檔 query2 = {'query': {'range': {'age': {'lt': 11}}}}# 刪除年齡小于11的所有文檔 query3 = {'query': {'term': {'name': 'jack'}}}# 查找名字叫做jack的所有文檔#刪除所有文檔 es.delete_by_query(index="my_index",doc_type="test_type",body=query)#get:獲取指定index、type、id所對應的文檔 es.get(index="my_index",doc_type="test_type",id=1)#search:查詢滿足條件的所有文檔,沒有id屬性,且index,type和body均可為None result = es.search(index="my_index",doc_type="test_type",body=query) print(result['hits']['hits'][0])# 返回第一個文檔的內容#update:更新指定index、type、id所對應的文檔 #更新的主要點: #1. 需要指定 id #2. body={"doc": <xxxx>} , 這個doc是必須的 es.update(index="my_index",doc_type="test_type",id=1,body={"doc":{"name":"python1","addr":"深圳1"}})獲取相關信息
測試集群是否啟動 In [40]: es.ping() Out[40]: True獲取集群基本信息 In [39]: es.info()獲取集群的健康狀態信息 In [41]: es.cluster.health()獲取當前連接的集群節點信息 In [43]: es.cluster.client.info()獲取集群目前所有的索引 In [55]: print(es.cat.indices())獲取集群的更多信息 es.cluster.stats()利用實例的 cat 屬性得到更簡單易讀的信息
In [85]: es.cat.health() Out[85]: '1510431262 04:14:22 sharkyun yellow 1 1 6 6 0 0 6 0 - 50.0%\n'In [86]: es.cat.master() Out[86]: 'VXgFbKAaTtGO5a1QAfdcLw 172.16.153.129 172.16.153.129 master\n'In [87]: es.cat.nodes() Out[87]: '172.16.153.129 27 49 0 0.02 0.01 0.00 mdi * master\n'In [88]: es.cat.indices()In [89]: es.cat.count() Out[89]: '1510431323 04:15:23 301002\n'In [90]: es.cat.plugins() Out[90]: ''In [91]: es.cat.templates() Out[91]: 'logstash logstash-* 0 50001\nfilebeat filebeat-* 0 \n'任務
es.tasks.get()es.tasks.list()?
單一操作
查看集群狀態
from elasticsearch import Elasticsearch es=Elasticsearch([{"host":"localhost","port":9200}]) print(es.cluster.state())查看集群健康度
from elasticsearch import Elasticsearch es=Elasticsearch([{"host":"localhost","port":9200}]) print(es.cluster.health())增加一條文檔
from elasticsearch import Elasticsearch es = Elasticsearch([{"host":"localhost","port":9200}]) print(es.cluster.state()) b= {"name": 'lu', 'sex':'female', 'age': 10} es.index(index='bank', doc_type='typeName',body=b,id=None) print(es.cluster.state())create() 方法需要指定 id 字段來唯一標識該條數據。index() 方法則不需要,如果不指定 id,會自動生成一個 id。
create() 方法內部其實也是調用了 index() 方法,是對 index() 方法的封裝。
?
刪除一條文檔
from elasticsearch import Elasticsearch es = Elasticsearch([{"host":"localhost","port":9200}]) es.delete(index='bank', doc_type='typeName', id='idValue')修改一條文檔
from elasticsearch import Elasticsearch es = Elasticsearch([{"host":"localhost","port":9200}]) es.update(index='bank', doc_type='typeName', id='idValue', body={待更新字段})更新操作利用 index() 方法同樣可以做到,index() 方法完成兩個操作,如果數據不存在,那就執行插入操作,如果已經存在,那就執行更新操作
?
查詢一條文檔
from elasticsearch import Elasticsearch es = Elasticsearch([{"host":"localhost","port":9200}]) find=es.get(index='bank', doc_type='typeName', id='idValue') print(find)?
批量操作
從json文件中批量添加文檔
from elasticsearch import Elasticsearch es = Elasticsearch([{"host":"localhost","port":9200}]) with open('./accounts.json','r',encoding='utf-8') as file:s =file.read()print(s)es.bulk(index='bank',doc_type='typeName',body=s)批量操作:
# -*- coding: utf-8 -*-from elasticsearch import Elasticsearch import os#指定一個文件夾 path = r'C:\Users\Administrator\Desktop\files' es = Elasticsearch() doc = [] i = 1 #獲取文件夾下所有文件的絕對路徑和文件名 for dirname,pathname,filenames in os.walk(path):for filename in filenames:doc.append({"index":{"_id" : i}})doc.append({"filepath":os.path.join(dirname,filename)})i = i + 1 es.bulk(index="test",doc_type="text",body=doc)?
按條件刪除文檔
query = {'query': {'match': {'sex': 'famale'}}}# 刪除性別為女性的所有文檔query = {'query': {'range': {'age': {'lt': 11}}}}# 刪除年齡小于51的所有文檔es.delete_by_query(index='indexName', body=query, doc_type='typeName')條件更新
update_by_query:更新滿足條件的所有數據,寫法同上刪除和查詢
?
按條件查詢文檔
query = {'query': {'match_all': {}}}# 查找所有文檔query = {'query': {'term': {'name': 'jack'}}}# 查找名字叫做jack的所有文檔query = {'query': {'range': {'age': {'gt': 11}}}}# 查找年齡大于11的所有文檔allDoc = es.search(index='indexName', doc_type='typeName', body=query)print allDoc['hits']['hits'][0]# 返回第一個文檔的內容#批量寫入、刪除、更新
doc = [{"index": {}},{'name': 'jackaaa', 'age': 2000, 'sex': 'female', 'address': u'北京'},{"index": {}},{'name': 'jackbbb', 'age': 3000, 'sex': 'male', 'address': u'上海'},{"index": {}},{'name': 'jackccc', 'age': 4000, 'sex': 'female', 'address': u'廣州'},{"index": {}},{'name': 'jackddd', 'age': 1000, 'sex': 'male', 'address': u'深圳'},]doc = [{'index': {'_index': 'indexName', '_type': 'typeName', '_id': 'idValue'}}{'name': 'jack', 'sex': 'male', 'age': 10 }{'delete': {'_index': 'indexName', '_type': 'typeName', '_id': 'idValue'}}{"create": {'_index' : 'indexName', "_type" : 'typeName', '_id': 'idValue'}}{'name': 'lucy', 'sex': 'female', 'age': 20 }{'update': {'_index': 'indexName', '_type': 'typeName', '_id': 'idValue'}}{'doc': {'age': '100'}}]es.bulk(index='indexName', doc_type='typeName', body=doc)#批量更新也可以采用如下的方式進行json拼裝,最后寫入for line in list:action = {"_index": self.index_name,"_type": self.index_type,"_id": i, #_id 也可以默認生成,不賦值"_source": {"date": line['date'],"source": line['source'].decode('utf8'),"link": line['link'],"keyword": line['keyword'].decode('utf8'),"title": line['title'].decode('utf8')}}i += 1ACTIONS.append(action) success, _ = bulk(self.es, ACTIONS, index=self.index_name, raise_on_error=True)Python Elasticsearch Client 還提供了很多功能。
參考文檔:
? ? ? ? https://elasticsearch-py.readthedocs.io/en/master/api.html
? ? ? ? https://www.elastic.co/guide/en/elasticsearch/reference/current/docs.html
?
搜索所有數據
es.search(index="my_index",doc_type="test_type") # 或者 body = {"query":{"match_all":{}} } es.search(index="my_index",doc_type="test_type",body=body)body = {"query":{"term":{"name":"python"}} } # 查詢name="python"的所有數據 es.search(index="my_index",doc_type="test_type",body=body)body = {"query":{"terms":{"name":["python","android"]}} } # 搜索出name="python"或name="android"的所有數據 es.search(index="my_index",doc_type="test_type",body=body)#match 與 multi_match
# match:匹配name包含python關鍵字的數據 body = {"query":{"match":{"name":"python"}} } # 查詢name包含python關鍵字的數據 es.search(index="my_index",doc_type="test_type",body=body)# multi_match:在name和addr里匹配包含深圳關鍵字的數據 body = {"query":{"multi_match":{"query":"深圳","fields":["name","addr"]}} } # 查詢name和addr包含"深圳"關鍵字的數據 es.search(index="my_index",doc_type="test_type",body=body)#ids
body = {"query":{"ids":{"type":"test_type","values":["1","2"]}} } # 搜索出id為1或2d的所有數據 es.search(index="my_index",doc_type="test_type",body=body)#復合查詢 bool
bool有3類查詢關系,must(都滿足),should(其中一個滿足),must_not(都不滿足)
body = {"query":{"bool":{"must":[{"term":{"name":"python"}},{"term":{"age":18}}]}} } # 獲取name="python"并且age=18的所有數據 es.search(index="my_index",doc_type="test_type",body=body)#切片式查詢
body = {"query":{"match_all":{}}"from":2 # 從第二條數據開始"size":4 # 獲取4條數據 } # 從第2條數據開始,獲取4條數據 es.search(index="my_index",doc_type="test_type",body=body)#范圍查詢
body = {"query":{"range":{"age":{"gte":18, # >=18"lte":30 # <=30}}} } # 查詢18<=age<=30的所有數據 es.search(index="my_index",doc_type="test_type",body=body)#前綴查詢
body = {"query":{"prefix":{"name":"p"}} } # 查詢前綴為"趙"的所有數據 es.search(index="my_index",doc_type="test_type",body=body)#通配符查詢
body = {"query":{"wildcard":{"name":"*id"}} } # 查詢name以id為后綴的所有數據 es.search(index="my_index",doc_type="test_type",body=body)#排序
body = {"query":{"match_all":{}}"sort":{"age":{ # 根據age字段升序排序"order":"asc" # asc升序,desc降序}} }#filter_path
響應過濾
#count
執行查詢并獲取該查詢的匹配數
#度量類聚合
獲取最小值
獲取最大值
body = {"query":{"match_all":{}},"aggs":{ # 聚合查詢"max_age":{ # 最大值的key"max":{ # 最大"field":"age" # 查詢"age"的最大值}}} } # 搜索所有數據,并獲取age最大的值 es.search(index="my_index",doc_type="test_type",body=body)獲取和
body = {"query":{"match_all":{}},"aggs":{ # 聚合查詢"sum_age":{ # 和的key"sum":{ # 和"field":"age" # 獲取所有age的和}}} } # 搜索所有數據,并獲取所有age的和 es.search(index="my_index",doc_type="test_type",body=body)獲取平均值
body = {"query":{"match_all":{}},"aggs":{ # 聚合查詢"avg_age":{ # 平均值的key"sum":{ # 平均值"field":"age" # 獲取所有age的平均值}}} } # 搜索所有數據,獲取所有age的平均值 es.search(index="my_index",doc_type="test_type",body=body)更多的搜索用法:https://elasticsearch-py.readthedocs.io/en/master/api.html
?
查詢數據
?
上面的幾個操作都是非常簡單的操作,普通的數據庫如 MongoDB 都是可以完成的,看起來并沒有什么了不起的,Elasticsearch 更特殊的地方在于其異常強大的檢索功能。
對于中文來說,我們需要安裝一個分詞插件,這里使用的是 elasticsearch-analysis-ik,GitHub 鏈接為:https://github.com/medcl/elasticsearch-analysis-ik,這里我們使用 Elasticsearch 的另一個命令行工具 elasticsearch-plugin 來安裝,這里安裝的版本是 6.2.4,請確保和 Elasticsearch 的版本對應起來,命令如下:
elasticsearch-plugin?install?https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.2.4/elasticsearch-analysis-ik-6.2.4.zip這里的版本號請替換成你的 Elasticsearch 的版本號。安裝之后重新啟動 Elasticsearch 就可以了,它會自動加載安裝好的插件。首先我們新建一個索引并指定需要分詞的字段,代碼如下:
from?elasticsearch?import?Elasticsearches?=?Elasticsearch() mapping?=?{'properties':?{'title':?{'type':?'text','analyzer':?'ik_max_word','search_analyzer':?'ik_max_word'}} } es.indices.delete(index='news',?ignore=[400,?404]) es.indices.create(index='news',?ignore=400) result?=?es.indices.put_mapping(index='news',?doc_type='politics',?body=mapping) print(result)這里我們先將之前的索引刪除了,然后新建了一個索引,然后更新了它的 mapping 信息,mapping 信息中指定了分詞的字段,指定了字段的類型 type 為 text,分詞器 analyzer 和 搜索分詞器 search_analyzer 為 ik_max_word,即使用我們剛才安裝的中文分詞插件。如果不指定的話則使用默認的英文分詞器。
接下來我們插入幾條新的數據:
datas?=?[{'title':?'美國留給伊拉克的是個爛攤子嗎','url':?'http://view.news.qq.com/zt2011/usa_iraq/index.htm','date':?'2011-12-16'},{'title':?'公安部:各地校車將享最高路權','url':?'http://www.chinanews.com/gn/2011/12-16/3536077.shtml','date':?'2011-12-16'},{'title':?'中韓漁警沖突調查:韓警平均每天扣1艘中國漁船','url':?'https://news.qq.com/a/20111216/001044.htm','date':?'2011-12-17'},{'title':?'中國駐洛杉磯領事館遭亞裔男子槍擊?嫌犯已自首','url':?'http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml','date':?'2011-12-18'} ]for?data?in?datas:es.index(index='news',?doc_type='politics',?body=data)這里我們指定了四條數據,都帶有 title、url、date 字段,然后通過 index() 方法將其插入 Elasticsearch 中,索引名稱為 news,類型為 politics。
接下來我們根據關鍵詞查詢一下相關內容:
result = es.search(index='news', doc_type='politics') print(result)可以看到查詢出了所有插入的四條數據:
{"took":?0,"timed_out":?false,"_shards":?{"total":?5,"successful":?5,"skipped":?0,"failed":?0},"hits":?{"total":?4,"max_score":?1.0,"hits":?[{"_index":?"news","_type":?"politics","_id":?"c05G9mQBD9BuE5fdHOUT","_score":?1.0,"_source":?{"title":?"美國留給伊拉克的是個爛攤子嗎","url":?"http://view.news.qq.com/zt2011/usa_iraq/index.htm","date":?"2011-12-16"}},{"_index":?"news","_type":?"politics","_id":?"dk5G9mQBD9BuE5fdHOUm","_score":?1.0,"_source":?{"title":?"中國駐洛杉磯領事館遭亞裔男子槍擊,嫌犯已自首","url":?"http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml","date":?"2011-12-18"}},{"_index":?"news","_type":?"politics","_id":?"dU5G9mQBD9BuE5fdHOUj","_score":?1.0,"_source":?{"title":?"中韓漁警沖突調查:韓警平均每天扣1艘中國漁船","url":?"https://news.qq.com/a/20111216/001044.htm","date":?"2011-12-17"}},{"_index":?"news","_type":?"politics","_id":?"dE5G9mQBD9BuE5fdHOUf","_score":?1.0,"_source":?{"title":?"公安部:各地校車將享最高路權","url":?"http://www.chinanews.com/gn/2011/12-16/3536077.shtml","date":?"2011-12-16"}}]} }可以看到返回結果會出現在 hits 字段里面,然后其中有 total 字段標明了查詢的結果條目數,還有 max_score 代表了最大匹配分數。
另外我們還可以進行全文檢索,這才是體現 Elasticsearch 搜索引擎特性的地方:
dsl?=?{'query':?{'match':?{'title':?'中國?領事館'}} }es?=?Elasticsearch() result?=?es.search(index='news',?doc_type='politics',?body=dsl) print(json.dumps(result,?indent=2,?ensure_ascii=False))這里我們使用 Elasticsearch 支持的 DSL 語句來進行查詢,使用 match 指定全文檢索,檢索的字段是 title,內容是“中國領事館”,搜索結果如下:
{"took":?1,"timed_out":?false,"_shards":?{"total":?5,"successful":?5,"skipped":?0,"failed":?0},"hits":?{"total":?2,"max_score":?2.546152,"hits":?[{"_index":?"news","_type":?"politics","_id":?"dk5G9mQBD9BuE5fdHOUm","_score":?2.546152,"_source":?{"title":?"中國駐洛杉磯領事館遭亞裔男子槍擊,嫌犯已自首","url":?"http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml","date":?"2011-12-18"}},{"_index":?"news","_type":?"politics","_id":?"dU5G9mQBD9BuE5fdHOUj","_score":?0.2876821,"_source":?{"title":?"中韓漁警沖突調查:韓警平均每天扣1艘中國漁船","url":?"https://news.qq.com/a/20111216/001044.htm","date":?"2011-12-17"}}]} }這里我們看到匹配的結果有兩條,第一條的分數為 2.54,第二條的分數為 0.28,這是因為第一條匹配的數據中含有“中國”和“領事館”兩個詞,第二條匹配的數據中不包含“領事館”,但是包含了“中國”這個詞,所以也被檢索出來了,但是分數比較低。
因此可以看出,檢索時會對對應的字段全文檢索,結果還會按照檢索關鍵詞的相關性進行排序,這就是一個基本的搜索引擎雛形。
另外 Elasticsearch 還支持非常多的查詢方式,詳情可以參考官方文檔:https://www.elastic.co/guide/en/elasticsearch/reference/6.3/query-dsl.html
?
?
特點
該客戶端被設計為Elasticsearch的REST API的非常薄的包裝,以實現最大的靈活性。這意味著這個客戶沒有意見;這也意味著從Python中使用一些api有點麻煩。我們創建了一些幫助程序來幫助解決這個問題,并在此基礎上創建了一個更高級的庫( Elasticsearch -dsl :https://elasticsearch-dsl.readthedocs.io/en/latest/?) 來提供使用 Elasticsearch 的更方便的方法。
?
?
持久化連接
elasticsearch-py使用各個連接池中的持久連接(每個配置或嗅探節點一個)。您可以在兩個http協議實現之間進行選擇。有關更多信息,請參見傳輸類 ?Transport classes?。
?
elasticsearch-py?uses persistent connections inside of individual connection pools (one per each configured or sniffed node). Out of the box you can choose between two?http?protocol implementations. See?Transport classes?for more information.
The transport layer will create an instance of the selected connection class per node and keep track of the health of individual nodes - if a node becomes unresponsive (throwing exceptions while connecting to it) it’s put on a timeout by the?ConnectionPool?class and only returned to the circulation after the timeout is over (or when no live nodes are left). By default nodes are randomized before being passed into the pool and round-robin strategy is used for load balancing.
You can customize this behavior by passing parameters to the?Connection Layer API?(all keyword arguments to the?Elasticsearch?class will be passed through). If what you want to accomplish is not supported you should be able to create a subclass of the relevant component and pass it in as a parameter to be used instead of the default implementation.
?
?
Automatic Retries (自動重試)
If a connection to a node fails due to connection issues (raises?ConnectionError) it is considered in faulty state. It will be placed on hold for?dead_timeout?seconds and the request will be retried on another node. If a connection fails multiple times in a row the timeout will get progressively larger to avoid hitting a node that’s, by all indication, down. If no live connection is available, the connection that has the smallest timeout will be used.
By default retries are not triggered by a timeout (ConnectionTimeout), set?retry_on_timeout?to?True?to also retry on timeouts.
?
?
Sniffing(嗅探)
The client can be configured to inspect the cluster state to get a list of nodes upon startup, periodically and/or on failure. See?Transport?parameters for details.
Some example configurations:
from elasticsearch import Elasticsearch# by default we don't sniff, ever es = Elasticsearch()# you can specify to sniff on startup to inspect the cluster and load # balance across all nodes es = Elasticsearch(["seed1", "seed2"], sniff_on_start=True)# you can also sniff periodically and/or after failure: es = Elasticsearch(["seed1", "seed2"],sniff_on_start=True,sniff_on_connection_fail=True,sniffer_timeout=60)?
?
Thread safety (線程安全)
The client is thread safe and can be used in a multi threaded environment. Best practice is to create a single global instance of the client and use it throughout your application. If your application is long-running consider turning on?Sniffing?to make sure the client is up to date on the cluster location.
By default we allow?urllib3?to open up to 10 connections to each node, if your application calls for more parallelism, use the?maxsize?parameter to raise the limit:
# allow up to 25 connections to each node es = Elasticsearch(["host1", "host2"], maxsize=25)注意 :Since we use persistent connections throughout the client it means that the client doesn’t tolerate?fork?very well. If your application calls for multiple processes make sure you create a fresh client after call to?fork. Note that Python’s?multiprocessing?module uses?fork?to create new processes on POSIX systems.
?
?
SSL and Authentication
You can configure the client to use?SSL?for connecting to your elasticsearch cluster, including certificate verification and HTTP auth:
from elasticsearch import Elasticsearch# you can use RFC-1738 to specify the url es = Elasticsearch(['https://user:secret@localhost:443'])# ... or specify common parameters as kwargses = Elasticsearch(['localhost', 'otherhost'],http_auth=('user', 'secret'),scheme="https",port=443, )# SSL client authentication using client_cert and client_keyfrom ssl import create_default_contextcontext = create_default_context(cafile="path/to/cert.pem") es = Elasticsearch(['localhost', 'otherhost'],http_auth=('user', 'secret'),scheme="https",port=443,ssl_context=context, )警告:elasticsearch-py?doesn’t ship with default set of root certificates. To have working SSL certificate validation you need to either specify your own as?cafile?or?capath?or?cadata?or install?certifi?which will be picked up automatically.
See class?Urllib3HttpConnection?for detailed description of the options.
?
?
Logging (日志)
elasticsearch-py?uses the standard?logging library?from python to define two loggers:?elasticsearch?and?elasticsearch.trace.?elasticsearch?is used by the client to log standard activity, depending on the log level.?elasticsearch.trace?can be used to log requests to the server in the form of?curl?commands using pretty-printed json that can then be executed from command line. Because it is designed to be shared (for example to demonstrate an issue) it also just uses?localhost:9200?as the address instead of the actual address of the host. If the trace logger has not been configured already it is set to?propagate=False?so it needs to be activated separately.
?
?
Environment considerations (環境注意事項)
When using the client there are several limitations of your environment that could come into play.
When using an HTTP load balancer you cannot use the?Sniffing?functionality - the cluster would supply the client with IP addresses to directly connect to the cluster, circumventing the load balancer. Depending on your configuration this might be something you don’t want or break completely.
In some environments (notably on Google App Engine) your HTTP requests might be restricted so that?GET?requests won’t accept body. In that case use the?send_get_body_as?parameter of?Transport?to send all bodies via post:
from elasticsearch import Elasticsearch es = Elasticsearch(send_get_body_as='POST')?
?
Compression (壓縮)
When using capacity-constrained networks (low throughput), it may be handy to enable compression. This is especially useful when doing bulk loads or inserting large documents. This will configure compression.
from elasticsearch import Elasticsearch es = Elasticsearch(hosts, http_compress=True)?
?
Running on AWS with IAM
If you want to use this client with IAM based authentication on AWS you can use the?requests-aws4auth?package:
from elasticsearch import Elasticsearch, RequestsHttpConnection from requests_aws4auth import AWS4Authhost = 'YOURHOST.us-east-1.es.amazonaws.com' awsauth = AWS4Auth(YOUR_ACCESS_KEY, YOUR_SECRET_KEY, REGION, 'es')es = Elasticsearch(hosts=[{'host': host, 'port': 443}],http_auth=awsauth,use_ssl=True,verify_certs=True,connection_class=RequestsHttpConnection ) print(es.info())?
?
Customization (定制)
By default,?JSONSerializer?is used to encode all outgoing requests. However, you can implement your own custom serializer:
from elasticsearch.serializer import JSONSerializerclass SetEncoder(JSONSerializer):def default(self, obj):if isinstance(obj, set):return list(obj)if isinstance(obj, Something):return 'CustomSomethingRepresentation'return JSONSerializer.default(self, obj)es = Elasticsearch(serializer=SetEncoder())?
?
Contents
?
- API Documentation
- Global options
- Elasticsearch
- Indices
- Ingest
- Cluster
- Nodes
- Cat
- Snapshot
- Tasks
- X-Pack APIs
- Info
- Graph Explore
- Licensing API
- Machine Learning APIs
- Security APIs
- Watcher APIs
- Migration APIs
- Exceptions
- Connection Layer API
- Transport
- Connection Pool
- Connection Selector
- Urllib3HttpConnection (default connection_class)
- Transport classes
- Connection
- Urllib3HttpConnection
- RequestsHttpConnection
- Helpers
- Bulk helpers
- Scan
- Reindex
?
?
?
總結
以上是生活随笔為你收集整理的Python 操作 Elasticsearch 实现 增 删 改 查的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Lambda 表达式详解~Streams
- 下一篇: SpringBoot 自带工具类~Col