當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

RGW Bucket Shard优化

發(fā)布時(shí)間：2023/12/14 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 RGW Bucket Shard优化小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

1.bucket index背景簡介

bucket index是整個(gè)RGW里面一個(gè)非常關(guān)鍵的數(shù)據(jù)結(jié)構(gòu)，用于存儲bucket的索引數(shù)據(jù)，默認(rèn)情況下單個(gè)bucket的index全部存儲在一個(gè)shard文件（shard數(shù)量為0，主要以O(shè)MAP-keys方式存儲在leveldb中），隨著單個(gè)bucket內(nèi)的Object數(shù)量增加，整個(gè)shard文件的體積也在不斷增長，當(dāng)shard文件體積過大就會引發(fā)各種問題。

2. 問題及故障

2.1 故障現(xiàn)象描述

Flapping OSD's when RGW buckets have millions of objects

● Possible causes

○ The first issue here is when RGW buckets have millions of objects their

bucket index shard RADOS objects become very large with high

number OMAP keys stored in leveldb. Then operations like deep-scrub,

bucket index listing etc takes a lot of time to complete and this triggers

OSD's to flap. If sharding is not used this issue become worse because

then only one RADOS index objects will be holding all the OMAP keys.

RGW的index數(shù)據(jù)以omap形式存儲在OSD所在節(jié)點(diǎn)的leveldb中，當(dāng)單個(gè)bucket存儲的Object數(shù)量高達(dá)百萬數(shù)量級的時(shí)候，
deep-scrub和bucket list一類的操作將極大的消耗磁盤資源，導(dǎo)致對應(yīng)OSD出現(xiàn)異常，
如果不對bucket的index進(jìn)行shard切片操作(shard切片實(shí)現(xiàn)了將單個(gè)bucket index的LevelDB實(shí)例水平切分到多個(gè)OSD上)，數(shù)據(jù)量大了以后很容易出事。

○ The second issue is when you have good amount of DELETEs it causes

loads of stale data in OMAP and this triggers leveldb compaction all the

time which is single threaded and non optimal with this kind of workload

and causes osd_op_threads to suicide because it is always compacting

hence OSD’s starts flapping.

RGW在處理大量DELETE請求的時(shí)候，會導(dǎo)致底層LevelDB頻繁進(jìn)行數(shù)據(jù)庫compaction(數(shù)據(jù)壓縮，對磁盤性能損耗很大)操作，而且剛好整個(gè)compaction在LevelDB中又是單線程處理，很容易到達(dá)osdopthreads超時(shí)上限而導(dǎo)致OSD自殺。

常見的問題有:

對index pool進(jìn)行scrub或deep-scrub的時(shí)候，如果shard對應(yīng)的Object過大，會極大消耗底層存儲設(shè)備性能，造成io請求超時(shí)。

底層deep-scrub的時(shí)候耗時(shí)過長，會出現(xiàn)request blocked，導(dǎo)致大量http請求超時(shí)而出現(xiàn)50x錯(cuò)誤，從而影響到整個(gè)RGW服務(wù)的可用性。

當(dāng)壞盤或者osd故障需要恢復(fù)數(shù)據(jù)的時(shí)候，恢復(fù)一個(gè)大體積的shard文件將耗盡存儲節(jié)點(diǎn)性能，甚至可能因?yàn)镺SD響應(yīng)超時(shí)而導(dǎo)致整個(gè)集群出現(xiàn)雪崩。

2.2 根因跟蹤

當(dāng)bucket index所在的OSD omap過大的時(shí)候，一旦出現(xiàn)異常導(dǎo)致OSD進(jìn)程崩潰，這個(gè)時(shí)候就需要進(jìn)行現(xiàn)場"救火"，用最快的速度恢復(fù)OSD服務(wù)。
先確定對應(yīng)OSD的OMAP大小，這個(gè)過大會導(dǎo)致OSD啟動(dòng)的時(shí)候消耗大量時(shí)間和資源去加載levelDB數(shù)據(jù)，導(dǎo)致OSD無法啟動(dòng)（超時(shí)自殺）。
特別是這一類OSD啟動(dòng)需要占用非常大的內(nèi)存消耗，一定要注意預(yù)留好內(nèi)存。（物理內(nèi)存40G左右，不行用swap頂上）

image.png

3. 臨時(shí)解決方案

3.1 關(guān)閉集群scrub, deep-scrub提升集群穩(wěn)定性

$ ceph osd set noscrub $ ceph osd set nodeep-scrub

3.2 調(diào)高timeout參數(shù)，減少OSD自殺的概率

osd_op_thread_timeout = 90 #default is 15 osd_op_thread_suicide_timeout = 2000 #default is 150 If filestore op threads are hitting timeout filestore_op_thread_timeout = 180 #default is 60 filestore_op_thread_suicide_timeout = 2000 #default is 180 Same can be done for recovery thread also. osd_recovery_thread_timeout = 120 #default is 30 osd_recovery_thread_suicide_timeout = 2000

3.2 手工壓縮OMAP

在可以停OSD的情況下，可以對OSD進(jìn)行compact操作，推薦在ceph 0.94.6以上版本，低于這個(gè)版本有bug。 https://github.com/ceph/ceph/pull/7645/files

○ The third temporary step could be taken if OSD's have very large OMAP

directories you can verify it with command: du -sh /var/lib/ceph/osd/ceph-$id/current/omap, then do manual leveldb compaction for OSD's.

■ ceph tell osd.$id compact or

■ ceph daemon osd.$id compact or

■ Add leveldb_compact_on_mount = true in [osd.$id] or [osd] section

and restart the OSD.

■ This makes sure that it compacts the leveldb and then bring the

OSD back up/in which really helps.

#開啟noout操作 $ ceph osd set noout#停OSD服務(wù) $ systemctl stop ceph-osd@<osd-id>#在ceph.conf中對應(yīng)的[osd.id]加上下面配置 leveldb_compact_on_mount = true#啟動(dòng)osd服務(wù) $ systemctl start ceph-osd@<osd-id>#使用ceph -s命令觀察結(jié)果，最好同時(shí)使用tailf命令去觀察對應(yīng)的OSD日志.等所有pg處于active+clean之后再繼續(xù)下面的操作 $ ceph -s #確認(rèn)compact完成以后的omap大小: $ du -sh /var/lib/ceph/osd/ceph-$id/current/omap#刪除osd中臨時(shí)添加的leveldb_compact_on_mount配置#取消noout操作(視情況而定，建議線上還是保留noout): $ ceph osd unset noout

4. 永久解決方案

4.1 提前規(guī)劃好bucket shard

index pool一定要上SSD，這個(gè)是本文優(yōu)化的前提，沒硬件支撐后面這些操作都是白搭。
合理設(shè)置bucket 的shard 數(shù)量
shard的數(shù)量并不是越多越好，過多的shard會導(dǎo)致部分類似list bucket的操作消耗大量底層存儲IO，導(dǎo)致部分請求耗時(shí)過長。
shard的數(shù)量還要考慮到你OSD的故障隔離域和副本數(shù)設(shè)置。比如你設(shè)置index pool的size為2，并且有2個(gè)機(jī)柜，共24個(gè)OSD節(jié)點(diǎn)，理想情況下每個(gè)shard的2個(gè)副本都應(yīng)該分布在2個(gè)機(jī)柜里面，比如當(dāng)你shard設(shè)置為8的時(shí)候，總共有8*2=16個(gè)shard文件需要存儲，那么這16個(gè)shard要做到均分到2個(gè)機(jī)柜。同時(shí)如果你shard超過24個(gè)，這很明顯也是不合適的。
控制好單個(gè)bucket index shard的平均體積，目前推薦單個(gè)shard存儲的Object信息條目在10-15W左右，過多則需要對相應(yīng)的bucket做單獨(dú)reshard操作（注意這個(gè)是高危操作，謹(jǐn)慎使用）。比如你預(yù)計(jì)單個(gè)bucket最多存儲100W個(gè)Object，那么100W/8＝12.5W，設(shè)置shard數(shù)為8是比較合理的。shard文件中每條omapkey記錄大概占用200 byte的容量，那么150000*200/1024/1024 ≈ 28.61 MB，也就是說要控制單個(gè)shard文件的體積在28MB以內(nèi)。
業(yè)務(wù)層面控制好每個(gè)bucket的Object上限，按每個(gè)shard文件平均10-15W Object為宜。

4.1.1 配置Bucket Index Sharding

To enable and configure bucket index sharding on all new buckets, use: redhat-bucket_sharding

the rgw_override_bucket_index_max_shards setting for simple configurations,
the bucket_index_max_shards setting for federated configurations

Simple configurations：

#1. 修改配置文件設(shè)置相應(yīng)的參數(shù)。 Note that maximum number of shards is 7877. [global] rgw_override_bucket_index_max_shards = 10 #2. 重啟rgw服務(wù)，讓其生效 systemctl restart ceph-radosgw.target#3. 查看bucket shard數(shù) rados -p default.rgw.buckets.index ls | wc -l 1000

Federated configurations
In federated configurations, each zone can have a different index_pool setting to manage failover. To configure a consistent shard count for zones in one region, set the bucket_index_max_shards setting in the configuration for that region. To do so:

#1. Extract the region configuration to the region.json file: $ radosgw-admin region get > region.json#2. In the region.json file, set the bucket_index_max_shards setting for each named zone.#3. Reset the region: $ radosgw-admin region set < region.json#4. Update the region map: $ radosgw-admin regionmap update --name <name>#5. Replace <name> with the name of the Ceph Object Gateway user, for example: $ radosgw-admin regionmap update --name client.rgw.ceph-client

上傳文件Demo:

#_*_coding:utf-8_*_ #yum install python-boto import boto import boto.s3.connection #pip install filechunkio from filechunkio import FileChunkIO import math import threading import os import Queue class Chunk(object):num = 0offset = 0len = 0def __init__(self,n,o,l):self.num=nself.offset=oself.length=l class CONNECTION(object):def __init__(self,access_key,secret_key,ip,port,is_secure=False,chrunksize=8<<20): #chunksize最小8M否則上傳過程會報(bào)錯(cuò)self.conn=boto.connect_s3(aws_access_key_id=access_key,aws_secret_access_key=secret_key,host=ip,port=port,is_secure=is_secure,calling_format=boto.s3.connection.OrdinaryCallingFormat())self.chrunksize=chrunksizeself.port=port#查詢def list_all(self):all_buckets=self.conn.get_all_buckets()for bucket in all_buckets:print u'容器名: %s' %(bucket.name)for key in bucket.list():print ' '*5,"%-20s%-20s%-20s%-40s%-20s" %(key.mode,key.owner.id,key.size,key.last_modified.split('.')[0],key.name)def list_single(self,bucket_name):try:single_bucket = self.conn.get_bucket(bucket_name)except Exception as e:print 'bucket %s is not exist' %bucket_namereturnprint u'容器名: %s' % (single_bucket.name)for key in single_bucket.list():print ' ' * 5, "%-20s%-20s%-20s%-40s%-20s" % (key.mode, key.owner.id, key.size, key.last_modified.split('.')[0], key.name)#普通小文件下載：文件大小<=8Mdef dowload_file(self,filepath,key_name,bucket_name):all_bucket_name_list = [i.name for i in self.conn.get_all_buckets()]if bucket_name not in all_bucket_name_list:print 'Bucket %s is not exist,please try again' % (bucket_name)returnelse:bucket = self.conn.get_bucket(bucket_name)all_key_name_list = [i.name for i in bucket.get_all_keys()]if key_name not in all_key_name_list:print 'File %s is not exist,please try again' % (key_name)returnelse:key = bucket.get_key(key_name)if not os.path.exists(os.path.dirname(filepath)):print 'Filepath %s is not exists, sure to create and try again' % (filepath)returnif os.path.exists(filepath):while True:d_tag = raw_input('File %s already exists, sure you want to cover (Y/N)?' % (key_name)).strip()if d_tag not in ['Y', 'N'] or len(d_tag) == 0:continueelif d_tag == 'Y':os.remove(filepath)breakelif d_tag == 'N':returnos.mknod(filepath)try:key.get_contents_to_filename(filepath)except Exception:pass# 普通小文件上傳：文件大小<=8Mdef upload_file(self,filepath,key_name,bucket_name):try:bucket = self.conn.get_bucket(bucket_name)except Exception as e:print 'bucket %s is not exist' % bucket_nametag = raw_input('Do you want to create the bucket %s: (Y/N)?' % bucket_name).strip()while tag not in ['Y', 'N']:tag = raw_input('Please input (Y/N)').strip()if tag == 'N':returnelif tag == 'Y':self.conn.create_bucket(bucket_name)bucket = self.conn.get_bucket(bucket_name)all_key_name_list = [i.name for i in bucket.get_all_keys()]if key_name in all_key_name_list:while True:f_tag = raw_input(u'File already exists, sure you want to cover (Y/N)?: ').strip()if f_tag not in ['Y', 'N'] or len(f_tag) == 0:continueelif f_tag == 'Y':breakelif f_tag == 'N':returnkey=bucket.new_key(key_name)if not os.path.exists(filepath):print 'File %s does not exist, please make sure you want to upload file path and try again' %(key_name)returntry:f=file(filepath,'rb')data=f.read()key.set_contents_from_string(data)except Exception:passdef delete_file(self,key_name,bucket_name):all_bucket_name_list = [i.name for i in self.conn.get_all_buckets()]if bucket_name not in all_bucket_name_list:print 'Bucket %s is not exist,please try again' % (bucket_name)returnelse:bucket = self.conn.get_bucket(bucket_name)all_key_name_list = [i.name for i in bucket.get_all_keys()]if key_name not in all_key_name_list:print 'File %s is not exist,please try again' % (key_name)returnelse:key = bucket.get_key(key_name)try:bucket.delete_key(key.name)except Exception:passdef delete_bucket(self,bucket_name):all_bucket_name_list = [i.name for i in self.conn.get_all_buckets()]if bucket_name not in all_bucket_name_list:print 'Bucket %s is not exist,please try again' % (bucket_name)returnelse:bucket = self.conn.get_bucket(bucket_name)try:self.conn.delete_bucket(bucket.name)except Exception:pass#隊(duì)列生成def init_queue(self,filesize,chunksize): #8<<20 :8*2**20chunkcnt=int(math.ceil(filesize*1.0/chunksize))q=Queue.Queue(maxsize=chunkcnt)for i in range(0,chunkcnt):offset=chunksize*ilength=min(chunksize,filesize-offset)c=Chunk(i+1,offset,length)q.put(c)return q#分片上傳objectdef upload_trunk(self,filepath,mp,q,id):while not q.empty():chunk=q.get()fp=FileChunkIO(filepath,'r',offset=chunk.offset,bytes=chunk.length)mp.upload_part_from_file(fp,part_num=chunk.num)fp.close()q.task_done()#文件大小獲取---->S3分片上傳對象生成----->初始隊(duì)列生成(--------------->文件切，生成切分對象)def upload_file_multipart(self,filepath,key_name,bucket_name,threadcnt=8):filesize=os.stat(filepath).st_sizetry:bucket=self.conn.get_bucket(bucket_name)except Exception as e:print 'bucket %s is not exist' % bucket_nametag=raw_input('Do you want to create the bucket %s: (Y/N)?' %bucket_name).strip()while tag not in ['Y','N']:tag=raw_input('Please input (Y/N)').strip()if tag == 'N':returnelif tag == 'Y':self.conn.create_bucket(bucket_name)bucket = self.conn.get_bucket(bucket_name)all_key_name_list=[i.name for i in bucket.get_all_keys()]if key_name in all_key_name_list:while True:f_tag=raw_input(u'File already exists, sure you want to cover (Y/N)?: ').strip()if f_tag not in ['Y','N'] or len(f_tag) == 0:continueelif f_tag == 'Y':breakelif f_tag == 'N':returnmp=bucket.initiate_multipart_upload(key_name)q=self.init_queue(filesize,self.chrunksize)for i in range(0,threadcnt):t=threading.Thread(target=self.upload_trunk,args=(filepath,mp,q,i))t.setDaemon(True)t.start()q.join()mp.complete_upload()#文件分片下載def download_chrunk(self,filepath,key_name,bucket_name,q,id):while not q.empty():chrunk=q.get()offset=chrunk.offsetlength=chrunk.lengthbucket=self.conn.get_bucket(bucket_name)resp=bucket.connection.make_request('GET',bucket_name,key_name,headers={'Range':"bytes=%d-%d" %(offset,offset+length)})data=resp.read(length)fp=FileChunkIO(filepath,'r+',offset=chrunk.offset,bytes=chrunk.length)fp.write(data)fp.close()q.task_done()def download_file_multipart(self,filepath,key_name,bucket_name,threadcnt=8):all_bucket_name_list=[i.name for i in self.conn.get_all_buckets()]if bucket_name not in all_bucket_name_list:print 'Bucket %s is not exist,please try again' %(bucket_name)returnelse:bucket=self.conn.get_bucket(bucket_name)all_key_name_list = [i.name for i in bucket.get_all_keys()]if key_name not in all_key_name_list:print 'File %s is not exist,please try again' %(key_name)returnelse:key=bucket.get_key(key_name)if not os.path.exists(os.path.dirname(filepath)):print 'Filepath %s is not exists, sure to create and try again' % (filepath)returnif os.path.exists(filepath):while True:d_tag = raw_input('File %s already exists, sure you want to cover (Y/N)?' % (key_name)).strip()if d_tag not in ['Y', 'N'] or len(d_tag) == 0:continueelif d_tag == 'Y':os.remove(filepath)breakelif d_tag == 'N':returnos.mknod(filepath)filesize=key.sizeq=self.init_queue(filesize,self.chrunksize)for i in range(0,threadcnt):t=threading.Thread(target=self.download_chrunk,args=(filepath,key_name,bucket_name,q,i))t.setDaemon(True)t.start()q.join()def generate_object_download_urls(self,key_name,bucket_name,valid_time=0):all_bucket_name_list = [i.name for i in self.conn.get_all_buckets()]if bucket_name not in all_bucket_name_list:print 'Bucket %s is not exist,please try again' % (bucket_name)returnelse:bucket = self.conn.get_bucket(bucket_name)all_key_name_list = [i.name for i in bucket.get_all_keys()]if key_name not in all_key_name_list:print 'File %s is not exist,please try again' % (key_name)returnelse:key = bucket.get_key(key_name)try:key.set_canned_acl('public-read')download_url = key.generate_url(valid_time, query_auth=False, force_http=True)if self.port != 80:x1=download_url.split('/')[0:3]x2=download_url.split('/')[3:]s1=u'/'.join(x1)s2=u'/'.join(x2)s3=':%s/' %(str(self.port))download_url=s1+s3+s2print download_urlexcept Exception:pass if __name__ == '__main__':#約定：#1:filepath指本地文件的路徑(上傳路徑or下載路徑),指的是絕對路徑#2:bucket_name相當(dāng)于文件在對象存儲中的目錄名或者索引名#3:key_name相當(dāng)于文件在對象存儲中對應(yīng)的文件名或文件索引access_key = "FYT71CYU3UQKVMC8YYVY"secret_key = "rVEASbWAytjVLv1G8Ta8060lY3yrcdPTsEL0rfwr"ip='127.0.0.1'port=7480conn=CONNECTION(access_key,secret_key,ip,port)#查看所有bucket以及其包含的文件#conn.list_all()#簡單上傳,用于文件大小<=8M#conn.upload_file('/etc/passwd','passwd','test_bucket01')conn.upload_file('/tmp/test.log','test1','test_bucket12')#查看單一bucket下所包含的文件信息conn.list_single('test_bucket12')#簡單下載,用于文件大小<=8M# conn.dowload_file('/lhf_test/test01','passwd','test_bucket01')# conn.list_single('test_bucket01')#刪除文件# conn.delete_file('passwd','test_bucket01')# conn.list_single('test_bucket01')##刪除bucket# conn.delete_bucket('test_bucket01')# conn.list_all()#切片上傳(多線程),用于文件大小>8M,8M可修改，但不能小于8M,否則會報(bào)錯(cuò)切片太小# conn.upload_file_multipart('/etc/passwd','passwd_multi_upload','test_bucket01')# conn.list_single('test_bucket01')# 切片下載(多線程),用于文件大小>8M,8M可修改，但不能小于8M，否則會報(bào)錯(cuò)切片太小# conn.download_file_multipart('/lhf_test/passwd_multi_dowload','passwd_multi_upload','test_bucket01')#生成下載url#conn.generate_object_download_urls('passwd_multi_upload','test_bucket01')#conn.list_all()

4.2 對bucket做reshard操作

To reshard the bucket index pool: redhat-bucket_sharding

#注意下面的操作一定要確保對應(yīng)的bucket相關(guān)的操作都已經(jīng)全部停止，之后使用下面命令備份bucket的index $ radosgw-admin bi list --bucket=<bucket_name> > <bucket_name>.list.backup#通過下面的命令恢復(fù)數(shù)據(jù) $ radosgw-admin bi put --bucket=<bucket_name> < <bucket_name>.list.backup#查看bucket的index id $ radosgw-admin bucket stats --bucket=bucket-maillist {"bucket": "bucket-maillist","pool": "default.rgw.buckets.data","index_pool": "default.rgw.buckets.index","id": "0a6967a5-2c76-427a-99c6-8a788ca25034.54133.1", #注意這個(gè)id"marker": "0a6967a5-2c76-427a-99c6-8a788ca25034.54133.1","owner": "user","ver": "0#1,1#1","master_ver": "0#0,1#0","mtime": "2017-08-23 13:42:59.007081","max_marker": "0#,1#","usage": {},"bucket_quota": {"enabled": false,"max_size_kb": -1,"max_objects": -1} }#Reshard對應(yīng)bucket的index操作如下: #使用命令將"bucket-maillist"的shard調(diào)整為4，注意命令會輸出osd和new兩個(gè)bucket的instance id$ radosgw-admin bucket reshard --bucket="bucket-maillist" --num-shards=4 *** NOTICE: operation will not remove old bucket index objects *** *** these will need to be removed manually *** old bucket instance id: 0a6967a5-2c76-427a-99c6-8a788ca25034.54133.1 new bucket instance id: 0a6967a5-2c76-427a-99c6-8a788ca25034.54147.1 total entries: 3#之后使用下面的命令刪除舊的instance id$ radosgw-admin bi purge --bucket="bucket-maillist" --bucket-id=0a6967a5-2c76-427a-99c6-8a788ca25034.54133.1#查看最終結(jié)果 $ radosgw-admin bucket stats --bucket=bucket-maillist {"bucket": "bucket-maillist","pool": "default.rgw.buckets.data","index_pool": "default.rgw.buckets.index","id": "0a6967a5-2c76-427a-99c6-8a788ca25034.54147.1", #id已經(jīng)變更"marker": "0a6967a5-2c76-427a-99c6-8a788ca25034.54133.1","owner": "user","ver": "0#2,1#1,2#1,3#2","master_ver": "0#0,1#0,2#0,3#0","mtime": "2017-08-23 14:02:19.961205","max_marker": "0#,1#,2#,3#","usage": {"rgw.main": {"size_kb": 50,"size_kb_actual": 60,"num_objects": 3}},"bucket_quota": {"enabled": false,"max_size_kb": -1,"max_objects": -1} }

總結(jié)

以上是生活随笔為你收集整理的RGW Bucket Shard优化的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：牛客网项目--MyBatis
下一篇：舒亦梵：4.24非农周即将来临，黄金趋势