RGW Bucket Shard优化
1.bucket index背景簡介
bucket index是整個(gè)RGW里面一個(gè)非常關(guān)鍵的數(shù)據(jù)結(jié)構(gòu),用于存儲bucket的索引數(shù)據(jù),默認(rèn)情況下單個(gè)bucket的index全部存儲在一個(gè)shard文件(shard數(shù)量為0,主要以O(shè)MAP-keys方式存儲在leveldb中),隨著單個(gè)bucket內(nèi)的Object數(shù)量增加,整個(gè)shard文件的體積也在不斷增長,當(dāng)shard文件體積過大就會引發(fā)各種問題。
2. 問題及故障
2.1 故障現(xiàn)象描述
RGW的index數(shù)據(jù)以omap形式存儲在OSD所在節(jié)點(diǎn)的leveldb中,當(dāng)單個(gè)bucket存儲的Object數(shù)量高達(dá)百萬數(shù)量級的時(shí)候,
deep-scrub和bucket list一類的操作將極大的消耗磁盤資源,導(dǎo)致對應(yīng)OSD出現(xiàn)異常,
如果不對bucket的index進(jìn)行shard切片操作(shard切片實(shí)現(xiàn)了將單個(gè)bucket index的LevelDB實(shí)例水平切分到多個(gè)OSD上),數(shù)據(jù)量大了以后很容易出事。
RGW在處理大量DELETE請求的時(shí)候,會導(dǎo)致底層LevelDB頻繁進(jìn)行數(shù)據(jù)庫compaction(數(shù)據(jù)壓縮,對磁盤性能損耗很大)操作,而且剛好整個(gè)compaction在LevelDB中又是單線程處理,很容易到達(dá)osdopthreads超時(shí)上限而導(dǎo)致OSD自殺。
常見的問題有:
2.2 根因跟蹤
當(dāng)bucket index所在的OSD omap過大的時(shí)候,一旦出現(xiàn)異常導(dǎo)致OSD進(jìn)程崩潰,這個(gè)時(shí)候就需要進(jìn)行現(xiàn)場"救火",用最快的速度恢復(fù)OSD服務(wù)。
先確定對應(yīng)OSD的OMAP大小,這個(gè)過大會導(dǎo)致OSD啟動(dòng)的時(shí)候消耗大量時(shí)間和資源去加載levelDB數(shù)據(jù),導(dǎo)致OSD無法啟動(dòng)(超時(shí)自殺)。
特別是這一類OSD啟動(dòng)需要占用非常大的內(nèi)存消耗,一定要注意預(yù)留好內(nèi)存。(物理內(nèi)存40G左右,不行用swap頂上)
image.png
3. 臨時(shí)解決方案
3.1 關(guān)閉集群scrub, deep-scrub提升集群穩(wěn)定性
$ ceph osd set noscrub $ ceph osd set nodeep-scrub3.2 調(diào)高timeout參數(shù),減少OSD自殺的概率
osd_op_thread_timeout = 90 #default is 15 osd_op_thread_suicide_timeout = 2000 #default is 150 If filestore op threads are hitting timeout filestore_op_thread_timeout = 180 #default is 60 filestore_op_thread_suicide_timeout = 2000 #default is 180 Same can be done for recovery thread also. osd_recovery_thread_timeout = 120 #default is 30 osd_recovery_thread_suicide_timeout = 20003.2 手工壓縮OMAP
在可以停OSD的情況下,可以對OSD進(jìn)行compact操作,推薦在ceph 0.94.6以上版本,低于這個(gè)版本有bug。 https://github.com/ceph/ceph/pull/7645/files
4. 永久解決方案
4.1 提前規(guī)劃好bucket shard
index pool一定要上SSD,這個(gè)是本文優(yōu)化的前提,沒硬件支撐后面這些操作都是白搭。
合理設(shè)置bucket 的shard 數(shù)量
shard的數(shù)量并不是越多越好,過多的shard會導(dǎo)致部分類似list bucket的操作消耗大量底層存儲IO,導(dǎo)致部分請求耗時(shí)過長。
shard的數(shù)量還要考慮到你OSD的故障隔離域和副本數(shù)設(shè)置。比如你設(shè)置index pool的size為2,并且有2個(gè)機(jī)柜,共24個(gè)OSD節(jié)點(diǎn),理想情況下每個(gè)shard的2個(gè)副本都應(yīng)該分布在2個(gè)機(jī)柜里面,比如當(dāng)你shard設(shè)置為8的時(shí)候,總共有8*2=16個(gè)shard文件需要存儲,那么這16個(gè)shard要做到均分到2個(gè)機(jī)柜。同時(shí)如果你shard超過24個(gè),這很明顯也是不合適的。控制好單個(gè)bucket index shard的平均體積,目前推薦單個(gè)shard存儲的Object信息條目在10-15W左右,過多則需要對相應(yīng)的bucket做單獨(dú)reshard操作(注意這個(gè)是高危操作,謹(jǐn)慎使用)。比如你預(yù)計(jì)單個(gè)bucket最多存儲100W個(gè)Object,那么100W/8=12.5W,設(shè)置shard數(shù)為8是比較合理的。shard文件中每條omapkey記錄大概占用200 byte的容量,那么150000*200/1024/1024 ≈ 28.61 MB,也就是說要控制單個(gè)shard文件的體積在28MB以內(nèi)。
業(yè)務(wù)層面控制好每個(gè)bucket的Object上限,按每個(gè)shard文件平均10-15W Object為宜。
4.1.1 配置Bucket Index Sharding
To enable and configure bucket index sharding on all new buckets, use: redhat-bucket_sharding
- the rgw_override_bucket_index_max_shards setting for simple configurations,
- the bucket_index_max_shards setting for federated configurations
Simple configurations:
#1. 修改配置文件設(shè)置相應(yīng)的參數(shù)。 Note that maximum number of shards is 7877. [global] rgw_override_bucket_index_max_shards = 10 #2. 重啟rgw服務(wù),讓其生效 systemctl restart ceph-radosgw.target#3. 查看bucket shard數(shù) rados -p default.rgw.buckets.index ls | wc -l 1000Federated configurations
In federated configurations, each zone can have a different index_pool setting to manage failover. To configure a consistent shard count for zones in one region, set the bucket_index_max_shards setting in the configuration for that region. To do so:
上傳文件Demo:
#_*_coding:utf-8_*_ #yum install python-boto import boto import boto.s3.connection #pip install filechunkio from filechunkio import FileChunkIO import math import threading import os import Queue class Chunk(object):num = 0offset = 0len = 0def __init__(self,n,o,l):self.num=nself.offset=oself.length=l class CONNECTION(object):def __init__(self,access_key,secret_key,ip,port,is_secure=False,chrunksize=8<<20): #chunksize最小8M否則上傳過程會報(bào)錯(cuò)self.conn=boto.connect_s3(aws_access_key_id=access_key,aws_secret_access_key=secret_key,host=ip,port=port,is_secure=is_secure,calling_format=boto.s3.connection.OrdinaryCallingFormat())self.chrunksize=chrunksizeself.port=port#查詢def list_all(self):all_buckets=self.conn.get_all_buckets()for bucket in all_buckets:print u'容器名: %s' %(bucket.name)for key in bucket.list():print ' '*5,"%-20s%-20s%-20s%-40s%-20s" %(key.mode,key.owner.id,key.size,key.last_modified.split('.')[0],key.name)def list_single(self,bucket_name):try:single_bucket = self.conn.get_bucket(bucket_name)except Exception as e:print 'bucket %s is not exist' %bucket_namereturnprint u'容器名: %s' % (single_bucket.name)for key in single_bucket.list():print ' ' * 5, "%-20s%-20s%-20s%-40s%-20s" % (key.mode, key.owner.id, key.size, key.last_modified.split('.')[0], key.name)#普通小文件下載:文件大小<=8Mdef dowload_file(self,filepath,key_name,bucket_name):all_bucket_name_list = [i.name for i in self.conn.get_all_buckets()]if bucket_name not in all_bucket_name_list:print 'Bucket %s is not exist,please try again' % (bucket_name)returnelse:bucket = self.conn.get_bucket(bucket_name)all_key_name_list = [i.name for i in bucket.get_all_keys()]if key_name not in all_key_name_list:print 'File %s is not exist,please try again' % (key_name)returnelse:key = bucket.get_key(key_name)if not os.path.exists(os.path.dirname(filepath)):print 'Filepath %s is not exists, sure to create and try again' % (filepath)returnif os.path.exists(filepath):while True:d_tag = raw_input('File %s already exists, sure you want to cover (Y/N)?' % (key_name)).strip()if d_tag not in ['Y', 'N'] or len(d_tag) == 0:continueelif d_tag == 'Y':os.remove(filepath)breakelif d_tag == 'N':returnos.mknod(filepath)try:key.get_contents_to_filename(filepath)except Exception:pass# 普通小文件上傳:文件大小<=8Mdef upload_file(self,filepath,key_name,bucket_name):try:bucket = self.conn.get_bucket(bucket_name)except Exception as e:print 'bucket %s is not exist' % bucket_nametag = raw_input('Do you want to create the bucket %s: (Y/N)?' % bucket_name).strip()while tag not in ['Y', 'N']:tag = raw_input('Please input (Y/N)').strip()if tag == 'N':returnelif tag == 'Y':self.conn.create_bucket(bucket_name)bucket = self.conn.get_bucket(bucket_name)all_key_name_list = [i.name for i in bucket.get_all_keys()]if key_name in all_key_name_list:while True:f_tag = raw_input(u'File already exists, sure you want to cover (Y/N)?: ').strip()if f_tag not in ['Y', 'N'] or len(f_tag) == 0:continueelif f_tag == 'Y':breakelif f_tag == 'N':returnkey=bucket.new_key(key_name)if not os.path.exists(filepath):print 'File %s does not exist, please make sure you want to upload file path and try again' %(key_name)returntry:f=file(filepath,'rb')data=f.read()key.set_contents_from_string(data)except Exception:passdef delete_file(self,key_name,bucket_name):all_bucket_name_list = [i.name for i in self.conn.get_all_buckets()]if bucket_name not in all_bucket_name_list:print 'Bucket %s is not exist,please try again' % (bucket_name)returnelse:bucket = self.conn.get_bucket(bucket_name)all_key_name_list = [i.name for i in bucket.get_all_keys()]if key_name not in all_key_name_list:print 'File %s is not exist,please try again' % (key_name)returnelse:key = bucket.get_key(key_name)try:bucket.delete_key(key.name)except Exception:passdef delete_bucket(self,bucket_name):all_bucket_name_list = [i.name for i in self.conn.get_all_buckets()]if bucket_name not in all_bucket_name_list:print 'Bucket %s is not exist,please try again' % (bucket_name)returnelse:bucket = self.conn.get_bucket(bucket_name)try:self.conn.delete_bucket(bucket.name)except Exception:pass#隊(duì)列生成def init_queue(self,filesize,chunksize): #8<<20 :8*2**20chunkcnt=int(math.ceil(filesize*1.0/chunksize))q=Queue.Queue(maxsize=chunkcnt)for i in range(0,chunkcnt):offset=chunksize*ilength=min(chunksize,filesize-offset)c=Chunk(i+1,offset,length)q.put(c)return q#分片上傳objectdef upload_trunk(self,filepath,mp,q,id):while not q.empty():chunk=q.get()fp=FileChunkIO(filepath,'r',offset=chunk.offset,bytes=chunk.length)mp.upload_part_from_file(fp,part_num=chunk.num)fp.close()q.task_done()#文件大小獲取---->S3分片上傳對象生成----->初始隊(duì)列生成(--------------->文件切,生成切分對象)def upload_file_multipart(self,filepath,key_name,bucket_name,threadcnt=8):filesize=os.stat(filepath).st_sizetry:bucket=self.conn.get_bucket(bucket_name)except Exception as e:print 'bucket %s is not exist' % bucket_nametag=raw_input('Do you want to create the bucket %s: (Y/N)?' %bucket_name).strip()while tag not in ['Y','N']:tag=raw_input('Please input (Y/N)').strip()if tag == 'N':returnelif tag == 'Y':self.conn.create_bucket(bucket_name)bucket = self.conn.get_bucket(bucket_name)all_key_name_list=[i.name for i in bucket.get_all_keys()]if key_name in all_key_name_list:while True:f_tag=raw_input(u'File already exists, sure you want to cover (Y/N)?: ').strip()if f_tag not in ['Y','N'] or len(f_tag) == 0:continueelif f_tag == 'Y':breakelif f_tag == 'N':returnmp=bucket.initiate_multipart_upload(key_name)q=self.init_queue(filesize,self.chrunksize)for i in range(0,threadcnt):t=threading.Thread(target=self.upload_trunk,args=(filepath,mp,q,i))t.setDaemon(True)t.start()q.join()mp.complete_upload()#文件分片下載def download_chrunk(self,filepath,key_name,bucket_name,q,id):while not q.empty():chrunk=q.get()offset=chrunk.offsetlength=chrunk.lengthbucket=self.conn.get_bucket(bucket_name)resp=bucket.connection.make_request('GET',bucket_name,key_name,headers={'Range':"bytes=%d-%d" %(offset,offset+length)})data=resp.read(length)fp=FileChunkIO(filepath,'r+',offset=chrunk.offset,bytes=chrunk.length)fp.write(data)fp.close()q.task_done()def download_file_multipart(self,filepath,key_name,bucket_name,threadcnt=8):all_bucket_name_list=[i.name for i in self.conn.get_all_buckets()]if bucket_name not in all_bucket_name_list:print 'Bucket %s is not exist,please try again' %(bucket_name)returnelse:bucket=self.conn.get_bucket(bucket_name)all_key_name_list = [i.name for i in bucket.get_all_keys()]if key_name not in all_key_name_list:print 'File %s is not exist,please try again' %(key_name)returnelse:key=bucket.get_key(key_name)if not os.path.exists(os.path.dirname(filepath)):print 'Filepath %s is not exists, sure to create and try again' % (filepath)returnif os.path.exists(filepath):while True:d_tag = raw_input('File %s already exists, sure you want to cover (Y/N)?' % (key_name)).strip()if d_tag not in ['Y', 'N'] or len(d_tag) == 0:continueelif d_tag == 'Y':os.remove(filepath)breakelif d_tag == 'N':returnos.mknod(filepath)filesize=key.sizeq=self.init_queue(filesize,self.chrunksize)for i in range(0,threadcnt):t=threading.Thread(target=self.download_chrunk,args=(filepath,key_name,bucket_name,q,i))t.setDaemon(True)t.start()q.join()def generate_object_download_urls(self,key_name,bucket_name,valid_time=0):all_bucket_name_list = [i.name for i in self.conn.get_all_buckets()]if bucket_name not in all_bucket_name_list:print 'Bucket %s is not exist,please try again' % (bucket_name)returnelse:bucket = self.conn.get_bucket(bucket_name)all_key_name_list = [i.name for i in bucket.get_all_keys()]if key_name not in all_key_name_list:print 'File %s is not exist,please try again' % (key_name)returnelse:key = bucket.get_key(key_name)try:key.set_canned_acl('public-read')download_url = key.generate_url(valid_time, query_auth=False, force_http=True)if self.port != 80:x1=download_url.split('/')[0:3]x2=download_url.split('/')[3:]s1=u'/'.join(x1)s2=u'/'.join(x2)s3=':%s/' %(str(self.port))download_url=s1+s3+s2print download_urlexcept Exception:pass if __name__ == '__main__':#約定:#1:filepath指本地文件的路徑(上傳路徑or下載路徑),指的是絕對路徑#2:bucket_name相當(dāng)于文件在對象存儲中的目錄名或者索引名#3:key_name相當(dāng)于文件在對象存儲中對應(yīng)的文件名或文件索引access_key = "FYT71CYU3UQKVMC8YYVY"secret_key = "rVEASbWAytjVLv1G8Ta8060lY3yrcdPTsEL0rfwr"ip='127.0.0.1'port=7480conn=CONNECTION(access_key,secret_key,ip,port)#查看所有bucket以及其包含的文件#conn.list_all()#簡單上傳,用于文件大小<=8M#conn.upload_file('/etc/passwd','passwd','test_bucket01')conn.upload_file('/tmp/test.log','test1','test_bucket12')#查看單一bucket下所包含的文件信息conn.list_single('test_bucket12')#簡單下載,用于文件大小<=8M# conn.dowload_file('/lhf_test/test01','passwd','test_bucket01')# conn.list_single('test_bucket01')#刪除文件# conn.delete_file('passwd','test_bucket01')# conn.list_single('test_bucket01')##刪除bucket# conn.delete_bucket('test_bucket01')# conn.list_all()#切片上傳(多線程),用于文件大小>8M,8M可修改,但不能小于8M,否則會報(bào)錯(cuò)切片太小# conn.upload_file_multipart('/etc/passwd','passwd_multi_upload','test_bucket01')# conn.list_single('test_bucket01')# 切片下載(多線程),用于文件大小>8M,8M可修改,但不能小于8M,否則會報(bào)錯(cuò)切片太小# conn.download_file_multipart('/lhf_test/passwd_multi_dowload','passwd_multi_upload','test_bucket01')#生成下載url#conn.generate_object_download_urls('passwd_multi_upload','test_bucket01')#conn.list_all()4.2 對bucket做reshard操作
To reshard the bucket index pool: redhat-bucket_sharding
#注意下面的操作一定要確保對應(yīng)的bucket相關(guān)的操作都已經(jīng)全部停止,之后使用下面命令備份bucket的index $ radosgw-admin bi list --bucket=<bucket_name> > <bucket_name>.list.backup#通過下面的命令恢復(fù)數(shù)據(jù) $ radosgw-admin bi put --bucket=<bucket_name> < <bucket_name>.list.backup#查看bucket的index id $ radosgw-admin bucket stats --bucket=bucket-maillist {"bucket": "bucket-maillist","pool": "default.rgw.buckets.data","index_pool": "default.rgw.buckets.index","id": "0a6967a5-2c76-427a-99c6-8a788ca25034.54133.1", #注意這個(gè)id"marker": "0a6967a5-2c76-427a-99c6-8a788ca25034.54133.1","owner": "user","ver": "0#1,1#1","master_ver": "0#0,1#0","mtime": "2017-08-23 13:42:59.007081","max_marker": "0#,1#","usage": {},"bucket_quota": {"enabled": false,"max_size_kb": -1,"max_objects": -1} }#Reshard對應(yīng)bucket的index操作如下: #使用命令將"bucket-maillist"的shard調(diào)整為4,注意命令會輸出osd和new兩個(gè)bucket的instance id$ radosgw-admin bucket reshard --bucket="bucket-maillist" --num-shards=4 *** NOTICE: operation will not remove old bucket index objects *** *** these will need to be removed manually *** old bucket instance id: 0a6967a5-2c76-427a-99c6-8a788ca25034.54133.1 new bucket instance id: 0a6967a5-2c76-427a-99c6-8a788ca25034.54147.1 total entries: 3#之后使用下面的命令刪除舊的instance id$ radosgw-admin bi purge --bucket="bucket-maillist" --bucket-id=0a6967a5-2c76-427a-99c6-8a788ca25034.54133.1#查看最終結(jié)果 $ radosgw-admin bucket stats --bucket=bucket-maillist {"bucket": "bucket-maillist","pool": "default.rgw.buckets.data","index_pool": "default.rgw.buckets.index","id": "0a6967a5-2c76-427a-99c6-8a788ca25034.54147.1", #id已經(jīng)變更"marker": "0a6967a5-2c76-427a-99c6-8a788ca25034.54133.1","owner": "user","ver": "0#2,1#1,2#1,3#2","master_ver": "0#0,1#0,2#0,3#0","mtime": "2017-08-23 14:02:19.961205","max_marker": "0#,1#,2#,3#","usage": {"rgw.main": {"size_kb": 50,"size_kb_actual": 60,"num_objects": 3}},"bucket_quota": {"enabled": false,"max_size_kb": -1,"max_objects": -1} }總結(jié)
以上是生活随笔為你收集整理的RGW Bucket Shard优化的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 牛客网项目--MyBatis
- 下一篇: 舒亦梵:4.24非农周即将来临,黄金趋势