hbase中列簇和列_为什么不建议在hbase中使用过多的列簇
我們知道,hbase表可以設(shè)置一個(gè)至多個(gè)列簇(column families),但是為什么說越少的列簇越好呢?
官網(wǎng)原文:
HBase currently does not do well with anything above two or three column families so keep the number of column families in your schema low. Currently, flushing and compactions are done on a per Region basis so if one column family is carrying the bulk of the data bringing on flushes, the adjacent families will also be flushed even though the amount of data they carry is small. When many column families exist the flushing and compaction interaction can make for a bunch of needless i/o (To be addressed by changing flushing and compaction to work on a per column family basis).
回顧下hbase表,每張表會(huì)切分為多個(gè)region,每個(gè)region也就是表的一部分子集數(shù)據(jù),region會(huì)分散到hbase 集群regionserver上;
region中每個(gè)columnFamily的數(shù)據(jù)組成一個(gè)Store。每個(gè)Store由一個(gè)Memstore和多個(gè)HFile組成(一個(gè)列簇對(duì)應(yīng)一個(gè)memstore和N個(gè)HFile);
在達(dá)到flush條件時(shí)候,每個(gè)memstore都會(huì)flush生成一個(gè)HFile文件;另外隨著HFile文件的生成,后臺(tái)minorCompact線程會(huì)觸發(fā)合并HFile文件;
重點(diǎn)來了!flush和compact都是在region的基礎(chǔ)上進(jìn)行的!!!
比如在flush時(shí)候,如果有多個(gè)memstore(多個(gè)列簇),只要有一個(gè)memstore達(dá)到flush條件,其他的memstore即使數(shù)據(jù)很小也要跟著執(zhí)行flush,這也就導(dǎo)致了很多不必要的I/O開銷。觸發(fā)flush的條件如下:
Memstore級(jí)別限制:當(dāng)Region中任意一個(gè)MemStore的大小達(dá)到了上限(hbase.hregion.memstore.flush.size,默認(rèn)128MB),會(huì)觸發(fā)Memstore刷新。
Region級(jí)別限制:當(dāng)Region中所有Memstore的大小總和達(dá)到了上限(hbase.hregion.memstore.block.multiplier * hbase.hregion.memstore.flush.size,默認(rèn) 2* 128M = 256M),會(huì)觸發(fā)memstore刷新。
Region Server級(jí)別限制:當(dāng)一個(gè)Region Server中所有Memstore的大小總和達(dá)到了上限(hbase.regionserver.global.memstore.upperLimit * hbase_heapsize,默認(rèn) 40%的JVM內(nèi)存使用量),會(huì)觸發(fā)部分Memstore刷新。Flush順序是按照Memstore由大到小執(zhí)行,先Flush Memstore最大的Region,再執(zhí)行次大的,直至總體Memstore內(nèi)存使用量低于閾值(hbase.regionserver.global.memstore.lowerLimit * hbase_heapsize,默認(rèn) 38%的JVM內(nèi)存使用量)。
當(dāng)一個(gè)Region Server中HLog數(shù)量達(dá)到上限(可通過參數(shù)hbase.regionserver.maxlogs配置)時(shí),系統(tǒng)會(huì)選取最早的一個(gè) HLog對(duì)應(yīng)的一個(gè)或多個(gè)Region進(jìn)行flush
HBase定期刷新Memstore:默認(rèn)周期為1小時(shí),確保Memstore不會(huì)長(zhǎng)時(shí)間沒有持久化。為避免所有的MemStore在同一時(shí)間都進(jìn)行flush導(dǎo)致的問題,定期的flush操作有20000左右的隨機(jī)延時(shí)。
同樣在compact時(shí)候,由于是建立在region的基礎(chǔ)上,同樣會(huì)產(chǎn)生不必要的I/O開銷,觸發(fā)compcat(minor_compact)條件:
hbase.hstore.compactionThreshold
Description
Ifmore than this number of HStoreFiles in any one HStore (one HStoreFile is written per flush of memstore) thena compaction is run to rewrite all HStoreFiles files as one. Larger numbers put off compaction but when it runs, it takes longer to complete.
default3
Where multiple ColumnFamilies exist in a single table, be aware of the cardinality (i.e., number of rows). If ColumnFamilyA has 1 million rows and ColumnFamilyB has 1 billion rows, ColumnFamilyA’s data will likely be spread across many, many regions (and RegionServers). This makes mass scans for ColumnFamilyA less efficient.
另外,如果一個(gè)表中存在多個(gè)列族,請(qǐng)注意數(shù)據(jù)量(即,行數(shù))。如果ColumnFamilyA有100萬行,而ColumnFamilyB有10億行,ColumnFamilyA的數(shù)據(jù)很可能分布在許多許多regions(和regionservers)。這使得ColumnFamilyA的大規(guī)模scan效率降低。(我們知道hbase split是由參數(shù)hbase.hregion.max.filesize值來控制的,但是,觸發(fā)region split不是說該region下所有的HFile文件大小達(dá)到這個(gè)值就會(huì)觸發(fā)split,而是region下某個(gè)HFile文件達(dá)到了這個(gè)值才會(huì)執(zhí)行split,也就是說這里ColumnFamilyB在做split時(shí)候,ColumnFamilyA的數(shù)據(jù)量還很小很小,但是也會(huì)被帶著執(zhí)行split,這也就會(huì)導(dǎo)致更多的HDFS小文件,并且分散到更多的region和regionservers上)
總結(jié)
以上是生活随笔為你收集整理的hbase中列簇和列_为什么不建议在hbase中使用过多的列簇的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 计算机控制的点火系统由,第八节(点火系统
- 下一篇: 怎么退一口价黄金?介绍两种方式