Hive _分桶及抽样查询
分桶及抽樣查詢
1、分桶表數據存儲
分區針對的是數據的存儲路徑;分桶針對的是數據文件。
分區提供一個隔離數據和優化查詢的便利方式。不過,并非所有的數據集都可形成合理的分區,特別是之前所提到過的要確定合適的劃分大小這個疑慮。
分桶是將數據集分解成更容易管理的若干部分的另一個技術。
hive (default)> show databases;
OK
database_name
default
Time taken: 1.092 seconds, Fetched: 1 row(s)
hive (default)> create table stu_buck(id int, name string)
? ? ? ? ? ? ? > clustered by(id)?
? ? ? ? ? ? ? > into 4 buckets
? ? ? ? ? ? ? > row format delimited fields terminated by '\t';
OK
Time taken: 0.443 seconds
hive (default)> desc formatted stu_buck;
OK
col_name?? ?data_type?? ?comment
# col_name ? ? ? ? ? ??? ?data_type ? ? ? ? ? ?? ?comment ? ? ? ? ? ??
?? ? ?? ??
id ? ? ? ? ? ? ? ? ??? ?int ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ?
name ? ? ? ? ? ? ? ??? ?string ? ? ? ? ? ? ??? ? ? ? ? ? ? ? ? ? ? ?
?? ? ?? ??
# Detailed Table Information?? ? ?? ??
Database: ? ? ? ? ? ?? ?default ? ? ? ? ? ? ?? ??
Owner: ? ? ? ? ? ? ??? ?root ? ? ? ? ? ? ? ??? ??
CreateTime: ? ? ? ? ?? ?Sun Nov 03 03:49:59 CST 2019?? ??
LastAccessTime: ? ? ?? ?UNKNOWN ? ? ? ? ? ? ?? ??
Protect Mode: ? ? ? ?? ?None ? ? ? ? ? ? ? ??? ??
Retention: ? ? ? ? ??? ?0 ? ? ? ? ? ? ? ? ? ?? ??
Location: ? ? ? ? ? ?? ?hdfs://mycluster/user/hive/warehouse/stu_buck?? ??
Table Type: ? ? ? ? ?? ?MANAGED_TABLE ? ? ? ?? ??
Table Parameters:?? ? ?? ??
?? ?transient_lastDdlTime?? ?1572724199 ? ? ? ? ?
?? ? ?? ??
# Storage Information?? ? ?? ??
SerDe Library: ? ? ??? ?org.apache.hadoop.hive.serde2.lazy.LazySimpleSerD?
InputFormat: ? ? ? ??? ?org.apache.hadoop.mapred.TextInputFormat?? ??
OutputFormat: ? ? ? ?? ?org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutp
utFormat?? ? Compressed: ? ? ? ? ?? ?No ? ? ? ? ? ? ? ? ??? ??
Num Buckets: ? ? ? ??? ?4 ? ? ? ? ? ? ? ? ? ?? ??
Bucket Columns: ? ? ?? ?[id] ? ? ? ? ? ? ? ??? ??
Sort Columns: ? ? ? ?? ?[] ? ? ? ? ? ? ? ? ??? ??
Storage Desc Params:?? ? ?? ??
?? ?field.delim ? ? ? ? ?? ?\t ? ? ? ? ? ? ? ? ?
?? ?serialization.format?? ?\t ? ? ? ? ? ? ? ? ?
Time taken: 0.318 seconds, Fetched: 28 row(s)
hive (default)> load data local inpath "/root/student" into table stu_buck
;Loading data to table default.stu_buck
Table default.stu_buck stats: [numFiles=1, totalSize=54]
OK
Time taken: 1.82 seconds
hive (default)> create table stu(id int, name string)
? ? ? ? ? ? ? > row format delimited fields terminated by '\t';
OK
Time taken: 0.123 seconds
hive (default)> load data local inpath "/root/student" into table stu;
Loading data to table default.stu
Table default.stu stats: [numFiles=1, totalSize=54]
OK
Time taken: 0.742 seconds
hive (default)> truncate table stu_buck;
OK
Time taken: 0.276 seconds
hive (default)> select * from stu_buck;
OK
stu_buck.id?? ?stu_buck.name
Time taken: 0.453 seconds
hive (default)> insert into table stu_buck select id,name from stu;
Query ID = root_20191103035354_d7ff026a-7592-48a1-ba5c-f0d2cced5d46
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1572895740509_0001, Tracking URL = http://henu3:8088/pr
oxy/application_1572895740509_0001/Kill Command = /opt/hadoop-2.6.5/bin/hadoop job ?-kill job_1572895740509_0
001Hadoop job information for Stage-1: number of mappers: 1; number of reduce
rs: 02019-11-03 03:54:21,088 Stage-1 map = 0%, ?reduce = 0%
2019-11-03 03:54:42,550 Stage-1 map = 100%, ?reduce = 0%, Cumulative CPU 1
.63 secMapReduce Total cumulative CPU time: 1 seconds 630 msec
Ended Job = job_1572895740509_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://mycluster/user/hive/warehouse/stu_buck/.hive-stagin
g_hive_2019-11-03_03-53-54_155_2767196214915907307-1/-ext-10000Loading data to table default.stu_buck
Table default.stu_buck stats: [numFiles=1, numRows=7, totalSize=42, rawDat
aSize=35]MapReduce Jobs Launched:?
Stage-Stage-1: Map: 1 ? Cumulative CPU: 1.63 sec ? HDFS Read: 3233 HDFS Wr
ite: 114 SUCCESSTotal MapReduce CPU Time Spent: 1 seconds 630 msec
OK
id?? ?name
Time taken: 51.336 seconds
并未分桶!!!
?
hive (default)> set hive.enforce.bucketing=true;【關鍵】
hive (default)> set mapreduce.job.reduces=-1;
hive (default)> insert into table stu_buck
? ? ? ? ? ? ? > select id, name from stu;
Query ID = root_20191103035755_addfbe62-a68d-4e56-8f8b-bdc3fb44153b
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
? set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
? set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
? set mapreduce.job.reduces=<number>
Starting Job = job_1572895740509_0002, Tracking URL = http://henu3:8088/pr
oxy/application_1572895740509_0002/Kill Command = /opt/hadoop-2.6.5/bin/hadoop job ?-kill job_1572895740509_0
002Hadoop job information for Stage-1: number of mappers: 1; number of reduce
rs: 42019-11-03 03:58:21,998 Stage-1 map = 0%, ?reduce = 0%
2019-11-03 03:58:31,546 Stage-1 map = 100%, ?reduce = 0%, Cumulative CPU 2
.08 sec2019-11-03 03:58:49,596 Stage-1 map = 100%, ?reduce = 17%, Cumulative CPU?
4.45 sec2019-11-03 03:58:51,903 Stage-1 map = 100%, ?reduce = 25%, Cumulative CPU?
5.71 sec2019-11-03 03:58:53,050 Stage-1 map = 100%, ?reduce = 50%, Cumulative CPU?
9.12 sec2019-11-03 03:58:54,155 Stage-1 map = 100%, ?reduce = 75%, Cumulative CPU?
12.65 sec2019-11-03 03:58:56,572 Stage-1 map = 100%, ?reduce = 100%, Cumulative CPU
?15.17 secMapReduce Total cumulative CPU time: 15 seconds 170 msec
Ended Job = job_1572895740509_0002
Loading data to table default.stu_buck
Table default.stu_buck stats: [numFiles=5, numRows=14, totalSize=84, rawDa
taSize=70]MapReduce Jobs Launched:?
Stage-Stage-1: Map: 1 ?Reduce: 4 ? Cumulative CPU: 15.17 sec ? HDFS Read:?
15296 HDFS Write: 240 SUCCESSTotal MapReduce CPU Time Spent: 15 seconds 170 msec
OK
id?? ?name
Time taken: 64.242 seconds
?
?分桶抽樣查詢
對于非常大的數據集,有時用戶需要使用的是一個具有代表性的查詢結果而不是全部結果。Hive可以通過對表進行抽樣來滿足這個需求。
查詢表stu_buck中的數據。
| hive (default)> select * from stu_buck tablesample(bucket 1 out of 4 on id); |
注:tablesample是抽樣語句,語法:TABLESAMPLE(BUCKET x OUT OF y)?。
y必須是table總bucket數的倍數或者因子。hive根據y的大小,決定抽樣的比例。例如,table總共分了4份,當y=2時,抽取(4/2=)2個bucket的數據,當y=8時,抽取(4/8=)1/2個bucket的數據。
x表示從哪個bucket開始抽取,如果需要取多個分區,以后的分區號為當前分區號加上y。例如,table總bucket數為4,tablesample(bucket 1?out of 2),表示總共抽取(4/2=)2個bucket的數據,抽取第1(x)個和第3(x+y)個bucket的數據。
注意:x的值必須小于等于y的值,否則
FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck
?
總結
以上是生活随笔為你收集整理的Hive _分桶及抽样查询的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 大数据(生于2006,卒于2019)已死
- 下一篇: Hive _偏门常用查询函数(一)附带实