當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

hive 动态分区实现（hive-1.1.0）

發布時間：2023/12/10 编程问答 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 hive 动态分区实现（hive-1.1.0）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

hive 動態分區實現（hive-1.1.0）

筆者使用的hive版本是hive-1.1.0

hive-1.1.0動態分區的默認實現是只有map沒有reduce,通過執行計劃就可以看出來。（執行計劃如下）

insert overwrite table public_t_par partition(delivery_datekey) select * from public_oi_fact_partition;

hive 默認的動態分區實現，不需要shuffle

那么hive如何通過map就實現了動態分區了呢，stage1根據FileInputSplit決定有幾個map，假如數據量較少只有一個map，這個job是沒有reduce的，那么就不會有sort merge combine等操作。
假設這個map讀的數據有10基數的delivery_datekey，那么這個map每讀到一個不同delivery_datekey數據就會打開一個file writer，一共會打開十個file writer。其實是相當浪費文件句柄的。這也就是為什么hive 嚴格模式是禁止用動態分區的，就算關閉嚴格模式，也是限制job最大寫分區數的，甚至限制每臺節點寫分區數，就是怕某個動態分區的hive任務把系統的文件句柄耗光，影響其他任務的正常運行。
當stage1 map把數據寫到相關分區后，再由stage 2啟動分區數(其實小于等于生成的分區數)個map進行小文件的合并。由于我們stage1 只有一個map不會涉及每個分區下有多個文件，并不用啟動stage2，進行分區下小文件合并。

hive 優化后動態分區實現，開啟reduce，需要shuffle

# 使用這個參數開始動態分區優化，其實就是強制開啟reduce SET hive.optimize.sort.dynamic.partition=true;

實現思路就是把分區字段delivery_datekey 作為key(細節可能跟源碼有出入，因為沒看源碼，不過這里怎么實現都行，就看源碼想不想根據其他字段排序了，不影響整體理解)，其他字段作為value,也根據分區字段delivery_datekey 分區partition，然后通過shuffle傳到不同的reduce，這樣合并小文件的操作有reduce完成了。一個reduce會寫多個分區(不會出現小文件問題)。

兩種實現方式有個優缺點，稍后總結一下

# 總結未完待續...

還有就是如果目標表是 orc 或者 parquet 使用動態分區有時會出現java heap oom
稍后說下原因：
未完待續...

解決方案

如果出現oom，說明使用的是hive默認實現方式并且用了orc或parquet作為target 表的格式。1. 開始強制開啟reduce，可以解決 SET hive.optimize.sort.dynamic.partition=true; 2. 減小maxSplit,相當于把map數變多，讓分區基數分散到多個map上，減少單個map的內存壓力，不過這個跟數據分布也有關。 set mapred.max.split.size 設置一個小于128m的數3. 增大map 的堆內存空間。 mapreduce.map.memory.mb和 mapreduce.map.java.opts # hive 想使用動態的參數，后兩個參數根據集群情況自己設置合適的值 set hive.mapred.mode=nonstrict; SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.max.dynamic.partitions.pernode = 1500; SET hive.exec.max.dynamic.partitions=3000; hive (public)> explain insert overwrite table public_t_par partition(delivery_datekey) select * from public_oi_fact_partition; OK STAGE DEPENDENCIES:Stage-1 is a root stageStage-7 depends on stages: Stage-1 , consists of Stage-4, Stage-3, Stage-5Stage-4Stage-0 depends on stages: Stage-4, Stage-3, Stage-6Stage-2 depends on stages: Stage-0Stage-3Stage-5Stage-6 depends on stages: Stage-5 STAGE PLANS:Stage: Stage-1Map ReduceMap Operator Tree:TableScanalias: public_oi_fact_partitionStatistics: Num rows: 110000 Data size: 35162211 Basic stats: COMPLETE Column stats: NONESelect Operatorexpressions: order_datekey (type: int), oiid (type: bigint), custom_order_id (type: bigint), ciid (type: int), bi_name (type: string), siid (type: int), si_name (type: string), classify (type: string), status (type: int), status_text (type: string), class1_id (type: int), class1_name (type: string), class2_id (type: int), class2_name (type: string), city_id (type: int), city_name (type: string), operate_area (type: int), company_id (type: int), standard_item_num (type: int), package_num (type: double), expect_num (type: decimal(30,6)), price (type: decimal(30,6)), order_weight (type: decimal(30,6)), order_amount (type: decimal(30,6)), order_money (type: decimal(30,6)), ci_weight (type: decimal(30,6)), c_t (type: string), u_t (type: string), real_num (type: decimal(30,6)), real_weight (type: decimal(30,6)), real_money (type: decimal(30,6)), cost_price (type: decimal(30,6)), cost_money (type: decimal(30,6)), price_unit (type: string), order_money_coupon (type: decimal(30,6)), real_money_coupon (type: decimal(30,6)), real_price (type: decimal(30,6)), f_activity (type: int), activity_type (type: tinyint), is_activity (type: tinyint), original_price (type: decimal(30,6)), car_group_id (type: bigint), driver_id (type: string), expect_pay_way (type: int), desc (type: string), coupon_score_amount (type: decimal(30,6)), sale_area_id (type: int), delivery_area_id (type: int), tag (type: int), promote_tag_id (type: bigint), promote_tag_name (type: string), pop_id (type: bigint), delivery_datekey (type: string)outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27, _col28, _col29, _col30, _col31, _col32, _col33, _col34, _col35, _col36, _col37, _col38, _col39, _col40, _col41, _col42, _col43, _col44, _col45, _col46, _col47, _col48, _col49, _col50, _col51, _col52Statistics: Num rows: 110000 Data size: 35162211 Basic stats: COMPLETE Column stats: NONEFile Output Operatorcompressed: falseStatistics: Num rows: 110000 Data size: 35162211 Basic stats: COMPLETE Column stats: NONEtable:input format: org.apache.hadoop.mapred.TextInputFormatoutput format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatserde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDename: public.public_t_parStage: Stage-7Conditional OperatorStage: Stage-4Move Operatorfiles:hdfs directory: truedestination: hdfs://ns1/user/hive/warehouse/public.db/public_t_par/.hive-staging_hive_2018-06-08_15-41-18_222_4176438830382881060-1/-ext-10000Stage: Stage-0Move Operatortables:partition:delivery_datekey replace: truetable:input format: org.apache.hadoop.mapred.TextInputFormatoutput format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatserde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDename: public.public_t_parStage: Stage-2Stats-Aggr OperatorStage: Stage-3Map ReduceMap Operator Tree:TableScanFile Output Operatorcompressed: falsetable:input format: org.apache.hadoop.mapred.TextInputFormatoutput format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatserde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDename: public.public_t_parStage: Stage-5Map ReduceMap Operator Tree:TableScanFile Output Operatorcompressed: falsetable:input format: org.apache.hadoop.mapred.TextInputFormatoutput format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatserde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDename: public.public_t_parStage: Stage-6Move Operatorfiles:hdfs directory: truedestination: hdfs://ns1/user/hive/warehouse/public.db/public_t_par/.hive-staging_hive_2018-06-08_15-41-18_222_4176438830382881060-1/-ext-10000

uploading-image-422377.png
uploading-image-83430.png

目錄都完成后_tmp.-ext-1000會變成-ext-1000 并參見stage-6

posted on 2018-08-31 14:08 姜小嫌閱讀(...) 評論(...) 編輯收藏

轉載于:https://www.cnblogs.com/jiangxiaoxian/p/9565500.html

總結

以上是生活随笔為你收集整理的hive 动态分区实现（hive-1.1.0）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： ClassPathResource使用简
下一篇：搭建自已的聊天服务器Rocket.Cha