當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

oracle rollup分组没有数据时为0_Hive 入门数据分析基础 5

發布時間：2025/3/15 编程问答 22 豆豆

生活随笔收集整理的這篇文章主要介紹了 oracle rollup分组没有数据时为0_Hive 入门数据分析基础 5 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

六, 常用優化技巧---主要內容和思路來源:開課吧學習筆記

1, 用 group by 代替 distinct 去重

在數據量大, 重復值多時, 能先使用 group by 去重的, 使用 group by 去重后再計算, 比之后再去重計算效率更高.

例如前面的例題中, 查詢 2019 年購買后又退款的用戶, 可以這樣優化:

select a.user_name
from
(select user_name
from user_trade
where year(dt)=2019
group by user_name) a
join
(select user_name
from user_refund
where year(dt)=2019
group by user_name) b
on a.user_name=b.user_name;

2, 多維度聚合

需要對多個字段進行聚合運算, 一個字段一個字段的寫 SQL 計算, 不如使用多維度聚合計算語句: grouping sets, cube 和 rollup.

2.01, 分別查詢用性別, 城市, 等級分布:

select sex, city, level,
count(user_id)
from user_info
group by sex, city, level;

上述 sql 查詢出來的是同一性別, 不同城市, 不同等級的分布, 而我們需要的結果是三個字段的結果是分開的:

select sex, city, level,
count(user_id)
from user_info
group by sex, city, level
grouping sets (sex, city, level);

sex	city	level	_c3
NULL	NULL	1	33
NULL	NULL	2	26
NULL	NULL	3	30
NULL	NULL	4	44
NULL	NULL	5	30
NULL	NULL	6	32
NULL	NULL	7	35
NULL	NULL	8	34
NULL	NULL	9	30
NULL	NULL	10	33
NULL	beijing	NULL	53
NULL	changchun	NULL	53
NULL	guangzhou	NULL	55
NULL	hangzhou	NULL	57
NULL	shanghai	NULL	61
NULL	shenzhen	NULL	48
female	NULL	NULL	177
male	NULL	NULL	150

a, grouping sets() 相當于將不同 group by 聚合的結果進行 union all, 可以在括號中指明聚合規則.

2.02, 查詢性別分布, 以及不同性別的城市分布:

select sex, city,
count(user_id)
from user_info
group by sex, city
grouping sets (sex, (sex, city));

sex	city	_c2
female	NULL	177
female	beijing	26
female	changchun	32
female	guangzhou	30
female	hangzhou	30
female	shanghai	36
female	shenzhen	23
male	NULL	150
male	beijing	27
male	changchun	21
male	guangzhou	25
male	hangzhou	27
male	shanghai	25
male	shenzhen	25

城市一列為 null 的是性別分布, 其他的是每個性別的城市分布.

b, with cube將分組聚合的所有維度都進行計算:

2.03, 查詢性別, 城市, 等級各種組合的用戶分布情況:

-- 方法一
select sex, city, level,
count(user_id)
from user_info
group by sex, city, level
grouping sets (sex, city, level,
(sex, city), (sex, level), (city, level),
(sex, city, level));

-- 方法二
select sex, city, level,
count(user_id)
from user_info
group by sex, city, level
with cube;

方法二將所有用戶數進行了統計, 方法一沒有.

c, with rollup以最左側的字段為主, 進行層級聚合, 結果是 with cube 的子集.

2.04, 查詢每個月的支付金額及每年的支付金額:

-- 方法一
select a.dt,
sum(a.year_amount),
sum(a.month_amount)
from
(select substr(dt, 1, 4) dt,
sum(pay_amount) year_amount,
0 month_amount
from user_trade
where dt>'0'
group by substr(dt, 1, 4)
union all
select substr(dt, 1, 7) dt,
0 year_mount,
sum(pay_amount) month_amount
from user_trade
where dt>'0'
group by substr(dt, 1, 7)
) a
group by a.dt;

a.dt	_c1	_c2
2017	24333973.60	0.00
2017-01	0.00	241755.70
2017-02	0.00	2582410.60

第一列是年的總額.

-- 方法二
select year(dt) year,
month(dt) month,
sum(pay_amount)
from user_trade
where dt>'0'
group by year(dt), month(dt)
with rollup;

year	month	_c2
NULL	NULL	62348041.30
2017	NULL	24333973.60
2017	1	241755.70
2017	2	2582410.60

第一列是全部的總額, 第二列是年的總額. with rollup計算了全部的總額和以 year 字段為主的兩個維度 year, (year, month)的總額.

3, 轉換解題思路

前面的例題, 查詢 2017 和 2018 都購買的用戶, 可以如下優化:

select a.user_name
from (select user_name,
count(distinct year(dt)) year_num
from user_trade
where year(dt) in (2017, 2018)
group by user_name) a
where a.year_num=2;

-- 還可再優化
select user_name,
count(distinct year(dt)) year_num
from user_trade
where year(dt) in (2017, 2018)
group by user_name
having count(distinct year(dt))=2;

4, 有 union all 查詢時, 開啟并發執行

開啟參數設置: set hive.exec.parallel=true

查看是否設置成功: set hive.exec.parallel;

多臺服務器時開啟才有效.

5, 同一字段的數據展開, 或按分組歸類

有用戶購買的商品表 user_goods_category:

col_name	data_type
user_name	string
category_detail	string

部分數據:

Abby	clothes,food,electronics
Ailsa	book,clothes,food
Albert	clothes,electronics,computer

a, 同一字段的數據展開:

基表 lateral view explode(列表字段) 單列表 as 列名1

得到的結果是一個將基表的列表字段展開得到列名為列名1 的單列表與基表合并的特殊表.

5.01, 查詢每個商品品類的購買用戶數:

select b.category,
count(distinct a.user_name)
from user_goods_category a
lateral view explode(split(category_detail, ',')) b as category
group by b.category;

b.category	_c1
book	99
clothes	110
computer	99
electronics	99
food	105
shoes	91

split()將字符串分割并以列表形式返回.

explode()將值為列表形式的字段展開成多行.

from語句后面得到的特殊表如下:

a.user_name	a.category_detail	b.category
Abby	clothes,food,electronics	clothes
Abby	clothes,food,electronics	food
Abby	clothes,food,electronics	electronics
Ailsa	book,clothes,food	book
Ailsa	book,clothes,food	clothes
Ailsa	book,clothes,food	food

b, 同一字段按分組歸類:

5.02, 查看每個商品品類都有哪些用戶購買:

select b.category,
collect_set(distinct a.user_name)
from user_goods_category a
lateral view explode(split(category_detail, ',')) b as category
group by b.category;

b.category	_c1
book	["Ailsa","Alexander",…(后面省略)]
clothes	["Abby","Ailsa",...(后面省略)]

collect_set()將分組后分到同一組的值全部放到一個列表里顯示出來, 還可以再用 concat_ws('分隔符', ...) 將列表中的值連接成字符串.

6, 表連接優化

使用相同的連接鍵
對3個以上表進行 join 連接, on 條件使用相同的連接鍵, 只會產生一個 MapReduce job.
盡早進行數據過濾
例如數據只選擇需要的區段和字段, 分組去重等.
邏輯過于復雜時引入中間表

7, 防止數據傾斜

數據傾斜: 任務執行過程中, 大部分任務執行完成, 少數任務一直在執行中的情況.

空值產生的數據傾斜
表連接時, 連接字段有空值, 增加空值過濾條件, 例如:on a.id=b.id and a.id is not null
表連接時, 連接字段數據類型不一致
將數據類型轉換一致, 例如:on a.id=cast(b.id as string)

end

? ? ? ?點擊下方

總結

以上是生活随笔為你收集整理的oracle rollup分组没有数据时为0_Hive 入门数据分析基础 5的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Linux基础学习二：计算机组成以及Li
下一篇：自动化构建部署之Jenkins安装【Do