當(dāng)前位置：首頁(yè) > 运维知识 > 数据库 >内容正文

数据库

SQL优化（二）快速计算Distinct Count

發(fā)布時(shí)間：2025/3/20 数据库 49 豆豆

生活随笔收集整理的這篇文章主要介紹了 SQL优化（二）快速计算Distinct Count 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

2019獨(dú)角獸企業(yè)重金招聘Python工程師標(biāo)準(zhǔn)>>>

原創(chuàng)文章，首發(fā)自個(gè)人站點(diǎn)?，轉(zhuǎn)載請(qǐng)務(wù)必注明出處?http://www.jasongj.com/2015/03/15/count_distinct/

歡迎關(guān)注微信公眾號(hào)【大數(shù)據(jù)架構(gòu)】

UV vs. PV

　　在互聯(lián)網(wǎng)中，經(jīng)常需要計(jì)算UV和PV。所謂PV即Page View，網(wǎng)頁(yè)被打開(kāi)多少次（YouTube等視頻網(wǎng)站非常重視視頻的點(diǎn)擊率，即被播放多少次，也即PV）。而UV即Unique Visitor（微信朋友圈或者微信公眾號(hào)中的文章則統(tǒng)計(jì)有多少人看過(guò)該文章，也即UV。雖然微信上顯示是指明該值是PV，但經(jīng)筆者測(cè)試，實(shí)為UV）。這兩個(gè)概念非常重要，比如淘寶賣(mài)家在做活動(dòng)時(shí)，他往往需要統(tǒng)計(jì)寶貝被看了多少次，有多少個(gè)不同的人看過(guò)該活動(dòng)介紹。至于如何在互聯(lián)網(wǎng)上唯一標(biāo)識(shí)一個(gè)自然人，也是一個(gè)難點(diǎn)，目前還沒(méi)有一個(gè)非常準(zhǔn)確的方法，常用的方法是用戶(hù)名加cookie，這里不作深究。

count distinct vs. count group by

　　很多情景下，尤其對(duì)于文本類(lèi)型的字段，直接使用count distinct的查詢(xún)效率是非常低的，而先做group by更c(diǎn)ount往往能提升查詢(xún)效率。但實(shí)驗(yàn)表明，對(duì)于不同的字段，count distinct與count group by的性能并不一樣，而且其效率也與目標(biāo)數(shù)據(jù)集的數(shù)據(jù)重復(fù)度相關(guān)。

　　本節(jié)通過(guò)幾組實(shí)驗(yàn)說(shuō)明了不同場(chǎng)景下不同query的不同效率，同時(shí)分析性能差異的原因。（本文所有實(shí)驗(yàn)皆基于PostgreSQL 9.3.5平臺(tái)）
分別使用count distinct 和 count group by對(duì) bigint, macaddr, text三種類(lèi)型的字段做查詢(xún)。
首先創(chuàng)建如下結(jié)構(gòu)的表

Column Type Modifiers

mac_bigint	bigint
mac_macaddr	macaddr
mac_text	text

并插入1000萬(wàn)條記錄，并保證mac_bigint為mac_macaddr去掉冒號(hào)后的16進(jìn)制轉(zhuǎn)換而成的10進(jìn)制bigint，而mac_text為mac_macaddr的文本形式，從而保證在這三個(gè)字段上查詢(xún)的結(jié)果，也及復(fù)雜度相同。

　　count distinct SQL如下

select count(distinct mac_macaddr) from testmac

　　count group by SQL如下

select count(*) from (select mac_macaddr from testmac group by 1) foo

　　對(duì)于不同記錄數(shù)較大的情景（1000萬(wàn)條記錄中，有300多萬(wàn)條不同記錄），查詢(xún)時(shí)間（單位毫秒）如下表所示。

query/字段類(lèi)型 macaddr bigint text

count distinct	24668.023	13890.051	149048.911
count group by	32152.808	25929.555	159212.700

　　對(duì)于不同記錄數(shù)較小的情景（1000萬(wàn)條記錄中，只有1萬(wàn)條不同記錄），查詢(xún)時(shí)間（單位毫秒）如下表所示。

query/字段類(lèi)型 macaddr bigint text

count distinct	20006.681	9984.763	225208.133
count group by	2529.420	2554.720	3701.869

　　從上面兩組實(shí)驗(yàn)可看出，在不同記錄數(shù)較小時(shí)，count group by性能普遍高于count distinct，尤其對(duì)于text類(lèi)型表現(xiàn)的更明顯。而對(duì)于不同記錄數(shù)較大的場(chǎng)景，count group by性能反而低于直接count distinct。為什么會(huì)造成這種差異呢，我們以macaddr類(lèi)型為例來(lái)對(duì)比不同結(jié)果集下count group by的query plan。
　　當(dāng)結(jié)果集較小時(shí)，planner會(huì)使用HashAggregation。

explain analyze select count(*) from (select mac_macaddr from testmac_small group by 1) foo;QUERY PLANAggregate (cost=668465.04..668465.05 rows=1 width=0) (actual time=9166.486..9166.486 rows=1 loops=1)-> HashAggregate (cost=668296.74..668371.54 rows=7480 width=6) (actual time=9161.796..9164.393 rows=10001 loops=1)-> Seq Scan on testmac_small (cost=0.00..572898.79 rows=38159179 width=6) (actual time=323.338..5091.112 rows=10000000 l oops=1)

　　而當(dāng)結(jié)果集較大時(shí)，無(wú)法通過(guò)在內(nèi)存中維護(hù)Hash表的方式使用HashAggregation，planner會(huì)使用GroupAggregation，并會(huì)用到排序，而且因?yàn)槟繕?biāo)數(shù)據(jù)集太大，無(wú)法在內(nèi)存中使用Quick Sort，而要在外存中使用Merge Sort，而這就極大的增加了I/O開(kāi)銷(xiāo)。

explain analyze select count(*) from (select mac_macaddr from testmac group by 1) foo;QUERY PLANAggregate (cost=1881542.62..1881542.63 rows=1 width=0) (actual time=34288.232..34288.232 rows=1 loops=1)-> Group (cost=1794262.09..1844329.41 rows=2977057 width=6) (actual time=25291.372..33481.228 rows=3671797 loops=1)-> Sort (cost=1794262.09..1819295.75 rows=10013464 width=6) (actual time=25291.366..29907.351 rows=10000000 loops=1)Sort Key: testmac.mac_macaddrSort Method: external merge Disk: 156440kB-> Seq Scan on testmac (cost=0.00..219206.64 rows=10013464 width=6) (actual time=0.082..4312.053 rows=10000000 loo ps=1)

dinstinct count高效近似算法

　　由于distinct count的需求非常普遍（如互聯(lián)網(wǎng)中計(jì)算UV），而該計(jì)算的代價(jià)又相比較高，很難適應(yīng)實(shí)時(shí)性要求較高的場(chǎng)景，如流計(jì)算，因此有很多相關(guān)研究試圖解決該問(wèn)題。比較著名的算法有daptive sampling Algorithm，Distinct Counting with a Self-Learning Bitmap，HyperLogLog，LogLog，Probabilistic Counting Algorithms。這些算法都不能精確計(jì)算distinct count，都是在保證誤差較小的情況下高效計(jì)算出結(jié)果。本文分別就這幾種算法做了兩組實(shí)驗(yàn)。

數(shù)據(jù)集100萬(wàn)條，每條記錄均不相同，幾種算法耗時(shí)及內(nèi)存使用如下。

algorithm result error time(ms) memory (B)

count(distinct)	1000000	0%	14026	？
Adaptive Sampling	1008128	0.8%	8653	57627
Self-learning Bitmap	991651	0.9%	1151	65571
Bloom filter	788052	22%	2400	1198164
Probalilistic Counting	1139925	14%	3613	95
PCSA	841735	16%	842	495

數(shù)據(jù)集100萬(wàn)條，只有100條不同記錄，幾種近似算法耗時(shí)及內(nèi)存使用如下。

algorithm result error time(ms) memory (B)

count(distinct)	100	0%	75306	？
Adaptive Sampling	100	0%	1491	57627
Self-learning Bitmap	101	1%	1031	65571
Bloom filter	100	0%	1675	1198164
Probalilistic Counting	95	5%	3613	95
PCSA	98	2%	852	495

　　
　　從上面兩組實(shí)驗(yàn)可看出，大部分的近似算法工作得都很好，其速度都比簡(jiǎn)單的count distinct要快很多，而且它們對(duì)內(nèi)存的使用并不多而結(jié)果去非常好，尤其是Adaptive Sampling和Self-learning Bitmap，誤差一般不超過(guò)1%，性能卻比簡(jiǎn)單的count distinct高十幾倍乃至幾十倍。　　

distinct count結(jié)果合并

　　如上幾種近似算法可以極大提高distinct count的效率，但對(duì)于data warehouse來(lái)說(shuō)，數(shù)據(jù)量非常大，可能存儲(chǔ)了幾年的數(shù)據(jù)，為了提高查詢(xún)速度，對(duì)于sum及avg這些aggregation一般會(huì)創(chuàng)建一些aggregation table。比如如果要算過(guò)去三年的總營(yíng)業(yè)額，那可以創(chuàng)建一張daily/monthly aggregation table，基于daily/monthly表去計(jì)算三年的營(yíng)業(yè)額。但對(duì)于distinct count，即使創(chuàng)建了daily/monthly aggregation table，也沒(méi)辦法通過(guò)其計(jì)算三年的數(shù)值。這里有種新的數(shù)據(jù)類(lèi)型hll，這是一種HyperLogLog數(shù)據(jù)結(jié)構(gòu)。一個(gè)1280字節(jié)的hll能計(jì)算幾百億的不同數(shù)值并且保證只有很小的誤差。
　　首先創(chuàng)建一張表(fact)，結(jié)構(gòu)如下

Column Type Modifiers

day	date
user_id	integer
sales	numeric

　插入三年的數(shù)據(jù)，并保證總共有10萬(wàn)個(gè)不同的user_id，總數(shù)據(jù)量為1億條（一天10萬(wàn)條左右）。

insert into fact select current_date - (random()*1095)::integer * '1 day'::interval,(random()*100000)::integer + 1,random() * 10000 + 500 from generate_series(1, 100000000, 1);

　　直接從fact表中查詢(xún)不同用戶(hù)的總數(shù)，耗時(shí)115143.217 ms。
利用hll，創(chuàng)建daily_unique_user_hll表，將每天的不同用戶(hù)信息存于hll類(lèi)型的字段中。

create table daily_unique_user_hll as select day, hll_add_agg(hll_hash_integer(user_id)) from fact group by 1;

　　通過(guò)上面的daily aggregation table可計(jì)算任意日期范圍內(nèi)的unique user count。如計(jì)算整個(gè)三年的不同用戶(hù)數(shù)，耗時(shí)17.485 ms，查詢(xún)結(jié)果為101044，誤差為(101044-100000)/100000=1.044%。

explain analyze select hll_cardinality(hll_union_agg(hll_add_agg)) from daily_unique_user_hll;QUERY PLANAggregate (cost=196.70..196.72 rows=1 width=32) (actual time=16.772..16.772 rows=1 loops=1)-> Seq Scan on daily_unique_user_hll (cost=0.00..193.96 rows=1096 width=32) (actual time=0.298..3.251 rows= 1096 loops=1)Planning time: 0.081 msExecution time: 16.851 msTime: 17.485 ms

　　而如果直接使用count distinct基于fact表計(jì)算該值，則耗時(shí)長(zhǎng)達(dá) 127807.105 ms。
　　
　　從上面的實(shí)驗(yàn)中可以看到，hll類(lèi)型實(shí)現(xiàn)了distinct count的合并，并可以通過(guò)hll存儲(chǔ)各個(gè)部分?jǐn)?shù)據(jù)集上的distinct count值，并可通過(guò)合并這些hll值來(lái)快速計(jì)算整個(gè)數(shù)據(jù)集上的distinct count值，耗時(shí)只有直接使用count distinct在原始數(shù)據(jù)上計(jì)算的1/7308，并且誤差非常小，1%左右。　　

總結(jié)

　　如果必須要計(jì)算精確的distinct count，可以針對(duì)不同的情況使用count distinct或者count group by來(lái)實(shí)現(xiàn)較好的效率，同時(shí)對(duì)于數(shù)據(jù)的存儲(chǔ)類(lèi)型，能使用macaddr/intger/bigint的，盡量不要使用text。
　　
　　另外不必要精確計(jì)算，只需要保證誤差在可接受的范圍之內(nèi)，或者計(jì)算效率更重要時(shí)，可以采用本文所介紹的daptive sampling Algorithm，Distinct Counting with a Self-Learning Bitmap，HyperLogLog，LogLog，Probabilistic Counting Algorithms等近似算法。另外，對(duì)于data warehouse這種存儲(chǔ)數(shù)據(jù)量隨著時(shí)間不斷超增加且最終數(shù)據(jù)總量非常巨大的應(yīng)用場(chǎng)景，可以使用hll這種支持合并dintinct count結(jié)果的數(shù)據(jù)類(lèi)型，并周期性的（比如daily/weekly/monthly）計(jì)算部分?jǐn)?shù)據(jù)的distinct值，然后通過(guò)合并部分結(jié)果的方式得到總結(jié)果的方式來(lái)快速響應(yīng)查詢(xún)請(qǐng)求。

轉(zhuǎn)載于:https://my.oschina.net/jasongj/blog/393281

與50位技術(shù)專(zhuān)家面對(duì)面20年技術(shù)見(jiàn)證，附贈(zèng)技術(shù)全景圖

總結(jié)

以上是生活随笔為你收集整理的SQL优化（二）快速计算Distinct Count的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：【SICP练习】144 练习3.82
下一篇： Spring AOP源码分析（七）Pro

数据库

SQL优化（二） 快速计算Distinct Count

UV vs. PV

count distinct vs. count group by

dinstinct count高效近似算法

distinct count結(jié)果合并

總結(jié)

總結(jié)

SQL优化（二）快速计算Distinct Count