當(dāng)前位置：首頁 > 运维知识 > windows >内容正文

windows

网站流量日志数据分析系统（模块开发----数据仓库设计）

發(fā)布時間：2024/1/8 windows 22 豆豆

生活随笔收集整理的這篇文章主要介紹了网站流量日志数据分析系统（模块开发----数据仓库设计）小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

1．維度建模基本概念

維度建模(dimensional modeling)是專門用于分析型數(shù)據(jù)庫、數(shù)據(jù)倉庫、數(shù)據(jù)集市建模的方法。數(shù)據(jù)集市可以理解為是一種"小型數(shù)據(jù)倉庫"。
維度表(dimension)
維度表示你要對數(shù)據(jù)進(jìn)行分析時所用的一個量,比如你要分析產(chǎn)品銷售情況, 你可以選擇按類別來進(jìn)行分析,或按區(qū)域來分析。這樣的按..分析就構(gòu)成一個維度。再比如"昨天下午我在星巴克花費200元喝了一杯卡布奇諾"。那么以消費為主題進(jìn)行分析，可從這段信息中提取三個維度：時間維度(昨天下午)，地點維度(星巴克), 商品維度(卡布奇諾)。通常來說維度表信息比較固定，且數(shù)據(jù)量小。
事實表(fact table)
表示對分析主題的度量。事實表包含了與各維度表相關(guān)聯(lián)的外鍵，并通過JOIN方式與維度表關(guān)聯(lián)。事實表的度量通常是數(shù)值類型，且記錄數(shù)會不斷增加，表規(guī)模迅速增長。比如上面的消費例子，它的消費事實表結(jié)構(gòu)示例如下：
消費事實表：Prod_id(引用商品維度表), TimeKey(引用時間維度表), Place_id(引用地點維度表), Unit(銷售量)。
總的說來，在數(shù)據(jù)倉庫中不需要嚴(yán)格遵守規(guī)范化設(shè)計原則。因為數(shù)據(jù)倉庫的主導(dǎo)功能就是面向分析，以查詢?yōu)橹?#xff0c;不涉及數(shù)據(jù)更新操作。事實表的設(shè)計是以能夠正確記錄歷史信息為準(zhǔn)則，維度表的設(shè)計是以能夠以合適的角度來聚合主題內(nèi)容為準(zhǔn)則。

2．維度建模三種模式

2.1．星型模式
星形模式(Star Schema)是最常用的維度建模方式。星型模式是以事實表為中心，所有的維度表直接連接在事實表上，像星星一樣。
星形模式的維度建模由一個事實表和一組維表成，且具有以下特點：
?a. 維表只和事實表關(guān)聯(lián)，維表之間沒有關(guān)聯(lián)；
?b. 每個維表主鍵為單列，且該主鍵放置在事實表中，作為兩邊連接的外鍵；
c. 以事實表為核心，維表圍繞核心呈星形分布；
?

2.2．雪花模式
雪花模式(Snowflake Schema)是對星形模式的擴展。雪花模式的維度表可以擁有其他維度表的，雖然這種模型相比星型更規(guī)范一些，但是由于這種模型不太容易理解，維護(hù)成本比較高，而且性能方面需要關(guān)聯(lián)多層維表，性能也比星型模型要低。所以一般不是很常用。
?

2.3．星座模式
星座模式是星型模式延伸而來，星型模式是基于一張事實表的，而星座模式是基于多張事實表的，而且共享維度信息。
前面介紹的兩種維度建模方法都是多維表對應(yīng)單事實表，但在很多時候維度空間內(nèi)的事實表不止一個，而一個維表也可能被多個事實表用到。在業(yè)務(wù)發(fā)展后期，絕大部分維度建模都采用的是星座模式。
?

3．本項目中數(shù)據(jù)倉庫的設(shè)計

注：采用星型模型 ?

?
3.1．事實表設(shè)計

原始數(shù)據(jù)表: ods_weblog_origin =>對應(yīng)mr清洗完之后的數(shù)據(jù)
valid	string	是否有效
remote_addr	string	訪客ip
remote_user	string	訪客用戶信息
time_local	string	請求時間
request	string	請求url
status	string	響應(yīng)碼
body_bytes_sent	string	響應(yīng)字節(jié)數(shù)
http_referer	string	來源url
http_user_agent	string	訪客終端信息
?	?	?
訪問日志明細(xì)寬表：dw_weblog_detail
valid	string	是否有效
remote_addr	string	訪客ip
remote_user	string	訪客用戶信息
time_local	string	請求完整時間
daystr	string	訪問日期
timestr	string	訪問時間
month	string	訪問月
day	string	訪問日
hour	string	訪問時
request	string	請求url整串
status	string	響應(yīng)碼
body_bytes_sent	string	響應(yīng)字節(jié)數(shù)
http_referer	string	來源url
ref_host	string	來源的host
ref_path	string	來源的路徑
ref_query	string	來源參數(shù)query
ref_query_id	string	來源參數(shù)query值
http_user_agent	string	客戶終端標(biāo)識

3.2．維度表設(shè)計?

時間維度 t_dim_time

date_Key

year

month

day

hour

訪客地域維度t_dim_area

area_ID

北京

上海

廣州

深圳

終端類型維度t_dim_termination

firefox

chrome

safari

ios

android

網(wǎng)站欄目維度 t_dim_section

跳蚤市場

房租信息

休閑娛樂

建材裝修

本地服務(wù)

人才市場

注意：
維度表的數(shù)據(jù)一般要結(jié)合業(yè)務(wù)情況自己寫腳本按照規(guī)則生成，也可以使用工具生成，方便后續(xù)的關(guān)聯(lián)分析。
比如一般會事前生成時間維度表中的數(shù)據(jù)，跨度從業(yè)務(wù)需要的日期到當(dāng)前日期即可.具體根據(jù)你的分析粒度,可以生成年，季，月，周，天，時等相關(guān)信息，用于分析。
?

三、模塊開發(fā)----ETL

ETL工作的實質(zhì)就是從各個數(shù)據(jù)源提取數(shù)據(jù)，對數(shù)據(jù)進(jìn)行轉(zhuǎn)換，并最終加載填充數(shù)據(jù)到數(shù)據(jù)倉庫維度建模后的表中。只有當(dāng)這些維度/事實表被填充好，ETL工作才算完成。
本項目的數(shù)據(jù)分析過程在hadoop集群上實現(xiàn)，主要應(yīng)用hive數(shù)據(jù)倉庫工具，因此，采集并經(jīng)過預(yù)處理后的數(shù)據(jù)，需要加載到hive數(shù)據(jù)倉庫中，以進(jìn)行后續(xù)的分析過程。
1．創(chuàng)建ODS層數(shù)據(jù)表
1.1．原始日志數(shù)據(jù)表
?

drop table if exists ods_weblog_origin; create table ods_weblog_origin( valid string, remote_addr string, remote_user string, time_local string, request string, status string, body_bytes_sent string, http_referer string, http_user_agent string) partitioned by (datestr string) row format delimited fields terminated by '\001';

1.2．點擊流模型pageviews表

drop table if exists ods_click_pageviews; create table ods_click_pageviews( session string, remote_addr string, remote_user string, time_local string, request string, visit_step string, page_staylong string, http_referer string, http_user_agent string, body_bytes_sent string, status string) partitioned by (datestr string) row format delimited fields terminated by '\001';

1.3．點擊流visit模型表

drop table if exist ods_click_stream_visit; create table ods_click_stream_visit( session string, remote_addr string, inTime string, outTime string, inPage string, outPage string, referal string, pageVisits int) partitioned by (datestr string) row format delimited fields terminated by '\001';

2．導(dǎo)入ODS層數(shù)據(jù)

load data inpath '/weblog/preprocessed/' overwrite into table ods_weblog_origin partition(datestr='20130918');--數(shù)據(jù)導(dǎo)入 show partitions ods_weblog_origin;---查看分區(qū) select count(*) from ods_weblog_origin; --統(tǒng)計導(dǎo)入的數(shù)據(jù)總數(shù) 點擊流模型的兩張表數(shù)據(jù)導(dǎo)入操作同上。注：生產(chǎn)環(huán)境中應(yīng)該將數(shù)據(jù)load命令，寫在腳本中，然后配置在azkaban中定時運行，注意運行的時間點，應(yīng)該在預(yù)處理數(shù)據(jù)完成之后。

3．生成ODS層明細(xì)寬表
3.1．需求實現(xiàn)
整個數(shù)據(jù)分析的過程是按照數(shù)據(jù)倉庫的層次分層進(jìn)行的，總體來說，是從ODS原始數(shù)據(jù)中整理出一些中間表（比如，為后續(xù)分析方便，將原始數(shù)據(jù)中的時間、url等非結(jié)構(gòu)化數(shù)據(jù)作結(jié)構(gòu)化抽取，將各種字段信息進(jìn)行細(xì)化，形成明細(xì)表），然后再在中間表的基礎(chǔ)之上統(tǒng)計出各種指標(biāo)數(shù)據(jù)。
?

3.2． ETL實現(xiàn) ? 建明細(xì)表ods_weblog_detail: drop table ods_weblog_detail; create table ods_weblog_detail( valid string, --有效標(biāo)識 remote_addr string, --來源IP remote_user string, --用戶標(biāo)識 time_local string, --訪問完整時間 daystr string, --訪問日期 timestr string, --訪問時間 month string, --訪問月 day string, --訪問日 hour string, --訪問時 request string, --請求的url status string, --響應(yīng)碼 body_bytes_sent string, --傳輸字節(jié)數(shù) http_referer string, --來源url ref_host string, --來源的host ref_path string, --來源的路徑 ref_query string, --來源參數(shù)query ref_query_id string, --來源參數(shù)query的值 http_user_agent string --客戶終端標(biāo)識 ) partitioned by(datestr string);? 通過查詢插入數(shù)據(jù)到明細(xì)寬表 ods_weblog_detail中 1、抽取refer_url到中間表 t_ods_tmp_referurl 也就是將來訪url分離出host path query query id drop table if exists t_ods_tmp_referurl; create table t_ods_tmp_referurl as SELECT a.*,b.* FROM ods_weblog_origin a LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, "\"", ""), 'HOST', 'PATH','QUERY', 'QUERY:id') b as host, path, query, query_id; 注：lateral view用于和split, explode等UDTF一起使用，它能夠?qū)⒁涣袛?shù)據(jù)拆成多行數(shù)據(jù)。 UDTF(User-Defined Table-Generating Functions) 用來解決輸入一行輸出多行(On-to-many maping) 的需求。Explode也是拆列函數(shù)，比如Explode (ARRAY) ，array中的每個元素生成一行。 2、抽取轉(zhuǎn)換time_local字段到中間表明細(xì)表 t_ods_tmp_detail drop table if exists t_ods_tmp_detail; create table t_ods_tmp_detail as select b.*,substring(time_local,0,10) as daystr, substring(time_local,12) as tmstr, substring(time_local,6,2) as month, substring(time_local,9,2) as day, substring(time_local,11,3) as hour from t_ods_tmp_referurl b; 3、以上語句可以合成一個總的語句 insert into table shizhan.ods_weblog_detail partition(datestr='2013-09-18') select c.valid,c.remote_addr,c.remote_user,c.time_local, substring(c.time_local,0,10) as daystr, substring(c.time_local,12) as tmstr, substring(c.time_local,6,2) as month, substring(c.time_local,9,2) as day, substring(c.time_local,11,3) as hour, c.request,c.status,c.body_bytes_sent,c.http_referer,c.ref_host,c.ref_path,c.ref_query,c.ref_query_id,c.http_user_agent from (SELECT a.valid,a.remote_addr,a.remote_user,a.time_local, a.request,a.status,a.body_bytes_sent,a.http_referer,a.http_user_agent,b.ref_host,b.ref_path,b.ref_query,b.ref_query_id FROM shizhan.ods_weblog_origin a LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, "\"", ""), 'HOST', 'PATH','QUERY', 'QUERY:id') b as ref_host, ref_path, ref_query,ref_query_id) c;

四、模塊開發(fā)----統(tǒng)計分析

數(shù)據(jù)倉庫建設(shè)好以后，用戶就可以編寫Hive SQL語句對其進(jìn)行訪問并對其中數(shù)據(jù)進(jìn)行分析。
在實際生產(chǎn)中，究竟需要哪些統(tǒng)計指標(biāo)通常由數(shù)據(jù)需求相關(guān)部門人員提出，而且會不斷有新的統(tǒng)計需求產(chǎn)生，以下為網(wǎng)站流量分析中的一些典型指標(biāo)示例。
注：每一種統(tǒng)計指標(biāo)都可以跟各維度表進(jìn)行鉆取。
1．流量分析
1.1．多維度統(tǒng)計PV總量
按時間維度
?

--計算每小時pvs，注意gruop by語法 select count(*) as pvs,month,day,hour from ods_weblog_detail group by month,day,hour;

方式一：直接在ods_weblog_detail單表上進(jìn)行查詢

--計算該處理批次（一天）中的各小時pvs drop table dw_pvs_everyhour_oneday; create table dw_pvs_everyhour_oneday(month string,day string,hour string,pvs bigint) partitioned by(datestr string);insert into table dw_pvs_everyhour_oneday partition(datestr='20130918') select a.month as month,a.day as day,a.hour as hour,count(*) as pvs from ods_weblog_detail a where a.datestr='20130918' group by a.month,a.day,a.hour;--計算每天的pvs drop table dw_pvs_everyday; create table dw_pvs_everyday(pvs bigint,month string,day string);insert into table dw_pvs_everyday select count(*) as pvs,a.month as month,a.day as day from ods_weblog_detail a group by a.month,a.day;

方式二：與時間維表關(guān)聯(lián)查詢

--維度：日 drop table dw_pvs_everyday; create table dw_pvs_everyday(pvs bigint,month string,day string);insert into table dw_pvs_everyday select count(*) as pvs,a.month as month,a.day as day from (select distinct month, day from t_dim_time) a join ods_weblog_detail b on a.month=b.month and a.day=b.day group by a.month,a.day;--維度：月 drop table dw_pvs_everymonth; create table dw_pvs_everymonth (pvs bigint,month string);insert into table dw_pvs_everymonth select count(*) as pvs,a.month from (select distinct month from t_dim_time) a join ods_weblog_detail b on a.month=b.month group by a.month;--另外，也可以直接利用之前的計算結(jié)果。比如從之前算好的小時結(jié)果中統(tǒng)計每一天的 Insert into table dw_pvs_everyday Select sum(pvs) as pvs,month,day from dw_pvs_everyhour_oneday group by month,day having day='18';

按終端維度
數(shù)據(jù)中能夠反映出用戶終端信息的字段是http_user_agent。
User Agent也簡稱UA。它是一個特殊字符串頭，是一種向訪問網(wǎng)站提供所使用的瀏覽器類型及版本、操作系統(tǒng)及版本、瀏覽器內(nèi)核、等信息的標(biāo)識。例如：
?

User-Agent,Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.276 Safari/537.36

上述UA信息就可以提取出以下的信息：
chrome 58.0、瀏覽器?? ?chrome、瀏覽器版本?? ?58.0、系統(tǒng)平臺?? ?windows
瀏覽器內(nèi)核?? ?webkit

這里不再拓展相關(guān)知識，感興趣的可以查看參考資料如何解析UA。
可以用下面的語句進(jìn)行試探性統(tǒng)計，當(dāng)然這樣的準(zhǔn)確度不是很高。
?

select distinct(http_user_agent) from ods_weblog_detail where http_user_agent like '%Chrome%' limit 200;

按欄目維度
網(wǎng)站欄目可以理解為網(wǎng)站中內(nèi)容相關(guān)的主題集中。體現(xiàn)在域名上來看就是不同的欄目會有不同的二級目錄。比如某網(wǎng)站網(wǎng)址為www.xxxx.cn,旗下欄目可以通過如下方式訪問：
欄目維度：../job
欄目維度：../news
欄目維度：../sports
欄目維度：../technology
那么根據(jù)用戶請求url就可以解析出訪問欄目，然后按照欄目進(jìn)行統(tǒng)計分析。

按referer維度

--統(tǒng)計每小時各來訪url產(chǎn)生的pv量 drop table dw_pvs_referer_everyhour; create table dw_pvs_referer_everyhour(referer_url string,referer_host string,month string,day string,hour string,pv_referer_cnt bigint) partitioned by(datestr string);insert into table dw_pvs_referer_everyhour partition(datestr='20130918') select http_referer,ref_host,month,day,hour,count(1) as pv_referer_cnt from ods_weblog_detail group by http_referer,ref_host,month,day,hour having ref_host is not null order by hour asc,day asc,month asc,pv_referer_cnt desc; --統(tǒng)計每小時各來訪host的產(chǎn)生的pv數(shù)并排序 drop table dw_pvs_refererhost_everyhour; create table dw_pvs_refererhost_everyhour(ref_host string,month string,day string,hour string,ref_host_cnts bigint) partitioned by(datestr string);insert into table dw_pvs_refererhost_everyhour partition(datestr='20130918') select ref_host,month,day,hour,count(1) as ref_host_cnts from ods_weblog_detail group by ref_host,month,day,hour having ref_host is not null order by hour asc,day asc,month asc,ref_host_cnts desc;

注：還可以按來源地域維度、訪客終端維度等計算

1.2．人均瀏覽量
需求描述：統(tǒng)計今日所有來訪者平均請求的頁面數(shù)。
人均瀏覽量也稱作人均瀏覽頁數(shù)，該指標(biāo)可以說明網(wǎng)站對用戶的粘性。
人均頁面瀏覽量表示用戶某一時段平均瀏覽頁面的次數(shù)。
計算方式：總頁面請求數(shù)/去重總?cè)藬?shù)
remote_addr表示不同的用戶。可以先統(tǒng)計出不同remote_addr的pv量，然后累加（sum）所有pv作為總的頁面請求數(shù)，再count所有remote_addr作為總的去重總?cè)藬?shù)。
?

--總頁面請求數(shù)/去重總?cè)藬?shù) drop table dw_avgpv_user_everyday; create table dw_avgpv_user_everyday( day string, avgpv string);insert into table dw_avgpv_user_everyday select '20130918',sum(b.pvs)/count(b.remote_addr) from (select remote_addr,count(1) as pvs from ods_weblog_detail where datestr='20130918' group by remote_addr) b;

1.3．統(tǒng)計pv總量最大的來源TOPN (分組TOP)

需求描述：統(tǒng)計每小時各來訪host的產(chǎn)生的pvs數(shù)最多的前N個（topN）。
row_number()函數(shù)
??? ?語法：row_number() over (partition by xxx order by xxx) rank，rank為分組的別名，相當(dāng)于新增一個字段為rank。
??? ?partition by用于分組，比方說依照sex字段分組
??? ?order by用于分組內(nèi)排序，比方說依照sex分組，組內(nèi)按照age排序
??? ?排好序之后，為每個分組內(nèi)每一條分組記錄從1開始返回一個數(shù)字
??? ?取組內(nèi)某個數(shù)據(jù)，可以使用where 表名.rank>x之類的語法去取
以下語句對每個小時內(nèi)的來訪host次數(shù)倒序排序標(biāo)號:
select ref_host,ref_host_cnts,concat(month,day,hour),
row_number() over (partition by concat(month,day,hour) order by ref_host_cnts desc) as od from dw_pvs_refererhost_everyhour;
效果如下：

根據(jù)上述row_number的功能，可編寫hql取各小時的ref_host訪問次數(shù)topn

drop table dw_pvs_refhost_topn_everyhour; create table dw_pvs_refhost_topn_everyhour( hour string, toporder string, ref_host string, ref_host_cnts string )partitioned by(datestr string);insert into table dw_pvs_refhost_topn_everyhour partition(datestr='20130918') select t.hour,t.od,t.ref_host,t.ref_host_cnts from(select ref_host,ref_host_cnts,concat(month,day,hour) as hour, row_number() over (partition by concat(month,day,hour) order by ref_host_cnts desc) as od from dw_pvs_refererhost_everyhour) t where od<=3;

結(jié)果如下：

2．受訪分析（從頁面的角度分析）

2.1．各頁面訪問統(tǒng)計
主要是針對數(shù)據(jù)中的request進(jìn)行統(tǒng)計分析，比如各頁面PV ，各頁面UV 等。
以上指標(biāo)無非就是根據(jù)頁面的字段group by。例如：
?

--統(tǒng)計各頁面pv select request as request,count(request) as request_counts from ods_weblog_detail group by request having request is not null order by request_counts desc limit 20;

2.2．熱門頁面統(tǒng)計

--統(tǒng)計每日最熱門的頁面top10 drop table dw_hotpages_everyday; create table dw_hotpages_everyday(day string,url string,pvs string);insert into table dw_hotpages_everyday select '20130918',a.request,a.request_counts from (select request as request,count(request) as request_counts from ods_weblog_detail where datestr='20130918' group by request having request is not null) a order by a.request_counts desc limit 10;

3．訪客分析

3.1．獨立訪客
需求描述：按照時間維度比如小時來統(tǒng)計獨立訪客及其產(chǎn)生的pv。
對于獨立訪客的識別，如果在原始日志中有用戶標(biāo)識，則根據(jù)用戶標(biāo)識即很好實現(xiàn);此處，由于原始日志中并沒有用戶標(biāo)識，以訪客IP來模擬，技術(shù)上是一樣的，只是精確度相對較低。
?

--時間維度：時 drop table dw_user_dstc_ip_h; create table dw_user_dstc_ip_h( remote_addr string, pvs bigint, hour string);insert into table dw_user_dstc_ip_h select remote_addr,count(1) as pvs,concat(month,day,hour) as hour from ods_weblog_detail Where datestr='20130918' group by concat(month,day,hour),remote_addr;在此結(jié)果表之上，可以進(jìn)一步統(tǒng)計，如每小時獨立訪客總數(shù)： select count(1) as dstc_ip_cnts,hour from dw_user_dstc_ip_h group by hour; --時間維度：日 select remote_addr,count(1) as counts,concat(month,day) as day from ods_weblog_detail Where datestr='20130918' group by concat(month,day),remote_addr; --時間維度：月 select remote_addr,count(1) as counts,month from ods_weblog_detail group by month,remote_addr;

3.2．每日新訪客
需求：將每天的新訪客統(tǒng)計出來。
實現(xiàn)思路：創(chuàng)建一個去重訪客累積表，然后將每日訪客對比累積表。
?

--歷日去重訪客累積表 drop table dw_user_dsct_history; create table dw_user_dsct_history( day string, ip string ) partitioned by(datestr string);--每日新訪客表 drop table dw_user_new_d; create table dw_user_new_d ( day string, ip string ) partitioned by(datestr string);--每日新用戶插入新訪客表 insert into table dw_user_new_d partition(datestr='20130918') select tmp.day as day,tmp.today_addr as new_ip from ( select today.day as day,today.remote_addr as today_addr,old.ip as old_addr from (select distinct remote_addr as remote_addr,"20130918" as day from ods_weblog_detail where datestr="20130918") today left outer join dw_user_dsct_history old on today.remote_addr=old.ip ) tmp where tmp.old_addr is null;--每日新用戶追加到累計表 insert into table dw_user_dsct_history partition(datestr='20130918') select day,ip from dw_user_new_d where datestr='20130918';

驗證查看：

select count(distinct remote_addr) from ods_weblog_detail;select count(1) from dw_user_dsct_history where datestr='20130918';select count(1) from dw_user_new_d where datestr='20130918';

注：還可以按來源地域維度、訪客終端維度等計算

4．訪客Visit分析（點擊流模型）

4.1．回頭/單次訪客統(tǒng)計
需求：查詢今日所有回頭訪客及其訪問次數(shù)。

實現(xiàn)思路：上表中出現(xiàn)次數(shù)>1的訪客，即回頭訪客；反之，則為單次訪客。

drop table dw_user_returning; create table dw_user_returning( day string, remote_addr string, acc_cnt string) partitioned by (datestr string);insert overwrite table dw_user_returning partition(datestr='20130918') select tmp.day,tmp.remote_addr,tmp.acc_cnt from (select '20130918' as day,remote_addr,count(session) as acc_cnt from ods_click_stream_visit group by remote_addr) tmp where tmp.acc_cnt>1;

4.2．人均訪問頻次

需求：統(tǒng)計出每天所有用戶訪問網(wǎng)站的平均次數(shù)（visit）
總visit數(shù)/去重總用戶數(shù)
?

select count(pagevisits)/count(distinct remote_addr) from ods_click_stream_visit where datestr='20130918';

5．關(guān)鍵路徑轉(zhuǎn)化率分析（漏斗模型）

5.1．需求分析
轉(zhuǎn)化：在一條指定的業(yè)務(wù)流程中，各個步驟的完成人數(shù)及相對上一個步驟的百分比。
?

5.2．模型設(shè)計

定義好業(yè)務(wù)流程中的頁面標(biāo)識，下例中的步驟為： Step1、 /item Step2、 /category Step3、 /index Step4、 /order

5.3．開發(fā)實現(xiàn)

??? ?查詢每一個步驟的總訪問人數(shù)

--查詢每一步人數(shù)存入dw_oute_numbs create table dw_oute_numbs as select 'step1' as step,count(distinct remote_addr) as numbs from ods_click_pageviews where datestr='20130920' and request like '/item%' union select 'step2' as step,count(distinct remote_addr) as numbs from ods_click_pageviews where datestr='20130920' and request like '/category%' union select 'step3' as step,count(distinct remote_addr) as numbs from ods_click_pageviews where datestr='20130920' and request like '/order%' union select 'step4' as step,count(distinct remote_addr) as numbs from ods_click_pageviews where datestr='20130920' and request like '/index%';

注：UNION將多個SELECT語句的結(jié)果集合并為一個獨立的結(jié)果集。

??? ?查詢每一步驟相對于路徑起點人數(shù)的比例
思路：級聯(lián)查詢，利用自join
?

--dw_oute_numbs跟自己join select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from dw_oute_numbs rn inner join dw_oute_numbs rr; --每一步的人數(shù)/第一步的人數(shù)==每一步相對起點人數(shù)比例 select tmp.rnstep,tmp.rnnumbs/tmp.rrnumbs as ratio from ( select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from dw_oute_numbs rn inner join dw_oute_numbs rr) tmp where tmp.rrstep='step1';

??? ?查詢每一步驟相對于上一步驟的漏出率

--自join表過濾出每一步跟上一步的記錄 select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from dw_oute_numbs rn inner join dw_oute_numbs rr where cast(substr(rn.step,5,1) as int)=cast(substr(rr.step,5,1) as int)-1; select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs as leakage_rate from ( select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from dw_oute_numbs rn inner join dw_oute_numbs rr) tmp where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1;

匯總以上兩種指標(biāo)
?

select abs.step,abs.numbs,abs.rate as abs_ratio,rel.rate as leakage_rate from ( select tmp.rnstep as step,tmp.rnnumbs as numbs,tmp.rnnumbs/tmp.rrnumbs as rate from ( select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from dw_oute_numbs rn inner join dw_oute_numbs rr) tmp where tmp.rrstep='step1' ) abs left outer join ( select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs as rate from ( select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from dw_oute_numbs rn inner join dw_oute_numbs rr) tmp where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1 ) rel on abs.step=rel.step;

五、模塊開發(fā)----結(jié)果導(dǎo)出

為了將我們計算出來的數(shù)據(jù)通過報表的形式展現(xiàn)到前臺頁面上去，我們可以通過sqoop將我們計算后的數(shù)據(jù)導(dǎo)出到關(guān)系型數(shù)據(jù)庫mysql當(dāng)中去（通常計算之后的數(shù)據(jù)量一般都不會太大，可以考慮使用關(guān)系型數(shù)據(jù)庫的方式來做我們的報表展現(xiàn)，如果統(tǒng)計之后的數(shù)據(jù)量仍然很大，那么就應(yīng)該考慮使用大數(shù)據(jù)的技術(shù)來實現(xiàn)我們數(shù)據(jù)的展現(xiàn)）

這里選擇幾張hive表進(jìn)行導(dǎo)出，其他的所有的導(dǎo)出基本上都是一樣

1．第一步：創(chuàng)建mysql數(shù)據(jù)庫以及對應(yīng)的數(shù)據(jù)庫表

SQLyog Ultimate v8.32 MySQL - 5.6.22-log : Database - weblog ********************************************************************* */ /*!40101 SET NAMES utf8 */;/*!40101 SET SQL_MODE=''*/;/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */; /*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */; /*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */; /*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */; CREATE DATABASE /*!32312 IF NOT EXISTS*/`weblog` /*!40100 DEFAULT CHARACTER SET utf8 */;USE `weblog`;/*Table structure for table `dw_pvs_everyday` */DROP TABLE IF EXISTS `dw_pvs_everyday`;CREATE TABLE `dw_pvs_everyday` (`pvs` varchar(32) DEFAULT NULL,`month` varchar(16) DEFAULT NULL,`day` varchar(16) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8;/*Table structure for table `dw_pvs_everyhour_oneday` */DROP TABLE IF EXISTS `dw_pvs_everyhour_oneday`;CREATE TABLE `dw_pvs_everyhour_oneday` (`month` varchar(32) DEFAULT NULL,`day` varchar(32) DEFAULT NULL,`hour` varchar(32) DEFAULT NULL,`pvs` varchar(32) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8;/*Table structure for table `dw_pvs_referer_everyhour` */DROP TABLE IF EXISTS `dw_pvs_referer_everyhour`;CREATE TABLE `dw_pvs_referer_everyhour` (`refer_url` varchar(2048) DEFAULT NULL,`referer_host` varchar(64) DEFAULT NULL,`month` varchar(32) DEFAULT NULL,`day` varchar(32) DEFAULT NULL,`hour` varchar(32) DEFAULT NULL,`pv_referer_cnt` varchar(32) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8;/*!40101 SET SQL_MODE=@OLD_SQL_MODE */; /*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */; /*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */; /*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;

2．第二步：通過sqoop命令來進(jìn)行導(dǎo)出

/export/servers/sqoop-1.4.6-cdh5.14.0/bin/sqoop export --connect jdbc:mysql://192.168.1.106:3306/weblog --username root --password admin --m 1 --export-dir /user/hive/warehouse/weblog.db/dw_pvs_everyday --table dw_pvs_everyday --input-fields-terminated-by '\001'/export/servers/sqoop-1.4.6-cdh5.14.0/bin/sqoop export --connect jdbc:mysql://192.168.1.106:3306/weblog --username root --password admin --m 1 --export-dir /user/hive/warehouse/weblog.db/dw_pvs_everyhour_oneday/datestr=20130918 --table dw_pvs_everyhour_oneday --input-fields-terminated-by '\001' /export/servers/sqoop-1.4.6-cdh5.14.0/bin/sqoop export --connect jdbc:mysql://192.168.1.106:3306/weblog --username root --password admin --m 1 --export-dir /user/hive/warehouse/weblog.db/dw_pvs_referer_everyhour/datestr=20130918 --table dw_pvs_referer_everyhour --input-fields-terminated-by '\001'

六、模塊開發(fā)----工作流調(diào)度

整個項目的數(shù)據(jù)按照處理過程，從數(shù)據(jù)采集到數(shù)據(jù)分析，再到結(jié)果數(shù)據(jù)的導(dǎo)出，一系列的任務(wù)可以分割成若干個azkaban的job單元，然后由工作流調(diào)度器調(diào)度執(zhí)行。

調(diào)度腳本的編寫難點在于shell腳本。但是一般都是有固定編寫模式。大家可以參考資料中的腳本進(jìn)行編寫。

第一步：開發(fā)我們的DateUtil工具類

開發(fā)我們的DateUtil工具類，用于獲取前一天的時間

public class DateUtil {/*** 獲取昨日的日期* @return*/public static String getYestDate(){Calendar instance = Calendar.getInstance();instance.add(Calendar.DATE,-1);Date time = instance.getTime();String format = new SimpleDateFormat("yyyy-MM-dd").format(time);return format;}public static void main(String[] args) {getYestDate();} }

第二步：定義我們的數(shù)據(jù)每日上傳目錄

定義我們的文件每日上傳目錄，并將我們的數(shù)據(jù)上傳到對應(yīng)的目錄下面去 hdfs dfs -mkdir -p /weblog/20180205/input hdfs dfs -put access.log.fensi /weblog/20180205/input

第三步：根據(jù)文件上傳目錄，改造MR程序

改造WebLogProcessor程序 String inputPath= "hdfs://node01:8020/weblog/"+DateUtil.getYestDate()+"/input"; String outputPath="hdfs://node01:8020/weblog/"+DateUtil.getYestDate()+"/weblogPreOut"; FileSystem fileSystem = FileSystem.get(new URI("hdfs://node01:8020"), conf); if (fileSystem.exists(new Path(outputPath))){fileSystem.delete(new Path(outputPath),true); } FileInputFormat.setInputPaths(job, new Path(inputPath)); FileOutputFormat.setOutputPath(job, new Path(outputPath));

改造ClickStreamPageViewString inputPath="hdfs://node01:8020/weblog/"+DateUtil.getYestDate()+"/weblogPreOut"; String outputPath="hdfs://node01: 8020/weblog/"+DateUtil.getYestDate()+"/pageViewOut"; FileSystem fileSystem = FileSystem.get(new URI("hdfs://node01: 8020"), conf); if (fileSystem.exists(new Path(outputPath))){fileSystem.delete(new Path(outputPath),true); } FileInputFormat.setInputPaths(job, new Path(inputPath)); FileOutputFormat.setOutputPath(job, new Path(outputPath));

改造ClickStreamVisitString inputPath = "hdfs://node01: 8020/weblog/"+ DateUtil.getYestDate() + "/pageViewOut"; String outPutPath="hdfs://node01: 8020/weblog/"+ DateUtil.getYestDate() + "/clickStreamVisit"; FileSystem fileSystem = FileSystem.get(new URI("hdfs://node01: 8020"),conf); if (fileSystem.exists(new Path(outPutPath))){fileSystem.delete(new Path(outPutPath),true); } FileInputFormat.setInputPaths(job, new Path(inputPath)); FileOutputFormat.setOutputPath(job, new Path(outPutPath));

第四步：將程序打成jar包

第五步：開發(fā)azkaban調(diào)度腳本

程序調(diào)度一共分為以下步驟：第一步：第一個MR程序執(zhí)行第二步：第二個MR程序執(zhí)行第三步：第三個MR程序執(zhí)行第四步：hive表數(shù)據(jù)加載第五步：hive表數(shù)據(jù)分析第六步：分析結(jié)果通過sqoop導(dǎo)出

第六步：定時執(zhí)行

定于每天晚上兩點鐘定時開始執(zhí)行任務(wù)

0 2 ? * *

七、模塊開發(fā)----數(shù)據(jù)可視化

1． Echarts介紹

ECharts是一款由百度前端技術(shù)部開發(fā)的，基于Javascript的數(shù)據(jù)可視化圖表庫，提供直觀，生動，可交互，可個性化定制的數(shù)據(jù)可視化圖表。

提供大量常用的數(shù)據(jù)可視化圖表，底層基于ZRender（一個全新的輕量級canvas類庫），創(chuàng)建了坐標(biāo)系，圖例，提示，工具箱等基礎(chǔ)組件，并在此上構(gòu)建出折線圖（區(qū)域圖）、柱狀圖（條狀圖）、散點圖（氣泡圖）、餅圖（環(huán)形圖）、K線圖、地圖、力導(dǎo)向布局圖以及和弦圖，同時支持任意維度的堆積和多圖表混合展現(xiàn)。

2． Web程序工程結(jié)構(gòu)

本項目是個純粹的JavaEE項目，基于ssm的框架整合構(gòu)建。使用maven的tomcat插件啟動項目。

3．感受Echarts—簡單入門

3.1．下載Echarts

從官網(wǎng)下載界面選擇你需要的版本下載，根據(jù)開發(fā)者功能和體積上的需求，提供了不同打包的下載，如果在體積上沒有要求，可以直接下載完整版本。開發(fā)環(huán)境建議下載源代碼版本，包含了常見的錯誤提示和警告。

3.2．頁面引入Echarts
ECharts 3 開始只需要像普通的 JavaScript 庫一樣用 script 標(biāo)簽引入即可。
?

<!DOCTYPE html> <html> <head><meta charset="utf-8"><script src="echarts.min.js"></script> </head> </html>

3.3．繪制一個簡單的圖表

在繪圖前我們需要為 ECharts 準(zhǔn)備一個具備高寬的 DOM 容器：

然后就可以通過?echarts.init?方法初始化一個 echarts 實例并通過?setOption?方法生成一個簡單的柱狀圖，下面是完整代碼。

<!DOCTYPE html> <html> <head><meta charset="utf-8"><title>ECharts</title><script src="echarts.min.js"></script> </head> <body><div id="main" style="width: 600px;height:400px;"></div><script type="text/javascript">// 基于準(zhǔn)備好的dom，初始化echarts實例var myChart = echarts.init(document.getElementById('main'));// 指定圖表的配置項和數(shù)據(jù)var option = {title: {text: 'ECharts 入門示例'},tooltip: {},legend: {data:['銷量']},xAxis: {data: ["襯衫","羊毛衫","雪紡衫","褲子","高跟鞋","襪子"]},yAxis: {},series: [{name: '銷量',type: 'bar',data: [5, 20, 36, 10, 10, 20]}]};// 使用剛指定的配置項和數(shù)據(jù)顯示圖表。myChart.setOption(option);</script> </body> </html>

不出意外的話你就可以看見如下的圖表：

三大框架環(huán)境搭建
第一步：創(chuàng)建數(shù)據(jù)庫并導(dǎo)入數(shù)據(jù)
?

/* SQLyog Ultimate v8.32 MySQL - 5.6.22-log : Database - web_log_view ********************************************************************* */ /*!40101 SET NAMES utf8 */;/*!40101 SET SQL_MODE=''*/;/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */; /*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */; /*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */; /*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */; CREATE DATABASE /*!32312 IF NOT EXISTS*/`web_log_view` /*!40100 DEFAULT CHARACTER SET utf8 */;USE `web_log_view`;/*Table structure for table `t_avgpv_num` */DROP TABLE IF EXISTS `t_avgpv_num`;CREATE TABLE `t_avgpv_num` (`id` int(11) DEFAULT NULL,`dateStr` varchar(255) DEFAULT NULL,`avgPvNum` decimal(6,2) DEFAULT NULL ) ENGINE=MyISAM DEFAULT CHARSET=utf8;/*Data for the table `t_avgpv_num` */insert into `t_avgpv_num`(`id`,`dateStr`,`avgPvNum`) values (1,'20130919','13.40'),(2,'20130920','17.60'),(3,'20130921','15.20'),(4,'20130922','21.10'),(5,'20130923','16.90'),(6,'20130924','18.10'),(7,'20130925','18.60');/*Table structure for table `t_flow_num` */DROP TABLE IF EXISTS `t_flow_num`;CREATE TABLE `t_flow_num` (`id` int(11) DEFAULT NULL,`dateStr` varchar(255) DEFAULT NULL,`pVNum` int(11) DEFAULT NULL,`uVNum` int(11) DEFAULT NULL,`iPNum` int(11) DEFAULT NULL,`newUvNum` int(11) DEFAULT NULL,`visitNum` int(11) DEFAULT NULL ) ENGINE=MyISAM DEFAULT CHARSET=utf8;/*Data for the table `t_flow_num` */insert into `t_flow_num`(`id`,`dateStr`,`pVNum`,`uVNum`,`iPNum`,`newUvNum`,`visitNum`) values (1,'20131001',4702,3096,2880,2506,3773),(2,'20131002',7528,4860,4435,4209,5937),(3,'20131003',7286,4741,4409,4026,5817),(4,'20131004',6653,5102,4900,2305,4659),(5,'20131005',5957,4943,4563,3134,3698),(6,'20131006',7978,6567,6063,4417,4560),(7,'20131007',6666,5555,4444,3333,3232);/*!40101 SET SQL_MODE=@OLD_SQL_MODE */; /*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */; /*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */; /*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;

第二步：創(chuàng)建maven? web工程并導(dǎo)入jar包

<dependencies><dependency><groupId>org.springframework</groupId><artifactId>spring-context</artifactId><version>4.2.4.RELEASE</version></dependency><dependency><groupId>org.springframework</groupId><artifactId>spring-beans</artifactId><version>4.2.4.RELEASE</version></dependency><dependency><groupId>org.springframework</groupId><artifactId>spring-webmvc</artifactId><version>4.2.4.RELEASE</version></dependency><dependency><groupId>org.springframework</groupId><artifactId>spring-jdbc</artifactId><version>4.2.4.RELEASE</version></dependency><dependency><groupId>org.springframework</groupId><artifactId>spring-aspects</artifactId><version>4.2.4.RELEASE</version></dependency><dependency><groupId>org.springframework</groupId><artifactId>spring-jms</artifactId><version>4.2.4.RELEASE</version></dependency><dependency><groupId>org.springframework</groupId><artifactId>spring-context-support</artifactId><version>4.2.4.RELEASE</version></dependency><dependency><groupId>org.mybatis</groupId><artifactId>mybatis</artifactId><version>3.2.8</version></dependency><dependency><groupId>org.mybatis</groupId><artifactId>mybatis-spring</artifactId><version>1.2.2</version></dependency><dependency><groupId>com.github.miemiedev</groupId><artifactId>mybatis-paginator</artifactId><version>1.2.15</version></dependency><dependency><groupId>mysql</groupId><artifactId>mysql-connector-java</artifactId><version>5.1.32</version></dependency><dependency><groupId>com.alibaba</groupId><artifactId>druid</artifactId><version>1.0.9</version></dependency><dependency><groupId>jstl</groupId><artifactId>jstl</artifactId><version>1.2</version></dependency><dependency><groupId>javax.servlet</groupId><artifactId>servlet-api</artifactId><version>2.5</version><scope>provided</scope></dependency><dependency><groupId>javax.servlet</groupId><artifactId>jsp-api</artifactId><version>2.0</version><scope>provided</scope></dependency><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.12</version></dependency><dependency><groupId>com.fasterxml.jackson.core</groupId><artifactId>jackson-databind</artifactId><version>2.4.2</version></dependency></dependencies><build><finalName>${project.artifactId}</finalName><resources><resource><directory>src/main/java</directory><includes><include>**/*.properties</include><include>**/*.xml</include></includes><filtering>false</filtering></resource><resource><directory>src/main/resources</directory><includes><include>**/*.properties</include><include>**/*.xml</include></includes><filtering>false</filtering></resource></resources><plugins><plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-compiler-plugin</artifactId><version>3.2</version><configuration><source>1.7</source><target>1.7</target><encoding>UTF-8</encoding></configuration></plugin><plugin><groupId>org.apache.tomcat.maven</groupId><artifactId>tomcat7-maven-plugin</artifactId><version>2.2</version><configuration><path>/</path><port>8080</port></configuration></plugin></plugins></build>

第三步：配置SqlMapConfig.xml

<?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE configurationPUBLIC "-//mybatis.org//DTD Config 3.0//EN""http://mybatis.org/dtd/mybatis-3-config.dtd"> <configuration><settings><setting name="logImpl" value="STDOUT_LOGGING" /></settings> </configuration>

第四步：配置ApplicationContext.xml

<?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.springframework.org/schema/beans"xmlns:context="http://www.springframework.org/schema/context"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xmlns:aop="http://www.springframework.org/schema/aop"xmlns:tx="http://www.springframework.org/schema/tx"xmlns:p="http://www.springframework.org/schema/p"xmlns:c="http://www.springframework.org/schema/c"xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsdhttp://www.springframework.org/schema/aop http://www.springframework.org/schema/aop/spring-aop.xsdhttp://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsdhttp://www.springframework.org/schema/tx http://www.springframework.org/schema/tx/spring-tx.xsd"><context:component-scan base-package="cn.itcast.weblog.service"></context:component-scan><context:property-placeholder location="classpath:properties/jdbc.properties"></context:property-placeholder><bean id="dataSource" class="com.alibaba.druid.pool.DruidDataSource"><property name="driverClassName" value="${jdbc.driver}"></property><property name="url" value="${jdbc.url}"></property><property name="username" value="${jdbc.username}"></property><property name="password" value="${jdbc.password}"></property></bean><bean id="transactionManager" class="org.springframework.jdbc.datasource.DataSourceTransactionManager"><property name="dataSource" ref="dataSource"></property></bean><tx:annotation-driven transaction-manager="transactionManager"></tx:annotation-driven><bean id="sqlSessionFactory" class="org.mybatis.spring.SqlSessionFactoryBean"><property name="dataSource" ref="dataSource"></property><property name="configLocation" value="classpath:mybaits/SqlMapConfig.xml"></property></bean><bean class="org.mybatis.spring.mapper.MapperScannerConfigurer"><property name="basePackage" value="cn.itcast.weblog.mapper"></property></bean></beans>

第五步：配置springMVC.xml

<?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.springframework.org/schema/beans"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xmlns:mvc="http://www.springframework.org/schema/mvc"xmlns:context="http://www.springframework.org/schema/context"xmlns:task="http://www.springframework.org/schema/task"xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-4.0.xsdhttp://www.springframework.org/schema/mvc http://www.springframework.org/schema/mvc/spring-mvc-4.0.xsdhttp://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context-4.0.xsdhttp://www.springframework.org/schema/task http://www.springframework.org/schema/task/spring-task-4.0.xsd"><context:component-scan base-package="cn.itcast.weblog.controller"></context:component-scan><mvc:annotation-driven/><bean class="org.springframework.web.servlet.view.InternalResourceViewResolver"><property name="prefix" value="/WEB-INF/jsp/"></property><property name="suffix" value=".jsp"></property></bean></beans>

第六步：配置web.xml

<?xml version="1.0" encoding="UTF-8"?> <web-app xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://java.sun.com/xml/ns/javaee" xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd" version="2.5"><display-name>crm</display-name><welcome-file-list><welcome-file>index.html</welcome-file><welcome-file>index.htm</welcome-file><welcome-file>index.jsp</welcome-file><welcome-file>default.html</welcome-file><welcome-file>default.htm</welcome-file><welcome-file>default.jsp</welcome-file><welcome-file>customer/list.action</welcome-file></welcome-file-list><context-param><param-name>contextConfigLocation</param-name><param-value>classpath:spring/ApplicationContext.xml</param-value></context-param><listener><listener-class>org.springframework.web.context.ContextLoaderListener</listener-class></listener><servlet><servlet-name>springDispatcherServlet</servlet-name><servlet-class>org.springframework.web.servlet.DispatcherServlet</servlet-class><init-param><param-name>contextConfigLocation</param-name><param-value>classpath:springMVC/springmvc.xml</param-value></init-param><load-on-startup>1</load-on-startup></servlet><servlet-mapping><servlet-name>springDispatcherServlet</servlet-name><url-pattern>*.action</url-pattern></servlet-mapping><filter><filter-name>CharacterEncodingFilter</filter-name><filter-class>org.springframework.web.filter.CharacterEncodingFilter</filter-class><init-param><param-name>encoding</param-name><param-value>utf-8</param-value></init-param></filter><filter-mapping><filter-name>CharacterEncodingFilter</filter-name><url-pattern>/*</url-pattern></filter-mapping></web-app>

第七步：拷貝我們準(zhǔn)備好的資源文件到項目中

第八步：配置IDEA使用tomcat插件訪問我們的項目

第九步：開發(fā)mapper層的xml以及接口

接口public interface TAvgpvNumMapper {List<TAvgpvNum> selectLastSeven(String s, String s1); }xml定義<?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN" "http://mybatis.org/dtd/mybatis-3-mapper.dtd" > <mapper namespace="cn.itcast.weblog.mapper.TAvgpvNumMapper" ><select id="selectLastSeven" parameterType="string" resultType="cn.itcast.weblog.pojo.TAvgpvNum">select * from t_avgpv_numwhere dateStr > #{0}and dateStr < #{1}order by dateStr desclimit 7;</select></mapper>

第十步：開發(fā)service層

@Service @Transactional public class AvgPvServiceImpl implements AvgPvService {@Autowiredprivate TAvgpvNumMapper tAvgpvNumMapper;@Overridepublic String getAvgJson() {//查詢最近七天的所有數(shù)據(jù)，指定起始日期和結(jié)束日期List<TAvgpvNum> tAvgpvNums = tAvgpvNumMapper.selectLastSeven("20130919","20130925");AvgToBean avgToBean = new AvgToBean();List<String> dateStrs = new ArrayList<String>();List<BigDecimal> datas = new ArrayList<BigDecimal>();for (TAvgpvNum tAvgpvNum : tAvgpvNums) {dateStrs.add(tAvgpvNum.getDatestr());datas.add(tAvgpvNum.getAvgpvnum());}avgToBean.setDates(dateStrs);avgToBean.setData(datas);String jsonString = JSONObject.toJSONString(avgToBean);return jsonString;} }

第十一步：開發(fā)controller層

@Controller public class IndexController {@Autowiredprivate AvgPvService avgPvService;@Autowiredprivate FlowService flowService;@RequestMapping("/index.action")public String skipToIndex(){return "index";}@RequestMapping("/avgPvNum.action")@ResponseBodypublic String getAvgPvJson(){String avgJson = avgPvService.getAvgJson();return avgJson;}}

總結(jié)

以上是生活随笔為你收集整理的网站流量日志数据分析系统（模块开发----数据仓库设计）的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：移动端实现图标拖拽效果
下一篇： Python标准库：内置函数comple

windows

网站流量日志数据分析系统（模块开发----数据仓库设计）

1． 維度建模基本概念

2． 維度建模三種模式

3． 本項目中數(shù)據(jù)倉庫的設(shè)計

注：采用星型模型 ?

? 3.1． 事實表設(shè)計

3.2． 維度表設(shè)計?

三、 模塊開發(fā)----ETL

1.2． 點擊流模型pageviews表

1.3． 點擊流visit模型表

2． 導(dǎo)入ODS層數(shù)據(jù)

四、 模塊開發(fā)----統(tǒng)計分析

方式一：直接在ods_weblog_detail單表上進(jìn)行查詢

方式二：與時間維表關(guān)聯(lián)查詢

上述UA信息就可以提取出以下的信息： chrome 58.0、瀏覽器?? ?chrome、瀏覽器版本?? ?58.0、系統(tǒng)平臺?? ?windows 瀏覽器內(nèi)核?? ?webkit

這里不再拓展相關(guān)知識，感興趣的可以查看參考資料如何解析UA。 可以用下面的語句進(jìn)行試探性統(tǒng)計，當(dāng)然這樣的準(zhǔn)確度不是很高。 ?

按referer維度

注：還可以按來源地域維度、訪客終端維度等計算

1.3． 統(tǒng)計pv總量最大的來源TOPN (分組TOP)

根據(jù)上述row_number的功能，可編寫hql取各小時的ref_host訪問次數(shù)topn

結(jié)果如下：

2． 受訪分析（從頁面的角度分析）

2.1． 各頁面訪問統(tǒng)計 主要是針對數(shù)據(jù)中的request進(jìn)行統(tǒng)計分析，比如各頁面PV ，各頁面UV 等。 以上指標(biāo)無非就是根據(jù)頁面的字段group by。例如： ?

2.2． 熱門頁面統(tǒng)計

3． 訪客分析

3.2． 每日新訪客 需求：將每天的新訪客統(tǒng)計出來。 實現(xiàn)思路：創(chuàng)建一個去重訪客累積表，然后將每日訪客對比累積表。 ?

驗證查看：

注：還可以按來源地域維度、訪客終端維度等計算

4． 訪客Visit分析（點擊流模型）

4.1． 回頭/單次訪客統(tǒng)計 需求：查詢今日所有回頭訪客及其訪問次數(shù)。

實現(xiàn)思路：上表中出現(xiàn)次數(shù)>1的訪客，即回頭訪客；反之，則為單次訪客。

4.2． 人均訪問頻次

需求：統(tǒng)計出每天所有用戶訪問網(wǎng)站的平均次數(shù)（visit） 總visit數(shù)/去重總用戶數(shù) ?

5． 關(guān)鍵路徑轉(zhuǎn)化率分析（漏斗模型）

5.1． 需求分析 轉(zhuǎn)化：在一條指定的業(yè)務(wù)流程中，各個步驟的完成人數(shù)及相對上一個步驟的百分比。 ?

5.2． 模型設(shè)計

5.3． 開發(fā)實現(xiàn)

??? ?查詢每一個步驟的總訪問人數(shù)

注：UNION將多個SELECT語句的結(jié)果集合并為一個獨立的結(jié)果集。

??? ?查詢每一步驟相對于路徑起點人數(shù)的比例 思路：級聯(lián)查詢，利用自join ?

??? ?查詢每一步驟相對于上一步驟的漏出率

匯總以上兩種指標(biāo) ?

五、 模塊開發(fā)----結(jié)果導(dǎo)出

這里選擇幾張hive表進(jìn)行導(dǎo)出，其他的所有的導(dǎo)出基本上都是一樣

1． 第一步：創(chuàng)建mysql數(shù)據(jù)庫以及對應(yīng)的數(shù)據(jù)庫表

2． 第二步：通過sqoop命令來進(jìn)行導(dǎo)出

六、 模塊開發(fā)----工作流調(diào)度

整個項目的數(shù)據(jù)按照處理過程，從數(shù)據(jù)采集到數(shù)據(jù)分析，再到結(jié)果數(shù)據(jù)的導(dǎo)出，一系列的任務(wù)可以分割成若干個azkaban的job單元，然后由工作流調(diào)度器調(diào)度執(zhí)行。

調(diào)度腳本的編寫難點在于shell腳本。但是一般都是有固定編寫模式。大家可以參考資料中的腳本進(jìn)行編寫。

第一步：開發(fā)我們的DateUtil工具類

開發(fā)我們的DateUtil工具類，用于獲取前一天的時間

第二步：定義我們的數(shù)據(jù)每日上傳目錄

第三步：根據(jù)文件上傳目錄，改造MR程序

第四步：將程序打成jar包

第五步：開發(fā)azkaban調(diào)度腳本

第六步：定時執(zhí)行

定于每天晚上兩點鐘定時開始執(zhí)行任務(wù)

0 2 ? * *

七、 模塊開發(fā)----數(shù)據(jù)可視化

1． Echarts介紹

ECharts是一款由百度前端技術(shù)部開發(fā)的，基于Javascript的數(shù)據(jù)可視化圖表庫，提供直觀，生動，可交互，可個性化定制的數(shù)據(jù)可視化圖表。

2． Web程序工程結(jié)構(gòu)

本項目是個純粹的JavaEE項目，基于ssm的框架整合構(gòu)建。使用maven的tomcat插件啟動項目。

3． 感受Echarts—簡單入門

3.1． 下載Echarts

3.2． 頁面引入Echarts ECharts 3 開始只需要像普通的 JavaScript 庫一樣用 script 標(biāo)簽引入即可。 ?

3.3． 繪制一個簡單的圖表

在繪圖前我們需要為 ECharts 準(zhǔn)備一個具備高寬的 DOM 容器：

然后就可以通過?echarts.init?方法初始化一個 echarts 實例并通過?setOption?方法生成一個簡單的柱狀圖，下面是完整代碼。

不出意外的話你就可以看見如下的圖表：

三大框架環(huán)境搭建 第一步：創(chuàng)建數(shù)據(jù)庫并導(dǎo)入數(shù)據(jù) ?

第二步：創(chuàng)建maven? web工程并導(dǎo)入jar包

第三步：配置SqlMapConfig.xml

第四步：配置ApplicationContext.xml

第五步：配置springMVC.xml

第六步：配置web.xml

第七步：拷貝我們準(zhǔn)備好的資源文件到項目中

第八步：配置IDEA使用tomcat插件訪問我們的項目

第九步：開發(fā)mapper層的xml以及接口

1．維度建模基本概念

2．維度建模三種模式

3．本項目中數(shù)據(jù)倉庫的設(shè)計

?
3.1．事實表設(shè)計

3.2．維度表設(shè)計?

三、模塊開發(fā)----ETL

1.2．點擊流模型pageviews表

1.3．點擊流visit模型表

2．導(dǎo)入ODS層數(shù)據(jù)

四、模塊開發(fā)----統(tǒng)計分析

上述UA信息就可以提取出以下的信息：
chrome 58.0、瀏覽器?? ?chrome、瀏覽器版本?? ?58.0、系統(tǒng)平臺?? ?windows
瀏覽器內(nèi)核?? ?webkit

這里不再拓展相關(guān)知識，感興趣的可以查看參考資料如何解析UA。
可以用下面的語句進(jìn)行試探性統(tǒng)計，當(dāng)然這樣的準(zhǔn)確度不是很高。
?

1.3．統(tǒng)計pv總量最大的來源TOPN (分組TOP)

2．受訪分析（從頁面的角度分析）

2.1．各頁面訪問統(tǒng)計
主要是針對數(shù)據(jù)中的request進(jìn)行統(tǒng)計分析，比如各頁面PV ，各頁面UV 等。
以上指標(biāo)無非就是根據(jù)頁面的字段group by。例如：
?

2.2．熱門頁面統(tǒng)計

3．訪客分析

3.2．每日新訪客
需求：將每天的新訪客統(tǒng)計出來。
實現(xiàn)思路：創(chuàng)建一個去重訪客累積表，然后將每日訪客對比累積表。
?

4．訪客Visit分析（點擊流模型）

4.1．回頭/單次訪客統(tǒng)計
需求：查詢今日所有回頭訪客及其訪問次數(shù)。

4.2．人均訪問頻次

需求：統(tǒng)計出每天所有用戶訪問網(wǎng)站的平均次數(shù)（visit）
總visit數(shù)/去重總用戶數(shù)
?

5．關(guān)鍵路徑轉(zhuǎn)化率分析（漏斗模型）

5.1．需求分析
轉(zhuǎn)化：在一條指定的業(yè)務(wù)流程中，各個步驟的完成人數(shù)及相對上一個步驟的百分比。
?

5.2．模型設(shè)計

5.3．開發(fā)實現(xiàn)

??? ?查詢每一步驟相對于路徑起點人數(shù)的比例
思路：級聯(lián)查詢，利用自join
?

匯總以上兩種指標(biāo)
?

五、模塊開發(fā)----結(jié)果導(dǎo)出

1．第一步：創(chuàng)建mysql數(shù)據(jù)庫以及對應(yīng)的數(shù)據(jù)庫表

2．第二步：通過sqoop命令來進(jìn)行導(dǎo)出

六、模塊開發(fā)----工作流調(diào)度

七、模塊開發(fā)----數(shù)據(jù)可視化

3．感受Echarts—簡單入門

3.1．下載Echarts

3.2．頁面引入Echarts
ECharts 3 開始只需要像普通的 JavaScript 庫一樣用 script 標(biāo)簽引入即可。
?

3.3．繪制一個簡單的圖表

三大框架環(huán)境搭建
第一步：創(chuàng)建數(shù)據(jù)庫并導(dǎo)入數(shù)據(jù)
?