每周一书《Spark与Hadoop大数据分析》分享!
Spark與Hadoop大數(shù)據(jù)分析比較系統(tǒng)地講解了利用Hadoop和Spark及其生態(tài)系統(tǒng)里的一系列工具進(jìn)行大數(shù)據(jù)分析的方法,既涵蓋ApacheSpark和Hadoop的基礎(chǔ)知識(shí),又深入探討所有Spark組件——SparkCore、SparkSQL、DataFrame、DataSet、普通流、結(jié)構(gòu)化流、MLlib、Graphx,以及Hadoop的核心組件(HDFS、MapReduce和Yarn)等,并配套詳細(xì)的實(shí)現(xiàn)示例,是快速掌握大數(shù)據(jù)分析基礎(chǔ)架構(gòu)及其實(shí)施方法的詳實(shí)參考。
全書共10章,第1章從宏觀的角度講解大數(shù)據(jù)分析的概念,并介紹在Hadoop和Spark平臺(tái)上使用的工具和技術(shù),以及一些*常見的用例;第2章介紹Hadoop和Spark平臺(tái)的基礎(chǔ)知識(shí);第3章深入探討并學(xué)習(xí)Spark;第4章主要介紹DataSourcesAPI、DataFrameAPI和新的DatasetAPI;第5章講解如何用SparkStreaming進(jìn)行實(shí)時(shí)分析;第6章介紹Spark和Hadoop配套的筆記本和數(shù)據(jù)流;第7章講解Spark和Hadoop上的機(jī)器學(xué)習(xí)技術(shù);第8章介紹如何構(gòu)建推薦系統(tǒng);第9章介紹如何使用GraphX進(jìn)行圖分析;第10章介紹如何使用SparkR。
目錄:
第1章 從宏觀視角看大數(shù)據(jù)分析··········1
1.1 大數(shù)據(jù)分析以及Hadoop和Spark
在其中承擔(dān)的角色····························3
1.1.1 典型大攻據(jù)分析項(xiàng)目的
生名周期.....................4
1.1.2 Hadoop中Spark承擔(dān)的角色·············6
1.2 大數(shù)據(jù)札學(xué)以及Hadoop和
Spark在其中承扣的角色…………6
1.2.1 從數(shù)據(jù)分析到數(shù)據(jù)科學(xué)的
根本性轉(zhuǎn)變···························6
1.2.2 典型數(shù)據(jù)科學(xué)項(xiàng)目的生命周期··········8
1.2.3 Hadoop和Spark承擔(dān)的角色·················9
1.3 工具和技術(shù)··························9
1.4 實(shí)際環(huán)境中的用例·············11
1.5 小結(jié)········································12
第2章 Apache Hadoop和ApacheSpark 入門····13
2.1 Apache Hadoop概述..…………13
2.1.1 Hadoop分布式文件系統(tǒng)····14
2.1.2 HDFS的特性·······························15
2.1.3 MapReduce··························16
2.1.4 MapReduce的特性······················17
2.1.5 MapReduce v 1與
MapRcduce v2 對(duì)比······················17
2.1.6 YARN··································18
2.1.7 Hadoop上的存儲(chǔ)選擇······················20
2.2 Apache Spark概述···························24
2.2.1 Spark的發(fā)展歷史······················24
2.2.2 Apache Spark是什么······················25
2.2.3 Apache Spark不是什么·······26
2.2.4 MapReduce的問題······················27
2.2.5 Spark的架構(gòu)························28
2.3 為何把Hadoop和Spark結(jié)合使用·······31
2.3.1 Hadoop的持性······················31
2.3.2 Spark的特性·······························31
2.4 安裝Hadoop和Spark集群···············33
2.5 小結(jié)··················································36
第3章 深入剖析Apache Spark ··········37
3.1 啟動(dòng)Spark守護(hù)進(jìn)程·······························37
3.1.1 使用CDH ····························38
3.1.2 使用HDP 、MapR和Spark預(yù)制軟件包··············38
3.2 學(xué)習(xí)Spark的核心概念························39
3.2.1 使用Spark的方法.··························39
3.2.2 彈性分布式數(shù)據(jù)集······················41
3.2.3 Spark環(huán)境································13
3.2.4 變換和動(dòng)作..........................44
3.2.5 ROD中的并行度·························46
3.2.6 延遲評(píng)估·······························49
3.2.7 譜系圖··································50
3.2.8 序列化·································51
3.2.9 在Spark 中利用Hadoop文件格式····52
3.2.10 數(shù)據(jù)的本地性··················53
3.2.11 共享變量........................... 54
3.2.12 鍵值對(duì)RDD ······················55
3.3 Spark 程序的生命周期………………55
3.3.1 流水線............................... 57
3.3.2 Spark執(zhí)行的摘要....………58
3.4 Spark應(yīng)用程序······························59
3.4.1 Spark Shell和Spark應(yīng)用程序·········59
3.4.2 創(chuàng)建Spark環(huán)境…….............59
3.4.3 SparkConf·························59
3.4.4 SparkSubmit ························60
3.4.5 Spark 配置項(xiàng)的優(yōu)先順序····61
3.4.6 重要的應(yīng)用程序配置··········61
3.5.1 存儲(chǔ)級(jí)別............................. 62
3.5.2 應(yīng)該選擇哪個(gè)存儲(chǔ)級(jí)別·····63
3.6 Spark 資源管理器: Standalone 、
YARN和Mesos·······························63
3.6.1 本地和集群模式··················63
3.6.2 集群資源管理器························64
3.7 小結(jié)·················································67
第4章 利用Spark SQL 、DataFrame
和Dataset 進(jìn)行大數(shù)據(jù)分析····················69
4.1 Spark SQL的發(fā)展史····························70
4.2 Spark SQL的架構(gòu)·······················71
4.3 介紹Spark SQL的四個(gè)組件················72
4.4 DataFrame和Dataset的演變············74
4.4.1 ROD 有什么問題····························74
4.4.2 ROD 變換與Dataset和
DataFramc 變換....................75
4.5 為什么要使用Dataset和Dataframe·····75
4.5.1 優(yōu)化·····································76
4.5.2 速度·····································76
4.5.3 自動(dòng)模式發(fā)現(xiàn)························77
4.5.4 多數(shù)據(jù)源,多種編程語(yǔ)言··················77
4.5.5 ROD和其包API之間的互操作性.......77
4.5.6 僅選擇和讀取為要的數(shù)據(jù)···········78
4.6 何時(shí)使用ROD 、Dataset
和DataFrame·············78
4.7 利用DataFraine進(jìn)行分析.......……78
4.7.1 創(chuàng)建SparkSession …………...79
4.7.2 創(chuàng)建DataFrame·····························79
4.7.3 把DataFrame轉(zhuǎn)換為RDD·············82
4.7.4 常用的Dataset DataFrame操作······83
4.7.5 緩存數(shù)據(jù)··································84
4.7.6 性能優(yōu)化·····························84
4.8 利用DatasetAPl進(jìn)行分析················85
4.8.1 創(chuàng)建Dataset·····························85
4.8.2 把Dataframe轉(zhuǎn)換為Dataset····86
4.8.3 利用數(shù)據(jù)字典訪問元數(shù)據(jù)···············87
4.9 Data Sources API ............................87
4.9.1 讀和寫函數(shù)································88
4.9.2 內(nèi)置數(shù)據(jù)庫(kù)····································88
4.9.3 外部數(shù)據(jù)源··························93
4.10 把Spark SQL作為分布式SQL引擎····97
4.10.1 把Spark SQL的Thrift服務(wù)器
用于JDBC/ODBC訪問............97
4.10.2 使用beeline客戶端查詢數(shù)據(jù)·········98
4.10.3 使用spark-sqI CLI從Hive查詢數(shù)據(jù)....99
4.10.4 與BI工具集成··························100
4.11 Hive on Spark ...........................…100
4.12 小結(jié)..............................................100
第5章 利用Spark Streaming和Structured Streaming 進(jìn)行
實(shí)時(shí)分析···102
5.1 實(shí)時(shí)處理概述··························103
5.1.1 Spark Streaming 的優(yōu)缺點(diǎn)...104
5.1.2 Spark Strcruning的發(fā)展史····104
5.2 Spark Streaming的架構(gòu)···············104
5.2.1 Spark Streaming應(yīng)用程序流··········106
5.2.2 無(wú)狀態(tài)和有狀態(tài)的準(zhǔn)處理·················107
5.3 Spark Streaming的變換和動(dòng)作········109
5.3.1 union·································· 109
5.3.2 join···························109
5.3.3 transform操作··························109
5.3.4 updateStateByKey·····················109
5.3.5 mapWithState ····················110
5.3.6 窗口操作······ ·····················110
5.3.7 輸出操作........................... 1 11
5.4 輸人數(shù)據(jù)源和輸出存儲(chǔ)·············111
5.4.1 基本數(shù)據(jù)源·······112
5.4.2 高級(jí)數(shù)據(jù)源····················112
5.4.3 自定義數(shù)據(jù)源.···················112
5.4.4 接收器的可靠性························ 112
5.4.5 輸出存儲(chǔ)··························113
5.5 使用Katlca和HBase的SparkStreaming···113
5.5.1 基于接收器的方法·······················114
5.5.2 直接方法(無(wú)接收器······················116
5.5.3 與HBase集成···························117
5.6 Spark Streaming的高級(jí)概念·········118
5.6.1 使用DataF rame······················118
5.6.2 MLlib操作·······················119
5.6.3 緩存/持久化·······················119
5.6.4 Spark Streaming中的容錯(cuò)機(jī)制······119
5.6.5 Spark Streaming應(yīng)用程序的
性能調(diào)優(yōu)············121
5.7 監(jiān)控應(yīng)用程序·······························122
5.8 結(jié)構(gòu)化流概述································123
5.8.1 結(jié)構(gòu)化流應(yīng)用程序的工作流··········123
5.8.2 流式Dataset和流式DataFrame·····125
5.8.3 流式Dataset和流式
DataFrame的操作·················126
5.9 小結(jié)········································129
第6章 利用Spark 和Hadoop的
筆記本與數(shù)據(jù)流····················130
6.1 基下網(wǎng)絡(luò)的筆記本概述·····················130
6.2 Jupyter概述..·························· 131
6.2.1 安裝Jupyter···················132
6.2.2 用Jupyter進(jìn)行分析···················134
6.3 Apache Zeppelin 概述····················· 135
6.3.1 Jupyter和Zeppelin對(duì)比····136
6.3.2 安裝ApacheZeppelin···················137
6.3.3 使用Zeppelin進(jìn)行兮析····139
6.4 Livy REST作業(yè)服務(wù)器和Hue筆記本····140
6.4.1 安裝設(shè)置Livy服務(wù)器和Hue········141
6.4.2 使用Livy服務(wù)器····················1 42
6.4.3 Livy和Hue筆記本搭配使用·········145
6.4.4 Livy和Zeppelin搭配使用·············148
6.5 用于數(shù)據(jù)流的ApacheNiFi概述········148
6.5.1 安裝ApacheNiFi··················148
6.5.2 把N iF1用干數(shù)據(jù)流和分析·····149
6.6 小結(jié)·····························152
第7章 利用Spark 和Hadoop 進(jìn)行機(jī)器學(xué)習(xí)...153
7.1 機(jī)器學(xué)習(xí)概述........….................... 153
7.2 在Spark和Hadoop上進(jìn)行機(jī)器學(xué)習(xí).....154
7.3 機(jī)器學(xué)習(xí)算法··················155
7.3.1 有監(jiān)督學(xué)習(xí)........…............. 156
7.3.2 無(wú)監(jiān)督學(xué)習(xí)···················156
7.3.3 推薦系統(tǒng)…................…..... 157
7.3.4 特征提取和變換……...…157
7.3.5 優(yōu)化...................................158
7.3.6 Spark MLlib的數(shù)據(jù)類型…158
7.4 機(jī)器學(xué)習(xí)算法示例·················160
7.5 構(gòu)建機(jī)器學(xué)習(xí)流水線·················163
7.5.1 流水線工作流的一個(gè)示例···········163
7.5.2 構(gòu)建一個(gè)ML流水線··················164
7.5.3 保存和加載模型··················166
7.6 利用H2O和Spark進(jìn)行機(jī)器學(xué)習(xí)·····167
7.6.1 為什么使用SparklingWatcr······167
7.6.2 YARN上的一個(gè)應(yīng)用程序流.........167
7 .6.3 Sparkling Water入門........168
7.7 Hivemall概述……..…………..169
7.8 Hivemall for Spark概述.. ……........170
7.9 小結(jié)······························170
第8章 利用Spark和Mahout構(gòu)建推薦系統(tǒng)...171
8.1 構(gòu)建推薦系統(tǒng)..............…171
8.1.1 基干內(nèi)容的過(guò)濾························172
8.1.2 協(xié)同過(guò)濾······························ 172
8.2 推薦系統(tǒng)的局限性··························· 173
8.3 用MLlib實(shí)現(xiàn)推薦系統(tǒng)·······················173
8.3.1 準(zhǔn)備環(huán)境·······················174
8.3.2 創(chuàng)建RDD······················175
8.3.3 利用DataFrame探索數(shù)據(jù)·······176
8.3.4 創(chuàng)建訓(xùn)練和測(cè)試數(shù)據(jù)集················178
8.3.5 創(chuàng)建一個(gè)模型···················178
8.3.6 做出預(yù)測(cè)··························179
8.3.7 利用測(cè)試數(shù)據(jù)對(duì)模型進(jìn)行評(píng)估·······179
8.3.8 檢查誤型的準(zhǔn)確度……......180
8.3.9 顯式和隱式反饋····················181
8.4 Mahout和Spark的集成·····················181
8.4.1 安裝Mahout····················181
8.4.2 探索Mahout shell ·····················182
8.4.3 利可Mahout和搜索工具
構(gòu)建一個(gè)通用的推薦系統(tǒng)········186
8.5 小結(jié)····················189
第9章 利用GraphX進(jìn)行圖分析···190
9.1 圖處理概述···································190
9.1.1 圖是什么···························191
9.1.2 圖數(shù)據(jù)庫(kù)和圖處理系統(tǒng)····191
9.1.3 GraphX概述·······················192
9.1.4 圖算法···································192
9.2 GraphX入門·······················193
9.2.1 GraphX的基本操作·······················193
9.2.2 圖的變換·············198
9.2.3 GraphX算法·······················202
9.3 利用GraphX分析航班數(shù)據(jù)···········205
9.4 GraphFrames概述························209
9.4.1 模式發(fā)現(xiàn)··························· 211
9.4.2 加載和保存Graphframes···212
9.5 小結(jié)...............................................212
第10章 利用SparkR進(jìn)行交互式分析······213
10.1 R語(yǔ)言和Spark.R概述·······················213
10.1.1 R語(yǔ)言是什么.··························214
10.1.2 SparkR慨述.....................214
10.1.3 SparkR架構(gòu)..................... 216
10.2 SparkR入門·······················216
10.2.1 安裝和配置R·························216
10.2.2 使用SparkR shell··········218
10.2.3 使甲Spark.R腳本·······················222
10.3 在 SparkR里使用Dataframe······223
10.4 在RStudio里使用SparkR···········228
10.5 利用SparkR進(jìn)行機(jī)器學(xué)習(xí)·······230
10.5.1 利用樸素貝葉斯模型······230
10.5.2 利用K均值模型·······················232
10.6 在Zeppelin里使用SparkR·······233
10.7 小結(jié)·······················234
如果想得到下載地址,請(qǐng)?jiān)L問中科院計(jì)算所培訓(xùn)中心官網(wǎng)www.tcict.cn添加官網(wǎng)上的微信客服獲取!
轉(zhuǎn)載于:https://blog.51cto.com/14242083/2363478
總結(jié)
以上是生活随笔為你收集整理的每周一书《Spark与Hadoop大数据分析》分享!的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 蚂蚁金服开源自动化测试框架 SOFAAC
- 下一篇: mysql数据库优化大全