當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

spark1.3.1使用基础教程

發(fā)布時(shí)間：2024/1/23 编程问答 19 豆豆

生活随笔收集整理的這篇文章主要介紹了 spark1.3.1使用基础教程小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

spark可以通過(guò)交互式命令行及編程兩種方式來(lái)進(jìn)行調(diào)用：
前者支持scala與python
后者支持scala、python與java

本文參考https://spark.apache.org/docs/latest/quick-start.html，可作快速入門

再詳細(xì)資料及用法請(qǐng)見(jiàn)https://spark.apache.org/docs/latest/programming-guide.html

建議學(xué)習(xí)路徑：

1、安裝單機(jī)環(huán)境：http://blog.csdn.net/jediael_lu/article/details/45310321

2、快速入門，有簡(jiǎn)單的印象：本文http://blog.csdn.net/jediael_lu/article/details/45333195

3、學(xué)習(xí)scala

4、深入一點(diǎn)：https://spark.apache.org/docs/latest/programming-guide.html

5、找其它專業(yè)資料或者在使用中學(xué)習(xí)

一、基礎(chǔ)介紹
1、spark的所有操作均是基于RDD(Resilient Distributed Dataset)進(jìn)行的，其中R（彈性）的意思為可以方便的在內(nèi)存和存儲(chǔ)間進(jìn)行交換。
2、RDD的操作可以分為2類：transformation 和 action，其中前者從一個(gè)RDD生成另一個(gè)RDD(如filter)，后者對(duì)RDD生成一個(gè)結(jié)果（如count)。

二、命令行方式

1、快速入門
$ ./bin/spark-shell

（1）先將一個(gè)文件讀入一個(gè)RDD中，然后統(tǒng)計(jì)這個(gè)文件的行數(shù)及顯示第一行。
scala> var textFile = sc.textFile("/mnt/jediael/spark-1.3.1-bin-hadoop2.6/README.md")
textFile: org.apache.spark.rdd.RDD[String] = /mnt/jediael/spark-1.3.1-bin-hadoop2.6/README.md MapPartitionsRDD[1] at textFile at <console>:21

scala> textFile.count()
res0: Long = 98

scala> textFile.first();
res1: String = # Apache Spark

（2）統(tǒng)計(jì)包含spark的行數(shù)
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:23

scala> linesWithSpark.count()
res0: Long = 19

（3）以上的filter與count可以組合使用
scala> textFile.filter(line => line.contains("Spark")).count()
res1: Long = 19

2、深入一點(diǎn)
（1）使用map統(tǒng)計(jì)每一行的單詞數(shù)量，reduce找出最大的那一行所包括的單詞數(shù)量
scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
res2: Int = 14

（2）在scala中直接調(diào)用java包
scala> import java.lang.Math
import java.lang.Math

scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
res2: Int = 14

（3）wordcount的實(shí)現(xiàn)
scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at <console>:24

scala> wordCounts.collect()
res4: Array[(String, Int)] = Array((package,1), (For,2), (processing.,1), (Programs,1), (Because,1), (The,1), (cluster.,1), (its,1), ([run,1), (APIs,1), (computation,1), (Try,1), (have,1), (through,1), (several,1), (This,2), ("yarn-cluster",1), (graph,1), (Hive,2), (storage,1), (["Specifying,1), (To,2), (page](http://spark.apache.org/documentation.html),1), (Once,1), (application,1), (prefer,1), (SparkPi,2), (engine,1), (version,1), (file,1), (documentation,,1), (processing,,2), (the,21), (are,1), (systems.,1), (params,1), (not,1), (different,1), (refer,2), (Interactive,2), (given.,1), (if,4), (build,3), (when,1), (be,2), (Tests,1), (Apache,1), (all,1), (./bin/run-example,2), (programs,,1), (including,3), (Spark.,1), (package.,1), (1000).count(),1), (HDFS,1), (Versions,1), (Data.,1), (>...

3、緩存：將RDD寫入緩存會(huì)大大提高處理效率
scala> linesWithSpark.cache()
res5: linesWithSpark.type = MapPartitionsRDD[2] at filter at <console>:23
scala> linesWithSpark.count()
res8: Long = 19

三、編碼

scala代碼，還不熟悉，以后再運(yùn)行

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SimpleApp {
? def main(args: Array[String]) {
??? val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
??? val conf = new SparkConf().setAppName("Simple Application")
??? val sc = new SparkContext(conf)
??? val logData = sc.textFile(logFile, 2).cache()
??? val numAs = logData.filter(line => line.contains("a")).count()
??? val numBs = logData.filter(line => line.contains("b")).count()
??? println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
? }
}

總結(jié)

以上是生活随笔為你收集整理的spark1.3.1使用基础教程的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

基础教程

上一篇：安装hadoop2.6.0伪分布式环境
下一篇： spark原理介绍