Hadoop 2.0.0-alpha尝鲜安装和hello world
生活随笔
收集整理的這篇文章主要介紹了
Hadoop 2.0.0-alpha尝鲜安装和hello world
小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
僅供測(cè)試學(xué)習(xí)的文章,不推薦在生產(chǎn)環(huán)境使用2.0,因?yàn)?.0采用YARN,hive,hbase,mahout等需要map/reduceV1的可能無法使用hadoop 2.0或者會(huì)出現(xiàn)意外情況。
5月23日,apache發(fā)布了hadoop 2.0的測(cè)試版。正好跟家呆著沒事干,小小的體會(huì)了一下map/reduce V2。
環(huán)境,virtual box虛擬機(jī)ubuntu server 12.04,openjdk-7。
簡(jiǎn)單介紹一下,2.0.0是從hadoop 0.23.x發(fā)展出來的。取消了jobtracker和tasktracker,或者說,是把這兩個(gè)封裝到了container里面。使用YARN替代了原來的map/reduce。
YARN號(hào)稱是第二代map/reduce,速度比一代更快,且支持集群服務(wù)器數(shù)量更大。hadoop 0.20.x和由其發(fā)展過來的1.0.x支持集群數(shù)量建議在3000臺(tái)左右,最大支持到4000臺(tái)。而hadoop 2.0和YARN宣稱支持6000-10000臺(tái),CPU核心數(shù)支持200000顆。從集群數(shù)量和運(yùn)算能力上說,似乎還是提高了不少的。并且加入了namenode的HA,也就是高可用。我說似乎,因?yàn)闆]有在實(shí)際生產(chǎn)環(huán)境測(cè)試速度。而namenode的HA,因?yàn)槭翘摂M機(jī)測(cè)試,也就沒有測(cè)試。只是簡(jiǎn)單的看了一下。
2.0的文件結(jié)構(gòu)相比1.0有所變化,更加清晰明了了。可執(zhí)行文件在bin/下,server啟動(dòng)放到了sbin/下,map/red,streaming,pipes的jar包放到了share/下。很容易找到。
安裝包解壓縮后,先進(jìn)入etc/hadoop/目錄下,按照單機(jī)版方式配置幾個(gè)配置文件。有core-site.xml,hdfs-site.xml,但是沒有了mapred-site.xml,取而代之的是yarn-site.xml
假設(shè)已經(jīng)按照單機(jī)配置配好了,那么進(jìn)入$HADOOP_HOME/bin/目錄下執(zhí)行如下
./hadoop namenode -format
#先格式化
cd ../sbin/
#進(jìn)入sbin目錄,這里放的都是server啟動(dòng)腳本
./hadoop-daemon.sh start namenode
./hadoop-daemon.sh start datanode
./hadoop-daemon.sh start secondarynamenode
#備份服起不起都無所謂,不影響使用,不過可以用來試試HA功能
#下面較重要,2.0取消了jobtracker和tasktracker,以YARN來代替,所以如果運(yùn)行start jobtracker一類的,會(huì)報(bào)錯(cuò)。
#且hadoop,hdfs,map/reduce功能都分離出了單獨(dú)腳本,所以不能用hadoop-daemon.sh啟動(dòng)所有了。
./yarn-daemon.sh start resourcemanager
#這個(gè)就相當(dāng)于原來的jobtracker,用作運(yùn)算資源分配的進(jìn)程,跟namenode可放在一起。
./yarn-daemon.sh start nodemanager
#這個(gè)相當(dāng)于原來的tasktracker,每臺(tái)datanode或者叫slave的服務(wù)器上都要啟動(dòng)。
ps aux一下,如果看到4個(gè)java進(jìn)程,就算啟動(dòng)成功了,訪問http://localhost:50070看看hdfs情況。且由于取消了jobtracker,所以也就沒有50030端口來查看任務(wù)情況了,這個(gè)以后再說吧。
然后來試試編寫第一個(gè)map/reduce V2的程序。其實(shí)從程序的編寫方式來說跟V1沒有任何區(qū)別,只是最后調(diào)用方式變化了一下。hadoop 2.0為了保證兼容性,用戶接口方面對(duì)于用戶來說,還是跟原來是一樣的。
這樣一段數(shù)據(jù)
20120503 ? ? ? ?04 ? ? ?2012-05-03 04:49:22 ? ? ? ? ? ? ? ? ? ? 222.139.35.72 ? Log_ASF ProductVer="5.12.0425.2111"20120503 ? ? ? ?04 ? ? ?2012-05-03 04:49:21 ? ? ? ? ? ? ? ? ? ? 113.232.38.239 ?Log_ASF ProductVer="5.09.0119.1112"
假設(shè)就2條不一樣的吧,一共20條。
還是用python來寫map/red腳本
#!/usr/bin/python
#-*- encoding:UTF-8 -*-
#map.py
import sys
debug = True
if debug:
????????????????lzo = 0
else:
????????????????lzo = 1
count='0'
for line in sys.stdin:
????????????????try:
????????????????????????????????flags = line[:-1].split('\t')
????????????????????????????????if len(flags) == 0:
????????????????????????????????????????????????break
????????????????????????????????if len(flags) != 5+lzo:
????????????????????????????????????????????????continue
????????????????????????????????stat_date = flags[2+lzo].split(' ')[0]
????????????????????????????????version = flags[5+lzo].split('"')[1]
????????????????????????????????str = stat_date+','+version+'\t'+count
????????????????????????????????print str
????????????????except Exception,e:
????????????????????????????????print e
------------------------------------------------------------------
#!/usr/bin/python
#-*- encoding:UTF-8 -*-
#reduce.py
import sys
import string
res = {}
#聲明字典
for line in sys.stdin:
????????????????try:
????????????????????????????????flags = line[:-1].split('\t')
????????????????????????????????if len(flags) != 2:
????????????????????????????????????????????????continue
????????????????????????????????field_key = flags[0]
????????????????????????????????if res.has_key(field_key) == False:
????????????????????????????????????????????????res[field_key] = 0
????????????????????????????????res[field_key] += 1
????????????????except Exception,e:
????????????????????????????????pass
for key in res.keys():
????????????????print key+','+'%s' % (res[key])
然后把范例數(shù)據(jù)復(fù)制到hdfs上面用
./hadoop fs -mkdir /tmp
./hadoop fs -copyFromLocal /root/asf /tmp/asf
測(cè)試一下,還跟以前hadoop一樣。不過兩種streaming的方式都可以
./hadoop jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.0.0-alpha.jar -mapper /opt/hadoop/mrs/map.py -reducer /opt/hadoop/mrs/red.py -input /tmp/asf -output /asf
或者
./yarn jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.0.0-alpha.jar -mapper /opt/hadoop/mrs/map.py -reducer /opt/hadoop/mrs/red.py -input /tmp/asf -output /asf
然后
./hadoop fs -cat /asf/part-00000文件
2012-05-03,5.09.0119.1112,22012-05-03,5.12.0425.2111,18
結(jié)果正確。
附map/reduce V2執(zhí)行日志:
root@localhost:/opt/hadoop/bin# ./yarn jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.0.0-alpha.jar -mapper /opt/hadoop/mrs/map.py -reducer /opt/hadoop/mrs/red.py -input /tmp/asf -output /asf
12/06/01 23:26:40 WARN util.KerberosName: Kerberos krb5 configuration not found, setting default realm to empty
12/06/01 23:26:41 WARN conf.Configuration: session.id is deprecated. Instead, use dfs.metrics.session-id
12/06/01 23:26:41 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
12/06/01 23:26:41 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
12/06/01 23:26:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
12/06/01 23:26:42 WARN snappy.LoadSnappy: Snappy native library not loaded
12/06/01 23:26:42 INFO mapred.FileInputFormat: Total input paths to process : 1
12/06/01 23:26:42 INFO mapreduce.JobSubmitter: number of splits:1
12/06/01 23:26:42 WARN conf.Configuration: mapred.jar is deprecated. Instead, use mapreduce.job.jar
12/06/01 23:26:42 WARN conf.Configuration: mapred.create.symlink is deprecated. Instead, use mapreduce.job.cache.symlink.create
12/06/01 23:26:42 WARN conf.Configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name
12/06/01 23:26:42 WARN conf.Configuration: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
12/06/01 23:26:42 WARN conf.Configuration: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
12/06/01 23:26:42 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
12/06/01 23:26:42 WARN conf.Configuration: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
12/06/01 23:26:42 WARN conf.Configuration: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
12/06/01 23:26:42 WARN conf.Configuration: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
12/06/01 23:26:42 WARN conf.Configuration: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
12/06/01 23:26:42 WARN conf.Configuration: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
12/06/01 23:26:42 WARN mapred.LocalDistributedCacheManager: LocalJobRunner does not support symlinking into current working dir.
12/06/01 23:26:42 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
12/06/01 23:26:42 INFO mapreduce.Job: Running job: job_local_0001
12/06/01 23:26:42 INFO mapred.LocalJobRunner: OutputCommitter set in config null
12/06/01 23:26:42 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
12/06/01 23:26:42 INFO mapred.LocalJobRunner: Waiting for map tasks
12/06/01 23:26:42 INFO mapred.LocalJobRunner: Starting task: attempt_local_0001_m_000000_0
12/06/01 23:26:42 INFO mapred.Task:????Using ResourceCalculatorPlugin : org.apache.hadoop.yarn.util.LinuxResourceCalculatorPlugin@52b5ef94
12/06/01 23:26:42 INFO mapred.MapTask: numReduceTasks: 1
12/06/01 23:26:42 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
12/06/01 23:26:42 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
12/06/01 23:26:42 INFO mapred.MapTask: soft limit at 83886080
12/06/01 23:26:42 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
12/06/01 23:26:42 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
12/06/01 23:26:42 INFO streaming.PipeMapRed: PipeMapRed exec [/opt/hadoop/mrs/map.py]
12/06/01 23:26:42 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
12/06/01 23:26:42 WARN conf.Configuration: user.name is deprecated. Instead, use mapreduce.job.user.name
12/06/01 23:26:42 WARN conf.Configuration: map.input.start is deprecated. Instead, use mapreduce.map.input.start
12/06/01 23:26:42 WARN conf.Configuration: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
12/06/01 23:26:42 WARN conf.Configuration: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
12/06/01 23:26:42 WARN conf.Configuration: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
12/06/01 23:26:42 WARN conf.Configuration: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
12/06/01 23:26:42 WARN conf.Configuration: map.input.length is deprecated. Instead, use mapreduce.map.input.length
12/06/01 23:26:42 WARN conf.Configuration: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
12/06/01 23:26:42 WARN conf.Configuration: mapred.work.output.dir is deprecated. Instead, use mapreduce.task.output.dir
12/06/01 23:26:42 WARN conf.Configuration: map.input.file is deprecated. Instead, use mapreduce.map.input.file
12/06/01 23:26:42 WARN conf.Configuration: mapred.job.id is deprecated. Instead, use mapreduce.job.id
12/06/01 23:26:43 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
12/06/01 23:26:43 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
12/06/01 23:26:43 INFO streaming.PipeMapRed: MRErrorThread done
12/06/01 23:26:43 INFO streaming.PipeMapRed: Records R/W=20/1
12/06/01 23:26:43 INFO streaming.PipeMapRed: mapRedFinished
12/06/01 23:26:43 INFO mapred.LocalJobRunner:
12/06/01 23:26:43 INFO mapred.MapTask: Starting flush of map output
12/06/01 23:26:43 INFO mapred.MapTask: Spilling map output
12/06/01 23:26:43 INFO mapred.MapTask: bufstart = 0; bufend = 560; bufvoid = 104857600
12/06/01 23:26:43 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214320(104857280); length = 77/6553600
12/06/01 23:26:43 INFO mapred.MapTask: Finished spill 0
12/06/01 23:26:43 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of committing
12/06/01 23:26:43 INFO mapred.LocalJobRunner: Records R/W=20/1
12/06/01 23:26:43 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
12/06/01 23:26:43 INFO mapred.LocalJobRunner: Finishing task: attempt_local_0001_m_000000_0
12/06/01 23:26:43 INFO mapred.LocalJobRunner: Map task executor complete.
12/06/01 23:26:43 INFO mapred.Task:????Using ResourceCalculatorPlugin : org.apache.hadoop.yarn.util.LinuxResourceCalculatorPlugin@25d71236
12/06/01 23:26:43 INFO mapred.Merger: Merging 1 sorted segments
12/06/01 23:26:43 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 574 bytes
12/06/01 23:26:43 INFO mapred.LocalJobRunner:
12/06/01 23:26:43 INFO streaming.PipeMapRed: PipeMapRed exec [/opt/hadoop/mrs/red.py]
12/06/01 23:26:43 WARN conf.Configuration: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
12/06/01 23:26:43 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
12/06/01 23:26:43 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
12/06/01 23:26:43 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
12/06/01 23:26:43 INFO streaming.PipeMapRed: Records R/W=20/1
12/06/01 23:26:43 INFO streaming.PipeMapRed: MRErrorThread done
12/06/01 23:26:43 INFO streaming.PipeMapRed: mapRedFinished
12/06/01 23:26:43 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of committing
12/06/01 23:26:43 INFO mapred.LocalJobRunner:
12/06/01 23:26:43 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now
12/06/01 23:26:43 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://localhost:9000/asf/_temporary/0/task_local_0001_r_000000
12/06/01 23:26:43 INFO mapred.LocalJobRunner: Records R/W=20/1 > reduce
12/06/01 23:26:43 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.
12/06/01 23:26:43 INFO mapreduce.Job: Job job_local_0001 running in uber mode : false
12/06/01 23:26:43 INFO mapreduce.Job:????map 100% reduce 100%
12/06/01 23:26:43 INFO mapreduce.Job: Job job_local_0001 completed successfully
12/06/01 23:26:43 INFO mapreduce.Job: Counters: 32
????????????????File System Counters
????????????????????????????????FILE: Number of bytes read=205938
????????????????????????????????FILE: Number of bytes written=452840
????????????????????????????????FILE: Number of read operations=0
????????????????????????????????FILE: Number of large read operations=0
????????????????????????????????FILE: Number of write operations=0
????????????????????????????????HDFS: Number of bytes read=252230
????????????????????????????????HDFS: Number of bytes written=59
????????????????????????????????HDFS: Number of read operations=13
????????????????????????????????HDFS: Number of large read operations=0
????????????????????????????????HDFS: Number of write operations=4
????????????????Map-Reduce Framework
????????????????????????????????Map input records=20
????????????????????????????????Map output records=20
????????????????????????????????Map output bytes=560
????????????????????????????????Map output materialized bytes=606
????????????????????????????????Input split bytes=81
????????????????????????????????Combine input records=0
????????????????????????????????Combine output records=0
????????????????????????????????Reduce input groups=2
????????????????????????????????Reduce shuffle bytes=0
????????????????????????????????Reduce input records=20
????????????????????????????????Reduce output records=2
????????????????????????????????Spilled Records=40
????????????????????????????????Shuffled Maps =0
????????????????????????????????Failed Shuffles=0
????????????????????????????????Merged Map outputs=0
????????????????????????????????GC time elapsed (ms)=12
????????????????????????????????CPU time spent (ms)=0
????????????????????????????????Physical memory (bytes) snapshot=0
????????????????????????????????Virtual memory (bytes) snapshot=0
????????????????????????????????Total committed heap usage (bytes)=396361728
????????????????File Input Format Counters
????????????????????????????????Bytes Read=126115
????????????????File Output Format Counters
????????????????????????????????Bytes Written=59
12/06/01 23:26:43 INFO streaming.StreamJob: Output directory: /asf
當(dāng)然map/reduce V2的功能還不止這些,還需要深入的研究一下。因?yàn)?.0雖然是0.23發(fā)展過來,但是跟0.23還有些不同,比如0.23中有ApplicationManager,2.0里好像沒有在外面露出來了。也許也封裝到container里面了。另外,那些xml的配置選項(xiàng)好像跟0.20.x也有很大不同了,具體還沒細(xì)看。HA功能是支持多個(gè)namenode,且多個(gè)namenode分管不同的datanode。可以支持手工從某臺(tái)namenode切換到另外一臺(tái)namenode。這樣做到高可用,據(jù)說未來會(huì)支持自動(dòng)檢測(cè)切換。
5月23日,apache發(fā)布了hadoop 2.0的測(cè)試版。正好跟家呆著沒事干,小小的體會(huì)了一下map/reduce V2。
環(huán)境,virtual box虛擬機(jī)ubuntu server 12.04,openjdk-7。
簡(jiǎn)單介紹一下,2.0.0是從hadoop 0.23.x發(fā)展出來的。取消了jobtracker和tasktracker,或者說,是把這兩個(gè)封裝到了container里面。使用YARN替代了原來的map/reduce。
YARN號(hào)稱是第二代map/reduce,速度比一代更快,且支持集群服務(wù)器數(shù)量更大。hadoop 0.20.x和由其發(fā)展過來的1.0.x支持集群數(shù)量建議在3000臺(tái)左右,最大支持到4000臺(tái)。而hadoop 2.0和YARN宣稱支持6000-10000臺(tái),CPU核心數(shù)支持200000顆。從集群數(shù)量和運(yùn)算能力上說,似乎還是提高了不少的。并且加入了namenode的HA,也就是高可用。我說似乎,因?yàn)闆]有在實(shí)際生產(chǎn)環(huán)境測(cè)試速度。而namenode的HA,因?yàn)槭翘摂M機(jī)測(cè)試,也就沒有測(cè)試。只是簡(jiǎn)單的看了一下。
2.0的文件結(jié)構(gòu)相比1.0有所變化,更加清晰明了了。可執(zhí)行文件在bin/下,server啟動(dòng)放到了sbin/下,map/red,streaming,pipes的jar包放到了share/下。很容易找到。
安裝包解壓縮后,先進(jìn)入etc/hadoop/目錄下,按照單機(jī)版方式配置幾個(gè)配置文件。有core-site.xml,hdfs-site.xml,但是沒有了mapred-site.xml,取而代之的是yarn-site.xml
假設(shè)已經(jīng)按照單機(jī)配置配好了,那么進(jìn)入$HADOOP_HOME/bin/目錄下執(zhí)行如下
./hadoop namenode -format
#先格式化
cd ../sbin/
#進(jìn)入sbin目錄,這里放的都是server啟動(dòng)腳本
./hadoop-daemon.sh start namenode
./hadoop-daemon.sh start datanode
./hadoop-daemon.sh start secondarynamenode
#備份服起不起都無所謂,不影響使用,不過可以用來試試HA功能
#下面較重要,2.0取消了jobtracker和tasktracker,以YARN來代替,所以如果運(yùn)行start jobtracker一類的,會(huì)報(bào)錯(cuò)。
#且hadoop,hdfs,map/reduce功能都分離出了單獨(dú)腳本,所以不能用hadoop-daemon.sh啟動(dòng)所有了。
./yarn-daemon.sh start resourcemanager
#這個(gè)就相當(dāng)于原來的jobtracker,用作運(yùn)算資源分配的進(jìn)程,跟namenode可放在一起。
./yarn-daemon.sh start nodemanager
#這個(gè)相當(dāng)于原來的tasktracker,每臺(tái)datanode或者叫slave的服務(wù)器上都要啟動(dòng)。
ps aux一下,如果看到4個(gè)java進(jìn)程,就算啟動(dòng)成功了,訪問http://localhost:50070看看hdfs情況。且由于取消了jobtracker,所以也就沒有50030端口來查看任務(wù)情況了,這個(gè)以后再說吧。
然后來試試編寫第一個(gè)map/reduce V2的程序。其實(shí)從程序的編寫方式來說跟V1沒有任何區(qū)別,只是最后調(diào)用方式變化了一下。hadoop 2.0為了保證兼容性,用戶接口方面對(duì)于用戶來說,還是跟原來是一樣的。
這樣一段數(shù)據(jù)
20120503 ? ? ? ?04 ? ? ?2012-05-03 04:49:22 ? ? ? ? ? ? ? ? ? ? 222.139.35.72 ? Log_ASF ProductVer="5.12.0425.2111"20120503 ? ? ? ?04 ? ? ?2012-05-03 04:49:21 ? ? ? ? ? ? ? ? ? ? 113.232.38.239 ?Log_ASF ProductVer="5.09.0119.1112"
假設(shè)就2條不一樣的吧,一共20條。
還是用python來寫map/red腳本
#!/usr/bin/python
#-*- encoding:UTF-8 -*-
#map.py
import sys
debug = True
if debug:
????????????????lzo = 0
else:
????????????????lzo = 1
count='0'
for line in sys.stdin:
????????????????try:
????????????????????????????????flags = line[:-1].split('\t')
????????????????????????????????if len(flags) == 0:
????????????????????????????????????????????????break
????????????????????????????????if len(flags) != 5+lzo:
????????????????????????????????????????????????continue
????????????????????????????????stat_date = flags[2+lzo].split(' ')[0]
????????????????????????????????version = flags[5+lzo].split('"')[1]
????????????????????????????????str = stat_date+','+version+'\t'+count
????????????????????????????????print str
????????????????except Exception,e:
????????????????????????????????print e
------------------------------------------------------------------
#!/usr/bin/python
#-*- encoding:UTF-8 -*-
#reduce.py
import sys
import string
res = {}
#聲明字典
for line in sys.stdin:
????????????????try:
????????????????????????????????flags = line[:-1].split('\t')
????????????????????????????????if len(flags) != 2:
????????????????????????????????????????????????continue
????????????????????????????????field_key = flags[0]
????????????????????????????????if res.has_key(field_key) == False:
????????????????????????????????????????????????res[field_key] = 0
????????????????????????????????res[field_key] += 1
????????????????except Exception,e:
????????????????????????????????pass
for key in res.keys():
????????????????print key+','+'%s' % (res[key])
然后把范例數(shù)據(jù)復(fù)制到hdfs上面用
./hadoop fs -mkdir /tmp
./hadoop fs -copyFromLocal /root/asf /tmp/asf
測(cè)試一下,還跟以前hadoop一樣。不過兩種streaming的方式都可以
./hadoop jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.0.0-alpha.jar -mapper /opt/hadoop/mrs/map.py -reducer /opt/hadoop/mrs/red.py -input /tmp/asf -output /asf
或者
./yarn jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.0.0-alpha.jar -mapper /opt/hadoop/mrs/map.py -reducer /opt/hadoop/mrs/red.py -input /tmp/asf -output /asf
然后
./hadoop fs -cat /asf/part-00000文件
2012-05-03,5.09.0119.1112,22012-05-03,5.12.0425.2111,18
結(jié)果正確。
附map/reduce V2執(zhí)行日志:
root@localhost:/opt/hadoop/bin# ./yarn jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.0.0-alpha.jar -mapper /opt/hadoop/mrs/map.py -reducer /opt/hadoop/mrs/red.py -input /tmp/asf -output /asf
12/06/01 23:26:40 WARN util.KerberosName: Kerberos krb5 configuration not found, setting default realm to empty
12/06/01 23:26:41 WARN conf.Configuration: session.id is deprecated. Instead, use dfs.metrics.session-id
12/06/01 23:26:41 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
12/06/01 23:26:41 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
12/06/01 23:26:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
12/06/01 23:26:42 WARN snappy.LoadSnappy: Snappy native library not loaded
12/06/01 23:26:42 INFO mapred.FileInputFormat: Total input paths to process : 1
12/06/01 23:26:42 INFO mapreduce.JobSubmitter: number of splits:1
12/06/01 23:26:42 WARN conf.Configuration: mapred.jar is deprecated. Instead, use mapreduce.job.jar
12/06/01 23:26:42 WARN conf.Configuration: mapred.create.symlink is deprecated. Instead, use mapreduce.job.cache.symlink.create
12/06/01 23:26:42 WARN conf.Configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name
12/06/01 23:26:42 WARN conf.Configuration: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
12/06/01 23:26:42 WARN conf.Configuration: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
12/06/01 23:26:42 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
12/06/01 23:26:42 WARN conf.Configuration: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
12/06/01 23:26:42 WARN conf.Configuration: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
12/06/01 23:26:42 WARN conf.Configuration: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
12/06/01 23:26:42 WARN conf.Configuration: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
12/06/01 23:26:42 WARN conf.Configuration: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
12/06/01 23:26:42 WARN mapred.LocalDistributedCacheManager: LocalJobRunner does not support symlinking into current working dir.
12/06/01 23:26:42 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
12/06/01 23:26:42 INFO mapreduce.Job: Running job: job_local_0001
12/06/01 23:26:42 INFO mapred.LocalJobRunner: OutputCommitter set in config null
12/06/01 23:26:42 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
12/06/01 23:26:42 INFO mapred.LocalJobRunner: Waiting for map tasks
12/06/01 23:26:42 INFO mapred.LocalJobRunner: Starting task: attempt_local_0001_m_000000_0
12/06/01 23:26:42 INFO mapred.Task:????Using ResourceCalculatorPlugin : org.apache.hadoop.yarn.util.LinuxResourceCalculatorPlugin@52b5ef94
12/06/01 23:26:42 INFO mapred.MapTask: numReduceTasks: 1
12/06/01 23:26:42 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
12/06/01 23:26:42 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
12/06/01 23:26:42 INFO mapred.MapTask: soft limit at 83886080
12/06/01 23:26:42 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
12/06/01 23:26:42 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
12/06/01 23:26:42 INFO streaming.PipeMapRed: PipeMapRed exec [/opt/hadoop/mrs/map.py]
12/06/01 23:26:42 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
12/06/01 23:26:42 WARN conf.Configuration: user.name is deprecated. Instead, use mapreduce.job.user.name
12/06/01 23:26:42 WARN conf.Configuration: map.input.start is deprecated. Instead, use mapreduce.map.input.start
12/06/01 23:26:42 WARN conf.Configuration: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
12/06/01 23:26:42 WARN conf.Configuration: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
12/06/01 23:26:42 WARN conf.Configuration: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
12/06/01 23:26:42 WARN conf.Configuration: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
12/06/01 23:26:42 WARN conf.Configuration: map.input.length is deprecated. Instead, use mapreduce.map.input.length
12/06/01 23:26:42 WARN conf.Configuration: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
12/06/01 23:26:42 WARN conf.Configuration: mapred.work.output.dir is deprecated. Instead, use mapreduce.task.output.dir
12/06/01 23:26:42 WARN conf.Configuration: map.input.file is deprecated. Instead, use mapreduce.map.input.file
12/06/01 23:26:42 WARN conf.Configuration: mapred.job.id is deprecated. Instead, use mapreduce.job.id
12/06/01 23:26:43 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
12/06/01 23:26:43 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
12/06/01 23:26:43 INFO streaming.PipeMapRed: MRErrorThread done
12/06/01 23:26:43 INFO streaming.PipeMapRed: Records R/W=20/1
12/06/01 23:26:43 INFO streaming.PipeMapRed: mapRedFinished
12/06/01 23:26:43 INFO mapred.LocalJobRunner:
12/06/01 23:26:43 INFO mapred.MapTask: Starting flush of map output
12/06/01 23:26:43 INFO mapred.MapTask: Spilling map output
12/06/01 23:26:43 INFO mapred.MapTask: bufstart = 0; bufend = 560; bufvoid = 104857600
12/06/01 23:26:43 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214320(104857280); length = 77/6553600
12/06/01 23:26:43 INFO mapred.MapTask: Finished spill 0
12/06/01 23:26:43 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of committing
12/06/01 23:26:43 INFO mapred.LocalJobRunner: Records R/W=20/1
12/06/01 23:26:43 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
12/06/01 23:26:43 INFO mapred.LocalJobRunner: Finishing task: attempt_local_0001_m_000000_0
12/06/01 23:26:43 INFO mapred.LocalJobRunner: Map task executor complete.
12/06/01 23:26:43 INFO mapred.Task:????Using ResourceCalculatorPlugin : org.apache.hadoop.yarn.util.LinuxResourceCalculatorPlugin@25d71236
12/06/01 23:26:43 INFO mapred.Merger: Merging 1 sorted segments
12/06/01 23:26:43 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 574 bytes
12/06/01 23:26:43 INFO mapred.LocalJobRunner:
12/06/01 23:26:43 INFO streaming.PipeMapRed: PipeMapRed exec [/opt/hadoop/mrs/red.py]
12/06/01 23:26:43 WARN conf.Configuration: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
12/06/01 23:26:43 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
12/06/01 23:26:43 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
12/06/01 23:26:43 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
12/06/01 23:26:43 INFO streaming.PipeMapRed: Records R/W=20/1
12/06/01 23:26:43 INFO streaming.PipeMapRed: MRErrorThread done
12/06/01 23:26:43 INFO streaming.PipeMapRed: mapRedFinished
12/06/01 23:26:43 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of committing
12/06/01 23:26:43 INFO mapred.LocalJobRunner:
12/06/01 23:26:43 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now
12/06/01 23:26:43 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://localhost:9000/asf/_temporary/0/task_local_0001_r_000000
12/06/01 23:26:43 INFO mapred.LocalJobRunner: Records R/W=20/1 > reduce
12/06/01 23:26:43 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.
12/06/01 23:26:43 INFO mapreduce.Job: Job job_local_0001 running in uber mode : false
12/06/01 23:26:43 INFO mapreduce.Job:????map 100% reduce 100%
12/06/01 23:26:43 INFO mapreduce.Job: Job job_local_0001 completed successfully
12/06/01 23:26:43 INFO mapreduce.Job: Counters: 32
????????????????File System Counters
????????????????????????????????FILE: Number of bytes read=205938
????????????????????????????????FILE: Number of bytes written=452840
????????????????????????????????FILE: Number of read operations=0
????????????????????????????????FILE: Number of large read operations=0
????????????????????????????????FILE: Number of write operations=0
????????????????????????????????HDFS: Number of bytes read=252230
????????????????????????????????HDFS: Number of bytes written=59
????????????????????????????????HDFS: Number of read operations=13
????????????????????????????????HDFS: Number of large read operations=0
????????????????????????????????HDFS: Number of write operations=4
????????????????Map-Reduce Framework
????????????????????????????????Map input records=20
????????????????????????????????Map output records=20
????????????????????????????????Map output bytes=560
????????????????????????????????Map output materialized bytes=606
????????????????????????????????Input split bytes=81
????????????????????????????????Combine input records=0
????????????????????????????????Combine output records=0
????????????????????????????????Reduce input groups=2
????????????????????????????????Reduce shuffle bytes=0
????????????????????????????????Reduce input records=20
????????????????????????????????Reduce output records=2
????????????????????????????????Spilled Records=40
????????????????????????????????Shuffled Maps =0
????????????????????????????????Failed Shuffles=0
????????????????????????????????Merged Map outputs=0
????????????????????????????????GC time elapsed (ms)=12
????????????????????????????????CPU time spent (ms)=0
????????????????????????????????Physical memory (bytes) snapshot=0
????????????????????????????????Virtual memory (bytes) snapshot=0
????????????????????????????????Total committed heap usage (bytes)=396361728
????????????????File Input Format Counters
????????????????????????????????Bytes Read=126115
????????????????File Output Format Counters
????????????????????????????????Bytes Written=59
12/06/01 23:26:43 INFO streaming.StreamJob: Output directory: /asf
當(dāng)然map/reduce V2的功能還不止這些,還需要深入的研究一下。因?yàn)?.0雖然是0.23發(fā)展過來,但是跟0.23還有些不同,比如0.23中有ApplicationManager,2.0里好像沒有在外面露出來了。也許也封裝到container里面了。另外,那些xml的配置選項(xiàng)好像跟0.20.x也有很大不同了,具體還沒細(xì)看。HA功能是支持多個(gè)namenode,且多個(gè)namenode分管不同的datanode。可以支持手工從某臺(tái)namenode切換到另外一臺(tái)namenode。這樣做到高可用,據(jù)說未來會(huì)支持自動(dòng)檢測(cè)切換。
總結(jié)
以上是生活随笔為你收集整理的Hadoop 2.0.0-alpha尝鲜安装和hello world的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 操作分布式文件之三:如何访问和操作远程文
- 下一篇: USB权限的设置