hadoop--MapReduce_WordCount词频统计案例
目錄
- WordCount案例
- 需求
- 環(huán)境準備
- 本地測試
- 提交到集群測試
- 集群測試
- 源碼程序
- 1.WordCountMapper類
- 2.WordCountReducer類
- 3.WordCountDriver類
WordCount案例
需求
: 統(tǒng)計一堆文件中單詞出現(xiàn)的個數(shù)。
1.輸入數(shù)據(jù)
hello hello
hi hi
haha
map
reduce
2.期望輸出數(shù)據(jù)
hello 2
hi 2
haha 1
map 1
reduce 1
需求分析:按照MapReduce編程規(guī)范,分別編寫Mapper、Reducer、Driver。
3.Mapper
1). 將MapTask傳給我們的文本內(nèi)容轉(zhuǎn)換成String:
hello hello
2). 根據(jù)空格將這一行切分成單詞:
hello
hello
3). 將單詞輸出為<單詞,1>
hello, 1
hello, 1
4.Reducer
1). 匯總各個key的個數(shù)
hello, 1
hello, 1
2). 輸出該key的總次數(shù)
hello, 2
5.Driver
1). 獲取配置信息,獲取job對象實例;
2). 制定本程序的jar包所在的本地路徑;
3). 關(guān)聯(lián)Mapper/Reducer業(yè)務(wù)類;
4). 指定Mapper輸出數(shù)據(jù)的KV類型;
5). 指定最終輸出的數(shù)據(jù)的KV類型;
6). 指定job的輸入原始文件所在目錄;
7). 指定job的輸出結(jié)果所在目錄;
8).提交作業(yè)。
環(huán)境準備
1.創(chuàng)建maven工程,MapReduceDemo;
2.在pom.xml文件中添加如下依賴:
3.在項目的src/main/resources目錄下,新建一個文件,命名為“l(fā)og4j.properties“,在文件中填入:
log4j.rootLogger=INFO, stdout log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n log4j.appender.logfile=org.apache.log4j.FileAppender log4j.appender.logfile.File=target/spring.log log4j.appender.logfile.layout=org.apache.log4j.PatternLayout log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n4.創(chuàng)建包名:com.xiaobai.mapreduce.wordcount;
分別編寫Mapper、Reducer、Driver類。
本地測試
源碼Driver部分:
//6.設(shè)置輸入路徑和輸出路徑 FileInputFormat.setInputPaths(job,new Path("/Users/jane/Desktop/test/")); FileOutputFormat.setOutputPath(job,new Path("/Users/jane/Desktop/hadoop/output"));在“/Users/jane/Desktop/test/”目錄下新建一份hello.xml,內(nèi)容如下:
輸出結(jié)果:
提交到集群測試
集群測試
1.用maven打jar包,在pom.xml文件中添加如下依賴:
<build><plugins><plugin><artifactId>maven-compiler-plugin</artifactId><version>3.6.1</version><configuration><source>1.8</source><target>1.8</target></configuration></plugin><plugin><artifactId>maven-assembly-plugin</artifactId><configuration><descriptorRefs><descriptorRef>jar-with-dependencies</descriptorRef></descriptorRefs></configuration><executions><execution><id>make-assembly</id><phase>package</phase><goals><goal>single</goal></goals></execution></executions></plugin></plugins></build>2.打包maven jar包。
(ps.我太難了,這張圖是拼接的,截了好幾次圖,一直缺東缺西的,不太完美 = = )
3.使用命令啟動集群:
[xiaobai@hadoop102 ~]$ myhadoop.sh start4.使用命令查看進程,確保集群已經(jīng)正常啟動:
[xiaobai@hadoop102 ~]$ jpsall5.將jar包復(fù)制一份到桌面并命名為wc.jar,上傳打包好的jar包到/opt/module/hadoop3.2.2:
6.右擊WordCountDriver–>copy/paste Special–>copy reference拷貝全類名。
com.xiaobai.mapreduce.wordcount.WordCountDriver
7.在/opt/module/hadoop3.2.2目錄下創(chuàng)建WordSum.txt并輸入以下內(nèi)容:
8.如圖,切換hdfs用戶并創(chuàng)建一個input文件夾:
[xiaobai@hadoop102 hadoop-3.2.2]$ hdfs dfs -mkdir /input9.如圖,將本地文件WordSum.txt上傳到HDFS:
[xiaobai@hadoop102 hadoop-3.2.2]$ hdfs dfs -put /opt/module/hadoop-3.2.2/WordSum.txt /input10.如圖,運行wc.jar:
[xiaobai@hadoop102 hadoop-3.2.2]$ hadoop jar wc.jar com.xiaobai.mapreduce.wordcount.WordCountDriver /input /output
tips: 空格計數(shù)1是因為我多打了一行,并沒有寫入內(nèi)容。
源碼程序
1.WordCountMapper類
package com.xiaobai.mapreduce.wordcount;import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper;import java.io.IOException;/* KEYIN, map階段輸入的key的類型:LongWritable VALUEIN, map階段輸入value類型:Text KEYOUT, map階段輸出的key類型:Text VALUEOUT,map階段輸出的value類型:IntWritable*/ public class WordCountMapper extends Mapper<LongWritable, Text,Text, IntWritable> {private Text outK = new Text();private IntWritable outV = new IntWritable(1);@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {//1.獲取一行//hello helloString line = value.toString();//2.切割//hello//helloString[] words = line.split(" ");//3.循環(huán)寫出for (String s : words) {//封裝outKoutK.set(s);//寫出context.write(outK,outV);}} }2.WordCountReducer類
package com.xiaobai.mapreduce.wordcount;import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer;import java.io.IOException;/* KEYIN, reduce階段輸入的key的類型:Text VALUEIN, reduce階段輸入value類型:IntWritable KEYOUT, reduce階段輸出的key類型:Text VALUEOUT,reduce階段輸出的value類型:IntWritable*/ public class WordCountReducer extends Reducer<Text, IntWritable,Text,IntWritable> {private IntWritable outV = new IntWritable();@Overrideprotected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {int sum = 0;//hello,(1,1)//累加for (IntWritable value: values) {sum += value.get();}outV.set(sum);//寫出context.write(key,outV);} }3.WordCountDriver類
package com.xiaobai.mapreduce.wordcount;import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import java.io.IOException;/* KEYIN, map階段輸入的key的類型:LongWritable VALUEIN, map階段輸入value類型:Text KEYOUT, map階段輸出的key類型:Text VALUEOUT,map階段輸出的value類型:IntWritable*/ public class WordCountDriver {public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {//1.獲取jobConfiguration conf = new Configuration();Job job = Job.getInstance(conf);//2.設(shè)置jar包路徑job.setJarByClass(WordCountDriver.class);//3.關(guān)聯(lián)mapper和reducerjob.setMapperClass(WordCountMapper.class);job.setReducerClass(WordCountReducer.class);//4.設(shè)置map輸出的KV類型job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(IntWritable.class);//5.設(shè)置最終輸出的KV類型job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);//6.設(shè)置輸入路徑和輸出路徑FileInputFormat.setInputPaths(job,new Path(args[0]));FileOutputFormat.setOutputPath(job,new Path(args[1]));//7.提交jobboolean result = job.waitForCompletion(true);System.exit(result?0:1);}}總結(jié)
以上是生活随笔為你收集整理的hadoop--MapReduce_WordCount词频统计案例的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Android UncaughtExce
- 下一篇: JavaScript函数式编程入门经典