當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

hadoop--MapReduce_WordCount词频统计案例

發(fā)布時間：2025/3/17 编程问答 22 豆豆

生活随笔收集整理的這篇文章主要介紹了 hadoop--MapReduce_WordCount词频统计案例小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

WordCount案例

需求

: 統(tǒng)計一堆文件中單詞出現(xiàn)的個數(shù)。

1.輸入數(shù)據(jù)
hello hello
hi hi
haha
map
reduce

2.期望輸出數(shù)據(jù)
hello 2
hi 2
haha 1
map 1
reduce 1

需求分析：按照MapReduce編程規(guī)范，分別編寫Mapper、Reducer、Driver。

3.Mapper
1). 將MapTask傳給我們的文本內(nèi)容轉(zhuǎn)換成String：
hello hello

2). 根據(jù)空格將這一行切分成單詞：
hello
hello

3). 將單詞輸出為<單詞，1>
hello, 1
hello, 1

4.Reducer
1). 匯總各個key的個數(shù)
hello, 1
hello, 1
2). 輸出該key的總次數(shù)
hello, 2

5.Driver
1). 獲取配置信息，獲取job對象實例；
2). 制定本程序的jar包所在的本地路徑；
3). 關(guān)聯(lián)Mapper/Reducer業(yè)務(wù)類；
4). 指定Mapper輸出數(shù)據(jù)的KV類型；
5). 指定最終輸出的數(shù)據(jù)的KV類型；
6). 指定job的輸入原始文件所在目錄；
7). 指定job的輸出結(jié)果所在目錄；
8).提交作業(yè)。

環(huán)境準備

1.創(chuàng)建maven工程，MapReduceDemo；
2.在pom.xml文件中添加如下依賴：

<dependencies><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-client</artifactId><version>3.2.2</version></dependency><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.12</version></dependency><dependency><groupId>org.slf4j</groupId><artifactId>slf4j-log4j12</artifactId><version>1.7.30</version></dependency></dependencies>

3.在項目的src/main/resources目錄下，新建一個文件，命名為“l(fā)og4j.properties“，在文件中填入：

log4j.rootLogger=INFO, stdout log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n log4j.appender.logfile=org.apache.log4j.FileAppender log4j.appender.logfile.File=target/spring.log log4j.appender.logfile.layout=org.apache.log4j.PatternLayout log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n

4.創(chuàng)建包名：com.xiaobai.mapreduce.wordcount；
分別編寫Mapper、Reducer、Driver類。

本地測試

源碼Driver部分：

//6.設(shè)置輸入路徑和輸出路徑 FileInputFormat.setInputPaths(job,new Path("/Users/jane/Desktop/test/")); FileOutputFormat.setOutputPath(job,new Path("/Users/jane/Desktop/hadoop/output"));

在“/Users/jane/Desktop/test/”目錄下新建一份hello.xml，內(nèi)容如下：

輸出結(jié)果：

提交到集群測試

集群測試

1.用maven打jar包，在pom.xml文件中添加如下依賴：

<build><plugins><plugin><artifactId>maven-compiler-plugin</artifactId><version>3.6.1</version><configuration><source>1.8</source><target>1.8</target></configuration></plugin><plugin><artifactId>maven-assembly-plugin</artifactId><configuration><descriptorRefs><descriptorRef>jar-with-dependencies</descriptorRef></descriptorRefs></configuration><executions><execution><id>make-assembly</id><phase>package</phase><goals><goal>single</goal></goals></execution></executions></plugin></plugins></build>

2.打包maven jar包。

(ps.我太難了，這張圖是拼接的，截了好幾次圖，一直缺東缺西的，不太完美 = = )

3.使用命令啟動集群：

[xiaobai@hadoop102 ~]$ myhadoop.sh start

4.使用命令查看進程，確保集群已經(jīng)正常啟動：

[xiaobai@hadoop102 ~]$ jpsall

5.將jar包復(fù)制一份到桌面并命名為wc.jar，上傳打包好的jar包到/opt/module/hadoop3.2.2：

6.右擊WordCountDriver–>copy/paste Special–>copy reference拷貝全類名。
com.xiaobai.mapreduce.wordcount.WordCountDriver

7.在/opt/module/hadoop3.2.2目錄下創(chuàng)建WordSum.txt并輸入以下內(nèi)容：

[xiaobai@hadoop102 hadoop-3.2.2]$ vim WordSum.txt

8.如圖，切換hdfs用戶并創(chuàng)建一個input文件夾：

[xiaobai@hadoop102 hadoop-3.2.2]$ hdfs dfs -mkdir /input

9.如圖，將本地文件WordSum.txt上傳到HDFS：

[xiaobai@hadoop102 hadoop-3.2.2]$ hdfs dfs -put /opt/module/hadoop-3.2.2/WordSum.txt /input

10.如圖，運行wc.jar:

[xiaobai@hadoop102 hadoop-3.2.2]$ hadoop jar wc.jar com.xiaobai.mapreduce.wordcount.WordCountDriver /input /output

tips: 空格計數(shù)1是因為我多打了一行，并沒有寫入內(nèi)容。

源碼程序

1.WordCountMapper類

package com.xiaobai.mapreduce.wordcount;import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper;import java.io.IOException;/* KEYIN, map階段輸入的key的類型：LongWritable VALUEIN, map階段輸入value類型：Text KEYOUT, map階段輸出的key類型：Text VALUEOUT,map階段輸出的value類型：IntWritable*/ public class WordCountMapper extends Mapper<LongWritable, Text,Text, IntWritable> {private Text outK = new Text();private IntWritable outV = new IntWritable(1);@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {//1.獲取一行//hello helloString line = value.toString();//2.切割//hello//helloString[] words = line.split(" ");//3.循環(huán)寫出for (String s : words) {//封裝outKoutK.set(s);//寫出context.write(outK,outV);}} }

2.WordCountReducer類

package com.xiaobai.mapreduce.wordcount;import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer;import java.io.IOException;/* KEYIN, reduce階段輸入的key的類型：Text VALUEIN, reduce階段輸入value類型：IntWritable KEYOUT, reduce階段輸出的key類型：Text VALUEOUT,reduce階段輸出的value類型：IntWritable*/ public class WordCountReducer extends Reducer<Text, IntWritable,Text,IntWritable> {private IntWritable outV = new IntWritable();@Overrideprotected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {int sum = 0;//hello,(1,1)//累加for (IntWritable value: values) {sum += value.get();}outV.set(sum);//寫出context.write(key,outV);} }

3.WordCountDriver類

package com.xiaobai.mapreduce.wordcount;import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import java.io.IOException;/* KEYIN, map階段輸入的key的類型：LongWritable VALUEIN, map階段輸入value類型：Text KEYOUT, map階段輸出的key類型：Text VALUEOUT,map階段輸出的value類型：IntWritable*/ public class WordCountDriver {public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {//1.獲取jobConfiguration conf = new Configuration();Job job = Job.getInstance(conf);//2.設(shè)置jar包路徑job.setJarByClass(WordCountDriver.class);//3.關(guān)聯(lián)mapper和reducerjob.setMapperClass(WordCountMapper.class);job.setReducerClass(WordCountReducer.class);//4.設(shè)置map輸出的KV類型job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(IntWritable.class);//5.設(shè)置最終輸出的KV類型job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);//6.設(shè)置輸入路徑和輸出路徑FileInputFormat.setInputPaths(job,new Path(args[0]));FileOutputFormat.setOutputPath(job,new Path(args[1]));//7.提交jobboolean result = job.waitForCompletion(true);System.exit(result?0:1);}}

總結(jié)

以上是生活随笔為你收集整理的hadoop--MapReduce_WordCount词频统计案例的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Android UncaughtExce
下一篇： JavaScript函数式编程入门经典