當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

MapReduce词频统计

發(fā)布時間：2024/7/19 编程问答 34 豆豆

生活随笔收集整理的這篇文章主要介紹了 MapReduce词频统计小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

1.1 文件準(zhǔn)備
創(chuàng)建本地目錄和創(chuàng)建兩個文本文件，在兩個文件中輸入單詞，用于統(tǒng)計詞頻。

cd /usr/local/hadoop mkdir WordFile cd WordFile touch wordfile1.txt touch wordfile2.txt

1.2 創(chuàng)建一個HDFS目錄，在本地上不可見,并將本地文本文件上傳到HDFS目錄。通過如下命令創(chuàng)建。

cd /usr/local/hadoop ./bin/hdfs dfs -mkdir wordfileinput ./bin/hdfs dfs -put ./WordFile/wordfile1.txt wordfileinput ./bin/hdfs dfs -put ./WordFile/wordfile2.txt wordfileinput

1.3 保證HDFS目錄不存在output,我們執(zhí)行如下命令，每次運行詞頻統(tǒng)計都要刪除output輸出文件,/user/hadoop/是HDFS的用戶目錄，不是本地目錄。

./bin/hdfs dfs -rm -r /user/hadoop/output

1.4 Eclips編寫代碼
創(chuàng)建Java project ,項目名稱為MapReduceWordCount,右鍵項目名，導(dǎo)入相關(guān)Jar包。

1.5 點擊Add External Jars,進(jìn)入目錄/usr/local/hadoop/share/hadoop,導(dǎo)入如下包。

“/usr/local/hadoop/share/hadoop/common”目錄下的hadoop-common-3.1.3.jar和haoop-nfs-3.1.3.jar；
“/usr/local/hadoop/share/hadoop/common/lib”目錄下的所有JAR包；
“/usr/local/hadoop/share/hadoop/mapreduce”目錄下的所有JAR包，但是，不包括jdiff、lib、lib-examples和sources目錄;
“/usr/local/hadoop/share/hadoop/mapreduce/lib”目錄下的所有JAR包。

1.6 創(chuàng)建類WordCount.java

import java.io.IOException; import java.util.Iterator; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; public class WordCount {public WordCount() {}public static void main(String[] args) throws Exception {Configuration conf = new Configuration();String[] otherArgs = (new GenericOptionsParser(conf, args)).getRemainingArgs();if(otherArgs.length < 2) {System.err.println("Usage: wordcount <in> [<in>...] <out>");System.exit(2);}Job job = Job.getInstance(conf, "word count");job.setJarByClass(WordCount.class);job.setMapperClass(WordCount.TokenizerMapper.class);job.setCombinerClass(WordCount.IntSumReducer.class);job.setReducerClass(WordCount.IntSumReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class); for(int i = 0; i < otherArgs.length - 1; ++i) {FileInputFormat.addInputPath(job, new Path(otherArgs[i]));}FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));System.exit(job.waitForCompletion(true)?0:1);}public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {private static final IntWritable one = new IntWritable(1);private Text word = new Text();public TokenizerMapper() {}public void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {StringTokenizer itr = new StringTokenizer(value.toString()); while(itr.hasMoreTokens()) {this.word.set(itr.nextToken());context.write(this.word, one);}}} public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {private IntWritable result = new IntWritable();public IntSumReducer() {}public void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {int sum = 0;IntWritable val;for(Iterator i$ = values.iterator(); i$.hasNext(); sum += val.get()) {val = (IntWritable)i$.next();}this.result.set(sum);context.write(key, this.result);}} }

1.7 編譯打包程序
將程序打包到　/usr/local/hadoop/myapp目錄下，

cd /usr/local/hadoop mkdir myapp

Run As 運行程序；
右鍵工程名－>Export－>Java－>Runnable JAR file

“Launch configuration”用于設(shè)置生成的JAR包被部署啟動時運行的主類，需要在下拉列表中選擇剛才配置的類“WordCount-MapReduceWordCount”。在“Export destination”中需要設(shè)置JAR包要輸出保存到哪個目錄即其名稱。點擊finish,中間會出現(xiàn)一些信息，一直點擊Ｏｋ即可。

1.8 運行程序
啟動hadoop

cd /usr/local/hadoop ./sbin/start-dfs.sh ./bin/hadoop jar ./myapp/WordCount.jar wordfileinput output

1.9 查看結(jié)果

cd /usr/local/hadoop ./bin/hdfs dfs -cat output/*

1.20 查看HDFS 文件系統(tǒng)
進(jìn)入/usr/local/hadoop/bin 目錄，執(zhí)行相關(guān)命令。

./hadoop fs -ls

1.21 源文檔
http://dblab.xmu.edu.cn/blog/2481-2/#more-2481

總結(jié)

以上是生活随笔為你收集整理的MapReduce词频统计的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：【计算机网络复习物理层】2.1.1
下一篇： linux的常用操作——makefile