MapReduce词频统计
1.1 文件準(zhǔn)備
創(chuàng)建本地目錄和創(chuàng)建兩個文本文件,在兩個文件中輸入單詞,用于統(tǒng)計詞頻。
1.2 創(chuàng)建一個HDFS目錄,在本地上不可見,并將本地文本文件上傳到HDFS目錄。通過如下命令創(chuàng)建。
cd /usr/local/hadoop ./bin/hdfs dfs -mkdir wordfileinput ./bin/hdfs dfs -put ./WordFile/wordfile1.txt wordfileinput ./bin/hdfs dfs -put ./WordFile/wordfile2.txt wordfileinput1.3 保證HDFS目錄不存在output,我們執(zhí)行如下命令,每次運行詞頻統(tǒng)計都要刪除output輸出文件,/user/hadoop/是HDFS的用戶目錄,不是本地目錄。
./bin/hdfs dfs -rm -r /user/hadoop/output1.4 Eclips編寫代碼
創(chuàng)建Java project ,項目名稱為MapReduceWordCount,右鍵項目名,導(dǎo)入相關(guān)Jar包。
1.5 點擊Add External Jars,進(jìn)入目錄/usr/local/hadoop/share/hadoop,導(dǎo)入如下包。
- “/usr/local/hadoop/share/hadoop/common”目錄下的hadoop-common-3.1.3.jar和haoop-nfs-3.1.3.jar;
- “/usr/local/hadoop/share/hadoop/common/lib”目錄下的所有JAR包;
- “/usr/local/hadoop/share/hadoop/mapreduce”目錄下的所有JAR包,但是,不包括jdiff、lib、lib-examples和sources目錄;
- “/usr/local/hadoop/share/hadoop/mapreduce/lib”目錄下的所有JAR包。
1.6 創(chuàng)建類WordCount.java
import java.io.IOException; import java.util.Iterator; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; public class WordCount {public WordCount() {}public static void main(String[] args) throws Exception {Configuration conf = new Configuration();String[] otherArgs = (new GenericOptionsParser(conf, args)).getRemainingArgs();if(otherArgs.length < 2) {System.err.println("Usage: wordcount <in> [<in>...] <out>");System.exit(2);}Job job = Job.getInstance(conf, "word count");job.setJarByClass(WordCount.class);job.setMapperClass(WordCount.TokenizerMapper.class);job.setCombinerClass(WordCount.IntSumReducer.class);job.setReducerClass(WordCount.IntSumReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class); for(int i = 0; i < otherArgs.length - 1; ++i) {FileInputFormat.addInputPath(job, new Path(otherArgs[i]));}FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));System.exit(job.waitForCompletion(true)?0:1);}public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {private static final IntWritable one = new IntWritable(1);private Text word = new Text();public TokenizerMapper() {}public void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {StringTokenizer itr = new StringTokenizer(value.toString()); while(itr.hasMoreTokens()) {this.word.set(itr.nextToken());context.write(this.word, one);}}} public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {private IntWritable result = new IntWritable();public IntSumReducer() {}public void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {int sum = 0;IntWritable val;for(Iterator i$ = values.iterator(); i$.hasNext(); sum += val.get()) {val = (IntWritable)i$.next();}this.result.set(sum);context.write(key, this.result);}} }1.7 編譯打包程序
將程序打包到 /usr/local/hadoop/myapp目錄下,
- Run As 運行程序;
- 右鍵工程名->Export->Java->Runnable JAR file
- “Launch configuration”用于設(shè)置生成的JAR包被部署啟動時運行的主類,需要在下拉列表中選擇剛才配置的類“WordCount-MapReduceWordCount”。在“Export destination”中需要設(shè)置JAR包要輸出保存到哪個目錄即其名稱。點擊finish,中間會出現(xiàn)一些信息,一直點擊Ok即可。
1.8 運行程序
啟動hadoop
1.9 查看結(jié)果
cd /usr/local/hadoop ./bin/hdfs dfs -cat output/*1.20 查看HDFS 文件系統(tǒng)
進(jìn)入/usr/local/hadoop/bin 目錄,執(zhí)行相關(guān)命令。
1.21 源文檔
http://dblab.xmu.edu.cn/blog/2481-2/#more-2481
總結(jié)
以上是生活随笔為你收集整理的MapReduce词频统计的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 【计算机网络复习 物理层】2.1.1
- 下一篇: linux的常用操作——makefile