Windows下Eclipse提交MR程序到HadoopCluster
作者:Syn良子 出處:http://www.cnblogs.com/cssdongl 歡迎轉載,轉載請注明出處.
以前Eclipse上寫好的MapReduce項目經常是打好包上傳到Hadoop測試集群來直接運行,運行遇到問題的話查看日志和修改相關代碼來解決。找時間配置了Windows上Eclispe遠程提交MR程序到集群方便調試.記錄一些遇到的問題和解決方法.
系統環境:Windows7 64,Eclipse Mars,Maven3.3.9,Hadoop2.6.0-CDH5.4.0.
一.配置MapReduce Maven工程
新建一個Maven工程,將CDH集群的相關xml配置文件(主要是core-site.xml,hdfs-site.xml,mapred-site.xml和yarn-site.xml)拷貝到src/main/java下,因為需要連接的是CDH集群,所以配置pom.xml文件主要內容如下
<properties><project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> </properties> <repositories><repository><id>cloudera</id><url>https://repository.cloudera.com/artifactory/cloudera-repos/</url></repository> </repositories> <dependencies><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>3.8.1</version><scope>test</scope></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-common</artifactId><version>2.6.0-cdh5.4.0</version><type>jar</type><exclusions><exclusion><artifactId>kfs</artifactId><groupId>net.sf.kosmosfs</groupId></exclusion></exclusions><scope>provided</scope></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-hdfs</artifactId><version>2.6.0-cdh5.4.0</version><type>jar</type><scope>provided</scope></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-hdfs-nfs</artifactId><version>2.6.0-cdh5.4.0</version><type>jar</type><scope>provided</scope></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-yarn-api</artifactId><version>2.6.0-cdh5.4.0</version><type>jar</type><scope>provided</scope></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-yarn-applications-distributedshell</artifactId><version>2.6.0-cdh5.4.0</version><type>jar</type><scope>provided</scope></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-yarn-server-resourcemanager</artifactId><version>2.6.0-cdh5.4.0</version><type>jar</type><scope>provided</scope></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-yarn-applications-unmanaged-am-launcher</artifactId><version>2.6.0-cdh5.4.0</version><type>jar</type><scope>provided</scope></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hbase-hadoop2-compat</artifactId><version>1.0.0-cdh5.4.0</version><type>jar</type><scope>provided</scope></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-mapreduce-client-common</artifactId><version>2.6.0-cdh5.4.0</version><type>jar</type><scope>provided</scope></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-mapreduce-client-jobclient</artifactId><version>2.6.0-cdh5.4.0</version><type>jar</type><scope>provided</scope></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-mapreduce-client-app</artifactId><version>2.6.0-cdh5.4.0</version><type>jar</type><scope>provided</scope></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-mapreduce-client-hs</artifactId><version>2.6.0-cdh5.4.0</version><type>jar</type><scope>provided</scope></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-mapreduce-client-hs-plugins</artifactId><version>2.6.0-cdh5.4.0</version><type>jar</type><scope>provided</scope></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hbase-client</artifactId><version>1.0.0-cdh5.4.0</version><type>jar</type><scope>provided</scope></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hbase-common</artifactId><version>1.0.0-cdh5.4.0</version><type>jar</type><scope>provided</scope></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hbase-server</artifactId><version>1.0.0-cdh5.4.0</version><type>jar</type><scope>provided</scope></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hbase-protocol</artifactId><version>1.0.0-cdh5.4.0</version><type>jar</type><scope>provided</scope></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hbase-prefix-tree</artifactId><version>1.0.0-cdh5.4.0</version><type>jar</type><scope>provided</scope></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hbase-hadoop-compat</artifactId><version>1.0.0-cdh5.4.0</version><type>jar</type><scope>provided</scope></dependency><dependency><groupId>jdk.tools</groupId><artifactId>jdk.tools</artifactId><version>1.7</version><scope>system</scope><systemPath>${JAVA_HOME}/lib/tools.jar</systemPath></dependency><dependency><groupId>org.apache.maven.surefire</groupId><artifactId>surefire-booter</artifactId><version>2.12.4</version></dependency></dependencies>如果CDH是其他版本,請參考CDH官方Maven Artifacts,配置好對應的dependency(修改version之類的屬性).如果是原生Hadoop,remove掉上面Cloudera的repositroy,配置好對應的dependency.
配置好以后保存pom文件,等待相關jar包下載完成.
二.配置Eclipse提交MR到集群
最簡單的莫過于WordCount了,貼代碼先.
package org.ldong.test;import java.io.File; import java.io.IOException; import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import org.apache.hadoop.yarn.conf.YarnConfiguration;public class WordCount1 extends Configured implements Tool {public static class TokenizerMapper extends Mapper<LongWritable, Text, Text, IntWritable> {private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {StringTokenizer itr = new StringTokenizer(value.toString());while (itr.hasMoreTokens()) {word.set(itr.nextToken());context.write(word, one);}}}public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {public void reduce(Text key, Iterable<IntWritable> values, Context context)throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {sum += val.get();}context.write(key, new IntWritable(sum));}}public static void main(String[] args) throws Exception {System.setProperty("hadoop.home.dir", "C:\\hadoop-2.6.0");ToolRunner.run(new WordCount1(), args);}public int run(String[] args) throws Exception {String input = "hdfs://littleNameservice/test/input";String output = "hdfs://littleNameservice/test/output";Configuration conf = new YarnConfiguration();Job job = Job.getInstance(conf, "word count");job.setJarByClass(WordCount1.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(TokenizerMapper.class);job.setCombinerClass(IntSumReducer.class);job.setReducerClass(IntSumReducer.class);job.setNumReduceTasks(1);FileInputFormat.addInputPath(job, new Path(input));FileSystem fs = FileSystem.get(conf);if (fs.exists(new Path(output))) {fs.delete(new Path(output), true);}=FileOutputFormat.setOutputPath(job, new Path(output));return job.waitForCompletion(true) ? 0 : 1;}}ok,這里主要配置好代碼中標紅部分的configuration(和自己的連接集群保持一致),開始運行該程序,直接run as java application ,不出意外的話,肯定報錯如下
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
這個錯誤很常見,在windows下運行MR就會出現,主要是hadoop需要的一些native library在windows上找不到.作為強迫癥患者肯定不能視而不見了,解決方式如下:
1.去官網下載hadoop2.6.0-.tar.gz解壓到windows本地,配置環境變量,指向解壓以后的hadoop根目錄下bin文件夾;
2.將本文最后鏈接中的壓縮包下載解壓,將其中的文件都解壓到步驟1中hadoop的bin目錄下.
3.如果沒有立刻生效的,加上臨時代碼,如上面代碼中 System.setProperty("hadoop.home.dir", "C:\\hadoop-2.6.0"),下次重啟生效后好可以去掉這行
重新運行,該錯誤消失.出現另外一個錯誤
其實這個異常是Hadoop2.4之前的一個windows提交MR任務到Hadoop Cluster的嚴重bug,跟不同系統的classes path有關系,已經在hadoop2.4修復了,見如下鏈接
http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/releasenotes.html
其中 MAPREDUCE-4052 :“Windows eclipse cannot submit job from Windows client to Linux/Unix Hadoop cluster.”
如果有同學是Hadoop2.4之前版本的,解決方式參考下面幾個鏈接(其實4052和5655是duplicate的):
https://issues.apache.org/jira/browse/MAPREDUCE-4052
https://issues.apache.org/jira/browse/MAPREDUCE-5655
https://issues.apache.org/jira/browse/YARN-1298
http://www.aboutyun.com/thread-8498-1-1.html
http://blog.csdn.net/fansy1990/article/details/22896249
而我的環境是Hadoop2.6.0-CDH5.4.0的,解決方式沒有上面這么麻煩,直接修改工程下mapred-site.xml文件,修改屬性如下(如沒有則添加)
<property> <name>mapreduce.app-submission.cross-platform</name> <value>true</value> </property>繼續運行,上面那個”no job control”錯誤消失,出現如下錯誤
沒完沒了了,肯定還是配置的問題,最后查閱資料發現還是直接修改工程的配置文件
對于mapred-site.xml,添加或者修改屬性如下,注意自己的集群保持一一致,如下
<property><name>mapreduce.application.classpath</name><value>$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,$MR2_CLASSPATH</value></property>對于yarn-site.xml,添加或者修改屬性如下,注意自己的集群保持一一致,如下
<property><name>yarn.application.classpath</name><value>$HADOOP_CLIENT_CONF_DIR,$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*</value></property>重新運行MR,上面錯誤消失,出現新的錯誤找不到MR的class.汗了一地,好吧,2種方式,一種就是我代碼中標黃的部分自動把MR打成jar包上傳到集群分布式運行
String classDirToPackage = "D:\\workspace\\performance-statistics-mvn\\target\\classes";
這個地方請改成自己工程下的classes文件夾(如果是maven工程,請暫時刪除classes下META-INF文件夾,否則后續無法打包),隨后的代碼用EJob類來打成Jar包,最終提交給集群來運行.EJob類如下
package org.ldong.test;import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.net.URL; import java.net.URLClassLoader; import java.util.ArrayList; import java.util.List; import java.util.jar.JarEntry; import java.util.jar.JarOutputStream; import java.util.jar.Manifest;public class EJob {private static List<URL> classPath = new ArrayList<URL>();public static File createTempJar(String root) throws IOException {if (!new File(root).exists()) {return null;}Manifest manifest = new Manifest();manifest.getMainAttributes().putValue("Manifest-Version", "1.0");final File jarFile = File.createTempFile("EJob-", ".jar", new File(System.getProperty("java.io.tmpdir")));Runtime.getRuntime().addShutdownHook(new Thread() {public void run() {jarFile.delete();}});JarOutputStream out = new JarOutputStream(new FileOutputStream(jarFile), manifest);createTempJarInner(out, new File(root), "");out.flush();out.close();return jarFile;}private static void createTempJarInner(JarOutputStream out, File f,String base) throws IOException {if (f.isDirectory()) {File[] fl = f.listFiles();if (base.length() > 0) {base = base + "/";}for (int i = 0; i < fl.length; i++) {createTempJarInner(out, fl[i], base + fl[i].getName());}} else {out.putNextEntry(new JarEntry(base));FileInputStream in = new FileInputStream(f);byte[] buffer = new byte[1024];int n = in.read(buffer);while (n != -1) {out.write(buffer, 0, n);n = in.read(buffer);}in.close();}}public static ClassLoader getClassLoader() {ClassLoader parent = Thread.currentThread().getContextClassLoader();System.out .println(parent);URL[] urls = classPath.toArray(new URL[0]);for( URL url : urls){System.out .println(url);}System.out.println(classPath.toArray(new URL[0]));if (parent == null) {parent = EJob.class.getClassLoader();}if (parent == null) {parent = ClassLoader.getSystemClassLoader();}return new URLClassLoader(classPath.toArray(new URL[0]), parent);}public static void addClasspath(String component) {if ((component != null) && (component.length() > 0)) {try {File f = new File(component);if (f.exists()) {URL key = f.getCanonicalFile().toURL();if (!classPath.contains(key)) {classPath.add(key);}}} catch (IOException e) {}}}}?
另外一種方式就是直接手動將MR工程打包,放到集群每個節點上可以加載到的class path下,執行eclipse中程序觸發MR分布式運行,也可以實現.
好吧,接著運行,這次沒什么問題成功了,如下圖
去集群上驗證jobhistory和hdfs輸出目錄,結果正確.
其他可能會遇到的錯誤如
org.apache.hadoop.security.AccessControlException: Permission denied….
其實就是沒有hdfs權限,解決方法有以下幾種
1.在系統環境變量中增加HADOOP_USER_NAME,其值為hadoop;或者通過java程序動態添加,如下:System.setProperty("HADOOP_USER_NAME", "hadoop"),這里hadoop為對hdfs具有讀寫權限的hadoop用戶.
2.或者對需要操作的目錄開放權限,例如 hadoop fs -chmod 777 /test.
3.或者修改hadoop的配置文件:hdfs-site.xml,添加或者修改 dfs.permissions 的值為 false.
4.或者將Eclipse所在機器的用戶的名稱修改為hadoop,即與服務器上運行hadoop的用戶一致,這里hadoop為對hdfs具有讀寫權限的hadoop用戶.
三.配置MR程序本地運行
如果僅僅是需要本地調試測試MR邏輯來讀寫HDFS,不提交MR到集群運行,那配置比上面簡單很多
同樣新建一個Maven工程,這次不需要拷貝那些*-site.xml,直接簡單配置pom文件主要內容如下
<properties><project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> </properties> <repositories><repository><id>cloudera</id><url>https://repository.cloudera.com/artifactory/cloudera-repos/</url></repository> </repositories> <dependencies><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>3.8.1</version><scope>test</scope></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-common</artifactId><version>2.6.0-cdh5.4.0</version><type>jar</type><exclusions><exclusion><artifactId>kfs</artifactId><groupId>net.sf.kosmosfs</groupId></exclusion></exclusions><scope>provided</scope></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-hdfs</artifactId><version>2.6.0-cdh5.4.0</version><type>jar</type><scope>provided</scope></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-core</artifactId><version>2.6.0-mr1-cdh5.4.0</version><type>jar</type><scope>provided</scope></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-yarn-api</artifactId><version>2.6.0-cdh5.4.0</version><type>jar</type><scope>provided</scope></dependency><dependency><groupId>jdk.tools</groupId><artifactId>jdk.tools</artifactId><version>1.7</version><scope>system</scope><systemPath>${JAVA_HOME}/lib/tools.jar</systemPath></dependency><dependency><groupId>org.apache.maven.surefire</groupId><artifactId>surefire-booter</artifactId><version>2.12.4</version></dependency></dependencies>編寫WordCount如下
import java.io.IOException; import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import org.apache.hadoop.yarn.conf.YarnConfiguration;public class WordCount extends Configured implements Tool {public static class TokenizerMapper extends Mapper<LongWritable, Text, Text, IntWritable> {private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {StringTokenizer itr = new StringTokenizer(value.toString());while (itr.hasMoreTokens()) {word.set(itr.nextToken());context.write(word, one);}}}public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {public void reduce(Text key, Iterable<IntWritable> values, Context context)throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {sum += val.get();}context.write(key, new IntWritable(sum));}}public static void main(String[] args) throws Exception {System.setProperty("hadoop.home.dir", "C:\\hadoop-2.6.0");ToolRunner.run(new WordCount(), args);}public int run(String[] args) throws Exception {String input = "hdfs://master01.jj.wl:8020/test/input";String output = "hdfs://master01.jj.wl:8020/test/output";Configuration conf = new YarnConfiguration();Job job = new Job(conf, "word count");job.setJarByClass(WordCount.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(TokenizerMapper.class);job.setCombinerClass(IntSumReducer.class);job.setReducerClass(IntSumReducer.class);FileInputFormat.addInputPath(job, new Path(input));FileSystem fs = FileSystem.get(conf);if (fs.exists(new Path(output))) {fs.delete(new Path(output), true);}FileOutputFormat.setOutputPath(job, new Path(output));return job.waitForCompletion(true) ? 0 : 1;}}配置好集群的連接,如標紅部分,然后運行,報錯如下
Could not locate executable null\bin\winutils.exe in the Hadoop binaries
和上面的解決方式一樣,不廢話,解決掉,接著運行,成功
眼尖的同學應該會發現這次MR調用了LocalJobRunner來運行,閱讀源碼會發現其實就是一個本地的JVM在運行MR程序,根本沒有提交到集群,但是讀寫的是集群的HDFS,所以去集群上檢查依然能看到output的輸出結果,但是去Web UI上查看集群的任務列表會發現根本找不到剛才運行成功的任務.
winutils相關文件下載鏈接:http://files.cnblogs.com/files/cssdongl/hadoop2.6%28x64%29.zip
轉載于:https://www.cnblogs.com/cssdongl/p/6027116.html
總結
以上是生活随笔為你收集整理的Windows下Eclipse提交MR程序到HadoopCluster的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 银行贷款利率2016年
- 下一篇: 招联金融对征信的影响