當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

hadoop streaming部分问题总结

發布時間：2025/4/5 编程问答 16 豆豆

生活随笔收集整理的這篇文章主要介紹了 hadoop streaming部分问题总结小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

來源：

https://hadoop.apache.org/docs/r1.2.1/streaming.html#Generic+Command+Options

文檔還是要好好看，中間遇到的好多問題文檔中都有。之前看的時候沒有感覺，等遇到了問題再來看，就知道是啥了。

============================================================

一、options

hadoop streaming有兩種參數：Generic Command Options和Streaming Command Options。

注意：genericOptions要求必須放在streamingOptions前面。如：

bin/hadoop command [genericOptions] [streamingOptions]

Generic Command Options

Parameter Optional/Required Description

-conf configuration_file	Optional	Specify an application configuration file
-D property=value	Optional	Use value for given property
-fs host:port or local	Optional	Specify a namenode
-jt host:port or local	Optional	Specify a job tracker
-files	Optional	Specify comma-separated files to be copied to the Map/Reduce cluster
-libjars	Optional	Specify comma-separated jar files to include in the classpath
-archives	Optional	Specify comma-separated archives to be unarchived on the compute machines

Streaming Command Options

Parameter Optional/Required Description

-input directoryname or filename	Required	Input location for mapper
-output directoryname	Required	Output location for reducer
-mapper executable or JavaClassName	Required	Mapper executable
-reducer executable or JavaClassName	Required	Reducer executable
-file filename	Optional	Make the mapper, reducer, or combiner executable available locally on the compute nodes
-inputformat JavaClassName	Optional	Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default
-outputformat JavaClassName	Optional	Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default
-partitioner JavaClassName	Optional	Class that determines which reduce a key is sent to
-combiner streamingCommand or JavaClassName	Optional	Combiner executable for map output
-cmdenv name=value	Optional	Pass environment variable to streaming commands
-inputreader	Optional	For backwards-compatibility: specifies a record reader class (instead of an input format class)
-verbose	Optional	Verbose output
-lazyOutput	Optional	Create output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write)
-numReduceTasks	Optional	Specify the number of reducers
-mapdebug	Optional	Script to call when map task fails
-reducedebug	Optional	Script to call when reduce task fails

注意：

1， genericoption中的-files參數和streamingoptions中的-file參數。

前者為可以放多個，中間用逗號分隔，并且可以放hdfs files的地址。后者是從本地往hadoop執行目錄上傳文件。

使用的時候，-files需要放到streamingoptions（如：-input, -output等）前面。

2. ?cmdenv參數可以設定需要的環境變量

-cmdenv EXAMPLE_DIR=/home/example/dictionaries/

二、使用hdfs files地址

先把數據文件上傳的hdfs上，之后可以直接使用。

使用參數：

-files hdfs://host:fs_port/user/testfile.txt

注意：

1.?host:fs_port可以從/usr/local/hadoop/etc/hadoop/core-site.xml中，找到fs.defaultFS對應的value即可。

2. 使用的時候，直接在對應的mapper.py中，使用open("testfile.txt")即可。hadoop會創建一個連接，指向對應的文件。（?Hadoop automatically creates a symlink named testfile.txt in the current working directory of the tasks. This symlink points to the local copy of testfile.txt.）

三、"No space left on device" error?

這是因為hadoop streaming中，會把需要的-file文件打成jar包上傳，如果文件很大，jar包也很大。jar包默認是放在/tmp目錄下。如果/tmp目錄空間不夠，則會報該錯誤。

更改目錄命令：

-D stream.tmpdir=/export/bigspace/...

四、在mapper.py中獲取設置的參數

加入設置：

-D mapred.reduce.tasks=1

如何在mapper.py中獲取該參數：

直接把對應的參數名中"."換成下劃線"_"即可獲得。

reducer_tasks = os.getenv(“mapred_reduce_tasks')

總結

以上是生活随笔為你收集整理的hadoop streaming部分问题总结的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Hadoop streaming： Ex
下一篇： python的可变长参数