hadoop streaming部分问题总结
來源:
https://hadoop.apache.org/docs/r1.2.1/streaming.html#Generic+Command+Options
文檔還是要好好看,中間遇到的好多問題文檔中都有。之前看的時候沒有感覺,等遇到了問題再來看,就知道是啥了。
============================================================
一、options
hadoop streaming有兩種參數:Generic Command Options和Streaming Command Options。
注意:genericOptions要求必須放在streamingOptions前面。如:
bin/hadoop command [genericOptions] [streamingOptions]
Generic Command Options
| -conf configuration_file | Optional | Specify an application configuration file |
| -D property=value | Optional | Use value for given property |
| -fs host:port or local | Optional | Specify a namenode |
| -jt host:port or local | Optional | Specify a job tracker |
| -files | Optional | Specify comma-separated files to be copied to the Map/Reduce cluster |
| -libjars | Optional | Specify comma-separated jar files to include in the classpath |
| -archives | Optional | Specify comma-separated archives to be unarchived on the compute machines |
Streaming Command Options
| -input directoryname or filename | Required | Input location for mapper |
| -output directoryname | Required | Output location for reducer |
| -mapper executable or JavaClassName | Required | Mapper executable |
| -reducer executable or JavaClassName | Required | Reducer executable |
| -file filename | Optional | Make the mapper, reducer, or combiner executable available locally on the compute nodes |
| -inputformat JavaClassName | Optional | Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default |
| -outputformat JavaClassName | Optional | Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default |
| -partitioner JavaClassName | Optional | Class that determines which reduce a key is sent to |
| -combiner streamingCommand or JavaClassName | Optional | Combiner executable for map output |
| -cmdenv name=value | Optional | Pass environment variable to streaming commands |
| -inputreader | Optional | For backwards-compatibility: specifies a record reader class (instead of an input format class) |
| -verbose | Optional | Verbose output |
| -lazyOutput | Optional | Create output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write) |
| -numReduceTasks | Optional | Specify the number of reducers |
| -mapdebug | Optional | Script to call when map task fails |
| -reducedebug | Optional | Script to call when reduce task fails |
注意:
1, genericoption中的-files參數和streamingoptions中的-file參數。
前者為可以放多個,中間用逗號分隔,并且可以放hdfs files的地址。后者是從本地往hadoop執行目錄上傳文件。
使用的時候,-files需要放到streamingoptions(如:-input, -output等)前面。
2. ?cmdenv參數可以設定需要的環境變量
-cmdenv EXAMPLE_DIR=/home/example/dictionaries/
二、使用hdfs files地址
先把數據文件上傳的hdfs上,之后可以直接使用。
使用參數:
-files hdfs://host:fs_port/user/testfile.txt
注意:
1.?host:fs_port可以從/usr/local/hadoop/etc/hadoop/core-site.xml中,找到fs.defaultFS對應的value即可。
2. 使用的時候,直接在對應的mapper.py中,使用open("testfile.txt")即可。hadoop會創建一個連接,指向對應的文件。(?Hadoop automatically creates a symlink named testfile.txt in the current working directory of the tasks. This symlink points to the local copy of testfile.txt.)
三、"No space left on device" error?
這是因為hadoop streaming中,會把需要的-file文件打成jar包上傳,如果文件很大,jar包也很大。jar包默認是放在/tmp目錄下。如果/tmp目錄空間不夠,則會報該錯誤。
更改目錄命令:
-D stream.tmpdir=/export/bigspace/...
四、在mapper.py中獲取設置的參數
加入設置:
-D mapred.reduce.tasks=1
如何在mapper.py中獲取該參數:
直接把對應的參數名中"."換成下劃線"_"即可獲得。
reducer_tasks = os.getenv(“mapred_reduce_tasks')
總結
以上是生活随笔為你收集整理的hadoop streaming部分问题总结的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Hadoop streaming: Ex
- 下一篇: python的可变长参数