关于MultipleOutputFormat若干小记
生活随笔
收集整理的這篇文章主要介紹了
关于MultipleOutputFormat若干小记
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
使用版本是0.19.2,據說0.20以后,MultipleOutputFormat不好使,不知道真假
api可以參考
http://hadoop.apache.org/common/docs/r0.19.2/api/
但是說老實話,光看api有的時候有點混亂,每個函數到底影響些啥呢?
| protected ?K | generateActualKey(K?key, V?value) ??????????Generate the actual key from the given key/value. |
| protected ?V | generateActualValue(K?key, V?value) ??????????Generate the actual value from the given key and value. |
| protected ?String | generateFileNameForKeyValue(K?key, V?value, String?name) ??????????Generate the file output file name based on the given key and the leaf file name. |
| protected ?String | generateLeafFileName(String?name) ??????????Generate the leaf name for the output file name. |
| protected abstract ?RecordWriter<K,V> | getBaseRecordWriter(FileSystem?fs, JobConf?job, String?name, Progressable?arg3) ??????????? |
| protected ?String | getInputFileBasedOutputFileName(JobConf?job, String?name) ??????????Generate the outfile name based on a given anme and the input file name. |
| ?RecordWriter<K,V> | getRecordWriter(FileSystem?fs, JobConf?job, String?name, Progressable?arg3) ??????????Create a composite record writer that can write key/value data to different output files |
?
現在簡單介紹了下調用的過程
ReduceTask.java文件中
?1?public?void?run(JobConf?job,?final?TaskUmbilicalProtocol?umbilical)?throws?IOException?2?{
?3?..........
?4?
?5?String?finalName?=?getOutputName(getPartition());//return?"part-"?+?NUMBER_FORMAT.format(partition);依據taskid產生諸如part-00000這樣的文件名
?6?
?7?FileSystem?fs?=?FileSystem.get(job);
?8?
?9?final?RecordWriter?out?=?job.getOutputFormat().getRecordWriter(fs,?job,?finalName,?reporter);//finalName=part-00000
10?
11?.............
12?}
?
?在MultipleOutputFormat.java里面,請注意這些個函數的調用順序
?
public?RecordWriter<K,?V>?getRecordWriter(FileSystem?fs,?JobConf?job,?String?name,?Progressable?arg3) throws?IOException????{
????????final?FileSystem?myFS?=?fs;
????????final?String?myName?=?generateLeafFileName(name);//在這里可以硬性的指定文件名名稱
????????final?JobConf?myJob?=?job;
????????final?Progressable?myProgressable?=?arg3;
????????return?new?RecordWriter<K,?V>()?{
????????????//?a?cache?storing?the?record?writers?for?different?output?files.
????????????TreeMap<String,?RecordWriter<K,?V>>?recordWriters?=?new?TreeMap<String,?RecordWriter<K,?V>>();
????????????public?void?write(K?key,?V?value)?throws?IOException
????????????{
????????????????//?get?the?file?name?based?on?the?key
????????????????String?keyBasedPath?=?generateFileNameForKeyValue(key,?value,?myName);//一般依據key來決定文件名的時候 就在這個函數
????????????????//?get?the?file?name?based?on?the?input?file?name
????????????????String?finalPath?=?getInputFileBasedOutputFileName(myJob,?keyBasedPath);//如果想依據jobconf配置來確定名稱的話 就在這個函數里實現? finalPath?就是最終的文件名
????????????????//?get?the?actual?key
????????????????K?actualKey?=?generateActualKey(key,?value);
????????????????V?actualValue?=?generateActualValue(key,?value);
????????????????RecordWriter<K,?V>?rw?=?this.recordWriters.get(finalPath);
????????????????if?(rw?==?null)
????????????????{
????????????????????//?if?we?don't?have?the?record?writer?yet?for?the?final?path,?create one and?add?it?to?the?cache
????????????????????rw?=?getBaseRecordWriter(myFS,?myJob,?finalPath,?myProgressable);//必須自己實現的
????????????????????this.recordWriters.put(finalPath,?rw);
????????????????}
????????????????rw.write(actualKey,?actualValue);//
????????????};
....... };
????}
?
?上述函數,除了getInputFileBasedOutputFileName,其他的紅色函數基本上都只是簡單的返回輸入值.
轉載于:https://www.cnblogs.com/xuxm2007/archive/2012/02/23/2365332.html
總結
以上是生活随笔為你收集整理的关于MultipleOutputFormat若干小记的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: MongoDB应用篇
- 下一篇: 关于项目中属性配置文件的改进