hadoop程序开发 --- python
生活随笔
收集整理的這篇文章主要介紹了
hadoop程序开发 --- python
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
這里以統計單詞為例
1 首先建立mapper.py
mkdir /usr/local/hadoop-python cd /usr/local/hadoop-python vim mapper.pymapper.py
#!/usr/bin/env pythonimport sys# input comes from STDIN (standard input) 輸入來自STDIN(標準輸入) for line in sys.stdin:# remove leading and trailing whitespace 刪除前導和尾隨空格line = line.strip()# split the line into words 把線分成單詞words = line.split()# increase counters 增加柜臺for word in words:# write the results to STDOUT (standard output); # 將結果寫入STDOUT(標準輸出);# what we output here will be the input for the# Reduce step, i.e. the input for reducer.py# tab-delimited; the trivial word count is 1# 我們在此處輸出的內容將是Reduce步驟的輸入,即reducer.py制表符分隔的輸入; # 平凡的字數是1print '%s\t%s' % (word, 1)文件保存后,請注意將其權限作出相應修改:
chmod a+x /usr/local/hadoop-python/mapper.py2 建立reducer.py
vim reducer.py #!/usr/bin/env pythonfrom operator import itemgetter import syscurrent_word = None current_count = 0 word = None# input comes from STDIN 輸入來自STDIN for line in sys.stdin:# remove leading and trailing whitespace # 刪除前導和尾隨空格line = line.strip()# parse the input we got from mapper.py# 解析我們從mapper.py獲得的輸入word, count = line.split('\t', 1)# convert count (currently a string) to int# 將count(當前為字符串)轉換為inttry:count = int(count)except ValueError:# count was not a number, so silently# ignore/discard this line# count不是數字,因此請忽略/丟棄此行continue# this IF-switch only works because Hadoop sorts map output# by key (here: word) before it is passed to the reducer# 該IF開關僅起作用是因為Hadoop在將映射輸出傳遞給reducer之前按鍵(此處為word)對 # 映射輸出進行排序if current_word == word:current_count += countelse:if current_word:# write result to STDOUT# 將結果寫入STDOUTprint '%s\t%s' % (current_word, current_count)current_count = countcurrent_word = word# do not forget to output the last word if needed! # 如果需要,不要忘記輸出最后一個單詞! if current_word == word:print '%s\t%s' % (current_word, current_count)文件保存后,請注意將其權限作出相應修改:
chmod a+x /usr/local/hadoop-python/reducer.py首先可以在本機上測試以上代碼,這樣如果有問題可以及時發現:
# echo "foo foo quux labs foo bar quux" | /usr/local/hadoop-python/mapper.py 輸出: foo 1 foo 1 quux 1 labs 1 foo 1 bar 1 quux 1再運行以下包含reduce.py的代碼:
echo "foo foo quux labs foo bar quux" | /usr/local/hadoop-python/mapper.py | sort -k1,1 | /usr/local/hadoop-python/reducer.py 輸出: bar 1 foo 3 labs 1 quux 23 在Hadoop上運行Python代碼
準備工作:
下載文本文件:
然后把這二本書上傳到hdfs文件系統上:
# 在hdfs上的該用戶目錄下創建一個輸入文件的文件夾 hdfs dfs -mkdir /input # 上傳文檔到hdfs上的輸入文件夾中 hdfs dfs -put /usr/local/hadoop-python/input/pg20417.txt /input尋找你的streaming的jar文件存放地址,注意2.6的版本放到share目錄下了,可以進入hadoop安裝目錄尋找該文件:
cd $HADOOP_HOME find ./ -name "*streaming*.jar"然后就會找到我們的share文件夾中的hadoop-straming*.jar文件:
./share/hadoop/tools/lib/hadoop-streaming-2.8.4.jar ./share/hadoop/tools/sources/hadoop-streaming-2.8.4-test-sources.jar ./share/hadoop/tools/sources/hadoop-streaming-2.8.4-sources.jar /usr/local/hadoop-2.8.4/share/hadoop/tools/lib由于這個文件的路徑比較長,因此我們可以將它寫入到環境變量:
vim /etc/profile export STREAM=/usr/local/hadoop-2.8.4/share/hadoop/tools/lib/hadoop-streaming-2.8.4.jar由于通過streaming接口運行的腳本太長了,因此直接建立一個shell名稱為run.sh來運行:
vim run.sh hadoop jar /usr/local/hadoop-2.8.4/share/hadoop/tools/lib/hadoop-streaming-2.8.4.jar \ -files /usr/local/hadoop-python/mapper.py,/usr/local/hadoop-python/reducer.py \ -mapper /usr/local/hadoop-python/mapper.py \ -reducer /usr/local/hadoop-python/reducer.py \ -input /input/pg20417.txt \ -output /output1 hadoop jar $STREAM \-files /usr/local/hadoop-python/mapper.py,/usr/local/hadoop-python/reducer.py \-mapper /usr/local/hadoop-python/mapper.py \-reducer /usr/local/hadoop-python/reducer.py \-input /input/pg20417.txt \-output /output1總結
以上是生活随笔為你收集整理的hadoop程序开发 --- python的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: hadoop程序开发--- Java
- 下一篇: boostrap3常用组件集合