Hadoop Streaming 运行Python脚本

若出现如下错误，

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2

将#!/usr/bin/env python插入到python脚本的顶端即可解决。

#mapper.py
#!/usr/bin/env python
import sys
dic = {}
for line in sys.stdin:
    line = line.strip().split()
    for key in line:
        if dic.has_key(key):
            dic[key] += 1
        else:
            dic[key] = 1
for key, value in dic.items():
    print "%s\t%d" % (key, value)

#reducer.py
#!/usr/bin/env python
import sys
wordcount = {}
for line in sys.stdin:
    line = line.strip()
    word,count=line.split("\t",1)
    count=int(count)
    wordcount[word]=wordcount.get(word,0)+count
for word,count in wordcount.items():
    print "%s\t%d" % (word, count)

Hadoop命令：

hadoop jar /hadoop/hadoop-streaming-1.1.2.jar 
-input * -output *
-file /home/map.py
-mapper map.py
-file /home/red.py
-reducer red.py

注意：hadoop-streaming-1.1.2.jar并不在hadoop的根目录下，请去/hadoop/contrib/streaming下寻找。

posted on 2013-12-12 18:06 zach_Emrys 阅读(942) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

zach_Emrys

Hadoop Streaming 运行Python脚本

导航

公告