zhaohz

  博客园  :: 首页  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理

hadoop jar hadoop-streaming-2.6.4.jar \
-D mapreduce.job.name='test' \
-files /local/path/to/mapper.py,/local/path/to/reducer.py
-input /test/data/*
-output /test/output/
-mapper 'python /local/path/to/mapper.py'
-reducer 'python /local/path/to/reducer.py'

1. python文件需要分发到每个节点
2. -mapper和-reducer后面必须带python,否则会报错
Caused by: java.io.IOException: Cannot run program "mapper.py": error=2, No such file or director

mapper.py

#!/usr/bin/python3
# -*- coding: utf-8 -*-

import os
import sys
import re

for line in sys.stdin:
line = line.strip()
words = re.split('[,.?\s"]',line)
for word in words:
word = word.strip(',|.|?|\s')
if word:

print("{0}\t{1}".format(word,1))

 


reducer.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
import sys
from operator import itemgetter
current_word = None
current_count = 0
word = None

for line in sys.stdin:
word = line.split('\t',1)[0]
count = line.split('\t',1)[1]
count = int(count)
if current_word == word:
current_count+=count
else:
if current_word:
print("{0}\t{1}".format(current_word,current_count))
current_word = word
current_count = count

if word:
print("{0}\t{1}".format(current_word,current_count))

参考官方说明: https://hadoop.apache.org/docs/r2.7.7/hadoop-streaming/HadoopStreaming.html

posted on 2020-02-21 19:55  zzhaoh  阅读(749)  评论(0编辑  收藏  举报