大数据之map
1.编写map函数,reduce函数
cd /home/hadoop
mkdir wc
cd /home/hadoop/wc
touch mapper.py
1
touch reducer.py
编写两个函数
mapper.py:
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print ('%s\t%s' % (word,1))

reducer.py:
#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
'%s\t%s' % (current_word, current_count)
current_count = count
current_word = word
if current_word == word:
'%s\t%s' % (current_word, current_count)
2.将其权限作出相应修改
chmod a+x /home/hadoop/wc/mapper.py
chmod a+x /home/hadoop/wc/reducer.py
3.本机上测试运行代码
echo
"foo foo quux labs foo bar quux" | / home / hadoop / wc / mapper.py
echo
"foo foo quux labs foo bar quux" | / home / hadoop / wc / mapper.py | sort - k1, 1 | / home / hadoop / wc / reducer.py

4.放到HDFS上运行
下载文本文件或爬取网页内容存成的文本文件:
cd /home/hadoop/wc
wget http://www.gutenberg.org/files/5000/5000-8.txt
wget http://www.gutenberg.org/cache/epub/20417/pg20417.txt
5.下载并上传文件到hdfs上
hdfs
dfs - put / home / hadoop / hadoop / gutenberg / *.txt / user / hadoop / input
6.用Hadoop Streaming命令提交任务
寻找你的streaming的jar文件存放地址:
cd /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar
打开环境变量配置文件
gedit ~/.bashrc
在里面写入streaming路径
export STREAM=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar
让环境变量生效:
source ~/.bashrcecho $STREAM建立一个shell名称为run.sh来运行:
gedit run.sh
hadoop jar $STREAM
-file / home / hadoop / wc / mapper.py \
- mapper / home / hadoop / wc / mapper.py \
- file / home / hadoop / wc / reducer.py \
- reducer / home / hadoop / wc / reducer.py \
- input / user / hadoop / input / *.txt \
- output / user / hadoop / wcoutput
source run.sh

浙公网安备 33010602011771号