使用 MRJob 库开发 MapReduce 程序
---
config:
theme: mc
look: classic
layout: dagre
---
flowchart LR
Input["<h4>输入文件:</h4><br><code>Hello World<br>Hello Python<br>Python Java<br>Java World</code>"] -- split --> M1["<h4>Mapper 1:</h4><br><code>Hello->1<br>World->1</code>"] & M2["<h4>Mapper 2:</h4><br><code>Hello->1<br>Python->1</code>"] & M3["<h4>Mapper 3:</h4><br><code>Python->1<br>Java->1</code>"] & M4["<h4>Mapper 4:</h4><br><code>Java->1<br>World->1</code>"]
M1 -- shuffle --> S1["<h4>Shuffle分组1:</h4><br><code>Hello->[1,1]</code>"] & S2["<h4>Shuffle分组2:</h4><br><code>World->[1,1]</code>"]
M2 -- shuffle --> S1 & S3["<h4>Shuffle分组3:</h4><br><code>Python->[1,1]</code>"]
M3 -- shuffle --> S3 & S4["<h4>Shuffle分组4:</h4><br><code>Java->[1,1]</code>"]
M4 -- shuffle --> S4 & S2
S1 -- partition --> R1["<h4>Reducer 1:</h4><br><code>World->2</code>"]
S2 -- partition --> R1
S3 -- partition --> R2["<h4>Reducer 2:</h4><br><code>Java->2</code>"]
S4 -- partition --> R2
R1 --> Output["<h4>输出文件:</h4><br><code>Hello 2<br>World 2<br>Python 2<br>Java 2</code>"]
R2 --> Output
style Input fill:#f9f,stroke:#333
style M1 fill:#bbf,stroke:#333
style M2 fill:#bbf,stroke:#333
style M3 fill:#bbf,stroke:#333
style M4 fill:#bbf,stroke:#333
style R1 fill:#bbf,stroke:#333
style R2 fill:#bbf,stroke:#333
style Output fill:#f9f,stroke:#333
本地调试
-
准备测试输入文件:
input.txt:hello world hello hadoop welcome to the world of big data hadoop and big data -
安装相关库:
pip install mrjob boto3 -
编写代码:
from mrjob.job import MRJob class WordCount(MRJob): def mapper(self, _, line): for word in line.split(): yield word, 1 def reducer(self, word, counts): yield word, sum(counts) if __name__ == '__main__': WordCount.run()-
mapper(self, key, value):默认情况下,传入的key为None,value为去掉换行符的原始输入行。因此在重写mapper时常常不使用key参数。mapper函数需要 yield(out_key, out_value)形式的元祖。这些元祖将作为reducer函数的输入。 -
reducer(self, key, value):这里的key为mapper生成的out_key。value为一个生成器,生成mapper产生的所有与out_key相对应的out_value。reducer的输出是最终结果。
参见:
-
-
本地模拟运行:
python wordcount.py -r local input.txt -o output所有 reducer 的输出会被写入
output目录:output ├── part-00000 ├── part-00001 ├── part-00002 ├── part-00003 ├── part-00004 ├── part-00005 ├── part-00006 └── part-00007如果使用重定向命令,则所有输出会被合成到一起:
python wordcount.py -r local input.txt > output.txt"of" 1 "the" 1 "data" 2 "hadoop" 2 "hello" 2 "big" 2 "to" 1 "welcome" 1 "world" 2 "and" 1
在 Amazon EMR 上运行
通过 MRJob 库,我们可以很方便地将 MapReduce 任务上传到 Amazon EMR 上运行。
-
编辑配置文件:
vim ~/.mrjob.confrunners: emr: # AWS 认证信息 aws_access_key_id: YOUR_ACCESS_KEY aws_secret_access_key: YOUR_SECRET_ACCESS_KEY region: ap-east-1 # 集群配置 instance_type: m5.xlarge num_core_instances: 3 # EMR 版本 release_label: emr-7.3.0 -
运行程序:
python wordcount.py -r emr input.txt > output.txtMRJob 会自动将输入文件上传到 S3,创建 EMR 集群,添加任务步,并将输出从 S3 下载到本地。
参见:Hadoop - mrjob Python Library For MapReduce With Example | GeeksforGeeks

浙公网安备 33010602011771号