使用 MRJob 库开发 MapReduce 程序

--- config: theme: mc look: classic layout: dagre --- flowchart LR Input["<h4>输入文件:</h4><br><code>Hello World<br>Hello Python<br>Python Java<br>Java World</code>"] -- split --> M1["<h4>Mapper 1:</h4><br><code>Hello->1<br>World->1</code>"] & M2["<h4>Mapper 2:</h4><br><code>Hello->1<br>Python->1</code>"] & M3["<h4>Mapper 3:</h4><br><code>Python->1<br>Java->1</code>"] & M4["<h4>Mapper 4:</h4><br><code>Java->1<br>World->1</code>"] M1 -- shuffle --> S1["<h4>Shuffle分组1:</h4><br><code>Hello->[1,1]</code>"] & S2["<h4>Shuffle分组2:</h4><br><code>World->[1,1]</code>"] M2 -- shuffle --> S1 & S3["<h4>Shuffle分组3:</h4><br><code>Python->[1,1]</code>"] M3 -- shuffle --> S3 & S4["<h4>Shuffle分组4:</h4><br><code>Java->[1,1]</code>"] M4 -- shuffle --> S4 & S2 S1 -- partition --> R1["<h4>Reducer 1:</h4><br><code>World->2</code>"] S2 -- partition --> R1 S3 -- partition --> R2["<h4>Reducer 2:</h4><br><code>Java->2</code>"] S4 -- partition --> R2 R1 --> Output["<h4>输出文件:</h4><br><code>Hello 2<br>World 2<br>Python 2<br>Java 2</code>"] R2 --> Output style Input fill:#f9f,stroke:#333 style M1 fill:#bbf,stroke:#333 style M2 fill:#bbf,stroke:#333 style M3 fill:#bbf,stroke:#333 style M4 fill:#bbf,stroke:#333 style R1 fill:#bbf,stroke:#333 style R2 fill:#bbf,stroke:#333 style Output fill:#f9f,stroke:#333

本地调试

准备测试输入文件：

input.txt:

hello world
hello hadoop
welcome to the world of big data
hadoop and big data

安装相关库：
```
pip install mrjob boto3
```
编写代码：
```
from mrjob.job import MRJob

class WordCount(MRJob):
    def mapper(self, _, line):
        for word in line.split():
            yield word, 1

    def reducer(self, word, counts):
        yield word, sum(counts)

if __name__ == '__main__':
    WordCount.run()
```
- mapper(self, key, value)：默认情况下，传入的 key 为 None，value 为去掉换行符的原始输入行。因此在重写 mapper 时常常不使用 key 参数。mapper 函数需要 yield (out_key, out_value) 形式的元祖。这些元祖将作为 reducer 函数的输入。
- reducer(self, key, value)：这里的 key 为 mapper 生成的 out_key。value 为一个生成器，生成 mapper 产生的所有与 out_key 相对应的 out_value。reducer 的输出是最终结果。
参见：
- mapper
- reducer

本地模拟运行：

python wordcount.py -r local input.txt -o output

所有 reducer 的输出会被写入 output 目录：

output
├── part-00000
├── part-00001
├── part-00002
├── part-00003
├── part-00004
├── part-00005
├── part-00006
└── part-00007

如果使用重定向命令，则所有输出会被合成到一起：

python wordcount.py -r local input.txt > output.txt

"of"	1
"the"	1
"data"	2
"hadoop"	2
"hello"	2
"big"	2
"to"	1
"welcome"	1
"world"	2
"and"	1

在 Amazon EMR 上运行

通过 MRJob 库，我们可以很方便地将 MapReduce 任务上传到 Amazon EMR 上运行。

编辑配置文件：

vim ~/.mrjob.conf

runners:
    emr:
        # AWS 认证信息
        aws_access_key_id: YOUR_ACCESS_KEY
        aws_secret_access_key: YOUR_SECRET_ACCESS_KEY
        region: ap-east-1

        # 集群配置
        instance_type: m5.xlarge
        num_core_instances: 3

        # EMR 版本
        release_label: emr-7.3.0

运行程序：
```
python wordcount.py -r emr input.txt > output.txt
```
MRJob 会自动将输入文件上传到 S3，创建 EMR 集群，添加任务步，并将输出从 S3 下载到本地。

参见：Hadoop - mrjob Python Library For MapReduce With Example | GeeksforGeeks

posted @ 2025-01-20 02:12 Undefined443 阅读(39) 评论(0) 收藏举报

刷新页面返回顶部

undefined443

使用 MRJob 库开发 MapReduce 程序

本地调试

在 Amazon EMR 上运行

公告