MapReduce初级编程实践

（1）通过实验掌握基本的MapReduce编程方法；
（2）掌握用MapReduce解决一些常见的数据处理问题，包括数据去重、数据排序和数据挖掘等。
实验报告
题目： MapReduce初级编程实践
实验环境：操作系统：Linux（centos7）；Hadoop版本：3.3.4；JDK版本：1.8；Hive是3.1.3，sqoop是1.4.7；Mysql为5.7。本机上的相关环境：Java IDE：IDEA;hbase版本：2.4.17 python编译器：PyCharm 2025.1.1.1 python版本：3.13，MySQL在本机上是8.0。
实验内容与完成情况：
准备工作

（一）编程实现文件合并和去重操作
输入文件A:

输入文件B：

验证结果并将文件上传到HDFS：

验证上传结果：

（二）编写程序实现对输入文件的排序
启动hadoop和yarn

创建实验目录：

准备输入文件：

上传到HDFS:

编写MapReduce代码：
在本机上创建maven项目：
进行配置pom.xml

4.0.0

<groupId>com.example</groupId>
<artifactId>day1225Sort</artifactId>
<version>1.0-SNAPSHOT</version>

<properties>
    <maven.compiler.source>21</maven.compiler.source>
    <maven.compiler.target>21</maven.compiler.target>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>3.3.4</version>
    </dependency>
</dependencies>

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.11.0</version>  <!-- 使用更新的版本 -->
            <configuration>
                <source>1.8</source>
                <target>1.8</target>
                <encoding>UTF-8</encoding>
            </configuration>
        </plugin>
    </plugins>
</build>

排序代码： SortMapper.java： package com.example;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;

public class SortMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
private IntWritable data = new IntWritable();

@Override
protected void map(LongWritable key, Text value, Context context)
        throws IOException, InterruptedException {
    String line = value.toString().trim();
    if (!line.isEmpty()) {
        int num = Integer.parseInt(line);
        data.set(num);
        // 将数据作为key输出，这样在shuffle阶段会自动排序
        context.write(data, new IntWritable(1));
    }
}

}
SortReducer.java
package com.example;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;

public class SortReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
private IntWritable index = new IntWritable(1);

@Override
protected void reduce(IntWritable key, Iterable<IntWritable> values, Context context)
        throws IOException, InterruptedException {
    // 输出排序结果，key是排序后的数字
    for (IntWritable val : values) {
        context.write(index, key);
        index.set(index.get() + 1);  // 序号递增
    }
}

}
SortDriver.java
package com.example;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class SortDriver {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: SortDriver ");
System.exit(-1);
}

    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "Multi File Sort");

    job.setJarByClass(SortDriver.class);
    job.setMapperClass(SortMapper.class);
    job.setReducerClass(SortReducer.class);

    job.setMapOutputKeyClass(IntWritable.class);
    job.setMapOutputValueClass(IntWritable.class);
    job.setOutputKeyClass(IntWritable.class);
    job.setOutputValueClass(IntWritable.class);

    // 设置输入输出路径
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    // 删除已存在的输出目录
    Path outputPath = new Path(args[1]);
    outputPath.getFileSystem(conf).delete(outputPath, true);

    System.exit(job.waitForCompletion(true) ? 0 : 1);
}

}
输入mvn clean package
打包成jar文件，导入虚拟机

运行结果：

最终输出结果：

（三）对给定的表格进行信息挖掘
准备输入文件并上传到HDFS：

编写mapReduce程序：
RelationMapper.java
package com.example.relation;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;

public class RelationMapper extends Mapper<Object, Text, Text, Text> {
private Text keyOut = new Text();
private Text valueOut = new Text();

@Override
protected void map(Object key, Text value, Context context)
        throws IOException, InterruptedException {
    String line = value.toString().trim();
    // 跳过表头
    if (line.equals("child parent") || line.isEmpty()) {
        return;
    }

    String[] parts = line.split("\\s+");
    if (parts.length >= 2) {
        String child = parts[0];
        String parent = parts[1];

        // 生成两种关系：parent->child 和 child->parent
        // 第一种：parent作为key，child作为value（用于查找孙子）
        keyOut.set(parent);
        valueOut.set("1:" + child);  // "1"表示是孩子关系
        context.write(keyOut, valueOut);

        // 第二种：child作为key，parent作为value（用于查找祖父母）
        keyOut.set(child);
        valueOut.set("2:" + parent);  // "2"表示是父母关系
        context.write(keyOut, valueOut);
    }
}

}
RelationReducer.java
package com.example.relation;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class RelationReducer extends Reducer<Text, Text, Text, Text> {
private List children = new ArrayList<>();
private List parents = new ArrayList<>();

@Override
protected void reduce(Text key, Iterable<Text> values, Context context)
        throws IOException, InterruptedException {
    // 清空列表
    children.clear();
    parents.clear();

    // 分离孩子和父母
    for (Text val : values) {
        String[] parts = val.toString().split(":", 2);
        if (parts.length == 2) {
            if ("1".equals(parts[0])) {
                // 这是孩子
                children.add(parts[1]);
            } else if ("2".equals(parts[0])) {
                // 这是父母
                parents.add(parts[1]);
            }
        }
    }

    // 生成祖孙关系：每个孩子的每个祖父母
    for (String child : children) {
        for (String grandParent : parents) {
            context.write(new Text(child), new Text(grandParent));
        }
    }
}

@Override
protected void setup(Context context) throws IOException, InterruptedException {
    // 输出表头
    context.write(new Text("grandchild"), new Text("grandparent"));
}

}
RelationDriver.java
package com.example.relation;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class RelationDriver {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: RelationDriver ");
System.exit(-1);
}

    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "Family Relation Mining");

    job.setJarByClass(RelationDriver.class);
    job.setMapperClass(RelationMapper.class);
    job.setReducerClass(RelationReducer.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    // 删除已存在的输出目录
    Path outputPath = new Path(args[1]);
    outputPath.getFileSystem(conf).delete(outputPath, true);

    System.exit(job.waitForCompletion(true) ? 0 : 1);
}

}

打包到虚拟机上运行程序：

结果展示：

出现的问题：
Hadoop NameNode 处于安全模式，这是一个只读模式，防止在系统启动期间对文件系统进行修改。导致我不能创建文件夹和写入文件内容。

解决方案（列出遇到的问题和解决办法，列出没有解决的问题）：
退出安全模式输入hdfs dfsadmin -safemode forceExit（强制退出安全模式）

posted @ 2025-12-04 21:51 haoyinuo 阅读(4) 评论(0) 收藏举报

刷新页面返回顶部

haoyinuo

MapReduce初级编程实践

公告