MapReduce初级编程实践
(1)通过实验掌握基本的MapReduce编程方法;
(2)掌握用MapReduce解决一些常见的数据处理问题,包括数据去重、数据排序和数据挖掘等。
实验报告
题目: MapReduce初级编程实践
实验环境:操作系统:Linux(centos7);Hadoop版本:3.3.4;JDK版本:1.8;Hive是3.1.3,sqoop是1.4.7;Mysql为5.7。 本机上的相关环境:Java IDE:IDEA;hbase版本:2.4.17 python编译器:PyCharm 2025.1.1.1 python版本:3.13,MySQL在本机上是8.0。
实验内容与完成情况:
准备工作


(一) 编程实现文件合并和去重操作
输入文件A:

输入文件B:

验证结果并将文件上传到HDFS:

验证上传结果:

(二) 编写程序实现对输入文件的排序
启动hadoop和yarn

创建实验目录:

准备输入文件:

上传到HDFS:

编写MapReduce代码:
在本机上创建maven项目:
进行配置pom.xml
<groupId>com.example</groupId>
<artifactId>day1225Sort</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>21</maven.compiler.source>
<maven.compiler.target>21</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.3.4</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.11.0</version> <!-- 使用更新的版本 -->
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
</plugins>
</build>
排序代码:
SortMapper.java:
package com.example;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class SortMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
private IntWritable data = new IntWritable();
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString().trim();
if (!line.isEmpty()) {
int num = Integer.parseInt(line);
data.set(num);
// 将数据作为key输出,这样在shuffle阶段会自动排序
context.write(data, new IntWritable(1));
}
}
}
SortReducer.java
package com.example;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class SortReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
private IntWritable index = new IntWritable(1);
@Override
protected void reduce(IntWritable key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
// 输出排序结果,key是排序后的数字
for (IntWritable val : values) {
context.write(index, key);
index.set(index.get() + 1); // 序号递增
}
}
}
SortDriver.java
package com.example;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class SortDriver {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: SortDriver
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Multi File Sort");
job.setJarByClass(SortDriver.class);
job.setMapperClass(SortMapper.class);
job.setReducerClass(SortReducer.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
// 设置输入输出路径
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// 删除已存在的输出目录
Path outputPath = new Path(args[1]);
outputPath.getFileSystem(conf).delete(outputPath, true);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
输入mvn clean package
打包成jar文件,导入虚拟机

运行结果:

最终输出结果:

(三) 对给定的表格进行信息挖掘
准备输入文件并上传到HDFS:

编写mapReduce程序:
RelationMapper.java
package com.example.relation;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class RelationMapper extends Mapper<Object, Text, Text, Text> {
private Text keyOut = new Text();
private Text valueOut = new Text();
@Override
protected void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString().trim();
// 跳过表头
if (line.equals("child parent") || line.isEmpty()) {
return;
}
String[] parts = line.split("\\s+");
if (parts.length >= 2) {
String child = parts[0];
String parent = parts[1];
// 生成两种关系:parent->child 和 child->parent
// 第一种:parent作为key,child作为value(用于查找孙子)
keyOut.set(parent);
valueOut.set("1:" + child); // "1"表示是孩子关系
context.write(keyOut, valueOut);
// 第二种:child作为key,parent作为value(用于查找祖父母)
keyOut.set(child);
valueOut.set("2:" + parent); // "2"表示是父母关系
context.write(keyOut, valueOut);
}
}
}
RelationReducer.java
package com.example.relation;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
public class RelationReducer extends Reducer<Text, Text, Text, Text> {
private List
private List
@Override
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
// 清空列表
children.clear();
parents.clear();
// 分离孩子和父母
for (Text val : values) {
String[] parts = val.toString().split(":", 2);
if (parts.length == 2) {
if ("1".equals(parts[0])) {
// 这是孩子
children.add(parts[1]);
} else if ("2".equals(parts[0])) {
// 这是父母
parents.add(parts[1]);
}
}
}
// 生成祖孙关系:每个孩子的每个祖父母
for (String child : children) {
for (String grandParent : parents) {
context.write(new Text(child), new Text(grandParent));
}
}
}
@Override
protected void setup(Context context) throws IOException, InterruptedException {
// 输出表头
context.write(new Text("grandchild"), new Text("grandparent"));
}
}
RelationDriver.java
package com.example.relation;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class RelationDriver {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: RelationDriver
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Family Relation Mining");
job.setJarByClass(RelationDriver.class);
job.setMapperClass(RelationMapper.class);
job.setReducerClass(RelationReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// 删除已存在的输出目录
Path outputPath = new Path(args[1]);
outputPath.getFileSystem(conf).delete(outputPath, true);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
打包到虚拟机上运行程序:

结果展示:

出现的问题:
Hadoop NameNode 处于安全模式,这是一个只读模式,防止在系统启动期间对文件系统进行修改。导致我不能创建文件夹和写入文件内容。

解决方案(列出遇到的问题和解决办法,列出没有解决的问题):
退出安全模式输入hdfs dfsadmin -safemode forceExit(强制退出安全模式)


浙公网安备 33010602011771号