MapReduce流程

一个简单WordCount程序

一、准备工作

1.MapReduce基于yarn组件，想要做MapReduce就必须先开启hdfs和yarn。

start-dfs.sh    //开启hdfs
start-yarn.sh    //开启yarn

2.yarn组件依赖于hdfs组件。所以使用MapReduce必须引入common包、hdfs包、yarn包和MapReduce包。jar包位置位于hadoop/share/hadoop里。将common、hdfs、yarn、MapReduce文件夹下的所有jar包及其依赖包导入到项目中。若使用maven开发在pom.xml中引入hadoop的依赖即可。

3.素材获取。这里使用苏轼的水调歌头-明月几时有

明月几时有，把酒问青天。 
不知天上宫阙，今夕是何年？ 
我欲乘风归去，又恐琼楼玉宇， 
高处不胜寒。 
起舞弄清影，何似在人间！ 
转朱阁，低绮户，照无眠。 
不应有恨，何事长向别时圆？ 
人有悲欢离合，月有阴晴圆缺， 
此事古难全。 
但愿人长久，千里共婵娟。

4.MapReduce是有自己的一套编程模型规定了如何去写代码程序1->Map 程序2->Reduce 程序3->Job 任务->组装我们的Map和Reduce

5.开发我们Map类 Map类会输出成一个文件 temp Map类规范必须得继承Mapper类并且重写mapper方法

6.开发我们的Reduce类 Reduce类规范必须得继承Reducer类并且重写Reducer方法把我们Map类输出的结果作为输入使用之后会把这个临时文件给删除掉

7.开发Driver类 Driver类用来关联HDFS Mapper Reducer3个类最终输出结果的这个目录应该是不存在的才可以

二、过程分析

1.map阶段

/**
 * mapper阶段按行读取数据
 * LongWritable 为全文的总字数
 * Text 输入数据格式
 * Text 输出的key数据格式
 * IntWritable 输出的value数据格式
 */
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    public void map(LongWritable ikey, Text ivalue, Context context) throws IOException, InterruptedException {
        char[] charArray = ivalue.toString().replace("，", "").replace("。", "").replace("？", "").toCharArray();
        for (char c : charArray) {
            context.write(new Text(String.valueOf(c)), new IntWritable(1));//写入到临时文件当中
        }
    }

}

2.临时文件的处理

map阶段过后，数据将变成如下情况：

人有悲欢离合月有阴晴圆缺

临时文件将会把相同的key聚合起来变成如下情况：

人    [1]
有    [1,1]
悲    [1]
欢    [1]
离    [1]
合    [1]
月    [1]
阴    [1]
晴    [1]
圆    [1]
缺    [1]

3.再进行reduce阶段

/**
 * Text 输入到reduce阶段的key的数据格式
 * IntWritable 输入到reduce阶段的value的数据格式
 * Text    输出的key的数据格式
 * IntWritable 输出的value的数据格式
 */
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

    public void reduce(Text _key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        Iterator<IntWritable> iterator = values.iterator();
        int sum = 0;
        while (iterator.hasNext()) {
            IntWritable intWritable = (IntWritable) iterator.next();
            sum += intWritable.get();
        }
        
        context.write(_key, new IntWritable(sum));  //输出到最终文件
    }

}

4.Job

public class WordCountJob {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://hdp01:9000");        //指定使用的hdfs文件系统
        Job job = Job.getInstance(conf, "WordCount");        //任务名
        job.setJarByClass(com.blb.core.WordCountJob.class); //指定job类
        // TODO: specify a mapper
        job.setMapperClass(WordCountMapper.class);            //指定mapper类
        // TODO: specify a reducer
        job.setReducerClass(WordCountReducer.class);        //指定reduce类
        job.setMapOutputKeyClass(Text.class);                //指定map输出的key数据格式
        job.setMapOutputValueClass(IntWritable.class);        //指定map输出的value数据格式
        // TODO: specify output types
        job.setOutputKeyClass(Text.class);                    //指定reduce输出的key数据格式
        job.setOutputValueClass(IntWritable.class);            //指定reduce输出的value数据格式

        // TODO: specify input and output DIRECTORIES (not files)
        FileInputFormat.setInputPaths(job, new Path("/1.txt"));    //指定需要计算的文件或文件夹
        FileOutputFormat.setOutputPath(job, new Path("/out"));    //指定输出文件保存位置，此文件夹不得存在

        if (!job.waitForCompletion(true))
            return;
    }
}

三、操作流程

1.写好程序

2.打成jar包

3.上传素材到hdfs

4.上传jar包到linux

5.使用hadoop jar 指令执行MapReduce

hadoop jar wordcount.jar com.blb.core.WordCountJob

6.代码完成

7.查看运行结果

hadoop fs -cat /out/part-r-00000

使用eclipse开发MapReduce

一、准备工作

1.在Windows上准备好hadoop

下载好eclipse插件和hadoop使用Windows编译过bin目录文件，解压至bin目录。下载地址：链接：https://pan.baidu.com/s/1iXp3MeiE8pXS3QevDJ24kw 提取码：mzye

将hadoop插件放入eclipse/plugin里

2.配置系统环境变量

3.启动eclipse

切换到切换到MapReduce视图

4.打开视图

然后打开Window->show veiw ->mapreduce tools ->mapreduce location

5.在eclipse中设置hadoop路径

Window->首选项->hadoop

6.在eclipse中连接hdfs文件系统

右击MapReduce location

输入ip、端口、登陆名等信息

完成后项目里会有显示连接成功开始编写代码

二、代购商品统计练习

1.伪造数据

新建项目

下一步

输入项目名即可创建一个MapReduce项目，无需手动导入jar包。

/**
 *伪造数据 伪造300户的购物需求
 */
package com.blb.core;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;

/**
 * 300户 每户都会有一个清单文件
 * 商品是随机  数量也是随机
 * 洗漱用品 脸盆、杯子、牙刷和牙膏、毛巾、肥皂(洗衣服的)以及皂盒、洗发水和护发素、沐浴液   [1-5之间]
 * 床上用品 比如枕头、枕套、枕巾、被子、被套、棉被、毯子、床垫、凉席   [0 1之间]
 * 家用电器 比如电磁炉、电饭煲、吹风机、电水壶、豆浆机、台灯等   [1-3之间]
 * 厨房用品 比如锅、碗、瓢、盆、灶   [1-2 之间]
 * 柴、米、油、盐、酱、醋 [1-6之间]  
 * 要生成300个文件 命名规则  1-300来表示 
 * @author Administrator
 *
 */
public class BuildBill {
    private static Random random=new Random(); //要还是不要
    private static List<String> washList=new ArrayList<>();
    private static List<String> bedList=new ArrayList<>();
    private static List<String> homeList=new ArrayList<>();
    private static List<String> kitchenList=new ArrayList<>();
    private static List<String> useList=new ArrayList<>();
    
    static{
        washList.add("脸盆");
        washList.add("杯子");
        washList.add("牙刷");
        washList.add("牙膏");
        washList.add("毛巾");
        washList.add("肥皂");
        washList.add("皂盒");
        washList.add("洗发水");
        washList.add("护发素");
        washList.add("沐浴液");
        ///////////////////////////////
        bedList.add("枕头");
        bedList.add("枕套");
        bedList.add("枕巾");
        bedList.add("被子");
        bedList.add("被套");
        bedList.add("棉被");
        bedList.add("毯子");
        bedList.add("床垫");
        bedList.add("凉席");
        //////////////////////////////
        homeList.add("电磁炉");
        homeList.add("电饭煲");
        homeList.add("吹风机");
        homeList.add("电水壶");
        homeList.add("豆浆机");
        homeList.add("电磁炉");
        homeList.add("台灯");
        //////////////////////////
        kitchenList.add("锅");
        kitchenList.add("碗");
        kitchenList.add("瓢");
        kitchenList.add("盆");
        kitchenList.add("灶 ");
        ////////////////////////
        useList.add("米");
        useList.add("油");
        useList.add("盐");
        useList.add("酱");
        useList.add("醋");
    }
    //确定要还是不要 1/2 
    private static boolean iswant()
    {
         int num=random.nextInt(1000);
         if(num%2==0)
         {
             return true;
         }
         else
         {
             return false;
         }
    }
    
    /**
     * 表示我要几个
     * @param sum
     * @return
     */
    private static int wantNum(int sum)
    {
        return random.nextInt(sum);
    }
    
    
    
    //生成300个清单文件  格式如下
    //输出的文件的格式 一定要是UTF-8
    //油     2
    public static void main(String[] args) {
        for(int i=1;i<=300;i++)
        {
            try {
                //字节流
            FileOutputStream out=new FileOutputStream(new File("E:\\tmp\\"+i+".txt"));
                
            //转换流  可以将字节流转换字符流  设定编码格式 
            //字符流
                BufferedWriter writer=new BufferedWriter(new OutputStreamWriter(out,"UTF-8"));
                //随机一下  我要不要  随机一下 要几个  再从我们的清单里面 随机拿出几个来 数量
                boolean iswant1=iswant();
                if(iswant1)
                {
                    //我要几个 不能超过该类商品的总数目
                    int wantNum = wantNum(washList.size()+1);
                    //3
                    for(int j=0;j<wantNum;j++)
                    {
                    String product=washList.get(random.nextInt(washList.size()));
                    writer.write(product+"\t"+(random.nextInt(5)+1));
                    writer.newLine();
                    }
               }
             
                boolean iswant2=iswant();
                if(iswant2)
                {
                    //我要几个 不能超过该类商品的总数目
                    int wantNum = wantNum(bedList.size()+1);
                    //3
                    for(int j=0;j<wantNum;j++)
                    {
                    String product=bedList.get(random.nextInt(bedList.size()));
                    writer.write(product+"\t"+(random.nextInt(1)+1));
                    writer.newLine();
                    }
               }
                
                boolean iswant3=iswant();
                if(iswant3)
                {
                    //我要几个 不能超过该类商品的总数目
                    int wantNum = wantNum(homeList.size()+1);
                    //3
                    for(int j=0;j<wantNum;j++)
                    {
                    String product=homeList.get(random.nextInt(homeList.size()));
                    writer.write(product+"\t"+(random.nextInt(3)+1));
                    writer.newLine();
                    }
               }
                boolean iswant4=iswant();
                if(iswant4)
                {
                    //我要几个 不能超过该类商品的总数目
                    int wantNum = wantNum(kitchenList.size()+1);
                    //3
                    for(int j=0;j<wantNum;j++)
                    {
                    String product=kitchenList.get(random.nextInt(kitchenList.size()));
                    writer.write(product+"\t"+(random.nextInt(2)+1));
                    writer.newLine();
                    }
               }
                
                boolean iswant5=iswant();
                if(iswant5)
                {
                    //我要几个 不能超过该类商品的总数目
                    int wantNum = wantNum(useList.size()+1);
                    //3
                    for(int j=0;j<wantNum;j++)
                    {
                    String product=useList.get(random.nextInt(useList.size()));
                    writer.write(product+"\t"+(random.nextInt(6)+1));
                    writer.newLine();
                    }
               }
                writer.flush();
                writer.close();
            } catch (FileNotFoundException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
        }
    }
}

素材格式：

2.mapper

选择文件->新建即可选择新建一个mapper

public class ShopCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    public void map(LongWritable ikey, Text ivalue, Context context) throws IOException, InterruptedException {
        
        String[] word = ivalue.toString().split("\t");
        context.write(new Text(word[0]), new IntWritable(Integer.parseInt(word[1])));
    }
}

3.reduce

选择文件->新建即可选择新建一个reducer

public class ShopCountReducer extends Reducer<Text, IntWritable, Text,IntWritable> {

    public void reduce(Text _key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        // process values
        int sum=0;
        for (IntWritable val : values) {
            int i = val.get();
            sum+=i;
        }
        context.write(_key,new IntWritable(sum));
    }

}

4.job

public class ShopCountJob {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS","hdfs://hdp01:9000");
        Job job = Job.getInstance(conf, "ShopCountDriver");
        job.setJarByClass(ShopCountJob.class);
        // TODO: specify a mapper
        job.setMapperClass(ShopCountMapper.class);
        // TODO: specify a reducer
        job.setReducerClass(ShopCountReducer.class);

        // TODO: specify output types
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        
        // TODO: specify input and output DIRECTORIES (not files)
        FileInputFormat.setInputPaths(job, new Path("/upload"));
        FileOutputFormat.setOutputPath(job, new Path("/out2/"));

        if (!job.waitForCompletion(true))
            return;
    }

}

配置log4j.properties

log4j.rootLogger = debug,stdout


log4j.appender.stdout = org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target = System.out
log4j.appender.stdout.layout = org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern = [%-5p] %d{yyyy-MM-dd HH:mm:ss,SSS} method:%l%n%m%n

右击运行查看out2结果

posted @ 2020-03-04 17:00 phy2020 阅读(611) 评论(0) 收藏举报

刷新页面返回顶部

phy2020

MapReduce流程

MapReduce流程

一个简单WordCount程序

一、准备工作

二、过程分析

1.map阶段

2.临时文件的处理

3.再进行reduce阶段

4.Job

三、操作流程

1.写好程序

2.打成jar包

3.上传素材到hdfs

4.上传jar包到linux

5.使用hadoop jar 指令执行MapReduce

6.代码完成

7.查看运行结果

使用eclipse开发MapReduce

一、准备工作

1.在Windows上准备好hadoop

2.配置系统环境变量

3.启动eclipse

4.打开视图

5.在eclipse中设置hadoop路径

6.在eclipse中连接hdfs文件系统

二、代购商品统计练习

1.伪造数据

2.mapper

3.reduce

4.job

公告