MapReduce: map读取文件的过程

我们的输入文件 hello0, 内容如下:

xiaowang 28 shanghai@_@zhangsan 38 beijing@_@someone 100 unknown

 

逻辑上有3条记录, 它们以@_@分隔.

 

我们看看数据是如何被map读取的...

1. 默认配置

/*
 New API
  */

     //conf.set("textinputformat.record.delimiter", "@_@");
        
        /*
        job.setInputFormatClass(Format0.class);  
        //job.setInputFormatClass(Format1.class);  error here
        
        //or,
        job.setInputFormatClass(Format3.class);
        
        //job.setInputFormatClass(Format4.class); error here
        
        job.setInputFormatClass(Format5.class);
        
        */

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class Test0 {


public static class MyMapper extends Mapper<Object, Text, Text, IntWritable> {
    public void map(Object key, Text value, Context context)  throws IOException, InterruptedException 
    {
        String line = value.toString();
        System.out.println(line);
    }
}

     
    
public static void main(String[] args) throws Exception {

         Configuration conf = new Configuration();

     
        Job job = Job.getInstance(conf);
    
        job.setJarByClass(Test0.class);
        job.setJobName("myjob");
        
    
        job.setMapperClass(MyMapper.class);
    
        
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        
    
        job.waitForCompletion(true);
        

        
    }

}

Debug我们可以看到value的值是获取了文件的整个内容作为这一条记录的值的, 因为默认情况下是以换行符作为记录分割符的, 而文件内容中没有换行符. map只被调用1次

 

2. 配置textinputformat.record.delimiter

我们为Configuration设置textinputformat.record.delimiter参数-

conf.set("textinputformat.record.delimiter", "@_@");

这样map按照我们的预期读取记录, map被调用3次

 

3. 自定义TextInputFormat

自定义TextInputFormat, 在其RecordReader方法中设置需要的record delimiter

import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;


public class Format5 extends TextInputFormat {

    public RecordReader createRecordReader (InputSplit split, TaskAttemptContext tac) {
        byte[] recordDelimiterBytes = "@_@".getBytes();
        return new LineRecordReader(recordDelimiterBytes);
    }
    

}

应用到job上-

 job.setInputFormatClass(Format5.class);

 

这样得到和方法2一样的效果.

 

posted @ 2015-05-09 15:43  Ready!  阅读(5069)  评论(0编辑  收藏  举报