Introduction

All MapReduce jobs begin with a collection of records, which consists of the initial key-value pairs passed to the mapper. The Hadoop InputFormat interface describes this input specification. Simple implementations such as TextInputFormat assume records are text blobs stored in a file, one line per record. This is obviously insufficient for more complex record structures. One solution is keep input records in some type of text-based structured format (e.g., XML) and write record readers. The downside is the need to write custom code for every record type. The other downside is that you'll have to reparse the text-based input every time.

Once you've become acquainted with the word count demo, you'll want to next learn about SequenceFiles in Hadoop. They are flat files that contain a sequence of key-value pairs encoded in a binary format. When reading a SequenceFile, Hadoop automatically deserializes the keys and values for you. Therefore, the most straightforward solution for preparing input to MapReduce jobs is to preprocess your data to create SequenceFiles. The typical sequence of actions might be:

  1. Create SequenceFiles from your input data on your local machine.
  2. Copy the SequenceFiles over to the cluster (via scp).
  3. Copy the SequenceFiles into HDFS (via hadoop dfs -put).
  4. Start MapReducing!

Creating a SequenceFile is pretty straightforward:

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf, new Path(outfile),
    LongWritable.class, JSONObjectWritable.class);

In this case, outfile is the name of the output file. The key type in the SequenceFile is LongWritable, and the value type is JSONObjectWritable. Once you've created the writer, adding key-value pairs is easy:

LongWritable l = new LongWritable();
JSONObjectWritable json = new JSONObjectWritable();
...

writer.append(l, json);

Remember to close the file when you are done!

writer.close();

Reading SequenceFiles

Use the following code fragment for reading SequenceFiles:

Configuration config = new Configuration();
FileSystem fs = FileSystem.get(config);
SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, config);

WritableComparable key = (WritableComparable) reader.getKeyClass().newInstance();
Writable value = (Writable) reader.getValueClass().newInstance();

while (reader.next(key, value)) {
    k++;

    key = (WritableComparable) reader.getKeyClass().newInstance();
    value = (Writable) reader.getValueClass().newInstance();

    // do something
}
reader.close();

Or easier still, use SequenceFileUtils in Cloud9:

List<KeyValuePair<WritableComparable, Writable>> bigrams = 
  SequenceFileUtils.readDirectory(path, Integer.MAX_VALUE);

First argument is the path, the second is the maximum number of entries to read.

Windows users beware!

On Windows (Cygwin), Hadoop 0.17.0 may croak with the following error when writing SequenceFiles:

08/08/11 08:55:26 WARN fs.FileSystem: uri=file:///
javax.security.auth.login.LoginException: Login failed: Expect one token as the result of whoami: Jimmy Lin
        at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
        at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
        at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
        at org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67)
        at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1353)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1289)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:203)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:108)
        at edu.umd.cloud9.demo.DemoPackRecords.main(DemoPackRecords.java:69)
08/08/11 08:55:26 WARN fs.FileSystem: uri=file:///
javax.security.auth.login.LoginException: Login failed: Expect one token as the result of whoami: Jimmy Lin
        at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
        at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
        at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
        at org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67)
        at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1353)
        at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1344)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1295)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:203)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:108)
        at edu.umd.cloud9.demo.DemoPackRecords.main(DemoPackRecords.java:69)

The problem is that Windows usernames contain spaces, which Hadoop does not expect. To fix this, in Cygwin edit the /etc/passwd file and change your username (first field).