Hadoop object Serialization mechanism 01

Reference: <<Hadoop the Definitive Guide>>

In Hadoop, interprocess communication between nodes in the system is implemented using remote procedure calls (RPCs). The RPC protocol uses serialization to render the message into a binary stream to be sent to the remote node, which then deserializes the binary stream into the original message.

The Writable Interface

The Writable interface defines two methods: one for writing its state to a DataOutput binary stream, and one for reading its state from a DataInput binary stream:

 @Override
  public void readFields(DataInput in) throws IOException {
    readObject(in, this, this.conf);
  }
  
  @Override
  public void write(DataOutput out) throws IOException {
    writeObject(out, instance, declaredClass, conf);
  }
View Code

We can look at a particular Writable to see what we can do with it. We will use IntWritable, a wrapper for Java Int . And we create an instance of it and see it's value using the set() method:

        IntWritable writable = new IntWritable();
        writable.set(163);
View Code

To examine the serialized form fo the IntWritable, we write a method wrapping a java.io.ByteArrayOutputStream in a java.io.DataOutputStream to capture the bytes in the serialized stream:

    public static byte[] serialize(Writable writable) throws IOException{
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        DataOutputStream dataOut = new DataOutputStream(out);
        writable.write(dataOut);
        dataOut.close();
        return out.toByteArray();
    }
View Code

And the bytes are written in big-endian order, this is dicated by the java.io.DataOutput interface. And we can see their hexadecimal representation by using a method on Hadoop’s StringUtils:

System.out.println(StringUtils.byteToHexString(bytes));

And the output:

"000000a3"

Then we will try deserialization, we create method read a Writable object from a byte array:

public static byte[] deserialize(Writable writable, byte[] bytes) throws IOException{
        ByteArrayInputStream in = new ByteArrayInputStream(bytes);
        DataInputStream dataIn = new DataInputStream(in);
        writable.readFields(dataIn);
        dataIn.close();
        return bytes;
    }
View Code

We construct a new, value-less, IntWritable, then call deserialize() to read from the output data that we just wrote. Then we check that its value, retrieved using the get() method, is the original value, 163.

WritableComparable and comparators

/**
 * A {@link Writable} which is also {@link Comparable}. 
 *
 * <p><code>WritableComparable</code>s can be compared to each other, typically 
 * via <code>Comparator</code>s. Any type which is to be used as a 
 * <code>key</code> in the Hadoop Map-Reduce framework should implement this
 * interface.</p>
 *
 * <p>Note that <code>hashCode()</code> is frequently used in Hadoop to partition
 * keys. It's important that your implementation of hashCode() returns the same 
 * result across different instances of the JVM. Note also that the default 
 * <code>hashCode()</code> implementation in <code>Object</code> does <b>not</b>
 * satisfy this property.</p>
 *  
 * <p>Example:</p>
 * <p><blockquote><pre>
 *     public class MyWritableComparable implements WritableComparable<MyWritableComparable> {
 *       // Some data
 *       private int counter;
 *       private long timestamp;
 *       
 *       public void write(DataOutput out) throws IOException {
 *         out.writeInt(counter);
 *         out.writeLong(timestamp);
 *       }
 *       
 *       public void readFields(DataInput in) throws IOException {
 *         counter = in.readInt();
 *         timestamp = in.readLong();
 *       }
 *       
 *       public int compareTo(MyWritableComparable o) {
 *         int thisValue = this.value;
 *         int thatValue = o.value;
 *         return (thisValue &lt; thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
 *       }
 *
 *       public int hashCode() {
 *         final int prime = 31;
 *         int result = 1;
 *         result = prime * result + counter;
 *         result = prime * result + (int) (timestamp ^ (timestamp &gt;&gt;&gt; 32));
 *         return result
 *       }
 *     }
 * </pre></blockquote></p>
 */
@InterfaceAudience.Public
@InterfaceStability.Stable
public interface WritableComparable<T> extends Writable, Comparable<T> {
}
View Code

Comparsion of types is crucial for MapReduce, where there is a sorting phase during which keys are compared with one another. One optimization that Hadoop provides is the RawComparator extension of Java's Comparator:

/**
 * <p>
 * A {@link Comparator} that operates directly on byte representations of
 * objects.
 * </p>
 * @param <T>
 * @see DeserializerComparator
 */
@InterfaceAudience.Public
@InterfaceStability.Stable
public interface RawComparator<T> extends Comparator<T> {

  /**
   * Compare two objects in binary.
   * b1[s1:l1] is the first object, and b2[s2:l2] is the second object.
   * 
   * @param b1 The first byte array.
   * @param s1 The position index in b1. The object under comparison's starting index.
   * @param l1 The length of the object in b1.
   * @param b2 The second byte array.
   * @param s2 The position index in b2. The object under comparison's starting index.
   * @param l2 The length of the object under comparison in b2.
   * @return An integer result of the comparison.
   */
  public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);

}
View Code

 This interface permits implementors to compare records read from a stream without deserializing them into objects, thereby avoiding any overhead of object creation. For example, the comparator for IntWritables implements the raw compare() method by reading an integer from each of the byte arrays b1 and b2 and comparing them directly, from the given start positions (s1 and s2) and lengths (l1 and l2).

WritableComparator is a general-purpose implementation of RawComparator for WritableComparable classes. It provides two main functions. First, it provides a default implementation of the raw compare() method that deserializes the objects to be com- pared from the stream and invokes the object compare() method. Second, it acts as a factory for RawComparator instances (that Writable implementations have registered). For example, to obtain a comparator for IntWritable, we just use:

 RawComparator<IntWritable> comparator = WritableComparator.get(IntWritable.class);

And the comparator can be used to compare two IntWritable objects:

IntWritable w1 = new IntWritable(163);
IntWritable w2 = new IntWritable(67);
RawComparator<IntWritable> comparator = WritableComparator.get(IntWritable.class);
System.out.println(comparator.compare(w1, w2));

 Or their serialized representations:

byte[] bytes1 = serialize(w1);
byte[] bytes2 = serialize(w2);

Writable wrappers for Java primitives

There are Writable wrappers for all the Java primitive types except char (which can be stored in an IntWritable). All have a get() and a set() method for retrieving and storing the wrapped value.

Text Class:

Text Class can be thought as replacement for the UTF8 class, but Text didn't supprot string whose encoding was over 32,767 bytes, and because it used Java's modified UTF-8.

Text class uses an int to store the number of bytes in the string encoding, so the maximum value is 2 GB. 

  • Indexing:

Because of emphasis on using standard UTF-8, there are some differences between Text and the Java String class. Indexing for the text class is in the terms of position in the encoded byte sequence, not eh Unicode character in the string, or the Java char code unit. for ASCII strings, Here is an example to demonstrate the use of the charAt() method:

       Text text = new Text("hadoop");
        assertThat(text.getLength(), is(6));
        assertThat(text.getBytes().length, is(6));

        assertThat(text.charAt(2), is((int)'d'));
        assertThat("Out of bounds", text.charAt(100), is(-1));
    }
View Code

Notice that cahrAt() returns an int representing a Unicode code point, unlike the String variant that returns a char. Text also has a find() method analogous to indexOf in String class.When we start using characters more than a single byte, the differences between Text and String is clear.

And the length of a String is the number of char code units it contains, whereas teh length of the a Text object is the number of bytes in its UTF-8 encoding. The indexOf() method in String returns an index in char code units, and the find() for Text is a byte offset. So iteration over the Unicode characters in Text is complicated by the use of byte offsets for indexing, since you can't just increment the index. The idiom for iteration is complicatd:

    public static void textIterator() throws IOException{
        Text t = new Text("\u0041\u00DF\u6771\uD801\uDC00");
        ByteBuffer buf = ByteBuffer.wrap(t.getBytes(), 0, t.getLength());
        int cp;
        while (buf.hasRemaining() && (cp = Text.bytesToCodePoint(buf)) != -1){
            System.out.println(Integer.toHexString(cp));
        }
    }
View Code

The output:

41
df
6771
10400

  • Mutability

Another difference with String is that Text is mutable, so you should always call getLength() when calling getBytes(), so you know how much of the byte array is valid data. 

BytesWritable

BytesWritable is a wrapper for an array of binary data. Its serialized format is an integer field (4 bytes) that specifies the number of bytes to follow, followed by the bytes themselves.  Here is an example, the byte array of length 2 with values 3 and 5 is serialized as 4-byte integer(00000002) followed by the 2 bytes from the array:

    public static byte[] serialize(Writable writable) throws IOException {
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        DataOutputStream dataOut = new DataOutputStream(out);
        writable.write(dataOut);
        dataOut.close();
        return out.toByteArray();
    }

    static void serializeBytes() throws IOException{
        BytesWritable b = new BytesWritable(new byte[]{3, 4, 5});
        byte[] bytes = serialize(b);
        System.out.println("byte arrary length: " + bytes.length);
        assertThat(StringUtils.byteToHexString(bytes), is("00000003030405"));
        b.setCapacity(11);
        assertThat(b.getLength(), is(3));
        assertThat(b.getBytes().length, is(11));
    }
View Code

BytesWritable is mutable, and its value may be changed by calling its set() method. As with Text, the size of the byte array returned from getBytes() method for BytesWritable. We can determine the size of the BytesWritable by calling getLength().

NullWritable

NullWritable is a special type of Writable, as it has a 0-length serialization. No bytes are written to, or read from, the stream. It's used as a placeholder, for example a key or a value can be declared as NullWritable when you don't need to use that position. NullWritable is an immutable singleton: the instance can be retrieved by calling NullWritable.get().

ObjectWritable and GenericWritable

ObjectWritable is a general-purpose wrapper for the following: Java primitives, String, enum, Writable, null, or arrays of any of these types. It is used in Hadoop RPC to marshal and unmarshal method arguments and return types. ObjectWritable is useful when a field can be of more than one type, but it's really wasteful of space since it writes the classname of the wrapped type everytime it's serialized. 

While in case where the number of types is small and known ahead of time, this can improved with static array of types and index into the array as the serialized reference to the type. The same approach has been used by GenericWritable.

Writable collections

There are six Writable collection types in the org.apache.hadoop.io package: Array Writable, ArrayPrimitiveWritable, TwoDArrayWritable, MapWritable, SortedMapWrita ble, and EnumSetWritable.

This implementations are implement for arrarys and array of arrarys of Writable instance. All the elements in an Array must be instances of the same class. Here is an example inherting ArrayWritable:

class TextArrayWritable extends ArrayWritable {
    public TextArrayWritable() {
        super(Text.class);
    }
}
View Code

MapWritable and SortedMapWritable are implementations of java.util.Map<Writable, Writable> and java.util.SortedMap<WritableComparable, Writable>, respectively. Here is an example of using a MapWritable with different types for keys and values:

Conspicuous by the absence of Writable collection implementations for sets and lists. But we can emulate this data structrue by using a MapWritable.

Implement a Custom Writable

We show how to implement custom Writable by an example.

package serialization;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

/**
 * Created by carl on 1/21/16.
 */
public class TextPair implements WritableComparable<TextPair> {
    private Text first;
    private Text second;

    public TextPair() {
        set(new Text(), new Text());
    }

    public TextPair(Text first, Text second) {
        set(first, second);
    }

    public TextPair(String first, String second){
        set(new Text(first), new Text(second));
    }

    public Text getFirst() {
        return first;
    }

    public Text getSecond() {
        return second;
    }

    @Override
    public void write(DataOutput out) throws IOException {
        first.write(out);
        second.write(out);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        first.readFields(in);
        second.readFields(in);
    }

    @Override
    public int hashCode() {
        return first.hashCode() * 163 + second.hashCode();
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (o == null || getClass() != o.getClass()) return false;

        TextPair textPair = (TextPair) o;

        if (!first.equals(textPair.first)) return false;
        return second.equals(textPair.second);

    }

    @Override
    public String toString() {
        return "TextPair{" +
                "first=" + first +
                ", second=" + second +
                '}';
    }

    @Override
    public int compareTo(TextPair o) {
        int cmp = first.compareTo(o.first);
        if (cmp != 0){
            return cmp;
        }
        return second.compareTo(o.second);
    }

    public void set(Text first, Text second) {
        this.first = first;
        this.second = second;
    }

    public static void main(String[] args){

    }
}
View Code

The hashCode(0 method is used by the HashPratitioner(the default partitioner in MapReduce) to choose a reduce partition.

And TextPair is an implementation of WritableComparable, so it provides an implementation fo teh comapreTo() emthod that imposed the ordering we expect.

 

Implement a RawComparator for speed

For when TextPair is being usrd as key in MapReduce, it will have to be deserialized into an object for the compateTo() method to be invoked. But we can compare two TextPair objects jsut by looking at their serialized representations.

For the binary representation of a Text object is a variable-length integer containing the number of bytes in the UTF-8 representation of the string, followed by the UTF-8 bytes themselves. The trick is that we firstly read the initial length, and know how long each elements, by passing the offsets the RawComparator. Here is an example about RawComparator in detail:

public class Comparator extends WritableComparator {
    private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();

    public Comparator() {
        super(TextPair.class);
    }

    @Override
    public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
        try {
            int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readInt(b1, s1);
            int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
            int cmp = TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);
            if (cmp != 0) {
                return cmp;
            }
            return TEXT_COMPARATOR.compare(b1, s1 + firstL1, l1 - firstL1,
                    b2, s1 + firstL2, l1 - firstL2);

        } catch (IOException e) {
            throw new IllegalArgumentException(e);
        }
    }

    static {
        WritableComparator.define(TextPair.class, new Comparator());
    }
}
View Code

We actually subclass WritableComparator rather than implement RawCompartator directly, since it provides soem convenience methods and default implements. The subtle part of this code is calculating the firstL1 and the firstL2, the lengths of the first Text field in each byte stream. Each is made up of the length og the variable-length  integer and the value is encoding.

The static block registers the ras comparator so that whenever MapReduce sees the TextPair class, it knows to usr the raw comparator as it's default comparator.

Serialization Frameworks

Although most MapReduce programs use Writable key and value typs to translates them to or from binary representation of each type. While MapReduce support the common type do serialization by API for pulggable serialization frameworks. A serialization framework is represented by an implementation of Serialization. 

A serialization defines a mapping from types to Seriablizer instances and Descerializer instances. We can register Serialization implementation by set the io.serializations property to a comma-separated list of classnames.

Hadoop includes a class called JavaSerialization that uses Java Object Serialization. Although it makes it convenient to be able to use standard Java types in MapReduce programs, like Integer or String, we not use JavaSerialization for efficiency trade-off.

(In fact , In the binary stream serialized only only very small bits are about variables values, Meaning this serialization is wasteful of space).

Apache Thrift and Google Protocol Buffers are both popular serialization frameworks, they are limited supported as MapReduce formats, however they are used internally in parts of Hadoop for RPC and data exchange.

In the next blog we will disk how Avro, an serialization framework designed to work well with larger-scale data processing in Hadoop.

posted @ 2016-01-04 22:53  快乐的小土狗  阅读(305)  评论(0)    收藏  举报