自定义排序WritableComparable

　　排序是MapReduce框架中重要的操作之一，其中MapTask和ReduceTask都会对数据按照key进行排序，这是Hadoop默认进行的操作。任何应用程序中的数据均会被排序，而不管逻辑上是否需要。一个MapReduce程序涉及了多种排序，而且相同类型的排序可能还进行了多次。其中，我们也可以自行定义排序来让MapReduce的处理结果达到我们想要的结果。而我们想要自定义排序，那就必须继承WritableComparable这个接口，看一下这个接口：

public interface WritableComparable<T> extends Writable, Comparable<T> {
}

　　它继承了Writable接口和Comparable接口，由于WritableComparable中没有定义要实现的方法，所以继承这个接口，要实现的方法都是Writable接口和Comparable接口中的方法，它们主要有以下几个：

　　对于Writable接口，要实现write和readFields方法，write是要对out对象进行序列化操作，readFields是对in对象进行反序列化操作。

public interface Writable {
  /** 
   * Serialize the fields of this object to <code>out</code>.
   * 
   * @param out <code>DataOuput</code> to serialize this object into.
   * @throws IOException
   */
  void write(DataOutput out) throws IOException;

  /** 
   * Deserialize the fields of this object from <code>in</code>.  
   * 
   * <p>For efficiency, implementations should attempt to re-use storage in the 
   * existing object where possible.</p>
   * 
   * @param in <code>DataInput</code> to deseriablize this object from.
   * @throws IOException
   */
  void readFields(DataInput in) throws IOException;
}

Comparable接口，它的作用是将调用该方法的对象和传入参数对象进行比较：

public int compareTo(T o);

返回值有三种：负整数，此时调用该方法的对象小于指定对象；0，二者相等；正整数，调用该方法的对象大于指定对象。

　　因为比较是针对key进行排序的，所以如果我们要想自定义对map处理后的数据排序，那么就应该map输出的数据中的key的类型应是我们自定义的，并且实现了WritableComparable接口，那么实现这个接口应该怎么写实现类呢，让我们看一下这个示例：

public class MyWritableComparable implements WritableComparable<MyWritableComparable> {
   // Some data
   private int counter;
   private long timestamp;

   public void write(DataOutput out) throws IOException {
     out.writeInt(counter);
     out.writeLong(timestamp);
   }

   public void readFields(DataInput in) throws IOException {
     counter = in.readInt();
     timestamp = in.readLong();
   }

   public int compareTo(MyWritableComparable o) {
     int thisValue = this.value;
     int thatValue = o.value;
     return (thisValue &lt; thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
   }

    public int hashCode() {
     final int prime = 31;
     int result = 1;
     result = prime * result + counter;
     result = prime * result + (int) (timestamp ^ (timestamp &gt;&gt;&gt; 32));
     return result
        }
 }

注意，实现序列化和反序列化方法时，写入和读取属性的顺序一定要相同！！！实现完成之后，这个实现类就可以作为Map阶段输入或者输出数据的key类型，进行输出。要想自定义排序，最核心的逻辑代码就在compareTo方法中。

原文地址：https://www.cnblogs.com/yxym2016/p/12993895.html