Wednesday, December 29, 2010

Using Java Collections Effectively by Implementing equals(), hashCode(), and compareTo()

Thanks to attending the local Java Users Group "Oracle® Certified Java™ Programmer Certification Boot Camp" on and off for several years, I've entered the 21st century and begun using Java Collections the way they were intended. Especially with regard to sorting and eliminating duplicate records, these classes are a god-send in terms of coding-time and run-time efficiency and arguably in terms of readability as well. But bug-free usage of these classes requires strict adherence to the contract of equals(), hashCode(), and often compareTo().

Even though popular IDEs automatically generate stubs of some of these methods for you, you should still understand how they work, particularly how the three methods work together because I don't see many IDE's writing meaningful compareTo() methods yet. Here's what I've learned in the past few years of implementing these methods. For much of what follows, I am endebted to Joshua Bloch and his book, Effective Java. Buy it, read it, live it.

In short, make the behavior of equals(), hashCode(), and compareTo() consistent. I'll cover each in order.

UPDATE 2011-08-31: As much as possible, you must base these methods on immutable fields. If you store an object in a collection (e.g. as a key in a HashMap) and it's hashcode changes, you probably won't be able to retrieve it from that hash table anymore! Thanks to "Programming in Scala" Section 30.2 p. 657. See also my later post on Object Mutability. END-UPDATE

equals()

a.equals(b) should return true when a and b represent the same object. Not that a and b are both pigs, or even that they are both pigs named Wilbur, but that they both represent the pig named Wilbur in the story, Charlotte's Web. Bloch (Item 8) says that the equals() method must be reflexive, symmetric, transitive and a few other things as well which I won't cover here. For any non-null value:

  • x.equals(x) must be true.
  • If x.equals(y) then y.equals(x) must be true.
  • If x.equals(y) and y.equals(z) then x.equals(z) must also be true.

The following should get you off to a good start in writing an equals method that is all of the above.

@Override
public boolean equals(Object other) {
    // First, the obvious...
    if (this == other) {
        return true;
    }
    if (other == null) {
         return false;
    }
    if ( !(other instanceof MyClass) ) {
        return false;
    }
    // Details...
    final MyClass that = (MyClass) other;

    // If this is a database object and both have the same surrogate key (id),
    // they are the same.
    if ( (this.getId() != 0) && (that.getId() != 0) ) {
        return (this.getId() == that.getId());
    }

    // If this is not a database object, compare "significant" field here.
    try {
        if ( this.item.equals(that.item) &&
             this.startDate.equals(that.startDate) &&
             this.endDate.equals(that.endDate) ) {
            return true;
        }
    } catch (Exception e) {
        // You may not want to catch exceptions here if that exception
        // indicates an invalid object.  See below.
    }

    // Not proven equal
    return false;
}

Be careful not to compare objects that may be invalid. Consider a rental system where no two people can have overlapping date values for the same rental item, but a user enters the wrong date values and the system builds an object with those values to compare it against the existing objects from the database (to check the validity). Is the new object the same as the database object whose dates it conflicts with? You can't make that comparison when one of the objects is invalid. The coding error here is not that the equals() method is flawed, but rather that you are trying to compare an object before it is valid.

Your equals() method should either compare "significant" fields (the ones that uniquely identify the object) OR surrogate keys - not both! The danger of providing a field-by-field equals comparison for a database object is that it will work in some cases with invalid objects, but it will not always work, thus I'd rather have it fail fast, than leave me scratching my head when an intermittent bug crops up in production. For database objects I always use surrogate keys because doing so acknowledges that everything about an object can change over time, yet it is still essentially the same object (The Artist Formerly Known as Prince). For non-database objects, you must compare the fields that uniquely identify a valid object.

hashCode()

Bloch's Item 9 states, "Always override hashCode() when you override equals()". Giving your hashCode() function an even distribution of values makes the Hashtable-based collections work better. The following are specifically required (see: Object.hashCode()):

  • x.hashCode() must always equal y.hashCode() when x.equals(y). If you break this nothing will work.
  • It's OK for x.hashCode() to equal y.hashCode() when x.equals(y) is false, but it's good to minimize this.

Once again, surrogate keys can come to your rescue here. Truncating the row number from the database from a long to an int is a great way to ensure equal distribution of hash buckets in a hash table. Until you have 2 billion records (2^31 for a signed 32-bit int), they each get their own hash code. Up to about 4 billion records, you only have 2 records in each hash bucket, and on and on as you add records. If you don't use surrogate keys, you need to construct an int from the same "significant" fields you used in the above equals method to keep them consistent. Good luck with that. Here's the simple solution:

@Override
public int hashCode() {
    if (this.getId() == 0) {
        return System.identityHashCode(this);
    } else {
        // return (possibly truncated) surrogate key
        return (int) this.getId();
    }
}
If you don't have a surrogate key, sometimes you can fit the other fields into a single int. For instance, in a date class, you might use the following which gives each day a unique, even human-readable integer representation:
return (this.year * 10000) + (this.month * 100) + this.day;
For a class with a lot of booleans you can/should represent them all losslessly in the hashcode:
int ret = (isHappy ? 1 : 0); // 2^0 = 1 = b0001
if (isHungry) {
    ret |= 2; // 2^1 = 2 = b0010
}
if (isSleepy) {
    ret |= 4; // 2^2 = 4 = b0100
}
if (isAntsy) {
    ret |= 8; // 2^3 = 8 = b1000
}
return ret;

compareTo()

Bloch's Item 12 states that if you are writing a class with an obvious natural ordering (alphabetical, numerical, chronological, etc.) consider implementing Comparable. I have personally gotten into the most trouble when my compareTo() method was not 100% compatible with my equals() method. So I added one rule (and it's converse) to Bloch's others:

  • If this.equals(that) == true, then this.compareTo(that) must return 0.
  • If this.equals(that) == false, then this.compareTo(that) must NOT return 0.

The example below should be fairly self-explanatory and sorts alphabetically by the description field.

/**
 Returns a negative integer, zero, or a positive integer as this object is
 less than, equal to, or greater than the specified object (respectively).

 @param that the object to compare to ("that")

 @return -1 means they are in order (this comes before that), 0 means they are equal
 (and Sets will eliminate one of them), +1 means to swap them (that comes
 before this)

 */
@Override
public int compareTo(MyClass that) {

    // ensures consistency with equals()
    if (this.equals(that)) {
        return 0;
    }

    // compare parents first in a hierarchical structure.
    // If this class can be it's own parent (e.g. a self-joining
    // table) this is much more complicated.  I'll have to
    // make another article on that.  You may also have to
    // check for a null parent, but here's the simplest case:
    int ret = this.getParent().compareTo(that.getParent());
    if (ret != 0) {
        return ret;
    }

    if (this.getDescription() == null) {
        if (that.getDescription() != null) {
            // sorts nulls first
            return -1;
        }
    } else if (that.getDescription() == null) {
        // sorts nulls first
        return 1;
    } else {
        // For each test, check and only return a non-zero result
        ret = this.getDescription().compareToIgnoreCase(that.getDescription());
        if (ret != 0) {
            return ret;
        }
        ret = this.getDescription().compareTo(that.getDescription());
    }

    if (ret != 0) {
        return ret;
    }

    // If you can't tell, return -1 which means that they are in the right
    // order but unequal - ensures consistency with equals()
    return -1;
}

Note: I have not verified this, but it stands to reason that if you change hashCode() and your object might be used in a Set or Hashtable, you probably need to change SerialVersionUID. Since a set only calls equals() for objects in the same bucket (ones that have similar hashcodes), you may end up with two of the same objects in the set (one with the old hashCode and one with the new). I'm not sure if this can happen in practice or not. Maybe someone will post test code in the comments that proves it one way or the other?

1 comments:

John Yeary said...

Glen this is very nicely done. Thanks for sharing. I tweeted the article since I thought it was very good.