Wednesday, December 29, 2010

Using Java Collections Effectively by Implementing equals() and hashCode()

IMPORTANT: The techniques in this post, while interesting, are outdated and sub-optimal. In short, follow standard equals() and hashCode() practice, but TEST your classes using something like TestUtils. I find a bug almost every time I use that.

This post is the first in a series on Comparing Objects.

These three methods must be implemented correctly in order for the Java collections to work properly.  Even though popular IDEs automatically generate stubs of some of these methods, you should still understand how they work, particularly how the three methods work together because I don't see many IDE's writing meaningful compareTo() methods yet. For much of what follows, I am endebted to Joshua Bloch and his book, Effective Java. Buy it, read it, live it.
  1. The behavior of equals(), hashCode(), and compareTo() must be consistent.
  2. You must base these methods on fields whose values do not change while they are in a collection.
If you store an object in a collection (e.g. as a key in a HashMap) and it's hashcode changes, you generally won't be able to retrieve it from that hashtable anymore! Thanks to "Programming in Scala" Section 30.2 p. 657. See also my later post on Object Mutability. You can use collections effectively with mutable objects so long as those objects use surrogate keys. In these examples I store my surrogate key in a private long id field with public getId() and setId() methods as many popular frameworks expect.

hashCode()

hashCode() is meant to provide a very cheap "can-equal" test.  It allows the put() and contains() methods on hashtables to run blazingly fast.  In small hashtables, the low bits from hashCode() determine which hash bucket an object belongs in.  In larger hashtables, all the bits are used.  The (presumably more expensive) equals() test is then applied against all the other objects already in that bucket.  If you had all your objects return some number, say, 31 for their hashCode(), this would completely destroy the performance of any hashtable based collections, since all objects would go in the same hash bucket and each object would have to be compared to all other objects using equals().

Bloch's Item 9 states, "Always override hashCode() when you override equals()". The following are specifically required (see: Object.hashCode()):
  1. x.hashCode() must always equal y.hashCode() when x.equals(y).
  2. It's OK for x.hashCode() to equal y.hashCode() when x.equals(y) is false, but it's good to minimize this.
Truncating the database row number from a long to an int is an ideal way to ensure efficient, equal distribution of values. If you don't use surrogate keys, you need to construct an int from the "significant" fields (the ones that uniquely identify this object):

@Override
public int hashCode() {
    if (id == 0) {
        return intField1 + intField2 + objField3.hashCode();
    }
    // return (possibly truncated) surrogate key
    return (int) id;
}

If your object does not have a surrogate key, then the field-by-field comparison in this solution is correct, though not quite as fast. If you like playing with bits, you can sometimes or and shift various fields into your hashcode in a way that is very efficient and not too hard to read.

equals()

a.equals(b) should return true only when a and b represent the same object. Bloch (Item 8) says that the equals() method must be reflexive, symmetric, transitive and a few other things as well which I won't cover here. For any non-null value:
  • x.equals(x) must be true.
  • If x.equals(y) then y.equals(x) must be true.
  • If x.equals(y) and y.equals(z) then x.equals(z) must also be true.
The following should get you off to a good start in writing an equals method that is all of the above.  Checking the hashCode() should be cheap and guarantees that two objects can't equal each other if their hashCodes are different.

@Override
public boolean equals(Object other) {
    // Cheapest operation first...
    if (this == other) { return true; }

    if ( (other == null) ||
         !(other instanceof MyClass) ||
         (this.hashCode() != other.hashCode()) ) {
        return false;
    }
    // Details...
    final MyClass that = (MyClass) other;

    // If this is a database object and both have the same surrogate key (id),
    // they are the same.
    if ( (id != 0) && (that.getId() != 0) ) {
        return (id == that.getId());
    }

    // If this is not a database object, compare significant fields here.
    // Return true only if they are all sufficiently the same.
    if (!this.getParent().equals(that.getParent())) {
        return false;
    }

    if (description == null) {
        if (that.getDescription() != null) {
            return false;
        }
    } else if (that.getDescription() == null) {
        return false;
    } else {
        // For each test, check and only return a non-zero result
        int ret = description.compareTo(that.getDescription());
        if (ret != 0) { return false; }
    }

    // Compare other fields
    // If all the same, return true
    return true;
}

Both objects must be valid before you compare them.  Your equals() method should either compare significant fields OR surrogate keys - not both! The danger of providing a field-by-field equals comparison for a database object is that it will work in some cases with invalid objects, but it not always.  This is a case where it's much better to fail fast, than to be scratching your head when an intermittent bug crops up in production. For database objects, using surrogate keys acknowledges that everything about an object can change over time, yet it is still essentially the same object (The Artist Formerly Known as Prince). For non-database objects, (including those that just haven't been given a surrogate key yet) you must compare individual fields.

With care, you can ensure consistency of equals() and compareTo() by defining one in terms of the other, but be careful not to create an infinite loop by defining them both in terms of each other!

Persistence/Hibernate

Persistence or communication frameworks create temporary surrogate objects in order to avoid fetching any extra objects from the database before they are needed.  Hibernate replaces a surrogate object with the actual object the first time a field other than id is accessed, or any methods other than persistent field accessors are accessed.  All of the above examples are designed to work with a persistence framework like Hibernate.

So your object can trust itself to be initialized inside equals(), hashcode(), and compareTo(). It should NOT trust that the other object being compared to is initialized! You can access the this.whatever fields directly, but always use that.getWhatever().

Scala's Case Classes

Declaring your class as a "case" class in Scala takes care of all the above items for you. It prevents inheritance, but for simple classes, it saves a ton of thought and typing! For non-case classes, you must do more work in Scala than in Java in order to support meaningful equals comparisons with inheritance. You have to implement a canEqual() method as well to support the idea that a parent class might think it was "close enough" to a child class, but the child might think they were different (because it defines extra fields relevant to the equals() method), so the child must implement canEqual() and the parent must check it in order to block the parent from thinking they are equal. I've never been bitten by this in Java, but I don't immediately see what prevents it.

Clojure

All Clojure's common built-in datatypes are immutable and implement the above methods for you, making them extremely easy to work with.

SerialVersionUID

I have not verified this, but it stands to reason that if you change hashCode() you probably need to update the SerialVersionUID just as you would if you changed any persistent field. Otherwise, you may end up with two of the same objects in a set (one with the old hashCode and one with the new). I'm not sure if this can happen in practice or not. Maybe someone will post test code in the comments that proves it one way or the other?

1 comment:

John Yeary said...

Glen this is very nicely done. Thanks for sharing. I tweeted the article since I thought it was very good.