Wednesday, December 29, 2010

Using Java Collections Effectively by Implementing equals() and hashCode()

IMPORTANT: The techniques in this post, while interesting, are outdated and sub-optimal. In short, follow standard equals() and hashCode() practice, but TEST your classes using something like TestUtils. I find a bug almost every time I use that.

This post is the first in a series on Comparing Objects.

These three methods must be implemented correctly in order for the Java collections to work properly.  Even though popular IDEs automatically generate stubs of some of these methods, you should still understand how they work, particularly how the three methods work together because I don't see many IDE's writing meaningful compareTo() methods yet. For much of what follows, I am endebted to Joshua Bloch and his book, Effective Java. Buy it, read it, live it.
  1. The behavior of equals(), hashCode(), and compareTo() must be consistent.
  2. You must base these methods on fields whose values do not change while they are in a collection.
If you store an object in a collection (e.g. as a key in a HashMap) and it's hashcode changes, you generally won't be able to retrieve it from that hashtable anymore! Thanks to "Programming in Scala" Section 30.2 p. 657. See also my later post on Object Mutability. You can use collections effectively with mutable objects so long as those objects use surrogate keys. In these examples I store my surrogate key in a private long id field with public getId() and setId() methods as many popular frameworks expect.

hashCode()

hashCode() is meant to provide a very cheap "can-equal" test.  It allows the put() and contains() methods on hashtables to run blazingly fast.  In small hashtables, the low bits from hashCode() determine which hash bucket an object belongs in.  In larger hashtables, all the bits are used.  The (presumably more expensive) equals() test is then applied against all the other objects already in that bucket.  If you had all your objects return some number, say, 31 for their hashCode(), this would completely destroy the performance of any hashtable based collections, since all objects would go in the same hash bucket and each object would have to be compared to all other objects using equals().

Bloch's Item 9 states, "Always override hashCode() when you override equals()". The following are specifically required (see: Object.hashCode()):
  1. x.hashCode() must always equal y.hashCode() when x.equals(y).
  2. It's OK for x.hashCode() to equal y.hashCode() when x.equals(y) is false, but it's good to minimize this.
Truncating the database row number from a long to an int is an ideal way to ensure efficient, equal distribution of values. If you don't use surrogate keys, you need to construct an int from the "significant" fields (the ones that uniquely identify this object):

@Override
public int hashCode() {
    if (id == 0) {
        return intField1 + intField2 + objField3.hashCode();
    }
    // return (possibly truncated) surrogate key
    return (int) id;
}

If your object does not have a surrogate key, then the field-by-field comparison in this solution is correct, though not quite as fast. If you like playing with bits, you can sometimes or and shift various fields into your hashcode in a way that is very efficient and not too hard to read.

equals()

a.equals(b) should return true only when a and b represent the same object. Bloch (Item 8) says that the equals() method must be reflexive, symmetric, transitive and a few other things as well which I won't cover here. For any non-null value:
  • x.equals(x) must be true.
  • If x.equals(y) then y.equals(x) must be true.
  • If x.equals(y) and y.equals(z) then x.equals(z) must also be true.
The following should get you off to a good start in writing an equals method that is all of the above.  Checking the hashCode() should be cheap and guarantees that two objects can't equal each other if their hashCodes are different.

@Override
public boolean equals(Object other) {
    // Cheapest operation first...
    if (this == other) { return true; }

    if ( (other == null) ||
         !(other instanceof MyClass) ||
         (this.hashCode() != other.hashCode()) ) {
        return false;
    }
    // Details...
    final MyClass that = (MyClass) other;

    // If this is a database object and both have the same surrogate key (id),
    // they are the same.
    if ( (id != 0) && (that.getId() != 0) ) {
        return (id == that.getId());
    }

    // If this is not a database object, compare significant fields here.
    // Return true only if they are all sufficiently the same.
    if (!this.getParent().equals(that.getParent())) {
        return false;
    }

    if (description == null) {
        if (that.getDescription() != null) {
            return false;
        }
    } else if (that.getDescription() == null) {
        return false;
    } else {
        // For each test, check and only return a non-zero result
        int ret = description.compareTo(that.getDescription());
        if (ret != 0) { return false; }
    }

    // Compare other fields
    // If all the same, return true
    return true;
}

Both objects must be valid before you compare them.  Your equals() method should either compare significant fields OR surrogate keys - not both! The danger of providing a field-by-field equals comparison for a database object is that it will work in some cases with invalid objects, but it not always.  This is a case where it's much better to fail fast, than to be scratching your head when an intermittent bug crops up in production. For database objects, using surrogate keys acknowledges that everything about an object can change over time, yet it is still essentially the same object (The Artist Formerly Known as Prince). For non-database objects, (including those that just haven't been given a surrogate key yet) you must compare individual fields.

With care, you can ensure consistency of equals() and compareTo() by defining one in terms of the other, but be careful not to create an infinite loop by defining them both in terms of each other!

Persistence/Hibernate

Persistence or communication frameworks create temporary surrogate objects in order to avoid fetching any extra objects from the database before they are needed.  Hibernate replaces a surrogate object with the actual object the first time a field other than id is accessed, or any methods other than persistent field accessors are accessed.  All of the above examples are designed to work with a persistence framework like Hibernate.

So your object can trust itself to be initialized inside equals(), hashcode(), and compareTo(). It should NOT trust that the other object being compared to is initialized! You can access the this.whatever fields directly, but always use that.getWhatever().

Scala's Case Classes

Declaring your class as a "case" class in Scala takes care of all the above items for you. It prevents inheritance, but for simple classes, it saves a ton of thought and typing! For non-case classes, you must do more work in Scala than in Java in order to support meaningful equals comparisons with inheritance. You have to implement a canEqual() method as well to support the idea that a parent class might think it was "close enough" to a child class, but the child might think they were different (because it defines extra fields relevant to the equals() method), so the child must implement canEqual() and the parent must check it in order to block the parent from thinking they are equal. I've never been bitten by this in Java, but I don't immediately see what prevents it.

Clojure

All Clojure's common built-in datatypes are immutable and implement the above methods for you, making them extremely easy to work with.

SerialVersionUID

I have not verified this, but it stands to reason that if you change hashCode() you probably need to update the SerialVersionUID just as you would if you changed any persistent field. Otherwise, you may end up with two of the same objects in a set (one with the old hashCode and one with the new). I'm not sure if this can happen in practice or not. Maybe someone will post test code in the comments that proves it one way or the other?

Sunday, December 12, 2010

Software Development has only One Metric that Matters

Having read and thoroughly enjoyed More Joel on Software I bought myself Joel on Software this week and find it to be similarly wonderful. Both books are basically just hard-copies of his blog and make for entertaining reading even though they are packed with knowledge from decades of successful software development.

Joel's Measurement article from 2002 is not his best article, but reading Joel helped me crystallize some vague notions that have been bumping around my head for years. The aspects of software that are easiest to measure are generally the least valuable measurements. For instance: lines of code. The more lines of code, generally the worse your software is; it's bloated and complicated. In general, the fewer lines of code for the same functionality, the better, though taken to an extreme, you can make something completely illegible and impossible to change without throwing it out and starting over. How many lines of code are appropriate for the problem you are trying to solve?

Similarly, increasing complexity usually makes a product buggy, unusable, or both. But decreasing complexity, taken to an extreme, can make a product useless (it doesn't do what it needs to). But how do you measure the level of complexity that is "just right" for the problem you are trying to solve? Bugs is an interesting metric because even though fewer bugs is better, the Heisenberg principle comes into play such that there's no way of measuring bugs without skewing your results. Scott Adams sums it up beautifully:
http://www.joeindie.com/images/dilbert-minivan.gif

But there is one metric that combines and trumps all others in just about every meaningful way: Customer satisfaction. Does the software solve the real-world problem it was intended to for the people who need it most? That's the only thing that matters. Not cyclomatic complexity, efferent coupling, or any other measurement that a computer can make on the code directly. It has meet someone's needs.

I recently saw Objectified which was an interesting film. But I didn't know whether to laugh out loud or stare in horror at one artist who made a robot that required human attention to do what it needed. The clip shows some woman dressed like a flight attendant leaning down so that this thing could whisper in her ear so that she would know to move it to the other side of the room. This is exactly how much of our software fails. The technology we create is supposed to make our lives easier, better, more enriching. Not make us it's slave.

How often is the new version of a product a step backward from the old? I remember one person I worked with actually advertising that with the new version of his architectural component, what used to take you one click now takes you several and makes you wait longer. How is that supposed to be a good thing? It's slower and more difficult... why?

Rackspace is on to something, realizing that what you get from a hosting company is not servERS, but servICE. Anyone can set up a few servers. But the first time your server goes down and your hosting company doesn't or can't respond, you realize that the service is what counts. Maybe I'm pushing this a little too far here, but I think software development has more in common with a hosting company than a discount store. That meeting the customer's needs, providing excellent SERVICE is more important than the implementation details of the product. The software is more like an extension of that service (it serves the customer instead of a human serving the customer) than like a shrink-wrapped product.

Providing an effective autonomous electronic servant means understanding the customer's need and designing something that meets that need, then communicating that understanding to the people who actually have to build the software. Get them excited about, or at least involved in solving the customer's actual problem instead of just thinking about some architectural detail or slavishly following a spec.

Obviously, there are pitfalls. In The Iceberg Secret Revealed, Joel says that "Customers don't know what they want." And it's true. In Make Users Happy by Ignoring Requirements I discuss what I should have called the "Excel Syndrome" where users describe the problem as if Excel were the solution. It's not. If it were, they would have made a spreadsheet instead of hiring you.

One last thing... When I say, "customer" I don't mean just the people your company serves. I mean the target audience for your software, which may be inside your company instead of outside it. When I worked for Fidelity, I worked for a little group called FMTC (now Pyramis) that handled retirement plans for large organizations. I think the minimum amount to open an account was over a million dollars. After years of working on the "customer-facing web-site" I learned that the primary users of the site were a handful of customer service people within Fidelity. Customers would call them up, ask a question, and the internal rep would use the web site to find the answer. Had we known this up front, we might have designed it very differently. That was years ago and most people are comfortable logging in and accessing their own account nowadays, but if you are in charge of billions of dollars, you may still have your secretary call the investment company and hand you the phone to get the answer to your question verbally. No password, no logging in, just "Yes Mr. Big-Wig. It's at 42 billion and change Mr. Big-Wig. I'd be happy to explain that for you..."

In short:
1.) Find out the real need.
2.) Meet it.
3.) Measure your success by asking your customers.
4.) Do it better next time (PDCA).