Wednesday, December 29, 2010

Using Java Collections Effectively by Implementing equals() and hashCode()

IMPORTANT: The techniques in this post, while interesting, are outdated and sub-optimal. In short, follow standard equals() and hashCode() practice, but TEST your classes using something like TestUtils. I find a bug almost every time I use that.

This post is the first in a series on Comparing Objects.

These three methods must be implemented correctly in order for the Java collections to work properly.  Even though popular IDEs automatically generate stubs of some of these methods, you should still understand how they work, particularly how the three methods work together because I don't see many IDE's writing meaningful compareTo() methods yet. For much of what follows, I am endebted to Joshua Bloch and his book, Effective Java. Buy it, read it, live it.
  1. The behavior of equals(), hashCode(), and compareTo() must be consistent.
  2. You must base these methods on fields whose values do not change while they are in a collection.
If you store an object in a collection (e.g. as a key in a HashMap) and it's hashcode changes, you generally won't be able to retrieve it from that hashtable anymore! Thanks to "Programming in Scala" Section 30.2 p. 657. See also my later post on Object Mutability. You can use collections effectively with mutable objects so long as those objects use surrogate keys. In these examples I store my surrogate key in a private long id field with public getId() and setId() methods as many popular frameworks expect.

hashCode()

hashCode() is meant to provide a very cheap "can-equal" test.  It allows the put() and contains() methods on hashtables to run blazingly fast.  In small hashtables, the low bits from hashCode() determine which hash bucket an object belongs in.  In larger hashtables, all the bits are used.  The (presumably more expensive) equals() test is then applied against all the other objects already in that bucket.  If you had all your objects return some number, say, 31 for their hashCode(), this would completely destroy the performance of any hashtable based collections, since all objects would go in the same hash bucket and each object would have to be compared to all other objects using equals().

Bloch's Item 9 states, "Always override hashCode() when you override equals()". The following are specifically required (see: Object.hashCode()):
  1. x.hashCode() must always equal y.hashCode() when x.equals(y).
  2. It's OK for x.hashCode() to equal y.hashCode() when x.equals(y) is false, but it's good to minimize this.
Truncating the database row number from a long to an int is an ideal way to ensure efficient, equal distribution of values. If you don't use surrogate keys, you need to construct an int from the "significant" fields (the ones that uniquely identify this object):

@Override
public int hashCode() {
    if (id == 0) {
        return intField1 + intField2 + objField3.hashCode();
    }
    // return (possibly truncated) surrogate key
    return (int) id;
}

If your object does not have a surrogate key, then the field-by-field comparison in this solution is correct, though not quite as fast. If you like playing with bits, you can sometimes or and shift various fields into your hashcode in a way that is very efficient and not too hard to read.

equals()

a.equals(b) should return true only when a and b represent the same object. Bloch (Item 8) says that the equals() method must be reflexive, symmetric, transitive and a few other things as well which I won't cover here. For any non-null value:
  • x.equals(x) must be true.
  • If x.equals(y) then y.equals(x) must be true.
  • If x.equals(y) and y.equals(z) then x.equals(z) must also be true.
The following should get you off to a good start in writing an equals method that is all of the above.  Checking the hashCode() should be cheap and guarantees that two objects can't equal each other if their hashCodes are different.

@Override
public boolean equals(Object other) {
    // Cheapest operation first...
    if (this == other) { return true; }

    if ( (other == null) ||
         !(other instanceof MyClass) ||
         (this.hashCode() != other.hashCode()) ) {
        return false;
    }
    // Details...
    final MyClass that = (MyClass) other;

    // If this is a database object and both have the same surrogate key (id),
    // they are the same.
    if ( (id != 0) && (that.getId() != 0) ) {
        return (id == that.getId());
    }

    // If this is not a database object, compare significant fields here.
    // Return true only if they are all sufficiently the same.
    if (!this.getParent().equals(that.getParent())) {
        return false;
    }

    if (description == null) {
        if (that.getDescription() != null) {
            return false;
        }
    } else if (that.getDescription() == null) {
        return false;
    } else {
        // For each test, check and only return a non-zero result
        int ret = description.compareTo(that.getDescription());
        if (ret != 0) { return false; }
    }

    // Compare other fields
    // If all the same, return true
    return true;
}

Both objects must be valid before you compare them.  Your equals() method should either compare significant fields OR surrogate keys - not both! The danger of providing a field-by-field equals comparison for a database object is that it will work in some cases with invalid objects, but it not always.  This is a case where it's much better to fail fast, than to be scratching your head when an intermittent bug crops up in production. For database objects, using surrogate keys acknowledges that everything about an object can change over time, yet it is still essentially the same object (The Artist Formerly Known as Prince). For non-database objects, (including those that just haven't been given a surrogate key yet) you must compare individual fields.

With care, you can ensure consistency of equals() and compareTo() by defining one in terms of the other, but be careful not to create an infinite loop by defining them both in terms of each other!

Persistence/Hibernate

Persistence or communication frameworks create temporary surrogate objects in order to avoid fetching any extra objects from the database before they are needed.  Hibernate replaces a surrogate object with the actual object the first time a field other than id is accessed, or any methods other than persistent field accessors are accessed.  All of the above examples are designed to work with a persistence framework like Hibernate.

So your object can trust itself to be initialized inside equals(), hashcode(), and compareTo(). It should NOT trust that the other object being compared to is initialized! You can access the this.whatever fields directly, but always use that.getWhatever().

Scala's Case Classes

Declaring your class as a "case" class in Scala takes care of all the above items for you. It prevents inheritance, but for simple classes, it saves a ton of thought and typing! For non-case classes, you must do more work in Scala than in Java in order to support meaningful equals comparisons with inheritance. You have to implement a canEqual() method as well to support the idea that a parent class might think it was "close enough" to a child class, but the child might think they were different (because it defines extra fields relevant to the equals() method), so the child must implement canEqual() and the parent must check it in order to block the parent from thinking they are equal. I've never been bitten by this in Java, but I don't immediately see what prevents it.

Clojure

All Clojure's common built-in datatypes are immutable and implement the above methods for you, making them extremely easy to work with.

SerialVersionUID

I have not verified this, but it stands to reason that if you change hashCode() you probably need to update the SerialVersionUID just as you would if you changed any persistent field. Otherwise, you may end up with two of the same objects in a set (one with the old hashCode and one with the new). I'm not sure if this can happen in practice or not. Maybe someone will post test code in the comments that proves it one way or the other?

Sunday, December 12, 2010

Software Development has only One Metric that Matters

Having read and thoroughly enjoyed More Joel on Software I bought myself Joel on Software this week and find it to be similarly wonderful. Both books are basically just hard-copies of his blog and make for entertaining reading even though they are packed with knowledge from decades of successful software development.

Joel's Measurement article from 2002 is not his best article, but reading Joel helped me crystallize some vague notions that have been bumping around my head for years. The aspects of software that are easiest to measure are generally the least valuable measurements. For instance: lines of code. The more lines of code, generally the worse your software is; it's bloated and complicated. In general, the fewer lines of code for the same functionality, the better, though taken to an extreme, you can make something completely illegible and impossible to change without throwing it out and starting over. How many lines of code are appropriate for the problem you are trying to solve?

Similarly, increasing complexity usually makes a product buggy, unusable, or both. But decreasing complexity, taken to an extreme, can make a product useless (it doesn't do what it needs to). But how do you measure the level of complexity that is "just right" for the problem you are trying to solve? Bugs is an interesting metric because even though fewer bugs is better, the Heisenberg principle comes into play such that there's no way of measuring bugs without skewing your results. Scott Adams sums it up beautifully:
http://www.joeindie.com/images/dilbert-minivan.gif

But there is one metric that combines and trumps all others in just about every meaningful way: Customer satisfaction. Does the software solve the real-world problem it was intended to for the people who need it most? That's the only thing that matters. Not cyclomatic complexity, efferent coupling, or any other measurement that a computer can make on the code directly. It has meet someone's needs.

I recently saw Objectified which was an interesting film. But I didn't know whether to laugh out loud or stare in horror at one artist who made a robot that required human attention to do what it needed. The clip shows some woman dressed like a flight attendant leaning down so that this thing could whisper in her ear so that she would know to move it to the other side of the room. This is exactly how much of our software fails. The technology we create is supposed to make our lives easier, better, more enriching. Not make us it's slave.

How often is the new version of a product a step backward from the old? I remember one person I worked with actually advertising that with the new version of his architectural component, what used to take you one click now takes you several and makes you wait longer. How is that supposed to be a good thing? It's slower and more difficult... why?

Rackspace is on to something, realizing that what you get from a hosting company is not servERS, but servICE. Anyone can set up a few servers. But the first time your server goes down and your hosting company doesn't or can't respond, you realize that the service is what counts. Maybe I'm pushing this a little too far here, but I think software development has more in common with a hosting company than a discount store. That meeting the customer's needs, providing excellent SERVICE is more important than the implementation details of the product. The software is more like an extension of that service (it serves the customer instead of a human serving the customer) than like a shrink-wrapped product.

Providing an effective autonomous electronic servant means understanding the customer's need and designing something that meets that need, then communicating that understanding to the people who actually have to build the software. Get them excited about, or at least involved in solving the customer's actual problem instead of just thinking about some architectural detail or slavishly following a spec.

Obviously, there are pitfalls. In The Iceberg Secret Revealed, Joel says that "Customers don't know what they want." And it's true. In Make Users Happy by Ignoring Requirements I discuss what I should have called the "Excel Syndrome" where users describe the problem as if Excel were the solution. It's not. If it were, they would have made a spreadsheet instead of hiring you.

One last thing... When I say, "customer" I don't mean just the people your company serves. I mean the target audience for your software, which may be inside your company instead of outside it. When I worked for Fidelity, I worked for a little group called FMTC (now Pyramis) that handled retirement plans for large organizations. I think the minimum amount to open an account was over a million dollars. After years of working on the "customer-facing web-site" I learned that the primary users of the site were a handful of customer service people within Fidelity. Customers would call them up, ask a question, and the internal rep would use the web site to find the answer. Had we known this up front, we might have designed it very differently. That was years ago and most people are comfortable logging in and accessing their own account nowadays, but if you are in charge of billions of dollars, you may still have your secretary call the investment company and hand you the phone to get the answer to your question verbally. No password, no logging in, just "Yes Mr. Big-Wig. It's at 42 billion and change Mr. Big-Wig. I'd be happy to explain that for you..."

In short:
1.) Find out the real need.
2.) Meet it.
3.) Measure your success by asking your customers.
4.) Do it better next time (PDCA).

Tuesday, April 20, 2010

Disposing of Computers Responsibly

You don't fall for con games, you use a router with WPA2 wireless, store your passwords securely, keep your operating system and applications updated. You're safe, right?

That depends on how you dispose of old hard drives and computers. If you run Windows, Microsoft recommends cleaning the hard drive with the secure delete command-line application, sdelete. Whatever operating system you run, you may prefer to Use an Ubuntu Live CD to securely wipe your PC's hard drive.

The Frontline video, Ghana: Digital Dumping Ground not only exposes the human health and environmental costs of disposing of computers, but about 8 minutes in, shows data recovery on those same computers that should make the hair on the back of your neck stand up.

One of the reasons I got into computer programming is that it eliminates paper, saves trees, and eliminates paper waste. Electronic devices require electricity, but more importantly, they break or become obsolete very quickly and we replace them. The impact of our obsession with newer, faster computers and other electronics is astounding. But even if we think we are recycling, we might be creating new waste problems in far-away lands as the 60 Minutes Electronic Wasteland video shows (also shows data recovery).

Anyone who reads this blog knows that when my computer became unacceptably slow running Microsoft Windows XP about 9 months ago and I switched to Ubuntu Linux and have been delighted with that decision. File operations that took hours on Windows take minutes on Linux. Everything else seems to run about twice as fast. So if you care about the impact of disposing of your computer (of if you want to save money), install Linux and keep your hardware twice as long. My laptop was bought in 1999 and I run Xubuntu on it (a lightweight version of Ubuntu). It's great for email and web browsing and with a little patience, I can even run a database and a web server on it, even though it only has 500MB memory a 500Ghz processor, and an incredibly slow disk drive.

Thanks to my occasional involvement with UCLUG, I learned of Free Linux PC. They are a fantastic organization that takes donations of old computers, installs Linux on them, and gives them to people who don't have a computer! I've given them a laptop and some memory and I've volunteered at a giveaway at the Greenville Public Library which was a great time. The recipients were pinching themselves.

Here's an old-fashioned recipe for improving environmental, human, and security impacts of your computer usage:

Reduce

Buy new computers (cell phones, TVs, etc) less often. Use your old ones longer.

Reuse

Installing Linux can double the useful life of a Windows computer. If that's not for you, then donate any still-working equipment to FreeLinuxPC and they will do it for you.

Recycle

Recycling sometimes works and is probably better than burning your computer on your front lawn (though it might still be burned in Ghana, Taiwan, or China), certainly better than throwing it in the trash. Even if you don't recycle, if you can find recycled gold watches or other products made from recycled computers, you are supporting an industry that badly needs encouragement.

In any case, make sure to wipe the hard drive before disposing of it!

Thursday, April 15, 2010

POST vs. GET for HTML Form Security (and The Back Button)

At the last GreenJUG meeting, we talked about how you should always use POST instead of GET for any secure web sites. But when you use a POST form on an HTTPS site, navigate away, and click the Back button (on Internet Explorer), you get "The web page you requested is no longer available - try refreshing..." It essentially breaks the back button which UI designers will tell you never to do.

So for me, the rule is, for any view-forms (where the form sends selection criteria for what the user wants to view), make them GET. Such forms tend to submit only the ID number of whatever they are pulling from the database, usually do not involve entering sensitive information, and should work when the users navigates away and uses the back-button. For update forms (where the user submits new data or changes old data) make them POST because it's good to prevent accidental resubmittion and because people tend to enter private and proprietary information on such forms.

Tuesday, April 13, 2010

Passwords Don't Matter



Why Passwords Don't Matter

I set out to write an article about tools to easily manage passwords securely, but when I looked for data on computer crime to encourage people to use better passwords, I discovered a very different story. Most "computer crime" (according to the FBI) is various forms of scams and con games that used to be carried out in person, over the phone, or through the mail, but are now done through online auctions or email. Nothing to do with passwords. This 25-page 2008 Internet Crime Report by the FBI only uses the word "password" twice.

At least for corporations, the big problem seems to be people using the access they were given to do bad things. That happens much more often than people hacking into other accounts.

Computer attacks tend to target applications and the operating system. If you don't keep up with patches, your password won't matter. Source: The Top Cyber Security Risks.

Contrary to the title of this posting, good password practices are important. But what's even more important is to:

1.) Keep your wits about you and cultivate a healthy skepticism before downloading a free game, clicking on an advertisement, or buying something from someone you don't know (e.g. eBay).

2.) Keep your operating system and applications updated. Always choose, "Yes, apply updates right now" and "Of course I'll reboot." Manually check for updates periodically just in case.

3.) Use a tool like Revo Uninstaller to remove applications you are no longer using. Especially anything by Adobe, RealPlayer, toolbars (e.g. Yahoo!), and the Microsoft .NET framework.



When Do Passwords Matter?

I got an email today saying that a web application I used a single time eight years ago had suffered a break-in and warning me that if I used that password for multiple accounts, I should change the passwords to all those accounts. I have over 120 personal accounts, and God knows how many at my various jobs over the last 8 years - how many of those applications have been compromised? Kudos to the organization who discovered the break-in AND alerted me. I think it's safe to assume this is not the only break-in among those 120 applications, nor the only one discovered.




Minimum Effort Password Management

I just read a wonderful article in the Boston Globe Online about the time-wasting, annoying, and mostly useless advice security experts have given us about passwords. So if you want to be secure with the minimum amount of effort, what is the most important thing?

I believe using a different unguessable password for every account is the most important password practice because doing so means that all of your other accounts are safe whenever one of them is compromised - and if you use a computer long enough, accounts WILL be compromised. Some have suggested using X9$bFacebook, X9$bTwitter, X9$bMySpace, but schemes that use the application name, even if it's altered in various ways, are still guessable.

To manage different passwords for every account, you need a password manager. Many people use the "remember passwords" feature of their favorite browser. This is a terrible idea because:
  • It means you are storing your most secure data (your passwords) in your least secure application (your browser)

  • You are going to need to enter activation keys or passwords into software installed on your machine at some point, and you cannot store that in your browser

  • When you go to another computer, or try to switch to another brand of browser, you don't have your passwords.

  • When your hard drive dies, so do all your passwords

  • When you die, so does access to your computer and all your passwords.


So the web browser is not such a good solution. The best I've found (and thanks to a good friend for pointing me to it) is a free, open-source password manager called KeePass which is available for Windows and KeePassX for Linux, Mac, and Windows. I use it with a strong master-password and a tool called DropBox to synch it across my computers. Lifehacker has an article on how to use them together. Once a year, I recommend printing out your KeePass database, writing your master password on it (your Dropbox password will be in your KeePass database), sealing the list in a tamper-evident security envelope, and putting it in your safe deposit box. Then burn last years list (you know, with a match). When your hard drive dies, you have a backup immediately available on your other computers via Dropbox. When you die, there are legal proceedings for your next of kin to access your safe deposit box.

For a less secure, less robust, but easier to use password manager, look at LastPass.

Wednesday, February 3, 2010

Grep: Yet Another Reason I Love Bash (the GNU/Linux Born Again Shell)

I have to say how incredibly awesome it is to be able to issue a one-liner like the following:

egrep -w --color --exclude-dir='.svn' '[Bb]usiness [Uu]nits?|BUs?' WEB-INF/jsp/*

Everywhere the expression Business Unit, business unit, or BU appears in the JSP pages is now listed on my screen in color.

How does it work?


egrep

Run grep (the GNU "find" utility) using Extended (that's where the E comes from) regular expressions.

-w

I was looking up the word boundary expression in the man page for grep and found that the -w switch forces the match to occur within word boundaries.

--color

Show the file names and matches in a different color (red) from the other text.

--exclude-dir='.svn'

Prevents grep from looking in subversion (source control) directories

'[Bb]usiness [Uu]nits?|BUs?'

Match "business unit" with or without initial caps and with or without an "s" at the end, or match "BU". These matches have to occur on word-boundaries (because of the -w above) so that ABUSE does not match.

WEB-INF/jsp/*

Run the match against all the files in the WEB-INF/jsp directory.




I pay about $150/year for an individual license for InteliJ IDEA because of it's refactoring, but it still can't do this. You can pry bash from my cold, dead hands.

Tuesday, January 5, 2010

My Favorite Free Windows Applications

I rely on a small set of free tools (suggested on Lifehacker) to keep my Windows (virtual) machines (and relative's Windows machines) running smoothly. They are:

  • Revo Uninstaller: To remove programs, but perhaps more importantly, to nuke junk that auto-starts (click Tools, then the traffic-light icon). Does a nice job of removing the extra junk that inconsiderate programs leave behind.

  • Ccleaner: To clean up trash on the hard drive and registry. I make sure to uncheck cookies for the cleanup in all my browsers. Longer-term, I will probably mark the cookies I care about as safe and let it nuke the rest. Then I run the registry cleanup repeatedly until it doesn't find anything any more.

  • Microsoft Security Essentials (antivirus): Free, highly rated, low-resource, no hassle. Time will still tell, but so far, so good. I've always thought that Microsoft should be the one responsible for protecting their own operating system.

  • MyDefrag: The "Weekly" script rocks. I don't run it weekly, but it's amazing what a mess Windows makes of its drive. It's like an animal soiling it's cage. MyDefrag really helps boot times. I'm very impressed. Note: I do not like the "Monthly" script at all. Sorting files in name and directory order seems like a total waste of time and disk-life to me.


Honorable Mention:


  • sdelete: A command-line utility from Microsoft that zeros out unused disk space. This is nice for security reasons, but I use it to keep my virtual disk images really small. I run "sdelete -c C:" Then in Linux, "VBoxManage modifyhd --compact"

  • cygwin: A Linux command-line emulator for Windows. I wouldn't live without it, but people wouldn't have a need for it.


The above programs have helped me tame the most clogged-with-crapware systems and make them secure and responsive again.

If someone told me that once I ran Linux, I'd think nothing of running several Microsoft operating systems, I would have laughed. But Windows works great in a VM. It's a fun toy when you don't have to rely on it to do anything useful. I use one VM for IE6 (for testing) and the other for IE8 (for programs that don't have Linux equivalents). Both are running Windows XP SP3. A Windows 7 VM is probably in my future.