Glen Peterson's Blog

Musings on programming and web applications...

View Glen Peterson's profile on LinkedIn

Friday, June 5, 2009

Free (GPL) Software Restricts Your Rights!

When I say Free Software in this article, I am referring specifically to software released under the GNU Public License (GPL), a license written by Richard Stallman that has revolutionized the software world.  Revolutionized in the sense of completely and indelibly changing it, but also in the sense of starting a movement of people.  I owe great thanks to RMS, his copyleft, and the ideas he set in motion.  I use emacs and cygwin so much every day that I don't even notice when I dream in emacs.  I have created or contributed to a small number of Free or Open-Source Software (F/OSS) projects, but like most professional programmers, I have mostly just helped myself of the bounty of fabulous F/OSS software that exists.

But for all it's talk of freedom, free as in "Free Speech" and free as in "Free Beer", Free (GPL) Software still restricts your rights as a programmer in one very important way that some other Open Source software does not; it cannot be bundled with "non-free" or proprietary software for distribution because the act of doing so turns the proprietary software into free software.  Most professionals get around this either by violating the terms of the GPL (intentionally or through ignorance) or by adopting the Software as a Service (SaaS) model of hosting web applications themselves so that the software does not need to be distributed in order to be used.  Other strategies for making money with Free Software include selling support and packaging (as RedHat does with Linux) or dual-licensing (as Oracle does with MySQL).

At my current job we pursued the SaaS model as far as it would take us, but in the end we found we had to distribute our product.  We were unwilling to license our company's only asset (its software) under the GPL, so we removed all GPL code instead.  That was a sad day indeed and it has since been a struggle to keep up with the sheer number of Licenses that F/OSS developers use for their products.  For some minor components we have had to choose tools based more on the license than the quality.

The saving grace for us has been the Apache Software Foundation and their license.  For F/OSS developers on the fence, software released under the Apache License v2.0 can be converted or combined with GPL 3.0 software at any time, but the reverse is not true.

The trouble with the GPL is a very specific, technical detail which requires some background in order to explain.  Modern computer programs tend to be composed of a number of proprietary and/or open-source components "linked" together to form a whole.  The definition of "linking" varies somewhat between programming languages and platforms and is somewhat disputed, but it basically means what it sounds like - components are developed separately then combined into a single whole.  The trick the GPL plays on programmers is that merely linking to GPL code (without changing it in any way) allows the GPL to extend to all linked code and effectively puts all of it in the public domain.  The Free Software Foundation will tell you that it's fine to link GPL code to proprietary code and it is, so long as you have the right and the desire to change that code into GPL code.

I have made my living writing software and charging for it since 1992.  I am happy to contribute bug fixes and non-core components of software I have written to the Open Source community, but I want the freedom to distribute critical components under the license of my choosing (generally a proprietary license, granting ownership of the code to my employer).  The Apache License (and other Open Source licenses) allow me to do that.  The GPL does not.

RMS wrote an article in the latest CACM entitled, "Why 'Open Source' Misses the Point of Free Software."  With all due respect to RMS, I have to say I'm glad that it does and I'd like to encourage anyone familiarizing themselves with F/OSS to read several licenses very carefully before choosing one.  The GPL is a classic, but I'm afraid history will show its application is limited to two scenarios:

  1. People who believe that proprietary software is fundamentally wrong and should be eliminated.
  2. Companies who write software which is much more useful when linked as a component of other software than it is as a stand-alone product.  These companies will pursue a strategy like MySQL where they dual-license their products under the GPL so that the world is free to try, examine, change, and distribute their software for non-commercial purposes.  But those who wish to use the software commercially will have to pay for another version of what might be the same code released under a different license.

I predict that the GPL will see the most usage by the second camp.  For the rest of us, there are a few hundred other F/OSS licenses to choose from.  A great source for learning about these licenses is the Open Source Initiative.  Their approval process is very instructive for anyone thinking they want to write their own license.  The proliferation of essentially duplicate licenses has run increasingly rampant as more programmers have gotten excited about F/OSS, mostly through the writer's ignorance of an existing license that already does exactly what they want.

Monday, March 9, 2009

Using @Deprecated more liberally

I recently created a class and an interface for managing dates as just a month and year (ignoring the day) and called them YearMonth and YearMonthInterface. I then added comparison methods to YearMonth: greaterThan(YearMonthInterface x), lessThan(YearMonthInterface x), and equals(YearMonthInterface x). Do you see the problem?

equals() overrides Object.equals() and has a special significance for the Java Collections framework. According to the Java SE 6 spec, equals() must be reflexive, symmetric, transitive, and consistent, and x.equals(null) should return false. I had no way of knowing that any class that implemented the YearMonthInterface could satisfy those conditions. In fact, I already had one implementing class that could never satisfy symmetry: myYearMonth.equals(myOtherClass) would often not equal myOtherClass.equals(myYearMonth) because my other class was dependent on another object and could only be equal when its associated objects were also equal.

I decided to keep the YearMonth.equals(Object o) method as overriding Object.equals() so that YearMonth would work correctly with collections, but create a new method, YearMonth.equalTo(YearMonthInterface x), for manual comparisons. My concern was that I never wanted to mistakenly use equals() instead of equalTo(). The solution? Deprecate YearMonth.equals() and provide a comment suggesting the use of equalTo() instead. The compiler will now catch any attempt to use the wrong method and the collections framework can still use the equals() method without problem.

/**
This is deprecated because it does not compare YearMonthInterface objects,
rather it overrides Object's equals method. It still works in
collections, but there should be no reason to ever call it directly.
Use equalTo() instead which really compares two YearMonthInterfaces as
you would expect, but doesn't need to be symmetric.

@param other another YearMonth (or it will return false).
@return true of the YearMonths are equal.
*/
@Deprecated
@Override
public boolean equals(Object other) {
// Check address and null first for speedy return
if (this == other) {
return true;
}
if (other == null) {
return false;
}
if ( !(other instanceof YearMonth) ) {
return false;
}
final YearMonth that = (YearMonth) other;
return (this.year == that.year) && (this.month == that.month);
}

public boolean equalTo(YearMonthInterface that) {
// Check address and null first for speedy return
if (this == that) {
return true;
}
if (that == null) {
return false;
}
return (this.year == that.getYear()) && (this.month == that.getMonth());
}

This technique would not work if I were to release this class to the public, but for a utility class with limited application, it can prevent me and my coworkers from making a simple but costly mistake.

For a full explanation of how to write a good equals() method, see Effective Java Second Edition by Joshua Bloch Chapter 3 (pp 35-36 is specifically about symmetry). I'm about half way through this book and really enjoying it.

Tuesday, February 3, 2009

New Favorite Find/Replace (bash) shell commands

These commands took a while for me to find and I now use them very, very often. One is to find files in all subdirectories excluding certain directories (in this case, the subversion ".svn" directories). The second expression replaces regular expressions in all the files that match the find in the first expression:

# To grep for things excluding subversion .svn directories
find . -path '*/.svn' -prune -o -type f -print | xargs -e fgrep -I PATTERN

# To replace the same (using sed), only in files that grep matched, only in non .svn files.
find . -path '*/.svn' -prune -o -type f -print | xargs -e fgrep -l -I 'OLD' | xargs sed -i -e 's/OLD/NEW/g'


UPDATE 2009-02-09
I was browsing at Barnes and Noble over the weekend and the O'Reilly Pocket Guide to Grep mentioned an --exclude-dir option. I looked at the man page again and it's there too. I even googled again and found it right away. How did I miss it? That makes things much easier.

# To grep for things excluding subversion .svn directories
fgrep --exclude-dir='.svn' -rI PATTERN *

# To replace the same (using sed), only in files that grep matched, only in non .svn files.
fgrep --exclude-dir='.svn' -rIl 'OLD' * | xargs sed -i -e 's/OLD/NEW/g'

Wednesday, July 30, 2008

Hibernate automatic POJO and mapping file generation from MySQL

Let's face it. Updating Hibernate .hbm.xml and .java files for every database change is phenomenally boring and a waste of time. It's the sort of thing that makes people switch to Ruby/ActiveRecord. I have recently undergone an effort to get the Hibernate reverse engineering tools to generate .hbm.xml and .java files as close as possible to what I want. In theory that should allow me to add new methods to the files in my source tree and still generally get hands-free updates by running the hibernate updater and diffing/patching. In practice, getting Hibernate to give me exactly what I want is very hard, so in the first week I moved from 3% of my changes being automatically applied to 80%. In the next month, I moved up to about 95% and it will probably slowly continue to improve.

The basic outline of the process is simple:
  • Save the generated files from the last time I ran the generator: mv generatedSchema generatedSchema.1

  • run the generator again and diff the files it generates this time against the ones it generated previously

  • patch those changes to my source tree.


I use the ant task, not the Eclipse integration because I'm in love with the command line. Assuming that you have read and absorbed all the documentation about hibernate reverse-engineering, I'll list a few key points from my hibernate.reveng.xml.

hibernate.reveng.xml


For some reason BIGINT was being reverse-engineered as Long (the object) instead of long (the primitive). I'm not so sure whether my VARCHAR mapping is a good idea...

I included an example of renaming a column, and a much more useful example of naming a parent-child relationship. The syntax of the foreign-key mapping was not obvious to me, so I hope you find this example helpful.
<hibernate-reverse-engineering>
<type-mapping>
<sql-type jdbc-type="BIGINT" hibernate-type="long" not-null="true"></sql-type>
<sql-type jdbc-type="VARCHAR" length="1" hibernate-type="char" not-null="true"></sql-type>
</type-mapping>
<table-filter match-catalog="mydb" match-name="temp_user" exclude="true"/>
<table catalog="mydb" name="company">
<column name="identifier_c" property="coIdentifierC" type="string" />
</table>
<table catalog="mydb" name="tree">
<foreign-key constraint-name="tree_ibfk_parent">
<many-to-one property="parent" />
<set property="children" />
</foreign-key>
</table>
</hibernate-reverse-engineering>


pojo/*.ftl Files


In addition to these customizations I've made extensive use of the Freemarker POJO templates. I made a number of very minor changes for personal preference, but one very important change for diff and patch to work. I added a two-line comment before and after the fields (in PojoFields.ftl) and the get/sets (in PojoPropertyAccessors.ftl):

// Begin Auto-generated Accessors (by myGenerator project)
// Manual editing is STRONGLY DISCOURAGED below this point!

...

// End Auto-generated Accessors (by myGenerator project)
// Manual editing is OK below this point


Having two lines of comments is very important because:
  • Hibernate reverse-engineering tools like to shuffle the field order even when you changed one of the related tables

  • Diff uses the previous 2-3 lines for comparison to confirm it's found the right place in the file and a two-line comment plus one blank line makes it very happy.


diff and patch


After running the hibernate ant task (and tweaking the output with some canned perl regular expression replacements), I run diff (telling it to ignore the timestamped Hibernate comment):
diff -I '^\(//\|<!--\) Generated ' -druN generatedSchema.1/ generatedSchema/ >patch.txt


Often I check what changes were made:
less patch.txt

From my source tree, I do a dry-run of patching the differences.
cd WEB-INF/src
patch --dry-run -l -p1 <../../../../../scorecardSchemaGen/patch.txt

If there are any failures:
  • patch those files manually performing any cleanup necessary to make patches more likely to succeed automatically next time

  • copy the new generated version of the manually patched file from generatedSchema to generatedSchema.1

  • Re-run the diff command so that it will find no differences in the generated version(s) of the manually patched file(s).

  • Perform another dry-run patch

  • Repeat until there are no errors in the dry-run


Now there will be no failures when you patch for real:
cd WEB-INF/src
patch -l -p1 <../../../../../scorecardSchemaGen/patch.txt


Finally, clean up the backup files made by patch:
rm company/project/section/*.orig


MySQL Conventions


When choosing optimal types and variable names for the columns in MySQL, one should take into account how Hibernate will reverse-engineer it so as to make sure the resulting files are as close as possible to what you want. In addition, MySQL has no "check" function and the MySQL BOOLEAN becomes TINYINT(1) which allows values from -128 to 127. Here are the subset of data types that are really easy to use with all tools and provide some data integrity by eliminating possibilities for meaningless values.

  • Boolean: Start each name with is_ whenever it makes sense - the code reads more like English that way. Use positive column/variable names. E.g. instead of preventUpdate, isNotDirty, or doesntHaveErrors, use allowUpdate, isClean, or isErrorFree. Declare in MySQL as:
    is_something bit(1) NOT NULL DEFAULT FALSE

    Note: Java/Hibernate like this but PHP and some other tools may not. Here are two work-arounds for other tools:
    select if(is_something, 1, 0) from my_table;
    or
    select if(is_something, true, false) from my_table;

  • Integers:Try to only use MySQL (Java) types: INT (int), BIGINT (long), and maybe TINYINT (byte). The size (the number in parenthesis after the type in MySQL) has no effect that I can see. The SMALLINT (short) data type requires a trade-off between saving space and limiting possible values and ease of use in Java. It requires too much type casting for my taste when working with it in Java. You even have to use casts for situations like:
    i = (short) (i + 1);
    myObject.setShort( (short) 0 );


    Unless you require a smaller data type for data integrity or for space considerations, It's easiest to stick with INT. Stay away from MySQL's MEDIUMINT: It's a 3-byte integer which is probably why Hibernate does not recognize it. 64-bit (8-byte) integers like BIGINT (long) didn't used to even be available on hardware. The manuals don't make any note of whether they are supported by hardware or not. Anyone know?

  • Strings: Use VARCHAR with the appropriate length. That way, it's easier to support utf-8 or other variable-width character encodings if you want to.

  • Naming: I started off naming my columns description_c and year_i, but ended up dropping the type-specific suffix because Java and MySQL are both type-safe anyway and the extra letter made the code slightly harder to read. I still use the word _date for dates because that seems to make the code easier to read (to my eye).

  • Foreign Keys: MySQL uses a separate constraint name and foreign key name. The constraint name has to be unique within the database so it's automatically assigned the tablename_ibfk_1 where the "1" is a sequence number incremented for each new constraint. If I drop and recreate a table in development, but merely add a new column in test, then the sequence numbers could be different in different environments and issuing a command to change a constraint in one environment will change a *different* constraint in another environment. This would be really bad.

    To avoid this, I use the following template to declare a foreign key with MySQL InnoDB (the default database, MYISAM ignores the "on-delete cascade" clause):

    CREATE TABLE process ( 
    ...
    company_id BIGINT UNSIGNED NOT NULL,
    constraint `process_fk_company` FOREIGN KEY (company_id)
    REFERENCES company(id) on delete cascade,
    ...
    ) ENGINE=InnoDB;


    That way I'm sure that the constraint name (which is what I have to reference in the Hibernate file) will not change. It's confusing because MySQL syntax mixes foreign key column names and constraint names. E.g. DROP FOREIGN KEY constraint-name. To find what constraint name is used for a given foreign key, use the SHOW CREATE TABLE command. Constraint names in mysql look like `tablename_ibfk_1`

Saturday, March 22, 2008

RESTful Form Beans with Struts 1

What Struts is Good At

Struts works great for view-only pages, particularly when you need to build those pages from the database. But there are a number of subtle issues with update pages which can probably best be solved with a RESTful solution.

REST and state

The database holds persistent state. My understanding is that one of the concepts of REST is that there is rarely a reason to cache that state on the server and/or the browser/client when it's already stored semi-permanently in the database. It seems that the session on the server only needs to hold state that doesn't belong in the database. That would be data specific to what the user is currently viewing or doing. Some examples are:
  • Logging in - that is usually (and probably best) accomplished by establishing a session on the server and storing a cookie on the client containing a token that the client's browser sends with every request instead of resending their user-id and password. The token tells the server which session is associated with that user. That way you aren't repeatedly sending the user ID and password over the wire. If someone eavesdrops on the session and steals the token, it expires after a certain period so that an attacker will (theoretically) not be able to log back in.

  • Other session-specific state. This is for global-controls analogous to the caps-lock key on the keyboard. All the other controls (keys) are still available to you, but their behavior is slightly different because the state (caps-lock) has changed.

Session Timeout

Sessions need to time out eventually (usually in 30 minutes or so for security reasons). End-users find this incredibly annoying. A little JavaScript can soften the blow:

setTimeout(alert('Your session will expire (silently) in 2 minutes...'), 1680000);

This is still annoying, and not very secure, since the point of having sessions expire is to prevent an unauthorized person from accessing a still-valid session. To a hacker, this is essentially a pop-up, advertising: "You can hijack someone's session if you act now!".

If you're application is mostly RESTful, storing only the login information on the server, you can pop up a login window to allow them to log-back in without loosing whatever they are working on. This is secure because the user ID and password are required. Also, the user does not have to respond within a certain period because they are effectively logging in and creating a new session.

if (confirm('Your session has expired.  Log back in to continue your work?')) {
loginWin = window.open("pre_login.do");
}


Since you do not rely on server side state or cookies for anything but the login, the user can get a form and fill it out in one session, then submit it in a completely different session, so the login pop-up works. If you have a little bit of state on the server and you are clever, you can enhance this solution to send a setting or two from your server-side state from the old session to the login page so that the new session will match the old.

Struts Problem 1: Clearing session-scoped Form Beans

Form beans are session scoped by default. A user GETting the form in preparation for POSTing a new record will see whatever data was left over in the form bean from the last record they edited. You need to work around this by figuring out when the user is going to add a new record and clearing the fields in the form.

Struts Problem 2: Session expiration with session-scoped form beans

Sooner or later the session will time out and the form bean will be completely lost. Writing an Action and pairing it with a log-back-in JavaScript function becomes prohibitively ugly from a maintenance and security standpoint if even a few different fields are required for a few different screens. As far as I can see, there is no reasonable work-around for this scenario, which is why REST is so useful.

Struts Problem 3: Request-scoped Form Beans and the Struts validator

When there is a failed validation on a form bean, the model isn't invoked, so only the fields that the HTML form posted are filled in from the request. Any extra data (breadcrumbs, related records from the database, or other variable text that provides context for the update taking place) is lost unless that data was also built into fields in the form <input type="hidden"... />. But REST is supposed to make things simpler and more lightweight. Building breadcrumbs into URLs requires escaping any special characters, increases both the request and response sizes, and is generally a very ugly solution. Storing these fields in client side cookies is possible too, but very complicated.

A RESTful solution

REST precludes any unnecessary server state and as mentioned above, the validator is not RESTful. So you'll need to declare actions:
<action ... validate="false" scope="request" />
All validation must take place on the action, not the form bean (see Problem 3, above).

Query the database for all the data to build the screen with each request. Any extra load on the database is small because this amounts to making extra queries only for failed validations. The catch is that you need to keep track of whether any data on the form bean is dirty (user has changed it) or clean. Remember, a user can clear a field, so just checking for null or 0 doesn't mean that the data is unchanged.

Fortunately, we already know whether the form data is dirty or not by the method the user used on the request.

GET without a database record number
Clean: Show the user a blank form so that they can add (POST) a new record.

GET with a database record number
Clean: Show the user the current values of the record from the database so that they can update (PUT) the existing record.

PUT
Dirty. If input passes validation, update existing record with changes. If it doesn't pass, return the dirty form plus any "extra" data from the DB for building the screen.

POST
Dirty. If input passes validation, check that the record doesn't exist already as the user can resubmit a post by pressing the back button. This resubmitted post should either be ignored or treated as a put (see above) since the record already exists. If it is a valid POST add a new record to the database. If the input is invalid, return the dirty form plus any "extra" data from the DB for building the screen.


Note: HTML only supports GET (all links) and POST (only in <form method="post"...>). Not surprisingly, most browsers don't support the other methods mentioned in the HTTP spec either. I'm sure people are signing petitions calling for an overhaul of HTML and all browsers, but it's really not an issue to send an extra parameter to designate the "method" of each request. Your application may require some methods that HTTP does not provide, in which case you'd have to send a "method" parameter anyway.

In order for the extra data from the DB to be refreshed on a request-scoped form, the action still needs to be invoked. Something like the following would probably be a good template for Actions:

  1. Check that the user is logged in. This might better be done in a servlet, but it needs to be done.

  2. Check for required parameters (like the unique identifier of the record you are editing). Send failures to a Not Authorized page.

  3. Check that the user is authorized to do what they are trying to do. Send failures to a Not Authorized page.

  4. Validate any other parameters that would make the request just invalid - essentially assertions that your UI is working properly go here. Send failures to a Not Authorized page.

  5. Create your ActionMessages object to hold user-friendly validation errors

  6. User input validation. Accumulate friendly error messages in your errors object.

  7. If there are no errors and the method is POST or PUT, perform update.

  8. If the form is not dirty - fill form fields with data from database. Build or rebuild other data for the screen to show the user and put it on the form bean. Or you might create a new form bean and forward to a different screen after a successful update.


That list is a starting point, but I always code Actions to minimize the number and extent of database queries, so that sometimes shuffles the later stages in that list.

Advantages:

  • You don't have to worry about clearing old data from the form bean (you get a new form bean with each request).

  • You don't need to encode any "extra" data (breadcrumbs, etc.) into the form to make it get included in the request (it's reread from the database).

  • You can log back in elegantly, without loosing what you were working on (all necessary data is reread from the database or carried as form data in the request).

  • You can use any objects you want for building the screen - you are not limited to Strings and other primitive types.

  • This requires less code than any of the other ways I tried solving these issues.


Disadvantages

You can't use the Struts validator. All your validation must be done in the action.

Conclusion

So far, this is the neatest solution I can come up with to all of the above issues. And it can be as RESTful as the design of your application allows (uses a minimum of server-side state). It fails to leverage the struts validator, but I was never eager to perform that much XML configuration anyway. Struts 1 has been around for a long time. People ask much more of a web application now than they used to. Maybe Struts 2 has some easier shortcuts for these same issues?

Friday, January 25, 2008

Converting a Web Application to UTF-8

Overview:



MySQL

MySQL supports only characters from the Basic Multilingual Plane (BMP or Plane 0) which is U+0000 - U+FFFF. The Plane 0 characters all take 2-bytes in UCS-2 or 1-3 bytes in UTF-8.

Source: MySQL 5.0 Reference Manual 8.1.8. Unicode Support.

IMPORTANT: MySQL's default collation for each characterset (not just utf8) ends with a "_ci" which stands for Case Insensitive. This is probably just what you want for everything *except* passwords digests (I'm assuming no-one stores plaintext passwords in databases any more). If a 43 character Base64 digest were stored in a case-insensitive column, it would be roughly 5 billion times more vulnerable to hash collisions. That's bad. Using utf8 for user IDs increases security by allowing tens of thousands of characters in the user ID instead of the mere 100 or so safe ASCII characters. Decide whether you want johnDoe and JohnDoe to log into the same account and choose your collation for the user ID column accordingly.

As of this writing, the only case-sensitive utf8 collation is utf8_bin. To find out what collations your installation supports issue:
show collation like 'utf8%';


Don't use utf8_bin for fields that you will sort for the user because letters A-Z come before a-z in this collation, but it is the only collation to use for utf8 case sensitivity. Specify character set and collation explicitly on your password hash column (and optionally the user ID column) so that you are free to change the character set of the database and tables in the future without accidentally locking any users out of your system:

user_identifier varchar(30) character set 'utf8' not null;
password_hash varchar(44) character set 'utf8' collate 'utf8_bin' not null;


More information on character sets and collations is available on the Character Sets and Collations in General page of the MySQL manual.

When you change the charset of a table in MySQL, every column is explicitly set to take the *old* charset - even if the columns in your tables all previously used whatever the default was! You will need to issue ALTER TABLE statements for every column, or dump the database to file and replace all the character set references. I did the latter with three text replacements:

mysqldump -p -u root --single-transaction my_db > ~/backup_$(date --iso-8601).sql

Change table-specific charsets (always present)
Replace:" DEFAULT CHARSET=latin1"
With:" DEFAULT CHARSET=utf8"

Remove any column-specific charsets (less common)
Replace:" character set latin1"
With:""

Change all char's to varchar's because char has to store 3 bytes/character in utf8 while varchar can store 1 for short characters
Replace:" char("
With:" varchar("

You might want to search for "latin1" and see if anything else needs to be changed. Then follow the instructions above and everything will be fine. Hopefully this will save someone else a day's worth of trouble.

Change configuration settings in the system-wide /etc/mysql/my.cnf file (or .my.cnf or my.ini file on Windows):

[client]
default-character-set=utf8
[mysql]
default-character-set=utf8
[mysqld]
default-character-set=utf8


Restart MySQL.

Login to MySQL and use the database in question. Enter the following:
alter database db_name character set utf8;


Test the settings:
mysql>show variables;

character_set_client | utf8
character_set_connection | utf8
character_set_database | utf8
character_set_filesystem | binary
character_set_results | utf8
character_set_server | utf8
character_set_system | utf8
character_sets_dir | C:\Program File
collation_connection | utf8_general_ci
collation_database | utf8_general_ci
collation_server | utf8_general_ci


Log out and re-load the database:
mysql -p -u root my_db <~/backup_$(date --iso-8601).sql

Cygwin

UTF-8 Cygwin seems to add support for Japanese characters, but I can't tell if it supports UTF-8 up to U-FFFF or higher. Maybe I just don't have the fonts on my system to see it!

Log4J

log4j.appender.appenderName.encoding=UTF-8

Source: a comment in email fromDon Dwoske

Hibernate:

In hibernate.cfg.xml, sethibernate.connection.url to:
jdbc:mysql://host:port/db?autoReconnect=true&amp;useUnicode=true&amp;characterEncoding=UTF-8

Source:More UTF-8 head-thumping with Hibernate 3

Struts:

http://ianpurton.com/struts-utf-8-and-form-submissions/

JSP:

Add the following attributes to the<%@page %> tag:
pageEncoding="UTF-8" contentType="text/html; charset=UTF-8"

Source:Tomcat/UTF-8

Tomcat:

In$CATALINA_HOME/conf/server.xml add the following attribute to the<Connector> tag:
URIEncoding="UTF-8"

Source:Tomcat/UTF-8

Java

In your Java code you need to replace:
s.getBytes()
with:
s.getBytes("UTF-8")

Source: a comment in email fromDon Dwoske

IMPORTANT: You *especially* need to convert bytes that are being sent to your SHA message digest algorithm because otherwise any non-latin characters will come out as "invalid", and they will all have the same hash!  Perform the UTF-8 conversion above, and you will have no problems.  All the ASCII characters keep their old encoding, so existing users' passwords still work.  Not sure what happens to high ASCII characters.  If that's an issue for you, you'll have to check.

Java seems to support the 16 Supplimentary Planes that take an extra 16-bit character to represent.
Source:JSR 204: Unicode Supplementary Character Support

MySQL only supports the BMP (Basic Multilingual Plane or "Plane 0") so I'm going to have to intercept these "high" UTF-8 characters in Java and put the replacement character in the database instead:

safeString = inString.replaceAll("[^\u0000-\uFFFF]", "\uFFFD");

Note: U+FFFD is � - the Replacement Character "used to represent an incoming character whose value is unknown or unrepresentable in Unicode". Source:Specials Range FFF0-FFFF (PDF). This shows up as a blank box in Internet Explorer (it's supposed to be a white question-mark in a black diamond). I'm sure IE will support it eventually.

I made a routine to clean all my string inputs from the user:

private static final Pattern highUnicodePattern = Pattern.compile("[^\u0000-\uFFFF]");

public static String cleanInputStr(String inStr) {
if (inStr == null) {
return null;
}
String trimmedStr = inStr.trim();
if (trimmedStr.length()< 1) {
return null;
}

Matcher highUnicodeMatcher = highUnicodePattern.matcher(trimmedStr);
return highUnicodeMatcher.replaceAll("\uFFFD");
}


As just shown, I filter any characters above U+FFFF, so I don't have to worry about characters that take more than 2 bytes of storage. I know what you are saying: "But, U+FFFF takes THREE bytes to store in UTF-8, wouldn't it take more than 2 bytes in UTF-16?" Maybe, but Java stores it as 2 bytes, so the Java String.length() still works. If you support characters above U+FFFF, you'll need to replace all your string-length checks:

int charCount = testString.length();
with this:
int oldCount = testString.length();
int charCount = testString.codePointCount(0, oldCount);


Source:Strings - Core Java Technologies Technical Tips

Notes:
There are also some characters that are affected by the neighboring characters (accents and such). I'm not thoroughly convinced that I can count characters by counting the number of times I press the arrow keys.

Even with the Java character-counting hack above, I thought for a while that JavaScript in FireFox and Safari counted more characters than Java in certain circumstances.  Then I discovered that I had line-feeds in my test data, and they were being counted by Java, but not by JavaScript (probably because they wouldn't render in the HTML) - so this "problem" wasn't related to UTF-8 after all!

Emacs

On Windows, GNU Emacs supports all the Unicode characters I could test on it of the box and XEmacs doesn't support unicode at all. For the past 8 years I've been using XEmacs because that's what I started with and I never saw a significant enough difference between GNU and X to bother switching. Ladies and Gentlemen, I am making the switch!

Note: Once you get your .emacs file switched over, make a backup or be careful not to open XEmacs, because it will *nuke* your old .emacs file and leave a new version converted to XEmacs in its place! That alone is so annoying that I don't ever want to switch back.

describe-current-coding-system is a handy function to see what emacs thinks it's doing! Emacs UTF-8 support is not 100% at the time of this writing (4-byte characters sometimes flicker and change), but it does pretty well. The GNU Emacs site says that even better UTF-8 support is planned for the next release.

Fixed-width Windows Fonts

I downloaded Everson Mono Unicode and it really supports every character, but I found it unacceptably hard to read the English/code in Emacs when I made it small enough for coding.

Courier New and Lucida Console seem the best of the ones that come with Windows XP English. They are both missing big chunks of UTF-8, but they have better support than any of the others I found (at least they show some Greek, Russian, Turkish, Arabic, Hebrew, Chinese, Japanese, and Korean). I picked Courier to use with Emacs (does anything else really matter?) because it's smaller and seems a little easier to read on the screen to me. Maybe I'm just used to it.

Humor

From xkcd: Emoticon (read the mouse-over)

Creative Commons License

Wednesday, December 12, 2007

Make Users Happy by Ignoring Requirements

People talk about the system they want you to build in the language of what they have now. It is natural to not want to give anything up. Writing web applications, I hear over and over that people want to be able to sort by every column, have column headings print out on every page, filter by any value... That's Microsoft Excel - and it is a truly incredible tool for doing all of those things.

On the other hand, most business systems (as opposed to software systems) are dependent on communication to make people with different skills work together. In general, if you can meet those needs, you can always add a CSV download later if people still want to "slice and dice" data in Excel. I have had consistently good results from ignoring the "Excel on the Web" requirements and asking my clients instead to describe the people, the tasks, and the communication involved in their processes.

I try to break a process into the sets of tasks that each person (or type of user) needs to perform. The usual goal is to show someone only the data they really need to perform their task(s), and give them the tools they need to accomplish that task on the same screen. Sometimes, similar tasks can share a screen or multiple screens might be required for a task, but the goal is to compartmentalize and specialize the business system into units that use common sets of data and functionality.

Example 1: Ignoring User Requirements

Below are the requirements I was given for a year-long project:
  1. A list of the documents
  2. Columns showing which one's are ready, which ones are late, and who is responsible for each
  3. Sort by any column
  4. Filter by any column
  5. Print out with column-headings on the top of every page...

Sounds like Excel, no? I only actually met requirement #1, yet the client was delighted. Why? Because the list of Excel features I was given tipped me off that the requirements were not well thought-through.

Here is a list of real requirements that a team of us had to dig for:
  1. A draft document needs to be written by an analyst
  2. The document must be reviewed and approved
  3. Financial reports need to be produced by a system
  4. Financial reports need to be approved by a person
  5. The documents and financials need to be combined into various client reports
  6. The client reports need to be printed
  7. The client reports need to be mailed
  8. A manager needs to keep track of all of the above

Developers understand systems and the benefits and limitations of your technology. Businesspeople know what they need to accomplish, but are often dimly aware of the business systems they use to achieve their goals. Bridging that communication gap is the hardest part of writing good software.

Example 2: Ignoring Systems Requirements

Stated requirements for a reporting web site that I worked on (on-and-off) for 6 years:
  1. high availability
  2. Load balancing
  3. portal
  4. pluggable, snappable, replaceable software components
  5. Thin, rich client
  6. Fully denormalized reusable data source
  7. Dashboarding
  8. 3-tiered communication layer

Today, you could add to those requirements:
  1. stateless (REST-ful)
  2. SOA (Service-oriented architecture)
  3. AJAX
  4. Web 2.0 (or 3.0, or whatever)

What the business actually needed was:
  1. Show quarterly reports as soon as the data was approved to show
  2. Secure
  3. Support a maximum of 20 simultaneous users
  4. Available 8AM-8PM Eastern (US) time
  5. Show some other marketing information

One web server would have been more than adequate, but we had 2. The slow part was actually some of the queries/reports in the database. The only part of the site we ever plugged/snapped/reused were some hand-coded HTML pages and the login code. If we had known enough to discover the actual needs of the users, we could have saved at least 50% of the effort and used some of that time to further optimize the long-running queries, or even redesign the database to make the queries simple enough that they wouldn't need optimization.

Well, 20/20 hindsight is easy. That's how you learn...

Conclusion

When requirements read like a winning "Buzzword Bingo" card, it's almost a sure sign they weren't thought out very carefully. Time spent digging for real requirements pays off in both medium and long-term system usefulness and cost savings.

License

Unless otherwise specified, all works on this blog are released under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License:

Creative Commons License