Wednesday, July 30, 2008

Hibernate automatic POJO and mapping file generation from MySQL

Let's face it. Updating Hibernate .hbm.xml and .java files for every database change is phenomenally boring and a waste of time. It's the sort of thing that makes people switch to Ruby/ActiveRecord. I have recently undergone an effort to get the Hibernate reverse engineering tools to generate .hbm.xml and .java files as close as possible to what I want. In theory that should allow me to add new methods to the files in my source tree and still generally get hands-free updates by running the hibernate updater and diffing/patching. In practice, getting Hibernate to give me exactly what I want is very hard, so in the first week I moved from 3% of my changes being automatically applied to 80%. In the next month, I moved up to about 95% and it will probably slowly continue to improve.

The basic outline of the process is simple:

  • Save the generated files from the last time I ran the generator:

    mv generatedSchema generatedSchema.1
  • Run the generator again and diff the files it generates this time against the ones it generated previously

  • Patch those changes to my source tree.

I use the ant task, not the Eclipse integration because I'm in love with the command line. Assuming that you have read and absorbed all the documentation about hibernate reverse-engineering, I'll list a few key points from my hibernate.reveng.xml.

hibernate.reveng.xml

For some reason BIGINT was being reverse-engineered as Long (the object) instead of long (the primitive). I'm not so sure whether my VARCHAR mapping is a good idea...

I included an example of renaming a column, and a much more useful example of naming a parent-child relationship. The syntax of the foreign-key mapping was not obvious to me, so I hope you find this example helpful:

<hibernate-reverse-engineering>
 <type-mapping>
  <sql-type jdbc-type="BIGINT" hibernate-type="long" not-null="true"></sql-type>
  <sql-type jdbc-type="VARCHAR" length="1" hibernate-type="char" not-null="true"></sql-type>
 </type-mapping>
 <table-filter match-catalog="mydb" match-name="temp_user" exclude="true"/>
 <table catalog="mydb" name="company"> 
  <column name="identifier_c" property="coIdentifierC" type="string" />
 </table>
 <table catalog="mydb" name="tree"> 
  <foreign-key constraint-name="tree_ibfk_parent">
   <many-to-one property="parent" />
   <set property="children" />
  </foreign-key>
 </table>
</hibernate-reverse-engineering>

pojo/*.ftl Files

In addition to these customizations I've made extensive use of the Freemarker POJO templates. I made a number of very minor changes for personal preference, but one very important change for diff and patch to work. I added a two-line comment before and after the fields (in PojoFields.ftl) and the get/sets (in PojoPropertyAccessors.ftl):

// Begin Auto-generated Accessors (by myGenerator project)
// Manual editing is STRONGLY DISCOURAGED below this point!

...

// End Auto-generated Accessors (by myGenerator project)
// Manual editing is OK below this point

Having two lines of comments is very important because:

  • Hibernate reverse-engineering tools like to shuffle the field order even when you changed one of the related tables.

  • Diff uses the previous 2-3 lines for comparison to confirm it's found the right place in the file and a two-line comment plus one blank line makes it very happy.

diff and patch

After running the hibernate ant task (and tweaking the output with some canned perl regular expression replacements), I run diff (telling it to ignore the timestamped Hibernate comment):

diff -I '^\(//\|<!--\) Generated ' -druN generatedSchema.1/ generatedSchema/ >patch.txt

Often I check what changes were made:

less patch.txt

From my source tree, I do a dry-run of patching the differences.

cd WEB-INF/src
patch --dry-run -l -p1 <../../../../../scorecardSchemaGen/patch.txt

If there are any failures:

  • Patch those files manually performing any cleanup necessary to make patches more likely to succeed automatically next time
  • Copy the new generated version of the manually patched file from generatedSchema to generatedSchema.1
  • Re-run the diff command so that it will find no differences in the generated version(s) of the manually patched file(s).
  • Perform another dry-run patch
  • Repeat until there are no errors in the dry-run

Now there will be no failures when you patch for real:

cd WEB-INF/src
patch -l -p1 <../../../../../scorecardSchemaGen/patch.txt

Finally, clean up the backup files made by patch:

rm company/project/section/*.orig

MySQL Conventions

When choosing optimal types and variable names for the columns in MySQL, one should take into account how Hibernate will reverse-engineer it so as to make sure the resulting files are as close as possible to what you want. In addition, MySQL has no "check" function and the MySQL BOOLEAN becomes TINYINT(1) which allows values from -128 to 127. Here are the subset of data types that are really easy to use with all tools and provide some data integrity by eliminating possibilities for meaningless values.

  • Boolean: Start each name with is_ whenever it makes sense - the code reads more like English that way. Use positive column/variable names. E.g. instead of preventUpdate, isNotDirty, or doesntHaveErrors, use allowUpdate, isClean, or isErrorFree. Declare in MySQL as:
    is_something bit(1) NOT NULL DEFAULT FALSE

    Note: Java/Hibernate like this but PHP and some other tools may not. Here are two work-arounds for other tools:

    select if(is_something, 1, 0) from my_table;
    or
    select if(is_something, true, false) from my_table;
  • Integers:Try to only use MySQL (Java) types: INT (int), BIGINT (long), and maybe TINYINT (byte). The size (the number in parenthesis after the type in MySQL) has no effect that I can see. The SMALLINT (short) data type requires a trade-off between saving space and limiting possible values and ease of use in Java. It requires too much type casting for my taste when working with it in Java. You even have to use casts for situations like:
    i = (short) (i + 1);
    myObject.setShort( (short) 0 );

    Unless you require a smaller data type for data integrity or for space considerations, It's easiest to stick with INT. Stay away from MySQL's MEDIUMINT: It's a 3-byte integer which is probably why Hibernate does not recognize it. 64-bit (8-byte) integers like BIGINT (long) didn't used to even be available on hardware. The manuals don't make any note of whether they are supported by hardware or not. Anyone know?

  • Strings: Use VARCHAR with the appropriate length. That way, it's easier to support utf-8 or other variable-width character encodings if you want to.

  • Naming: I started off naming my columns description_c and year_i, but ended up dropping the type-specific suffix because Java and MySQL are both type-safe anyway and the extra letter made the code slightly harder to read. I still use the word _date for dates because that seems to make the code easier to read (to my eye).

  • Foreign Keys: MySQL uses a separate constraint name and foreign key name. The constraint name has to be unique within the database so it's automatically assigned the tablename_ibfk_1 where the "1" is a sequence number incremented for each new constraint. If I drop and recreate a table in development, but merely add a new column in test, then the sequence numbers could be different in different environments and issuing a command to change a constraint in one environment will change a *different* constraint in another environment. This would be really bad.

    To avoid this, I use the following template to declare a foreign key with MySQL InnoDB (the default database, MYISAM ignores the "on-delete cascade" clause):

    CREATE TABLE process ( 
        ... 
        company_id BIGINT UNSIGNED NOT NULL, 
        constraint `process_fk_company` FOREIGN KEY (company_id) 
            REFERENCES company(id) on delete cascade, 
        ... 
    ) ENGINE=InnoDB;

    That way I'm sure that the constraint name (which is what I have to reference in the Hibernate file) will not change. It's confusing because MySQL syntax mixes foreign key column names and constraint names. E.g. DROP FOREIGN KEY constraint-name. To find what constraint name is used for a given foreign key, use the SHOW CREATE TABLE command. Constraint names in mysql look like `tablename_ibfk_1`

Saturday, March 22, 2008

RESTful Form Beans with Struts 1

What Struts is Good At

Struts works great for view-only pages, particularly when you need to build those pages from the database. But there are a number of subtle issues with update pages which can probably best be solved with a RESTful solution.

REST and state

The database holds persistent state. My understanding is that one of the concepts of REST is that there is rarely a reason to cache that state on the server and/or the browser/client when it's already stored semi-permanently in the database. It seems that the session on the server only needs to hold state that doesn't belong in the database. That would be data specific to what the user is currently viewing or doing. Some examples are:
  • Logging in - that is usually (and probably best) accomplished by establishing a session on the server and storing a cookie on the client containing a token that the client's browser sends with every request instead of resending their user-id and password. The token tells the server which session is associated with that user. That way you aren't repeatedly sending the user ID and password over the wire. If someone eavesdrops on the session and steals the token, it expires after a certain period so that an attacker will (theoretically) not be able to log back in.

  • Other session-specific state. This is for global-controls analogous to the caps-lock key on the keyboard. All the other controls (keys) are still available to you, but their behavior is slightly different because the state (caps-lock) has changed.

Session Timeout

Sessions need to time out eventually (usually in 30 minutes or so for security reasons). End-users find this incredibly annoying. A little JavaScript can soften the blow:

setTimeout(alert('Your session will expire (silently) in 2 minutes...'), 1680000);

This is still annoying, and not very secure, since the point of having sessions expire is to prevent an unauthorized person from accessing a still-valid session. To a hacker, this is essentially a pop-up, advertising: "You can hijack someone's session if you act now!".

If you're application is mostly RESTful, storing only the login information on the server, you can pop up a login window to allow them to log-back in without loosing whatever they are working on. This is secure because the user ID and password are required. Also, the user does not have to respond within a certain period because they are effectively logging in and creating a new session.

if (confirm('Your session has expired.  Log back in to continue your work?')) {

loginWin = window.open("pre_login.do");
}


Since you do not rely on server side state or cookies for anything but the login, the user can get a form and fill it out in one session, then submit it in a completely different session, so the login pop-up works. If you have a little bit of state on the server and you are clever, you can enhance this solution to send a setting or two from your server-side state from the old session to the login page so that the new session will match the old.

Struts Problem 1: Clearing session-scoped Form Beans

Form beans are session scoped by default. A user GETting the form in preparation for POSTing a new record will see whatever data was left over in the form bean from the last record they edited. You need to work around this by figuring out when the user is going to add a new record and clearing the fields in the form.

Struts Problem 2: Session expiration with session-scoped form beans

Sooner or later the session will time out and the form bean will be completely lost. Writing an Action and pairing it with a log-back-in JavaScript function becomes prohibitively ugly from a maintenance and security standpoint if even a few different fields are required for a few different screens. As far as I can see, there is no reasonable work-around for this scenario, which is why REST is so useful.

Struts Problem 3: Request-scoped Form Beans and the Struts validator

When there is a failed validation on a form bean, the model isn't invoked, so only the fields that the HTML form posted are filled in from the request. Any extra data (breadcrumbs, related records from the database, or other variable text that provides context for the update taking place) is lost unless that data was also built into fields in the form <input type="hidden"... />. But REST is supposed to make things simpler and more lightweight. Building breadcrumbs into URLs requires escaping any special characters, increases both the request and response sizes, and is generally a very ugly solution. Storing these fields in client side cookies is possible too, but very complicated.

A RESTful solution

REST precludes any unnecessary server state and as mentioned above, the validator is not RESTful. So you'll need to declare actions:
<action ... validate="false" scope="request" />
All validation must take place on the action, not the form bean (see Problem 3, above).

Query the database for all the data to build the screen with each request. Any extra load on the database is small because this amounts to making extra queries only for failed validations. The catch is that you need to keep track of whether any data on the form bean is dirty (user has changed it) or clean. Remember, a user can clear a field, so just checking for null or 0 doesn't mean that the data is unchanged.

Fortunately, we already know whether the form data is dirty or not by the method the user used on the request.

GET without a database record number
Clean: Show the user a blank form so that they can add (POST) a new record.

GET with a database record number
Clean: Show the user the current values of the record from the database so that they can update (PUT) the existing record.

PUT
Dirty. If input passes validation, update existing record with changes. If it doesn't pass, return the dirty form plus any "extra" data from the DB for building the screen.

POST
Dirty. If input passes validation, check that the record doesn't exist already as the user can resubmit a post by pressing the back button. This resubmitted post should either be ignored or treated as a put (see above) since the record already exists. If it is a valid POST add a new record to the database. If the input is invalid, return the dirty form plus any "extra" data from the DB for building the screen.


Note: HTML only supports GET (all links) and POST (only in <form method="post"...>). Not surprisingly, most browsers don't support the other methods mentioned in the HTTP spec either. I'm sure people are signing petitions calling for an overhaul of HTML and all browsers, but it's really not an issue to send an extra parameter to designate the "method" of each request. Your application may require some methods that HTTP does not provide, in which case you'd have to send a "method" parameter anyway.

In order for the extra data from the DB to be refreshed on a request-scoped form, the action still needs to be invoked. Something like the following would probably be a good template for Actions:

  1. Check that the user is logged in. This might better be done in a servlet, but it needs to be done.

  2. Check for required parameters (like the unique identifier of the record you are editing). Send failures to a Not Authorized page.

  3. Check that the user is authorized to do what they are trying to do. Send failures to a Not Authorized page.

  4. Validate any other parameters that would make the request just invalid - essentially assertions that your UI is working properly go here. Send failures to a Not Authorized page.

  5. Create your ActionMessages object to hold user-friendly validation errors

  6. User input validation. Accumulate friendly error messages in your errors object.

  7. If there are no errors and the method is POST or PUT, perform update.

  8. If the form is not dirty - fill form fields with data from database. Build or rebuild other data for the screen to show the user and put it on the form bean. Or you might create a new form bean and forward to a different screen after a successful update.


That list is a starting point, but I always code Actions to minimize the number and extent of database queries, so that sometimes shuffles the later stages in that list.

Advantages:

  • You don't have to worry about clearing old data from the form bean (you get a new form bean with each request).

  • You don't need to encode any "extra" data (breadcrumbs, etc.) into the form to make it get included in the request (it's reread from the database).

  • You can log back in elegantly, without loosing what you were working on (all necessary data is reread from the database or carried as form data in the request).

  • You can use any objects you want for building the screen - you are not limited to Strings and other primitive types.

  • This requires less code than any of the other ways I tried solving these issues.


Disadvantages

You can't use the Struts validator. All your validation must be done in the action.

Conclusion

So far, this is the neatest solution I can come up with to all of the above issues. And it can be as RESTful as the design of your application allows (uses a minimum of server-side state). It fails to leverage the struts validator, but I was never eager to perform that much XML configuration anyway. Struts 1 has been around for a long time. People ask much more of a web application now than they used to. Maybe Struts 2 has some easier shortcuts for these same issues?

Friday, January 25, 2008

Converting a Web Application to UTF-8

Overview:

MySQL

MySQL supports only characters from the Basic Multilingual Plane (BMP or Plane 0) which is U+0000 - U+FFFF. The Plane 0 characters all take 2-bytes in UCS-2 or 1-3 bytes in UTF-8.
Source: MySQL 5.0 Reference Manual 8.1.8. Unicode Support.

IMPORTANT: MySQL's default collation for each characterset (not just utf8) ends with a "_ci" which stands for Case Insensitive. This is probably just what you want for everything *except* Base64 encoded password digests (I'm assuming no-one stores plaintext passwords in databases any more). If a 43 character Base64 digest were stored in a case-insensitive column, it would be roughly 5 billion times more vulnerable to hash collisions. That's bad. Using utf8 for user IDs increases security by allowing tens of thousands of characters in the user ID instead of the mere 100 or so safe ASCII characters. Decide whether you want johnDoe and JohnDoe to log into the same account and choose your collation for the user ID column accordingly.

As of this writing, the only case-sensitive utf8 collation is utf8_bin. To find out what collations your installation supports issue:

show collation like 'utf8%';

Don't use utf8_bin for fields that you will sort for the user because letters A-Z come before a-z in this collation, but it is the only collation to use for utf8 case sensitivity. Specify character set and collation explicitly on your password hash column (and optionally the user ID column) so that you are free to change the character set of the database and tables in the future without accidentally locking any users out of your system:

user_identifier varchar(30) character set 'utf8' not null;
password_hash varchar(44) character set 'utf8' collate 'utf8_bin' not null;

A little off-topic, but it's simpler, faster, and smaller to store password digests as varbinary data:

`password_sha_digest` varbinary(64) DEFAULT NULL COMMENT
    'the users password repeatedly hashed with strong SHA and the salt';
`password_salt` varbinary(16) DEFAULT NULL COMMENT
    'the 128-bits of salt the users password was hashed with';

More information on character sets and collations is available on the Character Sets and Collations in General page of the MySQL manual.

When you change the character set of a table in MySQL, every column is explicitly set to take the *old* character set - even if the columns in your tables all previously used whatever the default was! You will need to issue ALTER TABLE statements for every column, or dump the database to file and replace all the character set references. I did the latter with three text replacements:

mysqldump -p -u root --single-transaction my_db > ~/backup_$(date +%Y-%m-%d_%H-%M-%S).sql

Change table-specific character sets (always present)
Replace: " DEFAULT CHARSET=latin1"
With: " DEFAULT CHARSET=utf8"

Remove any column-specific character sets (less common) Replace: " character set latin1"
With: ""

Change all char's to varchar's because char has to store 3 bytes/character in utf8 while varchar can store 1 for short characters:
Replace: " char("
With: " varchar("

You might want to search for "latin1" and see if anything else needs to be changed. Then follow the instructions above and everything will be fine. Hopefully this will save someone else a day's worth of trouble.

Change configuration settings in the system-wide /etc/mysql/my.cnf file (or .my.cnf or my.ini file on Windows):

[client]
default-character-set=utf8
[mysql]
default-character-set=utf8
[mysqld]
default-character-set=utf8

Restart MySQL.

Login to MySQL and use the database in question. Enter the following:

alter database db_name character set utf8;

Test the settings:

mysql>show variables;

character_set_client            | utf8
character_set_connection        | utf8
character_set_database          | utf8
character_set_filesystem        | binary
character_set_results           | utf8
character_set_server            | utf8
character_set_system            | utf8
character_sets_dir              | C:\Program File
collation_connection            | utf8_general_ci
collation_database              | utf8_general_ci
collation_server                | utf8_general_ci

Log out and re-load the database:

mysql -p -u root my_db <~/backup_$(date +%Y-%m-%d_%H-%M-%S).sql

Cygwin

UTF-8 Cygwin seems to add support for Japanese characters, but I can't tell if it supports UTF-8 up to U-FFFF or higher. Maybe I just don't have the fonts on my system to see it!

Log4J

log4j.appender.appenderName.encoding=UTF-8
Source: a comment in email from Don Dwoske

Hibernate:

In hibernate.cfg.xml, set
hibernate.connection.url
to:
jdbc:mysql://host:port/db?autoReconnect=true&amp;useUnicode=true&amp;characterEncoding=UTF-8
Source: More UTF-8 head-thumping with Hibernate 3

Struts:

http://ianpurton.com/struts-utf-8-and-form-submissions/

JSP:

Add the following attributes to the <%@page %> tag: pageEncoding="UTF-8" contentType="text/html; charset=UTF-8"
Source: Tomcat/UTF-8

Tomcat:

In $CATALINA_HOME/conf/server.xml add the following attribute to the <Connector> tag:
URIEncoding="UTF-8"
Source: Tomcat/UTF-8

Java

In your Java code you need to replace
s.getBytes()
with:
s.getBytes("UTF-8")
Source: a comment in email from Don Dwoske

IMPORTANT: You *especially* need to convert bytes that are being sent to your SHA message digest algorithm because otherwise any non-latin characters will come out as "invalid", and they will all have the same hash!  Perform the UTF-8 conversion above, and you will have no problems.  All the ASCII characters keep their old encoding, so existing users' passwords still work.  Not sure what happens to high ASCII characters.  If that's an issue for you, you'll have to check.

Java seems to support the 16 Supplimentary Planes that take an extra 16-bit character to represent.
Source: JSR 204: Unicode Supplementary Character Support

MySQL only supports the BMP (Basic Multilingual Plane or "Plane 0") so I'm going to have to intercept these "high" UTF-8 characters in Java and put the replacement character in the database instead:

safeString = inString.replaceAll("[^\u0000-\uFFFF]", "\uFFFD");

Note: U+FFFD is � - the Replacement Character "used to represent an incoming character whose value is unknown or unrepresentable in Unicode".
Source: Specials Range FFF0-FFFF (PDF). This shows up as a blank box in Internet Explorer (it's supposed to be a white question-mark in a black diamond). I'm sure IE will support it eventually.

I made a routine to clean all my string inputs from the user:

private static final Pattern highUnicodePattern = Pattern.compile("[^\u0000-\uFFFF]");

public static String cleanInputStr(String inStr) {
    if (inStr == null) {
        return null;
    }
    String trimmedStr = inStr.trim();
    if (trimmedStr.length()< 1) {
  return null;
 }

 Matcher highUnicodeMatcher = highUnicodePattern.matcher(trimmedStr);
 return highUnicodeMatcher.replaceAll("\uFFFD");
}

As just shown, I filter any characters above U+FFFF, so I don't have to worry about characters that take more than 2 bytes of storage. I know what you are saying: "But, U+FFFF takes THREE bytes to store in UTF-8, wouldn't it take more than 2 bytes in UTF-16?" Maybe, but Java stores it as 2 bytes, so the Java String.length() still works. If you support characters above U+FFFF, you'll need to replace all your string-length checks:

int charCount = testString.length();
with this:
int oldCount = testString.length();
int charCount = testString.codePointCount(0, oldCount);

Source: Strings - Core Java Technologies Technical Tips

Notes:

There are also some characters that are affected by the neighboring characters (accents and such). I'm not thoroughly convinced that I can count characters by counting the number of times I press the arrow keys.

Even with the Java character-counting hack above, I thought for a while that JavaScript in FireFox and Safari counted more characters than Java in certain circumstances.  Then I discovered that I had line-feeds in my test data, and they were being counted by Java, but not by JavaScript (probably because they wouldn't render in the HTML) - so this "problem" wasn't related to UTF-8 after all!

Emacs

On Windows, GNU Emacs supports all the Unicode characters I could test on it of the box and XEmacs doesn't support unicode at all. For the past 8 years I've been using XEmacs because that's what I started with and I never saw a significant enough difference between GNU and X to bother switching. Ladies and Gentlemen, I am making the switch!

Note: Once you get your .emacs file switched over, make a backup or be careful not to open XEmacs, because it will *nuke* your old .emacs file and leave a new version converted to XEmacs in its place! That alone is so annoying that I don't ever want to switch back.

describe-current-coding-system is a handy function to see what emacs thinks it's doing! Emacs UTF-8 support is not 100% at the time of this writing (4-byte characters sometimes flicker and change), but it does pretty well. The GNU Emacs site says that even better UTF-8 support is planned for the next release.

Fixed-width Windows Fonts

I downloaded Everson Mono Unicode and it really supports every character, but I found it unacceptably hard to read the English/code in Emacs when I made it small enough for coding.

Courier New and Lucida Console seem the best of the ones that come with Windows XP English. They are both missing big chunks of UTF-8, but they have better support than any of the others I found (at least they show some Greek, Russian, Turkish, Arabic, Hebrew, Chinese, Japanese, and Korean). I picked Courier to use with Emacs (does anything else really matter?) because it's smaller and seems a little easier to read on the screen to me. Maybe I'm just used to it.

Humor

From xkcd: Emoticon (read the mouse-over)

Creative Commons License