Glen on Programming: Converting a Web Application to UTF-8

Overview:

Java-Web-Application Character Encoding Overview - this is a great overview. It answers the questions, "why bother" and "what's involved".
UTF-8 and Unicode FAQ for Unix/Linux - A more detailed overview of Unicode in general.
The Unicode Standard: Conformance (PDF) - See section 3.9 Unicode Encoding Forms. Table 3-5 explains UTF-8 encoding at a binary level, which is really where you have to go to understand it. Table 3-4 is UTF-16.
UTF-8 Sampler a great collection of text in many different languages.
Phonecian (Range 10900-1091F) (PDF) - Some high sample-characters

MySQL

MySQL supports only characters from the Basic Multilingual Plane (BMP or Plane 0) which is U+0000 - U+FFFF. The Plane 0 characters all take 2-bytes in UCS-2 or 1-3 bytes in UTF-8.
Source: MySQL 5.0 Reference Manual 8.1.8. Unicode Support.

IMPORTANT: MySQL's default collation for each characterset (not just utf8) ends with a "_ci" which stands for Case Insensitive. This is probably just what you want for everything *except* Base64 encoded password digests (I'm assuming no-one stores plaintext passwords in databases any more). If a 43 character Base64 digest were stored in a case-insensitive column, it would be roughly 5 billion times more vulnerable to hash collisions. That's bad. Using utf8 for user IDs increases security by allowing tens of thousands of characters in the user ID instead of the mere 100 or so safe ASCII characters. Decide whether you want johnDoe and JohnDoe to log into the same account and choose your collation for the user ID column accordingly.

As of this writing, the only case-sensitive utf8 collation is utf8_bin. To find out what collations your installation supports issue:

show collation like 'utf8%';

Don't use utf8_bin for fields that you will sort for the user because letters A-Z come before a-z in this collation, but it is the only collation to use for utf8 case sensitivity. Specify character set and collation explicitly on your password hash column (and optionally the user ID column) so that you are free to change the character set of the database and tables in the future without accidentally locking any users out of your system:

user_identifier varchar(30) character set 'utf8' not null;
password_hash varchar(44) character set 'utf8' collate 'utf8_bin' not null;

A little off-topic, but it's simpler, faster, and smaller to store password digests as varbinary data:

`password_sha_digest` varbinary(64) DEFAULT NULL COMMENT
    'the users password repeatedly hashed with strong SHA and the salt';
`password_salt` varbinary(16) DEFAULT NULL COMMENT
    'the 128-bits of salt the users password was hashed with';

More information on character sets and collations is available on the Character Sets and Collations in General page of the MySQL manual.

When you change the character set of a table in MySQL, every column is explicitly set to take the *old* character set - even if the columns in your tables all previously used whatever the default was! You will need to issue ALTER TABLE statements for every column, or dump the database to file and replace all the character set references. I did the latter with three text replacements:

mysqldump -p -u root --single-transaction my_db > ~/backup_$(date +%Y-%m-%d_%H-%M-%S).sql

Change table-specific character sets (always present)
Replace: " DEFAULT CHARSET=latin1"
With: " DEFAULT CHARSET=utf8"

Remove any column-specific character sets (less common) Replace: " character set latin1"
With: ""

Change all char's to varchar's because char has to store 3 bytes/character in utf8 while varchar can store 1 for short characters:
Replace: " char("
With: " varchar("

You might want to search for "latin1" and see if anything else needs to be changed. Then follow the instructions above and everything will be fine. Hopefully this will save someone else a day's worth of trouble.

Change configuration settings in the system-wide /etc/mysql/my.cnf file (or .my.cnf or my.ini file on Windows):

[client]
default-character-set=utf8
[mysql]
default-character-set=utf8
[mysqld]
default-character-set=utf8

Restart MySQL.

alter database db_name character set utf8;

Test the settings:

mysql>show variables;

character_set_client            | utf8
character_set_connection        | utf8
character_set_database          | utf8
character_set_filesystem        | binary
character_set_results           | utf8
character_set_server            | utf8
character_set_system            | utf8
character_sets_dir              | C:\Program File
collation_connection            | utf8_general_ci
collation_database              | utf8_general_ci
collation_server                | utf8_general_ci

Log out and re-load the database:

mysql -p -u root my_db <~/backup_$(date +%Y-%m-%d_%H-%M-%S).sql

Cygwin

UTF-8 Cygwin seems to add support for Japanese characters, but I can't tell if it supports UTF-8 up to U-FFFF or higher. Maybe I just don't have the fonts on my system to see it!

Log4J

log4j.appender.appenderName.encoding=UTF-8
Source: a comment in email from Don Dwoske

Hibernate:

In hibernate.cfg.xml, set
hibernate.connection.url
to:
jdbc:mysql://host:port/db?autoReconnect=true&useUnicode=true&characterEncoding=UTF-8
Source: More UTF-8 head-thumping with Hibernate 3

Struts:

http://ianpurton.com/struts-utf-8-and-form-submissions/

JSP:

Add the following attributes to the <%@page %> tag: pageEncoding="UTF-8" contentType="text/html; charset=UTF-8"
Source: Tomcat/UTF-8

Tomcat:

In $CATALINA_HOME/conf/server.xml add the following attribute to the <Connector> tag:
URIEncoding="UTF-8"
Source: Tomcat/UTF-8

Java

In your Java code you need to replace
s.getBytes()
with:
s.getBytes("UTF-8")
Source: a comment in email from Don Dwoske

IMPORTANT: You *especially* need to convert bytes that are being sent to your SHA message digest algorithm because otherwise any non-latin characters will come out as "invalid", and they will all have the same hash! Perform the UTF-8 conversion above, and you will have no problems. All the ASCII characters keep their old encoding, so existing users' passwords still work. Not sure what happens to high ASCII characters. If that's an issue for you, you'll have to check.

Java seems to support the 16 Supplimentary Planes that take an extra 16-bit character to represent.
Source: JSR 204: Unicode Supplementary Character Support

MySQL only supports the BMP (Basic Multilingual Plane or "Plane 0") so I'm going to have to intercept these "high" UTF-8 characters in Java and put the replacement character in the database instead:

safeString = inString.replaceAll("[^\u0000-\uFFFF]", "\uFFFD");

Note: U+FFFD is � - the Replacement Character "used to represent an incoming character whose value is unknown or unrepresentable in Unicode".
Source: Specials Range FFF0-FFFF (PDF). This shows up as a blank box in Internet Explorer (it's supposed to be a white question-mark in a black diamond). I'm sure IE will support it eventually.

I made a routine to clean all my string inputs from the user:

private static final Pattern highUnicodePattern = Pattern.compile("[^\u0000-\uFFFF]");

public static String cleanInputStr(String inStr) {
    if (inStr == null) {
        return null;
    }
    String trimmedStr = inStr.trim();
    if (trimmedStr.length()< 1) {
  return null;
 }

 Matcher highUnicodeMatcher = highUnicodePattern.matcher(trimmedStr);
 return highUnicodeMatcher.replaceAll("\uFFFD");
}

As just shown, I filter any characters above U+FFFF, so I don't have to worry about characters that take more than 2 bytes of storage. I know what you are saying: "But, U+FFFF takes THREE bytes to store in UTF-8, wouldn't it take more than 2 bytes in UTF-16?" Maybe, but Java stores it as 2 bytes, so the Java String.length() still works. If you support characters above U+FFFF, you'll need to replace all your string-length checks:

int charCount = testString.length();

with this:

int oldCount = testString.length();
int charCount = testString.codePointCount(0, oldCount);

Source: Strings - Core Java Technologies Technical Tips

Notes:

There are also some characters that are affected by the neighboring characters (accents and such). I'm not thoroughly convinced that I can count characters by counting the number of times I press the arrow keys.

Even with the Java character-counting hack above, I thought for a while that JavaScript in FireFox and Safari counted more characters than Java in certain circumstances. Then I discovered that I had line-feeds in my test data, and they were being counted by Java, but not by JavaScript (probably because they wouldn't render in the HTML) - so this "problem" wasn't related to UTF-8 after all!

Emacs

On Windows, GNU Emacs supports all the Unicode characters I could test on it of the box and XEmacs doesn't support unicode at all. For the past 8 years I've been using XEmacs because that's what I started with and I never saw a significant enough difference between GNU and X to bother switching. Ladies and Gentlemen, I am making the switch!

Note: Once you get your .emacs file switched over, make a backup or be careful not to open XEmacs, because it will *nuke* your old .emacs file and leave a new version converted to XEmacs in its place! That alone is so annoying that I don't ever want to switch back.

describe-current-coding-system is a handy function to see what emacs thinks it's doing! Emacs UTF-8 support is not 100% at the time of this writing (4-byte characters sometimes flicker and change), but it does pretty well. The GNU Emacs site says that even better UTF-8 support is planned for the next release.

Fixed-width Windows Fonts

I downloaded Everson Mono Unicode and it really supports every character, but I found it unacceptably hard to read the English/code in Emacs when I made it small enough for coding.

Courier New and Lucida Console seem the best of the ones that come with Windows XP English. They are both missing big chunks of UTF-8, but they have better support than any of the others I found (at least they show some Greek, Russian, Turkish, Arabic, Hebrew, Chinese, Japanese, and Korean). I picked Courier to use with Emacs (does anything else really matter?) because it's smaller and seems a little easier to read on the screen to me. Maybe I'm just used to it.

Humor

From xkcd: Emoticon (read the mouse-over)

1 comment:

Anonymous said...: "IMPORTANT: MySQL's default collation for each characterset (not just utf8) ends with a "_ci" which stands for Case Insensitive. This is probably just what you want for everything *except* passwords digests (I'm assuming no-one stores plaintext passwords in databases any more). If a 43 character Base64 digest were stored in a case-insensitive column, it would be roughly 5 billion times more vulnerable to hash collisions. That's bad. Using utf8 for user IDs increases security by allowing tens of thousands of characters in the user ID instead of the mere 100 or so safe ASCII characters. Decide whether you want johnDoe and JohnDoe to log into the same account and choose your collation for the user ID column accordingly."

Collations are rules for comparing characters in a character set.

This warning is not correct.; January 11, 2009 at 10:33 AM

Friday, January 25, 2008

Converting a Web Application to UTF-8