UTF8 for Chinese, Japanese web apps

The motivation to use UTF8 character encoding in a web application is to be able to maintain a single development environment regardless of language content. I set out with the goal of creating a cheat sheet I could refer back to for UTF8 in the tools underlying a web application — MySQL database and Apache server configuration, plus PHP, Python, and Ruby programming. There’s also some discussion of Ubuntu Linux and Windows XP, and a side note on WordPress.

For a backgrounder on UTF8, see Joel Spolsky, “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).”

Part 1: MySQL

Note: I’m working with MySQL 5.0.45 on Ubuntu GNU/Linux 7.10:

/etc/mysql/my.cnf // Ubuntu and Debian; formerly /etc/my.cnf

[myslqd]
character_set_server=utf8
character_set_filesystem=utf8

You can check the model cnf files in /usr/local/mysql/support-files for other configuration information, but there’s nothing on UTF8. MySQL by default has character_set_server=latin1 and collation_server=latin1_swedish_ci. These can be changed by recompiling using ./configure –with-charset= and –with-collation=. Or mysqld can be started with –character-set-server and –collation-server, or with the corresponding settings in /etc/mysql/my.cnf, as detailed in the previous section. With those cnf settings, restart the MySQL server and now 3/6 responses in MySQL to issuing show variables like ‘char%’ are “utf8” instead of “latin1.” To get 6/6, add –default-character-set=utf8, as in mysql -u root -p –default-character-set=utf8. If you forget to use –default-character-set=utf8, you get mangled display of everything above the lower ASCII range.

MySQL uses “CHARACTER SET utf8” as a modifier to database and table definitions. So a model database definition for UTF8 would be:

create database my_database default character set utf8 default collate utf8_general_ci

and a model table definition for UTF8 would be

create table my_table (
my_id int unsigned not null auto_increment primary key,
my_string varchar(128)
) type=InnoDB CHARACTER SET utf8;

See 10.3.2 “Database Character Set and Collation” http://dev.mysql.com/doc/refman/5.0/en/charset-database.html. If a character set is defined for the database, it is the default for its tables. Note that show create database my_database indicates that it is UTF-8 but describe my_table does not. Also, when using regular expressions with REGEX in queries, first be sure to issue set names “utf8” or the results will be mangled. See 5.11.1 “The Character Set Used for Data and Sorting” (http://dev.mysql.com/doc/refman/5.0/en/character-sets.html).

Important note for using load data to put Chinese or Japanese text into a database: character_set_database affects data imports.

Leave a Reply