UTF8 for Chinese, Japanese web apps

The motivation to use UTF8 character encoding in a web application is to be able to maintain a single development environment regardless of language content. I set out with the goal of creating a cheat sheet I could refer back to for UTF8 in the tools underlying a web application — MySQL database and Apache server configuration, plus PHP, Python, and Ruby programming. There’s also some discussion of Ubuntu Linux and Windows XP, and a side note on WordPress.

For a backgrounder on UTF8, see Joel Spolsky, “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).”

Part 1: MySQL

Note: I’m working with MySQL 5.0.45 on Ubuntu GNU/Linux 7.10:

/etc/mysql/my.cnf // Ubuntu and Debian; formerly /etc/my.cnf


You can check the model cnf files in /usr/local/mysql/support-files for other configuration information, but there’s nothing on UTF8. MySQL by default has character_set_server=latin1 and collation_server=latin1_swedish_ci. These can be changed by recompiling using ./configure –with-charset= and –with-collation=. Or mysqld can be started with –character-set-server and –collation-server, or with the corresponding settings in /etc/mysql/my.cnf, as detailed in the previous section. With those cnf settings, restart the MySQL server and now 3/6 responses in MySQL to issuing show variables like ‘char%’ are “utf8” instead of “latin1.” To get 6/6, add –default-character-set=utf8, as in mysql -u root -p –default-character-set=utf8. If you forget to use –default-character-set=utf8, you get mangled display of everything above the lower ASCII range.

MySQL uses “CHARACTER SET utf8” as a modifier to database and table definitions. So a model database definition for UTF8 would be:

create database my_database default character set utf8 default collate utf8_general_ci

and a model table definition for UTF8 would be

create table my_table (
my_id int unsigned not null auto_increment primary key,
my_string varchar(128)
) type=InnoDB CHARACTER SET utf8;

See 10.3.2 “Database Character Set and Collation” http://dev.mysql.com/doc/refman/5.0/en/charset-database.html. If a character set is defined for the database, it is the default for its tables. Note that show create database my_database indicates that it is UTF-8 but describe my_table does not. Also, when using regular expressions with REGEX in queries, first be sure to issue set names “utf8” or the results will be mangled. See 5.11.1 “The Character Set Used for Data and Sorting” (http://dev.mysql.com/doc/refman/5.0/en/character-sets.html).

Important note for using load data to put Chinese or Japanese text into a database: character_set_database affects data imports.

Google vs. iPhone vs. Asia vs. U.S. cellular providers

Kevin Delaney’s Wall Street Journal Online blog entry from May 31, 2007, on “The iPhone Needn’t Fear Google, Yet” points out that Google’s cell phone strategy is not to have a phone product per se, like the iPhone, but rather to evolve a services platform.

With the impending iPhone launch, anyone who’s been using cell phones in Asia the past several years has to wonder, as the Japanese technology business magazine ASCII did in February, what the big deal is. Browser phones without physical keyboards? That’s already been mainstream there for some time. Sliding screen content with your fingernail? Same deal. 2.5G connection speed using EDGE for wireless data? Are we missing something? 3G has been operational for some time in Japan. Why would you want Apple’s phone at three times the price of 2004 model web phones on eBay? Well, because it will have an apple logo on it. Still, probably at the top of the desired improvement list is 3G, according to Ben Charny’s “Apple Changes the iPhone, But Critics Want More Still,” June 18, 2007 WSJ Online.

On the other hand, Apple’s entry is great because it will help loosen the silo death grip of most senior management in U.S. cellular service providers. They know their business model will have to change, but nobody wants to blink first. Managers who aren’t getting paid to put the current business model at risk are actually letting themselves be quoted, according to the Wall Street Journal’s lead story on June 14, 2007, to the effect that they don’t want to blow owning the silo this time like they did with the Internet. Hello? Some large companies can survive forever without realizing what business they’re in. Verizon thinks they’re going to make money as content providers instead of as service providers?

Meanwhile, the Chinese and Indian cellular markets are rapidly becoming 5 times the size of the U.S. market. I kind of think that the market-driven model over there, where the handset, the service provider, and the services platforms act, sell, and interact with the customer quasi-independently and quasi-cooperatively, is what will eventually take hold in the U.S. as well. So in the Asian context, Apple’s handset is nothing new. But in the U.S. market, it’s the break in the dike.

ハロウィーン: 携帯電話の写真


最近二歳半になったから、孫息子はどこにも小旅行をすることが好きだと思っています. また携帯電話の親指族人になりました.


Halloween: photo by cellphone

This cute pirate is my grandson, looking like he’s enjoying Halloween in Tokyo-Mitaka with my daughter-in-law and son.

Now that he’s turned two and a half, he’s been enjoying lots of excursions, but without turning into a cellphone “thumb tribesman” yet.

As for his Halloween experience, since the world-renowned Studio Ghibli is not far from Mitaka-Kichijouji station, you might think the area probably has a lot of spooks, witches, and pirates, but Mitaka is a great town. You can take a walk and go past cabbage stands.