Unicode nearing 50% of the web
This graph is from Google internal data, based on our indexing of web pages, and thus may vary somewhat from what other search engines find. However, the trends are pretty clear, and the continued rise in use of Unicode makes it even easier to do the processing for the many languages that we cover.
Searching for "nancials"?
Unicode is growing both in usage and in character coverage. We recently upgraded to the latest version of Unicode, version 5.2 (via ICU and CLDR). This adds over 6,600 new characters: some of mostly academic interest, such as Egyptian Hieroglyphs, but many others for living languages.
We're constantly improving our handling of existing characters. For example, the characters "fi" can either be represented as two characters ("f" and "i"), or a special display form "". A Google search for [financials] or [office] used to not see these as equivalent — to the software they would just look like *nancials and of*ce. There are thousands of characters like this, and they occur in surprisingly many pages on the web, especially generated PDF documents.
But no longer — after extensive testing, we just recently turned on support for these and thousands of other characters; your searches will now also find these documents. Further steps in our mission to organize the world's information and make it universally accessible and useful.
And we're angling for a party when Unicode hits 50%!
Mark Davis, Senior International Software Architect
googleblog.blogspot.com
published @ January 29, 2010