Web means Unicode
Posted by jpluimers on 2010/02/12
Google published an interesting graph generated from their internal data based on their indexed web pages.
A quick summary of popular encodings based on the graph:
- Unicode – almost 50% and rapidly rising
- ASCII – 20% and falling
- Western European* – 20% and falling
- Rest – 10% and falling
Conclusion: if you do something with the web, make sure you support Unicode.
When you are using Delphi, and need help with transitioning to Unicode: contact me.
–jeroen
* Western European encodings: Windows-1252, ISO-8859-1 and ISO-8859-15.
Reference: Official Google Blog: Unicode nearing 50% of the web.
Edit: 20100212T1500
Some people mentioned (either in the comments or otherwise) that a some sites pretend they emit Unicode, but in fact they don’t.
This doesn’t relieve you from making sure you support Unicode: Don’t pretend you support Unicode, but do it properly!
Examples of bad support for Unicode are not limited to the visible web, but also applications talking to the web, and to webservices (one of my own experiences is explained in StUF – receiving data from a provider where UTF-8 is in fact ISO-8859: it shows an example where a vendor does Unicode support really wrong).
So: when you support Unicode, support it properly.
–jeroen
Jolyon Smith said
“Don’t pretend you support Unicode, but do it properly!”
That’s hilarious, because anyone that takes an existing pre-Delphi 2009 application and thinks that “doing Unicode properly” is simply a question of eliminating hints and warnings is only pretending to support Unicode.
jpluimers said
So true!
–jeroen
BarryOw said
The example above contained a metatag example.
It disappeared!!!
jpluimers said
Mail me the example and I’ll try to edit your comment (almost anything at pluimers.com gets to me eventually, but using my first name speeds things up considerably).
–jeroen
BarryOw said
Postscript:
Example:
But I wish Microsoft would finally fix Notepad/Edit to work with Unicode. Real Unicode pages cannot be copied and saved. :(
BarryOw said
This was discussed before.
Apparently UTF-8 is just the tag in the header. UTF-8 is compatible with a code page, and as long as you don’t use non Latin code page characters, no-one can say that is false.
What is happening is that people are beginning to use the tag in the header, or more probably web tools are including the tag for users, especially the growth of web-based homepage tools and community sites.
But the contents are still code page based.
Olaf Monien said
You should realize though, that even then its quite a bit more than a “tag”. If any of those pages allow some sort of postback, then the postback parameters will be utf-8 encoded. With customers from all around the world, you (your web site) will need to handle that properly. Unless you want scrambled customer names, cities etc in your database.
My point is that if you just see utf-8 as a “tag”, you will run into troubles very soon.