The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My work

  • My badges

  • Twitter Updates

  • My Flickr Stream

    20140329-VMware-vSphere-Client-4.1-retry-with-compatibility-settings

    20140329-VMware-vSphere-Client-4.1-refuses-to-install-on-Windows-8.x-Requires-XP-SP2-and-up

    2014-03-06_0853-upright

    More Photos
  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 1,250 other followers

Archive for the ‘ISO-8859’ Category

Some words on Unicode in Windows (Delphi, .NET, APIs, etc)

Posted by Jeroen Pluimers on 2012/04/05

O'Reilly book "Unicode Explained: Internationalize Documents, Programs, and Web Sites"

O'Reilly book "Unicode Explained: Internationalize Documents, Programs, and Web Sites"

Withe the growing integration between systems, and the mismatch between those that support Unicode and that do not, I find that a lot of organisations lack basic Unicode knowledge.

So lets put down a few things, that helps as a primer and gets some confusion out of the way.

Please read the article on Unicode by Joel on Software, and the book Unicode Explained. The book is from 1996, and still very valid.

Unicode

Unicode started in the late 80s of last century as a 16-bit character model.

Somehow lots of people still thing Unicode is a 16-bit double-byte character set. It is not. It uses a variable width encoding for storage.

All encodings except the 32-bit ones are variable width. The UTF-16 encoding is a variable width encoding where each code point (not character!, see below why) takes one or more 16-bit words.

This is because – as of Unicode version 2.0 in 1996 – a surrogate character mechanism was introduced to be able to have more than 64k code points.

The architecture of Unicode is completely different than traditional single-byte character sets or double-byte character sets.

In Unicode, there is a distinction between code points (the mapping of the character to an actual IDs), storage/encoding (in Windows now uses UTF-16LE which includes the past used UCS-2) and leaves visual representation (glyphs/renderings) to fonts.

Unicode has over a million code points, logically divided into 17 planes, of which the Basic Multi-lingual Plane has code points that can be encoded into one 16-bit word.

There is no font that can display all Unicode code points. By original aim, the first 256 Unicode code points are identical to the ISO 8859-1 character set (which is Windows-29591, not Windows-1252!) for which most fonts can display most characters.

I entity Unicode (Windows version)

By now, you probably grasp that Unicode is not an easy thing to get right. And that can be hard, hence people love and hate Unicode at the same time. Maybe I should get the T-Shirt :).

One thing that complexes things, is that Unicode allows for both composite characters and ready made composites. This is one form where different sequences can be equivalent, so there can be Unicode equivalence for which you need some knowledge on Unicode Normalization (be sure to read this StackOverflow question and this article by Michael Kaplan on Unicode Normalization).

There are many Unicode encodings, of which UTF-8 and UTF-16 are the most widely used (and are variable length). UTF-32 is fixed length. All 16-bit and 32-bit encodings can have big-endian and little-endian storage and can use a Byte Order Mark (BOM) to indicate their endinaness. Not all software uses BOMs, and there are BOMs for UTF-8 and other encodings as well (for UTF-8 it is not recommended to include a BOM).

When only parts your development environment supports Unicode strings, you need to be aware of which do and which don’t. For any interface boundary between those, you need to be aware of potential data loss, and need to decide how to cope with that.

For instance, does your database use Unicode or not for character storage? (For Microsoft SQL Server: do you use CHAR/VARCHAR or NCHAR/NVARCHARyou should aim for NVARCHAR, yes you really should, do not use text, ntext and image). What do you do while transferring Unicode and non-Unicode text to it? Ask the same questions for Web Services, configuration files, binary storage, message queueing and various other interfaces to the outside world.

The Windows API is almost exclusively Unicode (see this StackOverflow question for more details)

Delphi and Unicode

Let’s focus a bit on Delphi now, as that the migration towards Unicode at clients raised a few questions over the last couple of months.

One of the key questions is why there are no conversion tools that help you migrate your existing source code to fully embrace Unicode.

The short answer is: because you can’t automate the detection of intent in your codebase.

The longer answer starts with that there are tools that detect parts of your Delphi source that potentially has problems: the compiler hints, warnings and errors that brings your attention to spots that are fishy, are likely to fail, or are plain wrong.

Delphi uses the standard Windows storage format for Unicode text: UTF-16LE.

Next to that, Delphi supports conversion to and from UTF-8 en UTF-32 (in their various forms endianness).

External storage of text is best done as UTF-8 because it doesn’t have endianness, and because of easier exchange of text in ISO-8859-1.

Marco Cantu wrote a very nice whitepaper about Delphi and Unicode, and I did a Delphi Unicode talk at CodeRage 4 and posted a lot of Delphi Unicode links at StackOverflow.

A few extra notes on Delphi and Unicode:

With Delphi string types, stick to the UnicodeString (default string as of Delphi 2009) and AnsiString (default string until Delphi 2007) as their memory management is done by Delphi. WideString management is done by COM, so only use that when you really need to. Also avoid ShortString.

For any interfaces to the external world, you need to decide which ones to keep to generic string, Char, PChar and which ones to fix to AnsiChar/PAnsiChar/AnsiString(+ accompanying codepage) or fix at UnicodeChar/PUnicodeChar/UnicodeString.

Of course remnants from the past will catch up with you: if you have Technical Debt on the past where characters were bytes, and you abused Char/PChar/array-of-char/etc you need to fix that, and use the Byte/PByte/TByteArray/PByteArray. It can be costly to pay the accrued debt on that.

–jeroen

PS:

Posted in .NET, C#, Delphi, Development, EBCDIC, Encoding, ISO-8859, Software Development, Technical Debt, Unicode, UTF-8 | 2 Comments »

Why SizeOf for character arrays is evil: stackoverflow question “Why does this code fail in D2010, but not D7?”

Posted by Jeroen Pluimers on 2010/05/11

This Why does this code fail in D2010, but not D7 question on stackoverflow once again shows that SizeOf on character arrays usualy is evil.

My point in this posting is that you should always try to write code that is semantically correct.

By writing semantically correct code, you have a much better chance of surviving a major change like a Unicode port.

The code below is semantically wrong: it worked in Delphi 7 by accident, not by design:
Like many Windows API functions, GetTempPath expects the first parameter (called nBufferLength) number of characters, not the number of bytes. Read the rest of this entry »

Posted in Delphi, Delphi 2005, Delphi 2006, Delphi 2007, Delphi 2009, Delphi 2010, Delphi 3, Delphi 4, Delphi 5, Delphi 6, Delphi 7, Delphi XE, Delphi XE2, Delphi XE3, Development, Encoding, ISO-8859, Software Development, Unicode | Leave a Comment »

Web means Unicode

Posted by Jeroen Pluimers on 2010/02/12

Google published an interesting graph generated from their internal data based on their indexed web pages.Encodings on the web

A quick summary of popular encodings based on the graph:

  1. Unicode – almost 50% and rapidly rising
  2. ASCII20% and falling
  3. Western European* – 20% and falling
  4. Rest – 10% and falling

Conclusion: if you do something with the web, make sure you support Unicode.

When you are using Delphi, and need help with transitioning to Unicode: contact me.

–jeroen

* Western European encodings: Windows-1252, ISO-8859-1 and ISO-8859-15.

Reference: Official Google Blog: Unicode nearing 50% of the web.

Edit: 20100212T1500

Some people mentioned (either in the comments or otherwise) that a some sites pretend they emit Unicode, but in fact they don’t.
This doesn’t relieve you from making sure you support Unicode: Don’t pretend you support Unicode, but do it properly!

Examples of bad support for Unicode are not limited to the visible web, but also applications talking to the web, and to webservices (one of my own experiences is explained in StUF – receiving data from a provider where UTF-8 is in fact ISO-8859: it shows an example where a vendor does Unicode support really wrong).

So: when you support Unicode, support it properly.

–jeroen

Posted in .NET, ASP.NET, C#, Database Development, Delphi, Development, Encoding, Firebird, IIS, InterBase, ISO-8859, ISO8859, Prism, SOAP/WebServices, Software Development, SQL Server, Unicode, UTF-8, UTF8, Visual Studio and tools, Web Development | 7 Comments »

CodeRage 4 session material download locations changed – CodeCentral messed up

Posted by Jeroen Pluimers on 2009/09/18

Somehow, CodeCentral managed to not only delete my uploaded CodeRage 4 session materials (the videos are fine!), but also newer uploads with other submissions.

Since I’m in crush mode to get the BASTA! and DelphiLive 2009 Germany sessions done, and all Embarcadero sites having to do with their membership server perform like a dead horse, I have temporary moved the downloads, and corrected my earlier post on the downloads.

I have notified Embarcadero of the problems, so I’m pretty sure they are being addressed on their side as well.
Edit 20090919: Embarcadero indicated that between 20090913 and 20090914 there has been a database restore on CodeCentral. Some entries therefore have been permanently lost.

When I get back from the conferences, and CodeCentral is more responsive, I’ll retry uploading it there, and correct the posts.
In the mean time, be sure not to save shortcuts to the current locations, as they are bound to change.

The are the new download locations can all be found in my xs4all temporary CodeRage 4 2009 download folder.

This is the full list of my CodeRage 4 sessions and places where you can download everything: Read the rest of this entry »

Posted in .NET, ASCII, CodeRage, CommandLine, Conferences, CP437/OEM 437/PC-8, Database Development, Debugging, Delphi, Development, Encoding, Event, Firebird, InterBase, ISO-8859, ISO8859, Prism, Software Development, SQL Server, Unicode, UTF-8, UTF8, Windows-1252 | 1 Comment »

CodeRage 4: session replays are online too!

Posted by Jeroen Pluimers on 2009/09/13

Embarcadero has made available the replays of the CodeRage 4 sessions.
You can find them in the CodeRage 4 sessions overview.

In order to download them from that overview, NOTE: To access this session replay, you must be logged into EDN. you can login or sign-up (which is free).

To make it easier to find all the relevant downloads, below is an overview of my sessions and their links.

Let me know what you use it for, I’m always interested!

Update 20090918: changed the download locations because CodeCentral messed up.
Read the rest of this entry »

Posted in .NET, ASCII, C#, C# 2.0, CodeRage, CommandLine, Conferences, CP437/OEM 437/PC-8, Database Development, Debugging, Delphi, Development, Encoding, Event, Firebird, InterBase, ISO-8859, ISO8859, Java, Prism, Software Development, Unicode, UTF-8, UTF8, Visual Studio and tools, XML, XML/XSD, XSD | 4 Comments »

 
Follow

Get every new post delivered to your Inbox.

Join 1,250 other followers

%d bloggers like this: