The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 1,570 other followers

StUF – receiving data from a provider where UTF-8 is in fact ISO-8859

Posted by jpluimers on 2009/05/08

Recently when receiving information from a StUF webservice created by a large Dutch provider of government IT systems, we had an issue with characters having their high bit set.

Although the web-service pretended to send their information as UTF-8, in fact they were encoding using a form of ISO_8859.

The most likely character set they used is ISO-8859-1 (since that is the default encoding for the HTTP protocol), but it might also be ISO-8859-15 which is an adaption of ISO-8859-1 trading some typographic characters for the euro-sign and some characters from French and some characters used for transliteration of  Russian, Finnish and Estonian.
(note that the printable characters of both ISO-8859-1 and ISO-8859-15 can be displayed by the Windows-1252 code page)

Since it is not possible to reliably “guess” the right encoding (there are way to many possibilities, even IsTextUnicode that is used by Notepad fails, see below), the only way is to use a fixed reencoding that depends on the StUF data provider.Links to posts that describe problems with IsTextUnicode:

Raymond Chen in The Old New Thing:

Michael Kaplan in Sorting it all Out:

If the XML specified the right encoding, then it is possible to reliably detect and use it: http://stackoverflow.com/questions/637855/how-to-best-detect-encoding-in-xml-file, however, that is not the case here: the providing party lies.

So lets look at the actual data, how the StUF provider sends it over the line, and what is actually meant.

This is a snippet of the content we received (it is a SOAP response, but I cut down all the non-essential stuff):

000000: 3C 3F 78 6D 6C 20 76 65  72 73 69 6F 6E 3D 22 31 ..Aart Iza.k..

What they did mean to pass is the name Aart Izaäk, which has ISO-8859-1 and ISO-8859-15 character code E4.
But in stead, the passed the three-byte UTF-8 character with byte sequence E4 6B 3C.
That is an invalid sequence, because the non-first bytes of a byte sequence must have the high bit set and the second highest bit clear (see the table of valid bytes in this UTF-8 wikipedia article).

What they should have done is pass the bytes as C3 A4, which is the valid UTF-8 encoding for ä:

000000: 3C 3F 78 6D 6C 20 76 65  72 73 69 6F 6E 3D 22 31 ..Aart Iza..k..

Actually, in .NET when you write a UTF-8 encoded stream, it will prepend it with a BOM (Byte Order Mark) indicating what kind of Unicode the file contains.
The BOM used here is EF BB BF indicating the UTF-8 encoding.
You do not strictly need a BOM for XML-files, as the encoding of the file should be the same as the encoding specified in the XML header. But it does not do harm either:

000000: EF BB BF 3C 3F 78 6D 6C  20 76 65 72 73 69 6F 6E .....Aart Iza..
000040: 6B 3C 2F 76 6F 6F 72 6E  61 6D 65 6E 3E 0D 0A 00 k...

An important thing to note is that the .NET StreamReader does not reject wrong UTF-8, in stead it is processed and wrong UTF-8 encodings are replaced by a U+FFFD code point.
This is the Unicode special character called “replacement character”. It marks a character that the Unicode decoder could not decode correctly.

A really great reference with Unicode code points is utf8-chartable.de having all the Unicode version 5.1.0 code points including information like UTF-8 byte sequence, HTML encodings, etc.
It shows the U+FFFD code point at this page.

I will go into more detail on how to work with these encoding issues in C#/.NET, but the below conversions will show what I mean.

First the conversion from UTF-8 to UTF-16:

Original: ISO-8859-1 in a UTF-8 disguise:
000000: 3C 3F 78 6D 6C 20 76 65  72 73 69 6F 6E 3D 22 31 ..Aart Iza.k</v
Converted from UTF-8 to UTF-16
Note the FD FF byte sequence at offset 000078 that marks the U+FFFD code point:
000000: FF FE 3C 00 3F 00 78 00  6D 00 6C 00 20 00 76 00 .....
000050: 0A 00 3C 00 76 00 6F 00  6F 00 72 00 6E 00 61 00 ...A.a.r.t.
000070: 20 00 49 00 7A 00 61 00  FD FF 6B 00 3C 00 2F 00  .I.z.a...k......

Then the conversion from UTF-8 to UTF-8:

Original: ISO-8859-1 in a UTF-8 disguise:
000000: 3C 3F 78 6D 6C 20 76 65  72 73 69 6F 6E 3D 22 31 ..Aart Iza.k..
Converted from UTF-8 to UTF-8
Note the EF BF BD byte sequence at offset 00003E that marks the U+FFFD code point:
000000: EF BB BF 3C 3F 78 6D 6C  20 76 65 72 73 69 6F 6E .....Aart Iza..
000040: BD 6B 3C 2F 76 6F 6F 72  6E 61 6D 65 6E 3E 0D 0A .k..

Above you can see that the UTF-16 encoded XML also is prepended with a BOM (Byte Order Mark) indicating what kind of Unicode the file contains.
In this case it is FF FE indicating a Little-Endian byte ordering used on Intel x86 instruction set architecture.

In a future blog post, I’ll show how to repair this wrong encoding in C#/.NET

–jeroen

5 Responses to “StUF – receiving data from a provider where UTF-8 is in fact ISO-8859”

  1. […] It reminds me so much about handling StUF. […]

  2. […] also applications talking to the web, and to webservices (one of my own experiences is explained in StUF – receiving data from a provider where UTF-8 is in fact ISO-8859: it shows an example where a vendor does Unicode support really […]

  3. Guus Creuwels said

    Which StUF provider are you using to test this?

  4. ae said

    Where is the future blog post, “I’ll show how to repair this wrong encoding in C#/.NET”…

    How I detect BOM in a file ?

    I have files ANSI, UTF-8 and UTF-8 without BOM like Resource Binary in C#.

    I try get string of byte array of those resources. With UTF-8 I dont get the text right, always wrong, the first byte is 273 value, strange for me…

    any solution ?? thanks..

    • jpluimers said

      I just returned from speaking on the DelphiLive! conference in San Jose with a 5 hour flight delay, so I have to reschedule some planned work for this week.
      So the plan is to upload the ‘future blog post’ somewhere at the beginning of next week.

      Please mail me some of your files so I can have a look to see of my code already fits your needs: @pluimers.com

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

 
%d bloggers like this: