StUF – receiving data from a provider where UTF-8 is in fact ISO-8859 « The Wiert Corner

All categories

May 2009
M	T	W	T	F	S	S
	1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

StUF – receiving data from a provider where UTF-8 is in fact ISO-8859

Posted by jpluimers on 2009/05/08

Recently when receiving information from a StUF webservice created by a large Dutch provider of government IT systems, we had an issue with characters having their high bit set.

Although the web-service pretended to send their information as UTF-8, in fact they were encoding using a form of ISO_8859.

The most likely character set they used is ISO-8859-1 (since that is the default encoding for the HTTP protocol), but it might also be ISO-8859-15 which is an adaption of ISO-8859-1 trading some typographic characters for the euro-sign and some characters from French and some characters used for transliteration of Russian, Finnish and Estonian.
(note that the printable characters of both ISO-8859-1 and ISO-8859-15 can be displayed by the Windows-1252 code page)

Since it is not possible to reliably “guess” the right encoding (there are way to many possibilities, even IsTextUnicode that is used by Notepad fails, see below), the only way is to use a fixed reencoding that depends on the StUF data provider.Links to posts that describe problems with IsTextUnicode:

Raymond Chen in The Old New Thing:

Michael Kaplan in Sorting it all Out:

If the XML specified the right encoding, then it is possible to reliably detect and use it: http://stackoverflow.com/questions/637855/how-to-best-detect-encoding-in-xml-file, however, that is not the case here: the providing party lies.

So lets look at the actual data, how the StUF provider sends it over the line, and what is actually meant.

This is a snippet of the content we received (it is a SOAP response, but I cut down all the non-essential stuff):

000000: 3C 3F 78 6D 6C 20 76 65  72 73 69 6F 6E 3D 22 31 ..Aart Iza.k..

What they did mean to pass is the name Aart Izaäk, which has ISO-8859-1 and ISO-8859-15 character code E4.
But in stead, the passed the three-byte UTF-8 character with byte sequence E4 6B 3C.
That is an invalid sequence, because the non-first bytes of a byte sequence must have the high bit set and the second highest bit clear (see the table of valid bytes in this UTF-8 wikipedia article).

What they should have done is pass the bytes as C3 A4, which is the valid UTF-8 encoding for ä:

000000: 3C 3F 78 6D 6C 20 76 65  72 73 69 6F 6E 3D 22 31 ..Aart Iza..k..

Actually, in .NET when you write a UTF-8 encoded stream, it will prepend it with a BOM (Byte Order Mark) indicating what kind of Unicode the file contains.
The BOM used here is EF BB BF indicating the UTF-8 encoding.
You do not strictly need a BOM for XML-files, as the encoding of the file should be the same as the encoding specified in the XML header. But it does not do harm either:

000000: EF BB BF 3C 3F 78 6D 6C  20 76 65 72 73 69 6F 6E .....Aart Iza..
000040: 6B 3C 2F 76 6F 6F 72 6E  61 6D 65 6E 3E 0D 0A 00 k...

An important thing to note is that the .NET StreamReader does not reject wrong UTF-8, in stead it is processed and wrong UTF-8 encodings are replaced by a U+FFFD code point.
This is the Unicode special character called “replacement character”. It marks a character that the Unicode decoder could not decode correctly.

A really great reference with Unicode code points is utf8-chartable.de having all the Unicode version 5.1.0 code points including information like UTF-8 byte sequence, HTML encodings, etc.
It shows the U+FFFD code point at this page.

I will go into more detail on how to work with these encoding issues in C#/.NET, but the below conversions will show what I mean.

First the conversion from UTF-8 to UTF-16:

Original: ISO-8859-1 in a UTF-8 disguise:
000000: 3C 3F 78 6D 6C 20 76 65  72 73 69 6F 6E 3D 22 31 ..Aart Iza.k</v
Converted from UTF-8 to UTF-16
Note the FD FF byte sequence at offset 000078 that marks the U+FFFD code point:
000000: FF FE 3C 00 3F 00 78 00  6D 00 6C 00 20 00 76 00 .....
000050: 0A 00 3C 00 76 00 6F 00  6F 00 72 00 6E 00 61 00 ...A.a.r.t.
000070: 20 00 49 00 7A 00 61 00  FD FF 6B 00 3C 00 2F 00  .I.z.a...k......

Then the conversion from UTF-8 to UTF-8:

Original: ISO-8859-1 in a UTF-8 disguise:
000000: 3C 3F 78 6D 6C 20 76 65  72 73 69 6F 6E 3D 22 31 ..Aart Iza.k..
Converted from UTF-8 to UTF-8
Note the EF BF BD byte sequence at offset 00003E that marks the U+FFFD code point:
000000: EF BB BF 3C 3F 78 6D 6C  20 76 65 72 73 69 6F 6E .....Aart Iza..
000040: BD 6B 3C 2F 76 6F 6F 72  6E 61 6D 65 6E 3E 0D 0A .k..

Above you can see that the UTF-16 encoded XML also is prepended with a BOM (Byte Order Mark) indicating what kind of Unicode the file contains.
In this case it is FF FE indicating a Little-Endian byte ordering used on Intel x86 instruction set architecture.

In a future blog post, I’ll show how to repair this wrong encoding in C#/.NET

–jeroen

This entry was posted on 2009/05/08 at 19:10 and is filed under Development, Encoding, ISO-8859, ISO8859, Mojibake, Software Development, The Old New Thing, Unicode, UTF-8, UTF8, Windows Development, XML, XML/XSD. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

5 Responses to “StUF – receiving data from a provider where UTF-8 is in fact ISO-8859”

inessential.com: Brian’s Stupid Feed Tricks « The Wiert Corner – irregular stream of stuff said

2013/03/20 at 00:51
[…] It reminds me so much about handling StUF. […]

Reply
Web means Unicode « The Wiert Corner – Jeroen Pluimers’ irregular stream of Wiert stuff said

2010/02/12 at 17:00
[…] also applications talking to the web, and to webservices (one of my own experiences is explained in StUF – receiving data from a provider where UTF-8 is in fact ISO-8859: it shows an example where a vendor does Unicode support really […]

Reply
Guus Creuwels said

2009/08/13 at 15:08
Which StUF provider are you using to test this?

Reply
ae said

2009/05/18 at 15:43
Where is the future blog post, “I’ll show how to repair this wrong encoding in C#/.NET”…

How I detect BOM in a file ?

I have files ANSI, UTF-8 and UTF-8 without BOM like Resource Binary in C#.

I try get string of byte array of those resources. With UTF-8 I dont get the text right, always wrong, the first byte is 273 value, strange for me…

any solution ?? thanks..

Reply
- jpluimers said
  
  2009/05/18 at 16:20
  I just returned from speaking on the DelphiLive! conference in San Jose with a 5 hour flight delay, so I have to reschedule some planned work for this week.
  So the plan is to upload the ‘future blog post’ somewhere at the beginning of next week.
  
  Please mail me some of your files so I can have a look to see of my code already fits your needs: …@pluimers.com
  
  Reply

	jpluimers on Ookla speedtest CLI for Window…
	Mateusz on Now that XE8 is out, some Turb…
	jpluimers on Some links that might help use…
	jpluimers on Hidden Features in Delphi rela…
	jpluimers on Watching “Why is C# Evol…

The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

Twitter Updates

My Flickr Stream

Pages

All categories

Email Subscription

StUF – receiving data from a provider where UTF-8 is in fact ISO-8859

5 Responses to “StUF – receiving data from a provider where UTF-8 is in fact ISO-8859”

inessential.com: Brian’s Stupid Feed Tricks « The Wiert Corner – irregular stream of stuff said

Web means Unicode « The Wiert Corner – Jeroen Pluimers’ irregular stream of Wiert stuff said

Guus Creuwels said

ae said

jpluimers said

Leave a comment Cancel reply

The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

Twitter Updates

My Flickr Stream

Pages

All categories

Email Subscription

StUF – receiving data from a provider where UTF-8 is in fact ISO-8859

Rate this:

Share this:

Related

5 Responses to “StUF – receiving data from a provider where UTF-8 is in fact ISO-8859”

inessential.com: Brian’s Stupid Feed Tricks « The Wiert Corner – irregular stream of stuff said

Web means Unicode « The Wiert Corner – Jeroen Pluimers’ irregular stream of Wiert stuff said

Guus Creuwels said

ae said

jpluimers said

Leave a comment Cancel reply