Encoding is hard… so how did the single quote become a circumflexed a followed by Euro sign and trade mark?
Posted by jpluimers on 2016/10/04
A while ago (in fact more than a year), I posted Encoding is hard… go G+ with the below picture.
[Wayback] ftfy (“fixes text for you”, a parody on “fixed that for you”) [Wayback] fixes it, but:
How did the single quote become ҉۪
“?
Actually, because of a a common “beautification” of many Office suites (Microsoft and Open alike), the single quote was a special one: a Unicode Character ‘RIGHT SINGLE QUOTATION MARK’ (U+2019) which in UTF-8 is encoded as 0xE2 0x80 0x99
.
The ҉۪
” are these (a full text file with all Unicode code points is at [Wayback] http://unicode.org/Public/UNIDATA/UnicodeData.txt):
â
: [Wayback] Unicode Character ‘LATIN SMALL LETTER A WITH CIRCUMFLEX’ (U+00E2).€
: [Wayback] Unicode Character ‘EURO SIGN’ (U+20AC).™
: [Wayback] Unicode Character ‘TRADE MARK SIGN’ (U+2122).
But if you look into a different encoding, then it becomes much clearer, not with the various ISO-8859 based, but Windows based Code Pages:
â
:0xE2
in Windows-1250, Windows-1252, Windows-1254, Windows-1256 and Windows-1258.€
:0x80
in Windows-1250, Windows-1252, Windows-1253, Windows-1254, Windows-1255, Windows-1256, Windows-1257 and Windows-1258.™
:0x99
in Windows-1250, Windows-1251, Windows-1252, Windows-1253, Windows-1254, Windows-1255, Windows-1256, Windows-1257 and Windows-1258.
What most likely happened, is that the [Wayback] ‘RIGHT SINGLE QUOTATION MARK’ got translated into UTF-8, interpreted as a Windows Code Page, and then outputted (in this case as an Android screen, so most likely with another intermediate Unicode step).
My conclusion is that someone with a Windows configured to one of the below regions didn’t have their full development infrastructure support all the roundtrips of Unicode to single-byte character set transliterations:
- Windows-1250: Central Europe or Eastern Europe.
- Windows-1252: region using the default Latin Alphabet.
- Windows-1254: Turkish.
- Windows-1256: Arabic.
- Windows-1258: Vietnamese.
This shows how many regions can get into trouble not having proper end-to-end testing in place to catch these errors.
The [Wayback] UTF-8 Character Debug Tool (later renamed into [Wayback/Archive] UTF-8 Encoding Debugging Chart) can help big time here, but is a bit cryptic and only covers Windows-1252, hence my explanation above.
Three kinds of issues are covered by the debug tool:
–jeroen
PS: these manglings are called Mojibake

A single quote becoming “’”
Leave a Reply