Encoding is hard… ﻿so how did the single quote become a circumflexed a followed by Euro sign and trade mark?

All categories

October 2016
M	T	W	T	F	S	S
	1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

Encoding is hard… so how did the single quote become a circumflexed a followed by Euro sign and trade mark?

Posted by jpluimers on 2016/10/04

A while ago (in fact more than a year), I posted Encoding is hard… go G+ with the below picture.

[Wayback] ftfy (“fixes text for you”, a parody on “fixed that for you”) [Wayback] fixes it, but:

How did the single quote become “â€™“?

Actually, because of a a common “beautification” of many Office suites (Microsoft and Open alike), the single quote was a special one: a Unicode Character ‘RIGHT SINGLE QUOTATION MARK’ (U+2019) which in UTF-8 is encoded as 0xE2 0x80 0x99.

The “â€™” are these (a full text file with all Unicode code points is at [Wayback] http://unicode.org/Public/UNIDATA/UnicodeData.txt):

â: [Wayback] Unicode Character ‘LATIN SMALL LETTER A WITH CIRCUMFLEX’ (U+00E2).
€: [Wayback] Unicode Character ‘EURO SIGN’ (U+20AC).
™: [Wayback] Unicode Character ‘TRADE MARK SIGN’ (U+2122).

But if you look into a different encoding, then it becomes much clearer, not with the various ISO-8859 based, but Windows based Code Pages:

â: 0xE2 in Windows-1250, Windows-1252, Windows-1254, Windows-1256 and Windows-1258.
€: 0x80 in Windows-1250, Windows-1252, Windows-1253, Windows-1254, Windows-1255, Windows-1256, Windows-1257 and Windows-1258.
™: 0x99 in Windows-1250, Windows-1251, Windows-1252, Windows-1253, Windows-1254, Windows-1255, Windows-1256, Windows-1257 and Windows-1258.

What most likely happened, is that the [Wayback] ‘RIGHT SINGLE QUOTATION MARK’ got translated into UTF-8, interpreted as a Windows Code Page, and then outputted (in this case as an Android screen, so most likely with another intermediate Unicode step).

My conclusion is that someone with a Windows configured to one of the below regions didn’t have their full development infrastructure support all the roundtrips of Unicode to single-byte character set transliterations:

Windows-1250: Central Europe or Eastern Europe.
Windows-1252: region using the default Latin Alphabet.
Windows-1254: Turkish.
Windows-1256: Arabic.
Windows-1258: Vietnamese.

This shows how many regions can get into trouble not having proper end-to-end testing in place to catch these errors.

The [Wayback] UTF-8 Character Debug Tool (later renamed into [Wayback/Archive] UTF-8 Encoding Debugging Chart) can help big time here, but is a bit cryptic and only covers Windows-1252, hence my explanation above.

Three kinds of issues are covered by the debug tool:

[Wayback] Encoding Problem 1: Treating UTF-8 Bytes as Windows-1252 or ISO-8859-1

[Wayback] Encoding Problem 2: Incorrect Double Mis-Conversion

[Wayback] Encoding Problem 3: ISO-8859-1 vs Windows-1252

–jeroen

PS: these manglings are called Mojibake

A single quote becoming “â€™”

This entry was posted on 2016/10/04 at 06:00 and is filed under Development, Encoding, ftfy, ISO-8859, ISO8859, Mojibake, Software Development, Unicode, UTF-8, UTF8, Windows-1252. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

	Attila Kovacs on Crowbarring Windows 95 into Wi…
	Jeroen Wiert Pluimer… on Does Odido (the old T-Mobile N…
	Lars Fosdal on Security alarm provider Woonve…
	Thomas Mueller on Question got closed in May 202…
	Thaddy de Koning on Formulier voor bewindvoerders…

The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

Twitter Updates

My Flickr Stream

Pages

All categories

Email Subscription

Encoding is hard… so how did the single quote become a circumflexed a followed by Euro sign and trade mark?

Leave a comment Cancel reply

The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

Twitter Updates

My Flickr Stream

Pages

All categories

Email Subscription

Encoding is hard… ﻿so how did the single quote become a circumflexed a followed by Euro sign and trade mark?

Rate this:

Share this:

Related

Leave a comment Cancel reply

Encoding is hard… so how did the single quote become a circumflexed a followed by Euro sign and trade mark?