Bad surprise of the day: SysUtils.TEncoding in XE2+ defaults to ANSI, while in XE it defaulted to UTF-8 .Among other things this means that TStringList… – Eric Grange – Google+
+Stefan Glienke Indeed, you’re right. The issue must be deeper somewhere. Don’t have time to investigate too much, I’m bypassing the RTL now (also have to work around the limitation that for utf-8 the TEncoding.GetString method returns an empty string if one character in the buffer isn’t utf-8)
From quite some time ago, but still very relevant as encoding issues keep occurring:
A while ago, I saw the text “v3/43/4r” in a document.I know it comes from “vóór” (the acute accent emphasises in Dutch), and wonder which encoding failure was applied to get this wrong.
ó: is U+00F3, and occupies the same codepoint (0xF3) in a lot of different encodings (most ISO-8859-* and most western Windows-*).
In CP850 the codepint 0xF3 is ¾ (U+00BE), that is the three-quarters character. It is the same in other, less used, codepages (CP775, CP856, CP857, CP858).
The ¾ is sometimes transliterated to 3/4 when the character is not directly available.
And there you are! “vóór” -> “v¾¾r” -> “v3/43/4r”.
The first part (ó -> ¾) is the usual corruption of ANSI vs. OEM codepages in the Western Windows versions (in my country ANSI=Windows-1252, OEM=CP850). You can see it easily creating a file with NOTEPAD, writing vóór and dumping it in a command prompt with type.
I’ve worked around this problem by using the UNICODE character UNICODE U+2261 (8801), ≡ IDENTICAL TO as illustrated below rather than the UNICODE U+2630 (9776) ☰ TRIGRAM FOR HEAVEN which
with open(filename, 'r', encoding='utf-16-le') as f:
for line in f:
pass
Delphi:
for Line in TLineReader.FromFile(filename, TEncoding.Unicode) do
;
This spurred some nice observations and unfounded statements on which encodings should be used, so I posted a bit of history that is included below.
Some tips and observations from the links:
Good old text files are not “good” with Unicode support, neither are TextFile Device Drivers; nobody has written a driver supporting a wide range of encodings as of yet.
Good old text files are slow as well, even with a changed SetTextBuffer
At its release in 1993, Windows NT was very early in supporting Unicode. Development of Windows NT started in 1990 where they opted for UCS-2 having 2 bytes per character and had a non-required annex on UTF-1.
UTF-1 – that later evolved into UTF-8 – did not even exist at that time. Even UCS-2 was still young: it got designed in 1989. UTF-8 was outlined late 1992 and became a standard in 1993
All these archives contain both the cwebp and dwebp precompiled executables, along with the libwebp.a library and C headers (the latter allowing you to add WebP encoding or decoding to your own programs).
Though formatting mangled the registry key to add, the article is interesting: since 2003 (C# Builder 1), you can force the IDE to always save files as UTF8 which should alleviate a lot of encoding problems.
It beats me why this isn’t the default setting, but below is an example .reg file for Delphi 8 which should be easily transformed to more recent Delphi versions:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
So basically (if formatting is kept), you browse to this key (replace Borland with the company for your specific Delphi version, and replace 2.0 by your IDE version):
Since various terminals have a different mapping from colours in the ANSI escape code colour table, I used the standard HTML colours using (which slightly differs from the Terminal.app screenshot on the right):