After a 2018 discussion with a “zorgkantoor” (Dutch for office that arranges for special long term health care needs, successor of AWBZ) about their very low (10 megabyte) SMTP message size limit – even though they expect scanned PDF documents.
Their web-care team posed this limit as normal, so I made a list of limits in their peer group, common world-wide and well-ranked Dutch internet providers.
My plan is to check the progression of these limits over time.
Note these are the bruto message sizes including encoded attachments. Since encoding in [WayBack] MIME Base64 – Wikipedia has a overhead of at least 37% (encoded size is at least 1.37 the original size), the unencoded maximum size is less than 73% of what is listed below.
Bad surprise of the day: SysUtils.TEncoding in XE2+ defaults to ANSI, while in XE it defaulted to UTF-8 .Among other things this means that TStringList… – Eric Grange – Google+
+Stefan Glienke Indeed, you’re right. The issue must be deeper somewhere. Don’t have time to investigate too much, I’m bypassing the RTL now (also have to work around the limitation that for utf-8 the TEncoding.GetString method returns an empty string if one character in the buffer isn’t utf-8)
From quite some time ago, but still very relevant as encoding issues keep occurring:
A while ago, I saw the text “v3/43/4r” in a document.I know it comes from “vóór” (the acute accent emphasises in Dutch), and wonder which encoding failure was applied to get this wrong.
ó: is U+00F3, and occupies the same codepoint (0xF3) in a lot of different encodings (most ISO-8859-* and most western Windows-*).
In CP850 the codepint 0xF3 is ¾ (U+00BE), that is the three-quarters character. It is the same in other, less used, codepages (CP775, CP856, CP857, CP858).
The ¾ is sometimes transliterated to 3/4 when the character is not directly available.
And there you are! “vóór” -> “v¾¾r” -> “v3/43/4r”.
The first part (ó -> ¾) is the usual corruption of ANSI vs. OEM codepages in the Western Windows versions (in my country ANSI=Windows-1252, OEM=CP850). You can see it easily creating a file with NOTEPAD, writing vóór and dumping it in a command prompt with type.
I’ve worked around this problem by using the UNICODE character UNICODE U+2261 (8801), ≡ IDENTICAL TO as illustrated below rather than the UNICODE U+2630 (9776) ☰ TRIGRAM FOR HEAVEN which
with open(filename, 'r', encoding='utf-16-le') as f:
for line in f:
pass
Delphi:
for Line in TLineReader.FromFile(filename, TEncoding.Unicode) do
;
This spurred some nice observations and unfounded statements on which encodings should be used, so I posted a bit of history that is included below.
Some tips and observations from the links:
Good old text files are not “good” with Unicode support, neither are TextFile Device Drivers; nobody has written a driver supporting a wide range of encodings as of yet.
Good old text files are slow as well, even with a changed SetTextBuffer
At its release in 1993, Windows NT was very early in supporting Unicode. Development of Windows NT started in 1990 where they opted for UCS-2 having 2 bytes per character and had a non-required annex on UTF-1.
UTF-1 – that later evolved into UTF-8 – did not even exist at that time. Even UCS-2 was still young: it got designed in 1989. UTF-8 was outlined late 1992 and became a standard in 1993
All these archives contain both the cwebp and dwebp precompiled executables, along with the libwebp.a library and C headers (the latter allowing you to add WebP encoding or decoding to your own programs).
Though formatting mangled the registry key to add, the article is interesting: since 2003 (C# Builder 1), you can force the IDE to always save files as UTF8 which should alleviate a lot of encoding problems.
It beats me why this isn’t the default setting, but below is an example .reg file for Delphi 8 which should be easily transformed to more recent Delphi versions:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
So basically (if formatting is kept), you browse to this key (replace Borland with the company for your specific Delphi version, and replace 2.0 by your IDE version):