The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 2,232 other followers

Archive for the ‘Encoding’ Category

Default XML encoding is UTF-8 (or better: utf-8). If it contains other byte sequences, this is an error.

Posted by jpluimers on 2021/01/21

I should have had the below answer when writing about StUF – receiving data from a provider where UTF-8 is in fact ISO-8859.

A while ago, a co-worker did not believe when I told that default XML encoding really is UTF-8 (and tried to force it to utf-8), and that if the content had byte sequences different from the (either specified or default) encoding, it was a problem.

I though I blogged about the default, and where to find it, but apparently, I did not.

My blog had (and has <g>) a truckload of articles mentioning UTF-8, less articles containing UTF-8, encoding and xml, but the ones having UTF-8, default, encoding and xml did not actually tell about a standard that really defines XML uses UTF-8 as default encoding when there is no other encoding information – like BOM (byte order mark), HTTP, or MIME encoding) available.

W3C indeed specifies it. [WayBack] utf 8 – How default is the default encoding (UTF-8) in the XML Declaration? – Stack Overflow has a summary (thanks James Holderness!):

The Short Answer

Under the very specific circumstances of a UTF-8 encoded document with no external encoding information (which I understand from the comments is what you’re interested in), there is no difference between the two declarations.

The long answer is far more interesting though.

and an elaboration:

Read the rest of this entry »

Posted in Development, Encoding, Software Development, UTF-8, UTF8, XML, XML/XSD | Leave a Comment »

Unicode is hard, also for the Delphi compiler and IDE

Posted by jpluimers on 2020/10/13

The Delphi compiler does not see a unicode non-breaking space (0x00A0 as whitespace, and the Delphi IDE does not warn you about it: [WayBack] Delphi revelations #2 – Space characters are not just space characters.

Given that this character was introduced in 1993, I wonder how the compiler tests look like.

These also will not be recognised as whitespace:

Related, as many other tools also do not properly support various whitespace characters:

Via: [WayBack] A Delphi “Aha” experience – Kim Madsen – Google+

–jeroen

Posted in Delphi, Development, Software Development, Unicode | Leave a Comment »

Which encoding failure did encode “vóór” into “v3/43/4r”? – Stack Overflow

Posted by jpluimers on 2020/02/24

From quite some time ago, but still very relevant as encoding issues keep occurring:

A while ago, I saw the text “v3/43/4r” in a document.I know it comes from “vóór” (the acute accent emphasises in Dutch), and wonder which encoding failure was applied to get this wrong.

Source: [WayBackWhich encoding failure did encode “vóór” into “v3/43/4r”? – Stack Overflow

From the [WayBack] answer by rodrigo:

  • ó: is U+00F3, and occupies the same codepoint (0xF3) in a lot of different encodings (most ISO-8859-* and most western Windows-*).
  • In CP850 the codepint 0xF3 is ¾ (U+00BE), that is the three-quarters character. It is the same in other, less used, codepages (CP775, CP856, CP857, CP858).
  • The ¾ is sometimes transliterated to 3/4 when the character is not directly available.

And there you are! “vóór” -> “v¾¾r” -> “v3/43/4r”.

The first part (ó -> ¾) is the usual corruption of ANSI vs. OEM codepages in the Western Windows versions (in my country ANSI=Windows-1252, OEM=CP850). You can see it easily creating a file with NOTEPAD, writing vóór and dumping it in a command prompt with type.

–jeroen

Posted in CP850, Development, Encoding, Software Development, UTF-8, UTF8, Windows-1252 | Leave a Comment »

Hamburger menu character on unicode: use U+2261 instead of U+2630

Posted by jpluimers on 2020/01/27

Not all fonts have Unicode character ☰ [WayBack] Unicode Character ‘TRIGRAM FOR HEAVEN’ (U+2630) as it is in a less common block.

More fonts have Unicode character ≡ [WayBack] Unicode Character ‘IDENTICAL TO’ (U+2261)

The latter is slightly shorter and slightly narrower than the former, but works in way more places.

Via [WayBack] html – Unicode ☰ hamburger not displaying in Android & Chrome – Stack Overflow

I’ve worked around this problem by using the UNICODE character UNICODE U+2261 (8801), ≡ IDENTICAL TO as illustrated below rather than the UNICODE U+2630 (9776) ☰ TRIGRAM FOR HEAVEN which

–jeroen

Posted in Development, Encoding, LifeHacker, Power User, Software Development, Unicode | Leave a Comment »

Delphi, decoding files to strings and finding line endings: some links, some history on Windows NT and UTF/UCS encodings

Posted by jpluimers on 2019/12/31

A while back there were a few G+ threads sprouted by David Heffernan on decoding big files into line-ending splitted strings:

Code comparison:

Python:

with open(filename, 'r', encoding='utf-16-le') as f:
  for line in f:
    pass

Delphi:

for Line in TLineReader.FromFile(filename, TEncoding.Unicode) do
  ;

This spurred some nice observations and unfounded statements on which encodings should be used, so I posted a bit of history that is included below.

Some tips and observations from the links:

  • Good old text files are not “good” with Unicode support, neither are TextFile Device Drivers; nobody has written a driver supporting a wide range of encodings as of yet.
  • Good old text files are slow as well, even with a changed SetTextBuffer
  • When using the TStreamReader, the decoding takes much more time than the actual reading, which means that [WayBack] Faster FileStream with TBufferedFileStream • DelphiABall does not help much
  • TStringList.LoadFromFile, though fast, is a memory allocation dork and has limits on string size
  • Delphi RTL code is not what it used to be: pre-Delphi Unicode RTL code is of far better quality than Delphi 2009 and up RTL code
  • Supporting various encodings is important
  • EBCDIC days: three kinds of spaces, two kinds of hyphens, multiple codepages
  • Strings are just that: strings. It’s about the encoding from/to the file that needs to be optimal.
  • When processing large files, caching only makes sense when the file fits in memory. Otherwise caching just adds overhead.
  • On Windows, if you read a big text file into memory, open the file in “sequential read” mode, to disable caching. Use the FILE_FLAG_SEQUENTIAL_SCAN flag under Windows, as stated at [WayBack] How do FILE_FLAG_SEQUENTIAL_SCAN and FILE_FLAG_RANDOM_ACCESS affect how the operating system treats my file? – The Old New Thing
  • Python string reading depends on the way you read files (ASCII or Unicode); see [WayBack] unicode – Python codecs line ending – Stack Overflow

Though TLineReader is not part of the RTL, I think it is from [WayBack] For-in Enumeration – ADUG.

Encodings in use

It doesn’t help that on the Windows Console, various encodings are used:

Good reading here is [WayBack] c++ – What unicode encoding (UTF-8, UTF-16, other) does Windows use for its Unicode data types? – Stack Overflow

Encoding history

+A. Bouchez I’m with +David Heffernan here:

At its release in 1993, Windows NT was very early in supporting Unicode. Development of Windows NT started in 1990 where they opted for UCS-2 having 2 bytes per character and had a non-required annex on UTF-1.

UTF-1 – that later evolved into UTF-8 – did not even exist at that time. Even UCS-2 was still young: it got designed in 1989. UTF-8 was outlined late 1992 and became a standard in 1993

Some references:

–jeroen

Read the rest of this entry »

Posted in Delphi, Development, Encoding, PowerShell, PowerShell, Python, Scripting, Software Development, The Old New Thing, Unicode, UTF-16, UTF-8, Windows Development | Leave a Comment »

 
%d bloggers like this: