The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 2,428 other followers

Archive for the ‘UTF8’ Category

Default XML encoding is UTF-8 (or better: utf-8). If it contains other byte sequences, this is an error.

Posted by jpluimers on 2021/01/21

I should have had the below answer when writing about StUF – receiving data from a provider where UTF-8 is in fact ISO-8859.

A while ago, a co-worker did not believe when I told that default XML encoding really is UTF-8 (and tried to force it to utf-8), and that if the content had byte sequences different from the (either specified or default) encoding, it was a problem.

I though I blogged about the default, and where to find it, but apparently, I did not.

My blog had (and has <g>) a truckload of articles mentioning UTF-8, less articles containing UTF-8, encoding and xml, but the ones having UTF-8, default, encoding and xml did not actually tell about a standard that really defines XML uses UTF-8 as default encoding when there is no other encoding information – like BOM (byte order mark), HTTP, or MIME encoding) available.

W3C indeed specifies it. [WayBack] utf 8 – How default is the default encoding (UTF-8) in the XML Declaration? – Stack Overflow has a summary (thanks James Holderness!):

The Short Answer

Under the very specific circumstances of a UTF-8 encoded document with no external encoding information (which I understand from the comments is what you’re interested in), there is no difference between the two declarations.

The long answer is far more interesting though.

and an elaboration:

Read the rest of this entry »

Posted in Development, Encoding, Software Development, UTF-8, UTF8, XML, XML/XSD | Leave a Comment »

Bad surprise of the day: SysUtils.TEncoding in XE2+ defaults to ANSI, while in XE it defaulted to UTF-8.

Posted by jpluimers on 2020/03/11

Bad surprise of the day: SysUtils.TEncoding in XE2+ defaults to ANSI, while in XE it defaulted to UTF-8 .Among other things this means that TStringList… – Eric Grange – Google+

Source: Bad surprise of the day: SysUtils.TEncoding in XE2+ defaults to ANSI, while i…

Delphi

Eric Grange's profile photo

+Stefan Glienke Indeed, you’re right. The issue must be deeper somewhere. Don’t have time to investigate too much, I’m bypassing the RTL now (also have to work around the limitation that for utf-8 the TEncoding.GetString method returns an empty string if one character in the buffer isn’t utf-8)

Asbjørn Heid's profile photo

I wouldn’t trust the RTL at all with loading non-ascii text, we’ve had it hang on invalid UTF-8 codes more than once.

–jeroen

Posted in Ansi, Delphi, Development, Encoding, Software Development, UTF-8, UTF8 | Leave a Comment »

Which encoding failure did encode “vóór” into “v3/43/4r”? – Stack Overflow

Posted by jpluimers on 2020/02/24

From quite some time ago, but still very relevant as encoding issues keep occurring:

A while ago, I saw the text “v3/43/4r” in a document.I know it comes from “vóór” (the acute accent emphasises in Dutch), and wonder which encoding failure was applied to get this wrong.

Source: [WayBackWhich encoding failure did encode “vóór” into “v3/43/4r”? – Stack Overflow

From the [WayBack] answer by rodrigo:

  • ó: is U+00F3, and occupies the same codepoint (0xF3) in a lot of different encodings (most ISO-8859-* and most western Windows-*).
  • In CP850 the codepint 0xF3 is ¾ (U+00BE), that is the three-quarters character. It is the same in other, less used, codepages (CP775, CP856, CP857, CP858).
  • The ¾ is sometimes transliterated to 3/4 when the character is not directly available.

And there you are! “vóór” -> “v¾¾r” -> “v3/43/4r”.

The first part (ó -> ¾) is the usual corruption of ANSI vs. OEM codepages in the Western Windows versions (in my country ANSI=Windows-1252, OEM=CP850). You can see it easily creating a file with NOTEPAD, writing vóór and dumping it in a command prompt with type.

–jeroen

Posted in CP850, Development, Encoding, Software Development, UTF-8, UTF8, Windows-1252 | Leave a Comment »

Delphi Galileo IDE (version 8 and up): Force files to be saved as UTF8 – The Oracle at Delphi

Posted by jpluimers on 2019/07/04

Though formatting mangled the registry key to add, the article is interesting: since 2003 (C# Builder 1), you can force the IDE to always save files as UTF8 which should alleviate a lot of encoding problems.

It beats me why this isn’t the default setting, but below is an example .reg file for Delphi 8 which should be easily transformed to more recent Delphi versions:

Windows Registry Editor Version 5.00
[HKEY_CURRENT_USER\Software\Borland\BDS\2.0\Editor]
"DefaultFileFilter"="Borland.FileFilter.UTF8ToUTF8"

So basically (if formatting is kept), you browse to this key (replace Borland with the company for your specific Delphi version, and replace 2.0 by your IDE version):

HKEY_CURRENT_USER\Software\Borland\BDS\16.0\Editor

Then you add a new string value named DefaultFileFilter with value Borland.FileFilter.UTF8ToUTF8

More background [WayBack] The Oracle at Delphi: More IDE secrets – UTF8 and the Editor

The unmangled registry key (and more tips) was from [WayBackBSC Polska: Hidden possibilities of Delphi 8.

Get the list of HKEY_CURRENT_USER paths for your Delphi version at Update to List-Delphi-Installed-Packages.ps1 shows HKCU/HKLM keys and doesn’t truncated fields any more.

–jeroen

Via: [WayBack] Is there any way (IDE expert?) to automatic set encoding of each PAS file in UTF-8 instead of ANSI? – Jacek Laskowski – Google+

Posted in Delphi, Development, Encoding, Software Development, UTF-8, UTF8 | 1 Comment »

UTF-8 support for single byte character sets is beta in Windows and likely breaks a lot of applications not expecting this (via Unicode in Microsoft Windows: UTF-8 – Wikipedia)

Posted by jpluimers on 2018/12/04

Uh-oh: [WayBack] Unicode in Microsoft Windows: UTF-8 – Wikipedia:

Microsoft Windows has a code page designated for UTF-8code page 65001. Prior to Windows 10 insider build 17035 (November 2017),[7] it was impossible to set the locale code page to 65001, leaving this code page only available for:

  • Explicit conversion functions such as MultiByteToWideChar
  • The Win32 console command chcp 65001 to translate stdin/out between UTF-8 and UTF-16.

This means that “narrow” functions, in particular fopen, cannot be called with UTF-8 strings, and in fact there is no way to open all possible files using fopen no matter what the locale is set to and/or what bytes are put in the string, as none of the available locales can produce all possible UTF-16 characters.

On all modern non-Windows platforms, the string passed to fopen is effectively UTF-8. This produces an incompatibility between other platforms and Windows. The normal work-around is to add Windows-specific code to convert UTF-8 to UTF-16 using MultiByteToWideChar and call the “wide” function.[8] Conversion is also needed even for Windows-specific api such as SetWindowText since many applications inherently have to use UTF-8 due to its use in file formats, internet protocols, and its ability to interoperate with raw arrays of bytes.

There were proposals to add new API to portable libraries such as Boost to do the necessary conversion, by adding new functions for opening and renaming files. These functions would pass filenames through unchanged on Unix, but translate them to UTF-16 on Windows.[9] This would allow code to be “portable”, but required just as many code changes as calling the wide functions.

With insider build 17035 and the April 2018 update (nominal build 17134) for Windows 10, a “Beta: Use Unicode UTF-8 for worldwide language support” checkbox appeared for setting the locale code page to UTF-8.[a] This allows for calling “narrow” functions, including fopen and SetWindowTextA, with UTF-8 strings. Microsoft claims this option might break some functions (a possible example is _mbsrev[10]) as they were written to assume multibyte encodings used no more than 2 bytes per character, thus until now code pages with more bytes such as GB 18030 (cp54936) and UTF-8 could not be set as the locale.[11]


  1. Jump up^ [WayBack“UTF-8 in Windows”Stack Overflow. Retrieved July 1, 2011.
  2. Jump up^ [WayBack“Boost.Nowide”.
  3. Jump up^ [WayBackhttps://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/strrev-wcsrev-mbsrev-mbsrev-l
  4. Jump up^ [WayBack“Code Page Identifiers (Windows)”msdn.microsoft.com.

Via [WayBack] Microsoft Windows Beta UTF-8 support for Ansi API could break things. Wiki Article of the Change… – Tommi Prami – Google+

Related, as handling encoding is hard, especially if it is changed or not your default:

–jeroen

Posted in .NET, C, C++, Delphi, Development, Encoding, GB 18030, Power User, Software Development, UTF-16, UTF-32, UTF-8, UTF16, UTF32, UTF8, Windows, Windows 10 | 2 Comments »

 
%d bloggers like this: