March 2026
M	T	W	T	F	S	S
	1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Archive for the ‘Encoding’ Category

Bad surprise of the day: SysUtils.TEncoding in XE2+ defaults to ANSI, while in XE it defaulted to UTF-8.

Posted by jpluimers on 2020/03/11

Bad surprise of the day: SysUtils.TEncoding in XE2+ defaults to ANSI, while in XE it defaulted to UTF-8 .Among other things this means that TStringList… – Eric Grange – Google+

Source: Bad surprise of the day: SysUtils.TEncoding in XE2+ defaults to ANSI, while i…

Delphi

Eric Grange+1

+Stefan Glienke Indeed, you’re right. The issue must be deeper somewhere. Don’t have time to investigate too much, I’m bypassing the RTL now (also have to work around the limitation that for utf-8 the TEncoding.GetString method returns an empty string if one character in the buffer isn’t utf-8)

Asbjørn Heid+1

I wouldn’t trust the RTL at all with loading non-ascii text, we’ve had it hang on invalid UTF-8 codes more than once.

–jeroen

Posted in Ansi, Delphi, Development, Encoding, Software Development, UTF-8, UTF8 | Leave a Comment »

Which encoding failure did encode “vóór” into “v3/43/4r”? – Stack Overflow

Posted by jpluimers on 2020/02/24

From quite some time ago, but still very relevant as encoding issues keep occurring:

A while ago, I saw the text “v3/43/4r” in a document.I know it comes from “vóór” (the acute accent emphasises in Dutch), and wonder which encoding failure was applied to get this wrong.

Source: [WayBack] Which encoding failure did encode “vóór” into “v3/43/4r”? – Stack Overflow

From the [WayBack] answer by rodrigo:

ó: is U+00F3, and occupies the same codepoint (0xF3) in a lot of different encodings (most ISO-8859-* and most western Windows-*).

In CP850 the codepint 0xF3 is ¾ (U+00BE), that is the three-quarters character. It is the same in other, less used, codepages (CP775, CP856, CP857, CP858).

The ¾ is sometimes transliterated to 3/4 when the character is not directly available.

And there you are! “vóór” -> “v¾¾r” -> “v3/43/4r”.

The first part (ó -> ¾) is the usual corruption of ANSI vs. OEM codepages in the Western Windows versions (in my country ANSI=Windows-1252, OEM=CP850). You can see it easily creating a file with NOTEPAD, writing vóór and dumping it in a command prompt with type.

–jeroen

Posted in CP850, Development, Encoding, Software Development, UTF-8, UTF8, Windows-1252 | Leave a Comment »

Hamburger menu character on unicode: use U+2261 instead of U+2630

Posted by jpluimers on 2020/01/27

Not all fonts have Unicode character ☰ [WayBack] Unicode Character ‘TRIGRAM FOR HEAVEN’ (U+2630) as it is in a less common block.

More fonts have Unicode character ≡ [WayBack] Unicode Character ‘IDENTICAL TO’ (U+2261)

The latter is slightly shorter and slightly narrower than the former, but works in way more places.

Via [WayBack] html – Unicode ☰ hamburger not displaying in Android & Chrome – Stack Overflow

I’ve worked around this problem by using the UNICODE character UNICODE U+2261 (8801), ≡ IDENTICAL TO as illustrated below rather than the UNICODE U+2630 (9776) ☰ TRIGRAM FOR HEAVEN which

–jeroen

Posted in Development, Encoding, LifeHacker, Power User, Software Development, Unicode | Leave a Comment »

CSL Bearware 302658 DCF clock manual

Posted by jpluimers on 2020/01/13

The manual for the CSL Bearware 302658 clock that uses the DCF77 signal is at [WayBack] Bearware_Manual-302658-20161220FZ004.pdf.

I like the relatively large 3.3 inch display and the blue background.

You can get the clock here:

More on the signal, transmitter and encoding: DCF77 – Wikipedia.

–jeroen

Read the rest of this entry »

Posted in DCF77, DCF77, Development, Encoding, Hardware, LifeHacker, Power User, Software Development | Leave a Comment »

Delphi, decoding files to strings and finding line endings: some links, some history on Windows NT and UTF/UCS encodings

Posted by jpluimers on 2019/12/31

A while back there were a few G+ threads sprouted by David Heffernan on decoding big files into line-ending splitted strings:

[WayBack] Having been a little underwhelmed by the performance of TStreamReader when reading huge text files line by line, I attempted to roll my own. I managed … – David Heffernan – Google+ where he compares the speed with Python (which runs circles around equivalent Delphi code)
[WayBack] I just read this in TEncoding.GetBufferEncoding: function ContainsPreamble(const Buffer, Signature: array of Byte): Boolean; var I: Integer; … – David Heffernan – Google+ (worse: Delphi 209 and up have 3 different implementations of the ContainsPreamble function)

Code comparison:

Python:

with open(filename, 'r', encoding='utf-16-le') as f:
  for line in f:
    pass

Delphi:

for Line in TLineReader.FromFile(filename, TEncoding.Unicode) do
  ;

This spurred some nice observations and unfounded statements on which encodings should be used, so I posted a bit of history that is included below.

Some tips and observations from the links:

Good old text files are not “good” with Unicode support, neither are TextFile Device Drivers; nobody has written a driver supporting a wide range of encodings as of yet.
Good old text files are slow as well, even with a changed SetTextBuffer
When using the TStreamReader, the decoding takes much more time than the actual reading, which means that [WayBack] Faster FileStream with TBufferedFileStream • DelphiABall does not help much
TStringList.LoadFromFile, though fast, is a memory allocation dork and has limits on string size
Delphi RTL code is not what it used to be: pre-Delphi Unicode RTL code is of far better quality than Delphi 2009 and up RTL code
Supporting various encodings is important
EBCDIC days: three kinds of spaces, two kinds of hyphens, multiple codepages
Strings are just that: strings. It’s about the encoding from/to the file that needs to be optimal.
When processing large files, caching only makes sense when the file fits in memory. Otherwise caching just adds overhead.
On Windows, if you read a big text file into memory, open the file in “sequential read” mode, to disable caching. Use the FILE_FLAG_SEQUENTIAL_SCAN flag under Windows, as stated at [WayBack] How do FILE_FLAG_SEQUENTIAL_SCAN and FILE_FLAG_RANDOM_ACCESS affect how the operating system treats my file? – The Old New Thing
Python string reading depends on the way you read files (ASCII or Unicode); see [WayBack] unicode – Python codecs line ending – Stack Overflow

Though TLineReader is not part of the RTL, I think it is from [WayBack] For-in Enumeration – ADUG.

Encodings in use

It doesn’t help that on the Windows Console, various encodings are used:

Most tools still use Code Page 437 (from the good old DOS days, also known as CP437, OEM-US, OEM 437, PC-8, or DOS Latin US.)
Unicode aware applications often use UTF-8
A minority (but growing because of PowerShell: [WayBack] utf 8 – Changing PowerShell’s default output encoding to UTF-8 – Stack Overflow) of tools uses UTF-16 because Windows Unicode support started with UCS-2 with an in-memory little-endian representation (.REG files, MSXML, SQL Server Management Studio)

Good reading here is [WayBack] c++ – What unicode encoding (UTF-8, UTF-16, other) does Windows use for its Unicode data types? – Stack Overflow

Encoding history

+A. Bouchez I’m with +David Heffernan here:

At its release in 1993, Windows NT was very early in supporting Unicode. Development of Windows NT started in 1990 where they opted for UCS-2 having 2 bytes per character and had a non-required annex on UTF-1.

UTF-1 – that later evolved into UTF-8 – did not even exist at that time. Even UCS-2 was still young: it got designed in 1989. UTF-8 was outlined late 1992 and became a standard in 1993

Some references:

[WayBack] Windows NT and VMS: The Rest of the Story | IT Pro

[WayBack] Windows NT – Wikipedia

[WayBack] History: UTF-8 – Wikipedia

[WayBack] UCS History: Universal Coded Character Set – Wikipedia

[WayBack] UTF-1 – Wikipedia

–jeroen

Read the rest of this entry »

Posted in Delphi, Development, Encoding, PowerShell, PowerShell, Python, Scripting, Software Development, The Old New Thing, Unicode, UTF-16, UTF-8, Windows Development | Leave a Comment »

imagemagick – Command line convert webp to jpg? – Unix & Linux Stack Exchange

Posted by jpluimers on 2019/12/23

For my link archive: [WayBack] imagemagick – Command line convert webp to jpg? – Unix & Linux Stack Exchange

[WayBack] Precompiled Utilities | WebP | Google Developers

All these archives contain both the cwebp and dwebp precompiled executables, along with the libwebp.a library and C headers (the latter allowing you to add WebP encoding or decoding to your own programs).
[WayBack] cwebp | WebP | Google Developers cwebp — Compress an image file to a WebP file
[WayBack] dwebp | WebP | Google Developers dwebp — Decompress a WebP file to an image file

–jeroen

Posted in *nix, *nix-tools, Development, Encoding, Google, GoogleWebP, Image Editing, Power User, Software Development, The Gimp, WebP | Leave a Comment »

Delphi Galileo IDE (version 8 and up): Force files to be saved as UTF8 – The Oracle at Delphi

Posted by jpluimers on 2019/07/04

Though formatting mangled the registry key to add, the article is interesting: since 2003 (C# Builder 1), you can force the IDE to always save files as UTF8 which should alleviate a lot of encoding problems.

It beats me why this isn’t the default setting, but below is an example .reg file for Delphi 8 which should be easily transformed to more recent Delphi versions:

	Windows Registry Editor Version 5.00

	[HKEY_CURRENT_USER\Software\Borland\BDS\2.0\Editor]
	"DefaultFileFilter"="Borland.FileFilter.UTF8ToUTF8"

view raw

Delphi8-Force-IDE-to-write-files-as-UTF8.reg

hosted with ❤ by GitHub

So basically (if formatting is kept), you browse to this key (replace Borland with the company for your specific Delphi version, and replace 2.0 by your IDE version):

HKEY_CURRENT_USER\Software\Borland\BDS\16.0\Editor

Then you add a new string value named DefaultFileFilter with value Borland.FileFilter.UTF8ToUTF8

More background [WayBack] The Oracle at Delphi: More IDE secrets – UTF8 and the Editor

The unmangled registry key (and more tips) was from [WayBack] BSC Polska: Hidden possibilities of Delphi 8.

Get the list of HKEY_CURRENT_USER paths for your Delphi version at Update to List-Delphi-Installed-Packages.ps1 shows HKCU/HKLM keys and doesn’t truncated fields any more.

–jeroen

Via: [WayBack] Is there any way (IDE expert?) to automatic set encoding of each PAS file in UTF-8 instead of ANSI? – Jacek Laskowski – Google+

Posted in Delphi, Development, Encoding, Software Development, UTF-8, UTF8 | 1 Comment »

Unicode ligatures: not all software does normalised search forgetting ffi

Posted by jpluimers on 2019/06/26

Via a private share, I found out that some software forgets to perform a Unicode normalisation when doing a search.

That means that ligatures do not match the non-ligatures in for instance these words:

“ff” and “ﬀ”, as in “difference” versus “diﬀerence”
“fi” and “ﬁ” as in “notification” versus “notiﬁcation”.

For more information, read [WayBack] Unicode equivalence – Wikipedia and make sure you know about these normal forms:

NFD
Normalization Form Canonical Decomposition Characters are decomposed by canonical equivalence, and multiple combining characters are arranged in a specific order.

NFC
Normalization Form Canonical Composition Characters are decomposed and then recomposed by canonical equivalence.

NFKD
Normalization Form Compatibility Decomposition Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order.

NFKC
Normalization Form Compatibility Composition Characters are decomposed by compatibility, then recomposed by canonical equivalence.

NFD Normalization Form Canonical Decomposition	Characters are decomposed by canonical equivalence, and multiple combining characters are arranged in a specific order.
NFC Normalization Form Canonical Composition	Characters are decomposed and then recomposed by canonical equivalence.
NFKD Normalization Form Compatibility Decomposition	Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order.
NFKC Normalization Form Compatibility Composition	Characters are decomposed by compatibility, then recomposed by canonical equivalence.

–jeroen

Posted in Development, Encoding, Software Development, Unicode | Leave a Comment »

I’ve given up on entering non-ASCII characters when entering data on-line

Posted by jpluimers on 2019/06/17

I live in a street that has a non-ASCII character in it: Pyreneeën.

I’ve reverted back to entering the street name as plain ASCII for a simple reason:

Too often the ë gets mangled into encoding gibberish, similar to the é example in [WayBack] When Good Characters Go Bad: A Guide to Diagnosing Character Display Problems as these characters are very near both in UTF-8 and in the [WayBack] Unicode Characters in the Latin-1 Supplement Block:

UTF-8 0xC3 0xA9: [WayBack] Unicode Character ‘LATIN SMALL LETTER E WITH ACUTE’ (U+00E9)
UTF-8 0xC3 0xAB: [WayBack] Unicode Character ‘LATIN SMALL LETTER E WITH DIAERESIS’ (U+00EB)

I’ve seen these encodings, where only the top encoding is correct; the degeneration gets worse moving downwards, a classic Mojibake:

# encoded UTF-8 (hex.)

0 ë 0xC3 0xAB

1 Ã« 0xC3 0x83 0xC2 0xAB

2 ÃÂ« 0xC3 0x83 0xC2 0x83 0xC3 0x82 0xC2 0xAB

3 ÃÂÃÂ« 0xC3 0x83 0xC2 0x83 0xC3 0x82 0xC2 0x83 0xC3 0x83 0xC2 0x82 0xC3 0x82 0xC2 0xAB

4 ÃÂÃÂÃÂÃÂ« 0xC3 0x83 0xC2 0x83 0xC3 0x82 0xC2 0x83 0xC3 0x83 0xC2 0x82 0xC3 0x82 0xC2 0x83 0xC3 0x83 0xC2 0x83 0xC3 0x82 0xC2 0x82 0xC3 0x83 0xC2 0x82 0xC3 0x82 0xC2 0xAB

5 ë 0x26 0x65 0x75 0x6d 0x6c 0x3b

#	encoded	UTF-8 (hex.)
0	ë	`0xC3 0xAB`
1	Ã«	`0xC3 0x83 0xC2 0xAB`
2	ÃÂ«	`0xC3 0x83 0xC2 0x83 0xC3 0x82 0xC2 0xAB`
3	ÃÂÃÂ«	`0xC3 0x83 0xC2 0x83 0xC3 0x82 0xC2 0x83 0xC3 0x83 0xC2 0x82 0xC3 0x82 0xC2 0xAB`
4	ÃÂÃÂÃÂÃÂ«	`0xC3 0x83 0xC2 0x83 0xC3 0x82 0xC2 0x83 0xC3 0x83 0xC2 0x82 0xC3 0x82 0xC2 0x83 0xC3 0x83 0xC2 0x83 0xC3 0x82 0xC2 0x82 0xC3 0x83 0xC2 0x82 0xC3 0x82 0xC2 0xAB`
5	ë	`0x26 0x65 0x75 0x6d 0x6c 0x3b`

The last one seldomly happens, the first one relatively often, just like [Archive.is] fd.nl did a while on their finanancial pages.

These mistakes become sort of understandable (but not forgivable) when you look at the below table-fragment (the full table is at[WayBack] Unicode/UTF-8-character table – starting from code position 0080).

Read the rest of this entry »

Posted in Development, Encoding, Mojibake, Power User, Software Development, Unicode, Web Browsers | Leave a Comment »

ls colour codes on OpenSuSE tumbleweed when accessed from Mac OS X ssh

Posted by jpluimers on 2019/06/07

`ls` colour codes

I got confused as I thought red text would mean an error.

But they’re not: greenish yellow on a read background means error (a symbolic link to a place that’s no longer there).

It’s the output of https://github.com/gkotian/gautam_linux/blob/master/scripts/colours.sh as the one at

Actually the script is here https://raw.githubusercontent.com/gkotian/gautam_linux/master/scripts/colours.sh as the one at [WayBack] command line – What do the different colors mean in the terminal? – Ask Ubuntu failed with errors like this one:

-bash: *.xbm: bad substitution

The full script output is below.

Since various terminals have a different mapping from colours in the ANSI escape code colour table, I used the standard HTML colours using (which slightly differs from the Terminal.app screenshot on the right):

HTML colour styles from HTML Color Names and HTML Color Values and
(since LS_COLORS uses dircolors which depends on the ISO 6429 color encoding) ANSI escape code: Colors.

References:

Note that the shell on Mac OS X uses a different way of configuring colours CLICOLOR as described in [WayBack] settings – CLICOLOR and LS_COLORS in bash – Unix & Linux Stack Exchange. I might cover that another day.

Script output:

Read the rest of this entry »

Posted in *nix, *nix-tools, ANSI escape code, bash, CSS, Development, Encoding, HTML, HTML5, Linux, openSuSE, Power User, Software Development, SuSE Linux, Tumbleweed, Web Development | Leave a Comment »

« Previous Entries

Next Entries »

	Attila Kovacs on Crowbarring Windows 95 into Wi…
	Jeroen Wiert Pluimer… on Does Odido (the old T-Mobile N…
	Lars Fosdal on Security alarm provider Woonve…
	Thomas Mueller on Question got closed in May 202…
	Thaddy de Koning on Formulier voor bewindvoerders…

The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

Twitter Updates

My Flickr Stream

Pages

All categories

Email Subscription

Archive for the ‘Encoding’ Category

Bad surprise of the day: SysUtils.TEncoding in XE2+ defaults to ANSI, while in XE it defaulted to UTF-8.

Which encoding failure did encode “vóór” into “v3/43/4r”? – Stack Overflow

Hamburger menu character on unicode: use U+2261 instead of U+2630

CSL Bearware 302658 DCF clock manual

Delphi, decoding files to strings and finding line endings: some links, some history on Windows NT and UTF/UCS encodings

Encodings in use

Encoding history

imagemagick – Command line convert webp to jpg? – Unix & Linux Stack Exchange

Delphi Galileo IDE (version 8 and up): Force files to be saved as UTF8 – The Oracle at Delphi

Unicode ligatures: not all software does normalised search forgetting ffi

I’ve given up on entering non-ASCII characters when entering data on-line

ls colour codes on OpenSuSE tumbleweed when accessed from Mac OS X ssh

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

My Flickr Stream

Pages

All categories

Email Subscription

Archive for the ‘Encoding’ Category

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Encodings in use

Encoding history

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this: