The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 1,862 other subscribers

Archive for the ‘Encoding’ Category

imagemagick – Command line convert webp to jpg? – Unix & Linux Stack Exchange

Posted by jpluimers on 2019/12/23

For my link archive: [WayBack] imagemagick – Command line convert webp to jpg? – Unix & Linux Stack Exchange

–jeroen

Posted in *nix, *nix-tools, Development, Encoding, Google, GoogleWebP, Image Editing, Power User, Software Development, The Gimp, WebP | Leave a Comment »

Delphi Galileo IDE (version 8 and up): Force files to be saved as UTF8 – The Oracle at Delphi

Posted by jpluimers on 2019/07/04

Though formatting mangled the registry key to add, the article is interesting: since 2003 (C# Builder 1), you can force the IDE to always save files as UTF8 which should alleviate a lot of encoding problems.

It beats me why this isn’t the default setting, but below is an example .reg file for Delphi 8 which should be easily transformed to more recent Delphi versions:


Windows Registry Editor Version 5.00
[HKEY_CURRENT_USER\Software\Borland\BDS\2.0\Editor]
"DefaultFileFilter"="Borland.FileFilter.UTF8ToUTF8"

So basically (if formatting is kept), you browse to this key (replace Borland with the company for your specific Delphi version, and replace 2.0 by your IDE version):

HKEY_CURRENT_USER\Software\Borland\BDS\16.0\Editor

Then you add a new string value named DefaultFileFilter with value Borland.FileFilter.UTF8ToUTF8

More background [WayBack] The Oracle at Delphi: More IDE secrets – UTF8 and the Editor

The unmangled registry key (and more tips) was from [WayBackBSC Polska: Hidden possibilities of Delphi 8.

Get the list of HKEY_CURRENT_USER paths for your Delphi version at Update to List-Delphi-Installed-Packages.ps1 shows HKCU/HKLM keys and doesn’t truncated fields any more.

–jeroen

Via: [WayBack] Is there any way (IDE expert?) to automatic set encoding of each PAS file in UTF-8 instead of ANSI? – Jacek Laskowski – Google+

Posted in Delphi, Development, Encoding, Software Development, UTF-8, UTF8 | 1 Comment »

Unicode ligatures: not all software does normalised search forgetting ffi 

Posted by jpluimers on 2019/06/26

Via a private share, I found out that some software forgets to perform a Unicode normalisation when doing a search.

That means that ligatures do not match the non-ligatures in for instance these words:

  • “ff” and “ff”, as in “difference” versus “difference”
  • “fi” and “fi” as in “notification” versus “notification”.

For more information, read [WayBackUnicode equivalence – Wikipedia and make sure you know about these normal forms:

NFD
Normalization Form Canonical Decomposition
Characters are decomposed by canonical equivalence, and multiple combining characters are arranged in a specific order.
NFC
Normalization Form Canonical Composition
Characters are decomposed and then recomposed by canonical equivalence.
NFKD
Normalization Form Compatibility Decomposition
Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order.
NFKC
Normalization Form Compatibility Composition
Characters are decomposed by compatibility, then recomposed by canonical equivalence.

–jeroen

Posted in Development, Encoding, Software Development, Unicode | Leave a Comment »

I’ve given up on entering non-ASCII characters when entering data on-line

Posted by jpluimers on 2019/06/17

I live in a street that has a non-ASCII character in it: Pyreneeën.

I’ve reverted back to entering the street name as plain ASCII for a simple reason:

Too often the ë gets mangled into encoding gibberish, similar to the é example in [WayBackWhen Good Characters Go Bad: A Guide to Diagnosing Character Display Problems as these characters are very near both in UTF-8 and in the [WayBackUnicode Characters in the Latin-1 Supplement Block:

I’ve seen these encodings, where only the top encoding is correct; the degeneration gets worse moving downwards, a classic Mojibake:

# encoded UTF-8 (hex.)
0 ë 0xC3 0xAB
1 ë 0xC3 0x83 0xC2 0xAB
2 ë 0xC3 0x83 0xC2 0x83 0xC3 0x82 0xC2 0xAB
3 ë 0xC3 0x83 0xC2 0x83 0xC3 0x82 0xC2 0x83 0xC3 0x83 0xC2 0x82 0xC3 0x82 0xC2 0xAB
4 ë 0xC3 0x83 0xC2 0x83 0xC3 0x82 0xC2 0x83 0xC3 0x83 0xC2 0x82 0xC3 0x82 0xC2 0x83 0xC3 0x83 0xC2 0x83 0xC3 0x82 0xC2 0x82 0xC3 0x83 0xC2 0x82 0xC3 0x82 0xC2 0xAB
5 ë 0x26 0x65 0x75 0x6d 0x6c 0x3b

The last one seldomly happens, the first one relatively often, just like [Archive.is] fd.nl did a while on their finanancial pages.

These mistakes become sort of understandable (but not forgivable) when you look at the below table-fragment (the full table is at[WayBack] Unicode/UTF-8-character table – starting from code position 0080).

Read the rest of this entry »

Posted in Development, Encoding, Mojibake, Power User, Software Development, Unicode, Web Browsers | Leave a Comment »

ls colour codes on OpenSuSE tumbleweed when accessed from Mac OS X ssh

Posted by jpluimers on 2019/06/07

`ls` colour codes

`ls` colour codes

I got confused as I thought red text would mean an error.

But they’re not: greenish yellow on a read background means error (a symbolic link to a place that’s no longer there).

It’s the output of https://github.com/gkotian/gautam_linux/blob/master/scripts/colours.sh as the one at

Actually the script is here https://raw.githubusercontent.com/gkotian/gautam_linux/master/scripts/colours.sh as the one at [WayBackcommand line – What do the different colors mean in the terminal? – Ask Ubuntu failed with errors like this one:

-bash: *.xbm: bad substitution

The full script output is below.

Since various terminals have a different mapping from colours in the ANSI escape code colour table, I used the standard HTML colours using (which slightly differs from the Terminal.app screenshot on the right):

References:

Note that the shell on Mac OS X uses a different way of configuring colours CLICOLOR as described in [WayBacksettings – CLICOLOR and LS_COLORS in bash – Unix & Linux Stack Exchange. I might cover that another day.

Script output:

Read the rest of this entry »

Posted in *nix, *nix-tools, ANSI escape code, bash, CSS, Development, Encoding, HTML, HTML5, Linux, openSuSE, Power User, Software Development, SuSE Linux, Tumbleweed, Web Development | Leave a Comment »

including enumerations and JPEG compression examples for wPDF 4 Manual: Compression related properties

Posted by jpluimers on 2019/04/11

Since I was tracking down an issue having to to with generating DIB in a compressed PDF: [Archive.is] wPDF 4 Manual: Compression related properties

Property CompressStreamMethod

By modifying this property you can let the PDF engine compress (deflate) text. By using compression the file will be reasonable smaller. On the other had compression will create binary data rather than ASCII data. While “deflate” produces the smallest files, “run-length” compression is compatible even to very old PDF reader programs.

Property JPEGQuality

wPDF can compress bitmaps using JPEG. This will work only for true color bitmaps (24 bits/pixel) and if you have set the desired quality in this property.

Property EncodeStreamMethod

If data in the PDF file is binary it can be encoded to be ASCII again. Binary data can be either compressed text or graphics. You can select HEX encoding or ASCII95 which is more effective then HEX.

Property ConvertJPEGData

Note: Only applies to TWPDFExport.

If this property is true JPEG data found in the TWPRichText editor will not be embedded as JPEG data. Instead the bitmap will be compressed using deflate or run length compression. It is necessary to set this property to TRUE if the PDF files must be compatible to older PDF reader programs which are incapable to read JPEG data.

Note that EncodeStreamMethod does not do compression, but it does belong here because the encodings result in different PDF sizes.

The settings are not documented in more detail, so here are the enumerations explaining them in a bit more depth:

–jeroen

Posted in ASCII95, Delphi, Development, Encoding, HEX encoding, Software Development | Leave a Comment »

UTF-8 support for single byte character sets is beta in Windows and likely breaks a lot of applications not expecting this (via Unicode in Microsoft Windows: UTF-8 – Wikipedia)

Posted by jpluimers on 2018/12/04

Uh-oh: [WayBack] Unicode in Microsoft Windows: UTF-8 – Wikipedia:

Microsoft Windows has a code page designated for UTF-8code page 65001. Prior to Windows 10 insider build 17035 (November 2017),[7] it was impossible to set the locale code page to 65001, leaving this code page only available for:

  • Explicit conversion functions such as MultiByteToWideChar
  • The Win32 console command chcp 65001 to translate stdin/out between UTF-8 and UTF-16.

This means that “narrow” functions, in particular fopen, cannot be called with UTF-8 strings, and in fact there is no way to open all possible files using fopen no matter what the locale is set to and/or what bytes are put in the string, as none of the available locales can produce all possible UTF-16 characters.

On all modern non-Windows platforms, the string passed to fopen is effectively UTF-8. This produces an incompatibility between other platforms and Windows. The normal work-around is to add Windows-specific code to convert UTF-8 to UTF-16 using MultiByteToWideChar and call the “wide” function.[8] Conversion is also needed even for Windows-specific api such as SetWindowText since many applications inherently have to use UTF-8 due to its use in file formats, internet protocols, and its ability to interoperate with raw arrays of bytes.

There were proposals to add new API to portable libraries such as Boost to do the necessary conversion, by adding new functions for opening and renaming files. These functions would pass filenames through unchanged on Unix, but translate them to UTF-16 on Windows.[9] This would allow code to be “portable”, but required just as many code changes as calling the wide functions.

With insider build 17035 and the April 2018 update (nominal build 17134) for Windows 10, a “Beta: Use Unicode UTF-8 for worldwide language support” checkbox appeared for setting the locale code page to UTF-8.[a] This allows for calling “narrow” functions, including fopen and SetWindowTextA, with UTF-8 strings. Microsoft claims this option might break some functions (a possible example is _mbsrev[10]) as they were written to assume multibyte encodings used no more than 2 bytes per character, thus until now code pages with more bytes such as GB 18030 (cp54936) and UTF-8 could not be set as the locale.[11]


  1. Jump up^ [WayBack“UTF-8 in Windows”Stack Overflow. Retrieved July 1, 2011.
  2. Jump up^ [WayBack“Boost.Nowide”.
  3. Jump up^ [WayBackhttps://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/strrev-wcsrev-mbsrev-mbsrev-l
  4. Jump up^ [WayBack“Code Page Identifiers (Windows)”msdn.microsoft.com.

Via [WayBack] Microsoft Windows Beta UTF-8 support for Ansi API could break things. Wiki Article of the Change… – Tommi Prami – Google+

Related, as handling encoding is hard, especially if it is changed or not your default:

–jeroen

Posted in .NET, C, C++, Delphi, Development, Encoding, GB 18030, Power User, Software Development, UTF-16, UTF-32, UTF-8, UTF16, UTF32, UTF8, Windows, Windows 10 | 2 Comments »

Getting rid of trailing line-endings in the draw.io web interface

Posted by jpluimers on 2018/12/03

One of the things that bugged me for a long time is that every now and then for some shapes, when editing their text, the draw.io web interface puts in trailing line feeds after the text, messing up layout.

The easiest way to work around it is by searching inside the diagram XML for
"
, then replacing that with a ".

(the above code got screwed by WordPress.com saving it, so the search is in this small gist below)

This behaviour is intermittent on the drawio MacOS desktop app.



"

–jeroen

 

Posted in Cloud Apps, Development, draw.io, Encoding, Internet, Power User, Software Development, Unicode | Leave a Comment »

It looks like gmail finally understands Outlook Calendar entries

Posted by jpluimers on 2018/11/12

For a very long time, gMail did nothing with Outlook Calendar entires.

So I had to view at the message source, then translate them to Google Calendar entries myself.

--_000_430b30b9ffd74d959b74ab7ba778b487ultrawarenl_
Content-Type: text/calendar; charset="utf-8"; method=REQUEST
Content-Transfer-Encoding: base64

...

As of late, they seem to be processed into Google Calendar compatible entries. Nice!

–jeroen

Posted in base64, Development, Encoding, GMail, Google, GoogleCalendar, MIME, Office, Outlook, Power User, Software Development, UTF-8, UTF8 | Leave a Comment »

Unicode spaces

Posted by jpluimers on 2018/09/25

For my link archive:

Via: [WyBack] Are there blank characters in unicode that have the same widths as period, comma and digits? – Lars Fosdal – Google+

Answer: no, though better fonts have period, comma, colon, semicolon and other punctuations the same width as the punctuation space.

The use-case:

I wanted right justified text without having to do custom positioning/drawing – where the decimal zero is white space.

F.x. here 12 instead of 12.0

9.5
11.6
12 <– #$2008 and #$2007
13.4

I.e. PunctuationSpace and FigureSpace

I don’t want to deal with positioning/rendering since it happens inside a third party component.

–jeroen

Posted in Development, Encoding, Font, Power User, Software Development, Unicode | Leave a Comment »