The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 1,860 other subscribers

Archive for the ‘Unicode’ Category

PowerShell error in a script but not on the console: The string is missing the terminator: “.

Posted by jpluimers on 2021/09/29

The below one will fail in a script, both both work from the PowerShell prompt:

Success

Get-NetFirewallRule -DisplayGroup "File and Printer Sharing" | ForEach-Object { Write-Host $_.DisplayName ; Get-NetFirewallAddressFilter -AssociatedNetFirewallRule $_ }

Failure

Get-NetFirewallRule –DisplayGroup "File and Printer Sharing" | ForEach-Object { Write-Host $_.DisplayName ; Get-NetFirewallAddressFilter -AssociatedNetFirewallRule $_ }

The error you get this this:

At C:\bin\Show-File-and-Printer-Sharing-firewall-rules.ps1:5 char:52
+ ... -TCP-NoScope" | ForEach-Object { Write-Host $_.DisplayName ; Get-NetF ...
+                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The string is missing the terminator: ".
    + CategoryInfo          : ParserError: (:) [], ParentContainsErrorRecordException
    + FullyQualifiedErrorId : TerminatorExpectedAtEndOfString

Via [WayBack] script file ‘The string is missing the terminator: “.’ – Google Search, I quickly found these that stood out:

Cause and solution

Before DisplayGroup, the first line has a minus sign and the second an en-dash. You can see this via [WayBack] What Unicode character is this ?.

Apparently, when using Unicode on the console, it does not matter if you have a minus sign (-), en-dash (–), em-dash (—) or horizontal bar (―) as dash character. You can see this in [WayBack] tokenizer.cs at function [WayBack] NextToken and [WayBack] CharTraits.cs at function [WayBack] IsChar).

When saving to a non-Unicode file, it does matter, even though it does not display as garbage in the error message.

Similarly, PowerShell has support for these special characters:

    internal static class SpecialChars
    {
        // Uncommon whitespace
        internal const char NoBreakSpace = (char)0x00a0;
        internal const char NextLine = (char)0x0085;

        // Special dashes
        internal const char EnDash = (char)0x2013;
        internal const char EmDash = (char)0x2014;
        internal const char HorizontalBar = (char)0x2015;

        // Special quotes
        internal const char QuoteSingleLeft = (char)0x2018; // left single quotation mark
        internal const char QuoteSingleRight = (char)0x2019; // right single quotation mark
        internal const char QuoteSingleBase = (char)0x201a; // single low-9 quotation mark
        internal const char QuoteReversed = (char)0x201b; // single high-reversed-9 quotation mark
        internal const char QuoteDoubleLeft = (char)0x201c; // left double quotation mark
        internal const char QuoteDoubleRight = (char)0x201d; // right double quotation mark
        internal const char QuoteLowDoubleLeft = (char)0x201E; // low double left quote used in german.
    }

The easiest solution is to use minus signs everywhere.

Another solution is to save files as Unicode UTF-8 encoding (preferred) or UTF-16 encoding (which I dislike).

–jeroen

Posted in .NET, CommandLine, Development, Encoding, PowerShell, PowerShell, Scripting, Software Development, Unicode, UTF-16, UTF-8, UTF16, UTF8 | Leave a Comment »

Unicode superscript and subscript alphabetic letters

Posted by jpluimers on 2021/08/04

Not all letters have superscript or subscript counterparts. The counterparts are from different ranges, so might not look nice when next to each other.

I think 20th using Unicode lowercase superscript looks ugly 20ᵗʰ. With uppercase superscript it is somewhat OK: 20ᵀᴴ.

The list is from [WayBack] javascript – How to find the unicode of the subscript alphabet? – Stack Overflow:

Take a look at the wikipedia article Unicode subscripts and superscripts. It looks like these are spread out across different ranges, and not all characters are available.

Consolidated for cut-and-pasting purposes, the Unicode standard defines complete sub- and super-scripts for numbers and common mathematical symbols ( ⁰ ¹ ² ³ ⁴ ⁵ ⁶ ⁷ ⁸ ⁹ ⁺ ⁻ ⁼ ⁽ ⁾ ₀ ₁ ₂ ₃ ₄ ₅ ₆ ₇ ₈ ₉ ₊ ₋ ₌ ₍ ₎ ), a full superscript Latin lowercase alphabet except q ( ᵃ ᵇ ᶜ ᵈ ᵉ ᶠ ᵍ ʰ ⁱ ʲ ᵏ ˡ ᵐ ⁿ ᵒ ᵖ ʳ ˢ ᵗ ᵘ ᵛ ʷ ˣ ʸ ᶻ ), a limited uppercase Latin alphabet ( ᴬ ᴮ ᴰ ᴱ ᴳ ᴴ ᴵ ᴶ ᴷ ᴸ ᴹ ᴺ ᴼ ᴾ ᴿ ᵀ ᵁ ⱽ ᵂ ), a few subscripted lowercase letters ( ₐ ₑ ₕ ᵢ ⱼ ₖ ₗ ₘ ₙ ₒ ₚ ᵣ ₛ ₜ ᵤ ᵥ ₓ ), and some Greek letters ( ᵅ ᵝ ᵞ ᵟ ᵋ ᶿ ᶥ ᶲ ᵠ ᵡ ᵦ ᵧ ᵨ ᵩ ᵪ ). Note that since these glyphs come from different ranges, they may not be of the same size and position, depending on the typeface.

After a nice chat with my nephew EWD, I did some research and found the above via

–jeroen

Posted in Development, Encoding, internatiolanization (i18n) and localization (l10), Power User, Software Development, Unicode | Leave a Comment »

“No mapping for the Unicode character exists in the target multi-byte code page”

Posted by jpluimers on 2021/06/24

Usually when I see this error [Wayback] “No mapping for the Unicode character exists in the target multi-byte code page” – Google Search, it is in legacy code that uses string buffers where decoding or decompressing data into.

This is almost always wrong no matter what kind of data you use, as it will depend in your string encoding.

I have seen it happen especially in these cases:

  • base64 decoding from string to string (solution: decode from a string stream into a binary stream, then post-process from there)
  • zip or zlib decompress from binary stream to string stream, then reading the string stream (solution: decompress from binary stream to binary stream, then post-process from there)

Most cases I encountered were in Delphi and C code, but surprisingly I also bumped into C# exhibiting this behaviour.

I’m not alone, just see these examples from the above Google search:

–jeroen

Posted in .NET, base64, C, C#, C++, Delphi, Development, Encoding, Software Development, Unicode | Leave a Comment »

VMware ESXi 6 and 7: checking and setting/clearing maintenance mode from the console

Posted by jpluimers on 2021/04/21

Every now and then it is useful to be able to do maintenance work from the ESXi console addition to the ESXi web-user interface.

I know there are many sites having this information, but many of them forgot to format the statements with code markup, so parameters with two dashes -- (each a Wayback Unicode Character ‘HYPHEN-MINUS’ (U+002D)) now have become an [Wayback] Unicode Character ‘EN DASH’ (U+2013) which is incompatible with most console programs, especially the ESXi ones (as they are Busybox based to minimise footprint).

Note you can use this small site (which runs in-browser, so does not phone home) to get the unicode code points for any string: [Wayback] What Unicode character is this ?.

Links like below (most on the vmware.com domain) have this EN DASH and make me document things on my blog instead of trying code directly from blogs or forum posts:

So below are three commands I use that have to do with the maintenance mode (the mode that for instance you can use to update an ESXi host to the latest patch level).

    1. Check the maintenance mode (which returns Enabled or Disabled):
      esxcli system maintenanceMode get
    2. Enable maintenance mode (which returns nothing when succeeded, and Maintenance mode is already enabled. when failed):
      esxcli system maintenanceMode set --enable true
    3. Disable maintenance mode (which returns nothing when succeeded, and Maintenance mode is already disabled. when failed):
      esxcli system maintenanceMode get

Some examples, especially an the various output possibilities (commands in bold, output in italic):

# esxcli system maintenanceMode get
Disabled
# esxcli system maintenanceMode set --enable false
Maintenance mode is already disabled.
# esxcli system maintenanceMode set --enable true 
# esxcli system maintenanceMode get
Enabled
# esxcli system maintenanceMode set --enable true
Maintenance mode is already enabled.
# esxcli system maintenanceMode set --enable false
# esxcli system maintenanceMode get
Disabled

I made these scripts for this:

  • esxcli-maintenanceMode-show.sh:
    #!/bin/sh
    esxcli system maintenanceMode get
  • esxcli-maintenanceMode-enter.sh:
    #!/bin/sh
    esxcli system maintenanceMode set --enable true
  • esxcli-maintenanceMode-exit.sh:
    #!/bin/sh
    esxcli system maintenanceMode set --enable false

Note I have not checked the exit codes for these esxcli commands yet, but did blog about how to do that: Busybox sh (actually ash derivative dash): checking exit codes.

–jeroen

Posted in BusyBox, Development, Encoding, ESXi6, ESXi6.5, ESXi6.7, ESXi7, Power User, Software Development, Unicode, Virtualization, VMware, VMware ESXi | Leave a Comment »

(mostly ASCII) List of emoticons – Wikipedia

Posted by jpluimers on 2021/03/17

Most searches for “ASCII emoticons” get you Unicode ones:

Luckily most are ASCII in List of emoticons – Wikipedia.

There are also shortcodes, which do not visually represent an emoji, but usually get translated to the image or Unicode character.

A few lists on them:

–jeroen

Posted in ASCII, Development, Encoding, LifeHacker, Power User, Software Development, Unicode | Leave a Comment »

Unicode is hard, also for the Delphi compiler and IDE

Posted by jpluimers on 2020/10/13

The Delphi compiler does not see a unicode non-breaking space (0x00A0 as whitespace, and the Delphi IDE does not warn you about it: [WayBack] Delphi revelations #2 – Space characters are not just space characters.

Given that this character was introduced in 1993, I wonder how the compiler tests look like.

These also will not be recognised as whitespace:

Related, as many other tools also do not properly support various whitespace characters:

Via: [WayBack] A Delphi “Aha” experience – Kim Madsen – Google+

–jeroen

Posted in Delphi, Development, Software Development, Unicode | Leave a Comment »

Hamburger menu character on unicode: use U+2261 instead of U+2630

Posted by jpluimers on 2020/01/27

Not all fonts have Unicode character ☰ [WayBack] Unicode Character ‘TRIGRAM FOR HEAVEN’ (U+2630) as it is in a less common block.

More fonts have Unicode character ≡ [WayBack] Unicode Character ‘IDENTICAL TO’ (U+2261)

The latter is slightly shorter and slightly narrower than the former, but works in way more places.

Via [WayBack] html – Unicode ☰ hamburger not displaying in Android & Chrome – Stack Overflow

I’ve worked around this problem by using the UNICODE character UNICODE U+2261 (8801), ≡ IDENTICAL TO as illustrated below rather than the UNICODE U+2630 (9776) ☰ TRIGRAM FOR HEAVEN which

–jeroen

Posted in Development, Encoding, LifeHacker, Power User, Software Development, Unicode | Leave a Comment »

Delphi, decoding files to strings and finding line endings: some links, some history on Windows NT and UTF/UCS encodings

Posted by jpluimers on 2019/12/31

A while back there were a few G+ threads sprouted by David Heffernan on decoding big files into line-ending splitted strings:

Code comparison:

Python:

with open(filename, 'r', encoding='utf-16-le') as f:
  for line in f:
    pass

Delphi:

for Line in TLineReader.FromFile(filename, TEncoding.Unicode) do
  ;

This spurred some nice observations and unfounded statements on which encodings should be used, so I posted a bit of history that is included below.

Some tips and observations from the links:

  • Good old text files are not “good” with Unicode support, neither are TextFile Device Drivers; nobody has written a driver supporting a wide range of encodings as of yet.
  • Good old text files are slow as well, even with a changed SetTextBuffer
  • When using the TStreamReader, the decoding takes much more time than the actual reading, which means that [WayBack] Faster FileStream with TBufferedFileStream • DelphiABall does not help much
  • TStringList.LoadFromFile, though fast, is a memory allocation dork and has limits on string size
  • Delphi RTL code is not what it used to be: pre-Delphi Unicode RTL code is of far better quality than Delphi 2009 and up RTL code
  • Supporting various encodings is important
  • EBCDIC days: three kinds of spaces, two kinds of hyphens, multiple codepages
  • Strings are just that: strings. It’s about the encoding from/to the file that needs to be optimal.
  • When processing large files, caching only makes sense when the file fits in memory. Otherwise caching just adds overhead.
  • On Windows, if you read a big text file into memory, open the file in “sequential read” mode, to disable caching. Use the FILE_FLAG_SEQUENTIAL_SCAN flag under Windows, as stated at [WayBack] How do FILE_FLAG_SEQUENTIAL_SCAN and FILE_FLAG_RANDOM_ACCESS affect how the operating system treats my file? – The Old New Thing
  • Python string reading depends on the way you read files (ASCII or Unicode); see [WayBack] unicode – Python codecs line ending – Stack Overflow

Though TLineReader is not part of the RTL, I think it is from [WayBack] For-in Enumeration – ADUG.

Encodings in use

It doesn’t help that on the Windows Console, various encodings are used:

Good reading here is [WayBack] c++ – What unicode encoding (UTF-8, UTF-16, other) does Windows use for its Unicode data types? – Stack Overflow

Encoding history

+A. Bouchez I’m with +David Heffernan here:

At its release in 1993, Windows NT was very early in supporting Unicode. Development of Windows NT started in 1990 where they opted for UCS-2 having 2 bytes per character and had a non-required annex on UTF-1.

UTF-1 – that later evolved into UTF-8 – did not even exist at that time. Even UCS-2 was still young: it got designed in 1989. UTF-8 was outlined late 1992 and became a standard in 1993

Some references:

–jeroen

Read the rest of this entry »

Posted in Delphi, Development, Encoding, PowerShell, PowerShell, Python, Scripting, Software Development, The Old New Thing, Unicode, UTF-16, UTF-8, Windows Development | Leave a Comment »

Unicode ligatures: not all software does normalised search forgetting ffi 

Posted by jpluimers on 2019/06/26

Via a private share, I found out that some software forgets to perform a Unicode normalisation when doing a search.

That means that ligatures do not match the non-ligatures in for instance these words:

  • “ff” and “ff”, as in “difference” versus “difference”
  • “fi” and “fi” as in “notification” versus “notification”.

For more information, read [WayBackUnicode equivalence – Wikipedia and make sure you know about these normal forms:

NFD
Normalization Form Canonical Decomposition
Characters are decomposed by canonical equivalence, and multiple combining characters are arranged in a specific order.
NFC
Normalization Form Canonical Composition
Characters are decomposed and then recomposed by canonical equivalence.
NFKD
Normalization Form Compatibility Decomposition
Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order.
NFKC
Normalization Form Compatibility Composition
Characters are decomposed by compatibility, then recomposed by canonical equivalence.

–jeroen

Posted in Development, Encoding, Software Development, Unicode | Leave a Comment »

I’ve given up on entering non-ASCII characters when entering data on-line

Posted by jpluimers on 2019/06/17

I live in a street that has a non-ASCII character in it: Pyreneeën.

I’ve reverted back to entering the street name as plain ASCII for a simple reason:

Too often the ë gets mangled into encoding gibberish, similar to the é example in [WayBackWhen Good Characters Go Bad: A Guide to Diagnosing Character Display Problems as these characters are very near both in UTF-8 and in the [WayBackUnicode Characters in the Latin-1 Supplement Block:

I’ve seen these encodings, where only the top encoding is correct; the degeneration gets worse moving downwards, a classic Mojibake:

# encoded UTF-8 (hex.)
0 ë 0xC3 0xAB
1 ë 0xC3 0x83 0xC2 0xAB
2 ë 0xC3 0x83 0xC2 0x83 0xC3 0x82 0xC2 0xAB
3 ë 0xC3 0x83 0xC2 0x83 0xC3 0x82 0xC2 0x83 0xC3 0x83 0xC2 0x82 0xC3 0x82 0xC2 0xAB
4 ë 0xC3 0x83 0xC2 0x83 0xC3 0x82 0xC2 0x83 0xC3 0x83 0xC2 0x82 0xC3 0x82 0xC2 0x83 0xC3 0x83 0xC2 0x83 0xC3 0x82 0xC2 0x82 0xC3 0x83 0xC2 0x82 0xC3 0x82 0xC2 0xAB
5 ë 0x26 0x65 0x75 0x6d 0x6c 0x3b

The last one seldomly happens, the first one relatively often, just like [Archive.is] fd.nl did a while on their finanancial pages.

These mistakes become sort of understandable (but not forgivable) when you look at the below table-fragment (the full table is at[WayBack] Unicode/UTF-8-character table – starting from code position 0080).

Read the rest of this entry »

Posted in Development, Encoding, Mojibake, Power User, Software Development, Unicode, Web Browsers | Leave a Comment »