The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 2,514 other followers

Archive for the ‘UTF8’ Category

PowerShell error in a script but not on the console: The string is missing the terminator: “.

Posted by jpluimers on 2021/09/29

The below one will fail in a script, both both work from the PowerShell prompt:

Success

Get-NetFirewallRule -DisplayGroup "File and Printer Sharing" | ForEach-Object { Write-Host $_.DisplayName ; Get-NetFirewallAddressFilter -AssociatedNetFirewallRule $_ }

Failure

Get-NetFirewallRule –DisplayGroup "File and Printer Sharing" | ForEach-Object { Write-Host $_.DisplayName ; Get-NetFirewallAddressFilter -AssociatedNetFirewallRule $_ }

The error you get this this:

At C:\bin\Show-File-and-Printer-Sharing-firewall-rules.ps1:5 char:52
+ ... -TCP-NoScope" | ForEach-Object { Write-Host $_.DisplayName ; Get-NetF ...
+                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The string is missing the terminator: ".
    + CategoryInfo          : ParserError: (:) [], ParentContainsErrorRecordException
    + FullyQualifiedErrorId : TerminatorExpectedAtEndOfString

Via [WayBack] script file ‘The string is missing the terminator: “.’ – Google Search, I quickly found these that stood out:

Cause and solution

Before DisplayGroup, the first line has a minus sign and the second an en-dash. You can see this via [WayBack] What Unicode character is this ?.

Apparently, when using Unicode on the console, it does not matter if you have a minus sign (-), en-dash (–), em-dash (—) or horizontal bar (―) as dash character. You can see this in [WayBack] tokenizer.cs at function [WayBack] NextToken and [WayBack] CharTraits.cs at function [WayBack] IsChar).

When saving to a non-Unicode file, it does matter, even though it does not display as garbage in the error message.

Similarly, PowerShell has support for these special characters:

    internal static class SpecialChars
    {
        // Uncommon whitespace
        internal const char NoBreakSpace = (char)0x00a0;
        internal const char NextLine = (char)0x0085;

        // Special dashes
        internal const char EnDash = (char)0x2013;
        internal const char EmDash = (char)0x2014;
        internal const char HorizontalBar = (char)0x2015;

        // Special quotes
        internal const char QuoteSingleLeft = (char)0x2018; // left single quotation mark
        internal const char QuoteSingleRight = (char)0x2019; // right single quotation mark
        internal const char QuoteSingleBase = (char)0x201a; // single low-9 quotation mark
        internal const char QuoteReversed = (char)0x201b; // single high-reversed-9 quotation mark
        internal const char QuoteDoubleLeft = (char)0x201c; // left double quotation mark
        internal const char QuoteDoubleRight = (char)0x201d; // right double quotation mark
        internal const char QuoteLowDoubleLeft = (char)0x201E; // low double left quote used in german.
    }

The easiest solution is to use minus signs everywhere.

Another solution is to save files as Unicode UTF-8 encoding (preferred) or UTF-16 encoding (which I dislike).

–jeroen

Posted in .NET, CommandLine, Development, Encoding, PowerShell, PowerShell, Scripting, Software Development, Unicode, UTF-16, UTF-8, UTF16, UTF8 | Leave a Comment »

Default XML encoding is UTF-8 (or better: utf-8). If it contains other byte sequences, this is an error.

Posted by jpluimers on 2021/01/21

I should have had the below answer when writing about StUF – receiving data from a provider where UTF-8 is in fact ISO-8859.

A while ago, a co-worker did not believe when I told that default XML encoding really is UTF-8 (and tried to force it to utf-8), and that if the content had byte sequences different from the (either specified or default) encoding, it was a problem.

I though I blogged about the default, and where to find it, but apparently, I did not.

My blog had (and has <g>) a truckload of articles mentioning UTF-8, less articles containing UTF-8, encoding and xml, but the ones having UTF-8, default, encoding and xml did not actually tell about a standard that really defines XML uses UTF-8 as default encoding when there is no other encoding information – like BOM (byte order mark), HTTP, or MIME encoding) available.

W3C indeed specifies it. [WayBack] utf 8 – How default is the default encoding (UTF-8) in the XML Declaration? – Stack Overflow has a summary (thanks James Holderness!):

The Short Answer

Under the very specific circumstances of a UTF-8 encoded document with no external encoding information (which I understand from the comments is what you’re interested in), there is no difference between the two declarations.

The long answer is far more interesting though.

and an elaboration:

Read the rest of this entry »

Posted in Development, Encoding, Software Development, UTF-8, UTF8, XML, XML/XSD | Leave a Comment »

Bad surprise of the day: SysUtils.TEncoding in XE2+ defaults to ANSI, while in XE it defaulted to UTF-8.

Posted by jpluimers on 2020/03/11

Bad surprise of the day: SysUtils.TEncoding in XE2+ defaults to ANSI, while in XE it defaulted to UTF-8 .Among other things this means that TStringList… – Eric Grange – Google+

Source: Bad surprise of the day: SysUtils.TEncoding in XE2+ defaults to ANSI, while i…

Delphi

Eric Grange's profile photo

+Stefan Glienke Indeed, you’re right. The issue must be deeper somewhere. Don’t have time to investigate too much, I’m bypassing the RTL now (also have to work around the limitation that for utf-8 the TEncoding.GetString method returns an empty string if one character in the buffer isn’t utf-8)

Asbjørn Heid's profile photo

I wouldn’t trust the RTL at all with loading non-ascii text, we’ve had it hang on invalid UTF-8 codes more than once.

–jeroen

Posted in Ansi, Delphi, Development, Encoding, Software Development, UTF-8, UTF8 | Leave a Comment »

Which encoding failure did encode “vóór” into “v3/43/4r”? – Stack Overflow

Posted by jpluimers on 2020/02/24

From quite some time ago, but still very relevant as encoding issues keep occurring:

A while ago, I saw the text “v3/43/4r” in a document.I know it comes from “vóór” (the acute accent emphasises in Dutch), and wonder which encoding failure was applied to get this wrong.

Source: [WayBackWhich encoding failure did encode “vóór” into “v3/43/4r”? – Stack Overflow

From the [WayBack] answer by rodrigo:

  • ó: is U+00F3, and occupies the same codepoint (0xF3) in a lot of different encodings (most ISO-8859-* and most western Windows-*).
  • In CP850 the codepint 0xF3 is ¾ (U+00BE), that is the three-quarters character. It is the same in other, less used, codepages (CP775, CP856, CP857, CP858).
  • The ¾ is sometimes transliterated to 3/4 when the character is not directly available.

And there you are! “vóór” -> “v¾¾r” -> “v3/43/4r”.

The first part (ó -> ¾) is the usual corruption of ANSI vs. OEM codepages in the Western Windows versions (in my country ANSI=Windows-1252, OEM=CP850). You can see it easily creating a file with NOTEPAD, writing vóór and dumping it in a command prompt with type.

–jeroen

Posted in CP850, Development, Encoding, Software Development, UTF-8, UTF8, Windows-1252 | Leave a Comment »

Delphi Galileo IDE (version 8 and up): Force files to be saved as UTF8 – The Oracle at Delphi

Posted by jpluimers on 2019/07/04

Though formatting mangled the registry key to add, the article is interesting: since 2003 (C# Builder 1), you can force the IDE to always save files as UTF8 which should alleviate a lot of encoding problems.

It beats me why this isn’t the default setting, but below is an example .reg file for Delphi 8 which should be easily transformed to more recent Delphi versions:


Windows Registry Editor Version 5.00
[HKEY_CURRENT_USER\Software\Borland\BDS\2.0\Editor]
"DefaultFileFilter"="Borland.FileFilter.UTF8ToUTF8"

So basically (if formatting is kept), you browse to this key (replace Borland with the company for your specific Delphi version, and replace 2.0 by your IDE version):

HKEY_CURRENT_USER\Software\Borland\BDS\16.0\Editor

Then you add a new string value named DefaultFileFilter with value Borland.FileFilter.UTF8ToUTF8

More background [WayBack] The Oracle at Delphi: More IDE secrets – UTF8 and the Editor

The unmangled registry key (and more tips) was from [WayBackBSC Polska: Hidden possibilities of Delphi 8.

Get the list of HKEY_CURRENT_USER paths for your Delphi version at Update to List-Delphi-Installed-Packages.ps1 shows HKCU/HKLM keys and doesn’t truncated fields any more.

–jeroen

Via: [WayBack] Is there any way (IDE expert?) to automatic set encoding of each PAS file in UTF-8 instead of ANSI? – Jacek Laskowski – Google+

Posted in Delphi, Development, Encoding, Software Development, UTF-8, UTF8 | 1 Comment »

 
%d bloggers like this: