The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 1,862 other subscribers

Archive for the ‘Encoding’ Category

.NET: XML escaping a string

Posted by jpluimers on 2022/02/15

[Wayback] WILT: XML encode a string in .net « Benoit MARTIN’s Weblog:

Always wondered why I couldn’t find a method that would XML encode a string, effectively escaping the 5 illegal characters for XML. There is such a method but its location in the API is not intuitive at all. It’s in the System.Security namespace: [Wayback] SecurityElement.Escape(String) Method (System.Security) | Microsoft Docs

public static string? Escape (string? str);

Its usage is:

   tagText = System.Security.SecurityElement.Escape(tagText);

This will escape the 5 characters <, >, &, " and '

–jeroen

Posted in .NET, Development, Encoding, Software Development, XML, XML escapes, XML/XSD | Leave a Comment »

DELPHI : EEncodingError – Invalid code page on windows xp embedded – Stack Overflow

Posted by jpluimers on 2022/02/15

From my Windows XP days  (which are long gone), but historically relevant the answer to [Wayback] DELPHI : EEncodingError – Invalid code page on windows xp embedded – Stack Overflow by [Wayback] Remy Lebeau:

The TEncoding.ASCII property uses codepage 20127, which is not installed on XP Embedded by default. You have to install it manually. The TEncoding class does not exist in D2006.

Are you using Indy 10, by chance? It uses TEncoding.ASCII by default for its string encodings. This exact error has been known to occur when using Indy on XP Embedded.

–jeroen

Posted in ASCII, Delphi, Development, Encoding, Power User, Software Development, XP-embedded | Leave a Comment »

Character set reencoding link archive

Posted by jpluimers on 2022/02/10

I will likely need some of these links in the future:

–jeroen

Posted in Apple, Development, Encoding, Mac, Mac OS X / OS X / MacOS, Power User, Software Development, Unicode | Leave a Comment »

In this day and age, web sites with delivery back-ends still have Unicode issues: at least @Woonveilig, @Medireva and @PostNL still have trouble

Posted by jpluimers on 2022/02/09

Nowadays, some 35 years after the first Unicode ideas got drafted and 30+ years after the Unicode Consortium saw the light, UTF-8 is served my more than 95% of the web as shown in yesterday’s post UTF-8 web adoption is huge, closing 100%, but only soured up since around 2006..

I mentioned this:

It means that nowadays there is a very small chance you will see mangled characters (what Japanese call mojibake) when you’re surfing the web.

Serving UTF8 does not mean no unicode problems.

Below are some issues that happened not too long ago and still happen. I have reported them to all parties involved through web-care, but no response whatsoever, and this is bad: Unicode support beyond basic ASCII for the below systems are still broken even for relatively simple non-ASCII characters based in diacritics decorating a standard ASCII character.

Yes, I know the realm of encoding and code pages is a mess, especially when handling data in multiple layers of an application stack. That’s why I wrote this post in the first place, and have a whole encoding category of blog posts plus a Mojibake subset.

Read the rest of this entry »

Posted in Communications Development, CP850, Dark Pattern, Development, Encoding, ISO-8859, ISO8859, Mojibake, Software Development, Unicode, User Experience (ux), UTF-16, UTF-8, Windows-1252 | Leave a Comment »

C# Effective way to find any file’s Encoding – Stack Overflow

Posted by jpluimers on 2022/02/09

Note: notepad cannot correctly guess the encoding, see the “old new thing”: [Wayback] Some files come up strange in Notepad | The Old New Thing (talking about ANSI a.k.a. Windows-1252, UTF-16LE, UTF-16BE, UTF-8, UTF-7 somewith and some without BOM as Notepad does not understand all permutations)

David Cumps discovered that certain text files come up strange in Notepad. The reason is that Notepad has to edit files in a variety of encodings, and when its back against the wall, sometimes it’s forced to guess.

[Wayback] C# Effective way to find any file’s Encoding – Stack Overflow shows how to detect various byte order marks in C#.

–jeroen

Posted in ASCII, Development, Encoding, Software Development, Unicode, UTF-16, UTF-32, UTF-8, UTF16, UTF32, UTF8 | Leave a Comment »

UTF-8 web adoption is huge, closing 100%, but only soured up since around 2006.

Posted by jpluimers on 2022/02/08

As a precursor to a post tomorrow showing that serving UTF8 does not mean organisations go without unicode problems, first some statistics.

The first Unicode ideas got drafted some 30 years ago in 1987. In 1991, more than 30 years ago, the Unicode Consortium saw the light. Nowadays more than 95% percent of the web-pages (close to 100% when you include plain ASCII) is served using the UTF-8 encoding.

It means that nowadays there is a very small chance you

will see mangled characters (what Japanese call mojibake) when you’re surfing the web.

Some nice graphs of unicode growth are at these locations are at these locations:

I think especially important are 2008 (when UTF-8 had outgrown all other individual encodings) and slightly after 2010, when UTF-8 alone covered more than 50% of the pages served. These exclude ASCII-only pages. Adding those would make the figures even larger.

graph showing a steep rise in the use of UTF-8 and a steep decline in other major encodings

Historical yearly trends in the usage statistics of character encodings for websites, June 2021

Historical yearly trends in the usage statistics of character encodings for websites, June 2021

–jeroen

Posted in Development, Encoding, Software Development, UTF-8, UTF8, Web Development | Leave a Comment »

When MySQL characterset ‘utf’ does not allow you to enter some Unicode code points

Posted by jpluimers on 2022/01/06

Contrary to what many believe is that MySQL utf8 is not always full blown UTF-8 support, but actually utf8mb3, which has been deprecated for a while now.

Only utf8mb4 will give you full blown UTF-8 support.

This when someone reminded me of this in a Delphi application:

When I insert :joy: emoji into mysql varchar filed I got an error :
#22007 Incorrect string value: '\xF0\x9F\x98\x82' for column 'remarks' at row 1

database charset is utf8

Note that the :joy: emoji is 😂 and has Unicode code point U+1F602 which is outside the basic multilingual plane.

See:

–jeroen

Posted in Conference Topics, Conferences, Database Development, Delphi, Development, Encoding, Event, MySQL, Software Development, UTF-8, UTF8 | Leave a Comment »

LIDL Radio Controlled Wall Clock IAN 100489 English manual

Posted by jpluimers on 2022/01/06

Model 100489-14-01 wall clock

Model 100489-14-01 wall clock

Just in case I need it again.

The signal quality fluctuates during the day (it is a lot better at night when there is less inionisation in the atmosphere), and is worsened by concrete walls (like our home).

Best way to get prolonged reception is at night, on the top floor behind a window or outside.

The clock usually needs between 3 and 10 minutes to pick up the DCF77 signal from the transmitter.

Wall clock manual: [Wayback] 100489_EN.pdf of which this abstract:

DCF77 HD-1688 clock mechanism

DCF77 HD-1688 clock mechanism

Numbers:

  1. M.SET button
    • Press and keep pressed the M.SET button 1 at least 3 seconds. The wall clock switches into manual mode.
    • Press and keep pressed the M.SET button again until the hands reach the correct position for you to set the time.
    • Briefly pressing the M.SET button moves the hands forward in one minute steps to enable you to set the current time manually.
      Note: After 8 seconds without pressing the M.SET button, the wall clock switches out of manual mode and keeps the time as normal. The manually set value is overwritten as soon as reception of the DCF radio time signal is successful.
  2. RESET button
    • Press the RESET button 2 to reset the radio clock settings. Alternatively, remove the batteries from the device and insert them again.
    • The product now automatically starts to search for the DCF radio time signal.
  3. REC button
    • Press and keep pressed the REC button 3 at least 5 seconds. The wall clock attempts to receive the DCF radio time signal. This process takes a few minutes to complete.
  4. Battery compartment
    • Battery type: 1 x 1.5 V ⎓ AA, LR6

 

More on the signal, transmitter and encoding: DCF77 – Wikipedia, where the below images are from:

DCF77 reception area from Mainflingen

DCF77 signal strength over a 24-hour period measured in Nerja, on the south coast of Spain 1,801 km (1,119 mi) from the transmitter. Around 1 AM it peaks at ≈ 100 µV/m signal strength. During the day, the signal is weakened by ionization of the ionosphere due to solar activity.

Another DCF77 clock I have: CSL Bearware 302658 DCF clock manual

–jeroen

Posted in Development, Encoding, Hardware Development, LifeHacker, Power User, Software Development | 2 Comments »

PowerShell error in a script but not on the console: The string is missing the terminator: “.

Posted by jpluimers on 2021/09/29

The below one will fail in a script, both both work from the PowerShell prompt:

Success

Get-NetFirewallRule -DisplayGroup "File and Printer Sharing" | ForEach-Object { Write-Host $_.DisplayName ; Get-NetFirewallAddressFilter -AssociatedNetFirewallRule $_ }

Failure

Get-NetFirewallRule –DisplayGroup "File and Printer Sharing" | ForEach-Object { Write-Host $_.DisplayName ; Get-NetFirewallAddressFilter -AssociatedNetFirewallRule $_ }

The error you get this this:

At C:\bin\Show-File-and-Printer-Sharing-firewall-rules.ps1:5 char:52
+ ... -TCP-NoScope" | ForEach-Object { Write-Host $_.DisplayName ; Get-NetF ...
+                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The string is missing the terminator: ".
    + CategoryInfo          : ParserError: (:) [], ParentContainsErrorRecordException
    + FullyQualifiedErrorId : TerminatorExpectedAtEndOfString

Via [WayBack] script file ‘The string is missing the terminator: “.’ – Google Search, I quickly found these that stood out:

Cause and solution

Before DisplayGroup, the first line has a minus sign and the second an en-dash. You can see this via [WayBack] What Unicode character is this ?.

Apparently, when using Unicode on the console, it does not matter if you have a minus sign (-), en-dash (–), em-dash (—) or horizontal bar (―) as dash character. You can see this in [WayBack] tokenizer.cs at function [WayBack] NextToken and [WayBack] CharTraits.cs at function [WayBack] IsChar).

When saving to a non-Unicode file, it does matter, even though it does not display as garbage in the error message.

Similarly, PowerShell has support for these special characters:

    internal static class SpecialChars
    {
        // Uncommon whitespace
        internal const char NoBreakSpace = (char)0x00a0;
        internal const char NextLine = (char)0x0085;

        // Special dashes
        internal const char EnDash = (char)0x2013;
        internal const char EmDash = (char)0x2014;
        internal const char HorizontalBar = (char)0x2015;

        // Special quotes
        internal const char QuoteSingleLeft = (char)0x2018; // left single quotation mark
        internal const char QuoteSingleRight = (char)0x2019; // right single quotation mark
        internal const char QuoteSingleBase = (char)0x201a; // single low-9 quotation mark
        internal const char QuoteReversed = (char)0x201b; // single high-reversed-9 quotation mark
        internal const char QuoteDoubleLeft = (char)0x201c; // left double quotation mark
        internal const char QuoteDoubleRight = (char)0x201d; // right double quotation mark
        internal const char QuoteLowDoubleLeft = (char)0x201E; // low double left quote used in german.
    }

The easiest solution is to use minus signs everywhere.

Another solution is to save files as Unicode UTF-8 encoding (preferred) or UTF-16 encoding (which I dislike).

–jeroen

Posted in .NET, CommandLine, Development, Encoding, PowerShell, PowerShell, Scripting, Software Development, Unicode, UTF-16, UTF-8, UTF16, UTF8 | Leave a Comment »

Unicode superscript and subscript alphabetic letters

Posted by jpluimers on 2021/08/04

Not all letters have superscript or subscript counterparts. The counterparts are from different ranges, so might not look nice when next to each other.

I think 20th using Unicode lowercase superscript looks ugly 20ᵗʰ. With uppercase superscript it is somewhat OK: 20ᵀᴴ.

The list is from [WayBack] javascript – How to find the unicode of the subscript alphabet? – Stack Overflow:

Take a look at the wikipedia article Unicode subscripts and superscripts. It looks like these are spread out across different ranges, and not all characters are available.

Consolidated for cut-and-pasting purposes, the Unicode standard defines complete sub- and super-scripts for numbers and common mathematical symbols ( ⁰ ¹ ² ³ ⁴ ⁵ ⁶ ⁷ ⁸ ⁹ ⁺ ⁻ ⁼ ⁽ ⁾ ₀ ₁ ₂ ₃ ₄ ₅ ₆ ₇ ₈ ₉ ₊ ₋ ₌ ₍ ₎ ), a full superscript Latin lowercase alphabet except q ( ᵃ ᵇ ᶜ ᵈ ᵉ ᶠ ᵍ ʰ ⁱ ʲ ᵏ ˡ ᵐ ⁿ ᵒ ᵖ ʳ ˢ ᵗ ᵘ ᵛ ʷ ˣ ʸ ᶻ ), a limited uppercase Latin alphabet ( ᴬ ᴮ ᴰ ᴱ ᴳ ᴴ ᴵ ᴶ ᴷ ᴸ ᴹ ᴺ ᴼ ᴾ ᴿ ᵀ ᵁ ⱽ ᵂ ), a few subscripted lowercase letters ( ₐ ₑ ₕ ᵢ ⱼ ₖ ₗ ₘ ₙ ₒ ₚ ᵣ ₛ ₜ ᵤ ᵥ ₓ ), and some Greek letters ( ᵅ ᵝ ᵞ ᵟ ᵋ ᶿ ᶥ ᶲ ᵠ ᵡ ᵦ ᵧ ᵨ ᵩ ᵪ ). Note that since these glyphs come from different ranges, they may not be of the same size and position, depending on the typeface.

After a nice chat with my nephew EWD, I did some research and found the above via

–jeroen

Posted in Development, Encoding, internatiolanization (i18n) and localization (l10), Power User, Software Development, Unicode | Leave a Comment »