February 2026
M	T	W	T	F	S	S
	1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

Archive for the ‘Encoding’ Category

Some links on sending SMS and the protocols/types involved

Posted by jpluimers on 2022/02/16

So I can find them back later:

SMS: Short Message Service. Messages limited to 140 octet (160 7-bit characers, 140 8-bit characters or 70 16-bit characters) sent mainly over the GSM or UMTS mobile networks.
Concatenated SMS or Multipart SMS. Does work on most devices and most operators. Way to send messages longer than 140 octets. Each part is billed separately.
MSISDN a number uniquely identifying a subscription in a GSM or a UMTS mobile network. Always starts with country code. Never includes a prefix (like 00 or +).
SMPP: Short Message Peer-to-Peer.
HLR: Home Location Register.

An interesting party with some public SMS APIs is MessageBird. You can compare their old and new ones:

New REST based APIs: [Wayback] Developers – MessageBird.
Old APIs: [Wayback] The APIs of MessageBird – MessageBird.
Repository: [Wayback] https://github.com/mobiletulip

Read the rest of this entry »

Posted in Development, Encoding, Software Development | Leave a Comment »

.NET: XML escaping a string

Posted by jpluimers on 2022/02/15

[Wayback] WILT: XML encode a string in .net « Benoit MARTIN’s Weblog:

Always wondered why I couldn’t find a method that would XML encode a string, effectively escaping the 5 illegal characters for XML. There is such a method but its location in the API is not intuitive at all. It’s in the System.Security namespace: [Wayback] SecurityElement.Escape(String) Method (System.Security) | Microsoft Docs
public static string? Escape (string? str);
Its usage is:
   tagText = System.Security.SecurityElement.Escape(tagText);
This will escape the 5 characters <, >, &, " and '

–jeroen

Posted in .NET, Development, Encoding, Software Development, XML, XML escapes, XML/XSD | Leave a Comment »

DELPHI : EEncodingError – Invalid code page on windows xp embedded – Stack Overflow

Posted by jpluimers on 2022/02/15

From my Windows XP days (which are long gone), but historically relevant the answer to [Wayback] DELPHI : EEncodingError – Invalid code page on windows xp embedded – Stack Overflow by [Wayback] Remy Lebeau:

The TEncoding.ASCII property uses codepage 20127, which is not installed on XP Embedded by default. You have to install it manually. The TEncoding class does not exist in D2006.

Are you using Indy 10, by chance? It uses TEncoding.ASCII by default for its string encodings. This exact error has been known to occur when using Indy on XP Embedded.

–jeroen

Posted in ASCII, Delphi, Development, Encoding, Power User, Software Development, XP-embedded | Leave a Comment »

Character set reencoding link archive

Posted by jpluimers on 2022/02/10

I will likely need some of these links in the future:

–jeroen

Posted in Apple, Development, Encoding, Mac, Mac OS X / OS X / MacOS, Power User, Software Development, Unicode | Leave a Comment »

In this day and age, web sites with delivery back-ends still have Unicode issues: at least @Woonveilig, @Medireva and @PostNL still have trouble

Posted by jpluimers on 2022/02/09

Nowadays, some 35 years after the first Unicode ideas got drafted and 30+ years after the Unicode Consortium saw the light, UTF-8 is served my more than 95% of the web as shown in yesterday’s post UTF-8 web adoption is huge, closing 100%, but only soured up since around 2006..

I mentioned this:

It means that nowadays there is a very small chance you will see mangled characters (what Japanese call mojibake) when you’re surfing the web.

Serving UTF8 does not mean no unicode problems.

Below are some issues that happened not too long ago and still happen. I have reported them to all parties involved through web-care, but no response whatsoever, and this is bad: Unicode support beyond basic ASCII for the below systems are still broken even for relatively simple non-ASCII characters based in diacritics decorating a standard ASCII character.

Yes, I know the realm of encoding and code pages is a mess, especially when handling data in multiple layers of an application stack. That’s why I wrote this post in the first place, and have a whole encoding category of blog posts plus a Mojibake subset.

Read the rest of this entry »

Posted in Communications Development, CP850, Dark Pattern, Development, Encoding, ISO-8859, ISO8859, Mojibake, Software Development, Unicode, User Experience (ux), UTF-16, UTF-8, Windows-1252 | Leave a Comment »

C# Effective way to find any file’s Encoding – Stack Overflow

Posted by jpluimers on 2022/02/09

Note: notepad cannot correctly guess the encoding, see the “old new thing”: [Wayback] Some files come up strange in Notepad | The Old New Thing (talking about ANSI a.k.a. Windows-1252, UTF-16LE, UTF-16BE, UTF-8, UTF-7 somewith and some without BOM as Notepad does not understand all permutations)

David Cumps discovered that certain text files come up strange in Notepad. The reason is that Notepad has to edit files in a variety of encodings, and when its back against the wall, sometimes it’s forced to guess.

[Wayback] C# Effective way to find any file’s Encoding – Stack Overflow shows how to detect various byte order marks in C#.

–jeroen

Posted in ASCII, Development, Encoding, Software Development, Unicode, UTF-16, UTF-32, UTF-8, UTF16, UTF32, UTF8 | Leave a Comment »

UTF-8 web adoption is huge, closing 100%, but only soured up since around 2006.

Posted by jpluimers on 2022/02/08

As a precursor to a post tomorrow showing that serving UTF8 does not mean organisations go without unicode problems, first some statistics.

The first Unicode ideas got drafted some 30 years ago in 1987. In 1991, more than 30 years ago, the Unicode Consortium saw the light. Nowadays more than 95% percent of the web-pages (close to 100% when you include plain ASCII) is served using the UTF-8 encoding.

It means that nowadays there is a very small chance you

will see mangled characters (what Japanese call mojibake) when you’re surfing the web.

Some nice graphs of unicode growth are at these locations are at these locations:

Popularity of text encodings – Wikipedia
[Wayback] W3C: Who uses Unicode?
[Archive.is] Web Technologies Statistics and Trends: W3Techs shows statistics and trends in the usage statistics of web technologies
2008: [Wayback] utf-8 Growth On The Web | W3C Blog
2012: [Wayback] Official Google Blog: Unicode over 60 percent of the web
2012: Archive.is Usage Statistics of Character Encodings for Websites, May 2012
2015: [Wayback] UTF-8 Unicode vs. other encodings over time | Pinyin News
2020: Archive.is Usage Statistics and Market Share of Character Encodings for Websites, August 2020
2010-2021: [Archive.is] Historical yearly trends in the usage statistics of character encodings for websites, June 2021: from 50% UTF-8 in 2010, to almost 97% mid 2021 (where the second place ISO-8859-1 at just 1.3%, so leaving less than 1.5% for all other encodings, see [Archive.is] Usage Statistics and Market Share of Character Encodings for Websites, June 2021)

I think especially important are 2008 (when UTF-8 had outgrown all other individual encodings) and slightly after 2010, when UTF-8 alone covered more than 50% of the pages served. These exclude ASCII-only pages. Adding those would make the figures even larger.

Historical yearly trends in the usage statistics of character encodings for websites, June 2021

–jeroen

Posted in Development, Encoding, Software Development, UTF-8, UTF8, Web Development | Leave a Comment »

When MySQL characterset ‘utf’ does not allow you to enter some Unicode code points

Posted by jpluimers on 2022/01/06

Contrary to what many believe is that MySQL utf8 is not always full blown UTF-8 support, but actually utf8mb3, which has been deprecated for a while now.

Only utf8mb4 will give you full blown UTF-8 support.

This when someone reminded me of this in a Delphi application:

When I insert :joy: emoji into mysql varchar filed I got an error :
#22007 Incorrect string value: '\xF0\x9F\x98\x82' for column 'remarks' at row 1

database charset is utf8

Note that the :joy: emoji is 😂 and has Unicode code point U+1F602 which is outside the basic multilingual plane.

See:

[Wayback] Unicode Character ‘FACE WITH TEARS OF JOY’ (U+1F602)
Plane (Unicode): Overview, Basic Multilingual Plane – Wikipedia
[Archive.is] Kristian Köhntopp on Twitter: “MySQL also, for quite some time now, no longer updates its own charsets and collations internally, for the same reason. So utf8 in MySQL is utf8mb3, the three byte variant of Unicode UTF-8 implementation that covers only the BMP (unicode up to U+FFFF).”
- Kristian Köhntopp
  ‏
  
  »Where does PostgreSQL’s collation logic come from?
  PostgreSQL relies on external libraries to order strings.
  – libc, meaning the operating system locale facility (POSIX or Windows)
  – icu, meaning the ICU project (if PostgreSQL was built with ICU support)«
- MySQL does things differently:
  MySQL binary data files are independent of the host operating system in byte order, number representation (as long as the host fulfils MySQLs basic requirements), collation and even time zone handling.
- So MySQL implements collations internally, also to guarantee stability across OS updates.
  If it didn’t, a libc update changing collations would mean you have to recreate a lot of indexes. Also, you would not be able to safely move data files from host to host.
- MySQL also, for quite some time now, no longer updates its own charsets and collations internally, for the same reason.
  So utf8 in MySQL is utf8mb3, the three byte variant of Unicode UTF-8 implementation that covers only the BMP (unicode up to U+FFFF).
- When moving to fuller (multiplane) UTF-8 support, a new name was needed, and utf8mb4 was chosen.
  So when you actually want modern utf8 in MySQL, you have to use utf8mb4, and now you know why.
- utf8 is deprecated and will be upgraded to utf8mb4 in some future MySQL release. This will be a breaking upgrade, and I wonder if it will require dropping and recreating all indexes affected by the change.
  That will be painful.
- https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html …
  utf8mb3 page in the MySQL 8.0 manual, with deprecation notice.
  What will change is the meaning of the alias utf8 (currently an alias for utf8mb3).
[Wayback] MySQL: Some Character Set Basics | Die wunderbare Welt von Isotopp
[Wayback] MySQL :: MySQL 8.0 Reference Manual :: 10.9.2 The utf8mb3 Character Set (3-Byte UTF-8 Unicode Encoding)

utf8 is an alias for utf8mb3; the character limit is implicit, rather than explicit in the name.

Note

The utf8mb3 character set is deprecated and you should expect it to be removed in a future MySQL release. Please use utf8mb4 instead. Although utf8 is currently an alias for utf8mb3, at some point utf8 is expected to become a reference to utf8mb4. To avoid ambiguity about the meaning of utf8, consider specifying utf8mb4 explicitly for character set references instead of utf8.
[Wayback] MySQL :: MySQL 8.0 Reference Manual :: 10.9.1 The utf8mb4 Character Set (4-Byte UTF-8 Unicode Encoding)
utf8mb4 contrasts with the utf8mb3 character set, which supports only BMP characters and uses a maximum of three bytes per character:
- For a BMP character, utf8mb4 and utf8mb3 have identical storage characteristics: same code values, same encoding, same length.
- For a supplementary character, utf8mb4 requires four bytes to store it, whereas utf8mb3 cannot store the character at all. When converting utf8mb3 columns to utf8mb4, you need not worry about converting supplementary characters because there are none.

–jeroen

Posted in Conference Topics, Conferences, Database Development, Delphi, Development, Encoding, Event, MySQL, Software Development, UTF-8, UTF8 | Leave a Comment »

LIDL Radio Controlled Wall Clock IAN 100489 English manual

Posted by jpluimers on 2022/01/06

Model 100489-14-01 wall clock

Just in case I need it again.

The signal quality fluctuates during the day (it is a lot better at night when there is less inionisation in the atmosphere), and is worsened by concrete walls (like our home).

Best way to get prolonged reception is at night, on the top floor behind a window or outside.

The clock usually needs between 3 and 10 minutes to pick up the DCF77 signal from the transmitter.

Wall clock manual: [Wayback] 100489_EN.pdf of which this abstract:

DCF77 HD-1688 clock mechanism

Numbers:

M.SET button

Press and keep pressed the M.SET button 1 at least 3 seconds. The wall clock switches into manual mode.

Press and keep pressed the M.SET button again until the hands reach the correct position for you to set the time.

Briefly pressing the M.SET button moves the hands forward in one minute steps to enable you to set the current time manually.
Note: After 8 seconds without pressing the M.SET button, the wall clock switches out of manual mode and keeps the time as normal. The manually set value is overwritten as soon as reception of the DCF radio time signal is successful.

RESET button

Press the RESET button 2 to reset the radio clock settings. Alternatively, remove the batteries from the device and insert them again.

The product now automatically starts to search for the DCF radio time signal.

REC button

Press and keep pressed the REC button 3 at least 5 seconds. The wall clock attempts to receive the DCF radio time signal. This process takes a few minutes to complete.

Battery compartment

Battery type: 1 x 1.5 V ⎓ AA, LR6

More on the signal, transmitter and encoding: DCF77 – Wikipedia, where the below images are from:

DCF77 reception area from Mainflingen

DCF77 signal strength over a 24-hour period measured in Nerja, on the south coast of Spain 1,801 km (1,119 mi) from the transmitter. Around 1 AM it peaks at ≈ 100 µV/m signal strength. During the day, the signal is weakened by ionization of the ionosphere due to solar activity.

Another DCF77 clock I have: CSL Bearware 302658 DCF clock manual

–jeroen

Posted in Development, Encoding, Hardware Development, LifeHacker, Power User, Software Development | 2 Comments »

PowerShell error in a script but not on the console: The string is missing the terminator: “.

Posted by jpluimers on 2021/09/29

The below one will fail in a script, both both work from the PowerShell prompt:

Success

Get-NetFirewallRule -DisplayGroup "File and Printer Sharing" | ForEach-Object { Write-Host $_.DisplayName ; Get-NetFirewallAddressFilter -AssociatedNetFirewallRule $_ }

Failure

Get-NetFirewallRule –DisplayGroup "File and Printer Sharing" | ForEach-Object { Write-Host $_.DisplayName ; Get-NetFirewallAddressFilter -AssociatedNetFirewallRule $_ }

The error you get this this:

At C:\bin\Show-File-and-Printer-Sharing-firewall-rules.ps1:5 char:52
+ ... -TCP-NoScope" | ForEach-Object { Write-Host $_.DisplayName ; Get-NetF ...
+                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The string is missing the terminator: ".
    + CategoryInfo          : ParserError: (:) [], ParentContainsErrorRecordException
    + FullyQualifiedErrorId : TerminatorExpectedAtEndOfString

Via [WayBack] script file ‘The string is missing the terminator: “.’ – Google Search, I quickly found these that stood out:

[WayBack] Reddit: Getting error- the string is missing the terminator: “. : PowerShell
That hyphen character in your -Reset (noticed by /u/SeeminglyScience) is an En-Dash (Decimal unicode character 8211). (Copy paste it and do [int]([char]'–') to see).

PowerShell can handle en-dash as a hyphen (for the curious, see the case for it in the PowerShell tokenizer source code here where hyphen, en-dash, em-dash and horizontalBar all fall into the same code block), but because it’s a Unicode character your source code will have to be saved as a unicode encoded .ps1 file. (UCS2-LE, UTF8+BOM, UTF8-BOM seem to work for me).

If you don’t, and you save it as ASCII/ANSI, what gets saved is â€“ which has a string opening in it. So the code becomes:
```
Set-ADAccountPassword $uname -NewPassword $newpwd â€“Reset -confirm -PassThru
```
Which now has an open string that doesn’t finish, and throws errors about unterminated strings.
Getting error- the string is missing the terminator: ".
byu/solidcore87 inPowerShell
[WayBack] azure – PowerShell script error: the string is missing the terminator: – Stack Overflow
[WayBack] powershell is missing the terminator: ” – Stack Overflow
Look closely at the two dashes in
```
unzipRelease –Src '$ReleaseFile' -Dst '$Destination'
```
This first one is not a normal dash but an en-dash (– in HTML). Replace that with the dash found before Dst.

Cause and solution

Before DisplayGroup, the first line has a minus sign and the second an en-dash. You can see this via [WayBack] What Unicode character is this ?.

Apparently, when using Unicode on the console, it does not matter if you have a minus sign (-), en-dash (–), em-dash (—) or horizontal bar (―) as dash character. You can see this in [WayBack] tokenizer.cs at function [WayBack] NextToken and [WayBack] CharTraits.cs at function [WayBack] IsChar).

When saving to a non-Unicode file, it does matter, even though it does not display as garbage in the error message.

Similarly, PowerShell has support for these special characters:

    internal static class SpecialChars
    {
        // Uncommon whitespace
        internal const char NoBreakSpace = (char)0x00a0;
        internal const char NextLine = (char)0x0085;

        // Special dashes
        internal const char EnDash = (char)0x2013;
        internal const char EmDash = (char)0x2014;
        internal const char HorizontalBar = (char)0x2015;

        // Special quotes
        internal const char QuoteSingleLeft = (char)0x2018; // left single quotation mark
        internal const char QuoteSingleRight = (char)0x2019; // right single quotation mark
        internal const char QuoteSingleBase = (char)0x201a; // single low-9 quotation mark
        internal const char QuoteReversed = (char)0x201b; // single high-reversed-9 quotation mark
        internal const char QuoteDoubleLeft = (char)0x201c; // left double quotation mark
        internal const char QuoteDoubleRight = (char)0x201d; // right double quotation mark
        internal const char QuoteLowDoubleLeft = (char)0x201E; // low double left quote used in german.
    }

The easiest solution is to use minus signs everywhere.

Another solution is to save files as Unicode UTF-8 encoding (preferred) or UTF-16 encoding (which I dislike).

–jeroen

Posted in .NET, CommandLine, Development, Encoding, PowerShell, PowerShell, Scripting, Software Development, Unicode, UTF-16, UTF-8, UTF16, UTF8 | Leave a Comment »

« Previous Entries

Next Entries »

	Attila Kovacs on Crowbarring Windows 95 into Wi…
	Jeroen Wiert Pluimer… on Does Odido (the old T-Mobile N…
	Lars Fosdal on Security alarm provider Woonve…
	Thomas Mueller on Question got closed in May 202…
	Thaddy de Koning on Formulier voor bewindvoerders…

The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

Twitter Updates

My Flickr Stream

Pages

All categories

Email Subscription

Archive for the ‘Encoding’ Category

Some links on sending SMS and the protocols/types involved

.NET: XML escaping a string

DELPHI : EEncodingError – Invalid code page on windows xp embedded – Stack Overflow

Character set reencoding link archive

In this day and age, web sites with delivery back-ends still have Unicode issues: at least @Woonveilig, @Medireva and @PostNL still have trouble

Serving UTF8 does not mean no unicode problems.

C# Effective way to find any file’s Encoding – Stack Overflow

UTF-8 web adoption is huge, closing 100%, but only soured up since around 2006.

When MySQL characterset ‘utf’ does not allow you to enter some Unicode code points

LIDL Radio Controlled Wall Clock IAN 100489 English manual

PowerShell error in a script but not on the console: The string is missing the terminator: “.

Cause and solution

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

My Flickr Stream

Pages

All categories

Email Subscription

Archive for the ‘Encoding’ Category

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Serving UTF8 does not mean no unicode problems.

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Cause and solution

Rate this:

Share this: