The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 2,726 other followers

Archive for the ‘UTF-8’ Category

Last year, a classic Mojibake was introduced when Waterschap Amstel, Gooi en Vecht redesigned their IT systems

Posted by jpluimers on 2022/03/16

Last year, Waterschap Amstel, Gooi en Vecht sent me a paper letter notifying the yearly water bill was going to be late as they were redesigning their IT systems.

Their letter introduced a classic Mojibake that had not been present in all their older paper letter communication.

  • Street name on a letter via the old IT systems is "Pyreneeën":

    Pyreneeën goed geprint.

  • Street name on a letter via the new IT systems is "Pyreneeën":

    Pyreneeën geprint met Mojibake vervormingen.

Read the rest of this entry »

Posted in Development, Encoding, ftfy, Mojibake, Python, Software Development, Unicode, UTF-8, UTF8 | Leave a Comment »

In this day and age, web sites with delivery back-ends still have Unicode issues: at least @Woonveilig, @Medireva and @PostNL still have trouble

Posted by jpluimers on 2022/02/09

Nowadays, some 35 years after the first Unicode ideas got drafted and 30+ years after the Unicode Consortium saw the light, UTF-8 is served my more than 95% of the web as shown in yesterday’s post UTF-8 web adoption is huge, closing 100%, but only soured up since around 2006..

I mentioned this:

It means that nowadays there is a very small chance you will see mangled characters (what Japanese call mojibake) when you’re surfing the web.

Serving UTF8 does not mean no unicode problems.

Below are some issues that happened not too long ago and still happen. I have reported them to all parties involved through web-care, but no response whatsoever, and this is bad: Unicode support beyond basic ASCII for the below systems are still broken even for relatively simple non-ASCII characters based in diacritics decorating a standard ASCII character.

Yes, I know the realm of encoding and code pages is a mess, especially when handling data in multiple layers of an application stack. That’s why I wrote this post in the first place, and have a whole encoding category of blog posts plus a Mojibake subset.

Read the rest of this entry »

Posted in Communications Development, CP850, Dark Pattern, Development, Encoding, ISO-8859, ISO8859, Mojibake, Software Development, Unicode, User Experience (ux), UTF-16, UTF-8, Windows-1252 | Leave a Comment »

C# Effective way to find any file’s Encoding – Stack Overflow

Posted by jpluimers on 2022/02/09

Note: notepad cannot correctly guess the encoding, see the “old new thing”: [Wayback] Some files come up strange in Notepad | The Old New Thing (talking about ANSI a.k.a. Windows-1252, UTF-16LE, UTF-16BE, UTF-8, UTF-7 somewith and some without BOM as Notepad does not understand all permutations)

David Cumps discovered that certain text files come up strange in Notepad. The reason is that Notepad has to edit files in a variety of encodings, and when its back against the wall, sometimes it’s forced to guess.

[Wayback] C# Effective way to find any file’s Encoding – Stack Overflow shows how to detect various byte order marks in C#.

–jeroen

Posted in ASCII, Development, Encoding, Software Development, Unicode, UTF-16, UTF-32, UTF-8, UTF16, UTF32, UTF8 | Leave a Comment »

UTF-8 web adoption is huge, closing 100%, but only soured up since around 2006.

Posted by jpluimers on 2022/02/08

As a precursor to a post tomorrow showing that serving UTF8 does not mean organisations go without unicode problems, first some statistics.

The first Unicode ideas got drafted some 30 years ago in 1987. In 1991, more than 30 years ago, the Unicode Consortium saw the light. Nowadays more than 95% percent of the web-pages (close to 100% when you include plain ASCII) is served using the UTF-8 encoding.

It means that nowadays there is a very small chance you

will see mangled characters (what Japanese call mojibake) when you’re surfing the web.

Some nice graphs of unicode growth are at these locations are at these locations:

I think especially important are 2008 (when UTF-8 had outgrown all other individual encodings) and slightly after 2010, when UTF-8 alone covered more than 50% of the pages served. These exclude ASCII-only pages. Adding those would make the figures even larger.

graph showing a steep rise in the use of UTF-8 and a steep decline in other major encodings

Historical yearly trends in the usage statistics of character encodings for websites, June 2021

Historical yearly trends in the usage statistics of character encodings for websites, June 2021

–jeroen

Posted in Development, Encoding, Software Development, UTF-8, UTF8, Web Development | Leave a Comment »

When MySQL characterset ‘utf’ does not allow you to enter some Unicode code points

Posted by jpluimers on 2022/01/06

Contrary to what many believe is that MySQL utf8 is not always full blown UTF-8 support, but actually utf8mb3, which has been deprecated for a while now.

Only utf8mb4 will give you full blown UTF-8 support.

This when someone reminded me of this in a Delphi application:

When I insert :joy: emoji into mysql varchar filed I got an error :
#22007 Incorrect string value: '\xF0\x9F\x98\x82' for column 'remarks' at row 1

database charset is utf8

Note that the :joy: emoji is 😂 and has Unicode code point U+1F602 which is outside the basic multilingual plane.

See:

–jeroen

Posted in Conference Topics, Conferences, Database Development, Delphi, Development, Encoding, Event, MySQL, Software Development, UTF-8, UTF8 | Leave a Comment »

 
%d bloggers like this: