May 2024
M	T	W	T	F	S	S
	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Archive for the ‘UTF8’ Category

A while ago I bumped into some GPI Mojibake examples, but soon found out I should use the ftfy test cases

Posted by jpluimers on 2022/11/22

I have been into more and more Mojibake example pages like [Wayback] Mojibake: Question Marks, Strange Characters and Other Issues | GPI

Have you ever found strange characters like these �� when viewing content in applications or websites in other languages?

They made me realise that all these (including the Mojibake examples on my blog) are just artifacts, but the real list of examples is the set of ftfy test cases at [Wayback/Archive.is] python-ftfy/test_cases.json at master · LuminosoInsight/python-ftfy

I got reminded when Waternet moved from paper mail using “Pyreneeën” to email using “PyreneeÃ«n“. Not as bad as Waterschap AGV did earlier: they took it one level further and made “PyreneeÃÂ«n” out of it, see Last year, a classic Mojibake was introduced when Waterschap Amstel, Gooi en Vecht redesigned their IT systems.

This seems like a trend where newer systems perform worse than older systems. I wonder why that is.

BTW: the trick on the [Wayback/Archive] Python.org shell to run ftfy (which is not installed by default) is first dropping to the shell (see my post How do I drop a bash shell from within Python? – Stack Overflow), then starting python again:

Read the rest of this entry »

Posted in CP850, Development, Encoding, ftfy, ISO-8859, Mojibake, Python, Scripting, Software Development, Unicode, UTF-8, UTF8 | Leave a Comment »

Last year, a classic Mojibake was introduced when Waterschap Amstel, Gooi en Vecht redesigned their IT systems

Posted by jpluimers on 2022/03/16

Last year, Waterschap Amstel, Gooi en Vecht sent me a paper letter notifying the yearly water bill was going to be late as they were redesigning their IT systems.

Their letter introduced a classic Mojibake that had not been present in all their older paper letter communication.

Street name on a letter via the old IT systems is "Pyreneeën":
Street name on a letter via the new IT systems is "PyreneeÃÂ«n":

Read the rest of this entry »

Posted in Development, Encoding, ftfy, Mojibake, Python, Software Development, Unicode, UTF-8, UTF8 | Leave a Comment »

C# Effective way to find any file’s Encoding – Stack Overflow

Posted by jpluimers on 2022/02/09

Note: notepad cannot correctly guess the encoding, see the “old new thing”: [Wayback] Some files come up strange in Notepad | The Old New Thing (talking about ANSI a.k.a. Windows-1252, UTF-16LE, UTF-16BE, UTF-8, UTF-7 somewith and some without BOM as Notepad does not understand all permutations)

David Cumps discovered that certain text files come up strange in Notepad. The reason is that Notepad has to edit files in a variety of encodings, and when its back against the wall, sometimes it’s forced to guess.

[Wayback] C# Effective way to find any file’s Encoding – Stack Overflow shows how to detect various byte order marks in C#.

–jeroen

Posted in ASCII, Development, Encoding, Software Development, Unicode, UTF-16, UTF-32, UTF-8, UTF16, UTF32, UTF8 | Leave a Comment »

UTF-8 web adoption is huge, closing 100%, but only soured up since around 2006.

Posted by jpluimers on 2022/02/08

As a precursor to a post tomorrow showing that serving UTF8 does not mean organisations go without unicode problems, first some statistics.

The first Unicode ideas got drafted some 30 years ago in 1987. In 1991, more than 30 years ago, the Unicode Consortium saw the light. Nowadays more than 95% percent of the web-pages (close to 100% when you include plain ASCII) is served using the UTF-8 encoding.

It means that nowadays there is a very small chance you

will see mangled characters (what Japanese call mojibake) when you’re surfing the web.

Some nice graphs of unicode growth are at these locations are at these locations:

Popularity of text encodings – Wikipedia
[Wayback] W3C: Who uses Unicode?
[Archive.is] Web Technologies Statistics and Trends: W3Techs shows statistics and trends in the usage statistics of web technologies
2008: [Wayback] utf-8 Growth On The Web | W3C Blog
2012: [Wayback] Official Google Blog: Unicode over 60 percent of the web
2012: Archive.is Usage Statistics of Character Encodings for Websites, May 2012
2015: [Wayback] UTF-8 Unicode vs. other encodings over time | Pinyin News
2020: Archive.is Usage Statistics and Market Share of Character Encodings for Websites, August 2020
2010-2021: [Archive.is] Historical yearly trends in the usage statistics of character encodings for websites, June 2021: from 50% UTF-8 in 2010, to almost 97% mid 2021 (where the second place ISO-8859-1 at just 1.3%, so leaving less than 1.5% for all other encodings, see [Archive.is] Usage Statistics and Market Share of Character Encodings for Websites, June 2021)

I think especially important are 2008 (when UTF-8 had outgrown all other individual encodings) and slightly after 2010, when UTF-8 alone covered more than 50% of the pages served. These exclude ASCII-only pages. Adding those would make the figures even larger.

Historical yearly trends in the usage statistics of character encodings for websites, June 2021

–jeroen

Posted in Development, Encoding, Software Development, UTF-8, UTF8, Web Development | Leave a Comment »

When MySQL characterset ‘utf’ does not allow you to enter some Unicode code points

Posted by jpluimers on 2022/01/06

Contrary to what many believe is that MySQL utf8 is not always full blown UTF-8 support, but actually utf8mb3, which has been deprecated for a while now.

Only utf8mb4 will give you full blown UTF-8 support.

This when someone reminded me of this in a Delphi application:

When I insert :joy: emoji into mysql varchar filed I got an error :
#22007 Incorrect string value: '\xF0\x9F\x98\x82' for column 'remarks' at row 1

database charset is utf8

Note that the :joy: emoji is 😂 and has Unicode code point U+1F602 which is outside the basic multilingual plane.

See:

[Wayback] Unicode Character ‘FACE WITH TEARS OF JOY’ (U+1F602)
Plane (Unicode): Overview, Basic Multilingual Plane – Wikipedia
[Archive.is] Kristian Köhntopp on Twitter: “MySQL also, for quite some time now, no longer updates its own charsets and collations internally, for the same reason. So utf8 in MySQL is utf8mb3, the three byte variant of Unicode UTF-8 implementation that covers only the BMP (unicode up to U+FFFF).”
- Kristian Köhntopp
  ‏
  
  »Where does PostgreSQL’s collation logic come from?
  PostgreSQL relies on external libraries to order strings.
  – libc, meaning the operating system locale facility (POSIX or Windows)
  – icu, meaning the ICU project (if PostgreSQL was built with ICU support)«
- MySQL does things differently:
  MySQL binary data files are independent of the host operating system in byte order, number representation (as long as the host fulfils MySQLs basic requirements), collation and even time zone handling.
- So MySQL implements collations internally, also to guarantee stability across OS updates.
  If it didn’t, a libc update changing collations would mean you have to recreate a lot of indexes. Also, you would not be able to safely move data files from host to host.
- MySQL also, for quite some time now, no longer updates its own charsets and collations internally, for the same reason.
  So utf8 in MySQL is utf8mb3, the three byte variant of Unicode UTF-8 implementation that covers only the BMP (unicode up to U+FFFF).
- When moving to fuller (multiplane) UTF-8 support, a new name was needed, and utf8mb4 was chosen.
  So when you actually want modern utf8 in MySQL, you have to use utf8mb4, and now you know why.
- utf8 is deprecated and will be upgraded to utf8mb4 in some future MySQL release. This will be a breaking upgrade, and I wonder if it will require dropping and recreating all indexes affected by the change.
  That will be painful.
- https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html …
  utf8mb3 page in the MySQL 8.0 manual, with deprecation notice.
  What will change is the meaning of the alias utf8 (currently an alias for utf8mb3).
[Wayback] MySQL: Some Character Set Basics | Die wunderbare Welt von Isotopp
[Wayback] MySQL :: MySQL 8.0 Reference Manual :: 10.9.2 The utf8mb3 Character Set (3-Byte UTF-8 Unicode Encoding)

utf8 is an alias for utf8mb3; the character limit is implicit, rather than explicit in the name.

Note

The utf8mb3 character set is deprecated and you should expect it to be removed in a future MySQL release. Please use utf8mb4 instead. Although utf8 is currently an alias for utf8mb3, at some point utf8 is expected to become a reference to utf8mb4. To avoid ambiguity about the meaning of utf8, consider specifying utf8mb4 explicitly for character set references instead of utf8.
[Wayback] MySQL :: MySQL 8.0 Reference Manual :: 10.9.1 The utf8mb4 Character Set (4-Byte UTF-8 Unicode Encoding)
utf8mb4 contrasts with the utf8mb3 character set, which supports only BMP characters and uses a maximum of three bytes per character:
- For a BMP character, utf8mb4 and utf8mb3 have identical storage characteristics: same code values, same encoding, same length.
- For a supplementary character, utf8mb4 requires four bytes to store it, whereas utf8mb3 cannot store the character at all. When converting utf8mb3 columns to utf8mb4, you need not worry about converting supplementary characters because there are none.

–jeroen

Posted in Conference Topics, Conferences, Database Development, Delphi, Development, Encoding, Event, MySQL, Software Development, UTF-8, UTF8 | Leave a Comment »

« Previous Entries

	jpluimers on Ookla speedtest CLI for Window…
	Mateusz on Now that XE8 is out, some Turb…
	jpluimers on Some links that might help use…
	jpluimers on Hidden Features in Delphi rela…
	jpluimers on Watching “Why is C# Evol…

The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

Twitter Updates

My Flickr Stream

Pages

All categories

Email Subscription

Archive for the ‘UTF8’ Category

A while ago I bumped into some GPI Mojibake examples, but soon found out I should use the ftfy test cases

Last year, a classic Mojibake was introduced when Waterschap Amstel, Gooi en Vecht redesigned their IT systems

C# Effective way to find any file’s Encoding – Stack Overflow

UTF-8 web adoption is huge, closing 100%, but only soured up since around 2006.

When MySQL characterset ‘utf’ does not allow you to enter some Unicode code points

The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

Twitter Updates

My Flickr Stream

Pages

All categories

Email Subscription

Archive for the ‘UTF8’ Category

A while ago I bumped into some GPI Mojibake examples, but soon found out I should use the ftfy test cases

Rate this:

Share this:

Last year, a classic Mojibake was introduced when Waterschap Amstel, Gooi en Vecht redesigned their IT systems

Rate this:

Share this:

C# Effective way to find any file’s Encoding – Stack Overflow

Rate this:

Share this:

UTF-8 web adoption is huge, closing 100%, but only soured up since around 2006.

Rate this:

Share this:

When MySQL characterset ‘utf’ does not allow you to enter some Unicode code points

Rate this:

Share this: