The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 4,262 other subscribers

A while ago I bumped into some GPI Mojibake examples, but soon found out I should use the ftfy test cases

Posted by jpluimers on 2022/11/22

I have been into more and more Mojibake example pages like [Wayback] Mojibake: Question Marks, Strange Characters and Other Issues | GPI

Have you ever found strange characters like these ���  when viewing content in applications or websites in other languages?

They made me realise that all these (including the Mojibake examples on my blog) are just artifacts, but the real list of examples is the set of ftfy test cases at [Wayback/Archive.is] python-ftfy/test_cases.json at master · LuminosoInsight/python-ftfy

I got reminded when Waternet moved from paper mail using “Pyreneeën” to email using “Pyreneeën“. Not as bad as Waterschap AGV did earlier: they took it one level further and made “Pyreneeën” out of it, see Last year, a classic Mojibake was introduced when Waterschap Amstel, Gooi en Vecht redesigned their IT systems.

This seems like a trend where newer systems perform worse than older systems. I wonder why that is.

BTW: the trick on the [Wayback/Archive] Python.org shell to run ftfy (which is not installed by default) is first dropping to the shell (see my post How do I drop a bash shell from within Python? – Stack Overflow), then starting python again:

import os
os.system('sh')
pip install ftfy
python
import ftfy
ftfy.fix_and_explain("Pyreneeën")

A longer log of the above trick is below the signature.

More ftfy examples

Note that ftfy did not help finding the cause of the Medireva problem mangling “PYRENEEËN” into “PYRENEEÓN” as the result of ftfy.fix_and_explain("PYRENEEÓN") is ExplainedText(text='PYRENEEÓN', explanation=[]).

I traced that back to a DOS Codepage 850 or Codepage 858 versus Windows 1252 Codepage (or maybe subsets ISO/IEC_8859-1 or ISO/IEC_8859-15) issue in In this day and age, web sites with delivery back-ends still have Unicode issues: at least @Woonveilig, @Medireva and @PostNL still have trouble.

>>> ftfy.fix_and_explain("PYRENEEÓN") ExplainedText(text='PYRENEEÓN', explanation=[])

>>> ftfy.fix_and_explain("PYRENEEÓN")
ExplainedText(text='PYRENEEÓN', explanation=[])

It also did not help fixing [Wayback/Archive] Jörg Hoh on Twitter: “All of today’s developers have forgotten the UTF-8 wars. In the last weeks I was called “J?rg Hoh” twice and today even a deployment broke because of the “ö”. History repeating. /cc @isotopp”:

>>> ftfy.fix_and_explain("J?rg Hoh") ExplainedText(text='J?rg Hoh', explanation=[])

>>> ftfy.fix_and_explain("J?rg Hoh")
ExplainedText(text='J?rg Hoh', explanation=[])

But it did help me find the cause for [Wayback/Archive] Jeroen Wiert Pluimers on Twitter: “@hoegenaamd “café”???” (and therefore [Wayback/Archive] Hoegenaamd 🍊 on Twitter: “Familiepark Drievliet – Waar nu het attractiepark ligt, lag in 1611 een landhuis. Dat heette Drievliet vanwege de ligging aan een knooppunt van de Haagvliet, de Delftvliet en de Vliet naar Leiden. In 1923 kwam er een café bij, in 1938 een theetuin en in 1951 een speeltuin.”)

$ python >>> import ftfy >>> ftfy.fix_and_explain("café") ExplainedText(text='café', explanation=[('encode', 'latin-1'), ('decode', 'utf-8')])

$ python
>>> import ftfy
>>> ftfy.fix_and_explain("café")
ExplainedText(text='café', explanation=[('encode', 'latin-1'), ('decode', 'utf-8')])

It also helped me finding a cause for [Wayback/Archive] Jeroen Wiert Pluimers on Twitter: “1995 calls and wants their character set transliteration errors back. @webcare020 @KevlinHenney”


Thanks to Google Lens that did a great job getting the OCR right for the below ftfy explanation:

>>> ftfy.fix_and_explain("Pyreneeën") ExplainedText(text='Pyreneeën', explanation=[('encode', 'sloppy-windows-1252'), ('decode', 'utf-8'), ('encode', 'latin-1'), ('decode', 'utf-8')])

>>> ftfy.fix_and_explain("Pyreneeën")
ExplainedText(text='Pyreneeën', explanation=[('encode', 'sloppy-windows-1252'), ('decode', 'utf-8'), ('encode', 'latin-1'), ('decode', 'utf-8')])

The conclusion is that ftfy cannot fix character sequences that are either “normal” (like “J?rg Hoh”) or lack enough information to trace back (like “PYRENEEÓN”). Usually those are failures not resulting in doubling of the failing code points.

Python.org live environment

The python.org live environment is provided by [Wayback/Archive] Host, run, and code Python in the cloud: PythonAnywhere, so the same trick likely works there as well.

More live Python environments I might try this on are at [Wayback/Archive] Python Interpreter – Choose the Best to Execute Python Online.

 

The answer from fix_and_explain (more usage examples at [Wayback/Archive] Fixing problems and getting explanations – ftfy: fixes text for you) is:

ExplainedText(text='Pyreneeën', explanation=[('encode', 'latin-1'), ('decode', 'utf-8')])

Note that there is (by now hopefully was) a bug running at [Archive] ftfy – fix unicode that’s broken in various ways: Pyreneeën, as it doesn’t show the above explanation but this one:

ftfy.fix_and_explain("Pyreneeën")
ExplainedText(text='Pyreneeën', explanation=[('encode', 'latin-1'), ('decode', 'utf-8'), ('encode', 'latin-1'), ('decode', 'utf-8')])

The bug: [Wayback/Archive] https://ftfy.vercel.app/?s=Pyrenee%C3%83%C2%ABn is broken and explains the wrong source text is broken and explains the wrong source text · Issue #8 · simonw/ftfy-web

Related tweets and posts

Some of the tweets not mentioned above:

20211107 - Waternet - Mojibake - Unicode transliteration problem

20211107 – Waternet – Mojibake – Unicode transliteration problem

Currently there are already about half a dozen posts on my blog in the ftfy category and more in the Mojibake category.

–jeroen



Python 3.9.5 (default, May 27 2021, 19:45:35)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.system('sh')
$ pip install ftfy
pDefaulting to user installation because normal site-packages is not writeable
Looking in links: /usr/share/pip-wheels
Collecting ftfy
Downloading ftfy-6.0.3.tar.gz (64 kB)
|████████████████████████████████| 64 kB 4.9 MB/s
Stored in directory: /home/.anon-bca447f2e1574daaaa3f9740/.cache/pip/wheels/3d/ee/4b/03a4e2e591ea56687af
ythoRequirement already satisfied: wcwidth in /usr/local/lib/python3.9/site-packages (from ftfy) (0.2.5)
Building wheels for collected packages: ftfy
n Building wheel for ftfy (setup.py) … done
Created wheel for ftfy: filename=ftfy-6.0.3-py3-none-any.whl size=41913 sha256=18ce752a33f0abca0c843d5b7157ae9ea4ac56bdef179b5d1b8b79adbf682d0b
Stored in directory: /home/.anon-bca447f2e1574daaaa3f9740/.cache/pip/wheels/3d/ee/4b/03a4e2e591ea56687aff999edc83827a2ace523baab75b8e41
Successfully built ftfy
Installing collected packages: ftfy
Successfully installed ftfy-6.0.3
python
Python 3.9.5 (default, May 27 2021, 19:45:35)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ftfy
>>> ftfy.fix_text("Pyreneeën")
'Pyreneeën'
>>> ftfy.fix_and_explain("Pyreneeën")
ExplainedText(text='Pyreneeën', explanation=[('encode', 'latin-1'), ('decode', 'utf-8'), ('encode', 'latin-1'), ('decode', 'utf-8')])
>>> ftfy.fix_and_explain("Pyreneeën")
ExplainedText(text='Pyreneeën', explanation=[('encode', 'latin-1'), ('decode', 'utf-8')])
>>>

view raw

log.txt

hosted with ❤ by GitHub

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.