A while ago I bumped into some GPI Mojibake examples, but soon found out I should use the ftfy test cases
Posted by jpluimers on 2022/11/22
I have been into more and more Mojibake example pages like [Wayback] Mojibake: Question Marks, Strange Characters and Other Issues | GPI
Have you ever found strange characters like these ��� when viewing content in applications or websites in other languages?
They made me realise that all these (including the Mojibake examples on my blog) are just artifacts, but the real list of examples is the set of ftfy test cases at [Wayback/Archive.is] python-ftfy/test_cases.json at master · LuminosoInsight/python-ftfy
I got reminded when Waternet moved from paper mail using “Pyreneeën” to email using “Pyreneeën
“. Not as bad as Waterschap AGV did earlier: they took it one level further and made “Pyreneeën
” out of it, see Last year, a classic Mojibake was introduced when Waterschap Amstel, Gooi en Vecht redesigned their IT systems.
This seems like a trend where newer systems perform worse than older systems. I wonder why that is.
BTW: the trick on the [Wayback/Archive] Python.org shell to run ftfy
(which is not installed by default) is first dropping to the shell (see my post How do I drop a bash shell from within Python? – Stack Overflow), then starting python again:
import os os.system('sh') pip install ftfy python import ftfy ftfy.fix_and_explain("Pyreneeën")
A longer log of the above trick is below the signature.
More ftfy
examples
Note that ftfy
did not help finding the cause of the Medireva problem mangling “PYRENEEËN
” into “PYRENEEÓN
” as the result of ftfy.fix_and_explain("PYRENEEÓN")
is ExplainedText(text='PYRENEEÓN', explanation=[])
.
I traced that back to a DOS Codepage 850 or Codepage 858 versus Windows 1252 Codepage (or maybe subsets ISO/IEC_8859-1 or ISO/IEC_8859-15) issue in In this day and age, web sites with delivery back-ends still have Unicode issues: at least @Woonveilig, @Medireva and @PostNL still have trouble.
>>> ftfy.fix_and_explain("PYRENEEÓN") ExplainedText(text='PYRENEEÓN', explanation=[])
It also did not help fixing [Wayback/Archive] Jörg Hoh on Twitter: “All of today’s developers have forgotten the UTF-8 wars. In the last weeks I was called “J?rg Hoh” twice and today even a deployment broke because of the “ö”. History repeating. /cc @isotopp”:
>>> ftfy.fix_and_explain("J?rg Hoh") ExplainedText(text='J?rg Hoh', explanation=[])
But it did help me find the cause for [Wayback/Archive] Jeroen Wiert Pluimers on Twitter: “@hoegenaamd “café”???” (and therefore [Wayback/Archive] Hoegenaamd 🍊 on Twitter: “Familiepark Drievliet – Waar nu het attractiepark ligt, lag in 1611 een landhuis. Dat heette Drievliet vanwege de ligging aan een knooppunt van de Haagvliet, de Delftvliet en de Vliet naar Leiden. In 1923 kwam er een café bij, in 1938 een theetuin en in 1951 een speeltuin.”)
$ python >>> import ftfy >>> ftfy.fix_and_explain("café") ExplainedText(text='café', explanation=[('encode', 'latin-1'), ('decode', 'utf-8')])
It also helped me finding a cause for [Wayback/Archive] Jeroen Wiert Pluimers on Twitter: “1995 calls and wants their character set transliteration errors back. @webcare020 @KevlinHenney”
Thanks to Google Lens that did a great job getting the OCR right for the belowftfy
explanation:
>>> ftfy.fix_and_explain("Pyreneeën") ExplainedText(text='Pyreneeën', explanation=[('encode', 'sloppy-windows-1252'), ('decode', 'utf-8'), ('encode', 'latin-1'), ('decode', 'utf-8')])
The conclusion is that ftfy
cannot fix character sequences that are either “normal” (like “J?rg Hoh”) or lack enough information to trace back (like “PYRENEEÓN”). Usually those are failures not resulting in doubling of the failing code points.
Python.org live environment
The python.org live environment is provided by [Wayback/Archive] Host, run, and code Python in the cloud: PythonAnywhere, so the same trick likely works there as well.
More live Python environments I might try this on are at [Wayback/Archive] Python Interpreter – Choose the Best to Execute Python Online.
The answer from fix_and_explain
(more usage examples at [Wayback/Archive] Fixing problems and getting explanations – ftfy: fixes text for you) is:
ExplainedText(text='Pyreneeën', explanation=[('encode', 'latin-1'), ('decode', 'utf-8')])
Note that there is (by now hopefully was) a bug running at [Archive] ftfy – fix unicode that’s broken in various ways: Pyreneeën, as it doesn’t show the above explanation but this one:
ftfy.fix_and_explain("Pyreneeën") ExplainedText(text='Pyreneeën', explanation=[('encode', 'latin-1'), ('decode', 'utf-8'), ('encode', 'latin-1'), ('decode', 'utf-8')])
The bug: [Wayback/Archive] https://ftfy.vercel.app/?s=Pyrenee%C3%83%C2%ABn is broken and explains the wrong source text is broken and explains the wrong source text · Issue #8 · simonw/ftfy-web
Related tweets and posts
Some of the tweets not mentioned above:

20211107 – Waternet – Mojibake – Unicode transliteration problem
- [Archive] Jeroen Wiert Pluimers on Twitter: “Hoi @Waternet, jullie hebben een mooie mojibake bug in jullie mailsysteem. Gelukkig zijn jullie niet alleen, maar het zou wel fijn zijn als dit binnenkort wordt opgelost. Velen gingen jullie al voor: … In plaats van “Pyreneeën” staat er “Pyreneeën”. 1/”
- [Archive] Jeroen Wiert Pluimers on Twitter: “Op papier ging dit vroeger goed, maar sinds de overgang naar email niet. Vooruitgang gaat dus soms met stapjes terug. Gelukkig is deze makkelijker dan @waterschapagv, want die maken er “Pyreneeën” van nadat ze op een nieuw IT-systeem waren overgegaan. Succes met fixen! 2/2…”
- [Archive] Jeroen Wiert Pluimers on Twitter: “Nabrander: gebruik ftfy om de oorzaak te vinden en die op te lossen. Voorbeeld: …”
Currently there are already about half a dozen posts on my blog in the ftfy
category and more in the Mojibake category.
–jeroen
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Python 3.9.5 (default, May 27 2021, 19:45:35) | |
[GCC 9.3.0] on linux | |
Type "help", "copyright", "credits" or "license" for more information. | |
>>> import os | |
>>> os.system('sh') | |
$ pip install ftfy | |
pDefaulting to user installation because normal site-packages is not writeable | |
Looking in links: /usr/share/pip-wheels | |
Collecting ftfy | |
Downloading ftfy-6.0.3.tar.gz (64 kB) | |
|████████████████████████████████| 64 kB 4.9 MB/s | |
Stored in directory: /home/.anon-bca447f2e1574daaaa3f9740/.cache/pip/wheels/3d/ee/4b/03a4e2e591ea56687af | |
ythoRequirement already satisfied: wcwidth in /usr/local/lib/python3.9/site-packages (from ftfy) (0.2.5) | |
Building wheels for collected packages: ftfy | |
n Building wheel for ftfy (setup.py) … done | |
Created wheel for ftfy: filename=ftfy-6.0.3-py3-none-any.whl size=41913 sha256=18ce752a33f0abca0c843d5b7157ae9ea4ac56bdef179b5d1b8b79adbf682d0b | |
Stored in directory: /home/.anon-bca447f2e1574daaaa3f9740/.cache/pip/wheels/3d/ee/4b/03a4e2e591ea56687aff999edc83827a2ace523baab75b8e41 | |
Successfully built ftfy | |
Installing collected packages: ftfy | |
Successfully installed ftfy-6.0.3 | |
python | |
Python 3.9.5 (default, May 27 2021, 19:45:35) | |
[GCC 9.3.0] on linux | |
Type "help", "copyright", "credits" or "license" for more information. | |
>>> import ftfy | |
>>> ftfy.fix_text("Pyreneeën") | |
'Pyreneeën' | |
>>> ftfy.fix_and_explain("Pyreneeën") | |
ExplainedText(text='Pyreneeën', explanation=[('encode', 'latin-1'), ('decode', 'utf-8'), ('encode', 'latin-1'), ('decode', 'utf-8')]) | |
>>> ftfy.fix_and_explain("Pyreneeën") | |
ExplainedText(text='Pyreneeën', explanation=[('encode', 'latin-1'), ('decode', 'utf-8')]) | |
>>> |
Leave a Reply