Archive for the ‘Unicode’ Category
Posted by jpluimers on 2024/08/06
Please do not overdo Unicode outside the ASCII realm for identifiers and stay away from Emoji: [Wayback/Archive] Kris on Twitter: “Company chat: »Right, we need more languages with Emoji as variable type indicators and pointer symbols.«…”
Company chat: »Right, we need more languages with Emoji as variable type indicators and pointer symbols.«
»🎼initializer🎱«
»💦 mutable, 🧱 not.«
»🎁 on the heap, 🥞 on the stack«
»🍼 ctor, 🪦 dtor«
»� non-utf string result«
»any of 👩❤️💋👨 as a concat operator«
»📁📂 block delims«
Read the rest of this entry »
Posted in Conference Topics, Conferences, Development, Encoding, Event, Fun, Quotes, Software Development, Unicode | Leave a Comment »
Posted by jpluimers on 2023/05/25
A while ago, I learned that MacOS has had a Unicode Hex Input keyboard since ages.
It is not installed by default, so you have to manually add it:
- Start the
System Preferences.app
- Open the
Keyboard icon
- Choose the
Input Sources tab
- Click the plus (
+) icon
- Search for Unicode or Hex to get so
Unicode Hex Input is the only entry in the list
- Click the
Add button
- Choose the
Keyboard tab
- Enable Show keyboard and emoji viewers in menu bar
Now in the menu bar, you can select the Unicode Hex Input.
After that, when holding the Option key, any 4-digit Unicode sequence will get you a Unicode character.
Read the rest of this entry »
Posted in Apple, Development, Encoding, Mac OS X / OS X / MacOS, Power User, Software Development, Unicode | Leave a Comment »
Posted by jpluimers on 2022/11/22
I have been into more and more Mojibake example pages like [Wayback] Mojibake: Question Marks, Strange Characters and Other Issues | GPI
Have you ever found strange characters like these ��� when viewing content in applications or websites in other languages?
They made me realise that all these (including the Mojibake examples on my blog) are just artifacts, but the real list of examples is the set of ftfy test cases at [Wayback/Archive.is] python-ftfy/test_cases.json at master · LuminosoInsight/python-ftfy
I got reminded when Waternet moved from paper mail using “Pyreneeën” to email using “Pyreneeën“. Not as bad as Waterschap AGV did earlier: they took it one level further and made “Pyreneeën” out of it, see Last year, a classic Mojibake was introduced when Waterschap Amstel, Gooi en Vecht redesigned their IT systems.
This seems like a trend where newer systems perform worse than older systems. I wonder why that is.
BTW: the trick on the [Wayback/Archive] Python.org shell to run ftfy (which is not installed by default) is first dropping to the shell (see my post How do I drop a bash shell from within Python? – Stack Overflow), then starting python again:
Read the rest of this entry »
Posted in CP850, Development, Encoding, ftfy, ISO-8859, Mojibake, Python, Scripting, Software Development, Unicode, UTF-8, UTF8 | Leave a Comment »
Posted by jpluimers on 2022/06/30
Even with a batch file saved as UTF-8 (with or without BOM), by default it does not show most non-ASCII Unicode characters.
The reason is that the default codepage usually is an ANSI one like codepage 437.
Thanks [Wayback] niutech for answering [Wayback/Archive.is] Unicode symbols in a batch file – Stack Overflow:
You can manually set the codepage to UTF-8 by typing chcp 65001 at the top of your batch file.
Codepage 65001 is Windows speak for the UTF-8 code page. I have some more blog entries mentioning codepage 65001.
An example where I needed this was to show how to address the localghost from a batch file (see The spookback localghost address to resolve 👻). This was the resulting UTF-8 saved batch file:
chcp 65001
ping 👻
ping xn--9q8h
For single-byte non-ASCII characters, you can usually get away with setting the encoding of your batch file to your default code page as mentioned in [Wayback/Archive.is] cmd – Using box-drawing Unicode characters in batch files – Stack Overflow.
–jeroen
Posted in Batch-Files, Development, Encoding, Scripting, Software Development, Unicode, UTF-8, Windows Development | Leave a Comment »
Posted by jpluimers on 2022/03/16
Last year, Waterschap Amstel, Gooi en Vecht sent me a paper letter notifying the yearly water bill was going to be late as they were redesigning their IT systems.
Their letter introduced a classic Mojibake that had not been present in all their older paper letter communication.
- Street name on a letter via the old IT systems is
"Pyreneeën":

- Street name on a letter via the new IT systems is
"Pyreneeën":

Read the rest of this entry »
Posted in Development, Encoding, ftfy, Mojibake, Python, Software Development, Unicode, UTF-8, UTF8 | Leave a Comment »
Posted by jpluimers on 2022/03/10
When writing this, [Wayback/Archive.is] ftfy · PyPI:history indicates ftfy was already at 6.0.3.
It is still my goto tool for figuring out the cause of Mojibake. I remember writing about it the first time in 2016 (see the ftfy category) when it was already at version 3.0, discovering it after a few Mojibake posts.
By now it even understands right-to-left Mojibake garbage:
[Archive.is] Elia Robyn Speer on Twitter: “ftfy 5.8 is out! … A user reported that Hebrew text wasn’t being fixed, and this made me think about how to expand some of the trickier cases to non-Latin alphabets.”
Mojibake mishaps still happen a lot, so by now I hope I will have done a Mojibake themed Delphi talk at one or more conferences.
Read the rest of this entry »
Posted in !!con (bangbangcon), About, Autistic Spectrum/Autism, Cancer, Conference Topics, Conferences, Development, Encoding, Event, ftfy, Mojibake, Personal, Python, Rectum cancer, Scripting, Software Development, Unicode | Leave a Comment »
Posted by jpluimers on 2022/02/10
I will likely need some of these links in the future:
–jeroen
Posted in Apple, Development, Encoding, Mac, Mac OS X / OS X / MacOS, Power User, Software Development, Unicode | Leave a Comment »
Posted by jpluimers on 2022/02/09
Nowadays, some 35 years after the first Unicode ideas got drafted and 30+ years after the Unicode Consortium saw the light, UTF-8 is served my more than 95% of the web as shown in yesterday’s post UTF-8 web adoption is huge, closing 100%, but only soured up since around 2006..
I mentioned this:
It means that nowadays there is a very small chance you will see mangled characters (what Japanese call mojibake) when you’re surfing the web.
Serving UTF8 does not mean no unicode problems.
Below are some issues that happened not too long ago and still happen. I have reported them to all parties involved through web-care, but no response whatsoever, and this is bad: Unicode support beyond basic ASCII for the below systems are still broken even for relatively simple non-ASCII characters based in diacritics decorating a standard ASCII character.
Yes, I know the realm of encoding and code pages is a mess, especially when handling data in multiple layers of an application stack. That’s why I wrote this post in the first place, and have a whole encoding category of blog posts plus a Mojibake subset.
Read the rest of this entry »
Posted in Communications Development, CP850, Dark Pattern, Development, Encoding, ISO-8859, ISO8859, Mojibake, Software Development, Unicode, User Experience (ux), UTF-16, UTF-8, Windows-1252 | Leave a Comment »
Posted by jpluimers on 2022/02/09
Note: notepad cannot correctly guess the encoding, see the “old new thing”: [Wayback] Some files come up strange in Notepad | The Old New Thing (talking about ANSI a.k.a. Windows-1252, UTF-16LE, UTF-16BE, UTF-8, UTF-7 somewith and some without BOM as Notepad does not understand all permutations)
David Cumps discovered that certain text files come up strange in Notepad. The reason is that Notepad has to edit files in a variety of encodings, and when its back against the wall, sometimes it’s forced to guess.
[Wayback] C# Effective way to find any file’s Encoding – Stack Overflow shows how to detect various byte order marks in C#.
–jeroen
Posted in ASCII, Development, Encoding, Software Development, Unicode, UTF-16, UTF-32, UTF-8, UTF16, UTF32, UTF8 | Leave a Comment »