The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 1,860 other subscribers

Archive for the ‘Encoding’ Category

As a tribute to their @isotopp handle history, Kris now changed its name to Köhntopp

Posted by jpluimers on 2024/12/17

[Wayback/Archive] Jeroen Wiert Pluimers: “LOL, just saw @isotopp changed…” – Mastodon

LOL, just saw @isotopp changed his name to Köhntopp

Well done, Kris. Well done.

ftfy.vercel.app/?s=ö

( the history of the iso isotopp handle is so great, that I was glad I captured it from Twitter before that content got deleted; it is now at wiert.me/2022/06/09/how-isotop )

This Vercel app cannot be archived in the Wayback Machine properly as it then returns a HTTP 500. The Archive.is save succeeded though: [Wayback/Archive] https://ftfy.vercel.app/?s=ö:

Read the rest of this entry »

Posted in Development, Encoding, ISO-8859, ISO8859, Mojibake, Software Development, Unicode, UTF-8 | Leave a Comment »

Unicode: Keyboard Symbols ⌘ ↵ ⌫

Posted by jpluimers on 2024/12/11

I wish I had bumped into this page a way sooner as it contains most if not all the keyboard symbols I ever looked for: [Wayback/Archive] Unicode: Keyboard Symbols ⌘ ↵ ⌫

The page contains a lot more than just this diagram (which already is a great start):

⎋
 ` 1 2 3 4 5   6 7 8 9 0  - = ⌫    ⎀ ⤒ ⇞
 ⇥ Q W E R T   Y U I O P  [ ] \    ⌦ ⤓ ⇟
 🄰  A S D F G   H J K L ;  ' ↵
 ⇧   Z X C V B   N M , . /  ⇧        ↑
 ⎈ ❖ ⎇    ␣    ⎇ ❖ ▤ ⎈           ← ↓ →

🌐 ⌃ ⌥ ⌘

Some more symbols are at these pages:

Read the rest of this entry »

Posted in Development, Encoding, Hardware, Keyboards and Keyboard Shortcuts, KVM keyboard/video/mouse, Power User, Software Development, Unicode | Leave a Comment »

Unicode spaces (not just en and em, but also em fractions 1/2, 1/3, 1/4, 1/6, 1/5, 4/18 and remarks)

Posted by jpluimers on 2024/10/03

For my link archive (please check the page as by now the table might have changed from what I quote below) [Wayback/Archive] Unicode spaces and the WordPress classic editor might have mangled it.

I like the table as it embeds the spaces between foo and bar so it easy to copy paste them to code or documentation.

Read the rest of this entry »

Posted in Development, Encoding, Software Development, Unicode | Leave a Comment »

The mojibake “creëer”

Posted by jpluimers on 2024/08/22

A while ago, I found the “creëermojibake in a Dutch page on the IKEA site.

They were not alone to make this mistake which is easily explained using [Wayback/Archive] ftfy:

>>> ftfy.fix_and_explain("creëer")
ExplainedText(text='creëer', explanation=[('encode', 'latin-1'), ('decode', 'utf-8')])

(you can run this on-line at [Wayback/Archive] Welcome to Python.org: interactive shell, see my post The things I didn’t notice during cancer survival: ftfy 6.0 and more versions got released during my recovery on how to do this)

So the text is easily fixed:

Read the rest of this entry »

Posted in Development, Encoding, ftfy, ISO-8859, ISO8859, Software Development, Unicode, UTF-8, UTF8, Web Development | Leave a Comment »

The regexp for an emoticon ?

Posted by jpluimers on 2024/08/08

I responded to [Wayback/Archive] jilles.com on Twitter: “@0xD4ni @Twitter What is the regexp for an emoticon ?” with [Wayback/Archive] Jeroen Wiert Pluimers on Twitter: “@jilles_com @0xD4ni @Twitter \p{So}+ See …”.

I got the answer from [Wayback/Archive] java – What is the regex to extract all the emojis from a string? – Stack Overflow (thanks [Wayback/Archive] vishalaksh, and [Wayback/Archive] Desgard_Duan) which refers to the quoted section below.

Note that correctly matching highly depends on the versions of the libraries you use: there have been lots of releases of Unicode versions over the last years (since 2014 roughly every 12 months) each usually adding more Emoji.

In addition, many Emoji are not single Unicode codepoints: often they are code points (with or without any of the variation selectors) stacked on top of each other with zero-width joiners like I described in Kris on Twitter: “Company chat: »Right, we need more languages with Emoji as variable type indicators and pointer symbols.«….

I tried fiddling on [Wayback/Archive] regex101: build, test, and debug regex and could not always getting it to work as I hoped for, but also could not figure out how recent their libraries are.

Read the rest of this entry »

Posted in Conference Topics, Conferences, Development, Emoticons, Encoding, Event, Geeky, RegEx, Software Development, Unicode | Leave a Comment »

Some notes on codepoints.net and beta.codepoints.net

Posted by jpluimers on 2024/08/07

At the time of writing a lot of this might be more recent, but for quite some time codepoints.net had not been updated with code point information newer Unicode releases.

Basically it was stuck at Unicode version 8.0 with some 120k glyphs. At the time of writing Unicode version 15.0 is in beta and the difference between 15.0 and 8.0 is some 24k glyphs.

So I had a quick twitter chat with the author and jotted down the links in this blog post so I won’t forget them.

There I learned it was open source (I think it is the only Unicode codepoint site that is).

Here it goes:

Read the rest of this entry »

Posted in *nix, *nix-tools, Apache2, codepoints.net, Conference Topics, Conferences, Database Development, Debian, Development, DVCS - Distributed Version Control, Encoding, Event, GitHub, Linux, MySQL, PHP, Power User, Scripting, Software Development, Source Code Management, Unicode, Web Development | Leave a Comment »

Kris on Twitter: “Company chat: »Right, we need more languages with Emoji as variable type indicators and pointer symbols.«…

Posted by jpluimers on 2024/08/06

Please do not overdo Unicode outside the ASCII realm for identifiers and stay away from Emoji: [Wayback/Archive] Kris on Twitter: “Company chat: »Right, we need more languages with Emoji as variable type indicators and pointer symbols.«…”

Company chat: »Right, we need more languages with Emoji as variable type indicators and pointer symbols.«
»
🎼initializer🎱«
»
💦 mutable, 🧱 not.«
»
🎁 on the heap, 🥞 on the stack«
»
🍼 ctor, 🪦 dtor«
»� non-utf string result«
»any of
👩‍❤️‍💋‍👨 as a concat operator«
»
📁📂 block delims«

Read the rest of this entry »

Posted in Conference Topics, Conferences, Development, Encoding, Event, Fun, Quotes, Software Development, Unicode | Leave a Comment »

I learned: MacOS has a Unicode Hex Input keyboard

Posted by jpluimers on 2023/05/25

A while ago, I learned that MacOS has had a Unicode Hex Input keyboard since ages.

It is not installed by default, so you have to manually add it:

  1. Start the System Preferences.app
  2. Open the Keyboard icon
  3. Choose the Input Sources tab
  4. Click the plus (+) icon
  5. Search for Unicode or Hex to get so Unicode Hex Input is the only entry in the list
  6. Click the Add  button
  7. Choose the Keyboard tab
  8. Enable Show keyboard and emoji viewers in menu bar

Now in the menu bar, you can select the Unicode Hex Input.

After that, when holding the Option key, any 4-digit Unicode sequence will get you a Unicode character.

Read the rest of this entry »

Posted in Apple, Development, Encoding, Mac OS X / OS X / MacOS, Power User, Software Development, Unicode | Leave a Comment »

Berlin Typography on Twitter: “The best of #TypeInBerlin: The tʒ and ſʒ ligatures, together at last.” / Güntʒelstraſʒe == Güntzelstraße

Posted by jpluimers on 2023/04/17

Learned a new thing a while ago: I knew about the ſʒ ligature (that nowadays usually is written as ß), but the tʒ ligature was new to me.

So: Güntʒelstraſʒe == Güntzelstraße.

References:

Source: [Archive.is] Berlin Typography on Twitter: “The best of #TypeInBerlin: The tʒ and ſʒ ligatures, together at last. …” / Twitter

Read the rest of this entry »

Posted in Development, Encoding, LifeHacker, Power User, Software Development, Unicode | Leave a Comment »

A while ago I bumped into some GPI Mojibake examples, but soon found out I should use the ftfy test cases

Posted by jpluimers on 2022/11/22

I have been into more and more Mojibake example pages like [Wayback] Mojibake: Question Marks, Strange Characters and Other Issues | GPI

Have you ever found strange characters like these ���  when viewing content in applications or websites in other languages?

They made me realise that all these (including the Mojibake examples on my blog) are just artifacts, but the real list of examples is the set of ftfy test cases at [Wayback/Archive.is] python-ftfy/test_cases.json at master · LuminosoInsight/python-ftfy

I got reminded when Waternet moved from paper mail using “Pyreneeën” to email using “Pyreneeën“. Not as bad as Waterschap AGV did earlier: they took it one level further and made “Pyreneeën” out of it, see Last year, a classic Mojibake was introduced when Waterschap Amstel, Gooi en Vecht redesigned their IT systems.

This seems like a trend where newer systems perform worse than older systems. I wonder why that is.

BTW: the trick on the [Wayback/Archive] Python.org shell to run ftfy (which is not installed by default) is first dropping to the shell (see my post How do I drop a bash shell from within Python? – Stack Overflow), then starting python again:

Read the rest of this entry »

Posted in CP850, Development, Encoding, ftfy, ISO-8859, Mojibake, Python, Scripting, Software Development, Unicode, UTF-8, UTF8 | Leave a Comment »