The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 4,259 other subscribers

I’ve given up on entering non-ASCII characters when entering data on-line

Posted by jpluimers on 2019/06/17

I live in a street that has a non-ASCII character in it: Pyreneeën.

I’ve reverted back to entering the street name as plain ASCII for a simple reason:

Too often the ë gets mangled into encoding gibberish, similar to the é example in [WayBackWhen Good Characters Go Bad: A Guide to Diagnosing Character Display Problems as these characters are very near both in UTF-8 and in the [WayBackUnicode Characters in the Latin-1 Supplement Block:

I’ve seen these encodings, where only the top encoding is correct; the degeneration gets worse moving downwards, a classic Mojibake:

# encoded UTF-8 (hex.)
0 ë 0xC3 0xAB
1 ë 0xC3 0x83 0xC2 0xAB
2 ë 0xC3 0x83 0xC2 0x83 0xC3 0x82 0xC2 0xAB
3 ë 0xC3 0x83 0xC2 0x83 0xC3 0x82 0xC2 0x83 0xC3 0x83 0xC2 0x82 0xC3 0x82 0xC2 0xAB
4 ë 0xC3 0x83 0xC2 0x83 0xC3 0x82 0xC2 0x83 0xC3 0x83 0xC2 0x82 0xC3 0x82 0xC2 0x83 0xC3 0x83 0xC2 0x83 0xC3 0x82 0xC2 0x82 0xC3 0x83 0xC2 0x82 0xC3 0x82 0xC2 0xAB
5 ë 0x26 0x65 0x75 0x6d 0x6c 0x3b

The last one seldomly happens, the first one relatively often, just like [Archive.is] fd.nl did a while on their finanancial pages.

These mistakes become sort of understandable (but not forgivable) when you look at the below table-fragment (the full table is at[WayBack] Unicode/UTF-8-character table – starting from code position 0080).

The sad thing: even with ASCII characters, system goof up systems; an apostrophe can be enough:

[WayBack] … My old card did have my last name (Ts’o) correct. The new card… – Theodore Ts’o – Google+

Via [WayBackUnicode on Credit Card Names is unthinkable, but currently even ASCII is too hard…. – Kristian Köhntopp – Google+

Even sadder: the “Via” post got goofed up by the WayBack machine from the internet archive: Source: Encoding horror: Wayback Machine “Sorry.This snapshot cannot be displayed due to an internal error.”.

My conclusion is that despite all the fuzz about AI, we’re never going to reach singularity as way too many things keep being broken all the time.

Conversion table

Note I used [WayBackUTF-8 encoder/decoder in addition to the below table to create the above degeneration sequence. It is on Source: mothereff.in/utf-8 at master · mathiasbynens/mothereff.in

Unicode
code point
character UTF-8
(hex.)
U+0080 0xc2 0x80
U+0081 0xc2 0x81
U+0082 0xc2 0x82
U+0083 0xc2 0x83
U+0084 0xc2 0x84
U+0085 0xc2 0x85
U+0086 0xc2 0x86
U+0087 0xc2 0x87
U+0088 0xc2 0x88
U+0089 0xc2 0x89
U+008A 0xc2 0x8a
U+008B 0xc2 0x8b
U+008C 0xc2 0x8c
U+008D 0xc2 0x8d
U+008E 0xc2 0x8e
U+008F 0xc2 0x8f
U+0090 0xc2 0x90
U+0091 0xc2 0x91
U+0092 0xc2 0x92
U+0093 0xc2 0x93
U+0094 0xc2 0x94
U+0095 0xc2 0x95
U+0096 0xc2 0x96
U+0097 0xc2 0x97
U+0098 0xc2 0x98
U+0099 0xc2 0x99
U+009A 0xc2 0x9a
U+009B 0xc2 0x9b
U+009C 0xc2 0x9c
U+009D 0xc2 0x9d
U+009E 0xc2 0x9e
U+009F 0xc2 0x9f
U+00A0 0xc2 0xa0
U+00A1 ¡ 0xc2 0xa1
U+00A2 ¢ 0xc2 0xa2
U+00A3 £ 0xc2 0xa3
U+00A4 ¤ 0xc2 0xa4
U+00A5 ¥ 0xc2 0xa5
U+00A6 ¦ 0xc2 0xa6
U+00A7 § 0xc2 0xa7
U+00A8 ¨ 0xc2 0xa8
U+00A9 © 0xc2 0xa9
U+00AA ª 0xc2 0xaa
U+00AB « 0xc2 0xab
U+00AC ¬ 0xc2 0xac
U+00AD 0xc2 0xad
U+00AE ® 0xc2 0xae
U+00AF ¯ 0xc2 0xaf
U+00B0 ° 0xc2 0xb0
U+00B1 ± 0xc2 0xb1
U+00B2 ² 0xc2 0xb2
U+00B3 ³ 0xc2 0xb3
U+00B4 ´ 0xc2 0xb4
U+00B5 µ 0xc2 0xb5
U+00B6 0xc2 0xb6
U+00B7 · 0xc2 0xb7
U+00B8 ¸ 0xc2 0xb8
U+00B9 ¹ 0xc2 0xb9
U+00BA º 0xc2 0xba
U+00BB » 0xc2 0xbb
U+00BC ¼ 0xc2 0xbc
U+00BD ½ 0xc2 0xbd
U+00BE ¾ 0xc2 0xbe
U+00BF ¿ 0xc2 0xbf
U+00C0 À 0xc3 0x80
U+00C1 Á 0xc3 0x81
U+00C2 Â 0xc3 0x82
U+00C3 Ã 0xc3 0x83
U+00C4 Ä 0xc3 0x84
U+00C5 Å 0xc3 0x85
U+00C6 Æ 0xc3 0x86
U+00C7 Ç 0xc3 0x87
U+00C8 È 0xc3 0x88
U+00C9 É 0xc3 0x89
U+00CA Ê 0xc3 0x8a
U+00CB Ë 0xc3 0x8b
U+00CC Ì 0xc3 0x8c
U+00CD Í 0xc3 0x8d
U+00CE Î 0xc3 0x8e
U+00CF Ï 0xc3 0x8f
U+00D0 Ð 0xc3 0x90
U+00D1 Ñ 0xc3 0x91
U+00D2 Ò 0xc3 0x92
U+00D3 Ó 0xc3 0x93
U+00D4 Ô 0xc3 0x94
U+00D5 Õ 0xc3 0x95
U+00D6 Ö 0xc3 0x96
U+00D7 × 0xc3 0x97
U+00D8 Ø 0xc3 0x98
U+00D9 Ù 0xc3 0x99
U+00DA Ú 0xc3 0x9a
U+00DB Û 0xc3 0x9b
U+00DC Ü 0xc3 0x9c
U+00DD Ý 0xc3 0x9d
U+00DE Þ 0xc3 0x9e
U+00DF ß 0xc3 0x9f
U+00E0 à 0xc3 0xa0
U+00E1 á 0xc3 0xa1
U+00E2 â 0xc3 0xa2
U+00E3 ã 0xc3 0xa3
U+00E4 ä 0xc3 0xa4
U+00E5 å 0xc3 0xa5
U+00E6 æ 0xc3 0xa6
U+00E7 ç 0xc3 0xa7
U+00E8 è 0xc3 0xa8
U+00E9 é 0xc3 0xa9
U+00EA ê 0xc3 0xaa
U+00EB ë 0xc3 0xab
U+00EC ì 0xc3 0xac
U+00ED í 0xc3 0xad
U+00EE î 0xc3 0xae
U+00EF ï 0xc3 0xaf
U+00F0 ð 0xc3 0xb0
U+00F1 ñ 0xc3 0xb1
U+00F2 ò 0xc3 0xb2
U+00F3 ó 0xc3 0xb3
U+00F4 ô 0xc3 0xb4
U+00F5 õ 0xc3 0xb5
U+00F6 ö 0xc3 0xb6
U+00F7 ÷ 0xc3 0xb7
U+00F8 ø 0xc3 0xb8
U+00F9 ù 0xc3 0xb9
U+00FA ú 0xc3 0xba
U+00FB û 0xc3 0xbb
U+00FC ü 0xc3 0xbc
U+00FD ý 0xc3 0xbd
U+00FE þ 0xc3 0xbe
U+00FF ÿ 0xc3 0xbf

–jeroen

Reference: [WayBack] Unicode Block ‘Latin-1 Supplement’

Reminds me of [WayBack] Schei¥e – [WayBack] Geek And Poke: Coders Love Unicode: For non Germans: Scheiße = shit, crap Tweet

 

 

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.