The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 4,262 other subscribers

In this day and age, web sites with delivery back-ends still have Unicode issues: at least @Woonveilig, @Medireva and @PostNL still have trouble

Posted by jpluimers on 2022/02/09

Nowadays, some 35 years after the first Unicode ideas got drafted and 30+ years after the Unicode Consortium saw the light, UTF-8 is served my more than 95% of the web as shown in yesterday’s post UTF-8 web adoption is huge, closing 100%, but only soured up since around 2006..

I mentioned this:

It means that nowadays there is a very small chance you will see mangled characters (what Japanese call mojibake) when you’re surfing the web.

Serving UTF8 does not mean no unicode problems.

Below are some issues that happened not too long ago and still happen. I have reported them to all parties involved through web-care, but no response whatsoever, and this is bad: Unicode support beyond basic ASCII for the below systems are still broken even for relatively simple non-ASCII characters based in diacritics decorating a standard ASCII character.

Yes, I know the realm of encoding and code pages is a mess, especially when handling data in multiple layers of an application stack. That’s why I wrote this post in the first place, and have a whole encoding category of blog posts plus a Mojibake subset.

Woonveilig back-end serving non-ASCII while the front-end limits to an ASCII-subset

This is the series of tweets:

I summarised what happened in Dutch at [Wayback/Archive.is] Woonveilig postcode check geeft straatnaam met diakriet, maar user-interface validatie keurt dat af.:

Postcode/huisnummer URL:
Validation result:
{"street":"C\u00e9sar Franckrode","houseNumber":1,"houseNumberAddition":"","postcode":"2717BA","city":"Zoetermeer","municipality":"Zoetermeer","province":"Zuid-Holland","rdX":91967,"rdY":453852,"latitude":52.06936633,"longitude":4.46788106,"bagNumberDesignationId":"0637200000256226","bagAddressableObjectId":"0637010000256227","addressType":"building","purposes":["residency"],"surfaceArea":103,"houseNumberAdditions":[""]}
Validation script URL:
Validation script fragment:
   this.setStreetNameBlur = function(element_o)
    {
        var valid_b = this.setValidationByRegExp(element_o, '^[A-Za-z\\. ]+$', 'De Straatnaam bevat onjuiste tekens.');

        if (valid_b)
            element_o.value = element_o.value.charAt(0).toUpperCase() + element_o.value.toLowerCase().substr(1);

In other words: the back-end correctly returns Unicode, but the frontend limits input to uppercase A-Z, lowercase a-z, period and space. Inconvenient, because technically speaking, Dutch street names are allowed to use any printable Unicode code point.

An example is again in Zoetermeer with the valid street César Franckrode (not my street: I don’t live in Zoetermeer), which the back-end correctly returns as escaped Unicode characters and also ends up with "De Straatnaam bevat onjuiste tekens." in the user interface. [Wayback/Archive.is] www.woonveilig.nl/engine/woonveilig__website__shop__order__address_c?building=1&zip=2717+BA

{"street":"C\u00e9sar Franckrode","houseNumber":1,"houseNumberAddition":"","postcode":"2717BA","city":"Zoetermeer","municipality":"Zoetermeer","province":"Zuid-Holland","rdX":91967,"rdY":453852,"latitude":52.06936633,"longitude":4.46788106,"bagNumberDesignationId":"0637200000256226","bagAddressableObjectId":"0637010000256227","addressType":"building","purposes":["residency"],"surfaceArea":103,"houseNumberAdditions":[""]}

Failing not just with non-ASCII printables

The same results for a different street, but in the city of Utrecht. Back in the days, the city replaced spaces by apostrophes: [Wayback/Archive.is] www.woonveilig.nl/engine/woonveilig__website__shop__order__address_c?building=1&zip=3543+EL

{"street":"Nat'King'Colestraat","houseNumber":1,"houseNumberAddition":"","postcode":"3543EL","city":"Utrecht","municipality":"Utrecht","province":"Utrecht","rdX":131693,"rdY":457113,"latitude":52.1017777,"longitude":5.04704835,"bagNumberDesignationId":"0344200000183532","bagAddressableObjectId":"0344010000180443","addressType":"building","purposes":["residency"],"surfaceArea":142,"houseNumberAdditions":[""]}

You guessed it: "De Straatnaam bevat onjuiste tekens."

Where Utrecht probably tried to avoid spaces in a street name thereby offending late Nat King Cole and his family by renaming him into Nat’King’Cole (no high hopes that WordPress refrains from messing with the apostrophes here). The sign with the street name was correct though: the people making it did have sense. After a complaint the street name was spelled incorrectly and didn’t match the signage, the city council indicated they would change the signage. [Archive.is] Google Street View indicated that in 2015 this had not happened yet:

Nat King Colestraat signage in Utrecht which on-line incorrectly is named Nat'King'Colestraat

Nat King Colestraat signage in Utrecht which on-line incorrectly is named Nat’King’Colestraat

Basically Utrecht traded a space problem with a quoting problem, waiting for a system that correctly works aruond the quoting problem. That will be a challenge, for instance with a street name like “d'Argtagnanstraat“. In that sense, Almere does better with [Wayback] Postcode Nat King Colestraat in Almere – Postcode bij adres.

Note that when writing this, I knew that there was a [Wayback] “d'Argtagnanstraat” in Wezet, Belgium, but missed that there was a [Wayback] “d'Artagnanlaan” in Maastricht, The Netherlands: [Wayback] Postcode d’Artagnanlaan in Maastricht – Postcode bij adres

Woonveilig mangles even more

Of course the above is not the only street name problem where the front-end disagrees with the back-end. The JavaScript front-end code forces the first letter of of the street name to me uppercase and everything else to be owercase, even if the back-end returned it correctly, or the user filled it in correctly.

In many cases, this does not hold, especially with street names containing spaces. An example with another street name in Zoetermeer is [Wayback/Archive.is] www.woonveilig.nl/engine/woonveilig__website__shop__order__address_c?building=150&zip=2713+RD:

{"street":"Van Leeuwenhoeklaan","houseNumber":150,"houseNumberAddition":"","postcode":"2713RD","city":"Zoetermeer","municipality":"Zoetermeer","province":"Zuid-Holland","rdX":93134,"rdY":452495,"latitude":52.05730863,"longitude":4.48514815,"bagNumberDesignationId":"0637200000209394","bagAddressableObjectId":"0637010000209395","addressType":"building","purposes":["residency"],"surfaceArea":79,"houseNumberAdditions":[""]}

The user interface incorrectly morphes this into “Van leeuwenhoeklaan” (note the lowercase l instead of the uppercase L): [Archive.is] Jeroen Wiert Pluimers on Twitter: “Inderdaad: “De Straatnaam bevat onjuiste tekens.”. Het kan nog gekker. … wat netjes “Van Leeuwenhoeklaan” en nummer “150” teruggeeft, dat wordt natuurlijk… Inderdaad, geen huisnummer en de L wordt lowercase: “Van Leeuwenhoeklaan”… “

Medireva -> PostNL export

I tried the various tables with encoding issues from Mojibake – Wikipedia, but had a hard time finding the cause for this one:

[Archive.is] Jeroen Wiert Pluimers on Twitter: “Still need to figure out what went wrong here between the @MediReva and @PostNL systems where “PYRENEEËN” on the medireva side (including package label) becomes “PYRENEEÓN” on the PostNL side. Luckily the postal code and house number are correct, so delivery is OK.… “

At first I thought the Ë (U+00CB) to Ó (U+00D3) mapping is odd as it does not match with any of the popular character sets:

From the above tables, I concluded I was basically looking for character sets that had either:

  • Ë (U+00CB) at code point 0xD3 in stead of 0xCB
  • Ó (U+00D3) at code point 0xCB in stead of 0xD3

Then I realised this might be based on some old software with DOS or Windows heritage dating back to DOS code pages.

In Western Europe, during the DOS age, both Code page 850 and Code page 858 were popular. Code page 858 basically is codepage 850 where codepoint 0xD5 replaced by the € Euro currency sign.

On Windows the most popular code pages was ISO 8859-1 (before the introduction of the Euro currency), then became ISO 8859-15 after the Euro currency was introduced on 2002-01-01 and on Windows has a kind of superset as Windows 1252. The differences between these code pages are as follows:

80 82 83 84 85 86 87 88 89 8A 8B 8C 8E 91 92 93 94 95 96 97 98 99 9A 9B 9C 9E 9F A4 A6 A8 B4 B8 BC BD BE
8859-1 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A ¤
00164
¦
00a6
¨
00A8
´
00B4
¸
00B8
¼
00BC
½
00BD
¾
00BE
8859-15 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
20AC
Š
0160
š
0161
Ž
017D
ž
017E
Œ
0152
œ
0153
Ÿ
0178
Windows 1252
20AC

201A
ƒ
0192

201E

2026

2020

2021
ˆ
02C6

2030
Š
0160

2039
Œ
0152
Ž
017D

2018

2019

201C

201D

2022

2013

2014
˜
02DC

2122
š
0161

203A
œ
0153
ž
017E
Ÿ
0178
¤
0164
¦
00A6
¨
00A8
´
00B4
¸
00B8
¼
00BC
½
00BD
¾
00BE

 

Note that various columns are missing as they have the same codepoints in all three code pages.

For non-Unicode software on Windows, code page Windows 1252 still is the most popular one. But what if the underlying software was older and assumed a DOS code page, or was historically forced to use a DOS code page?

Bingo: both Code page 850 and Code page 858 have Ë ad codepoint 0xD3 whereas Widows 1252, ISO 8859-1 and ISO 8859-15 had it at 0xCB. You see that Ë is at position 0xD3 in the below tables having Unicode code point U+00CB:

Code page 850
_0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _A _B _C _D _E _F
8_
128
Ç
00C7
ü
00FC
é
00E9
â
00E2
ä
00E4
à
00E0
å
00E5
ç
00E7
ê
00EA
ë
00EB
è
00E8
ï
00EF
î
00EE
ì
00EC
Ä
00C4
Å
00C5
9_
144
É
00C9
æ
00E6
Æ
00C6
ô
00F4
ö
00F6
ò
00F2
û
00FB
ù
00F9
ÿ
00FF
Ö
00D6
Ü
00DC
ø
00F8
£
00A3
Ø
00D8
×
00D7
ƒ
0192
A_
160
á
00E1
í
00ED
ó
00F3
ú
00FA
ñ
00F1
Ñ
00D1
ª
00AA
º
00BA
¿
00BF
®
00AE
¬
00AC
½
00BD
¼
00BC
¡
00A1
«
00AB
»
00BB
B_
176

2591

2592

2593

2502

2524
Á
00C1
Â
00C2
À
00C0
©
00A9

2563

2551

2557

255D
¢
00A2
¥
00A5

2510
C_
192

2514

2534

252C

251C

2500

253C
ã
00E3
Ã
00C3

255A

2554

2569

2566

2560

2550

256C
¤
00A4
D_
208
ð
00F0
Ð
00D0
Ê
00CA
Ë
00CB
È
00C8
ı
0131
Í
00CD
Î
00CE
Ï
00CF

2518

250C

2588

2584
¦
00A6
Ì
00CC

2580
E_
224
Ó
00D3
ß
00DF
Ô
00D4
Ò
00D2
õ
00F5
Õ
00D5
µ
00B5
þ
00FE
Þ
00DE
Ú
00DA
Û
00DB
Ù
00D9
ý
00FD
Ý
00DD
¯
00AF
´
00B4
F_
240
SHY
00AD
±
00B1

2017
¾
00BE

00B6
§
00A7
÷
00F7
¸
00B8
°
00B0
¨
00A8
·
00B7
¹
00B9
³
00B3
²
00B2

25A0
NBSP
00A0
Code page 858
_0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _A _B _C _D _E _F
8_
128
Ç
00C7
ü
00FC
é
00E9
â
00E2
ä
00E4
à
00E0
å
00E5
ç
00E7
ê
00EA
ë
00EB
è
00E8
ï
00EF
î
00EE
ì
00EC
Ä
00C4
Å
00C5
9_
144
É
00C9
æ
00E6
Æ
00C6
ô
00F4
ö
00F6
ò
00F2
û
00FB
ù
00F9
ÿ
00FF
Ö
00D6
Ü
00DC
ø
00F8
£
00A3
Ø
00D8
×
00D7
ƒ
0192
A_
160
á
00E1
í
00ED
ó
00F3
ú
00FA
ñ
00F1
Ñ
00D1
ª
00AA
º
00BA
¿
00BF
®
00AE
¬
00AC
½
00BD
¼
00BC
¡
00A1
«
00AB
»
00BB
B_
176

2591

2592

2593

2502

2524
Á
00C1
Â
00C2
À
00C0
©
00A9

2563

2551

2557

255D
¢
00A2
¥
00A5

2510
C_
192

2514

2534

252C

251C

2500

253C
ã
00E3
Ã
00C3

255A

2554

2569

2566

2560

2550

256C
¤
00A4
D_
208
ð
00F0
Ð
00D0
Ê
00CA
Ë
00CB
È
00C8

20AC
Í
00CD
Î
00CE
Ï
00CF

2518

250C

2588

2584
¦
00A6
Ì
00CC

2580
E_
224
Ó
00D3
ß
00DF
Ô
00D4
Ò
00D2
õ
00F5
Õ
00D5
µ
00B5
þ
00FE
Þ
00DE
Ú
00DA
Û
00DB
Ù
00D9
ý
00FD
Ý
00DD
¯
00AF
´
00B4
F_
240
SHY
00AD
±
00B1

2017
¾
00BE

00B6
§
00A7
÷
00F7
¸
00B8
°
00B0
¨
00A8
·
00B7
¹
00B9
³
00B3
²
00B2

25A0
NBSP
00A0

So the problem likely is sending the Ë in codepage 850/858 encoding with codepoint 0xD3, where the receiving system thinks it is ISO 8859-1 or ISO 8859-15 interpreting codepoint 0xD3 as Ó:

xs4all

As a side note: I never got response to [NL] encoding blijft moeilijk, waarom toch? (dit keer in een brief van @xs4all) either, but also did not have the opportunity to re-test that myself.

I tied figuring out if I could find the cause, but these are so distinct, I have not yet found it.

There is still a plan to switch my fibre and VoIP ISP from [Archive.is] @xs4all to [Archive.is] @FreedomNetNL, so maybe one day…

Late 1960s: the first diacritic in a Dutch street name

Back in the late 1960s, Tilburg wanted to name a street after the composer Georg Friedrich Händel (yup, the English Wikipedia page incorrectly rewrote it as George Frideric Handel).

This was the era where (then still 7-bit!) EBCDIC ruled, Unicode was not even in it’s infancy, and ASCII only became a standard in 1969 – the year I was born.

Luckily, already back-then the German language already had digraph replacement (because of the origins of the German umlaut) like  ä → ae, ö → oe, ü → ue, so Handel became Haendel, and in 1969 the street was named Haendellaan as Tilburg was afraid “the Computer” (note singular!) would not cope with diacritics well.

With the advent of the Dutch postal code system in 1977, yes in addition to a phone book, in 1978 everybody got a “postcodeboek” as well, it ended up with just one postal code: [Wayback] Postcode Haendellaan in Tilburg – Postcode bij adres.

Buildings and address registration in The Netherlands: BAG

The BAG public register (Basisregistratie Adressen en Gebouwen – Wikipedia) has current information on all addresses and buildings.

–jeroen


 

 

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.