Contrary to what many believe is that MySQL utf8 is not always full blown UTF-8 support, but actually utf8mb3, which has been deprecated for a while now.
Only utf8mb4 will give you full blown UTF-8 support.
This when someone reminded me of this in a Delphi application:
When I insert :joy: emoji into mysql varchar filed I got an error : #22007 Incorrect string value: '\xF0\x9F\x98\x82' for column 'remarks' at row 1
database charset is utf8
Note that the :joy: emoji is 😂 and has Unicode code point U+1F602 which is outside the basic multilingual plane.
»Where does PostgreSQL’s collation logic come from?
PostgreSQL relies on external libraries to order strings.
– libc, meaning the operating system locale facility (POSIX or Windows)
– icu, meaning the ICU project (if PostgreSQL was built with ICU support)«
MySQL does things differently:
MySQL binary data files are independent of the host operating system in byte order, number representation (as long as the host fulfils MySQLs basic requirements), collation and even time zone handling.
So MySQL implements collations internally, also to guarantee stability across OS updates.
If it didn’t, a libc update changing collations would mean you have to recreate a lot of indexes. Also, you would not be able to safely move data files from host to host.
MySQL also, for quite some time now, no longer updates its own charsets and collations internally, for the same reason.
So utf8 in MySQL is utf8mb3, the three byte variant of Unicode UTF-8 implementation that covers only the BMP (unicode up to U+FFFF).
When moving to fuller (multiplane) UTF-8 support, a new name was needed, and utf8mb4 was chosen.
So when you actually want modern utf8 in MySQL, you have to use utf8mb4, and now you know why.
utf8 is deprecated and will be upgraded to utf8mb4 in some future MySQL release. This will be a breaking upgrade, and I wonder if it will require dropping and recreating all indexes affected by the change.
utf8 is an alias for utf8mb3; the character limit is implicit, rather than explicit in the name.
Note
The utf8mb3 character set is deprecated and you should expect it to be removed in a future MySQL release. Please use utf8mb4 instead. Although utf8 is currently an alias for utf8mb3, at some point utf8 is expected to become a reference to utf8mb4. To avoid ambiguity about the meaning of utf8, consider specifying utf8mb4 explicitly for character set references instead of utf8.
utf8mb4 contrasts with the utf8mb3 character set, which supports only BMP characters and uses a maximum of three bytes per character:
For a BMP character, utf8mb4 and utf8mb3 have identical storage characteristics: same code values, same encoding, same length.
For a supplementary character, utf8mb4 requires four bytes to store it, whereas utf8mb3 cannot store the character at all. When converting utf8mb3 columns to utf8mb4, you need not worry about converting supplementary characters because there are none.
That hyphen character in your -Reset (noticed by /u/SeeminglyScience) is an En-Dash (Decimal unicode character 8211). (Copy paste it and do [int]([char]'–') to see).
PowerShell can handle en-dash as a hyphen (for the curious, see the case for it in the PowerShell tokenizer source code here where hyphen, en-dash, em-dash and horizontalBar all fall into the same code block), but because it’s a Unicode character your source code will have to be saved as a unicode encoded .ps1 file. (UCS2-LE, UTF8+BOM, UTF8-BOM seem to work for me).
If you don’t, and you save it as ASCII/ANSI, what gets saved is – which has a string opening in it. So the code becomes:
A while ago, a co-worker did not believe when I told that default XML encoding really is UTF-8 (and tried to force it to utf-8), and that if the content had byte sequences different from the (either specified or default) encoding, it was a problem.
I though I blogged about the default, and where to find it, but apparently, I did not.
Under the very specific circumstances of a UTF-8 encoded document with no external encoding information (which I understand from the comments is what you’re interested in), there is no difference between the two declarations.
Bad surprise of the day: SysUtils.TEncoding in XE2+ defaults to ANSI, while in XE it defaulted to UTF-8 .Among other things this means that TStringList… – Eric Grange – Google+
+Stefan Glienke Indeed, you’re right. The issue must be deeper somewhere. Don’t have time to investigate too much, I’m bypassing the RTL now (also have to work around the limitation that for utf-8 the TEncoding.GetString method returns an empty string if one character in the buffer isn’t utf-8)
From quite some time ago, but still very relevant as encoding issues keep occurring:
A while ago, I saw the text “v3/43/4r” in a document.I know it comes from “vóór” (the acute accent emphasises in Dutch), and wonder which encoding failure was applied to get this wrong.
ó: is U+00F3, and occupies the same codepoint (0xF3) in a lot of different encodings (most ISO-8859-* and most western Windows-*).
In CP850 the codepint 0xF3 is ¾ (U+00BE), that is the three-quarters character. It is the same in other, less used, codepages (CP775, CP856, CP857, CP858).
The ¾ is sometimes transliterated to 3/4 when the character is not directly available.
And there you are! “vóór” -> “v¾¾r” -> “v3/43/4r”.
The first part (ó -> ¾) is the usual corruption of ANSI vs. OEM codepages in the Western Windows versions (in my country ANSI=Windows-1252, OEM=CP850). You can see it easily creating a file with NOTEPAD, writing vóór and dumping it in a command prompt with type.
with open(filename, 'r', encoding='utf-16-le') as f:
for line in f:
pass
Delphi:
for Line in TLineReader.FromFile(filename, TEncoding.Unicode) do
;
This spurred some nice observations and unfounded statements on which encodings should be used, so I posted a bit of history that is included below.
Some tips and observations from the links:
Good old text files are not “good” with Unicode support, neither are TextFile Device Drivers; nobody has written a driver supporting a wide range of encodings as of yet.
Good old text files are slow as well, even with a changed SetTextBuffer
At its release in 1993, Windows NT was very early in supporting Unicode. Development of Windows NT started in 1990 where they opted for UCS-2 having 2 bytes per character and had a non-required annex on UTF-1.
UTF-1 – that later evolved into UTF-8 – did not even exist at that time. Even UCS-2 was still young: it got designed in 1989. UTF-8 was outlined late 1992 and became a standard in 1993
Though formatting mangled the registry key to add, the article is interesting: since 2003 (C# Builder 1), you can force the IDE to always save files as UTF8 which should alleviate a lot of encoding problems.
It beats me why this isn’t the default setting, but below is an example .reg file for Delphi 8 which should be easily transformed to more recent Delphi versions:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
So basically (if formatting is kept), you browse to this key (replace Borland with the company for your specific Delphi version, and replace 2.0 by your IDE version):
Microsoft Windows has a code page designated for UTF-8, code page 65001. Prior to Windows 10 insider build 17035 (November 2017),[7] it was impossible to set the locale code page to 65001, leaving this code page only available for:
Explicit conversion functions such as MultiByteToWideChar
The Win32 console command chcp 65001 to translate stdin/out between UTF-8 and UTF-16.
This means that “narrow” functions, in particular fopen, cannot be called with UTF-8 strings, and in fact there is no way to open all possible files using fopen no matter what the locale is set to and/or what bytes are put in the string, as none of the available locales can produce all possible UTF-16 characters.
On all modern non-Windows platforms, the string passed to fopen is effectively UTF-8. This produces an incompatibility between other platforms and Windows. The normal work-around is to add Windows-specific code to convert UTF-8 to UTF-16 using MultiByteToWideChar and call the “wide” function.[8] Conversion is also needed even for Windows-specific api such as SetWindowText since many applications inherently have to use UTF-8 due to its use in file formats, internet protocols, and its ability to interoperate with raw arrays of bytes.
There were proposals to add new API to portable libraries such as Boost to do the necessary conversion, by adding new functions for opening and renaming files. These functions would pass filenames through unchanged on Unix, but translate them to UTF-16 on Windows.[9] This would allow code to be “portable”, but required just as many code changes as calling the wide functions.
With insider build 17035 and the April 2018 update (nominal build 17134) for Windows 10, a “Beta: Use Unicode UTF-8 for worldwide language support” checkbox appeared for setting the locale code page to UTF-8.[a] This allows for calling “narrow” functions, including fopen and SetWindowTextA, with UTF-8 strings. Microsoft claims this option might break some functions (a possible example is _mbsrev[10]) as they were written to assume multibyte encodings used no more than 2 bytes per character, thus until now code pages with more bytes such as GB 18030 (cp54936) and UTF-8 could not be set as the locale.[11]
One of out customer had selected that and we started to experience very weird problems and took some time to find out why it misbehaves.
None of the application could connect to Firebird SQL server (Ours or third party) successfully.
So would be smart to go through all tooling and code with that setting, we never know what M$oft will do with that, will it ever be released or will it soon be default for all.