When MySQL characterset ‘utf’ does not allow you to enter some Unicode code points
Posted by jpluimers on 2022/01/06
Contrary to what many believe is that MySQL utf8
is not always full blown UTF-8 support, but actually utf8mb3
, which has been deprecated for a while now.
Only utf8mb4
will give you full blown UTF-8 support.
This when someone reminded me of this in a Delphi application:
When I insert :joy: emoji into mysql varchar filed I got an error :
#22007 Incorrect string value: '\xF0\x9F\x98\x82' for column 'remarks' at row 1
database charset is
utf8
Note that the :joy: emoji is 😂 and has Unicode code point U+1F602
which is outside the basic multilingual plane.
See:
- [Wayback] Unicode Character ‘FACE WITH TEARS OF JOY’ (U+1F602)
- Plane (Unicode): Overview, Basic Multilingual Plane – Wikipedia
- [Archive.is] Kristian Köhntopp on Twitter: “MySQL also, for quite some time now, no longer updates its own charsets and collations internally, for the same reason. So utf8 in MySQL is utf8mb3, the three byte variant of Unicode UTF-8 implementation that covers only the BMP (unicode up to U+FFFF).”
- Kristian Köhntopp
»Where does PostgreSQL’s collation logic come from?
PostgreSQL relies on external libraries to order strings.
– libc, meaning the operating system locale facility (POSIX or Windows)
– icu, meaning the ICU project (if PostgreSQL was built with ICU support)« - MySQL does things differently:
MySQL binary data files are independent of the host operating system in byte order, number representation (as long as the host fulfils MySQLs basic requirements), collation and even time zone handling.
- So MySQL implements collations internally, also to guarantee stability across OS updates.
If it didn’t, a libc update changing collations would mean you have to recreate a lot of indexes. Also, you would not be able to safely move data files from host to host.
- MySQL also, for quite some time now, no longer updates its own charsets and collations internally, for the same reason.
So utf8 in MySQL is utf8mb3, the three byte variant of Unicode UTF-8 implementation that covers only the BMP (unicode up to U+FFFF).
- When moving to fuller (multiplane) UTF-8 support, a new name was needed, and utf8mb4 was chosen.
So when you actually want modern utf8 in MySQL, you have to use utf8mb4, and now you know why.
- utf8 is deprecated and will be upgraded to utf8mb4 in some future MySQL release. This will be a breaking upgrade, and I wonder if it will require dropping and recreating all indexes affected by the change.
That will be painful.
- https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html …
utf8mb3 page in the MySQL 8.0 manual, with deprecation notice.
What will change is the meaning of the alias utf8 (currently an alias for utf8mb3).
- Kristian Köhntopp
- [Wayback] MySQL: Some Character Set Basics | Die wunderbare Welt von Isotopp
- [Wayback] MySQL :: MySQL 8.0 Reference Manual :: 10.9.2 The
utf8mb3
Character Set (3-Byte UTF-8 Unicode Encoding)
utf8
is an alias forutf8mb3
; the character limit is implicit, rather than explicit in the name.NoteThe
utf8mb3
character set is deprecated and you should expect it to be removed in a future MySQL release. Please useutf8mb4
instead. Althoughutf8
is currently an alias forutf8mb3
, at some pointutf8
is expected to become a reference toutf8mb4
. To avoid ambiguity about the meaning ofutf8
, consider specifyingutf8mb4
explicitly for character set references instead ofutf8
. - [Wayback] MySQL :: MySQL 8.0 Reference Manual :: 10.9.1 The utf8mb4 Character Set (4-Byte UTF-8 Unicode Encoding)
utf8mb4
contrasts with theutf8mb3
character set, which supports only BMP characters and uses a maximum of three bytes per character:- For a BMP character,
utf8mb4
andutf8mb3
have identical storage characteristics: same code values, same encoding, same length. - For a supplementary character,
utf8mb4
requires four bytes to store it, whereasutf8mb3
cannot store the character at all. When convertingutf8mb3
columns toutf8mb4
, you need not worry about converting supplementary characters because there are none.
- For a BMP character,
–jeroen
Leave a Reply