Contrary to what many believe is that MySQL utf8 is not always full blown UTF-8 support, but actually utf8mb3, which has been deprecated for a while now.
Only utf8mb4 will give you full blown UTF-8 support.
This when someone reminded me of this in a Delphi application:
When I insert :joy: emoji into mysql varchar filed I got an error :
#22007 Incorrect string value: '\xF0\x9F\x98\x82' for column 'remarks' at row 1database charset is
utf8
Note that the :joy: emoji is 😂 and has Unicode code point U+1F602 which is outside the basic multilingual plane.
See:
- [Wayback] Unicode Character ‘FACE WITH TEARS OF JOY’ (U+1F602)
- Plane (Unicode): Overview, Basic Multilingual Plane – Wikipedia
- [Archive.is] Kristian Köhntopp on Twitter: “MySQL also, for quite some time now, no longer updates its own charsets and collations internally, for the same reason. So utf8 in MySQL is utf8mb3, the three byte variant of Unicode UTF-8 implementation that covers only the BMP (unicode up to U+FFFF).”
- Kristian Köhntopp
»Where does PostgreSQL’s collation logic come from?
PostgreSQL relies on external libraries to order strings.
– libc, meaning the operating system locale facility (POSIX or Windows)
– icu, meaning the ICU project (if PostgreSQL was built with ICU support)« - MySQL does things differently:
MySQL binary data files are independent of the host operating system in byte order, number representation (as long as the host fulfils MySQLs basic requirements), collation and even time zone handling.
- So MySQL implements collations internally, also to guarantee stability across OS updates.
If it didn’t, a libc update changing collations would mean you have to recreate a lot of indexes. Also, you would not be able to safely move data files from host to host.
- MySQL also, for quite some time now, no longer updates its own charsets and collations internally, for the same reason.
So utf8 in MySQL is utf8mb3, the three byte variant of Unicode UTF-8 implementation that covers only the BMP (unicode up to U+FFFF).
- When moving to fuller (multiplane) UTF-8 support, a new name was needed, and utf8mb4 was chosen.
So when you actually want modern utf8 in MySQL, you have to use utf8mb4, and now you know why.
- utf8 is deprecated and will be upgraded to utf8mb4 in some future MySQL release. This will be a breaking upgrade, and I wonder if it will require dropping and recreating all indexes affected by the change.
That will be painful.
- https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html …
utf8mb3 page in the MySQL 8.0 manual, with deprecation notice.
What will change is the meaning of the alias utf8 (currently an alias for utf8mb3).
- Kristian Köhntopp
- [Wayback] MySQL: Some Character Set Basics | Die wunderbare Welt von Isotopp
- [Wayback] MySQL :: MySQL 8.0 Reference Manual :: 10.9.2 The
utf8mb3Character Set (3-Byte UTF-8 Unicode Encoding)
utf8is an alias forutf8mb3; the character limit is implicit, rather than explicit in the name.NoteThe
utf8mb3character set is deprecated and you should expect it to be removed in a future MySQL release. Please useutf8mb4instead. Althoughutf8is currently an alias forutf8mb3, at some pointutf8is expected to become a reference toutf8mb4. To avoid ambiguity about the meaning ofutf8, consider specifyingutf8mb4explicitly for character set references instead ofutf8. - [Wayback] MySQL :: MySQL 8.0 Reference Manual :: 10.9.1 The utf8mb4 Character Set (4-Byte UTF-8 Unicode Encoding)
utf8mb4contrasts with theutf8mb3character set, which supports only BMP characters and uses a maximum of three bytes per character:- For a BMP character,
utf8mb4andutf8mb3have identical storage characteristics: same code values, same encoding, same length. - For a supplementary character,
utf8mb4requires four bytes to store it, whereasutf8mb3cannot store the character at all. When convertingutf8mb3columns toutf8mb4, you need not worry about converting supplementary characters because there are none.
- For a BMP character,
–jeroen












