Hacker News Re-Imagined

In MySQL, use “utf8mb4” instead of “utf8” (2016)

  • 643 points
  • 12 days ago

  • @pcr910303
  • Created a post

In MySQL, use “utf8mb4” instead of “utf8” (2016)


@btschaegg 11 days

Replying to @pcr910303 🎙

> And developers who wanted correctness were wrong to use “utf8”, because it can’t store “poo emoji”.

This sentence deserves an award :)

Edit: TIL: HN doesn't support emojis.

Reply


@donatj 12 days

Replying to @pcr910303 🎙

Is Medium secretly mining crypto in the background? I have no idea what it is but their site makes my CPU spike and my fans turn on.

https://jdon.at/JTHj2G

Update - here's a video, it's making hundreds of graphql requests a minute…

https://jdon.at/z05ImC

Reply


@throw0101a 12 days

Replying to @pcr910303 🎙

Note:

> The utf8mb3 character set is deprecated and you should expect it to be removed in a future MySQL release. Please use utf8mb4 instead. utf8 is currently an alias for utf8mb3, but it is now deprecated as such, and utf8 is expected subsequently to become a reference to utf8mb4. Beginning with MySQL 8.0.28, utf8mb3 is also displayed in place of utf8 in columns of Information Schema tables, and in the output of SQL SHOW statements.

* https://dev.mysql.com/doc/mysql-g11n-excerpt/8.0/en/charset-...

Reply


@capitainenemo 12 days

Replying to @pcr910303 🎙

This is also a security vulnerability because MySQL/MariaDB silently truncate strings at the first invalid character. This can result in data manipulation attacks where a higher level layer validates the complete input as UTF-8 for insertion into the DB, but the database only stores half the string.

https://bugzilla.mozilla.org/show_bug.cgi?id=1253201

Reply


@exabrial 11 days

Replying to @pcr910303 🎙

> Choose CHAR columns. (The CHAR format is a relic nowadays. Back then, MySQL was faster with CHAR columns. Ever since 2005, it’s not.)

Is this vs VARCHAR? To me this is fascinating if so, I'd love to see a link or benchmark. My understanding that reading a stripe off a drive, then chopping it up indiscriminately at known intervals is faster than conditionally branching for strings of other lengths

Reply


@PrimeDirective 12 days

Replying to @pcr910303 🎙

The author mentioned MariaDB in the intro and the went on about MySQL, not sure which he actually used.

I think the latest recommendation for MySQL was utf8mb4_0900_ai_ci and utf8mb4_general_ci for MariaDB.

While I agree about the Postgres recommendation, it's a non-starter in the PHP world, unless doing something from scratch. Even then, the support for MySQL/MariaDB is much better.

Reply


@iliketrains 12 days

Replying to @pcr910303 🎙

Fun story: When I was working at one of the FAANG companies, I have placed the (Night with stars, it does not seem to render here [0]) character in my preferred name. Some time later, I was notified by HR that I must change my preferred name and I can only use UTF-8 characters with maximum length of 3 bytes. I was quite confused why such specific demand, I am pretty sure it was exactly this issue. I replaced it with ᗧ···ᗣ···

[0] https://emojipedia.org/night-with-stars/

Reply


@brightball 12 days

Replying to @pcr910303 🎙

I just went through the process of converting a 2TB MySQL database over the summer. Quite a few gotchas along the way that make the process a huge pain.

Reply


@mwattsun 12 days

Replying to @pcr910303 🎙

Has anyone ever successfully converted a large legacy Apache/PHP/MySQL site to UTF-8? If so you deserve an award. utf8mb4 is just one issue and not the worst.

Reply


@OliverJones 12 days

Replying to @pcr910303 🎙

WordPress has, of course, the lion's share of MySQL deployments by server count if not by row count. Yes, WordPress isn't the latest and greatest stuff. Yes, it's <bad thing> and <another bad thing>. But such is the curse of the customer base.

They upgraded everybody from utf8 to utf8mb4 (excluding some users of ancient versions of MySQL) with their version 4.2. It went live on April 23, 2015.

They announced the upgrade at the beginning of April 2015 here. https://make.wordpress.org/core/2015/04/02/the-utf8mb4-upgra... There's some interesting stuff in there about the practical difficulties of the upgrade. The biggest issue was, and still is, the need for prefix indexes.

With respect to Adam Hooper, his article was stale when it appeared just over a year after the WordPress schema upgrade. Here's right of course. But the biggest user base was well on their way to abandoning utfmb3 by that time.

Reply


@davidjfelix 12 days

Replying to @pcr910303 🎙

Does this even fully fix the problem? It looks like utf8mb4 is limited to 4 byte sequences, but as far as I understand, utf-8 is variable width. Can utf8mb4 even encode the scottish flag https://www.iemoji.com/view/emoji/2476/flags/scotland ?

Reply


@urbandw311er 12 days

Replying to @pcr910303 🎙

I enjoyed the article right up to the point where it concluded at the very end that the solution is to “switch to postgresql” without actually providing any rationale for that sudden and subjective ending.

Reply


@talos2110 12 days

Replying to @pcr910303 🎙

So in mysql utf8 does not mean utf8. Reminds me of iso8601 in php, which does not mean iso8601.

https://www.php.net/manual/en/class.datetimeinterface.php

Reply


@77pt77 12 days

Replying to @pcr910303 🎙

MySQL is an awful piece of software.

Try diving into the Time zone insanity, collations, and of course this UTF-8 gem that has lasted since at least 2011.

Reply


@rubyist5eva 12 days

Replying to @pcr910303 🎙

god I hate mysql so much - long live postgres, one of the greatest pieces of software ever created

Reply


@fake-name 12 days

Replying to @pcr910303 🎙

In MySQL, use postgres instead.

Reply


@hn_throwaway_99 12 days

Replying to @pcr910303 🎙

Wow, never knew this, this is really bad, especially in this day and age where emojis are so prevalent.

MySQL should deprecate utf8 and give a warning if you try to use it.

Reply


@daneel_w 12 days

Replying to @pcr910303 🎙

This is an old article and it was incorrect when written, and it's incorrect today. The author may possibly have learned since writing it that UTF-8, by design, encodes up to 21 bits of character code point in at most 4 bytes of data. It works and behaves as it should.

Reply


@jamaicahest 12 days

Replying to @pcr910303 🎙

> Database systems have subtle bugs and oddities, and you can avoid a lot of bugs by avoiding database systems.

This sounds like an offended developer. Avoiding database systems is not possible if you want to do a lot of programming tasks.

Reply


@LinuxBender 12 days

Replying to @pcr910303 🎙

Would the DBA's on this thread consider committing code to the MySQL [1] and Postgres [2] tuner scripts that give new DBA's all your learned advise and battle hardening experience? This thread appears to be such an example.

[1] - https://github.com/major/MySQLTuner-perl

[2] - https://github.com/jfcoz/postgresqltuner

Reply


@pm90 12 days

Replying to @pcr910303 🎙

Another interesting “issue” is that Case sensitive character sets may lead to unexpected results with aggregations (eg https://blog.mallya.dev/2021/07/25/mysql-cases-sensitivity/)

MySQL has many such issues which have turned me off from recommending it. It’s a shame because it’s a solid technology that mostly works and has a really long operational history.

Reply


@ipaddr 12 days

Replying to @pcr910303 🎙

No one talks about the downsides of changing.

Moving from 255 max characters to 191 or 192 max means a lot of your data needs to be moved into a text fields which means things like this field can't be a primary key.

Reply


@jijji 12 days

Replying to @pcr910303 🎙

i'll just stick latin1 charset for now... its slower and increases the size of ur tables to use utf8, etc [0]

[0] https://stackoverflow.com/questions/12449336/utf-8-vs-latin1...

Reply


@clon 12 days

Replying to @pcr910303 🎙

The plain utf8 covers the "basic multilingual plane" (x0000-xFFFF), so it will get you very far, actually.

In our app, we finally went for utf8mb4 to allow people to enter emoji. As a side "bonus" you will also soon see some clever people entering their names to stand out, such as "𝙹𝚘𝚑𝚗". Note that this is not "John". It is actually a series of mathematical symbols:

  U+1D679   MATHEMATICAL MONOSPACE CAPITAL J
  U+1D698   MATHEMATICAL MONOSPACE SMALL O
  U+1D691   MATHEMATICAL MONOSPACE SMALL H
  U+1D697   MATHEMATICAL MONOSPACE SMALL N

Reply


@fredoralive 12 days

Replying to @pcr910303 🎙

According to the docs, MySQL is a bit more explicit and calls the old "utf8" "utf8mb3" nowadays, and notes that it may be removed at some point. Although "utf8" is still aliased to the footgun at the moment.

https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8...

Reply


@webkike 12 days

Replying to @pcr910303 🎙

Genuine question: why would anyone prefer MySQL over Postgres?

Reply


@dark-star 12 days

Replying to @pcr910303 🎙

In the past, Unicode was assumed to be 64k of codepoints, so a 3-byte UTF-8 sequence was considered "long enough", especially since there were surrogate pairs for the rare cases where you have to encode higher code points.

Only "recently" have longer UTF-8 sequences (aka. emojis) become widespread enough that this became a problem.

Yes, it could have been avoided if they had allowed arbitrary-length UTF-8 sequences from the beginning, but I can see that they probably just wanted to optimize a bit.

What I don't understand is why they had to create a different encoding (the utf8mb4) instead of just extending the existing utf8 encoding, since 4-byte UTF-8 is obviously backward-compatible with 3-byte UTF-8... (unless they always used 3 bytes for every character, which would be stupid as UTF-8 has been explicitly designed for variable-length encodings)

Bonus: Many filesystems also do not allow 4+ byte UTF 8 code points in filenames. Test your company's file server to see if it allows you to save a file as "(some random emoji).doc". A few very expensive storage systems also have problems with that (and they have the same workaround: convert your filesystem to a different encoding, instead of simply extending the existing encoding to allow 4+ bytes)

Reply


@billpg 12 days

Replying to @pcr910303 🎙

Dear databases, please don't get hung up about string lengths when dealing with UTF8.

If I ask for a UTF8 string with a max-length of 100, please don't apply the worse case scenario and allocate space for 100 emojis. Please give me a box of 100 bytes and allow me to write any UTF-8 string that can fit into 100 bytes in there.

100 ASCII characters. 20 emojis. Any mixture of the two.

If I ask for UTF-8, it'll be because I'd like to make advantage of UTF-8 and I accept the costs. If that means I can't quickly jump to the character at index 84 in a string, no problem, I've accepted the trade-off.

Reply


@karsinkk 12 days

Replying to @pcr910303 🎙

This might be a tad unrelated to the original post, but the following article is one of my favorite primers on Character sets/Unicode :

https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

Reply


@cblconfederate 12 days

Replying to @pcr910303 🎙

will something terrible happen if i just convert my tables to utf8mb4? Will i have to upgrade each column?

Reply


@smarx007 12 days

Replying to @pcr910303 🎙

Of course, nobody "refused" to fix a "bug". Instead, a non-conformant behavior was already relied upon by legacy systems out in the wild and the "fix" was added in a backwards-compatible way.

Edit: Three bytes are enough to fit nearly any of the chars in use in any language, including Chinese and Japanese, so I can only assume someone "smart" in the MySQL dev team decided to "save space" (before emoji were a thing).

Reply


@NelsonMinar 12 days

Replying to @pcr910303 🎙

In 2022 with a new release of MySQL or MariaDB, is it still possible to create a database with "utf8" encoding? Does it print a warning?

Reply


@maxfurman 12 days

Replying to @pcr910303 🎙

Let's say for the sake of discussion that your MySQL db has a lot of tables encoded as "utf8." Are there any known drawbacks or gotchas to converting them en masse to "utf8mb4"? Is this a lengthy operation?

Reply


@Animats 12 days

Replying to @pcr910303 🎙

That was set up when Microsoft and Java had standardized on UTF-16. So this can represent the UTF-16 subset of Unicode, which is Unicode Plane 0, the Basic Multilingual Plane (BMP). The higher-numbered "astral planes" of UTF-8 were rarely used. All modern languages with a significant user base are covered in Plane 0. Plane 1, the Supplementary Multilingual Plane, with Cretan Linear B, Egyptian hieroglyphics, and such, was seldom needed. Few people had fonts for those, anyway.

Because of the way UTF-8 is encoded, it takes 3 bytes to represent the UTF-16 set. That's because it only takes one byte for ASCII characters. Hence, 3-byte MySQL data.

Emoji, though, were put in Unicode Plane 1. That's where mass demand for the astral planes came from. More four byte UTF-8 characters started showing up in data.

Reply


@ninju 12 days

Replying to @pcr910303 🎙

This HN posting should be marked with (2016)

Is the issue still an issue?

Reply


@oleg_antonyan 12 days

Replying to @pcr910303 🎙

Thanks to emoji (almost)everything now supports unicode

Reply


@leros 12 days

Replying to @pcr910303 🎙

In addition to emojis, I've also had issues with certain characters copied from Microsoft Word being 4 bytes.

Reply


@sshine 12 days

Replying to @pcr910303 🎙

In MySQL, if you want to make sure that your Unicode fields are mangled, never use ‘utf8’, use ‘utf8mb3’. Since ‘utf8’ is an alias to ‘utf8mb3’, that alias might eventually be updated to point to ‘utf8mb4’ which won’t mangle your characters.

The only safe choice here is to explicitly use ‘utf8mb3’.

Reply


@MailNerd 12 days

Replying to @pcr910303 🎙

Another fun fact is that string comparisons are case insensitive by default:

https://dev.mysql.com/doc/refman/8.0/en/case-sensitivity.htm...

This can definitely catch you by surprise and cause bugs that are only detected late.

Reply


@didip 12 days

Replying to @pcr910303 🎙

MySQL is very similar to PHP in this regard. The good function name is actually broken so a new function with similar name is needed. This creates confusion for new people when adopting the technology.

Reply


@fareesh 12 days

Replying to @pcr910303 🎙

I decided to build all my new projects on postgres a few years ago. I am often reminded of why that was a good decision.

Reply


@ShaneMcGowan 11 days

Replying to @pcr910303 🎙

This one is very fun to debug in production when you're on MySQL 5.8

Reply


@pkrumins 12 days

Replying to @pcr910303 🎙

Does anyone know if utf8mb4 is variable length or does each character take full 4 bytes?

Reply


@hashimotonomora 12 days

Replying to @pcr910303 🎙

Just use ASCII.

Reply


@kolaente 12 days

Replying to @pcr910303 🎙



@patrickcteng 12 days

Replying to @pcr910303 🎙

I've had to "upgrade" a couple to utf8, and varchar(255) bites me all the time.

Reply


@qwerasdf0987 12 days

Replying to @pcr910303 🎙

I wrote about supporting emojis in Ruby on Rails / MySQL here:

https://josephecombs.com/2018/05/06/how-to-support-emojis-wi...

Reply


@TedShiller 12 days

Replying to @pcr910303 🎙

use Postgres instead of MySQL

Reply


@bambax 12 days

Replying to @pcr910303 🎙

> If you need a database, don’t use MySQL or MariaDB. Use PostgreSQL.

Sure. But for many use cases, SQLite is enough.

Reply


@ashvardanian 12 days

Replying to @pcr910303 🎙

Wow! I was working on this issue in our DBMS product today!

Fun suggestion, try making a JSON string with a NULL character somewhere in the middle. It will be encoded as \u0000 and is a a valid UTF-8 code, but most C based systems will truncate the string by estimating its length via strlen.

Java community and some other software vendors designed the Modified UTF-8, which replaces the zero with a 2-byte code point. Sleek. Aside from the fact, that you are modifying the data that customer wants to stay consistent.

Postres explicitly bans such cases in the VARCHAR, not sure if it can fit in their JSON columns. Who tried?

Reply


@lsllc 12 days

Replying to @pcr910303 🎙

A "hello devops" article from Jan 3, 2022 discussing this topic was posted by /u/flokoe a week ago:

https://www.hellodevops.blog/posts/database-character-sets-a...

And the ensuing discussion:

https://news.ycombinator.com/item?id=29793916

Reply


@CountSessine 12 days

Replying to @pcr910303 🎙

That's kind of funny - Oracle has a similar issue. If you want industry-standard UTF-8, you have to specify "AL32UTF8" as your encoding. "UTF8" is kind of crazy - it's this monstrous abomination called CESU-8 (https://en.wikipedia.org/wiki/CESU-8), which isnt' UTF-8 at all - it's actually this weird "UTF-16 complete with surrogate pairs wrapped in a UTF-8 shell" thing.

Reply


@kyralis 12 days

Replying to @pcr910303 🎙

> Database systems have subtle bugs and oddities, and you can avoid a lot of bugs by avoiding database systems.

Wat? That is... not a great takeaway from this experience. "Oh, this library had bug reported in it! I know, that means I should just write my own, because clearly they're all buggy and my software is always perfect."

Reply


@wonder_er 12 days

Replying to @pcr910303 🎙

a while ago, in my first engineering job, I was tracking down a strange bug related to chinese character sets.

Turns out some of the db tables were using utf8 and latin1; changing them to `utf8mb4` fixed it right up!

https://josh.works/troubleshooting-chinese-character-sets-in...

It was a fun bug to work on. I learned a lot about character encoding, and enjoyed bringing a refined "process" to the table.

Reply


@Dylan16807 12 days

Replying to @pcr910303 🎙

Also make sure mysqldump is using utf8mb4 or you'll get character replacement in your backups. This often needs to be configured separately.

Reply


@TedDoesntTalk 12 days

Replying to @pcr910303 🎙

> Back in 2002, MySQL gave users a speed boost if users could guarantee that every row in a table had the same number of bytes. To do that, users would declare text columns as “CHAR”

This database type has existed in many databases since at least the 1980s

Reply


@mrcarruthers 12 days

Replying to @pcr910303 🎙

And here's the commit that changed it: https://github.com/mysql/mysql-server/commit/43a506c0ced0e6e.... It was originally set to max 6 bytes but for some reason someone dropped it to 3.

They also seem to be slowly phasing it out. Internally it's utf8mb3 and utf8 is just an alias. The idea is to eventually make utf8 an alias to utf8mb4.

Reply


@jmnicolas 12 days

Replying to @pcr910303 🎙

It's because of things like that it's still hard for me to respect MySQL in 2022.

I use Postgres since 2008. It has never bitten me even once.

Reply


@tapoxi 12 days

Replying to @pcr910303 🎙

Maybe this is a cynical take - but we used https://pgloader.io/ a few years ago to migrate to Postgres, and have never been happier. MySQL has a lot of stupid decisions like this.

Reply


@DonHopkins 12 days

Replying to @pcr910303 🎙

>Of course, they never advertised this (probably because the bug is so embarrassing).

Not as embarrassing as being owned by Oracle.

Reply


@cryptos 12 days

Replying to @pcr910303 🎙

Best to be used with mysqli_real_escape_string ;-) https://www.php.net/manual/en/mysqli.real-escape-string.php

Reply


@tomwojcik 12 days

Replying to @pcr910303 🎙

I need to share something as literally today I fixed a bug in our project that's somewhat related.

MS SQL encodes everything with UTF-16. Emojis (code points) require up to 4 bytes. If it's a grapheme (emoji constructed from emojis), it will be even more.

We are using Django. If you check length of an emoji, it will give you `1` and Django assumes utf8 everywhere. If you try to save it to PostgreSQL with char field `max_length=1` it will work just fine, but on MS SQL it will fail as it requires 2 characters (4 bytes) in the DB!

I tried it with MS SQL 2017, 2019 with different collations on nvarchar and I'm pretty sure there's no way around it.

> Because in CHAR(n) and VARCHAR(n) or in NCHAR(n) and NVARCHAR(n), the n defines the byte storage size, not the number of characters that can be stored, it's important to determine the data type size you must convert to, in order to avoid data truncation.

https://docs.microsoft.com/en-US/sql/relational-databases/co...

Reply


@riteshpatel 12 days

Replying to @pcr910303 🎙

We learned this a few years ago when clients starting using emojis when saving their content and it got truncated. Very quickly converted everything to utf8mb4!

Reply


@dehrmann 12 days

Replying to @pcr910303 🎙

Memories... When I was at Amazon, my starter project was modifying Redshift to support four-byte UTF-8 characters, not just three-byte.

Reply


About Us

site design / logo © 2022 Box Piper