Get the latest tech news

Corrected UTF-8 (2022)

is decent and all but it contains some design errors, partly because its original designers just messed up, and partly because of ISO and Unicode Consortium internal politics. We’re probably going to be using it forever so it would be good to correct these design errors before they get any more entrenched than they already have.

This was purely because of internal ISO and Unicode Consortium politics; they rejected the possibility of a future in which codepoints would exist that UTF- 16 could not represent. Corrected UTF-8 reverts to the original definition of four-, five-, and six-byte sequences from RFC 2044; after taking the offsets into account, the highest encodable code point is U+8421 109F. U+10E7D is RUMI FRACTION ONE THIRD, and U+ED4E is the private use character assigned by the Under-ConScript Unicode Registry to NIJI CONSONANT CH; these choices are largely arbitrary.

Get the Android app

Or read this on Hacker News