Get the latest tech news

Branchless UTF-8 Encoding


Can you encode UTF-8 without branches? Yes. The question In a Recurse chat, Nathan Goldbaum asked: I know how to decode UTF-8 using bitmath and some LUTs (see https://github.com/skeeto/branchless-utf8), but if I want to to go from a codepoint to UTF-8, is there a way to do it without branches? To start with, is there a way to write this C function, which returns the number of bytes needed to store the UTF-8 bytes for the codepoint without any branches? Or would I need a huge look-up-table? The C function int num_utf8_bytes_for_codepoint(uint32_t code) { if (code <= 0x7F) { return 1; } else if (code <= 0x07FF) { return 2; } else if (code <= 0xFFFF) { if ((code >= 0xD800) && (code <= 0xDFFF)) { // surrogates are invalid UCS4 code points return -1; } return 3; } else if (code <= 0x10FFFF) { return 4; } else { // codepoint is outside the valid unicode range return -1; } } I pondered this but didn’t immediately see a way to do it without a huge (2^32) lookup table.

Put differently, the “count leading zeros” intrinsic isn’t necessarily a branchless instruction. Rust’s semantics guarantee this conversion is safe, and it happens to be a representation the hardware can work with; it doesn’t appear to incur a conditional after compilation. Happily, this transformation also allowed the compiler to realize len <= 4 on all paths, and to statically eliminate the array bounds check.

Get the Android app

Or read this on Hacker News