Get the latest tech news

You probably don't need to validate UTF-8 strings


Written 2024-05-16 Strings are important to all programmers, but to us bioinformaticians, they are an absolutely central. The layout of strings is just a slice of bytes in memory, so you'd think the string data type is not an interesting design space when designing a programming language - but you'd be wrong! In this post, I'll compare and contrast the design of strings in Rust and Julia.

In fact, I'm struggling with coming up with a single thing that you can correctly and consistently do with UTF8 text that you can't do with a bunch of opaque bytes - not including self-justifying reasons like "you can count the number of UTF8 codepoints in a UTF8 string". If you have a string and don't know it's valid UTF8, simple operations like lowercasing, iterating, and checking the number of codepoints give meaningless, undefined results. Now you could say that that's not semantically meaningful- simply skipping ill-formed sequences may be correct according so some arbitrary ruleset, but if I write a program which tries to uppercase\xff\xff\x00\xa1, then nothing meaningful will come out of it - something clearly went wrong.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of UTF-8

UTF-8

Photo of UTF-8 strings

UTF-8 strings

Related news:

News photo

Meta String: A more space-efficient string encoding than UTF-8 in Fury

News photo

You can't just assume UTF-8