Get the latest tech news

Unicode shenanigans: Martine écrit en UTF-8


A blog about functional programming

ftfy has been used as a data processing step in major NLP research, including OpenAI’s original GPT. In spite of their differences, most encodings in practice agree at least about ASCII characters, in the range 0-127, which is sufficient to contain the majority of English language writing if you can compromise on details such as confusing the apostrophe and the single quotes. I first tried the naive thing: each character is canonically a Unicode code point, which is a number between 0 and 1114111, and I just hoped that those which did occur would fit in the range 0-255.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Unicode

Unicode

Photo of UTF-8

UTF-8

Photo of Unicode shenanigans

Unicode shenanigans

Related news:

News photo

Text makeup – a tool to decode and explore Unicode strings

News photo

Unicode 16 now includes retro video game sprites [pdf]

News photo

Bitten by Unicode