Get the latest tech news

Unicode shenanigans: Martine écrit en UTF-8

A blog about functional programming

ftfy has been used as a data processing step in major NLP research, including OpenAI’s original GPT. In spite of their differences, most encodings in practice agree at least about ASCII characters, in the range 0-127, which is sufficient to contain the majority of English language writing if you can compromise on details such as confusing the apostrophe and the single quotes. I first tried the naive thing: each character is canonically a Unicode code point, which is a number between 0 and 1114111, and I just hoped that those which did occur would fit in the range 0-255.

Get the Android app

Or read this on Hacker News