Get the latest tech news
Unicode shenanigans: Martine écrit en UTF-8
A blog about functional programming
ftfy has been used as a data processing step in major NLP research, including OpenAI’s original GPT. In spite of their differences, most encodings in practice agree at least about ASCII characters, in the range 0-127, which is sufficient to contain the majority of English language writing if you can compromise on details such as confusing the apostrophe and the single quotes. I first tried the naive thing: each character is canonically a Unicode code point, which is a number between 0 and 1114111, and I just hoped that those which did occur would fit in the range 0-255.
Or read this on Hacker News