Get the latest tech news
Searching for DeepSeek's glitch tokens
A first attempt at identifying and cataloging DeepSeek's glitched tokens
I'm sure there's a lot worth exploring going on in those tokens, but I pretty quickly decided that I’d rather not stare at Chinese and broken Unicode for hours on end. I first manually filtered out uninteresting samples (V3 adding escape backslashes, extra or removed spaces, refusals on slurs, etc), and then clustered them into some rough groupings based on their initial appearance. "Fragment tokens" aren't too surprising to find in a large vocabulary, but I suspect there's still enough interesting behavior to be worth eventually examining more closely.
Or read this on Hacker News