Get the latest tech news

Searching for DeepSeek's glitch tokens


A first attempt at identifying and cataloging DeepSeek's glitched tokens

I'm sure there's a lot worth exploring going on in those tokens, but I pretty quickly decided that I’d rather not stare at Chinese and broken Unicode for hours on end. I first manually filtered out uninteresting samples (V3 adding escape backslashes, extra or removed spaces, refusals on slurs, etc), and then clustered them into some rough groupings based on their initial appearance. "Fragment tokens" aren't too surprising to find in a large vocabulary, but I suspect there's still enough interesting behavior to be worth eventually examining more closely.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of DeepSeek

DeepSeek

Photo of glitch tokens

glitch tokens

Related news:

News photo

Tech leaders respond to the rapid rise of DeepSeek

News photo

Chinese AI startup DeepSeek unveils open-source model to rival #OpenAI o1. DeepSeek-R1 features 671 billion parameters and claims performance superiority to OpenAI’s o1 on key benchmarks. 👀

News photo

Scale AI CEO Says China Has Quickly Caught the US With DeepSeek