Get the latest tech news
Tokenization for language modeling: BPE vs. Unigram Language Modeling (2020)
Tokenizers used by the best-performing language models (Bert, GPT-2, etc.) poorly reflect the morphology of English text. I had hoped to use some quarantine time to design one that more closely aligns to relationships between wordforms. But Kaj Bostrom and Greg Durrett beat me to it and so this blog post materialized instead. I add some additional motivation, evaluate both methods against ‘gold standard’ tokenizations, and speculate about what might come next.
In the short term, I share Bostrom and Durret's hope "that developers of future pretrained language models will consider adopting the unigram LM method over the more common BPE." This is to permit unambiguous reconstruction of strings from token sequences (originally discussed in Neural Machine Translation of Rare Words with Subword Units). There may be other ways to achieve that aim without doubling the vocabulary, for instance by adding a new_word mask as an additional input dimension (which would be predicted by a seq-to-seq model as a secondary output).
Or read this on Hacker News