Get the latest tech news
The Tradeoffs of SSMs and Transformers
(or - tokens are bs)
This explanation is also useful and I think actually points to the same underlying principle as mine.On a related note, another researcher hypothesized that SSMs may be less prone to hallucination than Transformers; it hasn't been fleshed out, but if true would make sense from this intuition. Intuitively, commonly used tokenizers like BPE and Unigram are somewhat based in information-theoretic heuristics, and play a particular role in smoothing out the non-uniform information rate of raw data into a form that’s more easily processed by a Transformer. And so all LLMs suffer from this sort of noise and redundancy.More recent ideas like mixture-of-depths and other conditional compute approaches may make some progress here, but I think don't sufficiently address it yet and I'm guessing would be brittle.
Or read this on Hacker News