Get the latest tech news

The Tradeoffs of SSMs and Transformers


(or - tokens are bs)

This explanation is also useful and I think actually points to the same underlying principle as mine.On a related note, another researcher hypothesized that SSMs may be less prone to hallucination than Transformers; it hasn't been fleshed out, but if true would make sense from this intuition. Intuitively, commonly used tokenizers like BPE and Unigram are somewhat based in information-theoretic heuristics, and play a particular role in smoothing out the non-uniform information rate of raw data into a form that’s more easily processed by a Transformer. And so all LLMs suffer from this sort of noise and redundancy.More recent ideas like mixture-of-depths and other conditional compute approaches may make some progress here, but I think don't sufficiently address it yet and I'm guessing would be brittle.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Transformers

Transformers

Photo of tradeoffs

tradeoffs

Photo of ssms

ssms

Related news:

News photo

Understanding Transformers via N-gram Statistics

News photo

Beyond transformers: Nvidia’s MambaVision aims to unlock faster, cheaper enterprise computer vision

News photo

Transformers Without Normalization