Get the latest tech news
How has DeepSeek improved the Transformer architecture?
This Gradient Updates issue goes over the major changes that went into DeepSeek’s most recent model.
In this issue, I’ll cover some of the important architectural improvements that DeepSeek highlight in their report and why we should expect them to result in better performance compared to a vanilla Transformer. The full technical report contains plenty of non-architectural details as well, and I strongly recommend reading it if you want to get a better idea of the engineering problems that have to be solved when orchestrating a moderate-sized training run. The problem with this is that it introduces a rather ill-behaved discontinuous function with a discrete image at the heart of the model, in sharp contrast to vanilla Transformers which implement continuous input-output relations.
Or read this on Hacker News