Get the latest tech news

How has DeepSeek improved the Transformer architecture?


This Gradient Updates issue goes over the major changes that went into DeepSeek’s most recent model.

In this issue, I’ll cover some of the important architectural improvements that DeepSeek highlight in their report and why we should expect them to result in better performance compared to a vanilla Transformer. The full technical report contains plenty of non-architectural details as well, and I strongly recommend reading it if you want to get a better idea of the engineering problems that have to be solved when orchestrating a moderate-sized training run. The problem with this is that it introduces a rather ill-behaved discontinuous function with a discrete image at the heart of the model, in sharp contrast to vanilla Transformers which implement continuous input-output relations.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Transformer

Transformer

Photo of DeepSeek

DeepSeek

Related news:

News photo

Data Protection Day is a great day to stop using DeepSeek

News photo

How DeepSeek Erased Billions in Market Value

News photo

How China Is Reacting to DeepSeek Upending the A.I. Race