Get the latest tech news

Chinese AI company says breakthroughs enabled creating a leading-edge AI model with 11X less compute — DeepSeek's optimizations highlight limits of US sanctions


With a lot of optimizations and low-level programming.

Deepseek trained its DeepSeek-V3 Mixture-of-Experts (MoE) language model with 671 billion parameters using a cluster containing 2,048 Nvidia H800 GPUs in just two months, which means 2.8 million GPU hours, according to its paper. For comparison, it took Meta 11 times more compute power ( 30.8 million GPU hours) to train its Llama 3 with 405 billion parameters using a cluster containing 16,384 H100 GPUs over the course of 54 days. The DualPipe algorithm minimized training bottlenecks, particularly for the cross-node expert parallelism required by the MoE architecture, and this optimization allowed the cluster to process 14.8 trillion tokens during pre-training with near-zero communication overhead, according to DeepSeek.

Get the Android app

Or read this on r/technology

Read more on:

Photo of Chinese

Chinese

Photo of sanctions

sanctions

Photo of limits

limits

Related news:

News photo

Chinese Carmakers’ Profit Margins Squeezed Further in 2024

News photo

Chinese Firm Trains Massive AI Model for Just $5.5 Million

News photo

DeepSeek’s new AI model appears to be one of the best ‘open’ challengers yet