Get the latest tech news
Nvidia’s new Llama-3.1 Nemotron Ultra outperforms DeepSeek R1 at half the size
Compared to DeepSeek R1, Llama-3.1-Nemotron-Ultra-253B shows competitive results despite having less than half the parameters.
Its architecture—customized through a Neural Architecture Search (NAS) process—introduces structural variations such as skipped attention layers, fused feedforward networks (FFNs), and variable FFN compression ratios. This included supervised fine-tuning across domains such as math, code generation, chat, and tool use, followed by reinforcement learning with Group Relative Policy Optimization (GRPO) to further boost instruction-following and reasoning performance. These results suggest that despite being a dense model, Nvidia’s offering matches or exceeds MoE alternatives on reasoning and general instruction alignment tasks, while trailing slightly in math-heavy categories.
Or read this on Venture Beat