Get the latest tech news

Hacking Diffusion into Qwen3 for the Arc Challenge


A deep dive into ongoing efforts adapting language models for diffusion-based ARC solving.

Step 4 is the only part that can be reused as-is, to get this diffusion model to a competitive state I would need to modify the test time training to instead be an unmasking task, and consider alternatives to their depth first sampling to generate a bunch of high quality candidates for the ranker. 3e-5 learning rate, 500 warmup steps then cosine decay to 3e-6 Max sequence length of 6144 tokens (discarding a couple of longer tasks for simplicity) Standard diffusion loss with autoregressive position shifting Mixed precision (BF16) training An effective batch size of 64 (8 minibatches of 8) This might seem like an odd choice, but here's my thinking: the ARChitects showed you need to combine the base model with test-time training, candidate generation, and clever re-ranking to get fully correct solutions.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Diffusion

Diffusion

Photo of Qwen3

Qwen3

Photo of arc challenge

arc challenge

Related news:

News photo

Qwen3-4B-Thinking-2507

News photo

Show HN: Implementation of DDPM (Denoising Diffusion Probabilistic Models)

News photo

Alibaba admits Qwen3's hybrid-thinking mode was dumb