Get the latest tech news
Hacking Diffusion into Qwen3 for the Arc Challenge
A deep dive into ongoing efforts adapting language models for diffusion-based ARC solving.
Step 4 is the only part that can be reused as-is, to get this diffusion model to a competitive state I would need to modify the test time training to instead be an unmasking task, and consider alternatives to their depth first sampling to generate a bunch of high quality candidates for the ranker. 3e-5 learning rate, 500 warmup steps then cosine decay to 3e-6 Max sequence length of 6144 tokens (discarding a couple of longer tasks for simplicity) Standard diffusion loss with autoregressive position shifting Mixed precision (BF16) training An effective batch size of 64 (8 minibatches of 8) This might seem like an odd choice, but here's my thinking: the ARChitects showed you need to combine the base model with test-time training, candidate generation, and clever re-ranking to get fully correct solutions.
Or read this on Hacker News