Get the latest tech news

Xiaomi MiMo Reasoning Model

MiMo: Unlocking the Reasoning Potential of Language Model – From Pretraining to Posttraining - XiaomiMiMo/MiMo

Moreover, it was widely considered that achieving uniform and simultaneous improvements in both mathematical and code capabilities within a small model is challenging. By assigning fine-grained scores for test cases with varying difficulty levels, the policy can be more effectively optimized via dense reward signal. We implement a data re-sampling strategy for easy problems to enhance rollout sampling efficiency and stabilize policy updates, particularly in the later phases of RL training.

Get the Android app

Or read this on Hacker News