Get the latest tech news

Omnivision-968M: Vision Language Model with 9x Tokens Reduction for Edge Devices

Pocket-size multimodal model with 9x token reduction for on-device deployment

The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding. Pretraining The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements.

Get the Android app

Or read this on Hacker News