Get the latest tech news

Omnivision-968M: Vision Language Model with 9x Tokens Reduction for Edge Devices


Pocket-size multimodal model with 9x token reduction for on-device deployment

The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding. Pretraining The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of edge devices

edge devices

Photo of omnivision-968 m

omnivision-968 m

Photo of 9x tokens reduction

9x tokens reduction

Related news:

News photo

Mistral releases new AI models optimized for laptops and phones

News photo

BuffDB is a Rust library to simplify multi-plexing on edge devices

News photo

Hailo takes on Nvidia with energy-efficient gen AI accelerator for edge devices and $120M in funding