Get the latest tech news

FastVLM: Efficient Vision Encoding for Vision Language Models

Vision Language Models (VLMs) enable visual understanding alongside textual inputs. They are typically built by passing visual tokens from a…

By leveraging the rich visual representations of the vision encoder and the world knowledge and reasoning capabilities of the LLM, VLMs can be useful for a wide range of applications, including accessibility assistants, UI navigation, robotics, and gaming. In a paper accepted to CVPR 2025, Apple ML researchers recently shared a new technique to address this challenge: FastVLM, a new type of VLM that significantly improves accuracy-latency trade-offs with a simple design. Y-axis is the average performance of the model on ChartQA, TextVQA, DocVQA, OCRBench, AI2D, MMMU and ScienceQA benchmarks.To further show the on-device efficiency of FastVLM, we released an iOS/macOS demo app based on MLX.

Get the Android app

Or read this on Hacker News