Get the latest tech news
FastVLM: Efficient Vision Encoding for Vision Language Models
Vision Language Models (VLMs) enable visual understanding alongside textual inputs. They are typically built by passing visual tokens from a…
By leveraging the rich visual representations of the vision encoder and the world knowledge and reasoning capabilities of the LLM, VLMs can be useful for a wide range of applications, including accessibility assistants, UI navigation, robotics, and gaming. In a paper accepted to CVPR 2025, Apple ML researchers recently shared a new technique to address this challenge: FastVLM, a new type of VLM that significantly improves accuracy-latency trade-offs with a simple design. Y-axis is the average performance of the model on ChartQA, TextVQA, DocVQA, OCRBench, AI2D, MMMU and ScienceQA benchmarks.To further show the on-device efficiency of FastVLM, we released an iOS/macOS demo app based on MLX.
Or read this on Hacker News