Get the latest tech news
Vision language models are blind
Research showing that vision language models (VLMs) fail on simple visual tasks that are easy for humans.
Consistent with prior reports[2][3][4], we find that VLMs can 100% accurately identify a primitive shape (e.g., a red circle ⭕)[2] and can perfectly read an English word (e.g., Subdermatoglyphic) alone. It is important for VLMs to be able to follow paths in order to read maps or charts, interpret graphs, and understand user notations (e.g., arrows) in input images. To assess path-following capability, this task asks models to count the unique-color paths between two given stations in a simplified subway map.
Or read this on Hacker News