Get the latest tech news

Vision language models are blind


Research showing that vision language models (VLMs) fail on simple visual tasks that are easy for humans.

Consistent with prior reports[2][3][4], we find that VLMs can 100% accurately identify a primitive shape (e.g., a red circle ⭕)[2] and can perfectly read an English word (e.g., Subdermatoglyphic) alone. It is important for VLMs to be able to follow paths in order to read maps or charts, interpret graphs, and understand user notations (e.g., arrows) in input images. To assess path-following capability, this task asks models to count the unique-color paths between two given stations in a simplified subway map.

Get the Android app

Or read this on Hacker News