Get the latest tech news

Study shows that Vision Language Models can't perform rudimentary visual analysis


Large language models (LLMs) memorize a vast amount of prior knowledge from the Internet that helps them on downstream tasks but also may notoriously sway their outputs towards wrong or biased answers. In this work, we test how the knowledge about popular subjects hurt the accuracy of vision language models (VLMs) on standard, objective visual tasks of counting and identification. We find that state-of-the-art VLMs are strongly biased (e.g., unable to recognize the 4th stripe has been added to a 3-stripe Adidas logo) scoring an average of 17.05% accuracy in counting (e.g., counting stripes in an Adidas-like logo) across 7 diverse domains from animals, logos, chess, board games, optical illusions, to patterned grids. Removing image backgrounds nearly doubles accuracy (21.09 percentage points), revealing that contextual visual cues trigger these biased responses. Further analysis of VLMs' reasoning patterns shows that counting accuracy initially rises with thinking tokens, reaching ~40%, before declining with excessive reasoning. Our work presents an interesting failure mode in VLMs and a human-supervised automated framework for testing VLM biases. Code and data are available at: vlmsarebiased.github.io.

None

Get the Android app

Or read this on r/technology

Read more on:

Photo of Study

Study

Related news:

News photo

Most people rarely use AI, and dark personality traits predict who uses it more | Study finds AI browsing makes up less than 1% of online activity

News photo

Over 5,000 gambling ads seen during Premier League match despite ban, study finds

News photo

Study of 1M-year-old skull points to earlier origins of modern humans