Get the latest tech news

Dragonfly: A large vision-language model with multi-resolution zoom


We are excited to announce the launch of Dragonfly, a breakthrough instruction-tuning Vision-language architecture, that enhances fine-grained visual understanding and reasoning about image regions. We are releasing the Dragonfly architecture, which uses multi-resolution zoom-and-select to enhance multi-modal reasoning while being context-efficient.

Dragonfly employs two key strategies: multi-resolution visual encoding and zoom-in patch selection, which enables the model to focus more fine-grained details on image regions and provide better commonsense reasoning. The painting is executed in the same style as the original Mona Lisa, with similar brushwork and color palette, which adds to the humor by juxtaposing the serious art historical context with the light-hearted subject matter. We evaluate Dragonfly trained based on LLaMA-8B on five popular vision-language benchmarks that require strong commonsense reasoning and detailed image understanding, AI2D, ScienceQA, MMMU, MMVet, and POPE.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of language model

language model

Photo of Dragonfly

Dragonfly

Photo of resolution zoom

resolution zoom

Related news:

News photo

Dragonfly: An optical telescope built from an array of off-the-shelf Canon lens

News photo

Japan team uses Fugaku supercomputer to develop language model for AI

News photo

DrEureka: Language Model Guided SIM-to-Real Transfer