Get the latest tech news
Dragonfly: A large vision-language model with multi-resolution zoom
We are excited to announce the launch of Dragonfly, a breakthrough instruction-tuning Vision-language architecture, that enhances fine-grained visual understanding and reasoning about image regions. We are releasing the Dragonfly architecture, which uses multi-resolution zoom-and-select to enhance multi-modal reasoning while being context-efficient.
Dragonfly employs two key strategies: multi-resolution visual encoding and zoom-in patch selection, which enables the model to focus more fine-grained details on image regions and provide better commonsense reasoning. The painting is executed in the same style as the original Mona Lisa, with similar brushwork and color palette, which adds to the humor by juxtaposing the serious art historical context with the light-hearted subject matter. We evaluate Dragonfly trained based on LLaMA-8B on five popular vision-language benchmarks that require strong commonsense reasoning and detailed image understanding, AI2D, ScienceQA, MMMU, MMVet, and POPE.
Or read this on Hacker News