Get the latest tech news
ScreenAI: A visual LLM for UI and visually-situated language understanding
model for UI and visually-situated language understanding March 19, 2024 Posted by Srinivas Sunkara and Gilles Baechler, Software Engineers, Google Research We introduce ScreenAI, a vision-language model for user interfaces and infographics that achieves state-of-the-art results on UI and infographics-based tasks. We are also releasing three new datasets: Screen Annotation to evaluate the layout understanding capability of the model, as well as ScreenQA Short and Complex ScreenQA for a more comprehensive evaluation of its QA capability.
These text annotations provide large language models (LLMs) with screen descriptions, enabling them to automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. At only 5B parameters, ScreenAI achieves state-of-the-art results on UI- and infographic-based tasks ( WebSRC and MoTIF), and best-in-class performance on Chart QA, DocVQA, and InfographicVQA compared to models of similar size. This project is the result of joint work with Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Carbune, Jason Lin, Jindong Chen and Abhanshu Sharma.
Or read this on Hacker News