Get the latest tech news
OmniParser for Pure Vision Based GUI Agent
SOCIAL MEDIA DESCRIPTION TAG TAG
To further demonstrate OmniParser is a plugin choice for off-the-shelf vision langauge models, we show the performance of OmniParser combined with recently announced vision language models: Phi-3.5-V and Llama-3.2-V. As seen in table, our finetuned interactable region detection (ID) model significantly improves the task performance compared to grounding dino model (w.o. In addition, the local semantics of icon functionality helps significantly with the performance for every vision language model. ID and w.o LS means we use Grounding DINO model, and further without using the icon description in the text prompt.
Or read this on Hacker News