Get the latest tech news
Show HN: Tarsier – Vision utilities for web interaction agents
Vision utilities for web interaction agents 👀. Contribute to reworkd/tarsier development by creating an account on GitHub.
(e.g. HTML, Accessibility Tree, Screenshot) How do you map LLM responses back to web elements? Furthermore, we've developed an OCR algorithm to convert a page screenshot into a whitespace-structured string (almost like ASCII art) that an LLM even without vision can understand. Since current vision-language models still lack fine-grained representations needed for web interaction tasks, this is critical.
Or read this on Hacker News