Get the latest tech news

Show HN: Tarsier – Vision utilities for web interaction agents

Vision utilities for web interaction agents 👀. Contribute to reworkd/tarsier development by creating an account on GitHub.

(e.g. HTML, Accessibility Tree, Screenshot) How do you map LLM responses back to web elements? Furthermore, we've developed an OCR algorithm to convert a page screenshot into a whitespace-structured string (almost like ASCII art) that an LLM even without vision can understand. Since current vision-language models still lack fine-grained representations needed for web interaction tasks, this is critical.

Get the Android app

Or read this on Hacker News