Get the latest tech news

Show HN: Tarsier – Vision utilities for web interaction agents


Vision utilities for web interaction agents 👀. Contribute to reworkd/tarsier development by creating an account on GitHub.

(e.g. HTML, Accessibility Tree, Screenshot) How do you map LLM responses back to web elements? Furthermore, we've developed an OCR algorithm to convert a page screenshot into a whitespace-structured string (almost like ASCII art) that an LLM even without vision can understand. Since current vision-language models still lack fine-grained representations needed for web interaction tasks, this is critical.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Text

Text

Photo of Vision

Vision

Photo of LLM

LLM

Related news:

News photo

Google introduces Imagen 3, its highest-quality text-to-image model, available in private preview

News photo

Linum (YC W23) is a hiring a founding AI engineer to train text-to-video models

News photo

OpenAI now has an AI model with vision, and everyone else should be scared