Get the latest tech news

Benchmarking vision-language models on OCR in dynamic video environments


This paper introduces an open-source benchmark for evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) tasks in dynamic video environments. We present a curated dataset containing 1,477 manually annotated frames spanning diverse domains, including code editors, news broadcasts, YouTube videos, and advertisements. Three state of the art VLMs - Claude-3, Gemini-1.5, and GPT-4o are benchmarked against traditional OCR systems such as EasyOCR and RapidOCR. Evaluation metrics include Word Error Rate (WER), Character Error Rate (CER), and Accuracy. Our results highlight the strengths and limitations of VLMs in video-based OCR tasks, demonstrating their potential to outperform conventional OCR models in many scenarios. However, challenges such as hallucinations, content security policies, and sensitivity to occluded or stylized text remain. The dataset and benchmarking framework are publicly available to foster further research.

View a PDF of the paper titled Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments, by Sankalp Nagaonkar and 3 other authors View PDFHTML (experimental) Abstract:This paper introduces an open-source benchmark for evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) tasks in dynamic video environments. We present a curated dataset containing 1,477 manually annotated frames spanning diverse domains, including code editors, news broadcasts, YouTube videos, and advertisements.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of gemini

gemini

Photo of OCR

OCR

Photo of new OCR benchmark

new OCR benchmark

Related news:

News photo

New hack uses prompt injection to corrupt Gemini’s long-term memory

News photo

Gemini 2.0 Flash Thinking and Pro experimental models are hitting the Gemini app

News photo

Honor teams up with Gemini and ChatGPT’s biggest rival to level up its AI assistant