Get the latest tech news

Why extracting data from PDFs is still a nightmare for data experts

Countless digital documents hold valuable info, and the AI industry is attempting to set it free.

Unlike traditional OCR methods that follow a rigid sequence of identifying characters based on pixel patterns, multimodal LLMs that can read documents are trained on text and images that have been translated into chunks of data called tokens and fed into large neural networks. AI app developer Alexander Doria also recently pointed out a flaw with Mistral OCR's ability to understand handwriting on X, writing, "Unfortunately Mistral-OCR has still the usual VLM curse: with challenging manuscripts, it hallucinates completely." Whether it benefits AI companies with training data or historians analyzing a historical census, as these technologies improve, they may unlock repositories of knowledge currently trapped in digital formats designed primarily for human consumption.

Get the Android app

Or read this on ArsTechnica