Get the latest tech news

Why extracting data from PDFs is still a nightmare for data experts


Countless digital documents hold valuable info, and the AI industry is attempting to set it free.

Unlike traditional OCR methods that follow a rigid sequence of identifying characters based on pixel patterns, multimodal LLMs that can read documents are trained on text and images that have been translated into chunks of data called tokens and fed into large neural networks. AI app developer Alexander Doria also recently pointed out a flaw with Mistral OCR's ability to understand handwriting on X, writing, "Unfortunately Mistral-OCR has still the usual VLM curse: with challenging manuscripts, it hallucinates completely." Whether it benefits AI companies with training data or historians analyzing a historical census, as these technologies improve, they may unlock repositories of knowledge currently trapped in digital formats designed primarily for human consumption.

Get the Android app

Or read this on ArsTechnica

Read more on:

Photo of data

data

Photo of Nightmare

Nightmare

Photo of pdfs

pdfs

Related news:

News photo

Trump Admin Repurposes App Used By Migrants To Request Asylum To Track Them Down And Deport Them | CBP One will now be called CBP Home and will use people's data to track them down

News photo

Major AI market share shift revealed: DALL-E plummets 80% as Black Forest Labs dominates 2025 data

News photo

Bluesky is weighing a proposal that gives users consent over how their data is used for AI