Get the latest tech news

Why Extracting Data from PDFs Remains a Nightmare for Data Experts

Businesses, governments, and researchers continue to struggle with extracting usable data from PDF files, despite AI advances. These digital documents contain valuable information for everything from scientific research to government records, but their rigid formats make extraction difficult. "PDF...

These digital documents contain valuable information for everything from scientific research to government records, but their rigid formats make extraction difficult. "PDFs are a creature of a time when print layout was a big influence on publishing software," Derek Willis, a lecturer in Data and Computational Journalism at the University of Maryland, told ArsTechnica. "Right now, the clear leader is Google's Gemini 2.0 Flash Pro Experimental," Willis notes, while Mistral's recent OCR solution "performed poorly" in tests.

Get the Android app

Or read this on Slashdot