Get the latest tech news

Why Extracting Data from PDFs Remains a Nightmare for Data Experts


Businesses, governments, and researchers continue to struggle with extracting usable data from PDF files, despite AI advances. These digital documents contain valuable information for everything from scientific research to government records, but their rigid formats make extraction difficult. "PDF...

These digital documents contain valuable information for everything from scientific research to government records, but their rigid formats make extraction difficult. "PDFs are a creature of a time when print layout was a big influence on publishing software," Derek Willis, a lecturer in Data and Computational Journalism at the University of Maryland, told ArsTechnica. "Right now, the clear leader is Google's Gemini 2.0 Flash Pro Experimental," Willis notes, while Mistral's recent OCR solution "performed poorly" in tests.

Get the Android app

Or read this on Slashdot

Read more on:

Photo of data

data

Photo of Nightmare

Nightmare

Photo of pdfs

pdfs

Related news:

News photo

PowerSchool previously hacked in August, months before data breach

News photo

Why extracting data from PDFs is still a nightmare for data experts

News photo

Trump Admin Repurposes App Used By Migrants To Request Asylum To Track Them Down And Deport Them | CBP One will now be called CBP Home and will use people's data to track them down