Get the latest tech news

PDF to Text, a challenging problem

The search engine has recently gained the ability to index the PDF file format. The change will deploy over a few months. Extracting text information from PDFs is a significantly bigger challenge than it might seem. The crux of the problem is that the file format isn’t a text format at all, but a graphical format. It doesn’t have text in the way you might think of it, but more of a mapping of glyphs to coordinates on “paper”.

The absolute best way of doing this is these days is likely through a vision based machine learning model, but that is an approach that is very far away from scaling to processing hundreds of gigabytes of PDF files off a single server with no GPU. A search engine is primarily interested in relevance signals, such as headings, it’s very happy if it can identify an abstract, and get a somewhat coherent picture of the remaining text. Evidence from Kenya (2022) - Working Paper Guthrie Gray-Lobe, Anthony Keats, Michael Kremer, Isaac Mbiti, Owen W. Ozier The theory of ideas and Plato’s philosophy of mathematics (2019) Dembiński, B.

Get the Android app

Or read this on Hacker News