Get the latest tech news
PDF to Text, a challenging problem
The search engine has recently gained the ability to index the PDF file format. The change will deploy over a few months. Extracting text information from PDFs is a significantly bigger challenge than it might seem. The crux of the problem is that the file format isn’t a text format at all, but a graphical format. It doesn’t have text in the way you might think of it, but more of a mapping of glyphs to coordinates on “paper”.
The absolute best way of doing this is these days is likely through a vision based machine learning model, but that is an approach that is very far away from scaling to processing hundreds of gigabytes of PDF files off a single server with no GPU. A search engine is primarily interested in relevance signals, such as headings, it’s very happy if it can identify an abstract, and get a somewhat coherent picture of the remaining text. Evidence from Kenya (2022) - Working Paper Guthrie Gray-Lobe, Anthony Keats, Michael Kremer, Isaac Mbiti, Owen W. Ozier The theory of ideas and Plato’s philosophy of mathematics (2019) Dembiński, B.
Or read this on Hacker News