Get the latest tech news

Finding near-duplicates with Jaccard similarity and MinHash

Suppose we have a large collection of documents, and we wish you identify which documents are approximately the same as each other. For instance, we may have crawled the web over some period of time, and expect to have fetched the “same page” several times, but to see slight differences in metadata, or that we have several revisions of a page following small edits.

This is a commonly-used approach to this problem (e.g. the GPT-3 paper describes using it as part of their dataset preparation pipeline), but one I had not encountered until recently, and which I find to be pretty interesting. We’ll find an approximation that avoid examining the entire sets, and which only requires a small, fixed-size “signature,” which we can precompute for each document independently. It turns out, that for any choice of \(r\) and \(b\) greater than 1, the resulting curve is S-shaped in a broadly similar fashion, and so varying those values gives us a rich tradeoff space over sensitivity, recall, and performance costs.

Get the Android app

Or read this on Hacker News