Get the latest tech news

Crawl Order and Disorder

A problem the search engine’s crawler has struggled with for some time is that it takes a fairly long time to finish up, usually spending several days wrapping up the final few domains. This has been actualized recently, since the migration to slop crawl data has dropped memory requirements of the crawler by something like 80%, and as such I’ve been able to increase the number of crawling tasks, which has led to a bizarre case where 99.

This happens for a few reasons, in part because the the sizes of websites seem to follow a pareto distribution and some sites are just very large, but also because the crawler limits how many concurrent crawl tasks are allowed per common domain name. As these websites take a lot of time to crawl, it’s desirable to start them as soon as possible, rather than choosing the order at random. Though it won’t completely fix the problem, which is in part a consequence of the batch oriented model of crawling the search engine uses, it will at least make better use of the crawler’s runtime.

Get the Android app

Or read this on Hacker News