Get the latest tech news
Crawling More Politely Than Big Tech
Dennis Schubert, engineer at Mozilla and noteworthy contributor to diapsora, a distributed, open-source social network, recently observed that 70% of the load on diaspora's servers was coming from poorly-behaved bots that feed the LLMs of a few big outfits. The worst offenders, amounting to 40% of total traffic combined, were OpenAI and Amazon.
Dennis Schubert, engineer at Mozilla and noteworthy contributor to diapsora, a distributed, open-source social network, recently observed that 70% of the load on diaspora's servers was coming from poorly-behaved bots that feed the LLMs of a few big outfits. If you limit yourself to one fetch context (worker, thread, or coroutine) per domain, then it's as simple as using a local variable to track how long its been since you made your last request. The result is a clear division of logic and resource utilization that crawls with respect for rate limits, freshness, and necessity.
Or read this on Hacker News