Get the latest tech news

Improved ways to operate a rude crawler


This text is satirical in nature. Tech news is abuzz with rude AI crawlers that forge their user-agent and ignore robots.txt. In my opinion, if this is all the AI startups can muster, they’re losing their touch. wget can do this. You need to up your game, get that crawler really rolling coal. Flagrant disregard for externalities is an important signal to the investors that your AI startup is the one.

Git hosts are also very valuable sources of data, though be sure to really get in there and crawl the entire repo for each historical commit, branch, tag, and so forth, not just HEAD. Don’t bother cloning the git repository though, as that requires a bunch of specialized coding on your end, and if you waste your time reinventing the wheel like that you’re ngmi. Some haters say that if you crawl over a shitty connection and drop a ton of packets every time a car drives by or someone runs the microwave, it might mess with the congestion control algorithm of the server you’re talking to, leading to them severely throttling their network throughput.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Improved ways

Improved ways

Photo of rude crawler

rude crawler