Get the latest tech news
Can Robots.txt Files Really Stop AI Crawlers?
In the high-stakes world of AI, "The fundamental agreement behind robots.txt [files], and the web as a whole — which for so long amounted to 'everybody just be cool' — may not be able to keep up..." argues the Verge: For many publishers and platforms, having their data crawled for train...
A study by Ben Welsh, the news applications editor at Reuters, found that 606 of 1,156 surveyed publishers had blocked GPTBot in their robots.txt file. CCBot, which is run by the organization Common Crawl, scours the web for search engine purposes, but its data is also used by OpenAI, Google, and others to train their models. And those are just the crawlers that identify themselves — many others attempt to operate in relative secrecy, making it hard to stop or even find them in a sea of other web traffic.
Or read this on Slashdot