Get the latest tech news

Multiple AI companies bypassing web standard to scrape publisher sites, licensing firm says


Multiple artificial intelligence companies are circumventing a common web standard used by publishers to block the scraping of their content for use in generative AI systems, content licensing startup TollBit has told publishers. A letter to publishers seen by Reuters on Friday, which does not name the AI companies or the publishers affected, comes amid a public dispute between AI search startup Perplexity and media outlet Forbes involving the same web standard and a broader debate between tech and media firms over the value of content in the age of generative AI.

A Wired investigation published this week found Perplexity likely bypassing efforts to block its web crawler via the Robots Exclusion Protocol, or "robots.txt," a widely accepted standard meant to determine which parts of a site are allowed to be crawled. "What this means in practical terms is that AI agents from multiple sources (not just one company) are opting to bypass the robots.txt protocol to retrieve content from sites," TollBit wrote. More recently, robots.txt has become a key tool publishers have used to block tech companies from ingesting their content free-of-charge for use in generative AI systems that can mimic human creativity and instantly summarize articles.

Get the Android app

Or read this on r/technology

Read more on:

Photo of web standard

web standard

Photo of publisher sites

publisher sites

Photo of licensing firm

licensing firm

Related news:

News photo

Multiple AI Companies Ignore Robots.Txt Files, Scrape Web Content, Says Licensing Firm

News photo

Multiple AI companies bypassing web standard to scrape publisher sites