Get the latest tech news

Web-scraping AI bots cause disruption for scientific databases and journals


Automated programs gathering training data for artificial-intelligence tools are overwhelming academic websites.

These automated programs, which attempt to ‘scrape’ large amounts of content from websites, are increasingly becoming a headache for scholarly publishers and researchers who run sites hosting journal papers, databases and other resources. “It’s the wild west at the moment,” says Andrew Pitts, the chief executive of PSI, a company based in Oxford, UK, that provides a global repository of validated IP addresses for the scholarly communications community. But the developers behind DeepSeek showed that an LLM that rivals popular generative-AI tools could be made with many fewer resources, kick-starting an explosion of bots seeking to scrape the data needed to train this type of model.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Web

Web

Photo of disruption

disruption

Photo of journals

journals

Related news:

News photo

Major US Grocery Distributor Warns of Disruption After Cyberattack

News photo

Major US grocery distributor warns of disruption after cyberattack

News photo

Agent-based computing is outgrowing the web as we know it