Get the latest tech news

Web-scraping AI bots cause disruption for scientific databases and journals

Automated programs gathering training data for artificial-intelligence tools are overwhelming academic websites.

These automated programs, which attempt to ‘scrape’ large amounts of content from websites, are increasingly becoming a headache for scholarly publishers and researchers who run sites hosting journal papers, databases and other resources. “It’s the wild west at the moment,” says Andrew Pitts, the chief executive of PSI, a company based in Oxford, UK, that provides a global repository of validated IP addresses for the scholarly communications community. But the developers behind DeepSeek showed that an LLM that rivals popular generative-AI tools could be made with many fewer resources, kick-starting an explosion of bots seeking to scrape the data needed to train this type of model.

Get the Android app

Or read this on Hacker News