Get the latest tech news

Wikipedia offers AI developers a training dataset to maybe get scraper bots off its back


The Wikimedia Foundation and Google's data science platform Kaggle are offering AI developers a dataset of information from Wikipedia they can freely use.

Wikipedia has been struggling with the impact that AI crawlers — bots that are scraping text and multimedia from the encyclopedia to train generative artificial intelligence models — have been having on its servers, leading to increased costs and slower load times for human users in some cases. The organization has teamed up with Kaggle, a data science platform, to offer up a beta release of a structured dataset in both English and French. Wikimedia Enterprise notes that the dataset includes "abstracts, short descriptions, infobox-style key-value data, image links and clearly segmented article sections."

Get the Android app

Or read this on Endgadget

Read more on:

Photo of Wikipedia

Wikipedia

Photo of AI developers

AI developers

Photo of training dataset

training dataset

Related news:

News photo

Russia-linked Pravda network cited on Wikipedia, LLMs, and X - The embedding of Pravda network websites into Wikipedia is particularly concerning given Wikipedia’s significant role as a primary source of knowledge for LLMs

News photo

Wikipedia's largest non-English version was created by a bot. Generative AI poses new problems

News photo

Wikipedia servers are struggling under pressure from AI scraping bots