Fresh fears have been raised about the training material used for some of the largest and most powerful artificial intelligence models, after several investigations exposed the fascist, pirated and malicious sources from which the data is harvested.
One such dataset is the Colossal Clean Crawled Corpus, or C4, assembled by Google from more than 15m websites and used to train both the search engine’s LaMDA AI as well as Meta’s GPT competitor, LLaMA.
The dataset is public, but its scale has made it difficult to examine the contents: it is supposedly a “clean” version of a more expansive dataset, Common Crawl, with “noisy” content, offensive language and racist slurs removed from the material.
But an investigation by the Washington Post reveals that C4’s “cleanliness” is only skin deep. While it draws on websites such as the Guardian – which makes up 0.05% of the entire dataset — and Wikipedia, as well as large databases such as Google Patents and the scientific journal hub PLOS, it also contains less reputable sites.
The white nationalist site VDARE is in the database, one of the 1,000 largest sites, as is the far-right news site Breitbart. The Russian state-backed propaganda site RT is one of the hundred largest providers of training data to the C4 corpus.
Few of the sites gave explicit consent to be included, although Common Crawl, the non-profit organisation that assembled the scraped data, says it respects requests to be left out of its search. Some, however, push the limits of fair use: b-ok.org, formerly known as Bookzz, was a vast repository of pirated ebooks, until it was seized by the FBI in 2022. Despite that, contents of the site remain in the C4 database.
Such vast collections of data are important to AI creation,
Read more on theguardian.com