enless you have been living under a rock for the past year or so, you, by now, have not only heard of the many wonders of generative AI, it is more than likely that you have already experimented with its many manifestations. Few technologies in human memory have, over such a short span of time, demonstrated their potential to so radically transform the way we live and work.
But, as rapid as this progress has been, it is fast becoming apparent that there is a limit to how long this exponential improvement will continue. Large language models (and their image and video generation counterparts) need access to vast amounts of training data in order to improve.
This is what gives successive generations of AI the ability to compose prose in an increasing variety of literary formats, and poetry and songs in the style of more and more artists—and why even a novice like me can generate increasingly complex code for a range of different use cases. The trouble is that the availability of high-quality content needed for training these models is fast dwindling.
According to a recent paper by Pablo Villalobos, the size of training data-sets has been increasing exponentially—at a rate greater than 50% per year. The volume of available language data needed to satisfy this appetite is, however, growing at a rate of just 7% per year and is expected to steadily slow down to just 1% in 2100.
As a result, even though the total stock of language data available today is somewhere between 70 trillion and 700 quadrillion words, given the rate at which it’s being consumed, our supply of high-quality language data is likely to run out four years from now. Image data suffers from similar challenges and is estimated to run out somewhere between 2030
. Read more on livemint.com