Unintended consequences arise when AI is trained on AI-generated data


The prevalence of generative AI has surged within a short span since the launch of OpenAI’s ChatGPT. This technology has infiltrated classrooms, political advertisements, entertainment, journalism, and even search engines, shaping both digital and real-life experiences. Despite uncertainties about its long-term impact, generative AI is firmly establishing itself in various aspects of our lives. Paradoxically, the synthetic content it produces could also pose a significant threat to its own advancement.

At the core of the expanding generative AI ecosystem lies human-crafted data. These AI models don’t spontaneously generate human-like content; rather, they are trained on large datasets predominantly sourced from human-created material found on the internet. Yet, a curious phenomenon emerges when synthetic content is fed back into these generative AI models: strange distortions in the output arise. This analogy likens it to data inbreeding, resulting in increasingly distorted, lackluster, and subpar outcomes. Dubbed “Habsburg AI” or “a system heavily reliant on outputs from other generative AIs,” the effect echoes the deformities seen in inbred organisms.

This issue carries significant implications. AI developers are eager to supply their models with more data, often drawn from an internet landscape rife with synthetic content. The looming question is: if this self-destructive inbreeding escalates, could the entire system unravel?

To delve deeper into this quandary, we engaged in a conversation with machine learning researchers Sina Alemohammad and Josue Casco-Rodriguez, both PhD candidates in Rice University’s Electrical and Computer Engineering department. Their supervising professor, Richard G. Baraniuk, also shared insights. Collaborating with Stanford researchers, they recently published an intriguing paper titled “Self-Consuming Generative Models Go MAD.” The term MAD, which stands for Model Autophagy Disorder, encapsulates AI’s apparent self-rejection. Their research revealed that just five cycles of training on synthetic data could trigger what Baraniuk describes as an output “explosion”.

This phenomenon potentially exposes generative AI’s Achilles’ heel. The implications extend beyond the technology itself, touching on everyday individuals, the burgeoning AI sector, and the very fabric of the internet. The intricate interplay between generative AI and the data it’s trained on demands further exploration to ensure sustainable and effective use of this evolving technology.


Please enter your comment!
Please enter your name here