Yohann's notebook

2022 is to media, as 1945 is to steel.

Media released before November 2022 is precious because, theoretically, it is uncontaminated by the AI slop that has taken over the internet in the years since.

Shortly after the debut of ChatGPT, academics and technologists started to wonder if the recent explosion in AI models has also created contamination.

Their concern is that AI models are being trained with synthetic data created by AI models. Subsequent generations of AI models may therefore become less and less reliable, a state known as AI model collapse.

In March 2023, John Graham-Cumming, then CTO of Cloudflare and now a board member, registered the web domain lowbackgroundsteel.ai and began posting about various sources of data compiled prior to the 2022 AI explosion, such as the Arctic Code Vault (a snapshot of GitHub repos from 02/02/2020).

Maybe the guys over at r/datahoarder were onto something.

#ai #internet