What if we reach a point where there's so many AI generated images on the 'Net that it'll feed itself back into the dataset and output worse results?

You'd have the researchers who are creating the model dedicate more and more time going through and manually pruning the dataset (or rather, getting grad students/cheap mechanical Turks to do it), assuming that they haven't already developed a way to determine whether the art is human-made prior to scraping automatically. Any new dataset would likely just be building upon the old dataset (barring some kind of shift in IP law following all of these lawsuits that necessitates removing most parts), as well. So it's less that you're photocopying a photocopy ad infinitum, and more that, if you wanted to create a larger dataset of images, you'd need to provide additional human-made art (which likely won't be going anyway any time soon, no matter who might claim otherwise.) It'll just take more labor.

Of course, it's also a question of how big a dataset really needs to be. If you're trying to shoot for some absurd 1-trillion hyperparameters, you'd need plenty of data. But at the same time, advances in LLMs are being made in efficiency beyond just the standard method of "make it bigger." If you can get a model that produces similar results in quality while needing only 1/10th of the hyperparameters, chances are you won't need to expand too far in terms of what open datasets you're using, supplemented slowly over time.
 
Don't worry, they're lobotomizing the crap out of those AIs right now.
The only thing that they will render is "2+2= Vote Blue"

Blackpilling aside, those who own OpenAI are doing everything they can to keep that power. They will never allow normal people to use their version of the AI. I have a feeling that it is why Microsoft is testing and lobotomizing it on Bing.
 
They'll find workarounds. Maybe a tiered system where billions of images are in the "dump" tier, and smaller sets of better manually curated ones are given greater weight. Similar to hypernetworks/LoRAs. They can also find automatic ways to filter what is coming in, discarding some of the total crap.

It's possible that adding significantly more training images is already unnecessary. Are Midjourney, SDXL, DALL-E, etc. going to improve much by increasing the size of the training set by another order of magnitude? If you look around you can probably find some papers explaining what they are doing to improve their models.
 
Back