Slurred
kiwifarms.net
- Joined
- Jan 29, 2022
Training is basically a process of creating mathematical associations between images and captions. Modern AI works as well as it does because of the immense amounts of data used. Basically, the more captioned images your model is trained on, the better images it will produce. This is assuming that most of the images are high quality and have relevant and detailed captions.How does AI art generation work anyway? Like does "scraping" reallycopy"steal" whole images by artists?
The product of the training process is a model weights file. For Stable Diffusion 1.5, this was about ~4GB. For Stable Diffusion XL and variants it's about ~7GB. For bigger models like Flux and Qwen Image it's ~20GB. SD1.5 was trained with 600 million images, and the others with even more. Obviously these file sizes aren't big enough to contain copies of the training data. The files contain high dimensional matrices which map relationships between text and image data, allowing generation. So when you enter a prompt, the model is basically doing some math on your words to come out with the resulting image.
The key insight that's enabled the current AI boom is that when training runs exceed a certain size, surprising things happen. If you train a small model on a single picture of a bird and then prompt it for a bird, you will get more-or-less that same picture. Congratulations, you've invented the world's shittiest compression.
But if you train a model on millions of pictures of birds, all of different types, in different styles, different positions and so on (with detailed captions describing all that), then at some point it learns the essence of what a bird is, or at least how to draw one. A lot of antis will take issue with the word "learns" here because the AI is not human, it's not learning like humans do, but I don't think we really have a better name for it. Human artists learn by observation and some analogous process happens with image generation AIs.
The other part of the process is diffusion. The model takes two inputs: an initial canvas and a prompt. In the case of pure text to image, the canvas is a mass of randomly generated noise, something like this:

The model then iteratively denoises the image based on the prompt you gave it. You can think of this as kind of like trying to see pictures in the clouds. Over enough denoising steps, this random noise resolves into a picture containing the elements you prompted for, based on what the model has learned about the aspects of your prompt.
You can also start with your own image instead of using random noise: this is basis of image to image. The model will then blur your input and apply the same denoising process to it. More advanced interfaces allow you to control the amount of denoising to be done and you can achieve various things with that, from smoothing out lines and blending disparate elements of an image to changing its art style. Here's a fun post (archive) about that from the early Stable Diffusion days.
I think this tweet from the AI Derangement Syndrome OP is really apropos:In any case and however an AI art generation works, the end results are usually something that's not substantially similar to art by human artists -- much less being "pirated" of paid stuff. BTW, I'm not a lawyer, but I'm pretty sure copying a work to train an AI -- assuming that's what's done -- is not necessarily copyright infringement in and of itself, if the copy used for training isn't redistributed online.
A key part of the NYT lawsuit against OpenAI seems to be that if you prompt ChatGPT with a large portion of an NYT article, it will produce the rest of the article and therefore can be used to bypass the paywall. But that's maybe the lamest thing you can do with AI and not really what it's for. You can commit copyright infringement with these tools, just as you can with PhotoShop or Word, but to argue that any output is de facto infringement because it is based on training data seems to me to be an insane expansion of copyright as a concept. Like, some people are basically proposing that every time anyone generates an image, $0.0000001 should be paid in royalties to every fanartist on Twitter for their infinitesimal contribution to the weights. You can call this stuff "high-tech collage" or plagiarism or art theft, but you're really stretching the definitions of those terms.
Last edited:



