Stable Diffusion, NovelAI, Machine Learning Art - AI art generation discussion and image dump

That's not super unsurprising. In the way these things work, "joe" and "Joe" might be conceptually different to the model, in ways where the relation between the two isn't even all that strong. The "deeper" a model is, the more it will understand such relations, e.g. 4o (or to give a more fitting example, Dall-E 3) will always understand that joe biden and Joe Biden are the same thing, but smaller, shallower models like this can be relatively easily tripped up by this. They can feel extremely literal because of that and I would always pay close attention to the language I use.
I think this is actually a mistake they made while training it. "Convert the training dataset into lower case so the model will be case insensitive" seems like an easy step to miss (it would have taken a few minutes to run, which is nothing). With SD I think they used human-labelled images for the dataset, which means most likely a dataset-maker like Google were turning everything into lower case (ie a photo depicting a bouquet of roses in a vase on a table would be tagged "photograph, rose, flowers, vase, table, red, green"), which is why SD worked best on word salad prompts and struggled with composition. This time they seem to have just scraped a tonne of images with no fucks given about getting permission first (which I absolutely love, screw artists and copyright holders who try to restrict our creativity) and had an AI describe them, which would have produced results like "Sure, I can help with that! This is a photograph of a bouquet of roses in a vase, arranged tastefully on a table". Give it an SD-style prompt and it will struggle, but with a more descriptive prompt it will have far fewer issues with composition, which is also why it's so much better at inserting text (text being just a composition of characters). SD never had issues with actual shapes of letters, it just couldn't parse sentences properly, and with its poor understanding of composition it may just not have made the connection that words are letters in sequence rather than individual glyphs.
 
Yeah, I agree with with @Susanna about it being a mistake and further evidence for that would be that I think they used token replacement to suppress known people. So say you have a dozen images labelled "brad pitt" in your training data, it will obviously learn who Brad Pitt is. But having done all the training you replace the token name of "brad pitt" with "afe03da79a" or whatever. Now you have the benefit of the model having been trained but people can't just type in "brad pitt" and have the model show his image. It's actually a pretty educated guess that what might have happened is they did their substitutions and missed doing upper case substitution. Probably for reasons Susanna says - they weren't used to needing to.
 
who don't put effort into their orthography
I'm guilty of this sometimes, and even higher end LLMs really suffer from it.

SD (1.5) is a tiny model with a screwy dataset and it's impressive it performs as well as it does tbh.

Well, I theorized about a model I barely used, so that's what I get. These explanations make sense though. Now that there's a control net floating around (from what I saw) I'll play around with it some more.
 
So for people who are running Flux locally, how long does it take to generate images on your rigs? I've got 16 GB worth of VRAM and I seem to recall generating one image in SDXL taking one minute.
 
For people looking for more optimized Flux inference and/or an escape from ComfyUI, https://github.com/lllyasviel/stable-diffusion-webui-forge has been updated to support Flux and offers "nf4" precision. Smaller than FP8, faster than FP8, minimum quality fall off and can apparently perform better than FP8. https://huggingface.co/lllyasviel/flux1-dev-bnb-nf4/blob/main/flux1-dev-bnb-nf4.safetensors for Dev weights already converted to nf4, or you can just check the nf4 precision type and have any checkpoint use nf4 on the fly. (It'll take a minute or two to convert when loading.) With nf4, went from 3-5s per iteration to 1.5 seconds per iteration.

https://github.com/lllyasviel/stable-diffusion-webui-forge/discussions/981 For more info, be aware Schnell checkpoints are also supported.

I can comfortably gen on extremely low end hardware with https://huggingface.co/drbaph/FLUX.1-schnell-dev-merged-fp8-4step (it's a block merge that takes the low step convergence from Schnell while retaining the text ability and general quality of Dev) ComfyUI was much slower given my limited hardware.
 
Last edited:
@verymuchawful Interesting info and may give it a go. But using what you wrote just as a launching off point and not as any kind of argument, I kind of feel that the days of running everything locally on a consumer GPU might be on their way out. Not completely, never completely. But for serious use I think the future is renting compute in the cloud. One of the things that has made Flux so good is that the creators decided not to overly constrain themselves to current domestic GPU and said: "Yeah, lets use 24GB VRAM".

Maybe I'm wrong - haven't even tried the Schnell model yet, I've been doing everything on Runpod. But I see posts on Reddit where people are happy about getting it to run in 8GB of VRAM or something and my instinct is to think there's no way that can be comparable.

Unless gaming ends up drastically increasing the amount of VRAM games need - and I can't see that happening as VRAM is already outpacing the GPU's ability to use it - then I feel like AI piggy-backing on consumer GPUs is going to come to an end. Even in terms of cost, I'm renting an Nvidia A40 with 48GB for approx $0.47 per hour. A 4090 with half that RAM costs me about $2,300 which is equivalent to nearly 5,000hrs of full usage. That's 200_ days. Aside from any electricity costs I'd have, though mild balance would be I haven't included cloud storage though that's cheaper.
 
But I see posts on Reddit where people are happy about getting it to run in 8GB of VRAM or something and my instinct is to think there's no way that can be comparable.
To give an idea performance wise. FP8 on a 10GB 3080 is about 3s per iteration. 10-15 seconds for a 4 step generation on the merge linked. On NF4 the speed on that same GPU is 1.5s per iteration, making 20 step gens on Dev only take 40-50 seconds. (Friend's system that has a lot of GPU downtime when they are busy with other stuff.) On my super low end system that only has 16GB RAM and 4GB VRAM and is AMD, I can get 55-85s per iterations. Making 4 step gens on Schnell/Schnell merges take about 5 minutes. And that's on FP8 because that card doesn't support NF4. (For Flux anyway, I can run SDXL checkpoints in NF4 on the AMD card for some reason.) Not being able to fit the entire model into VRAM really isn't that detrimental. Unless of course you want to train the model or even train a lora, you're gonna need 24GB minimum unless they can cut down lora training to NF4 precision as well.
1723469257338.png
Above is 4 steps on the Dev-Schnell block merge. Schnell on it's own would struggle with the text way more.
 
Last edited:
To give an idea performance wise. FP8 on a 10GB 3080 is about 3s per iteration. 10-15 seconds for a 4 step generation on the merge linked. On NF4 the speed on that same GPU is 1.5s per iteration, making 20 step gens on Dev only take 40-50 seconds. (Friend's system that has a lot of GPU downtime when they are busy with other stuff.) On my super low end system that only has 16GB RAM and 4GB VRAM and is AMD, I can get 55-85s per iterations. Making 4 step gens on Schnell/Schnell merges take about 5 minutes. And that's on FP8 because that card doesn't support NF4. (For Flux anyway, I can run SDXL checkpoints in NF4 on the AMD card for some reason.) Not being able to fit the entire model into VRAM really isn't that detrimental. Unless of course you want to train the model or even train a lora, you're gonna need 24GB minimum unless they can cut down lora training to NF4 precision as well.
View attachment 6299210
Above is 4 steps on the Dev-Schnell block merge. Schnell on it's own would struggle with the text way more.
That's pretty impressive. Well I welcome happy surprises. I'll likely give it a go locally a little later. Cheers!
 
The chinese are soldering 48 and 32 GB of VRAM to their 4080D and 4090 Super, "don't underestimate the chinese", exhibit #415144.

You can do this at home if you are adventurous and skilled, there are also cheap $200 chinese BGA stations now. I'm surprised the nvidia firmware doesn't brick the card. It would seem like a very nvidia thing to do.

But for serious use I think the future is renting compute in the cloud
I agree with this though. People have a mental block for paying for such things online but it's actually not that expensive. Some of the LLM rigs some people build are so expensive to put together (=costs of the parts) and run (=cost of electricity) that you'd take forever to break even with just renting some server time, and most of the rigs I've seen perform strictly worse than those cloud servers. Of course it's better to not be dependant on some cloud but currently I feel it's just not practical. That might change if we get dedicated AI hardware.
 
Last edited:
@verymuchawful Well you were right about what was possible. I was surprised at how able to run Flux locally I was.

Flux Dev. fp16, 1024x1024, 20 steps } 62 seconds, 2.85s/it
Flux Dev, fp8, 1024x1024, 20 steps } 64 seconds, 2.88s/it

(No, I have no idea why the fp8 took longer than fp16. It's not due to model loading, this was consistent across runs).

Flux Dev, fp8, 512x512, 6 steps } 12 seconds, 1.08s/it.

I tried out the slightly cut down Comfy recommended one and it made no difference to times so far as I could tell nor how maxed out my VRAM was (I have 20GB). I also tried out Schnell and it gave me better output. I think something was giving out with Dev on my hardware as I would sometimes get blurred images. (No, it wasn't anything NSFW). And very, very weirdly it would seemingly hold onto elements from a previous run. Example, I ask for a drawing of a person with various details. I then change it to "photo of" and add "detailed, realistic", and it still gave me drawings. Swap to a different model and back and now it gives me realistic photos. I have no explanation for that at all. It shouldn't be possible but appeared to be the case.

The bulk of time for a generation was loading the model which it seemed to need to do every time, I guess perhaps VRAM was so tight that it freed it up the moment a run was over. I didn't try any of the ones you pointed at yet. And tbc, my view on how things are going long-term is the same just to be clear. However, was surprised I could run this (more or less) on my hardware.
 
Can someone who played around with Flux tell me if it can copy artists style and does it know specific people?
Could it create a drawing of George Floyd punching Elizabeth Olsen in the stomach in the style of Todd McFarlane?
 
Can someone who played around with Flux tell me if it can copy artists style and does it know specific people?
Could it create a drawing of George Floyd punching Elizabeth Olsen in the stomach in the style of Todd McFarlane?
I just typed it in and it gave me a white man punching a blond woman in the style of an American comic book. So, no.
 
Back