Stable Diffusion, NovelAI, Machine Learning Art - AI art generation discussion and image dump

  • 🔧 At about Midnight EST I am going to completely fuck up the site trying to fix something.
Okay, allowing for the model loading (took 17 seconds on the first run so ran a second time), it took 64 seconds for the below:
comfy7900XT.png

Used the base model and started it up with the --directml flag. I did see a message as follows:
\ComfyUI\comfy\model_sampling.py:92: UserWarning: The operator 'aten::frac.out' is not currently supported on the DML backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at D:\a\_work\1\s\pytorch-directml-plugin\torch_directml\csrc\dml\dml_cpu_fallback.cpp:17.)

However, watching the performance monitor on my machine, it showed the gpu usage maxed out at 100% once had passed the CLIP encoding stage (which was a small part of the overall time taken. So whatever was falling back to the CPU didn't seem to be the main part of it. I did this with the base model which seemed proper. I hardly ever use that so I wondered if maybe others just take longer so I tried with Juggernaut XL and that was 69 seconds (allowing for model loading) so I don't know what to tell you guys. It's quicker than I thought and I don't know why. I did do a GIT pull just before but I don't know that there've been any radical improvements lately.

This is still slower than what you both think I should be getting, though. :thinking:
 
Nah, that would take half an hour, probably.
Yeah on second thought, yeah. Though even on a RX 570, I can get like 10s per iteration (basically 2.5 minutes for a low 15 step image) on a 1024x1024 SDXL image, so a higher-end AMD card with a lot more VRAM being that slow seems odd.
This is still slower than what you both think I should be getting, though. :thinking:

64 seconds seems more correct, 3 seconds per iteration still isn't the best, but that seems way more believable than being slower than a 4GB card. There may be some flags you can pass to try to speed things up. It might be the precision it's running at by default. Usually FP16 should be fine, but in some cases it can be slower than FP32 on some cards. --dont-upcast-attention may help. And may be worth tryiung the 3 attentions manually to see if one is faster than the other. https://github.com/comfyanonymous/C...6d87743445fcce1f0477ba9/comfy/cli_args.py#L91 --use-quad-cross attention can usually be a safe choice and it's usually the most efficient with VRAM consumption.
 
Last edited:
Yeah on second thought, yeah. Though even on a RX 570, I can get like 10s per iteration (basically 2.5 minutes for a low 15 step image) on a 1024x1024 SDXL image, so a higher-end AMD card with a lot more VRAM being that slow seems odd. 64 seconds seems more correct, 3 seconds per iteration still isn't the best, but that seems way more believable than being slower than a 4GB card.
Well it's definitely been taking longer than two minutes recently so I don't know what has changed. But I'm getting about 3s/it, correct. For such an expensive card, it would be nice to get more out of it. What's a 570? Seven years old? And it wasn't even the top-end at the time.

If I had known that I would develop an interest in AI then I would have bought an Nvidia card. TBH, it did cross my mind but I figured VRAM is really important for AI and this has 20GB for far less than a 4090 with similar game quality. And probably the AMD side will catch up. Ach, well. Runpod is dirt cheap for someone who just plays around with it casually until ROCm 6.x is released on Windows. When it is, that's going to be a significant leap forward I think. Not quite on parity with Nvidia but it will be a big leap forwards for AMD on Windows.

I should probably just free up some disk space and set up a dual boot.
 
Well it's definitely been taking longer than two minutes recently so I don't know what has changed. But I'm getting about 3s/it, correct. For such an expensive card, it would be nice to get more out of it. What's a 570? Seven years old? And it wasn't even the top-end at the time.

If I had known that I would develop an interest in AI then I would have bought an Nvidia card. TBH, it did cross my mind but I figured VRAM is really important for AI and this has 20GB for far less than a 4090 with similar game quality. And probably the AMD side will catch up. Ach, well. Runpod is dirt cheap for someone who just plays around with it casually until ROCm 6.x is released on Windows. When it is, that's going to be a significant leap forward I think. Not quite on parity with Nvidia but it will be a big leap forwards for AMD on Windows.

I should probably just free up some disk space and set up a dual boot.
Yeah I dunno. I can't speak on how much faster ROCM will be for you. I just didn't think DirectML would be that far behind. In my own experience, ROCM on my RX 570 was only like, single digit percentage faster than DirectML. I just stopped using Linux for inference because I couldn't update my packages without also updating ROCM to a version that no longer supports my ancient card. (RIP) So just mess around from time to time on Windows generating 704x704 images in 30 seconds or so on SD 1.5-based models.
 
I saw your edit after I'd replied and wanted to say I'd just tried your suggestions. So --dont-upcast-attention made no difference either way so far as I could see. --use-quad-cross-attention may well have been more conservative with VRAM and therefore safer, but for me unnecessary and almost halved my time down to 6s/it. Forcing fp32 similarly was negative and took me up to just under 5s/it. I figured therefore that forcing fp16 would be the same as not specifying anything at all but in fact it did, for reasons beyond me, deliver a consistently < 3s/it, usually around 2.7 seconds. But when the image emerged, it was just a black square. I hadn't altered the VAE so maybe that's why. In any case, it looks like I'm in a Best Case scenario, the suggestions very much appreciated though.

I run 75 seconds on a 7900XTX Linux ComfyUI SDXL with 50 steps at 832x1216 then 20 more at 1664x2432
Well you've got the next card up from me and another 4GB VRAM. But that seems way faster than mine. 832x1216 is about the same number of pixels as 1024x1024 in my tests (if aspect isn't a factor). And that's 50 steps! I was doing 20 and it came in at 64 seconds. No second step for refiner as I was just doing a quick performance test. If you happen to be on your system now any chance you could do a quick generation of a 20 steps simple prompt at 1024x1024? I wouldn't like to impose but as your system is so blisteringly fast I feel this will probably only take ten seconds! ;) Jokes aside, just if you happen to be able to - I'd be curious to get a like for like between my card on Windows and your next one up on Linux.

Yeah I dunno. I can't speak on how much faster ROCM will be for you. I just didn't think DirectML would be that far behind. In my own experience, ROCM on my RX 570 was only like, single digit percentage faster than DirectML. I just stopped using Linux for inference because I couldn't update my packages without also updating ROCM to a version that no longer supports my ancient card. (RIP) So just mess around from time to time on Windows generating 704x704 images in 30 seconds or so on SD 1.5-based models.

Well ROCm is still advancing so maybe the gap is more now. I'll be disappointed if when v6 arrives on Windows it's only a percentage point or two better than Direct ML. Or if I give up waiting, create a dual-boot Linux system and find the same. But that's good to know.

Yes, for SD 1.5 my system is pretty quick.
 
Well you've got the next card up from me and another 4GB VRAM. But that seems way faster than mine. 832x1216 is about the same number of pixels as 1024x1024 in my tests (if aspect isn't a factor). And that's 50 steps! I was doing 20 and it came in at 64 seconds. No second step for refiner as I was just doing a quick performance test. If you happen to be on your system now any chance you could do a quick generation of a 20 steps simple prompt at 1024x1024? I wouldn't like to impose but as your system is so blisteringly fast I feel this will probably only take ten seconds! ;) Jokes aside, just if you happen to be able to - I'd be curious to get a like for like between my card on Windows and your next one up on Linux.
7900 XTX ComfyUI Linux sd_xl_base_1.0 only, no refiner, no upscale

100% 20/20 [00:06<00:00, 3.28it/s] Requested to load AutoencoderKL Loading 1 new model Prompt executed in 9.70 seconds
As I had just copied the base model to the system it was probably still in RAM, so if it had to load it off disk it would have probably been nearly a second slower.
That's my Gaming system.

For reference, my desktop 4060 Ti 16GB , also Linux same parameters.

100% 20/20 [00:08<00:00, 2.47it/s] Requested to load AutoencoderKL Loading 1 new model Prompt executed in 12.65 seconds

Both(different bottle result obviously):
2024-03-08_14-20.png
 
7900 XTX ComfyUI Linux sd_xl_base_1.0 only, no refiner, no upscale

100% 20/20 [00:06<00:00, 3.28it/s] Requested to load AutoencoderKL Loading 1 new model Prompt executed in 9.70 seconds
As I had just copied the base model to the system it was probably still in RAM, so if it had to load it off disk it would have probably been nearly a second slower.
That's my Gaming system.

For reference, my desktop 4060 Ti 16GB , also Linux same parameters.

100% 20/20 [00:08<00:00, 2.47it/s] Requested to load AutoencoderKL Loading 1 new model Prompt executed in 12.65 seconds

Both(different bottle result obviously):
View attachment 5797638
Thanks for that. Well I'm going to take it as a positive sign. Your GPU is same generations as mine and not wildly different in specs (+4GB VRAM, +few hundred MHz) so hopefully the wild difference in outcome is ROCm. I hope they don't take too long to update the Windows version though you've got me curious now and I'm tempted to go and set up that dual-boot system. Shame this can't be done with WSL2.
 
  • Like
Reactions: Kees H
I've been out of diffusers for a few months, is kohya-ss still the current package of choice for lora training or have we moved on to something better?

I tried to make an @Null with the prompt "Unshowered Romanian forums admin." and I have to admit, the results are pretty accurate. (both from the first generation)
View attachment 5797368
The 'suffer horse', as seen on MATI.
 
  • Winner
Reactions: I'm a Silly
Sigh, Bing is on to me, "Mario wearing red pants. The red nose of a proboscis monkey is in his lap. No monkey is present only the nose." only gives erroneous results, with the proper ones being censored. :(

edit: I GOT ONE I GOT ONE! Eat shit censors, I strike this blow for freedom!
mario15.jpg
 
Last edited:
Sigh, Bing is on to me, "Mario wearing red pants. The red nose of a proboscis monkey is in his lap. No monkey is present only the nose." only gives erroneous results, with the proper ones being censored. :(

edit: I GOT ONE I GOT ONE! Eat shit censors, I strike this blow for freedom!
Why Mario, of all people?
 
Back