Stable Diffusion, NovelAI, Machine Learning Art - AI art generation discussion and image dump

ShooterOfCum · Mar 8, 2024

verymuchawful said:
That sounds like CPU inference speed.

Nah, that would take half an hour, probably.

Overly Serious · Mar 8, 2024

Okay, allowing for the model loading (took 17 seconds on the first run so ran a second time), it took 64 seconds for the below:

Used the base model and started it up with the --directml flag. I did see a message as follows:

\ComfyUI\comfy\model_sampling.py:92: UserWarning: The operator 'aten::frac.out' is not currently supported on the DML backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at D:\a\_work\1\s\pytorch-directml-plugin\torch_directml\csrc\dml\dml_cpu_fallback.cpp:17.)

However, watching the performance monitor on my machine, it showed the gpu usage maxed out at 100% once had passed the CLIP encoding stage (which was a small part of the overall time taken. So whatever was falling back to the CPU didn't seem to be the main part of it. I did this with the base model which seemed proper. I hardly ever use that so I wondered if maybe others just take longer so I tried with Juggernaut XL and that was 69 seconds (allowing for model loading) so I don't know what to tell you guys. It's quicker than I thought and I don't know why. I did do a GIT pull just before but I don't know that there've been any radical improvements lately.

This is still slower than what you both think I should be getting, though. :thinking:

verymuchawful · Mar 8, 2024

ShooterOfCum said:
Nah, that would take half an hour, probably.

Yeah on second thought, yeah. Though even on a RX 570, I can get like 10s per iteration (basically 2.5 minutes for a low 15 step image) on a 1024x1024 SDXL image, so a higher-end AMD card with a lot more VRAM being that slow seems odd.

Overly Serious said:
This is still slower than what you both think I should be getting, though.

64 seconds seems more correct, 3 seconds per iteration still isn't the best, but that seems way more believable than being slower than a 4GB card. There may be some flags you can pass to try to speed things up. It might be the precision it's running at by default. Usually FP16 should be fine, but in some cases it can be slower than FP32 on some cards. --dont-upcast-attention may help. And may be worth tryiung the 3 attentions manually to see if one is faster than the other. https://github.com/comfyanonymous/C...6d87743445fcce1f0477ba9/comfy/cli_args.py#L91 --use-quad-cross attention can usually be a safe choice and it's usually the most efficient with VRAM consumption.

Overly Serious · Mar 8, 2024

verymuchawful said:
Yeah on second thought, yeah. Though even on a RX 570, I can get like 10s per iteration (basically 2.5 minutes for a low 15 step image) on a 1024x1024 SDXL image, so a higher-end AMD card with a lot more VRAM being that slow seems odd. 64 seconds seems more correct, 3 seconds per iteration still isn't the best, but that seems way more believable than being slower than a 4GB card.

Well it's definitely been taking longer than two minutes recently so I don't know what has changed. But I'm getting about 3s/it, correct. For such an expensive card, it would be nice to get more out of it. What's a 570? Seven years old? And it wasn't even the top-end at the time.

If I had known that I would develop an interest in AI then I would have bought an Nvidia card. TBH, it did cross my mind but I figured VRAM is really important for AI and this has 20GB for far less than a 4090 with similar game quality. And probably the AMD side will catch up. Ach, well. Runpod is dirt cheap for someone who just plays around with it casually until ROCm 6.x is released on Windows. When it is, that's going to be a significant leap forward I think. Not quite on parity with Nvidia but it will be a big leap forwards for AMD on Windows.

I should probably just free up some disk space and set up a dual boot.

DavidS877 · Mar 8, 2024

I run 75 seconds on a 7900XTX Linux ComfyUI SDXL with 50 steps at 832x1216 then 20 more at 1664x2432

verymuchawful · Mar 8, 2024

Overly Serious said:
Well it's definitely been taking longer than two minutes recently so I don't know what has changed. But I'm getting about 3s/it, correct. For such an expensive card, it would be nice to get more out of it. What's a 570? Seven years old? And it wasn't even the top-end at the time.

If I had known that I would develop an interest in AI then I would have bought an Nvidia card. TBH, it did cross my mind but I figured VRAM is really important for AI and this has 20GB for far less than a 4090 with similar game quality. And probably the AMD side will catch up. Ach, well. Runpod is dirt cheap for someone who just plays around with it casually until ROCm 6.x is released on Windows. When it is, that's going to be a significant leap forward I think. Not quite on parity with Nvidia but it will be a big leap forwards for AMD on Windows.

I should probably just free up some disk space and set up a dual boot.

Yeah I dunno. I can't speak on how much faster ROCM will be for you. I just didn't think DirectML would be that far behind. In my own experience, ROCM on my RX 570 was only like, single digit percentage faster than DirectML. I just stopped using Linux for inference because I couldn't update my packages without also updating ROCM to a version that no longer supports my ancient card. (RIP) So just mess around from time to time on Windows generating 704x704 images in 30 seconds or so on SD 1.5-based models.

Overly Serious · Mar 8, 2024

I saw your edit after I'd replied and wanted to say I'd just tried your suggestions. So --dont-upcast-attention made no difference either way so far as I could see. --use-quad-cross-attention may well have been more conservative with VRAM and therefore safer, but for me unnecessary and almost halved my time down to 6s/it. Forcing fp32 similarly was negative and took me up to just under 5s/it. I figured therefore that forcing fp16 would be the same as not specifying anything at all but in fact it did, for reasons beyond me, deliver a consistently < 3s/it, usually around 2.7 seconds. But when the image emerged, it was just a black square. I hadn't altered the VAE so maybe that's why. In any case, it looks like I'm in a Best Case scenario, the suggestions very much appreciated though.

davids877 said:
I run 75 seconds on a 7900XTX Linux ComfyUI SDXL with 50 steps at 832x1216 then 20 more at 1664x2432

Well you've got the next card up from me and another 4GB VRAM. But that seems way faster than mine. 832x1216 is about the same number of pixels as 1024x1024 in my tests (if aspect isn't a factor). And that's 50 steps! I was doing 20 and it came in at 64 seconds. No second step for refiner as I was just doing a quick performance test. If you happen to be on your system now any chance you could do a quick generation of a 20 steps simple prompt at 1024x1024? I wouldn't like to impose but as your system is so blisteringly fast I feel this will probably only take ten seconds!

Jokes aside, just if you happen to be able to - I'd be curious to get a like for like between my card on Windows and your next one up on Linux.

verymuchawful said:
Yeah I dunno. I can't speak on how much faster ROCM will be for you. I just didn't think DirectML would be that far behind. In my own experience, ROCM on my RX 570 was only like, single digit percentage faster than DirectML. I just stopped using Linux for inference because I couldn't update my packages without also updating ROCM to a version that no longer supports my ancient card. (RIP) So just mess around from time to time on Windows generating 704x704 images in 30 seconds or so on SD 1.5-based models.

Well ROCm is still advancing so maybe the gap is more now. I'll be disappointed if when v6 arrives on Windows it's only a percentage point or two better than Direct ML. Or if I give up waiting, create a dual-boot Linux system and find the same. But that's good to know.

Yes, for SD 1.5 my system is pretty quick.

DavidS877 · Mar 8, 2024

Overly Serious said:
Well you've got the next card up from me and another 4GB VRAM. But that seems way faster than mine. 832x1216 is about the same number of pixels as 1024x1024 in my tests (if aspect isn't a factor). And that's 50 steps! I was doing 20 and it came in at 64 seconds. No second step for refiner as I was just doing a quick performance test. If you happen to be on your system now any chance you could do a quick generation of a 20 steps simple prompt at 1024x1024? I wouldn't like to impose but as your system is so blisteringly fast I feel this will probably only take ten seconds! Jokes aside, just if you happen to be able to - I'd be curious to get a like for like between my card on Windows and your next one up on Linux.

7900 XTX ComfyUI Linux sd_xl_base_1.0 only, no refiner, no upscale

100% 20/20 [00:06<00:00,  3.28it/s]
Requested to load AutoencoderKL
Loading 1 new model
Prompt executed in 9.70 seconds

As I had just copied the base model to the system it was probably still in RAM, so if it had to load it off disk it would have probably been nearly a second slower.
That's my Gaming system.

For reference, my desktop 4060 Ti 16GB , also Linux same parameters.

100% 20/20 [00:08<00:00,  2.47it/s]
Requested to load AutoencoderKL
Loading 1 new model
Prompt executed in 12.65 seconds

Both(different bottle result obviously):

Dick Mason · Mar 9, 2024

The real reason for bollards in Britain.

Overly Serious · Mar 9, 2024

davids877 said:
7900 XTX ComfyUI Linux sd_xl_base_1.0 only, no refiner, no upscale

100% 20/20 [00:06<00:00, 3.28it/s] Requested to load AutoencoderKL Loading 1 new model Prompt executed in 9.70 seconds
As I had just copied the base model to the system it was probably still in RAM, so if it had to load it off disk it would have probably been nearly a second slower.
That's my Gaming system.

For reference, my desktop 4060 Ti 16GB , also Linux same parameters.

100% 20/20 [00:08<00:00, 2.47it/s] Requested to load AutoencoderKL Loading 1 new model Prompt executed in 12.65 seconds

Both(different bottle result obviously):
View attachment 5797638

Thanks for that. Well I'm going to take it as a positive sign. Your GPU is same generations as mine and not wildly different in specs (+4GB VRAM, +few hundred MHz) so hopefully the wild difference in outcome is ROCm. I hope they don't take too long to update the Windows version though you've got me curious now and I'm tempted to go and set up that dual-boot system. Shame this can't be done with WSL2.

I'm a Silly · Mar 10, 2024

Dick Mason said:
View attachment 5798952
The real reason for bollards in Britain.

SNAIL OF PEACE [PBUH]

Dick Mason · Mar 10, 2024

Dick Mason · Mar 10, 2024

ducktales4gameboy · Mar 10, 2024

I've been out of diffusers for a few months, is kohya-ss still the current package of choice for lora training or have we moved on to something better?

whatever I feel like said:
I tried to make an @Null with the prompt "Unshowered Romanian forums admin." and I have to admit, the results are pretty accurate. (both from the first generation)
View attachment 5797368

The 'suffer horse', as seen on MATI.

whatever I feel like · Mar 11, 2024

Sigh, Bing is on to me, "Mario wearing red pants. The red nose of a proboscis monkey is in his lap. No monkey is present only the nose." only gives erroneous results, with the proper ones being censored.

edit: I GOT ONE I GOT ONE! Eat shit censors, I strike this blow for freedom!

Monolith · Mar 11, 2024

whatever I feel like said:
Sigh, Bing is on to me, "Mario wearing red pants. The red nose of a proboscis monkey is in his lap. No monkey is present only the nose." only gives erroneous results, with the proper ones being censored.

edit: I GOT ONE I GOT ONE! Eat shit censors, I strike this blow for freedom!

View attachment 5805059

Why Mario, of all people?

whatever I feel like · Mar 11, 2024

Monolith said:
Why Mario, of all people?

There's a billion of images of him so he always appears on-model and he's a fake person so Google doesn't censor him like it does, for example, Trump.

DeadwastePrime · Mar 11, 2024

whatever I feel like said:
Sigh, Bing is on to me, "Mario wearing red pants. The red nose of a proboscis monkey is in his lap. No monkey is present only the nose." only gives erroneous results, with the proper ones being censored.

idm man thats kinda gay to ask it to make fake penises

ShooterOfCum · Mar 11, 2024

whatever I feel like said:
Eat shit censors

Here you go, sir. Your Marios.

whatever I feel like · Mar 11, 2024

ShooterOfCum said:
Here you go, sir. Your Marios.

View attachment 5806407 View attachment 5806414

View attachment 5806444

Its not really special if you aren't thwarting a censor to do it.

Stable Diffusion, NovelAI, Machine Learning Art - AI art generation discussion and image dump

ShooterOfCum

I am shooting cum!!!

Overly Serious

verymuchawful

Enjoy prison, sticker child.

Overly Serious

DavidS877

Giant Meteor Goes to Washington

verymuchawful

Enjoy prison, sticker child.

Overly Serious

DavidS877

Giant Meteor Goes to Washington

Dick Mason

SUPER HEEBSTAR

Overly Serious

I'm a Silly

𝖋𝖎𝖗𝖊 𝖋𝖎𝖗𝖊 𝖋𝖎𝖗𝖊

Dick Mason

SUPER HEEBSTAR

Dick Mason

SUPER HEEBSTAR

ducktales4gameboy

destruction brings creation

whatever I feel like

Mushroom Kingdom Uber Alles!

Monolith

Of course Jesus is white, the Bible's in English.

whatever I feel like

Mushroom Kingdom Uber Alles!

DeadwastePrime

pronouns in bio

ShooterOfCum

I am shooting cum!!!

whatever I feel like

Mushroom Kingdom Uber Alles!