Stable Diffusion, NovelAI, Machine Learning Art - AI art generation discussion and image dump

  • 🐕 I am attempting to get the site runnning as fast as possible. If you are experiencing slow page load times, please report it.
The 4060Ti is a reasonable budget option. It’s not super fast, but it does have 16GB of memory.
I've been thinking about a used 3090, but who knows, maybe when the 50 series comes out used 4090's will be a better deal. Kinda stupid to pair a top end GPU with a mid range CPU, but 24GB of VRAM is pretty strong, and the performance of such a high end card will probably suit my needs for years to come. I can always frame limit my games, which I do already, to not have it work overtime when it doesn't have to. And of course I'd give it an undervolt anyway. The games I've been playing recently are still playable on the 1060 so I didn't felt a big push to aim for an upgrade yet. Investing in a new CPU was a much better idea.
 
I've been thinking about a used 3090, but who knows, maybe when the 50 series comes out used 4090's will be a better deal. Kinda stupid to pair a top end GPU with a mid range CPU, but 24GB of VRAM is pretty strong, and the performance of such a high end card will probably suit my needs for years to come. I can always frame limit my games, which I do already, to not have it work overtime when it doesn't have to. And of course I'd give it an undervolt anyway. The games I've been playing recently are still playable on the 1060 so I didn't felt a big push to aim for an upgrade yet. Investing in a new CPU was a much better idea.
3060 12GB are nothing to scoff at and sell for around 250. P40s are 220 after you buy the shroud and are slower but give you 24VRAM. They do require some rigging though.

Edit:This is obviously the used eBay market.
 
Last edited:
As you can see, something went very wrong here. But then given the subject matter, maybe I don't want it to be right.
I made that image with A1111, so maybe the metadata does not transfer cleanly to Comfy? That or you were missing the checkpoint and it defaulted back to vanilla SDXL or something.

I've been thinking about a used 3090, but who knows, maybe when the 50 series comes out used 4090's will be a better deal. Kinda stupid to pair a top end GPU with a mid range CPU, but 24GB of VRAM is pretty strong, and the performance of such a high end card will probably suit my needs for years to come. I can always frame limit my games, which I do already, to not have it work overtime when it doesn't have to. And of course I'd give it an undervolt anyway. The games I've been playing recently are still playable on the 1060 so I didn't felt a big push to aim for an upgrade yet. Investing in a new CPU was a much better idea.
In the interim you can always rent a cloud GPU if you are just messing around with image gen stuff for a short period of time. Eg. https://vast.ai/ has 4090s for $0.35 per hour. Getting all the software set up on something like that can be a bit of a hassle, but it's a cost-effective option for short-term use.
 
  • Informative
Reactions: Overly Serious
In the interim you can always rent a cloud GPU if you are just messing around with image gen stuff for a short period of time.
I enjoy the ability to run whatever I want whenever I want on my machine with zero censorship. I dislike the idea of renting someone else's machine to use the software where they can see all that I'm doing. The GPU investment would future proof me for other things anyway, like gaming with a 1440p high refresh rate monitor with stable framerates, or AV1 encoding, though if I want to go with Team Green for that it's 40 series and up.
 
Enterprise VPS will encrypt your memory content, and have a lot of incentive not to spy on their customers (because if word got out, their other customers would leave and governments worldwide would be on them for failing to secure consumer credentials). They'll still unlock your files for governments if asked to, but governments don't really care about one single creep using datacentre GPUs to train loli porn dungeon LLMs or whatever it is you want to do that would be so damaging to your reputation.
 
  • Agree
Reactions: std::string
I made that image with A1111, so maybe the metadata does not transfer cleanly to Comfy? That or you were missing the checkpoint and it defaulted back to vanilla SDXL or something.
That shouldn't matter; defaulting to a different model should still resolve a normal image (in a different style) unless the parameters are super weird, which those weren't.

But Comfy doesn't work like WebUI; it's a node-based pipeline editor for automating workflows, bootstrapping shit like Stable Cascade, etc. You can't just drop prompts/metadata into it without setting up/loading the workflow blueprint too. So I think he meant he was just fucking around by shoving the prompt into whatever he currently had set up.


Edit:
btw 95% of people don't need ComfyUI but if that sounded interesting and you want to try it out, here's a tip: copy the environment loader from your WebUI setup before loading Comfy (so it can share all your models and shit from where they are). Eg. under Windows, this is my comfy.bat file:
Code:
@echo off
for /f "tokens=2 delims=:." %%a in ('"%SystemRoot%\System32\chcp.com"') do (set _OLD_CODEPAGE=%%a)
if defined _OLD_CODEPAGE ("%SystemRoot%\System32\chcp.com" 65001 > nul)

set VIRTUAL_ENV=(your:\path\here\nigger)\stable-diffusion-webui-forge\venv

if not defined PROMPT set PROMPT=$P$G
if defined _OLD_VIRTUAL_PROMPT set PROMPT=%_OLD_VIRTUAL_PROMPT%
if defined _OLD_VIRTUAL_PYTHONHOME set PYTHONHOME=%_OLD_VIRTUAL_PYTHONHOME%
set _OLD_VIRTUAL_PROMPT=%PROMPT%
set PROMPT=(venv) %PROMPT%
if defined PYTHONHOME set _OLD_VIRTUAL_PYTHONHOME=%PYTHONHOME%
set PYTHONHOME=
if defined _OLD_VIRTUAL_PATH set PATH=%_OLD_VIRTUAL_PATH%
if not defined _OLD_VIRTUAL_PATH set _OLD_VIRTUAL_PATH=%PATH%
set PATH=%VIRTUAL_ENV%\Scripts;%PATH%
set VIRTUAL_ENV_PROMPT=(venv)
:END
if defined _OLD_CODEPAGE (
    "%SystemRoot%\System32\chcp.com" %_OLD_CODEPAGE% > nul
    set _OLD_CODEPAGE=
)

python main.py
I'm not sure all that shit is necessary but it works (and you can probably just copypaste it and change the VIRTUAL_ENV line).

Also btw WebUI-Forge rules and I recommend it over vanilla WebUI. It's basically the exact same thing (and they merge WebUI updates nightly so it literally kinda is) but with QoL tweaks to the interface, a bunch of optimisations, and faster support for new shit.
 
Last edited:
btw 95% of people don't need ComfyUI but if that sounded interesting and you want to try it out
comfyUI is ok, i like the modular feel of it and how powerful it is, but i could not for the life of me get all the extensions to work i tried literally everything to get controlnet and that super simple 2 second video tool to function but i just never could.

I know it and Auto 1111 are a lot better programs, but i usually find i can get most everything i want done with easy diffusion. it's the simplest and has most of the bells and whistles i want.
 
comfyUI is ok, i like the modular feel of it and how powerful it is, but i could not for the life of me get all the extensions to work i tried literally everything to get controlnet and that super simple 2 second video tool to function but i just never could.

I know it and Auto 1111 are a lot better programs, but i usually find i can get most everything i want done with easy diffusion. it's the simplest and has most of the bells and whistles i want.
WebUI's pretty similar FYI. Thanks to community drama I think a lot of people might get tricked into trying Comfy first, but it's a power tool that kinda requires you to know how things fit together.

Video Diffusion and ControlNet (pretty sure) are supported by default in Forge, btw.
Although last I checked it can't batch SVD, which is the kind of thing someone might actually want Comfy for (in my case creating a gen->img2img with a second model->batch video diffusion setup that works in one click so I can walk away and come back to a set of variants instead of spending half an hour manually piping each gen).
 
In my experience ComfyUI gives me better results than any A1111 derivative. I set up the exact same parameters in A1111, SD.Next and ComfyUI, and ComfyUI gives me the best result. No idea why, is it because A1111 derivatives have a shitton going on behind the scenes I can't see or control but ComfyUI lays it out in front of me and I have full control of it?
 
This meshy.ai site is still pretty new and shit, but its got some uses. tried to make a slobbermutt kiwi but it fused them. If you Generate a picture of a 3d model with bing or stable diffusion tho the picture to AI works great with simple models

1714163496785.png
 
. So I think he meant he was just fucking around by shoving the prompt into whatever he currently had set up.
Actually, @inception_state was correct. I was trying to replicate the actual workflow exactly. The thing was I didn't know that he was using A1111. I didn't know that he was not. So the first thing I did was download the PNG and drag it into ComfyUI and see a workflow appear. I was slightly surprised that it did as there were two things that could have gone wrong I was thinking. One would be that KF strips the metadata like Reddit does, so I was pleased to see it doesn't. The second one, and this is where I erred, was to think that if he hadn't made it in ComfyUI, I wouldn't be able to do this. But in fact the output of A1111 also has it which I didn't know. When the workflow appeared I nearly wrote a response asking about it because it looked kind of funky to me. It went through two lots of sampling with the second not looking like a simple refinement and some odd looking upscaling I didn't quite see the point of. Still, I tried it with and it kind of worked only a bit mashed up. It did however let me get the exact prompts and then I was able to try and replicate one on my end. I could share but I don't have it to hand right now.

Separate to all that though, I've been playing with Stable Diffusion 3. It's not yet available to run locally but I've been running it through the API with a bastardised Python script. And I have to say I'm very impressed so far. I'm slightly hesitant to share results here as because it's via the API technically they could perhaps identity link me via the pics I share. That seems a long shot though so I might be provoked into doing it. Even if I were one day identified I've actually never written anything here I think is that bad. I'm not even racist. Maybe I should offer my services in the request thread. I can answer any questions at least for people who want to know how I find it.
 
The second one, and this is where I erred, was to think that if he hadn't made it in ComfyUI, I wouldn't be able to do this. But in fact the output of A1111 also has it which I didn't know.
Oh, that's interesting. I didn't know Comfy would try that either, and would have assumed otherwise because I've only seen Comfy blueprints passed around separately.
WebUI doesn't embed that information, it's just the standard gen deets (prompt, seed, model, other parameters), but since the pipeline doesn't really change in WebUI I assume Comfy tries to load a standard template to recreate it when it recognises WebUI metadata.

Some stuff (like extension settings) doesn't go in there though, and some models need weird specific CFG mods but that hash is for PonyXL AutismMix_Pony which isn't one of those. So for the same reason I said even though I was wrong about what you were doing, it should really have still genned a regular image anyway. So I'm guessing the issue is Comfy's default blueprint version of WebUI's pipe having some fault.
I just messed around with it a bit and couldn't see any problem with the metadata anyway (besides having a separate scheduler tag, rather than putting it in the sampler tag, which my version of Forge doesn't recognise).

I'm sure I've seen that particular artefacting somewhere before though but I can't remember what it was so it's still bugging me.

...And yeah, you're right about the upscaler being unnecessary. The original image is genned at 832x1216 (832 is smaller than XL's minimum resolution, which can cause issues sometimes), then R-ESRGAN upscaled to a size that actually is in XL's ballpark. So it could have just been genned at the target size and saved a step.

Edit: Just recreated what you did in Comfy and got the same confetti noise output. The Civit page for that model mix does recommend an extension to avoid noise errors in outputs that can apparently happen with XL models, but the examples look different.
Edit2: noticed it was happening pre-upscale and pre-VAE, seems to be the sampler. Specifically the scheduler, which is the tag Forge didn't recognise for me. Switching it to use karras (per the unread metadata tag) fixes it.
dpmpp_3m_sde should work basically the same as dpmpp_3m_sde_karras (the schedule curve is very similar, there's no reason massive noise artefacts oughtta show up at 40 steps as opposed to tiny variances) so yeah it might just be a bug in Comfy's implementation of that.
 
Last edited:
  • Like
Reactions: Overly Serious
Oh, that's interesting. I didn't know Comfy would try that either, and would have assumed otherwise because I've only seen Comfy blueprints passed around separately.
WebUI doesn't embed that information, it's just the standard gen deets (prompt, seed, model, other parameters), but since the pipeline doesn't really change in WebUI I assume Comfy tries to load a standard template to recreate it when it recognises WebUI metadata.
Yes, I was surprised as well. That's why I almost presumed that the original source must be from ComfyUI for me to get that far, though I wasn't certain. I don't have PonyXL if that was the source. From the name I've figured it's more anime or cartoon focused.

In separate news, my playing around with SD3 has given me a more nuanced view on things. I remain very impressed. It's notably better at following prompts and I look forward to when I have it locally and can play around with it properly and most especially when we start to see more expert people building on it. Its an 8b model. I understand why that's about as high as is feasible for regular mortals. Still though, I'd deeply love to see something the size of DALL-E but public and open. Anyway, SD3 is interesting. It still has its limits. I need to really push it with a prompt to see just how specific I can get.

Does anybody know how to get partially obscured things in an output? Either by shadow or by objects. Lets say I choose some Universal Horror monster like a mummy. SD can give me that. I can adjust it such as making the bandages super detailed or cartoonish or make the mummy take a ballet pose or lunge towards the viewer. I can do all sorts of things. But what I cannot do is say "the mummy emerges from the shadows mostly obscured by the darkness". I cannot say "The mummy hides behind a trash can with only its head and shoulders visible." These models don't seem to understand parts of things. Is that possible at all? If I use image to image might that get me closer? I struggle with image to image to get a detailed output from a simple input.
 
I decided to Heck with PL'ing, I'm going to upload a couple of Stable Diffusion 3 images. I wanted to show off how it handles lettering. It's certainly far, far better. It does have an odd habit of introducing spelling errors that weren't there before. It did correctly place the letters on the can though and scale and fit them with the linebreaks which is good:
batman_v_superman.png

I noticed that it is a bit fixated on the Snyder versions of characters. The above was a second attempt after an earlier "Batman v Superman" request gave me a very movie version. So I explicitly requested Adam West Batman and Christopher Reeves Superman. It moved me closer as you can see but still not quite there. So I isolated the characters and doubled-down on asking for specific versions. It still wouldn't really give me Adam West as you can see though it was sort of understanding with the chest symbol:
batman_v_superman_2.png
It's better at Christopher Reeves a little. I also tested out whether it could make Superman hold something in his hand and it did an okay-ish job, That's certainly better than previous versions of SD though not yet perfect.
batman_v_superman_3.png

(He's just been sprayed with the Bat Kryptonian Repellant if that wasn't obvious).

It's also somewhat better at interpreting multiple people prompts though it still struggles. This is supposed to be Batman, Supergirl, Superman and Batman standing in a row. Made a few attempts at that and it's got a tendency to repeat characters or, well, look what happened! Superman is having a bit of a Final Crisis, I think.
hero_gallery.png

All that said, I remain impressed and I think this is a significant step forward over its predecessor and I think the community will do great things with it. (Also terrible things. Probably involving Kim Possible Foxes, but lets not re-open that).
 
Does anybody know how to get partially obscured things in an output? Either by shadow or by objects. Lets say I choose some Universal Horror monster like a mummy. SD can give me that. I can adjust it such as making the bandages super detailed or cartoonish or make the mummy take a ballet pose or lunge towards the viewer. I can do all sorts of things. But what I cannot do is say "the mummy emerges from the shadows mostly obscured by the darkness". I cannot say "The mummy hides behind a trash can with only its head and shoulders visible." These models don't seem to understand parts of things. Is that possible at all? If I use image to image might that get me closer? I struggle with image to image to get a detailed output from a simple input.
I've had some success with shadows by really cranking up the weighting of (shadows) or (emerging from shadow), etc. Like 1.75 weight up to very high numbers you wouldn't ordinarily try. You can also try bumping the shadow part to the top of the prompt, raising CFG, and/or putting the shadow-related part on its own with a BREAK separator.
For partial figures you might need to stress whatever you can as a verb like (hiding) and try your luck rerolling without getting too specific, even lower CFG if it doesn't break the prompt.

Img2img should definitely work; for shadows I'd have the mummy emerge from something and just crudely add a semitransparent black layer or whatever in photoshop for where they'll go. Then regen that with the same/similar prompt and seed, probably lowish (0.3-0.5) denoising strength and maybe higher CFG.
Something like the trash can example would be more involved; more time photochopping means less time tweaking parameters. In the extreme example you'd hastily mspaint a bin in then crank the denoising (0.6 and up) and regen random seeds until something looks okay. If it's being stubborn you can always take the least-worst gen and use that as your new img2img base, and repeat while gradually lowering denoise strength until it clicks (also a good trick for when your img2imgs have lost the visual style you were originally going for).
 
@Involuntary Celebrity Thanks for that. A lot of helpful information. Even if it doesn't all work you've given me a lot of things to try and good starting points. I'll have a play around with your suggestions.

It feels like a theme of generative AI that it doesn't know how NOT to do something. Same with text - it can't just say "I don't know". If it can give you a right answer it will, but if it can't it will give you a wrong answer. In both cases, it only knows how to build out the pattern whether that be a mummy or a legal brief.
 
@Involuntary Celebrity Thanks for that. A lot of helpful information. Even if it doesn't all work you've given me a lot of things to try and good starting points. I'll have a play around with your suggestions.

It feels like a theme of generative AI that it doesn't know how NOT to do something. Same with text - it can't just say "I don't know". If it can give you a right answer it will, but if it can't it will give you a wrong answer. In both cases, it only knows how to build out the pattern whether that be a mummy or a legal brief.
happy2help. Another tip for img2img if you ever go with the mspaint approach is to add multicoloured noise on that layer. The photoshop noise tool (grayscale unchecked) or anything really, it just gives it more to fuck with wherever you have regions of a single colour. Also helps if you're trying to change a style completely like getting something photorealistic out of an illustration--not necessarily a lot, but can cut down on how many times you need to re-img2img it.

Regarding not doing stuff, they're kinda two separate phenomena. If you look at the literature the work with transformers that gave us GPT was really just focused on creating "question-answering" systems--actually a huge jump in natural language processing to just be able to do this. Getting away from not being able to answer was the whole focus there. Accuracy matters but an ideal system would never just shrug unless the only answer to a question is that it's unanswerable.
With image gen there's a similar element but it's more a matter of abstraction and specificity of language in training. I'd bet there's plenty of porn models on Civit that can happily do parts of girls obscured by the object they need a stepbrother's assistance with--and plenty of similar prompts that work without a specialised model/LoRA because specific enough tags for whatever fetish already existed on booru collections without any need for extra curation. Otherwise the capacity probably exists but it's just uncommon that anyone wanting a picture of the Hulk (from the system or historically in training art) would want a picture of the Hulk hiding in a bush, and if they do they might label that a dozen different ways or not at all, so you really gotta wrangle prompts or heavily stress it.
 
  • Winner
Reactions: Overly Serious
...And yeah, you're right about the upscaler being unnecessary. The original image is genned at 832x1216 (832 is smaller than XL's minimum resolution, which can cause issues sometimes), then R-ESRGAN upscaled to a size that actually is in XL's ballpark. So it could have just been genned at the target size and saved a step.
In my experience this is not the case. I get much better results by generating an image in common aspect ratios (eg. 832x1216, 1024x1024, 768x1344, etc) and then upscaling. Even base SDXL does not generate coherent images when you go significantly above those sizes. Just as an example, here's the Kim Possible example image, then the same parameters with no upscaling and a base image size of 1216x1792, then the same thing but base SDXL. It brings back the classic stretched torsos, duplicated body parts, etc. Also, if you look how LoRAs are trained, images are generally normalized to 1024x1024. I have done a few for fun, and the tooling will reduce a 2048x2048 to 1024x1024, 1664x2432 to 832x1216, etc.
27732-3578580717-score_9, score_8_up, score_7_up, source_cartoon, source_furry, kim possible a...png 28021-3578580717-score_9, score_8_up, score_7_up, source_cartoon, source_furry, kim possible a...png 28025-3578580717-score_9, score_8_up, score_7_up, source_cartoon, source_furry, kim possible a...png
I decided to Heck with PL'ing, I'm going to upload a couple of Stable Diffusion 3 images.
Cool stuff, thanks for sharing. Multiple characters composed in a scene and proper lettering are two of the big edges Dalle-3 had, good to see open source closing the gap. Is it a closed beta or is it possible to sign up somewhere?
 
In my experience this is not the case. I get much better results by generating an image in common aspect ratios (eg. 832x1216, 1024x1024, 768x1344, etc) and then upscaling. Even base SDXL does not generate coherent images when you go significantly above those sizes. Just as an example, here's the Kim Possible example image, then the same parameters with no upscaling and a base image size of 1216x1792, then the same thing but base SDXL. It brings back the classic stretched torsos, duplicated body parts, etc. Also, if you look how LoRAs are trained, images are generally normalized to 1024x1024. I have done a few for fun, and the tooling will reduce a 2048x2048 to 1024x1024, 1664x2432 to 832x1216, etc.
Got to be honest, I prefer the second one. Although I would need more hands.

Cool stuff, thanks for sharing. Multiple characters composed in a scene and proper lettering are two of the big edges Dalle-3 had, good to see open source closing the gap. Is it a closed beta or is it possible to sign up somewhere?
It was a closed beta but their hosted version is available via API now. You get a handful of credits when you sign up and then have to buy at $10 per 1000. Which is good for around 150 standard generations. It's worth it if you just feel like playing around. You can use a web interface hosted by Google Colab but I don't think you should. Just use a scrap of Python or Javascript or even CURL on your local machine. They have example scripts in different languages here:

Authentication is just via the API key so really all you need to do is paste it into one of their scripts and run it, build on the script if you want to do batch or anything else. Not sure what metadata is in the returned images, I stripped it all out of mine.
 
Double-post but that's because nobody else has said anything - allowed!

So here's something which I noticed in the SD3 API documentation. Search and Replace! Probably old news to others(?) but new to me and furthermore always good to see how new models can handle something like this.

So search and replace - basically inpainting by prompt rather than a mask. First example blew me away - the image on the left was a very simple prompt to SD3: "Blond man in a field". Then I sent that image back to SD3 and told it the search term was "man" and the new prompt was "blonde woman in a field". A few seconds later the machine spat out the image on the right.

search_and_replace_1.png
Look how the cornstalks in the foreground are in the same place and the distant scenery is the same. And how neat a job it has done of replacing the dude with a lass. Dong-Gone wishes this worked in reality!


I wanted to mix things around a little and this brought to mind the old Stalin meme of someone disappearing in a photograph. So to do something a little more idiosyncratic I asked SD3 for Stalin walking along with Batman. And then I gave its image back to it and told it to replace Batman with Superman.

Oo-er, there's something not quite right with our Red Son in this one.
search_and_replace_2.png
Still, interesting. First off this is quite different to regular in-painting by mask. It gave me an old style comic-book Superman and then extrapolated to make Stalin fit the same style! Not sure if that's good or bad. Depends what you're trying to do, really.
To be clear, I specified nothing about the style, background or anything else. And the reason I'm just trying this on SD3's own images to begin with is because I thought that might be giving it an easier time of things. Figured I'd begin simple.

But given the change of style I decided for the next attempt I would specify a style. So this time I asked for a photograph of Stalin walking by a blond man. And then much like the cornfield I asked it to replace that with a blonde woman, again specifying a photograph.
search_and_replace_3.png

Well that's all kinds of disturbing and I hope that @Susanna doesn't put me on a Soviet naughty list because of it. However, there's still a lot that's impressive about this. The background is unaltered, it's seamless in its replacement of one subject with another. It again has adapted Stalin's style to make it in keeping with the new subject.

I really don't know quite what to make of the way it re-interprets other elements of the image to match the new subject. What is notable is that it does it less so the more I am clear about the style and particulars of the image. Like as a more detailed photograph it doesn't change things so wholesale because both before and after are explicitly set to be similar styles. I imagine the more precise I am the less it tends to alter things outside the subject of the search and replace. Still, I found this interesting and hoped someone else would. This has some very interesting potential.
 
Back