Stable Diffusion, NovelAI, Machine Learning Art - AI art generation discussion and image dump

  • 🔧 Actively working on site again.
jkh.jpg
 
Flux is like the opposite of whatever Bing is called now, it can do real world people but not fictional ones.
View attachment 6319293

edit: I take it back, this is "perfect"
View attachment 6319298
Give it a week or two and someone will train a Pepe LoRA. I have been playing around a bit with training LoRAs for Flux, and it's actually less bad than I was expecting. It seems like you can quantise to int8 and train LoRAs on a 3090/4090 with minimal quality loss. I'm getting around 3 it/sec training on a 4090, so it takes an hour or two, but it's definitely doable.

CivitAI is also offering LoRA training as a service, but I wasn't super impressed by the quality of the ones people generated with their service for SDXL. Maybe that was just people being clueless and using bad parameters or datasets though.

 
That nomad indie hacker-whatever pieter levels claims he's making about $100k a month from just two shitty SD apps, photoAI and interiorAI. He's so lazy both have the same look and UX despite being for 2 completely different markets. But seriously who is paying for this crap? Its not cheap either.

Tried some interior design app with nearly 5 stars on the app store and this shit just generated a random overstylized picture full of artifacts completely different from any room in my house, who the fuck likes this crap? are all reviews fake now?
 
Last edited:
After spending two hours in the missing dll, version downgrade and dependency hell known as Python, I finally got ComfyUI and Flux working.
Here's my thread tax:
View attachment 6328402
Congratulations! The good news is once you've got it working, you can usually just re-do it quickly if you ever need to. The most complex thing I have to do with my set-up now is occasional manual ROCm updates and that's only because I had the temerity to get an AMD card.

Also, prompt adherence in Flux is great. Trying to do an image like you just did in Stable Diffusion would be pretty tricky, imo.
 
  • Informative
Reactions: Egregore
The most complex thing I have to do with my set-up now is occasional manual ROCm updates and that's only because I had the temerity to get an AMD card.
I tried to wrangle rocm for hours to get flash attention installed for my 6800xt on mint 21.1 but eventually gave up. I even tried in docker and still the build fails with an http error 404 or something to that effect, and that was after 4 hours of waiting for docker pulls, building, reinstalling pytorch+rocm with pip. I also tried building llama.cpp for rocm and it just uses the CPU for inference inexplicably.
 
  • Feels
Reactions: Post Reply
An interesting thing to attempt is to let an AI rewrite the prompt. I discounted automatic prompting with SD largely because in my experimenting, it simply did not lead to good results if you didn't feed the AI all the right keywords, and at that point, you might as well write the prompt yourself. It seems to work well with Flux though. If you consider that the images were probably categorized by AI in a conversational way as theorized in this thread earlier, it makes sense that another AI would find the "right language" (perhaps GPTisms?) to get exactly what you asked for.

Prompting Flux is very different from prompting SD and all the other models that came out (and perhaps MJ and Dall-E, never used those) and for optimal results instead of looking for the right keywords (which often, simply do not exist), it makes much more sense to just describe what you want. I know this has already been said on this very page, just for the sake of completeness.
Ok I wrote all of this out and it's so long and autistic I'm just going to spoil it. It's a few insights into the architecture of Flux and why I think this style of conditioning works.

In the AI space, you can usually get a lot of the answers if you read the papers put out by the researchers before they release a model. Black Forest Labs was founded by three of the head researchers at Stability AI, and if you read the paper for Stable Diffusion 3 they wrote right before leaving Stability you can find it's likely describing a lot of how Flux works and why GPT captions are more effective. I think boils down to two reasons, one of which is in that paper and the other isn't. The first is that it's a mutlimodal diffusion transformer and it's big as described in the paper. I do not think we'd be seeing the same prompt adherence or effectiveness of GPT prompts without adopting that architecture. Here's some autism that explains why I think so:
Just like SD3, Kolors, and AuraFlow, Flux is a multimodal diffusion transformer instead of a unet. Transformers' attention mechanism allows them to focus on different parts of the input and capture more complicated dependencies (in a language model this is different words in a sequence of text, in an image model this can be different patches of an image). With the multimodality you get better internal representations of the images/text because the model has separately trained weights for images and text.

All of this, plus the fact that they opted for a higher dimensional latent space to get better detailing, means that the models need to be a lot bigger. But the associated expense means they need to extract a lot more information from training data in the same or fewer steps. But they are already really expensive, so the actual advancement in AuraFlow, SD3, and Flux is that they are rectified flow transformers, which maximizes the efficiency and quality of diffusion model training by using an optimized path between random noise distribution and coherent data (recall that text-to-image is trained by adding noise to an image and making the model remove it. But HOW we add and remove this noise is somewhat arbitrary and can be improved).

While a single mode diffusion transformer still might have been able to achieve this with enough scale, I think the multimodality made it possible at a scale that Stability and Black Forest could achieve.
The tl;dr is the architecture allows for better representations and finer control over patches of the generated image.

But generally Flux is still better than SD3 in terms of it's prompt adherence while sometimes really needing a detailed prompt to work best, whereas SD3 can be a bit more flexible in that regard. So this leads to the second reason for the GPT style captions. Stability trained SD3 with 50% images captioned by a VLM (visual language model) and 50% original captions. From the paper:
"Betker et al. (2023) demonstrated that synthetically generated captions can greatly improve text-to-image models trained at scale. This is due to the oftentimes simplistic nature of the human-generated captions that come with large-scale image datasets, which overly focus on the image subject and usually omit details describing the background or composition of the scene, or, if applicable, displayed text (Betker et al., 2023). We follow their approach and use an off-the-shelf, state-of-the-art vision-language model, CogVLM (Wang et al., 2023), to create synthetic annotations for our large-scale image dataset. As synthetic captions may cause a text-to-image model to forget about certain concepts not present in the VLM’s knowledge corpus, we use a ratio of 50 % original and 50 % synthetic captions."

The paper they are referring to here is a little treat from OpenAI in which they show how images for Dall-E 3 were captioned (but nothing else about how Dall-E actually works). SD3 prompt following is pretty good, but even in the paper you can see that the prompt following boost from their image captioning was not that impactful, and it's not even close to how far OpenAI went with the idea. From the Dall-E paper:
"To test our synthetic captions at scale, we train DALL-E 3, a new state of the art text to image generator. To
train this model, we use a mixture of 95% synthetic captions and 5% ground truth captions."

And they mention what may be a possible hint about why Flux might specifically needs longer captions:
"The above experiments suggest that we can maximize the performance of our models by training on a very high percentage of synthetic captions. However, doing so causes the models to naturally adapt to the distribution of long, highly-descriptive captions emitted by our captioner. Generative models are known to produce poor results when sampled out of their training distribution. Thus [...] we will need to exclusively sample from them with highly descriptive captions. [...] Models like GPT-4 have become exceptionally good at tasks that require imagination [...] It stands to reason that they might also be good at coming up with plausible details in an image description.
Given a prompt [...] we found that GPT-4 will readily "upsample" any caption into a highly descriptive one."

Basically, aside from being much bigger than SD3, I think Black Forest Labs leaned way more into how OpenAI did it and relied extensively on VLM captioned images and GPT upsampled captions, though maybe not at the 95% mixture.

All of it can be summarized as: It's a transformer that was trained on GPT4 captions to a much much greater extent than SD3.
 
I tried to wrangle rocm for hours to get flash attention installed for my 6800xt on mint 21.1 but eventually gave up. I even tried in docker and still the build fails with an http error 404 or something to that effect, and that was after 4 hours of waiting for docker pulls, building, reinstalling pytorch+rocm with pip. I also tried building llama.cpp for rocm and it just uses the CPU for inference inexplicably.
Hmmmm.

I mean part of me wants to offer to help because I might be able to, but the realities of trying to help debug something anonymously over forum messaging are tricky. I mostly just followed along with the instructions for installing it from their repos here:
 
  • Informative
Reactions: Jones McCann
I tried to wrangle rocm for hours to get flash attention installed for my 6800xt on mint 21.1 but eventually gave up. I even tried in docker and still the build fails with an http error 404 or something to that effect, and that was after 4 hours of waiting for docker pulls, building, reinstalling pytorch+rocm with pip. I also tried building llama.cpp for rocm and it just uses the CPU for inference inexplicably.
I had some difficulty getting ROCM running on Mint 21.3 with a 7800xt and use llama.cpp/Stable Diffusion, but I figured out what I was doing wrong. Follow the instructions @Overly Serious posted at first. Mint had a package that didn't install for me so after getting ROCM installed, run:
Code:
apt install rocm-hip-sdk
You should be able to run the command 'rocminfo' in your console and if it runs, ROCM should be installed correctly.
Than go to your .bashrc file and add these lines:
Code:
export HSA_OVERRIDE_GFX_VERSION=10.3.0
export HCC_AMDGPU_TARGET=gfx1030
export ROCM_PATH=/opt/rocm (or whatever your ROCM path is)
This should make ROCM run on RX6000 cards. If you have an AMD CPU with integrated graphics, you might need to add another command to use your GPU over your CPU. I hope this helps.
 
Last edited:
  • Like
Reactions: Overly Serious
That nomad indie hacker-whatever pieter levels claims he's making about $100k a month from just two shitty SD apps, photoAI and interiorAI. He's so lazy both have the same look and UX despite being for 2 completely different markets. But seriously who is paying for this crap? Its not cheap either.

Tried some interior design app with nearly 5 stars on the app store and this shit just generated a random overstylized picture full of artifacts completely different from any room in my house, who the fuck likes this crap? are all reviews fake now?
His target market is all the dumb niggercattle who don't know what a filesystem is, let alone how to set up SD.
 
Hmmmm.

I mean part of me wants to offer to help because I might be able to, but the realities of trying to help debug something anonymously over forum messaging are tricky. I mostly just followed along with the instructions for installing it from their repos here:
I just wanted to vent mostly because I spent 8 hours pulling my hair out. I will try that, thank you.
This should make ROCM run on RX6000 card
I'll give that a try. I did get inference working in pytorch but specifically installing flash attention so I could speed up tts inferencing was the issue, it wouldn't cooperate with building from source or installing the wheel from the rocm flash attention release on github. My plan is to try again in the future but if it's still a pain I'm going to buy a 3900 or something similar, I assume that's easier to get working reliably. I thought about renting a server specifically for ML but it's expensive and I already have a capable system for what I want to do.
 
I just wanted to vent mostly because I spent 8 hours pulling my hair out.
Oh, we hear you! We've been there!

Good luck!

I'll give that a try. I did get inference working in pytorch but specifically installing flash attention so I could speed up tts inferencing was the issue, it wouldn't cooperate with building from source or installing the wheel from the rocm flash attention release on github. My plan is to try again in the future but if it's still a pain I'm going to buy a 3900 or something similar, I assume that's easier to get working reliably. I thought about renting a server specifically for ML but it's expensive and I already have a capable system for what I want to do.
I upgrade very rarely and when I bought my 7900XT I knew much less about this stuff, at least in hands-on terms - theory I was fine with. And I thought to myself: "It's got 20GB VRAM and anything Nvidia will cost much more for the same amount of RAM. I can compromise a bit on AI being a bit worse in performance for the extra RAM". Woefully uninformed decision. I'd have saved myself many hours and an OS install (need Linux for current ROCm) if I'd just gone Nvidia. And to rub a little salt in the wound, the prices have plumetted from what I paid nearly on release to around £620. It's probably one of the most ill-informed tech purchases I've ever made. And I'm usually pretty cautious and discriminating on this stuff.

It's not been wholly wrong - it's been decent for games (which I barely play) and the 20GB VRAM has allowed me to do things like run Flux Dev (barely). AMD are waaaaaay closer to Nvidia than they were when I got it, in terms of software support. But when you start that far behind "way closer" still doesn't mean close. Gaining but not catching.

I'd sell it and get something else but the massive drop in retail price would just make it so much of a loss by this point it's better to just bear with it.

If you do want to rent a server online, though, I'm using Runpod.io whom I've sperged about before so wont do again. Was a lot cheaper than I expected and great to try stuff out before buying actual hardware even if you do decide to. £10 for playing around with things for a week and get a good idea of what you actually need, can save you money in the long run. Plus it's great fun to have 48GB VRAM at your fingertips, even if temporary. ;)
 
I feel the Radeon struggle too. I recently spent a whole day trying out different distros and tearing my hair out with ROCM. In the end, none of my Linux/SD setups worked right. At least I can still mess around with ZLUDA on Windows.

ZLUDA + HiDiffusion gives me about 0.3 iterations per second generating images at 1664x2432 resolution with my 6900XT. It feels glacial but it does give good results eventually.
 
Last edited:
  • Feels
Reactions: Stalphos Johnson
I recently spent a whole day trying out different distros and tearing my hair out with ROCM. In the end, none of my Linux/SD setups worked right.
I'm glad I've finally figured out how to get working on my setup, but I'm real wary of updating from Mint 21.3 to 22. It took me a couple weeks of on and off troubleshooting before I figured what was going wrong. Even now I had to figure out what was causing random crashes on the Automatic1111 UI before I found this launch script after digging through the ROCM github issues page. I'm glad it's working, and I can generate some 2048x2048 and bigger images with Tiled Vae, but I want to make sure it never breaks again and not have to deal with it. My fault for going AMD, but I was using it for mostly gaming before I got back into SD and the 7800xt is a good card for that, especially on Linux.
 
Back