For what it's worth on a 4GB GPU and only 16GB of RAM, I can load the Schnell version of Flux and get an image in 1 step in a little over 2 minutes. Had to load EVERYTHING as FP8. The 20 minutes was on the first gen was my system dying as it pushed everything into swap for a good 10-15 minutes. 1 step vs 4 steps isn't very different composition wise and seems to have diminishing returns on detail.
Here are some more of the Dall-E style images I was able to make, it still needs prompt finagling and some luck to get good results, but I assume Dall-E/Bing is doing a lot of stuff behind the scenes to your prompt anyway. It doesn't seem too censored, more like its just missing a lot of information rather than purposely gimped, I feel like finetunes and loras will be big for this in once it gets more optimized and further refined, like how sdxl took a while to become the main model people use.
You guys should try PonyDiffusion, on anything RTX it runs great, probably the best model out there.
The prompting is weird (you only use rule34/danbooru tags) but the end result is great. Loras for it are great too.
There are a few people on the Github for it demanding to know why they would make a model that requires so much VRAM and a reply that is basically "LOL - poor!". Honestly, I think Flux shows what happens when you stop trying to include everybody. That fact it eats up 20+ GB of VRAM is a factor in why it's so good. I haven't tried running it locally - only in Runpod for which I'm paying the privilege of around $0.88 per hour, eh - in return I get 48GB of VRAM and the ability to generate images with fp16 and in around 8 seconds per image. Here's my first impressions:
My quick random female assassin with a crossbow. No thought, no particular fanciness. And wow - impressive realism and atmosphere. Keep in mind this is a pure base model.
Okay, lets see how it does with multiple people. Asked for male and female runners standing side by side (incidentally, prompts should be in the image metadata if you want the full text).
Great - understands how to do two people. It made the man gay though. Lets see if we can restore his heterosexuality and test its ability to understand associating directives with one particular figure. I told it to put a white t-shirt on the man only.
Voila - birth rate restored and Flux proving it can isolate prompt details.
Lets try particular poses. I asked for the Archangel Michael. He was to be holding aloft a flaming sword, his wings were to be unfurled and the view was to be from a low angle. I'd asked for high detail and it pretty much delivered. I continue to be very impressed.
What about different aspect ratios? I couldn't find any guidelines from them on aspect ratios and resolutions but I took a stab at using the same ones as SDXL. Same prompt, different aspect. Wow - even better. And one for the artists - Flux brought in a little contraposto. Flame on, Michael!
So there's no reference pic for Archangel Michael, lets try for some known realworld people. Someone already posted a Trump above, but was their Trump riding a dinosaur? I think not!
Lets broaden the political figures out. Can we get a Kamala Harris in there?
What about a cyborg Putin with "half his face covered with metal and electronics"? Yep - followed the prompt very well understanding what I meant and creating this foreboding Phantom of the Kremlin. Z-Man beware!
A smattering of fictional and comic book characters. Does it know who Supergirl and Batgirl are with no Lora or fine-tuning? Yep - it's comics aware. (Nice detail btw, I asked for sunrise and sunset, iirc).
What about WH40K. Surely it wont know what a Space Marine is? By the Emperor, it does (more or less). Also, nice text insert. I only haven't been posted experiments with that because other people are focusing on it already. Got to remark on how good it is though. Word for word and exactly where I asked it to be.
Alright - situational awareness. I asked for her to be in the cockpit of a fighter plane, sky behind her and facing the viewer. For a quick off-the-cuff attempt, no img-to-img but just a casual text prompt this is very impressive. Seriously, it knos what a cockpit is and gave me the background and positioning of her that I wanted.
Lets give it a little clothing test and see how it follows direction there. One 18th Century vampire for @Susanna coming up. Specifically asked for 18th Century dress.
No, no, no - more vampire! That doesn't scare me at all. I added details for red velvet and iirc gold embroidery and redder eyes. Ah - now that's someone I dodon'tdo don't want to meet coming home at night. I was asking for soft-focus / blurred background and it seemed to understand that.
Final test for prompt following with clothes and style and background - expensive blue armour, braided hair, interior with stone columns, out of focus, posed with head turned towards viewer - nailed every aspect of the prompt.
Quick DEI check - ancient egyptians and ancient greek philosophers. Google thinks both of these are sub-Saharan Africans. What does Flux think? Both look pretty spot on to me. Also love its clothing choices for both of these as I didn't specify anything. More men should wear robes and have gold headdresses.
I've been doing all people. Quick sanity test for a landscape - Mountain view looking downwards with snow and rocks:
Oh, and give me a dragon, specifically flying, specifically in the distance:
Love it - can you make it a pencil or charcoal sketch please?
Nice. But what about oil paintings? And how about we test TASTEFUL nudity at the same time.
Wow - pretty much exactly what I asked for in the prompt - reclining on a bed, Renaissance style oil painting, long red hair, nude. It picked out an artistic pose all by itself - wouldn't want to have been the model having to hold that pose for Rembrandt for a couple months! Nipples a little odd but again - base model. And I really like the effect it places at the edge of the picture to show it's a canvas.
Okay, challenge mode - multiple figures in a particular pose interacting with varying emotions. I specifically wanted Batman to be annoyed / angry and Supergirl to be smiling / happy, I needed them seated and arm-wrestling so hands clasped. Two attempts at different realism levels:
Holy crap it did it! First off it accurately attributed different facial expressions to the requested characters. Historically that's been quite tricky. Batman looks really frustrated in the first one. So much so that he's cheating and using an extra finger, but he still can't win. I felt I wanted a greater discrepancy in their body sizes so I specified to make Batman more muscular and Supergirl smaller and skinnier. It worked (that is the second of the more realistic version, the first more realistic version they were closer in size). Batman appears to be using two hands in the second one but I'll allow it. It took multiple attempts to get the two. It did keep wanting to put a batman mask on Supergirl but not always. Facial expressions were variable. I picked out the best.
Coming to the end now and just a few odd experiments that I wanted to try. Something I could never properly get out of anything were spines or a crest. I made a lizardman and specified "a yellow crest". Nailed it first time: (all the other details, colour of scales, holding spear, savage clothing style, muscles, emerging from a swamp, all perfectly followed the prompt as well)
A few innocent little children's book illustrations. Mixed results but am sure could get closer to what I wanted with actual effort.
I am a guy so of course I did try out at making a beautiful woman. Thankfully I have wholesome tastes.
and interestingly you can contrast that with a previous one made with Stable Diffusion (can't remember which checkpoint exactly but was a fine-tuned SDXL)
Ehhh, okay - two slightly more cheesecake ones but mainly just to play around with some fantasy art and facial expressions. What actually got me though was how well it interpreted "blood covered". Look at the way it drips from the barbarian's axe:
Flux is a spectacular success, imo. And a testament to what you can achieve when you don't restrict yourself to the lowest common denominator. I hope people found this interesting.
You guys should try PonyDiffusion, on anything RTX it runs great, probably the best model out there.
The prompting is weird (you only use rule34/danbooru tags) but the end result is great. Loras for it are great too.
Might come back to XL eventually when I make XL versions of LoRAs I've made, but 1.5 is still pretty decent. I hear AutismMix is a pretty nifty derivative of Pony Diffusion, though.
Flux is going very well for me, but my Mr. Popos are being stymied by it not really doing "DBZ style", I just get actual obese black men with red lipstick. Ahh well.
Alright - situational awareness. I asked for her to be in the cockpit of a fighter plane, sky behind her and facing the viewer. For a quick off-the-cuff attempt, no img-to-img but just a casual text prompt this is very impressive. Seriously, it knos what a cockpit is and gave me the background and positioning of her that I wanted.
I would make a pithy comment but I've never seen Top Gun and know only that it is about fighter pilots. And there might be something to do with making cocktails. However, if Wes Anderson did make a Top Gun movie I suspect it would be extremely weird. Tonnes of slow dialogue, probably a child would accidentally end up flying the jet whilst her mum and dad tell her it's okay and she can do this over the radio. The love interest would be played by Willem Defoe.
Okay, this image is a little catty but I asked Flux to make me an image of a woman lying on the grass, just to poke fun at SD3.
There are a few people on the Github for it demanding to know why they would make a model that requires so much VRAM and a reply that is basically "LOL - poor!". Honestly, I think Flux shows what happens when you stop trying to include everybody. That fact it eats up 20+ GB of VRAM is a factor in why it's so good.
I'm not sure if there's a thread for LLMs, but the same addage holds, and I think that's the reason Meta axed any Illama models other than the smallest and largest. There's simply no comparing a properly tuned 70B with a properly tuned 8B, and that's even with Llama3 pushing 8K context on the 8B. Quantized down to 4Bit you still need over the amount of VRAM any single consumer card can provide. You are looking at a dual card system at that point, or going all in and running a ESC4000 in your basement which will make it sound like a jet hanger.
I hope we see 32 GB consoomer GPUs soon. Also, if it wasn't mentioned in this thread, the RTX 5090 is expected to bring 28 GB (448-bit) instead of 24 GB (384-bit).
I'm using the web version of Flux and I'm having trouble generating exactly what I want. Any pointers?
Prompt:
Night time moody lighting, a woman in all black, eyes blindfolded with a black scarf, using both hands holding an ancient broadsword pointing up, with both hands on the hilt holding it up, surrounded by darkness, there is no forest, there is no beach, she is skipping atop water in a pond with waves dispersing at each graceful ballerina step from left to right, the water is clear and beautiful, atmospheric dark shot, camera is looking down at her.
Result:
I can't get her to hold the sword exactly as described and the background to be nothingness. Any advice for lighting would be helpful. Thanks.