Google Imagen AI image generator

This and DallE scare the shit out of me and I want to play with them. I'm really having trouble believing it's getting images that coherent from text prompts and it makes me wonder if this is just some sort of elaborate smoke and mirrors using an enormous dataset and segregation trickery. I'm also curious if the training methods are applicable to audio and how fast the synthesis is.

Either way the next couple years are going to be interesting especially when winds up applied to music.
 
This and DallE scare the shit out of me and I want to play with them. I'm really having trouble believing it's getting images that coherent from text prompts and it makes me wonder if this is just some sort of elaborate smoke and mirrors using an enormous dataset and segregation trickery. I'm also curious if the training methods are applicable to audio and how fast the synthesis is.

Either way the next couple years are going to be interesting especially when winds up applied to music.
We need to see if it chokes or thrives when you feed it TMI prompts. Does going into autistic detail result in something more complex and still accurate? Can you say "divided into four quadrants, in the top left..." and create four separate and distinct scenes in one image?
 
  • Like
Reactions: Smaug's Smokey Hole
Eh if Google did it then I figure by the end of the decade we'll have something similar to the masses. Can't hoard shit forever.
It will be accessible in the same way Google Search is right now. No, you can't host it off your own machine and you definitely can't have access to the underlying technology. The masses will be able to query it and those queries will be controlled and filtered down, just like searching is right now but even more strict.

When people say "the scientists will leak it", they are missing the point of neural networks. Plenty of them are open source already, but what is missing is the dataset and specific instructions on how a network was trained on this dataset to produce desirable results. With a system of this size that can produce such paintings, there's no fucking way it will ever see the light of day in a truly free (as in freedom) form.
 
  • Like
Reactions: Aidan
It will be accessible in the same way Google Search is right now. No, you can't host it off your own machine and you definitely can't have access to the underlying technology. The masses will be able to query it and those queries will be controlled and filtered down, just like searching is right now but even more strict.

When people say "the scientists will leak it", they are missing the point of neural networks. Plenty of them are open source already, but what is missing is the dataset and specific instructions on how a network was trained on this dataset to produce desirable results. With a system of this size that can produce such paintings, there's no fucking way it will ever see the light of day in a truly free (as in freedom) form.
I think it would be pretty fucking tough to host it yourself to begin with. I'd be content if I could use it to draw anime girls forever.
 
It will be accessible in the same way Google Search is right now. No, you can't host it off your own machine and you definitely can't have access to the underlying technology. The masses will be able to query it and those queries will be controlled and filtered down, just like searching is right now but even more strict.

When people say "the scientists will leak it", they are missing the point of neural networks. Plenty of them are open source already, but what is missing is the dataset and specific instructions on how a network was trained on this dataset to produce desirable results. With a system of this size that can produce such paintings, there's no fucking way it will ever see the light of day in a truly free (as in freedom) form.
And the datasets themselves are many terabytes im size usually, and require a good chunk of processing power. Self hosting such a thing would be out of the realm of possibility for most people due to the hardware costs
 
  • Agree
Reactions: Frail Snail
And the datasets themselves are many terabytes im size usually, and require a good chunk of processing power. Self hosting such a thing would be out of the realm of possibility for most people due to the hardware costs
If the dataset is 10 terabytes, any idiot can store it with a $200 hard drive. 100 terabytes narrows it to data hoarders, 1000 terabytes is too hardcore.

Training and using a model are two different things. We can't even use this one.
 
There are several ethical challenges facing text-to-image research broadly. We offer a more detailed exploration of these challenges in our paper and offer a summarized version here. First, downstream applications of text-to-image models are varied and may impact society in complex ways. The potential risks of misuse raise concerns regarding responsible open-sourcing of code and demos. At this time we have decided not to release code or a public demo. In future work we will explore a framework for responsible externalization that balances the value of external auditing with the risks of unrestricted open-access. Second, the data requirements of text-to-image models have led researchers to rely heavily on large, mostly uncurated, web-scraped datasets. While this approach has enabled rapid algorithmic advances in recent years, datasets of this nature often reflect social stereotypes, oppressive viewpoints, and derogatory, or otherwise harmful, associations to marginalized identity groups. While a subset of our training data was filtered to removed noise and undesirable content, such as pornographic imagery and toxic language, we also utilized LAION-400M dataset which is known to contain a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes. Imagen relies on text encoders trained on uncurated web-scale data, and thus inherits the social biases and limitations of large language models. As such, there is a risk that Imagen has encoded harmful stereotypes and representations, which guides our decision to not release Imagen for public use without further safeguards in place.
"You don't get this cool tool because SJWs will whine when it makes something they dislike."
 
Yeah to be clear you don't need the dataset or (generally) to know anything about the training strategy to use a pre-trained model. That's kind of the point, you end up with a magic box that'll take input in the intended form and that's it unless you want to tune it further or whatever for your application, which you still don't need most of that for. The model might still be big and take a long-ass time to crunch on your shitty hardware but it'll still be an order of magnitude at least less than what went into training.

The methods themselves are public in a sense yeah, that's true. But it's the specific way you combine ML techniques in novel ways to build your training system that matters, not necessarily the dataset so much except in the specific deterministic sense. You can think of it like the "personality" of your AI might be different but if you trained it in the same way on a large enough corpus that was selected and processed in the same way, it'll basically be able to do the same things. And that's essentially pointless in this and most high-functioning examples: the hard work has already been done to be able to produce the magic. Even if you wanted to alter the flavour of the outputs it'd still be easier to operate on the existing model rather than a blank slate.

Anyway I'd take their explanation with a huge grain of salt: the "racial biases in AIs" thing is known "problem" that's basically a canned excuse.
When you see this kind of powerful shit, just think about how profitable it could be if you never told anybody you had it, giving you a lead of years on unsuspecting markets that think your tech is literally impossible. So you gotta wonder if a super sweet demo is really just the crumbs, or something they've been sitting on that only got disclosed down the line in anticipation of someone else announcing a similar advance first.
(Or if they just have research divisions doing stuff outside their core sinister shit, I guess). That said, Google has released some pretty great and useful stuff in this field, and the meme potential of a funny picture maker is definitely more of a PR hazard than other stuff.
 
Last edited:
If the dataset is 10 terabytes, any idiot can store it with a $200 hard drive. 100 terabytes narrows it to data hoarders, 1000 terabytes is too hardcore.
It's likely pulling video sources from everything uploaded to YT too so we'd be talking about petabytes here.

Racism and misinformation are most likely convenient excuses for the fact that this starts melting down their server room for anything you request.
 
Doubt they used video with the emphasis on composition, near focus and lens effects.

Fine, I'll actually skim the paper.
Looks like they filtered naughty things out of a combination of the Laion-400M dataset which is 400 million image-description pairs (and this was already posted so fukka u) and slightly larger collection of internal sets which they don't describe, but maybe comes from their image recognition stuff? I'd also guess they fudged it a bit given the emphasis on human preference rating as a benchmark and everything looking like it has instagram filters (or maybe not, looks like they optimised a HDR-like step).
Laion-400M comes in at 10TB for the 256x256 version but you can't really guess how big the set was from that, even though it doesn't matter, because they generate base images from the text at 64x64 (so they probably train that part at 64x) and upsize with a scaler also conditioned on the text (I tried reading the related papers but there was maths and I got sleepy).
And yeah the hardware is nuts.
 
I find it amazing how fast things have evolved since next generation convolutional neural networks were made possible starting in 2012. 10 years we went from a doubling of proper image recognition to this. All thanks to the work done by Japanese, French, and Canadian researchers (who were initially inspired by the neurophysiological research done by a Swede and a Canadian) which lead to the first breakthrough in 2012.
 
  • Like
Reactions: Based Cheeto
I find it amazing how fast things have evolved since next generation convolutional neural networks were made possible starting in 2012. 10 years we went from a doubling of proper image recognition to this. All thanks to the work done by Japanese, French, and Canadian researchers (who were initially inspired by the neurophysiological research done by a Swede and a Canadian) which lead to the first breakthrough in 2012.
It became so advanced that we can't be trusted to use Google Image Search anymore. "Dog" yeah I know it's a dog for fucks sake I just wanted to find the same image in a higher res, "Woman", no that's not a woman, jesus christ!
 
Honestly, all fears of this being abused for fake news (just as much as text generators which are somewhat more effective) could be easily resolved if digitally signing pictures/videos/texts was more common place and established. The technology for digital signage is literally decades old, pretty simple and pretty waterproof. A malicious actor could generate tons of fake videos and images, as long as they're not from the source they claim to be (as in lacking signature) it would not matter. Sadly this would also require a computer-literate populace who isn't borderline braindead and also gives a shit which means we're all doomed.

EDIT: I'm still waiting for the clusterfuck that'll happen when rights holders and lawyers figure out a lot of these neural networks are fed with technically copyrighted material. I don't think anyone has ventured into that particular minefield yet.
 
Last edited:
Sadly this would also require a computer-literate populace who isn't borderline braindead and also gives a shit which means we're all doomed.

EDIT: I'm still waiting for the clusterfuck that'll happen when rights holders and lawyers figure out a lot of these neural networks are fed with technically copyrighted material. I don't think anyone has ventured into that particular minefield yet.
You don't even need good deepfakes that can fool deepfake detector algorithms. Plenty of people will think video game footage is from the war in Ukraine.

I think that is already happening. The best defense is to make it very hard to conclusively link the output back to an original image, and not allow copyright trolls or the Business Software Alliance to ever access your datasets.
 
You don't even need good deepfakes that can fool deepfake detector algorithms. Plenty of people will think video game footage is from the war in Ukraine.

I think that is already happening. The best defense is to make it very hard to conclusively link the output back to an original image, and not allow copyright trolls or the Business Software Alliance to ever access your datasets.
I bet their attempts will be enough to allow them access to pictures from private citizens that doesn't have megacorp lawyer money. The EU debacle is close enough to that and it will only get worse.

The thing I worry about with future deep fakes and their watermarking is that there will be no way for the public or independent researchers to verify if there is one or not. If malicious people know how it works they can defeat or circumvent it, that's why you have to trust the fact-checking experts and their secret software that decides if something happened or not. No matter how real it looks.
 
What if this "top secret" AI is actually just a bunch of trained monkeys using Photoshop in a dark room somewhere in Google's HQ?
 
Back