The same model? How can a text model generate images?
Yes. How OpenAI did it for GPT4o in detail, who knows. Imagine it like a brain that has different regions that do different, specific things, and different en- and decoders depending on the type of in-and output. Then imagine layers routing the information to the right parts, then imagine other layers specialized in paying attention to what is important about the specific input in question, and stiching the reply back together, considering relevant information they got from the specialized parts.
This is very complex on an architecture level but the advantage is that you can use several specialized areas to solve a problem, for example ask a text question about a picture. This does not actually mean necessarily that the text in regards to specific topics comes out more accurate just because that other information exists also, contrary to what I stated earlier.
Big advantage of this is especially in regards to audio - low latency, you don't have to "translate" between models and use valuable time, but end up with data that is intrinsically compatible with the various internal representations of the concepts of the model. This can move your data along the neural net really quickly. OpenAI claims a latency from audio input-audio output of around 200 milliseconds (!) (as opposed to anything between 3 to 10 seconds if you have different models working with each other) this is about the same speed as humans. That's why the models reply in the videos come across as so natural. There is no awkward pause in which the audio gets processed because it doesn't need to be. This model can work as almost real-time language translator.
For image generation this could also mean stability in between generations. This means you come up with a design for a character (lets say a smoking hot redhead) and ask the model to generate pictures of this character (how you described her in detail) in five different situations. She'd look largely the same in all five pictures, essentially being the same recognizeable character, very unlike e.g. stable diffusion. This would allow you to e.g. make a webcomic with gpt4o, where the character design between the panels is consistent. (this is why I told the artfags in the AI art seething thread that their days are numbered, stable diffusion especially is just a computationally very efficent way to do image generation, not the only one)