Real-Time Voice Cloning

  • 🔧 At about Midnight EST I am going to completely fuck up the site trying to fix something.
It sounds like a novel concept now but just wait until movies become even more characterless and boring with reboots starring old dead celebrities or maybe middle aged or older ones happy to sell their likeness of themselves in their prime.

Just wait until after that, when the user have control through micro transactions.

Hm, I don't like Ryan Reynolds. Make him Gary Oldman. Rosario Dawson is overrated, make her Gary Oldman. Tom Hardy? More like Gary Oldman.

Ok... Put that Gary Oldman in a Robin Hood costume. That one should be a 30's gangster and... *whispering to the tv* put Christina Hendricks body on that Gary Oldman. Now, instead of submarine it takes place in a castle.
 
Just wait until after that, when the user have control through micro transactions.
That is a very real application we might see in our lifetimes. Visual applications of machine learning are much farther along than audio. But I imagine the first few commercial attempts will be met with lukewarm reception, like current AR/VR stuff.
 
  • Like
Reactions: Smaug's Smokey Hole
combine this with cgi people, deepfakes, and a bunch of other bullshit and we wont even need to hire actors for movies at this rate

Necro'ing an old thread, but this is actually happening with the current influx of Yanderedev memes being put out by people. One person of note is the youtube channel Derewah, who's made Yanderedev deepfakes their entire schtick and is working on voice-cloning Yanderedev.
 
  • Like
Reactions: Spedracer
If you've ever watched the 9/11 conspiracy movie "Loose Change", it does mention that a phone call was likely deepfaked accurately, and this is going way back then. This tech has existed for decades and we can only imagine what kind of tech they've got now we don't know exists. There's a reason they call Area 51 Dreamland. We didn't even know the Stealth Black Hawk existed until one crashed in Operation Neptune's Spear which couldn't be heard beyond a short distance and this was nearly a decade ago.
 
Necro'ing an old thread, but this is actually happening with the current influx of Yanderedev memes being put out by people. One person of note is the youtube channel Derewah, who's made Yanderedev deepfakes their entire schtick and is working on voice-cloning Yanderedev.
The autism we deserve, not the autism we need.
 
Necrobumping this thread five years later. A lot of developments happened with ML voice cloning and generative audio.
As of 2024, closed service options reign king, producing almost perfect speech. ElevenLabs and GPT-4o's voice mode, despite the dystopian corporate use cases advertised, are still superior to the open source options (if the demos are to be believed, I'm sure others will follow). Companies are shifting focus to multimodal LLMs, which I suspect will also include multimodal output in the coming years (text, image, audio, etc).
Despite this, local options since 2019 have exploded. XTTS-v2, Bark, MARS5-TTS, ChatTTS. Some of which are approaching ElevenLabs quality. If you're interested in any of this stuff, HF is a good place to check trending options. And of course, it's still just a colossal bitch to get this stuff working even if you have the appropriate hardware. I understand why not everyone wants to set up a conda environment and spend the next hour installing various dependencies and CUDA libraries.
Synthesized natural speech in general is still a niche part of the ML ecosystem, trailing behind text and image generation, but it's way farther than it was half a decade ago. There are so many options now that it's hard to keep up. Any thoughts or predictions? Are there any other speech synthesis options I missed that are worth noting?
 
  • Like
Reactions: Colon capital V
Necrobumping this thread five years later. A lot of developments happened with ML voice cloning and generative audio.
As of 2024, closed service options reign king, producing almost perfect speech. ElevenLabs and GPT-4o's voice mode, despite the dystopian corporate use cases advertised, are still superior to the open source options (if the demos are to be believed, I'm sure others will follow). Companies are shifting focus to multimodal LLMs, which I suspect will also include multimodal output in the coming years (text, image, audio, etc).
Despite this, local options since 2019 have exploded. XTTS-v2, Bark, MARS5-TTS, ChatTTS. Some of which are approaching ElevenLabs quality. If you're interested in any of this stuff, HF is a good place to check trending options. And of course, it's still just a colossal bitch to get this stuff working even if you have the appropriate hardware. I understand why not everyone wants to set up a conda environment and spend the next hour installing various dependencies and CUDA libraries.
Synthesized natural speech in general is still a niche part of the ML ecosystem, trailing behind text and image generation, but it's way farther than it was half a decade ago. There are so many options now that it's hard to keep up. Any thoughts or predictions? Are there any other speech synthesis options I missed that are worth noting?

I used a tortoise derivative because I didn't want to pay like all the elevenlab newbs and it worked well. The field moves lightning fast though. Theres a bunch of channels that clone David Attenborough and its like they resurrected him at his peak. Fake Attenborough sounds better than the real one does now. In just the space of months they managed to nail down the pauses and affectations which were still problematic when I was doing my project to where it seems almost indistinguishable from a real person. I only got suspicious because it was weird that some dinky little warhammer channel had such a distinguished old British man to narrate their videos.

The thing that annoys me is I really worked hard on my voice cloning project and now its completely obsolete since you can do even better with a few button presses. sigh.
 
Last edited:
Back