Real-Time Voice Cloning

Smaug's Smokey Hole · Nov 27, 2019

Pickle Inspector said:
It sounds like a novel concept now but just wait until movies become even more characterless and boring with reboots starring old dead celebrities or maybe middle aged or older ones happy to sell their likeness of themselves in their prime.

Just wait until after that, when the user have control through micro transactions.

Hm, I don't like Ryan Reynolds. Make him Gary Oldman. Rosario Dawson is overrated, make her Gary Oldman. Tom Hardy? More like Gary Oldman.

Ok... Put that Gary Oldman in a Robin Hood costume. That one should be a 30's gangster and... *whispering to the tv* put Christina Hendricks body on that Gary Oldman. Now, instead of submarine it takes place in a castle.

Spedracer · Dec 9, 2019

Smaug's Smokey Hole said:
Just wait until after that, when the user have control through micro transactions.

That is a very real application we might see in our lifetimes. Visual applications of machine learning are much farther along than audio. But I imagine the first few commercial attempts will be met with lukewarm reception, like current AR/VR stuff.

Pixy · Aug 5, 2020

Deadwaste said:
combine this with cgi people, deepfakes, and a bunch of other bullshit and we wont even need to hire actors for movies at this rate

Necro'ing an old thread, but this is actually happening with the current influx of Yanderedev memes being put out by people. One person of note is the youtube channel Derewah, who's made Yanderedev deepfakes their entire schtick and is working on voice-cloning Yanderedev.

children of bodom · Aug 6, 2020

If you've ever watched the 9/11 conspiracy movie "Loose Change", it does mention that a phone call was likely deepfaked accurately, and this is going way back then. This tech has existed for decades and we can only imagine what kind of tech they've got now we don't know exists. There's a reason they call Area 51 Dreamland. We didn't even know the Stealth Black Hawk existed until one crashed in Operation Neptune's Spear which couldn't be heard beyond a short distance and this was nearly a decade ago.

Vlinny-kun · Aug 6, 2020

Imagine all the n-words I could make famous people say with this technology...

Dante Alighieri · Aug 6, 2020

Sackity said:
Necro'ing an old thread, but this is actually happening with the current influx of Yanderedev memes being put out by people. One person of note is the youtube channel Derewah, who's made Yanderedev deepfakes their entire schtick and is working on voice-cloning Yanderedev.

The autism we deserve, not the autism we need.

Spedracer · Jul 2, 2024

Necrobumping this thread five years later. A lot of developments happened with ML voice cloning and generative audio.
As of 2024, closed service options reign king, producing almost perfect speech. ElevenLabs and GPT-4o's voice mode, despite the dystopian corporate use cases advertised, are still superior to the open source options (if the demos are to be believed, I'm sure others will follow). Companies are shifting focus to multimodal LLMs, which I suspect will also include multimodal output in the coming years (text, image, audio, etc).
Despite this, local options since 2019 have exploded. XTTS-v2, Bark, MARS5-TTS, ChatTTS. Some of which are approaching ElevenLabs quality. If you're interested in any of this stuff, HF is a good place to check trending options. And of course, it's still just a colossal bitch to get this stuff working even if you have the appropriate hardware. I understand why not everyone wants to set up a conda environment and spend the next hour installing various dependencies and CUDA libraries.
Synthesized natural speech in general is still a niche part of the ML ecosystem, trailing behind text and image generation, but it's way farther than it was half a decade ago. There are so many options now that it's hard to keep up. Any thoughts or predictions? Are there any other speech synthesis options I missed that are worth noting?

grump · Jul 3, 2024

Spedracer said:
Necrobumping this thread five years later. A lot of developments happened with ML voice cloning and generative audio.
As of 2024, closed service options reign king, producing almost perfect speech. ElevenLabs and GPT-4o's voice mode, despite the dystopian corporate use cases advertised, are still superior to the open source options (if the demos are to be believed, I'm sure others will follow). Companies are shifting focus to multimodal LLMs, which I suspect will also include multimodal output in the coming years (text, image, audio, etc).
Despite this, local options since 2019 have exploded. XTTS-v2, Bark, MARS5-TTS, ChatTTS. Some of which are approaching ElevenLabs quality. If you're interested in any of this stuff, HF is a good place to check trending options. And of course, it's still just a colossal bitch to get this stuff working even if you have the appropriate hardware. I understand why not everyone wants to set up a conda environment and spend the next hour installing various dependencies and CUDA libraries.
Synthesized natural speech in general is still a niche part of the ML ecosystem, trailing behind text and image generation, but it's way farther than it was half a decade ago. There are so many options now that it's hard to keep up. Any thoughts or predictions? Are there any other speech synthesis options I missed that are worth noting?

I used a tortoise derivative because I didn't want to pay like all the elevenlab newbs and it worked well. The field moves lightning fast though. Theres a bunch of channels that clone David Attenborough and its like they resurrected him at his peak. Fake Attenborough sounds better than the real one does now. In just the space of months they managed to nail down the pauses and affectations which were still problematic when I was doing my project to where it seems almost indistinguishable from a real person. I only got suspicious because it was weird that some dinky little warhammer channel had such a distinguished old British man to narrate their videos.

The thing that annoys me is I really worked hard on my voice cloning project and now its completely obsolete since you can do even better with a few button presses. sigh.

Real-Time Voice Cloning

Smaug's Smokey Hole

Closed for summer

Spedracer

im gonna be chasin' after someone.

Pixy

Yo, buddy. Still alive

children of bodom

Vlinny-kun

WE ARE THE CALL OF DUTY CIA MOBILE OF THE US

Dante Alighieri

“Beauty awakens the soul to act.”

Spedracer

im gonna be chasin' after someone.

grump