Researcher finds way to get audio from still images and silent videos

archive
link

With video calls becoming more common in the age of remote and hybrid workplaces, "mute yourself" and "I think you're muted" have become part of our everyday vocabularies. But it turns out muting yourself might not be as safe as you think.

Kevin Fu, a professor of electrical and computer engineering and computer science at Northeastern University, has figured out a way to get audio from pictures and even muted videos. Using Side Eye, a machine learningassisted tool that Fu and his research team created, Fu can determine the gender of someone speaking in the room where a photo was taken—and even the exact words they spoke.

"Imagine someone is doing a TikTok video and they mute it and dub music," Fu says. "Have you ever been curious about what they're really saying? Was it 'Watermelon watermelon' or 'Here's my password?' Was somebody speaking behind them? You can actually pick up what is being spoken off camera."

It sounds like the stuff of science fiction—and it is. The idea for Side Eye was inspired by an episode of the sci-fi show "Fringe" that saw the main characters, a team of fringe science investigators working for the FBI, extracting audio from a melted pane of glass.

When the episode aired, one critic for Den of Geek called it a "ridiculous pseudo science technique." Fu disagreed.

"I was like, 'I bet we can do that,'" Fu says. "My lab specializes in the impossible. We usually expect the first reaction to anything we do to be 'You can't do that,' and we say, 'Well, we already did.'"

Side Eye takes advantage of the image stabilization technology that is now virtually standard across most phone cameras. To ensure a shaky hand doesn't make for a blurry photo, cameras have small springs that hold the lens suspended in liquid. An electromagnet and sensors then push the lens in equal and opposite directions to reduce camera shake.

However, Fu says whenever someone speaks near a camera lens, it causes tiny vibrations in the springs and bends the light ever so slightly. The angle of the light changes almost imperceptibly—"unless you're looking for it," Fu says.

Normally, it would be hard to extract sonic frequency from those microscopic vibrations. But Fu says rolling shutter, a method of photography most phone cameras use today, actually makes it easier to achieve the impossible.

"The way cameras work today to reduce cost basically is they don't scan all pixels of an image simultaneously –– they do it one row at a time," Fu says. "[That happens] hundreds of thousands of times in a single photo. What this basically means is you're able to amplify by over a thousand times how much frequency information you can get, basically the granularity of the audio."

As long as there is even a little bit of light, Side Eye will work, although the more imagery it has access to, the better. Fu says even a photo pointed at a ceiling would let Side Eye do its thing.

The end result of this process is audio that, even at its best, sounds more like the muffled sound of adults in the Peanuts cartoons. But by using machine learning and training Side Eye on certain words and audio, Fu is able to extract a lot of information.

"If you want to know if I said yes or no, you can train [Side Eye] on people saying yes and no and then look at the patterns and with high confidence when I get an image later know if someone said yes or no," Fu says.

Side Eye can even identify the exact person who is speaking if it's been trained on that person's voice, although Fu says it's not as accurate when it comes to that just yet.

From a cybersecurity perspective, Side Eye opens up an entirely new world of threats that people and cybersecurity experts should be aware of. However, Fu says the most interesting application for Side Eye could be as a new form of digital evidence for lawyers and others working in the criminal legal system.

"Maybe there's an alibi and it's being admitted to court and somebody wants to prove somebody was or wasn't there," Fu says. "You might be able to use this technique if you have an authenticated video with a known timestamp to confirm one way or the other. If you hear the person's voice, they're more than likely there."
 
Why do you think comics and pictures are still around despite spoken media? People are surprisingly good at figuring out the situation from images.

Welp trannies and tranny enablers gonna get this shutdown now, aren't they?

Troons are a scourge of humanity and this decade has justified every single moment of history when ancient people would mercilessly oppress and prosecute eunuchs. They're prone to causing shit.
 
It can only work on phone videos then. When they say "still image" they don't mean a single frame photograph, they mean a video of nothing moving. That's pretty misleading.

I didn't think you'd be able to hear great grandma on 100+ year old photos.

But if this technology is legit it will only be used for evil. Surveillance cameras are already everywhere. Even if you aren't a terminally online instatard you are still being recorded many many times daily.
 
It can only work on phone videos then. When they say "still image" they don't mean a single frame photograph, they mean a video of nothing moving. That's pretty misleading.

Not quite, no. What they're saying is that because of rolling shutter a "still" photograph is actually captured over a long enough period of time to extract useful audio information from it.
 
So this relies on image stabilization to work?

Now that is somewhat interesting, but also ensures you have a workaround: disable image stabilization, no snooping.
 
  • Like
Reactions: Markass the Worst
It requires the speaker to be close to (a specific type of) camera. For surveillance cameras, it would be both easier and more effective to just add a microphone.

It's been possible to extract sound from pixel variance of a high frame rate video for quite some time, at any distance...but the frame rate needs to be like several thousand fps.

Also, lip reading has been a thing since forever. Seems like a perfect application for AI actually, wonder if there's a specific app for it out yet.
 
This is just sonification not actually taking sounds from things. It's essentially extracting a data set and then producing sound after that. Any sound that is produced is going to actually sound garbled and have to be redone by someone else to sound good. There's a great video about Sonification and how it's used all the time to claim that it's "producing the sounds of outer space" when in reality it just people taking numbers and arbitrarily making sounds based on those numbers and claiming it's the "sounds of outer space".
 
  • Disagree
Reactions: teriyakiburns
Back