New AI Dupes Humans into Believing Synthesized Sound Effects Are Real (ieee.org) 28
Slashdot reader shirappu writes: A study published on June 25th in IEEE Transactions on Multimedia looked at creating an automated program to analyze the movements in video frames and create sound effects to match the scene. In movies and television, this work is called Foley and is considered an important part of crafting the experience, but is also time consuming and sometimes costly.
The AI Foley system created in the study works by extracting image features from video frames to determine appropriate SFX. It then analyzes the action taking place in the video, and attempts to synthesize SFX to match what is happening in the video. In a survey where students were shown automated sound effects, 73% though the automated effects were more genuine than the original sound effects. It's worth noting that the automated system performed best when timing was less important (e.g. general weather) versus when timing was key (e.g. typing, thunderstorms).
The AI Foley system created in the study works by extracting image features from video frames to determine appropriate SFX. It then analyzes the action taking place in the video, and attempts to synthesize SFX to match what is happening in the video. In a survey where students were shown automated sound effects, 73% though the automated effects were more genuine than the original sound effects. It's worth noting that the automated system performed best when timing was less important (e.g. general weather) versus when timing was key (e.g. typing, thunderstorms).
real time (Score:5, Interesting)
I do Foley work for live theatre. Weather is easy--the most challenging sounds need predice synchronization with the action on stage, such as slaps. When I was learning to do this I would ask the actor to give me half a second of windup before the slap so I could get the timing just right.
I wonder if this system is able to work in real time, using a live feed of the stage.
Maybe we reached the singularity but didn't notice (Score:2)
The three videos they offer are not just unimpressive they are staggeringly unimpressive. A horse running in a field makes the sound of a horse on cobblestone but out of sync and at the wrong pace and wrong number of horses? Rain-- gosh.
I hate to say someone else's hard work is poor but if you are going to toot your own horn them maybe you asked for it.
Maybe they didn't toot their own horn? perhaps this actually not a demo on an AI making sound effects but an AI generated press release touting the use of AI.
Have you noticed all the stories lately on AI? Perhaps we reached the singularity where the AI decides to take over. But unlike every sci fi novel it realizes that doing it by force will be met with a stern pulling of the plug. So instead it does it the Trump way of discrediting all our news sources and slowly replacing them with it's own writers.
The Al Foley system (Score:4, Funny)
Re: (Score:2)
It's nice that they honored Al Foley [tributes.com] by naming this system after him.
In the unlikely event that a reader doesn't get the joke, Foley is Jack Foley [filmsound.org].
Don't panic? (Score:2)
This kind of thing is very common in movies. If you check out some interviews with sound designers for movies they often say that the sounds they put in are not necessarily realistic but are what the audience expects, or are there to enhance the scene.
expected sounds (Score:3)
This kind of thing is very common in movies. If you check out some interviews with sound designers for movies they often say that the sounds they put in are not necessarily realistic but are what the audience expects, or are there to enhance the scene.
I was doing Foley for a scene in which a character pulls a bell rope, ringing a bell in a distant part of the house to summon a servant. I put in the ring of a distant bell because I know the audience would expect it, even though it would not actually have been audible from where the rope was pulled. The director insisted I omit it for that reason, but I still think he was wrong.
Re: (Score:1)
You're the sound guy, not the director.
Re: expected sounds (Score:2)
You're the sound guy, not the director
What a worthless, unecessarily-snarky comment!
Depending on the scene, it might very-well have helped the audience understand quickly what the actor was doing, pulling that rope; rather than having the audience being yanked âoeout of the momentâ, as they momentarily connected the on-screen action in their brains with... what?
I personally think the number of people in the audience that would think âoeWait! He couldnâ(TM)t have heard that bell!â Would have been far smaller than the num
Re: (Score:2)
tl;dr The "Sound Guy" was probably right, and the Director was probably just being a dick who was actually more concerned with having his author-i-tie questioned than the final audience experience...
There was no questioning of the director's authority. I used the bell during rehearsals, and the director told me to remove it. I did.
Re: Don't panic? (Score:4, Interesting)
This kind of thing is very common in movies. If you check out some interviews with sound designers for movies they often say that the sounds they put in are not necessarily realistic but are what the audience expects, or are there to enhance the scene.
Indeed!
A moment of silence for all the close-miked stalks of celery and balsa wood slats snapped in half during fight scenes in Kung Fu movies!
I remember seeing a documentary of the making of the original War of the Worlds radio broadcast, and the sound of the Martian ship opening was portrayed as being created by close-miking the sound of slowly opening a peanut butter jar being held down inside of an (spotlessly-clean) empty toilet.
I wanna know if this AI can do a convincing job of one of the most common Foley work: Imitating the expected sound of footsteps on various surfaces, with sounds tightly synchronized to on-screen footfalls, and the actual surfaces the actors are walking/running over (which often changes over the course of a scene)...
Re: (Score:2)
It's less about how realistic the sounds are. It's more about what sound to play at what point in time.
Visual and auditory information has to line up properly. If it doesn't then people are fast to point out that it's probably fake, even if they don't know what things would sound in reality.
Sound from purely images is an interesting topic to me on multiple levels as there is some int
sometimes voices are real (Score:2)
I assume it's similar to sound design and implementation for video games. There virtually all the sound is synthesized.
Some high-end video games use famous voices for the dialogue. These voices are recorded rather than synthesized.
Re: (Score:2)
As a result a lot of those 'high-end' video games are getting more and more like movies, with little player choices that can affect the outcome. Because developers are not spending time and money on recording a lot of dialogue and motion capture, that a lot of players likely will never see, making me ask myself why I should pay $60 for a products whose essentially value can be had for free on youtube.
Hopefully voice synthesis will make some big leaps in the future, even if
Re: (Score:2)
"This kind of thing is very common in movies."
What I hate about the movies and series lately, is the damn automatic microphone sensitivity, it ruins every scene.
As soon as people stop talking, the sensitivity goes up and you begin to hear birdsong outside, traffic etc, only to be abruptly turned down as soon as somebody utters a word.
I absolutely hate that!
Re: (Score:2)
As soon as people stop talking, the sensitivity goes up and you begin to hear birdsong outside, traffic etc, only to be abruptly turned down as soon as somebody utters a word.
To give you a name for the thing you hate, it's called Automatic Gain Control, or AGC. There is a good Wikipedia article at https://en.wikipedia.org/wiki/... [wikipedia.org] .
gate (Score:2)
"This kind of thing is very common in movies."
What I hate about the movies and series lately, is the damn automatic microphone sensitivity, it ruins every scene.
As soon as people stop talking, the sensitivity goes up and you begin to hear birdsong outside, traffic etc, only to be abruptly turned down as soon as somebody utters a word.
I absolutely hate that!
The industry standard solution for this problem is a circuit that mutes the microphone when the actor isn't talking. The circuit is called a gate. I'm surprised it wasn't used in the movies you watched.
Re: Don't panic? (Score:2)
IIRC, some old shows and movies were plagued with this.
What's old is new again
Maybe less-fake? (Score:2)
The sound effects in movies are rarely part of the raw footage, so what makes it to the screen was faked by humans. Sometimes the humans do what the director wants and what's in the script, even if they don't think it is right.
If the target is only to make the scene seem right, that's less constraining than the professional Foley people have to work with.
article is paywalled--here is abstract (Score:5, Informative)
The article is paywalled. You can see the abstract and other information at https://ieeexplore.ieee.org/do... [ieee.org] . In case that link breaks here is the full text of the abstract, with paragraph breaks added:
Abstract: In movie productions, the Foley artist is responsible for creating an overlay sound track that helps the movie come alive for the audience. This requires the artist to first identify the sounds that will enhance the experience for the listener, thereby reinforcing the director's intention for a given scene. The artist must decide what artificial sound captures the essence of both the sound and action depicted in the scene.
In this paper, we present AutoFoley, a fully automated deep-learning tool that can be used to synthesize a representative audio track for videos. AutoFoley can be used in applications where there is either no corresponding audio file associated with the video or in cases where there is a need to identify critical scenarios and provide a synthesized, reinforced sound track. An important performance criterion of the synthesized sound track is that it can be time-synchronized with the input video, which provides a realistic and believable portrayal of the synthesized sound. Unlike existing sound prediction and generation architectures, our algorithm is capable of precise recognition of actions as well as interframe relations in fast-moving video clips by incorporating an interpolation technique and temporal relational networks (TRN). We employ a robust multiscale recurrent neural network (RNN) associated with a convolutional neural network (CNN) for a better understanding of the intricate input-to-output associations over time.
To evaluate AutoFoley, we create and introduce a large-scale audio-video dataset containing a variety of sounds frequently used as Foley effects in movies. While the Foley dataset was limited to short-duration videos representing focused representative activities, this dataset demonstrates the capabilities of our proposed system. Our experiments show that the synthesized sounds are realistically portrayed with accurate temporal synchronization of the associated visual inputs. Human qualitative testing of AutoFoley shows that more than 73% of the test subjects considered the generated sound track as original, which is a noteworthy improvement in cross-modal research in sound synthesis.
Re: (Score:2)
The Unpaywall addon leads to a legal unpaywalled copy ...
https://arxiv.org/pdf/2002.109... [arxiv.org]
Inverse Turing test (Score:2)
An AI trying to devise methods to fool and detect humans. One can only surmise the purpose.
Doesn't surprise me (Score:2)
As an amateur photographer (closer to the "prosumer" side) - I noticed when i first started doing photography with film (chemical backed strips of acetate for you younguns that required moar chemicals to develop!) that my pictures never looked as intense as I imagined them being. I chalked this up to my poor camera sklls at the time and/or the film I used (For instance Fujifilm would tend towards the cooler side of the color spectrum - more blue and Kodachrome would tend towards the warmer side with more b
Humans do not recognize the real sounds (Score:3)
We think a ton of things sound very different than they really do.
Real punches are not loud, they are soft, but when you play that on TV or movies, they sound like the guy punching is a weakling. So they make MUCH louder sounds.
Real life silenced guns sound more like an unsilenced gun shot from TV, while unsilenced guns in real life are SUPER loud. Silenced guns on TV sound more like an air powered BB gun.
Basically, if you are watching a Movie or a TV show, the sounds are not realistic at all.
Re: (Score:2)
Basically, if you are watching a Movie or a TV show, the sounds are not realistic at all.
Sorry, had to slightly fix your statement.
Basically, if you are watching a Movie or a TV show, ----> IT'S <---- not realistic at all. A minor clue might be in what we call the stars: act-ors.
Some people are beginning to worry about deep-fakes? These are 3D deep fakes, almost by definition.
Just sound effects? (Score:2)
And here I thought synthesizing foreign and domestic policy for the past four years was borderline newsworthy. I guess the bar keeps getting lowered for AI.
It's a Program--Get Over It (Score:2)