Meta's AI-Powered Audio Codec Promises 10x Compression Over MP3 (arstechnica.com) 98
Last week, Meta announced an AI-powered audio compression method called "EnCodec" that can reportedly compress audio 10 times smaller than the MP3 format at 64kbps with no loss in quality. Meta says this technique could dramatically improve the sound quality of speech on low-bandwidth connections, such as phone calls in areas with spotty service. The technique also works for music. Ars Technica reports: Meta debuted the technology on October 25 in a paper titled "High Fidelity Neural Audio Compression," authored by Meta AI researchers Alexandre Defossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. Meta also summarized the research on its blog devoted to EnCodec.
Meta describes its method as a three-part system trained to compress audio to a desired target size. First, the encoder transforms uncompressed data into a lower frame rate "latent space" representation. The "quantizer" then compresses the representation to the target size while keeping track of the most important information that will later be used to rebuild the original signal. (This compressed signal is what gets sent through a network or saved to disk.) Finally, the decoder turns the compressed data back into audio in real time using a neural network on a single CPU.
Meta's use of discriminators proves key to creating a method for compressing the audio as much as possible without losing key elements of a signal that make it distinctive and recognizable: "The key to lossy compression is to identify changes that will not be perceivable by humans, as perfect reconstruction is impossible at low bit rates. To do so, we use discriminators to improve the perceptual quality of the generated samples. This creates a cat-and-mouse game where the discriminator's job is to differentiate between real samples and reconstructed samples. The compression model attempts to generate samples to fool the discriminators by pushing the reconstructed samples to be more perceptually similar to the original samples."
Meta describes its method as a three-part system trained to compress audio to a desired target size. First, the encoder transforms uncompressed data into a lower frame rate "latent space" representation. The "quantizer" then compresses the representation to the target size while keeping track of the most important information that will later be used to rebuild the original signal. (This compressed signal is what gets sent through a network or saved to disk.) Finally, the decoder turns the compressed data back into audio in real time using a neural network on a single CPU.
Meta's use of discriminators proves key to creating a method for compressing the audio as much as possible without losing key elements of a signal that make it distinctive and recognizable: "The key to lossy compression is to identify changes that will not be perceivable by humans, as perfect reconstruction is impossible at low bit rates. To do so, we use discriminators to improve the perceptual quality of the generated samples. This creates a cat-and-mouse game where the discriminator's job is to differentiate between real samples and reconstructed samples. The compression model attempts to generate samples to fool the discriminators by pushing the reconstructed samples to be more perceptually similar to the original samples."
Lossless by removing information? (Score:4, Insightful)
Seems that the marketing department is confusing things again. If I want to record birds, or rats singing that format would drop the ultrasound part of the information. So yes it works for day to day use with a claim that the missing audio is non perceivable by most humans but it is far from lossless.
Re: (Score:2)
Re: (Score:2)
Surely it's no loss in quality relative to MP3, not a lossless codec.
Re: (Score:3)
e.g. Not much good to have an AI codec if hypothetically it transpired you needed to load u
Re: (Score:2)
I'd be curious if the training data was mostly Western music with equal temperament tuning and if that affects what happens when you give it Pythagorean tuning or microtonal sounds.
Re:Lossless by removing information? (Score:5, Informative)
If you read to the end of the summary, you will see:
"The key to lossy compression is to identify changes that will not be perceivable by humans, as perfect reconstruction is impossible at low bit rates. To do so, we use discriminators to improve the perceptual quality of the generated samples. This creates a cat-and-mouse game where the discriminator's job is to differentiate between real samples and reconstructed samples. The compression model attempts to generate samples to fool the discriminators by pushing the reconstructed samples to be more perceptually similar to the original samples."
It seems like they understand compression and loss quite well.
Re:Lossless by removing information? (Score:4, Interesting)
The issue with this method is the discriminator. It has to model human hearing and perception of sound in order to measure how good a lossy codec is. Any error in that modelling gets translated to the compression algorithm.
My first thought was that it might be more interesting to do what FLAC does. A lossy compression of the audio, followed by a lossless compression of the difference between the lossy version and the original. That way you get some of the benefit of lossy compression, but the end result is still lossless. AI could optimise the lossy compression part.
Re: (Score:2)
I assume they leverage previous work on lossy codecs. Even if much of it is human testing, they might be able to use those saved results to create a model.
Re:Lossless by removing information? (Score:4, Funny)
Re: Lossless by removing information? (Score:3)
If it's not in FLAC, I'm not coming back.
Re: (Score:2)
No they are not. Just he editor mistakenly dropped 'virtually' from in front of 'loss-less'. ;-)
Re: Lossless by removing information? (Score:4)
Under that reasoning, there is no lossless codec. Sampling above 44.1khz is pointless for nearly anything other than scientific applications. Per Nyquist, that means you don't get any sound above 22khz, which is already above the best human hearing. Only idiots sample higher than that for anything a person is intended to hear.
But say you want to record unicorns singing for whatever reason, even 192khz sampling rate is nowhere near the theoretical limit of the highest possible audio frequency that can physically exist.
In other words, lossless doesn't mean what you think it means.
Re: (Score:1)
"Only idiots sample higher than that for anything a person is intended to hear."
Nice, insulting other people because you think you know better.
Except you don't.
1) Higher than 44.1 kHz sampling is used for mastering purposes, before downsampling back to 44.1 kHz
2) 44 kHz would be enough to perfectly sample frequencies up to 22 kHz, IF and ONLY IF you had infinite precision on the amplitude.
Which you clearly don't.
Thus, a higher sampling rate do bring benefits, as you are trying to approximate real values wit
Re: (Score:3)
2) 44 kHz would be enough to perfectly sample frequencies up to 22 kHz, IF and ONLY IF you had infinite precision on the amplitude.
Which you clearly don't.
With 24 bits you have more amplitude resolution than you can practically realize in the analog electronics that precede or follow it.
24 bits represents a dynamic range of over 144dB. What analog equipment had a signal to noise ration close to that? Most tops out around 120dB.
Thus there are many 24 bits DACs that won't ever be able to display their excellent amplitude resolution because of the analog electronics that transport it to your speakers.
24 bits gets you way deep into analog background noise. How yo
Re: (Score:3)
Oversampling in audio processing is also very much a thing to mitigate aliasing from non-linear processing (ie. any sort of distortion). This way hopefully the most significant aliasing will fall into the excess bandwidth so it
Re: (Score:3)
Oversampling DACs are a thing, ...
Since most ADCs and DACs are delta-sigma most ADCs and DACs are also oversampling. Like i said, practically all converters work in the mega Hertz range internally.
but usually you'd be oversampling with something like 1-4 bits in order to produce a signal that's equivalent to the 24 bits signal that you're advertising,
Sure, that's how delta-sigma converters work. Super high samplng rates and super low bit depths. All in all capturing about 24 bits worth of information when converted back to normal sampling rates.
Oversampling in audio processing is also very much a thing to mitigate aliasing from non-linear processing (ie. any sort of distortion).
Sure. But you only need to do this if the plugin in question doesn't already have internal oversampling.
And as you say, it's also kindof strange to do
Re: (Score:3)
Re: (Score:3)
Good points well made.
What's really sad about this 'hi-res' craze (well, it's been going on for quite some time with SACD etc) is that the recordings are terrible overall.
At one point a couple of years ago i decided to check them out from a technical perspective.
I downloaded a bunch of 'samplers'. Generally encoded at 96kHz or higher and 24 bits.
I think i examined 8 or 10 or so and none of them actually had musical content that came close to using the available space.
None of them had a noise floor below abo
Re: (Score:2)
It seems someone is confusing sampling with quantization?
Re: (Score:2)
Nice, insulting other people because you think you know better.
Because I do.
Re: (Score:2)
theoretical limit of the highest possible audio frequency that can physically exist.
Some post on the internet claim that the highest sound frequency is in the Gigaherz range.
Do you know if that is true? I would think that the highest possible sound frequency would depend on the speed of sound and the density of the medium?
Re: Lossless by removing information? (Score:2)
It does, the point is that even our highest sampling rates don't come anywhere near the limits of 1 ATM of air. So if you're going to count exclusion of ultrasonic sound as lossy, then literally everything we have is lossy.
Re: (Score:2)
Most lossless formats won't preserve your ultrasound either.
Audio recordings are typically 44.1kHz or 48kHz sample rate. Due to Nyquist that means the highest frequencies they can reproduce are 22.05kHz and 24kHz respectively.
And even if you sample at 96kHz, most sound cards cannot exceed 48kHz so will just remove frequencies higher than 24kHz anyway, or even worse convert them into distortion. Most sound recording hardware has a built in low pass filter to remove frequencies above 24kHz too, as does a lot
Re: (Score:3)
Audio recordings are typically 44.1kHz or 48kHz sample rate.
This would all be true 10~20 years ago.
These days most studios record at least at 96kHz.
The benefit is better reproduction of the hearable range due to more relaxed filter constraints.
Once recorded you can use it to master lower bandwidth versions.
most sound cards cannot exceed 48kHz so will just remove frequencies higher than 24kHz anyway, or even worse convert them into distortion.
That's nonsense if you're talking about audio interfaces. Most stuff is 24/96 or higher these days, even the cheap stuff. Only the very lowest and cheapest category of audio interfaces (Behringer etc.) is limited to 48kHz. Remember that 24/96 converters were alrea
Re: (Score:2)
I was under the impression that even when recording at 96kHz, there is a low pass filter at around 24 or 28kHz anyway. Sampling at that rate is done to improve the quality of digital equalization and mixing later.
As for audio interfaces, I'm talking about the DAC. While it might take 96kHz in, I bet that the output has a low pass filter that removes everything above about 24kHz. Perhaps in some very high end equipment there are adjustable low pass filters, perhaps done digitally too, but most gear will just
Re: (Score:3)
I was under the impression that even when recording at 96kHz, there is a low pass filter at around 24 or 28kHz anyway.
Not usually.
Sampling at that rate is done to improve the quality of digital equalization and mixing later.
This can be one of the reasons, tho mixing itself (adding channels) is not a problem.
But a lot of processing algorithms do benefit from a higher sample rate, especially non-linear algorithms like distortion. But usually this has nothing to do with the source material. The plugins could simply upsample the signal internally, process, and then downsample to the original sample rate.
Still, many plugins don't do this or don't do this correctly and a higher samplerate can make them sound better becau
Re: (Score:2)
If I want to record birds, or rats singing that format would drop the ultrasound part of the information.
There seems to be nothing specific about the frequency range that can be encoded by this format.
it is far from lossless.
They didn't say it's lossless. They say there is no loss from a 64kbps MP3 but the MP3 is already lossy. So you get the same loss as a 64kbps mp3 except it is 10x smaller.
Re: (Score:3)
So yes it works for day to day use with a claim that the missing audio is non perceivable by most humans but it is far from lossless.
You're being obtuse. "lossless" has nothing to do with frequency range. You don't call WAV format "lossy" simply because the sample rate is 44.1kHz and also doesn't contain irrelevant ultrasound information.
In any case you're the only one here making a claim that something is lossless. That term has a specific meaning and this codec is not lossless, nor was it claimed to be. The claim was a comparison between mp3, a format which already applies low-pass filtering as part of compression (unless someone very
Meta's "Some Foolish Geegaw" Promises (Score:2)
Lossy (Score:1)
The authors need to understand lossless vs. lossy. II also wonder if this will be free and open source?
Re: (Score:2)
The article never claim the codec is lossless.
The key to lossy compression is to identify changes that will not be perceivable by humans, as perfect reconstruction is impossible at low bit rates
Re: (Score:2)
computing power (Score:1)
Re: (Score:2)
Re: computing power (Score:2)
Probably a good amount for decompress. But it comes from so little data, 300k for a song. which raises the question: will I be able to plug in my talentless crap, adequately downsampled, and have something that sounds talented come out?
Re: (Score:2)
>1. It's the infinite monkey theorem.
I don't really understand the popular view of this theorem. If you have an infinite number of monkeys, an infinite number of them will produce the complete works of William Shakespeare in just as much time as needed to type them.
Start with an infinite number of monkeys. Have them type the first letter.
Eliminate all that do not type the first letter of the complete works of Shakespeare. You still are left with an infinite number of monkeys, who have all typed the fi
Re: (Score:1)
Twitter. "Interesting"? Maybe that's a stretch.
Pied Piper! (Score:2)
...but Thomas Middleditch instead plays a drugged out billionaire obsessed with getting humans to prefer living within his virtual reality
Re: (Score:1)
Neural network on a single CPU (Score:1, Offtopic)
Finally, the decoder turns the compressed data back into audio in real time using a neural network on a single CPU.
Sorry, a single CPU is not a neural network. A simulated one, perhaps.
Re: Neural network on a single CPU (Score:2)
Re: (Score:1)
An important characteristic of a neural network is massive parallelism, where you have thousands of limited processors working at the same time. Sure, you can simulate this on a single chip, but you lose the massively parallel architecture. You end up spending a whole lot of computing power to the simulation of the neural net. You'd be better off just ditching the neural net simulation and just optimize your prediction algorithm to run like normal software on a normal processor.
Re: (Score:1)
Parallel execution, and the massiveness thereof, are not defining characteristics of neural networks. There isn't some processor number cutoff where where a NN becomes "real" instead of "simulated." I take it you think neural nets were only simulated decades ago and became real since GPUs and TPUs were developed? Or is that not enough? Is a neural net just a simulation until each artificial neuron has its own dedicated processor? (Thus making all NNs with more neurons than processors mere simulations. Absur
Meta Owns Your Data (stream) (Score:1)
Re: (Score:2)
The paper contains a link to a Github repository containing code that (according to the documentation at least) can both encode and decode. So at the very least there is an implementation out there (even if it not actually open source since its licensed CC-BY-NC)
Because AI (Score:2)
This codec can achieve such amazing compression "because AI."
The part about achieving the target bandwidth, I believe. The part about no losing quality in the process, not so much.
Re: (Score:3)
"Though I guess if that random song were 4'33" maybe the mp3 could still give ACcodec a run for its money."
The sound of the audience listening to the performance is an integral part of 4'33". Just saying...
Re: (Score:2)
This reminds me of Vernor Vinge's A Fire upon the Deep, where the protaganist is trying to have a conversation with someone in a nice, crisp and realistic video stream and realizes she's just talking to the local AI, which is spouting bullshit because the actual data rate coming through is less than 10 bits per second.
Could be useful... (Score:3, Funny)
...if you had some lame audio player with no wireless and less space than a Nomad.
"First 48 kHz" but "lower frame rate" (Score:2)
First, the encoder transforms uncompressed data into a lower frame rate "latent space" representation.
Meta's researchers claim they are the first group to apply the technology to 48 kHz stereo audio
These seem contradictory to me.
If an Encodec file is played in the metaverse and. (Score:1)
If an Encodec file is played in the metaverse and there's no one around to hear it, does it sound lossless?
Since in the metaverse there is no air to breathe or vibrate can sounds be played back in the metaverse?
It was a joke (Score:2)
Don't actually make people have to use proprietary formats to post on twtter as i "suggested" on the other news
Not the first with smaller compression (Score:3)
If facebook cared about saving bandwidth they would dissolve their company, all facebook does is wastes the bandwidth and the time of people who use it.
Re: (Score:3)
Yes, but 6 kpbs to make something that approaches 64 kbps MP3 in quality is very impressive. Its utility will be in places where bandwidth will always be at a premium such as low-earth orbit satellite networks. No one will care about its storage savings, however, for obvious reasons.
Re: (Score:2)
Yes, but they would need to get it near 128kbps MP3 to be useful. 64kbps MP3 has lots of audible artefacts.
Re: (Score:2)
Agree on the audio artifacts, you can definitely hear them in music playback. Speech playback has much lower tolerances.
Re: (Score:2)
A lot of people can easily hear audio artifacts from 128kbps MP3, and even Apple increased the bitrate of AAC files from the iTunes Store to 256kbps, which probably requires at least 384kbps MP3 to match it.
Are most people using dollar store speakers and headphones to listen to music, or what?
Re: (Score:2)
Re: (Score:3)
This new codec is not going to be adopted. There were codecs that produced audio smaller than MP3 before. Most of them failed to be adopted because current codecs are good enough. Just look at this long list https://en.wikipedia.org/wiki/... [wikipedia.org]. MP3 became popular at the time of dial up modem speeds and even at that time it was not dethroned. Today size of an audio stream is tiny in comparison with the video so the audio codec does not matter. It may matters to facebook if they store huge amounts of audio on their servers. Average MP3 file has 1 minute of audio per 1MB, so 1TB drive can store 1000000 minutes or roughly 2 years of audio. Today people use aac which has better compression and supported by pretty much any device.
If facebook cared about saving bandwidth they would dissolve their company, all facebook does is wastes the bandwidth and the time of people who use it.
What's the bandwidth bill for Spotify? They control the server and the client, if there's a smaller file format with equivalent quality I'm not sure why they wouldn't use it.
Re: (Score:2)
The Opus codec is mature and available today. It is transparent for most music by 128kbps, with great results even at 64kbps, but all the music stores still use inferior MP3, Vorbis, and AAC at higher bitrates.
They don't really seem to care.
Re: (Score:2)
What's the bandwidth bill for Spotify? They control the server and the client, if there's a smaller file format with equivalent quality I'm not sure why they wouldn't use it.
You are forgetting that there are two parties involved in this - Spotify and their listener. It is the listener that is the problem. Codec needs to be widely adopted for spotify to move to it. Or for facebook for that matter. There is a huge variety of clients that need to support this codec. Some of them do not have horse power to run this decoder. This is an old debate which was re-hashed many times, yet MP3 still exists and supported. For example https://www.amazon.com/music/p... [amazon.com] . This guy born the same
Re: (Score:2)
With all the better codecs around, music still sold in MP3.
With how cheap storage space has become, there's no good reason for paid music downloads to be in any sort of lossy format, period.
Re: (Score:2)
Significant savings may be realized through streaming in the new format to clients controlled by Spotify.
Someone is getting off this boat right now (Score:2)
> The key to lossy compression is to identify changes that will not be perceivable by humans, as perfect reconstruction is impossible at low bit rates.
I shot the Sheriff but I did not shoot the Deputy, because otherwise you would notice the police were missing.
do you know the story about the horse? (Score:2)
So that guy owned a horse. But he was frugal and figure that if he could get the horse to eat less, he could save a bunch. He almost succeed. He got the hose to survive on little for long time. Enlightened by the progress he then pushed bravely further yet and resolve to not feeding the horse at all. Though the horse then quit and died.
So did the audio compressed with this method. The zero-length compression works phenomenally well on ... silence.
The key to lossy compression... (Score:2)
Anegodally, the early dynamic range compression used in telephony worked great for western languages but was impacting legibility for tonal ones. I can find any references to this, anyone?
Re: (Score:3)
My recollection is that western POTS would carry about 8kHz and you need more like 11kHz for the full range of human speech, so people speaking tonal languages had to shout to be understood.
Why not evolve one? (Score:1)
A genetic algorithm perhaps be used to evolve a highly optimized codec. Maybe combine such with neural net techniques.
Shiii... I got that beat (Score:2)
I got a compression algorithm that whips the pants off that. Every tune compresses down to a string that's 128 characters or less. It might take a while to download though.
What's the catch? (Score:2)
With companies like Facebook, Google, Amazon, Akamai, CloudFlare that live on exploiting other people's data, you always have to wonder why they invest in anything. Because no company invests in cool tech just for the beauty of it.
So naturally, the first question that comes to my mind is: WTF is Facebook up to with this one? Why do they invest in a codec that can compress audio for use on ultra-low bandwidth links?
The only two explanations I can think of are:
- Always-on audio surveillance
- Always-on music c
Re: (Score:2)
Look, I'm not buying into this ecosystem, but how can you be confused about the benefit of a more efficient audio codec? Facebook serves a lot of videos with audio, and there's obvious utility to streaming audio in VR.
Re: (Score:3)
The only two explanations I can think of are:
- Always-on audio surveillance
- Always-on music copyright infringement enforcement
You are asking about a company that not only moves media around its platform but also provides voice and video communication services why they would invest in improving the efficiency and reducing the bandwidth cost of said service, and the only two examples you can think of are something completely unrelated to codecs?
Seriously on a scale of "Sep/11 was an inside job" to "We are ruled by alien lizards", just how far down the rabbit hole of insanity have you fallen?
Can we apply it to the Metastasis itself? (Score:2)
I'm pretty sure if we remove everything that isn't perceived and doesn't lead to the user noticing a loss in quality or content, I'm fairly sure we can save a couple dozen billions.
Plus, we'd get rid of that nuisance on top.
Not too bad actually (Score:4, Informative)
Ignoring all the marketing BS, I followed the links to the actual announcement ( https://ai.facebook.com/blog/a... [facebook.com] ) and listened to the samples they provided.
I have to say for 6 kbps, the last segment of the sample (made with their "Encodec") is pretty good quality.
Re: (Score:1)
Everyone is missing the point (Score:2)
Meta provides communication services. Even if this codec doesn't get used for music that doesn't mean it is useless. If it can be made more efficient than Opus both in terms of bandwidth and computation, then you'll find hundreds of millions of people around the world may use it without ever knowing about it.
No really, without looking it up what codec is used by WhatsApp, Messenger, standard 3GPP voice calls? There are many codecs you will use without ever having a clue what they are.
Relevancy of this (Score:2)
I think we are starting to see same with audio. mp3 with 320kbit/s is as good as uncompressed audio for all people except very small exceptions. And we have better algorithms without patent problems in form of opus or
Re: (Score:2)
With modern encoders, listening tests indicate MP3 is essentially transparent at 192kbps. Thing is about whether it's needed: it's a massive straw man to compare against MP3 these days, since that's long been superseded. Opus beats or matches other codecs pretty much across the board except for very low bitrates that it doesn't support and there codec2 beats all comers.
Trained Ears (Score:2)
After so many years of enjoying the convenience of streaming music/audio, I was amazed at how used to the quality I've become. From music being streamed from my phone to listening to SiriusXM in the car to letting Alexa entertain me at home, etc. This past year I went old school and bought a record player, and nabbed some classic vinyl I'd owned back in the day. Listening to the depth and richness of it compared to digitized streaming media was like night and day. Obviously the bandwidth and convenience are
did they try it out first successfully on Kanyes (Score:2)
CC-BY-NC License (Score:2)
They expect to license this to *everyone* providing an additional revenue stream.
The problem will be getting anyone to pay them for it, instead of generating their own AI codec, if necessary.
Meta can't patent the process of generating an AI codec. Even if they end up quite similar, any clean room implementation such as training from scratch would eliminate any basis for a lawsuit.
They keep saying ... (Score:1)
Everytime some fruitloop comes up with a lossy compressor that is more lossy (and thus gets better compression) that fruitloop claims that the improvement occurs without "loss" of "quality" of the audio or video or whatever the fruitloop is compressing. Usually the claim of "equal quality" is due to the defective fruitloop itself being either blind or deaf or both.
Every single "new fangled lossy compressor" that has come out over the past 40 years that has claimed "more efficient compression" has also, wit
Isn't MP3 a couple/three decades old? (Score:2)
Why the specious comparison?
I mean WMA audio and OGG Vorbis are marginally better than MP3. Even the creator of MP3, Fraunhofer Institute, has been working on later codecs (MPEG-H and xHE-AAC for example https://www.iis.fraunhofer.de/... [fraunhofer.de] ).
As far as I've seen... AAC is today's standard lossy method, and FLAC the lossless. With AAC having flavors/versions for higher or lower bitrates, and voice specific vs not). The HE above stands for High Efficiency, meaning more work but smaller files or better result
Uh, I want FLAC, DSD and Atmos mixes (Score:1)
"Improve the sound quality of speech"? (Score:1)
How much processing power? (Score:2)
Excellent! (Score:1)
Awesome-sauce!