Get the latest tech news

Toward a Sparse Interpretable Audio Codec


f Contents Introduction Most widely-used modern audio codecs, such as Ogg Vorbis and MP3, as well as more recent "neural" codecs like Meta's Encodec or Descript's are based on block-coding; audio is divided into overlapping, fixed-size "frames" which are then compressed. While they produce excellent reproduction quality and can be used for downstream tasks such as text-to-audio, they do not produce an intuitive, directly-interpretable representation.

Rudimentary physics-based assumptions are used to model attack and the physical resonance of both the instrument being played and the room in which a performance occurs, hopefully encouraging a sparse, parsimonious, and easy-to-interpret representation. The encoder iteratively removes energy from the input spectrogram, producing an event vector and one-hot/dirac impulse representing the time of occurrence. The decoder side of the model is very interesting, and all sorts of physical modelling-like approaches could yield better, more realistic, and sparser renderings of the audio.

Get the Android app

Or read this on Hacker News