Authors: Seungu Han, Junhyeok Lee @MINDsLab Inc., SNU
Abstract: Conventionally, audio super-resolution models fixed the initial and the target sampling rates, which necessitate the model to be trained for each pair of sampling rates. We introduce NU-Wave 2, a diffusion model for neural audio upsampling that enables the generation of 48 kHz audio signals from inputs of various sampling rates with a single model. Based on the architecture of NU-Wave, NU-Wave 2 uses short-time Fourier convolution (STFC) to generate harmonics to resolve the main failure modes of NU-Wave, and incorporates bandwidth spectral feature transform (BSFT) to condition the bandwidths of inputs in the frequency domain. We experimentally demonstrate that NU-Wave 2 produces high-resolution audio regardless of the sampling rate of input while requiring fewer parameters than other models. The official code and the audio samples are available at \url{https://mindslab-ai.github.io/nuwave2}.
This page contains a set of audio samples to support the paper; we suggest that the reader listen to the samples when reading the paper. All utterances were unseen during training, and the results are uncurated (NOT cherry-picked) unless specified.
Samples for original low resolution audio signals are resampled to 48 kHz.Original low resolution (24 kHz -> 48 kHz) |
Original high resolution (48 kHz) | WSRGlow (48 kHz) | NU-Wave (48 kHz) | NU-Wave 2 (48 kHz) | |
---|---|---|---|---|---|
Audio | |||||
LinearSpectrogram | |||||
Audio | |||||
LinearSpectrogram | |||||
Audio | |||||
LinearSpectrogram |
Original low resolution (16 kHz -> 48 kHz) |
Original high resolution (48 kHz) | WSRGlow (48 kHz) | NU-Wave (48 kHz) | NU-Wave 2 (48 kHz) | |
---|---|---|---|---|---|
Audio | |||||
LinearSpectrogram | |||||
Audio | |||||
LinearSpectrogram | |||||
Audio | |||||
LinearSpectrogram |
Original low resolution (12 kHz -> 48 kHz) |
Original high resolution (48 kHz) | WSRGlow (48 kHz) | NU-Wave (48 kHz) | NU-Wave 2 (48 kHz) | |
---|---|---|---|---|---|
Audio | |||||
LinearSpectrogram | |||||
Audio | |||||
LinearSpectrogram | |||||
Audio | |||||
LinearSpectrogram |
Original low resolution (8 kHz -> 48 kHz) |
Original high resolution (48 kHz) | WSRGlow (48 kHz) | NU-Wave (48 kHz) | NU-Wave 2 (48 kHz) | |
---|---|---|---|---|---|
Audio | |||||
LinearSpectrogram | |||||
Audio | |||||
LinearSpectrogram | |||||
Audio | |||||
LinearSpectrogram |
Original low resolution (22.05 kHz -> 48 kHz) |
NU-Wave 2 (48 kHz) | |
---|---|---|
LJ001-0001 | ||
LinearSpectrogram | ||
LJ001-0002 | ||
LinearSpectrogram | ||
LJ001-0003 | ||
LinearSpectrogram |
Original low resolution (22.05 kHz -> 48 kHz) |
NU-Wave 2 (48 kHz) | |
---|---|---|
Audio | ||
LinearSpectrogram |
Original high resolution (48 kHz) | NU-Wave 2 w/o BSFT (24 kHz input) |
NU-Wave 2 w/o BSFT (16 kHz input) |
NU-Wave 2 w/o BSFT (12 kHz input) |
NU-Wave 2 w/o BSFT (8 kHz input) |
|
---|---|---|---|---|---|
Audio | |||||
LinearSpectrogram |