Abstract:
In this work, we introduce NU-Wave, the first neural audio upsampling model to produce waveforms of sampling rate 48kHz from coarse 16kHz or 24kHz inputs, while prior works could generate only up to 16kHz. NU-Wave is the first diffusion probabilistic model for audio super-resolution which is engineered based on neural vocoders. NU-Wave generates high-quality audio that achieves high performance in terms of signal-to-noise ratio (SNR), log-spectral distance (LSD), and accuracy of the ABX test. In all cases, NU-Wave outperforms the baseline models despite the substantially smaller model capacity (3.0M parameters) than baselines (5.4-21%). The audio samples of our model are available at https://mindslab-ai.github.io/nuwave, and the code will be made available soon.
Illustration of NU-Wave Sampling (8 iterations)
This page contains a set of audio samples to support the paper; we suggest that the reader listen to the samples when reading the paper.
All utterances were unseen during training, and the results are uncurated (NOT cherry-picked) unless specified.
Section Ⅰ: Examples for SingleSpeaker (seen speaker during training) upsampled from 24kHz to 48kHz.
This section contains examples for the speaker “p225" from the VCTK dataset. The upsampling rate is 2 (from 24kHz to 48kHz).
Original low resolution (24 kHz)
Original high resolution (48 kHz)
Linear Interpolation (48 kHz)
U-Net (48 kHz)
MU-GAN (48 kHz)
NU-Wave (48 kHz)
Section Ⅱ: Examples for MultiSpeaker (unseen speaker during training) upsampled from 24kHz to 48kHz.
This section contains examples for the unseen speakers. The model is trained on the first 100 speakers of the VCTK dataset. The following samples are generated for the remaining 8 speakers. The upsampling rate is 2 (from 24kHz to 48kHz).
Original low resolution (24 kHz)
Original high resolution (48 kHz)
Linear Interpolation (48 kHz)
U-Net (48 kHz)
MU-GAN (48 kHz)
NU-Wave (48 kHz)
Section Ⅲ: Examples for SingleSpeaker (seen speaker during training) upsampled from 16kHz to 48kHz.
This section contains examples for the speaker “p225" from the VCTK dataset. The upsampling rate is 3 (from 16kHz to 48kHz).
Original low resolution (16 kHz)
Original high resolution (48 kHz)
Linear Interpolation (48 kHz)
U-Net (48 kHz)
MU-GAN (48 kHz)
NU-Wave (48 kHz)
Section Ⅳ: Examples for multi speaker (unseen speaker during training) upsampled from 16kHz to 48kHz.
This section contains examples for the unseen speakers. The model is trained on the first 100 speakers of the VCTK dataset. The following samples are generated for the remaining 8 speakers. The upsampling rate is 3 (from 16kHz to 48kHz).