Audio samples for "NU-Wave 2: A General Neural Audio Upsampling Model for Various Sampling Rates"

Authors: Seungu Han, Junhyeok Lee @MINDsLab Inc., SNU

Abstract: Conventionally, audio super-resolution models fixed the initial and the target sampling rates, which necessitate the model to be trained for each pair of sampling rates. We introduce NU-Wave 2, a diffusion model for neural audio upsampling that enables the generation of 48 kHz audio signals from inputs of various sampling rates with a single model. Based on the architecture of NU-Wave, NU-Wave 2 uses short-time Fourier convolution (STFC) to generate harmonics to resolve the main failure modes of NU-Wave, and incorporates bandwidth spectral feature transform (BSFT) to condition the bandwidths of inputs in the frequency domain. We experimentally demonstrate that NU-Wave 2 produces high-resolution audio regardless of the sampling rate of input while requiring fewer parameters than other models. The official code and the audio samples are available at \url{https://mindslab-ai.github.io/nuwave2}.

Illustration of NU-Wave 2 Sampling (8 iterations)

This page contains a set of audio samples to support the paper; we suggest that the reader listen to the samples when reading the paper.
All utterances were unseen during training, and the results are uncurated (NOT cherry-picked) unless specified.

Samples for original low resolution audio signals are resampled to 48 kHz.

Section Ⅰ: Examples for samples upsampled from 24kHz to 48kHz.

This section contains examples for the unseen speakers. The model is trained on the first 100 speakers of the VCTK dataset. The following samples are generated for the remaining 8 speakers.

	Original low resolution (24 kHz -> 48 kHz)	Original high resolution (48 kHz)	WSRGlow (48 kHz)	NU-Wave (48 kHz)	NU-Wave 2 (48 kHz)
Audio
Linear Spectrogram
Audio
Linear Spectrogram
Audio
Linear Spectrogram

Section Ⅱ: Examples for samples upsampled from 16kHz to 48kHz.

This section contains examples for the unseen speakers. The model is trained on the first 100 speakers of the VCTK dataset. The following samples are generated for the remaining 8 speakers.

	Original low resolution (16 kHz -> 48 kHz)	Original high resolution (48 kHz)	WSRGlow (48 kHz)	NU-Wave (48 kHz)	NU-Wave 2 (48 kHz)
Audio
Linear Spectrogram
Audio
Linear Spectrogram
Audio
Linear Spectrogram

Section Ⅲ: Examples for samples upsampled from 12kHz to 48kHz.

This section contains examples for the unseen speakers. The model is trained on the first 100 speakers of the VCTK dataset. The following samples are generated for the remaining 8 speakers.

	Original low resolution (12 kHz -> 48 kHz)	Original high resolution (48 kHz)	WSRGlow (48 kHz)	NU-Wave (48 kHz)	NU-Wave 2 (48 kHz)
Audio
Linear Spectrogram
Audio
Linear Spectrogram
Audio
Linear Spectrogram

Section Ⅳ: Examples for samples upsampled from 8kHz to 48kHz.

This section contains examples for the unseen speakers. The model is trained on the first 100 speakers of the VCTK dataset. The following samples are generated for the remaining 8 speakers.

	Original low resolution (8 kHz -> 48 kHz)	Original high resolution (48 kHz)	WSRGlow (48 kHz)	NU-Wave (48 kHz)	NU-Wave 2 (48 kHz)
Audio
Linear Spectrogram
Audio
Linear Spectrogram
Audio
Linear Spectrogram

Section Ⅴ: Examples for samples upsampled from 22.05kHz to 48kHz (LJSpeech).

This section contains examples for the unseen dataset. Our model is also able to generate high-quality audio from other datasets, such as the LJSpeech (22.05kHz).
The model is trained on the first 100 speakers of the VCTK dataset. The following samples are generated for the LJSpeech dataset.

	Original low resolution (22.05 kHz -> 48 kHz)	NU-Wave 2 (48 kHz)
LJ001-0001
Linear Spectrogram
LJ001-0002
Linear Spectrogram
LJ001-0003
Linear Spectrogram

Section Ⅵ: Examples for samples of non-English voice (Korean) upsampled from 22.05kHz to 48kHz (KSS).

This section contains examples for non-English dataset. Our model is also able to generate high-quality audio with non-English voice, such as the KSS (Korean, 22.05kHz).
The model is trained on the first 100 speakers of the VCTK dataset. The following samples are generated for the KSS dataset.

	Original low resolution (22.05 kHz -> 48 kHz)	NU-Wave 2 (48 kHz)
Audio
Linear Spectrogram

Section Ⅶ: Examples for samples of ablation study without BSFT.

This section contains examples for the unseen speakers. The model is trained on the first 100 speakers of the VCTK dataset. The following samples are generated for the remaining 8 speakers.
The samples are sampled with 1000 iterations to check whether the reason for failure is the lack of the number of iterations.
We can observe that the boundaries are not clear and the model seem cannot reliably distinguish the input bandwidth without BSFT. Also, the results are a little noisy.

	Original high resolution (48 kHz)	NU-Wave 2 w/o BSFT (24 kHz input)	NU-Wave 2 w/o BSFT (16 kHz input)	NU-Wave 2 w/o BSFT (12 kHz input)	NU-Wave 2 w/o BSFT (8 kHz input)
Audio
Linear Spectrogram