Audio Samples from "Controllable and Interpretable Singing Voice Decomposition via Assem-VC"

Paper: arXiv:2110.12676

Repository: mindslab-ai/assem-vc @ GitHub

Authors: Kang-wook Kim, Junhyeok Lee @MINDsLab Inc., SNU

Abstract: We propose a singing decomposition system that encodes time-aligned linguistic content, pitch, and source speaker identity via Assem-VC. With decomposed speaker-independent information and the target speaker's embedding, we could synthesize the singing voice of the target speaker. In conclusion, we made a perfectly synced duet with the user's singing voice and the target singer's converted singing voice.


Contents

  1. Reconstruction
  2. Controllable Attributes
    1. Control Lyrics
    2. Control Rhythm
    3. Control Pitch
    4. Control Speaker Identity
  3. Demo with the User's Singing Voice
    1. Demo with the Author's Singing Voice
    2. Further Examples of Duet
  4. Artifacts of HiFi-GAN

0. Reconstruction

Reference 1 Reference 2 Reference 3
Reconstruction 1 Reconstruction 2 Reconstruction 3

1. Controllable Attributes

1.1. Control Lyrics

Text Content: {OW} {D IH R} {W AH T} {K AE N} {DH AH} {M AE T ER} {B IY}.

Reference

1.1.1 Text Edition

{UH} {DH EH R} {M AW S} {K AE N}
{N AA} {B AA DH ER} {M IY}.
{AH} {AH AH AH} {AH AH AH} {AH AH
AH} {AH AH} {AH AH AH AH} {AH AH}
{ER} {ER ER ER} {ER ER ER} {ER ER ER}
{ER ER} {ER ER ER ER} {ER ER}
{IY} {IY IY IY} {IY IY IY} {IY IY IY} {IY IY}
{IY IY IY IY} {IY IY}

1.1.2 Text Deletion

It is able to delete desired phonemes by replacing them with blank tokens and the corresponding pitches with 0.

{OW} {D IH R} {W AH T} {K AE N}
{DH AH} {M AE T ER} {B IY}.
{OW} {D IH R} {W AH T} {K AE N}
{DH AH} {M AE T ER} {B IY}.

Replacing phonemes with blank tokens without changing the pitches generates voice as follows.

{ } { } { } { } { } { } { }.

1.2. Control Rhythm

Text Content: {IH T S} {F L IY S} {W AA Z} {W AY T} {AE Z} {S N OW}.

Reference
{IH T S} {F L IY S} {W AA Z}
{W AY T} {AE Z} {S N OW}. × 2.5
{IH T S} {F L IY S} {W AA Z}
{W AY T} {AE Z} {S N OW}. × 0.5
{IH T S} {F L IY S} {W AA Z}
{W AY T} {AE Z} {S N OW}. × 5.
{IH T S} {F L IY S} {W AA Z}
{W AY T} {AE Z} {S N OW}. × 5.

Text Content: {HH IY} {P R AA M IY S T} {T UW} {B R IH NG} {M IY} {AH} {B AH N CH} {AH V} {R EH D} {R OW Z IH Z}.

Reference
ALL × 0.6 ALL × 1.7 {HH IY} {P R AA M IY S T}
{T UW}BLANK{B R IH NG}... × 10.
{HH IY} {P R AA M IY S T} {T UW}
{B R IH NG} {M IY} {AH} ... × 10.

1.3. Control Pitch

1.3.1 Pitch Shift

Reference
+1 +2 +3
+4 +5 +6
-1 -2 -3
-4 -5 -6

1.3.2 Pitch Deletion

Replacing pitches with 0 without changing the phonemes generates voice as follows.

Text Content:{OW} {D IH R} {W AH T} {K AE N} {DH AH} {M AE T ER} {B IY}.

Reference
{OW} {D IH R} {W AH T} {K AE N}
{DH AH} {M AE T ER} {B IY}
.

1.4. Control Speaker Identity

1.4.1 CSD to Female

Text Content: {IH T S} {F L IY S} {W AA Z} {W AY T} {AE Z} {S N OW}.

Reference: CSD
Target: NJAT Target: PMAR
Converted Result: NJAT Converted Result: PMAR

1.4.2 CSD to Male

The pitches are multiplied by 1/2 to match the pitch range of the male target speakers.

Text Content: {HH IY} {P R AA M IY S T} {T UW} {B R IH NG} {M IY} {AH} {B AH N CH} {AH V} {R EH D} {R OW Z IH Z}.

Reference: CSD
Target: JTAN Target: KENN
Converted Result: JTAN Converted Result: KENN

2. Demo with the User's Singing Voice

2.4.1 Demo with the User's Singing Voice

Text Content: Till the end of the time

Reference: ADIZ Target 1: JTAN Target 2: KENN
Result: JTAN-7 keys Result: KENN-7 keys
Combined Result: ADIZ & JTAN-7 Combined Result: ADIZ & KENN-7

Text Content: Just a be with you

Reference: JLEE Target: MCUR
Result: MCUR+5 keys Result: MCUR+7 keys
Combined Result: Author & MPOL+5 Combined Result: Author & MPOL+7

Text Content: Do you hear the people sing, singing the song of angry man

Reference: Author Target: CSD
Result: CSD:+5 keys Result: CSD:+7 keys
Combined Result: Author & CSD+5 Combined Result: Author & CSD+7

Text Content: City of stars, are you shining just for me

Reference: Author Target: MPOL
Result: MPOL+5 keys Result: MPOL+7 keys
Combined Result: Author & MPOL+5 Combined Result: Author & MPOL+7

2.4.2 Further Examples of Duet

Text Content: Lucky I'm with love with my best friend, lucky to have been where I have been.

We mixed +3 and +4 keys to match the chord of the original song.

Reference: Author Result: CSD+3,4 Combined: Author & CSD+3,4

Text Content: There's a calm surrender

Reference: JLEE Result: CSD+7 Combined: JLEE & CSD+7

Text Content: Lift me up to touch the sky

Reference: ADIZ Result: CSD-7 Combined: ADIZ & CSD-7

Text Content: Show you the best of mine

Reference: ADIZ Result: KENN-7 Combined: ADIZ & KENN-7

3. Artifacts of HiFi-GAN

We observe that there are audible artifacts in our model’s synthesized result, and it is also visible in spectrogram. These noisy artifacts degrade the quality of the synthesized result of the model. We also found that the similar audible artifacts is generated when the singing voice was reconstructed by HiFi-GAN. We will resolve this issue in future works.

Reference Reconstruction of HiFi-GAN Reconstruction of our model