Assem-Singer Demo

Audio Samples from "Controllable and Interpretable Singing Voice Decomposition via Assem-VC"

Repository: mindslab-ai/assem-vc @ GitHub

Authors: Kang-wook Kim, Junhyeok Lee @MINDsLab Inc., SNU

Abstract: We propose a singing decomposition system that encodes time-aligned linguistic content, pitch, and source speaker identity via Assem-VC. With decomposed speaker-independent information and the target speaker's embedding, we could synthesize the singing voice of the target speaker. In conclusion, we made a perfectly synced duet with the user's singing voice and the target singer's converted singing voice.

Contents

Reconstruction
Controllable Attributes

Control Lyrics
Control Rhythm
Control Pitch
Control Speaker Identity

Demo with the User's Singing Voice

Demo with the Author's Singing Voice
Further Examples of Duet

Artifacts of HiFi-GAN

0. Reconstruction

Reference 1	Reference 2	Reference 3

Reconstruction 1	Reconstruction 2	Reconstruction 3

1. Controllable Attributes

1.1. Control Lyrics

Text Content: {OW} {D IH R} {W AH T} {K AE N} {DH AH} {M AE T ER} {B IY}.

Reference

1.1.1 Text Edition

{UH} {DH EH R} {M AW S} {K AE N} {N AA} {B AA DH ER} {M IY}.	{AH} {AH AH AH} {AH AH AH} {AH AH AH} {AH AH} {AH AH AH AH} {AH AH}.	{ER} {ER ER ER} {ER ER ER} {ER ER ER} {ER ER} {ER ER ER ER} {ER ER}.	{IY} {IY IY IY} {IY IY IY} {IY IY IY} {IY IY} {IY IY IY IY} {IY IY}.

1.1.2 Text Deletion

It is able to delete desired phonemes by replacing them with blank tokens and the corresponding pitches with 0.

{OW} {D IH R} ~~{W AH T}~~ {K AE N} {DH AH} {M AE T ER} {B IY}.	~~{OW}~~ {D IH R} {W AH T} {K AE N} ~~{DH AH}~~ {M AE T ER} ~~{B IY}~~.

Replacing phonemes with blank tokens without changing the pitches generates voice as follows.

{ } { } { } { } { } { } { }.

1.2. Control Rhythm

Text Content: {IH T S} {F L IY S} {W AA Z} {W AY T} {AE Z} {S N OW}.

Reference

{IH T S} {F L IY S} {W AA Z} {W AY T} {AE Z} {S N OW}. × 2.5	{IH T S} {F L IY S} {W AA Z} {W AY T} {AE Z} {S N OW}. × 0.5	{IH T S} {F L IY S} {W AA Z} {W AY T} {AE Z} {S N OW}. × 5.	{IH T S} {F L IY S} {W AA Z} {W AY T} {AE Z} {S N OW}. × 5.

Text Content: {HH IY} {P R AA M IY S T} {T UW} {B R IH NG} {M IY} {AH} {B AH N CH} {AH V} {R EH D} {R OW Z IH Z}.

Reference

ALL × 0.6	ALL × 1.7	{HH IY} {P R AA M IY S T} {T UW}BLANK{B R IH NG}... × 10.	{HH IY} {P R AA M IY S T} {T UW} {B R IH NG} {M IY} {AH} ... × 10.

1.3. Control Pitch

1.3.1 Pitch Shift

Reference

+1	+2	+3

+4	+5	+6

-1	-2	-3

-4	-5	-6

1.3.2 Pitch Deletion

Replacing pitches with 0 without changing the phonemes generates voice as follows.

Text Content:{OW} {D IH R} {W AH T} {K AE N} {DH AH} {M AE T ER} {B IY}.

Reference

{OW} {D IH R} {W AH T} {K AE N} {DH AH} {M AE T ER} {B IY}.

1.4. Control Speaker Identity

1.4.1 CSD to Female

Text Content: {IH T S} {F L IY S} {W AA Z} {W AY T} {AE Z} {S N OW}.

Reference: CSD

Target: NJAT	Target: PMAR

Converted Result: NJAT	Converted Result: PMAR

1.4.2 CSD to Male

The pitches are multiplied by 1/2 to match the pitch range of the male target speakers.

Text Content: {HH IY} {P R AA M IY S T} {T UW} {B R IH NG} {M IY} {AH} {B AH N CH} {AH V} {R EH D} {R OW Z IH Z}.

Reference: CSD

Target: JTAN	Target: KENN

Converted Result: JTAN	Converted Result: KENN

2. Demo with the User's Singing Voice

2.4.1 Demo with the User's Singing Voice

Text Content: Till the end of the time

Reference: ADIZ	Target 1: JTAN	Target 2: KENN

Result: JTAN-7 keys	Result: KENN-7 keys

Combined Result: ADIZ & JTAN-7	Combined Result: ADIZ & KENN-7

Text Content: Just a be with you

Reference: JLEE	Target: MCUR

Result: MCUR+5 keys	Result: MCUR+7 keys

Combined Result: Author & MPOL+5	Combined Result: Author & MPOL+7

Text Content: Do you hear the people sing, singing the song of angry man

Reference: Author	Target: CSD

Result: CSD:+5 keys	Result: CSD:+7 keys

Combined Result: Author & CSD+5	Combined Result: Author & CSD+7

Text Content: City of stars, are you shining just for me

Reference: Author	Target: MPOL

Result: MPOL+5 keys	Result: MPOL+7 keys

Combined Result: Author & MPOL+5	Combined Result: Author & MPOL+7

2.4.2 Further Examples of Duet

Text Content: Lucky I'm with love with my best friend, lucky to have been where I have been.

We mixed +3 and +4 keys to match the chord of the original song.

Reference: Author	Result: CSD+3,4	Combined: Author & CSD+3,4

Text Content: There's a calm surrender

Reference: JLEE	Result: CSD+7	Combined: JLEE & CSD+7

Text Content: Lift me up to touch the sky

Reference: ADIZ	Result: CSD-7	Combined: ADIZ & CSD-7

Text Content: Show you the best of mine

Reference: ADIZ	Result: KENN-7	Combined: ADIZ & KENN-7

3. Artifacts of HiFi-GAN

We observe that there are audible artifacts in our model’s synthesized result, and it is also visible in spectrogram. These noisy artifacts degrade the quality of the synthesized result of the model. We also found that the similar audible artifacts is generated when the singing voice was reconstructed by HiFi-GAN. We will resolve this issue in future works.

Reference	Reconstruction of HiFi-GAN	Reconstruction of our model