Audio Samples from "Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques"

Paper: arXiv:2104.00931

Repository: mindslab-ai/assem-vc @ GitHub

Authors: Kang-wook Kim, Seung-won Park, Junhyeok Lee, Myun-chul Joe @MINDsLab Inc., SNU

Abstract: In this paper, we pose the current state-of-the-art voice conversion (VC) systems as two-encoder-one-decoder models. After comparing these models, we combine the best features and propose Assem-VC, a new state-of-the-art any-to-many non-parallel VC system. This paper also introduces the GTA finetuning in VC, which significantly improves the quality and the speaker similarity of the outputs. Assem-VC outperforms the previous state-of-the-art approaches in both the naturalness and the speaker similarity on the VCTK dataset. As an objective result, the degree of speaker disentanglement of features such as phonetic posteriorgrams (PPG) is also explored. Our investigation indicates that many-to-many VC results are no longer distinct from human speech and similar quality can be achieved with any-to-many models.


Contents

  1. System Description
  2. Many-to-Many Conversion
  3. Any-to-Many Conversion
  4. Bonus

0. System Description

Mellotron-VC: VC System based on Mellotron which we modify to normalize the pitch to suit the VC task.

PPG-VC: Our proposed VC system for comparison between Cotatron and PPG. It uses PPG, Normalized F0, and causal decoder.

Cotatron-VC: VC System proposed in Cotatron paper.

Assem-VC (Proposed): Our Proposed VC system using Cotatron, Normalized F0, and causal decoder.

Assem-VC (adv): Assem-VC with adversarial Cotatron. There is a degradation of quality when adopting adversarial training on Cotatron.

GTA Finetuning (Proposed methodology): Finetuning vocoder with the reconstructed mel spectrograms from the acoustic model.

 


1. Many-to-Many Conversion

Source speakers: Randomly selected speech from VCTK test split, which consists of 108 English speakers. All the speakers are seen during the training step.

Target speakers: Randomly selected speech from VCTK test split, which consists of 108 English speakers.

 

Text Content: "it was an empowering journey"

Source Speech(p307) Target Speech(p294)

Converted Results

PPG-VC w/o GTA Cotatron-VC w/o GTA Mellotron-VC w/o GTA Assem-VC w/o GTA Assem-VC (adv) w/o GTA
PPG-VC Cotatron-VC Mellotron-VC Assem-VC Assem-VC (adv)

 

Text Content: "the clarity is vital"

Source Speech(p232) Target Speech(p275)

Converted Results

PPG-VC w/o GTA Cotatron-VC w/o GTA Mellotron-VC w/o GTA Assem-VC w/o GTA Assem-VC (adv) w/o GTA
PPG-VC Cotatron-VC Mellotron-VC Assem-VC Assem-VC (adv)

 

Text Content: "there may be some resistance"

Source Speech(p340) Target Speech(p304)

Converted Results

PPG-VC w/o GTA Cotatron-VC w/o GTA Mellotron-VC w/o GTA Assem-VC w/o GTA Assem-VC (adv) w/o GTA
PPG-VC Cotatron-VC Mellotron-VC Assem-VC Assem-VC (adv)

 

Text Content: "what do we do ?"

Source Speech(p243) Target Speech(p318)

Converted Results

PPG-VC w/o GTA Cotatron-VC w/o GTA Mellotron-VC w/o GTA Assem-VC w/o GTA Assem-VC (adv) w/o GTA
PPG-VC Cotatron-VC Mellotron-VC Assem-VC Assem-VC (adv)

 

Text Content: "england were ahead until two minutes into injury time"

Source Speech(p280) Target Speech(p364)

Converted Results

PPG-VC w/o GTA Cotatron-VC w/o GTA Mellotron-VC w/o GTA Assem-VC w/o GTA Assem-VC (adv) w/o GTA
PPG-VC Cotatron-VC Mellotron-VC Assem-VC Assem-VC (adv)

 


2. Any-to-Many Conversion

Source speakers: Randomly selected speech from LibriTTS test-clean split. All the speakers are unseen during the training step.

Target speakers: Randomly selected speech from VCTK test split, which consists of 108 English speakers

 

Text Content: "If the prosecution were withdrawn and the case settled with the victim of the forged check, then the young man would be allowed his freedom. But under the circumstances I doubt if such an arrangement could be made."

Source Speech(6829) Target Speech(p274)
PPG-VC Cotatron-VC Mellotron-VC Assem-VC Assem-VC (adv)

 

Text Content: "My son," said his mamma, "just stop and think how badly you would feel, if you really couldn't see Percy."

Source Speech(237) Target Speech(p313)
PPG-VC Cotatron-VC Mellotron-VC Assem-VC Assem-VC (adv)

 

Text Content: "Magic must be a very interesting study," said Ojo.

Source Speech(1284) Target Speech(p360)
PPG-VC Cotatron-VC Mellotron-VC Assem-VC Assem-VC (adv)

 

Text Content: Soon another fire sparkled and snapped on the hearth, and there were songs and poems and choruses and Osh Popham's fiddle, to say nothing of the supreme event of the evening, his rendition of "Fly like a youthful hart or roe, over the hills where spices grow," to Mother Carey's accompaniment.

Source Speech(4992) Target Speech(p236)
PPG-VC Cotatron-VC Mellotron-VC Assem-VC Assem-VC (adv)

 

Text Content: This is my schooling, major; and if one neglects the book, there is little chance of learning from the open land of Providence.

Source Speech(1320) Target Speech(p273)
PPG-VC Cotatron-VC Mellotron-VC Assem-VC Assem-VC (adv)

 

Text Content: She makes effort after effort, trembling with eagerness, and when she fails to reproduce what she sees, she works herself into a frenzy of grief and disappointment."

Source Speech(4992) Target Speech(p271)
PPG-VC Cotatron-VC Mellotron-VC Assem-VC Assem-VC (adv)

 

Text Content: I shall return to the consideration of familiarity shortly.

Source Speech(8230) Target Speech(p360)
PPG-VC Cotatron-VC Mellotron-VC Assem-VC Assem-VC (adv)

 


3. Bonus (curated)

We show some entertaining conversion results from the speech of celebrities.

Steve Jobs' 2005 Stanford Commencement Address

Source speech

Target speech(p300)

Assem-VC

"Your time is limited, so don't waste it living someone else's life. Don't be trapped by dogma - which is living with the results of other people's thinking. Don't let the noise of other's opinions drown out your own inner voice. And most important, have the courage to follow your heart and intuition. They somehow already know what you truly want to become. Everything else is secondary."
Youtube Link

EA Sports

Source speech

Target speech(p244)

Assem-VC

"E A Sports. It's in the game."
Youtube Link

Korean Memes

Korean Assem-VC model was trained with our proprietary multi-speaker Korean TTS dataset.

Target speaker: KSS (Korean Single Speaker Dataset)

Source speech

Assem-VC

"너 때문에 흥이 다 깨져버렸으니까, 책임져."
Youtube Link
"마포대굔 무너졌냐 이XX야?"
Youtube Link
"무야호"
Youtube Link