Audio samples from "Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data"

Paper: arXiv:2005.03295 (To appear in INTERSPEECH 2020)

Code & Pre-trained weights: mindslab-ai/cotatron @ GitHub

Try with Google Colab

Authors: Seung-won Park, Doo-young Kim, Myun-chul Joe @ SNU, MINDsLab Inc.

Abstract: We propose Cotatron, a transcription-guided speech encoder for speaker-independent linguistic representation. Cotatron is based on the multispeaker TTS architecture and can be trained with conventional TTS datasets. We train a voice conversion system to reconstruct speech with Cotatron features, which is similar to the previous methods based on Phonetic Posteriorgram (PPG). By training and evaluating our system with 108 speakers from the VCTK dataset, we outperform the previous method in terms of both naturalness and speaker similarity. Our system can also convert speech from speakers that are unseen during training, and utilize ASR to automate the transcription with minimal reduction of the performance. Audio samples are available at https://mindslab-ai.github.io/cotatron, and the code with a pre-trained model will be made available soon.

This page contains a set of audio samples in support of the paper: it is suggested that the reader listen to the samples in conjunction with reading the paper.
All utterances were unseen during training, and the results are uncurated (NOT cherry-picked) unless specified.

Contents

Last updated at 2020.05.06

1. Many-to-Many Conversion

The following audio samples are conversion between randomly selected speech from VCTK test split, which consists of 108 English speakers. Please keep in mind that:

Transcription fed:
sometimes you get them sometimes you dont
Source = p293_148.wav
Target Speaker = p234
Converted - Blow
Converted - Cotatron
Transcription fed:
there is no point in looking any further
Source = p288_175.wav
Target Speaker = p374
Converted - Blow
Converted - Cotatron
Transcription fed:
our task is to complete the picture
Source = p228_293.wav
Target Speaker = p301
Converted - Blow
Converted - Cotatron
Transcription fed:
how are you sir ?
Source = p281_071.wav
Target Speaker = p311
Converted - Blow
Converted - Cotatron
Transcription fed:
it is an industry failure
Source = p264_470.wav
Target Speaker = p255
Converted - Blow
Converted - Cotatron
Transcription fed:
he will go a long way
Source = p279_243.wav
Target Speaker = p258
Converted - Blow
Converted - Cotatron
Transcription fed:
in fact they are the future of investment
Source = p244_048.wav
Target Speaker = p294
Converted - Blow
Converted - Cotatron
Transcription fed:
she had every right to read the warrant
Source = p305_397.wav
Target Speaker = p374
Converted - Blow
Converted - Cotatron

2. Any-to-Many Conversion

Our Cotatron is able to convert speech from speakers that are unseen during training. We convert randomly selected speech from LibriTTS test-clean into random speakers from VCTK, which is an any-to-many conversion.

Transcription fed:
I am six feet high, and I could do it with an effort. No one less than that would have a chance.
Source = 1580_141084_000077_000004-22k.wav
Target Speaker = p335
Converted - Cotatron
Transcription fed:
'Can I conjecture why he is gone?' murmured Rachel, still gazing with a wild kind of apathy into distance.
Source = 5683_32879_000034_000000-22k.wav
Target Speaker = p306
Converted - Cotatron
Transcription fed:
Among other instrumentalities for executing the bogus laws, the bogus Legislature had appointed one Samuel j Jones sheriff of Douglas county kansas Territory, although that individual was at the time of his appointment, and long afterwards, United States postmaster of the town of Westport, Missouri.
Source = 7729_102255_000005_000000-22k.wav
Target Speaker = p308
Converted - Cotatron
Transcription fed:
I could just see my uncle at full length on the raft, and Hans still at his helm and spitting fire under the action of the electricity which has saturated him.
Source = 260_123288_000043_000001-22k.wav
Target Speaker = p270
Converted - Cotatron
Transcription fed:
A fresh noise is heard!
Source = 260_123288_000046_000000-22k.wav
Target Speaker = p303
Converted - Cotatron
Transcription fed:
They said they 'were sorry'--that is, 'Wall Street sorry'--and refused to pay it.
Source = 2300_131720_000026_000007-22k.wav
Target Speaker = p313
Converted - Cotatron
Transcription fed:
It appeared that the narrative he had promised to read us really required for a proper intelligence a few words of prologue. Let me say here distinctly, to have done with it, that this narrative, from an exact transcript of my own made much later, is what I shall presently give.
Source = 121_127105_000042_000003-22k.wav
Target Speaker = p311
Converted - Cotatron
Transcription fed:
What did it mean?
Source = 1089_134691_000027_000005-22k.wav
Target Speaker = p314
Converted - Cotatron

3. Use of Automatic Speech Recognition

Cotatron is robust against word errors of ASR: in this section, we curate an examples of ASR errors and their consequences on conversion results.

Ground truth transcription:
shareholders will be asked to approve a new replacement scheme
Transcription fed by ASR:
shelters will be asked to prove a new replacement scheme
Source = p225_149.wav
Target Speaker = p294
Converted - Cotatron
Ground truth transcription:
the site has been fully recorded and surveyed
Transcription fed by ASR:
the sight has been fully recorded and surveyed
Source = p300_255.wav
Target Speaker = p249
Converted - Cotatron
Ground truth transcription:
the failings are serious
Transcription fed by ASR:
the feelings are serious
Source = p317_302.wav
Target Speaker = p314
Converted - Cotatron
Ground truth transcription:
the breakdown was much later in her life
Transcription fed by ASR:
the breakdown was much better in her life
Source = p241_112.wav
Target Speaker = p300
Converted - Cotatron

4. Bonus (curated)

We show some entertaining conversion results from the speech of celebrities.

Transcription fed:
We want to live by each other's happiness, not by each other's misery.
Source = Charlie Chaplin's speech from "The Great Dictator" (YouTube link)
Target Speaker = p225 (from VCTK)
Converted - Cotatron
Transcription fed (in Korean):
온갖 음해에 시달렸습니다. 여러분 이거 다 거짓말인거 아시죠?
Source = Lee Myung-bak's speech from 2007 (YouTube link)
Target Speaker = KSS (Korean Single Speaker Dataset)
Converted - Cotatron (Korean version)

This page uses a template from the project page of Battenberg et al., "Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis".