Audio samples from "Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data"

Paper: arXiv:2005.03295 (To appear in INTERSPEECH 2020)

Code & Pre-trained weights: mindslab-ai/cotatron @ GitHub

Authors: Seung-won Park, Doo-young Kim, Myun-chul Joe @ SNU, MINDsLab Inc.

Abstract: We propose Cotatron, a transcription-guided speech encoder for speaker-independent linguistic representation. Cotatron is based on the multispeaker TTS architecture and can be trained with conventional TTS datasets. We train a voice conversion system to reconstruct speech with Cotatron features, which is similar to the previous methods based on Phonetic Posteriorgram (PPG). By training and evaluating our system with 108 speakers from the VCTK dataset, we outperform the previous method in terms of both naturalness and speaker similarity. Our system can also convert speech from speakers that are unseen during training, and utilize ASR to automate the transcription with minimal reduction of the performance. Audio samples are available at https://mindslab-ai.github.io/cotatron, and the code with a pre-trained model will be made available soon.

This page contains a set of audio samples in support of the paper: it is suggested that the reader listen to the samples in conjunction with reading the paper.
All utterances were unseen during training, and the results are uncurated (NOT cherry-picked) unless specified.

Contents

Last updated at 2020.05.06

1. Many-to-Many Conversion
2. Any-to-Many Conversion
3. Use of Automatic Speech Recognition
4. Bonus (curated)

1. Many-to-Many Conversion

The following audio samples are conversion between randomly selected speech from VCTK test split, which consists of 108 English speakers. Please keep in mind that:

Our Cotatron uses transcription; where Blow doesn't.
The sampling rate of audios from Blow (16kHz) and Cotatron (22.05kHz) differs.

Transcription fed:
sometimes you get them sometimes you dont


	Source = p293_148.wav
	Target Speaker = p234
	Converted - Blow
	Converted - Cotatron

Transcription fed:
there is no point in looking any further


	Source = p288_175.wav
	Target Speaker = p374
	Converted - Blow
	Converted - Cotatron

Transcription fed:
our task is to complete the picture


	Source = p228_293.wav
	Target Speaker = p301
	Converted - Blow
	Converted - Cotatron

Transcription fed:
how are you sir ?


	Source = p281_071.wav
	Target Speaker = p311
	Converted - Blow
	Converted - Cotatron

Transcription fed:
it is an industry failure


	Source = p264_470.wav
	Target Speaker = p255
	Converted - Blow
	Converted - Cotatron

Transcription fed:
he will go a long way


	Source = p279_243.wav
	Target Speaker = p258
	Converted - Blow
	Converted - Cotatron

Transcription fed:
in fact they are the future of investment


	Source = p244_048.wav
	Target Speaker = p294
	Converted - Blow
	Converted - Cotatron

Transcription fed:
she had every right to read the warrant


	Source = p305_397.wav
	Target Speaker = p374
	Converted - Blow
	Converted - Cotatron

2. Any-to-Many Conversion

Our Cotatron is able to convert speech from speakers that are unseen during training. We convert randomly selected speech from LibriTTS test-clean into random speakers from VCTK, which is an any-to-many conversion.

Transcription fed:
I am six feet high, and I could do it with an effort. No one less than that would have a chance.


	Source = 1580_141084_000077_000004-22k.wav
	Target Speaker = p335
	Converted - Cotatron

Transcription fed:
'Can I conjecture why he is gone?' murmured Rachel, still gazing with a wild kind of apathy into distance.


	Source = 5683_32879_000034_000000-22k.wav
	Target Speaker = p306
	Converted - Cotatron

Transcription fed:
Among other instrumentalities for executing the bogus laws, the bogus Legislature had appointed one Samuel j Jones sheriff of Douglas county kansas Territory, although that individual was at the time of his appointment, and long afterwards, United States postmaster of the town of Westport, Missouri.


	Source = 7729_102255_000005_000000-22k.wav
	Target Speaker = p308
	Converted - Cotatron

Transcription fed:
I could just see my uncle at full length on the raft, and Hans still at his helm and spitting fire under the action of the electricity which has saturated him.


	Source = 260_123288_000043_000001-22k.wav
	Target Speaker = p270
	Converted - Cotatron

Transcription fed:
A fresh noise is heard!


	Source = 260_123288_000046_000000-22k.wav
	Target Speaker = p303
	Converted - Cotatron

Transcription fed:
They said they 'were sorry'--that is, 'Wall Street sorry'--and refused to pay it.


	Source = 2300_131720_000026_000007-22k.wav
	Target Speaker = p313
	Converted - Cotatron

Transcription fed:
It appeared that the narrative he had promised to read us really required for a proper intelligence a few words of prologue. Let me say here distinctly, to have done with it, that this narrative, from an exact transcript of my own made much later, is what I shall presently give.


	Source = 121_127105_000042_000003-22k.wav
	Target Speaker = p311
	Converted - Cotatron

Transcription fed:
What did it mean?


	Source = 1089_134691_000027_000005-22k.wav
	Target Speaker = p314
	Converted - Cotatron

3. Use of Automatic Speech Recognition

Cotatron is robust against word errors of ASR: in this section, we curate an examples of ASR errors and their consequences on conversion results.

Ground truth transcription:
shareholders will be asked to approve a new replacement scheme
Transcription fed by ASR:
shelters will be asked to prove a new replacement scheme


	Source = p225_149.wav
	Target Speaker = p294
	Converted - Cotatron

Ground truth transcription:
the site has been fully recorded and surveyed
Transcription fed by ASR:
the sight has been fully recorded and surveyed


	Source = p300_255.wav
	Target Speaker = p249
	Converted - Cotatron

Ground truth transcription:
the failings are serious
Transcription fed by ASR:
the feelings are serious


	Source = p317_302.wav
	Target Speaker = p314
	Converted - Cotatron

Ground truth transcription:
the breakdown was much later in her life
Transcription fed by ASR:
the breakdown was much better in her life


	Source = p241_112.wav
	Target Speaker = p300
	Converted - Cotatron

4. Bonus (curated)

We show some entertaining conversion results from the speech of celebrities.

Transcription fed:
We want to live by each other's happiness, not by each other's misery.


	Source = Charlie Chaplin's speech from "The Great Dictator" (YouTube link)
	Target Speaker = p225 (from VCTK)
	Converted - Cotatron

Transcription fed (in Korean):
온갖 음해에 시달렸습니다. 여러분 이거 다 거짓말인거 아시죠?


	Source = Lee Myung-bak's speech from 2007 (YouTube link)
	Target Speaker = KSS (Korean Single Speaker Dataset)
	Converted - Cotatron (Korean version)

This page uses a template from the project page of Battenberg et al., "Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis".