Audio Samples from "SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech"

Paper: SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech

Authors: Hyunjae Cho, Wonbin Jung, Junhyeok Lee, Sang Hoon Woo @MINDsLab Inc.

Abstract: In this paper, we present SANE-TTS, a stable and natural end-to-end multilingual TTS model. By the difficulty of obtaining multilingual corpus for given speaker, training multilingual TTS model with monolingual corpora is unavoidable. We introduce speaker regularization loss that improves speech naturalness during cross-lingual synthesis as well as domain adversarial training, which is applied in other multilingual TTS models. Furthermore, by adding speaker regularization loss, replacing speaker embedding with zero vector in duration predictor stabilizes cross-lingual inference. With this replacement, our model generates speeches with moderate rhythm regardless of source speaker in cross-lingual synthesis. In MOS evaluation, SANE-TTS achieves naturalness score above 3.80 both in cross-lingual and intralingual synthesis, where the ground truth score is 3.99. Also, SANE-TTS maintains speaker similarity close to that of ground truth even in cross-lingual inference. Audio samples are available on our web page.

Abstract
spoken by LibriTTS4356

Abstract translated into Korean
spoken by LibriTTS4356


We use 203.4 hours of internal and external speech corpora in four languages. The external datasets include LJ Speech, LibriTTS, KSS, AISHELL-3 and JVS dataset.

Talking Face Generation with Multilingual TTS (CVPR 2022 Demo), based on SANE-TTS is available on HuggingFace Spaces

Contents

Multilingual Speech Synthesis Comparison with Another Model
Ablation Study
Long Text
Code Switching

Multilingual Speech Synthesis

Speaker

Reference Speech

English Target:
I saw him come from your window,
and I saw all that passed between
you in the balcony.

Korean Target (Romanization):
언젠가 들은 적이 있는 것 같거든요.
(Eonjenga deul-eun jeog-i issneun geos gatgeodeun-yo.)

Japanese Target (Romanization):
許可書がなければここへは入れない。
(Kyoka-sho ga nakereba koko e wa hairenai.)

Chinese Target (Hanyu Pinyin):
替我播放相思风雨中
(Tì wǒ bòfàng xiāngsī fēngyǔ zhōng)

LJ
(from English dataset)
LibriTTS4356
(from English dataset)
KSS
(from Korean dataset)
jvs001
(from Japanese dataset)
SSB0073
(from Chinese dataset)

Comparison with Another Model

Text: She saw him displace the bar and slip into the garden.

Model
English
(LibriTTS4356)
Korean
(KSS)
Japanese
(jvs001)
Chinese
(SSB0073)
Ground Truth
Meta-Learning Model
SANE-TTS

Header of each column with audio samples is a language of a speaker's utterances in the dataset.


Ablation Study

Text: When she was left by herself the poor girl began to feel afraid.

Model
English
(LibriTTS4356)
Korean
(KSS)
Japanese
(jvs001)
Chinese
(SSB0073)
Ground Truth
SANE-TTS
without DANN
without Regularization
with SDP

Header of each column with audio samples is a language of a speaker's utterances in the dataset.


Long Text

Text(497 words): In my younger and more vulnerable years my father gave me some advice that I’ve been turning over in my mind ever since. “Whenever you feel like criticizing any one,” he told me, “just remember that all the people in this world haven’t had the advantages that you’ve had.” He didn’t say any more, but we’ve always been unusually communicative in a reserved way, and I understood that he meant a great deal more than that. In consequence, I’m inclined to reserve all judgments, a habit that has opened up many curious natures to me and also made me the victim of not a few veteran bores. The abnormal mind is quick to detect and attach itself to this quality when it appears in a normal person, and so it came about that in college I was unjustly accused of being a politician, because I was privy to the secret griefs of wild, unknown men. Most of the confidences were unsought — frequently I have feigned sleep, preoccupation, or a hostile levity when I realized by some unmistakable sign that an intimate revelation was quivering on the horizon; for the intimate revelations of young men, or at least the terms in which they express them, are usually plagiaristic and marred by obvious suppressions. Reserving judgments is a matter of infinite hope. I am still a little afraid of missing something if I forget that, as my father snobbishly suggested, and I snobbishly repeat, a sense of the fundamental decencies is parcelled out unequally at birth. And, after boasting this way of my tolerance, I come to the admission that it has a limit. Conduct may be founded on the hard rock or the wet marshes, but after a certain point I don’t care what it’s founded on. When I came back from the East last autumn I felt that I wanted the world to be in uniform and at a sort of moral attention forever; I wanted no more riotous excursions with privileged glimpses into the human heart. Only Gatsby, the man who gives his name to this book, was exempt from my reaction — Gatsby, who represented everything for which I have an unaffected scorn. If personality is an unbroken series of successful gestures, then there was something gorgeous about him, some heightened sensitivity to the promises of life, as if he were related to one of those intricate machines that register earthquakes ten thousand miles away. This responsiveness had nothing to do with that flabby impressionability which is dignified under the name of the “creative temperament.”— it was an extraordinary gift for hope, a romantic readiness such as I have never found in any other person and which it is not likely I shall ever find again. No — Gatsby turned out all right at the end; it is what preyed on Gatsby, what foul dust floated in the wake of his dreams that temporarily closed out my interest in the abortive sorrows and shortwinded elations of men.

Model
English
(LibriTTS4356)
Korean
(KSS)
Japanese
(jvs001)
Chinese
(SSB0073)
SANE-TTS

Header of each column with audio samples is a language of a speaker's utterances in the dataset.


Code Switching

Text: 東京에 가서 麻辣烫 먹고싶다. Let's go.
(Tōkyōe gaseo Málà tàng meoggo sipda. Let's go.)

Model
English
(LibriTTS4356)
Korean
(KSS)
Japanese
(jvs001)
Chinese
(SSB0073)
SANE-TTS

Text: いち, two, 삼, 四.
(Ichi, two, sam, sì.)

Model
English
(LibriTTS4356)
Korean
(KSS)
Japanese
(jvs001)
Chinese
(SSB0073)
SANE-TTS

Header of each column with audio samples is a language of a speaker's utterances in the dataset.