CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation

Ji-Hoon Kim1, Hong-Sun Yang2, Yoon-Cheol Ju2, Il-Hwan Kim2, Byeong-Yeol Kim2, Joon Son Chung1

1 Korea Advanced Institute of Science and Technology, Republic of Korea
2 42dot Inc., Republic of Korea

Abstract

The goal of this work is to generate natural speech in multiple languages while maintaining the same speaker identity, a task known as cross-lingual speech synthesis. A key challenge of cross-lingual speech synthesis is the language-speaker entanglement problem, which causes the quality of cross-lingual systems to lag behind that of intra-lingual systems. In this paper, we propose CrossSpeech++, which effectively disentangles language and speaker information and significantly improves the quality of cross-lingual speech synthesis. To this end, we break the complex speech generation pipeline into two simple components: language-dependent and speaker-dependent generators. The language-dependent generator produces linguistic variations that are not biased by specific speaker attributes. The speaker-dependent generator models acoustic variations that characterize speaker identity. By handling each type of information in separate modules, our method can effectively disentangle language and speaker representation. We conduct extensive experiments using various metrics, and demonstrate that CrossSpeech++ achieves significant improvements in cross-lingual speech synthesis, outperforming existing methods by a large margin.

Audio Samples

Ground Truth
Vocoded
FP
FP + DAT
FP + DAT + $\mathcal{L}_{reg}$
CrossSpeech
CrossSpeech++
(Ours)
     * Reference English speaker
     - English Text: It was a bit of a shock.
     - Korean Text: 어떤 영상을 유튜브로 틀어드릴까요?
$-$ $-$
     - Chinese Text: 邓小平与撒切尔会晤。
$-$ $-$
     - Japanese Text: 原告の主張は、いずれも失当である。
$-$ $-$
     * Reference Korean speaker
     - English Text: Well, it did last time, he was reminded.
$-$ $-$
     - Korean Text: 교과서 몇 쪽까지 배웠나 알려드릴게요.
     - Chinese Text: 卡尔普陪外孙玩滑梯。
$-$ $-$
     - Japanese Text: この前探った時は、途中に瘢痕の隆起があったので、ついそこが行きどまりだとばかり思って、ああ云ったんですが、
$-$ $-$
     * Reference Chinese speaker
     - English Text: The highest rate is in Glasgow.
$-$ $-$
     - Korean Text: 보기 기능을 실행시켜 즐겨찾기를 찾아보겠습니다.
$-$ $-$
     - Chinese Text: 假语村言别再拥抱我。
     - Japanese Text: 治療期間を、三から六か月に限定して、認められるべきである。
$-$ $-$
     * Reference Japanese speaker
     - English Text: I'm sure everyone will be delighted for them.
$-$ $-$
     - Korean Text: 설교 빼고는 없으십니다.
$-$ $-$
     - Chinese Text: 假语村言别再拥抱我。
$-$ $-$
     - Japanese Text: 電気伝導度の測定方法は、例えば次のようにして行う。