Enhancing Polyglot Voices by Leveraging Cross-Lingual Fine-Tuning in Any-to-One Voice Conversion
Abstract:
The creation of artificial polyglot voices remains a challenging task, despite considerable progress in recent years. This paper investigates self-supervised learning for voice conversion to create native-sounding polyglot voices. We introduce a novel cross-lingual any-to-one voice conversion system that is able to preserve the source accent without the need for multilingual data from the target speaker. In addition, we show a novel cross-lingual fine-tuning strategy that further improves the accent and reduces the training data requirements. Objective and subjective evaluations with English, Spanish, French and Mandarin Chinese confirm that our approach improves on state-of-the-art methods, enhancing the speech intelligibility and overall quality of the converted speech, especially in cross-lingual scenarios.
Intra-lingual - English
In this section, we present some speech samples used in the intra-lingual subjective evaluation.
We focus on any-to-one conversion using
LJSpeech as
the target and LibriSpeech
test-clean as the source speech.
We compare Proposed and Proposed-F against two baselines: Soft-VC and kNN-VC.
Target speaker:
Source
Soft-VC
kNN-VC
Proposed
Proposed-F
Cross-lingual - French
In this section, we present some cross-lingual speech samples for French.
We use LJSpeech as the target and Multilingual LibriSpeech (MLS) French dev + test as source speech.
We compare Proposed and Proposed-F against two baselines: Soft-VC and kNN-VC.
Target speaker:
Source
Soft-VC
kNN-VC
Proposed
Proposed-F
Cross-lingual - Spanish
In this section, we present some cross-lingual speech samples for Spanish.
We use LJSpeech as the target and Multilingual LibriSpeech (MLS) Spanish test as source speech.
We compare Proposed and Proposed-F against two baselines: Soft-VC and kNN-VC.
Target speaker:
Source
Soft-VC
kNN-VC
Proposed
Proposed-F
Cross-lingual - Mandarin Chinese
In this section, we present some cross-lingual speech samples for Mandarin Chinese.
We use LJSpeech as the target and Aishell dev + test as source speech.
We compare Proposed and Proposed-F against two baselines: Soft-VC and kNN-VC.
Target speaker:
Source
Soft-VC
kNN-VC
Proposed
Proposed-F
Ablation - Emotions Maintenance
In this section, we present some intra-lingual speech samples used in the ablation evaluation.
We use LJSpeech as the target and Emotional Speech Dataset (ESD) as source speech.
We compare Proposed and Proposed-F considering four emotions: angry, happy, sad, surprise.