Eta-WavLM: Efficient Speaker Identity Removal in Self-Supervised Speech Representations Using a Simple Linear Equation
Giuseppe Ruggiero, Matteo Testa, Jurgen Van de Walle, Luigi Di Caro
Abstract: Self-supervised learning (SSL) has reduced the reliance on expensive labeling in speech technologies by learning meaningful representations from unannotated data. Since most SSL-based downstream tasks prioritize content information in speech, ideal representations should disentangle content from unwanted variations like speaker characteristics in the SSL representations. However, removing speaker information often degrades other speech components, and existing methods either fail to fully disentangle speaker identity or require resource-intensive models. In this paper, we propose a novel disentanglement method that linearly decomposes SSL representations into speaker-specific and speaker-independent components, effectively generating speaker disentangled representations. Comprehensive experiments show that our approach achieves speaker independence and as such, when applied to content-driven tasks such as voice conversion, our representations yield significant improvements over state-of-the-art methods.
Voice Conversion Demo - LJSpeech
In this section, we present speech samples generated by the voice conversion system (Section 3.2), using the cleaner
LJSpeech dataset as
the target speaker and the LibriSpeech
test-clean dataset as the source speech.
We compare Proposed Eta-WavLM against five baselines: WavLM, Perturbation, Soft, Utterance Std, and Vector Quantization.
LJSpeech Target (F): |
---|
Scenario | Source Speech | WavLM | Perturbation | Soft | Utterance Std | Vector Quantization | Proposed Eta-WavLM |
---|---|---|---|---|---|---|---|
M -> F | |||||||
M -> F | |||||||
F -> F | |||||||
F -> F |
Voice Conversion Demo - Elliot Miller
In this section, we present speech samples generated by the voice conversion system (Section 3.2), using the more challenging
Elliot Miller dataset as
the target speaker and the LibriSpeech
test-clean dataset as the source speech.
We compare Proposed Eta-WavLM against five baselines: WavLM, Perturbation, Soft, Utterance-Std, and Vector-Quantization.
Elliot Miller Target (M): |
---|
Scenario | Source Speech | WavLM | Perturbation | Soft | Utterance Std | Vector Quantization | Proposed Eta-WavLM |
---|---|---|---|---|---|---|---|
M -> M | |||||||
M -> M | |||||||
F -> M | |||||||
F -> M |