The goal of voice conversion is to transform the speech of a source speaker to sound like
that of a reference speaker while preserving the original content.
A key challenge is to extract disentangled linguistic content from the source and voice
style from the reference.
While existing approaches leverage various methods to isolate the two, a generalization
still requires further
attention, especially for robustness in zero-shot scenarios.
In this paper, we achieve successful disentanglement of content and speaker features by
tuning self-supervised speech
features with adapters.
The adapters are trained to dynamically encode nuanced features from rich self-supervised
features, and the decoder
fuses them to produce speech that accurately resembles the reference with minimal loss of
content.
Moreover, we leverage a conditional flow matching decoder with cross-attention speaker
conditioning to further boost the
synthesis quality and efficiency.
Subjective and objective evaluations in a zero-shot scenario demonstrate that the proposed
method outperforms existing
models in speech quality and similarity to the reference speech.
Conversion Samples for Unseen Speakers (zero-shot)