Disentangled Representation Learning for
Environment-agnostic Speaker Recognition

Kihyun Nam1, Hee-Soo Heo2, Jee-weon Jung3, Joon Son Chung1

1Korea Advanced Institute of Science and Technology / Republic of Korea.

2Naver Cloud Corporation / Republic of Korea.

3Carnegie Mellon University / USA.

Abstract

This work presents a framework based on feature disentanglement to learn speaker embeddings that are robust to environmental variations.

Our framework utilises an auto-encoder as a disentangler, dividing the input speaker embedding into components related to the speaker and other residual information. We employ a group of objective functions to ensure that the auto-encoder's code representation -- used as the refined embedding -- condenses only the speaker characteristics.

We show the versatility of our framework through its compatibility with any existing speaker embedding extractor, requiring no structural modifications or adaptations for integration. We validate the effectiveness of our framework by incorporating it into two popularly used embedding extractors and conducting experiments across various benchmarks. The results show a performance improvement of up to 16%. We release our code for this work to be available here.

Overview

Our framework leverages disentangled representation learning (DRL) to enhance speaker recognition systems by making them robust to environmental variations. Our novel approach, rooted in DRL and Auto-encoder, focuses on isolating and eliminating environmental information from speaker embeddings without compromising vital speaker-specific information.