Metric Learning for User-defined Keyword Spotting

Jaemin Jung1*, Youkyum Kim1*, Jihwan Park2,3, Youshin Lim2,3, Byeong-Yeol Kim2,3, Youngjoon Jang1, Joon Son Chung1,

1KAIST / Daejeon, Republic of Korea.

2Hyundai Motor Company / Seoul, Republic of Korea. 342Dot Inc / Seoul, Republic of Korea.


The goal of this work is to detect new spoken terms defined by users. While most previous works address Keyword Spotting (KWS) as a closed-set classification problem, this limits their transferability to unseen terms. The ability to define custom keywords has advantages in terms of user experience.

In this paper, we propose a metric learning-based training strategy for user-defined keyword spotting. In particular, we make the following contributions: (1) we construct a large-scale keyword dataset with an existing speech corpus and propose a filtering method to remove data that degrade model training; (2) we propose a metric learning-based two-stage training strategy, and demonstrate that the proposed method improves the performance on the user-defined keyword spotting task by enriching their representations; (3) to facilitate the fair comparison in the user-defined KWS field, we propose unified evaluation protocol and metrics.

Our proposed system does not require an incremental training on the user-defined keywords, and outperforms previous works by a significant margin on the Google Speech Commands dataset using the proposed as well as the existing metrics.

Our paper is available here.

Our code and dataset are available on our github repo.

LibriSpeech Keywords (LSK) : Grad-CAM







LSK dataset consists of 1 second keyword audio samples which include noises or utterances that may occur before or after the keywords in a real-world scenario. We utilise Grad-CAM to visually show whether the model can detect the exact part of user-defined keywords from the input audios. As shown above, the model trained on our LSK dataset focuses on where the target keywords are spotted.

Effectiveness of Fixed-length Truncation

Table. 1 - Experimental results according to data trimming methods. All experiments are conducted with 10-shot enrollment.

At inference time, it is not possible to segment the input audio to keyword units, since we do not have the temporal segmentation labels. Our target is to construct a KWS dataset that simulates a real-world scenario where the model should detect target keywords from a continuous audio input. We perform ablations to verify the effectiveness of the fixed-length truncation process, compared to the zero-padding operation. As shown in Table. 1, the model trained on the truncated data outperforms the model trained on the zero-padded data.

Additional Experiments on Other Datasets

Table. 2 - Results on MSWC dataset with 3 languages and Hey Snapdragon dataset. All experiments are conducted with 10-shot enrollment. # words: Number of words in each keyword.

To verify that the proposed model is generalised well to diverse KWS datasets, we conduct additional experiments on two datasets: Multilingual Spoken Words Corpus (MSWC) and Hey Snapdragon Keyword Dataset (Hey Snapdragon). We choose 3 subsets (English, Spanish, and Polish) of the MSWC dataset and finetune our model on them. Please note that we use the model pre-trained on the LSK dataset with the AP loss. We finetune our model on Hey Snapdragon dataset which consists of four 2-word keyword classes. We use the model pre-trained on LSK dataset with the AP loss. Among four classes, three keywords are used for fine-tuning, and only one keyword is used for the test. Since only one keyword is the target, the accuracy and the F1-score cannot be computed. As shown in Table. 2, the proposed two-stage method shows a significant performance improvement on all datasets regardless of the language or the number of words in each keyword.

Detection Error Tradeoff (DET) curves

Figure. 1 - Detection Error Tradeoff (DET) curves with various experiment setups including (a) training strategy, objective functions, and (b) the number of keyword classes and samples per class. PT and FT denote pre-training and fine-tuning, respectively.

To clearly interpret the trade-off between False Alarm Rate (FAR) and False Rejection Rate (FRR), we visualise Detection Error Tradeoff (DET) curves. Figure. 1(a) and Figure. 1(b) in this page are corresponded to Table. 2 and Table. 3 in the main paper, respectively. The point where the value of FRR is the same to that of FAR represents Equal Error Rate (EER). We also report the EERs in the legend of each figure.

Model (Res15) Specification

Table. 3 - The number of parameters, operations, and the inference time.

Table. 3 shows the number of model parameters, calculations, and the inference time. The inference time is measured using a single NVIDIA A5000 GPU under the condition that only one utterance sample is input to the model.