LSK dataset consists of 1 second keyword audio samples which include noises or utterances that may occur before or after the keywords in a real-world scenario. We utilise Grad-CAM to visually show whether the model can detect the exact part of user-defined keywords from the input audios. As shown above, the model trained on our LSK dataset focuses on where the target keywords are spotted.
At inference time, it is not possible to segment the input audio to keyword units, since we do not have the temporal segmentation labels. Our target is to construct a KWS dataset that simulates a real-world scenario where the model should detect target keywords from a continuous audio input. We perform ablations to verify the effectiveness of the fixed-length truncation process, compared to the zero-padding operation. As shown in Table. 1, the model trained on the truncated data outperforms the model trained on the zero-padded data.
To verify that the proposed model is generalised well to diverse KWS datasets, we conduct additional experiments on two datasets: Multilingual Spoken Words Corpus (MSWC) and Hey Snapdragon Keyword Dataset (Hey Snapdragon). We choose 3 subsets (English, Spanish, and Polish) of the MSWC dataset and finetune our model on them. Please note that we use the model pre-trained on the LSK dataset with the AP loss. We finetune our model on Hey Snapdragon dataset which consists of four 2-word keyword classes. We use the model pre-trained on LSK dataset with the AP loss. Among four classes, three keywords are used for fine-tuning, and only one keyword is used for the test. Since only one keyword is the target, the accuracy and the F1-score cannot be computed. As shown in Table. 2, the proposed two-stage method shows a significant performance improvement on all datasets regardless of the language or the number of words in each keyword.
To clearly interpret the trade-off between False Alarm Rate (FAR) and False Rejection Rate (FRR), we visualise Detection Error Tradeoff (DET) curves. Figure. 1(a) and Figure. 1(b) in this page are corresponded to Table. 2 and Table. 3 in the main paper, respectively. The point where the value of FRR is the same to that of FAR represents Equal Error Rate (EER). We also report the EERs in the legend of each figure.
Table. 3 shows the number of model parameters, calculations, and the inference time. The inference time is measured using a single NVIDIA A5000 GPU under the condition that only one utterance sample is input to the model.