FreGrad: light and fast frequency-aware diffusion vocoder

Abstract

The goal of this paper is to generate realistic audio with a lightweight and fast diffusion-based vocoder, named FreGrad. Our framework consists of the following three key components: (1) We employ discrete wavelet transform that decomposes a complicated waveform into sub-band wavelets, which helps FreGrad to operate on a simple and concise feature space, (2) We design a frequency-aware dilated convolution that elevates frequency awareness, resulting in generating speech with accurate frequency information, and (3) We introduce a bag of tricks that boosts the generation quality of the proposed model. In our experiments, FreGrad achieves 3.7 times faster training time and 2.2 times faster inference speed compared to our baseline while reducing the model size by 0.6 times (only 1.78M parameters) without sacrificing the output quality.

Synthesized samples

Sample generated by 50 iterations

Groundtruth	WaveGrad	DiffWave	PriorGrad	FreGrad
The crowd began to congregate in and about the Old Bailey.

The due relation of letter to pictures and other ornament was thoroughly understood by the old printers; so that

Heard noises that sounded like firecrackers and ran toward the President's limousine.

He concluded, quote, There is no doubt in my mind that these fibers could have come from this shirt.

Restricting the coverage in this way would avoid unnecessary controversy over the inclusion or exclusion of other officials who are in the order of succession

So that I know not where we can hope to find any absolute distinction between animals and plants, unless we return to their mode of nutrition

A measure which must greatly tend to discourage attempts to escape.

The privileges of the master's side also disappeared; fees were nominally abolished, and garnish was scotched, although not yet killed outright.

Soon after midnight on the Sunday night, for by this time the present practice of executing on Monday morning had been pretty generally introduced,

And also about Mrs. Paine.

Compare Training Time Between PriorGrad and FreGrad

Ablation studies for quality

Sample generated by 50 iterations.

Ours	w/o Freq-DConv	w/o separate prior prior	w/o zero SNR	w/o L mag

>

BibTeX

@INPROCEEDINGS{fregrad,
      author={Tan Dat Nguyen, Ji-Hoon Kim, Youngjoon Jang, Jaehun Kim, Joon Son Chung},
      booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
      title={FreGrad: lightweight and fast frequency-aware diffusion vocoder}, 
      year={2024},
}

FreGrad: Lightweight and fast frequency-aware diffusion vocoder

FreGrad gradually remove noise from wavelet features

Abstract

Synthesized samples

Compare Training Time Between PriorGrad and FreGrad

Ablation studies for quality

BibTeX