InfiniteAudio: Infinite-Length Audio Generation with Consistency

Interspeech 2025

Overall pipeline for InfiniteAudio

Abstract

This paper presents InfiniteAudio, a simple yet effective strategy for generating infinite-length audio using diffusion-based text-to-audio methods. Current approaches face memory constraints because the output size increases with input length, making long duration generation challenging. A common workaround is to concatenate short audio segments, but this often leads to inconsistencies due to the lack of shared temporal context. To address this, InfiniteAudio integrates seamlessly into existing pipelines without additional training. It introduces two key techniques: FIFO sampling, a first-in, first-out inference strategy with fixed-size inputs, and curved denoising, which selectively prioritizes key diffusion steps for efficiency. Experiments show that InfiniteAudio achieves comparable or superior performance across all metrics.

Note

• InfiniteAudio generates text-conditional sound effects.
• All audio samples are produced on a single 12GB GPU without any additional training.
• InfiniteAudio allows for extended audio generation while maintaining a fixed memory footprint.

Table of Contents

A. AudioLDM + InfiniteAudio


Description: "A hammer is hitting a wooden surface."


Description: "Motorcycle engines running and revving as a man talks in the background."


Description: "Speech and insects buzzing."

B. VoiceLDM + InfiniteAudio


Description: "A violin playing a heartfelt melody."


Description: "Birds chirping."


C. Ablation Study on VoiceLDM

Strategies on selecting sampling steps.

  • InfiniteAudio w/ equal timesteps.
    Spectrogram of equal timesteps
  • InfiniteAudio w/ middle focused timesteps
    Spectrogram of InfiniteAudio
  • InfiniteAudio w/ last focused timesteps.
    Spectrogram of infiniteaudio
  • InfiniteAudio w/ initial focused timesteps.
    Spectrogram of InfiniteAudio

Description: "A violin playing a heartfelt melody."