10.8%
Audio-Text Alignment
10.7%
Audio Quality
16.5%
Generation Quality
The goal of this paper is to enhance Text-to-Audio generation at inference, focusing on generating realistic audio that precisely aligns with text prompts. Despite the rapid advancements, existing models often fail to achieve a reliable balance between percep- tual quality and textual alignment. To address this, we adopt Inference-Time Scaling, a training-free method that improves performance by increasing inference computation. We establish its unexplored application to audio generation and propose a novel multi-reward guidance that equally signifies each component essential in perception. By normalizing each reward value into a common scale and combining them with a weighted summation, the method not only enforces stable guidance but also enables explicit control to reach desired aspects. Moreover, we introduce a new audio-text alignment metric using an audio language model for more robust evaluation. Empirically, our method improves both semantic alignment and perceptual quality, significantly outperforming naive generation and existing reward guidance techniques.
Note: All audio samples are generated using the EZAudio large model.
SCORE provides explicit control over the generation process through weight control.
The generation favors audio quality when the quality reward (PQ) weight is high, and favors audio-text alignment when the text alignment reward (CLAP) weight is high. Setting equal weights for both weights results in a balanced generation.