SCORE: Scaling audio generation
using Standardized COmposite REwards

1Korea Advanced Institute of Science and Technology (KAIST)
2ByteDance Seed
*Equal contribution

Key Results

10.8%

Audio-Text Alignment

10.7%

Audio Quality

16.5%

Generation Quality

Abstract

The goal of this paper is to enhance Text-to-Audio generation at inference, focusing on generating realistic audio that precisely aligns with text prompts. Despite the rapid advancements, existing models often fail to achieve a reliable balance between percep- tual quality and textual alignment. To address this, we adopt Inference-Time Scaling, a training-free method that improves performance by increasing inference computation. We establish its unexplored application to audio generation and propose a novel multi-reward guidance that equally signifies each component essential in perception. By normalizing each reward value into a common scale and combining them with a weighted summation, the method not only enforces stable guidance but also enables explicit control to reach desired aspects. Moreover, we introduce a new audio-text alignment metric using an audio language model for more robust evaluation. Empirically, our method improves both semantic alignment and perceptual quality, significantly outperforming naive generation and existing reward guidance techniques.

Method Overview

Method Overview 1
Method Overview 4

Audio samples

Note: All audio samples are generated using the EZAudio large model.

AudioCaps Demo

Weight Control Examples

SCORE provides explicit control over the generation process through weight control.

The generation favors audio quality when the quality reward (PQ) weight is high, and favors audio-text alignment when the text alignment reward (CLAP) weight is high. Setting equal weights for both weights results in a balanced generation.

Sample 00208: Ducks quacking as birds chirp followed by a flock of ducks quacking.
75% Quality + 25% Text Alignment
00208 25% Mel Spectrogram
50% Quality + 50% Text Alignment
00208 50% Mel Spectrogram
25% Quality + 75% Text Alignment
00208 75% Mel Spectrogram
Sample 00575: Large church bells ring as rain falls on a hard surface and wind blows lightly into a microphone.
75% Quality + 25% Text Alignment
00575 25% Mel Spectrogram
50% Quality + 50% Text Alignment
00575 50% Mel Spectrogram
25% Quality + 75% Text Alignment
00575 75% Mel Spectrogram
Sample 00888: High frequency humming followed by wind blowing.
75% Quality + 25% Text Alignment
00888 25% Mel Spectrogram
50% Quality + 50% Text Alignment
00888 50% Mel Spectrogram
25% Quality + 75% Text Alignment
00888 75% Mel Spectrogram

BibTeX