SCORE: SCALING AUDIO GENERATION USING STANDARDIZED COMPOSITE REWARDS

SCORE: Scaling audio generation
using Standardized COmposite REwards

¹Korea Advanced Institute of Science and Technology (KAIST)
²ByteDance Seed

*Equal contribution

Key Results

10.8%

Audio-Text Alignment

10.7%

Audio Quality

16.5%

Generation Quality

Abstract

The goal of this paper is to enhance Text-to-Audio generation at inference, focusing on generating realistic audio that precisely aligns with text prompts. Despite the rapid advancements, existing models often fail to achieve a reliable balance between percep- tual quality and textual alignment. To address this, we adopt Inference-Time Scaling, a training-free method that improves performance by increasing inference computation. We establish its unexplored application to audio generation and propose a novel multi-reward guidance that equally signifies each component essential in perception. By normalizing each reward value into a common scale and combining them with a weighted summation, the method not only enforces stable guidance but also enables explicit control to reach desired aspects. Moreover, we introduce a new audio-text alignment metric using an audio language model for more robust evaluation. Empirically, our method improves both semantic alignment and perceptual quality, significantly outperforming naive generation and existing reward guidance techniques.

Audio samples

Note: All audio samples are generated using the EZAudio large model.

AudioCaps Demo

Weight Control Examples

SCORE provides explicit control over the generation process through weight control.

The generation favors audio quality when the quality reward (PQ) weight is high, and favors audio-text alignment when the text alignment reward (CLAP) weight is high. Setting equal weights for both weights results in a balanced generation.

Sample 00208: Ducks quacking as birds chirp followed by a flock of ducks quacking.

75% Quality + 25% Text Alignment

50% Quality + 50% Text Alignment

25% Quality + 75% Text Alignment

Sample 00575: Large church bells ring as rain falls on a hard surface and wind blows lightly into a microphone.

75% Quality + 25% Text Alignment

50% Quality + 50% Text Alignment

25% Quality + 75% Text Alignment

Sample 00888: High frequency humming followed by wind blowing.

75% Quality + 25% Text Alignment

50% Quality + 50% Text Alignment

25% Quality + 75% Text Alignment

BibTeX

@article{jung2026score,
  title={SCORE: Scaling audio generation using Standardized COmposite REwards},
  author={Jaemin Jung, Jaehun Kim, Inkyu Shin, Joon Son Chung},
  journal={arXiv preprint arXiv:2509.19831},
  year={2026}
}

SCORE: Scaling audio generationusing Standardized COmposite REwards

Key Results

Abstract

Method Overview

Audio samples

AudioCaps Demo

Weight Control Examples

Sample 00208: Ducks quacking as birds chirp followed by a flock of ducks quacking.

Sample 00575: Large church bells ring as rain falls on a hard surface and wind blows lightly into a microphone.

Sample 00888: High frequency humming followed by wind blowing.

BibTeX

SCORE: Scaling audio generation
using Standardized COmposite REwards