SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS

¹Korea Advanced Institute of Science and Technology (KAIST), South Korea ²42dot Inc, South Korea

†Corresponding author

Key Results

50%

Model Depth Reduction

20%

VRAM Usage Reduction

<5%

Training Data Required

Abstract

The goal of this paper is to introduce SPADE, a framework for Structured Pruning and Adaptive Distillation for Efficient Large Language Model-based text-to-speech (LLM-TTS). Recent LLM-TTS systems achieve strong controllability and zero-shot generalization, but their large parameter counts and high latency limit real-world deployment. SPADE addresses this by combining (i) a pruning step guided by a word-error-rate-based layer importance index to remove non-essential Transformer layers, with (ii) multi-level knowledge distillation to restore autoregressive coherence. On zero-shot benchmarks, SPADE preserves near-parity perceptual quality while halving Transformer depth, reducing VRAM usage by up to 20%, and achieving up to 1.7× faster real-time factor with less than 5% of the original training data. These results show that compact LLM-TTS models can maintain naturalness and speaker similarity while enabling practical real-time speech generation.

Method Overview

Audio Demos

Note: These audios are directly shown for demo purposes. All audios have been normalized for fair evaluation in MOS survey.

Sample	Ground Truth (Seed TTS)/ Reference Voice (LibriTTS)	CosyVoice 2	CosyVoice 2 Lite (Ours)	LLaSA	LLaSA Lite (Ours)

BibTeX

@article{nguyen2026pruntts,
  title={SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS},
  author={Nguyen, Tan Dat and Kim, Jaehun and Kim, Ji-Hoon and Choi, Shukjae and Lim, Youshin and Chung, Joon Son},
  journal={arXiv:2509.20802},
  year={2026}
}