The goal of this paper is to accelerate codec-based
speech synthesis systems without compromising speech quality.
We propose an enhanced inference method that allows for flexible
trade-offs between speed and quality during inference without
requiring additional training.
Our core idea is to predict multiple
tokens per inference step of the AR module using multiple
prediction heads, resulting in a linear reduction in synthesis time
as the number of heads increases. Furthermore, we introduce a
novel speculative decoding technique that utilises a Viterbi-based
algorithm to select the optimal sequence of generated tokens at
each decoding step.
In our experiments, we demonstrate that the
time required to predict each token is reduced by a factor of 4
to 5 compared to baseline models, with minimal quality tradeoff or
even improvement in terms of speech intelligibility.