ShortSpeech: Learning Short Discrete Speech Representations for High-Quality and Efficient LM-Based Zero-Shot Text-to-Speech Synthesis
Abstract
Language models (LMs) have demonstrated great potential in speech generation, especially for zero-shot text-to-speech (TTS) synthesis. However, this auto-regressive sequential model is also troubled by the longer length of speech sequence than that of the text, affecting the training and inference efficiency seriously, hence hindering its development in the speech domain. This work aims to compress speech into a shorter discrete representation to achieve high-quality and efficient LM-based TTS. We first propose SoCodec, a semantic-ordered speech codec: it compresses speech into a multi-stream discrete semantic sequence, i.e. each frame consists of multiple semantic tokens, and one utterance-level global acoustic embedding. Meanwhile, this multi-stream representation is constrained into an ordered representation, which can be better predicted recursively along the stream axis. Based on this representation, we apply a delay-prediction LLM to TTS, which can predict the proposed ordered multi-stream sequence using only one auto-regressive model. Finally, we implement such an LM-TTS system based on a frameshift of only 240ms, currently the shortest speech representation for TTS, which yet significantly outperforms baselines in naturalness, speaker similarity, and efficiency.111The code and checkpoint of our work are available at: https://github.com/hhguo/shortspeech
1 Introduction
Language models (LMs) have demonstrated great potential in speech generation, especially for zero-shot text-to-speech (TTS) synthesis. However, this auto-regressive sequential model is also troubled by the longer length of speech sequence than that of the text, affecting the training and inference efficiency seriously, hence hindering its development in the speech domain. This work aims to compress speech into a shorter discrete representation to achieve high-quality and efficient LM-based TTS.
2 Related Work
3 Approach
In this work, the TTS framework is composed of two models: a speech codec, and a language-model-based acoustic model, which are introduced as follows accordingly.
3.1 Semantic-Ordered Speech Codec
The speech codec, as the essential component of LM-TTS, is responsible for compressing speech signals into discrete speech tokens for language models, and decoding them back to signals with the minimum reconstruction loss.