ShortSpeech: Learning Short Discrete Speech Representations for High-Quality and Efficient LM-Based Zero-Shot Text-to-Speech Synthesis

¹Haohan Guo ²Fenglong Xie ²Kun Xie ²Dake Guo
¹Dongchao Yang ¹Xixin Wu ¹Helen Meng

¹The Chinese University of Hong Kong ²Xiaohongshu Inc.

Abstract

Language models (LMs) have demonstrated great potential in speech generation, especially for zero-shot text-to-speech (TTS) synthesis. However, this auto-regressive sequential model is also troubled by the longer length of speech sequence than that of the text, affecting the training and inference efficiency seriously, hence hindering its development in the speech domain. This work aims to compress speech into a shorter discrete representation to achieve high-quality and efficient LM-based TTS. We first propose SoCodec, a semantic-ordered speech codec: it compresses speech into a multi-stream discrete semantic sequence, i.e. each frame consists of multiple semantic tokens, and one utterance-level global acoustic embedding. Meanwhile, this multi-stream representation is constrained into an ordered representation, which can be better predicted recursively along the stream axis. Based on this representation, we apply a delay-prediction LLM to TTS, which can predict the proposed ordered multi-stream sequence using only one auto-regressive model. Finally, we implement such an LM-TTS system based on a frameshift of only 240ms, currently the shortest speech representation for TTS, which yet significantly outperforms baselines in naturalness, speaker similarity, and efficiency.¹¹1The code and checkpoint of our work are available at: https://github.com/hhguo/shortspeech

1 Introduction

2 Related Work

3 Approach

In this work, the TTS framework is composed of two models: a speech codec, and a language-model-based acoustic model, which are introduced as follows accordingly.

3.1 Semantic-Ordered Speech Codec

The speech codec, as the essential component of LM-TTS, is responsible for compressing speech signals into discrete speech tokens for language models, and decoding them back to signals with the minimum reconstruction loss.

3.1.1 Model Architecture

3.1.2 Ordered Product Quantization

3.1.3 Loss Function

3.2 ShortLLM

3.2.1 Chain-of-Thought Generation

3.2.2 Delay Prediction

3.2.3 Loss Function

4 Experiments

4.1 Experimental Protocol

4.1.1 Datasets

4.1.2 Model Training and Inference

4.1.3 Evaluation Metrics

4.2 System Comparison

4.2.1 TTS Quality

4.2.2 Inference Efficiency

4.3 Speech Codecs

4.4 Multi-Codebook Vector Quantization

4.5 Multi-Stream LLM

5 Conclusions