Kangdi Mei, Zhaoci Liu, Huipeng Du, Hengyu Li, Yang Ai, Liping Chen, Zhenhua Ling

National Engineering Research Center of Speech and Language Information Processing,
University of Science and Technology of China, Hefei, P.R.China


Conversational speech synthesis aims to synthesize speech of an individual speaker based on history conversation. However, most studies in conversational speech synthesis only focus on the synthesis performance of the current speaker’s turn and neglect the temporal relationship between turns of interlocutors. Therefore, we consider the temporal connection between turns for conversational speech synthesis, which is crucial for the naturalness and coherence of conversations. Specifically, we focus on the situation where there is no overlap between turns and only one history turn is considered. In our task, we propose an acoustic model that leverages multi-modal (including text and speech) information from previous turn to predict the acoustic features of not only current turn but also the inter-turn gap. The model is designed based on MQTTS and incorporates the global acoustic representation and BERT-based local semantic representation of previous turn when predicting the acoustic features of each frame. Experimental results demonstrate that with the introduction of global acoustic information and local semantic information, our model achieves better performance in temporal connection between turns and content synthesis of current turn.


Model Architecture

Fig.1 Overview of our proposed model.

Evaluation on different models

GT: reconstructed from groundtruth.
M1: baseline model.
M2: baseline model with global acoustic information.
M3: baseline model with local semantic information.
M4: baseline model with global acoustic information and local semantic information.

Sample 1

A: w- they can be blocked they can be blocked.
B: well or they can be blocked indeed i i don't i don't know how that works if uh if if if parents can block computers from kids /p/ or or from porn sites then that's great.

GT M1 M2 M3 M4

Sample 2

A: um so we're supposed to talk about sports is that correct /p/ okay tv sports um i don't spend many hours a week watching it but i like basketball.
B: well that's about like i am i only watch football and it's only when the kansas city chiefs are playing.
*/p/ means intra-speaker pause.*

GT M1 M2 M3 M4

Sample 3

A: yeah that's the same with me.
B: right yeah so you relax and then you sit and relaxes and /p/ the germs have uh /p/ a way to play with your body.
*/p/ means intra-speaker pause.*

GT M1 M2 M3 M4

Sample 4

A: but you definitely wouldn't miss the food.
B: no guess i guess not maybe um /p/ you know not the american food maybe like indian food and.
*/p/ means intra-speaker pause.*

GT M1 M2 M3 M4

Sample 5

A: um m- may i ask what ah class you teach.
B: most of my classes ah at this point have been geared towards mathematics.

GT M1 M2 M3 M4

Sample 6

A: um do you go jogging?
B: um actually i work out three times a week.

GT M1 M2 M3 M4

Sample 7

A: gives -em /p/ not a very good outlook on everything
B: no /p/ kind of makes you think quite alone.
*/p/ means intra-speaker pause.*

GT M1 M2 M3 M4