Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources

EURASIP Journal on Audio, Speech, and Music Processing

Table 4 Applied models and techniques in literature for extracting features from textual input of the TTS model with papers’ links in which they are applied. Extracted features are utilized in the ETTS model for three purposes: inference in reference-based ETTS models when lacking reference audio, inputs to ETTS models trained to be based on text only or as additional features to the ETTS model

Model/method	Utilized for:
Model/method	Inference without reference audio	ETTS based on text only	Additional ETTS features
BERT language model	[35, 44, 46, 50, 59, 73, 129]	[29, 40, 54]	[62, 87, 100, 110, 127, 131, 136, 138]
ELECTRA language model		[125]
ELMo language model		[83]
RoBERTa language model			[21, 70]
XLNet language model	[17, 50]
(GPT)-3 language model		[64]
Parsing trees	[129]
Prosody boundaries in text
Constituency trees			[131]
Sentiment analysis model		[30]
Stanford Sentiment Parser		[135]
Syntax-related features (such as POS: part of speech)			[127]
Word emotion lexicon		[40]
Term Frequency-Inverse Document Frequency (TF-IDF) (TF-IDF)			[99]
Character/phoneme embedding	[20, 33, 37, 44, 47, 48, 63, 71, 72, 91, 94,95,96, 103, 111]