Skip to main content

Table 4 Applied models and techniques in literature for extracting features from textual input of the TTS model with papers’ links in which they are applied. Extracted features are utilized in the ETTS model for three purposes: inference in reference-based ETTS models when lacking reference audio, inputs to ETTS models trained to be based on text only or as additional features to the ETTS model

From: Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources


Utilized for:

Inference without reference audio

ETTS based on text only

Additional ETTS features

BERT language model

[35, 44, 46, 50, 59, 73, 129]

[29, 40, 54]

[62, 87, 100, 110, 127, 131, 136, 138]

ELECTRA language model




ELMo language model




RoBERTa language model


[21, 70]

XLNet language model

[17, 50]


(GPT)-3 language model




Parsing trees



Prosody boundaries in text


Constituency trees



Sentiment analysis model




Stanford Sentiment Parser




Syntax-related features (such as POS: part of speech)



Word emotion lexicon




Term Frequency-Inverse Document Frequency (TF-IDF) (TF-IDF)



Character/phoneme embedding

[20, 33, 37, 44, 47, 48, 63, 71, 72, 91, 94,95,96, 103, 111]