Skip to main content

Table 4 Applied models and techniques in literature for extracting features from textual input of the TTS model with papers’ links in which they are applied. Extracted features are utilized in the ETTS model for three purposes: inference in reference-based ETTS models when lacking reference audio, inputs to ETTS models trained to be based on text only or as additional features to the ETTS model

From: Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources

Model/method

Utilized for:

Inference without reference audio

ETTS based on text only

Additional ETTS features

BERT language model

[35, 44, 46, 50, 59, 73, 129]

[29, 40, 54]

[62, 87, 100, 110, 127, 131, 136, 138]

ELECTRA language model

 

[125]

 

ELMo language model

 

[83]

 

RoBERTa language model

  

[21, 70]

XLNet language model

[17, 50]

  

(GPT)-3 language model

 

[64]

 

Parsing trees

[129]

  

Prosody boundaries in text

   

Constituency trees

  

[131]

Sentiment analysis model

 

[30]

 

Stanford Sentiment Parser

 

[135]

 

Syntax-related features (such as POS: part of speech)

  

[127]

Word emotion lexicon

 

[40]

 

Term Frequency-Inverse Document Frequency (TF-IDF) (TF-IDF)

  

[99]

Character/phoneme embedding

[20, 33, 37, 44, 47, 48, 63, 71, 72, 91, 94,95,96, 103, 111]