Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources

EURASIP Journal on Audio, Speech, and Music Processing

Table 9 Subjective evaluation metrics for expressive speech synthesis models

Metric	Description
Mean Opinion Score (MOS)	Listeners to scores quality (naturalness and intelligibility) of synthesized speech with a five-point scoring system.
Comparison Mean Opinion score (CMOS)	Compares MOS values between models under test and the baseline via comparing ground truth and synthetic samples from each model.
Differential mean opinion score (DMOS)	Listeners score samples from one to five based on its similarity to a specific emotion or style.
AB preference test	Listeners score same sentence synthesized by the two models and select the one that fulfills the given condition more than the other.
ABX preference test	Listeners hear three samples A, B and X ,where X represents the target speech, and they should score the one that is more close to target speech.
MUltiple Stimuli Hidden Reference and Anchor (MUSHRA)	Listeners are presented with mixed samples including synthesized sample, natural speech samples (named proper reference) and total loss sample (named anchor). Listeners score each sample from 0 to 100 through a double-blind listening test.