Skip to main content

Table 9 Subjective evaluation metrics for expressive speech synthesis models

From: Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources

Metric

Description

Mean Opinion Score (MOS)

Listeners to scores quality (naturalness and intelligibility) of synthesized speech with a five-point scoring system.

Comparison Mean Opinion score (CMOS)

Compares MOS values between models under test and the baseline via comparing ground truth and synthetic samples from each model.

Differential mean opinion score (DMOS)

Listeners score samples from one to five based on its similarity to a specific emotion or style.

AB preference test

Listeners score same sentence synthesized by the two models and select the one that fulfills the given condition more than the other.

ABX preference test

Listeners hear three samples A, B and X ,where X represents the target speech, and they should score the one that is more close to target speech.

MUltiple Stimuli Hidden Reference and Anchor (MUSHRA)

Listeners are presented with mixed samples including synthesized sample, natural speech samples (named proper reference) and total loss sample (named anchor). Listeners score each sample from 0 to 100 through a double-blind listening test.