Metric | Description |
---|---|
Mel-Cepstral Distortion (MCD) | Sums the squared differences between the Mel-Frequency Cepstrum Coefficients (MFCC) from the ground truth and synthesized sample. |
Gross Pith Error (GPE) | Calculates percentage of voiced frames that deviate in pitch by more than 20% compared to the ground truth samples. |
Voice Decision Error (VDE) | Measures the difference of voiced/unvoiced decision between the ground truth and the synthesized sample. |
F0 Frame Error (FFE) | Combines GPE and VDE by measuring the percentage of frames that either contain a 20% pitch error (GPE) or a voicing decision error (VDE) in ground truth and synthesized samples. |
Word Error Rate (WER) | Measures word error rate of the synthesized speech’s transcription with respect to the input text. Public automatic speech recognition (ASR) models are used for transcribing synthesized speech. |
Band APeriodicity Distortion (BAPD) | Measures over linearly spaced band aperiodicity coefficients between the ground truth and the synthesized samples. |
Root Mean Square Error (RMSE) | Measure the root mean square error of F0 or energy of the synthesized samples compared to their ground truth. |