Sparse coding of the modulation spectrum for noise-robust automatic speech recognition

EURASIP Journal on Audio, Speech, and Music Processing

Table 1 Accuracy for five systems on noise type 1 (subway noise) of test set A

	Clean	20 dB	15 dB	10 dB	5 dB	0 dB	−5 dB
Sys1	90.51	91.00	89.53	87.69	83.76	76.76	65.31
(single frame)
Sys2
(single frame)	89.19	89.62	87.57	83.54	76.51	62.57	36.91
(LDA transformed)
Sys3
(29 frames)	87.50	88.70	87.41	85.42	77.62	59.41	27.85
(LDA transformed)
Sys4	89.71	90.57	89.28	87.41	84.13	77.71	63.83
(9 bands - GA)
Sparse coding [24]	93.12	90.18	87.22	82.62	72.64	56.31	34.57
5-frame exemplars
Sparse coding [24]	93.21	91.86	91.53	89.62	87.47	80.01	61.61
30-frame exemplars

Sys1, 135-D vectors; Sys2, LDA-transformed 135-D vectors of Sys1; Sys3, LDA-transformed 29× 135-D vectors of 29 consecutive frames; Sys4, Sys1 plus nine recognizers operating on 15-D vectors, weights obtained from a genetic algorithm. Recognition results for noise type 1 using the sparse coding approach [20],[24] using 5 and 30 frame windows are included for comparison in the bottom part.