From: Multi-task deep cross-attention networks for far-field speaker verification and keyword spotting
 | Model | Seen | Unseen | ||||
---|---|---|---|---|---|---|---|
 | EER (%) | minDCF | Acc (%) | EER (%) | minDCF | Acc (%) | |
Without noise | Convmixer | - | - | 96.25 | - | - | 94.44 |
ECAPA-TDNN | 0.17 | 0.009 | - | 1.88 | 0.149 | - | |
Baseline | 0.18 | 0.010 | 96.16 | 1.83 | 0.125 | 94.77 | |
+SA | 0.18 | 0.007 | 96.45 | 1.69 | 0.113 | 95.15 | |
+SE | 0.14 | 0.006 | 96.56 | 1.68 | 0.112 | 95.06 | |
+DCA | 0.11 | 0.006 | 98.37 | 1.55 | 0.110 | 96.57 | |
With noise | Convmixer | - | - | 91.96 | - | - | 90.36 |
ECAPA-TDNN | 1.62 | 0.077 | - | 4.18 | 0.223 | - | |
Baseline | 1.64 | 0.078 | 92.08 | 4.16 | 0.214 | 90.61 | |
+SA | 1.59 | 0.079 | 92.21 | 4.10 | 0.209 | 90.91 | |
+SE | 1.58 | 0.069 | 92.89 | 4.07 | 0.208 | 91.11 | |
+DCA | 1.27 | 0.059 | 95.61 | 3.98 | 0.200 | 93.08 |