From: Channel and temporal-frequency attention UNet for monaural speech enhancement
Model | #Para. | With reverb | Without reverb | ||||||
---|---|---|---|---|---|---|---|---|---|
 |  | WB-PESQ | NB-PESQ | STOI(%) | SI-SDR | WB-PESQ | NB-PESQ | STOI(%) | SI-SDR |
Noisy | - | 1.822 | 2.753 | 86.62 | 9.03 | 1.582 | 2.454 | 91.52 | 9.07 |
+ISA | 6.5M | \(2.525\pm 0.089\) | \(3.290\pm 0.042\) | \(89.36\pm 0.19\) | \(13.38\pm 0.50\) | \(2.214\pm 0.076\) | \(3.023\pm 0.051\) | \(92.32\pm 0.15\) | \(13.84\pm 0.57\) |
+ASA | 5.4M | \(3.155\pm 0.012\) | \(3.658\pm 0.007\) | \(93.44\pm 0.03\) | \(16.14\pm 0.13\) | \(2.945\pm 0.017\) | \(3.520\pm 0.008\) | \(96.45\pm 0.05\) | \(17.35\pm 0.14\) |
CTFUNet | 6.1M | \({\textbf {3.196}}\pm 0.014\) | \({\textbf {3.673}}\pm 0.003\) | \({\textbf {93.63}}\pm 0.01\) | \({\textbf {16.36}}\pm 0.03\) | \({\textbf {2.979}}\pm 0.003\) | \({\textbf {3.540}}\pm 0.001\) | \({\textbf {96.64}}\pm 0.03\) | \({\textbf {17.60}}\pm 0.03\) |
−RCAM | 4.9M | \(3.157\pm 0.015\) | \(3.648\pm 0.006\) | \(93.51\pm 0.07\) | \(16.27\pm 0.08\) | \(2.951\pm 0.001\) | \(3.517\pm 0.002\) | \(96.53\pm 0.03\) | \(17.52\pm 0.05\) |
−MCHCA | 5.1M | \(3.143\pm 0.007\) | \(3.643\pm 0.003\) | \(93.38\pm 0.06\) | \(16.08\pm 0.02\) | \(2.951\pm 0.012\) | \(3.510\pm 0.001\) | \(96.54\pm 0.02\) | \(17.43\pm 0.16\) |
−CTFSC | 5.9M | \(2.996\pm 0.150\) | \(3.576\pm 0.054\) | \(92.93\pm 0.34\) | \(15.57\pm 0.60\) | \(2.820\pm 0.117\) | \(3.428\pm 0.072\) | \(96.08\pm 0.25\) | \(16.97\pm 0.48\) |