Sub-convolutional U-Net with transformer attention network for end-to-end single-channel speech enhancement

EURASIP Journal on Audio, Speech, and Music Processing

Table 1 Ablation study of the proposed model is shown in terms of averaged SDR, STOI, and PESQ metrics. The proposed model is indicated in the BOLD Italic text. N indicated the depth of UNet

Metrics	TAN model			Par. (M)	PESQ				STOI (%)				SDR (in dB)
SNR (in dB)	ATAB	AFAB	AHA	-	− 5.00	0.00	5.00	Avg.	− 5.00	0.00	5.00	Avg.	− 5.00	0.00	5.00	Avg.
Raw speech	x	x	x	x	1.48	1.66	1.87	1.67	32.14	41.24	50.17	41.18	− 2.98	0.14	3.15	0.10
SCUNet (N = 5)	x	x	x	13.20	2.18	2.41	2.68	2.42	62.01	69.46	76.04	69.17	5.78	8.03	10.56	8.12
TANSCUNet	\(\checkmark\)	x	x	3.25	2.31	2.69	2.91	2.63	64.35	71.05	78.32	71.24	6.89	9.07	11.09	9.02
TANSCUNet	x	\(\checkmark\)	x	3.25	2.53	2.78	3.02	2.78	66.16	73.26	80.72	73.38	7.83	10.18	11.57	9.86
TANSCUNet	\(\checkmark\)	\(\checkmark\)	x	3.51	2.66	2.91	3.16	2.90	68.33	75.54	82.26	75.38	8.55	10.91	12.13	10.53
TANSCUNet	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	3.51	2.85	3.12	3.37	3.08	72.52	79.65	84.36	78.84	9.81	11.85	13.62	11.76