Dual input neural networks for positional sound source localization

EURASIP Journal on Audio, Speech, and Music Processing

Table 1 Hyperparameters

Parameter	Value
Num. parameters (DI-NN)	3.5M
Num. conv. kernels	64, 128, 256, 512
Conv. kernel size	2x2
Conv. layer pooling size	2x2
GRU output size	256
Metadata fusion net. layer out. sizes	\(512 + N_{\phi }\), 2
Metadata embedding layer out. sizes	\(2 N_{\phi }\), \(N_{\phi }\)
Activation func. last layer	None
Activation func. other layers	Rectified Linear Unit (ReLU)
Num. Discrete Fourier Transform (DFT) bins (for STFT)	1024
DFT hop length (for STFT)	512
Input duration	0.5 secs.
Sampling rate	16kHz
Grid resolution of LS method	2 cm
Learning rate	0.0005
Batch size	32
Num. epochs	40
Batch normalization [44]	Only after conv. layers
Optimizer	Adam [45]