From: Dual input neural networks for positional sound source localization
Parameter | Value |
---|---|
Num. parameters (DI-NN) | 3.5M |
Num. conv. kernels | 64, 128, 256, 512 |
Conv. kernel size | 2x2 |
Conv. layer pooling size | 2x2 |
GRU output size | 256 |
Metadata fusion net. layer out. sizes | \(512 + N_{\phi }\), 2 |
Metadata embedding layer out. sizes | \(2 N_{\phi }\), \(N_{\phi }\) |
Activation func. last layer | None |
Activation func. other layers | Rectified Linear Unit (ReLU) |
Num. Discrete Fourier Transform (DFT) bins (for STFT) | 1024 |
DFT hop length (for STFT) | 512 |
Input duration | 0.5 secs. |
Sampling rate | 16kHz |
Grid resolution of LS method | 2Â cm |
Learning rate | 0.0005 |
Batch size | 32 |
Num. epochs | 40 |
Batch normalization [44] | Only after conv. layers |
Optimizer | Adam [45] |