From: Channel and temporal-frequency attention UNet for monaural speech enhancement
Layer name | Input size | Hyperparameters | Output size |
---|---|---|---|
PE | \(2 \times 161 \times L\) | - | \(2 \times 161 \times L\) |
input conv2d | \(2 \times 161 \times L\) | (3,3),(1,1) | \(32 \times 160 \times L\) |
encoder-fd-1 | \(32 \times 160 \times L\) | (4,4),(2,1) | \(64 \times 80 \times L\) |
encoder-fd-2 | \(64 \times 80 \times L\) | (4,4),(2,1) | \(128 \times 40 \times L\) |
encoder-fd-3 | \(128 \times 40 \times L\) | (4,4),(2,1) | \(256 \times 20 \times L\) |
neck-1 | \(256 \times 20 \times L\) | - | \(256 \times 20 \times L\) |
neck-2 | \(256 \times 20 \times L\) | - | \(256 \times 20 \times L\) |
decoder-fu-1 | \(256 \times 20 \times L\) | (4,4),(2,1) | \(128 \times 40 \times L\) |
decoder-fu-2 | \(128 \times 40 \times L\) | (4,4),(2,1) | \(64 \times 80 \times L\) |
decoder-fu-3 | \(64 \times 80 \times L\) | (4,4),(2,1) | \(32 \times 160 \times L\) |
output conv2d | \(32 \times 160 \times L\) | (3,3),(1,1) | \(4 \times 161 \times L\) |