Skip to main content

Table 6 Architecture of CTFUNet

From: Channel and temporal-frequency attention UNet for monaural speech enhancement

Layer name

Input size

Hyperparameters

Output size

PE

\(2 \times 161 \times L\)

-

\(2 \times 161 \times L\)

input conv2d

\(2 \times 161 \times L\)

(3,3),(1,1)

\(32 \times 160 \times L\)

encoder-fd-1

\(32 \times 160 \times L\)

(4,4),(2,1)

\(64 \times 80 \times L\)

encoder-fd-2

\(64 \times 80 \times L\)

(4,4),(2,1)

\(128 \times 40 \times L\)

encoder-fd-3

\(128 \times 40 \times L\)

(4,4),(2,1)

\(256 \times 20 \times L\)

neck-1

\(256 \times 20 \times L\)

-

\(256 \times 20 \times L\)

neck-2

\(256 \times 20 \times L\)

-

\(256 \times 20 \times L\)

decoder-fu-1

\(256 \times 20 \times L\)

(4,4),(2,1)

\(128 \times 40 \times L\)

decoder-fu-2

\(128 \times 40 \times L\)

(4,4),(2,1)

\(64 \times 80 \times L\)

decoder-fu-3

\(64 \times 80 \times L\)

(4,4),(2,1)

\(32 \times 160 \times L\)

output conv2d

\(32 \times 160 \times L\)

(3,3),(1,1)

\(4 \times 161 \times L\)