Skip to main content

Cascade algorithms for combined acoustic feedback cancelation and noise reduction

Abstract

This paper presents three cascade algorithms for combined acoustic feedback cancelation (AFC) and noise reduction (NR) in speech applications. A prediction error method (PEM)-based adaptive feedback cancelation (PEM-based AFC) algorithm is used for the AFC stage, while a multichannel Wiener filter (MWF) is applied for the NR stage. A scenario with M microphones and 1 loudspeaker is considered, without loss of generality. The first algorithm is the baseline algorithm, namely the cascade M-channel rank-1 MWF and PEM-AFC, where a NR stage is performed first using a rank-1 MWF followed by a single-channel AFC stage using a PEM-based AFC algorithm. The second algorithm is the cascade \((M+1)\)-channel rank-2 MWF and PEM-AFC, where again a NR stage is applied first followed by a single-channel AFC stage. The novelty of this algorithm is to consider an (\(M+1\))-channel data model in the MWF formulation with two different desired signals, i.e., the speech component in the reference microphone signal and in the loudspeaker signal, both defined by the speech source signal but not equal to each other. The two desired signal estimates are later used in a single-channel PEM-based AFC stage. The third algorithm is the cascade M-channel PEM-AFC and rank-1 MWF where an M-channel AFC stage is performed first followed by an M-channel NR stage. Although in cascade algorithms where NR is performed first and then AFC the estimation of the feedback path is usually affected by the NR stage, it is shown here that by performing a rank-2 approximation of the speech correlation matrix this issue can be avoided and the feedback path can be correctly estimated. The performance of the algorithms is assessed by means of closed-loop simulations where it is shown that for the considered input signal-to-noise ratios (iSNRs) the cascade \((M+1)\)-channel rank-2 MWF and PEM-AFC and the cascade M-channel PEM-AFC and rank-1 MWF algorithms outperform the cascade M-channel rank-1 MWF and PEM-AFC algorithm in terms of the added stable gain (ASG) and misadjustment (Mis) as well as in terms of perceptual metrics such as the short-time objective intelligibility (STOI), perceptual evaluation of speech quality (PESQ), and signal distortion (SD).

1 Introduction

Acoustic feedback and noise are common problems that corrupt microphone signals and affect the performance of speech and audio signal processing applications and devices, such as hearing aids, public address (PA) systems, in-car communication, and teleconferencing systems. Acoustic feedback occurs whenever a signal is captured by a microphone, amplified and played back by a loudspeaker within the same acoustic environment. This acoustic coupling between the microphone (array) and loudspeaker may give rise to instabilities in the system, which translates into signal degradation and, in the worst case, acoustic howling. Different approaches can be found to tackle this problem, with the two most popular being howling suppression and acoustic feedback cancelation (AFC) [1]. AFC solutions rely on a decorrelation of the microphone and loudspeaker signals to obtain an unbiased feedback path estimate [1, 2]. In the literature, many different solutions for AFC can be found using different decorrelation procedures such as probe-noise injection [3], time-varying or nonlinear processes in the forward path [4], null-steering (array) [5], subband implementations [6], and prewhitening [7]. The latter approach has been shown to provide limited perceptual distortion [8, 9]. Similarly, for multi-microphone noise reduction (NR), a wide range of solutions can be found in the literature, where one of the popular algorithms is the multi-channel Wiener filter (MWF) [10,11,12], and more recently deep learning-based methods have appeared [13].

Few solutions for combined multi-microphone AFC and NR have been reported in the literature [14, 15]. Similarly to combined acoustic echo cancelation (AEC) and NR, combined AFC and NR can be tackled with integrated and cascade approaches. An integrated approach combines the AFC and NR tasks in a single optimization criterion [14, 15]. A cascade approach consists of an AFC stage and a NR stage which can be combined in two ways, i.e., a multichannel AFC stage followed by a multichannel NR stage, or a single-channel AFC stage preceded by a multichannel NR stage. The order of the stages has performance implications on the combined system [14, 15].

Existing solutions to combined AFC and NR mainly cover single-microphone scenarios [16] and hearing aid applications [5, 14]. In [16], the prediction-error method (PEM)-based adaptive filtering with row operations (PEM) algorithm [17] is used in combination with an NR stage based on a minimum mean squared error short-time log-spectral amplitude (MMSE-LSA) estimation, for a single-microphone scenario. In [14] and [15], multiple schemes are presented for combined AFC and NR using a generalized sidelobe canceler (GSC) for the NR stage and a PEM-based AFC stage. In [18], active feedback suppression for one microphone in a hearing aid is proposed using multiple loudspeakers, without considering the presence of noise in the microphone signal. A real-time implementation of a combined NR and feedback suppression method using spectral subtraction in a smartphone-based hearing aid is presented in [19]. In [20], the authors presented integrated and cascade approaches for combined AEC and NR in the context of wireless acoustic sensor and actuator networks. The algorithms in [20] did not consider the presence of a closed-loop system, therefore they are not appropriate solutions for combined multi-microphone AFC and NR.

In [21], the authors presented two cascade algorithms for combined multi-microphone AFC and NR for speech applications using a PEM-based AFC algorithm and MWF. The aim of these cascade algorithms is to estimate a desired speech signal without the feedback and noise components, as observed at a chosen reference microphone. A scenario with M microphones and one loudspeaker is considered, without loss of generality. The first algorithm in [21] is the baseline algorithm, namely the cascade M-channel rank-1 MWF and PEM-AFC, where a NR stage is performed first using a rank-1 MWF followed by a single-channel AFC stage using the PEM-based AFC algorithm. It is shown by means of simulations that this algorithm does not improve the added stable gain (ASG) in the closed-loop system. The second algorithm is the cascade \((M+1)\)-channel rank-2 MWF and PEM-AFC where again a NR stage is applied first followed by a single-channel AFC stage. The novelty of this algorithm is to consider an (\(M+1\))-channel data model in the MWF formulation (i.e., by including the loudspeaker signal) with two different desired signals, i.e., the speech component in the reference microphone signal and in the loudspeaker signal, both defined by the speech source signal but not equal to each other [12]. The two desired signal estimates are later used in a single-channel PEM-based AFC stage [7, 22]. Although in cascade algorithms where NR is performed first and then AFC, the estimation of the feedback path is usually affected by the NR stage, it is shown in [21] that by performing a rank-2 approximation of the speech correlation matrix this issue can be avoided and the feedback path can be correctly estimated.

The contributions of this paper in comparison to [21] are as follows. A third cascade algorithm for AFC and NR using the PEM-based AFC algorithm and MWF is presented, and then the three algorithms are further analyzed and compared. The third algorithm is the cascade M-channel PEM-AFC and rank-1 MWF, where an M-channel AFC stage is performed first followed by an M-channel rank-1 NR stage. A comparison of the performance of the three algorithms is provided based on closed-loop simulations using three different scenarios under three signal-to-noise ratios (SNRs). It is shown that for the considered input SNRs (iSNRs) both the cascade \((M+1)\)-channel rank-2 MWF and PEM-AFC and the cascade M-channel PEM-AFC and rank-1 MWF algorithms outperform the cascade M-channel rank-1 MWF and PEM-AFC algorithm in terms of ASG and misadjustment (Mis) as well as in terms of perceptual metrics such as the short-time objective intelligibility (STOI), perceptual evaluation of speech quality (PESQ), and signal distortion (SD). Additionally, the ASG definition is modified to account for the presence of the NR filters in the closed-loop system.

The algorithms in [14] and [15] are similar to the ones presented in this paper. However, there are several differences. The algorithms in this paper rely on a voice activity detector (VAD) to estimate statistics of the signals during noise-only and speech-plus-noise periods, while the GSC requires prior knowledge of the desired speech source and loudspeaker location to design the fixed beamformer and blocking matrix. The GSC in [14] and [15] is defined in the time domain, while the NR stage in this paper is performed in the frequency domain. In [15], the combined AFC and NR problem is tackled by using adaptive filters with prefiltering on the output signals of the blocking matrix (noise references), while in [14], one of the proposed schemes uses the loudspeaker signal as an extra input to the adaptive filters. In [14] and [15], the GSC schemes were tested in scenarios where the forward path gain does not increase over time, i.e., with a fixed gain, whereas in this paper a gain profile is used to gradually increase the gain in the closed-loop system.

The paper is organized as follows. The signal model is presented in Section 2. The formulation of the cascade M-channel rank-1 MWF and PEM-AFC algorithm is provided in Section 3. The cascade \((M+1)\)-channel rank-2 MWF and PEM-AFC algorithm is described in Section 4. The cascade M-channel PEM-AFC and rank-1 MWF is described in Section 5. The computational complexity of the three presented algorithms is analyzed in Section 6. Simulation results are given in Section 7, and finally Section 8 concludes the paper.

2 Signal model

Consider a room with M microphones and L loudspeakers where the aim is to record a desired speech signal, amplify it and play it back in the loudspeakers. The case when \(L=1\) will be considered, without loss of generality, with the speech source signal denoted by s(t), the loudspeaker signal denoted by u(t) and the \(m^{\textrm{th}}\) microphone signal, with \(m =1,\dots , M\), modeled as

$$\begin{aligned} x^{(m)}(t) = H^{(m)}(q,t) s(t) + F^{(m)}(q,t) u(t) + n^{(m)}(t) \end{aligned}$$
(1)

where \(H^{(m)}(q,t)\) and \(F^{(m)}(q,t)\) are the transfer function from the speech source position and from the loudspeaker to the \(m^{\textrm{th}}\) microphone, respectively. The latter is also known as the feedback path transfer function. The direct noise signal in the \(m^{\textrm{th}}\) microphone is denoted by \(n^{(m)}(t)\)Footnote 1. The discrete time index is represented by t and \(q^{-1}\) is the delay operator, i.e., \(q^{-k}u(t) = u(t-k)\). The loudspeaker signal can be expressed as

$$\begin{aligned} u(t)= & {} \sum _{m=1}^M\ G^{(m)}(q,t)\ x^{(m)}(t),\end{aligned}$$
(2)
$$\begin{aligned} u(t)= & {} u_{s}(t) + u_n(t) \end{aligned}$$
(3)

where \(G^{(m)}(q,t)\) is the forward path transfer function for the \(m^{\textrm{th}}\) microphone signal, \(u_{s}(t)\) is the desired speech component, and \(u_n(t)\) is the noise component in the loudspeaker signal. The presence of the forward path creates a closed-loop system which introduces signal correlation between the loudspeaker and microphone signals. Figure 1 depicts a block diagram of the closed-loop system. It is assumed that the speech source signal can be modeled as

$$\begin{aligned} s(t) = \dfrac{1}{A(q,t)} e(t) \end{aligned}$$
(4)

where \(\frac{1}{A(q,t)}\) is an autoregressive (AR) process excited by the white noise signal e(t), which is a common assumption in PEM-based AFC and it is highly time-varying [1, 7, 22] . A combined NR and AFC algorithm aims to estimate the desired speech signal without the feedback and noise components, as observed at a chosen reference microphone \((m=r)\), i.e.,

$$\begin{aligned} d(t) = H^{(r)}(q,t) s(t) \end{aligned}$$
(5)

where \(H^{(r)}(q,t)\) is the transfer function from the speech source position to the reference microphone. Additionally, the speech component including the feedback contribution in the reference microphone signal is expressed as

$$\begin{aligned} x_s^{(r)}(t) = H^{(r)}(q,t) s(t) + F^{(r)}(q,t) u_{s}(t). \end{aligned}$$
(6)
Fig. 1
figure 1

Block diagram of the closed-loop system

The STFT domain representation of the time-domain signals will be used here, which is obtained by means of an R samples long analysis window in a WOLA filterbank with \(50\%\) overlap [23]. Therefore, the STFT \(x^{(m)}(\kappa ,l)\) of the \(m^{\textrm{th}}\) microphone signal, \(x^{(m)}(t)\), at frame l can be defined as

$$\begin{aligned} \left[ \begin{array}{c}x^{(m)}(0,l) \\ \vdots \\ x^{(m)}(R-1,l)\end{array}\right] = \boldsymbol{\mathcal {F}}_{R} \left[ \begin{array}{c}x^{(m)}\left( l\dfrac{R}{2}\right) g_a(0) \\ \vdots \\ x^{(m)}\left( R-1+l\dfrac{R}{2}\right) g_a(R-1)\end{array}\right] \end{aligned}$$
(7)

with \(\kappa \in \{0,1,\dots , R-1 \}\) the frequency bin index, \(l \in \{0,1,\dots , L_f-1 \}\) with \(L_f\) being the number of frames, \(\boldsymbol{\mathcal {F}}_{R}\) being the discrete Fourier transform (DFT) matrix of size R and \(g_a(t)\) being an analysis window. Using the STFT representation of each microphone signal, the following \(M \times 1\) STFT-domain microphone vector is defined

$$\begin{aligned} \textbf{x}(\kappa ,l) = \left[ \begin{array}{ccc}x^{(1)}(\kappa ,l)&\cdots&x^{(M)}(\kappa ,l)\end{array}\right] ^T. \end{aligned}$$
(8)

Furthermore, an \((M+1) \times 1\) signal vector, consisting of loudspeaker and microphone signals, can be expressed as

$$\begin{aligned}{} & {} \textbf{y}(\kappa ,l) \triangleq \left[ \begin{array}{c}u(\kappa ,l) \\ \textbf{x}(\kappa ,l)\end{array}\right] = \underbrace{\begin{bmatrix}u_s(\kappa ,l) \\ \textbf{x}_s(\kappa ,l)\end{bmatrix}}_{\textbf{y}_s(\kappa ,l)} + \underbrace{\left[ \begin{array}{c}u_n(\kappa ,l) \\ \textbf{x}_n(\kappa ,l)\end{array}\right] }_{\textbf{y}_n(\kappa ,l)}\end{aligned}$$
(9)
$$\begin{aligned}{} & {} = \underbrace{\left[ \begin{array}{c} 0 \\ \textbf{h}(\kappa ,l) \end{array}\right] s(\kappa ,l) + \left[ \begin{array}{c} 1 \\ \textbf{f}(\kappa ,l) \end{array}\right] u_s(\kappa ,l)}_{\textbf{y}_s(\kappa ,l)} + \textbf{y}_n(\kappa ,l) \end{aligned}$$
(10)

where \(s(\kappa ,l)\), \(u_s(\kappa ,l)\), \(u(\kappa ,l)\), and \(\textbf{y}_n(\kappa ,l)\) are the STFT-domain speech source signal, speech component in the loudspeaker signal, loudspeaker signal, and noise component in the microphone and loudspeaker signals, respectivelyFootnote 2. It is noted that \(\textbf{y}_n(\kappa ,l)\) includes the noise component in the loudspeaker signal (first vector component) as well as its coupling into the microphones, added to the direct noise components in the microphones (all other vector components). The STFT-domain transfer functions from the speech source position to the microphones and from the loudspeaker to the microphones are respectively denoted by \(\textbf{h}(\kappa ,l)\) and \(\textbf{f}(\kappa ,l)\). The time-frame and frequency-bin indices l and \(\kappa\) will be mostly omitted in the following for brevity.

The speech correlation matrix is defined as

$$\begin{aligned} \bar{\textbf{R}}_{\mathbf {yy|ss}} = E\{ \textbf{y}_s \textbf{y}_s^H\} = \begin{bmatrix} 1 &{} 0 \\ \textbf{f} &{} \textbf{h}\end{bmatrix} \begin{bmatrix} \Phi _{uu} &{} \Phi _{us} \\ \Phi _{su} &{} \Phi _{ss} \end{bmatrix} \begin{bmatrix} 1 &{} \textbf{f}^H \\ 0 &{} \textbf{h}^H \end{bmatrix} \end{aligned}$$
(11)

where \(\Phi _{ss}=E\{ss^{*}\}\),\(\Phi _{su}=E\{s u_s^{*}\}\), \(\Phi _{us}=E\{u_s s^{*}\}=\Phi _{us}^{*}\), \(\Phi _{uu}=E\{u_s u_s^{*}\}\), \(E\{ \cdot \}\) denotes statistical expectation, and \((\cdot )^{*}\) and \((\cdot )^H\) are the conjugate and conjugate transpose operator, respectively. Performing an LDL factorisation on the matrix with the \(\Phi\)’s in (11), \(\bar{\textbf{R}}_{\mathbf {yy|ss}}\) can alternatively be expressed as

$$\begin{aligned} \bar{\textbf{R}}_{\mathbf {yy|ss}} = \begin{bmatrix} 1 &{} 0 \\ \textbf{f} + \epsilon \textbf{h} &{} \textbf{h}\end{bmatrix} \begin{bmatrix} \Phi _{uu} &{} 0 \\ 0 &{} \Gamma \end{bmatrix} \begin{bmatrix} 1 &{} \textbf{f}^H + \epsilon ^{*} \textbf{h}^H \\ 0 &{} \textbf{h}^H\end{bmatrix} \end{aligned}$$
(12)

where \(\epsilon = \dfrac{\Phi _{su}}{\Phi _{uu}}\) and \(\Gamma = \Phi _{ss} - \dfrac{ \Phi _{su} \Phi _{us}}{\Phi _{uu}}\). It is clear that from the knowledge of \(\bar{\textbf{R}}_{\mathbf {yy|ss}}\) in (12) alone, \(\textbf{f}\) and \(\textbf{h}\) cannot be uniquely defined whenever there is a non-zero correlation between s and \(u_s\). In Section 4.1, \(\bar{\textbf{R}}_{\mathbf {yy|ss}}\) is modeled using a rank-2 approximation by assuming that the forward path delay is at least one STFT frame. This delay allows to view the loudspeaker signal as a second source and hence use a rank-2 approximation for \(\bar{\textbf{R}}_{\mathbf {yy|ss}}\). An experimental validation of this assumption is presented in Section 7.4.

Three different cascade algorithms are presented in the following sections for AFC and NR. The first algorithm performs an M-channel rank-1 MWF-based NR to estimate the contribution of \(s(\kappa ,l)\) and \(u_s(\kappa ,l)\) in the reference microphone, and then a single-channel AFC is performed on the resulting signals. The second algorithm performs an \((M+1)\)-channel rank-2 MWF-based NR stage first followed by a single-channel AFC stage, where the rank-2 MWF-based NR is used to estimate the contribution of \(s(\kappa ,l)\) and \(u_s(\kappa ,l)\) in the reference microphone as well as in the loudspeaker, and then a single-channel AFC is performed on the resulting signals. The third algorithm performs an M-channel AFC stage first followed by an M-channel rank-1 MWF-based NR stage. In this case, after the M-channel AFC stage removes the feedback component in each microphone, a rank-1 MWF-based NR is used to estimate the contribution of \(s(\kappa ,l)\) in the reference microphone.

3 Cascade M-channel rank-1 MWF and PEM-AFC

3.1 NR stage

The objective of the NR stage is to provide an estimate of the speech component in the reference microphone signal. The feedback component will still be present in the output of the NR stage; hence, a single-channel AFC stage is required to remove it.

In the STFT domain, the correlation matrix of the microphone signal vector \(\textbf{x}\) can be expressed as

$$\begin{aligned} \bar{\textbf{R}}_{\textbf{xx}} = E \{ \textbf{xx}^H \} = \bar{\textbf{R}}_{\mathbf {xx|ss}} + \bar{\textbf{R}}_{\mathbf {xx|nn}} \end{aligned}$$
(13)

where

$$\begin{aligned} \bar{\textbf{R}}_{\mathbf {xx|ss}}&= E \{ \textbf{x}_s \textbf{x}_s^H\} = E \{ (\textbf{h} s + \textbf{f} u_s)(\textbf{h} s + \textbf{f} u_s)^H \},\end{aligned}$$
(14)
$$\begin{aligned} \bar{\textbf{R}}_{\mathbf {xx|nn}}&=E \{ \textbf{x}_{\textbf{n}} \textbf{x}_{\textbf{n}}^H \} \end{aligned}$$
(15)

are the \(M \times M\) microphone-only speech and noise correlation matrix, respectively. The expressions in (13)–(15) are obtained based on the assumption that s and \(\textbf{x}_n\) are uncorrelated. The minimization of the mean squared error (MSE) between the desired signal and the filtered microphone signals defines an optimal filter

$$\begin{aligned} \bar{\textbf{w}} = \underset{\textbf{w}}{\min }\;E \left\{ \left\| d_{\textrm{NR}} - \textbf{w}^{H}\textbf{x} \right\| ^2 \right\} \end{aligned}$$
(16)

with \(d_{\text {NR}} = x_s^{(r)}\) representing the speech component (total contribution of s together with \(u_s\)) in the reference microphone signal. The desired signal estimate \(\hat{d}_{\text {NR}}\) is obtained as

$$\begin{aligned} \hat{d}_{\text {NR}}&= \bar{\textbf{w}}^H\textbf{x}. \end{aligned}$$
(17)

The solution to (16) is the MWF [10, 12], given by

$$\begin{aligned} \bar{\textbf{w}} = \bar{\textbf{R}}_{\textbf{xx}}^{-1} \bar{\textbf{R}}_{\mathbf {xx|ss}} \textbf{e}_r \end{aligned}$$
(18)

where \(\textbf{e}_r\) selects the \(r^{\textrm{th}}\) column of a matrix.

In practice, by using a VAD, \(\bar{\textbf{R}}_{\textbf{xx}}\) and \(\bar{\textbf{R}}_{\mathbf {xx|nn}}\) are first estimated during speech-plus-noise periods where the speech source signal and noise are active and noise-only periods where only the noise is active, i.e.,

$$\begin{aligned}&\text {if VAD}(\kappa ,l)=1: \nonumber \\&\hat{\textbf{R}}_{\textbf{xx}}(\kappa ,l) = \beta \hat{\textbf{R}}_{\textbf{xx}}(\kappa ,l-1) + (1-\beta ) \textbf{x}(\kappa ,l) \textbf{x}^H(\kappa ,l), \nonumber \\&\text {if VAD}(\kappa ,l)=0: \nonumber \\&\hat{\textbf{R}}_{\mathbf {xx|nn}}(\kappa ,l) = \beta \hat{\textbf{R}}_{\mathbf {xx|nn}}(\kappa ,l-1) + (1-\beta ) \textbf{x}(\kappa ,l) \textbf{x}^H(\kappa ,l), \end{aligned}$$
(19)

where \(\hat{\textbf{R}}_{\textbf{xx}}(\kappa ,l)\) and \(\hat{\textbf{R}}_{\mathbf {xx|nn}}(\kappa ,l)\) represent estimates of \(\bar{\textbf{R}}_{\textbf{xx}}\) and \(\bar{\textbf{R}}_{\mathbf {xx|nn}}\) at frame l and frequency bin \(\kappa\), respectively. The forgetting factor \(0<\beta <1\) can be chosen depending on the variation of the statistics of the signals, i.e., if the statistics change slowly then \(\beta\) should be chosen close to 1 to obtain long-term estimates that mainly capture the spatial coherence between the microphone signals. The following criterion will then be used to estimate \(\bar{\textbf{R}}_{\mathbf {xx|ss}}\) [12],

$$\begin{aligned}{} & {} \hat{\textbf{R}}_{\mathbf {xx|ss}} = \nonumber \\{} & {} \underset{\textbf{R}_{\mathbf {xx|ss}}}{\min } \Big \Vert \hat{\textbf{R}}_{\mathbf {xx|nn}}^\mathrm {-1/2} \left( \hat{\textbf{R}}_{\textbf{xx}} - \hat{\textbf{R}}_{\mathbf {xx|nn}} - \textbf{R}_{\mathbf {xx|ss}} \right) \hat{\textbf{R}}_{\mathbf {xx|nn}}^{-H/2} \Big \Vert ^2_F \end{aligned}$$
(20)
$$\begin{aligned}{} & {} \text {s.t.} \quad \textrm{rank}(\textbf{R}_{\mathbf {xx|ss}})=1, \nonumber \\{} & {} \qquad \ \textbf{R}_{\mathbf {xx|ss}} \succeq 0 \end{aligned}$$
(21)

where \(\Vert \cdot \Vert _F\) denotes the Frobenius norm. Spatial pre-whitening is applied by pre- and post-multiplying by \(\hat{\textbf{R}}^\mathrm {-1/2}_{\mathbf{xx|nn}}\) and \(\hat{\textbf{R}}^{-H/2}_{\mathbf {xx|nn}}\), respectively. The solution to (20), (21) is based on a generalized eigenvalue decomposition (GEVD) of the (\(M \times M\)) matrix pencil \(\{ \hat{\textbf{R}}_{\textbf{xx}}, \hat{\textbf{R}}_{\mathbf {xx|nn}} \}\) [12, 25]

$$\begin{aligned} \hat{\textbf{R}}_{\textbf{xx}}&= \hat{\textbf{Q}} \hat{\boldsymbol{\Sigma }}_{\textbf{xx}} \hat{\textbf{Q}}^H\end{aligned}$$
(22)
$$\begin{aligned} \hat{\textbf{R}}_{\mathbf {xx|nn}}&= \hat{\textbf{Q}} \hat{\boldsymbol{\Sigma }}_{\mathbf {xx|nn}} \hat{\textbf{Q}}^H \end{aligned}$$
(23)

where \(\hat{\boldsymbol{\Sigma }}_{\textbf{xx}}\) and \(\hat{\boldsymbol{\Sigma }}_{\mathbf {xx|nn}}\) are diagonal matrices and \(\hat{\textbf{Q}}\) is an invertible matrix. The rank-1 speech correlation matrix estimate \(\hat{\textbf{R}}_{\mathbf{xx|ss}}\) is then [12]

$$\hat{\textbf{R}}_{\mathbf {xx|ss}} = \hat{\textbf{Q}} {\textrm{diag}}\{ \hat{\sigma }_{xx,1} - \hat{\sigma }_{xx|nn,1},0,\ldots ,0 \} \hat{\textbf{Q}}^H$$
(24)

where \(\hat{\sigma }_{xx,i}\) and \(\hat{\sigma }_{xx|nn,i}\) are the ith diagonal element of \(\hat{\boldsymbol{\Sigma }}_{\textbf{xx}}\) and \(\hat{\boldsymbol{\Sigma }}_{\mathbf {xx|nn}}\), respectively, corresponding to the ith largest ratio \(\hat{\sigma }_{xx,i}/\hat{\sigma }_{xx|nn,i}\). Using (24) and \(\hat{\textbf{R}}_{\textbf{xx}}\) (cfr. (22)) in (18), the rank-1 MWF estimate \(\hat{\textbf{w}}\) can be expressed as

$$\hat{\textbf{w}}= {} \hat{\textbf{Q}}^{-H} {\textrm{diag}}\left\{ 1 - \dfrac{\hat{\sigma }_{xx|nn,1}}{\hat{\sigma }_{xx,1}},0,\ldots ,0 \right\} \hat{\textbf{Q}}^H \textbf{e}_r.$$
(25)

The estimate, \(\hat{x}_s^{(r)}\), is obtained as in (17) with \(\hat{\textbf{w}}\) replacing \(\bar{\textbf{w}}\)

$$\begin{aligned} \hat{x}_s^{(r)}&= \hat{\textbf{w}}^H\textbf{x}. \end{aligned}$$
(26)

The corresponding time-domain signals are obtained by adding the \(L_f\) overlapping windowed frames as

$$\begin{aligned} \hat{\textbf{x}}_{s,seg}^{(r)}(l)= & {} \boldsymbol{\mathcal {F}}_{R}^{-1}, \left[ \begin{array}{ccc} {\hat{x}_s^{(r)}} \left( 0,l \right)&\cdots&{\hat{x}_s^{(r)}} \left( {R-1},l \right) \end{array}\right] ^{T}\end{aligned}$$
(27)
$$\begin{aligned} \hat{\textbf{x}}_{s,seg}^{(r)}(l)= & {} \left[ {\hat{x}_{s,seg}^{(r)}\left( l\dfrac{R}{2}\right) , \dots , \hat{x}_{s,seg}^{(r)}\left( R-1+l\dfrac{R}{2}\right) }\right] ^T,\end{aligned}$$
(28)
$$\begin{aligned} \hat{x}_s^{(r)}(t-\delta _{\textrm{NR}})= & {} \sum _{l=0}^{L_f-1} \hat{x}_{s,seg}^{(r)} \left( t-l\dfrac{R}{2} \right) g_s \left( t-l\dfrac{R}{2} \right) \end{aligned}$$
(29)

where \(g_s\) is a synthesis window with nonzero values in the interval \(0 \le t \le R-1\) and \(\delta _{\textrm{NR}}\) is the delay from the NR stage.

3.2 AFC stage

The NR stage provides an estimate for \(x_s^{(r)}(t)\) (cfr. (6)) from which the AFC stage will now estimate \(H^{(r)}(q,t) s(t)\). A single-channel PEM-based AFC algorithm is used. This kind of algorithms were initially developed in [7, 17], and they provide estimates of both the feedback path and the speech source signal model. The PEM-based AFC algorithm used here is the frequency-domain version presented in [22] (the reader is referred to [22] for a detailed explanation of the AFC algorithm). The algorithm uses an overlap-save (OLS) filterbank to compute convolutions in the frequency domain, which requires a rectangular window. The input signals to the AFC algorithm are the (noisy) loudspeaker signal u and the estimate in (29). A short description of the single-channel PEM-based AFC algorithm is provided in Algorithm 1.

figure a

Algorithm 1 Single-channel PEM-based AFC [22]

A complete description of the cascade M-channel rank-1 MWF and PEM-AFC algorithm is provided in Algorithm 2, with a block diagram provided in Fig. 2(a).

figure b

Algorithm 2 Cascade M-channel rank-1 MWF and PEM-AFC

Fig. 2
figure 2

Block diagrams for cascade algorithms

4 Cascade (\(M+1\))-channel rank-2 MWF and PEM-AFC

4.1 NR stage

The objective of the NR stage is to provide an estimate of the speech component in the reference microphone signal and in the loudspeaker signal. The feedback component will still be present in the former, hence a single-channel AFC stage is required to remove it.

In the STFT domain, the correlation matrix of the signal vector \(\textbf{y}\) in (9) can be expressed as

$$\begin{aligned} \bar{\textbf{R}}_{\textbf{yy}} = E \{ \textbf{yy}^H \} = \bar{\textbf{R}}_{\mathbf {yy|ss}} + \bar{\textbf{R}}_{\mathbf {yy|nn}} \end{aligned}$$
(30)

with \(\bar{\textbf{R}}_{\mathbf {yy|nn}} =E \{ \textbf{y}_n \textbf{y}_n^H \}\) the \((M+1) \times (M+1)\) noise correlation matrix. The final expression in (30) is obtained based on the assumption that s and \(\textbf{n}\) are uncorrelated. The minimization of the mean squared error (MSE) between the desired signals and the filtered microphone and loudspeaker signals defines an optimal filter

$$\begin{aligned} \underset{(M+1) \times 2}{\bar{\textbf{W}}} = \underset{\textbf{w}}{\min }\;E \left\{ \left\| \textbf{d}_{\textrm{NR}} - \textbf{W}^{H}\textbf{y} \right\| ^2 \right\} . \end{aligned}$$
(31)

with \(\textbf{d}_{\textrm{NR}} = \left[ {u_s\quad x_s^{(r)}}\right] ^T\). The desired signal estimates \(\hat{u}_s\) and \(\hat{x}_s^{(r)}\) are obtained as

$$\begin{aligned} \hat{u}_{s}&= \textbf{e}_1^{H}\bar{\textbf{W}}^H\textbf{y},\end{aligned}$$
(32)
$$\begin{aligned} \hat{x}_s^{(r)}&= \textbf{e}_2^{H}\bar{\textbf{W}}^H\textbf{y}. \end{aligned}$$
(33)

The solution to (31) is the MWF [10, 12], given by

$$\begin{aligned} \bar{\textbf{W}} = \bar{\textbf{R}}_{\textbf{yy}}^{-1} \bar{\textbf{R}}_{\mathbf {yy|ss}} [\textbf{e}_1 | \textbf{e}_{r+1}]. \end{aligned}$$
(34)

In practice, by using a VAD, \(\bar{\textbf{R}}_{\textbf{yy}}\) and \(\bar{\textbf{R}}_{\mathbf {yy|nn}}\) are first estimated during speech-plus-noise periods where the desired speech signal and noise are active, and noise-only periods where only the noise is active, i.e.,

$$\begin{aligned}&\text {if VAD}(\kappa ,l)=1: \nonumber \\&\hat{\textbf{R}}_{\textbf{yy}}(\kappa ,l) = \beta \hat{\textbf{R}}_{\textbf{yy}}(\kappa ,l-1) + (1-\beta ) \textbf{y}(\kappa ,l) \textbf{y}^H(\kappa ,l), \nonumber \\&\text {if VAD}(\kappa ,l)=0: \nonumber \\&\hat{\textbf{R}}_{\mathbf {yy|nn}}(\kappa ,l) = \beta \hat{\textbf{R}}_{\mathbf {yy|nn}}(\kappa ,l-1) + (1-\beta ) \textbf{y}(\kappa ,l) \textbf{y}^H(\kappa ,l) \end{aligned}$$
(35)

where \(\hat{\textbf{R}}_{\textbf{yy}}(\kappa ,l)\) and \(\hat{\textbf{R}}_{\mathbf {yy|nn}}(\kappa ,l)\) represent estimates of \(\bar{\textbf{R}}_{\textbf{yy}}\) and \(\bar{\textbf{R}}_{\mathbf {yy|nn}}\) at frame l and frequency bin \(\kappa\), respectively. The following criterion will then be used to estimate \(\bar{\textbf{R}}_{\mathbf {yy|ss}}\) [12],

$$\begin{aligned}{} & {} \hat{\textbf{R}}_{\mathbf {yy|ss}} =\nonumber \\{} & {} \underset{\textbf{R}_{\mathbf {yy|ss}}}{\min } \left\| \hat{\textbf{R}}_{\mathbf {yy|nn}}^\mathrm {-1/2}\left( \hat{\textbf{R}}_{\textbf{yy}} - \hat{\textbf{R}}_{\mathbf {yy|nn}} - \textbf{R}_{\mathbf {yy|ss}} \right) \hat{\textbf{R}}_{\mathbf {yy|nn}}^{-H/2}\right\| ^2_F\end{aligned}$$
(36)
$$\begin{aligned}{} & {} \text {s.t. } \quad \textrm{rank}(\textbf{R}_{\mathbf {yy|ss}})=2,\nonumber \\{} & {} \qquad \ \ \textbf{R}_{\mathbf {yy|ss}} \succeq 0. \end{aligned}$$
(37)

Assuming an exact speech signal modelSpatial pre-whitening is applied by pre- and post-multiplying by \(\hat{\textbf{R}}^\mathrm {-1/2}_{\mathbf{yy|nn}}\) and \(\hat{\textbf{R}}^{-H/2}_{\mathbf {yy|nn}}\), respectively. The solution to (36)-(37) is based on a GEVD of the \((M+1) \times (M+1)\) matrix pencil \(\{ \hat{\textbf{R}}_{\textbf{yy}}, \hat{\textbf{R}}_{\mathbf {yy|nn}} \}\) [12, 25]

$$\begin{aligned} \hat{\textbf{R}}_{\textbf{yy}}&= \hat{\textbf{Q}} \hat{\boldsymbol{\Sigma }}_{\textbf{yy}} \hat{\textbf{Q}}^H\end{aligned}$$
(38)
$$\begin{aligned} \hat{\textbf{R}}_{\mathbf {yy|nn}}&= \hat{\textbf{Q}} \hat{\boldsymbol{\Sigma }}_{\mathbf {yy|nn}} \hat{\textbf{Q}}^H \end{aligned}$$
(39)

where \(\hat{\boldsymbol{\Sigma }}_{\textbf{yy}}\) and \(\hat{\boldsymbol{\Sigma }}_{\mathbf {yy|nn}}\) are diagonal matrices and \(\hat{\textbf{Q}}\) is an invertible matrix. The rank-2 speech correlation matrix estimate \(\hat{\textbf{R}}_{\mathbf{yy|ss}}\) is then [12]

$$\hat{\textbf{R}}_{\mathbf {yy|ss}} = \hat{\textbf{Q}} {\textrm{diag}}\{ \hat{\sigma }_{yy,1} - \hat{\sigma }_{yy|nn,1}, \hat{\sigma }_{yy,2} - \hat{\sigma }_{yy|nn,2},0,\ldots ,0 \} \hat{\textbf{Q}}^H$$
(40)

where \(\hat{\sigma }_{yy,i}\) and \(\hat{\sigma }_{yy|nn,i}\) are the ith diagonal element of \(\hat{\boldsymbol{\Sigma }}_{\textbf{yy}}\) and \(\hat{\boldsymbol{\Sigma }}_{\mathbf {yy|nn}}\), respectively, corresponding to the ith largest ratio \(\hat{\sigma }_{yy, i}/\hat{\sigma }_{yy|nn,i}\). Using (40) and \(\hat{\textbf{R}}_{\textbf{yy}}\) (cfr. (38)) in (34), the rank-2 MWF estimate \(\hat{\textbf{W}}\) can be expressed as

$$\hat{\textbf{W}} = \hat{\textbf{Q}}^{-H} {\textrm{diag}}\left\{ 1 - \dfrac{\hat{\sigma }_{yy|nn,1}}{\hat{\sigma }_{yy,1}},1 - \dfrac{\hat{\sigma }_{yy|nn,2}}{\hat{\sigma }_{yy,2}}, 0,\ldots ,0 \right\} \hat{\textbf{Q}}^H [\textbf{e}_1 | \textbf{e}_{r+1}].$$
(41)

The estimates \(\hat{u}_{s}\) and \(\hat{x}_s^{(r)}\), are now obtained as in (32)-(33) with \(\hat{\textbf{W}}\) replacing \(\bar{\textbf{W}}\)

$$\begin{aligned} \hat{u}_{s}&= \textbf{e}_1^H \hat{\textbf{W}}^H\textbf{y},\end{aligned}$$
(42)
$$\begin{aligned} \hat{x}_s^{(r)}&= \textbf{e}_2^H \hat{\textbf{W}}^H\textbf{y}. \end{aligned}$$
(43)

The corresponding time-domain signals are obtained by adding the \(L_f\) overlapping windowed frames as

$$\begin{aligned}{} & {} \hat{\textbf{x}}_{s,seg}^{(r)}(l) = \boldsymbol{\mathcal {F}}_{R}^{-1} \left[ \begin{array}{ccc} {\hat{x}_s^{(r)}} \left( 0,l \right)&\cdots&{\hat{x}_s^{(r)}} \left( {R-1},l \right) \end{array}\right] ^T,\end{aligned}$$
(44)
$$\begin{aligned}{} & {} \hat{\textbf{u}}_{s,seg}(l) = \boldsymbol{\mathcal {F}}_{R}^{-1} \left[ \begin{array}{ccc} \hat{u}_{s} \left( 0,l \right)&\cdots&\hat{u}_{s} \left( R-1,l \right) \end{array}\right] ^T\end{aligned}$$
(45)
$$\begin{aligned}{} & {} \hat{\textbf{x}}_{s,seg}^{(r)}(l) = \left[ {\hat{x}_{s,seg}^{(r)}\left( l\dfrac{R}{2}\right) , \dots , \hat{x}_{s,seg}^{(r)}\left( R-1+l\dfrac{R}{2}\right) }\right] ^T, \end{aligned}$$
(46)
$$\begin{aligned}&\hat{\textbf{u}}_{s,seg}(l) = \left[ {\hat{u}_{s,seg}\left( l\dfrac{R}{2}\right) , \dots , \hat{u}_{s,seg}\left( R-1+l\dfrac{R}{2}\right) }\right] ^T,\end{aligned}$$
(47)
$$\begin{aligned}&\hat{x}_s^{(r)}(t-\delta _{\textrm{NR}}) = \sum _{l=0}^{L_f-1} \hat{x}_{s,seg}^{(r)} \left( t-l\dfrac{R}{2} \right) g_s \left( t-l\dfrac{R}{2} \right) ,\end{aligned}$$
(48)
$$\begin{aligned}&\hat{u}_s(t-\delta _{\textrm{NR}}) = \sum _{l=0}^{L_f-1} \hat{u}_{s,seg} \left( t-l\dfrac{R}{2} \right) g_s \left( t-l\dfrac{R}{2} \right) . \end{aligned}$$
(49)

4.2 AFC stage

In the AFC stage a single-channel PEM-based AFC algorithm is used. The PEM-based AFC algorithm used here is the frequency-domain version presented in [22]. The input signals to the AFC algorithm are \(\hat{u}_s\) and \(\hat{x}_s^{(r)}\). A short description of the PEM-based AFC algorithm is provided in Algorithm 1. Note that in this AFC stage, the estimates of the speech component in the loudspeaker signal (cfr. (49)) and in the reference microphone signal (cfr. (48)) are used to estimate the feedback path, unlike in Section 3.2 where the estimate of the speech component in the reference microphone signal (cfr. (29)) and the noisy loudspeaker signal are used.

A complete description of the cascade (\(M+1\))-channel rank-2 MWF and PEM-AFC algorithm is provided in Algorithm 3, with block diagram provided in Fig. 2(b).

figure c

Algorithm 3 Cascade (\(M+1\))-channel rank-2 MWF and PEM-AFC

5 Cascade M-channel PEM-AFC and rank-1 MWF

Assuming an exact speech signal model \(A^{-1}(q,t)\) is available (see (4)), a prefilter A(q, t) can be applied, such that the time-domain prefiltered loudspeaker and \(m^{\textrm{th}}\) microphone signal can be expressed as

$$\begin{aligned} \tilde{u}(t)&= A(q,t)u(t),\end{aligned}$$
(50)
$$\begin{aligned} \tilde{x}^{(m)}(t)&= A(q,t)x^{(m)}(t). \end{aligned}$$
(51)

Similarly, the prefiltered version of the signal vector \(\textbf{y}\) in (9) can be expressed as

$$\begin{aligned} \tilde{\textbf{y}}(\kappa ,l)&= {\left[ \begin{array}{c}\tilde{u}(\kappa ,l) \\ \tilde{\textbf{x}}(\kappa ,l)\end{array}\right] }\end{aligned}$$
(52)
$$\begin{aligned}&= {\left[ \begin{array}{c}0 \\ \textbf{h}(\kappa ,l)\end{array}\right] } e(\kappa ,l) + {\left[ \begin{array}{c}1 \\ \textbf{f}(\kappa ,l)\end{array}\right] } \tilde{u}_s(\kappa ,l) + \tilde{\textbf{y}}_n(\kappa ,l) \end{aligned}$$
(53)

where \(\tilde{u}(\kappa ,l)\) and \(\tilde{\textbf{x}}(\kappa ,l)\) represent the STFT-domain prefiltered loudspeaker and microphone signals. Similarly, \(\tilde{u}_s(\kappa ,l)\) is the STFT-domain prefiltered desired speech component in the loudspeaker signal and \(\tilde{\textbf{y}}_n(\kappa ,l)\) is the STFT-domain prefiltered noise component in the loudspeaker and microphone signals. The speech correlation matrix can be rewritten as

$$\begin{aligned} \bar{\textbf{R}}_{\tilde{\textbf{y}}\tilde{\textbf{y}}\mathbf {|ss}}= & {} \left[ \begin{array}{cc} 1 &{} 0 \\ \textbf{f} &{} \textbf{h}\end{array}\right] \left[ \begin{array}{cc} \Phi _{\tilde{u}\tilde{u}} &{} 0 \\ 0 &{} \Phi _{ee} \end{array}\right] \left[ \begin{array}{cc} 1 &{} \textbf{f}^H \\ 0 &{} \textbf{h}^H \end{array}\right] \end{aligned}$$
(54)
$$\begin{aligned}= & {} {\left[ \begin{array}{c}1 \\ \textbf{f}\end{array}\right] } \Phi _{\tilde{u}\tilde{u}} \left[ {1\quad \textbf{f}^H}\right] + \left[ \begin{array}{c}0\\ \textbf{h}\end{array}\right] \Phi _{ee} \left[ {0\quad \textbf{h}^H} \right] \end{aligned}$$
(55)

where \(\Phi _{\tilde{u}\tilde{u}}=E \{\tilde{u} \tilde{u}^{*}\}\), \(\Phi _{ee} = E\{ ee^{*}\}\), \(\Phi _{e \tilde{u}}=E \{e\tilde{u}^{*}\}=0\) and \(\Phi _{\tilde{u}e}=E \{\tilde{u}e^{*} \}=0\). Since (54) is computed in the STFT domain, the cross-correlation terms would only be zero if there is a delay of at least one STFT-frame in the forward path. It can be observed that, after prefiltering, \(\textbf{h}\) and \(\textbf{f}\) can be readily computed from \(\bar{\textbf{R}}_{\tilde{\textbf{y}} \tilde{\textbf{y}}|ss}\). In this case, the order of the AFC and NR stages can be inverted so that an M-channel AFC stage is performed first, which will estimate the speech component (without its feedback contribution) together with the noise component, and then a multichannel NR stage can follow.

5.1 AFC stage

In the AFC stage, a single-channel PEM-based AFC algorithm is used for each microphone, i.e., M times. The AR model is estimated for each single-channel PEM-based AFC algorithm. The same step-size tuning is used for all adaptive algorithms. The PEM-based AFC algorithm used here is the frequency-domain version presented in [22]. The input signals to the AFC algorithm are u and \(x^{(m)}, \forall m\). A short description of the PEM-based AFC algorithm is provided in Algorithm 1.

5.2 NR stage

A rank-1 MWF is used for the NR stage which operates on the microphone signals after the AFC stage.

The STFT domain representation of the time-domain signals will be used here, which is obtained by means of an R samples long analysis window in a WOLA filterbank with \(50\%\) overlap [23]. Therefore, the STFT \(x_f^{(m)}(\kappa ,l)\) of the \(m^{\textrm{th}}\) microphone signal after the AFC stage, \(x_f^{(m)}(t)\), at frame l can be defined as

$$\begin{aligned} {\left[ \begin{array}{c}x_f^{(m)}(0,l) \\ \vdots \\ x_f^{(m)}(R-1,l)\end{array}\right] } = \boldsymbol{\mathcal {F}}_{R} {\left[ \begin{array}{c}x_f^{(m)}\left( l\dfrac{R}{2}\right) g_a(0) \\ \vdots \\ x_f^{(m)}\left( R-1+l\dfrac{R}{2}\right) g_a(R-1)\end{array}\right] }. \end{aligned}$$
(56)

The STFT-domain multi-channel microphone signal after the AFC stage, assuming perfect feedback cancelation, is modeled as

$$\begin{aligned} \textbf{x}_{f}(\kappa ,l) = \textbf{h}(\kappa ,l) s(\kappa ,l) + \textbf{x}_{f|n}(\kappa ,l) \end{aligned}$$
(57)

where \(\textbf{x}_{f|n}(\kappa ,l)\) is the STFT-domain noise component in the microphone signal after feedback cancelation. The minimization of the mean squared error (MSE) between the desired signal and the filtered feedback-compensated microphone signals, \(\textbf{x}_{f}\), defines an optimal filter

$$\begin{aligned} \bar{\textbf{w}} = \underset{\textbf{w}}{\min }\;E \left\{ \left\| {d}_{\textrm{NR}} - \textbf{w}^{H}\textbf{x}_{f} \right\| ^2 \right\} \end{aligned}$$
(58)

with \(d_{\text {NR}}= x_{f|s}^{(r)}\). The desired signal estimate is then obtained as \(\hat{d}_{\text {NR}} = \bar{\textbf{w}}^H \textbf{x}_{f}\). The solution to (58) is the well-known MWF [10, 12], given by

$$\begin{aligned} \bar{\textbf{w}} = \bar{\textbf{R}}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}}^{-1} \bar{\textbf{R}}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}\mathbf {|ss}} \textbf{e}_r \end{aligned}$$
(59)

where \(\bar{\textbf{R}}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}} = E\{ \textbf{x}_{f} \textbf{x}_{f}^H \}\), \(\bar{\textbf{R}}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}\mathbf {|ss}} = E \{ \textbf{h} s s^H \textbf{h}^H \}\), and, similarly, \(\bar{\textbf{R}}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}\mathbf {|nn}} = E\{ \textbf{x}_{f|n} \textbf{x}_{f|n}^H\}\). The final expression in (59) is obtained based on the assumption that s and \(\textbf{x}_{f|n}\) are uncorrelated.

In practice, by using a voice activity detector (VAD), \(\bar{\textbf{R}}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}}\) and \(\bar{\textbf{R}}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}\mathbf {|nn}}\) are first estimated during speech-plus-noise periods where the desired speech signal and background noise are active, and noise-only periods where only the noise is active [26], i.e.,

$$\begin{aligned}&\text {if VAD}(\kappa ,l)=1:\end{aligned}$$
(60)
$$\begin{aligned}&\hat{\textbf{R}}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}}(\kappa ,l) = \beta \hat{\textbf{R}}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}}(\kappa ,l-1) + (1-\beta ) \textbf{x}_{f}(\kappa ,l) \textbf{x}_{f}^H(\kappa ,l), \nonumber \\&\text {if VAD}(\kappa ,l)=0:\end{aligned}$$
(61)
$$\begin{aligned}&\hat{\textbf{R}}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}\mathbf {|nn}}(\kappa ,l) = \beta \hat{\textbf{R}}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}\mathbf {|nn}}(\kappa ,l-1) + \nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad (1-\beta ) \textbf{x}_{f}(\kappa ,l) \textbf{x}_{f}^H(\kappa ,l) \end{aligned}$$
(62)

where \(\hat{\textbf{R}}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}}(\kappa ,l)\) and \(\hat{\textbf{R}}_{\textbf{x}_{\textbf{f}} \textbf{x}_{{\textbf {f}}}\mathbf {|nn}}(\kappa ,l)\) represent estimates of \(\bar{\textbf{R}}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}}\) and \(\bar{\textbf{R}}_{\textbf{x}_{{\textbf {f}}} \textbf{x}_{{\textbf {f}}}\mathbf {|nn}}\) at frame l and frequency bin index \(\kappa\), respectively. The following criterion will then be used to estimate \(\bar{\textbf{R}}_{\textbf{x}_{\mathbf{f}} \textbf{x}_{\textbf{f}}\mathbf {|ss}}\) [12],

$$\begin{array}{c}{\hat{\textbf{R}}}_{{\textbf{x}}_\textbf{f}{\textbf{x}}_\textbf{f}{\boldsymbol\vert\mathbf{ss}}}= \\\underset{\textbf{R}_{\textbf{x}_{\textbf{f}}\textbf{x}_{\textbf{f}}\mathbf{|ss}}}{\min } \left\| \hat{\textbf{R}}_{\textbf{x}_{{\textbf {f}}} \textbf{x}_{{\textbf {f}}}\mathbf {|nn}}^\mathrm {-1/2} \left( \hat{\textbf{R}}_{\textbf{x}_{{\textbf {f}}} \textbf{x}_{{\textbf {f}}}} - \hat{\textbf{R}}_{\textbf{x}_{\textbf{f}} \textbf{x}_{f}\mathbf {|nn}} - \textbf{R}_{{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}\mathbf {|ss}}} \right) \hat{\textbf{R}}_{\textbf{x}_{{\textbf {f}}} \textbf{x}_{{\textbf {f}}}\mathbf {|nn}}^{-H/2} \right\| ^2_F\end{array}$$
(63)
$$\begin{aligned}&\text {s.t. } \quad \textrm{rank}(\textbf{R}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}\mathbf {|ss}})=1, \nonumber \\&\qquad \ \ \textbf{R}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}\mathbf {|ss}} \succeq 0. \end{aligned}$$
(64)

Spatial pre-whitening is applied by pre- and post-multiplying by \(\hat{\textbf{R}}^\mathrm {-1/2}_{\textbf{x}_{f} \textbf{x}_{f}|\mathbf{nn}}\) and \(\hat{\textbf{R}}^{-H/2}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}\mathbf {|nn}}\), respectively. The solution to (63)–(64) is based on a GEVD of the (\(M \times M\)) matrix pencil \(\{ \hat{\textbf{R}}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}}, \hat{\textbf{R}}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}\mathbf {|nn}} \}\) [12, 25]

$$\begin{aligned} \hat{\textbf{R}}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}}&= \hat{\textbf{Q}} \hat{\boldsymbol{\Sigma }}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}} \hat{\textbf{Q}}^H,\end{aligned}$$
(65)
$$\begin{aligned} \hat{\textbf{R}}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}\mathbf {|nn}}&= \hat{\textbf{Q}} \hat{\boldsymbol{\Sigma }}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}\mathbf {|nn}} \hat{\textbf{Q}}^H \end{aligned}$$
(66)

where \(\hat{\boldsymbol{\Sigma }}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}}\) and \(\hat{\boldsymbol{\Sigma }}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}\mathbf {|nn}}\) are diagonal matrices and \(\hat{\textbf{Q}}\) is an invertible matrix. The speech correlation matrix estimate \(\hat{\textbf{R}}_{\textbf{x}_{f} \textbf{x}_{f}|\mathbf{ss}}\) is then [12]

$$\hat{\textbf{R}}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}\mathbf {|ss}} = \hat{\textbf{Q}} {\textrm{diag}}\{ \hat{\sigma }_{x_f x_f,1} - \hat{\sigma }_{x_f x_f|nn,1},0,\ldots ,0 \} \hat{\textbf{Q}}^H$$
(67)

where \(\hat{\sigma }_{x_f x_f,1}\) and \(\hat{\sigma }_{x_f x_f|nn,1}\) are the first diagonal element of \(\hat{\boldsymbol{\Sigma }}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}}\) and \(\hat{\boldsymbol{\Sigma }}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}\mathbf {|nn}}\), respectively, corresponding to the largest ratio \(\hat{\sigma }_{x_f x_f,i}/\hat{\sigma }_{x_f x_f|nn,i}\). Using (67) and \(\hat{\textbf{R}}_{\textbf{x}_{\textbf{f}} \textbf{x}_{\textbf{f}}}\) (cfr. (65)) in (59), the rank-1 MWF estimate \(\hat{\textbf{w}}\) can be expressed as

$$\begin{aligned} \hat{\textbf{w}} = \hat{\textbf{Q}}^{-H} {\textrm{diag}}\left\{ 1 - \dfrac{\hat{\sigma }_{x_f x_f|nn,1}}{\hat{\sigma }_{x_f x_f,1}},0,\ldots ,0 \right\} \hat{\textbf{Q}}^H \textbf{e}_r. \end{aligned}$$
(68)

The desired signal estimate is then obtained as \(\hat{d} = \hat{\textbf{w}}^H \textbf{x}_{f}\). The time-domain desired signal is obtained by adding the \(L_f\) overlapping windowed frames as

$$\begin{aligned} \hat{\textbf{d}}_{seg}(l)= & {} \boldsymbol{\mathcal {F}}_{R}^{-1} \left[ \begin{array}{ccc} \hat{d} \left( 0,l \right)&\cdots&\hat{d} \left( R-1,l \right) \end{array}\right] ^T,\end{aligned}$$
(69)
$$\begin{aligned} \hat{\textbf{d}}_{seg}(l)= & {} \left[ {\hat{d}_{seg}\left( l\dfrac{R}{2}\right) , \dots , \hat{d}_{seg}\left( R-1+l\dfrac{R}{2}\right) }\right] ^T,\end{aligned}$$
(70)
$$\begin{aligned} \hat{d}(t-\delta _t)= & {} \sum _{l=0}^{L_f-1} \hat{d}_{seg} \left( t-l\dfrac{R}{2} \right) g_s \left( t-l\dfrac{R}{2} \right) \end{aligned}$$
(71)

where \(g_s(t)\) is a synthesis window, \(\delta _t = \delta _{\textrm{AFC}}+\delta _{\textrm{NR}}\) is the total delay from both stages and \(\delta _{\textrm{AFC}}\) is the delay from the AFC stage. A complete description of the cascade M-channel PEM-AFC and rank-1 MWF algorithm is provided in Algorithm 4 and a block diagram is provided in Fig. 2(c).

figure d

Algorithm 4 Cascade M-channel PEM-AFC and rank-1 MWF

6 Computational complexity

In [22], the computational complexity of the single-channel PEM-based AFC algorithm has been provided as \(O \left( \frac{6R \log _2(R) + 22R + n_A^2+n_A(5+R)}{R/2-n_A}\right)\) in terms of real multiplications. To obtain this expression, equal complexity for a real multiplication and a real division is assumed, as well as a complexity of \(R\log _2{R}\) for the fast Fourier transform (FFT) and inverse FFT operations. The NR stage of the rank-1 MWF in Sections 3.1 and 5.2 has a computational complexity in terms of real multiplications of \(O((4M)^3)\) per frequency bin; hence, by considering \(B=\frac{R}{2}+1\), the total computational complexity of the NR stage is \(O(64BM^3)\). The NR stage of the rank-2 MWF in Section 4.1 has a total computational complexity in terms of real multiplications of \(O(64B(M+1)^3)\). Table 1 shows the computational complexity for the AFC stage and NR stage in terms of real multiplications of each of the presented algorithms. The algorithms are abbreviated as follows in the table descriptors. The cascade M-channel rank-1 MWF and PEM-AFC algorithm is abbreviated as Rank-1 NR-AFC, the cascade \((M+1)\)-channel rank-2 MWF and PEM-AFC algorithm as Rank-2 NR-AFC and the cascade M-channel PEM-AFC and rank-1 MWF as AFC-NR.

Table 1 Computational complexity of the presented algorithms

7 Simulation results

7.1 Scenario description

In order to assess the performance of the presented cascade algorithms, closed-loop simulations were performed using the following three scenarios.

  • Scenario 1 consists of a 4-microphone linear array with an inter-microphone distance of 10 cm and a loudspeaker which reproduces an amplified version of the desired speech source signal. The desired source is 25 cm away from the microphone array at an angle of \(0^{\circ }\). The loudspeaker is 1.4 m away from the microphone array at an angle of \(45^{\circ }\). Artificial impulse responses from the loudspeaker and the desired source to the microphones were generated using the randomized image method in [27], and the speech source signal was generated using a cascade of AR models. The signal generation using a cascade of AR models was performed by designing a 1024-order low-pass filter with cut-off frequency of 0.9 \(\pi\)rad/sample. Then, the linear prediction of order 30 was used on the low-pass filter coefficients to obtain the first stable AR model. The second model was designed by first choosing a central frequency \(f_{\text {cen}} = 689.1\,Hz\) and then the coefficients \(\textbf{a}_c\) were obtained as

    $$\begin{aligned} a_{\textrm{order}}&= \textrm{round}\left( \dfrac{F_s}{f_{\textrm{cen}}}\right) ,\end{aligned}$$
    (72)
    $$\begin{aligned} \textbf{a}_c&= \left[ {1\quad \textbf{0}_{(a_{\textrm{order}}-2) \times 1}\quad -0.1\quad -0.5\quad -0.1}\right] ^T \end{aligned}$$
    (73)

    where \(F_s\) is the sampling frequency, \(a_{\textrm{order}}\) is the order of the AR model. Results for different SNRs are shown.

  • Scenario 2 has the same set-up as Scenario 1, however the source signal is replaced by a speech signal [28] and the reverberation time is set to 0.14 s. Results for different SNRs are shown.

  • Scenario 3 consists of a 4-microphone array with an inter-microphone distance of 10 cm and a loudspeaker located diagonally (at an angle of approximately \(-135^{\circ }\)) from it, which reproduces an amplified version of the desired signal. The desired source is in front of the microphone array, at an angle of approximately \(0^{\circ }\). Measured impulse responses [29] from the loudspeaker and the desired source to the microphones were used and the source signal was a speech signal [28]. The labels from [29] that represent the microphone positions are CMA20_90, CMA10_90, CMA10_-90, and CMA20_-90; similarly, the labels for the loudspeaker position and desired source position are SL5 and SL2, respectively. For exact coordinates and room description, the reader is referred to [29]. The results for different SNRs are shown. Although the reverberation time of these impulse responses is 0.5 s, they were truncated to 0.31 s which keeps most of the reverberant tail.

The loudspeaker signal in all scenarios was obtained by using the desired signal estimate \(\hat{d}(t)\), multiplied and delayed by the forward path gain and delay respectively. The noise added to the microphones in all scenarios was uncorrelated white noise. An oracle frequency-domain VAD was used and was computed using the desired source signal. This oracle VAD was obtained using the STFT representation of the desired speech signal. The average energy for each frequency bin was computed and used as a threshold for determining the speech activity in this frequency bin. For comparison, simulation results using the speech presence probability (SPP) function from [30] on the microphone signals are shown for scenario 2 and scenario 3. The original SPP function in [30] requires the complete knowledge of the signal, which is not feasible in an AFC scenario due to the closed-loop system. Therefore, the SPP function was adapted to online processing by using as input signal the current frame and the previous 10 frames of the unprocessed microphone signal. The threshold for determining the presence of speech was set to 0.8. The window and impulse response length for each scenario are shown in Table 2. The forward path gain profile used for scenario 1 is shown in Fig. 3 with \(K_{\textrm{MSG}}\) defined in Section 7.2.2. Similar forward path gain profiles were used for scenario 2 and scenario 3; however, the duration of the signals is different. The gain profile was chosen such that the noise-only and speech-plus-noise correlation matrices in the three algorithms could be updated while the system is stable, and then the gain is gradually increased to test the proposed algorithms. The forward path delay in the simulations depends on the window size used for both the WOLA and OLS procedures. In all simulations, the forward path delay was set to \(\frac{3R}{2}\). An R-samples long root-squared Hann window was used in the WOLA filterbank for the NR stage and an R-samples long rectangular window was used in the OLS filterbank for the AFC stage.

Fig. 3
figure 3

Forward path gain profile for scenario 1

Table 2 Scenarios and simulation parameters

7.2 Feedback cancelation performance measures

7.2.1 Misadjustment (Mis)

The Mis measure is defined as the normalized distance in dB between the true and estimated feedback path in the time domain. Alternatively, due to Parseval’s energy theorem, the Mis can be expressed in the frequency domain as [9]

$${\textrm{Mis}}(l) = 20 \log _{10} \; \left[ \dfrac{\frac{1}{R}\sum _{\kappa =0}^{R-1}\left( f^{(r)}(\kappa ) - \hat{f}^{(r)}(\kappa ,l)\right) ^2}{\frac{1}{R}\sum _{\kappa =0}^{R-1}\left( f^{(r)}(\kappa )\right) ^2}\right] \quad {\textrm{dB}}$$
(74)

where \(f^{(r)}(\kappa )\) is the true STFT-domain transfer function from the loudspeaker to the reference microphone. To compute this metric the impulse response was first truncated to the STFT length.

7.2.2 Added stable gain (ASG)

The ASG measure is based on the so-called maximum stable gain (MSG) which is the maximum gain achievable in the system without it becoming unstable. In a single-channel scenario with a spectrally flat forward path, the MSG is given by [1]

$${\textrm{MSG}}(l) = -20 \log _{10}\left[ \underset{\kappa \in \mathcal{P}^{(r)}(l)}{\max } \left| f^{(r)}(\kappa ) - \hat{f}^{(r)}(\kappa ,l) \right| \right] \quad {\textrm{dB}}$$
(75)

where \(\mathcal {P}^{(r)}(l)\) is the set of frequencies that satisfy the phase condition of the Nyquist stability criterion [1] at the reference microphone. The ASG is then obtained as

$$\textrm{ASG}(l) = \textrm{MSG}(l) - K_{\textrm{MSG}} \quad \textrm{dB}$$
(76)

where \(K_{\textrm{MSG}}\) is the MSG of the system when no feedback canceler is included, i.e., \(\hat{f}^{(r)}(\kappa ,l)=0 \; \forall \kappa ,l,\) in (75).

When a NR stage is included in the closed-loop system, the expression in (75) can be modified to account for the NR filters. For this, the MSG is defined at a reference microphone as

$$\begin{aligned} {\textrm{MSG}}(l) = -20 \log _{10}\left[ \underset{\kappa \in \mathcal {P}^{(r)}(l)}{\max } \left| f^{\star (r)}(\kappa ,l) - \hat{f}^{\star (r)}(\kappa ,l) \right| \right] \quad {\textrm{dB}} \end{aligned}$$
(77)

where for an M-channel NR stage \(\hat{f}^{\star (r)}(\kappa ,l)=\hat{f}^{(r)}(\kappa ,l)\) and \(f^{\star (r)}(\kappa ,l)\) is defined as

$$\begin{aligned} f^{\star (r)}(\kappa ,l) = \hat{\textbf{w}}^H(\kappa ,l) \textbf{f}(\kappa ,l), \end{aligned}$$
(78)

and for an \(M+1\)-channel NR stage \(f^{\star (r)}(\kappa ,l)\) and \(\hat{f}^{\star (r)}(\kappa ,l)\) are

$$\begin{aligned} f^{\star (r)}(\kappa ,l)&= \textbf{e}_{r+1}^H \hat{\textbf{W}}^H(\kappa ,l) \left[ \begin{array}{c}1\\ \textbf{f}(\kappa ,l)\end{array}\right] .\end{aligned}$$
(79)
$$\begin{aligned} \hat{f}^{\star (r)}(\kappa ,l)&= \hat{f}^{(r)}(\kappa ,l) \textbf{e}_{r}^H \hat{\textbf{W}}^H(\kappa ,l) \left[ \begin{array}{c}1\\ \textbf{f}(\kappa ,l)\end{array}\right] . \end{aligned}$$
(80)

Then, the ASG can be computed as in (76), noting that \(K_{\textrm{MSG}}\) should be computed similarly to (77) with the initial value of \(\hat{\textbf{W}}\). For the simulations presented here \(\hat{\textbf{W}}\) was initialized with \({\left[ \begin{array}{cc}1 &{} 0 \\ 0 &{} 1 \\ \varvec{0}_{(M-1)\times 1} &{} \varvec{0}\end{array}\right] }\). It should be noted that a random initialization is also possible.

7.2.3 Signal distortion (SD)

The SD gives an indication of the distortion of the processed signal. Unweighted and weighted SD measures have been used in the literature [8, 9, 31, 32] for different speech enhancement algorithms. The frequency-weighted SD is defined as in [8]

$$\begin{aligned} \textrm{SD}(l) = \left( \int _{f_l}^{f_h} w_{ \textrm{ERB}}(f) \left( 10 \log _{10} \dfrac{\Phi _e(f,l)}{\Phi _r(f,l)} \right) ^2 df. \right) ^{1/2} \end{aligned}$$
(81)

where \(\Phi _e(f,l)\) is the PSD of the estimated signal, \(\Phi _r(f,l)\) is the PSD of the reference signal, f is the frequency index in Hz, which can be related to \(\kappa\) as \(f=\frac{f_s\kappa }{R}\), with \(f_s\) being the sampling rate, and \(w_{ \textrm{ERB}}(f)\) is a weighting function which gives equal weight to each auditory critical band between \(f_l= 300\,Hz\) and \(f_h= 6400\,Hz\). For this metric, the estimated signal is \(\hat{d}(t)\) and the reference signal is \(H^{(r)}(q,t) s(t)\) (cfr. (5)). The measure is computed only during “speech-plus-noise” periods and the average over all frames is presented.

7.3 Perceptual performance measures

For the perceptual assessment of the cascade algorithms presented in this paper, two metrics have been selected, namely, the PESQ and the STOI [9, 33, 34]. The PESQ measure is part of an International Telecommunications Union (ITU) Standard and widely used to objectively assess the perceptual quality of a speech signal. The STOI measure is a correlation-based speech intelligibility measure that works on the temporal envelopes of short speech frames. We used a MATLAB implementation of the STOI measure from [34] and the PESQ implementation from [35]. These metrics were chosen based on the results presented in [9] where objective metrics were compared to subjective evaluation results for AFC algorithms.

7.4 Closed-loop simulations

Closed-loop simulation results are presented in this section. For comparison, simulation results using the GSC algorithm from [14] are shown for all scenarios. Two noise references were used, and the loudspeaker signal was included as an extra noise reference. A recursive least squares (RLS) algorithm was used with a forgetting factor of 0.9999. The fixed beamformer and blocking matrix were selected as in [15], where the source is assumed to be in front. The algorithms are abbreviated in the legends and tables descriptors as mentioned in Section 6. The GSC from [14] using an RLS adaptive filter for the noise references is abbreviated as GSC-RLS. The three proposed algorithms and the data for scenario 1 are available in [36].

First, the assumption that \(\hat{\textbf{R}}_{\mathbf {yy|ss}}\) in (12) can be modeled as a rank-2 matrix is validated experimentally. A closed-loop simulation was performed without the NR and AFC stages using Scenario 2. A fixed, random beamformer was used to combine the microphone signals. No noise was included in the microphone signals, \(\beta =0.9\), the forward path gain was set as in Fig. 3 with \(K_1=K_{\textrm{MSG}}- 15\,dB\) and \(K_2=K_{\textrm{MSG}}- 10\,dB\), the forward path delay was set to \(\frac{3R}{2}\) and \(R=\{512,1024,2048\} \text {samples}\). The speech correlation matrix \(\hat{\textbf{R}}_{\mathbf {yy|ss}}\) was computed in the closed-loop system, and its eigenvalues (cfr.(40)) are plotted in Fig. 4 over time. It can be seen that for all R, there are two distinct eigenvalues, which validates the assumption of modeling \(\hat{\textbf{R}}_{\mathbf {yy|ss}}\) as a rank-2 matrix. It is noted that as the forward path gain increases (after 6 s) these two distinct eigenvalues get closer to each other. Similarly, as R decreases, the difference between these two eigenvalues and the others decreases. The reason for this is that the forward path delay also decreases, which is defined based on R.

Fig. 4
figure 4

Eigenvalues of \(\hat{\textbf{R}}_{\textbf{yy|ss}}\) for different window lengths

Figure 5 shows the ASG and Mis for three iSNRs for all algorithms using scenario 1. The iSNR was computed in the reference microphone before any processing of the microphone signals. In addition, the STOI and SD scores for each algorithm are shown in Table 3. The forward path gain was set as in Fig. 3 with \(K_1=K_{\textrm{MSG}}- 5\,dB\) and \(K_2=K_{\textrm{MSG}}+ 10\,dB\). For the GSC-RLS, the gain was fixed at \(K_1=K_{\textrm{MSG}}- 5\,dB\) to avoid unstability in the closed-loop system. It is observed that both the Rank-2 NR-AFC and AFC-NR increase the ASG and the Mis is reduced. Furthermore, the STOI and SD scores outperform those of the Rank-1 NR-AFC and GSC-RLS for all iSNRs.

Fig. 5
figure 5

ASG and Mis for the three cascade algorithms in scenario 1

Table 3 STOI and SD for the three cascade algorithms using scenario 1

Figure 6 shows the ASG and Mis for all algorithms using scenario 2. The STOI, PESQ-MOS, and SD scores are shown in Table 4. The forward path gain was set as in Fig. 3 with \(K_1=K_{\textrm{MSG}}- 5\,dB\) and \(K_2=K_{\textrm{MSG}}+ 10\,dB\) for all algorithms. Results for the Rank-2 NR-AFC and the AFC-NR algorithm using the SPP function are also included. It can be seen that the Rank-2 NR-AFC and the AFC-NR outperform the Rank-1 NR-AFC in terms of ASG and Mis. The GSC-RLS also increases the ASG of the system, although when the forward path gain is reaching its maximum value, \(K_2\), the Mis starts to diverge, which makes the closed-loop system unstable. Similarly to the results using scenario 1, the STOI, PESQ-MOS, and SD scores of both the Rank-2 NR-AFC and AFC-NR algorithms when using an oracle VAD outperform those of the Rank-1 NR-AFC and the GSC-RLS algorithms for all iSNRs. As expected, the inclusion of the SPP function decreases the performance of the Rank-2 NR-AFC and AFC-NR algorithms due to poorer estimates of the correlation matrices.

Fig. 6
figure 6

ASG and Mis for the three cascade algorithms in scenario 2

Table 4 STOI, SD, and PESQ for the three cascade algorithms using scenario 2

Figure 7 shows the ASG and Mis for all algorithms using scenario 3. The forward path gain was set as in Fig. 3 with \(K_1=K_{\textrm{MSG}}- 5\,dB\) and \(K_2=K_{\textrm{MSG}}+ 10\,dB\). For the GSC-RLS, \(K_1=K_{\textrm{MSG}}- 10\,dB\) and \(K_2=K_{\textrm{MSG}}- 5\,dB\). It can be seen that the ASG is increased for the Rank-2 NR-AFC and the AFC-NR algorithms for all iSNRs. It is also observed that the Rank-2 NR-AFC and AFC-NR decrease the Mis, however not as much as in scenario 1 and scenario 2. It is also noted that the GSC-RLS increases the ASG until the forward path gain starts to increase, and then the system becomes unstable. The STOI and SD scores are presented in Table 5. Both the Rank-2 NR-AFC and AFC-NR outperform the Rank-1 NR-AFC and the GSC-RLS algorithm for all iSNRs.

Fig. 7
figure 7

ASG and Mis for the three cascade algorithms in scenario 3

Table 5 STOI and SD for the three cascade algorithms using scenario 3

The observed high ASG values for the Rank-2 NR-AFC and AFC-NR algorithms in scenario 1 and scenario 2 can be explained by the inclusion of the NR filters in the ASG computation (cfr. (78)–(80)) which means that the MWF also influences the stability of the system. The fluctuating ASG values for the Rank-1 NR-AFC algorithm mean that the system stability is not guaranteed. This has been confirmed both by the perceptual performance measures scores in Tables 3 and 4 and by the presence of howling in the resulting audio signals. Additionally, it should be noted that the SD scores in Tables 3, 4 and 5 for all algorithms are considerably higher than those reported in the literature [9]. The reason for this is the sensitivity of this metric to the presence of noise in the microphone signals, which distorts the signal. In the literature, most of the considered SNRs are around 30 dB, which is considerably higher than the ones in this paper. Similarly, the STOI and PESQ metrics are low in all scenarios. This is due to the metrics being computed using the estimate of the desired speech component in the closed-loop system. This means that all changes in the NR and AFC filters are reflected in the desired signal estimate. In scenario 3, the feedback path estimate is being under-modeled (cfr. Table 2) which explains the low ASG values for all the algorithms. The estimated feedback path has a smoother frequency response than the true feedback path which can cause a magnitude difference in the ASG computation, resulting in a slowing increasing ASG. Similarly to scenario 1 and 2, the system is not stable when using the Rank-1 NR-AFC algorithm. The GSC-RLS algorithm performs well whenever the forward path gain is not too close to the MSG; however, it should be noted that changes in the acoustic environment cannot be tracked using this algorithm due to the prior knowledge that is required.

8 Conclusions

Three cascade multi-channel NR and AFC algorithms have been presented. Three different scenarios have been used to compare the performance of these algorithms in simulations. It is shown that both the cascade \((M+1)\)-channel rank-2 MWF and PEM-AFC and the cascade M-channel PEM-AFC and rank-1 MWF algorithms outperform the cascade M-channel rank-1 MWF and PEM-AFC in terms of ASG and Mis. It is then shown in Section 7 that both the cascade \((M+1)\)-channel rank-2 MWF and PEM-AFC and the cascade M-channel PEM-AFC and rank-1 MWF are suitable to solve the combined AFC and NR problem in speech applications. It is also shown that by performing a rank-2 approximation of the speech correlation matrix the feedback path can be correctly estimated when an NR stage precedes the AFC stage.

Availability of data and materials

The algorithms are publicly available at [36].

Notes

  1. It is noted that u(t) may also add an additional noise component to \(x^{(m)}(t)\), cfr. (3).

  2. The STFT-domain multiplicative transfer function model in (10) is an approximation, the approximation being better when the STFT uses frequency selective analysis filters, and when the frame length matches that of the room impulse responses [24]

References

  1. T. van Waterschoot, M. Moonen, Fifty years of acoustic feedback control: state of the art and future challenges. Proc. IEEE 99(2), 288–327 (2011)

    Article  Google Scholar 

  2. M. Guo, S.H. Jensen, J. Jensen, Evaluation of state-of-the-art acoustic feedback cancellation systems for hearing aids. J. Audio Eng. Soc. 61(3), 125–137 (2013)

    Google Scholar 

  3. M. Guo, S.H. Jensen, J. Jensen, Novel acoustic feedback cancellation approaches in hearing aid applications using probe noise and probe noise enhancement. IEEE Trans. Audio Speech Lang. Process. 20(9), 2549–2563 (2012). https://doi.org/10.1109/TASL.2012.2206025

    Article  Google Scholar 

  4. M. Guo, S.H. Jensen, J. Jensen, S.L. Grant, in Proc. 20th European Signal Process. Conf. (EUSIPCO ’12). On the use of a phase modulation method for decorrelation in acoustic feedback cancellation. (2012). https://ieeexplore.ieee.org/abstract/document/6333787/authors#authors

  5. H. Schepker, S.E. Nordholm, L.T.T. Tran, S. Doclo, Null-steering beamformer-based feedback cancellation for multi-microphone hearing aids with incoming signal preservation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(4), 679–691 (2019). https://doi.org/10.1109/TASLP.2019.2892234

    Article  Google Scholar 

  6. F. Strasser, H. Puder, Adaptive feedback cancellation for realistic hearing aid applications. IEEE/ACM Trans. Audio Speech Lang. Process. 23(12), 2322–2333 (2015). https://doi.org/10.1109/TASLP.2015.2479038

    Article  Google Scholar 

  7. A. Spriet, M. Moonen, I. Proudler, in Proc. 11th European Signal Process. Conf. Feedback cancellation in hearing aids: an unbiased modelling approach (2002), pp. 1–4

  8. A. Spriet, M. Moonen, J. Wouters, Evaluation of feedback reduction techniques in hearing aids based on physical performance measures. J. Acoust. Soc. Amer. 128(3), 1245–1261 (2010)

    Article  Google Scholar 

  9. G. Bernardi, T. van Waterschoot, J. Wouters, M. Moonen, Subjective and objective sound-quality evaluation of adaptive feedback cancellation algorithms. IEEE/ACM Trans. Audio Speech Lang. Process. 26(5), 1010–1024 (2018)

    Article  Google Scholar 

  10. J. Benesty, J. Chen, Y.A. Huang, S. Doclo, in Speech Enhancement. Study of the wiener filter for noise reduction (Springer, Berlin Heidelberg, 2005), pp. 9–41

  11. J. Benesty, J.R. Jensen, M.G. Christensen, J. Chen, Speech enhancement: a signal subspace perspective (Elsevier, Oxford, 2014)

    Google Scholar 

  12. R. Serizel, M. Moonen, B. Van Dijk, J. Wouters, Low-rank approximation based multichannel wiener filter algorithms for noise reduction with application in cochlear implants. IEEE/ACM Trans. Audio Speech Lang. Process. 22(4), 785–799 (2014)

    Article  Google Scholar 

  13. D. Wang, J. Chen, Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans. Audio Speech Lang. Process. 26(10), 1702–1726 (2018). https://doi.org/10.1109/TASLP.2018.2842159

    Article  MathSciNet  Google Scholar 

  14. A. Spriet, G. Rombouts, M. Moonen, J. Wouters, Combined feedback and noise suppression in hearing aids. IEEE Trans. Audio Speech Lang. Process. 15(6), 1777–1790 (2007). https://doi.org/10.1109/TASL.2007.896670

    Article  MATH  Google Scholar 

  15. G. Rombouts, A. Spriet, M. Moonen, Generalized sidelobe canceller based combined acoustic feedback- and noise cancellation. Signal Process. 88(3), 571–581 (2008). https://doi.org/10.1016/j.sigpro.2007.08.018

    Article  MATH  Google Scholar 

  16. A. Bastari, S. Squartini, F. Piazza, in 2008 Hands-Free Speech Communication and Microphone Arrays. Joint acoustic feedback cancellation and noise reduction within the prediction error method framework (2008). pp. 228–231. https://doi.org/10.1109/HSCMA.2008.4538728

  17. G. Rombouts, T. van Waterschoot, K. Struyve, M. Moonen, Acoustic feedback cancellation for long acoustic paths using a nonstationary source model. IEEE Trans. Signal Process. 54(9), 3426–3434 (2006)

    Article  MATH  Google Scholar 

  18. H. Schepker, S. Doclo, in Proc. 2019 IEEE Workshop Appls. Signal Process. Audio Acoust. (WASPAA ’19),. Active feedback suppression for hearing devices exploiting multiple loudspeakers (2019), pp. 60–64. https://doi.org/10.1109/WASPAA.2019.8937187

  19. M. Vashkevich, E. Azarov, N. Petrovsky, D. Likhachov, A. Petrovsky, in Proc. 2017 IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP ’17). Real-time implementation of hearing aid with combined noise and acoustic feedback reduction based on smartphone (2017), pp. 6570–6571. https://doi.org/10.1109/ICASSP.2017.8005301

  20. S. Ruiz, T. van Waterschoot, M. Moonen, Distributed combined acoustic echo cancellation and noise reduction in wireless acoustic sensor and actuator networks. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 534–547 (2022)

    Article  Google Scholar 

  21. S. Ruiz, T. van Waterschoot, M. Moonen, in Proc. 2022 IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP ’22). Cascade multi-channel noise reduction and acoustic feedback cancellation (2022), pp. 676–680. https://doi.org/10.1109/ICASSP43922.2022.9747291

  22. G. Bernardi, T. van Waterschoot, J. Wouters, M. Moonen, in Proc. 2015 IEEE Workshop Appls. Signal Process. Audio Acoust. (WASPAA’ 15). An all-frequency-domain adaptive filter with PEM-based decorrelation for acoustic feedback control. (2015), pp. 1–5. https://doi.org/10.1109/WASPAA.2015.7336931

  23. R. Crochiere, A weighted overlap-add method of short-time Fourier analysis/synthesis. IEEE Trans. Acoust. Speech Signal Process. 28(1), 99–102 (1980)

    Article  Google Scholar 

  24. Y. Avargel, I. Cohen, On multiplicative transfer function approximation in the short-time Fourier transform domain. IEEE Signal Process. Lett. 14(5), 337–340 (2007)

    Article  Google Scholar 

  25. F. Jabloun, B. Champagne, in Speech Enhancement. Signal subspace techniques for speech enhancement (Springer, Berlin Heidelberg, 2005), pp. 135–159

  26. A. Bertrand, M. Moonen, Robust distributed noise reduction in hearing aids with external acoustic sensor nodes. EURASIP J. Adv. Signal Process. 2009, 1–14 (2009)

    Article  MATH  Google Scholar 

  27. E. De Sena, N. Antonello, M. Moonen, T. van Waterschoot, On the modeling of rectangular geometries in room acoustic simulations. IEEE/ACM Trans. Audio Speech Lang. Process. 23(4), 774–786 (2015)

    Article  Google Scholar 

  28. Bang, Olufsen, Music for archimedes. Compact Disc B &O (1992)

  29. T. Dietzen, R. Ali, M. Taseska, T. van Waterschoot. MYRiAD: A multi-array room acoustic database. ESAT-STADIUS Tech. Rep. TR 22-118, KU Leuven, Belgium (submitted for publication) (2022)

  30. T. Gerkmann, R.C. Hendriks, Unbiased mmse-based noise power estimation with low complexity and low tracking delay. IEEE Trans. Audio Speech Lang. Process. 20(4), 1383–1393 (2012). https://doi.org/10.1109/TASL.2011.2180896

    Article  Google Scholar 

  31. S. Gannot, I. Cohen, Speech enhancement based on the general transfer function GSC and postfiltering. IEEE Trans. Speech Audio Process. 12(6), 561–571 (2004)

    Article  Google Scholar 

  32. R. Aichner, Acoustic blind source separation in reverberant and noisy environments. Ph.D. thesis, Friedrich-Alexander-Universitat Erlangen-Nurnberg (2007)

  33. ITU-T Rec. P.862, Perceptual evaluation of speech quality (pesq): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. (International Telecommunication Union, Geneva, 2001)

  34. C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, in Proc. 2010 IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP ’10),. A short-time objective intelligibility measure for time-frequency weighted noisy speech. (2010), pp. 4214–4217. https://doi.org/10.1109/ICASSP.2010.5495701

  35. J. Donley, pesq-mex (2017). https://github.com/ludlows/pesq-mex.git. Accessed 12 Apr 2021

  36. S. Ruiz, AFC-NR (2022). https://github.com/rogaits/AFC-NR

Download references

Acknowledgements

Not applicable.

Funding

This research work was carried out at the ESAT Laboratory of KU Leuven, in the frame of Research Council KU Leuven Project C3-19-00221 “Cooperative Signal Processing Solutions for IoT-based Multi-User Speech Communication Systems,” Fonds de la Recherche Scientifique - FNRS, and the Fonds Wetenschappelijk Onderzoek - Vlaanderen under EOS Project no 30452698 ’(MUSE-WINET) MUlti-SErvice WIreless NETwork’ and the European Research Council under the European Union’s Horizon 2020 Research and Innovation Program/ERC Consolidator Grant: SONORA (no. 773268). This paper reflects only the authors’ views and the Union is not liable for any use that may be made of the contained information. The scientific responsibility is assumed by its authors.

Author information

Authors and Affiliations

Authors

Contributions

SR, TvW, and MM jointly developed the idea of using a (\(M+1\))-channel data model in the multichannel Wiener filter formulation for combined acoustic feedback cancelation and noise reduction. SR, TvW, and MM jointly developed the research methodology to turn this concept into a usable and effective algorithm. SR, TvW, and MM jointly designed and interpreted the computer simulations. SR implemented the computer simulations. All authors contributed in writing the manuscript and further read and approved the final manuscript.

Corresponding author

Correspondence to Santiago Ruiz.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ruiz, S., van Waterschoot, T. & Moonen, M. Cascade algorithms for combined acoustic feedback cancelation and noise reduction. J AUDIO SPEECH MUSIC PROC. 2023, 37 (2023). https://doi.org/10.1186/s13636-023-00296-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13636-023-00296-5

Keywords