Skip to main content

Single-channel acoustic echo cancellation in noise based on gradient-based adaptive filtering

Abstract

In this paper, a two-stage scheme is proposed to deal with the difficult problem of acoustic echo cancellation (AEC) in single-channel scenario in the presence of noise. In order to overcome the major challenge of getting a separate reference signal in adaptive filter-based AEC problem, the delayed version of the echo and noise suppressed signal is proposed to use as reference. A modified objective function is thereby derived for a gradient-based adaptive filter algorithm, and proof of its convergence to the optimum Wiener-Hopf solution is established. The output of the AEC block is fed to an acoustic noise cancellation (ANC) block where a spectral subtraction-based algorithm with an adaptive spectral floor estimation is employed. In order to obtain fast but smooth convergence with maximum possible echo and noise suppression, a set of updating constraints is proposed based on various speech characteristics (e.g., energy and correlation) of reference and current frames considering whether they are voiced, unvoiced, or pause. Extensive experimentation is carried out on several echo and noise corrupted natural utterances taken from the TIMIT database, and it is found that the proposed scheme can significantly reduce the effect of both echo and noise in terms of objective and subjective quality measures.

1 Introduction

The phenomenon of acoustic echo occurs when the output speech signal from a loudspeaker gets reflected from different surfaces, like ceilings, walls, and floors and then fed back to the microphone. In its worst case, acoustic echo can cause howling of a significant portion of sound energy [1, 2]. In real life applications, such as a lecture in a large conference hall or in the public address system of a trade fair, the presence of acoustic echo along with the environmental noise is a very common phenomenon, which degrades the speech quality even leading to complete loss of intelligibility.

In order to deal with the problem of acoustic echo cancellation (AEC), conventionally echo suppressors, earphones, and directional microphones have been used, which generally place restrictions on the talkers’ movement [2]. As an alternate of such hardware-based solutions, adaptive filter algorithms are widely being applied where apart from the input channel, a separate echo-free reference channel is required [3–13]. Among different adaptive filter algorithms, the least mean squares (LMS) algorithm and its different variants are very popular for their satisfactory performances and less computational burden [4, 10, 12–14]. Besides these algorithms, the recursive least squares (RLS) algorithm is well-known for its fast convergence at the expense of computational complexity [13]. The adaptive filter algorithms have also been used for acoustic noise cancellation (ANC) [15].

There are some methods that deal with both acoustic echo and noise cancellation (AENC) [16–18]. The echo canceller used in [16] utilizes a sub-band noise cancellation scheme. In [17], echo cancellation is done by an adaptive LMS filter while a linear prediction error filter removes the residual echo and noise. In [18], a single Wiener filter is employed to simultaneously suppress the echo and noise. It is to be mentioned that all these AENC methods employ more than one microphone, while the solutions using single microphone are favorable in most of the real-life applications.

In this paper, an AENC scheme is proposed which can efficiently deal with the single-channel scenario. First, unlike conventional LMS algorithm, considering the delayed version of the previously echo- and noise-suppressed signal as reference, a gradient-based adaptive LMS algorithm is developed for single channel AEC. Preliminary results obtained by using this idea is reported in [19]. However, in the current paper, analytical proof of convergence towards the optimum Wiener-Hopf solution is presented. Next, a single-channel ANC algorithm based on spectral subtraction with an adaptive spectral floor estimation is developed, which reduces not only the effect of noise but also some residual echo. Finally, analyzing different speech characteristics of the reference and current frames, multiconditional updating constraints are proposed in order to obtain precise control on convergence characteristics. For performance evaluation, extensive experimentation is conducted on several real-life echo and noise corrupted speech signals at different acoustic environments.

2 Problem formulation

In order to formulate the problem of single-channel AENC, for a better understanding, first, a dual channel AENC scheme is presented in Figure 1 (according to [17]). Here, s1(n) and s2(n) are speech signals corresponding to near-end and far-end speakers, while v1(n) and v2(n) are additive noises, respectively. The noise corrupted far-end signal (s2(n)+v2(n)) is played through a loudspeaker at the near-end acoustic room environment and the echo signal x2(n) is generated. Thus, the input y1(n) to the near-end microphone is given by

y 1 ( n ) = s 1 ( n ) + v 1 ( n ) + x 2 ( n ) .
(1)
Figure 1
figure 1

Adaptive filter-based echo and noise cancellation in dual channel communication system.

The task of the adaptive filter-based AEC block placed at the near-end is to produce an estimate x Ì‚ 2 (n) of the echo x2(n) by minimizing the error

e 1 ( n ) = y 1 ( n ) − x ̂ 2 ( n ) .
(2)

Two major issues in dual channel system are (i) availability of a separate reference signal required for the adaptive filter, for example, here the delayed version of (s2(n)+v2(n)) and (ii) different speakers for input and echo signals. Moreover, use of the double talk detector (DTD) helps in controlling the update process. Unfortunately, these features are absent in single-channel scenario as shown in Figure 2. Instead of two speakers, in this case, the microphone receives the input s(n) corrupted by noise v(n) and echo generated from the same speaker.

Figure 2
figure 2

Single-channel acoustic echo generation in noisy room environment.

In the presence of noise v(n), the sole microphone input signal in single-channel scenario is given by

y(n)=s(n)+v(n)+ x s (n)+ x v (n),
(3)

where x s (n) and x v (n) denote the echo of the input speech and noise, respectively. The echo signals can be expressed as

x s ( n ) = a n T s ( n − k 0 ) ,
(4)
x v ( n ) = a n T v ( n − k 0 ) ,
(5)

where s(n−k0)=[s(n−k0−1),s(n−k0−2),…,s(n−k0−p)]T and v(n−k0)=[v(n−k0−1),v(n−k0−2),…,v(n−k0−p)]T with k0 being a predefined flat delay and a n =[a n (1),a n (2),…,a n (p)]T consists of the coefficients corresponding to the acoustic room transfer function A(z). The order p and coefficient values of A(z) depend on the room characteristics. It is to be noted that in this case, there is no scope of obtaining a separate echo-free reference or a separate noise-only reference, which makes the single-channel AENC problem extremely difficult to handle.

3 Proposed single-channel AENC scheme

3.1 Proposed two-stage setup

In Figure 3a, a simple block diagram showing two stages of the proposed AENC scheme is presented and in Figure 3b, more detail of the adaptive filter-based AEC algorithm involved in the first stage is shown. Similar to Figure 2, the input to the microphone y(n) can be described by (3). For the case of single-channel AEC, for example, while delivering a lecture in a large conference hall, the microphone in front of the speaker receives input speech s(n) corrupted by v(n). Once this noise-corrupted speech is transmitted through loudspeaker, echo signal is generated and thus the microphone after some initial time delay will receive noise-corrupted speech and echo of previously uttered speech. The task of AEC is to cancel the echo part from this input by using adaptive filter algorithm. In order to obtain adaptively an estimate x Ì‚ s (n)+ x Ì‚ v (n) of the echo signal, we propose to utilize delayed versions of the previously echo-suppressed samples of the noisy speech as reference signal [19]. A symbol hat on the variable is used to indicate estimated value. The error signal e(n) thus obtained is given by

e ( n ) = y ( n ) − [ x ̂ s ( n ) + x ̂ v ( n ) ] .
(6)
Figure 3
figure 3

Block diagram of proposed single-channel AENC scheme. (a) Two stages and (b) details of proposed adaptive filter-based AEC algorithm.

The estimate of the echo signal can be expressed as

x ̂ s ( n ) + x ̂ v ( n ) = w ̂ n T [ s ̂ ( n − k 0 ) + v ̂ ( n − k 0 ) ] ,
(7)

where w ̂ n = [ w ̂ n ( 1 ) , w ̂ n ( 2 ) … w ̂ n ( p ) ] T is the estimated coefficient vector. The task of the adaptive filter is to obtain an optimum w ̂ n by minimizing the error in (6) i.e.,

e ( n ) = s ( n ) + { ( v ( n ) + δ s ( n ) ) + δ v ( n ) } ,
(8)

where δ s (n)= x s (n)− x ̂ s (n) and δ v (n)= x v (n)− x ̂ v (n) are the residual echo of the speech and noise portions of the input signal, respectively, and it is assumed that these signals exhibit the properties of white Gaussian noise. Next, e(n) is passed through a spectral subtraction-based single-channel ANC block which produces output s ~ (n)≈s(n)+Ψ(n) that closely resembles s(n) provided that the residual echo-noise portion Ψ(n) becomes very small.

It is to be noted that the task of noise reduction, unlike the proposed AENC scheme, may be carried out prior to the AEC block. However, because of possible nonlinearities introduced by the prior noise reduction block, no proper reference would be available for the single-channel AEC block [17]. Hence, the arrangement shown in Figure 3a is adopted, in which the noise reduction block also serves as a post-processor for attenuating the residual echo.

3.2 Development of proposed gradient-based single-channel LMS AEC scheme

A delayed version of the adaptive filter output e(n) is proposed to use as the reference signal, and from (8), filter output e(n) can be written as

e(n)= s Ì‚ (n)+ v Ì‚ (n),
(9)

where s ̂ (n)=s(n)+ δ s (n) and v ̂ (n)=v(n)+ δ v (n). The objective function of the adaptive filter involves minimization of the mean square estimation of the error function and using (6) it can be written as

E { e 2 ( n ) } = E { ( s ( n ) + v ( n ) ) 2 } + E { ( x s ( n ) + x v ( n ) − x ̂ s ( n ) − x ̂ v ( n ) ) 2 } + 2 E { ( s ( n ) + v ( n ) ) × ( x s ( n ) + x v ( n ) − x ̂ s ( n ) − x ̂ v ( n ) ) } ,
(10)

where E{.} denotes the expectation operator. In (10), it is intended to use the basic definition of cross-correlation operation, for example, the cross-correlation function between s(n) and v(n) is defined as

r sv (m)=E{s(n)v(n−m)},
(11)

where m denotes the lag. Using (4), (5), (7), and the above definition, the last term of (10) can be expressed as

2 E { [ ( s ( n ) + v ( n ) ) ( x s ( n ) + x v ( n ) − x ̂ s ( n ) − x ̂ v ( n ) ) ] } = 2 ∑ k = 1 k = p { ( a n ( k ) − w ̂ n ( k ) ) ( r ss ( k 0 + k ) + r sv ( k 0 + k ) + r vs ( k 0 + k ) + r vv ( k 0 + k ) ) − r s δ s ( k 0 + k ) − r s δ v ( k 0 + k ) − r v δ s ( k 0 + k ) − r v δ v ( k 0 + k ) } .
(12)

Here, r s s (k0+k) corresponds to the (k0+k)th lag of the cross-correlation between s(n) and its previous samples s(n−k0−k), and r s v (k0+k) corresponds to the (k0+k)th lag of the cross-correlation between s(n) and v(n−k0−k). In a similar way, r v s (k0+k), r v v (k0+k), r s δ s ( k 0 +k), r s δ v ( k 0 +k), r v δ s ( k 0 +k), and r v δ v ( k 0 +k) can be defined. It is well known that the value of cross-correlation decreases rapidly with the increasing lags when two signals are uncorrelated. In ideal case, the cross-correlation function between two random noise signals would be nonzero only at the zero lag. Since v(n) is assumed to be white Gaussian noise and, generally, the value of k0 is very large, in (12), the effect of the terms r s v (k0+k), r v s (k0+k), and r v v (k0+k) can be neglected. Moreover, because of noise-like characteristics of δ s (n) and δ v (n), in (12), one can neglect r s δ v ( k 0 +k), r v δ s ( k 0 +k), and r v δ v ( k 0 +k) too. Hence, it can easily be comprehended that optimal filter performance occurs when r s s (n) is minimum, i.e., the least possible correlation between s(n−k0−k) and s(n) is desired. As a result, (10) reduces to

E { e 2 ( n ) } = E { ( s ( n ) + v ( n ) ) 2 } + E { [ x s ( n ) + x v ( n ) − x ̂ s ( n ) − x ̂ v ( n ) ] 2 } + 2 ∑ k = 1 k = p ( a n ( k ) − w ̂ n ( k ) ) r ss ( k 0 + k ) .
(13)

Here, the magnitude of r s s (k0+k) strongly depends on speech characteristics and the amount of flat delay k0. For a reasonably large k0, the effect of r s s (k0+k) in 13 can be neglected, and minimization of (13) results in

∂E { e 2 ( n ) } ∂ w ̂ n T = 0 E [ { x s ( n ) + x v ( n ) − x ̂ s ( n ) − x ̂ v ( n ) } { s ̂ ( n − k 0 ) + v ̂ ( n − k 0 ) } ] = 0 .
(14)

Hence, we obtain

E { ( x s ( n ) + x v ( n ) ) ( s ̂ ( n − k 0 ) + v ̂ ( n − k 0 ) ) } = w ̂ n T E [ { s ̂ ( n − k 0 ) + v ̂ ( n − k 0 ) } { s ̂ ( n − k 0 ) + v ̂ ( n − k 0 ) } ] .
(15)

The above equation is similar to Wiener-Hopf equation and its solution can be written as

w ̂ n T = R ( s + v ) ( s + v ) ( n − k 0 ) − 1 r ( x s + x v ) ( s + v ) (n− k 0 ),
(16)

where r ( x s + x v ) ( s + v ) (n− k 0 ) consists of different lags of cross-correlation between the echo signal x s (n)+x v (n) and the noisy input signal s(n)+v(n), while R(s+v)(s+v) is the auto-correlation matrix of s(n)+v(n). There is no doubt that w ̂ n is the most optimum solution possible. Hence, it is shown that even for a single-channel noise corrupted AEC problem, the most optimum solution w ̂ n can be achieved under the assumptions stated earlier.

For iterative estimation of optimal filter coefficients, the adaptive LMS algorithm is very popular. It is fast and efficient, and it does not require any correlation measurements or matrix inversion [13]. The update equation of the LMS adaptive algorithm is generally expressed as

w ̂ n + 1 T = w ̂ n T −μ∇ξ(n),
(17)

where μ is the step factor controlling the stability and rate of convergence, ξ(n) is the cost function, and ∇ is the gradient operator. The LMS algorithm simply approximates the mean square error by the square of the instantaneous error, i.e., ξ(n)=e2(n), and therefore, from (6) and (7), the gradient of ξ(n) can be expressed as

∇ ξ ( n ) = ∂ξ ( n ) ∂ w ̂ n T = − 2 e ( n ) ( s ̂ ( n − k 0 ) + v ̂ ( n − k 0 ) ) .

Thus, the update equation for the proposed single-channel LMS adaptive scheme can be written as

w ̂ n + 1 T = w ̂ n T + 2 μe ( n ) ( s ̂ ( n − k 0 ) + v ̂ ( n − k 0 ) ) .
(18)

3.3 Convergence analysis of the proposed AEC scheme

Considering expectation operation on both sides of the update Eq. 18, one can obtain

w ̂ ̲ n + 1 T = w ̂ ̲ n T + 2 μE { e ( n ) ( s ̂ ( n − k 0 ) + v ̂ ( n − k 0 ) ) } .
(19)

Here, an underline beneath w ̂ n is introduced to represent the expected value E{ w ̂ n }. For the k th unknown weight vector (where k=1,2,…,p), using (6) and neglecting the effect of r s s (n) that has already been discussed in the previous subsection, the last term of (19) can be written as

2 μE { e ( n ) ( s ̂ ( n − k 0 ) + v ̂ ( n − k 0 ) ) } = 2 μE { [ x s ( n ) + x v ( n ) − x ̂ s ( n ) − x ̂ v ( n ) ] × ( s ̂ ( n − k 0 ) + v ̂ ( n − k 0 ) ) } .
(20)

Based on the assumptions on cross-correlation terms stated in the previous subsection, one can obtain

E { e ( n ) ( s ̂ ( n − k 0 ) + v ̂ ( n − k 0 ) ) } = r ( x s + x v ) ( s + v ) ( n − k 0 ) − R ( s + v ) ( s + v ) ( n − k 0 ) w ̂ n T .
(21)

Using (21), the update Eq. 19 can be written as

w ̂ ̲ n + 1 T = w ̂ ̲ n T − 2 μ R ( s + v ) ( s + v ) ( n − k 0 ) w ̂ ̲ n T + 2 μ r ( x s + x v ) ( s + v ) ( n − k 0 ) .
(22)

Evaluating the homogeneous and particular solutions of (22), the total solution can be obtained as (see Appendix)

w ̂ ̲ n + 1 U ( k ) = C k ( 1 − 2 μλ ( k ) ) n + 1 λ ( k ) r U ( n − k 0 − k ) ,
(23)

where λ(k) is the k th diagonal element of the eigenvalue matrix obtained by eigenvalue decomposition of R(s+v)(s+v)(n−k0) and rU(n−k0−k) is the k th element of U T r ( x s + x v ) ( s + v ) (n− k 0 )= r ( x s + x v ) ( s + v ) U (n− k 0 ) with the matrix U consisting of eigenvectors corresponding to eigenvalues. Since in the iterative update procedure, the homogeneous part (1−2μ λ(k))n diminishes with iterations, (23) in a matrix form can be expressed as

w ̂ ̲ T = U Λ − 1 U T r ( x s + x v ) ( s + v ) ( n − k 0 ) = R ( s + v ) ( s + v ) − 1 ( n − k 0 ) r ( x s + x v ) ( s + v ) ( n − k 0 ) .
(24)

Thus, it is found that the average value of the weight vector converges to the Wiener-Hopf solution, which is the optimum solution with increasing number of iteration.

3.4 Noise reduction in spectral domain

In the proposed AENC scheme, the operation of the ANC block is processed frame by frame for noise reduction based on single-channel spectral subtraction algorithm [20–22]. According to (9), for the i th frame, the error signal for the duration of a frame length can be written as

e i (n)= s Ì‚ i (n)+ v Ì‚ i (n).
(25)

Corresponding frequency domain representation is given by

E i (ω)= S ̂ i (ω)+ V ̂ i (ω).
(26)

The magnitude squared spectrum of s Ì‚ i (n) can be written as

∣ S ̂ i ( ω ) ∣ 2 = ∣ E i ( ω ) ∣ 2 − ∣ V ̂ i ( ω ) ∣ 2 − V ̂ i ( ω ) S ̂ i ∗ ( ω ) − S ̂ i ( ω ) V ̂ i ∗ ( ω ) .
(27)

It is desired to choose an estimate S ~ i (ω) that will minimize

Er r i (ω)=∣∣ S ~ i (ω) ∣ 2 −∣ S ̂ i (ω) ∣ 2 ∣.
(28)

Since the noise is assumed to be zero mean and uncorrelated with the signal, the expected values of the last two terms of (27) can be neglected. Thus, (28) can be expressed as

Er r i (ω)=∣ S ~ i (ω) ∣ 2 −∣ E i (ω) ∣ 2 +E{∣ V ̂ i (ω) ∣ 2 }.
(29)

This expression of E r r i (ω) can be minimized by choosing

∣ S ~ i (ω) ∣ 2 =∣ E i (ω) ∣ 2 −E{∣ V ̂ i (ω) ∣ 2 }.
(30)

With an estimate of noise spectrum E{∣ V ̂ i (ω) ∣ 2 }, signal spectrum S ~ i (ω) can be computed as

S ~ i (ω)=∣ S ~ i (ω)∣ e jarg [ E i ( ω ) ] ,
(31)

where the phase (arg[E i (ω)]) is generally assumed to be the phase of the noise corrupted signal without causing significant degradation in terms of loss of intelligibility of the speech signal [20]. It can be seen that an estimate of the magnitude spectrum ∣ S ~ i (ω)∣ of the signal can be obtained provided an estimate of noise spectrum E{∣ V ̂ i (ω) ∣ 2 } is available, which is generally computed during the periods when speech is known a priori not to be present.

Final output of the AENC system is the speech frame ( s ~ i (n)), which consists of the original speech s i (n) and a negligible amount of noise-like signal Ψ i (n). The signal Ψ i (n), although very weak, may contain some signature of the input noise v(n), the residual echo δ s (n), and the residual noise δ v (n). In order to overcome the problem of musical noise and to avoid the speech distortion caused by speech subtraction, in (31), an over estimate of the noise power spectrum can be subtracted carefully such that the spectral floor is preserved [21]. Thus, (31) can be modified as

∣ S ~ i ( ω ) ∣ 2 = ∣ E ̂ i ( ω ) ∣ 2 − α ss E { ∣ V ̂ i ( ω ) ∣ 2 } , if ∣ S ~ i ( ω ) ∣ 2 > β ss { ∣ V ̂ i ( ω ) ∣ 2 } = β ss { ∣ V ̂ i ( ω ) ∣ 2 } , otherwise.
(32)

Here, α s s is the subtraction factor and β s s is the spectral floor parameter with α s s ≥1 and 0≤β s s ≤1. The task of noise power spectral density estimation is carried out based on the minimum statistics noise estimator proposed in [23] which can handle the time-varying nature of the noise.

4 Development of adaptive update constraints

The AEC part of the proposed AENC scheme may suffer from some common problems of adaptive filter-based algorithms, such as slow convergence rate and fluctuation around the desired estimates, especially in practical cases where the assumption on negligibility of cross-correlation terms (stated in the previous section) may not strictly hold. In order to overcome such problems, some updating constraints are proposed based on the following speech characteristics:

  1. (i)

    The level of cross-correlation

  2. (ii)

    The amount of signal power

  3. (iii)

    The mean square error (MSE) between consecutive estimates of the unknown filter coefficients.

Through extensive experimentation on different speech frames, it is found that the negligibility of the cross-correlation terms r s s (n), r s δ v (n), r v δ s (n), and r v δ v (n) (as described after (12)) strongly depends on the voicing characteristics of speech frames and the input noise. Because of inherent periodicity of the voiced speech frame, the degree of cross-correlation between two voiced speech frames of a person becomes higher in comparison to that between two unvoiced speech frames which are random in nature. Regarding signal power, the ratio of power of a voiced speech frame and an unvoiced speech frame is found to be higher in comparison to that of the two voiced speech frames. As white Gaussian noise is considered, the degree of cross-correlation between the speech and noise is found to be negligible and the noise powers in two different frames may not differ significantly. As a result, the effect of input noise is found to be negligible on the power ratio.

For a flat delay of k0 samples, the initial k0 samples of the utterance s(n)+v(n) can be treated as a reference signal (echo-free signal) responsible for the generation of echo signal that corrupts the current samples at or after k0 samples. Considering a window of M samples with M≪K0, power of the reference signal ( s ̂ (n− k 0 )+ v ̂ (n− k 0 )) can be computed as

P ref ( n ) = 1 M ∑ i = − M 2 M 2 − 1 [ s ̂ ( n − k 0 + i ) + v ̂ ( n − k 0 + i ) ] 2 .
(33)

For a window of last M samples of the echo-suppressed speech signal s Ì‚ (n), the average power Psup(n) can be computed as

P sup ( n ) = 1 M ∑ j = 0 M − 1 [ s ̂ ( n − j ) + v ̂ ( n − j ) ] 2 .
(34)

The ratio of Pref(n) and Psup(n) is denoted as the power ratio Prs(n) and considered as one of the control characteristics.

Another important characteristic criterion is the correlation coefficient Crs(n) between a frame of the noisy reference signal ( s ̂ (n− k 0 )+ v ̂ (n− k 0 )) and a frame of the current noisy signal ( s ̂ (n)+ v ̂ (n)). For a frame length of M samples, correlation coefficient Crs(n) is defined as

C rs ( n ) = 1 σ s ̂ ( n − k 0 + i ) + v ̂ ( n − k 0 + i ) σ s ̂ ( n − j ) + v ̂ ( n − j ) × { cov ( ( s ̂ ( n − k 0 + i ) + v ̂ ( n − k 0 + i ) ) × ( s ̂ ( n − j ) + v ̂ ( n − j ) ) ) }
(35)

where −M/2≤i≤M/2−1 and 0≤j≤(M−1).

Finally, the parameter estimation accuracy is also considered for the purpose of analyzing the convergence property. In this regard, the mean square error MSEideal(n) between the values of estimated coefficients w Ì‚ n and those of true coefficients a n is computed as

MSE ideal ( n ) = 1 p ∑ k = 1 p [ w ̂ n ( k ) − a n ( k ) ] 2 .
(36)

In Figure 4, considering a real-life speech utterance of 250 ms corrupted by echo and noise, behavior of the control parameters obtained by using (33), (34), (35), and (36) is shown. The speech utterance (/i y/−/i x/) contains a voiced phoneme followed by another voiced phoneme [24]. Here k0=1,000, M=100, N f =1002, sampling frequency 16 kHz and S N R=15 db is used.

Figure 4
figure 4

Characteristics of controlling factors - a voiced phoneme followed by another voiced frame.

In a similar fashion, in Figure 5, a speech utterance consisting of a voiced phoneme /ih/ followed by an unvoiced phoneme /sh/ and, in Figure 6, a voiced phoneme /ih/ followed by pause are considered. It is observed that the characteristic parameters vary depending on the nature of reference and current frames. When the current frame is a pause or weakly unvoiced, the power ratio becomes higher in comparison to the case when the current frame is a voiced one. On the contrary, the correlation coefficient becomes smaller when measured between a voiced and an unvoiced frame, but it becomes quite larger when measured between two voiced frames. It is also found that the presence of voiced frame as a reference strongly governs the rate of convergence and the estimation error of the proposed LMS algorithm. In Figure 4, because of all through presence of the voiced frame as the reference as well as the current frame, it is found that the convergence performance is not very satisfactory and the estimation error is relatively higher. On the other hand, in Figure 6, it is observed that when the current frame is pause, even in the presence of voiced reference frame, a very fast convergence is obtained with a little estimation error. In Figure 5, as the current frame is unvoiced instead of pause, a comparatively slower convergence is observed with higher estimation error.

Figure 5
figure 5

Characteristics of controlling factors - a voiced phoneme followed by an unvoiced phoneme.

Figure 6
figure 6

Characteristics of controlling factors - a voiced phoneme followed by pause frame.

Next, in Figures 7, 8, 9, the reference frame is considered unvoiced, and in Figures 10, 11, 12, it is considered pause. When the reference frame is considered unvoiced because of the existence of a little correlation between the current and reference frames, the convergence performance of the proposed LMS algorithm is found quite satisfactory irrespective of the power of the reference signal (strong unvoiced or weakly unvoiced). In the case when the current frame is pause, no matter whether the reference frame is voiced or unvoiced, a fast convergence with high estimation accuracy is achieved using the proposed LMS algorithm. The reasons behind are (i) negligible cross-correlation between reference frame and current frame and (ii) a comparatively higher power ratio. In Figures 10, 11, 12, it is observed that even the reference frame is a pause or stop because of the presence of additive white noise, the reference frame may contain significant energy. In these cases, a reasonable estimation of the room response can be obtained given that the noise power is quite high. Findings in the above cases are summarized in Table 1.

Figure 7
figure 7

Characteristics of controlling factors - an unvoiced frame followed by a voiced frame.

Figure 8
figure 8

Characteristics of controlling factors - an unvoiced frame followed by another unvoiced frame.

Figure 9
figure 9

Characteristics of controlling factors - an unvoiced frame followed by a pause.

Figure 10
figure 10

Characteristics of controlling factors - pause followed by a voiced frame.

Figure 11
figure 11

Characteristics of controlling factors - pause followed by an unvoiced frame.

Figure 12
figure 12

Characteristics of controlling factors - pause followed by another pause.

Table 1 Variation of LMS updating performance due to various characteristics of reference and current speech frame

First of all, it is observed that a better convergence in terms of iterations and estimation error is obtained when the current frame is a pause (P) or stop and the reference frame is either voiced (V) or unvoiced (U), namely, V-P and U-P. This fact leads to a decision that the updating needs to be carried out at high level of power ratio, i.e.,

P rs (n)= P ref ( n ) P sup ( n ) ≥ζ,
(37)

where Pref(n) and Psup(n) are defined in (33) and (34), respectively. If the value of the lower bound ζ is chosen too large, the updating would be postponed for most of the instances resulting in very slow convergence. On the other hand, a very small value of ζ may cause more frequent updates where possibility of wrong estimations of filter coefficients would be higher, especially in V-P, U-P, and P-P cases. It is to be noted that considering only a lower bound of Prs(n) may not always be sufficient to ensure that the reference frame possesses significant energy. For example in Figure 13, it is shown that high value of Prs(n) may arise (marked block in the figure) from an initial silence frame where only a very little amount of noise is present. In order to prevent the updating in these situations, a lower bound β on the power of the reference frame is employed, i.e., Pref(n)≥β. The value of β should surpass the power of speech pauses and ensure that the LMS update is postponed even if a frame of speech containing a partial pause is available as the reference. Hence, the first constraint for updating the algorithm is proposed as Condition I: Prs(n)≥ζ and Pref(n)≥β.

Figure 13
figure 13

Example of high power ratio during initial silence frame.

In some cases, it is observed that though the power ratio is very small, quite satisfactory updating is obtained, such as the U-V case shown in Figure 7. Another characteristic observed here is lower value of correlation coefficient Crs(n) with higher value of Pref(n). It is to be mentioned that the proposed AEC algorithm is developed on the assumption of negligibility of the cross correlation between current frame and reference frame. However, since both reference and current frame may belong to the same person, in case of high degree of correlation, the adaptive algorithm would try to suppress portion from the echo-corrupted signal resulting in unusual degradation= in convergence performance. Hence, introducing an upper bound on Crs(n), the second condition is proposed as Condition II: Crs(n)≤Υ 1 and Pref≥β.

The presence of a certain level of noise can be utilized as an advantage in pause instances where generally the updating is not performed. Since noise is considered uncorrelated to itself, updating at frames where only noise is present would be quite satisfactory. In this case, the value of Crs(n) must be very small and thus another condition on updating is proposed as Condition III: Crs(n)≤Υ 2≤Υ 1.

Another important factor is the MSE of the estimations of successive iterations, which is defined as

e coeff (n)= ∑ K = 1 p ( w ̂ n ( k ) − w ̂ n − 1 ( k ) ) 2 /p.
(38)

In order to continue the updating, an upper bound on the variation of successive estimates is set as following condition: Condition IV: e c o e f f (n)≤ℵ.

Considering smaller values of ecoeff(n) allows to avoid updating at those instances where abrupt and significant changes occur in the estimated coefficients. In the proposed method, in order to carry out the LMS update, at least one of the above four conditions must be fulfilled.

5 Simulation results and comments

Performance of the proposed algorithm is investigated in different echo-generating environments at various input noise levels considering several male and female utterances available in the TIMIT database [24]. An acoustic room environment is simulated using an FIR filter of length N f , where as per conventional approaches, filter coefficients during the flat delay portion are assumed to be zero. The flat delay time (k0) can be pre-calculated based on the distance between the microphone and the speaker [25]. Because of the implicit zeros corresponding to the flat delay, it is evident that a few number (N f −k0) of unknown coefficients has to be determined. In the proposed method, a smaller step size is used to obtain a smooth convergence.

First, a subjective evaluation is carried out based on the feedback about the quality of the echo- and noise-suppressed signal provided by five individual listeners at different noisy echo-generating environments. From the overall response of the listeners in terms of mean objective score (MOS), a very satisfactory performance of the proposed method is obtained even under severe echo-generating conditions in noise.

Next, two objective measures, namely, echo return loss enhancement (ERLE) and signal-to-distortion ratio (SDR) are employed. The ERLE is defined as the ratio of the instantaneous power of the residual echo signal η ς (n) and that of the input echo signal η x (n) and expressed in dB as [1]

ERLE(n)=−10log η ς ( n ) η x ( n ) .
(39)

The average value of ERLE(n) over time is considered. The input and output SDRs in dB are respectively defined as

SD R in = 10 log P s P x + v
(40)
SD R out = 10 log P s P s ̂ + v ̂ − s ,
(41)

where P s is the power of original signal s(n), Px+v is the power of microphone input, and P s ̂ + v ̂ − s (n) is the power of distortion present in the echo-suppressed output signal. The SDR improvement is given by

SDRI=SD R out −SD R in ,
(42)

which indicates the overall distortion removal.

The proposed algorithm has been tested on several different sentences taken from the TIMIT database. In order to demonstrate the principle of selecting different threshold values required in the proposed updating constraints, as a typical example, a sample utterance ‘Good service should be rewarded by big tips’ is shown in Figure 14[24]. Voicing decisions are marked in the figure as ‘P’ for pause, ‘V’ for voiced, and ‘U’ for unvoiced. Considering white Gaussian noise with SNR = 15 dB, N f =1,002, k0=1,000, and M=100 in Figure 14b,c,d,e, Prs(n), Pref(n), Crs(n), and MSEideal(n) are shown, respectively. Note that in this case, the proposed algorithm is used without the update constraints, and thus, the MSEideal(n) exhibits some higher values. The comments provided in Table 1 can be better visualized from different marked zones of this figure. From extensive experimentations, it is found that a better update requires Pref(n) to be at least twice of Psupp(n) and a small percentage (1% to 5%) of the power of a regular voiced frame can be chosen as the lower bound of β for Pref(n). Analyzing Crs(n) in different speech frames, Υ 1 in condition 2 is chosen as 0.25 to ensure that no speech is being suppressed during the update procedure by confusing it with the echo and Υ 2 is kept very small, i.e, Υ 2≈0.1 to allow updating for cases where there exists no correlation or extremely low correlation between the reference signal and echo-suppressed signal. The value of the threshold ℵ for ecoeff(n) in condition IV is chosen to be very small (0.7×10−4) such that there will be no update of the LMS algorithm when the magnitude of ecoeff(n) is comparatively much larger.

Figure 14
figure 14

Plots of (a) utterance s ( n ) and update parameters (b) P r s ( n ), (c) P r e f ( n ), (d) C r s ( n ), and (e) MSE (without using constraint).

In Figure 15, the effect of incorporating the proposed conditions is shown. It is vividly observed from Figure 15 that by employing the proposed conditions, the convergence is improved to a greater extent. Moreover, in order to demonstrate the performance in frequency domain, spectrograms of the original signal, echo- and noise-corrupted signal, and the output of the proposed AENC block are depicted in Figure 16a,b, respectively. For convenience, some zones are marked on the spectrograms where significant reduction in echo and noise can easily be observed

Figure 15
figure 15

MSEs for the utterance shown in Figure 14.(a) Without conditions and (b) with conditions.

Figure 16
figure 16

Spectrograms of (a) original signal, (b) echo- and noise-corrupted input and (c) enhanced output.

In order to show the effectiveness of the proposed conditions, the MSEideal(n) obtained in Figure 14e is redrawn in Figure 15. In Figure 15, the effect of incorporating the conditions is shown. It is vividly observed from Figure 15 that by employing the proposed conditions, the convergence is improved to a greater extent. Moreover, in order to demonstrate the performance in frequency domain, spectrograms of the original signal, echo- and noise-corrupted signal, and the output of the proposed AENC block are depicted in Figure 16a,b, respectively. For convenience, some zones are marked on the spectrograms where significant reduction in echo and noise can easily be observed. For a better understanding, another TIMIT utterance ‘She had your dark suit in greasy wash water all year’, under similar acoustic environment as used in Figure 14, is considered and corresponding echo- and noise-corrupted speech signal is shown in Figure 17a. The MSEs obtained by using the proposed method with and without the conditions are presented in Figure 17b,c, which clearly demonstrate the performance improvement in the later case.

Figure 17
figure 17

Another TIMIT utterance (a). MSEs of LMS estimations: (b) without conditions and (c) with conditions.

In Table 2, the performance of the proposed algorithm with and without applying the conditions is shown in terms of the SDR improvement (dB) and ERLE (dB) for utterance 1. In order to evaluate the performance under different room environments, length (N f ) and parameter values of the room response filter are varied while keeping the input SNR constant to 15 dB. Considering k0=1,000, N f −k0 is varied from 2 to 14. Results shown in the table clearly demonstrate the effectiveness of using the conditions on performance measures; in all cases, higher values of SDR and ERLE are obtained.

Table 2 Performance comparison with varying room acoustics

In Table 3, the performance of the proposed algorithm with and without applying the conditions is evaluated for different levels of input SNR ranging from 25 to −5 dB for the first utterance considering white Gaussian noise and N f =1014. It can be seen that the proposed method provides satisfactory performance at all SNR levels. Especially, the use of proposed conditions exhibits comparatively better performance.

Table 3 Performance comparison with noise level variation

6 Conclusion

The problem of echo cancellation in the presence of noise, especially in single-channel environment, is a very challenging task, which has been efficiently tackled in this paper. First, the single-channel AEC block is designed based on the gradient-based adaptive LMS filter where to overcome the problem of getting a separate reference signal, we propose to use the delayed version of the echo-suppressed signal. Such a unique proposal of getting the reference signal is justified by presenting a detailed mathematical proof of achieving the most optimum Wiener-Hopf solution of the estimated filter coefficients, and a convergence analysis is carried out. Moreover, in order to achieve fast and smooth convergence, a set of updating constraints is proposed by analyzing the speech characteristics of different types of speech frames, such as voiced, unvoiced, and pause. In the ANC block, a modified single-channel spectral subtraction method is considered for its robust performance. It is shown that the proposed AENC scheme with updating constraints provides a very satisfactory performance in different echo-generating conditions and various levels of SNR in terms of SDR and ERLE.

Appendix

Derivation of the solution of the LMS update

In order to obtain a homogeneous solution of the update Eq. 22, one may consider

w ̂ ̲ n + 1 T = w ̂ ̲ n T −2μ R ( s + v ) ( s + v ) (n− k 0 ) w ̂ ̲ n T .
(43)

Eigenvalue decomposition of the correlation matrix R(s+v)(s+v)(n−k0) results in

R ( s + v ) ( s + v ) (n− k 0 )=UΛ U T ,
(44)

where each column of the matrix U consists of eigenvectors corresponding to eigenvalues constituting the diagonal elements of the matrix Λ and UTU=I. Forward multiplication by UT on both sides of (43) results in

w ̂ ̲ n + 1 T U = w ̂ ̲ n T U − 2 μ Λ w ̂ ̲ n T U ,
(45)

where U T w ̂ ̲ n T = w ̂ ̲ n T U . The k th coefficient of the weight vector can be expressed as

w ̂ ̲ n + 1 U ( k ) = ( 1 − 2 μλ ( k ) ) w ̂ ̲ n U ( k ) ,
(46)

where λ(k) is the k th diagonal element of the eigenvalue matrix obtained by eigenvalue decomposition of R(s+v)(s+v)(n−k0). Hence, the homogeneous solution can be obtained as

w ̂ h.s = C k ( 1 − 2 μλ ( k ) ) n ,
(47)

where C k is a constant. Next, in order to obtain the particular solution for the k th coefficient, based on (22) one can get

w ̂ p.s = w ̂ p.s − 2 μλ ( k ) w ̂ p.s + 2 μ r U ( n − k 0 − k ) .
(48)

Here, rU(n−k0−k) is the k th element of UT r ( x s + x v ) ( s + v ) (n− k 0 )= r ( x s + x v ) ( s + v ) U (n− k 0 ). For a particular solution w ̂ p.s = K p r U (n− k 0 −k), (48) can be written as

K p r U ( n − k 0 − k ) = K p r U ( n − k 0 − k ) − 2 μλ ( k ) K p r U ( n − k 0 − k ) + 2 μ r U ( n − k 0 − k ) ,
(49)

which leads to K p = 1 λ ( k ) and the particular solution

w ̂ p.s = 1 λ ( k ) r U ( n − k 0 − k ) .
(50)

References

  1. Vaseghi SV: Advanced Digital Signal Processing and Noise Reduction. Wiley, Chichester; 2000.

    Google Scholar 

  2. Kuo SM, Lee BH: Real-Time Digital Signal Processing. Wiley; 2001.

    Book  Google Scholar 

  3. Breining C, Dreiseitel P, Hänsler E, Mader A, Nitsch B, Puder H, Schertler T, Schmidt G, Tilp J: Acoustic echo control - an application of very-high-order adaptive filters. IEEE Signal Process. Mag 1999, 16(4):42-69. 10.1109/79.774933

    Article  Google Scholar 

  4. Hänsler E: The hands-free telephone problem: an annotated bibliography. Signal Process 1992, 27(3):259-271. 10.1016/0165-1684(92)90074-7

    Article  Google Scholar 

  5. Khong AWH, Naylor PA: Stereophonic acoustic echo cancellation employing selective-tap adaptive algorithms. IEEE Trans. Audio, Speech, Lang. Process 2006, 14(3):785-796.

    Article  Google Scholar 

  6. Lindstrom F, Schuldt C, Claesson I: An improvement of the two-path algorithm transfer logic for acoustic echo cancellation. IEEE Trans. Audio, Speech, Lang. Process 2007, 15(4):1320-1326.

    Article  Google Scholar 

  7. Wu S, Qiu X, Wu M: Stereo acoustic echo cancellation employing frequency-domain preprocessing and adaptive filter. IEEE Trans. Audio, Speech, Lang. Process 2011, 19(3):614-623.

    Article  Google Scholar 

  8. Nath R: Adaptive echo cancellation based on a multipath model of acoustic channel. Circuits, Syst. Signal Process., Springer US 2013, 32(4):1673-1698. 10.1007/s00034-012-9529-4

    Article  Google Scholar 

  9. Yukawa M, de Lamare RC, Sampaio-Neto R: Efficient acoustic echo cancellation with reduced-rank adaptive filtering based on selective decimation and adaptive interpolation. IEEE Trans. Audio, Speech, Lang. Process 2008, 16(4):696-710.

    Article  Google Scholar 

  10. Hänsler E, Schmidt G: Acoustic Echo and Noise Control: a Practical Approach. Wiley, New York; 2004.

    Book  Google Scholar 

  11. Myllylä V: Residual echo filter for enhanced acoustic echo control. Signal Process 2006, 86(6):1193-1205. 10.1016/j.sigpro.2005.07.036

    Article  Google Scholar 

  12. Topa R, Muresan I, Kirei BS, Homana I: A digital adaptive echo-canceller for room acoustics improvement. Adv. Electrical Comput. Eng 2004, 10: 450-453.

    Google Scholar 

  13. Haykin S: Adaptive Filter Theory. Prentice-Hall, Inc., Upper Saddle River, NJ; 1996.

    Google Scholar 

  14. Schmidt G: Applications of acoustic echo control: an overview. In Proc. Eur. Signal Process. Conf.. EUSIPCO, Vienna; 2004:9-16.

    Google Scholar 

  15. Widrow B, Glover JRJ, McCool JM, Kaunitz J, Williams CS, Hearn RH, Zeidler JR, Dong JE, Goodlin RC: Adaptive noise cancelling: principles and applications. Proc. IEEE 1975, 63(12):1692-1716.

    Article  Google Scholar 

  16. Yasukawa H: An acoustic echo canceller with sub-band noise cancelling. IEICE Trans. Fundamentals Electron. Commun. Comput. Sci 1992, E75–A(11):1516-1523.

    Google Scholar 

  17. Park SJ, Cho CG, Lee C, Youn DH: Integrated echo and noise canceller for hands-free applications. IEEE Trans. Circuits Syst.-II: Analog Digital Signal Process 2002., 49(3):

  18. Beaugeant C, Turbin V, Scalart P, Gilloire A: New optimal filtering approaches for hands-free telecommunication terminals. Signal Process 1998, 64(1):33-47. 10.1016/S0165-1684(97)00174-6

    Article  Google Scholar 

  19. Mahbub U, Fattah SA: Gradient based adaptive filter algorithm for single channel acoustic echo cancellation in noise. In Proc. Int. Conf. Electrical Computer Engineering (ICECE), 2012 7th International Conference On. Dhaka, 688 Bangladesh; 2012:880-883.

    Chapter  Google Scholar 

  20. Boll S: A spectral subtraction algorithm for suppression of acoustic noise in speech. Proc. IEEE Int. Conf. Acoust. Speech, Signal Process. (ICASSP) ’79 1979, 200-203.

    Chapter  Google Scholar 

  21. Berouti M, Schwartz R, Makhoul J: Enhancement of speech corrupted by acoustic noise. IEEE Conf. Acoust. Speech Signal Process. (ICASSP) 1979, 208-211.

    Google Scholar 

  22. Lim JS: Evaluation of a correlation subtraction method for enhancing speech degraded by additive white noise. IEEE Trans. Acoust. Speech Signal Process 1978, 26(5):471-472. 10.1109/TASSP.1978.1163129

    Article  Google Scholar 

  23. Martin R: Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process 2001, 9(5):504-512. 10.1109/89.928915

    Article  Google Scholar 

  24. Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL, Zue V: Timit acoustic-phonetic continuous speech corpus. Linguistic Data Consortium, Philadelphia; 1993.

    Google Scholar 

  25. Guangzeng F, Feng L: A new echo caneller with the estimation of flat delay. In IEEE Region Ten Conf. TENCON 92. Melbourne, Australia; 1992. vol. 1, pp. 1–5, Print ISBN 0-7803-0849-2, DOI- 10.1109/TENCON.1992.271995

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Upal Mahbub.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Authors’ original file for figure 14

Authors’ original file for figure 15

Authors’ original file for figure 16

Authors’ original file for figure 17

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mahbub, U., Fattah, S.A., Zhu, WP. et al. Single-channel acoustic echo cancellation in noise based on gradient-based adaptive filtering. J AUDIO SPEECH MUSIC PROC. 2014, 20 (2014). https://doi.org/10.1186/1687-4722-2014-20

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1687-4722-2014-20

Keywords