Skip to main content

Joint estimation of pitch and direction of arrival: improving robustness and accuracy for multi-speaker scenarios

Abstract

In many speech communication applications, robust localization and tracking of multiple speakers in noisy and reverberant environments are of major importance. Several algorithms to tackle this problem have been proposed in the last decades. In this paper, we propose several extensions to a recently presented joint direction of arrival (DOA) and pitch estimation method, increasing its robustness in multi-speaker scenarios, noise, and reverberation. First, a spectral comb filter is added to the original algorithm to better cope with concurrent speakers. Second, the well-known generalized cross-correlation with phase transform (GCC-PHAT) is used as an additional weighting function to improve the DOA estimation accuracy in terms of correct hits. Third, using multiple microphone pairs, the multi-channel cross-correlation approach is incorporated to improve the robustness against noise and reverberation. In order to improve tracking for moving and even intersecting speakers, a particle filter is used. Experiments with real-world recordings in realistic acoustic conditions show that the proposed extensions increase the DOA hit rate by about 33% compared to the original algorithm for two step-wise moving sources at a signal-to-noise ratio (SNR) of 15 dB and a reverberation time RT60 of 560 ms.

1 Introduction

Automatic detection, localization, and tracking of speaker are of high interest in several applications such as hands-free speech communication and video conferencing, as well as for computational auditory scene analysis and human-machine interfaces. For example, in current high-quality video-conferencing systems, the users are typically not located close to the microphones, and furthermore, several users may be talking simultaneously.

To distinguish between multiple concurrent speakers, it is desirable to be able to differentiate between their directions of arrival (DOAs) and their voice characteristics. This information can then be used to, e.g., enhance automatic speech recognition, indicate active speakers, steer the camera of a video-conferencing system, or to suppress undesired acoustic disturbances.

A common method for DOA estimation is to first estimate the time difference of arrival (TDOA) between different microphone pairs. An overview of these methods, as well as related references, can be found in [1],[2]. A well-known TDOA estimation method is the generalized cross-correlation with phase transform (GCC-PHAT), first introduced in [3] and intensively investigated for speech signals in, e.g., [4]–[7]. The dual delay line algorithm in [8] is another method to estimate the azimuths of sound sources by analyzing the coincidences along two-channel delay-line pairs. Other methods for DOA estimation are based on blind channel identification, such as the adaptive eigenvalue decomposition algorithm (AEDA) [9],[10], or subspace methods such as multiple signal classification (MUSIC) [11]. Another category of DOA estimation algorithms are energy-based methods which only use the measured signal energy at each microphone [12], or combined methods using both TDOA and energy information [13],[14].

The spectro-temporal characteristics of speech signals, e.g., the fundamental frequency (pitch), can also be analyzed to distinguish between concurrent speakers [15]. Traditional pitch estimation methods are based on, e.g., zero crossing rate analysis, detection of harmonics in the autocorrelation function, and cepstrum analysis [16]. Recently, a pitch estimation filter with amplitude compression (PEFAC) in the spectral domain has been proposed in [17], and methods for joint pitch and model order estimation have been proposed in [18]–[20]. Multi-pitch estimation has also become a topic of research, and several approaches are summarized in [21].

DOA and pitch estimation are typically treated separately, and only a couple of attempts have been made for joint DOA and pitch estimation. A possible solution is a two-step approach. In the first step, the position of a source is estimated from (multiple) microphone pairs. In the second step, the microphone signals are combined using a beamformer to obtain a single-channel output signal which is used to estimate the pitch using a single-channel pitch estimation method. In [22], a joint position and pitch (PoPi) estimation method has been proposed which is based on either cross-correlations or cross-power spectral densities (CPSDs). Several extensions have been proposed using cepstral weighting [23], gammatone-like weighting [24], time-domain GCC-PHAT replacement [25], particle filtering [26], and speaker-dependent subgrouping [27]. In [28], a different method based on a recurrent timing neural network is used for joint DOA and pitch estimation. Other methods make use of the 2D-Capon method [29],[30], a subspace approach termed as multi-channel multi-pitch harmonic MUSIC (MC-HMUSIC) [31], a minimum variance distortionless response (MVDR) beamformer [32] which additionally estimates the model order to determine the number of harmonics of the source signal, and a non-linear least squares (NLS)-based method [33], all using a harmonic signal model to jointly estimate the DOA and pitch. When jointly estimating DOA and pitch, the parameter estimation typically mutually benefits from each other. Although these joint estimation methods perform quite well for clean speech signals (i.e., without noise and no reverberation), their performance typically degrades considerably in adverse acoustic environments.

The focus of this paper is to improve joint DOA and pitch estimation for multiple speakers in terms of accuracy and robustness in realistic acoustic situations. We have taken the CPSD-based method proposed in [22] combined with cepstral weighting [23], gammatone-like weighting [24], and a subsequent particle filtering [26] as the core algorithm, and we propose several extensions to improve both accuracy and robustness in this paper. As a first extension, a frequency-domain comb filter is introduced to improve the performance for simultaneously active speakers. As a second extension, a GCC-PHAT weighting function is introduced, resulting in an improved DOA estimation accuracy. As a third extension, instead of simply averaging the multiple microphone pair results, the multi-channel cross-correlation (MCCC) method, presented in [34], is adapted to the joint DOA and pitch estimator, leading to a robustness improvement especially for noisy conditions.

This paper is structured as follows: In Section 2, we introduce the core algorithm for joint DOA and pitch estimation and describe each of the proposed extensions. In Section 3, the core algorithm and its extensions are evaluated for different amounts of reverberation and signal-to-noise ratios (SNRs). Finally, the paper concludes with the most relevant findings from the proposed extensions in Section 4.

2 Algorithm

Figure 1 gives an overview of the complete proposed algorithm, depicting the different processing steps which can be divided into three parts (pre-, main, and post-processing). The proposed extensions are highlighted by gray-shaded areas. Since we are interested in joint DOA and pitch estimation, the main feature of the algorithm is the computation of a two-dimensional (2D) pattern for DOA and pitch. As core algorithm, the CPSD-based method proposed in [22] combined with cepstral weighting [23] and gammatone-like weighting [24] is used. To enable speaker tracking, a subsequent particle filter [26] is also part of the core algorithm. In Section 2.1, we introduce the considered scenario and notation. The core algorithm is described in detail in Section 2.2, while each of our extensions is explained separately in Section 2.3.

Figure 1
figure 1

Overview of the proposed method. Gray-shaded areas are the extensions proposed in this paper. The signal-flow path starts with the microphone signals and continues through the pre-processing followed by the main processing and is completed with the calculation of a two-dimensional pattern ρ(φ,f0) in the post-processing. A particle filter is used on this 2D pattern to determine the DOA and pitch estimate.

2.1 Scenario and notation

We consider a acoustic scenario where Q speech sources are recorded using M microphones in a noisy and reverberant environment. The i th microphone signal y i [ k], with k the discrete time index, is first transformed to the frequency domain using the short-time Fourier transform (STFT), i.e.,

y i n , λ =STFT y i [ k ] ,i=1M,
(1)

with frequency index n=1…N and frame index λ. The STFT spectra can be modeled as

y i n , λ = h i T n , λ s n , λ + v i n , λ
(2)
x i n , λ ,
(3)

where h i [n,λ]=[hi 1 [n,λ],…,h iQ [n,λ]]T denotes the acoustic transfer functions between the speech sources s [n,λ]=[s1 [n,λ],…,s Q [n,λ]]T and the i th microphone, and x i [n,λ] and v i [n,λ] represent the speech and noise components in the i th microphone signal, respectively. The superscript T denotes the transpose operation. Each acoustic transfer function h iq [n,λ] can be expressed as

h iq n , λ = A iq n , λ e j ψ iq n , λ , q = 1 Q ,
(4)

where A iq [n,λ] and ψ iq [n,λ] represent the amplitude and the phase of the acoustic transfer function, respectively.

As proposed in [24], a subgrouping of the spectra y i [n,λ] is applied, in order to improve multi-speaker detection, i.e.,

y i ( g ) n , λ = y i n , λ · g ( g ) [n],g=1G,
(5)

where y i ( g ) n , λ denotes the weighted spectrum, and the superscript g indicates the frequency group number, which results in G (partially overlapping) spectra. We used a gammatone-like weighting function g(g)[ n] as depicted in Figure 2.

Figure 2
figure 2

Sixty-four gammatone-like weighting functions.

In addition, the CPSD

Φ i ( g ) n , λ =E y i ( g ) n , λ y ( g ) n , λ
(6)

between the i th and the th microphone is computed for each subgroup g, where E{·} denotes the expectation operator, and complex conjugate terms are marked by the operator (·). In practice, the CPSD is estimated using a recursive smoothing procedure corresponding to a first-order low-pass filter [35], i.e.,

Φ ̂ i ( g ) n , λ =α Φ ̂ i ( g ) n , λ 1 +(1α) y i ( g ) n , λ y ( g ) n , λ ,
(7)

where the symbol ( · ) ̂ indicates an estimated value, and 0≤α<1 is a smoothing factor. Please note that in our case, the CPSD calculation in Equation 7 is performed in G subspectra. Afterwards, the CPSDs are normalized by the maximum of each subspectrum and recombined, i.e.,

Φ ̂ i n , λ = 1 G g = 1 G Φ ̂ i ( g ) n , λ max n Φ ̂ i ( g ) n , λ ,
(8)

where maxn{·} denotes the maximum operator over index n. The normalization of each subspectrum in Equation 8 attempts to emphasize all speech source components in the recombined representation, as described in [24]. This is because in multi-speaker scenarios, harmonic speech sources have a different influence on the subspectra, and the narrowband CPSD Φ ̂ i ( g ) n , λ may be dominated by different signal components.

A CPSD-based voice activity detection (VAD) [36] is used to determine speech segments. Only the time frames in which voice activity has been detected are considered in the following processing. Please note that in the remainder of this paper, we will omit the frame index λ for simplification where it is not needed.

2.2 Joint DOA and pitch estimation

Assuming free field condition, plane waves, and a single source signal s1[ n] impinging with DOA φ on a uniform linear array (ULA), as shown in Figure 3a,b, the relationship between the i th and l th microphone signal is equal to

x i [ n ] = x n e j ψ i [ n ]
(9)
Figure 3
figure 3

Direction of arrival. (a) Geometrical interpretation of the relation between direction of arrival φ and distance d i between microphones i and , assuming a single speech source and a plane sound wave in free field condition. (b) Definition of the direction of arrival φ relative to the microphone array.

ψ i [ n ] = 2 π f n d i · cos ( φ ) c ,
(10)

where ψ i describes the phase, depending on the center frequency f n at frequency index n, the distance d i between microphones i and , and the speed of sound c. Figure 3b depicts the definition of the DOA relative to the microphone array used throughout this paper. Without additional noise, the CPSD can then be understood as

Φ ̂ i n , λ = α Φ ̂ i n , λ 1 + ( 1 α ) y i [ n ] 2 e j ψ i [ n ] .
(11)

For the joint DOA and pitch estimation, a 2D DOA/pitch pattern will be computed using the CPSD [22]. Only voiced signals will be considered as relevant sources, where it is assumed that these speech signals consist of a fundamental frequency f0 (pitch) and multiple harmonics. We use a harmonic sieve in order to estimate the pitch of the speech signal. This is shown in Figure 4, where the underlying concept of a harmonic sieve is presented, assuming different pitch values up to the fourth harmonic. The indices of the analyzed frequency bins of the harmonic sieve are defined as

n p = p · f 0 f s · N + 0.5 round ,p=1P,
(12)
Figure 4
figure 4

Harmonic sieve with four different pitch values f 0 up to the fourth harmonic ( P =4; black lines). The exemplary harmonic signal consists of equally spaced triangles. Best estimation results would be achieved with example 3 (f0=200 Hz) where the pitch of the signal and harmonic sieve match.

where p denotes the harmonic, N is the frame size, and fs is the sampling frequency. Only signal components corresponding to the harmonic sieve will be considered for the estimation. The harmonic sieve is computed for all values in the considered pitch range. For the exemplary harmonic sieve in Figure 4, the third example (f0=200 Hz) would result in the best estimate, since the pitch of the signal and the harmonic sieve match. In [22], two different types of harmonic sieves are proposed, either based on cross-correlation or based on the CPSD. In this paper, we will only consider the CPSD-based version.

In addition to pitch estimation, the DOA estimation is performed by analyzing the phase ψ i [ n] of the CPSD. To this end, the harmonic sieve is applied to the recombined CPSD in Equation 8, where the amplitude | Φ ̂ i [n]| and the phase ψ i [ n] are treated differently to obtain the 2D DOA/pitch pattern ρ i (φ,f0) as follows:

ρ i ( φ , f 0 ) = p = 1 P Φ ̂ i n p · T ψ ̂ i n p ψ i 0 n p
(13)
ψ i 0 n p = p · 2 π f 0 d i · cos ( φ ) c
(14)
ψ ̂ i n p = arg Φ ̂ i n p ,
(15)

where ψ ̂ i n p denotes the phase of the CPSD, and ψ i 0 n p denotes the expected phase for a combination of pitch f0 and DOA φ. The sum is taken over the P discrete frequency bins n p belonging to the harmonic sieve. The amplitude | Φ ̂ i n p | encodes pitch information due to the harmonic multiples of f0, whereas DOA information is encoded in the phase ψ ̂ i n p . The result for all considered combinations of pitch values f0 and DOA values φ are stored in the 2D pattern ρ i (φ,f0). For computational efficiency, the values n p and ψ i 0 n p can be calculated beforehand and stored in look-up tables. Figure 5 shows the magnitude and phase spectrum of the harmonic sieve filter for a speech signal. The example depicts the case in which the harmonic sieve fits to the pitch of the speaker.

Figure 5
figure 5

CPSD of a speech signal with a pitch of f 0 = 164 Hz. (a) Amplitude. (b) Phase. The dotted lines represent a harmonic sieve filtering at the discrete frequency positions n p with p=1…P and P=9. In this case, the pitch of the the signal and the harmonic sieve match.

The operator T{·} in (13) can be considered as an additional phase transform. Different phase transforms T{·} are possible in order to enhance the 2D pattern ρ i (φ,f0), which are all real-valued, even, and 2π periodic functions [22]. These transforms increase the impact of the phase weighting on the harmonic sieve (cf. Equation 13). The transform used in this contribution is the one proposed in [22], i.e.,

T χ = 1 1 + β cos ( χ ) .
(16)

For χ, we use the mismatch between ψ i 0 n p and ψ ̂ i n p as stated in Equation 13, where the parameter 0<β≤1 affects the width of the preferred direction. A small mismatch from 0 or a multiple of 2π causes a large weighting factor. Accordingly, a large mismatch in χ leads to a small weighting factor. Hence, if the pair φ,f0 corresponds to a source, the amplitude | Φ ̂ i n p | is weighted more. Figure 6 depicts the case in which ψ i 0 n p corresponds to the measured phase ψ ̂ i n p , and the transform T{·} is large for the analyzed frequency bins (marked by vertical dashed lines).

Figure 6
figure 6

Phase transform T{ χ }, in case when measured (solid black) and expected (dashed black) phase are close. The gray dotted line is the difference χ= ψ ̂ i [n] ψ i 0 [n] between both phases. The transform T{χ} (solid gray) produces a high value at frequency bins relevant for the harmonic sieve (assuming matching phases) and furthermore acts as an unwrapping function.

A cepstral weighting (cf. Figure 1) of the 2D pattern ρ i (φ,f0), based on the cepstrum of the cross-correlation, was proposed in [23] to further increase the pitch estimation for disturbed input signals. The cepstrum is computed on the inverse STFT of the logarithm of the amplitude of the spectrum y i [ n]. This transform leads to an additive representation of the signal components rather than a multiplicative one in the superimposed spectrum [33]. Thus, the so-called quefrency [35] for a dominant peak can be interpreted as a pitch candidate and the pitch relevant part of the cepstrum can be used as a weighting function.

In the post-processing, a particle filter is applied to the 2D pattern ρ(φ,f0), combined of DOA and pitch estimate. The particle filter tries to represent an unknown probability function by using a sequential Monte Carlo simulation with a set of particles and respective probabilities. The particles νu[ λ] for frame λ incorporate the DOA φu[ λ], angular velocity ωu[ λ], and pitch f 0 u [λ], i.e.,

ν u [ λ ] = φ u [ λ ] , ω u [ λ ] , f 0 u [ λ ] , u = 1 U ,
(17)

where U denotes the total number of particles.

Each particle νu[ λ] has a weight ξu representing its probability. The evolution of the particles can be described in two stages. First, the state of a particle is predicted using the particle νu[ λ−1] from the previous frame, taking into account possible physical restrictions. The pitch change is described using

f 0 u [ λ ] = f 0 u [ λ 1 ] + N f s · β f · r ,
(18)

where r is a Gaussian distributed random variable, and β f is the pitch shift prediction value. In addition, it is assumed that the pitch only changes within a certain range. The DOA change is described using the so-called Langevin Model[37], i.e.,

ω u [ λ ] = a φ ω u [ λ 1 ] + b φ r
(19)
φ u [ λ ] = φ u [ λ 1 ] + N f s ω u [ λ ]
(20)
a φ = e β φ N f s
(21)
b φ = ω ̄ 1 a φ 2 · 180 ° π ,
(22)

where ω ̄ is the steady-state angular velocity, and β φ is the DOA shift prediction value. In the second stage, we use the 2D pattern ρ(φ,f0) as a pseudo-likelihood function to determine the weights ξu, i.e.,

ξ u = ρ φ u [ λ ] , f 0 u [ λ ] ,
(23)

where subsequently, the sum of all weights is normalized to unity such that u = 1 U ξ u =1. The final DOA and pitch estimate for time frame λ is obtained by summing all weighted particles, i.e.,

φ ~ [ λ ] = u = 1 U ξ u · φ u [ λ ]
(24)
f ~ 0 [ λ ] = u = 1 U ξ u · f 0 u [ λ ] .
(25)

To avoid the so-called degeneracy problem, we use the systematic resampling approach, as proposed in [38], and an additional module for removal and addition of particles, as proposed in [26]. The advantages of a particle filter compared to a simple maximum search is based on its inherent tracking capabilities and its robustness against reverberation [37]. This is because in adverse conditions, the 2D pattern ρ(φ,f0) does not exhibit a clear main peak at the source positions, but instead a fuzzy sometimes biased area with multiple peaks is observable. The particle filter can be considered as a self-adapting smoothing function of the estimate due to the predicted source behavior and the imposed physical restrictions of this behavior.

2.3 Methods to increase the robustness

For a single speaker scenario and clean speech recordings, the basic DOA and pitch estimation algorithm in [23] performs quite well. However, its performance decreases in noisy and reverberant conditions as well as in multi-speaker scenarios. Different extensions have been proposed in [23]–[27] to increase the robustness of the algorithm in various aspects. The above stated subgrouping of the spectra (cf. Equation 5) as well as the already mentioned cepstral weighing [23], both part of the core algorithm and discussed in Section 2.2, are two of these extensions.

In the following sections, we will explain three novel extensions, namely, a spectral comb filter to better cope with concurrent speakers, a generalized cross-correlation (GCC)-phase transform (PHAT) weighting function to improve the DOA estimation accuracy, and a multi-channel cross-correlation approach to improve the robustness against noise and reverberation. The order of the extensions corresponds to their occurrence in the algorithm, cf. Figure 1.

2.3.1 Spectral comb filter

In [25], the authors observed that if more than one source is active simultaneously, a dominant source masks other concurrent sources in the CPSD Φ ̂ i [n] and eventually in the 2D pattern ρ i (φ,f0). Assuming the number of sources Q is known, we propose to introduce a comb filter γ[ n] in order to suppress components of the CPSD Φ ̂ i [n] corresponding to already estimated sources, i.e.,

Φ ̂ i [ n ] = Φ ̂ i [ n ] · γ [ n ]
(26)
γ [ n ] = 0 , if n n p β c , n p + β c , with p = 1 P 1 , else .
(27)

The parameter β c indicates the width of one tooth of the comb filter, and P denotes the number of teeth in the comb filter, which is equal to the number of considered harmonics in Equation 12. The comb filter γ[ n] only depends on already estimated pitch values f ̂ 0 , i.e.,

n p = p · f ̂ 0 f s · N + 0.5 round ,p=1P.
(28)

Using estimated pitch values f ̂ 0 , the spectral comb filter is build to suppress the influence of the already estimated speech sources in the CPSD; this leads to a more robust estimation of the remaining speech sources. If the concurrent sources are not yet estimated in the current frame, the pitch estimate from the previous timeframe λ−1 is used. Accordingly, if there is no previous estimate available, the very first pitch estimate is determined using the unmodified | Φ ̂ i [n]| in Equation 13.

For each time frame, the filtering is applied repeatedly to the original CPSD Φ ̂ i [n] as often as sources are estimated. All successive processing steps, including the harmonic sieve, are repeated respectively. Figure 7 illustrates the effect of the comb filter for two concurrent sources. It can be seen that the secondary source is suppressed, whereas the target source is highlighted.

Figure 7
figure 7

2D pattern ρ ( φ , f 0 ) of joint DOA and pitch estimation. (a) Resulting pattern for two concurrent sources with f0,1=132 and f0,2=175 Hz and DOA φ1=63° and φ2=124° without spectral comb filtering. (b, c) The patterns after spectral comb filtering for each of the sources. The suppressing influence of the comb filtering is clearly visible. Real recorded vowel utterances from two different speakers where used as sources. The dotted crosses indicate the true source positions. (a) original pattern ρ(φ,f0). (b) ρ(φ,f0) after comb filter for source 1. (c) ρ(φ,f0) after comb filter for source 2.

2.3.2 GCC-PHAT weighting

When using the core algorithm discussed in Section 2.2, the 2D pattern ρ i (φ,f0) exhibits a wide spread of the peaks with regard to the DOA φ. Similar to the cepstral weighting, which aims to improve the pitch estimation, we propose a second weighting function w i (φ) which aims to improve the DOA estimation accuracy. This extension is derived from the GCC-PHAT algorithm [3], not used as a DOA estimator itself, but only as a weighting function of the 2D pattern ρ i (φ,f0), i.e.,

ρ i phat φ , f 0 = ρ i φ , f 0 · w i ( φ )
(29)
w i ( φ ) = r i phat d i · cos ( φ ) · f s c ,
(30)

where r i phat [k] denotes the resampled generalized cross-correlation between the microphone signals i and using the phase transform PHAT weighting [3]. The weighting function w i (φ) can be interpreted as a warped extract of the cross-correlation with respect to the DOA φ and the microphone distance. In Figure 8, the upper graph depicts an example of a complete GCC-PHAT, whereas the lower graph only shows the relevant part for the DOA estimation, which is used as a weighting function.

Figure 8
figure 8

GCC-PHAT weighting. The upper graph depicts a complete GCC-PHAT of two speech signals at different DOA, recorded with a single microphone pair (d i =22 cm). The lower graph depicts only the DOA relevant informations according to Equation 30, whereby the x-axis is transformed to DOA values. The speech sources were located at 63° and 124° relative to the microphones.

Please note that in [25], a different GCC-PHAT extension was proposed, in which the central part of the unweighted cross-correlation is replaced with the GCC-PHAT weighted cross-correlation. Afterwards, in contrast to the GCC-PHAT weighing proposed here, a time-domain-based harmonic sieve is applied to the cross-correlation to obtain the 2D pattern ρ i (φ,f0).

2.3.3 Multi-channel cross-correlation

Up to now, we have discussed methods and extensions to compute the 2D pattern ρ i phat (φ, f 0 ) using one microphone pair i and . An intuitive approach to combine multiple microphone pairs is the arithmetic mean of all 2D pattern, as already performed in [22]. However, averaging is sensitive to microphone malfunctions and mutual cancelation of opposite erroneous estimates. Therefore, we introduce a more sophisticated method based on the multi-channel cross-correlation (MCCC) [34], which exploits the redundancy among multiple microphones pairs and can be understood as the generalized multi-channel extension of the cross-correlation. We adapted the MCCC to the joint DOA and pitch estimation problem, in order to generate an overall 2D pattern using multiple microphone pairs. First, a M×M matrix P(φ,f0) with the 2D pattern of all microphone pairs is constructed, i.e.,

P φ , f 0 = ρ 11 φ , f 0 ρ 12 φ , f 0 ρ 1 M φ , f 0 ρ 21 φ , f 0 ρ 22 φ , f 0 ρ 2 M φ , f 0 ρ M 1 φ , f 0 ρ M 2 φ , f 0 ρ MM φ , f 0 ,
(31)

which is a symmetry matrix, since ρ il (φ,f0)=ρ li (φ,f0). Similarly to [34], the determinant det(P(φ,f0)) of this matrix is subsequently used to define the overall 2D pattern, i.e.,

ρ φ , f 0 =1det P φ , f 0 .
(32)

Although it has been shown in [34] that the MCCC always lies between 0 and 1, this does not hold anymore for ρ(φ,f0) defined in Equation 32.

The adaptation of the MCCC algorithm shows two advantages compared to the arithmetic mean. Firstly, it is robust against malfunctions of single microphones. In case of defective microphones, only the remaining microphones are taken into account for the estimation. Secondly, if two microphone signals, highly or perfectly, match (with regard to the considered pitch and DOA combination), the overall result becomes 1, independent of the remaining microphones and opposite erroneous estimates no longer cancel themselves.

3 Evaluation

We have conducted experimental evaluations for different acoustic conditions and scenarios. Three different scenarios with increasing complexity are evaluated. In Scenario 1, two simultaneous speakers at fixed positions are simulated. In Scenario 2, the two speakers move stepwise while speaking. In the most difficult Scenario 3, the two speakers move stepwise on intersecting pathways. Measured and simulated room impulse responses (RIRs) are used to generate the microphone signals. Reverberation times RT60 ranging from 0 to 560 ms, SNRs from 0 to 20 dB, and noise free simulations (SNR=) are used. A performance comparison between the core algorithm discussed in Section 2.2 and the extensions proposed in Section 2.3 will be presented in terms of DOA estimation hit rate A φ and pitch estimation hit rate A f , as well as root-mean-square error (RMSE) of the DOA estimates.

A comparison with other state-of-the art joint DOA and pitch estimators (cf. Section 1) was not conducted since those algorithms assume single-source scenarios and do not support estimation of multiple sources without introducing further extensions which is beyond the scope of this paper.

3.1 Setup and performance measures

The evaluation was carried out for a conference room (cf. Figure 9) using a microphone line array with M=6 microphones (inter-microphone distance 0.22 m), resulting in 15 microphone pairs. A loudspeaker was used as signal source at nine different positions with a distance of approximately 3.3 m to the microphones at a similar height of 1.21 m to the microphones. The distance between the loudspeaker positions was 0.5 m. The real RIRs were measured at a sampling rate of 48 kHz using the sine sweep method [39]. The reverberation time of the conference room is approximately 560 ms with a direct-to-reverberant ratio (DRR) of 6.2 dB.

Figure 9
figure 9

Description of the used measurement setup in the conference room. The conference room is of the Fraunhofer project group HSA in Oldenburg. Room impulse responses were recorded using this setup and convolved with clean speech recordings used for the evaluation. (a) Measurement setup. (b) Measuring circuit.

To investigate the performance for different reverberation times, we used simulated RIRs that were generated with the image method [40],[41]. The same relative microphone and loudspeaker positions were used, but inside a simulated rectangular room of size l=4.6 m×w=5.1 m×h=2.5 m. Reverberation times of RT60={0,100,250, and 560 ms} with direct-to-reverberant ratios of DRR={,9.1,2.5, and −3.6 dB} were simulated.

The clean microphone signals were generated by convolving the measured or simulated RIRs with clean speech recordings which consisted of male and female speech in German and English. Uncorrelated speech-shaped noise was used as interference signal, played back from all loudspeakers simultaneously. Speech and noise recordings were mixed at six different broadband SNR values (measured at the first microphone) ranging from −5 to 20 dB.

Only the time frames labeled to contain active speech, determined using a VAD [36], are considered for the joint DOA and pitch estimation. These time frames are not further distinguished in voiced or unvoiced speech, assuming the most dominant part of speech is voiced speech.

For the STFT processing, the frame size was set to 85 ms (4,096 samples at 48 kHz sampling rate) using a von Hann window with an overlap of 50%. The resulting spectrum was subdivided in G=64 partly overlapping gammatone-like weighted subspectra. It should be noted that the choice of the frame size is a trade-off between frequency resolution of the harmonic sieve (cf. Equation 12) and tracking capability of the particle filter (cf. Equations 18 and 20).

The smoothing parameter for the CPSD estimation in Equation 7 was set to α=0.1, which is chosen quite low to better deal with simultaneous speech sources. The number of considered harmonics in Equation 12 was set to P=5. In practice, the number of harmonics P is unknown or changes over time and has to be estimated [18]–[20],[32]. The parameter β for the phase transform in Equation 16 was set to β=0.2. The teeth width of the proposed spectral comb filter extension in Equation 27 was set to β c =1.

In order to generate the 2D pattern ρ(φ,f0) in Equation 13, the DOA values were analyzed from 0° to 180° with an interval of 1°, where 90° is perpendicular to the microphone axis (cf. Figure 3), and pitch frequencies were analyzed from 70 to 280 Hz with an interval of 1 Hz. Due to the test setup from Figure 9, only the estimates in front of the array are considered to be valid source positions.

For the particle filter, we used U=50 particles per source to simulate and track the source motion over time, where the steady-state angular velocity was set to ω ̄ =1 rad/s; for the DOA shift prediction value, we used β φ =10 s−1, and for the pitch shift prediction we used β f =5 Hz/s.

The resulting performance was measured in terms of hit rate A φ (DOA) and A f (pitch) for all processed time frames with a fault tolerance of Δ φ =±10° and Δ f =±10 Hz compared to the true source characteristics. Results with a smaller tolerance interval, i.e., Δ φ =±5° and Δ f =±5 Hz, have also been calculated, which lead to an overall reduced hit rate, but showing the same performance comparison between the algorithms under test. Although, it is known that beamformers can be designed to have a more narrow beam width, a tolerance interval of Δ φ =±10° and Δ f =±10 Hz allows for more robust beamformers in case of erroneous DOA estimates.

Please note that the exact pitch of real speech signals, required to calculate the hit rate A f , is unknown and can only be estimated. We used the overall mean of the PEFAC pitch estimate [17] of the clean speech signal as ground-truth pitch.

3.1.1 Scenario 1: two speech sources at fixed positions

In Scenario 1, we investigated the influence of each proposed extension on the hit rates A φ and A f separately. We chose a scenario where two persons (male and female) were simultaneously speaking at fixed positions. The signals were about 5 s long, wherein each speaker is pronouncing one sentence. These signals were processed with different extensions enabled, resulting in five different setups shown in Table 1.

Table 1 Setup specification of used extensions

We performed simulations for an SNR of 15 dB and noiseless and for reverberation times RT60=0 and 560 ms. The results shown in Figure 10 are separated into DOA and pitch results for every source separately.

Figure 10
figure 10

Hit rates in terms of pitch ( A φ ) and DOA ( A f ). The rates are for two simultaneous speakers at fixed positions. Left panels show the results for SNR=0 dB; right panels shows the results for SNR=15 dB. The top panels show the results for anechoic environment (RT60=0 ms); the bottom panels show the results for reverberant environment (RT60=560 ms). The scenarios were processed with the core algorithm only (Setup I), with single extensions activated, and up to the proposed algorithm (Setup V) (cf. Table 1).

As seen from Figure 10, the core algorithm in Setup I performs moderately at SNR= and in an anechoic environment, but deteriorates fast in adverse conditions for both DOA and pitch estimation. Setup I seems to be particularly susceptible to reverberation.

With the spectral comb filter activated in Setup II, the two sources are estimated equally good for the scenarios without reverberation (top panel in Figure 10). Unfortunately, there is no improvement observable for the scenarios with RT60=560 ms (bottom panel in Figure 10). Nevertheless, if the spectral comb filter is missing, as in Setup I, we can see that the algorithm preferably estimates the dominant source. Hence, in our implementation, this filter is beneficial for tracking of two sources simultaneously. Considering that we are exploring multi-source localization, in all following scenarios, the spectral comb filter will be a ctivated using Setups II to V.

Using Setup III, focusing on the GCC-PHAT weighting as our second extension, we can observe that especially the DOA estimate improves considerably in all four SNR and RT60 combinations, compared to Setup I. In comparison to Setups I and II, no substantial difference in the pitch estimation can be identified.

In case of low reverberation, the MCCC extension (Setup IV) also improves the DOA estimation compared to Setup I, but deteriorates strongly with larger reverberation times. The simulation results of Setup V, in which all proposed extensions are enabled, show a good overall performance for all different acoustic conditions. It seems that the GCC-PHAT weighting has the largest influence in terms of DOA estimation, in that the DOA hit rates A φ of Setup III are equal or better than those of Setup V. However, especially for RT60=560 ms and SNR=15 dB, Setup V shows the best hit rate compared to all other setups.

Figure 11 shows the pitch estimates for all processed time frames. It can be observed that the pitch estimates are less scattered using Setup V in Figure 11b, compared to Setup II used in Figure 11a. This leads to a more robust DOA estimation even if the estimated pitch does not correspond to the true value. Again, the pitch estimation is not very accurate, but it is still beneficial in case of multi-source scenarios to improve the source differentiation.

Figure 11
figure 11

Pitch estimates f ~ 0 [λ](cf. Equation25) for two concurrent speakers. (a) Setup II. (b) Setup V. Results for RT60=560 ms and SNR=15 dB. The dashed lines indicate the tolerance interval around the true pitch values.

At an SNR of 15 dB and a reverberation time RT60 of 560 ms, we obtained DOA hit rates of A φ =72% with Setup V, compared to DOA hit rates of 5% obtained with Setup II.

3.1.2 Scenario 2: two stepwise moving speech sources

Since we aim at real-world scenarios, moving sources are considered in the second scenario. Two concurrent speech sources move towards one another in a stepwise manner, heading to the middle of the monitored area (cf. Figure 9). We simply switched between adjacent loudspeaker positions to simulate the movements of the sources. At every new position, the speaker repeated the same sentence. The results shown in Figure 12 are calculated with Setup V (bottom panel of Figure 12) and with Setup II (top panel of Figure 12).

Figure 12
figure 12

Hit rates A φ for two concurrent speakers moving towards each other. The top panels show the results for the core algorithm including spectral comb filter (Setup II); the bottom panels show the results for the proposed algorithm (Setup V). Columns (a) and (b) show the results for RT60=0 ms and RT60=560 ms (measured RIR) and varying SNR values. Column (c) shows the results for SNR= and several simulated RIRs. The very right bar in each panel shows the mean hit rate Ā φ for the respective condition.

Figure 12(a) shows the results for several SNR conditions from 20 to −5 dB without reverberation. It is apparent that the proposed algorithm (Setup V) outperforms Setup II for all conditions, resulting in mean hit rates Ā φ =87.1% for Setup V over all SNR (bottom panel in column (a)) and 55.3% for Setup II (top panel in column (a)). The proposed algorithm (Setup V) results in high hit rates even at low SNR.

Figure 12(b) shows the result for the measured RIRs with reverberation time RT60=560 ms and varying SNRs. For both setups, II and V, the hit rate decreases compared to RT60=560 ms. However, the proposed algorithm (Setup V) still outperforms Setup II by 25.7% on average for all conditions. Figure 12(c) shows the results for different reverberation times without noise (SNR =). Comparing the mean hit rates Ā φ , the proposed algorithm (Setup V) achieves a hit rate of 87.7% and surpasses the core algorithm with the spectral comb filter (Setup II) by 37%. Additionally, Figure 13 shows the RMSEs of the DOA estimates for the same setups (II and V) and conditions as in Figure 12.

Figure 13
figure 13

RMSE (in degree) for two concurrent speakers moving towards each other. The top panel shows the results for the core algorithm including spectral comb filter (Setup II); the bottom panel shows the results for the proposed algorithm (Setup V). Columns (a) and (b) show the results for RT60=0 ms and RT60=560 ms (measured RIR) and varying SNR values. Column (c) shows the results for SNR = and several simulated RIRs. The very right bar in each panel shows the mean RMSE for the respective condition.

For RT60=0 ms (Figure 13(a)), Setup II achieves a mean RMSE of 15.7°, whereas the mean RMSE of Setup V decreases to 8.5°. For RT60=560 ms (Figure 13(b)), the mean RMSE for both setups increases to 19.6° for Setup II and to 15.9° for Setup V. For different reverberation times (Figure 13(c)), Setup V achieves a mean RMSE of 7.9°), which is 10° better than the mean RMSE of Setup II.

Figure 14 shows the DOA estimates for all processed time frames for the condition SNR=15 dB and RT60=560 ms. It can be observed that the proposed algorithm (Setup V) achieves less scattered estimates than Setup II.

Figure 14
figure 14

DOA estimation φ ~ [λ]with two speech sources moving towards each other. (a) Setup II. (b) Setup V. Results for SNR=15 dB and RT60=560 ms (measured RIR). Setup V shows a significantly better performance than Setup II. The dashed lines indicate the tolerance interval around the true DOA value.

The hit rates A φ in Scenario 2 are considerably higher than in Scenario 1, especially for the core algorithm with the spectral comb filter (Setup II). This is because wrong DOA estimates tend to be located in the frontal direction (around φ=90°), as can be seen in Figure 14a. Due to the scenario definition, in which the sources move towards φ=90°, it occurred that erroneous estimates are actually counted as correct hits, e.g., for time frames around λ=400, which does not necessarily indicate a more reliable estimation but still increases the A φ value. At an SNR of 15 dB and a reverberation time RT60 of 560 ms, we obtained DOA hit rates of A φ =73% with Setup V, compares to DOA hit rates of A φ =40% obtained with Setup II.

3.1.3 Scenario 3: two intersecting speech sources

The proposed algorithm is intended to be capable of tracking stepwise intersecting sources by using the particle filter. Therefore, in the third scenario, we considered two speakers on crossing paths while speaking. The intersecting source movement can be considered as the most ambitious, but also the most realistic scenario in this evaluation. The movement of the two concurrent speakers was, again, simulated with a stepwise switching between subsequent loudspeaker positions. Hence, at a certain position, the two speech signals were emitted by a single loudspeaker. Similar to Scenario 2, we performed three experiments in which either the reverberation time RT60 or the SNR was kept constant and the other value varied over the range of interest.

The results for the third scenario are shown in Figure 15. It can be observed that the mean hit rate decreases slightly in all conditions, compared to the results in Scenario 2. Again, the proposed algorithm (Setup V) outperforms the core algorithm with the spectral comb filter (Setup II) in all conditions, e.g., the mean hit rate Ā φ for RT60=560 ms (cf. Figure 15(b)) is 54.6% for Setup V and only 34.5% for Setup II.

Figure 15
figure 15

Hit rates A φ for two concurrent speakers on crossing paths. The top panels show the results for the core algorithm including spectral comb filter (Setup II); the bottom panels show the results for the proposed algorithm (Setup V). Columns (a) and (b) show the results for RT60=0 ms and RT60=560 ms (measured RIR) and varying SNR values. Column (c) shows the results for SNR = and several simulated RIRs. The very right bar in each panel shows the mean hit rate Ā φ for the respective condition.

The corresponding RMSE is shown in Figure 16. It can be observed that for all conditions, the proposed Setup V achieves a better mean RMSE, i.e., Setup II achieves mean RMSEs of 13.6° (RT60=0 ms), 17° (RT60=560 ms), and 17.2° (SNR=), but Setup V achieves better mean RMSE of 11.3°, 15.6°, and 10.4°, respectively.

Figure 16
figure 16

RMSE (in degree) for two concurrent speakers on crossing paths. The top panels show the results for the core algorithm including spectral comb filter (Setup II); the bottom panels show the results for the proposed algorithm (Setup V). Columns (a) and (b) show the results for RT60=0 ms and RT60=560 ms (measured RIR) and varying SNR values. Column (c) shows the results for SNR = and several simulated RIRs. The very right bar in each panel shows the mean RMSE for the respective condition.

Figure 17 shows the DOA estimates for all processed time frames λ for an SNR of 15–dB and a reverberation time of 560 ms. The corresponding DOA hit rates are A φ =63.2% for Setup V and A φ =36.6% for Setup II.

Figure 17
figure 17

DOA estimate φ ~ [λ]for two concurrent speakers on crossing paths. (a) Setup II. (b) Setup V. Results for SNR=15 dB and RT60=560 ms (measured RIR). Cross-over takes place at time frames 400<λ<550. The proposed algorithm (Setup V) is able to properly track the movement.

Due to the spectral comb filter, it is impossible for the proposed algorithm to exactly estimate two sources coming from a single direction. Nevertheless, Figure 17 shows that the proposed algorithm is still capable to estimate the movement of intersecting speakers. In particular, when the speakers are crossing (at frames 400<λ<550), it can be seen that the proposed algorithm estimates the sources to be in close proximity to each other; however, they never overlap.

4 Conclusion

In this paper, several extensions to the core joint DOA and pitch estimation algorithm were proposed, which were shown to increase the robustness and hit rate even for difficult acoustic situations. In particular, the generalized cross-correlation GCC-PHAT weighting achieves a considerable improvement of the DOA estimation accuracy. To cope with multi-speaker situations, the spectral comb filter was proposed, which achieves that the proposed method is less unaffected by dominant sources and more or less estimates the DOA and pitch of all sources to the same extent. Furthermore, the MCCC extension improves the robustness and accuracy and, in addition, makes the algorithm less sensitive to microphone malfunctions. Even intersecting sources can be tracked by usage of a particle filter.

At an SNR of 15 dB and a reverberation time RT60 of 560 ms, the proposed algorithm (Setup V) achieved DOA hit rates of A φ =72%,73%, and 63.2% for two fixed, moving, and intersecting speech sources, respectively, compared to A φ =5%,40%, and 36.6% achieved with the core algorithm including the spectral comb filter (Setup II).

References

  1. Chen J, Benesty J, Huang Y: Time delay estimation in room acoustic environments: an overview. EURASIP J. Appl. Signal Proces 2006, 26503(1):1-19.

    Google Scholar 

  2. Madhu N, Martin R: in Acoustic Source Localization with Microphone Arrays in.In Advances in Digital Speech Transmission Edited by: Martin R, Heute U, Antweiler C. Wiley, Chichester, UK; 2008, 135-170. [http://dx.doi.org/10.1002/9780470727188.ch6] http://dx.doi.org/10.1002/9780470727188.ch6

    Chapter  Google Scholar 

  3. Knapp C, Carter G: The generalized correlation method for estimation of time delay. IEEE Trans. Acoust. Speech Signal Processing 1976, 24: 320-327. 10.1109/TASSP.1976.1162830

    Article  Google Scholar 

  4. D Bechler, K Kroschel, in Proceedings of the International Workshop on Acoustic Echo and Noise Cancellation (IWAENC). Considering the second peak in the GCC function for multi-source TDOA estimation with a microphone array (Kyoto, Japan, Sept. 2003), pp. 315–318.

    Google Scholar 

  5. A Brutti, M Omologo, P Svaizer, in Hands-Free Speech Communication and Microphone Arrays, HSCMA. Comparison between different sound source localization techniques based on a real data collection (Trento, Italy, 2008), p. 69–72. doi:10.1109/HSCMA.2008.4538690.

  6. Scheuing J, Yang B: Correlation-based TDOA-estimation for multiple sources in reverberant environments.In Signals and Communication Technology: Speech and Audio Processing in Adverse Environments Edited by: Hänsler E, Schmidt G. Springer, Berlin, Germany; 2008, 381-416. [http://dx.doi.org/10.1007/978-3-540-70602-1_11] http://dx.doi.org/10.1007/978-3-540-70602-1_11

    Chapter  Google Scholar 

  7. B Kwon, Y Park, Y Park, in ICROS-SICE International Joint Conference. Multiple sound sources localization using the spatially mapped GCC functions (Fukuoka, Japan, 2009), pp. 1773–1776.

    Google Scholar 

  8. Liu C, Wheeler BC, O’Brien WD, Bilger RC, Lansing CR, Feng AS: Localization of multiple sound sources with two microphones. J. Acoust. Soc. Am 2000, 108(4):1888-1905. 10.1121/1.1290516

    Article  Google Scholar 

  9. Benesty J: Adaptive eigenvalue decomposition algorithm for passive source localization. J. Acoust. Soc. Am 2000, 107(1):384-391. 10.1121/1.428310

    Article  Google Scholar 

  10. Doclo S, Moonen M: Robust adaptive time delay estimation for speaker localization in noisy and reverberant acoustic environments. EURASIP J. Appl. Signal Proces 2003, 11: 1110-1124. 10.1155/S111086570330602X

    Article  Google Scholar 

  11. Schmidt R: Multiple emitter location and signal parameter estimation. IEEE Trans. Antennas Propagation 1986, 34(3):276-280. doi:10.1109/TAP.1986.1143830 10.1109/TAP.1986.1143830

    Article  Google Scholar 

  12. Ampeliotis D, Berberidis K: Low complexity multiple acoustic source localization in sensor networks based on energy measurements. Signal Proces 2010, 90(4):1300-1312. doi:10.1016/j.sigpro.2009.10.015 10.1016/j.sigpro.2009.10.015

    Article  Google Scholar 

  13. W Cui, Z Cao, J Wei, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4. Dual-microphone source location method in 2-D space (Toulouse, France, May 2006), pp. 845–848. doi:10.1109/ICASSP.2006.1661101.

    Google Scholar 

  14. Ho KC, Sun M: Passive source localization using time differences of arrival and gain ratios of arrival. IEEE Trans. Signal Proces 2008, 56(2):464-477. doi:10.1109/TSP.2007.906728 10.1109/TSP.2007.906728

    Article  MathSciNet  Google Scholar 

  15. DP Morgan, EB George, LT Lee, SM Kay, in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1. Co-channel speaker separation (Detroit, USA, May 1995), pp. 828–831.

    Google Scholar 

  16. D Sharma, PA Naylor, Evaluation of pitch estimation in noisy speech for application in non-intrusive speech quality assessment, (Glasgow, Scotland, Aug. 2009).

  17. Gonzalez S, Brookes M: PEFAC - a pitch estimation algorithm robust to high levels of noise. IEEE/ACM Trans. Audio, Speech Lang. Proces 2014, 22(2):518-530. doi:10.1109/TASLP.2013.2295918 10.1109/TASLP.2013.2295918

    Article  Google Scholar 

  18. Christensen M, HÃÿjvang L, Jakobsson A, Jensen S: Joint fundamental frequency and order estimation using optimal filtering. EURASIP J. Adv. Signal Proces 2011, 2011(1):1-18. 10.1186/1687-6180-2011-13

    Article  Google Scholar 

  19. Nielsen J, Christensen M, Jensen S: Default Bayesian estimation of the fundamental frequency. IEEE Trans. Audio Speech Lang. Proces 2013, 21(3):598-610. doi:10.1109/TASL.2012.2229979 10.1109/TASL.2012.2229979

    Article  Google Scholar 

  20. Nielsen JK, Christensen MG, Cemgil AT, Jensen SH: Bayesian model comparison with the g-prior. IEEE Trans. Signal Proces 2014, 62(1):225-238. 10.1109/TSP.2013.2286776

    Article  MathSciNet  Google Scholar 

  21. Christensen MG, Jakobsson A: Multi-pitch estimation.In Synthesis Lectures on Speech & Audio Processing Edited by: Juang BH. Morgan & Claypool, San Rafael; 2009. [http://dx.doi.org/10.2200/S00178ED1V01Y200903SAP005]

    Google Scholar 

  22. M Wohlmayr, M Képesi, in 8th Conference of the International Speech Communication Association, Interspeech. Joint position-pitch extraction from multichannel audio (AntwerpBelgium, Aug. 2007), pp. 1629–1632.

    Google Scholar 

  23. T Habib, M Képesi, L Ottowitz, in 5th IEEE Sensor Array and Multichannel Signal Processing Workshop (SAM). Experimental evaluation of the joint position-pitch estimation (POPI) algorithm in noisy environments (Darmstadt, Germany, July 2008), pp. 369–372.

    Google Scholar 

  24. M Képesi, L Ottowitz, T Habib, in Hands-Free Speech Communication and Microphone Arrays (HSCMA). Joint position-pitch estimation for multiple speaker scenarios (Trento, Italy, May 2008), pp. 85–88. doi:10.1109/HSCMA.2008.4538694.

    Google Scholar 

  25. T Habib, L Ottowitz, M Képesi, in 9th Conference of the International Speech Communication Association, Interspeech. Experimental evaluation of multi-band position-pitch estimation (M-PoPi) algorithm for multi-speaker localization (Brisbane, Australia, Sept. 2008), pp. 1317–1320.

    Google Scholar 

  26. T Habib, H Romsdorfer, in 13th International Conference on Digital Audio Effects (DAFX). Comparison of SRP-PHAT and multiband-Popi algorithms for speaker localization using particle filters (Graz, Austria, Sept. 2010).

    Google Scholar 

  27. Habib T, Romsdorfer H: Auditory inspired methods for localization of multiple concurrent speakers. Comput. Speech Lang. Spec. Issue Speech Sep. Recognit. Multisource Environ 2013, 27(3):634-659. doi:10.1016/j.csl.2012.09.003

    Google Scholar 

  28. SN Wrigley, GJ Brown, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Recurrent timing neural networks for joint F0-localization based speech separation (Hawaii, USA, April 2007).

    Google Scholar 

  29. JR Jensen, MG Christensen Jensen, in European Signal Processing Conference, EUSIPCO. Joint DOA and fundamental frequency estimation methods based on 2-D filtering (Aalborg, Denmark, Aug. 2010), pp. 2091–2095.

    Google Scholar 

  30. Z Zhou, MG Christensen, JR Jensen, HC So, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Joint DOA and fundamental frequency estimation based on relaxed iterative adaptive approach and optimal filtering (Vancouver, Canada, May 2013), pp. 6812–6816. doi:10.1109/ICASSP.2013.6638981.

    Google Scholar 

  31. Zhang J, Christensen M, Jensen S, Moonen M: Joint DOA and multi-pitch estimation based on subspace techniques.EURASIP. J. Adv. Signal Proces 2012, 2012: 1. 10.1186/1687-6180-2012-1

    Article  Google Scholar 

  32. S Karimian-Azari, JR Jensen, MG Christensen, in European Signal Processing Conference EUSIPCO. Fast joint DOA and pitch estimation using a broadband MVDR beamformer (Marrakech, Morocco, p. Sept. 2013.

    Google Scholar 

  33. Jensen JR, Christensen MG, Jensen SH: Nonlinear least squares methods for joint DOA and pitch estimation. IEEE Trans. Audio Speech Lang. Proces 2013, 21(5):923-933. doi:10.1109/TASL.2013.2239290 10.1109/TASL.2013.2239290

    Article  Google Scholar 

  34. Benesty J, Chen J, Huang Y: Time-delay estimation via linear interpolation and cross correlation. IEEE Trans. Speech Audio Proces 2004, 12(5):509-519. 10.1109/TSA.2004.833008

    Article  Google Scholar 

  35. Vary P, Martin R: Digital Speech Transmission: Enhancement, Coding and Error Concealment. Wiley, Chichester; 2006.

    Book  Google Scholar 

  36. I Shafran, R Rose, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1. Robust speech detection and segmentation for real-time ASR applications (Hong Kong, China, April 2003), pp. 432–435.

    Google Scholar 

  37. Ward DB, Lehmann EA, Williamson RC: Particle filtering algorithms for tracking an acoustic source in a reverberant environment. IEEE Trans. Speech Audio Proces 2003, 11(6):826-836. 10.1109/TSA.2003.818112

    Article  Google Scholar 

  38. Arulampalam MS, Maskell S, Gordon N, Clapp T: A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Trans. Signal Proces 2002, 50(2):174-188. doi:10.1109/78.978374 10.1109/78.978374

    Article  Google Scholar 

  39. Müller S, Massarani P: Transfer-function measurement with sweeps. J. Audio Eng. Soc. (AES) 2001, 49(6):443-471.

    Google Scholar 

  40. Allen JB, Berkley DA: Image method for efficiently simulating small-room acoustics. J. Acoust. Soc. Am 1979, 65(4):943-950. 10.1121/1.382599

    Article  Google Scholar 

  41. E Habets, Room impulse response generator. Internal Report (2010). . Accessed 18 March 2014., [http://home.tiscali.nl/ehabets/rir_generator.html]

Download references

Acknowledgements

This work was partly supported by the Research Unit FOR 1732 ‘Individualized Hearing Acoustics’, funded by the German Research Foundation (DFG), and EcoShopping ’Energy efficient & cost competitive retrofitting solutions for shopping buildings’ grant no. 609180, funded by the European Commission.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stephan Gerlach.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Authors’ original file for figure 14

Authors’ original file for figure 15

Authors’ original file for figure 16

Authors’ original file for figure 17

Authors’ original file for figure 18

Authors’ original file for figure 19

Authors’ original file for figure 20

Authors’ original file for figure 21

Authors’ original file for figure 22

Authors’ original file for figure 23

Authors’ original file for figure 24

Authors’ original file for figure 25

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gerlach, S., Bitzer, J., Goetze, S. et al. Joint estimation of pitch and direction of arrival: improving robustness and accuracy for multi-speaker scenarios. J AUDIO SPEECH MUSIC PROC. 2014, 31 (2014). https://doi.org/10.1186/s13636-014-0031-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13636-014-0031-8

Keywords