Skip to main content
  • Empirical Research
  • Open access
  • Published:

Learning-based robust speaker counting and separation with the aid of spatial coherence

Abstract

A three-stage approach is proposed for speaker counting and speech separation in noisy and reverberant environments. In the spatial feature extraction, a spatial coherence matrix (SCM) is computed using whitened relative transfer functions (wRTFs) across time frames. The global activity functions of each speaker are estimated from a simplex constructed using the eigenvectors of the SCM, while the local coherence functions are computed from the coherence between the wRTFs of a time-frequency bin and the global activity function-weighted RTF of the target speaker. In speaker counting, we use the eigenvalues of the SCM and the maximum similarity of the interframe global activity distributions between two speakers as the input features to the speaker counting network (SCnet). In speaker separation, a global and local activity-driven network (GLADnet) is used to extract each independent speaker signal, which is particularly useful for highly overlapping speech signals. Experimental results obtained from the real meeting recordings show that the proposed system achieves superior speaker counting and speaker separation performance compared to previous publications without the prior knowledge of the array configurations.

1 Introduction

Blind speech separation (BSS) involves the extraction of individual speech sources from a mixed signal without prior knowledge of the speakers and mixing systems [1]. BSS finds application in smart voice assistants, hands-free teleconferencing, automatic meeting transcription, etc., where only mixed signals from single or multiple microphones are available. Several BSS algorithms have been developed based on different assumptions about the characteristics of the speech sources and the mixing systems [2,3,4,5,6,7,8,9]. Learning-based BSS approaches have recently received increased research attention due to advances in deep learning hardware and software. Promising results have been obtained using single-channel neural networks (NNs) [10,11,12,13,14,15]. To further improve separation performance, techniques have been developed that exploit the spatial information embedded in the microphone array signals began to emerge [16,17,18,19]. However, most of these BSS techniques assume a known number of speakers prior to separation. As a key step prior to speaker separation, speaker counting [20] is examined next.

Some studies have assumed the maximum number of speakers during speaker separation [15, 21,22,23]. Another approach is to extract speech signals in a recursive manner [24,25,26], where the BSS problem has been tackled by a multi-pass source-extraction procedure based on a recurrent neural network (RNN). In contrast to the previous methods that use implicit speaker counting for separation, a multi-decoder DPRNN [27] uses a count-head to infer the number of speakers and multiple decoder heads to separate the signals. A speaker counting technique has been proposed using a scheme that alternates between speech enhancement and speaker separation [28]. Instead of exhaustive separation, one can selectively extract only the target speech signal, with the help of auxiliary information such as video images [29, 30], pre-enrolled utterances [31,32,33], and the location of the target speaker [34,35,36,37]. Although the target speaker extraction approach leads to significant performance improvements, the auxiliary information may not always be accessible. To overcome this problem, the speaker activity-driven speech extraction neural network [38] has been proposed to facilitate target speaker extraction by monitoring speaker activity. However, the speaker activity-driven speech extraction neural network is susceptible to adverse acoustic conditions in speaker extraction using speaker activity information alone. In such circumstances, multichannel approaches may be more advantageous than monochannel approaches. For example, deep clustering-based speaker counting and mask estimation have been incorporated into masking-based linear beamforming for speaker separation tasks [39]. Chazan et al. presented the use of a deep-neural network (DNN)-based single-microphone concurrent speaker detector for source counting, followed by beamformer coefficient estimation for speaker separation [40, 41].

Despite the promising results obtained with DNN-based approaches, most network models require a large amount of data for training. Another limitation is that identical array configurations used in the test, and training phases are preferred. Therefore, DSP-based approaches may have certain advantages [42]. Laufer-Goldshtein et al. proposed the global and local simplex separation algorithm by exploiting the correlation matrix of relative transfer functions (RTFs) across time frames [43]. The number of speakers is determined from the eigenvalue decay of the correlation matrix. The activity probabilities of each speaker are estimated from the simplex formed by the eigenvectors. In the separation stage, a spectral mask is computed for the identified dominant speakers, followed by spatial beamforming and post-filtering. Although the simplex-based approach is very effective in most cases, it does not work well for low-activity speakers [44].

In general, the DNN-based approaches show promise, but require extensive training data and could not generalize well to unseen array configurations. The DSP-based approaches require no training and often allow for low-resource implementation, but their performance depends on the array configuration. While the deep clustering-based speaker counting and mask estimation methods [39,40,41] are also array configuration-agnostic, their speaker counting capability relies on a single-channel input feature, which can degrade counting performance in adverse acoustic conditions. Furthermore, the separation performance of these methods is dependent on the array configurations used.

The goal of this study is twofold. First, we reformulate a spatial feature that significantly improves the performance and robustness of source counting and separation. Second, we seek to leverage the strengths of DSP-based and learning-based methods for improved speaker counting and speaker separation performance, with robustness to unseen room impulse responses (RIRs) and array configurations. Inspired by the work of Gannot et al. [43, 45], which is a purely DSP-based approach, we propose a robust speaker counting and activity-driven speaker separation algorithm that combines statistical preprocessing and a neural network back-end. We formulate a modified spatial coherence matrix based on whitened relative transfer functions (wRTFs) as a spatial signature of directional sources. The whitening procedure provides spectrally rich phase information that proves to be a robust spatial signature for dealing with mismatched array configurations. In the speaker counting stage, our approach attempts to reliably estimate the number of active speakers in low-SNR and low-activity scenarios by incorporating eigenvalues from the spatial coherence matrix and the maximum similarity between the global activity distributions. In the speaker separation stage, the local coherence functions of each speaker are computed using the coherence between the wRTFs of each time-frequency (TF) bin and that weighted by the corresponding global activity function. The target masks for each speaker are estimated using a global and local activity-driven network (GLADnet), which remains effective for “mismatched” RIRs and array configurations not included in the training data.

We train our DNN models with RIRs simulated using the image-source method [46], while the trained models are tested using the measured RIRs recorded at Bar-Ilan University [47]. Real-life recordings from the LibriCSS meeting corpus [48] are also used to validate the proposed separation networks. In this study, the proposed speaker counting and speaker separation algorithms are compared with the simplex-based methods developed by Laufer-Goldshtein et al. [43] in terms of F1 scores and confusion matrices. Perceptual evaluation of speech quality (PESQ) [49] and word error rate (WER) are adopted as the performance measures in speaker separation tasks.

While inspired by Ref. [43], this study presents three main contributions that differ from the previous work. First, a learning-based robust speaker counting and activity-driven speaker separation algorithm is developed. Second, a modified spatial coherence matrix is formulated to effectively capture the spatial information of independent speakers. A novel idea based on the maximum similarity between the global activity distribution of two speakers over time frames is explored as an input feature for speaker counting. Third, an array configuration-agnostic GLADnet informed by the global and local speaker activities is proposed.

The remainder of this paper is organized as follows. Section 2 presents the problem formulation and a brief review of the simplex-based approach, which is used as the baseline in this study. Section 3 presents the proposed speaker counting and speaker separation system. In Section 4, we compare the proposed system with several baselines through extensive experiments. Section 5 concludes the paper.

2 Problem formulation and the baseline approach

2.1 Problem formulation

Consider a scenario in which the utterances of J speakers are captured by M distant microphones in a reverberant room. We assume that there is no prior knowledge of the array configuration. The array signal model is described in the short-time Fourier transform (STFT) domain. The received signal at the mth microphone can be written as

$${X}^{m}\left(l, f\right)=\sum_{j=1}^{J}{A}_{j}^{m}\left(f\right){S}_{j}\left(l,f\right)+{V}^{m}\left(l, f\right)$$
(1)

where l and f denote the time frame index and frequency bin index, respectively; \({A}_{j}^{m}\left(f\right)\) denotes the acoustic transfer function (ATF) between the mth microphone and the jth speaker; \({S}_{j}\left(l, f\right)\) denotes the signal of the jth speaker; and \({V}^{m}\left(l,f\right)\) denotes the additive sensor noise. This study aims to estimate the number of speakers J (speaker counting) and extract independent speaker signals from the microphone mixture signals without information about the sources and the mixing process.

2.2 Baseline method: the simplex-based approach

In this section, we present the baseline by revisiting [43]. The simplex-based approach [43, 44] is based on the global and local simplex representations and relies on the assumption of the speech sparsity in the STFT domain [50]. By assuming speech sparsity, each TF bin is dominated by either the speaker or the noise. The ideal indicator selected in each TF bin can be expressed as follows:

$${I}_{j}\left(l,f\right)=\left\{\begin{array}{ll}1& j\text{th speaker is dominant}\\ 0& \mathrm{otherwise}\end{array}\right.$$
(2)

If a TF bin is not dominated by any speakers, such a TF bin will be dominated by noise, i.e., \({\sum }_{j=1}^{J}{I}_{j}\left(l, f\right)=0\). Let \({p}_{j}^{G}\left(l\right)\) be the global activity of speaker j in frame l:

$${p}_{j}^{G}\left(l\right)=\frac{1}{F}\sum_{f=1}^{F}{I}_{j}\left(l,f\right)$$
(3)

which is the global activity associated with the jth speaker in the lth frame. Note that the global activities \({\left\{{p}_{j}^{G}\left(1\right)\right\}}_{j=1}^{J}\) depend only on the frame index, not on the frequency index.

2.2.1 Spatial feature extraction

Assuming speech sparsity in the TF domain, the relative transfer function (RTF) [51], which represents the ratio between the ATF of the mth microphone and the ATF of the first (reference) microphone, can be written as follows:

$$\begin{array}{l}{R}^{m}\left(l, k\right)=\frac{{X}^{m}\left(l,k\right)}{{X}^{1}\left(l,k\right)}\\= \left\{\begin{array}{cc}\frac{{A}_{j}^{m}\left(f\right)}{{A}_{j}^{1}\left(f\right)}& \text{for } {I}_{j}\left(l, f\right)=1, 1\le j\le J\\ \frac{{V}^{m}\left(l,f\right)}{{V}^{1}\left(l,f\right)}& \text{for } {\sum }_{j=1}^{J}{I}_{j}\left(l,f\right)=0\end{array}\right.\end{array}$$
(4)

In the following, a feature vector \(\mathbf{r}\left(l\right)\) for each frame l is defined to compose \(D=2\times \left(M-1\right)\times K\) elements of the real and imaginary parts of the computed ratios (4) for \(1\le k\le K\) frequency bins and in (M-1) microphone signals:

$$\begin{array}{l}\mathbf r^m\left(l\right)=\left[R^m\left(l,f_1\right)R^m\left(l,f_2\right)\cdots R^m\left(l,f_K\right)\right]\\\mathbf r^c\left(l\right)=\left[\mathbf r^2\left(l\right)\mathbf r^3\left(l\right)\cdots\mathbf r^M\left(l\right)\right]\\\mathbf r\left(l\right)=\left[real\left\{\mathbf r^c\left(l\right)\right\}\text{ }\mathrm{imag}\left\{\mathbf r^c\left(l\right)\right\}\right]^{\;T}\end{array}$$
(5)

where \({\left\{{f}_{k}\right\}}_{k=1}^{K}\) are the selected frequencies. The correlation matrix \(\mathbf{W}\in {\mathbb{R}}^{L\times L}\) is computed, where \({\left[\mathbf{W}\right]}_{ln}=\frac{1}{D}{\mathbf{r}}^{T}\left(l\right)\mathbf{r}\left(n\right)\). W can be approximated as [45]

$$\mathbf{W}\approx {\mathbf{P}\mathbf{P}}^{T}$$
(6)

where \(\mathbf{P}=\left[{\mathbf{p}}_{1}^{G} \dots {\mathbf{p}}_{J}^{G}\right]\in {\mathbb{R}}^{L\times J}\) is composed of the global activity vectors \({\mathbf{p}}_{j}^{G}={\left[{p}_{j}^{G}\left(1\right)\dots {p}_{j}^{G}\left(L\right)\right]}^{T}\in {\mathbb{R}}^{L\times 1}\) associated with the jth speaker.

2.2.2 Speaker counting

For J independent speakers, the matrix P should have rank J. It follows that the number of speakers can be determined by counting the principal eigenvalues of the correlation matrix W. However, selecting an appropriate threshold is not straightforward due to complex acoustic conditions. To select an appropriate threshold, the speaker counting problem has been formulated as a classification problem [43], where each class corresponds to a different number of speakers. A feature vector consisting of the first \(J'\) principal eigenvalues of the correlation matrix is used as the input to the classifier

$${\mathbf{f}}_{\mathrm{baseline}\text{ } 1}={\left[{\lambda }_{1}\ {\lambda }_{2}\ \cdots\ {\lambda }_{J'}\right]}^{T}$$
(7)

where \(J'\) is the maximum possible number of speakers and is set to 4 in this study. The multiclass support vector machine (SVM) is used as the classifier in [43].

2.2.3 Speaker separation

Once the number of speakers (J) is available, the eigenvectors associated with the J largest eigenvalues for each frame l are selected to form the global mapping vector

$$\mathbf{v}^{G}\left(l\right)={\left[{u}_{1}\left(l\right), {u}_{2}\left(l\right),\dots ,{u}_{J}\left(l\right)\right]}^{T}$$
(8)

where \(\left\{u_{j} (l)\right\}^{J}_{j=1}\) denotes the lth element associated with the jth eigenvector.

According to [43, 45], the global mapping vector \(\mathbf{v}^{G}(l)\) can be expressed as a linear transformation of the global activity vector \(\mathbf{p}^{G}(l)\) :

$$\mathbf{v}^{G}(l) = \mathbf{Gp}^{G}(l)$$
(9)

with embedded information of speaker activities. The successive projection algorithm [52] can be applied to identify the simplex vertices and construct the transformation matrix \(\mathbf{G} = [\mathbf{v}^{G}(l_{1}), \mathbf{v}^{G}(l_{2}),\ldots,\mathbf{v}^{G}(l_{J})]\), where \({\left\{{l}_{j}\right\}}_{j=1}^{J}\) represents frame indices of the simplex vertices. Hence, the global activity can be computed.

$$\mathbf{p}^{G}\left(l\right)={\left[{p}_{1}^{G}\left(l\right),{p}_{2}^{G}\left(l\right),\dots ,{p}_{\widehat{J}}^{G}\left(l\right)\right]}^{T}=\mathbf{G}^{-1}\mathbf{v}^{G}\left(l\right)$$
(10)

For the local mapping, each TF bin is assigned to a dominant speaker or noise. The spectral mask can be obtained by using the weighted nearest-neighbor rule.

$$M\left(l,f\right)=\mathop{\arg\max}\limits_{j\in \left(1,\dots ,J+1\right)}\frac{1}{{\pi }_{j}}\sum_{n=1}^{L}{\omega }_{ln}\left(f\right){p}_{j}^{G}\left(n\right)$$
(11)

where \({\pi }_{j}={\sum }_{n=1}^{L}{p}_{j}^{G}\left(n\right)\) denotes the class normalization factor and \({\omega }_{ln}\left(f\right)\) is a Gaussian weighting function [33]:

$${\omega }_{ln}\left(f\right)=\mathrm{exp}\left\{-\Vert \mathbf{r}\left(l,f\right)-\mathbf{r}\left(n,f\right)\Vert \right\}$$
(12)

that is inversely related to the distance in the space defined by the local representation \({\left\{\mathbf{r}\left(l,f\right)\right\}}_{l=1}^{L}\) between frame n and frame l. The signal of the jth speaker can be estimated by applying the spectral mask in (11) to the reference microphone signal:

$${\widehat{S}}_{j}^{Mask}\left(l,f\right)=\left\{\begin{array}{cc}{X}^{1}\left(l,f\right)& if M\left(l,f\right)=j\\ {\beta X}^{1}\left(l,f\right)& \mathrm{otherwise},\end{array}\right.$$
(13)

where \(\beta\) is the attenuation factor to avoid musical noise. In this paper, \(\beta\) is set to 0.2 as in [43].

A linearly constrained minimum variance (LCMV) beamformer can be used to extract each independent speaker signals [43, 44], with the weights below

$${\mathrm{w}}_{LCMV}={\mathbf{R}}_{nn}^{-1}\left(f\right)\widehat{\mathbf{A}}\left(f\right){\left({\widehat{\mathbf{A}}}^{H}\left(f\right){\mathbf{R}}_{nn}^{-1}\left(f\right)\widehat{\mathbf{A}}\left(f\right)\right)}^{-1}{\mathbf{g}}_{j},$$
(14)

where \(\widehat{\mathbf{A}}\left(f\right)={\left[{\widehat{\mathbf{a}}}_{1}\left(f\right),\dots ,{\widehat{\mathbf{a}}}_{J}\left(f\right)\right]}^{T}\in {\mathbb{C}}^{M\times J}\) denotes the RTF matrix with \({\widehat{\mathbf{a}}}_{j}\left(f\right)={\left[{\widehat{\mathrm{A}}}_{j}^{1}\left(f\right), {\widehat{\mathrm{A}}}_{j}^{2}\left(f\right),\dots ,{\widehat{A}}_{j}^{M}\left(f\right)\right]}^{T}\) of the jth speaker and \({\mathbf{R}}_{nn}\left(f\right)\) is the noise covariance matrix. In this study, only sensor noise is assumed, i.e., \({\mathbf{R}}_{nn}={\sigma }_{nn}\mathbf{I}\). As a result, (14) reduces to

$${\mathbf{w}}_{LCMV}={\widehat{\mathbf{A}}\left(f\right)\left({\widehat{\mathbf{A}}}^{H}\left(f\right)\widehat{\mathbf{A}}\left(f\right)\right)}^{-1}{\mathbf{g}}_{j}$$
(15)

where the RTF of the jth speaker can be estimated by

$${\widehat{A}}_{j}^{m}\left(f\right)=\frac{{\Sigma }_{{l\in \mathcal{L}}_{j}}{X}^{m}\left(l,f\right){X}^{1*}\left(l,f\right)}{{\Sigma }_{{l\in \mathcal{L}}_{j}}{X}^{1}\left(l,f\right){X}^{1*}\left(l,f\right)}$$
(16)

where \({\mathcal{L}}_{j}=\left\{l\left|{p}_{j}^{G}\left(l\right)>\varepsilon ,l\in \left\{1,\dots ,L\right\}\right.\right\}\) denotes the set of frames dominated by the jth speaker, and ε = 0.2 is an activity threshold.

To further illuminate the residual noise and interference, a single-channel mask is applied [43, 44], as given by

$$\begin{array}{c}{\widehat{S}}_{j}^{LCMV-Mask}\left(l,f\right)\\ =\left\{\begin{array}{cc}{\mathbf{w}}_{LCMV}^{H}\mathbf{x}\left(l,f\right){\mathbf{g}}_{j}& \text{if } M\left(l,f\right)=j\\ {\beta \mathbf{w}}_{LCMV}^{H}\mathbf{x}\left(l,f\right){\mathbf{g}}_{j}& \mathrm{otherwise},\end{array}\right.\end{array}$$
(17)

where the vector \(\mathbf{x}\left(l,f\right)={\left[{X}^{1}\left(l,f\right),\dots ,{ X}^{M}\left(l,f\right)\right]}^{T}\) denotes the microphone signals, \({\mathbf{g}}_{j} \epsilon {\mathbb{R}}^{J\times 1}\) is a one-hot vector with one in the jth entry and zeros elsewhere, and β = 0.2 is a small factor to prevent from musical noise.

3 Proposed method

Inspired by the above simplex-based approach, we develop a robust speaker counting and separation system by exploiting spatial coherence features of array signals, as illustrated in Fig. 1. The system consists of three modules: the feature extraction module (Section 3.1), the speaker counting module (Section 3.2), and the speaker separation module (Section 3.3), as detailed in the sequel.

Fig. 1
figure 1

Block diagram of the proposed speaker counting and separation system

3.1 Spatial feature extraction

The simplex-based method [43] exploits the spatial information provided by the microphone array. As a result, spatial feature extraction plays a critical role in subsequent speaker counting and separation algorithms. Instead of the RTF used in [43], in this study, we extract spatial information by whitening RTFs with no change in phase to enhance the spatial signature of the directional source, analogous to generalized cross-correlation with phase transformation (GCC-PHAT) [53]. In the light of the uncertainty principle [54], this helps to improve the time domain resolution for the computation of the spatial coherence matrix. Instead of the real feature vector used in the simplex-based approach, a “whitened” complex feature vector \(\tilde{\mathbf{r}}(l)\) is defined as follows:

$$\widetilde{\mathbf{r}}\left(l\right)={\left[\widetilde{\mathbf{r}}\left(l,{f}_{1}\right) \widetilde{\mathbf{r}}\left(l,{f}_{2}\right)\cdots \widetilde{\mathbf{r}}\left(l,{f}_{K}\right)\right]}^{T}\in {\mathbb{C}}^{\left(M-1\right)K\times 1}$$
(18)

Where

$$\widetilde{\mathbf{r}}\left(l,f\right)=\left[\begin{array}{ccc}\frac{{R}^{2}\left(l,f\right)}{\left|{R}^{2}\left(l,f\right)\right|}& \cdots&\frac{{R}^{M}\left(l,f\right)}{\left|{R}^{M}\left(l,f\right)\right|}\end{array}\right]$$

\(R^{m} (l. f)\) is defined in (4), and \(\{f_{k}\}^{K}_{k=1}\) is the selected frequency band as in (5). Next, we construct a spatial coherence matrix \(\tilde{\mathbf{W}} \in \mathbb{R}^{L \times L}\) with the lnth entry defined as

$${\widetilde{W}}_{\mathrm{ln}}=\frac{Re\left\{{\widetilde{\mathbf{r}}}^{H}\left(l\right)\widetilde{\mathbf{r}}\left(n\right)\right\}}{\Vert \widetilde{\mathbf{r}}\left(l\right)\Vert \Vert \widetilde{\mathbf{r}}\left(n\right)\Vert }=\frac{1}{\widetilde{D}}Re\left\{{\widetilde{\mathbf{r}}}^{H}\left(l\right)\widetilde{\mathbf{r}}\left(n\right)\right\}$$
(19)

where Re{·} is the real-part operator, \(\|\!\cdot\!\|\) denotes the l2-norm, and \(\tilde{D} = \|\tilde{\mathbf{r}}(l)\|\, \|\tilde{\mathbf{r}}(n)\| = (M - 1)K\) due to the fact that the feature vectors have been whitened. Note that the complex inner product of \(\tilde{\mathbf{r}}(l)\) and \(\tilde{\mathbf{r}}(n)\) is computed, which can also be regarded as a sign-sensitive cosine similarity based on the Euclidean angle [55]. An example of the spatial correlation matrix computed using the method reported in the references [43,44,45] and the proposed spatial coherence matrix are compared in Fig. 2, which is generated using a 12-second clip with a three-speaker mixture captured by an eight-element uniform linear array (ULA) with interelement spacing of 8 cm. The image in Fig. 2(b) is preferable to Fig. 2(a) because the time span of the proposed spatial coherence matrix aligns better than the baseline, especially at the overlap, as shown by the ground-truth activity bar at the top of the figure. This suggests that the proposed spatial coherence matrix is effective in capturing speaker activity, much like a voice activity detector. In addition, the range of the proposed coherence matrix is within [−1, 1], which is a desired property for network training.

Fig. 2
figure 2

Examples of a the spatial correlation matrix W and b the spatial coherence matrix \(\tilde{\mathbf{W}}\). The color bars at the top of each figure indicate the active span of each speaker

3.2 Speaker counting

The flowchart of the proposed speaker counting approach is detailed in Fig. 3. Two features related to the speaker count are extracted from the spatial coherence matrix \(\tilde{\mathbf{W}}\) and input to the speaker counting network (SCnet), as will be detailed next.

Fig. 3
figure 3

Flowchart of the proposed speaker counting approach

In this study, we propose to use the eigenvalues \(\left\{\tilde{\lambda}_{n}\right\}^{L}_{n=1}\) of the spatial coherence matrix \(\tilde{\mathbf{W}}\) as the feature for the classifier. An example of scatter pattern of the eigenvalues to discriminate between different speaker count classes, \(J \in \left\{1, 2, 3, 4\right\}\), is illustrated in Fig. 4. We generated 2000-sample speech mixtures for 1–4 speakers, with 0%, 10%, 20%, 30%, and 40% overlap ratios. Sensor noise was added with 10 dB SNR. Dry signals were convoluted with the measured RIRs selected from the Multi-Channel Impulse Responses Database [47] that was recorded using an eight-element ULA with interelement spacing of 8 cm and T60 = 0.61 s. Each cross in the figure represents one observation to specify the number of speakers. Figure 4 shows the ability of the eigenvalues obtained from the correlation matrix and the coherence matrix to discriminate between different numbers of speakers. In addition, the eigenvalues of the coherence matrix \(\tilde{\mathbf{W}}\) can discriminate between different numbers of speakers better than those of the correlation matrix \(\mathbf{W}\). However, some of the observations cannot be classified into the correct class according to the eigenvalues. In this study, we evaluate the similarity between global activities as auxiliary information to address the cases where the principal eigenvalue-based counting method does not work.

Fig. 4
figure 4

Scatter plots of the eigenvalues corresponding to the observations with \(J \in \left\{1, 2, 3, 4\right\}\) speakers. Each cross with different color represents an observation corresponding to different number of speakers. The left row shows the result with the correlation matrix W and the right row is the result with the coherent matrix \(\mathbf{W}\)  

Apart from eigenvalues of the spatial coherence matrix, another feature that can help speaker counting is introduced to deal with meeting scenarios in which the overlap ratio of conversation is often less than 20% [56]. For such scenarios, we first calculate a similarity matrix \({\widetilde{\gamma }}^{j}\epsilon {\mathbb{R}}^{j\times j}\) of the first \(j\) global activities with the pq-th entry defined as follows:

$${\widetilde{\gamma }}_{pq}^{j}=\frac{{\widetilde{\mathbf{p}}}_{p}^{G}\cdot {\widetilde{\mathbf{p}}}_{q}^{G}}{\Vert {\widetilde{\mathbf{p}}}_{p}^{G}\Vert \Vert {\widetilde{\mathbf{p}}}_{q}^{G}\Vert }$$
(20)

where “·” denotes the inner product, \({\tilde{\mathbf{p}}}_{p}^{G}\in {\mathbb{R}}^{L\times 1}\) and \({\tilde{\mathbf{p}}}_{q}^{G}\in {\mathbb{R}}^{L\times 1}\) denote the pth and qth global activities estimated from the spatial coherence matrix \(\tilde{\mathbf{W}}\) and \(1\le p,q\le \mathrm{j}\). Next, we find the maximum similarity value of all entries but the diagonal entries.

$${\widetilde{\gamma }}_{\mathrm{max}}^{j}=\mathop{\max}\limits_{p,q}{\left({\widetilde{\gamma }}^{j}-\mathbf{I}\right)}_{pq}$$
(21)

Similarly, \({\gamma }_{\mathrm{max}}^{j}\) denotes the maximum similarity calculated using the first j global activities obtained from the spatial correlation matrix W. An example of scatter pattern of the maximum similarity to discriminate between different speaker count classes, \(J\in \left\{1, 2, 3, 4\right\}\), is illustrated in Fig. 5. The data generation is identical to those of Fig. 4. To visualize the separability by using the proposed feature, we plot the scatter plot by the projection onto a two-dimensional feature space. Figure 5 suggests that the observations are separable by the maximum similarity, which helps to classify the number of speakers. In Fig. 5(a), the single-speaker observations and the two to four speaker observations are clearly separable along the \({\tilde{\gamma }}_{\mathrm{max}}^{2}\) coordinate. The one or two speaker observations and the three or four speaker observations are clearly separable along the \({\tilde{\gamma }}_{\mathrm{max}}^{3}\) coordinate. In Fig. 5(b), the one to three speaker observations and the four speaker observations are clearly separable along the \({\tilde{\gamma }}_{\mathrm{max}}^{4}\) coordinate.

Fig. 5
figure 5

Scatter plots of the maximum similarity to the observations with \(J \in \left\{1, 2, 3, 4\right\}\) speakers. Each cross with different color represents an observation corresponding to different number of speakers

In this study, the speaker counting problem is formulated as a classification problem as in Ref. [43] with four classes corresponding to 1 to 4 speakers. For each observation (audio clip), the number of speakers is indicated by a one-hot vector \(\mathbf{z}\in {\mathbb{R}}^{4\times 1}\) . For inference, the prediction is the highest probability of the output distribution. Three different input feature vectors are defined for the assessment of speaker counting performance:

$$\begin{array}{l}{\mathbf{f}}_{\mathrm{baseline} \text{ } 2}={\left[\frac{{\lambda }_{2}}{{\lambda }_{1}}\cdots \frac{{\lambda }_{{J'}}}{{\lambda }_{1}} {\gamma }_{\mathrm{max}}^{2}\cdots {\gamma }_{\mathrm{max}}^{J'}\right]}^{T}\in {}^{2\left({J'}-1\right)}\\ {\mathbf{f}}_{\mathrm{proposal} \text{ } 1}={\left[\frac{{\widetilde{\lambda }}_{2}}{{\widetilde{\lambda }}_{1}}\cdots \frac{{\widetilde{\lambda }}_{J'}}{{\widetilde{\lambda }}_{1}}\right]}^{T}\in {\mathbb{R}}^{{J'}-1}\\ {\mathbf{f}}_{\mathrm{proposal} \text{ } 2}={\left[\frac{{\widetilde{\lambda }}_{2}}{{\widetilde{\lambda }}_{1}}\cdots \frac{{\widetilde{\lambda }}_{{J'}}}{{\widetilde{\lambda }}_{1}} {\widetilde{\gamma }}_{\mathrm{max}}^{2}\cdots {\widetilde{\gamma }}_{\mathrm{max}}^{J'}\right]}^{T}\in {\mathbb{R}}^{2\left({J'}-1\right)},\end{array}$$
(22)

where \(J' = 4\) is the maximum possible number of speakers, and the eigenvalues are normalized by the maximum eigenvalue to improve convergence. Features fbaseline 2 is obtained from the spatial correlation matrix W, whereas features fproposal 1 and fproposal 2 are obtained from the proposed spatial coherence matrix \(\tilde{\mathbf{W}}\) .

A DNN model termed SCnet is used as the classifier for speaker counting. Figure 6 shows an SCnet consisting of three dense layers followed by a rectified linear unit (ReLU) activation, with softmax activation in the output layer. In addition, (Fsize,64) means a dense layer with input size = Fsize and output size = 64. The cross-entropy is used as the loss function in network training.

Fig. 6
figure 6

Speaker counting network (SCnet)

3.3 Speaker separation

The simplex-based method relies solely on the spatial cue to perform the subsequent beamforming, which depends on the specific array configuration. In contrast, our learning-based approach uses global and local spatial activity features to train the model, as shown in Fig. 7. The proposed system consists of two main modules: (1) the local coherence estimation of independent speakers, which monitors the local activity of each speaker according to the global activity of the speaker, and (2) the global and local activity-driven network (GLADnet), which extracts the speaker signal with the auxiliary information about the global and local activities of the speaker.

Fig. 7
figure 7

Block diagram of the proposed speaker separation module

In the local coherence estimation of a speaker, the local coherence is calculated between the wRTF of the target speaker and the wRTF of each TF bin. The wRTF of the jth speaker is calculated as follows:

$${\widetilde{\mathbf{a}}}_{j}\left(f\right)={\left[\frac{{\widehat{A}}_{j}^{2}\left(f\right)}{\left|{\widehat{A}}_{j}^{2}\left(f\right)\right|}\quad\cdots\quad \frac{{\widehat{A}}_{j}^{M}\left(f\right)}{\left|{\widehat{A}}_{j}^{M}\left(f\right)\right|}\right]}^{T}$$
(23)

where \({\widehat{A}}_{j}^{m}\left(f\right)\) is the estimated RTF. Thus, the local coherence of the jth speaker can be calculated as follows:

$$\begin{array}{c}{p}_{j}^{L}\left(l,f\right)=\frac{\mathrm{Re}\left\{{\widetilde{\mathbf{a}}}_{j}^{H}\left(l\right)\widetilde{\mathbf{r}}\left(l,f\right)\right\}}{\Vert {\widetilde{\mathbf{a}}}_{j}\left(l\right)\Vert \Vert \widetilde{\mathbf{r}}\left(l,f\right)\Vert }\\ =\frac{1}{M-1}\mathrm{Re}\left\{{\widetilde{\mathbf{a}}}_{j}^{H}\left(l\right)\widetilde{\mathbf{r}}\left(l,f\right)\right\},\end{array}$$
(24)

where \(\widetilde{\mathbf{r}}\left(l,f\right)\) is given by the equation under (14). Local coherence serves to inform the DNN about the local activity of a speaker.

GLADnet is based on a convolutional recurrent network [57], as illustrated in Fig. 8. The network has three inputs: the magnitude spectrogram of the reference microphone signal, the global activity of the speaker, and the local activity of the speaker. GLADnet has six symmetric encoder and decoder layers with an 8-16-32-128-128-128 filter. The convolutional blocks feature a separable convolution layer, followed by batch normalization, and exponential linear unit activation. The output layer terminates with sigmoid activation. The convolution kernel and step size are set to (3,2) and (2,1), respectively. Note that 1 × 1 pathway convolutions (PConv) are used as skip connections, which leads to considerable parameter reduction with little performance degradation. The global activity is concatenated to the output of the linear layer with 256 nodes in each time frame. The resulting vector is then fed to the following bidirectional long short-term memory layers with 256 nodes to sift out the latent features pertaining to each speaker. The soft mask estimated by the network is multiplied element-wise with the noisy magnitude spectrogram to yield an enhanced spectrogram. The complete complex spectrogram can be obtained by combining the enhanced magnitude spectrogram with the phase of the noisy spectrogram. The network is trained to minimize the compressed mean square error between the masked magnitude \(\left(\widehat{\mathbf{S}}\right)\) and the ground-truth magnitude \(\left(\mathbf{S}\right)\)

$${J}_{CMSE}={\sum_{t,f}\Vert {\left|\mathbf{S}\right|}^{c}-{\left|\widehat{\mathbf{S}}\right|}^{c}\Vert }_{F}^{2}$$
(25)

where c = 0.3 is the compression factor and \({\parallel \parallel }_{F}\) denotes the Frobenius norm.

Fig. 8
figure 8

The GLADnet

4 Experimental study

Experiments were performed to validate the proposed learning-based speaker counting and separation system. The networks were trained on the simulated RIRs and tested on the measured RIRs with different T60s and array configurations recorded at Bar-Ilan University [47]. For meeting scenarios, we also tested the proposed system on real meeting recordings from the LibriCSS meeting corpus [48].

4.1 Training and validation dataset

In total, 50,000 and 5000 samples were used in training and validation, respectively. Dry speech signals selected from the train-clean-360 subset of th

LibriSpeech corpus [58] were used for training and validation. Noisy speech mixtures edited in 12-s clips were prepared with different numbers of speakers \(J\in \left\{1, 2, 3, 4\right\}\) in reverberation conditions and signal-to-noise ratios (SNRs) between −5 dB and 5 dB. The overlap ratio of the speech mixtures varied from 0 to 40%. Reverberant microphone signals were simulated by filtering the dry signals with the simulated RIRs using the image-source method [46]. The reverberation time was within the range of [0.2, 0.6] s. Sensor noise was added with SNR = 15, 25, and 35 dB. In this study, simulated (Gaussian) noise was used to simulate the sensor noise. Two microphone array geometries were used for training and validation, as depicted in Fig. 9. The first microphone array is an eight-element ULA with interelement spacing of 8 cm. The geometry of the second array is similar to that of the seven-element uniform circular array (UCA) used in the LibriCSS dataset [48] which has one microphone at the center and the other six uniformly distributed around a circle with a radius of 4.25 cm. The RIRs of rectangular rooms with randomly generated dimensions (length, width, and height) in the range of [3 × 3 × 2.5, 7 × 7 × 3] m were simulated. The ULA was placed at 0.5 m from the wall, while the UCA was placed at the center of the room. Any two speakers were separated by at least 15°.

Fig. 9
figure 9

Settings for network training with different microphone array geometries

4.2 Implementation and evaluation metrics

In this study, the signal frame was 128 ms long with a 32 ms stride. A 2048-point fast Fourier transform was used. The sample rate was 16 kHz. The feature vectors in (5) and (18) comprised \(K=257\) frequency bins in 1–3 kHz. We chose this frequency range because, as in Ref. [43], it performed well in all of the scenarios examined for different simulated and measured RIRs and array configurations. In the experiment, SCnet and GLADnet were trained using the Adam optimizer with a learning rate of 0.001 and a gradient norm clipping of 3. The learning rate was halved if the validation loss did not improve for three consecutive epochs.

The F1 score and the confusion matrix are used to evaluate the speaker counting performance. The F1 score is a measure of the accuracy of a test in classification problems. It is defined as the harmonic mean of precision and recall [59]. PESQ [49] is used as a metric for speech quality and is computed only in the period when the speech is present. In addition, we also evaluate the WER achieved by the proposed system compared to the baselines, by using a transformer-based pre-trained model from the SpeechBrain toolkit [60]. The pre-trained model was trained on the LibriSpeech dataset. The WER obtained with this model when tested on the test-clean subset is 1.9%.

4.3 Spatial feature robustness

In this section, we aim to investigate the robustness of the algorithm with respect to the spatial correlation matrix and the spatial coherence matrix for measured RIRs and unseen array geometries. The proposed spatial coherence matrix based on wRTFs is used as a spatial signature for directional sources. The whitening process provides spectrally rich information that better accommodates unseen array configurations and measured RIRs. To see this, we compute the Modal Assurance Criterion (MAC) value on the spatial correlation matrix and the spatial coherence matrix for various unseen array configurations and RIRs. First, we vectorize the spatial matrix as \(\psi ={\left[{\mathbf{w}}_{1}\quad {\mathbf{w}}_{2}\quad \cdots\quad {\mathbf{w}}_{L}\right]}^{T}\in {\mathbb{R}}^{{L}^{2}\times 1}\) , where \({\mathbf{w}}_{l}=\left[{W}_{l1}\quad {W}_{l2}\quad \cdots\quad {W}_{lL}\right] \in {\mathbb{R}}^{L\times 1}\). \(\psi\) and \({\psi}'\) represent feature vectors associated with two spatial matrices. The MAC value between \(\psi\) and \(\psi'\) is defined as follows:

$$MAC\left(\psi ,\psi' \right)=\frac{{\left({\psi }^{T}\psi' \right)}^{2}}{{\psi}^{T}{\psi \psi }^{{\prime}T}\psi'}$$
(26)

To evaluate the robustness of the proposed spatial feature extraction method, we generated four different test datasets, each consisting of 500 samples. The first three datasets (G1, G2, and G3) were generated using measured RIRs from the Multi-Channel Impulse Responses Database [47], while the last dataset (sG1) was generated using simulated RIRs. As shown in Fig. 10, the first array configuration (G1) is included in the training set, while the second and third array configurations (G2 and G3) are considered “unseen” to the trained model. Note that sG1 had the same array configuration as G1, but with simulated RIRs. Tables 1 and 2 summary the MAC values obtained using the spatial correlation matrix and spatial coherence matrix. The off-diagonal MAC values of the spatial coherence matrix are consistently close to one and larger than those of the spatial correlation matrix. The MAC test demonstrates that the proposed spatial coherence matrix exhibits superior robustness to different array configurations and RIRs compared to the spatial correlation matrix. This property is desirable for the subsequent learning-based speaker counting and speaker separation approaches when dealing with unseen array configurations and measured RIRs.

Fig. 10
figure 10

Microphone array settings for experiments to investigate the effects of array configurations

Table 1 MAC values calculated using the spatial correlation matrix for various array configurations and RIRs
Table 2 MAC values calculated using the spatial coherence matrix for various array configurations and RIRs

4.4 Speaker counting performance

In the following, we examine several speaker counting methods for different levels of sensor noise and T60s. We generated 2000-sample speech mixtures for 1–4 speakers, with 0%, 10%, 20%, 30%, and 40% overlap ratios and dry speech signals from the test-clean subset of the LibriSpeech corpus. Sensor noise was added with SNR = 10, 20, and 30 dB. The measured RIRs were selected from the Multi-channel Impulse Responses Database [47] recorded using an eight-element ULA with interelement spacing of 8 cm and T60 = 0.36, 0.61 s at Bar-Ilan University. The RIRs were measured in 15° intervals from −90 to 90° at distances of 1 and 2 m from the array center. Table 3 summarizes the speaker counting results in F1 scores. We compare the proposed counting approaches with two baselines. Baseline 1 is the method proposed in [43]. The SVM classifier with fbaseline 1 in (7) as the input feature is used for training. Baseline 2 is the SCnet trained with fbaseline 2 in (22). For the proposed methods, proposals 1 and 2 represent the SCnet trained with fproposal 1 and fproposal 2 in (22). The speaker counting performance summarized in Table 3 suggests that baseline 1 performs comparably with baseline 2 in high SNR conditions. However, the speaker counting performance of baseline 1 degrades significantly as the SNR decreases. The feature using the eigenvalues obtained from the spatial coherence matrix (proposal 1) significantly outperform those obtained from the spatial correlation matrix (baseline 1), especially when the SNR is low. In addition, the method trained with the maximum similarity (proposal 2) could further improve the speaker counting performance over the method trained with eigenvalues only (proposal 1). In this study, speaker counting is highly dependent on the quality of spatial information extracted from the microphone array. However, it should be noted that spatial features tend to degrade as the SNR decreases. As a result, the counting performance may be relatively lower at SNR = 10.

Table 3 Comparison of speaker counting performance under different acoustical conditions in terms of F1 score

Next, we investigate speaker counting in low-activity scenarios using four-speaker mixtures, where the first speaker was active for only 5% of the time. In Table 4, we see a significant performance degradation in the SCnet trained on the eigenvalues of the spatial correlation matrix (baseline 1), even in high-SNR conditions. In contrast, the SCnet trained on the eigenvalues and the maximum similarities computed using the proposed spatial coherence matrix (proposal 2) performs quite satisfactorily despite the unbalanced speaker activity.

Table 4 Comparsion of low-activity speaker counting performance under different acoustical conditions in terms of F1 score

Lastly, we investigate speaker counting using the real-life recordings from the LibriCSS dataset [48]. There are 10 one-hour sessions, including six 10-min mini-sessions in each session with different speaker overlap ratios (0S, 0L, 10%, 20%, 30%, and 40%). In the 0% case, 0S and 0L represent the signals with short and long silence periods, where inter-utterance silence lasts between 0.1–0.5s and 2.9–3.0s. The test data was pre-segmented into 12-s clips containing 1 to 4 speakers in each session.

The speaker count of each audio clip was labeled by using the ground-truth information. In addition, the dataset contains 511, 1119, 614, and 154 examples for one, two, three, and four speakers, respectively. The results of speaker counting are summarized in the confusion matrices depicted in Fig. 11. The F1 scores for the baselines 1 and 2, proposals 1 and 2 were 88.37%, 92.44%, 96.48%, and 97.36%. From Fig. 11, we can see that the methods trained on the features from the spatial coherence matrix (proposals 1 and 2) outperform the methods trained on the features from the spatial correlation matrix (baselines 1 and 2). Figure 11(c) and (d) show that the methods trained on maximum similarities (proposal 2) yield significantly lower underestimation rates than the methods trained on eigenvalues only (proposal 1). For the BSS problems, underestimation can undermine the subsequent separation, while overestimation is less critical. In summary, we extract spatial information by whitening the RTFs without changing the phase to enhance the spatial signature of the directional source, analogous to generalized cross-correlation with phase transformation (GCC-PHAT) [53]. In the light of the uncertainty principle [54], this helps to improve the time domain resolution for the computation of the spatial coherence matrix, which in turn leads to a more accurate estimation of the spatial activity, especially in low SNR cases. This enables a more accurate estimation of the maximum similarity of two global activities as independent activities, without overlooking scenarios with low activity speakers.

Fig. 11
figure 11

Confusion matrices for the speaker counting results obtained using a baseline 1, b baseline 2, c proposal 1, and d proposal 2

Furthermore, unlike most multichannel source counting methods, which typically require more microphones than sources, the simplex-based and the proposed methods are limited by the total number of frames used to compute the spatial correlation matrix and the spatial coherence matrix, not the number of microphones. This implies that, in theory, there is virtually no limit to the number of speakers that can be identified. In fact, the only limit on counting accuracy is the degree of time overlap. To see this, we give two examples with different speaker activity patterns to show the maximum number of independent speakers that can be identified using ULAs with 2–5 elements evenly spaced at 8 cm.

Case I represents a scenario where four speakers are active in moderately overlapping time periods, as shown in Fig. 12(a). Note that at 2–4 s, three speakers are active concurrently. Inspection of Fig. 13(a) indicates that the spatial coherence matrices associated with different numbers of microphones remain very similar. In this case, the eigenvalue distribution analysis reveals that the number of sources can be accurately estimated, even when the number of speakers (4) exceeds the number of microphones (5), as shown in Fig. 14(a).

Fig. 12
figure 12

Ground truth speaker activities for a case I and b case II

Fig. 13
figure 13

Spatial coherence matrices for different number of microphones in a case I and b case II

Fig. 14
figure 14

Eigenvalue distribution in descending order of the spatial coherence matrix for a case I and b case II

Case II presents a scenario where the proposed source counting method fails, where four independent speakers are active with 100% overlap, as shown in Fig. 12(b). In this case, the spatial coherence matrices in Fig. 13(b) show no meaningful patterns of activity, regardless of the number of microphones. The eigenvalue distribution analysis in Fig. 14(b) provides an incorrect estimate, one. In summary, methods based on simplex preprocessing are not limited by the number of microphones, but rather by the overlap percentage of the speaker activity time span.

4.5 Speaker separation performance

In the following, we compare the proposed speaker separation approach (GLADnet) with three baselines. The first baseline (mask) uses only a spectral mask (13). The second baseline (LCMV-mask) is the simplex-based approach [43, 44] with beamforming and spectral masking (17). The third is the GLADnet, which is trained only on the global activity, called the global activity-driven network (GADnet). To evaluate the robustness of the proposed speaker separation approach when applied to unseen RIRs and array configurations, we created three 2000-sample test datasets for three different array configurations (G1, G2, and G3) using the measured RIRs from the Multi-Channel Impulse Responses Database [47]. The array configurations G1, G2, and G3 are shown in Fig. 10.

First, we examine the separation performance using the G1 configuration for different overlap ratios and T60s. The results in Fig. 15 show that the proposed GLADnet outperforms the three baselines in terms of speech quality. The performance of the GADnet, which is not trained with spatial features, degrades drastically as the overlap ratio increases. While the LCMV-mask method achieves comparable WER to GLADnet at moderate T60 = 360 ms, its separation performance drops sharply at high reverberation.

Fig. 15
figure 15

Comparison of separation performance with array configuration (G1) in terms of a, c PESQ and b, d WER for different overlap ratios

Next, the effect of array configurations on separation performance is investigated. Figure 16 reveals that the speech quality (PESQ) and the ASR performance (WER) using the LCMV-mask method degrade as the array spacing and the array aperture decrease, even for moderate T60s. In contrast, the proposed GLADnet performs quite satisfactorily despite the unseen RIRs and array geometries.

Fig. 16
figure 16

Comparison of separation performance with array configurations (G1, G2, and G3) in terms of a, c PESQ and b, d WER for different array configurations

We also evaluated the proposed network in speaker separation using a more realistic LibriCSS dataset. The dataset generation for network testing is identical to that for speaker counting. Figure 17 shows that the LCMV-mask method has a comparable performance to the proposed GLADnet when the overlap ratio is low. However, the performance of the LCMV-mask drops dramatically at high overlap ratios. In addition, GADnet performs satisfactorily only for non-overlapping speech mixtures. In summary, the separation performance of baselines such as mask and LCMV-mask, which rely solely on spatial information, can be significantly affected by the inter-element spacing and array aperture. On the other hand, the baseline GADnet, which relies solely on spectral information, can suffer performance degradation in adverse acoustic conditions such as large reverberation and high overlap ratios. In contrast to these baselines, the proposed GLADnet exploits both spatial and spectral information to achieve superior performance in terms of PESQ and WER metrics. In addition, the GLADnet is trained using the global and local activities derived from the wRTFs, which is less sensitive to unseen RIRs and array configurations.

Fig. 17
figure 17

Comparison of separation performance in terms of a PESQ and b WER for the LibriCSS dataset

5 Conclusions

In this paper, a learning-based robust speaker counting and separation system has been implemented by integrating array signal processing and DNN. In feature extraction, the spatial coherence matrix computed with wRTFs across time frames shows superior robustness to different array configurations and RIRs compared to the spatial correlation matrix. In speaker counting, the SCnet trained on the eigenvalues and the maximum similarities obtained from the spatial coherence matrix is conducive to speaker counting in adverse acoustic conditions, especially in unbalanced voice activity scenarios. In speaker separation, the GLADnet based on global and local spatial activities proves to be capable of effective and robust enhancement with different overlap ratios for unseen RIRs and array configurations, which is highly desirable for real-world applications.

Availability of data and materials

N/a.

Abbreviations

SCM:

Spatial coherence matrix

Wrtf:

Whitened relative transfer functions

SCnet:

Speaker counting network

GLADnet:

Global and local activity-driven network

BSS:

Blind source separation

NN:

Neural network

RNN:

Recurrent neural network

DNN:

Deep-neural-network

RTF:

Relative transfer function

RIRs:

Room impulse responses

PESQ:

Perceptual evaluation of speech quality

WER:

Word error rate

STFT:

Short-time Fourier transform

ATF:

Acoustic transfer function

SVM:

Support vector machine

LCMV:

Linearly constrained minimum variance

GCC-PHAT:

Generalized cross-correlation with phase transformation

ULA:

Uniform linear array

SNR:

Signal-to-noise ratio

UCA:

Uniform circular array

WER:

Word error rate

References

  1. E. Vincent, T. Virtanen, S. Gannot, Audio source separation and speech enhancement (Wiley, USA, 2018)

    Google Scholar 

  2. M. Kawamoto, K. Matsuoka, N. Ohnishi, A method of blind separation for convolved nonstationary signals. Neurocomputing 22, 157–171 (1998)

    MATH  Google Scholar 

  3. H. Buchner, R. Aichner, W. Kellermann, A generalization of blind source separation algorithms for convolutive mixtures based on second-order statistics. IEEE Trans Audio Speech Lang Process 13(1), 120–134 (2005)

    Google Scholar 

  4. Z. Koldovsky, P. Tichavsky, Time-domain blind separation of audio sources on the basis of a complete ICA decomposition of an observation space. IEEE Trans Audio Speech Lang Process 19(2), 406–416 (2011)

    Google Scholar 

  5. T. Kim, T. Eltoft, T.W. Lee, Independent vector analysis: an extension of ICA to multivariate components, in International Conference on Independent Component Analysis and Signal Separation. (2006), pp.165–172

    MATH  Google Scholar 

  6. T. Virtanen, Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans Audio Speech Lang Process 15(3), 1066–1074 (2007)

    Google Scholar 

  7. O. Dikmen, A.T. Cemgil, Unsupervised single-channel source separation using Bayesian NMF, in Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). (2009), pp.93–96

    Google Scholar 

  8. A. Ozerov, C. Fvotte, Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation. IEEE Trans Audio Speech Lang Process 18(3), 550–563 (2010)

    Google Scholar 

  9. Y. Mitsufuji, A. Roebel, Sound source separation based on non-negative tensor factorization incorporating spatial cue as prior knowledge, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2013), pp.71–75

    Google Scholar 

  10. J.R. Hershey, Z. Chen, J. Le Roux, S. Watanabe, Deep clustering: discriminative embeddings for segmentation and separation, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2016), pp.31–35

    Google Scholar 

  11. Z. Chen, Y. Luo, N. Mesgarani, Deep attractor network for single-microphone speaker separation, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2017), pp.246–250

    Google Scholar 

  12. Y. Luo, N. Mesgarani, Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE Trans Audio Speech Lang Process 27(8), 1256–1266 (2019)

    Google Scholar 

  13. Y. Lue, Z. Chen, T. Yoshioka, Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2020), pp.46–50

    Google Scholar 

  14. D. Yu, M. Kolbæk, Z. Tan, J. Jensen, Permutation invariant training of deep models for speaker-independent multi-talker speech separation, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2017), pp.241–245

    Google Scholar 

  15. M. Kolbæk, D. Yu, Z. Tan, J. Jensen, Multi-talker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans Audio Speech Lang Process 25(10), 1901–1913 (2017)

    Google Scholar 

  16. L. Drude, R. Haeb-Umbach, Tight integration of spatial and spectral features for BSS with deep clustering embeddings, in Interspeech. (2017), pp.2650–2654

    Google Scholar 

  17. Z.Q. Wang, J. Le Roux, J.R. Hershey, Multi-channel deep clustering: discriminative spectral and spatial embeddings for speaker-independent speech separation, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2018), pp.1–5

    Google Scholar 

  18. Z. Wang, D. Wang, Combining spectral and spatial features for deep learning based blind speaker separation. IEEE/ACM Trans Audio Speech Lang Process 27(2), 457–468 (2019)

    Google Scholar 

  19. Y. Luo, C. Han, N. Mesgarani, E. Ceolini, S. Liu, FaSNet: Low-latency adaptive beamforming for multi-microphone audio processing, in Proc. of IEEE Workshop Automatic Speech Recognition and Understanding. (2019), pp.260–267

    Google Scholar 

  20. K. Kinoshita, M. Delcroix, S. Araki, T. Nakatani, Tackling real noisy reverberant meetings with all-neural source separation, counting, and diarazation system, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2020), pp.381–385

    Google Scholar 

  21. Y. Liu, D. Wang, Divide and conquer: a deep CASA approach to talker-independent monaural speaker separation. IEEE/ACM Trans Audio Speech Lang Process 27(12), 2092–2102 (2019)

    Google Scholar 

  22. E. Nachmani, Y. Adi, L. Wolf, Voice separation with an unknown number of multiple speakers, in International Conference on Machine Learning (ICML). (2020), pp.2623–2634

    Google Scholar 

  23. Y. Luo, N. Mesgarani, Separating varying numbers of sources with auxiliary autoencoding loss, in Interspeech. (2020)

    Google Scholar 

  24. K. Kinoshita, L. Drude, M. Delcroix, T. Nakatani, Listening to each speaker one by one with recurrent selective hearing networks, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2018), pp.5064–5068

    Google Scholar 

  25. T. Neumann, K. Kinoshita, M. Delcroix, S. Araki, T. Nakatani, R. Haeb-Umbach, All-neural online source separation counting and diarization for meeting analysis, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2019), pp.91–95

    Google Scholar 

  26. Jin, Z., Hao, X., and Su, X, Coarse-to-fine recursive speech separation for unknown number of speakers. arXiv 2203.16054 (2022)

  27. J. Zhu, R.A. Yeh, M. Hasegawa-Johnson, Multi-decoder DPRNN: source separation for variable number of speakers, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2021), pp.3420–3424

    Google Scholar 

  28. Z..-Q. Wang, D. Wang, Count and separate: incorporating speaker counting for continuous speaker separation, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2021), pp.11–15

    Google Scholar 

  29. A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W.T. Freeman, M. Rubinstein, Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans Graph 37(4), 1–11 (2018)

    Google Scholar 

  30. C. Li, Y. Qian, Listen, watch and understand at the cocktail party: audio-visual-contextual speech separation, in Interspeech. (1426, 2020), p.1430

  31. K. Žmolíková, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget, J. Černocký, Speakerbeam: speaker aware neural network for target speaker extraction in speech mixtures. IEEE J Sel Top Signal Process 13(4), 800–814 (2019)

    Google Scholar 

  32. Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, Z, J.R. Hershey, J.R, R.A. Saurous, R.J. Weiss, Y. Jia, I.L. Moreno, VoiceFilter: targeted voice separation by speaker-conditioned spectrogram masking, in Interspeech. (2019), pp.2728–2732

    Google Scholar 

  33. M. Ge, C. Xu, L. Wang, E.S. Chang, H. Li, Spex+: a complete time domain speaker extraction network, in Interspeech. (2020), pp.1406–1410

    Google Scholar 

  34. R. Gu, L. Chen, S.X. Zhang, J. Zheng, Y. Xu, M. Yu, D. Su, Y. Zou, D. Yu, Neural spatial filter: target speaker speech separation assisted with directional information, in Interspeech. (2019), pp.4290–4294

    Google Scholar 

  35. M. Delcroix, T. Ochiai, K. Zmolikova, K. Kinoshita, N. Tawara, T. Nakatani, S. Araki, Improving speaker discrimination of target speech extraction with time-domain speakerbeam, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2020), pp.691–695

    Google Scholar 

  36. J. Han, W. Rao, Y. Wang, Y. Long, Improving channel decorrelation for multi-channel target speech extraction, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2021), pp.6094–6098

    Google Scholar 

  37. Y. Hsu, Y. Lee, M.R. Bai, Learning-based personal speech enhancement for teleconferencing by exploiting spatial-spectral features, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2022), pp.8787–8791

    Google Scholar 

  38. M. Delcroix, K. Zmolikova, T. Ochiai, K. Kinoshita, T. Nakatani, Speaker activity driven neural speech extraction, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2021), pp.6099–6103

    Google Scholar 

  39. T. Higuchi, K. Kinoshita, M. Delcroix, K. Zmolkova, T. Nakatani, Deep clustering-based beamforming for separation with unknown number of sources, in Interspeech. (2017)

    Google Scholar 

  40. S.E. Chazan, J. Goldberger, S. Gannot, DNN-based concurrent speakers detector and its application to speaker extraction with LCMV beamforming, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2018), pp.6712–6716

    Google Scholar 

  41. S.E. Chazan, S. Gannot, J. Goldberger, Attention-based neural network for joint diarization and speaker extraction, in Proc. of IEEE International Workshop on Acoustic Signal Enhancement (IWAENC). (2018), pp.301–305

    Google Scholar 

  42. C. Boeddeker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Heymann, R. Haeb-Umbach, Front-end processing for the CHiME-5 dinner party scenario, in Proc. of CHiME5 Workshop. (2018), pp.35–40

    Google Scholar 

  43. B. Laufer-Goldshtein, R. Talmon, S. Gannot, Global and local simplex representations for multichannel source separation. IEEE/ACM Trans Audio Speech Lang Process 28(1), 914–928 (2020)

    MATH  Google Scholar 

  44. B. Laufer-Goldshtein, R. Talmon, S. Gannot, Audio source separation by activity probability detection with maximum correlation and simplex geometry. EURASIP J Audio Speech Music Process 2021, 5 (2021)

    Google Scholar 

  45. B. Laufer-Goldshtein, R. Talmon, S. Gannot, Source counting and separation based on simplex analysis. IEEE/ACM Trans Audio Speech Lang Process 66(24), 6458–6473 (2018)

    MathSciNet  MATH  Google Scholar 

  46. E. Lehmann, A. Johansson, Prediction of energy decay in room impulse responses simulated with an image-source model. J Acoust Soc Am 124(1), 269–277 (2008)

    Google Scholar 

  47. E. Hadad, F. Heese, P. Vary, S. Gannot, Multichannel audio database in various acoustic environments, in Proc. of IEEE International Workshop on Acoustic Signal Enhancement (IWAENC). (2014), pp.313–317

    Google Scholar 

  48. Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y. Luo, J. Wu, J. Li, J, Continuous speech separation: dataset and analysis, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2020), pp.7484–7288

    Google Scholar 

  49. A.W. Rix, J.G. Beerends, M.P. Hollier, A.P. Hekstra, Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2021), pp.749–752

    Google Scholar 

  50. O. Yilmaz, S. Rickard, Blind separation of speech mixtures via time-frequency masking. IEEE Trans Signal Process 52(7), 1830–1847 (2004)

    MathSciNet  MATH  Google Scholar 

  51. S. Gannot, D. Burshtein, E. Weinstein, Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Trans Signal Process 49(8), 1614–1626 (2001)

    Google Scholar 

  52. W.-K. Ma et al., A signal processing perspective on hyperspectral unmixing: Insights from remote sensing. IEEE Signal Process Mag 31(1), 67–81 (2014)

    Google Scholar 

  53. C. Knapp, G. Carter, The generalized correlation method for estimation of time delay. IEEE Trans Signal Process 24(4), 320–327 (1967)

    Google Scholar 

  54. L. Cohen, The uncertainty principle in signal analysis, in Proc. of IEEE Time-Freq./Time-Scale Anal. (1994), pp.182–185

    Google Scholar 

  55. K. Scharnhorst, Angles in complex vector spaces. Acta Applicandae Mathematicae 69(1), 95–103 (2001)

    MathSciNet  MATH  Google Scholar 

  56. O. Çetin, E. Shriberg, Analysis of overlaps in meetings by dialog factors hot spots speakers and collection site: insights for automatic speech recognition, in Interspeech. (2006), pp.293–296

    Google Scholar 

  57. K. Tan, D. Wang, A convolutional recurrent neural network for real-time speech enhancement, in Interspeech. (2018), pp.3229–3233

    Google Scholar 

  58. V. Panayotov, G. Chen, D. Povey, S. Khudanpur, Librispeech: an ASR corpus based on public domain audio books, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2015), pp.5206–5210

    Google Scholar 

  59. D. Powers, Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J Mach Learn Technol 2, 37–63 (2007)

    Google Scholar 

  60. Ravanelli, M. et al, SpeechBrain: a general-purpose speech toolkit. arXiv preprint arXiv:2106.04624 (2021)

Download references

Acknowledgements

N/a.

Funding

This work was supported by the National Science and Technology Council (NSTC), Taiwan, under the project number 110-2221-E-007-027-MY3.

Author information

Authors and Affiliations

Authors

Contributions

Model development: Y. Hsu and M. R. Bai. Design of the dataset and test cases: Y. Hsu. Experimental testing: Y. Hsu. Writing paper: Y. Hsu and M. R. Bai. The authors read and approved the final manuscript.

Corresponding author

Correspondence to Mingsian R. Bai.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hsu, Y., Bai, M.R. Learning-based robust speaker counting and separation with the aid of spatial coherence. J AUDIO SPEECH MUSIC PROC. 2023, 36 (2023). https://doi.org/10.1186/s13636-023-00298-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13636-023-00298-3

Keywords