Musical Sound Separation Based on Binary Time-Frequency Masking
© Y. Li and D.Wang. 2009
Received: 15 November 2008
Accepted: 16 April 2009
Published: 19 July 2009
Skip to main content
© Y. Li and D.Wang. 2009
Received: 15 November 2008
Accepted: 16 April 2009
Published: 19 July 2009
The problem of overlapping harmonics is particularly acute in musical sound separation and has not been addressed adequately. We propose a monaural system based on binary time-frequency masking with an emphasis on robust decisions in time-frequency regions, where harmonics from different sources overlap. Our computational auditory scene analysis system exploits the observation that sounds from the same source tend to have similar spectral envelopes. Quantitative results show that utilizing spectral similarity helps binary decision making in overlapped time-frequency regions and significantly improves separation performance.
Monaural musical sound separation has received significant attention recently. Analyzing a musical signal is difficult in general due to the polyphonic nature of music, but extracting useful information from monophonic music is considerably easier. Therefore a musical sound separation system would be a very useful processing step for many audio applications, such as automatic music transcription, automatic instrument identification, music information retrieval, and object-based coding. A particularly interesting application of such a system is signal manipulation. After a polyphonic signal is decomposed to individual sources, modifications, such as pitch shifting and time stretching, can then be applied to each source independently. This provides infinite ways to alter the original signal and create new sound effects .
An emerging approach for general sound separation exploits the knowledge from the human auditory system. In an influential book, Bregman proposed that the auditory system employs a process called auditory scene analysis (ASA) to organize an acoustic mixture into different perceptual streams which correspond to different sound sources . The perceptual process is believed to involve two main stages: The segmentation stage and the grouping stage . In the segmentation stage, the acoustic input is decomposed into time-frequency (TF) segments, each of which mainly originates from a single source [3, Chapter 1]. In the grouping stage, segments from the same source are grouped according to a set of grouping principles. Grouping has two types: primitive grouping and schema-based grouping. The principles employed in primitive grouping include proximity in frequency and time, harmonicity/pitch, synchronous onset and offset, common amplitude/frequency modulation, and common spatial information. Human ASA has inspired researchers to investigate computational auditory scene analysis (CASA) for sound separation . CASA exploits the intrinsic properties of sounds for separation and makes relatively minimal assumptions about specific sound sources. Therefore it shows considerable potential as a general approach to sound separation. Recent CASA-based speech separation systems have shown promising results in separating target speech from interference [3, Chapters 3 and 4]. However, building a successful CASA system for musical sound separation is challenging, and a main reason is the problem of overlapping harmonics.
As mentioned, overlapping harmonics are not as common in speech mixtures as in polyphonic music. This problem has not received much attention in the CASA community. Even those CASA systems specifically developed for musical sound separation [5, 6] do not address the problem explicitly.
In this paper, we present a monaural CASA system that explicitly addresses the problem of overlapping harmonics for 2-source separation. Our goal is to determine in overlapped TF regions which harmonic is dominant and make binary pitch-based labeling accordingly. Therefore we follow a general strategy in CASA that allocates TF energy to individual sources exclusively. More specifically, our system attempts to estimate the ideal binary mask (IBM) [7, 8]. For a TF unit, the IBM takes value 1 if the energy from target source is greater than that from interference and 0 otherwise. The IBM was originally proposed as a main goal of CASA  and it is optimal in terms of signal-to-noise ratio gain among all the binary masks under certain conditions . Compared to nonoverlapped regions, making reliable binary decisions in overlapped regions is considerably more difficult. The key idea in the proposed system is to utilize contextual information available in a musical scene. Harmonics in nonoverlapped regions, called nonoverlapped harmonics, contain information that can be used to infer the properties of overlapped harmonics, that is, harmonics in overlapped regions. Contextual information is extracted temporally, that is, from notes played sequentially.
This paper is organized as follows. Section 2 provides the detailed description of the proposed system. Evaluation and comparison are presented in Section 3. Section 4 concludes the paper.
It can be seen from the above equation that the center frequencies of the filters are approximately linearly spaced in the low frequency range while logarithmically spaced in the high frequency range. Therefore more filters are placed in the low frequency range, where speech energy is concentrated.
where is the equivalent rectangular bandwidth of the filter with the center frequency . This bandwidth is adequate when the intelligibility of separated speech is the main concern. However, for musical sound separation, the 1-ERB bandwidth appears too wide for analysis and resynthesis, especially in the high frequency range. We have found that using narrower bandwidths, which provide better frequency resolution, can significantly improve the quality of separated sounds. In this study we set the bandwidth to a quarter ERB. The center frequencies of channels are spaced from to Hz. Hu  showed that a 128-channel gammatone filterbank with the bandwidth of ERB per filter has a flat frequency response within the range of passband from to Hz. Similarly, it can be shown that a gammatone filterbank with the same number of channels but the bandwidth of ERB per filter still provides a fairly flat frequency response over the same passband. By a flat response we mean that the summated responses of all the gammatone filters do not vary with frequency.
After auditory filtering, the output of each channel is divided into frames of milliseconds with a frame shift of milliseconds.
The normalization converts correlogram values to the range of with at the zero time lag.
Although this pitch-based labeling (see (6) works well, it has two problems. The first problem is that the decision is made locally. The labeling of each TF unit is independent of the labeling of its neighboring TF units. Studies have shown that labeling on a larger auditory entity, such as a TF segment, can often improve the performance. In fact, the emphasis of segmentation is considered as a unique aspect of CASA systems [3, Chapter 1]. The second problem is overlapping harmonics. As mentioned before, in TF units where two harmonics from different sources overlap spectrally, unit labeling breaks down and the decision becomes unreliable. To address the first problem, we construct T-segments and find ways to make decisions based on T-segments instead of individual TF units. For the second problem, we exploit the observation that sounds from the same source tend to have similar spectral envelopes.
The concept of T-segment is introduced in  (see also ). A segment is a set of contiguous TF units that are supposed to mainly originate from the same source. A T-segment is a segment in which all the TF units have the same center frequency. Hu noted that using T-segments gives a better balance on rejecting energy from a target source and accepting energy from the interference than TF segments . In other words, compare to TF segments, T-segments achieve a good compromise between false rejection and false acceptance. Since musical sounds tend to be stable, a T-segment naturally corresponds to a frequency component from its onset to offset. To get T-segments, we use pitch information to determine onset times. If the difference of two consecutive pitch points is more than one semitone, it is considered as an offset occurrence for the first pitch point and an onset occurrence for the second pitch point. The set of all the TF units between an onset/offset pair of the same channel defines a T-segment.
For each T-segment, we first determine if it is overlapped or nonoverlapped. If harmonics from two sources overlap at channel , . A TF unit is considered overlapped if at that unit , where is chosen to be . If half of the TF units in a T-segment is overlapped, then the T-segment is considered overlapped; Otherwise, the T-segment is considered nonoverlapped. With overlapped T-segments, we can also determine which harmonics of each source are overlapped. Given an overlapped T-segment at channel , the frequency of the overlapping harmonics can be roughly approximated by the center frequency of the channel. Using the pitch contour of each source, we can identify the harmonic number of each overlapped harmonic. All other harmonics are considered nonoverlapped.
where and are the sets of TF units previously labeled as 1 and 0 (see (6), respectively, in the T-segment. The zero time lag of indicates the energy of . Equation (7) means that, in a T-segment, if the total energy of the TF units labeled as the first source is stronger than that of the TF units labeled as the second source, all the TF units in the T-segment are labeled as the first source; otherwise, they are labeled as the second source. Although this labeling scheme works for nonoverlapped T-segments, it cannot be extended to overlapped T-segments because the labeling of TF units in an overlapped T-segment is not reliable.
We summarize the above pitch-based labeling in the form of a pseudoalgorithm as Algorithm 1.
Algorithm 1: Pitch-based labeling.
for Each T-segment between an onset/offset pair and each
frequency channel do
for Each TF unit indexed by and do
Increase TotalTFUnitCount by 1
Increase OverlapTFUnitCount by 1
Increase NonOverlapTFUnitCount by 1
if OverlapTFUnitCount TotalTFUnitCount then
The T-Segment is overlapped
The T-Segment is nonoverlapped
if The T-Segment is nonoverlapped then
for Each TF unit indexed by and do
All the TF units in the T-Segment are labeled as
All the TF units in the T-Segment are labeled as
To make binary decisions for an overlapped T-segment, it is helpful to know the energies of the two sources in that T-segment. One possibility is to use the spectral smoothness principle  to estimate the amplitude of an overlapped harmonic by interpolating its neighboring nonoverlapped harmonics. However, the spectral smoothness principle does not hold well for many real instrument sounds. Another way to estimate the amplitude of an overlapped harmonic is to use an instrument model, which may consist of templates of spectral envelopes of an instrument . However, instrument models of this nature unlikely work due to enormous intrainstrument variations of musical sounds. When training and test conditions differ, instrument models would be ineffective.
In the above equation, we assume that the first harmonics of both notes are not overlapped. If the first harmonic of is also overlapped, then all the harmonics of will be overlapped. Currently our system is not able to handle this extreme situation. If the first harmonic of note is overlapped, we try to find some other note which has the first harmonic and harmonic reliable. Note from (9) that with an appropriate note, the overlapped harmonic can be recovered from the overlapped region without the knowledge of the other overlapped harmonic. In other words, using temporal contextual information, it is possible to extract the energy of only one source.
where is the common harmonic number of nonoverlapped harmonics of both notes. After this is done for each such note , we choose the note that has the highest correlation with note and whose th harmonic is nonoverlapped. The temporal window in general should be centered on a note being considered, and long enough to include multiple notes from the same source. However, in this study, since each test recording is 5-second long (see Section 3), the temporal window is set to be the same as the duration of a recording. Note that, for this procedure to work, we assume that the playing style within the search window does not change much.
After the appropriate note is identified, the amplitude of of note is estimated according to (9). Similarly, the amplitude of the other overlapped harmonic, (i.e., the dashed line in Figure 7), can be estimated. As mentioned before, the labeling of the overlapped T-segment depends on the relative overall energy of overlapping harmonics and . If the overall energy of harmonic in the T-segment is greater than that of harmonic , all the TF units in the T-segment will be labeled as source 1. Otherwise, they will be labeled as source 2. Since the amplitude of a harmonic is calculated as the square root of the harmonic's overall energy (see next paragraph), we label all the TF units in the T-segment based on the relative amplitudes of the two harmonics, that is, all the TF units are labeled as 1 if and 0 otherwise.
The above procedure requires the amplitude information of each nonoverlapped harmonic. This can be obtained by using single-source pitch points and the activation pattern of gammatone filters. For harmonic , we use the median pitch points of each note over the time period of a T-segment to determine the frequency of the harmonic. We then identify which frequency channel is most strongly activated. If the T-segment in that channel is not overlapped, then the harmonic amplitude is taken as the square root of the overall energy over the entire T-segment. Note that the harmonic amplitude refers to the strength of a harmonic over the entire duration of a note.
We summarize the above relabeling in Algorithm 2.
Algorithm 2: Relabeling.
for Each overlapped T-Segment do
for Each source overlapping at the T-Segment do
Get the harmonic number of the overlapped note
Get the set of nonoverlapped harmonics, , for
for Each note from the same source do
Get the set of nonoverlapped harmonics, , for
Get the correlation of and using (10)
Find the note, , with the highest correlation and
Find based on (9)
if from source 1 from source 2 then
All the TF units in the T-Segment are labeled as source 1
All the TF units in the T-Segment are labeled as source 2
The resynthesis is performed using a technique introduced by Weintraub  (see also [3, Chapter 1]). During the resynthesis, the output of each filter is first phase-corrected and then divided into time frames using a raised cosine with the same frame size used in TF decomposition. The responses of individual TF units are weighted according to the obtained binary mask and summed over all the frequency channels and time frames to produce a reconstructed audio signal. The resynthesis pathway allows the quality of separated lines to be assessed quantitatively.
To evaluate the proposed system, we construct a database consisting of pieces of quartet composed by J. S. Bach. Since it is difficult to obtain multitrack signals where different instruments are recorded in different tracks, we generate audio signals from MIDI files. For each MIDI file, we use the tenor and the alto line for synthesis since we focus on separating two concurrent instrument lines. Audio signals could be generated from MIDI data using MIDI synthesizers. But such signals tend to have stable spectral contents, which are very different from real music recordings. In this study, we use recorded note samples from the RWC music instrument database  to synthesize audio signals based on MIDI data. First, each line is randomly assigned to one of the four instruments: a clarinet, a flute, a violin, and a trumpet. After that, for each note in the line, a note sound sample with the closest average pitch points is selected from the samples of the assigned instrument and used for that note. Details about the synthesis procedure can be found in . Admittedly, the audio signals generated this way are a rough approximation of real recordings. But they show realistic spectral and temporal variations. Different instrument lines are mixed with equal energy. The first 5-second signal of each piece is used for testing. We detect the pitch contour of each instrument line using Praat .
SNR gain (in decibels) of the proposed CASA system and related systems.
SNR gain (dB)
Hu and Wang (2004)
Ideal Binary Mask
2-Pitch labeling (ideal segmentation)
We compare the performance of our system with those of related systems. The second row in Table 1 gives the SNR gain by the Hu-Wang system, an effective CASA system designed for voiced speech separation. The Hu-Wang system has similar time-frequency decomposition to ours, implements the two stages of segmentation and grouping, and utilizes pitch and amplitude modulation as organizational cues for separation. The Hu-Wang system has a mechanism to detect the pitch contour of one voiced source. For comparison purposes, we supply the system with single-source pitch contours and adjust the filter bandwidths to be the same as ours. Although the Hu-Wang system performs well on voiced speech separation , our experiment shows that it is not very effective for musical sound separation. Our system outperforms theirs by 3.5 dB.
We also compare with Virtanen's system which is based on sinusoidal modeling . At each frame, his system uses pitch information and least mean square estimation to simultaneously estimate the amplitudes and phases of the harmonics of all instruments. His system also uses a so-called adaptive frequency-band model to recover each individual harmonic from overlapping harmonics . To avoid inaccurate implementation of his system, we sent our test signals to him and he provided the output. Note that his results are also obtained using single-source pitch contours. The average SNR gain of his system is shown in the third row of Table 1. Our system's SNR gain is higher than Virtanen's system by 1.6 dB. In addition, we compare with a classic pitch-based separation system developed by Parsons . Parsons's system is one of the earliest that explicitly addresses the problem of overlapping harmonics in the context of separating cochannel speech. Harmonics of each speech signal are manifested as spectral peaks in the frequency domain. Parsons's system separates closely spaced spectral peaks and performs linear interpolation for completely overlapped spectral peaks. Note that for Parsons's system we also provide single-source pitch contours. As shown in Table 1 the Parsons system achieves an SNR gain of 10.6 dB, which is 2.0 dB smaller than the proposed system.
Since our system is based on binary masking, it is informative to compare with the SNR gain of the IBM which is constructed from premixed instrument sounds. Although overlapping harmonics are not separated by ideal binary masking, the SNR gain is still very high, as shown in the fifth row of Table 1. There are several reasons for the performance gap between the proposed system and the ideal binary mask. One is that pitch-based labeling is not error-free. Second, a T-segment can be mistaken, that is, containing significant energy from two different sources. Also using contextual information may not always lead to the right labeling of a T-segment.
If we simply apply pitch-based labeling and ignore the problem of overlapping harmonics, the SNR gain is 11.3 dB as reported in . The 1.3 dB improvement of our system over the previous one shows the benefit of using contextual information to make binary decisions. We also consider the effect of segmentation on the performance. We supply the system with ideal segments, that is, segments from the IBM. After pitch-based labeling, a segment is labeled by comparing the overall energy from one source to that from the other source. In this case, the SNR gain is 13.1 dB. This shows that if we had access to ideal segments, the separation performance could be further improved. Note that the performance gap between ideal segmentation and the IBM exists mainly because ideal segmentation does not help in the labeling of the segments with overlapped harmonics.
As the last quantitative comparison, we apply the spectral smoothness principle  to estimate the amplitude of overlapped harmonics from concurrent nonoverlapped harmonics. We use linear interpolation for amplitude estimation and then compare the estimated amplitudes of overlapped harmonics to label T-segments. In this case, the SNR gain is 9.8 dB, which is considerably lower than that of the proposed system. This suggests that the spectral smoothness principle is not very effective in this case.
Finally, we mention two other related systems. Duan et al.  recently proposed an approach to estimate the amplitude of an overlapped harmonic. They introduced the concept of the average harmonic structure and built a model for the average relative amplitudes using nonoverlapped harmonics. The model is then used to estimate the amplitude of an overlapped harmonic of a note. Our approach can also be viewed as building a model of spectral shapes for estimation. However, in our approach, each note is a model and could be used in estimating overlapped harmonics, unlike their approach which uses an average model for each harmonic instrument. Because of the spectral variations among notes, our approach could potentially be more effective by taking inter-note variations into explicit consideration. In another recent study, we proposed a sinusoidal modeling based separation system . This system attempts to resolve overlapping harmonics by taking advantage of correlated amplitude envelopes and predictable phase changes of harmonics. The system described here utilizes the temporal context, whereas the system in  uses common amplitude modulation. Another important difference is that the present system aims at estimating the IBM, whereas the objective of the system in  is to recover the underlying sources. Although the sinusoidal modeling based system produces a higher SNR gain (14.4 dB), binary decisions are expected to be less sensitive to background noise and room reverberation.
In this paper, we have proposed a CASA system for monaural musical sound separation. We first label each TF unit based on the values of the autocorrelation function at time lags corresponding to the two underlying pitch periods. We adopt the concept of T-segments for more reliable estimation for nonoverlapped harmonics. For overlapped harmonics, we analyze the musical scene and utilize the contextual information from notes of the same source. Quantitative evaluation shows that the proposed system yields large SNR gain and performs better than related separation systems.
Our separation system assumes that ground truth pitches are available since our main goal is to address the problem of overlapping harmonics; in this case the idiosyncratic errors associated with a specific pitch estimation algorithm can be avoided. Obviously pitch has to be detected in real applications, and detected pitch contours from the same instrument also have to be grouped into the same source. The former problem is addressed in multipitch detection, and significant progress has been made recently [3, 15]. The latter problem is called the sequential grouping problem, which is one of the central problems in CASA . Although in general sequentially grouping sounds from the same source is difficult, in music, a good heuristic is to apply the "no-crossing" rule, which states that pitches of different instrument lines tend not to cross each other. This rule is strongly supported by musicological studies  and works particularly well in compositions by Bach . The pitch-labeling stage of our system should be relatively robust to fine pitch detection errors since it uses integer pitch periods instead of pitch frequencies. The stage of resolving overlapping harmonics, however, is likely more vulnerable to pitch detection errors since it relies on pitches to determine appropriate notes as well as to derive spectral envelopes. In this case, a pitch refinement technique introduced in  could be used to improve the pitch detection accuracy.
The authors would like to thank T. Virtanen for his assistance in sound separation and comparison, J. Woodruff for his help in figure preparation, and E. Fosler-Lussier for useful comments. They also wish to thank the three anonymous reviewers for their constructive suggestions/criticisms. This research was supported in part by an AFOSR Grant (FA9550-08-1-0155) and an NSF Grant (IIS-0534707).
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.