<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1687-4722-2010-252374</ui>
   <ji>1687-4722</ji>
   <fm>
      <dochead>Research Article</dochead>
      <bibl>
         <title>
            <p>Monaural Voiced Speech Segregation Based on Dynamic Harmonic Function</p>
         </title>
         <aug>
            <au id="A1"><snm>Zhang</snm><fnm>Xueliang</fnm><insr iid="I1"/><insr iid="I2"/><email>cszxl@imu.edu.cn</email></au>
            <au id="A2" ca="yes"><snm>Liu</snm><fnm>Wenju</fnm><insr iid="I1"/><email>lwj@nlpr.ia.ac.cn</email></au>
            <au id="A3"><snm>Xu</snm><fnm>Bo</fnm><insr iid="I1"/><email>xubo@hitic.ia.ac.cn</email></au>
         </aug>
         <insg>
            <ins id="I1"><p>National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China</p></ins>
            <ins id="I2"><p>Computer Science Department, Inner Mongolia University, Huhhot 010021, China</p></ins>
         </insg>
         <source>EURASIP Journal on Audio, Speech, and Music Processing</source>
         <issn>1687-4722</issn>
         <pubdate>2010</pubdate>
         <volume>2010</volume>
         <issue>1</issue>
         <fpage>252374</fpage>
         <url>http://asmp.eurasipjournals.com/content/2010/1/252374</url>
         <xrefbib><pubid idtype="doi">10.1155/2010/252374</pubid></xrefbib>
      </bibl>
      <history><rec><date><day>17</day><month>9</month><year>2010</year></date></rec><acc><date><day>2</day><month>12</month><year>2010</year></date></acc><pub><date><day>12</day><month>12</month><year>2010</year></date></pub></history>
      <cpyrt><year>2010</year><collab>Xueliang Zhang et al.</collab><note>This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note></cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <p>Correlogram is an important representation for periodic signals. It is widely used in pitch estimation and source separation. For these applications, major problems of correlogram are its low resolution and redundant information. This paper proposes a voiced speech segregation system based on a newly introduced concept called dynamic harmonic function (DHF). In the proposed system, conventional correlograms are further processed by replacing the autocorrelation function (ACF) with DHF. The advantages of DHF are: 1) peak's width is adjustable by controlling the variance of the Gaussian function and 2) the invalid peaks of ACF, not at the pitch period, tend to be suppressed. Based on DHF, pitch detection and effective source segregation algorithms are proposed. Our system is systematically evaluated and compared with the correlogram-based system. Both the signal-to-noise ratio results and the perceptual evaluation of speech quality scores show that the proposed system yields substantially better performance.</p>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>1. Introduction</p>
         </st>
         <p>In realistic environment, speech is often corrupted by acoustic interference. Meanwhile, many applications have bad performance when handling the noisy speech. Therefore, noise reduction or speech enhancement is meaningful for systems such as speech recognition and hearing aids. Numerous speech enhancement algorithms have been proposed in the literature [<abbr bid="B1">1</abbr>]. The methods, such as independent component analysis [<abbr bid="B2">2</abbr>] or beam forming [<abbr bid="B3">3</abbr>], require multiple sensors. However, this requirement is not applicable for many applications such as telecommunication. Spectrum subtraction [<abbr bid="B4">4</abbr>] and subspace analysis [<abbr bid="B5">5</abbr>] proposed for monaural speech enhancement usually make strong assumptions on acoustic interference. Therefore, these methods are limited to some special environments. Segregating speech from one monaural recording has proven to be very challenging. At present, it is still an open problem in realistic environments.</p>
         <p>Compared with the limited performance of speech enhancement algorithms, human listeners with normal hearing are capable of dealing with sound intrusions, even in monaural condition. According to Bregman [<abbr bid="B6">6</abbr>], a human's auditory system segregates a target sound from interference through a process called auditory scene analysis (ASA) which has two parts: (1) sound signal decomposition and (2) components grouping. Bregman considered that the components organization included sequential organization on time series and simultaneous organization on frequency series. To simulate ASA inspired a novel field, computational auditory scene analysis (CASA) [<abbr bid="B7">7</abbr>], which has obtained more and more attention. Compared with other general methods, CASA can be applied under single channel input, and it has no strong assumption on the prior knowledge of noise.</p>
         <p>A large proportion of sounds have harmonic structure, such as vowels and music tone. The most distinct characteristic is that these sounds consist of fundamental harmonic (<inline-formula><graphic file="1687-4722-2010-252374-i1.gif"/></inline-formula>) and several overtones which are called harmonic series. A good deal of evidence suggest that harmonics tend to be perceived as a single sound. The phenomenon is called the "<it>harmonicity</it>" principle in ASA. Pitch and harmonic structure provide an efficient mechanism for voiced speech segregation in CASA systems [<abbr bid="B8">8</abbr>, <abbr bid="B9">9</abbr>]. Continuous variation of pitch is good for sequential grouping, and harmonic structure is suitable for simultaneous grouping. Licklider [<abbr bid="B10">10</abbr>] proposed that pitch could be extracted from nerve firing patterns by a running autocorrelation function performed on the activity of individual fibers. Licklider's theory was implemented by the scholars (e.g., [<abbr bid="B11">11</abbr>&#8211;<abbr bid="B14">14</abbr>]). Meddis and Hewitt [<abbr bid="B14">14</abbr>] implemented a similar computer model for harmonics perception. Specifically, their model firstly simulated the mechanical filtering of basilar membrane to decompose the signal and then the mechanism of neural transduction at hair cell. Their important innovation was to conduct the autocorrelation to model the neural firing rate analysis of human being. These banks of autocorrelation functions (ACF) were called correlograms which provide a simple way to pitch estimation and source separation. For pitch estimation, previous research [<abbr bid="B14">14</abbr>] showed that peaks of summary correlograms indicate the pitch periods. According to the experiment results, Meddis and Hewitt argued that many phenomena about pitch perception could be explained with their model including the missing fundamental, ambiguous pitch, the pitch of interrupted noise, inharmonic components, and the dominant region of pitch. For source separation, the method as in [<abbr bid="B15">15</abbr>] is to directly check that whether the pitch period is close to the peak of correlograms. By these advantages of correlogram, it is widely used in pitch detection [<abbr bid="B16">16</abbr>] and speech separation algorithms [<abbr bid="B8">8</abbr>, <abbr bid="B9">9</abbr>, <abbr bid="B15">15</abbr>].</p>
         <p>However, there are some unsatisfactory facts. One was pointed out that the peak corresponding to the pitch period for a pure tone is rather wide [<abbr bid="B17">17</abbr>]. It leads to low resolution for the pitch extraction since mutual overlap between voices weakens their pitch cues. Some methods were proposed to obtain narrow peaks, such as "narrowed" ACF [<abbr bid="B18">18</abbr>] and generalized correlation function [<abbr bid="B19">19</abbr>]. Another problem is redundant information caused by the "invalid" peaks of ACF. In fact, we care more about the peak of ACF at the pitch period when using correlogram to estimate pitch and separate sound sources. For example, algorithm [<abbr bid="B14">14</abbr>] used the maximum peak of summary correlogram to indicate the pitch period. However, competitive peaks at multiples of pitch period may leads to subharmonic errors. To overcome the drawbacks, the first thing is to make the peaks narrower, and the second is to remove or suppress the peaks which are not at the pitch periods. We propose a novel feature called dynamic harmonic function to solve these two problems. The basic idea of DHF is shown in the next section.</p>
         <p>The rest of the paper is organized as follows. We firstly present the basic idea behind DHF in Section 2. Section 3 gives an overview of our model and specific description. Our system is systematically evaluated and compared with the Hu and Wang model for speech segregation in Section 4, followed by the discussion in Section 5 and the conclusion in Section 6.</p>
      </sec>
      <sec>
         <st>
            <p>2. Basic Idea of DHF</p>
         </st>
         <p>DHF is defined as a Gaussian mixture function. Gaussian means equal to the peak position of ACF which carries periodic information about the original signal in a certain frequency range. The peak width can be narrowed by adjusting the Gaussian variance. Meanwhile, the Gaussian mixture coefficient controls the peak height of DHF. The problem is how to estimate the mixture coefficients. The basic idea is as follows.</p>
         <p>Voiced speech generally has a harmonic structure including continuously numbered harmonics. Therefore, one could verify a pitch hypothesis based on whether or not there is a continuously numbered harmonics corresponding to this pitch. For example, when its neighbor harmonics appear at 400&#8201;Hz or 800&#8201;Hz, harmonic at 600&#8201;Hz is regarded as the third harmonic of the complex tone whose pitch is 200&#8201;Hz, such as case A in Figure <figr fid="F1">1</figr>. In this case, the pitch period is at the third peak position of ACF of frequency region around 600&#8201;Hz. While in case B, the pitch period is at the second peak position. Based on this idea, Gaussian mixture function tends to give a high peak at a pitch period hypothesis if its neighbor harmonics appear. It implies that the shape of guassian mixture function of a harmonic does not only depend on the frequency of harmonic itself but also the neighbor harmonics around. Therefore, we call it dynamic harmonic function.</p>
         <fig id="F1"><title><p>Figure 1</p></title><caption><p>Frequency component perception.</p></caption><text>
   <p>
      <b>Frequency component perception.</b>
   </p>
</text><graphic file="1687-4722-2010-252374-1"/></fig>
      </sec>
      <sec>
         <st>
            <p>3. System Overview</p>
         </st>
         <p>The proposed model contains six modules shown in Figure <figr fid="F2">2</figr>. In front-end processing stage, signal is decomposed into small units along time and frequency. Each unit is called T-F unit. After that, the features of each unit are extracted, such as normalized ACF, normalized envelope ACF proposed in previous studies [<abbr bid="B16">16</abbr>], and newly introduced carrier to envelope energy ratio. In the second stage, DHF in each unit is computed. According to different characteristic, the units are first classified into two categories: (1) resolved T-F unit dominated by a single harmonic and (2) unresolved T-F unit dominated by multiple harmonics. The computations of DHF for resolved and unresolved T-F unit are different. More details can be seen in Section 3.2. In the pitch estimation stage, pitch of target speech is extracted based on DHFs. Before that, the resolved T-F units are merged into segments firstly. Segmentation has been performed in previous CASA systems. A segment is a larger component of an auditory scene than a T-F unit and captures an acoustic component of a single source. An auditory segment is composed of a spatially continuous region of T-F units. Therefore, computational segment is formed according to time continuity and cross-channel correlation. It is reasonable to expect that high correlation shows the adjacent channels dominated by same source. However, frequencies of target and intrusion are often overlapped and it leads to the computational segments being dominated by different sources. In our model, we expect a segment to be dominated by the same harmonic of the same source. Hence, we employed another unit feature called harmonic order to split the segments into relative small ones. Its benefit is shown in following subsection. Harmonic order represents the unit dominated by which harmonic of the sound. During the unit labeling stage, T-F unit is labeled as target or intrusion according to the estimated pitch and DHF. In the fifth stage, T-F units are segregated into foreground and background based on segmentation. Finally, the T-F units in foreground synthesize the separated speech.</p>
         <fig id="F2"><title><p>Figure 2</p></title><caption><p>Schematic diagram of the proposed multistage system.</p></caption><text>
   <p>
      <b>Schematic diagram of the proposed multistage system.</b>
   </p>
</text><graphic file="1687-4722-2010-252374-2"/></fig>
         <sec>
            <st>
               <p>3.1. Front-End Processing</p>
            </st>
            <sec>
               <st>
                  <p>3.1.1. Signal Decomposition</p>
               </st>
               <p>At first, an input signal is decomposed by 128-channel gammatone filterbank [<abbr bid="B20">20</abbr>] whose center frequencies are quasilogarithmically spaced from 80&#8201;Hz to 5&#8201;kHz and bandwidths are set according to equivalent rectangle bandwidth (ERB). The gammatone filterbank simulates the characteristic of basilar membrane of the cochlea. Then, the outputs of filterbank are transited into neural firing rate by hair cell model [<abbr bid="B21">21</abbr>]. The same processing is employed in [<abbr bid="B9">9</abbr>, <abbr bid="B15">15</abbr>]. Amplitude modulation (AM) is important for channels dominated by multiple harmonics. Psychoacoustic experiments have demonstrated that amplitude modulation or beat rate is perceived in a critical band within which harmonic partials are unresolved [<abbr bid="B6">6</abbr>]. The AM in channels are obtained by performing Hilbert transform on gammatone filter output and then by filtering the squared Hilbert envelope by a filter with passband (50&#8201;Hz, 550&#8201;Hz). In the following part, gammatone filter output, hair cell output, and amplitude modulation at channel <inline-formula><graphic file="1687-4722-2010-252374-i2.gif"/></inline-formula> are represented by <inline-formula><graphic file="1687-4722-2010-252374-i3.gif"/></inline-formula>, <inline-formula><graphic file="1687-4722-2010-252374-i4.gif"/></inline-formula>, and <inline-formula><graphic file="1687-4722-2010-252374-i5.gif"/></inline-formula>, respectively.</p>
               <p>Then, time frequency (T-F) units are formed with 10&#8201;ms offset and 20&#8201;ms window in each channel. Let <inline-formula><graphic file="1687-4722-2010-252374-i6.gif"/></inline-formula> denote a T-F unit for frequency channel <inline-formula><graphic file="1687-4722-2010-252374-i7.gif"/></inline-formula> and time frame <inline-formula><graphic file="1687-4722-2010-252374-i8.gif"/></inline-formula>. The T-F units will be segregated into foreground and background according to their features.</p>
            </sec>
            <sec>
               <st>
                  <p>3.1.2. Feature Extraction</p>
               </st>
               <p>Previous researches have shown that the correlogram is an effective mid-level auditory representation for pitch estimation and source segregation. Thus, the normalized correlogram and the normalized envelope correlogram are computed here. For T-F unit <inline-formula><graphic file="1687-4722-2010-252374-i9.gif"/></inline-formula>, they are computed as the following equations which are same as in [<abbr bid="B16">16</abbr>]:</p>
               <p>
                  <display-formula id="M1">
                     <graphic file="1687-4722-2010-252374-i10.gif"/>
                  </display-formula>
               </p>
               <p/>
               <p>
                  <display-formula id="M2">
                     <graphic file="1687-4722-2010-252374-i11.gif"/>
                  </display-formula>
               </p>
               <p>where lag <inline-formula><graphic file="1687-4722-2010-252374-i12.gif"/></inline-formula>, shift<inline-formula><graphic file="1687-4722-2010-252374-i13.gif"/></inline-formula>corresponds to 10&#8201;ms and window length <inline-formula><graphic file="1687-4722-2010-252374-i14.gif"/></inline-formula>.</p>
               <p>One knows that the peak position of ACF reflects the period or its multiple of the signal. <inline-formula><graphic file="1687-4722-2010-252374-i15.gif"/></inline-formula> is a proper feature to segregate the T-F units dominated by a single harmonic. However, it is not suitable for the T-F units dominated by several harmonics because of the peaks' fluctuation, as shown in Figure <figr fid="F3">3(b)</figr>. In this case, <inline-formula><graphic file="1687-4722-2010-252374-i16.gif"/></inline-formula> is employed for segregation whose first peak position usually corresponds to pitch period. In order to remove the peaks at integer multiples of the pitch period, the normalized envelope ACF is further processed into "enhanced" envelope ACF as shown in Figure <figr fid="F3">3(d)</figr>. Specifically, <inline-formula><graphic file="1687-4722-2010-252374-i17.gif"/></inline-formula> is half rectified and expended in time by factor <inline-formula><graphic file="1687-4722-2010-252374-i18.gif"/></inline-formula> and subtracted from clipped <inline-formula><graphic file="1687-4722-2010-252374-i19.gif"/></inline-formula>, and again, the result is half rectified. Iteration is performed by <inline-formula><graphic file="1687-4722-2010-252374-i20.gif"/></inline-formula> to cancel spurious peaks in possible pitch range. The computation is similar with the one in [<abbr bid="B22">22</abbr>].</p>
               <fig id="F3"><title><p>Figure 3</p></title><caption><p>(a) is channel response dominated by multiple harmonics; (b) is the ACF of the channel; (c) is the envelope ACF of the channel; (d) is the "enhanced" envelope ACF of the channel and the vertical line in (d) is the corresponding pitch period.</p></caption><text>
   <p>
      <b>(a) is channel response dominated by multiple harmonics; (b) is the ACF of the channel; (c) is the envelope ACF of the channel; (d) is the "enhanced" envelope ACF of the channel and the vertical line in (d) is the corresponding pitch period.</b>
   </p>
</text><graphic file="1687-4722-2010-252374-3"/></fig>
               <p>Since we use different features to segregate the T-F units dominated by a single harmonic and the ones dominated by several harmonics, it is important to classify the T-F units correctly according to their different characteristics. In order to narrate facility, we define the resolved T-F unit as the one dominated by a single harmonic and the unresolved T-F unit as the one dominated by multiple harmonics. In fact, the fluctuation of envelope is relative severe in unresolved T-F units because of the amplitude modulation. Figure <figr fid="F4">4</figr> shows the filter response and its envelope in resolved T-F unit (Figure <figr fid="F4">4(a)</figr>) and in unresolved T-F unit (Figure <figr fid="F4">4(b)</figr>). Here, a feature&#8212;carrier to envelope energy ratio, proposed in our previous work [<abbr bid="B23">23</abbr>], is employed to classify the units into resolved and unresolved ones. If the <inline-formula><graphic file="1687-4722-2010-252374-i21.gif"/></inline-formula> is larger than a threshold, the T-F unit is regarded as resolved one and vice versa. For T-F unit <inline-formula><graphic file="1687-4722-2010-252374-i22.gif"/></inline-formula>, its computation is given by</p>
               <p>
                  <display-formula id="M3">
                     <graphic file="1687-4722-2010-252374-i23.gif"/>
                  </display-formula>
               </p>
               <p/>
               <fig id="F4"><title><p>Figure 4</p></title><caption><p>Filter response (the solid line) and its envelope (the dash line).</p></caption><text>
   <p><b>Filter response (the solid line) and its envelope (the dash line).</b> (a) At channel 20 with center frequency 242&#8201;Hz. (b) At channel 100 with center frequency 2573&#8201;Hz.</p>
</text><graphic file="1687-4722-2010-252374-4"/></fig>
               <p>In a unit <inline-formula><graphic file="1687-4722-2010-252374-i24.gif"/></inline-formula>, severe fluctuation of envelope leads to <inline-formula><graphic file="1687-4722-2010-252374-i25.gif"/></inline-formula> being small. Hence, we regard <inline-formula><graphic file="1687-4722-2010-252374-i26.gif"/></inline-formula> as unresolved if <inline-formula><graphic file="1687-4722-2010-252374-i27.gif"/></inline-formula> or else as resolved. Here, the <inline-formula><graphic file="1687-4722-2010-252374-i28.gif"/></inline-formula> according to the experiments.</p>
               <p>As demonstrated in [<abbr bid="B15">15</abbr>], cross-channel correlation measures the similarity between the responses of two adjacent filter channels and indicates whether the filters are responding to the same sound component or not. It is important for subsequent segmentation. Hence, the cross-channel correlation and cross-channel correlation of envelopes are calculated as</p>
               <p>
                  <display-formula id="M4">
                     <graphic file="1687-4722-2010-252374-i29.gif"/>
                  </display-formula>
               </p>
               <p/>
               <p>
                  <display-formula id="M5">
                     <graphic file="1687-4722-2010-252374-i30.gif"/>
                  </display-formula>
               </p>
               <p>where, <inline-formula><graphic file="1687-4722-2010-252374-i31.gif"/></inline-formula> and <inline-formula><graphic file="1687-4722-2010-252374-i32.gif"/></inline-formula> are zero-mean and unity-variance versions of <inline-formula><graphic file="1687-4722-2010-252374-i33.gif"/></inline-formula> and <inline-formula><graphic file="1687-4722-2010-252374-i34.gif"/></inline-formula>.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>3.2. Dynamic Harmonic Function</p>
            </st>
            <p>DHF is defined by a one-dimensional Gaussian mixture function as in formula (6) which indicates the probability of lag <inline-formula><graphic file="1687-4722-2010-252374-i35.gif"/></inline-formula> being the pitch period. We intend to use the variances of Gaussian function to narrow the peaks' width and the mixture coefficients to suppress the "invalid" peaks. In the following part, we show how to calculate the parameters of DHF. Although the representations of DHF are identical, calculations of the parameters are different for resolved and unresolved units </p>
            <p>
               <display-formula id="M6">
                  <graphic file="1687-4722-2010-252374-i36.gif"/>
               </display-formula>
            </p>
            <p/>
            <p>
               <display-formula id="M7">
                  <graphic file="1687-4722-2010-252374-i37.gif"/>
               </display-formula>
            </p>
            <p>where, lag <inline-formula><graphic file="1687-4722-2010-252374-i38.gif"/></inline-formula> (same as in ACF); <inline-formula><graphic file="1687-4722-2010-252374-i39.gif"/></inline-formula> is the number of peaks of ACF.</p>
            <p>In formula (6), there are four parameters component number, Gaussian means, Gaussian variances, and Gaussian mixture coefficients to be computed. The component number equals to the number of peaks of ACF. Mean of the <inline-formula><graphic file="1687-4722-2010-252374-i40.gif"/></inline-formula>th Gaussian function is set to the position of the <inline-formula><graphic file="1687-4722-2010-252374-i41.gif"/></inline-formula>th peak of ACF. Gaussian variances are used to control the peak width of DHF which are determined later. The following part will show the estimation method of the mixture coefficients. </p>
            <p>For the DHF of a T-F unit, we want to give a higher peak at the pitch period if it is dominated by voiced sound, which means a larger mixture coefficient for the corresponding Gaussian function. Therefore, our work is to estimated pitch period at each T-F unit. Let us see an example at first. The input signal is a complex tone with <inline-formula><graphic file="1687-4722-2010-252374-i42.gif"/></inline-formula>&#8201;Hz and all the amplitude of harmonics are equal. Figures <figr fid="F5">5(a)</figr>&#8211;<figr fid="F5">5(c)</figr> show the ACFs of correlogram at channel 10, 30 and 45 with center frequency 148&#8201;Hz, 360&#8201;Hz, and 612&#8201;Hz, respectively. And Figure <figr fid="F5">5(d)</figr> shows the enhanced envelope ACF at channel 100 with center frequency 2573&#8201;Hz. Obviously, channel 30 is dominated by the second harmonic of complex tone. However, it is not indicated by ACF because its peaks have equal amplitude. In fact, without information of the other channels, there are several interpretations for channel 30 according to ACF. For example, the channel could be dominated by the second harmonic where <inline-formula><graphic file="1687-4722-2010-252374-i43.gif"/></inline-formula>&#8201;Hz or by forth harmonic where <inline-formula><graphic file="1687-4722-2010-252374-i44.gif"/></inline-formula>&#8201;Hz. In DHF, we expect that the second mixture coefficient of DHF could be larger than others. Analysis above implies that the computation of mixture coefficient has to combine the information of other channels. According to analysis above, the mixture coefficient of DHF for resolved T-F unit <inline-formula><graphic file="1687-4722-2010-252374-i45.gif"/></inline-formula> is computed as follows:</p>
            <p>
               <display-formula id="M8">
                  <graphic file="1687-4722-2010-252374-i46.gif"/>
               </display-formula>
            </p>
            <p/>
            <p>
               <display-formula id="M9">
                  <graphic file="1687-4722-2010-252374-i47.gif"/>
               </display-formula>
            </p>
            <p>where, <inline-formula><graphic file="1687-4722-2010-252374-i48.gif"/></inline-formula> is the mean of the <inline-formula><graphic file="1687-4722-2010-252374-i49.gif"/></inline-formula>th Gaussian function;<inline-formula><graphic file="1687-4722-2010-252374-i50.gif"/></inline-formula>. Formula (8) shows the pseudopossibility of <inline-formula><graphic file="1687-4722-2010-252374-i51.gif"/></inline-formula> dominated by the <inline-formula><graphic file="1687-4722-2010-252374-i52.gif"/></inline-formula>th harmonic of the sound with pitch period at <inline-formula><graphic file="1687-4722-2010-252374-i53.gif"/></inline-formula>. And (9) shows the possibility of the <inline-formula><graphic file="1687-4722-2010-252374-i54.gif"/></inline-formula>th harmonic with hypothesis pitch period <inline-formula><graphic file="1687-4722-2010-252374-i55.gif"/></inline-formula> appearing at frame <inline-formula><graphic file="1687-4722-2010-252374-i56.gif"/></inline-formula></p>
            <p>
               <display-formula id="M10">
                  <graphic file="1687-4722-2010-252374-i57.gif"/>
               </display-formula>
            </p>
            <p/>
            <fig id="F5"><title><p>Figure 5</p></title><caption><p>(a) ACF at channel 10 whose center frequency (cf) is 148 Hz; (b) ACF at channel 30 whose cf is 360 Hz; (c) ACF at channel 45 whose cf is 612 Hz; (d) enhanced envelope ACF at channel 100 whose cf is 2573 Hz; Input signal is a complex tone with <inline-formula><graphic file="1687-4722-2010-252374-i58.gif"/></inline-formula> Hz; The vertical dash line shows the pitch period.</p></caption><text>
   <p>
      <b>(a) ACF at channel 10 whose center frequency (cf) is 148 Hz; (b) ACF at channel 30 whose cf is 360 Hz; (c) ACF at channel 45 whose cf is 612 Hz; (d) enhanced envelope ACF at channel 100 whose cf is 2573 Hz; Input signal is a complex tone with <inline-formula><graphic file="1687-4722-2010-252374-i58.gif"/></inline-formula> Hz; The vertical dash line shows the pitch period.</b>
   </p>
</text><graphic file="1687-4722-2010-252374-5"/></fig>
            <p>Formula (10) shows that the <inline-formula><graphic file="1687-4722-2010-252374-i59.gif"/></inline-formula>th mixture coefficient depends on the appearance of the <inline-formula><graphic file="1687-4722-2010-252374-i60.gif"/></inline-formula>th or <inline-formula><graphic file="1687-4722-2010-252374-i61.gif"/></inline-formula>th harmonic. As seen in Figure <figr fid="F5">5</figr>, the second mixture coefficient of DHF in (b) is large, because there are channels (a) and (c) dominated by the first and the third harmonic of the complex tone whose pitch period is 5.0&#8201;ms. While the forth mixture coefficient is small, because no channels were dominated by the third or the fifth harmonic whose frequencies are 300&#8201;Hz and 500&#8201;Hz, respectively.</p>
            <p>From formula (8)&#8211;(10), it can be seen that a mixture coefficient of DHF does not depend on its all related harmonics but only two neighbours. One reason is to simplify the algorithm. The other is that previous psychoacoustic experiments [<abbr bid="B6">6</abbr>] showed that the nearest related harmonics have the strongest effect for the harmonic fusion. During the experiments, scholars used a stimulus in which a rich tone with 10 harmonics wav alternated with a pure tone and checked if the harmonic of rich tone could be captured by the pure tone. It was found that a harmonic was easier to capture out of the complex tone when neighboring harmonics were removed. According to the results, one of conclusions is "the greater the frequency separation between a harmonic and its nearest frequency neighbors, the easier it was to capture it out of the complex tone."</p>
            <p>For unresolved T-F unit, computation of the mixture coefficients is different from resolved. One reason is that unresolved T-F unit is dominated by several harmonics at the same time. Hence, the peak order of its ACF does not reflect the harmonic order accurately. Another reason is that the resolution of gammatone filter is relative low in high-frequency region and the continuously numbered harmonic-structure cannot be found in correlograms. Fortunately, the peak of enhanced envelope ACF tends to appear around pitch period, as shown in Figure <figr fid="F5">5(d)</figr>. It implies that the mixture coefficient should be large if the mean of Gaussian function is close to the peak of enhanced envelope ACF. Therefore, the mixture coefficient equals to the amplitude of enhanced envelope ACF at the mean of Gaussian function, as in </p>
            <p>
               <display-formula id="M11">
                  <graphic file="1687-4722-2010-252374-i62.gif"/>
               </display-formula>
            </p>
            <p>where <inline-formula><graphic file="1687-4722-2010-252374-i63.gif"/></inline-formula> is the enhanced envelope ACF; <inline-formula><graphic file="1687-4722-2010-252374-i64.gif"/></inline-formula> is the <inline-formula><graphic file="1687-4722-2010-252374-i65.gif"/></inline-formula>th peak's position of ACF.</p>
            <p>In order to estimate the pitch, we also define the summary DHF at frame <inline-formula><graphic file="1687-4722-2010-252374-i66.gif"/></inline-formula> as formula (12) which is important for pitch estimation</p>
            <p>
               <display-formula id="M12">
                  <graphic file="1687-4722-2010-252374-i67.gif"/>
               </display-formula>
            </p>
            <p/>
            <p>Figure <figr fid="F6">6</figr> shows the comparison of correlogram and DHFs. It can be seen that (1) peaks in DHFs are less in ACFs, (2) the peaks at the pitch period are properly preserved, and (3) the peaks in summary DHF are narrower than in summary correlogram. Figure <figr fid="F7">7</figr> shows the periodogram (a time series of summary correlogram) comparison. The input signal is male utterance, "where were you away a year, <it>Roy</it>" mixed by a female utterance. For conventional periodogram (a), pitch information of two sources is mixed together and it is hard to separate directly whereas it is clear in DHF periodogram (b).</p>
            <fig id="F6"><title><p>Figure 6</p></title><caption><p>Auditory features.</p></caption><text>
   <p><b>Auditory features.</b> The input signal is complex tone with <inline-formula><graphic file="1687-4722-2010-252374-i68.gif"/></inline-formula>&#8201;Hz. (a) correlogram at frame <inline-formula><graphic file="1687-4722-2010-252374-i69.gif"/></inline-formula> for the clean female speech (channel 1&#8211;80 is ACFs, channel 81&#8211;128 is envelope ACFs). The summary correlogram is shown in bottom panel; (b) corresponding dynamic harmonic functions. The summary dynamic harmonic function is shown in bottom panel. The variance of DHF <inline-formula><graphic file="1687-4722-2010-252374-i70.gif"/></inline-formula> is 2.0.</p>
</text><graphic file="1687-4722-2010-252374-6"/></fig>
            <fig id="F7"><title><p>Figure 7</p></title><caption><p><inline-formula><graphic file="1687-4722-2010-252374-i71.gif"/></inline-formula>-axis is frame, <inline-formula><graphic file="1687-4722-2010-252374-i72.gif"/></inline-formula>-axis is lag; (a) Conventional periodogram (channel 1&#8211;80 is ACF, channel 81&#8211;128 is envelope ACF); (b) Dynamic harmonic function periodogram.</p></caption><text>
   <p><b><inline-formula><graphic file="1687-4722-2010-252374-i71.gif"/></inline-formula>-axis is frame, <inline-formula><graphic file="1687-4722-2010-252374-i72.gif"/></inline-formula>-axis is lag; (a) Conventional periodogram (channel 1&#8211;80 is ACF, channel 81&#8211;128 is envelope ACF); (b) Dynamic harmonic function periodogram.</b> The input signal is male speech mixed with female speech.</p>
</text><graphic file="1687-4722-2010-252374-7"/></fig>
         </sec>
         <sec>
            <st>
               <p>3.3. Pitch Estimation</p>
            </st>
            <p>Pitch estimation in noisy environment is closely related to sound separation. If, on one hand, the mixed sound is separated, the pitch of each sound can be obtained relatively easily. On the other hand, pitch is a very efficient grouping cue for sound separation and widely used in previous systems [<abbr bid="B8">8</abbr>, <abbr bid="B9">9</abbr>, <abbr bid="B15">15</abbr>]. In the Hu and Wang model, a continuous pitch estimation method is proposed based on correlogram in which the T-F units are merged into segments according to cross-channel correlation and time continuity. Each segment is expected to be dominated by a single voiced sound. At first, they employed the longest segment as a criterion to initially separate the segments into foreground and background. And then, the pitch contour is formed using units in foreground and followed by sequential linear interpolation, more details can be found in [<abbr bid="B9">9</abbr>]. </p>
            <p>It is obvious that initial separation plays an important role for pitch estimation. Although result of the simple decision could be adjusted in the following stage through iterative estimation and linear interpolation so as to give an acceptable prediction of pitch contour, it yet does not satisfy the requirements of the segregation and may also deliver some segments which are dominated by the intrusions into the foreground. This will certainly affect the accuracy of the result of pitch.</p>
            <p>As a matter of fact, the pitch period is reflected by the ACF of each harmonic. The problem is that ACF has multiple peaks pitch estimation could be simple that if we find the longest segment which is dominated not only by the same source but also by the same harmonic and also know the harmonic order. It only needs to summate the corresponding peaks on each frame and regard the position of the maximum peak as pitch period. This process avoids source separation and pitch interpolation. Under the instruction of above analysis, we try (1) to find the longest segment and (2) to estimate the harmonic order. In this subsection, we will solve these two problems based on DHFs.</p>
            <p>In previous systems [<abbr bid="B9">9</abbr>, <abbr bid="B15">15</abbr>], the segments are formed by cross-channel correlation and time continuity of T-F units. The motivation is that high-cross-channel correlations indicate adjacent channels dominated by the same harmonic and voiced sections have continuity on time scale. However, some of the formed segments are dominated by different sources or multiple harmonics. Figure <figr fid="F8">8(a)</figr> shows the segments which are generated by cross-channel correlation and time continuity. The input signal is a voiced speech mixed by click noise. The black region is dominated by speech and the gray region is dominated by click noise. It is obvious that click noise has no harmonic structure and unit at higher channels is dominated by multiple harmonics. Hence, we expect that each segment is dominated by a single harmonic of the same source. Therefore, to use these segments directly is not proper. Here, we add other two features of T-F unit for segmentation. One is carrier-to-envelope energy ratio which is computed by formula (3) and the other is unit harmonic order.</p>
            <fig id="F8"><title><p>Figure 8</p></title><caption><p>Segmentation comparison.</p></caption><text>
   <p><b>Segmentation comparison.</b> The input signal is a voiced speech mixed by click noise. (a) Segments formed by cross-channel correlation and time continuity. The black region is dominated by speech and the gray region is dominated by click noise. (b) Segments formed by cross-channel correlation, time continuity and carrier-to-envelope energy ratio.</p>
</text><graphic file="1687-4722-2010-252374-8"/></fig>
            <sec>
               <st>
                  <p>3.3.1. Initial Segmentation</p>
               </st>
               <p>As mentioned in Section 3.2, T-F units are classified into resolved and unresolved by carrier-to-envelope energy ratio. Each resolved T-F unit is dominated by a single harmonic. In addition, because the passbands of adjacent channels have significant overlap, a resolved harmonic usually activates adjacent channels, which leads to high-cross-channel correlations. Thus, only resolved T-F units with sufficiently high-cross-channel correlations are considered. More specifically, resolved unit <inline-formula><graphic file="1687-4722-2010-252374-i73.gif"/></inline-formula> is selected for consideration if <inline-formula><graphic file="1687-4722-2010-252374-i74.gif"/></inline-formula>, chosen to be little lower than in [<abbr bid="B15">15</abbr>]. Selected neighboring units are iteratively merged into segments. Finally, segments shorter than 30&#8201;ms are removed, since they unlikely arise from target speech. Figure <figr fid="F8">8(b)</figr> shows a result of segmentation for the same signal in Figure <figr fid="F8">8(a)</figr>.</p>
            </sec>
            <sec>
               <st>
                  <p>3.3.2. Harmonic Order Computation</p>
               </st>
               <p>For a resolved T-F unit <inline-formula><graphic file="1687-4722-2010-252374-i75.gif"/></inline-formula>, harmonic order <inline-formula><graphic file="1687-4722-2010-252374-i76.gif"/></inline-formula> indicates the unit dominated by which harmonic. Although DHF suppress some of peaks compared with ACF, there are still multiple invalid peaks especially at the fraction of pitch period, as seen in Figure <figr fid="F6">6(b)</figr>. We still cannot decide the harmonic order by DHF. Fortunately, those peaks at the fractional pitch period are suppressed in summary DHF. Hence, the computation combines the DHF and summary DHF as </p>
               <p>
                  <display-formula id="M13">
                     <graphic file="1687-4722-2010-252374-i77.gif"/>
                  </display-formula>
               </p>
               <p/>
               <p>From the above algorithm, we can see that the harmonic order of a resolved unit depends on single frame. Due to the noise's interference, estimations of harmonic order of some units are unreliable. Therefore, we extend the estimation by segmentation. Firstly, the initial segments further splits according to harmonic order of resolved T-F unit. These newly formed segments include small segments (shorter than 50&#8201;ms) and large segments (longer than 50&#8201;ms). Secondly, the connected small segments are merged together. For those units in the rest small segments, they are absorbed by neighboring segments. Finally, the harmonic order of each unit is recomputed by formula (14). For units in segment <inline-formula><graphic file="1687-4722-2010-252374-i78.gif"/></inline-formula>, the harmonic orders are in accordance with segment harmonic order</p>
               <p>
                  <display-formula id="M14">
                     <graphic file="1687-4722-2010-252374-i79.gif"/>
                  </display-formula>
               </p>
               <p/>
               <p>Here, all the variances of DHFs are 2.0 for computation of summary DHF. The results are not significantly affected when the variances are in range [2, 4]. Too large values will cause the mutual influence by peaks of different sources. But too small values are also improper for describing the peaks' vibration of the units which are dominated by target speech.</p>
            </sec>
            <sec>
               <st>
                  <p>3.3.3. Pitch Contour Tracking</p>
               </st>
               <p>For voiced speech, the first several harmonics have more energy than others, which are relative robust to noisy. Here, we only use the longest segment to estimate the pitch contour. With the harmonic order, it is quite easy to estimate pitch depending only on the longest segment. The algorithm is as follows:</p>
               <p indent="1">(1)summate the <inline-formula><graphic file="1687-4722-2010-252374-i80.gif"/></inline-formula>th peak of DHF of T-F units in the longest segment at each frame where <inline-formula><graphic file="1687-4722-2010-252374-i81.gif"/></inline-formula> is the harmonic order of T-F unit,</p>
               <p indent="1">(2)normalize the maximum value of summation at each frame to 1,</p>
               <p indent="1">(3)find all the peaks of summation as pitch period candidates at each frame,</p>
               <p indent="1">(4)track the pitch contour within candidates by dynamic programming,</p>
               <p/>
               <p>
                  <display-formula id="M15">
                     <graphic file="1687-4722-2010-252374-i82.gif"/>
                  </display-formula>
               </p>
               <p>where <inline-formula><graphic file="1687-4722-2010-252374-i83.gif"/></inline-formula> is the summation at frame <inline-formula><graphic file="1687-4722-2010-252374-i84.gif"/></inline-formula>, <inline-formula><graphic file="1687-4722-2010-252374-i85.gif"/></inline-formula> is the <inline-formula><graphic file="1687-4722-2010-252374-i86.gif"/></inline-formula>th peak of <inline-formula><graphic file="1687-4722-2010-252374-i87.gif"/></inline-formula>, the weight <inline-formula><graphic file="1687-4722-2010-252374-i88.gif"/></inline-formula>.</p>
               <p>Figures <figr fid="F9">9(a)</figr> and <figr fid="F9">9(b)</figr> illustrate the summary DHF (only with the peak corresponding to harmonic order) in longest segment and pitch contour. As shown in figure, the pitch contour is roughly given by summary DHF. The dynamic programming corrects some errors during the pitch tracking. Figure <figr fid="F9">9(b)</figr> shows the estimated pitch contour matches that of the clean speech very well at most of the frames.</p>
               <fig id="F9"><title><p>Figure 9</p></title><caption><p>Result of pitch for the mixture of speech and cocktail party.</p></caption><text>
   <p><b>Result of pitch for the mixture of speech and cocktail party.</b> (a) Summary of dynamic harmonic function (only with the peak corresponding to harmonic order) within longest segment. (b) Estimated pitch contour, marked by "o" and the solid line is the pitch contour obtained from clean speech before mixing.</p>
</text><graphic file="1687-4722-2010-252374-9"/></fig>
            </sec>
         </sec>
         <sec>
            <st>
               <p>3.4. Unit Labeling</p>
            </st>
            <p>The pitch computed above is used to label the T-F units according to whether target speech dominates the unit responses or not. Mechanism of the Hu and Wang model is to test that the pitch period is close to the maximum peak of ACF. It is because that for the units dominated by target speech, there should be a peak around the pitch period. The method employed here is similar but with some differences. </p>
            <p>For resolved T-F units, the maximum peak of DHF tends to appear at the pitch period as presented in previous section. We can label a unit <inline-formula><graphic file="1687-4722-2010-252374-i89.gif"/></inline-formula> as target speech if <inline-formula><graphic file="1687-4722-2010-252374-i90.gif"/></inline-formula> is close to the maximum peak of DHF. However, computation method of DHF is influenced by noise. To obtain the robust results, the method has some changes. For the resolved T-F unit <inline-formula><graphic file="1687-4722-2010-252374-i91.gif"/></inline-formula> in segment (generated in Section 3.3), if its nearest peak to the pitch period equals to the harmonic order <inline-formula><graphic file="1687-4722-2010-252374-i92.gif"/></inline-formula> and satisfies (16), it is labeled as target or else as intrusion</p>
            <p>
               <display-formula id="M16">
                  <graphic file="1687-4722-2010-252374-i93.gif"/>
               </display-formula>
            </p>
            <p>where <inline-formula><graphic file="1687-4722-2010-252374-i94.gif"/></inline-formula>; <inline-formula><graphic file="1687-4722-2010-252374-i95.gif"/></inline-formula> is estimated pitch period at frame <inline-formula><graphic file="1687-4722-2010-252374-i96.gif"/></inline-formula>; the variance <inline-formula><graphic file="1687-4722-2010-252374-i97.gif"/></inline-formula> for <inline-formula><graphic file="1687-4722-2010-252374-i98.gif"/></inline-formula>. </p>
            <p>For an unresolved T-F unit, we cannot use the same labeling method as resolved T-F unit because it is dominated by multiple harmonics. As analysis before, the peaks of envelope ACF tend to appear at the pitch period. Thus, DHF of unresolved unit shows a large peak at the pitch period. The labeling method is changed into (17). In (17), it is to compare the pseudo-probabilities at <inline-formula><graphic file="1687-4722-2010-252374-i99.gif"/></inline-formula> and at the most possible pitch period in unit. If its ratio is larger than the threshold <inline-formula><graphic file="1687-4722-2010-252374-i100.gif"/></inline-formula> threshold, the unresolved T-F unit is labeled as target or else as intrusion</p>
            <p>
               <display-formula id="M17">
                  <graphic file="1687-4722-2010-252374-i101.gif"/>
               </display-formula>
            </p>
            <p>where <inline-formula><graphic file="1687-4722-2010-252374-i102.gif"/></inline-formula>; the variance <inline-formula><graphic file="1687-4722-2010-252374-i103.gif"/></inline-formula>.</p>
            <p>The variance <inline-formula><graphic file="1687-4722-2010-252374-i104.gif"/></inline-formula> of DHF in each unit depends on the first peak's position <inline-formula><graphic file="1687-4722-2010-252374-i105.gif"/></inline-formula>. It leads to the peak width of DHF close to ACF. And the threshold <inline-formula><graphic file="1687-4722-2010-252374-i106.gif"/></inline-formula> is according to our experiment results.</p>
         </sec>
         <sec>
            <st>
               <p>3.5. Segregation Based on Segment</p>
            </st>
            <p>In this stage, units are segregated based on segmentation. Previous studies showed that it is more robust. Our method here is very similar with the Hu and Wang model [<abbr bid="B9">9</abbr>].</p>
            <sec>
               <st>
                  <p>3.5.1. Resolved Segment Grouping</p>
               </st>
               <p>For a resolved segment generated in Section 3.3, it is segregated into foreground <inline-formula><graphic file="1687-4722-2010-252374-i107.gif"/></inline-formula> if more than half of its units are marked as target, or else it is segregated into background <inline-formula><graphic file="1687-4722-2010-252374-i108.gif"/></inline-formula>. The spectra of target and intrusion often overlap, and as a result, some resolved segments contain units dominated by target as well as those dominated by intrusion. The <inline-formula><graphic file="1687-4722-2010-252374-i109.gif"/></inline-formula> is further divided according to the unit label. The target units and intrusion units in <inline-formula><graphic file="1687-4722-2010-252374-i110.gif"/></inline-formula> merged into segments according to frequency and time continuity. The segment retained in <inline-formula><graphic file="1687-4722-2010-252374-i111.gif"/></inline-formula> which is made up of target units and larger than 50&#8201;ms. And the segment are added to <inline-formula><graphic file="1687-4722-2010-252374-i112.gif"/></inline-formula>, if it is made up of intrusion units and larger than 50&#8201;ms. The rest smaller segments are removed.</p>
            </sec>
            <sec>
               <st>
                  <p>3.5.2. Unresolved Segment Grouping</p>
               </st>
               <p>The unresolved segment is formed by target unresolved T-F units with frequency and time continuity. The segments larger than 30&#8201;ms are retained. The rest of the units in small segments are merged into large segment iteratively. At last, the unresolved units in large segments are grouped into <inline-formula><graphic file="1687-4722-2010-252374-i113.gif"/></inline-formula>, and the rest are grouped into <inline-formula><graphic file="1687-4722-2010-252374-i114.gif"/></inline-formula>. This processing part is similar with the Hu and Wang model.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>3.6. Resynthesis</p>
            </st>
            <p>Finally, the units in foreground <inline-formula><graphic file="1687-4722-2010-252374-i115.gif"/></inline-formula> are resynthesised into wave form by the method in [<abbr bid="B12">12</abbr>]. Figure <figr fid="F10">10</figr> shows the waveforms as an example. It shows the clean speech in Figure <figr fid="F10">10(a)</figr>, mixture (mixed by cocktail party noise) in Figure <figr fid="F10">10(b)</figr> and segregated speech by proposed system in Figure <figr fid="F10">10(c)</figr>. As can be seen, the segregated speech resembles the major parts of clean speech.</p>
            <fig id="F10"><title><p>Figure 10</p></title><caption><p>Waveforms.</p></caption><text>
   <p><b>Waveforms.</b> (a) clean speech; (b) mixture of clean speech and cocktail party noise; (c) segregated speech by the proposed method.</p>
</text><graphic file="1687-4722-2010-252374-10"/></fig>
         </sec>
      </sec>
      <sec>
         <st>
            <p>4. Evaluation and Results</p>
         </st>
         <p>Proposed model is evaluated on a corpus of 100 mixtures composed of ten voiced utterances mixed with ten different kinds of intrusions collected by Cooke [<abbr bid="B8">8</abbr>]. In the dataset, ten voiced utterances have continuous pitch nearly throughout whole duration. The intrusions are ten different kinds of sounds including N0, 1&#8201;kHz pure tone; N1, white noise; N2, noise bursts; N3, "cocktail party" noise; N4, rock music; N5, siren; N6, trill telephone; N7, female speech; N8, male speech; and N9, another female speech. Ten voiced utterances are regarded as targets. Frequency sampling rate of the corpus is 16&#8201;kHz.</p>
         <p>There are two main reasons for using this dataset. The first is that the proposed system focuses on primitive driven [<abbr bid="B6">6</abbr>] separation, and it is possible for system to obtain the pitch from same source without schema driven principles. The other reason is that the dataset has been widely used in evaluate CASA-based separation systems [<abbr bid="B8">8</abbr>, <abbr bid="B9">9</abbr>, <abbr bid="B15">15</abbr>] which facilitates the comparison.</p>
         <p>The objective evaluation criterion is signal to noise ratio (SNR) of original and distorted signal after segregation. Although SNR is used as a conventional method for system evaluation, it is not always consistent with the voice quality. Perceptual evaluation of speech quality (ITU-T P.862 PESQ, 2001) is employed as another objective evaluation criterion. The ITU-T P.862 is an intrusive objective speech quality assessment algorithm. Since the original speech before mixing is available, it is convenient to apply the ITU-T P.862 algorithm to obtain the intrusive speech quality evaluation result of the separated speech.</p>
         <p>SNR is measured in decibel and computed by following equation. The results are listed in Table <tblr tid="T1">1</tblr></p>
         <p>
            <display-formula id="M18">
               <graphic file="1687-4722-2010-252374-i116.gif"/>
            </display-formula>
         </p>
         <p>where <inline-formula><graphic file="1687-4722-2010-252374-i117.gif"/></inline-formula> is original voiced speech and <inline-formula><graphic file="1687-4722-2010-252374-i118.gif"/></inline-formula> is the synthesized waveform by segregation systems.</p>
         <tbl id="T1"><title><p>Table 1</p></title><caption><p>SNR Results. (Mixture: Original degraded speech; Hu-Wang: Hu and Wang model; Proposed: Proposed model; TP Hu-Wang: true pitch-based Hu and Wang model; TP proposed: true pitch-based proposed model; IBM: Ideal binary mask)</p></caption><tblbdy cols="12">
      <r>
         <c ca="left">
            <p>
               <b/>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>N0</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>N1</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>N2</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>N3</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>N4</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>N5</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>N6</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>N7</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>N8</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>N9</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Avg</b>
            </p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Mixture</p>
         </c>
         <c ca="center">
            <p>&#8722;7.42</p>
         </c>
         <c ca="center">
            <p>&#8722;8.27</p>
         </c>
         <c ca="center">
            <p>5.62</p>
         </c>
         <c ca="center">
            <p>0.80</p>
         </c>
         <c ca="center">
            <p>0.68</p>
         </c>
         <c ca="center">
            <p>&#8722;10.00</p>
         </c>
         <c ca="center">
            <p>&#8722;1.62</p>
         </c>
         <c ca="center">
            <p>3.85</p>
         </c>
         <c ca="center">
            <p>9.53</p>
         </c>
         <c ca="center">
            <p>2.75</p>
         </c>
         <c ca="center">
            <p>&#8722;0.41</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Hu-Wang</p>
         </c>
         <c ca="center">
            <p>16.01</p>
         </c>
         <c ca="center">
            <p>5.59</p>
         </c>
         <c ca="center">
            <p>14.27</p>
         </c>
         <c ca="center">
            <p>5.83</p>
         </c>
         <c ca="center">
            <p>8.25</p>
         </c>
         <c ca="center">
            <p>14.35</p>
         </c>
         <c ca="center">
            <p>15.53</p>
         </c>
         <c ca="center">
            <p>10.46</p>
         </c>
         <c ca="center">
            <p>14.06</p>
         </c>
         <c ca="center">
            <p>6.88</p>
         </c>
         <c ca="center">
            <p>11.12</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Proposed</p>
         </c>
         <c ca="center">
            <p>
               <b>17.95</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>6.32</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>17.76</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>6.51</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>9.44</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>14.99</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>17.45</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>11.97</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>15.27</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>8.33</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>12.60</b>
            </p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>TP Hu-Wang</p>
         </c>
         <c ca="center">
            <p>16.16</p>
         </c>
         <c ca="center">
            <p>5.64</p>
         </c>
         <c ca="center">
            <p>14.74</p>
         </c>
         <c ca="center">
            <p>6.43</p>
         </c>
         <c ca="center">
            <p>9.58</p>
         </c>
         <c ca="center">
            <p>14.44</p>
         </c>
         <c ca="center">
            <p>16.49</p>
         </c>
         <c ca="center">
            <p>11.14</p>
         </c>
         <c ca="center">
            <p>14.76</p>
         </c>
         <c ca="center">
            <p>7.39</p>
         </c>
         <c ca="center">
            <p>11.68</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>TP proposed</p>
         </c>
         <c ca="center">
            <p>17.95</p>
         </c>
         <c ca="center">
            <p>6.36</p>
         </c>
         <c ca="center">
            <p>17.79</p>
         </c>
         <c ca="center">
            <p>6.97</p>
         </c>
         <c ca="center">
            <p>9.60</p>
         </c>
         <c ca="center">
            <p>14.98</p>
         </c>
         <c ca="center">
            <p>17.43</p>
         </c>
         <c ca="center">
            <p>11.97</p>
         </c>
         <c ca="center">
            <p>15.30</p>
         </c>
         <c ca="center">
            <p>8.33</p>
         </c>
         <c ca="center">
            <p>12.67</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>IBM</p>
         </c>
         <c ca="center">
            <p>20.05</p>
         </c>
         <c ca="center">
            <p>6.84</p>
         </c>
         <c ca="center">
            <p>18.46</p>
         </c>
         <c ca="center">
            <p>7.97</p>
         </c>
         <c ca="center">
            <p>11.33</p>
         </c>
         <c ca="center">
            <p>15.75</p>
         </c>
         <c ca="center">
            <p>19.90</p>
         </c>
         <c ca="center">
            <p>13.86</p>
         </c>
         <c ca="center">
            <p>17.65</p>
         </c>
         <c ca="center">
            <p>11.21</p>
         </c>
         <c ca="center">
            <p>14.30</p>
         </c>
      </r>
   </tblbdy></tbl>
         <p>The proposed system is compared with the Hu and Wang model. Meanwhile, we also show the performance of ideal binary mask (IBM) which is obtained by calculating local SNR in each T-F unit and selecting units (SNR &gt; 0&#8201;dB) as the target. The SNR results of IBM are the upper limit of all CASA-based systems which employ "binary mask". Table <tblr tid="T1">1</tblr> gives the variety of SNR in which each value represents the average SNR of one kind intrusion mixed with ten target utterances and the last column shows the average SNR over all intrusions. As shown in Table <tblr tid="T1">1</tblr>, proposed system improves SNR for every intrusion and gets 13.01&#8201;dB improvement of overall mean against unprocessed mixture. Compared with results of the Hu and Wang model, proposed model enhances the separation results about 1.48&#8201;dB for overall mean. The highest enhancement of SNR happens on the mixtures of N2 and is about 3.50&#8201;dB higher than the Hu and Wang model. Other larger improvements (more than 1.0&#8201;dB) are obtained for harmonic sound (N4, N5, N7, N8, and N9) or tone-like sound (N0 and N6). While less improvements are obtained for broadband noises (e.g., N1 and N3).</p>
         <p>To further compare the pitch detection algorithm and T-F unit grouping method separately, we replace the estimated pitch with true pitch (obtained on clean speech) for both the Hu and Wang model and proposed system. From Table <tblr tid="T1">1</tblr>, we can see that true pitch makes the Hu and Wang model enhance the SNR for 0.56 dB (from 11.12&#8201;dB to 11.68&#8201;dB). But the enhancement is tiny about 0.07 dB for the true pitch-based proposed system. And the only noticeable improvement is on N3 about 0.46&#8201;dB. The overall mean of SNR of the true pitch-based proposed system is about 1.00&#8201;dB higher than that of true pitch-based Hu and Wang model.</p>
         <p>Although conventional SNR is widely used, it does not reflect the related perceptual effects, such as auditory masking. As computational goal of CASA [<abbr bid="B24">24</abbr>], IBM directly corresponds to the auditory masking phenomenon. Recent psychoacoustic experiments have demonstrated that target speech reconstructed from the IBM can dramatically improve the intelligibility of speech masked by different types of noise, even in very noisy conditions [<abbr bid="B25">25</abbr>]. Li and Wang [<abbr bid="B26">26</abbr>] also systematically compared the performance of IBM and ideal ratio masks (IRM) and the results showed that IBM is optimal as computational goal in terms of SNR gain. Considering the advantages of IBM, we compute the SNR and PESQ score using the speeches reconstructed from IBM as the ground truth instead of clean speeches.</p>
         <p>Figure <figr fid="F11">11</figr> shows that the SNR of the proposed system are much higher than unprocessed mixtures of all kinds of intrusions. Compared to the performance of the Hu and Wang model, the SNR of the proposed system has significant improvement for all kinds of intrusions except for N3 and N4 with small drops. To further obtain the voice quality of segregated speech, PESQ is employed as a measurement. Figure <figr fid="F12">12</figr> shows the PESQ scores of IBM against unprocessed mixtures (white bars), segregated speeches from proposed system (gray bars) and from the Hu and Wang model (black bars) on ten kinds of intrusions. As Figure <figr fid="F12">12</figr> showing, the segregated speeches from proposed system obtain higher PESQ scores on all ten kinds of intrusions (especially on N2, N7, N8, and N9) than unprocessed mixtures and the outputs of the Hu and Wang model.</p>
         <fig id="F11"><title><p>Figure 11</p></title><caption><p>SNR results using IBM as the ground truth.</p></caption><text>
   <p><b>SNR results using IBM as the ground truth.</b> White bars show the results from unprocessed mixtures, black bars those from the Hu and Wang model, and gray bars those from proposed system.</p>
</text><graphic file="1687-4722-2010-252374-11"/></fig>
         <fig id="F12"><title><p>Figure 12</p></title><caption><p>PESQ results using IBM as the ground truth.</p></caption><text>
   <p><b>PESQ results using IBM as the ground truth.</b> White bars show the results from unprocessed mixtures, black bars those from the Hu and Wang model, and gray bars those from proposed system.</p>
</text><graphic file="1687-4722-2010-252374-12"/></fig>
         <p>Comparing the results of the Hu and Wang model, the most SNR gain about 4&#8201;dB is obtained in N0 (pure tone) By analyzing the segregated speeches, we found that the Hu and Wang model groups many target units into the background. It is mainly because some segments include both target units and interference units. These kinds of segments are divided into small ones by harmonic order in our system. Therefore, it leads to the significant SNR gain. For N2 (click noise), the SNR gain also due to the segmentation (see Figure <figr fid="F8">8</figr>). The difference is that the Hu and Wang model groups many interference units into foreground. It should be noticed that the gains of PESQ scores on these two noises are different, about 0.1 on N0 and 0.5 on N2, comparing with the Hu and Wang model. It implies that the second error, grouping intrusion units into foreground, has a greater impact on speech perceptual quality.</p>
         <p>Figure <figr fid="F13">13</figr> shows the spectrograms of mixture of male and female speech in (a), processed by IBM in (b), processed by the Hu and Wang model (c), and processed by proposed model (d). In Figure <figr fid="F13">13</figr>, we can see that the result of proposed model is closer to that of IBM. However, the result of the Hu and Wang model has residual female speech. </p>
         <fig id="F13"><title><p>Figure 13</p></title><caption><p>Spectrogram comparison: (a) mixture; (b) results of IBM; (c) results of the Hu and Wang model; (d) results of proposed model.</p></caption><text>
   <p><b>Spectrogram comparison: (a) mixture; (b) results of IBM; (c) results of the Hu and Wang model; (d) results of proposed model.</b> The input signal is male speech mixed with female speech.</p>
</text><graphic file="1687-4722-2010-252374-13"/></fig>
      </sec>
      <sec>
         <st>
            <p>5. Discussion</p>
         </st>
         <p>In sound separation, the application concerns about whether a unit is dominated by a resolved harmonic or by unresolved harmonics. Previous research showed that this process is very important. Resolved and unresolved harmonics are relative concepts which depend on the distance of harmonics and also the resolution of gammatone filterbank. Therefore, the decision of unit cannot be made by its channel frequency. A reasonable decision is to check the filter response in unit. As in previous research [<abbr bid="B15">15</abbr>], cross-channel correlation is used which measures the similarity between the responses of two adjacent filters, indicates whether the filters are responding to the same sound component. However, it is not reliable for some units especially in high frequency region (as shown in Figure <figr fid="F8">8(a)</figr>). Hence, we use a more direct measurement, carrier to envelope energy ratio, to help classifying the units.</p>
         <p>ACF reflects the period information of the signal in a unit. According to the "harmonicity" principle, each peak position could be a pitch period. However, only one of them corresponds to the true pitch period. DHF tends to reduce the peaks by the fact that voiced speeches have continuous numbered harmonics. In noisy environment, it will lead to errors when both neighbors of a harmonic are masked at the same time. However, we found that these cases are relative less.</p>
         <p>Pitch detection is another key stage for sound separation. Our algorithm uses only the longest resolved segment for pitch detection. Based on this process, it is relative easy for pitch tracking which is a difficult problem. It should be pointed out that robustness of the system may reduce when the interfering sounds dominate frequency regions for resolved harmonics. However, resolved harmonics have larger energy than unresolved ones. They are more robust to noise. In addition, it should be pointed out that DHF is generated based on the idea of continuous numbered harmonics. For sounds without this feature, DHF is improper</p>
      </sec>
      <sec>
         <st>
            <p>6. Conclusions</p>
         </st>
         <p>In this paper, we propose the dynamic harmonic functions which derive from conventional correlograms. DHF has the uniform representation for both resolved and unresolved units. Based on DHF, the pitch detection algorithm and T-F unit grouping strategy are proposed. Results show that proposed algorithm improves the SNRs for variety kinds of noises over the Hu and Wang model.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgments</p>
            </st>
            <p>This work was supported in part by the China National Nature Science Foundation (no. 60675026, no. 60121302, and no. 90820011), the 863 China National High Technology Development Projects (no. 20060101Z4073, no. 2006AA01Z194), and the National Grand Fundamental Research 973 Program of China (no. 2004CB318105).</p>
         </sec>
      </ack>
      <refgrp><bibl id="B1"><aug><au><snm>Benesty</snm><fnm>J</fnm></au><au><snm>Makino</snm><fnm>S</fnm></au><au><snm>Chen</snm><fnm>J</fnm></au></aug><source>Speech Enhancement</source><publisher>Springer, Berlin, Germany</publisher><pubdate>2005</pubdate></bibl><bibl id="B2"><title><p>Estimation of speech embedded in a reverberant and noisy environment by independent component analysis and wavelets</p></title><aug><au><snm>Barros</snm><fnm>AK</fnm></au><au><snm>Rutkowski</snm><fnm>T</fnm></au><au><snm>Itakura</snm><fnm>F</fnm></au><au><snm>Ohnishi</snm><fnm>N</fnm></au></aug><source>IEEE Transactions on Neural Networks</source><pubdate>2002</pubdate><volume>13</volume><issue>4</issue><fpage>888</fpage><lpage>893</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1109/TNN.2002.1021889</pubid><pubid idtype="pmpid" link="fulltext">18244484</pubid></pubidlist></xrefbib></bibl><bibl id="B3"><aug><au><snm>Brandstein</snm><fnm>M</fnm></au><au><snm>Ward</snm><fnm>D</fnm></au></aug><source>Microphone Arrays: Signal Processing Techniques and Applications</source><publisher>Springer, Berlin, Germany</publisher><pubdate>2001</pubdate></bibl><bibl id="B4"><title><p>Suppression of acoustic noise in speech using spectral subtraction</p></title><aug><au><snm>Boll</snm><fnm>SF</fnm></au></aug><source>IEEE Trans Acoust Speech Signal Process</source><pubdate>1979</pubdate><volume>27</volume><issue>2</issue><fpage>113</fpage><lpage>120</lpage><xrefbib><pubid idtype="doi">10.1109/TASSP.1979.1163209</pubid></xrefbib></bibl><bibl id="B5"><title><p>Signal subspace approach for speech enhancement</p></title><aug><au><snm>Ephraim</snm><fnm>Y</fnm></au><au><snm>Van Trees</snm><fnm>HL</fnm></au></aug><source>IEEE Transactions on Speech and Audio Processing</source><pubdate>1995</pubdate><volume>3</volume><issue>4</issue><fpage>251</fpage><lpage>266</lpage><xrefbib><pubid idtype="doi">10.1109/89.397090</pubid></xrefbib></bibl><bibl id="B6"><aug><au><snm>Bregman</snm><fnm>AS</fnm></au></aug><source>Auditory Scene Analysis</source><publisher>MIT Press, Cambridge, Mass, USA</publisher><pubdate>1990</pubdate></bibl><bibl id="B7"><aug><au><snm>Wang</snm><fnm>DL</fnm></au><au><snm>Brown</snm><fnm>GJ</fnm></au></aug><source>Computational Auditory Scene Analysis: Principles, Algorithms and Applications</source><publisher>Wiley-IEEE Press, New York, NY, USA</publisher><pubdate>2006</pubdate></bibl><bibl id="B8"><aug><au><snm>Cooke</snm><fnm>MP</fnm></au></aug><source>Modeling Auditory Processing and Organization</source><publisher>Cambridge University Press, Cambridge, UK</publisher><pubdate>1993</pubdate></bibl><bibl id="B9"><title><p>Monaural speech segregation based on pitch tracking and amplitude modulation</p></title><aug><au><snm>Hu</snm><fnm>G</fnm></au><au><snm>Wang</snm><fnm>DL</fnm></au></aug><source>IEEE Transactions on Neural Networks</source><pubdate>2004</pubdate><volume>15</volume><issue>5</issue><fpage>1135</fpage><lpage>1150</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1109/TNN.2004.832812</pubid><pubid idtype="pmpid" link="fulltext">18238087</pubid></pubidlist></xrefbib></bibl><bibl id="B10"><title><p>A duplex theory of pitch perception</p></title><aug><au><snm>Licklider</snm><fnm>JCR</fnm></au></aug><source>Experientia</source><pubdate>1951</pubdate><volume>7</volume><issue>4</issue><fpage>128</fpage><lpage>134</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1007/BF02156143</pubid><pubid idtype="pmpid">14831572</pubid></pubidlist></xrefbib></bibl><bibl id="B11"><title><p>Computational models of neural auditory processing</p></title><aug><au><snm>Lyon</snm><fnm>RF</fnm></au></aug><source>Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP &apos;84)</source><fpage>41</fpage><lpage>44</lpage></bibl><bibl id="B12"><aug><au><snm>Weintraub</snm><fnm>M</fnm></au></aug><source>A theory and computational model of auditory monaural sound separation, Ph.D. dissertation</source><publisher>Dept. Elect. Eng., Stanford Univ., Stanford, Calif, USA</publisher><pubdate>1985</pubdate></bibl><bibl id="B13"><title><p>A perceptual pitch detector</p></title><aug><au><snm>Slaney</snm><fnm>M</fnm></au><au><snm>Lyon</snm><fnm>RF</fnm></au></aug><source>Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, April 1990</source><fpage>357</fpage><lpage>360</lpage></bibl><bibl id="B14"><title><p>Virtual pitch and phase sensitivity of a computer model of the auditory periphery. I: pitch identification</p></title><aug><au><snm>Meddis</snm><fnm>R</fnm></au><au><snm>Hewitt</snm><fnm>MJ</fnm></au></aug><source>Journal of the Acoustical Society of America</source><pubdate>1991</pubdate><volume>89</volume><issue>6</issue><fpage>2866</fpage><lpage>2882</lpage><xrefbib><pubid idtype="doi">10.1121/1.400725</pubid></xrefbib></bibl><bibl id="B15"><title><p>Separation of speech from interfering sounds based on oscillatory correlation</p></title><aug><au><snm>Wang</snm><fnm>DL</fnm></au><au><snm>Brown</snm><fnm>GJ</fnm></au></aug><source>IEEE Transactions on Neural Networks</source><pubdate>1999</pubdate><volume>10</volume><issue>3</issue><fpage>684</fpage><lpage>697</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1109/72.761727</pubid><pubid idtype="pmpid" link="fulltext">18252568</pubid></pubidlist></xrefbib></bibl><bibl id="B16"><title><p>A multipitch tracking algorithm for noisy speech</p></title><aug><au><snm>Wu</snm><fnm>M</fnm></au><au><snm>Wang</snm><fnm>DL</fnm></au><au><snm>Brown</snm><fnm>GJ</fnm></au></aug><source>IEEE Transactions on Speech and Audio Processing</source><pubdate>2003</pubdate><volume>11</volume><issue>3</issue><fpage>229</fpage><lpage>241</lpage><xrefbib><pubid idtype="doi">10.1109/TSA.2003.811539</pubid></xrefbib></bibl><bibl id="B17"><title><p>Pitch and the narrowed autocoindidence histogram</p></title><aug><au><snm>Cheveigne</snm><fnm>A</fnm></au></aug><source>Proceedings of the International Conference on Music Perception and Cognition, 1989, Kyoto, Japan</source><fpage>67</fpage><lpage>70</lpage></bibl><bibl id="B18"><title><p>Calculation of a "narrowed" autocorrelation function</p></title><aug><au><snm>Brown</snm><fnm>JC</fnm></au><au><snm>Puckette</snm><fnm>MS</fnm></au></aug><source>Journal of the Acoustical Society of America</source><pubdate>1989</pubdate><volume>85</volume><issue>4</issue><fpage>1595</fpage><lpage>1601</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1121/1.397363</pubid><pubid idtype="pmpid" link="fulltext">2708677</pubid></pubidlist></xrefbib></bibl><bibl id="B19"><title><p>A pitch detector based on a generalized correlation function</p></title><aug><au><snm>Xu</snm><fnm>JW</fnm></au><au><snm>Principe</snm><fnm>JC</fnm></au></aug><source>IEEE Transactions on Audio, Speech and Language Processing</source><pubdate>2008</pubdate><volume>16</volume><issue>8</issue><fpage>1420</fpage><lpage>1432</lpage></bibl><bibl id="B20"><title><p>On cochlear encoding: potentialities and limitations of the reverse-correlation technique</p></title><aug><au><snm>De Boer</snm><fnm>E</fnm></au><au><snm>De Jongh</snm><fnm>HR</fnm></au></aug><source>Journal of the Acoustical Society of America</source><pubdate>1978</pubdate><volume>63</volume><issue>1</issue><fpage>115</fpage><lpage>135</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1121/1.381704</pubid><pubid idtype="pmpid" link="fulltext">632404</pubid></pubidlist></xrefbib></bibl><bibl id="B21"><title><p>Simulation of auditory-neural transduction: further studies</p></title><aug><au><snm>Meddis</snm><fnm>R</fnm></au></aug><source>Journal of the Acoustical Society of America</source><pubdate>1988</pubdate><volume>83</volume><issue>3</issue><fpage>1056</fpage><lpage>1063</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1121/1.396050</pubid><pubid idtype="pmpid" link="fulltext">3356811</pubid></pubidlist></xrefbib></bibl><bibl id="B22"><title><p>A computationally efficient multipitch analysis model</p></title><aug><au><snm>Tolonen</snm><fnm>T</fnm></au><au><snm>Karjalainen</snm><fnm>M</fnm></au></aug><source>IEEE Transactions on Speech and Audio Processing</source><pubdate>2000</pubdate><volume>8</volume><issue>6</issue><fpage>708</fpage><lpage>716</lpage><xrefbib><pubid idtype="doi">10.1109/89.876309</pubid></xrefbib></bibl><bibl id="B23"><title><p>Monaural voiced speech segregation based on elaborate harmonic grouping strategy</p></title><aug><au><snm>Zhang</snm><fnm>X</fnm></au><au><snm>Liu</snm><fnm>W</fnm></au><au><snm>Li</snm><fnm>P</fnm></au><au><snm>Xu</snm><fnm>BO</fnm></au></aug><source>Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP &apos;09), April 2009</source><fpage>4661</fpage><lpage>4664</lpage></bibl><bibl id="B24"><title><p>On ideal binary masks as the computational goal of auditory scene analysis</p></title><aug><au><snm>Wang</snm><fnm>DL</fnm></au></aug><source>Speech Separation by Humans and Machines</source><publisher>Kluwer Academic Publishers, Boston, Mass, USA</publisher><editor>Divenyi P</editor><pubdate>2005</pubdate><fpage>181</fpage><lpage>197</lpage></bibl><bibl id="B25"><title><p>Factors influencing intelligibility of ideal binary-masked speech: implications for noise reduction</p></title><aug><au><snm>Li</snm><fnm>N</fnm></au><au><snm>Loizou</snm><fnm>PC</fnm></au></aug><source>Journal of the Acoustical Society of America</source><pubdate>2008</pubdate><volume>123</volume><issue>3</issue><fpage>1673</fpage><lpage>1682</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1121/1.2832617</pubid><pubid idtype="pmcid">2696360</pubid><pubid idtype="pmpid" link="fulltext">18345855</pubid></pubidlist></xrefbib></bibl><bibl id="B26"><title><p>On the optimality of ideal binary time-frequency masks</p></title><aug><au><snm>Li</snm><fnm>Y</fnm></au><au><snm>Wang</snm><fnm>D</fnm></au></aug><source>Speech Communication</source><pubdate>2009</pubdate><volume>51</volume><issue>3</issue><fpage>230</fpage><lpage>239</lpage><xrefbib><pubid idtype="doi">10.1016/j.specom.2008.09.001</pubid></xrefbib></bibl></refgrp>
   </bm>
</art>