SpringerOpen Newsletter

Receive periodic news and updates relating to SpringerOpen.

Open Access Research

Biomimetic multi-resolution analysis for robust speaker recognition

Sridhar Krishna Nemala1, Dmitry N Zotkin2, Ramani Duraiswami2 and Mounya Elhilali1*

Author Affiliations

1 Department of Electrical and Computer Engineering, Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, USA

2 Institute for Advanced Computer Studies, University of Maryland, College Park, MD, USA

For all author emails, please log on.

EURASIP Journal on Audio, Speech, and Music Processing 2012, 2012:22  doi:10.1186/1687-4722-2012-22


The electronic version of this article is the complete one and can be found online at: http://asmp.eurasipjournals.com/content/2012/1/22


Received:26 July 2011
Accepted:17 August 2012
Published:7 September 2012

© 2012 Nemala et al.; licensee Springer.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Humans exhibit a remarkable ability to reliably classify sound sources in the environment even in presence of high levels of noise. In contrast, most engineering systems suffer a drastic drop in performance when speech signals are corrupted with channel or background distortions. Our brains are equipped with elaborate machinery for speech analysis and feature extraction, which hold great lessons for improving the performance of automatic speech processing systems under adverse conditions. The work presented here explores a biologically-motivated multi-resolution speaker information representation obtained by performing an intricate yet computationally-efficient analysis of the information-rich spectro-temporal attributes of the speech signal. We evaluate the proposed features in a speaker verification task performed on NIST SRE 2010 data. The biomimetic approach yields significant robustness in presence of non-stationary noise and reverberation, offering a new framework for deriving reliable features for speaker recognition and speech processing.

Introduction

In addition to the intended message, human voice carries the unique imprint of a speaker. Just like fingerprints and faces, voice prints are biometric markers with tremendous potential for forensic, military, and commercial applications [1]. However, despite enormous advances in computing technology over the last few decades, automatic speaker verification (ASV) systems still rely heavily on training data collected in controlled environments, and most systems face a rapid degradation in performance when operating under previously unseen conditions (e.g. channel mismatch, environmental noise, or reverberation). In contrast, human perception of speech and ability to identify sound sources (including voices) is quite remarkable even at relatively high distortion levels [2]. Consequently, the pursuit of human-like recognition capabilities has spurred great interest in understanding how humans perceive and process speech signals.

One of the intriguing processes taking place in the central auditory system involves ensembles of neurons with variable tuning to spectral profiles of acoustic signals. In addition to the frequency (tonotopic) organization emerging as early as the cochlea, neurons in the central auditory system (specifically in the midbrain and more prominently in the auditory cortex) exhibit tuning to a variety of filter bandwidths and shapes [3]. This elegant neural architecture provides a detailed multi-resolution analysis of the spectral sound profile, which is presumably relevant to speech and speaker recognition. Only few studies so far have attempted to use this cortical representation in speech processing, yielding some improvements for automatic speech recognition at the expense of substantial computational complexity[4,5]. To the best of our knowledge, no similar work was done in ASV.

In the present report, we explore the use of a multi-resolution analysis for robust speaker verification. Our representation is simple, effective, and computationally-efficient. The proposed scheme is carefully optimized to be particularly sensitive to the information-rich spectro-temporal attributes of the signal while maintaining robustness to unseen noise distortions. The choice of model parameters builds on our current knowledge of psychophysical principles of speech perception in noise [6,7] complemented with a statistical analysis of the dependencies between spectral details of the message and speaker information. We evaluate the proposed features in an ASV system and compare it against one of the best performing systems in NIST 2010 SRE evaluation [8] under detrimental conditions such as white noise, non-stationary additive noise, and reverberation.

The following section describes details of the proposed multi-resolution spectro-temporal model. It is followed by an analysis that motivates the choice of model parameters to maximize speaker information retention. Next, we describe the experimental setup and results. We finish with a discussion of these results and comment on potential extensions towards achieving further noise robustness.

The biomimetic multi-resolution analysis

An overview of the processing chain described in this section is presented in Figure 1.

thumbnailFigure 1. An outline of the cortical features extraction algorithm. A schematic diagram of the algorithm that transforms a speech waveform into a sequence of cortical feature vectors.

Peripheral analysis

The speech signal is processed through a pre-emphasis stage (implemented as a first-order high pass filter with pre-emphasis coefficient 0.97), and a time-frequency auditory spectrogram is generated using a biomimetic sound processing model described in details in [9] and briefly summarized here (Equation 1). First, the signal s(t) undergoes a cochlear frequency analysis modeled by a bank of 128 constant-Q (Q=4) highly asymmetric bandpass filters h(t;f) equally spaced over the span of 51/3 octaves on a logarithmic frequency axis. The filterbank output is a spatiotemporal pattern of cochlea basilar membrane displacements ycoch(tf) over 128 channels. Next, a lateral inhibitory network detects discontinuities in the responses across the tonotopic (frequency) axis, resulting in further filterbank frequency selectivity enhancement. This step is modeled as a first-order differentiation operation across the channel array followed by a half-wave rectifier and a short-term integrator. The temporal integration window is given by μ(t;τ)=et/τu(t) with time constant τ=10 ms mimicking the further loss of phase-locking observed in the midbrain. This time constant controls the frame rate of the spectral vectors. Finally, a nonlinear cubic root compression of the spectrum is performed, resulting in an auditory spectrogram y(tf):

<a onClick="popup('http://asmp.eurasipjournals.com/content/2012/1/22/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://asmp.eurasipjournals.com/content/2012/1/22/mathml/M1">View MathML</a>

(1)

where ⊗trepresents convolution with respect to time. The choice of the auditory spectrogram is motivated by its neurophysiological foundation as well as its proven self-normalization and robustness properties (see [10] for full details).

Spectral cortical analysis

The auditory spectrogram is processed further in order to capture the spectral details present in each spectral slice. The processing is based on neurophysiological findings that neurons in the central auditory pathway are tuned not only to frequencies but also to spectral shapes, in particular to peaks of various widths on the log-frequency axis [3,11,12]. The spectral width is characterized by a parameter called scale and is measured in cycles per octave, or CPO. Physiological data indicates that auditory cortex neurons are highly scale-selective, thus expanding the cochlear one-dimensional tonotopic axis onto a two-dimensional sheet that explicitly encodes tonotopy as well as spectral shape details (see Figures 1 and 2).

thumbnailFigure 2. Details of the speech spectral analysis. (a) The speech spectrogram is analyzed separately at each time instant. Each spectrogram slice is filtered through a bandpass filter HS(Ω;Ωc) parameterized by Ωc. The ∗ operator signifies the filtering operation. Four such filtering operations yield four views of the same spectral slice; each view highlights different details about the spectrum, notably formant peaks and harmonic structure. (b) Cortical features for clean and noisy versions of one phoneme ∖ow∖. The plots show magnitude as a function of frequency and scale. For visualization, the discrete image points have been interpolated in MATLAB using a bicubic interpolation routine. Notice the consistency of formant peaks around 1 and 4 KHz and of harmonic energies at 2 CPO and 4 CPO despite the additive noise distortion. (c) Cortical features for different types of additive noise. Note that the patterns exhibited are quite different. Subtle peaks due to harmonicity and formant structure of human speech can be seen in the left panel (babble noise).

The cortical analysis is implemented using a bank of modulation filters operating in the Fourier domain. The algorithm processes each data frame individually. The Fourier transform of each spectral slice y(t0,f) is multiplied by a modulation filter HS(Ω;Ωc) that is tuned to spectral features of scale Ωc. The filtering operates on the magnitude of the signal. After filtering, the inverse Fourier transform is performed and the real part is taken as the new filtered slice. This process is then repeated with a number of different Ωc, yielding a number of filtered spectrograms y(t,f;Ωc), each with features of scale Ωcemphasized (see Figure 1). This set of spectrograms constitutes the spectral cortical representation of the sound.

The filter HS(Ω;Ωc) is defined as

<a onClick="popup('http://asmp.eurasipjournals.com/content/2012/1/22/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://asmp.eurasipjournals.com/content/2012/1/22/mathml/M2">View MathML</a>

(2)

where Ωmax is the highest spectral modulation frequency (set at 12 CPO given our spectrogram resolution of 24 channels per octave).

Choice of spectral parameters

The set of scales Ωcis chosen by dividing the spectral modulation axis into equal energy regions using a training corpus (TIMIT database [13]) as described below. Define the average spectral modulation profile <a onClick="popup('http://asmp.eurasipjournals.com/content/2012/1/22/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://asmp.eurasipjournals.com/content/2012/1/22/mathml/M3">View MathML</a> as the ensemble mean of the magnitude Fourier transform of the spectral slice y(t0f) averaged over all times T and over entire speech corpus Ψ. The resulting ensemble profile (shown in Figure 3a) is then divided into M equal energy regions Γk:

<a onClick="popup('http://asmp.eurasipjournals.com/content/2012/1/22/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://asmp.eurasipjournals.com/content/2012/1/22/mathml/M4">View MathML</a>

(3)

where Ωk and Ωk + 1 denote the lower and upper cutoffs for kth band, Ω1=0, and ΩM=4.a This sampling scheme ensures that the high energy regions are sampled more densely, which has the dual advantage of sampling the given modulation space with a relatively small set of scales and emphasizing high-energy signal components, which are presumably noise-robust. Setting M=5 results in cutoffs at {0.18,0.59,1.34,2.36,4}, which are approximated to the nearest log-scale as Ωc={0.25,0.5,1.0,2.0,4.0}. Finally, in order to put less emphasis on message-dominant regions of the spectrum, we drop the 0.25 CPO filter, which carries mostly articulatory and formant-specific information relevant to the speech message (analysis presented in the next section). The remaining set of Ωc={0.5,1.0,2.0,4.0} is found to be a good tradeoff between computational complexity and system performance.

thumbnailFigure 3. Speech signal spectral analysis. (a) Average spectral modulation profile <a onClick="popup('http://asmp.eurasipjournals.com/content/2012/1/22/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://asmp.eurasipjournals.com/content/2012/1/22/mathml/M5">View MathML</a>; (b) Top panel: MI between feature representation and speech message as a function of scale. Bottom panel: MI between feature representation and speaker identity as a function of scale.

Temporal filtration

In this stage, the spectral cortical features are processed through a bandpass temporal modulation filter to remove information that is believed to be mostly irrelevant. It was shown in [14] that the neurons in the auditory cortex are mostly sensitive to the modulation rates between 0.5 and 12 Hz and that the same modulation range represents the information crucial for speech comprehension [7]. Accordingly, the filtering is performed by multiplying the Fourier transform of the time sequence of each spectral feature by a bandpass filter HT(w;wlwh):

<a onClick="popup('http://asmp.eurasipjournals.com/content/2012/1/22/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://asmp.eurasipjournals.com/content/2012/1/22/mathml/M6">View MathML</a>

(4)

where wl=0.5 Hz, wh=12.0 Hz, wmax=1/(2tf), and tf=10 ms (the frame length). After filtering in Fourier domain, the inverse Fourier transform is performed and the real part of the output forms the temporally filtered spectral cortical representation of the sound yw(tf;Ωc). This operation is performed on an utterance by utterance basis.

Cortical features

To reduce computational complexity and to allow use of state-of-the-art speaker verification machinery (which generally expects a relatively low-dimensional input), the spectral cortical representation is downsampled in frequency by a factor of 4 (Figure 1). The resulting feature representation has a dimensionality of 128 (32 auditory frequency channels multiplied by four scales used for analysis). The features are then normalized to zero mean and unit variance for each utterance, yielding the reduced set of spectrograms <a onClick="popup('http://asmp.eurasipjournals.com/content/2012/1/22/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://asmp.eurasipjournals.com/content/2012/1/22/mathml/M7">View MathML</a>. Principal component analysis is used to further reduce the feature dimensionality to 19. This number is chosen for consistency with the dimensionality of the standard Mel-Frequency Cepstral Coefficients (MFCC) feature set used for speaker recognition. The reduced features, along with their first- and second-order derivatives, form the final 57-dimensional cortical feature vector used for the speaker verification task.

Speech information versus speaker information

The speech signal carries both speech message and speaker identity information in distinct yet overlapping components. Separation of these elements is a non-trivial task in general. In the multi-resolution framework presented above, the broadest filters (0.25 and 0.5 CPO) capture primarily the overall spectral profile and formant peaks, while the others (1, 2, and 4 CPO) reflect narrower spectral details such as harmonics and subharmonic structure. In order to select a set of scales (Ωc) that are most relevant for the speaker recognition task, we analyze the mutual information (MI) between the feature vector (X), the speech message (Y1), and the speaker identity (Y2). The MI is a measure of the statistical dependence between random variables [15] and is defined for two discrete random variables X and Y as

<a onClick="popup('http://asmp.eurasipjournals.com/content/2012/1/22/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://asmp.eurasipjournals.com/content/2012/1/22/mathml/M8">View MathML</a>

(5)

To estimate the MI, the continuous feature vector is quantized by dividing its support into cells of equal volume. To characterize the speech message, phoneme labels from the TIMIT corpus are first divided into four broad phoneme classes. The variable Y1thus takes four discrete values representing the phoneme categories: vowels, stops, fricatives, and nasals. The average MI (taken as the mean MI across all the frequency bands for a given scale) between the feature vector and the speech message is shown in Figure 3b (top) as a function of scale. For the speaker identity MI test, the TIMIT “sa1” speech utterance (She had your dark suit in greasy wash water all year) spoken by 100 different subjects is used; thus, Y2 takes 100 discrete values representing the speaker. The average MI between the feature vector and the speaker identity is shown in Figure 3b (bottom), again as a function of scale.b

Notice that while the lower scale (0.25 CPO) clearly provides significantly more information about the underlying linguistic message, the MI peak in Figure 3c (bottom) is centered at 1 CPO, highlighting the significance of pitch and harmonically-related frequency channels in representing speaker-specific information. In order to put less emphasis on message-carrying features of the speech signal, we drop the 0.25 CPO filter at the feature encoding stage for our ASV system and choose Ωc={0.5,1.0,2.0,4.0} CPO.c

Experiments and results

Recognition setup

Text independent speaker verification experiments are conducted on the NIST 2010 speaker recognition evaluation (SRE) data set [8]. The extended core task of the evaluation involves 6.9 million trials broken down into nine common conditions reflecting a variety of channel mismatch scenarios [8] (see Table 1).

Table 1. List of conditions for NIST 2010 extended core task

The front end of the implemented ASV system uses either the 57-dimensional MFCC feature vector or the 57-dimensional cortical feature vector. The MFCC feature vector is computed by invoking RASTAMAT “melfcc” function with ‘numcep’ parameter set to 20, dropping the first (energy) component of the output, and appending first- and second-order derivatives of the resultant feature vector. The cortical feature vector is obtained as described in the previous sections. For fair comparison between MFCC and cortical features, MFCC was supplemented with mean subtraction, variance normalization, and RASTA filtering [16] applied at the utterance level. Such processing parallels the temporal filtering and normalization performed on cortical features. A combination of ASR output provided by NIST and an in-house energy-based VAD system is used to drop all non-speech frames from input data.

The back-end is a robust state-of-the-art UBM-GMM system [17,18]. In a UBM-GMM system, each speaker’s distribution of feature vectors is modeled as a mixture of Gaussians, forming a Gaussian mixture speaker model (GMSM). In addition, a universal background model (UBM) defines a “generic” speaker. The UBM typically has hundreds of thousands of parameters and is trained on a very large amount of data (hundreds of hours of speech), which should include speech produced by a large number of individual speakers (in our case, the 2048-center diagonal-covariance UBM is trained on NIST SRE 2004, 2005, 2006, and 2008; Fisher; Switchboard-2; and Switchboard-Cellular databases). As the amount of speech available per individual speaker is typically much less than required to train the speaker model from scratch, the GMSM is produced by adapting UBM means so that the resulting model best describes the available speaker data. Finally, given the UBM, the candidate GMSM, and the audio file, the system extracts the feature vectors from the audio file and computes the log-likelihoods of these feature vectors belonging to the GMSM and to the UBM. The difference between these log-likelihoods constitutes the output score for this particular trial.

Our ASV system additionally employs the technique known as joint factor analysis [19,20]. JFA use enables channel variability compensation by offsetting the channel effects and more robust speaker model estimation by using more informative prior on speaker model distribution. To use JFA in the described framework, an alternative representation of the speaker model—a single vector Z (“supervector”)—is formed by concatenaging all GMSM means. JFA is trained in advance on a large annotated collection of audio files to learn the channel subspace (the basis over which Z preferentially varies when the same speaker’s voice is presented over different channels) and the speaker subspace (the basis over which Z preferentially varies when different speakers are presented over the same channel). In our system, the dimensionalities of speaker subspace and of channel subspace are 300 and 150, respectively. Then, when processing the previously unseen data, components of inter-speaker differences attributable to speaker/to channel are emphasized/canceled, respectively. This is done by projecting corresponding supervectors into speaker/channel subspaces, using speaker subspace projection of Z to modify GMSM, using channel subspace projection of Z to modify UBM, and performing scoring with these modified GMSM and UBM. Also, as the log-likelihood calculation is expensive, in our system an approximation to it is computed based on an inner product [20] is used.

Finally, the obtained scores are subject to ZT-normalization [21], and the decision threshold minimizing equal error rate (EER) is chosen (separately for each condition).

Noise conditions

Every trial in NIST SRE 2010 consists of computing the matching score between a speaker model and an audio file. To evaluate the noise robustness of the proposed cortical features, several distorted versions of these audio files are created by adding different types of noise reflecting a variety of real world scenarios:

• White noise at signal-to-noise ratio (SNR) levels from 24 to 0 dB in 6 dB steps;

• Babble noise (from Aurora database [22]), same SNR levels;

• Subway noise (from Aurora database [22]), same SNR levels;

• Simulated reverberation with RT60from 200 to 1,200 ms in steps of 200 ms.

It is important to mention that all training (UBM, JFA, and speaker model training) is done exclusively on clean data, and only the test audio files are corrupted. Note also that the train-test mismatch created by addition of noise/reverberation is superimposed on the train-test mismatch inherent to the SRE 2010 data.

Results

Figure 4 shows the speaker verification performance in terms of EER for the cortical features and for the MFCC features as a function of noise type/strength and trial condition. The results clearly demonstrate that the proposed cortical features provide substantially lower EER than the MFCC as noise level increases, indicating their robustness. The average performance for each noise type and trial condition is shown in Table 2. On average (across all conditions and all noise types), the cortical-features-based system yields 15.9% relative EER improvement over the robust state-of-the-art MFCC system. It is worth noting that the proposed approach is outperformed by the MFCC-based approach in only 4 out of the 36 cases. Because the proposed metric incorporates both a biomimetic auditory spectrogram previously shown to exhibit some noise-robustness characteristics [10] as well as multiresolution decomposition, we investigated further the contribution of both components in the reported improvements. We tested the system using the auditory spectrogram alone or an adaptation of the auditory spectrogram described here, coupled with a cepstral transformation. Neither system performed as well as the proposed multiresolution decomposition, hence strengthening the claim that our proposed multiresolution analysis is indeed responsible for the performance improvements shown in Table 2.

thumbnailFigure 4. Evaluation results. Performance of the proposed cortical features (red filled squares) and enhanced MFCC features (black open circles) on NIST SRE 2010 “extended core” database as a function of noise level, noise type, and condition. In each subplot, the noise level is shown on X axis and the EER (in percents) is on Y axis. Columns and rows of subplots belong to the same noise type and to the same condition, respectively. Note the Y -axis ranges are not the same in the subplots.

Table 2. Average ASV performance (EER, %) as a function of noise type and condition

In some ASV applications, metrics other than EER may be more relevant. For example, in certain biometric speaker verification systems the key requirement is a low false alarm rate. We present our results here in terms of two additional metrics more suitable in such case, namely Miss-10 and quadratic DCF (decision cost function) metrics. These two metrics were used in the NIST 2011 IARPA BEST program SRE [23]. The Miss-10 metric is defined as the false alarm rate PFA obtained when the decision threshold is set such that the miss rate PMiss=10%, and the quadratic DCF is defined as

<a onClick="popup('http://asmp.eurasipjournals.com/content/2012/1/22/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://asmp.eurasipjournals.com/content/2012/1/22/mathml/M9">View MathML</a>

(6)

with the parameter values CMiss=100, CFA=10, and Ptarget=0.01.

The average verification performance for each noise type using the Miss-10 and quadratic DCF metrics is shown in Tables 3 and 4, respectively. As seen from the data, in the low false alarm region the proposed cortical features outperform the robust state-of-the-art MFCC system with even larger margin: 28.8% relative using the Miss-10 metric and 22.6% relative using the quadratic DCF metric.

Table 3. Average ASV performance (Miss-10 metric, %) as a function of noise type and condition

Table 4. Average ASV performance (quadratic DCF metric) as a function of noise type and condition

Discussion and conclusions

In this report, we explore the applicability of a multi-resolution analysis of speech signals to ASV. This framework maps the speech signal onto a rich feature space, highlighting and separating information about the glottal excitation signal, glottal shape, vocal tract geometry, and articulatory configuration (as each of these elements is an underlying factor for features of different width located in different areas on the log-frequency axis; see e.g. [24]). The cortical representation can be viewed as a “local” variant (w.r.t. log-frequency axis) of the analysis provided by MFCC analysis. This analogy stems from the fact that MFCC roughly correspond to spectral features of different widths integrated over the whole frequency range. In this work, both the “global-integration” MFCC approach and the “local” cortical approach are tested in a state-of-the-art ASV system on the NIST SRE 2010 dataset. While both perform comparably in clean condition, the cortical features are substantially more robust on noisy data, including non-stationary distortions as well as reverberation.

One of the intuitions behind the robustness observed in the proposed features is the fact that speech and noise generally exhibit different spectral shapes while occupying an overlapping spectral range. The expansion of the spectral axis with the multi-resolution analysis allows the extrication of some speech components from the masking noise, suppressing the noise components and providing for increased robustness. Furthermore, by highlighting the range between 0.5 and 4 CPO, the model stresses the most speaker-informative regions in the speech spectrum, which in turn map onto a modulation space to which humans are highly sensitive [7]. Such range is also commensurate with neurophysiological tuning observed in mammalian auditory cortex with most neurons concentrated around a spectral tuning of the order of few CPOs [3,14]. A similar emphasis is put on the temporal dynamics of the signal by underscoring the region between 0.5 and 12 Hz, which defines natural boundaries for speech perception in noise by human listeners [7,25-28] and mostly coincides with temporal tuning of mammalian cortical neurons [14]. Higher temporal modulation frequencies represent mostly the syllabic and segmental rate of speech [2].

Unlike comparable multi-resolution schemes recently developed [4,5], the proposed approach does not involve dimension-expanded representations (close to 30,000 dimensions, which inherently require computationally-expensive schemes and therefore have limited applicability). Instead, our model is constrained to lie in a perceptually-relevant spectral modulation space and further uses a careful sampling scheme to encode the information with only four spectral analysis filters. This has the dual advantage of producing a feature space that is both low-dimensional and highly robust. The careful optimization of model parameters is necessary to strike a balance between simple and efficient computation and noise robustness.

Importantly, in our approach no model components have been customized in any way to deal with a specific noise condition, making it suitable for a wide range of acoustic environments. In addition, the model has been minimally customized for the speaker recognition task and can in fact provide a general framework for a variety of speech processing tasks. Our preliminary results do indeed show great robustness of a similar scheme for automatic speech recognition. It is therefore essential to emphasize that the performance obtained with the cortical features is solely a property of the features themselves and is achieved without any noise compensation techniques. Our ongoing efforts are aimed at achieving further improvements by applying the described multi-resolution cortical analysis on enhanced spectral profiles obtained using speech enhancement techniques, which involve estimation of noise characteristics in various forms [29].

Endnotes

aWe constraint the range of spectral modulations to 4 CPO, which covers more than 90% of the entire spectral modulation energy in speech and is most important for speech comprehension [7].bThe difference in MI levels between the speech message and speaker identity may be attributed to the observation that the speech signal encodes more information about the underlying linguistic message than about the speaker.cIn addition to the MI analysis, we performed an empirical test regarding use of 0.25 CPO filter. An experiment was run on clean data with Ωc={0.25,0.5,1.0,2.0} CPO and yielded a 3.4% EER—a decrease of performance compared with 2.7% EER for the system that used Ωc={0.5,1.0,2.0,4.0} CPO.

Competing Interests

The authors declare that they have no competing interests.

Acknowledgements

This research is partly supported by the IIS-0846112 (NSF), FA9550-09-1-0234 (AFOSR), 1R01AG036424-01 (NIH), N000141010278 (ONR), and by the Office of the Director of National Intelligence (ODNI), the Intelligence Advanced Research Projects Activity (IARPA), through the Army Research Laboratory (ARL). All statements of fact, opinion, or conclusions contained herein are those of the authors and should not be construed as representing the official views or policies of IARPA, the ODNI, or the U.S. Government.

References

  1. H Beigi, Fundamentals of Speaker Recognition (Springer, Berlin, 2011)

  2. S Greenberg, A Popper, W Ainsworth, Speech Processing in the Auditory System (Springer, Berlin, 2004)

  3. K O’Connor, P Yin, C Petkov, M Sutter, Complex spectral interactions encoded by auditory cortical neurons: relationship between bandwidth and pattern. Front Syst. Neurosci 4, 4–145 (2010)

  4. J Woojay, B Juang, Speech analysis in a model of the central auditory system. IEEE Trans. Speech Audio Process 15, 1802–1817 (2007)

  5. Q Wu, L Zhang, G Shi, Robust speech feature extraction based on Gabor filtering and tensor factorization. Proc. IEEE Intl. Conf. Acoust. Speech Signal Proc., Taipei, Taiwan, 4649–4652 (2009)

  6. M Elhilali, T Chi, SA Shamma, A spectro-temporal modulation index (STMI) for assessment of speech intelligibility. Speech Commun 41, 331–348 (2003). Publisher Full Text OpenURL

  7. T Elliott, F Theunissen, The modulation transfer function for speech intelligibility. PLoS Comput. Biol 5, e1000302 (2009). PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  8. NIST 2010 speaker recognition evaluation (http://www), . nist.gov/speech/tests/sre/2010 webcite

  9. X Yang, K Wang, SA Shamma, Auditory representations of acoustic signals. IEEE Trans. Inf. Theory 38, 824–839 (1992). Publisher Full Text OpenURL

  10. K Wang, SA Shamma, Self-normalization noise-robustness in early auditory representations. IEEE Trans. Speech Audio Process 2, 421–435 (1994). Publisher Full Text OpenURL

  11. C Schreiner, B Calhoun, Spectral envelope coding in cat primary auditory cortex: properties of ripple transfer functions. J. Aud. Neurosc 1, 39–61 (1995)

  12. H Versnel, N Kowalski, SA Shamma, Ripple analysis in ferret primary auditory cortex. iii. topographic distribution of ripple response parameters. J. Aud. Neurosc 1, 271–286 (1995)

  13. JS Garofolo, LF Lamel, WM Fisher, JG Fiscus, DS Pallett, NL Dahlgren, DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus (vol LDC93S1 Linguistic Data Consortium, Philadelphia, 1993)

  14. L Miller, M Escabi, H Read, C Schreiner, Spectrotemporal receptive fields in the lemniscal auditory thalamus and cortex. J. Neurophysiol 87(1), 516–527 (2002). PubMed Abstract | Publisher Full Text OpenURL

  15. T Cover, J Thomas, Elements of Information Theory, 2nd edn. (Wiley-Interscience, New York, 2006)

  16. H Hermansky, N Morgan, RASTA processing of speech. IEEE Trans. Speech Audio Process 2(4), 382–395 (1994)

  17. T Kinnunen, H Lib, An overview of text-independent speaker recognition: from features to supervectors. Speech Commun 52, 12–40 (2010). Publisher Full Text OpenURL

  18. D Garcia-Romero, et al. The UMD-JHU 2011 speaker recognition system. in Proc, ed. by . IEEE Intl. Conf. Acoust. Speech Signal Proc (Kyoto, Japan, 2012), pp. 4229–4232

  19. P Kenny, G Boulianne, P Ouellet, P Dumouchel, Speaker and session variability in gmm-based speaker verification. IEEE Trans. Audio Speech Lang. Process 15, 1448–1460 (2007)

  20. D Garcia-Romero, C Espy-Wilson, Joint factor analysis for speaker recognition reinterpreted as signal coding using overcomplete dictionaries. in Proc, ed. by . Odyssey Speaker and Language Recognition Workshop (Brno, Czech Republic, 2010), pp. 117–124

  21. R Auckenthaler, M Carey, H Lloyd-Thomas, Score normalization for text-independent speaker verification system. Digit. Signal Proc 1(10), 42–54 (2000)

  22. H Hirsch, D Pearce, The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. ISCA ITRW ASR2000 (vol. 4 Beijing, China, 2000), pp. 29–32

  23. NIST 2011 speaker recognition evaluation (http://www), . nist.gov/itl/iad/mig/best.cfm webcite

  24. D Zotkin, T Chi, SA Shamma, R Duraiswami, Neuromimetic sound representation for percept detection and manipulation. EURASIP J. App. Sig. Process 2005, 1350–1364 (2005). Publisher Full Text OpenURL

  25. H Steeneken, T Houtgast, A physical method for measuring speech-transmission quality. J. Acoust. Soc. Am 67, 318–326 (1979)

  26. R Drullman, J Festen, R Plomp, Effect of temporal envelope smearing on speech reception. J. Acoust. Soc. Am 95, 1053–1064 (1994). PubMed Abstract | Publisher Full Text OpenURL

  27. T Arai, M Pavel, H Hermansky, C Avendano, Syllable intelligibility for temporally filtered lpc cepstral trajectories. J. Acoust. Soc. Am 105, 2783–2791 (1999). PubMed Abstract | Publisher Full Text OpenURL

  28. S Greenberg, T Arai, K Grant, The Role of Temporal Dynamics in Understanding Spoken Language. NATO Science Series: Life and Behavioural Sciences (IOS Press, Amsterdam, 2006), pp. 171–190

  29. P Loizou, Speech Enhancement: Theory and Practice (CRC Press, Boca Raton, 2007)