SpringerOpen Newsletter

Receive periodic news and updates relating to SpringerOpen.

Open Access Research Article

Visual Contribution to Speech Perception: Measuring the Intelligibility of Animated Talking Heads

Slim Ouni1*, Michael M Cohen2, Hope Ishak2 and Dominic W Massaro2

Author Affiliations

1 LORIA, Campus Scientifique, BP 239, Vandoeure lès Nancy Cedex 54506, France

2 Perceptual Science Laboratory, University of California, Santa Cruz, CA 95064, USA

For all author emails, please log on.

EURASIP Journal on Audio, Speech, and Music Processing 2007, 2007:047891  doi:10.1155/2007/47891

The electronic version of this article is the complete one and can be found online at: http://asmp.eurasipjournals.com/content/2007/1/047891


Received:7 January 2006
Revisions received:21 July 2006
Accepted:21 July 2006
Published:23 October 2006

© 2007 Slim Ouni et al.

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Animated agents are becoming increasingly frequent in research and applications in speech science. An important challenge is to evaluate the effectiveness of the agent in terms of the intelligibility of its visible speech. In three experiments, we extend and test the Sumby and Pollack (1954) metric to allow the comparison of an agent relative to a standard or reference, and also propose a new metric based on the fuzzy logical model of perception (FLMP) to describe the benefit provided by a synthetic animated face relative to the benefit provided by a natural face. A valid metric would allow direct comparisons accross different experiments and would give measures of the benfit of a synthetic animated face relative to a natural face (or indeed any two conditions) and how this benefit varies as a function of the type of synthetic face, the test items (e.g., syllables versus sentences), different individuals, and applications.

References

  1. WH Sumby, I Pollack, Visual contribution to speech intelligibility in noise. Journal of Acoustical Society of America 26(2), 212–215 (1954). Publisher Full Text OpenURL

  2. C Benoît, T Mohamadi, S Kandel, Effects of phonetic context on audio-visual intelligibility of French. Journal of Speech and Hearing Research 37(5), 1195–1203 (1994). PubMed Abstract | Publisher Full Text OpenURL

  3. A Jesse, N Vrignaud, MM Cohen, DW Massaro, The processing of information from multiple sources in simultaneous interpreting. Interpreting 5(2), 95–115 (2000). Publisher Full Text OpenURL

  4. AQ Summerfield, Use of visual information for phonetic perception. Phonetica 36(4-5), 314–331 (1979). PubMed Abstract | Publisher Full Text OpenURL

  5. G Bailly, M Bérar, F Elisei, M Odisio, Audiovisual speech synthesis. International Journal of Speech Technology 6(4), 331–346 (2003). Publisher Full Text OpenURL

  6. J Beskow, in Talking heads - models and applications for multimodal speech synthesis, Ph, ed. by . D. thesis (Department of Speech, Music and Hearing, KTH, Stockholm, Sweden, 2003)

  7. DW Massaro, Perceiving Talking Faces: From Speech Perception to a Behavioral Principle (MIT Press, Cambridge, Mass, USA, 1998)

  8. M Odisio, G Bailly, F Elisei, Tracking talking faces with shape and appearance models. Speech Communication 44(1–4), 63–82 (2004)

  9. C Pelachaud, NI Badler, M Steedman, Generating facial expressions for speech. Cognitive Science 20(1), 1–46 (1996). Publisher Full Text OpenURL

  10. DW Massaro, J Beskow, MM Cohen, CL Fry, T Rodriguez, Picture my voice: audio to visual speech synthesis using artificial neural networks. in Proceedings of Auditory-Visual Speech Processing (AVSP '99), August 1999, Santa Cruz, Calif, USA, , ed. by Massaro DW, pp. 133–138

  11. J Beskow, I Karlsson, J Kewley, G Salvi, SYNFACE-a talking head telephone for the hearing-impaired. in Proceedings of 9th International Conference on Computers Helping People with Special Needs (ICCHP '04), July 2004, Paris, France, , ed. by Miesenberger K, Klaus J, Zagler W, Burger D, pp. 1178–1186

  12. A Bosseler, DW Massaro, Development and evaluation of a computer-animated tutor for vocabulary and language learning in children with autism. Journal of Autism and Developmental Disorders 33(6), 653–672 (2003). PubMed Abstract | Publisher Full Text OpenURL

  13. DW Massaro, J Light, Improving the vocabulary of children with hearing loss. Volta Review 104(3), 141–174 (2004)

  14. DW Massaro, J Light, Read my tongue movements: bimodal learning to perceive and produce non-native speech /r/ and /l/. Proceedings of the 8th European Conference on Speech Communication and Technology (EUROSPEECH '03), September 2003, Geneva, Switzerland, 2249–2252

  15. DW Massaro, J Light, Using visible speech for training perception and production of speech for hard of hearing individuals. Journal of Speech, Language, and Hearing Research 47(2), 304–320 (2004). PubMed Abstract | Publisher Full Text OpenURL

  16. C Nass, Wired for Speech: How Voice Activates and Advances the Human-Computer Relationship (MIT Press, Cambridge, Mass, USA, 2005)

  17. MM Cohen, RL Walker, DW Massaro, Perception of synthetic visual speech. in Speechreading by Humans and Machines: Models, Systems, and Applications, ed. by Stork DG, Hennecke ME (Springer, Berlin, Germany, 1996), pp. 153–168

  18. C Siciliano, G Williams, J Beskow, A Faulkner, Evaluation of a multilingual synthetic talking face as a communication aid for the hearing impaired. Proceedings of the 15th International Congress of Phonetic Science (ICPhS '03), August 2003, Barcelona, Spain, 131–134

  19. B LeGoff, T Guiard-Marigny, MM Cohen, C Benoît, Real-time analysis-synthesis and intelligibility of talking faces. Proceedings of the 2nd International Conference on Speech Synthesis, September 1994, Newark, NY, USA

  20. S Ouni, MM Cohen, DW Massaro, Training Baldi to be multilingual: a case study for an Arabic Badr. Speech Communication 45(2), 115–137 (2005). Publisher Full Text OpenURL

  21. KW Grant, BE Walden, Evaluating the articulation index for auditory-visual consonant recognition. Journal of the Acoustical Society of America 100(4), 2415–2424 (1996). PubMed Abstract | Publisher Full Text OpenURL

  22. LE Bernstein, SP Eberhardt, Johns Hopkins Lipreading Corpus Videodisk Set (The Johns Hopkins University, Baltimore, Md, USA, 1986)

  23. KW Grant, PF Seitz, Measures of auditory-visual integration in nonsense syllables and sentences. Journal of the Acoustical Society of America 104(4), 2438–2450 (1998). PubMed Abstract | Publisher Full Text OpenURL

  24. KW Grant, BE Walden, PF Seitz, Auditory-visual speech recognition by hearing-impaired subjects: consonant recognition, sentence recognition, and auditory-visual integration. Journal of the Acoustical Society of America 103(5), 2677–2690 (1998). PubMed Abstract | Publisher Full Text OpenURL

  25. KW Grant, BE Walden, Predicting auditory-visual speech recognition in hearing-impaired listeners. Proceedings of the 13th International Congress of Phonetic Sciences, August 1995, Stockholm, Sweden 3, 122–129

  26. DW Massaro, MM Cohen, Tests of auditory-visual integration efficiency within the framework of the fuzzy logical model of perception. Journal of the Acoustical Society of America 108(2), 784–789 (2000). PubMed Abstract | Publisher Full Text OpenURL

  27. DW Massaro, MM Cohen, CS Campbell, T Rodriguez, Bayes factor of model selection validates FLMP. Psychonomic Bulletin and Review 8(1), 1–17 (2001). PubMed Abstract | Publisher Full Text OpenURL

  28. TH Chen, DW Massaro, Mandarin speech perception by ear and eye follows a universal principle. Perception and Psychophysics 66(5), 820–836 (2004). PubMed Abstract | Publisher Full Text OpenURL

  29. DW Massaro, From multisensory integration to talking heads and language learning. in Handbook of Multisensory Processes, ed. by Calvert G, Spence C, Stein BE (MIT Press, Cambridge, Mass, USA, 2004), pp. 153–176

  30. SA Lesner, The talker. Volta Review 90(5), 89–98 (1988)

  31. K Johnson, P Ladefoged, M Lindau, Individual differences in vowel production. Journal of the Acoustical Society of America 94(2), 701–714 (1993). PubMed Abstract | Publisher Full Text OpenURL

  32. PB Kricos, SA Lesner, Differences in visual intelligibility across talkers. Volta Review 84, 219–225 (1982)

  33. AT Gesi, DW Massaro, MM Cohen, Discovery and expository methods in teaching visual consonant and word identification. Journal of Speech and Hearing Research 35(5), 1180–1188 (1992). PubMed Abstract | Publisher Full Text OpenURL

  34. AA Montgomery, PL Jackson, Physical characteristics of the lips underlying vowel lipreading performance. Journal of the Acoustical Society of America 73(6), 2134–2144 (1983). PubMed Abstract | Publisher Full Text OpenURL

  35. JE Preminger, H-B Lin, M Payen, H Levitt, Selective visual masking in speechreading. Journal of Speech, Language, and Hearing Research 41(3), 564–575 (1998). PubMed Abstract | Publisher Full Text OpenURL