Open Access Research

Acoustic-visual synthesis technique using bimodal unit-selection

Slim Ouni1*, Vincent Colotte1, Utpala Musti2, Asterios Toutios3, Brigitte Wrobel-Dautcourt1, Marie-Odile Berger2 and Caroline Lavecchia1

Author Affiliations

1 Université de Lorraine, LORIA, UMR 7503, Villers-lès-Nancy, F-54600, France

2 INRIA, Villers-lès-Nancy, F-54600, France

3 Signal Analysis & Interpretation Laboratory (SAIL), University of Southern California, 3740 McClintock Ave., Los Angeles, CA 90089, USA

For all author emails, please log on.

EURASIP Journal on Audio, Speech, and Music Processing 2013, 2013:16  doi:10.1186/1687-4722-2013-16

Published: 27 June 2013

Abstract

This paper presents a bimodal acoustic-visual synthesis technique that concurrently generates the acoustic speech signal and a 3D animation of the speaker’s outer face. This is done by concatenating bimodal diphone units that consist of both acoustic and visual information. In the visual domain, we mainly focus on the dynamics of the face rather than on rendering. The proposed technique overcomes the problems of asynchrony and incoherence inherent in classic approaches to audiovisual synthesis. The different synthesis steps are similar to typical concatenative speech synthesis but are generalized to the acoustic-visual domain. The bimodal synthesis was evaluated using perceptual and subjective evaluations. The overall outcome of the evaluation indicates that the proposed bimodal acoustic-visual synthesis technique provides intelligible speech in both acoustic and visual channels.

Keywords:
Audiovisual speech; Acoustic-visual synthesis; Unit-selection