A new paper authored by researchers from Disney Research and several universities describes a new approach to procedural speech animation based on deep learning. The system samples audio recordings of human speech and uses it to automatically generate matching mouth animation. The method has applications ranging from increased efficiency in animation pipelines to making social VR interactions more convincing by animating the speech of avatars in real-time in social VR settings.
Researchers from Disney Research, University of East Anglia, California Institute of Technology, and Carnegie Mellon University, have authored a paper titled A Deep Learning Approach for Generalized Speech Animation. The paper describes a system which has been trained with a ‘deep learning / neural network’ approach, using eight hours of reference footage (2,543 sentences) from a single speaker to teach the system the shape the mouth should make during various units of speech (called phonemes) and combinations thereof.
Below: The face on the right is the reference footage. The left face is overlaid with a mouth generated from the system based only on the audio input, after training with the video.