Lip movement to speech

The paper, ‘Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis’ proposes a state of the art deep learning model to generate natural speech using the lip movements of a speaker.

Lip movement is a seq-to-seq problem. Given a video the framework extracted the face from each frame using a face detector. The cropped face is then passed through the face encoder followed by a speech decoder. 

From the speech decoder they generated the Mel-spectrogram, which is then used to generate raw speech using vocoder.

This can be useful to generate audio that gets corrupted in a video file. Or during a web session audio signal of the speaker is lost, such a model can seamlessly generate synthetic audio.

Spread the knowledge


Let's Expert Up

Leave a Reply

Your email address will not be published. Required fields are marked *