The paper, ‘Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis’ proposes a state of the art deep learning model to generate natural speech using the lip movements of a speaker.
Lip movement is a seq-to-seq problem. Given a video the framework extracted the face from each frame using a face detector. The cropped face is then passed through the face encoder followed by a speech decoder.
From the speech decoder they generated the Mel-spectrogram, which is then used to generate raw speech using vocoder.
This can be useful to generate audio that gets corrupted in a video file. Or during a web session audio signal of the speaker is lost, such a model can seamlessly generate synthetic audio.