It’s already possible to use computers to automatically generate translations of videos into foreign languages — YouTube has offered translated captions with reasonable accuracy for some time. But we’ve never seen foreign language videos automatically dubbed over with speech of a different language. For the most part, you still need good old humanoid voice actors.
Researchers from Amazon have detailed a “speech-to-speech” AI technology that will translate speech into a new computer-generated voice of a different language, tune its speed and duration to match the original dialogue, and even add background noise so as to make the computer-generated voices sound as natural as possible.
Automated dubbing is complicated — More than simply transcribing audio into a different language, automated dubbing requires 1). translating the original audio to text 2). translating that text into a different language and 3). generating new speech from that transcription that also carries the same speed and emotion as the original audio. And of course, you'd hope it will come out the other end accurate and human-like.
Amazon’s researchers say its speech-to-speech tech has been trained on over 150 million English-Italian phrase pairs in order to compute the match in duration between speech segments (or, how fast a particular phrase should be spoken in the translated audio). The tech also extracts background audio from the original dialogue and injects it into the new audio in order to make it seem as similar to the original as possible. There’s also a step that estimates the reverberation happening in the original audio and applies that to the dubbed audio.
This could actually be useful — If it works well, that is, because dubbing films the old fashioned way is expensive and time-consuming. You need actual voice actors. For less common languages, films sometimes just don’t get dubbed at all.
It still needs some work, though. The researchers found from gathering feedback that the rhythm of the generated speech hasn't quite mastered the human-like voice just yet.