Resources Article Developing an Artificially Intelligent Voice: A Brief History of Text-to-Speech

Developing an Artificially Intelligent Voice: A Brief History of Text-to-Speech

Victoria Hseih

Published on 02/07/24

Table of Contents

Text-To-Speech Systems Voice Cloning: An even more robust version of TTS Voice Transfer: The oft-overlooked branch of TTS Applications of Text-to-Speech

Share this guide

Across apps like Tiktok or Instagram reels, users can find videos of former U.S. president Barack Obama singing the lyrics of Shape of You by Ed Sheeran or Formula 1 driver Charles Leclerc belting a song by Taylor Swift. Neither cover has happened, however with the increased advancement of AI technology, videos with voice cloning can be easily created by the general public. While some videos are hilarious, others can be more nefarious like a fake voice recording of a Progressive Party leader in Slovakia encouraging a plan to rig a vote or alleged “leaked” recordings of Omar al-Bashir, a former leader in Sudan.

On the other hand, the public encounters interactions with text-to-speech systems (TTS) everyday, whether it be asking Siri to research a topic or Alexa to play some music. However, what is the difference between a text-to-speech system and voice cloning?

Text-To-Speech Systems

Traditional TTS technology converts written text into spoken words. Earlier systems relied on three different synthesis techniques: articulatory synthesis, formant synthesis and concatenative synthesis. Articulatory synthesis mimics how humans produce sounds, by copying movements of the lips, glottis, and more, but it is difficult in implementation because it requires modeling articulatory behavior and it is also difficult to obtain this data to model this behavior. Formant synthesis is modeled on a set of linguistic rules and the source-filter model, a simplified theory that describes speech production as a two stage process, to produce speech. This particular method does not require extensive linguistic data or recordings but as a result it can sound artificial. Finally, concatenative synthesis is dependent on a large database of speech fragments by voice actors because it breaks these fragments into smaller parts and stitches them together to produce natural speech. However, because of the necessity for a large corpus, there is less variety in speech styles that can be mimicked and speech can sound more emotional.A more evolved version of these techniques came about called SPSS using statistical models to produce speech.

However, modern TTS systems leverage deep learning techniques, specifically neural networks, to generate more natural and human-like voices. These systems, often referred to as deep neural networks (DNNs) or end-to-end neural models, can understand context, intonation, and emotional cues to produce speech that closely mimics human intonation and rhythm. For example, the development of WaveNet by DeepMind in 2016 marked a significant leap forward, producing speech that can be almost indistinguishable from human speech. This technology uses a neural network that mimics the human voice by analyzing waveforms, resulting in a more natural and lifelike output. TTS systems are now integral to assistive devices, helping visually impaired individuals access written content, and in educational tools, where they facilitate language learning and literacy.

Voice Cloning: An even more robust version of TTS

Voice cloning is a more specialized application of speech generation technology that aims to replicate a specific individual's voice. It involves capturing the unique characteristics of a person's voice, such as pitch, tone, and accent, from a few speech samples and then using this model to generate new speech in that person's voice. This model is then capable of producing speech that sounds like the original speaker, even saying words or sentences the person never actually recorded. Thus, voice cloning can be more extensive and complex in its implementation compared to many TTS systems.

Like modern TTS, voice cloning often uses advanced machine learning techniques to achieve high levels of realism. Two approaches that can be used for voice cloning are speaker adaptation and speaker encoding: the first of which intends to fine tune an existing multi-speaker model based on audio-text pairs and the second directly estimates the speaker embedding, or a representation unique to the speaker, from the audio itself. These techniques show greater promise with advanced multi speaker models and a larger set of audios.

The video below provides a humorous example of voice cloning in action. It depicts former presidents Barack Obama and Donald Trump playing a video game together.

Voice Transfer: The oft-overlooked branch of TTS

The act of voice transfer is a bit more specific. The idea is this: given a pre-recorded audio of a particular person’s speech, the AI should be able to turn that person’s voice into someone else’s while maintaining the original pace, rhythm, and intonation of the initial recording.

One simple example of Voice Transfer can be found in this article, in which the author walks through a research project he conducted. The project entails building an AI that can transform children’s voices into the adults they will grow up to become.