Voice Synthesis Software

The human voice, with all its subtlety and nuance, is proving to be an exceptionally difficult thing for computers to emulate. Using a powerful new algorithm, a Montreal-based AI startup has developed a voice generator that can mimic virtually any person’s voice, and even add an emotional punch when necessary. The system isn’t perfect, but it heralds a future when voices, like photos, can be easily faked.

When Siri, Alexa, or our GPS talk to us, it’s fairly obvious that we’re being spoken to by a machine. That’s because virtually every text-to-speech system on the market relies on a pre-recorded set of words, phrases, and utterances (recorded from voice actors), which are then strung together in Frankenstein-like fashion to produce complete words and sentences. The end result is a vocal delivery that sounds distinctly uninspiring, robotic, and at times laughable. This approach to voice synthesis also means that we’re stuck listening to the same pre-recorded, monotonous voice over and over again.

Advertisement

The synthesizer is based on a new approach to speech reproduction, called formant synthesis, which creates a voice simulation by modeling the sounds of natural human speech. This software dictionary delivers instant word translation and back translation and presents easy-to-use navigation. Top 10 Text To Speech (TTS) Software For eLearning (2017 Update) Need help finding the most effective text to speech software that will make your eLearning course an unforgettable experience? Text to speech software has become an integral part of contemporary eLearning courses. Our neural networks were built based on Google’s speech synthesis expertise. Select from 180+ voices Google Cloud Text-to-Speech offers a selection of 180+ voices across 30+ languages and variants, enabling developers to pick the voice that works best for their application. What is ModelTalker? The ModelTalker System is a revolutionary speech synthesis software package developed by the Nemours Speech Research Laboratory and designed to benefit people who are losing or who have already lost their ability to speak.

In an effort to inject some life in the automated voices that come out of our apps, AI startup Lyrebird has developed a voice-imitation algorithm that can mimic any person’s voice, and read any text with a predefined emotion or intonation. Incredibly, it can do this after analyzing just a few dozen seconds of pre-recorded audio. In an effort to promote its new tool, Lyrebird produced several audio samples using the voices of Barack Obama, Donald Trump, and Hillary Clinton.

Lyrebird’s demos also showcase the virtually unlimited catalog of voices, and the system’s ability to articulate the same sentence with different intonations.

Advertisement

Synthesis

Advertisement

This is all made by possible through the use of artificial neural networks, which function in a manner similar to the biological neural networks in the human brain. Essentially, the algorithm learns to recognize patterns in a particular person’s speech, and then reproduce those patterns during simulated speech.

Advertisement

“We train our models on a huge dataset with thousands of speakers,” Jose Sotelo, a team member at Lyrebird and a speech synthesis expert, told Gizmodo. “Then, for a new speaker we compress their information in a small key that contains their voice DNA. We use this key to say new sentences.”

Synthesis

The end result is far from perfect—the samples still exhibit digital artifacts, clarity problems, and other weirdness—but there’s little doubt who is being imitated by the speech generator. Changes in intonation are also discernible. Unlike other systems, Lyrebird’s solution requires less data per speaker to produce a new voice, and it works in real time. The company plans to offer its tool to companies in need of speech synthesis solutions.

Advertisement

“We are currently raising funds and growing our engineering team,” said Sotelo. “We are working on improving the quality of the audio to make it less robotic, and we hope to start beta testing soon.”

Needless to say, this form of speech synthesis introduces a host of ethical problems and and security concerns. Eventually, a refined version of this system could replicate a person’s voice with incredible accuracy, making it virtually impossible for a human listener to discern the original from the emulation. The day is coming when vocal speech, like an image processed in Photoshop, can be manipulated without our knowing. Unscrupulous individuals could fake a speech by a prominent politician, adding yet another layer to the emerging post-truth environment. Hackers could use speech synthesis for social engineering, fooling even the most careful security experts. The possibilities are almost endless.

Advertisement

These potentially adverse impacts are not lost on Lyrebird, which argues that the era in which we can trust audio recordings is on the verge of coming to an end.

“We take seriously the potential malicious applications of our technology,” Sotelo told Gizmodo. “We want this technology to be used for good purposes: giving back the voice to people who lost it to sickness, being able to record yourself at different stages in your life and hearing your voice later on, etc. Since this technology could be developed by other groups with malicious purposes, we believe that the right thing to do is to make it public and well-known so we stop relying on audio recordings [as evidence].”

Advertisement

Voice

No doubt, we’ll have to start second-guessing audio recordings of speech soon, but solutions could also be developed to ascertain the authenticity of vocal recordings. Humans may be fooled by such systems, but computers will not be—at least, not for a while. When analyzing the waveform, or frequencies, of human speech, a high resolution recording can yield a tremendous amount of data for a computer to analyze. It will be a long, long time before a speech synthesis program can replicate every single aspect of a person’s distinctive speech, like the finer details of vocal timbre (i.e. the quality of speech), and mouth noises such as breathing, tongue sounds, and lip smacking, to the point where even a machine can’t detect the difference. There are other aspects of a recording to consider as well. For instance, the absence of background noises, the presence of a faked acoustic space, or artificially introduced ambient sounds should be easily detectable by a machine designed for the task.

Eventually, however, a speech synthesis program may be able to fake all of these things, at which point, our ability to discern truth from fabrication will be put to the test.

Advertisement

[Lyrebird via Scientific American]

Speech Synthesis Demo

xoxos.net has release Syng2, a voice synthesizer plugin for Windows, as a free download:

Granular Synthesis Software

Syng2 is a voice synthesizer which uses four bandpass filters to produce formants in any spectrally rich signal, or using an internal oscillator. The phoneme can be selected on a second MIDI channel, or preset words can be triggered.

Video Synthesis Software

Syng2 was designed as an intelligible, low-cpu voice engine for Breathcube 2. The articulation is intended to be more natural than other ‘robot voices’ using this method, with plosives slightly emphasised to be heard with music.