Text-to-speech is a technology that becomes more and more mainstream. However, this is yet still totally an unknown concept for some people.
Here is a small article with the most frequently asked questions about Text-to-Speech …
Text-to-What?
Text To Speech (abbreviation: TTS), also called “Speech synthesis” is the artificial production of human speech. The computer system used to produce TTS is called a speech engine.
A Text to Speech engine transforms any text into speech in real time. It literally reads out loud any written information with a smooth and natural sounding voice. The automatic intonation reflects the meaning of the text, with respect to pauses, breath groups, punctuation and context.
The most important qualities of a speech synthesis system are naturalness and intelligibility. Naturalness describes how closely the output sounds like human speech, while intelligibility is the ease with which the output is understood. The ideal speech synthesizer is both natural and intelligible.
Acapela Text To Speech maximizes both characteristics.
So, you speak in a microphone and it recognize what it is said?
No. Definitely not. This is just the perfect opposite.
Speech recognition (also known as automatic speech recognition (ASR) or dictation system) converts spoken words to text.
Text-to-Speech (TTS or Speech Synthesis) converts text to spoken words.
In a Human Machine Interface model, speech recognition is a way to transfer information FROM a human TO the computer (as does a keyboard, a mouse or a touchscreen).
Speech synthesis is a way to transfer information FROM the computer TO a human (as does a screen, a braille terminal, a beep sound …)
Yes, OK, I understand now … But how does it sound?
Here is an example of our female American voice Heather, who speaks this text: “Hi, I’m Heather, the female american english speech synthesis voice from Acapela. Efficient, fast and a very high quality, why not try me out with your own words?”
You have examples of our Text-to-Speech voice on the demo page of our corporate website.
You can also send free speech powered e-card with our sparkling laboratory acapela.tv
And how does it work?
Acapela TTS is mainly based on a technology called “Non Uniform Unit Selection“. This is sometimes called “Text-to-speech of third generation” (previous one was formant synthesis and diphone concatenation)
Step 1) Creation of the voice
This step is done by Acapela R&D and Linguistic team (the bottom part of the image).
In order to reproduce the natural sound of each language, a narrator records a series of texts (poetry, political news, sports results, stock exchange updates, etc.) which contain every possible sound in the chosen language.
These recordings are then sliced (automatically and manually) and organized into an acoustic database.
During database creation, all recorded speech is segmented into some or all of the following: diphones (most of the time), triphones, syllables, morphemes or even words, phrases, and sentences in some case.
Step 2) Text-to-speech realtime process
The first step is done offline, by us, and integrate into our product.
Then the voice and the linguistics data are packed in a product in order to be used by an application.
This is into this application than the Text-to-Speech process is realized (upper part of the picture).
The speech synthesis process is composed of 2 big parts: the linguistic analysis module and the synthesizer
a) Linguistic analysis
The Linguistic analysis is done by the NLP (Natural Language Processing) module.
When a sentence is sent to the TTS, the NLP module system begins by carrying out a sophisticated linguistic analysis that transposes written text into phonetic text.
- A text preprocessor system transforms all date, currency, email or postal adress, phone number, into a normalized sentence.
For example, a sentence like “I have only $2.56 in my pocket, it is 12:45 AND I SHOULD EAT something in this 5 stars St. John restaurant” will become internally “I have only two dollars and fifty six cents in my pocket, it is twelve forty-five and i should eat something in this five stars saint john restaurant“. - A grammatical and syntactic analysis then enables the system to define how to pronounce each word in order to reconstruct the sense.
- A phonetizer, a set of rules and lexicons give the phonetic of each word, based on the context and the result of the grammatical and syntactic analysis and the proprocessor.
- Finally, the system produces information associating the phonetic writing with the tone and required length of the pronunciation. We call this the prosody: it gives the rhythm and intonation of a sentence.
b) Synthesizer and sound output
The chain of analysis ends here and sound is generated by selecting the best units stocked in the acoustic database.
The algorithm takes in input the results of the NLP module, and selects the best chain of candidate units from the database who will match as much as possible the desired prosody: fundamental frequency (pitch), tone, length …
The units are extracted from the database, decoded, concatenated (without signal processing, in order to stay as natural as possible) and sent to the output.
This output may be loudspeakers, headset, file, telephony board, audio stream …
OK, Thanks, now I understand.
You’re welcome

June 19th, 2010 at 3:11 pm
If you have some questions about TTS technology, don’t hesitate to ask them in comments. I will try to answer to most of them with other blog posts.
Jean-Michel Reghem
Developer Solutions Product Manager – Acapela Group
October 30th, 2010 at 5:39 am
If I write an application for the iPhone, does text need to be sent from the phone to Acapela servers and then the resulting sound clip sent to the phone? Or is all the processing done on the phone itself?
November 3rd, 2010 at 10:12 am
With Acapela TTS for iPhone and iPad, the TTS conversion is done on the phone itself!
No need of connection for the text-to-speech.
The engine and the data are included into your application.
We have of course other products allowing to use a TTS server ( http://www.acapela-vaas.com or Acapela TTS for Servers on http://www.acapela-for-developers.com ), but most of our iPhone/iPad customers are using the SDK.
November 9th, 2010 at 6:39 am
I’d like to have an app where the text is read and each word is highlighted/emphasised/?? as it is read. Is that possible with your API? BTW, iPhone/iPad platform.
November 9th, 2010 at 9:42 pm
vbookz (check in our gallery) do that.
By using our SDK you can register a delegate, called each time a word is pronounced.
In this delegate, your app can implement the highlight of the current word
January 27th, 2011 at 3:49 pm
For some reason, I cannot get the delegate set.
_TTS.delegate = self;
NSLog(@”self = %@”,self);
NSLog(@”_TTS.delegate = %@”,_TTS.delegate);
results in:
2011-01-27 07:50:55.333 eDocReader[964:307] self =
2011-01-27 07:50:55.336 eDocReader[964:307] _TTS.delegate = (null)
Seems strange. I have another app where it works fine.
January 27th, 2011 at 4:52 pm
logical … (accessor mechanism in objective C)
delegate is a function, not a member … to set a delegate, you should use
[_TTS setDelegate:self];
–> check the AcapelaSpeech.h and the documentation for more explanation
BR
Jean-Michel
June 29th, 2011 at 3:01 pm
I have a question.What do you think about HTS technology to create a synthesizer?and why the acapela use unit selection technology?I want to know about the different between HTS technology and unit selection?Do you think unit selection is the best technology to create a synthesizer?
June 29th, 2011 at 3:54 pm
HTS is the original first implementation of TTS in HMM technology.
If I take the wikipedia quote about HMM:
“HMM-based synthesis is a synthesis method based on hidden Markov models, also called Statistical Parametric Synthesis. In this system, the frequency spectrum (vocal tract), fundamental frequency (vocal source), and duration (prosody) of speech are modeled simultaneously by HMMs. Speech waveforms are generated from HMMs themselves based on the maximum likelihood criterion”
So, HMM synthesizer are fine and already deployed (Default SVox voices on Android or VoiceOver Nuance voices in iOS are based on HMM).
Big advantage of HMM is that all the voices datas can be as small as 1 MB (versus minimum 20MB for our LH unit selection voice).
With unit selection, we store a large collection of real small voice units into a database. HMM generate the voice from parameters.
So, there is no “best” technology. But in term of acoustic quality, Unit Selections algorithms are of course far better (HMM voices seems sometimes like a return to the past … if you don’t consider the footprint advantage).