Thanks to all participants to the first Acapela TTS for iPhone and iPad webinars (30th of June 2010 for the French webinar, 1st of July 2010 for the English one).
These webinars were about the basics of Text-to-speech and how to integrate quickly Acapela TTS for iPhone and iPad into your application.
The topics were:
Introduction to Acapela Group
Introduction to Text to Speech Technology
Description of Acapela TTS for iPhone and iPad SDK
API quick overview and live demo of a simple application
Q&A
The slides are now available in the documentation section:
Text-to-speech is a technology that becomes more and more mainstream. However, this is yet still totally an unknown concept for some people.
Here is a small article with the most frequently asked questions about Text-to-Speech …
Text-to-What?
Text To Speech (abbreviation: TTS), also called “Speech synthesis” is the artificial production of human speech. The computer system used to produce TTS is called a speech engine.
A Text to Speech engine transforms any text into speech in real time. It literally reads out loud any written information with a smooth and natural sounding voice. The automatic intonation reflects the meaning of the text, with respect to pauses, breath groups, punctuation and context.
The most important qualities of a speech synthesis system are naturalness and intelligibility. Naturalness describes how closely the output sounds like human speech, while intelligibility is the ease with which the output is understood. The ideal speech synthesizer is both natural and intelligible.
Acapela Text To Speech maximizes both characteristics.
So, you speak in a microphone and it recognize what it is said?
No. Definitely not. This is just the perfect opposite.
Speech recognition (also known as automatic speech recognition (ASR) or dictation system) converts spoken words to text.
Text-to-Speech (TTS or Speech Synthesis) converts text to spoken words.
In a Human Machine Interface model, speech recognition is a way to transfer information FROM a human TO the computer (as does a keyboard, a mouse or a touchscreen).
Speech synthesis is a way to transfer information FROM the computer TO a human (as does a screen, a braille terminal, a beep sound …)
Yes, OK, I understand now … But how does it sound?
Here is an example of our female American voice Heather, who speaks this text: “Hi, I’m Heather, the female american english speech synthesis voice from Acapela. Efficient, fast and a very high quality, why not try me out with your own words?”
This step is done by Acapela R&D and Linguistic team (the bottom part of the image).
In order to reproduce the natural sound of each language, a narrator records a series of texts (poetry, political news, sports results, stock exchange updates, etc.) which contain every possible sound in the chosen language.
These recordings are then sliced (automatically and manually) and organized into an acoustic database.
During database creation, all recorded speech is segmented into some or all of the following: diphones (most of the time), triphones, syllables, morphemes or even words, phrases, and sentences in some case.
Step 2) Text-to-speech realtime process
The first step is done offline, by us, and integrate into our product.
Then the voice and the linguistics data are packed in a product in order to be used by an application.
This is into this application than the Text-to-Speech process is realized (upper part of the picture).
The speech synthesis process is composed of 2 big parts: the linguistic analysis module and the synthesizer
When a sentence is sent to the TTS, the NLP module system begins by carrying out a sophisticated linguistic analysis that transposes written text into phonetic text.
A text preprocessor system transforms all date, currency, email or postal adress, phone number, into a normalized sentence.
For example, a sentence like “I have only $2.56 in my pocket, it is 12:45 AND I SHOULD EAT something in this 5 stars St. John restaurant” will become internally “I have only two dollars and fifty six cents in my pocket, it is twelve forty-five and i should eat something in this five stars saint john restaurant“.
A grammatical and syntactic analysis then enables the system to define how to pronounce each word in order to reconstruct the sense.
A phonetizer, a set of rules and lexicons give the phonetic of each word, based on the context and the result of the grammatical and syntactic analysis and the proprocessor.
Finally, the system produces information associating the phonetic writing with the tone and required length of the pronunciation. We call this the prosody: it gives the rhythm and intonation of a sentence.
b) Synthesizer and sound output
The chain of analysis ends here and sound is generated by selecting the best units stocked in the acoustic database.
The algorithm takes in input the results of the NLP module, and selects the best chain of candidate units from the database who will match as much as possible the desired prosody: fundamental frequency (pitch), tone, length …
The units are extracted from the database, decoded, concatenated (without signal processing, in order to stay as natural as possible) and sent to the output.
This output may be loudspeakers, headset, file, telephony board, audio stream …