Archive for the ‘TTS and Speech Technologies’ Category

Presentation slides of webinars

Friday, July 9th, 2010

Thanks to all participants to the first Acapela TTS for iPhone and iPad webinars (30th of June 2010 for the French webinar, 1st of July 2010 for the English one).

These webinars were about the basics of Text-to-speech and how to integrate quickly Acapela TTS for iPhone and iPad into your application.

The topics were:

  • Introduction to Acapela Group
  • Introduction to Text to Speech Technology
  • Description of Acapela TTS for iPhone and iPad SDK
  • API quick overview and live demo of a simple application
  • Q&A

The slides are now available in the documentation section:

The live demo description can be found here: Quick Start: How to add TTS in your app (“HelloWorld TTS app”)

Other webinars are in preparation with more advanced topics like audio management, iOS4 features, inAppPurchase implementation etc ….

If you have other suggestions, don’t hesitate to add a comment.

Good App!

Jean-Michel

Text-to-Speech? What is that?

Saturday, June 12th, 2010

Text-to-speech is a technology that becomes more and more mainstream. However, this is yet still totally an unknown concept for some people.

Here is a small article with the most frequently asked questions about Text-to-Speech …

Text-to-What?

Text To Speech (abbreviation: TTS), also called “Speech synthesis” is the artificial production of human speech. The computer system used to produce TTS is called a speech engine.

A Text to Speech engine transforms any text into speech in real time. It literally reads out loud any written information with a smooth and natural sounding voice. The automatic intonation reflects the meaning of the text, with respect to pauses, breath groups, punctuation and context.

The most important qualities of a speech synthesis system are naturalness and intelligibility. Naturalness describes how closely the output sounds like human speech, while intelligibility is the ease with which the output is understood. The ideal speech synthesizer is both natural and intelligible.
Acapela Text To Speech maximizes both characteristics.

So, you speak in a microphone and it recognize what it is said?

No. Definitely not. This is just the perfect opposite.

Speech recognition (also known as automatic speech recognition (ASR) or dictation system) converts spoken words to text.

Text-to-Speech (TTS or Speech Synthesis) converts text to spoken words.

In a Human Machine Interface model, speech recognition is a way to transfer information FROM a human TO the computer (as does a keyboard, a mouse or a touchscreen).
Speech synthesis is a way to transfer information FROM the computer TO a human (as does a screen, a braille terminal, a beep sound …)

Yes, OK, I understand now … But how does it sound?

Here is an example of our female American voice Heather, who speaks this text: “Hi, I’m Heather, the female american english speech synthesis voice from Acapela. Efficient, fast and a very high quality, why not try me out with your own words?

You have examples of our Text-to-Speech voice on the demo page of our corporate website.

You can also send free speech powered e-card with our sparkling laboratory acapela.tv

And how does it work?

Acapela TTS is mainly based on a technology called “Non Uniform Unit Selection“. This is sometimes called “Text-to-speech of third generation” (previous one was formant synthesis and diphone concatenation)

This schema presents the chain of processes behind Text to Speech.

Step 1) Creation of the voice

This step is done by Acapela R&D and Linguistic team (the bottom part of the image).

In order to reproduce the natural sound of each language, a narrator records a series of texts (poetry, political news, sports results, stock exchange updates, etc.) which contain every possible sound in the chosen language.

These recordings are then sliced (automatically and manually) and organized into an acoustic database.

During database creation, all recorded speech is segmented into some or all of the following: diphones (most of the time), triphones, syllables, morphemes or even words, phrases, and sentences in some case.

Step 2) Text-to-speech realtime process

The first step is done offline, by us, and integrate into our product.
Then the voice and the linguistics data are packed in a product in order to be used by an application.

This is into this application than the Text-to-Speech process is realized (upper part of the picture).

The speech synthesis process is composed of 2 big parts: the linguistic analysis module and the synthesizer

a) Linguistic analysis

The Linguistic analysis is done by the NLP (Natural Language Processing) module.

When a sentence is sent to the TTS, the NLP module system begins by carrying out a sophisticated linguistic analysis that transposes written text into phonetic text.

  • A text preprocessor system transforms all date, currency, email or postal adress, phone number, into a normalized sentence.
    For example, a sentence like “I have only $2.56 in my pocket, it is 12:45 AND I SHOULD EAT something in this 5 stars St. John restaurant” will become internally “I have only two dollars and fifty six cents in my pocket, it is twelve forty-five and i should eat something in this five stars saint john restaurant“.
  • A grammatical and syntactic analysis then enables the system to define how to pronounce each word in order to reconstruct the sense.
  • A phonetizer, a set of rules and lexicons give the phonetic of each word, based on the context and the result of the grammatical and syntactic analysis and the proprocessor.
  • Finally, the system produces information associating the phonetic writing with the tone and required length of the pronunciation. We call this the prosody: it gives the rhythm and intonation of a sentence.
b) Synthesizer and sound output

The chain of analysis ends here and sound is generated by selecting the best units stocked in the acoustic database.

The algorithm takes in input the results of the NLP module, and selects the best chain of candidate units from the database who will match as much as possible the desired prosody: fundamental frequency (pitch), tone, length …

The units are extracted from the database, decoded, concatenated (without signal processing, in order to stay as natural as possible) and sent to the output.

This output may be loudspeakers, headset, file, telephony board, audio stream …

OK, Thanks, now I understand.

You’re welcome :-)