#DemystifyingAI: Text-to-Speech

AI in customer service is evolving at break-neck speed. Phonon has been the customer communication automation partner to leading global enterprises for over a decade now. We intend using this series of #DemystifyingAI to bring practical information about the implementation of AI in customer service.

Text-to-Speech

Text-to-Speech(TTS) is the voice you hear whenever you call into a self-service IVR system or get an automated call.

Why is TTS Important?

TTS is the dynamic part of announcements. Using the case study below – TTS involves converting the dynamic fields into a comprehensible message. The italicized portions in announcements below are dynamic fields.

Announcement

Example 1: Your Air India flight 101 of 23rd May 2019 from New Delhi Indira Gandhi International Airport, Terminal 3 to New York, John F Kennedy International Airport is on-time and en-route. It is expected to arrive at JFK, Terminal 4 at 11:20 am.

Dynamic Fields in the above announcement are: Flight Number, Date of Flight (ordinal), Origin Airport, Origin Airport Terminal, Destination Airport, Flight Status, Destination Airport Acronym, Destination Airport Terminal, ETA

Example 2: We are calling regarding a recent transaction on your State Bank of India, Visa credit card, ending with 3781. A sum 374$ and 30Cents has been spent at Merchant IndigoDeli. In case you have not made this transaction, please say ‘ block‘ to speak to an agent.

Dynamic fields here are: Bank Name, Card Type, Last 4 digits of Card, Amount (including currency), Merchant Name, CTA (Call-to-Action)

Types Of TTS

TTS can be classified into human and synthesized voice.

Human Voice

These are snippets recorded by an artist to make the voice sound almost human. For example, 101 is announced as one.mp3+zero.mp3+one.mp3 in series.

Advantage Human Voice: If edited well, they sound relatable by end users.

Disadvantage of Human Voice: Needs linguists to create proper grammar, such as announcing numbers and time. For example, in Hindi, one needs to compose time as सुबह + ग्यारह + बजकर + बीस + मिनट but in English, it is simply Eleven + Twenty + AM. So the recording needs to be accurate and well edited.

Synthesised Voice

These are machine-like robotic voices by techno-linguists, who program sounds, consonants, vowels, punctuations and grammar, to make the voices sound as close to the accent of the desired language.

Advantages: Synthesized voices work when there is need for a complex vocabulary.

Disadvantage of Synthesised Voice: It sounds robotic, and, still needs more development to sound human-like. Punctuations and pauses need to be scripted in the mark-up language.

Phonon Skills in Text-to-Speech

Phonon handles human voice IVR services in over 12 Indian languages including Indian English, Hindi, Marathi, Gujarati, Telugu, Kannada, Malayalam, Bengali, Assamese, Oriya, Punjabi. We have ready grammars for these languages and with work experience of over a decade.

Skills in Synthesised Voice: Phonon integrates with leading TTS AI providers, including Amazon Polly, Google Cloud TTS, Microsoft Azure TTS API and IBM Watson TTS – Global platforms for a variety of languages including Cantonese, Mandarin, Arabic and European and African languages, using Speech Synthesis Markup Language (SSML).

About the Author

Ujwal Makhija
Managing Director

An Electronics Engineer and an MBA from IIM Calcutta, Ujwal is a technology visionary in service automation and a master of concept sales. He is a people’s person and ever ready for an operational challenge with his sleeves rolled up.