#DemystifyingAI: Understanding NLP & Speech Recognition

Here we’re again to #DemystifyAI for you!

We started by explaining how Text-to-Speech helps in making customer service more dynamic. In case you missed our first article of the series #DemystifyingAI, Read Here to understand TTS better.

In this article, we will talk about how Natural Language Processing and Speech Recognition work together to form meaningful conversations.  

Natural Language Processing:

NLP (Natural Language Processing) is a branch of AI that understands the human language. 

We express ourselves in infinite ways, both verbally and in writing. Not only are there hundreds of languages and dialects, but within each language is a unique set of grammar and syntax rules, terms and slang.

The ultimate aim of NLP is to read, decode, understand and make sense of the human language in a manner that is valuable.

How NLP works:

NLP starts with breaking down the language into shorter, elemental pieces, decipher the relationship between the pieces and explore how the pieces work together to create meaning. In other words, NLP involves syntactic and semantic analysis of the human language. 

In the Contact Centre Automation industry, NLP is used to handle Client’s request via Interactive Voice Response (IVR)

The journey of understanding the voice input with the help of NLP starts with  speech recognition:

Speech Recognition: 

Speech-to-Text is a type of speech recognition program that converts audio input from the user into text. Also known as automatic speech recognition (ASR) returns text results for NLP with a certain confidence level. 

Next, NLP uses Utterances, Intents, and Entity to identify the meaning of the input:


These are the inputs by the users. Utterances are like clauses. Anything the user says is an utterance. 

We use utterance to train NLP so that it can correctly identify the intent of the user. 


Intents are the intentions of the user or simply put the outcomes the users would like. They are like verbs in the clause. 

Let’s take an example:

“ I’d like to fly from Mumbai to Delhi Tomorrow”

Here the final goal/intention of the user is to book a flight to Delhi from Mumbai. So to identify this intent of the user,  we need to map the input from the user or the Utterance to a specific intent. In this case, the specific intent is “Book Flight”.

For example- Training for the intent of

“ I’d like to fly from Mumbai to Delhi Tomorrow”  

can have utterances like:

“I’d like to book a flight from Mumbai to Delhi tomorrow”

“I need to fly from Mumbai to Delhi tomorrow, can you find a flight for me?”

“Is there a flight that can take me from Paris to London next Monday?”

“Could you please book me on the next flight from Bangalore to Mumbai?”

“What’s the first flight from Brisbane to Perth on Friday morning?”

These (and many more) are the sentences that a user can use while booking a flight and hence we need to train our NLP to map the intent of “Book Flight” with these set of utterances. 

Similarly, in order to help the intent engine to identify the right intent, we need to define a rich consistent pool of utterances for each intent.


An entity is something that our user is talking about. These are the metadata or variables of the intent. Or we can say they are the nouns associated with the verb in the clause. 

“ I’d like to fly from Mumbai to Delhi Tomorrow”

In the above query, the metadata or the variable of the query are Mumbai, Delhi and Tomorrow which will help our NPL engine to find the relevant available flights for the user. To extract those three values from the input the engine uses what in NPL we call Entities. They add context to the intent itself and they enable the NLP engine to complete a user request. So, for example, the intent “Book flight” could involve the following entities:

  1. “Departure”: a custom Entity which is the list of all the possible departure cities
  2. “Destination”: a custom Entity which is derived from “Departure” (the prepositional phrase “to” is used as a rule to extract the destination city and disambiguates it from the departure).
  3. “Time”: the departure date – a built-in Entity like Number, Email, Phone Number, URL, etc.. they’re generally predefined/built-in

NLP and Speech Recognition together are transforming the customer service domain. Especially considering how voice recognition and NLP enabled IVRs (Voice Bots) are already providing frictionless conversations to customers. 

Phonon’s skill in NLP and Speech Recognition:

Phonon’s Intelligent IVR provides the caller multiple user input options. It has the capability to process both DTMF and voice input from the caller. It enables callers to use everyday language to solve their queries. Our IVR combines natural language processing and machine learning to ensure higher accuracy in detecting the customer’s intent and responds intelligently. 

The seamless integration of our Intelligent IVR with different global AI platforms like Google Cloud, AWS, Azure, IBM Watson gives the users freedom to be AI/ML platform-agnostic.

As a value addition to our consultative exercise, we also provide industry-specific ML consultation to bring inflow level automation in customer service flows.

Want to see a live demo? No worries, just give a call at +91 997 974 6666 or email us at info@phonon.io.

About the Author

Marketing Manager

A B2B marketing enthusiast who loves to write on AI and ML.

Request a Demo

Please fill this form to request a demo