|
Lead
Voice-controlled engine
Speech recognition as a technology has come as a boon to
usersnot only enabling them to access information and knowledge but also
entertainment by simply voicing their needs, writes Nivedan Prakash
Speech
recognition has always been a subject of much interest in human-computer interface.
Though the platform is not new, it has gotten better over the years.
Before getting into the details of the technology, let us understand what speech
recognition software is all about. Speech recognition software is the technology
that allows you to control a computer by speaking to it. When you talk, the
software decides whether to convert your voice into text in order to let you
dictate documents or e-mail messages, or interpret what you say as a voice command
and take an action.
While relatively simple to use, speech recognition software is a highly sophisticated
technology that leverages language modeling to recognize and differentiate
among the millions of human utterances that make up any language. Using statistical
probability, speech recognition software programs analyze an incoming stream
of sound and interpret those sounds as commands and dictation. This process
of interpretation is called speech recognition.
Companies like IBM, Microsoft, and Nuance Communications are pioneers on the
speech technology front. IBM has a desktop Hindi speech recognition technology
that provides a natural interface for human-computer interaction. This technology
helps transcribe Hindi speech into text and is suitable for a country like India
where a large portion of the population is not familiar with the English language
and keyboard to use computers.
This core technology developed by IRL (IBM Research Laboratory) can be used
for desktop applications such as data entry, letter-writing, sending e-mails,
as well as to speech-enable ATMs, kiosks and other devices. You can also use
the software for issuing commands to a computer and for IVR (interactive voice
response) applications in telephony. As the software supports Unicode, developers
can integrate it with word processing and e-mail applications. The software
produces word recognition rates in the range of 90-95% with speaker adaptation
and 80-90% without.
Dr Guruduth Banavar, Director-IBM India Research Laboratory, stated, Our
researchers are also exploring areas where the social impact could be huge.
For instance, the Spoken Web, driven out of IBM India Research Lab, is also
being incubated in IBMs eight global labs in six countries. Spoken Web
technology aims to transform how people create, build, and interact with e-Commerce
sites on the World Wide Web using the spoken word instead of the written.
Similarly, Microsofts Windows Speech Recognition in Windows Vista
empowers users to interact with their computer by voice. The Windows Vista speech
recognition experience places the user in control and enables him to become
proficient at using speech recognition quickly. Using speech recognition in
the operating system, you can dictate e-mails and documents and use your voice
to control your favorite programs and navigate the Web. It enables you to create
documents quickly and helps you work with reduced risk of repetitive stress
injuries.
One of the most important aspects of Windows Speech Recognition is that it enables
wider accessibility to individuals with difficulties in typing. It also enables
individuals to interact with the computer using only voice thereby, reducing
the use of a mouse and keyboard while maintaining overall productivity. Additionally,
an interactive training guides you through the setup process and helps you familiarize
yourself with the voice commands.
Prasanna Meduri, DirectorWindows Client, Microsoft India, added, Windows
Vista Speech Recognition provides excellent recognition accuracy that improves
with each use as it adapts to the users speaking style and vocabulary.
Speech Recognition is available in English (US), English (UK), German, French,
Spanish, Japanese, Chinese (Traditional), and Chinese (Simplified).
Following the same lines, Nuance Communications has the speech recognition software,
Dragon NaturallySpeaking, which uses the human voice as the primary communication
mechanism between the user and the computer. When Dragon NaturallySpeaking users
create and train their user profile, they start with a standard set of models
and then customize them for the way they speak (acoustic model) and the way
they use words (vocabulary and associated language model). The software employs
the customised user profile to guess the words spoken with accuracy rates of
up to 99%. As the individual uses the software and corrects recognition errors,
the software becomes increasingly accurate. In other words, the more one uses
Dragon, the better it performsoffering greater productivity benefits over
time.
Speech recognition process model
|
"The
language and acoustic models used for speech recognition are created by
teams of linguists, linguistic engineers, and programmers. They analyze
millions of words in documents to gather language model statistics"
- Sunny Rao
Vice President and General Manager-India and Asian Operations, Nuance
Communications
|
|
"In
Windows Vista, before
using Windows Speech Recognition, you have to train the computer to recognize
your voice and spoken commands by creating a voice profile"
- Prasanna Meduri
Director-Windows Client,
Microsoft India
|
|
"Our
Spoken Web technology aims to transform how people create, build, and
interact with e-commerce sites on the World Wide Web using the spoken
word instead of the written word"
- Dr Guruduth Banavar
Director-IBM India Research Laboratory
|
Speech recognition software programs are based on statistical
probability. The software analyses an incoming stream of sounds and interprets
those sounds as commands and dictation. This process of interpretation is called
speech recognition, and its success is measured by the percentage of correct
interpretations, or recognition accuracy.
Speech recognition is not a single technology, rather
it combines many of them. You start building a voice recognition engine by recording
words, phrases, and sentences, and then putting them in a database. Then you
have to create a library of the specific pronunciations of the different words
to be recognized. Then the sounds in the recordings are mapped to the word
pronunciations. And finally, you have to build a large table of the most commonly
occurring patterns of words people are likely to speak. Algorithms are created
that combine all these sources of information to come up with the right answer
in a specific situation.
We have been continually exploring how to adapt their voice recognition
engines more quickly to a specific person or sound environment, added
Dr Banavar.
A software like Dragon NaturallySpeaking relies on three
sources of information to achieve high recognition accuracy, including an acoustic
modela mathematical model of the sound patterns used by the speakers
language, a vocabularya list of words that the program can recognize.
Each word in the vocabulary has a text representation and a pronunciation and
a language modelstatistical information associated with a vocabulary that
describes the likelihood of words and sequences of words occurring in the users
speech.
Sunny Rao, Vice President and General Manager-India and Asian Operations, Nuance
Communications, said, When you create and train a user profile, you start
with a standard set of models and then customize them for the way you speak
(acoustic model) and the way you use words (vocabulary and associated language
model). The software employs your customized user files to guess the words that
you spoke. The quality and type of noise-canceling microphone is a critical
success factor in implementing speech recognition. As the individual uses the
software and corrects recognition errors, the software becomes increasingly
accurate. Most programs enable users to add new words or customize the vocabulary
for their particular field e.g. medical, legal, etc.
Working of this engine
In the past, virtually any attempt at developing a practical
continuous speech dictation program was hampered by the slow speed and limited
processing power of most personal computers. However, with faster processing
speeds and larger random access memory capabilities of todays personal
computers, as well as major advances in the algorithms behind speech recognition
technology, continuous speech dictation programs are now ready to fuel the next
major wave of new applications in the marketplace.
Many of the more advanced algorithms that are driving todays speech recognition
products, particularly those capable of recognizing continuous speech, recognize
words built from phonemes. Phonemes are the basic building blocks of language.
Every word in English therefore can be represented by a sequence of these phonemes.
For each language, the greater the contrast between different phonemes or sounds
within the language, the easier it is for the computer to recognize the spoken
words.
True speech recognition is, therefore, the process of capturing
the sounds, identifying them, determining proper context, and stringing them
together to actually build words, phrases, and sentences. In addition to analyzing
the acoustic information of phonemes, todays programs use a language model
to determine the words in a segment of dictated speech. The language model,
which is different for each spoken language, describes how phrases and sentences
are built from individual words. Information typically included in the model
would include statistics on how frequently individual words are used and how
often groups of two words (bigrams) or three words (trigrams) are used together.
Rao emphasized, The language and acoustic models used for speech recognition
are created by teams of linguists, linguistic engineers, and programmers. They
analyze millions of words in documents to gather language model statistics.
They sample thousands of bits of speech from different speakers to develop acoustic
models of the phonemes and words. They then develop a recipe for
how to combine these two different bodies of information to determine the words
in a sample of speech.
To build such algorithms, or sets of calculations and instructions, linguistic
engineers record thousands of bits of speech; identify the stresses, accents,
and tones of human speech. All this is done to represent the nearly infinite
variety in the way people speak, the pattern they use, and the pitch of their
voices. The task has been to create a recognition engine based on algorithms,
which can recognize human speech regardless of whom the speaker is. In other
words, it is speaker independent. Over time, however, the programs can learn
by simply adapting to the user. It acquires more specific phrasing and usage,
and becomes more accustomed to the users speech and tone.
Adding further, Meduri pointed out, In Windows Vista, before using Windows
Speech Recognition, you have to train the computer to recognize your voice and
spoken commands by creating a voice profile. This ensures that the computer
recognizes the speaker.
Similarly, the IBM desktop Hindi speech recognition system is a continuous speech,
speaker-independent recognizer. This is a continuous speech, speaker-independent
recognizer. The system has a vocabulary of 65,000 words (to increase to 125,000).
With speaker training, accuracy levels in excess of 90% can be achieved.
A novel technique of building phonetic spellings of Hindi words has been used
to generate the vocabulary in the system. The system displays output in Unicode
Hindi characters. Algorithms have been developed to perform speaker adaptation
of the system in order to improve accuracy with speaker training. The system
provides capabilities to add user-specific words to the existing vocabulary.
Dr Banavar, added, These technologies aid in adapting the IBM ViaVoice
recognition system for the Hindi language. The plan is to cover more Indian
languages and then to build a multilingual speech recognizer for Indian
languages based on a multilingual phone set.
Boosting productivity
Speech recognition software helps write documents and e-mails, search the Web
and even control PCs entirely by voice, saving time and boosting productivity.
To take one example, with Speech Recognition in Windows Vista, one can dramatically
reduce the use of the keyboard and mouse by performing tasks with speech. The
user can effortlessly control the computer using ones voice, including
starting or switching applications and selecting menus and buttons.
Likewise, Dragon NaturallySpeaking allows for improved productivity almost immediately,
as it is easy to use, allowing most users to be up and running in less than
10 minutes. It also enables them to create documents and e-mails three times
faster than typing and with recognition accuracy rates of up to 99%. Dragon
allows users to quickly create detailed and accurate reportswithout any
spelling errors.
Other aspects
This software platform help build several services that will
enable users to access information, knowledge, and even entertainment by just
voicing their needs. Dragon NaturallySpeaking is an example of speaker-dependent
speech recognition technology and is designed to work on the desktop as a productivity
tool for converting speech into text for document creation and command/control
of the PC.
Nuance also develops speaker-independent speech recognition technologies that
allow people to access information using their voice via landline or mobile
phones. This technology platform is vastly different when compared to speaker-dependent
technologies and is often deployed in a call center environment to automate
transactions.
This platform works well in noisy environments too. Virtually every desktop-dictation
program available today requires use of a good-quality microphone. The better
microphones are typically a headset-style, making it easy for the user to adjust
the unit so it is consistently the same distance from the speakers mouth.
This captures voice input at a consistent volume and makes it easier for the
program to recognize the speakers words.
The recognition task is further simplified by removing background noise. Dragon
NaturallySpeaking can screen out steady background noise and common sounds such
as a ringing phone or door being slammed. Sudden noises such as sneezes, coughs,
or alarms are more difficult to predict and can result in false words popping
into the document. Still higher-quality microphones with noise-canceling capabilities
can reduce the noise and the errors significantly. Dragon NaturallySpeaking
comes with a high quality, noise-cancelling headset.
Rao concluded, Speech recognition technology has reached a critical turning
point where it is now fundamentally changing the way that we interact with the
world around us.
Overall, the demand for speech recognition technology is expected to increase
significantly over the next few years as people use their mobile phones as all-purpose
lifestyle devices. In-car entertainment and navigation systems are increasingly
controlled by voice commands. This growth in adoption is being fuelled by steady
improvements in speech recognition accuracy.
nivedan.prakash@expressindia.com
|