Untitled Document
www.expresscomputeronline.com WEEKLY INSIGHT FOR TECHNOLOGY PROFESSIONALS
22 December 2008  
Untitled Document
Sections

Market
Management
Technology
Technology Life

Express Intelligent Enterprise

Events

Technology Senate
Technology Sabha

Services
Subscribe/Renew
Archives
Search
Contact Us
Network Sites
Exp.Channel Business
Express Hospitality
Express TravelWorld
feBusiness Traveller
Express Pharma
Express Healthcare
Express Textile
Group Sites
ExpressIndia
Indian Express
Financial Express

Untitled Document
 
Home - Technology - Article

Lead

Voice-controlled engine

Speech recognition as a technology has come as a boon to users—not only enabling them to access information and knowledge but also entertainment by simply voicing their needs, writes Nivedan Prakash

Speech recognition has always been a subject of much interest in human-computer interface. Though the platform is not new, it has gotten better over the years.

Before getting into the details of the technology, let us understand what speech recognition software is all about. Speech recognition software is the technology that allows you to control a computer by speaking to it. When you talk, the software decides whether to convert your voice into text in order to let you dictate documents or e-mail messages, or interpret what you say as a voice command and take an action.

While relatively simple to use, speech recognition software is a highly sophisticated technology that leverages ‘language modeling’ to recognize and differentiate among the millions of human utterances that make up any language. Using statistical probability, speech recognition software programs analyze an incoming stream of sound and interpret those sounds as commands and dictation. This process of interpretation is called speech recognition.

Companies like IBM, Microsoft, and Nuance Communications are pioneers on the speech technology front. IBM has a desktop Hindi speech recognition technology that provides a natural interface for human-computer interaction. This technology helps transcribe Hindi speech into text and is suitable for a country like India where a large portion of the population is not familiar with the English language and keyboard to use computers.

This core technology developed by IRL (IBM Research Laboratory) can be used for desktop applications such as data entry, letter-writing, sending e-mails, as well as to speech-enable ATMs, kiosks and other devices. You can also use the software for issuing commands to a computer and for IVR (interactive voice response) applications in telephony. As the software supports Unicode, developers can integrate it with word processing and e-mail applications. The software produces word recognition rates in the range of 90-95% with speaker adaptation and 80-90% without.

Dr Guruduth Banavar, Director-IBM India Research Laboratory, stated, “Our researchers are also exploring areas where the social impact could be huge. For instance, the Spoken Web, driven out of IBM India Research Lab, is also being incubated in IBM’s eight global labs in six countries. Spoken Web technology aims to transform how people create, build, and interact with e-Commerce sites on the World Wide Web using the spoken word instead of the written.”

Similarly, Microsoft’s Windows Speech Recognition in Windows Vista empowers users to interact with their computer by voice. The Windows Vista speech recognition experience places the user in control and enables him to become proficient at using speech recognition quickly. Using speech recognition in the operating system, you can dictate e-mails and documents and use your voice to control your favorite programs and navigate the Web. It enables you to create documents quickly and helps you work with reduced risk of repetitive stress injuries.

One of the most important aspects of Windows Speech Recognition is that it enables wider accessibility to individuals with difficulties in typing. It also enables individuals to interact with the computer using only voice thereby, reducing the use of a mouse and keyboard while maintaining overall productivity. Additionally, an interactive training guides you through the setup process and helps you familiarize yourself with the voice commands.

Prasanna Meduri, Director–Windows Client, Microsoft India, added, “Windows Vista Speech Recognition provides excellent recognition accuracy that improves with each use as it adapts to the user’s speaking style and vocabulary.” Speech Recognition is available in English (US), English (UK), German, French, Spanish, Japanese, Chinese (Traditional), and Chinese (Simplified).

Following the same lines, Nuance Communications has the speech recognition software, Dragon NaturallySpeaking, which uses the human voice as the primary communication mechanism between the user and the computer. When Dragon NaturallySpeaking users create and train their user profile, they start with a standard set of models and then customize them for the way they speak (acoustic model) and the way they use words (vocabulary and associated language model). The software employs the customised user profile to guess the words spoken with accuracy rates of up to 99%. As the individual uses the software and corrects recognition errors, the software becomes increasingly accurate. In other words, the more one uses Dragon, the better it performs—offering greater productivity benefits over time.

Speech recognition process model

"The language and acoustic models used for speech recognition are created by teams of linguists, linguistic engineers, and programmers. They analyze millions of words in documents to gather language model statistics"

- Sunny Rao
Vice President and General Manager-India and Asian Operations, Nuance Communications

"In Windows Vista, before
using Windows Speech Recognition, you have to train the computer to recognize your voice and spoken commands by creating a voice profile"

- Prasanna Meduri
Director-Windows Client,
Microsoft India

"Our Spoken Web technology aims to transform how people create, build, and
interact with e-commerce sites on the World Wide Web using the spoken word instead of the written word"

- Dr Guruduth Banavar
Director-IBM India Research Laboratory

Speech recognition software programs are based on statistical probability. The software analyses an incoming stream of sounds and interprets those sounds as commands and dictation. This process of interpretation is called speech recognition, and its success is measured by the percentage of correct interpretations, or recognition accuracy.

Speech recognition is not a single technology, rather it combines many of them. You start building a voice recognition engine by recording words, phrases, and sentences, and then putting them in a database. Then you have to create a library of the specific pronunciations of the different words to be recognized. Then the sounds in the recordings are mapped to the word pronunciations. And finally, you have to build a large table of the most commonly occurring patterns of words people are likely to speak. Algorithms are created that combine all these sources of information to come up with the right answer in a specific situation.

“We have been continually exploring how to adapt their voice recognition engines more quickly to a specific person or sound environment,” added Dr Banavar.

A software like Dragon NaturallySpeaking relies on three sources of information to achieve high recognition accuracy, including an acoustic model—a mathematical model of the sound patterns used by the speaker’s language, a vocabulary—a list of words that the program can recognize. Each word in the vocabulary has a text representation and a pronunciation and a language model—statistical information associated with a vocabulary that describes the likelihood of words and sequences of words occurring in the user’s speech.

Sunny Rao, Vice President and General Manager-India and Asian Operations, Nuance Communications, said, “When you create and train a user profile, you start with a standard set of models and then customize them for the way you speak (acoustic model) and the way you use words (vocabulary and associated language model). The software employs your customized user files to guess the words that you spoke. The quality and type of noise-canceling microphone is a critical success factor in implementing speech recognition. As the individual uses the software and corrects recognition errors, the software becomes increasingly accurate. Most programs enable users to add new words or customize the vocabulary for their particular field e.g. medical, legal, etc.”

Working of this engine

In the past, virtually any attempt at developing a practical continuous speech dictation program was hampered by the slow speed and limited processing power of most personal computers. However, with faster processing speeds and larger random access memory capabilities of today’s personal computers, as well as major advances in the algorithms behind speech recognition technology, continuous speech dictation programs are now ready to fuel the next major wave of new applications in the marketplace.

Many of the more advanced algorithms that are driving today’s speech recognition products, particularly those capable of recognizing continuous speech, recognize words built from phonemes. Phonemes are the basic building blocks of language. Every word in English therefore can be represented by a sequence of these phonemes. For each language, the greater the contrast between different phonemes or sounds within the language, the easier it is for the computer to recognize the spoken words.

True speech recognition is, therefore, the process of capturing the sounds, identifying them, determining proper context, and stringing them together to actually build words, phrases, and sentences. In addition to analyzing the acoustic information of phonemes, today’s programs use a language model to determine the words in a segment of dictated speech. The language model, which is different for each spoken language, describes how phrases and sentences are built from individual words. Information typically included in the model would include statistics on how frequently individual words are used and how often groups of two words (bigrams) or three words (trigrams) are used together.

Rao emphasized, “The language and acoustic models used for speech recognition are created by teams of linguists, linguistic engineers, and programmers. They analyze millions of words in documents to gather language model statistics. They sample thousands of bits of speech from different speakers to develop acoustic models of the phonemes and words. They then develop a ‘recipe’ for how to combine these two different bodies of information to determine the words in a sample of speech.”

To build such algorithms, or sets of calculations and instructions, linguistic engineers record thousands of bits of speech; identify the stresses, accents, and tones of human speech. All this is done to represent the nearly infinite variety in the way people speak, the pattern they use, and the pitch of their voices. The task has been to create a recognition engine based on algorithms, which can recognize human speech regardless of whom the speaker is. In other words, it is speaker independent. Over time, however, the programs can learn by simply adapting to the user. It acquires more specific phrasing and usage, and becomes more accustomed to the user’s speech and tone.

Adding further, Meduri pointed out, “In Windows Vista, before using Windows Speech Recognition, you have to train the computer to recognize your voice and spoken commands by creating a voice profile. This ensures that the computer recognizes the speaker.”

Similarly, the IBM desktop Hindi speech recognition system is a continuous speech, speaker-independent recognizer. This is a continuous speech, speaker-independent recognizer. The system has a vocabulary of 65,000 words (to increase to 125,000). With speaker training, accuracy levels in excess of 90% can be achieved. A novel technique of building phonetic spellings of Hindi words has been used to generate the vocabulary in the system. The system displays output in Unicode Hindi characters. Algorithms have been developed to perform speaker adaptation of the system in order to improve accuracy with speaker training. The system provides capabilities to add user-specific words to the existing vocabulary.

Dr Banavar, added, “These technologies aid in adapting the IBM ViaVoice recognition system for the Hindi language. The plan is to cover more Indian languages and then to build a multilingual speech recognizer for Indian languages based on a multilingual phone set.”

Boosting productivity

Speech recognition software helps write documents and e-mails, search the Web and even control PCs entirely by voice, saving time and boosting productivity. To take one example, with Speech Recognition in Windows Vista, one can dramatically reduce the use of the keyboard and mouse by performing tasks with speech. The user can effortlessly control the computer using one’s voice, including starting or switching applications and selecting menus and buttons.

Likewise, Dragon NaturallySpeaking allows for improved productivity almost immediately, as it is easy to use, allowing most users to be up and running in less than 10 minutes. It also enables them to create documents and e-mails three times faster than typing and with recognition accuracy rates of up to 99%. Dragon allows users to quickly create detailed and accurate reports—without any spelling errors.

Other aspects

This software platform help build several services that will enable users to access information, knowledge, and even entertainment by just voicing their needs. Dragon NaturallySpeaking is an example of speaker-dependent speech recognition technology and is designed to work on the desktop as a productivity tool for converting speech into text for document creation and command/control of the PC.

Nuance also develops speaker-independent speech recognition technologies that allow people to access information using their voice via landline or mobile phones. This technology platform is vastly different when compared to speaker-dependent technologies and is often deployed in a call center environment to automate transactions.

This platform works well in noisy environments too. Virtually every desktop-dictation program available today requires use of a good-quality microphone. The better microphones are typically a headset-style, making it easy for the user to adjust the unit so it is consistently the same distance from the speaker’s mouth. This captures voice input at a consistent volume and makes it easier for the program to recognize the speaker’s words.

The recognition task is further simplified by removing background noise. Dragon NaturallySpeaking can screen out steady background noise and common sounds such as a ringing phone or door being slammed. Sudden noises such as sneezes, coughs, or alarms are more difficult to predict and can result in false words popping into the document. Still higher-quality microphones with noise-canceling capabilities can reduce the noise and the errors significantly. Dragon NaturallySpeaking comes with a high quality, noise-cancelling headset.

Rao concluded, “Speech recognition technology has reached a critical turning point where it is now fundamentally changing the way that we interact with the world around us.”

Overall, the demand for speech recognition technology is expected to increase significantly over the next few years as people use their mobile phones as all-purpose lifestyle devices. In-car entertainment and navigation systems are increasingly controlled by voice commands. This growth in adoption is being fuelled by steady improvements in speech recognition accuracy.

nivedan.prakash@expressindia.com

 


Untitled Document

UNSUBSCRIBE HERE
Untitled Document
© Copyright 2001: Indian Express Newspapers (Mumbai) Limited (Mumbai, India). All rights reserved throughout the world. This entire site is compiled in Mumbai by the Business Publications Division (BPD) of the Indian Express Newspapers (Mumbai) Limited. Site managed by BPD.