Welcome to your ultimate assistive technology resource


Speech recognition software guide

Speech recognition (in many contextshundred wrong) if operated under optimal
also known as automatic speechconditions. These optimal conditions
recognition, computer speech recognitionusually means the test subjects have 1)
or erroneously as voice recognition) ismatching speaker characteristics with
the process of converting a speechthe training data, 2) proper speaker
signal to a set of words, by means of anadaptation, and 3) clean environment
algorithm implemented as a computer(e.g. office space). (This explains why
program. Speech recognition applicationssome users, especially accented, might
that have emerged over the last yearsactually find that the recognition rate
include voice dialing (e.g., Call home),could be perceptually much lower than
call routing (e.g., I would like to makethe expected 98% to 99%).
a collect call), simple data entryOther, limited vocabulary, systems
(e.g., entering a credit card number),requiring no training can recognize a
and preparation of structured documentssmall number of words (for instance, the
(e.g., a radiology report).ten digits) from most speakers. Such
Voice or speaker recognition is asystems are popular for routing incoming
related process that attempts tophone calls to their destinations in
identify the person speaking, as opposedlarge organisations.
to what is being said.Use Commercial systems for speech
An isolated-word speech recognitionrecognition have been available
system requires that the speaker pauseoff-the-shelf since the 1990s. Despite
briefly between words, whereas athe apparent success of the technology,
continuous speech recognition systemfew people use such speech recognition
does not. Spontaneous, orsystems on their desktop computers. It
extemporaneously generated, speechappears that most computer users can
contains disfluencies and is much morecreate and edit documents and interact
dificult to recognize than speech readwith their computer more quickly with
from script. Some systems requireconventional input devices, a keyboard
speaker enrollment (a user must provideand mouse, despite the fact that most
samples of his or her speech beforepeople are able to speak considerably
using them) whereas other systems arefaster than they can type. Using both
said to be speaker-independent, in thatkeyboard and speech recognition
no enrollment is necessary. Some of thesimultaneously, however, can in some
other parameters depend on the specificcases be more efficient than using any
task. Recognition is generally moreone of these inputs alone. A typical
difficult when vocabularies are large oroffice environment, with a high
have many similar-sounding words. Whenamplitude of background speech, is one
speech is produced in a sequence ofof the most adverse environments for
words, language models or artificialcurrent speech recognition technologies,
grammars are used to restrict theand large-vocabulary systems with
combination of words. The simplestspeaker-independence that are designed
language model can be specified as ato operate within these adverse
finite-state network, where theenvironments have significantly lower
permissible words following each wordrecognition accuracy. The typical
are explicitly given. More generalachievable recognition rate as of 2005
language models approximating naturalfor large-vocabulary speaker-independent
language are specified in terms of asystems is about 80%-90% for a clear
context-sensitive grammar. One popularenvironment, but can be as low as 50%
measure of the difficulty of the task,for scenarios like cellular phone with
combining the vocabulary size and thebackground noise. Additionally, heavy
language model, is perplexity, looselyuse of the speech organs can result in
defined as the geometric mean of thevocal loading.
number of words that can follow a wordSpeech recognition systems have found
after the language model has beenuse where the speed of text input is
applied. In addition, there are somerequired to be extremely fast. They are
external parameters that can affectused in legal and medical transcription,
speech recognition system performance,the generation of subtitles for live
including the characteristics of thesports and current affairs programs on
environmental noise and the type and thetelevision; not directly but via an
placement of the microphone.operator that re-speaks the dialog into
Speech recognition is a difficultsoftware trained in the operator's
problem, largely because of the manyvoice; in such cases the operator also
sources of variability associated withhas special training, first to speak
the signal.clearly and consistently to maximize
1. the acoustic realizations ofrecognition accuracy, second to indicate
phonemes, the smallest sound units ofpunctuation by various techniques, and
which words are composed, are highlyalso often domain-specific training
dependent on the context in which they(especially in medical or legal contexts
appear. These phonetic variabilities arewhere the operator needs to know
exemplified by the acoustic differencesspecialized vocabulary and procedures).
of the phoneme /t/ in two, true, andIn courtrooms and similar situations
butter in American English. At wordwhere the operator's voice would disturb
boundaries, contextual variations can bethe proceedings, he or she may sit in a
quite dramatic making gas shortage soundsoundproofed booth or wear a Stenomask
like gash shortage in American English,or similar device.
and devo andare sound like devandare inSpeech recognition is sometimes a
Italian.necessity for people who have difficulty
2. acoustic variabilities can resultinteracting with their computers through
from changes in the environment as wella keyboard, for example, those with
as in the position and characteristicsserious carpal tunnel syndrome, impaired
of the transducer.extremities, or other physical
3. within speaker variabilities canlimitations.
result from changes in the speaker'sSpeech recognition technology is used
physical and emotional state, speakingmore and more for telephone applications
rate, or voice quality.like travel booking and information,
4. differences in sociolinguisticfinancial account information, customer
background, dialect, and vocal tractservice call routing, and directory
size and shape can contribute to acrossassistance. Using constrained grammar
speaker variabilities.recognition (described below), such
Speech Recognition Technology In termsapplications can achieve remarkably high
of technology, most of the technicalaccuracy. Research and development in
text books nowadays emphasize the use ofspeech recognition technology has
Hidden Markov Model as the underlyingcontinued to grow as the cost for
technology. The use of dynamic algorithmimplementing such voice-activated
approach, neural network-based approachsystems has dropped and the usefulness
and knowledge-based learning approachand efficiency of these systems has
have been studied intensively in theimproved.
1980s and 1990s.For example, recognition systems
Performance of Speech Recognitionoptimized for telephone applications can
Systems Speech recognition systems,often supply information about the
depending on several different factors,confidence of a particular recognition,
could have a wide performance range asand if the confidence is low, it can
measured by word error rate. Thesetrigger the application to prompt
factors include the environment, thecallers to confirm or repeat their
speaking rate of the speaker, therequest (for example "I heard you say
context (or the grammar) being used in'billing', is that right?").
recognition.Furthermore, speech recognition has
Most speech recognition users would tendenabled the automation of certain
to agree that dictation machines canapplications that are not automatable
achieve very high performance inusing push-button interactive voice
controlled conditions. Part of theresponse (IVR) systems, like directory
confusion mainly comes from the mixedassistance and systems that allow
usage of the term speech recognition andcallers to "dial" by speaking names
dictation.listed in an electronic phone book.
Speaker-dependent dictation systemsNevertheless, speech recognition based
requiring a short period of training cansystems remain the exception because
capture continuous speech with a largepush-button systems are still much
vocabulary at normal pace with a verycheaper to implement and operate.
high accuracy. Most commercial companiesSpeech recognition is also used for
claim that recognition software canspeech fluency evaluation and language
achieve between 98% to 99% accuracyinstruction.
(getting one to two words out of one



1 A B C D 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114