| Speech recognition (in many contexts | | | | hundred wrong) if operated under optimal |
| also known as automatic speech | | | | conditions. These optimal conditions |
| recognition, computer speech recognition | | | | usually means the test subjects have 1) |
| or erroneously as voice recognition) is | | | | matching speaker characteristics with |
| the process of converting a speech | | | | the training data, 2) proper speaker |
| signal to a set of words, by means of an | | | | adaptation, and 3) clean environment |
| algorithm implemented as a computer | | | | (e.g. office space). (This explains why |
| program. Speech recognition applications | | | | some users, especially accented, might |
| that have emerged over the last years | | | | actually find that the recognition rate |
| include voice dialing (e.g., Call home), | | | | could be perceptually much lower than |
| call routing (e.g., I would like to make | | | | the expected 98% to 99%). |
| a collect call), simple data entry | | | | Other, limited vocabulary, systems |
| (e.g., entering a credit card number), | | | | requiring no training can recognize a |
| and preparation of structured documents | | | | small number of words (for instance, the |
| (e.g., a radiology report). | | | | ten digits) from most speakers. Such |
| Voice or speaker recognition is a | | | | systems are popular for routing incoming |
| related process that attempts to | | | | phone calls to their destinations in |
| identify the person speaking, as opposed | | | | large organisations. |
| to what is being said. | | | | Use Commercial systems for speech |
| An isolated-word speech recognition | | | | recognition have been available |
| system requires that the speaker pause | | | | off-the-shelf since the 1990s. Despite |
| briefly between words, whereas a | | | | the apparent success of the technology, |
| continuous speech recognition system | | | | few people use such speech recognition |
| does not. Spontaneous, or | | | | systems on their desktop computers. It |
| extemporaneously generated, speech | | | | appears that most computer users can |
| contains disfluencies and is much more | | | | create and edit documents and interact |
| dificult to recognize than speech read | | | | with their computer more quickly with |
| from script. Some systems require | | | | conventional input devices, a keyboard |
| speaker enrollment (a user must provide | | | | and mouse, despite the fact that most |
| samples of his or her speech before | | | | people are able to speak considerably |
| using them) whereas other systems are | | | | faster than they can type. Using both |
| said to be speaker-independent, in that | | | | keyboard and speech recognition |
| no enrollment is necessary. Some of the | | | | simultaneously, however, can in some |
| other parameters depend on the specific | | | | cases be more efficient than using any |
| task. Recognition is generally more | | | | one of these inputs alone. A typical |
| difficult when vocabularies are large or | | | | office environment, with a high |
| have many similar-sounding words. When | | | | amplitude of background speech, is one |
| speech is produced in a sequence of | | | | of the most adverse environments for |
| words, language models or artificial | | | | current speech recognition technologies, |
| grammars are used to restrict the | | | | and large-vocabulary systems with |
| combination of words. The simplest | | | | speaker-independence that are designed |
| language model can be specified as a | | | | to operate within these adverse |
| finite-state network, where the | | | | environments have significantly lower |
| permissible words following each word | | | | recognition accuracy. The typical |
| are explicitly given. More general | | | | achievable recognition rate as of 2005 |
| language models approximating natural | | | | for large-vocabulary speaker-independent |
| language are specified in terms of a | | | | systems is about 80%-90% for a clear |
| context-sensitive grammar. One popular | | | | environment, but can be as low as 50% |
| measure of the difficulty of the task, | | | | for scenarios like cellular phone with |
| combining the vocabulary size and the | | | | background noise. Additionally, heavy |
| language model, is perplexity, loosely | | | | use of the speech organs can result in |
| defined as the geometric mean of the | | | | vocal loading. |
| number of words that can follow a word | | | | Speech recognition systems have found |
| after the language model has been | | | | use where the speed of text input is |
| applied. In addition, there are some | | | | required to be extremely fast. They are |
| external parameters that can affect | | | | used in legal and medical transcription, |
| speech recognition system performance, | | | | the generation of subtitles for live |
| including the characteristics of the | | | | sports and current affairs programs on |
| environmental noise and the type and the | | | | television; not directly but via an |
| placement of the microphone. | | | | operator that re-speaks the dialog into |
| Speech recognition is a difficult | | | | software trained in the operator's |
| problem, largely because of the many | | | | voice; in such cases the operator also |
| sources of variability associated with | | | | has special training, first to speak |
| the signal. | | | | clearly and consistently to maximize |
| 1. the acoustic realizations of | | | | recognition accuracy, second to indicate |
| phonemes, the smallest sound units of | | | | punctuation by various techniques, and |
| which words are composed, are highly | | | | also often domain-specific training |
| dependent on the context in which they | | | | (especially in medical or legal contexts |
| appear. These phonetic variabilities are | | | | where the operator needs to know |
| exemplified by the acoustic differences | | | | specialized vocabulary and procedures). |
| of the phoneme /t/ in two, true, and | | | | In courtrooms and similar situations |
| butter in American English. At word | | | | where the operator's voice would disturb |
| boundaries, contextual variations can be | | | | the proceedings, he or she may sit in a |
| quite dramatic making gas shortage sound | | | | soundproofed booth or wear a Stenomask |
| like gash shortage in American English, | | | | or similar device. |
| and devo andare sound like devandare in | | | | Speech recognition is sometimes a |
| Italian. | | | | necessity for people who have difficulty |
| 2. acoustic variabilities can result | | | | interacting with their computers through |
| from changes in the environment as well | | | | a keyboard, for example, those with |
| as in the position and characteristics | | | | serious carpal tunnel syndrome, impaired |
| of the transducer. | | | | extremities, or other physical |
| 3. within speaker variabilities can | | | | limitations. |
| result from changes in the speaker's | | | | Speech recognition technology is used |
| physical and emotional state, speaking | | | | more and more for telephone applications |
| rate, or voice quality. | | | | like travel booking and information, |
| 4. differences in sociolinguistic | | | | financial account information, customer |
| background, dialect, and vocal tract | | | | service call routing, and directory |
| size and shape can contribute to across | | | | assistance. Using constrained grammar |
| speaker variabilities. | | | | recognition (described below), such |
| Speech Recognition Technology In terms | | | | applications can achieve remarkably high |
| of technology, most of the technical | | | | accuracy. Research and development in |
| text books nowadays emphasize the use of | | | | speech recognition technology has |
| Hidden Markov Model as the underlying | | | | continued to grow as the cost for |
| technology. The use of dynamic algorithm | | | | implementing such voice-activated |
| approach, neural network-based approach | | | | systems has dropped and the usefulness |
| and knowledge-based learning approach | | | | and efficiency of these systems has |
| have been studied intensively in the | | | | improved. |
| 1980s and 1990s. | | | | For example, recognition systems |
| Performance of Speech Recognition | | | | optimized for telephone applications can |
| Systems Speech recognition systems, | | | | often supply information about the |
| depending on several different factors, | | | | confidence of a particular recognition, |
| could have a wide performance range as | | | | and if the confidence is low, it can |
| measured by word error rate. These | | | | trigger the application to prompt |
| factors include the environment, the | | | | callers to confirm or repeat their |
| speaking rate of the speaker, the | | | | request (for example "I heard you say |
| context (or the grammar) being used in | | | | 'billing', is that right?"). |
| recognition. | | | | Furthermore, speech recognition has |
| Most speech recognition users would tend | | | | enabled the automation of certain |
| to agree that dictation machines can | | | | applications that are not automatable |
| achieve very high performance in | | | | using push-button interactive voice |
| controlled conditions. Part of the | | | | response (IVR) systems, like directory |
| confusion mainly comes from the mixed | | | | assistance and systems that allow |
| usage of the term speech recognition and | | | | callers to "dial" by speaking names |
| dictation. | | | | listed in an electronic phone book. |
| Speaker-dependent dictation systems | | | | Nevertheless, speech recognition based |
| requiring a short period of training can | | | | systems remain the exception because |
| capture continuous speech with a large | | | | push-button systems are still much |
| vocabulary at normal pace with a very | | | | cheaper to implement and operate. |
| high accuracy. Most commercial companies | | | | Speech recognition is also used for |
| claim that recognition software can | | | | speech fluency evaluation and language |
| achieve between 98% to 99% accuracy | | | | instruction. |
| (getting one to two words out of one | | | | |