Feedback  Site Map  
  Other Researches

   AIM Implementation
   Continuous Speech
   Lip Reading
   McGurk Effect
   Real Environment

  Request More Info

About Speech Recognition

A speech signal is complex and encodes far more information than can be analyzed and processed in real-time. Therefore, speech recognition systems typically use a speech feature extractor as a front-end pre-processor. Its input is a sequence of speech signals, and its output is a sequence of speech vectors that characterize key features of the input speech. These vectors are then fed into a recognizer that determines what words were spoken.

The most common front end extracts Mel-frequency Cepstral coefficients (MFCs), but recent research indicates that the Auditory Image Model (AIM) might also offer high performance. Hidden Markov models (HMMs) are used as the back-end recognizer in all commercial products, but neural network recognizers are also being studied. One neural network recognizer that can recognize continuous speech is Learning Vector Quantization (LVQ).

Tanner Labs has studied and enhanced all the aforementioned front ends and back ends, with a focus on computationally efficient algorithms and implementations, and on systems and architectures that achieve high performance even under noisy, stressful, or otherwise adverse conditions. For example, we are currently developing lip reading technology to exploit visual cues that can play a significant role in human speech recognition, even in those with typical hearing abilities, a phenomenon known as the McGurk Effect.

Mel-Frequency Cepstral Coefficients (MFCs)

Mel-frequency Cepstral coefficients (MFCs) are obtained by a Fourier transform of short speech segments into the frequency domain, a computation of the logarithm of the amplitude spectrum, and an inverse Fourier transform back to the time domain. MFCs have proved to be features that are useful for speech recognition, and they are widely used commercially and in research environments.

Tanner Labs has used MFC extractor front-ends in much of our research, and we have studied whether various additional parameters can improve performance.

Auditory Image Model (AIM)

Tanner Labs has studied various biologically inspired pre-processors, including the Slaney and Lyon EAR model, but the one we have focused our research on has been the Auditory Image Model (AIM) developed by Patterson and Holdsworth (see also Holdsworth).

AIM closely mimics what is known about the firing pattern in the mammalian auditory nerve. It consists of three stages: a filter bank, a logarithmic compressor/rectifier, and an adaptive threshold mechanism that provides lateral inhibition in time and frequency. Adaptive thresholding is the key feature that makes the AIM model different from an ordinary filter-bank front end. It occurs in two dimensions: time and frequency. The temporal threshold provides a short-term memory of recent activity so that high activity masks subsequent weak activity. This gives the AIM model the ability to discriminate between impulses (clicks) and formants (an impulse passed through a resonance). The spectral threshold is a smoothed version of the spectral envelope. Strongly activated frequency channels suppress neighboring channels that are only weakly activated. The two thresholds implement a form of local inhibition, a common technique for sharpening the response throughout the nervous system in all mammals.

Tanner Labs has developed a computationally efficient hardware implementation of AIM that includes a custom analog IC as well as reprogrammable digital logic. As part of this development process, we performed extensive AIM simulations in order to refine the algorithm and to determine optimal parameter values.

Hidden Markov Models (HMMs)

The typical single-word HMM consists of some 4-6 states, connected in a linear chain (see figure below). Each state roughly corresponds to a phoneme. An utterance is a sequence of speech vectors. The HMM produces an utterance as follows. It begins in State 1. For each time step, it produces a speech vector that is picked from a probability distribution. It then either remains in that state or moves on to the next state according to a second probability distribution. These two steps are repeated until the HMM reaches the last state of the chain. During training, a set of n HMMs are constructed, one for each word. A number of training utterances are used to derive the probability distributions. During classification, a test utterance is compared to each of the n models and the probability that each model produced the sequence is calculated.

Schematic of hidden Markov model (HMM)

HMMs have been used as a framework for two decades in the speech recognition community and are the core of many state-of-the-art systems and of all commercial systems. Kai-Fu Lee's original SPHINX system introduced triphones into HMM recognition and was a major breakthrough in large-vocabulary continuous-speech recognition. SPHINX-II built upon this successful architecture, in part by incorporating the noise and environmental robustness methods developed by Acero. Microsoft has now released Whisper (Windows Highly Intelligent Speech Recognizer), an implementation of SPHINX-II that is optimized to reduce processing and memory requirements (by a factor of five and of twenty, respectively). Whisper now represents the state-of-the-art in HMM-based large-vocabulary speaker-independent speech recognition.

Microsoft has reported that for a typical speaker-independent command-and-control application with 260 words, Whisper performs at a word error rate of 1.4% while running in real-time on a 486DX PC. However, this performance can be obtained only under idealized conditions that do not exist in practice; background noise or non-speech sounds from the speaker can degrade the performance of a speech recognition system. Whisper is reported to reject only between 20% and 76% of out-of-vocabulary words, ungrammatical phrases and non-speech sounds.

Tanner Labs has been developing innovative HMM training techniques and architectures, as well as lip-reading capabilities, to increase recognition accuracy in realistic environments where current recognizers perform poorly (e.g., in the presence of background noise or when the speaker is under stress).

Learning Vector Quantization (LVQ)

An LVQ recognizer is a neural network back-end for speech recognition. Kohonen introduced the fundamental algorithm, which McDermott and Katagiri then enhanced by incorporating Waibel's time delay neural network (TDNN). Public domain LVQ algorithms can now be obtained in LVQ_PAK. One advantage of LVQ is that it can be used to recognize continuous speech.

Tanner Labs has refined LVQ training algorithms and incorporated LVQ into sophisticated architectures as part of our strategy to develop new techniques for improving recognizer performance, especially in natural environments.


Back to top

Home | Products | MEMS | R & D | Contact Info | Feedback | Site Map

Copyright © 1999-2001 by Tanner Research, Inc. All Rights Reserved