|

About Speech Recognition
A speech signal is complex and encodes far more information
than can be analyzed and processed in real-time. Therefore,
speech recognition systems typically use a speech feature
extractor as a front-end pre-processor. Its input is a sequence
of speech signals, and its output is a sequence of speech
vectors that characterize key features of the input speech.
These vectors are then fed into a recognizer that determines
what words were spoken.
The most common front end extracts Mel-frequency Cepstral
coefficients (MFCs), but recent research
indicates that the Auditory Image Model (AIM)
might also offer high performance. Hidden Markov models (HMMs)
are used as the back-end recognizer in all commercial products,
but neural network recognizers are also being studied. One
neural network recognizer that can recognize continuous
speech is Learning Vector Quantization (LVQ).
Tanner Labs has studied and enhanced all the aforementioned
front ends and back ends, with a focus on computationally
efficient algorithms and implementations, and on systems and
architectures that achieve high performance even under noisy,
stressful, or otherwise adverse conditions. For example, we
are currently developing lip reading
technology to exploit visual cues that can play a significant
role in human speech recognition, even in those with typical
hearing abilities, a phenomenon known as the McGurk
Effect.
Mel-Frequency Cepstral Coefficients (MFCs)
Mel-frequency Cepstral coefficients (MFCs) are obtained by
a Fourier transform of short speech segments into the frequency
domain, a computation of the logarithm of the amplitude spectrum,
and an inverse Fourier transform back to the time domain.
MFCs have proved to be features that are useful for speech
recognition, and they are widely used commercially and in
research environments.
Tanner Labs has used MFC extractor front-ends in much of
our research, and we have studied whether various additional
parameters can improve performance.
Auditory Image Model (AIM)
Tanner Labs has studied various biologically inspired pre-processors,
including the Slaney
and Lyon EAR model, but the one we have focused our research
on has been the Auditory Image Model (AIM) developed by Patterson
and Holdsworth (see also Holdsworth).
AIM closely mimics what is known about the firing pattern
in the mammalian auditory nerve. It consists of three stages:
a filter bank, a logarithmic compressor/rectifier, and an
adaptive threshold mechanism that provides lateral inhibition
in time and frequency. Adaptive
thresholding is the key feature that makes the AIM model
different from an ordinary filter-bank front end. It occurs
in two dimensions: time and frequency. The temporal
threshold provides a short-term memory of recent activity
so that high activity masks subsequent weak activity. This
gives the AIM model the ability to discriminate between impulses
(clicks) and formants (an impulse passed through a resonance).
The spectral threshold is a smoothed version of the
spectral envelope. Strongly activated frequency channels suppress
neighboring channels that are only weakly activated. The two
thresholds implement a form of local inhibition, a
common technique for sharpening the response throughout the
nervous system in all mammals.
Tanner Labs has developed a computationally efficient hardware
implementation of AIM that includes
a custom analog IC as well as reprogrammable digital logic.
As part of this development process, we performed extensive
AIM simulations in order to refine the algorithm and to determine
optimal parameter values.
Hidden Markov Models (HMMs)
The typical single-word HMM consists of some 4-6 states,
connected in a linear chain (see figure below). Each state
roughly corresponds to a phoneme. An utterance is a
sequence of speech vectors. The HMM produces an utterance
as follows. It begins in State 1. For each time step, it produces
a speech vector that is picked from a probability distribution.
It then either remains in that state or moves on to the next
state according to a second probability distribution. These
two steps are repeated until the HMM reaches the last state
of the chain. During training, a set of n HMMs
are constructed, one for each word. A number of training utterances
are used to derive the probability distributions. During classification,
a test utterance is compared to each of the n models
and the probability that each model produced the sequence
is calculated.

Schematic of hidden Markov model (HMM)
HMMs have been used as a framework for two decades in the
speech recognition community and are the core of many state-of-the-art
systems and of all commercial systems. Kai-Fu Lee's original
SPHINX
system introduced triphones into HMM recognition and was a
major breakthrough in large-vocabulary continuous-speech recognition.
SPHINX-II
built upon this successful architecture, in part by incorporating
the noise and environmental robustness methods developed by
Acero.
Microsoft has now released Whisper
(Windows Highly Intelligent Speech Recognizer), an implementation
of SPHINX-II that is optimized to reduce processing and memory
requirements (by a factor of five and of twenty, respectively).
Whisper now represents the state-of-the-art in HMM-based large-vocabulary
speaker-independent speech recognition.
Microsoft has reported that for a typical speaker-independent
command-and-control application with 260 words, Whisper performs
at a word error rate of 1.4% while running in real-time on
a 486DX PC. However, this performance can be obtained only
under idealized conditions that do not exist in practice;
background noise or non-speech sounds
from the speaker can degrade the performance of a speech recognition
system. Whisper is reported to reject only between 20% and
76% of out-of-vocabulary words, ungrammatical phrases and
non-speech sounds.
Tanner Labs has been developing innovative HMM training techniques
and architectures, as well as lip-reading
capabilities, to increase recognition accuracy in realistic
environments where current recognizers perform poorly (e.g.,
in the presence of background noise or when the speaker is
under stress).
Learning Vector Quantization (LVQ)
An LVQ recognizer is a neural network back-end for speech
recognition. Kohonen
introduced the fundamental algorithm, which McDermott
and Katagiri then enhanced by incorporating Waibel's
time delay neural network (TDNN). Public domain LVQ algorithms
can now be obtained in LVQ_PAK.
One advantage of LVQ is that it can be used to recognize continuous
speech.
Tanner Labs has refined LVQ training algorithms and incorporated
LVQ into sophisticated architectures as part of our strategy
to develop new techniques for improving recognizer performance,
especially in natural environments.
|