John-Paul Hosom, Ron Cole, and Mark Fanty
Center for Spoken Language Understanding (cslu
)
Oregon Graduate Institute of Science and Technology
July 6,
1999
Figure 1. Diagram of the Recognition Process
There are four basic steps to performing recognition. Each step will be explained briefly here and in more detail later in this document.
First, we digitize the speech that we want to recognize; for telephone speech the sampling rate is 8000 samples per second. Second, we compute features that represent the spectral-domain content of the speech (regions of strong energy at particular frequencies). These features are computed every 10 msec, with one 10-msec section called a frame. Third, a neural network (also called an ANN, multi-layer perceptron, or MLP) is used to classify a set of these features into phonetic-based categories at each frame. Fourth, a Viterbi search is used to match the neural-network output scores to the target words (the words that are assumed to be in the input speech), in order to determine the word that was most likely uttered.
It is also possible to analyze the results by looking at the confidence we
have in the top-scoring word. The word can be rejected as out-of-vocabulary if
the confidence falls below a pre-determined threshold. The confidence values are
always computed by the CSLU Toolkit during recognition, but we currently do not
use these values during general-purpose recognition.
Now we'll go into a bit more detail.
Figure 2. Graphical Overview of the Recognition Process
The digitized waveform is converted into a spectral-domain representation; one of two sets of features may be used, depending on the recognizer. For the current general-purpose recognizer, we use twelve mel-frequency cepstral coefficients (MFCC coefficients), twelve MFCC delta features that indicate the degree of spectral change, one energy feature, and one delta-energy feature (for a total of 26 features per frame). Cepstral-mean-subtraction (CMS) of the MFCC coefficients is done to remove some of the effects of noise. For the latest digit recognizer (to be released with version 2.0 of the Toolkit), we use twelve MFCC features with CMS, one energy feature from MFCC analysis, twelve perceptual-linear prediction (PLP) features with noise reduction by RASTA processing, and one energy feature from PLP analysis (for a total of 26 features per frame).
To provide some information about the acoustic content we take a context window of features, as shown by the vertical lines in the upper-right diagram in Figure 2. This means simply taking the frame of interest as well as frames that are -60, -30, 30, and 60 msec away from the frame of interest. This is done to take into consideration the dynamic nature of speech: the identity of a phoneme will often depend not only on the spectral features at one point in time, but it will also depend on how the features change over time.
We send the features in a context window to a neural network for classification (26 features per frame at 5 frames = 130 features). The output of the neural network is a classification of each input frame, measured in terms of the probabilities of phoneme-based categories. By sending context windows for all frames of speech to the neural network, we can build a matrix of the probabilities of phoneme-based categories over time. In this example of neural-network output, the word to be recognized is two, and the darker regions in the t, t<u, and u categories indicate the greater probabilities of those classes at the indicated times.
The target-word pronunciations are then expanded into strings of
phonetic-based categories, and a Viterbi search is used to find the best path
through the matrix of probabilities for each legal string. The output of
recognition is the word string that corresponds to this best path.
Figure 3. Context-Dependent Modeling
So what categories should the neural network classify? The obvious choice is one category for each phoneme, so a word such as "yes" would have three cateogries: /y/, /E/, and /s/. Early recognizers were built in this way, but it was found that phonemes have such a great influence on neighboring phonemes that the /E/ that follows a /s/ may look very different from the /E/ that follows, say, a /b/. In order to account for these coarticulatory effects, the current strategy is to split each phoneme into one, two, or three parts, depending on the typical duration of that phoneme as well as how much that phoneme will be influenced by surrounding phonemes. The phoneme /E/, for example, can be influenced by the phonemes on both the right and left side, but it can also be long enough in duration that the middle part is not much influenced by surrounding phonemes. As a result, we split /E/ into three parts. The phoneme /n/, on the other hand, can also be long in duration, but it is not influenced much by surrounding phonemes; we may split /n/ into two parts, or keep it as a one-part phoneme. Such decisions are made using knowledge of acoustic phonetics as well as practical experience.
If we were to model every phoneme in the left and right contexts of every other phoneme, there would be on the order of 5000 different categories to classify. In order to simplify this situation, we group categories that are similar into one of eight broad contexts. For example, /y/ is spectrally similar to a front vowel, and so it is assigned (along with other front vowels such as /iy/ and /ih/) to the broad context "$front". The phoneme /s/ is a fricative, and so it is assigned, along with /f/ and /sh/, to the broad context "$fric". The dollar-sign notation is used when we are referring to a broad context instead of a single phoneme. So, our model for the /E/ in "yes" then becomes $front<E (/E/ in the context of a preceding front vowel), <E> (/E/ in the middle with no contextual effects), and E>$fric (/E/ in the context of a following fricative). For a vocabulary-independent system, /E/ will be split into 17 categories: 8 categories for each preceding broad context, 8 categories for each following broad context, and one context-indpendent category.
The method of using broad contexts results in 576 categories for all American
English phonemes, which is more computationally feasible. In addition,
categories that do not occur often in the training corpora can be tied to
acoustically similar categories that have more training data; in this way, the
number of categories to train on can be reduced from 576 to 544. The current
general purpose network has 544 output categories, but the phonemes belonging to
each broad context were determined using a data-driven method rather than
acoustic-phonetic knowledge.
Figure 4. Neural-network-based probability computation.
The neural network has 130 input nodes, one for each feature in the 5-frame
context window. There are 200 hidden nodes, and 544 output nodes. The
general-purpose network used at OGI was trained using 1500 samples from each
category, with the data taken from the OGI Numbers, Yes/No, Apple,
and Stories corpora, as well as the NYNEX PhoneBook corpus. One important
thing to note is that the outputs of the neural network are used as estimates of
posterior probability; in other words, we don't take the category with the
highest score and say, "OK, at frame 42 we found the category y>$mid."
Instead, the network is used to estimate the probabilities of each category, so
that we might say, "OK, at frame 42 we found an 82% proability of y>$mid, a
7% probability of y>$front, a 7% probability of iy>$mid, a 3% probability
of <iy>, and a close to zero probability for all other categories." It has
been shown by Bourlard and Wellekens (NIPS 89) and Richard and Lippmann (Neural
Computation 1991) that neural networks can classify in this way given enough
training data and hidden nodes.
Figure 5. Viterbi search
Once we have a matrix of how the phonetic probabilities change with time, we
want to search for the best word. Before doing that, we need to compute the set
of legal strings of phonetic categories. This set is dependent on the words we
want to recognize and the possible order of words, so we combine pronunciation
models for each of our words (for example, "yes" = $sil<y, y>$mid,
$front<E, <E>, E>$fric, $mid<s, s>$sil) with a grammar (which
specifies what words can occur in what order). In the example shown here, we
have a simple search path that can recognize only "yes" or "no", both of which
have to be preceded and followed by silence. In searching, when we look at a new
frame, we transition to a new state if the probability of the new category is
greater than the probability of the current category. At the end of the search,
we have a score for the most likely category sequence and the path through the
categories that was used to generate the best score. We can take this path and
easily determine the corresponding word (or word sequence). This word has the
best fit to the input data, and it is therefore the word that was most likely
uttered.
Figure 6. Viterbi search example
This figure traces the search paths for "yes" and "no" through a hypothetical matrix of probabilities. Even though the score for "no" is very low, it is still possible to find the most probable path for this word, if "yes" had not been in our vocabulary. The Viterbi search can be understood by reading through the following pseudo-code algorithm (with notation borrowed from Rabiner's paper, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition):
/* initialization */ given N is the number of categories, given T is the number of frames, given matrix B[j][t]: category probabilities (neural network outputs), with 1 <= j <= N, 1 <= t <= T given matrix A[i][j]: probability of transitioning from category i to category j, with 1 <= i <= N, 1 <= j <= N initialize delta[i][1] to B[i][1], with 1 <= i <= N initialize psi[i][1] to 0, with 1 <= i <= N /* main loop: compute delta (scores) and psi (category values corresponding */ /* to best scores). max_score is probability of being in state i and */ /* transitioning from state i to state j */ for each frame t from 2 to T { for each category j from 1 to N { max_score = LOWEST_POSSIBLE_VALUE max_index = 0 for each category i from 1 to N { if (delta[i][t-1] * A[i][j] > max_score) { max_score = delta[i][t-1] * A[i][j] max_index = i } } delta[j][t] = max_score * B[j][t] psi[j][t] = max_index } } /* backtracking to find state (category) sequence */ max_score = LOWEST_POSSIBLE_VALUE max_index = 0 for each category i from 1 to N { if (delta[i][T] > max_score) { max_score = delta[i][T] max_index = i } } state[T] = max_index for each frame t from T-1 to 1 { state[t] = psi[state[t+1]][t+1] }
note that usually this algorithm is implemented using log-domain numbers to avoid the underflow caused by multiplying together many values less than one.
Figure 7. Word spotting
Word-spotting allows us to find our target word or phrase somewhere in the utterance, even if the person said, "Well, let me think, ummm.... <target word> is what I'd like."
The default grammar for the recognition of a single word or phrase is
ANY <target word or phrase>
ANY
which means that we recognize "anything," followed by one of the words from our vocabulary, followed by "anything." "Anything" at some frame is defined as being the score for silence at that frame or the score of the Nth-best phonetic category at that frame, whichever is greater. The default value of N for the general-purpose recognizer is 16. In effect, we are saying that we will recognize the target word when its score is better than the score obtained by multiplying the values of all 16-th best values (or when its score is better than the score for silence). This provides a measurement that is not dependent on a fixed threshold (if the signal is noisy, the scores for the target words and the 16th-best values will decrease) or grammar (no assumptions are made about what the extraneous speech really is).
The value 16 was determined empirically and can be changed by the user in the
cslurp or cslush packages.
Questions or comments? Please send mail to the first author, John-Paul Hosom.