How do programmers teach a computer to recognize your voice?

It won’t be long before you can carry on lengthy conversations with a computer, schedule business appointments by negotiating with a machine rather than a secretary, even instruct your VCR by voice when to record a certain show. IBM already has a computer in the works that can recognize twenty thousand words.

How does it do it?

Basically, the computer “recognizes” your voice by converting its sound waves into digital signals. When you speak into a microphone, the computer compares your voice signals with a set of patterns it has stored in its system.

The computer will show you what it thinks you’re saying by instantly flashing up your words on a computer screen. You can then edit (by voice or keyboard), store, print, or transmit. In the labs at AT&T, for instance, researchers play video games, guiding a mouse through a maze simply by saying “left” or “right.”

But before a computer can perform for you, it has to get used to your voice. (This type of computer will respond to only one voice and is dubbed “speaker dependent.” Others, referred to as “speaker independent,” will respond to any voice.) A speaker-dependent computer first listens for twenty minutes as you read a special document into a microphone.

When you’ve finished, it will have stored two hundred sound patterns that characterize your voice, and when you use the system it will match them against your speech. A selection of words that roughly match those sound patterns is drawn from a twenty-thousand-word vocabulary.

These words are then matched against a second model that has a vast data base of words commonly used in the office. The computer can reduce the number of possible candidates by determining which words are most likely to follow the two previous words.

For example, the computer knows that bows, and not boughs, would follow the word actor. The system makes its final selection of the best word after it has determined that analysis of subsequent words (to the audience) won’t affect its choice. Within a second or so, the word appears on the system’s display screen.

This contextual ability enables the system to distinguish between words that sound alike but are different, such as know and no, air and heir, and to, too, and two. Punctuation can be added verbally by simply saying “period,” “comma,” and so on.

When IBM first started working on this system in 1986, the system occupied an entire room. Now it has been reduced to desk size and IBM feels it won’t be long before it is used routinely in business offices. AT&T is experimenting with an automated directory system for the employees at Bell Labs.

The caller simply picks up the phone, spells out the last name and first initial of the person he or she is trying to reach, and the call automatically goes through. Who knows? Before long you may be able to pick up the phone and just say, “Get me Mom, please.”