Home
Shows/Demos
Case 
 Examples
Medical 
 Transcribing
Speech 
 Recognition
Support Page
e-business
Connectivity 
 Solutions
Partner Links
About Us
White Paper

Speech Recognition

Copyright May 1998 by Ken Fetterhoff    -    Last Update March 12, 1999

Introduction
Recent advances in PC horsepower have made it possible to run software that interprets the spoken human voice and translate the words or phrases into a desired action or text. These words can be commands or navigation directions to the computer or they can be text and numbers dictated to a word processor or spreadsheet. As of yet, the software is not sufficiently advanced to always know whether you are issuing commands or dictating simply from context, but through the use of natural pauses and keywords it does surprisingly well.

There are three major software offerings on the market, IBM's ViaVoice Gold, Dragon System's NaturallySpeaking and Lernout & Hauspie Speech Products' Voice Xpress Plus. Each combines text formatting commands with continuous speech (natural speech) text recognition. Some of the above vendors offer varying degrees of hands-free abilities depending upon make and model, invaluable to pathologists and other professionals who are accustomed to tape recorder dictation while working with both hands.

State of the Art
Speech recognition technology today can now accept continuous speech as opposed to discrete speech. Discrete speech required a distinct pause of 1/10 of a second between words, which required the operator to learn an unnatural speech pattern. Under no circumstance should you consider an older discrete-speech version. Modern Pentium computer processors now permit acquisition of continuous speech patterns and the necessary wave-form comparative analysis to be performed in an acceptable time frame - 5 sec delay. This means that as you dictate in your normal voice, about 5 seconds go by before the words appear on the screen, already checked for spelling and grammar context. The newest 300 MHz and 400 MHz CPUs combined with 96 MB of RAM virtually eliminate the time lag, though this consideration is less important than you might think since one should be dictating without looking at the computer's monitor. Time-pressured professionals, who can get the most benefit from this technology, already use dictation equipment while viewing X-ray charts, biopsies or client records.

How it Works
A microphone and PC sound card are used to produce a digital wave-form from analog human speech. Complex algorithmic equations are used to isolate, identify and interpret the individual phonemic components of each spoken word. Each user enrolls by speaking a defined and known-in-advance text that creates a voice model. Enrollment produces a personalized collection of user files that statistically model how the phonemes of a word correspond to the data produced by the acoustic processor. The system uses a speech engine that updates or averages new speech data with existing voice model files. If an error is indicated by the user clicking on a improperly recognized word or phrase, alternate statistical choices are presented to the user for correction by selection and substitution. This step is very important to the continued improvement of the recognition system which learns from these mistakes.

Users must be cautioned to know the difference between editing dictated text and correcting dictated text. Every word has a spoken audio wave-form associated with it and error correction reinforces and updates the enrolled user voice model files. One should never do editing (changing the word or phrase) while in text correction mode. Each speech recognition vendor has error correction capabilities but not all can do it hands-free. Delegated correction capability (where someone other than the dictator does the correction is a desirable feature for those environments where clerical staff performs duties of this nature. The integrity of the user's voice profile is not compromised in these circumstances since the delegated corrector is matching the originator's intended spoken word as played back with the proper word. Again, some of the lower priced speech recognition products don't offer this feature. Most of the higher priced models will have it.

Where it Works
Speech recognition is especially well suited to professional environments where a large specialized vocabulary is employed. Time savings include eliminating staff training for the highly technical vocabulary, eliminating a shorthand dictation process where used, and instantaneous availability of a 98%+ accurate first draft. A high return on investment (ROI) is guaranteed for organizations using outside Transcription Service Bureaus. See our transcription cost analysis later in this report.

Motivated users will always reap the greatest benefit from new technology. Users with some handicap often show the most dedication to learn, use and make speech recognition part of their lives. The cost of speech recognition is not significant - a professional version for use in a medical or technical environment with custom vocabularies and the necessary training to assure accuracy and therefore economic success is less than $2,000, where an existing PC can be utilized. Work Flow Solutions normally charges about $1200 for the software and $800 for installation, setup, orientation, training and coaching.

Required Hardware
A PC with a minimum 166 MHz MMX Intel Pentium processor with Windows 95 or NT operating system is required. More preferable is a 200 or 233 MHz speed processor. The minimum RAM varies with the operating system but since memory cost is now inexpensive, start with 64MB or spend $50 to upgrade to 96MB or 128 MB. The amount of hard drive storage varies between the speech recognition vendors with IBM's ViaVoice Gold and Dragon's Deluxe version requiring 125 MB and Dragon's Personal version needing only 60 MB. A CD-ROM, 3.5" floppy drive, mouse, monitor, keyboard, speakers and sound card make up the balance of the PC. Beyond the specifications of the PC, the quality of the sound card is one of the most important components of the speech recognition system. Generally, all the vendors recommend a 16 Bit SoundBlaster or Mwave compatible sound card. It has been our experience that other sound card chips/vendors can also give equal-to or better sound quality. EnSonic and Yamaha are two that give good performance. In the final analysis, testing is the only way to determine if the PC's circuitry is compatible with the installed sound card. This is especially true of clone notebook computers. Be cautious and wary and test before you buy any notebook computer to be used for speech recognition. Even some IBM ThinkPad notebooks will not run IBM's ViaVoice software.

Required Software
The basic software is often referred to as the `Engine'. It provides features and functions similar to a word processor. In addition, it permits multiple users to create and store voice profiles which are unique to each user. Each user dictates using their own profile for improved accuracy. Other capabilities are provided that permit navigation, text formatting, correction, building vocabularies, dictation playback and text-to-speech. Each software package includes a headset microphone with a single ear piece speaker. The lower cost software versions provide retail type headsets. The Dragon Deluxe version includes a professional, rugged headset. All these headsets use noise-cancelling mics which will permit use in a somewhat noisy exhibition hall.

Custom Vocabularies
Perhaps the single most important factor in achieving high recognition accuracy is the use of a custom vocabulary. This is a dictionary of words that are unique to a specific profession i.e., Podiatry, Pathology, Patent Law, Environmental Engineering etc. These words are then taken from the total vocabulary and moved to the active portion of the vocabulary. Work Flow Solutions provides custom vocabularies to the medical specialties, and can build custom vocabularies for other professional activities.

Work Flow Automation
Significant productivity benefits can be achieved by taking advantage of some of the advanced features of the speech recognition software. Each vendor offers the capability of creating sophisticated word or phrase macros and complex templates. The latter combines fixed and variable text to build simple reports or lengthy contracts in which the variable text is a series of `fields' to be verbally filled-in. Some analysis is required to determine what repetitive work is best suited for macros or templates. For an office, it may be desirable to create certain standards by which all those using speech recognition would use the same macros and templates. This way they would be constructed once and replicated for all users.

Training
It is necessary to know what commands and formatting words to use to achieve acceptable results. Training will accelerate this learning process, so that good accuracy can be reached early. These results reinforce the satisfaction factor and helps one to continue the use of speech recognition. Initial successful results can thereby add to improved recognition and further use. The repetition and correction of misrecognized words improves the system and creates a cycle of success. Without training and coaching or significant practice the opposite can occur, requiring enrollment to be repeated from the beginning.

Mobile Speech Recognition
A variety of optional hardware exists that can accommodate different user preferences and situations. Those users wanting more freedom while creating documents, a number of alternatives are available. A wireless microphone can be used by those who like to walk while they're thinking or don't want to be tethered to the PC for safety or other considerations. Shure Brothers Inc. makes a 3 component wireless microphone consisting of a headset/transmitter, battery pack, and receiver.

A slightly lower-tech solution is to use the familiar portable tape recorder, but to play the tape to the speech recognition software rather than sending it out for transcription. This type of equipment can provide the user with the ability to dictate anywhere, anytime and in certain circumstances under any situation. We have tested the Norcom 2500 tape recorder and found it to be excellent. Professionals who are experienced in using tape recorders for dictation will find an easy transition to speech recognition using a tape recorder. However, the dictator must still learn the speech recognition method of dictating. Periods must be placed at the end of each sentence. Commas and other punctuation must be verbally inserted. Numbers and other acronyms must be handled in a special spoken manner. The best method is to learn initially by using a PC and then later convert over to the tape recorder.

Norcom also makes a coupler adapter which converts the output of the tape recorder for playback directly into the sound card jack. As with "live" speech recognition, there is a lag in translation time if you are using a 166 MHz PC, considerably less with a 300 MHz PC. If a number of users in an office are using tape recorders, a single PC could act as an automatic transcription workstation. Some versions of a speech recognition vendor's software provide multi user capability.

Hand-held Microphones
Some speech recognition software is not totally hands-free - requiring certain functions to be selected and executed with a mouse. Philips Electronics has come out with a product called SpeechMike that integrates in one hand held wand a track ball mouse, microphone and speaker. While it does tie up the use of one hand, some users may find it helpful for speech recognition software that is not totally hands-free. Users who also object to repeated removal and replacement of the headset because of interruptions may find this type of microphone more convenient.

Dictation/Telephone Switch Adapter
A number of companies, namely telephone headset manufacturers are combining the speech dictation and telephone conversation functions into one system. This solves the problem of having to take off the headset to answer the telephone. The systems usually consist of an adapter that sits between the telephone and the headset and supplies a connection to the PC sound card as well as the headset. This type of setup is ideal for the user who dictates for extended periods of time, doesn't have secretarial staff to answer the phone and doesn't want to miss a phone call. Plantronics and VXI Corporation make such units.

Choosing a Desktop or Notebook
If you are planning to acquire a new PC for speech recognition, the main decision is whether to go with a desktop or a notebook. Desktop computer are appreciably less expensive than notebooks of the same general capabilities (at least $1000), are more flexible and reliable, and are cheaper easier to maintain and upgrade. The main strength of notebooks is that they are portable, can be used away from electrical sockets for several hours before recharging, and with the proper warranty, they last as long as you'd want to keep them anyway. Tape recorder users will usually want to go with the desktop.

Software Installation
Make sure you test your PC system's multimedia capability - CD-ROM, microphone and speaker operation before you install any speech recognition software. Also, be aware that many PCs have pre-installed software that is operating in the background and could interfere with the speech recognition software. After starting up your PC, press Ctrl-Atl-Del and then selectively stop this software from running (virus protection, calendar reminders, etc. ) before installing the speech recognition software. Then go to Start - Programs - Accessories - Multimedia - Recording and Volume to set and test the audio system. If you can record from your headset mic and also play some of the WAV files, you are ready to begin installing the speech recognition software.

Microphone Setup
After the speech recognition software is installed, the software will walk you through the setup and testing of the microphone and speakers. Each software vendor's instructions - either by booklet or on-line provide adequate directions on how to do this.

User Enrollment
The enrollment process is the way the speech engine gets built for each user. Each user starts with a standard or generic voice model which then gets modified after the user enrolls. The enrollment refines the user's voice model based upon reading a preset series of sentences. Each speech recognition software package has both a quick and a lengthy enrollment process. Choose the former if you achieve good accuracy after reading about 50 sentences or the latter if you have a heavy accent. Always be aware of what user voice model is running, prior to dictating because a voice model can become corrupted if the wrong user is being associated with your dictation or if you are correcting someone's dictation with your voice model.

Initial Enrollment Training (Processing)
Once you have gone through enrollment, the software processes or trains itself by updating the general voice model with the enrollment text for your way of speaking. Depending upon the length of the enrollment and the speed of your PC this processing may take 5 minutes or up to 130 minutes. Again, make sure that no other software is running prior to initiating this training or you could get an error and have to redo the enrollment.

The Art of Dictation
Correct dictation habits result in better recognition accuracy. Accuracy means user satisfaction, pride and continued use. The more the software is used the higher the recognition accuracy. Conversely, if you are having problems get help. Bad habits not corrected can produce poor accuracy and that will result in the whole process taking more time than doing it the old way, especially if you are under a time constraint. So, it can very easily become a self defeating cycle; less used, lower skill in using it, less accuracy and more bother. Ever since the industry went to $99 or less for "light" versions of speech recognition software, we have seen more failures or casual usage. One of the risks with inexpensive products is the lack of commitments to make it work.

Work Flow Solutions has staff that can achieve 99% accuracy with highly technical and specific vocabularies - Podiatry, Pathology etc. The longer the word or the more unique the word the better the accuracy is going to be. Misrecognition errors are going to be higher with general business correspondence - 94 to 97%, compared to Pathology - 97 to 99%.

  • Have your reference material in front of you or in your hands.
  • If you're not used to dictating, outline what you want to say before you begin.
  • Don't look at the computer screen. Don't let it distract you - in fact turn away from it.
  • Go through the microphone setup prior to beginning a dictation session.
  • Concentrate on what you want to say and only pause between sentences if necessary to collect your thoughts.
  • Position the end of the mic at the corner of your mouth away from any airflow from your mouth.
  • Feel the mic end and make sure that the flat surface of the mic is facing your mouth.
  • Enunciate your words clearly and speak naturally. Remember this is composed speech not conversational speech with ah's and extra an's.
  • When dictating, don't over-enunciate your words or hesitate between words.
  • Say formatting or punctuation words quickly and multiple words as one-word.
  • Establish a voice profile for each dictation environment (equipment and
  • place) i.e. KenOffice, KenCar, KenTape, KenHall (Convention hall).
  • Do not say your words too s-l-o-w-l-y or too fast that you slur or mumble.

In the beginning, dictate about 80 to 90 words per minute and slowly increase it to 100 and then 120 words per minute. Maximum speed is approx. 130 to 140 words per minute but requires a great deal of practice to get to this speed while maintaining accuracy.

Using the Portable Tape Recorder
The portable tape recorder provides a significant benefit by eliminating the need to have a PC for each doctor or the doctor be constrained by having to go to where the dictation PC is located. The tape recorder also provides a division of labor such that the doctor can delegate to the office staff the tasks of automatic tape transcribing, proofing and correction.

Cost Comparison - Traditional Transcribing
Some medical practices do a considerable amount of dictation annually with either in-house staff or contracted to an outside transcription service bureau. Some of these costs total $5,000 to $15,000 per doctor per year. A seven doctor practice could be spending $40,000 to $105,000 annually for transcription services. We will do a cost comparison between manual transcription and speech recognition with automatic transcription.

To determine costs of using conventional medical transcription the following rates and assumptions were used:

  1. The hourly rate of an in-house transcriptionist is $15 per hour.
  2. A doctor normally dictates at an Ave. rate of 140 words per min.
  3. An average document consists of 500 words or 3.57 min. dictation per document.
  4. Each doctor dictates 16 documents per day, 250 days per yr. or 4,000 documents.
  5. The total transcription time per document is 1 min. setup + 4 times dictation time.

Based upon the above averages the traditional transcription service bureau labor cost for each doctor for the year would be:

3.57 min. x 4 + 1 min. = 15.28 min/doc x 4000 / 60 = 1018.7 hr x $15/hr = $15,280 or, $15,000/doctor/year

Investment in Speech Recognition
Let's assume this organization adopted speech recognition technology. They would undertake the following:

  1. Each doctor receives the Norcom 2500 Tape Recorder & 6 Mini Tape Cassettes
  2. Each doctor receives adequate training - 8 to 10 hours per doctor
  3. The office sets up two PC based transcription workstations
  4. Each workstation has speech recognition software and a Customized Vocabulary
  5. Each workstation has the Norcom 2500 Recorder, Coupler and Power Adapter
  6. Each workstation has a Norcom Tape Player with foot pedal
  7. Each transcriptionist receives Norcom & speech recognition training

Cost Comparison - Speech Recognition
To determine the labor cost of speech recognition with automatic transcribing, the following assumptions are made:

  • Dictation speed will be less (100 wpm) but time per document is assumed to be the same because of the use of macros.
  • One mini tape cassette will hold four documents or 15 min of dictation per side.
  • Typing out the contents of a tape requires setup and `playing' the tape (real time).
  • Setup requires selecting the proper user for the tape to be played. - 1 min
  • Tape `playing' requires no staff labor - it is automatically typed out.
  • Each document can be proofread at a rate of 180 wpm or 2.8 min. for 500 word doc.
  • Correction time is assumed at 15 sec. per error therefore for a 500 word doc:

    Accuracy Acheived

    Total Number of Errors

    Correction Time

    95%

    25

    6.25 min

    96%

    20

    5.0 min

    97%

    15

    3.75 min

    98%

    10

    2.5 min

    99%

    5

    1.25 min

 Based upon the above figures, staff labor to transcribe 16 documents, exclusive of document printout would be:

Setup time for 4 tapes (16 documents)

4.0 min

Proofing time (16 x 2.8)

44.8 min.

Correction time @ 98% (16 x 2.5)

40.0 min.

Total

88.8 min.

Extending the above for a year, the cost is 250 days x 88.8/60 @ $15/hr = $5,550 or, $5,500/doctor/year

This is 36.3% of the cost of the manual transcription method.

Capital Investment for a 7 Doctor Medical Practice

Tape Recorders - 9 @ $395

Tapes - 6 Pak 7 @ $35

Training 7 hr ea. (7+1) @56 @150

PC Workstations 2 @ $1700

Speech Recognition Software 2 @ $600

Custom Vocabulary 2 @ $900

Transcription Player 2 @ $250

Power Adapter 2 @ $50

Playback Coupler 2 @ $100

Total

$3555

$ 245

$9600

$3400

$1200

$1990

$ 500

$ 100

$ 200

$20790

Savings & Return on Investment
The cost of the traditional transcription method is $15,280 per doctor for the year, while the speech recognition with automatic transcription method is $5,550 per doctor per year, resulting in a savings of $9,730. Total practice savings per year are $9,730 x 7 = $68,110. The return on investment is $68,110 / $20,790 or 327% and the pay back is approx. 15.8 weeks or less than 4 months.

Cost Comparison Comments
Two workstations are required since the office staff person would have to alternate between them, first playing a tape on WS#1 and when it was finished being typed out proofing and correcting the documents on the PC. However before commencing the proofing and correcting on WS#1 a tape would be played on WS#2. When finished, the staff person would switch to WS#2 to do the proofing and corrections on it's documents.

The analysis also shows how important recognition accuracy is to transcription costs. A drop in accuracy from 98% to 95% almost doubles the cost to $9,300 per doctor or savings of $5,980. This still gives a good return on investment of over 200%. Further accuracy reductions lower the return and extend the payback. The other important consideration is the fact that the professional people must buy into it by modifying their dictation habits. The financial incentive for good dictation can be a motivator.

Last Modified

Monday, August 07, 2000

WorkFlow Solutions Inc.