|
Introduction Recent advances in PC horsepower have made it possible to run software that interprets the spoken human voice and translate the words or phrases into a desired action
or text. These words can be commands or navigation directions to the computer or they can be text and numbers dictated to a word processor or spreadsheet. As of yet, the software is not sufficiently advanced to always
know whether you are issuing commands or dictating simply from context, but through the use of natural pauses and keywords it does surprisingly well.
There are three major software offerings on the market, IBM's
ViaVoice Gold, Dragon System's NaturallySpeaking and Lernout & Hauspie Speech Products' Voice Xpress Plus. Each combines text formatting commands with continuous speech (natural speech) text recognition. Some of the
above vendors offer varying degrees of hands-free abilities depending upon make and model, invaluable to pathologists and other professionals who are accustomed to tape recorder dictation while working with both hands.
State of the Art Speech recognition
technology today can now accept continuous speech as opposed to discrete speech. Discrete speech required a distinct pause of 1/10 of a second between words, which required the operator to learn an unnatural speech
pattern. Under no circumstance should you consider an older discrete-speech version. Modern Pentium computer processors now permit acquisition of continuous speech patterns and the necessary wave-form comparative
analysis to be performed in an acceptable time frame - 5 sec delay. This means that as you dictate in your normal voice, about 5 seconds go by before the words appear on the screen, already checked for spelling and
grammar context. The newest 300 MHz and 400 MHz CPUs combined with 96 MB of RAM virtually eliminate the time lag, though this consideration is less important than you might think since one should be dictating without
looking at the computer's monitor. Time-pressured professionals, who can get the most benefit from this technology, already use dictation equipment while viewing X-ray charts, biopsies or client records. How it Works A microphone and PC sound card are used to produce a digital wave-form from analog human speech. Complex algorithmic equations are used to isolate, identify and interpret the
individual phonemic components of each spoken word. Each user enrolls by speaking a defined and known-in-advance text that creates a voice model. Enrollment produces a personalized collection of user files that
statistically model how the phonemes of a word correspond to the data produced by the acoustic processor. The system uses a speech engine that updates or averages new speech data with existing voice model files. If an
error is indicated by the user clicking on a improperly recognized word or phrase, alternate statistical choices are presented to the user for correction by selection and substitution. This step is very important to the
continued improvement of the recognition system which learns from these mistakes.
Users must be cautioned to know the difference between editing dictated text and correcting dictated text. Every word has a spoken
audio wave-form associated with it and error correction reinforces and updates the enrolled user voice model files. One should never do editing (changing the word or phrase) while in text correction mode. Each speech
recognition vendor has error correction capabilities but not all can do it hands-free. Delegated correction capability (where someone other than the dictator does the correction is a desirable feature for those
environments where clerical staff performs duties of this nature. The integrity of the user's voice profile is not compromised in these circumstances since the delegated corrector is matching the originator's intended
spoken word as played back with the proper word. Again, some of the lower priced speech recognition products don't offer this feature. Most of the higher priced models will have it.
Where it Works
Speech recognition is especially well suited to professional environments where a large specialized vocabulary is employed. Time savings include eliminating staff training for the highly technical vocabulary,
eliminating a shorthand dictation process where used, and instantaneous availability of a 98%+ accurate first draft. A high return on investment (ROI) is guaranteed for organizations using outside Transcription Service
Bureaus. See our transcription cost analysis later in this report.Motivated users will always reap the greatest benefit from new technology. Users with some handicap often show the most dedication to
learn, use and make speech recognition part of their lives. The cost of speech recognition is not significant - a professional version for use in a medical or technical environment with custom vocabularies and the
necessary training to assure accuracy and therefore economic success is less than $2,000, where an existing PC can be utilized. Work Flow Solutions normally charges about $1200 for the software and $800 for
installation, setup, orientation, training and coaching. Required Hardware A PC with a minimum 166 MHz MMX Intel Pentium processor with Windows 95 or NT operating system is required.
More preferable is a 200 or 233 MHz speed processor. The minimum RAM varies with the operating system but since memory cost is now inexpensive, start with 64MB or spend $50 to upgrade to 96MB or 128 MB. The amount of
hard drive storage varies between the speech recognition vendors with IBM's ViaVoice Gold and Dragon's Deluxe version requiring 125 MB and Dragon's Personal version needing only 60 MB. A CD-ROM, 3.5" floppy drive,
mouse, monitor, keyboard, speakers and sound card make up the balance of the PC. Beyond the specifications of the PC, the quality of the sound card is one of the most important components of the speech recognition
system. Generally, all the vendors recommend a 16 Bit SoundBlaster or Mwave compatible sound card. It has been our experience that other sound card chips/vendors can also give equal-to or better sound quality. EnSonic
and Yamaha are two that give good performance. In the final analysis, testing is the only way to determine if the PC's circuitry is compatible with the installed sound card. This is especially true of clone notebook
computers. Be cautious and wary and test before you buy any notebook computer to be used for speech recognition. Even some IBM ThinkPad notebooks will not run IBM's ViaVoice software. Required Software The basic software is often referred to as the `Engine'. It provides features and functions similar to a word processor. In addition, it permits multiple users to create and
store voice profiles which are unique to each user. Each user dictates using their own profile for improved accuracy. Other capabilities are provided that permit navigation, text formatting, correction, building
vocabularies, dictation playback and text-to-speech. Each software package includes a headset microphone with a single ear piece speaker. The lower cost software versions provide retail type headsets. The Dragon Deluxe
version includes a professional, rugged headset. All these headsets use noise-cancelling mics which will permit use in a somewhat noisy exhibition hall. Custom Vocabularies Perhaps the single
most important factor in achieving high recognition accuracy is the use of a custom vocabulary. This is a dictionary of words that are unique to a specific profession i.e., Podiatry, Pathology, Patent Law, Environmental
Engineering etc. These words are then taken from the total vocabulary and moved to the active portion of the vocabulary. Work Flow Solutions provides custom vocabularies to the medical specialties, and can build custom
vocabularies for other professional activities. Work Flow Automation Significant productivity benefits can be achieved by taking advantage of some of the advanced features of the
speech recognition software. Each vendor offers the capability of creating sophisticated word or phrase macros and complex templates. The latter combines fixed and variable text to build simple reports or lengthy
contracts in which the variable text is a series of `fields' to be verbally filled-in. Some analysis is required to determine what repetitive work is best suited for macros or templates. For an office, it may be
desirable to create certain standards by which all those using speech recognition would use the same macros and templates. This way they would be constructed once and replicated for all users. Training It is necessary to know what commands and formatting words to use to achieve acceptable results. Training will accelerate this learning process, so that good accuracy can be
reached early. These results reinforce the satisfaction factor and helps one to continue the use of speech recognition. Initial successful results can thereby add to improved recognition and further use. The repetition
and correction of misrecognized words improves the system and creates a cycle of success. Without training and coaching or significant practice the opposite can occur, requiring enrollment to be repeated from the
beginning. Mobile Speech Recognition
A variety of optional hardware exists that can accommodate different user preferences and situations. Those users wanting more freedom while creating
documents, a number of alternatives are available. A wireless microphone can be used by those who like to walk while they're thinking or don't want to be tethered to the PC for safety or other considerations. Shure
Brothers Inc. makes a 3 component wireless microphone consisting of a headset/transmitter, battery pack, and receiver.A slightly lower-tech solution is to use the familiar portable tape recorder, but to
play the tape to the speech recognition software rather than sending it out for transcription. This type of equipment can provide the user with the ability to dictate anywhere, anytime and in certain circumstances under
any situation. We have tested the Norcom 2500 tape recorder and found it to be excellent. Professionals who are experienced in using tape recorders for dictation will find an easy transition to speech recognition using
a tape recorder. However, the dictator must still learn the speech recognition method of dictating. Periods must be placed at the end of each sentence. Commas and other punctuation must be verbally inserted. Numbers and
other acronyms must be handled in a special spoken manner. The best method is to learn initially by using a PC and then later convert over to the tape recorder. Norcom also makes a coupler adapter which
converts the output of the tape recorder for playback directly into the sound card jack. As with "live" speech recognition, there is a lag in translation time if you are using a 166 MHz PC, considerably less
with a 300 MHz PC. If a number of users in an office are using tape recorders, a single PC could act as an automatic transcription workstation. Some versions of a speech recognition vendor's software provide multi user
capability. Hand-held Microphones
Some speech recognition software is not totally hands-free - requiring certain functions to be selected and executed with a mouse. Philips Electronics has come
out with a product called SpeechMike that integrates in one hand held wand a track ball mouse, microphone and speaker. While it does tie up the use of one hand, some users may find it helpful for speech recognition
software that is not totally hands-free. Users who also object to repeated removal and replacement of the headset because of interruptions may find this type of microphone more convenient. Dictation/Telephone Switch Adapter A number of companies, namely telephone headset manufacturers are combining the speech dictation and telephone conversation functions into one system. This solves the
problem of having to take off the headset to answer the telephone. The systems usually consist of an adapter that sits between the telephone and the headset and supplies a connection to the PC sound card as well as the
headset. This type of setup is ideal for the user who dictates for extended periods of time, doesn't have secretarial staff to answer the phone and doesn't want to miss a phone call. Plantronics and VXI Corporation make
such units. Choosing a Desktop or Notebook
If you are planning to acquire a new PC for speech recognition, the main decision is whether to go with a desktop or a notebook. Desktop computer are
appreciably less expensive than notebooks of the same general capabilities (at least $1000), are more flexible and reliable, and are cheaper easier to maintain and upgrade. The main strength of notebooks is that they
are portable, can be used away from electrical sockets for several hours before recharging, and with the proper warranty, they last as long as you'd want to keep them anyway. Tape recorder users will usually want to go
with the desktop. Software Installation
Make sure you test your PC system's multimedia capability - CD-ROM, microphone and speaker operation before you install any speech recognition
software. Also, be aware that many PCs have pre-installed software that is operating in the background and could interfere with the speech recognition software. After starting up your PC, press Ctrl-Atl-Del and then
selectively stop this software from running (virus protection, calendar reminders, etc. ) before installing the speech recognition software. Then go to Start - Programs - Accessories - Multimedia - Recording and Volume
to set and test the audio system. If you can record from your headset mic and also play some of the WAV files, you are ready to begin installing the speech recognition software.
Microphone Setup After the speech recognition software is installed, the software will walk you through the setup and testing of the microphone and speakers. Each software vendor's instructions
- either by booklet or on-line provide adequate directions on how to do this. User Enrollment The enrollment process is the way the speech engine gets built for each user. Each user starts with
a standard or generic voice model which then gets modified after the user enrolls. The enrollment refines the user's voice model based upon reading a preset series of sentences. Each speech recognition software package
has both a quick and a lengthy enrollment process. Choose the former if you achieve good accuracy after reading about 50 sentences or the latter if you have a heavy accent. Always be aware of what user voice model is
running, prior to dictating because a voice model can become corrupted if the wrong user is being associated with your dictation or if you are correcting someone's dictation with your voice model. Initial Enrollment Training (Processing) Once you have gone through enrollment, the software processes or trains itself by updating the general voice model with the enrollment text for your way of speaking.
Depending upon the length of the enrollment and the speed of your PC this processing may take 5 minutes or up to 130 minutes. Again, make sure that no other software is running prior to initiating this training or you
could get an error and have to redo the enrollment. The Art of Dictation Correct dictation habits result in better recognition accuracy. Accuracy means user satisfaction, pride and
continued use. The more the software is used the higher the recognition accuracy. Conversely, if you are having problems get help. Bad habits not corrected can produce poor accuracy and that will result in the whole
process taking more time than doing it the old way, especially if you are under a time constraint. So, it can very easily become a self defeating cycle; less used, lower skill in using it, less accuracy and more bother.
Ever since the industry went to $99 or less for "light" versions of speech recognition software, we have seen more failures or casual usage. One of the risks with inexpensive products is the lack of
commitments to make it work. Work Flow Solutions has staff that can achieve 99% accuracy with highly technical and specific vocabularies - Podiatry, Pathology etc. The longer the word or the more unique
the word the better the accuracy is going to be. Misrecognition errors are going to be higher with general business correspondence - 94 to 97%, compared to Pathology - 97 to 99%.
- Have your reference material in front of you or in your hands.
- If you're not used to dictating, outline what you want to say before you begin.
- Don't look at the computer screen. Don't let it distract you - in fact turn away from it.
- Go through the microphone setup prior to beginning a dictation session.
- Concentrate on what you want to say and only pause between sentences if necessary to collect your thoughts.
- Position the end of the mic at the corner of your mouth away from any airflow from your mouth.
- Feel the mic end and make sure that the flat surface of the mic is facing your mouth.
- Enunciate your words clearly and speak naturally. Remember this is composed speech not conversational speech with ah's and extra an's.
- When dictating, don't over-enunciate your words or hesitate between words.
- Say formatting or punctuation words quickly and multiple words as one-word.
- Establish a voice profile for each dictation environment (equipment and
- place) i.e. KenOffice, KenCar, KenTape, KenHall (Convention hall).
- Do not say your words too s-l-o-w-l-y or too fast that you slur or mumble.
In the beginning, dictate about 80 to 90 words per minute and slowly increase it to 100 and then 120 words per minute. Maximum speed is approx. 130 to 140 words per minute but requires a great deal of
practice to get to this speed while maintaining accuracy. Using the Portable Tape Recorder
The portable tape recorder provides a significant benefit by eliminating the need to
have a PC for each doctor or the doctor be constrained by having to go to where the dictation PC is located. The tape recorder also provides a division of labor such that the doctor can delegate to the office staff the
tasks of automatic tape transcribing, proofing and correction. Cost Comparison - Traditional Transcribing
Some medical practices do a considerable amount of dictation annually with either
in-house staff or contracted to an outside transcription service bureau. Some of these costs total $5,000 to $15,000 per doctor per year. A seven doctor practice could be spending $40,000 to $105,000 annually for
transcription services. We will do a cost comparison between manual transcription and speech recognition with automatic transcription.To determine costs of using conventional medical transcription the
following rates and assumptions were used:
- The hourly rate of an in-house transcriptionist is $15 per hour.
- A doctor normally dictates at an Ave. rate of 140 words per min.
- An average document consists of 500 words or 3.57 min. dictation per document.
- Each doctor dictates 16 documents per day, 250 days per yr. or 4,000 documents.
- The total transcription time per document is 1 min. setup + 4 times dictation time.
Based upon the above averages the traditional transcription service bureau labor cost for each doctor for the year would be:
3.57 min. x 4 + 1 min. = 15.28 min/doc x 4000 / 60 = 1018.7 hr x $15/hr = $15,280 or, $15,000/doctor/year Investment in Speech Recognition Let's
assume this organization adopted speech recognition technology. They would undertake the following:
- Each doctor receives the Norcom 2500 Tape Recorder & 6 Mini Tape Cassettes
- Each doctor receives adequate training - 8 to 10 hours per doctor
- The office sets up two PC based transcription workstations
- Each workstation has speech recognition software and a Customized Vocabulary
- Each workstation has the Norcom 2500 Recorder, Coupler and Power Adapter
- Each workstation has a Norcom Tape Player with foot pedal
- Each transcriptionist receives Norcom & speech recognition training
Cost Comparison - Speech Recognition
To determine the labor cost of speech recognition with automatic transcribing, the following assumptions are made:
- Dictation speed will be less (100 wpm) but time per document is assumed to be the same because of the use of macros.
- One mini tape cassette will hold four documents or 15 min of dictation per side.
- Typing out the contents of a tape requires setup and `playing' the tape (real time).
- Setup requires selecting the proper user for the tape to be played. - 1 min
- Tape `playing' requires no staff labor - it is automatically typed out.
- Each document can be proofread at a rate of 180 wpm or 2.8 min. for 500 word doc.
- Correction time is assumed at 15 sec. per error therefore for a 500 word doc:
Accuracy Acheived |
Total Number of Errors |
Correction Time |
95% |
25 |
6.25 min |
96% |
20 |
5.0 min |
97% |
15 |
3.75 min |
98% |
10 |
2.5 min |
99% |
5 |
1.25 min |
Based upon the above figures, staff labor to transcribe 16 documents, exclusive of document printout would be:
Setup time for 4 tapes (16 documents) |
4.0 min |
Proofing time (16 x 2.8) |
44.8 min. |
Correction time @ 98% (16 x 2.5) |
40.0 min. |
Total |
88.8 min. |
Extending the above for a year, the cost is 250 days x 88.8/60 @ $15/hr = $5,550 or, $5,500/doctor/year This is 36.3%
of the cost of the manual transcription method.
 |
 |
|
Capital Investment for a 7 Doctor Medical Practice |
 |
 |
|
Tape Recorders - 9 @ $395 |
|
|
Tapes - 6 Pak 7 @ $35 |
|
|
Training 7 hr ea. (7+1) @56 @150 |
|
|
PC Workstations 2 @ $1700 |
|
|
Speech Recognition Software 2 @ $600 |
|
|
Custom Vocabulary 2 @ $900 |
|
|
Transcription Player 2 @ $250 |
|
|
Power Adapter 2 @ $50 |
|
|
Playback Coupler 2 @ $100 |
|
|
Total |
|
 |
 |
|
$3555 |
|
|
$ 245 |
|
|
$9600 |
|
|
$3400 |
|
|
$1200 |
|
|
$1990 |
|
|
$ 500 |
|
|
$ 100 |
|
|
$ 200 |
|
|
$20790 |
|
|
|
Savings & Return on Investment
The cost of the traditional transcription method is $15,280 per doctor for the year, while the speech recognition with automatic transcription method is $5,550 per doctor per year,
resulting in a savings of $9,730. Total practice savings per year are $9,730 x 7 = $68,110. The return on investment is $68,110 / $20,790 or 327% and the pay back is approx. 15.8 weeks or less than 4 months. Cost Comparison Comments Two workstations are required since the office staff person would have to alternate between
them, first playing a tape on WS#1 and when it was finished being typed out proofing and correcting the documents on the PC. However before commencing the proofing and
correcting on WS#1 a tape would be played on WS#2. When finished, the staff person would switch to WS#2 to do the proofing and corrections on it's documents.
The analysis also shows how important recognition accuracy is to transcription costs. A drop in accuracy from 98% to 95% almost doubles the cost to $9,300 per doctor or savings of
$5,980. This still gives a good return on investment of over 200%. Further accuracy reductions lower the return and extend the payback. The other important consideration is the
fact that the professional people must buy into it by modifying their dictation habits. The financial incentive for good dictation can be a motivator.
|