Guest article by Matthew Loxton, a Principal Healthcare Analyst and Professional MAXQDA Trainer.
For some people, such as myself, MAXQDA’s Transcription Mode tools are a wonderful help. The software is not, however, able to make us efficient in transcribing audio files. Even with the many functions available in MAXQDA’s Multimedia Browser, such as keyboard shortcuts and automatic speaker changes, I am still far too clumsy with hitting F4, the pause button, or the various other buttons, to transcribe the audio efficiently.
My typing is too slow, my working-memory too short, and I get confused between the function keys and buttons. As a result, transcribing a 30-minute audio file takes me several hours, and by the end, I am exhausted.
How MAXQDA’s Transcription Mode Feature Can Be Used in Conjunction with Machine Learning Voice-to-Text Services
The advent of machine learning represents a dramatic change in the effectiveness and efficiency of voice-to-text software. Where previously, voice-activated systems and voice-to-text applications were stuck at low levels of accuracy and speed, the use of machine learning has resulted in a variety of devices and services that boast of 95% and higher accuracy.
The same machine-learning technology can be used to transcribe interviews. I reviewed three different options and services, ultimately using one of them throughout a current project.
My aim was to speed up my transcription process, and to remove the frustration that normally accompanies this step in the research process for me. I recorded nine interviews, which ranged from 18-48 minutes, and in which there was one participant, the primary interviewer, and myself.
Three voice to text services
Based on some research on popular services, the three services that I looked at were:
|1. TEMI.com||$0.10||>5min/hr||~98%||Machine Learning|
|2. SPEXT.com (defunct as of May 2021)||$0.25||>5min/hr.||~99%||Machine Learning|
|3. REV.com||$1.00||~12hr||~100%||Machine + Human|
I found that approximately 1-2% of the participant’s text from the two Machine transcriptions needed edits, typically for similar-sounding words such as “IV” vs “IB”, or “wake” vs “wait”. I also needed to change the speaker names manually. I had higher error rates in the interviewer text, but this was not of significance to my needs.
The low cost and almost immediate turnaround of the machine transcriptions meant that for just over $20 in total, I could have very workable transcriptions almost ready to code within minutes for nine of the half-hour interviews.
I found that making the few edits required was far less frustrating and time-consuming than manual transcription had been in the past. This is obviously a function of my typing speed and keyboard dexterity, but may be similar for many other researchers.
Recording and importing the interviews
I used Skype for Business to create the meeting requests for the interviews, provide Voice over IP (VoIP) services and dial-in numbers, and allow me to record the call as an MP4 file with good audio quality. I experimented with two alternative processes:
- In the first, I used the content of an MS Word document from the transcription provider,
- and in the second process, I used MAXQDA’s Transcripts with Timestamps function to import a SubRip (SRT) file with timestamps from the provider.
My transcription environment, captured in Figure 1, shows text from the machine learning service pasted into MAXQDA’s “Document Browser” window for the selected audio file imported into the Document System (highlighted in the “Document System” window).
At this point, I am about to edit the name of the second speaker using the Automatic Speaker Change and Autotext functions:
Fig. 1: MAXQDA’s Transcription Mode Feature
Method 1 – Microsoft Word
My process using the MS Word file was as follows:
- Download the transcription from the online service as an MS Word document,
- Import the MP4 into MAXQDA,
- Open the file for transcription,
- Paste the content of the Word file into the Transcription Mode “Document Browser” window,
- Run the audio in MAXQDA’s Multimedia Browser, and make any edits in the text as needed,
- Add timestamps at each speaker change or where desired.
Method 1 Findings
The method was effective, and I will continue to use this in production because it has greatly reduced the time and effort to transcribe audio tracks at very low cost. There were some caveats, however.
Firstly, the machine transcription did a reasonably good job of speaker changes, but sometimes it got confused. Sometimes it thought there was a different speaker, when in reality, it was the same person continuing to speak (Figure 2).
This was not a major hurdle, and the changeovers were reasonably obvious, and mistakes were easily corrected. Sometimes it changed over to a new speaker a little late, and switched speakers several words into the next speaker’s text. In Figure 3 the highlighted text belongs to Speaker 1, not to the previous speaker. This too was reasonably easy to spot, and to edit.
Secondly, it did not like multiple people talking at the same time. When there were multiple interleaved or simultaneous speakers, it tended to think it was the same speaker, and also sometimes lost several words of each speaker.
This issue can be seen in the first paragraph in Figure 1 (above) where the participant and interviewer start the interview talking about whether the VoIP system had announced that it was recording. Both speakers were captured as a single person, and their phrases intermingled.
This was harder to identify, and to edit but tended to happen at specific points in the interview, such as when there was a transition in topic. However, there were at least one or two occurrences in each interview, especially when something startling or funny was said, and multiple people interjected.
For example, when one participant commented that they had been left alone in the MRI room and staff had left for lunch, both interviewers interjected with comments, laughter, whistles, etc. The machine transcription missed some of the words, and mingled them together in small phrases. This was less easy to edit, but not any worse than was the case for me doing it manually.
Lastly, it often ignored audio that was significantly lower volume than the rest (it may have calculated an average volume and then filtered out quieter sounds as noise). One recurring situation brought this to light. At the end of each participant track, the interviewer typically thanked them in a much quieter voice and then used normal volume to initiate the next question or comment.
In many cases, these “asides” were not transcribed at all. This was not a problem for my situation, but it may be a significant issue if the participant varies in volume.
Method 2 – SRT File
My method for the SRT file used MAXQDA’s Import Transcripts with Timestamps function:
- Download the SRT file,
- Import the SRT into MAXQDA using the Import Transcripts with Timestamps function,
- Point to the associated MP4 file in the resulting dialogue box,
- Open the file for transcription,
- Run the audio in MAXQDA’s Multimedia Browser,
- Make edits in the text as needed.
Method 2 Findings
The accuracy rate was the same, and the same cost and logistics considerations applied. The caveats were also similar, and although importing the SRT had an advantage in not requiring me to add timestamps manually, it resulted in a highly fragmented text layout inherited from the SRT file structure.
While the audio and text tracked well together, viewing the text as a vertical list of fragments was difficult to read, and coding was made significantly more difficult. As a result, I did not continue with the use of the SRT file method after the initial test.
The low cost and short turnaround time make machine-learning transcription services worthwhile to some researchers whose keyboard dexterity makes manual transcription tedious. The services do not offer perfect transcription, however, and each method to use them comes with some caveats.
Using the MS Word import was found to be preferable to using the SRT method, and what was lost in the more comprehensive timestamping was gained in the greater readability of the resulting text of the Word import.
My overall finding was that both TEMI and SPEXT were good enough to continue to use, but that if high fidelity was needed and cost and time were less of an issue, the human-machine combination would be an attractive option.
About the Author
Editor’s note: this post has been updated from its original version published in April 2019.