This is a product update announcement. I’m happy to say that our speech-to-text transcription (Automatic Speech Recognition/ASR) AI model recently got an upgrade. Whenever a report gets created for a podcast or other audio file, part of our processing involves converting the vocal sounds to machine and human-readable text.
The primary purpose for doing this is to enable two of our other checks — Profanities/swearing detection and restarted sentence detection. Detecting swear words is useful for shows that are targeting a younger or family-friendly listener base. Sentence restarts happen when the vocal artist makes a mistake and has to start again. These often get missed by the person editing. Converting to text allows us to search and analyse the content more easily.
At the bottom of each report we also allow the user to copy or download our transcription to re-purpose on their website or in show notes. Some audience members like to follow along with show transcriptions making the content more accessible. A transcription also has a great SEO benefit opening shows up to a wider organic audience.
Previous Tech
We used the VOSK speech recognition toolkit to do the conversion. It was reasonably state-of-the-art at the time (January 2022) and had a variety of pre-trained models of differing quality allowing us to trade-off between speed and accuracy.
VOSK is naturally single threaded so to get optimal parallel performance from our multi-core worker processors we would split the audio into chunks at the largest breaks of silence in intervals of a few minutes and run them under multiple processes via our task scheduling software.
VOSK also doesn’t provide any punctuation or capitalisation so we had to do that ourselves as a post-process.
New Tech
In September 2022 OpenAI released their Whisper speech recognition and transcription model as open source to the world. It was trained on massive partially-labelled datasets with a lot of compute power and it surpassed almost all accuracy benchmarks by a high degree.
Over subsequent months libraries emerged around the model enabling better performance and improvements to time-coding down to the individual word level.
By keeping up with this progress, running our own tests and evaluating performance we’ve been able to roll out this technology to Audio Audit. It requires more computational power but allows us to remove quite a bit of our own code: chunking pre-processing and task management and then the punctuation post-processing. The end result is that you should notice reports generating slightly quicker than before as well as being a lot more accurate.
Another brilliant side-effect is that due to the way the model works, transcriptions are also available in about 10 different languages — no extra effort and detected automatically.
Extra Improvement
As one final bonus, we improved our transcription download options. Whereas before you could download the transcription in plain text, now you can download in SRT and WebVTT subtitle formats. This can be attached to video versions of your podcast (YouTube etc.) or embedded within certain web audio players to improve accessibility.
Final Words
I hope this in-depth look into our audio transcription was interesting to you and that users of our service find the changes to be as vast an improvement as I think they are.
We will continue to keep up with further development in this area and improve our own technology along with it.
—
Photo courtesy of Conny Schneider and Nick Fewings
Whilst you’re here…
Audio Audit is an automatic benchmarking and proofing tool which checks the quality of your podcast MP3 files, giving you peace of mind before you publish.
It checks things like loudness, silences, restarted sentences, encoding, swearing and metadata.
Learn more ⇢Sign up
Creating an account only takes a couple of minutes. You’ll soon be able to start uploading your own audio files and improving your shows.