Resources Article Whisper-v3 Hallucinations on Real World Data

Whisper-v3 Hallucinations on Real World Data

Jose Nicholas Francisco

Published on 11/14/23Updated on 11/15/23

Table of Contents

🔍 The Peculiarities We Found Peculiarity #1:Peculiarity #2:🧐 I wasn’t the only one who had this result…🖼️ Whisper Background ⚒️ Part 1: Setup 🤖 The Output 🚀Part 1: A never-before-seen audio with tough edge-cases 📚 Part 2: Overall performance on a larger dataset.

Share this guide

Last week Sam Altman announced Whisper v3 on stage at OpenAI’s Dev Day. Like any in the community, I was eager to see how the model performed. After all, at Deepgram, we love all things voice AI. Therefore we decided to take it for a spin.

This post shows how I got it running and the results of the testing that I did. Getting the testing setup was relatively straightforward; however, we found some surprising results.

I’ll show some peculiarities up front then go through the thorough analysis after that.

🔍 The Peculiarities We Found

Peculiarity #1:

Start at 4:06 in this audio clip (the same one embedded above). This file is one we used in our testing.

At that moment in the audio, the ground-truth transcription reads “Yeah, I have one Strider XS9. That one’s from 2020. I’ve got two of the Fidgets XSR7s from 2019. And the player tablet is a V2090 that’s dated 2015.”

However, the Whisper-v3 transcript says: