TL;DR:

  • Deepgram offers a fully managed Whisper API that’s faster, more reliable, and cheaper than OpenAI's. 

  • But even with these improvements, Deepgram Nova is faster, cheaper, and more accurate than any Whisper model in the market (including our own).

  • Beyond performance and cost, Whisper models lack critical features and functionality that can impede successful productization

  • Deepgram Smart Formatting is now available and delivers superior entity formatting results to OpenAI Whisper

In September 2022, OpenAI released Whisper, its general-purpose, open source model for automatic speech recognition (ASR) and translation tasks. OpenAI researchers developed Whisper to study speech processing systems trained under large-scale weak supervision and, in Open AI’s own words, for “AI researchers studying robustness, generalization, capabilities, biases, and constraints of the current model.” In contrast, Deepgram has developed a Language AI platform that includes speech-to-text APIs that enable software developers to quickly build scalable, production-quality products with voice data.

OpenAI offers Whisper in five model sizes, ranging from 39 million to over 1.5 billion parameters. Larger models tend to provide higher accuracy at the tradeoff of increased processing time and compute cost; to increase the processing speed for larger models additional computing resources are required.

Whisper is an open source software package, and can be a great choice for hobbyists, researchers, and developers interested in creating product demos or prototypes or conducting technical research on AI speech recognition and translation. However, when it comes to building production systems at scale involving real-time processing of streaming voice data, there are a number of considerations that may make Whisper less suitable than commercially available ASR solutions. Some of its notable limitations include:

  • Whisper is slow and expensive

  • Only Large-v2 is available via API (Tiny, Base, Small, and Medium models are excluded)

  • No built-in diarization, word-level timestamps, or keyword detection

  • 25MB file size cap and a low limit on concurrent requests per minute

  • No support for transcription via hosted URLs or callbacks

  • Can’t be used for real-time transcription out of the box; no streaming support, batch processing for pre-recorded audio only

  • No model customization or ability to train Whisper on your own data to improve performance

  • Limited entity formatting

OpenAI Whisper’s Performance and Cost

In April, we announced Nova—the fastest, cheapest, and most accurate speech recognition model in the market today. In addition, we also released Deepgram Whisper Cloud, a fully managed Whisper API that supports all five open source models, and is faster, more reliable, and cheaper than OpenAI's.

Since launch, we’ve helped customers transcribe over 400 million seconds of audio with our new Whisper Cloud. We believe very strongly that if you’re going to use Whisper, you should certainly use our managed Whisper offering. It’s 20% more affordable (for Whisper Large), 3 times faster, and provides more accurate transcription results than what you’re currently able to get with OpenAI’s model.

But implementing Whisper, even a managed service offering like Deepgram’s, is not without its shortcomings. We conducted rigorous testing of Nova against its competitors on 60+ hours of human-annotated audio pulled from real-life situations, encompassing diverse audio lengths, speakers, accents, environments, and domains, to ensure a practical evaluation of its real-world performance.

Using these datasets, we calculated the Word Error Rate (WER)[1] of Nova and Deepgram’s Whisper models and compared it to OpenAI’s most accurate model (Whisper Large). The results show Deepgram’s Whisper Large model beats OpenAI’s in each domain, while Nova leads the pack by a considerable margin. Nova achieves an overall WER of 7.4% for the median files tested, representing a 45.2% lead over OpenAI Whisper Large, which had an overall WER of 13.5% (see Figure 1). If you’re going to use Whisper, use Deepgram Whisper Cloud. But if you need the most accurate model, use Deepgram Nova. 

Figure 1: The figure above compares the average Word Error Rate (WER) of our Nova and Whisper models with OpenAI’s Whisper Large model across three audio domains: video/media/podcast, meeting, and phone call. It uses a boxplot chart, which is a type of chart often used to visually show the distribution of numerical data and skewness. The chart displays the five-number summary of each dataset, including the minimum value, first quartile (median of the lower half), median, third quartile (median of the upper half), and maximum value.