Resources Article Benchmarking OpenAI Whisper for non-English ASR

Benchmarking OpenAI Whisper for non-English ASR

Dan Shafer

Published on 11/04/22Updated on 10/11/23

Table of Contents

Measuring the Accuracy of an ASR Model Benchmarking ASR for Non-English Languages Benchmarking Whisper Whisper Wrap-Up

Share this guide

You may be wondering: Can a single deep learning model—granted, a large one—really achieve accurate and robust automatic speech recognition (ASR) across many languages? Sure, why not? In this post, we will discuss benchmarking OpenAI Whisper models for non-English ASR.

We'll first go over some basics. How does one measure the accuracy of an ASR model? We'll work through a simple example. We will then discuss some of the challenges of accurate benchmarks, especially for non-English languages. Finally, we'll choose a fun mix of languages we are familiar with here at Deepgram—Spanish, French, German, Hindi, and Turkish—and benchmark Whisper for those, using curated publicly available data we have labeled in-house.

Measuring the Accuracy of an ASR Model

Benchmarking ASR, for English or otherwise, is simple in concept but tricky in practice. Why? Well, to start: a lot comes down to how you normalize (standardize) the text, which otherwise will differ between your model, another model, and whatever labels you consider to be the ground truth.

Let's look at a simple example in English. Say your ground truth is:

My favorite city is Paris, France. I’ve been there 12 times.

And perhaps an ASR model predicts:

my favorite city is paris france ive been there twelve times

This might be the sort of output you would get if you never intended your model to predict capitalization or punctuation, just words.

Considering that, the model does really well, right? In fact, it's perfect! To verify this, we will compute the word error rate (WER), defined as the total number of mistakes (insertions, deletions, or replacements of words) divided by the total number of words in the ground truth. A value of zero means the prediction was spot on. Typically, the WER is less than 1, though you would find a value of exactly 1 if no words were predicted and an arbitrarily large value if many more words are predicted than are in the ground truth.

Let's directly compute the WER for our example. To get setup, we make sure we have installed the editdistance package. It will enable us to compute the minimum number of insertions, deletions, and replacements to make any two sequences match.

pip install editdistance

import re
import editdistance

# Define a helper function to computer the WER, given an arbitrary function that converts the text to word tokens

def _wer(truth, pred, tokenizer):
   truth_tokens = tokenizer(truth)
   pred_tokens = tokenizer(pred)
   return editdistance.eval(truth_tokens, pred_tokens) / len(truth_tokens)

# Store a hypothetical ground truth and a model's prediction

truth = "My favorite city is Paris, France. I've been there 12 times."
pred = "my favorite city is paris france ive been there twelve times"

# Compute and display the WER obtained after simply splitting the text on whitespace

wer = _wer(truth, pred, str.split)
print(f'WER: {wer}')

You will find that the WER is 0.55! Hey! What happened? The word tokens must be identical strings to not count as an error: "I've" is not the same as "ive", and so on. Clearly, we need to ignore punctuation and capitalization when computing the WER. So we can normalize by removing both. Let's try it.

# Define a helper function to lowercase the text and strip common punctuation characters, using a regex substitution, before splitting on whitespace

def _normalize(text):
   text = text.lower()
   text = re.sub(r'[\.\?\!\,\']', '', text)
   return text.split()

wer = _wer(truth, pred, _normalize)
print(f'WER: {wer}')

Now we find a WER value of 0.09, which is much better! If you think through the example, you will notice one remaining issue: the numeral "12" vs. the word "twelve". In some applications, we may actually want to consider that an error (e.g. maybe you really need your model to produce a numeral over a word when it hears a spoken number). Here, assume we had no intention of penalizing the model, since it did get the right number. Let's install the handy num2words package, which converts numbers in digit form to their word form, and define a modified tokenizer.

pip install num2words
from num2words import num2words

# Define a helper function that takes as input a regex match object, assumed to be a (string) integer, and replaces it with the corresponding word(s) from num2words

def _to_words(match):
   return f' {num2words(int(match.group()))} '

# Same as before, but now we also normalize numbers

def _normalize_num(text):
   text = text.lower()
   text = re.sub(r'[\.\?\!\,\']', '', text)
   text = re.sub(r'\s([0-9]+)\s', _to_words, text)
   return text.split()

wer = _wer(truth, pred, _normalize_num)
print(f'WER: {wer}')

Finally! We have zero WER.

Before getting into non-English languages, let’s recap:

It just so happens that our text normalization does not change the model output, but this doesn’t necessarily have to be the case—a normalization function would typically operate on both. And of course, this is a simple example. There is a lot more one could do to handle numbers correctly—what about times, currencies, years, and addresses? And what about other English-specific terms? Is "Dr." the same as "doctor"? It can get complicated quickly, the more flexible you need your normalization to be. But even what we have done above may get us pretty far in some cases.