Speech recognition remains a challenging problem in AI and machine learning. One step towards solving this, OpenAI today Open-sourced Whisper, an automated speech recognition system that the company claims enables “robust” transcription into multiple languages as well as translation from those languages into English.
Countless companies have developed highly capable speech recognition systems, based on software and services from tech giants like Google, Amazon and Meta. But what sets Whisper apart, according to OpenAI, is that it was trained on 680,000 hours of multilingual and “multitask” data collected from the web, leading to improved recognition of unique accents, background noise and technical jargon.
“The primary objective is users [the Whisper] Models are AI researchers study the robustness, generalizability, capabilities, biases and limitations of existing models However, Whisper is also potentially quite useful as an automated speech recognition solution for developers, especially for English speech recognition,” writes OpenAI on GitHub Repo For Whisper, from where different versions of the system can be downloaded. “[The models] Show powerful ASR results in ~10 languages. They may demonstrate additional capabilities … if fine-tuned on certain tasks such as voice activity detection, speaker classification or speaker diarization, but these areas have not been rigorously evaluated.”
Whisper has its limitations, especially when it comes to text prediction. Because the system was trained on large amounts of “noisy” data, OpenAI warns Whisper may include words in its transcriptions that weren’t actually spoken — presumably because it’s both trying to predict the next word in the audio and trying to transcribe the audio itself. . Furthermore, Whisper does not perform equally well across languages, suffering from higher error rates when it comes to speakers of languages that are not well represented in the training data.
Unfortunately that last bit is nothing new in the world of speech recognition. Biases have long plagued even the best systems, with a 2020 Stanford study showing that systems from Amazon, Apple, Google, IBM and Microsoft made far fewer errors — about 35% — with white users than with black users.
Nevertheless, OpenAI sees Whisper’s transcription capabilities being used to enhance existing accessibility tools.
“Although Whisper models cannot be used for real-time transcription out of the box, their speed and size indicate that others may be able to build applications on top of them that allow for near-real-time speech recognition and translation,” the company wrote on GitHub. This continues. “The real value of utility applications built on Whisper models suggests that the uneven performance of these models may have real economic implications… [W]e hope that the technology will be used primarily for utilitarian purposes, making automatic speech recognition technology more accessible will enable more actors to develop capable surveillance technologies or augment existing surveillance efforts, as speed and accuracy allow for affordable automated transcription and large-volume translation. . Audio Communication.”
Whisper’s release is not necessarily indicative of OpenAI’s future plans. While increasingly focused on commercial endeavors such as DALL-E 2 and GPT-3, the company is pursuing a number of purely theoretical research threads, including AI systems. Learn by watching videos.