Google's SpecAugment garnered an amazing state-of-the-art speech recognition rate without requiring a language model.
A blog posted on Monday revealed that Google's SpecAugment gathered an amazingly low word error rate (WER) giving it a higher evaluation rate. What makes it more interesting is the fact that this feature showed better results than with the aid of language models.
Daniel S. Park, Google AI resident expressed his awe through a statement saying that even without a language model, Google's SpecAugment trained models exceeded the performance of the past methods. The statement supported by William Chan, a research scientist added that the results they gathered are encouraging showing the possibility of training networks for practical purposes even without the help of the language models.
Both the Google AI resident and the research scientist however assured that adding a language model would benefit their models.
The encouraging results as Park and Chan describe were undisclosed to the public through a paper published on April 18 via arXiv entitled "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition."
The article showed that Google's SpecAugment garnered a satisfyingly low word error rate (WER). The results were gathered after the new data augmentation method for automatic speech recognition was tested and applied to Listen, Attend and Spell networks.
Google's SpecAugment specifically got a 6.8 percent word error rate (WER) on Switchboard 300h collection of 260 hours of English telephone conversations and 2.6 percent WER with the collection of about 1,000 hours of spoken English, called LibriSpeech960h.
The article also revealed that instead of augmenting the input audio waveform, Google's SpecAugment directly augments the audio spectrogram. An audio spectrogram is a visual representation of a speech like an image of a waveform. This means that augmenting an audio data can now be treated as no longer an audio thing but a visual issue.
The unexpected result of the research marks a step forward towards more efficient use of Automatic Speech Recognition (ASR) models. With the language models having the need to be trained separately from ASR thereby requiring a huge amount of memory space, SpecAugment's independence from LM means less amount of memory needed.
As of the moment, there are already lost of conversational AI being used publicly such as the Gboard's dictation tool for Android smartphones and the home smart speakers by Google Assistant. With the vast innovations in technology, conversational AI adoption rate increase could mean a big leap forward and the encouraging result on Google's SpecAugment research could be a kickstart.