Learning from Past Mistakes: Improving Automatic Speech Recognition Output via Noisy-Clean Phrase Context Modeling.
Automatic speech recognition (ASR) systems lack joint optimization duringdecoding over the acoustic, lexical and language models; for instance the ASRwill often prune words due to acoustics using short-term context, prior torescoring with long-term context. In this work we model the automated speechtranscription process as a noisy transformation channel and propose an errorcorrection system that can learn from the aggregate errors of all theindependent modules constituting the ASR. The proposed system can exploitlong-term context using a neural network language model and can better choosebetween existing ASR output possibilities as well as re-introduce previouslypruned and unseen (out-of-vocabulary) phrases. The system provides significantcorrections under poorly performing ASR conditions without degrading anyaccurate transcriptions. The proposed system can thus be independentlyoptimized and post-process the output of even a highly optimized ASR. We showthat the system consistently provides improvements over the baseline ASR. Wealso show that it performs better when used on out-of-domain and mismatchedtest data and under high-error ASR conditions. Finally, an extensive analysisof the type of errors corrected by our system is presented.
Stay in the loop.
Subscribe to our newsletter for a weekly update on the latest podcast, news, events, and jobs postings.