Authors : Yeona Hong, Miseul Kim, Woo-Jin Chung, Hong-Goo Kang
Year : 2024
Publisher / Conference : International Conference on Electronics, Information, and Communication (ICEIC)
Research area : Speech Signal Processing, Speech Recognition
Presentation/Publication date : 2024.01.29
Presentation : Poster
—In this paper, we present an automatic speech recognition (ASR) system that is capable of decoding complete transcriptions from speech even in cases where there are missing segments in the audio. To predict complete transcriptions from speech that may have missing segments, we utilize a contextual learning approach inspired by recent language model training approaches, in which our model leverages surrounding speech segments as cues for the prediction. Our model consists of two modules: a contextual feature extractor designed with the structure of wav2vec 2.0, and a projection layer. We further explore various masking lengths for model training so as to optimally benefit the ASR system without compromising its performance. Our proposed methodology demonstrates highquality ASR performance on missing speech segments of various lengths, ranging from a word error rate (WER) of 4.7% on 0.25 seconds segments to 18.5% on 1 second segments.