Browsing by Author "Kumar, T.N.M."

Now showing 1 - 2 of 2

End-to-End Speech Recognition for Low Resource Language Sanskrit using Self-Supervised Learning
(Institute of Electrical and Electronics Engineers Inc., 2022) Holla, S.S.; Kumar, T.N.M.; Hiretanad, J.R.; Deepak, K.T.; Narasimhadhan, A.V.
We are presenting the work on building a speaker independent, continuous speech recognition system for Samskruta (also called Sanskrit) using self-supervised learning. We have used a Pre-trained model from the Vakyansh team where the model is trained using 10,000 Hrs of data with 23 Indic languages and Fine-tuned it using a data-set containing nearly 78 Hrs of Samskruta audio along with their transcription taken from Vaksancaya - Sanskrit Speech Corpus from IIT Bombay. Acoustic representations are learned in an end-to-end deep learning approach using the wav2vec2.0 architecture from Fairseq. On top of this acoustic model, a language model is used to increase the overall performance. Our system provides a word error rate (WER) of 5.1 % on test data and 2.4% on train data. Meanwhile we built a graphical user interface in the form of a web page using the Flask framework, which provides an interactive platform for the user to record audio and see the transcription in real-time. To the best of our knowledge, our approach using self-supervised learning, gives better performance compared to the state of the art methods. Â© 2022 IEEE.
Monophone and Triphone Acoustic Phonetic Model for Kannada Speech Recognition System
(Institute of Electrical and Electronics Engineers Inc., 2022) Kumar, T.N.M.; Jayan, A.; Bhat, S.; Anvith, M.; Narasimhadhan, A.V.
The automatic Speech Recognition system (ASR) is the most widely used application in the speech domain. ASR systems generate text data from spoken utterances without manual intervention. In this work, we build an ASR system for the Kannada language. For building the proposed system, we extract Mel Frequency Cepstral Coefficients (MFCC) features from the audio data, and the Kannada language model is developed using corresponding labels. The dictionary generation and phonetic labelings are automated. Recognition performance is compared for both monophonic and triphone models. The word error rate of 15.73 % and the sentence error rate of 55.5 % are achieved for the triphone model. Comparatively, the triphone model gives a better performance than the monophonic model. Â© 2022 IEEE.