Browsing by Author "Deepak, K.T."

Now showing 1 - 4 of 4

An Improved Method for Speech Enhancement Using Convolutional Neural Network Approach
(Institute of Electrical and Electronics Engineers Inc., 2022) Mahesh Kumar, T.N.; Hegde, P.; Deepak, K.T.; Narasimhadhan, A.V.
In the speech processing domain Speech enhancement is one of the most widely used techniques. With the development of deep neural networks and the availability of powerful hardware, multiple deep learning-based speech enhancement models have come up in recent years. In this work, the speech enhancement technique using a Convolutional Neural Network(CNN) as Denoising Autoencoders (DAEs) is investigated and compared with the conventional feed-forward topology. Further, The proposed model is analyzed at various SNR levels to process the corrupted english speech and also tested on unseen speech data which includes additional SNR levels. It is observed from simulation results that the proposed model outperforms the existing model in terms of Perceptual Evaluation of Speech Quality (PESQ) and Log Spectral Distance (LSD). The network achieved 3% higher scores than feed-forward neural networks, and it is found that the convolutional DAEs perform better than feed-forward counterparts. Â© 2022 IEEE.
End-to-End Speech Recognition for Low Resource Language Sanskrit using Self-Supervised Learning
(Institute of Electrical and Electronics Engineers Inc., 2022) Holla, S.S.; Kumar, T.N.M.; Hiretanad, J.R.; Deepak, K.T.; Narasimhadhan, A.V.
We are presenting the work on building a speaker independent, continuous speech recognition system for Samskruta (also called Sanskrit) using self-supervised learning. We have used a Pre-trained model from the Vakyansh team where the model is trained using 10,000 Hrs of data with 23 Indic languages and Fine-tuned it using a data-set containing nearly 78 Hrs of Samskruta audio along with their transcription taken from Vaksancaya - Sanskrit Speech Corpus from IIT Bombay. Acoustic representations are learned in an end-to-end deep learning approach using the wav2vec2.0 architecture from Fairseq. On top of this acoustic model, a language model is used to increase the overall performance. Our system provides a word error rate (WER) of 5.1 % on test data and 2.4% on train data. Meanwhile we built a graphical user interface in the form of a web page using the Flask framework, which provides an interactive platform for the user to record audio and see the transcription in real-time. To the best of our knowledge, our approach using self-supervised learning, gives better performance compared to the state of the art methods. Â© 2022 IEEE.
Group Attack Dingo Optimizer for enhancing speech recognition in noisy environments
(Springer Science and Business Media Deutschland GmbH, 2023) Kumar, M.; Kumar, K.G.; Deepak, K.T.; Narasimhadhan, A.V.
The speech recognition system has become a vital technology enabling seamless human–computer interactions, even in noisy public places. To enhance the performance of various applications like machine translation, natural language processing, spoken language understanding, and text generation, speech enhancement (SE) techniques play a crucial role. In this study, we introduce a novel approach termed (GA-DOA) for optimizing speech enhancement tasks. Our method combines an improved short-time Fourier transform (STFT) and an optimized deep U-Net, with GA-DOA used to fine-tune the parameters. Additionally, feature extraction employs Mel-frequency cepstral coefficients (MFCCs), spectral features, and one-dimensional convolutional neural networks (1D-CNN). To select the most effective features, we employ GA-DOA-assisted feature selection. These optimized features are then fed into our proposed hybrid model for speech recognition (HMSR), which integrates bidirectional long short-term memory (BiLSTM) with the gated recurrent unit (GRU). Experimental results reveal that our proposed model achieves superior recognition rates and significantly lowers the word error rate (WER), thereby demonstrating enhanced system performance, even in noisy environments. © 2023, The Author(s), under exclusive licence to Società Italiana di Fisica and Springer-Verlag GmbH Germany, part of Springer Nature.
The Second DISPLACE Challenge: DIarization of SPeaker and LAnguage in Conversational Environments
(International Speech Communication Association, 2024) Kalluri, S.B.; Singh, P.; Roy Chowdhuri, P.; Kulkarni, A.; Baghel, S.; Hegde, P.; Sontakke, S.; Deepak, K.T.; Mahadeva Prasanna, S.R.; Vijayasenan, D.; Ganapathy, S.
The DIarization of SPeaker and LAnguage in Conversational Environments (DISPLACE) 2024 challenge is the second in the series of DISPLACE challenges, which involves tasks of speaker diarization (SD) and language diarization (LD) on a challenging multilingual conversational speech dataset. In the DISPLACE 2024 challenge, we also introduced the task of automatic speech recognition (ASR) on this dataset. The dataset containing 158 hours of speech, consisting of both supervised and unsupervised mono-channel far-field recordings, was released for LD and SD tracks. Further, 12 hours of close-field mono-channel recordings were provided for the ASR track conducted on 5 Indian languages. The details of the dataset, baseline systems and the leader board results are highlighted in this paper. We have also compared our baseline models and the team's performances on evaluation data of DISPLACE-2023 to emphasize the advancements made in this second version of the challenge. Â© 2024 International Speech Communication Association. All rights reserved.