Browsing by Author "Spoorthy, V."

Now showing 1 - 9 of 9

A Transpose-SELDNet for Polyphonic Sound Event Localization and Detection
(Institute of Electrical and Electronics Engineers Inc., 2023) Spoorthy, V.; Koolagudi, S.G.
Human beings have the ability to identify a particular event occurring in a surrounding based on sound cues even when no visual scenes are presented. Sound events are the auditory cues that are present in a surrounding. Sound event detection (SED) is the process of determining the beginning and end of sound events as well as a textual label for the event. The term sound source localization (SSL) refers to the process of identifying the spatial location of a sound occurrence in addition to the SED. The integrated task of SED and SSL is known as Sound Event Localization and Detection (SELD). In this proposed work, three different deep learning architectures are explored to perform SELD. The three deep learning architectures are SELDNet, D-SELDNet (Depthwise Convolution), and T-SELDNet (Transpose Convolution). Two sets of features are used to perform SED and Direction-of-Arrival (DOA) estimation tasks in this work. D-SELDNet uses a Depthwise convolution layer which helps reduce the model's complexity in terms of computation time. T-SELDNet uses Transpose Convolution, which helps in learning better discriminative features by retaining the input size and not losing necessary information from the input. The proposed method is evaluated on the First-order Ambisonic (FOA) array format of the TAU-NIGENS Spatial Sound Events 2020 dataset. An improvement has been observed as compared to the existing SELD systems with the proposed T-SELDNet. Â© 2023 IEEE.
An Improved Transformer Transducer Architecture for Hindi-English Code Switched Speech Recognition
(International Speech Communication Association, 2022) Antony, A.; Kota, S.R.; Lade, A.; Spoorthy, V.; Koolagudi, S.G.
Due to the extensive usage of technology in many languages throughout the world, interest in Automatic Speech Recognition (ASR) systems for Code-Switching (CS) in speech has grown in recent years. Several studies have shown that End-to-End (E2E) ASR is easier to adopt and works much better in monolingual settings. E2E systems are likewise widely recognised for requiring massive quantities of labelled speech data. Since there is a scarcity in the availability of large amount of CS speech, E2E ASR takes longer computation time and does not offer promising results. In this work, an E2E ASR model system using a transformer-transducer architecture is introduced for code-switched Hindi-English speech, and also addressed training data scarcity by leveraging the vastly available monolingual data. Specifically, the language-specific modules in the Transformer are pre-trained by leveraging the vastly available single language speech datasets. The proposed method also provides a Word Error Rate (WER) of 29.63% and Transliterated Word Error Rate (T-WER) of 27.42% which is better than the state-of-the-art by 2.19%. Â© Â© 2022 ISCA.
Bi-level Acoustic Scene Classification Using Lightweight Deep Learning Model
(Birkhauser, 2024) Spoorthy, V.; Koolagudi, S.G.
Identifying a scene based on the environment in which the related audio is recorded is known as acoustic scene classification (ASC). In this paper, a bi-level light-weight Convolutional Neural Network (CNN)-based model is presented to perform ASC. The proposed approach performs classification in two levels. The scenes are classified into three broad categories in the first level as indoor, outdoor, and transportation scenes. The three classes are further categorized into individual scenes in the second level. The proposed approach is implemented using three features: log Mel band energies, harmonic spectrograms and percussive spectrograms. To perform the classification, three CNN classifiers, namely, MobileNetV2, Squeeze-and-Excitation Net (SENet), and a combination of these two architectures, known as SE-MobileNet are used. The proposed combined model encashes the advantages of both MobileNetV2 and SENet architectures. Extensive experiments are conducted on DCASE 2020 (IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events) Task 1B development and DCASE 2016 ASC datasets. The proposed SE-MobileNet model resulted in a classification accuracy of 96.9% and 86.6% for the first and second levels, respectively, on DCASE 2020 dataset, and 97.6% and 88.4%, respectively, on DCASE 2016 dataset. The proposed model is reported to be better in terms of both complexity and accuracy as compared to the state-of-the-art low-complexity ASC systems. © 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
Code-switching automatic speech recognition using modified ESPNet
(American Institute of Physics Inc., 2023) Sinha, S.; Spoorthy, V.; Koolagudi, S.G.
Recently, a drastically increased focus has been observed in multilingual Automatic Speech Recognition (ASR). To cater to multiple low resource languages, a speech recognition system is used. This is performed by taking advantage of low amounts of labeled corpora in multiple languages has. The prosperity of low-resource multilingual and code-switching ASR often depends on the variety of languages in terms of linguistic characteristics as well as the amount of data available. This work focuses on modifying the multilingual and code-switching ASR system through two different subtasks including a total of seven Indian languages. To counter this the model has been provided with several hours of transcribed speech data, comprising of train and test sets, in these languages including two code-switched language pairs, Hindi-English and Bengali-English. In this work, a modified ESPNet architecture is proposed to perform multilingual ASR which improved the performance of the baseline system resulting in accuracy of Word Error Rate (WER) is 27.69%. Â© 2023 Author(s).
Machine Learning-based Automated System for Subjective Answer Evaluation
(Institute of Electrical and Electronics Engineers Inc., 2023) Dodia, S.; Spoorthy, V.; Chandak, T.
An examination is a useful tool for assessing students' knowledge. Evaluation of exams is a difficult and time-consuming process. The automatic examination of answer scripts makes this task easier for teachers, reducing the amount of effort and time required. The existing literature has a number of methods that have been proposed for evaluating responses to objective questions using machine learning. However, more work needs to be done on evaluating answers to descriptive questions. This study suggests a way to evaluate students' answers to questions of a descriptive kind without using traditional paper or pencil by teachers. Instead, a computer acts as a teacher and grades the students' submissions. The primary objective is to communicate the outcomes of subjective responses using Bidirectional Encoder Representations from Transformers (BERT), cosine, and Jaccard distance. The proposed model achieved an accuracy of 91%, an error of 9.01, a precision of 83%, and a recall of 79%, respectively. The suggested model has provided the best results in comparison with state-of-the-art systems. Â© 2023 IEEE.
Noise Cancellation by Fast Fourier Transform for Wav2Vec2.0 based Speech-to-Text System
(Institute of Electrical and Electronics Engineers Inc., 2023) Gupta, S.P.; Spoorthy, V.; Koolagudi, S.G.
Speech-To-Text (STT) systems are a part of the Speech Recognition domain in which speech is given as input, and it generates the transcript. The input speech sometimes disrupts the STT system and generates incorrect transcripts because of background noise. In this work, we have discussed a Fast Fourier Transform (FFT) based noise cancellation method for Hindi words with background noise and performed speech to text conversion using a fine-tuned and pre-trained Wav2Vec2.0 model. The background noise added to the audio samples is Gaussian white noise with three different intensity levels, 0.01, 0.03, and 0.05 units, indicated by the Gaussian distribution's standard deviation (STD). The model has been trained on the OpenSLR Hindi dataset. The proposed system is evaluated by the metric Character Error Rate (CER). The testing of the model is done using 20 Hindi words in both clean and noisy conditions. The results obtained proved that the noise cancellation was found effective in terms of CER, and on first level noise with an STD of 0.01, the CER is better after noise cancellation than its noisy counterpart. Â© 2023 IEEE.
Polyphonic Sound Event Detection Using Mel-Pseudo Constant Q-Transform and Deep Neural Network
(Taylor and Francis Ltd., 2024) Spoorthy, V.; Koolagudi, S.G.
The task of identification of sound events in a particular surrounding is known as Sound Event Detection (SED) or Acoustic Event Detection (AED). The occurrence of sound events is unstructured and also displays wide variations in both temporal structure and frequency content. Sound events may be non-overlapped (monophonic) or overlapped (polyphonic) in nature. In real-time scenarios, polyphonic SED is most commonly seen as compared to monophonic SED. In this paper, a Mel-Pseudo Constant Q-Transform (MP-CQT) technique is introduced to perform polyphonic SED to effectively learn both monophonic and polyphonic sound events. A pseudo CQT technique is adapted to extract features from the audio files and their Mel spectrograms. The Mel-scale is believed to broadly simulate human perception system. The classifier used is a Convolutional Recurrent Neural Network (CRNN). Comparison of the performance of the proposed MP-CQT technique along with CRNN is presented and a considerable performance improvement is observed. The proposed method achieved an average error rate of 0.684 and average F1 score of 52.3%. The proposed approach is also analyzed for the robustness by adding an additional noise at different Signal to Noise Ratios (SNRs) to the audio files. The proposed method for SED task has displayed improved performance as compared to state-of-the-art SED systems. The introduction of new feature extraction technique has shown promising improvement in the performance of the polyphonic SED system. © 2024 IETE.
Recognition of Fricative Phoneme based Hindi Words in Speech-to-Text System using Wav2Vec2.0 Model
(Institute of Electrical and Electronics Engineers Inc., 2022) Gupta, S.P.; Spoorthy, V.; Koolagudi, S.G.
In this work, we have discussed issues with Microsoft's state-of-the-art Speech-to-Text (STT) system. Two key issues have been identified: recognition of Hindi words starting with the fricative phoneme (/ha/) and recognition power of the system with background noise. The solution for correctly identifying the unrecognized Hindi fricative phoneme is by training the Wav2Vec2.0 model on the OpenSLR Hindi dataset. The evaluation of the proposed model is given by the performance metric Char-acter Error Rate (CER). To test the performance of the proposed model, 20 fricative words in both clean and noisy conditions are fed to the trained model. The second issue of handling noisy speech samples is resolved using an amplitude-based automatic noise detection method. The results achieved from the proposed model are observed to be better than the state-of-the-art STT model when trained with and without the language model in terms of CER in clean conditions. Â© 2022 IEEE.
Spectral Features for Emotional Speaker Recognition
(Institute of Electrical and Electronics Engineers Inc., 2020) Pasala, P.; Spoorthy, V.; Koolagudi, S.G.; Sobhana, N.V.
Speaker recognition in an emotive environment is a bit challenging task because of influence of emotions in a speech. Identifying the speaker from the speech can be done by analyzing the features of the speech signal. In normal conditions, identifying a speaker is not a tedious task. Whereas, identifying the speaker in an emotional environment such as happy, sad, anger, surprise, sarcastic, fear etc. is really challenging, since speech becomes altered under emotions and noise. The spectral features of speech signal include Mel Frequency Cepstral Co-efficients(MFCC), Shifted Delta Cepstral Coefficients (SDCC), spectral centroid, spectral roll off, spectral flatness, spectral contrast, spectral bandwidth, chroma-stft, zero crossing rate, root mean square energy, Linear Prediction Cepstral Coefficients (LPCC), spectral subband centroid, Teager energy based MFCC, line spectral frequencies, single frequency cepstral coefficients, formant frequencies, Power Normalized Cepstral Coefficients (PNCC), etc. The features that are extracted from the speech signal are classified using classifiers. Support Vector Machine(SVM), Gaussian Mixture Model, Gaussian Naive Bayes, K-Nearest Neighbour, Random Forest and a simple Neural Network using Keras is used for classification. The important application include security systems in which a person can be identified by biometrics that is voice of the person. The work aims to identify the speaker in an emotional environment using spectral features and classify using any of the classification techniques and to achieve a high speaker recognition rate. Feature combinations can also be used to improve accuracy. The proposed model performed better than most of the state-of-The-Art methods. Â© 2020 IEEE.