Faculty Publications

Permanent URI for this communityhttps://idr.nitk.ac.in/handle/123456789/18736

Publications by NITK Faculty

Browse

Search Results

Now showing 1 - 6 of 6

Singer identification for Indian singers using convolutional neural networks
(Springer, 2021) Vishnu Srinivasa Murthy, Y.V.S.; Koolagudi, S.G.; Jeshventh Raja, T.K.
Singer identification is one of the important aspects of music information retrieval (MIR). In this work, traditional feature-based and trending convolutional neural network (CNN) based approaches are considered and compared for identifying singers. Two different datasets, namely artist20 and the Indian popular singers’ database with 20 singers are used in this work to evaluate proposed approaches. Cepstral features such as Mel-frequency cepstral coefficients (MFCCs) and linear prediction cepstral coefficients (LPCCs) are considered to represent timbre information. Shifted delta cepstral (SDC) features are also computed beside the cepstral coefficients to capture temporal information. In addition, chroma features are computed from 12 semitones of a musical octave, overall forming a 46-dimensional feature vector. Experiments are conducted with different feature combinations, and suitable features are selected using the genetic algorithm-based feature selection (GAFS) approach. Two different classification techniques, namely artificial neural networks (ANNs) and random forest (RF), are considered on the features mentioned above. Further, spectrograms and chromagrams of audio clips are directly fed to CNN for classification. The singer identification results obtained using CNNs seem to be better than the traditional isolated and ensemble classifiers. Average accuracy of around 75% is observed with CNN in the case of Indian popular singers database. Whereas, on artist20 dataset, the proposed configuration of feature-based approach and CNN could not give better than 60% accuracy. © 2021, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
Bi-level Acoustic Scene Classification Using Lightweight Deep Learning Model
(Birkhauser, 2024) Spoorthy, V.; Koolagudi, S.G.
Identifying a scene based on the environment in which the related audio is recorded is known as acoustic scene classification (ASC). In this paper, a bi-level light-weight Convolutional Neural Network (CNN)-based model is presented to perform ASC. The proposed approach performs classification in two levels. The scenes are classified into three broad categories in the first level as indoor, outdoor, and transportation scenes. The three classes are further categorized into individual scenes in the second level. The proposed approach is implemented using three features: log Mel band energies, harmonic spectrograms and percussive spectrograms. To perform the classification, three CNN classifiers, namely, MobileNetV2, Squeeze-and-Excitation Net (SENet), and a combination of these two architectures, known as SE-MobileNet are used. The proposed combined model encashes the advantages of both MobileNetV2 and SENet architectures. Extensive experiments are conducted on DCASE 2020 (IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events) Task 1B development and DCASE 2016 ASC datasets. The proposed SE-MobileNet model resulted in a classification accuracy of 96.9% and 86.6% for the first and second levels, respectively, on DCASE 2020 dataset, and 97.6% and 88.4%, respectively, on DCASE 2016 dataset. The proposed model is reported to be better in terms of both complexity and accuracy as compared to the state-of-the-art low-complexity ASC systems. © 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
Polyphonic Sound Event Detection Using Mel-Pseudo Constant Q-Transform and Deep Neural Network
(Taylor and Francis Ltd., 2024) Spoorthy, V.; Koolagudi, S.G.
The task of identification of sound events in a particular surrounding is known as Sound Event Detection (SED) or Acoustic Event Detection (AED). The occurrence of sound events is unstructured and also displays wide variations in both temporal structure and frequency content. Sound events may be non-overlapped (monophonic) or overlapped (polyphonic) in nature. In real-time scenarios, polyphonic SED is most commonly seen as compared to monophonic SED. In this paper, a Mel-Pseudo Constant Q-Transform (MP-CQT) technique is introduced to perform polyphonic SED to effectively learn both monophonic and polyphonic sound events. A pseudo CQT technique is adapted to extract features from the audio files and their Mel spectrograms. The Mel-scale is believed to broadly simulate human perception system. The classifier used is a Convolutional Recurrent Neural Network (CRNN). Comparison of the performance of the proposed MP-CQT technique along with CRNN is presented and a considerable performance improvement is observed. The proposed method achieved an average error rate of 0.684 and average F1 score of 52.3%. The proposed approach is also analyzed for the robustness by adding an additional noise at different Signal to Noise Ratios (SNRs) to the audio files. The proposed method for SED task has displayed improved performance as compared to state-of-the-art SED systems. The introduction of new feature extraction technique has shown promising improvement in the performance of the polyphonic SED system. © 2024 IETE.
MICAnet: A Deep Convolutional Neural Network for mineral identification on Martian surface
(Elsevier B.V., 2024) Kumari, P.; Soor, S.; Shetty, A.; Koolagudi, S.G.
Mineral identification plays a vital role in understanding the diversity and past habitability of the Martian surface. Mineral mapping by the traditional manual method is time-consuming and the unavailability of ground truth data limited the research on building supervised learning models. To address this issue an augmentation process is already proposed in the literature that generates training data replicating the spectra in the MICA (Minerals Identified in CRISM Analysis) spectral library while preserving absorption signatures and introducing variability. This study introduces MICAnet, a specialized Deep Convolutional Neural Network (DCNN) architecture for mineral identification using the CRISM (Compact Reconnaissance Imaging Spectrometer for Mars) hyperspectral data. MICAnet is inspired by the Inception-v3 and InceptionResNet-v1 architectures, but it is tailored with 1-dimensional convolutions for processing the spectra at the pixel level of a hyperspectral image. To the best of the authors’ knowledge, this is the first DCNN architecture solely dedicated to mineral identification on the Martian surface. The model is evaluated by its matching with a TRDR (Targeted Reduced Data Record) dataset obtained using a hierarchical Bayesian model. The results demonstrate an impressive f-score of at least .77 among different mineral groups in the MICA library, which is on par with or better than the unsupervised models previously applied to this objective. © 2024
Rare Sound Event Detection Using Multi-resolution Cochleagram Features and CRNN with Attention Mechanism
(Birkhauser, 2025) Pandey, G.; Koolagudi, S.G.
Acoustic event detection (AED) or sound event detection (SED) is a problem that focuses on automatically detecting acoustic events in an audio recording along with their onset and offset times. Rare acoustic event detection in AED is a challenging problem. Rare AED aims to detect rare but significant sound events in an audio signal. Traditional methods used for SED often struggle to accurately detect rare sound events due to their infrequent occurrence and diverse characteristics. This paper introduces novel features named as multi-resolution cochleagrams (MRCGs) for rare SED tasks. Different cochleagrams with different resolutions are extracted from the audio recording and stacked to get the MRCG feature vector. The equivalent rectangular bandwidth (ERB) scale used in the cochleagram simulates the human auditory filter. The classifier used is a convolutional recurrent neural network (CRNN) embedded with an attention module. This work considers the Task 2 DCASE 2017 dataset for detecting rare sound events. Results show that the proposed MRCG and CRNN with attention combination improves the performance. The proposed method achieved an average error rate of 0.11 and an average F1 score of 94.3%. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.
Rare sound event detection using superlets and a convolutional TDPANet
(Springer Science and Business Media Deutschland GmbH, 2025) Pandey, G.; Koolagudi, S.G.
Rare Sound Event Detection (RSED) focuses on identifying infrequent but significant sound events in audio recordings with precise onset and offset times. It is crucial for applications like surveillance, healthcare, and environmental monitoring. An essential component in RSED systems is extracting effective time-frequency representation as input features. These features capture short, transient acoustic events in an audio input recording, even in noisy and complex environments. Most existing approaches to this RSED problem rely on input features as time-frequency representations, such as the Mel spectrogram, Constant-Q Transform (CQT), and Continuous Wavelet Transform (CWT). However, these approaches often suffer from resolution trade-offs between frequency and time. This trade-off limits their ability to precisely capture the fine-grained details needed to detect these events in complex acoustic environments. To overcome these limitations, we introduce superlets, a novel time-frequency representation that offers super-resolution in both time and frequency domains. To process the high-resolution Superlet features, we have also proposed a Convolutional Temporal Dilated Pyramid Attention Network (TDPANet). This novel neural network architecture incorporates convolutional feature extraction, dilated temporal modeling, multi-scale temporal pooling, and temporal attention mechanisms to enhance event detection accuracy. We evaluate our method on the DCASE 2017 Task 2 rare sound event dataset, which includes isolated sound events and real-world acoustic scenes. Experimental results show that our proposed method significantly outperforms state-of-the-art techniques, achieving an Error Rate (ER) of 0.15 and an F1-score of 92.3%, demonstrating its effectiveness in detecting rare sound events. © The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2025.

Faculty Publications

Browse

Filters

Settings

Sort By

Results per page

Search Results