Journal Articles

Permanent URI for this collectionhttps://idr.nitk.ac.in/handle/123456789/19884

Browse

Search Results

Now showing 1 - 5 of 5

Extraction of MapReduce-based features from spectrograms for audio-based surveillance
(Elsevier Inc. usjcs@elsevier.com, 2019) Mulimani, M.; Koolagudi, S.G.
In this paper, we proposed a novel parallel method for extraction of significant information from spectrograms using MapReduce programming model for the audio-based surveillance system, which effectively recognizes critical acoustic events in the surrounding environment. Extraction of reliable information as features from spectrograms of big noisy audio event dataset demands high computational time. Parallelizing the feature extraction using MapReduce programming model on Hadoop improves the efficiency of the overall system. The acoustic events with real-time background noise from Mivia lab audio event data set are used for surveillance applications. The proposed approach is time efficient and achieves high performance of recognizing critical acoustic events with the average recognition rate of 96.5% in different noisy conditions. © 2019 Elsevier Inc.
Segmentation and characterization of acoustic event spectrograms using singular value decomposition
(Elsevier Ltd, 2019) Mulimani, M.; Koolagudi, S.G.
The traditional frame-based speech features such as Mel-frequency cepstral coefficients (MFCCs) are specifically developed for speech/speaker recognition tasks. Speech is different from acoustic events, when one considers its phonetic structure. Hence, frame-based speech features may not be suitable for Acoustic Event Classification (AEC). In this paper, a novel method is proposed for the extraction of robust acoustic event specific features from the spectrogram using a left singular vector for AEC. It consists of two main stages: segmentation and characterization of acoustic event spectrograms. In the first stage, symmetric Laplacian matrix of an acoustic event spectrogram is decomposed into singular values and vectors. Then, reliable region (spectral shape) of an acoustic from the spectrogram is segmented using a left singular vector. The selected prominent values of a left singular vector using the proposed threshold, automatically segment the reliable region of an acoustic event from the spectrogram. In the second stage, the segmented region of the spectrogram is used as a feature vector for AEC. Characteristics of values of singular vector belonging to reliable (event) and unreliable (non-event) regions of the spectrogram are determined. To evaluate the proposed approach, different categories of ‘home’ acoustic events are considered from the Freiburg-106 dataset. The results show that the significantly improved performance of acoustic event segmentation and classification. A singular vector effectively segments the reliable region of the acoustic event from spectrogram for Support Vector Machine (SVM) based AEC system. The proposed AEC system is robust to noise and achieves higher recognition rate in clean and noisy conditions compared to the traditional speech feature based systems. © 2018 Elsevier Ltd
Robust Acoustic Event Classification using Fusion Fisher Vector features
(Elsevier Ltd, 2019) Mulimani, M.; Koolagudi, S.G.
In this paper, a novel Fusion Fisher Vector (FFV) features are proposed for Acoustic Event Classification (AEC) in the meeting room environments. The monochrome images of a pseudo-color spectrogram of an acoustic event are represented as Fisher vectors. First, irrelevant feature dimensions of each Fisher vector are discarded using Principal Component Analysis (PCA) and then, resulting Fisher vectors are fused to get FFV features. Performance of the FFV features is evaluated on acoustic events of UPC-TALP dataset in clean and different noisy conditions. Results show that proposed FFV features are robust to noise and achieve overall 94.32% recognition accuracy in clean and different noisy conditions. © 2019 Elsevier Ltd
Bi-level Acoustic Scene Classification Using Lightweight Deep Learning Model
(Birkhauser, 2024) Spoorthy, V.; Koolagudi, S.G.
Identifying a scene based on the environment in which the related audio is recorded is known as acoustic scene classification (ASC). In this paper, a bi-level light-weight Convolutional Neural Network (CNN)-based model is presented to perform ASC. The proposed approach performs classification in two levels. The scenes are classified into three broad categories in the first level as indoor, outdoor, and transportation scenes. The three classes are further categorized into individual scenes in the second level. The proposed approach is implemented using three features: log Mel band energies, harmonic spectrograms and percussive spectrograms. To perform the classification, three CNN classifiers, namely, MobileNetV2, Squeeze-and-Excitation Net (SENet), and a combination of these two architectures, known as SE-MobileNet are used. The proposed combined model encashes the advantages of both MobileNetV2 and SENet architectures. Extensive experiments are conducted on DCASE 2020 (IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events) Task 1B development and DCASE 2016 ASC datasets. The proposed SE-MobileNet model resulted in a classification accuracy of 96.9% and 86.6% for the first and second levels, respectively, on DCASE 2020 dataset, and 97.6% and 88.4%, respectively, on DCASE 2016 dataset. The proposed model is reported to be better in terms of both complexity and accuracy as compared to the state-of-the-art low-complexity ASC systems. © 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
Blended-emotional speech for Speaker Recognition by using the fusion of Mel-CQT spectrograms feature extraction
(Elsevier Ltd, 2025) Tomar, S.; Koolagudi, S.G.
Emotions are integral to human speech, adding depth and influencing the effectiveness of interactions. Speech with a single emotion is speech in which the emotional state stays the same throughout the utterance. Unlike single emotion, blended emotion involves a mix of emotions, such as happiness tinged with sadness or a shift from neutral to sadness within the same utterance. In real-life scenarios, people often experience and express mixed emotions. Most existing works on Speaker Recognition (SR), which recognizes the person from their voice, have focused on either neutral emotions or some primary emotions. This study aims to develop Blended-Emotional Speaker Recognition (BESR). In the proposed work, we try to look for emotional information in speech signals by simulating a blended emotional speech dataset for Speaker Recognition. The fusion of the Mel-Spectrograms and the Constant-Q Transform Spectrograms (Mel-CQT Spectrograms) has been developed to extract features. Three datasets, namely the National Institute of Technology Karnataka Kannada Language Emotional Speech Corpus (NITK-KLESC), the Crowd-sourced emotional multimodal actors dataset (CREMA-D), and the Indian Institute of Technology Kharagpur Simulated Emotion Hindi Speech Corpus (IITKGP-SEHSC) datasets are considered for the proposed work. The experimental outcomes demonstrate that the performance of the BESR system using blended emotional speech improves the fairness of Speaker Recognition. © 2025 Elsevier Ltd

Journal Articles

Browse

Filters

Settings

Sort By

Results per page

Search Results