Faculty Publications

Permanent URI for this communityhttps://idr.nitk.ac.in/handle/123456789/18736

Publications by NITK Faculty

Browse

Search Results

Now showing 1 - 9 of 9

Classification of vocal and non-vocal regions from audio songs using spectral features and pitch variations
(Institute of Electrical and Electronics Engineers Inc., 2015) Vishnu Srinivasa Murthy, Y.V.S.; Koolagudi, S.G.
In this work, an effort has been made to identify vocal and non-vocal regions from a given song using signal processing techniques and machine learning algorithm. Initially spectral features like mel-frequency cepstral coefficients (MFCCs) are used to develop the baseline system. Statistical values of pitch, jitter and shimmer are considered to improve performance of the system. Artificial neural networks (ANNs) are used to capture the characteristics of vocal and non-vocal segments of the songs. The experiment is conducted on 60 vocal and 60 non-vocal clips extracted from Telugu albums. 11-point moving window is used to ensure the continuity of vocal and non-vocal segments, thus improving the accuracy of system. With this approach system achieves 85.59% accuracy for vocal and 88.52% for non-vocal segment classification. Â© 2015 IEEE.
Audio songs classification based on music patterns
(Springer Verlag service@springer.de, 2016) Sharma, R.; Vishnu Srinivasa Murthy, Y.V.S.; Koolagudi, S.G.
In this work, effort has been made to classify audio songs based on their music pattern which helps us to retrieve the music clips based on listenerâ€™s taste. This task is helpful in indexing and accessing the music clip based on listenerâ€™s state. Seven main categories are considered for this work such as devotional, energetic, folk, happy, pleasant, sad and, sleepy. Forty music clips of each category for training phase and fifteen clips of each category for testing phase are considered; vibrato-related features such as jitter and shimmer along with the mel-frequency cepstral coefficients (MFCCs); statistical values of pitch such as min, max, mean, and standard deviation are computed and added to the MFCCs, jitter, and shimmer which results in a 19-dimensional feature vector. feedforward backpropagation neural network (BPNN) is used as a classifier due to its efficiency in mapping the nonlinear relations. The accuracy of 82% is achieved on an average for 105 testing clips. Â© Springer India 2016.
Sound event detection in urban soundscape using two-level classification
(Institute of Electrical and Electronics Engineers Inc., 2016) Luitel, B.; Vishnu Srinivasa Murthy, Y.V.S.; Koolagudi, S.G.
A huge increase in automobile field h as lead t o the creation of different sounds in large volume, especially in urban cities. An analysis of the increased quantity of automobiles will give information related to traffic and vehicles. It also provides a scope to understand the scenario of particular location using sound scape information. In this paper, a two level classification is proposed to classify urban sound events such as bus engine (BE), bus horn (BH), car horn (CH) and whistle (W) sounds. The above sounds are taken as they place a major role in traffic scenario. A real-time data is collected from the live recordings at major locations of the urban city. Prior to the detection of events, the class of events is identified u sing signal processing techniques. Further, features such as Mel-frequency cepstral coefficients (MFCCs) a re extracted based on the analysis of a spectrum of the above-mentioned events and they are prominent to classify even in the complex scenario. Classifiers such as artificial neural networks (ANN), naive-Bayesian (NB), decision tree (J48), random forest (RF) are used at two levels. The proposed approach outperforms the existing approaches that usually does direct feature extraction without signal level analysis. Â© 2016 IEEE.
Performance analysis of LPC and MFCC features in voice conversion using artificial neural networks
(Springer Verlag service@springer.de, 2017) Koolagudi, S.G.; Vishwanath, B.K.; Akshatha, M.; Vishnu Srinivasa Murthy, Y.V.S.
Voice Conversion is a technique in which source speakers voice is morphed to a target speakers voice by learning sourceâ€“target relationship from a number of utterances from source and the target. There are many applications which may benefit from this sort of technology for example dubbing movies, TV-shows, TTS systems and so on. In this paper, analysis on the performance of ANN-based Voice Conversion system is done using linear predictive coding (LPC) and mel-frequency cepstral coefficients (MFCCs). Experimental results show that Voice Conversion system based on LPC features is better than the ones based on MFCC features. Â© Springer Science+Business Media Singapore 2017.
Choice of a classifier, based on properties of a dataset: case study-speech emotion recognition
(Springer New York LLC barbara.b.bertram@gsk.com, 2018) Koolagudi, S.G.; Vishnu Srinivasa Murthy, Y.V.S.; Bhaskar, S.P.
In this paper, the process of selecting a classifier based on the properties of dataset is designed since it is very difficult to experiment the data on n—number of classifiers. As a case study speech emotion recognition is considered. Different combinations of spectral and prosodic features relevant to emotions are explored. The best subset of the chosen set of features is recommended for each of the classifiers based on the properties of chosen dataset. Various statistical tests have been used to estimate the properties of dataset. The nature of dataset gives an idea to select the relevant classifier. To make it more precise, three other clustering and classification techniques such as K-means clustering, vector quantization and artificial neural networks are used for experimentation and results are compared with the selected classifier. Prosodic features like pitch, intensity, jitter, shimmer, spectral features such as mel frequency cepstral coefficients (MFCCs) and formants are considered in this work. Statistical parameters of prosody such as minimum, maximum, mean (?) and standard deviation (?) are extracted from speech and combined with basic spectral (MFCCs) features to get better performance. Five basic emotions namely anger, fear, happiness, neutral and sadness are considered. For analysing the performance of different datasets on different classifiers, content and speaker independent emotional data is used, collected from Telugu movies. Mean opinion score of fifty users is collected to label the emotional data. To make it more accurate, one of the benchmark IIT-Kharagpur emotional database is used to generalize the conclusions. © 2018, Springer Science+Business Media, LLC, part of Springer Nature.
Classification of vocal and non-vocal segments in audio clips using genetic algorithm based feature selection (GAFS)
(Elsevier Ltd, 2018) Vishnu Srinivasa Murthy, Y.V.S.; Koolagudi, S.G.
The technology of music information retrieval (MIR) is an emerging field that helps in tagging each portion of an audio clip. A majority of the subtasks of MIR need an application that segments vocal and non-vocal portions. In this paper, an effort has been made to segment the vocal and non-vocal regions using some novel features based on formant structure on top of standard features. The features such as Mel-frequency cepstral coefficients (MFCCs), linear prediction cepstral coefficients (LPCCs), frequency domain linear prediction (FDLP) values, statistical values of pitch, jitter, shimmer, formant attack slope (FAS), formant heights from base-to-peak (FH1), peak-to-base (FH2), formant angle values at peak (FA1), valley (FA2), and F5 have been considered. The classifiers such as artificial neural networks (ANN), support vector machines (SVM), and random forest (RF) have been considered for a comparative study as they are powerful enough to discover huge non-linear patterns. The concept of genetic algorithms with the support of neural networks has been used to select the relevant features rather considering all dimensions, named as a genetic algorithm based feature selection (GAFS). an accuracy of 89.23% before windowing and 95.16% after windowing is obtained with the optimal feature vector of length 32 using artificial neural networks. The system developed is capable of detecting singing voice segments with an accuracy of 98%. © 2018 Elsevier Ltd
Segmentation and characterization of acoustic event spectrograms using singular value decomposition
(Elsevier Ltd, 2019) Mulimani, M.; Koolagudi, S.G.
The traditional frame-based speech features such as Mel-frequency cepstral coefficients (MFCCs) are specifically developed for speech/speaker recognition tasks. Speech is different from acoustic events, when one considers its phonetic structure. Hence, frame-based speech features may not be suitable for Acoustic Event Classification (AEC). In this paper, a novel method is proposed for the extraction of robust acoustic event specific features from the spectrogram using a left singular vector for AEC. It consists of two main stages: segmentation and characterization of acoustic event spectrograms. In the first stage, symmetric Laplacian matrix of an acoustic event spectrogram is decomposed into singular values and vectors. Then, reliable region (spectral shape) of an acoustic from the spectrogram is segmented using a left singular vector. The selected prominent values of a left singular vector using the proposed threshold, automatically segment the reliable region of an acoustic event from the spectrogram. In the second stage, the segmented region of the spectrogram is used as a feature vector for AEC. Characteristics of values of singular vector belonging to reliable (event) and unreliable (non-event) regions of the spectrogram are determined. To evaluate the proposed approach, different categories of ‘home’ acoustic events are considered from the Freiburg-106 dataset. The results show that the significantly improved performance of acoustic event segmentation and classification. A singular vector effectively segments the reliable region of the acoustic event from spectrogram for Support Vector Machine (SVM) based AEC system. The proposed AEC system is robust to noise and achieves higher recognition rate in clean and noisy conditions compared to the traditional speech feature based systems. © 2018 Elsevier Ltd
Singer identification for Indian singers using convolutional neural networks
(Springer, 2021) Vishnu Srinivasa Murthy, Y.V.S.; Koolagudi, S.G.; Jeshventh Raja, T.K.
Singer identification is one of the important aspects of music information retrieval (MIR). In this work, traditional feature-based and trending convolutional neural network (CNN) based approaches are considered and compared for identifying singers. Two different datasets, namely artist20 and the Indian popular singers’ database with 20 singers are used in this work to evaluate proposed approaches. Cepstral features such as Mel-frequency cepstral coefficients (MFCCs) and linear prediction cepstral coefficients (LPCCs) are considered to represent timbre information. Shifted delta cepstral (SDC) features are also computed beside the cepstral coefficients to capture temporal information. In addition, chroma features are computed from 12 semitones of a musical octave, overall forming a 46-dimensional feature vector. Experiments are conducted with different feature combinations, and suitable features are selected using the genetic algorithm-based feature selection (GAFS) approach. Two different classification techniques, namely artificial neural networks (ANNs) and random forest (RF), are considered on the features mentioned above. Further, spectrograms and chromagrams of audio clips are directly fed to CNN for classification. The singer identification results obtained using CNNs seem to be better than the traditional isolated and ensemble classifiers. Average accuracy of around 75% is observed with CNN in the case of Indian popular singers database. Whereas, on artist20 dataset, the proposed configuration of feature-based approach and CNN could not give better than 60% accuracy. © 2021, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
Fog-Based Intelligent Machine Malfunction Monitoring System for Industry 4.0
(IEEE Computer Society, 2021) Natesha, B.V.; Guddeti, R.M.R.
There is an exponential increase in the use of Industrial Internet of Things (IIoT) devices for controlling and monitoring the machines in an automated manufacturing industry. Different temperature sensors, pressure sensors, audio sensors, and camera devices are used as IIoT devices for pipeline monitoring and machine operation control in the industrial environment. But, monitoring and identifying the machine malfunction in an industrial environment is a challenging task. In this article, we consider machines fault diagnosis based on their operating sound using the fog computing architecture in the industrial environment. The different computing units, such as industrial controller units or micro data center are used as the fog server in the industrial environment to analyze and classify the machine sounds as normal and abnormal. The linear prediction coefficients and Mel-frequency cepstral coefficients are extracted from the machine sound to develop and deploy supervised machine learning (ML) models on the fog server to monitor and identify the malfunctioning machines based on the operating sound. The experimental results show the performance of ML models for the machines sound recorded with different signal-to-noise ratio levels for normal and abnormal operations. © 2021 IEEE.

Faculty Publications

Browse

Filters

Settings

Sort By

Results per page

Search Results