Faculty Publications
Permanent URI for this communityhttps://idr.nitk.ac.in/handle/123456789/18736
Publications by NITK Faculty
Browse
19 results
Search Results
Item A framework for estimating geometric distortions in video copies based on visual-audio fingerprints(Springer-Verlag London Ltd, 2015) Roopalakshmi, R.; Guddeti, G.R.M.Spatio-temporal alignments and estimation of distortion model between pirate and master video contents are prerequisites, in order to approximate the illegal capture location in a theater. State-of-the-art techniques are exploiting only visual features of videos for the alignment and distortion model estimation of watermarked sequences, while few efforts are made toward acoustic features and non-watermarked video contents. To solve this, we propose a distortion model estimation framework based on multimodal signatures, which fully integrates several components: Compact representation of a video using visual-audio fingerprints derived from Speeded Up Robust Features and Mel-Frequency Cepstral Coefficients; Segmentation-based bipartite matching scheme to obtain accurate temporal alignments; Stable frame pairs extraction followed by filtering policies to achieve geometric alignments; and distortion model estimation in terms of homographic matrix. Experiments on camcorded datasets demonstrate the promising results of the proposed framework compared to the reference methods. © 2013, Springer-Verlag London.Item Recognition of emotions from video using acoustic and facial features(Springer-Verlag London Ltd, 2015) Sreenivasa Rao, K.S.; Koolagudi, S.In this paper, acoustic and facial features extracted from video are explored for recognizing emotions. The temporal variation of gray values of the pixels within eye and mouth regions is used as a feature to capture the emotion-specific knowledge from the facial expressions. Acoustic features representing spectral and prosodic information are explored for recognizing emotions from the speech signal. Autoassociative neural network models are used to capture the emotion-specific information from acoustic and facial features. The basic objective of this work is to examine the capability of the proposed acoustic and facial features in view of capturing the emotion-specific information. Further, the correlations among the feature sets are analyzed by combining the evidences at different levels. The performance of the emotion recognition system developed using acoustic and facial features is observed to be 85.71 and 88.14 %, respectively. It has been observed that combining the evidences of models developed using acoustic and facial features improved the recognition performance to 93.62 %. The performance of the emotion recognition systems developed using neural network models is compared with hidden Markov models, Gaussian mixture models and support vector machine models. The proposed features and models are evaluated on real-life emotional database, Interactive Emotional Dyadic Motion Capture database, which was recently collected at University of Southern California. © 2013, Springer-Verlag London.Item A study and implementation of mapping and speech recognition techniques for an autonomous mobile robot based on ROS(Inderscience Enterprises Ltd., 2017) Srinivasa Rao, H.; Desai, V.; Bhat, R.; Jayaprakash, S.; Sampangi, Y.Autonomous mobile robots work in close interaction with humans in environments such as homes, hospitals, public places and disaster areas. In autonomous mobile robots, the main constraints are safety, autonomy and efficiency in helping the humans. Given these constraints, developing the autonomous mobile robot technologies is a major challenge for both the industry and the research society. This paper work is about how an indoor autonomous mobile robot can work based on robot operating system and using Lidar and other sensors to create a map of an environment, and perform autonomous navigation with using capabilities like dynamic obstacle avoidance, speech recognition and video streaming. To achieve the above features, different algorithms like SLAM, AMCL, dynamic window approach algorithms, and CMU PocketSphinx speech recogniser are used. For video steaming, ROS web video server is used and the recorded video can be sent to a remote desktop system using ROS network. © 2017 Inderscience Enterprises Ltd.Item Dravidian language classification from speech signal using spectral and prosodic features(Springer New York LLC barbara.b.bertram@gsk.com, 2017) Koolagudi, S.G.; Bharadwaj, A.; Vishnu Srinivasa Murthy, Y.V.; Reddy, N.; Rao, P.The interesting aspect of the Dravidian languages is a commonality through a shared script, similar vocabulary, and their common root language. In this work, an attempt has been made to classify the four complex Dravidian languages using cepstral coefficients and prosodic features. The speech of Dravidian languages has been recorded in various environments and considered as a database. It is demonstrated that while cepstral coefficients can indeed identify the language correctly with a fair degree of accuracy, prosodic features are added to the cepstral coefficients to improve language identification performance. Legendre polynomial fitting and the principle component analysis (PCA) are applied on feature vectors to reduce dimensionality which further resolves the issue of time complexity. In the experiments conducted, it is found that using both cepstral coefficients and prosodic features, a language identification rate of around 87% is obtained, which is about 18% above the baseline system using Mel-frequency cepstral coefficients (MFCCs). It is observed from the results that the temporal variations and prosody are the important factors needed to be considered for the tasks of language identification. © 2017, Springer Science+Business Media, LLC.Item Choice of a classifier, based on properties of a dataset: case study-speech emotion recognition(Springer New York LLC barbara.b.bertram@gsk.com, 2018) Koolagudi, S.G.; Vishnu Srinivasa Murthy, Y.V.S.; Bhaskar, S.P.In this paper, the process of selecting a classifier based on the properties of dataset is designed since it is very difficult to experiment the data on n—number of classifiers. As a case study speech emotion recognition is considered. Different combinations of spectral and prosodic features relevant to emotions are explored. The best subset of the chosen set of features is recommended for each of the classifiers based on the properties of chosen dataset. Various statistical tests have been used to estimate the properties of dataset. The nature of dataset gives an idea to select the relevant classifier. To make it more precise, three other clustering and classification techniques such as K-means clustering, vector quantization and artificial neural networks are used for experimentation and results are compared with the selected classifier. Prosodic features like pitch, intensity, jitter, shimmer, spectral features such as mel frequency cepstral coefficients (MFCCs) and formants are considered in this work. Statistical parameters of prosody such as minimum, maximum, mean (?) and standard deviation (?) are extracted from speech and combined with basic spectral (MFCCs) features to get better performance. Five basic emotions namely anger, fear, happiness, neutral and sadness are considered. For analysing the performance of different datasets on different classifiers, content and speaker independent emotional data is used, collected from Telugu movies. Mean opinion score of fifty users is collected to label the emotional data. To make it more accurate, one of the benchmark IIT-Kharagpur emotional database is used to generalize the conclusions. © 2018, Springer Science+Business Media, LLC, part of Springer Nature.Item Classification of vocal and non-vocal segments in audio clips using genetic algorithm based feature selection (GAFS)(Elsevier Ltd, 2018) Vishnu Srinivasa Murthy, Y.V.S.; Koolagudi, S.G.The technology of music information retrieval (MIR) is an emerging field that helps in tagging each portion of an audio clip. A majority of the subtasks of MIR need an application that segments vocal and non-vocal portions. In this paper, an effort has been made to segment the vocal and non-vocal regions using some novel features based on formant structure on top of standard features. The features such as Mel-frequency cepstral coefficients (MFCCs), linear prediction cepstral coefficients (LPCCs), frequency domain linear prediction (FDLP) values, statistical values of pitch, jitter, shimmer, formant attack slope (FAS), formant heights from base-to-peak (FH1), peak-to-base (FH2), formant angle values at peak (FA1), valley (FA2), and F5 have been considered. The classifiers such as artificial neural networks (ANN), support vector machines (SVM), and random forest (RF) have been considered for a comparative study as they are powerful enough to discover huge non-linear patterns. The concept of genetic algorithms with the support of neural networks has been used to select the relevant features rather considering all dimensions, named as a genetic algorithm based feature selection (GAFS). an accuracy of 89.23% before windowing and 95.16% after windowing is obtained with the optimal feature vector of length 32 using artificial neural networks. The system developed is capable of detecting singing voice segments with an accuracy of 98%. © 2018 Elsevier LtdItem Phoneme boundary detection from speech: A rule based approach(Elsevier B.V., 2019) Ramteke, P.B.; Koolagudi, S.G.In this paper, a novel approach has been proposed for the automatic segmentation of speech signal into phonemes. In a well spoken word, phonemes can be characterized by the changes observed in speech waveform. To get phoneme boundaries, the signal level properties of speech waveform i.e. changes in the waveform during transformation from one phoneme to the other are explored. The problem of phoneme level segmentation has been addressed in this work from two aspects 1. Segmentation of phonemes between voiced and unvoiced portions and 2. Segmentation of phonemes within voiced and unvoiced regions. Pitch and zero-frequency filter signal are used to get the region of change from voiced to unvoiced and vice versa. The segmentation of phoneme boundaries within voiced and unvoiced regions are approximated using the properties of power spectrum of correlation of adjacent frames of the signal. A finite set of rules is proposed on the variations observed in the power spectrum during phoneme transitions. The segmentation results of both approaches are combined to get the final phoneme boundaries. Three databases namely TIMIT Corpus, IIIT Hyderabad Marathi database & IIIT Hyderabad Hindi database (IIIT-H Indic Speech Databases) are used to test the proposed approach; an accuracy of 95.40%, 96.87% and 96.12% is achieved within the tolerance range of 10 ms respectively. The results of the proposed approach are observed to give precise phoneme boundaries. © 2019 Elsevier B.V.Item Segmentation and characterization of acoustic event spectrograms using singular value decomposition(Elsevier Ltd, 2019) Mulimani, M.; Koolagudi, S.G.The traditional frame-based speech features such as Mel-frequency cepstral coefficients (MFCCs) are specifically developed for speech/speaker recognition tasks. Speech is different from acoustic events, when one considers its phonetic structure. Hence, frame-based speech features may not be suitable for Acoustic Event Classification (AEC). In this paper, a novel method is proposed for the extraction of robust acoustic event specific features from the spectrogram using a left singular vector for AEC. It consists of two main stages: segmentation and characterization of acoustic event spectrograms. In the first stage, symmetric Laplacian matrix of an acoustic event spectrogram is decomposed into singular values and vectors. Then, reliable region (spectral shape) of an acoustic from the spectrogram is segmented using a left singular vector. The selected prominent values of a left singular vector using the proposed threshold, automatically segment the reliable region of an acoustic event from the spectrogram. In the second stage, the segmented region of the spectrogram is used as a feature vector for AEC. Characteristics of values of singular vector belonging to reliable (event) and unreliable (non-event) regions of the spectrogram are determined. To evaluate the proposed approach, different categories of ‘home’ acoustic events are considered from the Freiburg-106 dataset. The results show that the significantly improved performance of acoustic event segmentation and classification. A singular vector effectively segments the reliable region of the acoustic event from spectrogram for Support Vector Machine (SVM) based AEC system. The proposed AEC system is robust to noise and achieves higher recognition rate in clean and noisy conditions compared to the traditional speech feature based systems. © 2018 Elsevier LtdItem Fog-Based Intelligent Machine Malfunction Monitoring System for Industry 4.0(IEEE Computer Society, 2021) Natesha, B.V.; Guddeti, R.M.R.There is an exponential increase in the use of Industrial Internet of Things (IIoT) devices for controlling and monitoring the machines in an automated manufacturing industry. Different temperature sensors, pressure sensors, audio sensors, and camera devices are used as IIoT devices for pipeline monitoring and machine operation control in the industrial environment. But, monitoring and identifying the machine malfunction in an industrial environment is a challenging task. In this article, we consider machines fault diagnosis based on their operating sound using the fog computing architecture in the industrial environment. The different computing units, such as industrial controller units or micro data center are used as the fog server in the industrial environment to analyze and classify the machine sounds as normal and abnormal. The linear prediction coefficients and Mel-frequency cepstral coefficients are extracted from the machine sound to develop and deploy supervised machine learning (ML) models on the fog server to monitor and identify the malfunctioning machines based on the operating sound. The experimental results show the performance of ML models for the machines sound recorded with different signal-to-noise ratio levels for normal and abnormal operations. © 2021 IEEE.Item Contribution of frequency compressed temporal fine structure cues to the speech recognition in noise: An implication in cochlear implant signal processing(Elsevier Ltd, 2022) Poluboina, V.; Pulikala, A.; Pitchai Muthu, A.N.The study investigated the effect of proportionally frequency compressed encoding of temporal fine structure information on speech perception in noise using vocoder simulations of cochlear implant signal processing. The study proposed a pitch synchronous overlap-add algorithm (PSOLA) for downward frequency shifting of TFS. The speech recognition scores (SRS) were measured at −10 dB, 0 dB, and +10 dB for eight signal processing conditions corresponding to sinewave vocoder without TFS (NO-TFS), four unshifted TFS conditions including full band TFS, TFS up to 2000, 1000, and 600 Hz, and three conditions with PSOLA which shifted 2000, 1000 and 600 Hz TFS to 1000, 500 and 300 Hz respectively. The original envelope was unchanged across the conditions. SRS at +10 dB and −10 dB SNR reached ceiling and floor respectively, in most conditions. Hence, SRS at 0 dB SNR was compared across the conditions. The results showed that the SRS was highest with full band TFS and lowest for the NO-TFS condition.The SRS for TFS 600 Hz shifted to 300 Hz through PSOLA was higher than the NO-TFS condition. Study findings suggest that encoding TFS by proportional frequency compression results in better speech perception in noise compared to NO-TFS. An important observation of this current study is that the speech recognition was better than the sine wave vocoder for all TFS conditions including frequency compressed 600 Hz TFS. © 2021 Elsevier Ltd
