Faculty Publications
Permanent URI for this communityhttps://idr.nitk.ac.in/handle/123456789/18736
Publications by NITK Faculty
Browse
10 results
Search Results
Item A novel approach to video copy detection using audio fingerprints and PCA(Elsevier B.V., 2011) Roopalakshmi, R.; Guddeti, G.R.M.In Content-Based Copy detection (CBCD) literature, numerous state-of-the-art techniques are primarily focusing on visual content of video. Exploiting audio fingerprints for CBCD problem is necessary, because of following rea-sons: audio content constitutes an indispensable information source; transformations on audio content is limited compared to visual content. In this paper, a novel CBCD approach using audio features and PCA is proposed, which includes two stages: first, multiple feature vectors are computed by utilizing MFCC and four spectral descriptors; second, features are further processed using PCA, to provide compact feature description. The results of experiments tested on TRECVID-2007 dataset, demonstrate the efficiency of proposed method against various transformations. © 2011 Published by Elsevier Ltd.Item Multiclass SVM-based language-independent emotion recognition using selective speech features(Institute of Electrical and Electronics Engineers Inc., 2014) Kokane Amol, T.; Guddeti, G.R.M.In this paper, we emphasize on recognizing six basic emotions viz. Anger, Disgust, Fear, Happiness, Neutral and Sadness using selective features of speech signal of different languages like Germen and Telugu. The feature set includes thirteen Mel-Frequency Cepstral Coefficients (MFCC) and four other features of speech signal such as Energy, Short Term Energy, Spectral Roll-Off and Zero-Crossing Rate (ZCR). The Surrey Audio-Visual Expressed Emotion (SAVEE) Database is used to train the Multiclass Support Vector Machine (SVM) classifier and a German Corpus EMO-DB (Berlin Database of Emotional Speech) and Telugu Corpus IITKGP: SESC are used for emotion recognition. The results are analyzed for each speech emotion separately and obtained accuracies of 98.3071% and 95.8166 % for Emo-DB, IITKGP: SESC databases respectively. © 2014 IEEE.Item Robust features for automatic estimation of physical parameters from speech(Institute of Electrical and Electronics Engineers Inc., 2017) Kalluri, K.S.; Vijayasenan, D.Estimating speaker's physical parameters like height, weight and shoulder size can assist in voice forensics by providing additional knowledge about the speaker. In this work, statistics of the components of background GMM are employed as features in estimating the physical parameters. These features improved the performance of height and shoulder size estimation as compared to our earlier attempt based on a Bag of Word representation. The robustness of the features is validated using two different training subsets containing different languages. © 2017 IEEE.Item Robust Dialect Identification System using Spectro-Temporal Gabor Features(Institute of Electrical and Electronics Engineers Inc., 2018) Chittaragi, N.B.; Mothukuri, S.P.; Hegde, P.; Koolagudi, S.G.Automatic identification of dialects of a language is gaining popularity in the field of automatic speech recognition (ASR) systems. The present work proposes an automatic dialect identification (ADI) system using 2D Gabor and spectral features. A comprehensive study of the five dialects of a Dravidian Kannada language has been taken up. Gabor filters representing spectro-temporal modulations attempt in emulation of the human auditory system concerning signal processing strategies. Hence, they are able to well perceive human voices in tern recognize dialectal variations effectively. Also, spectral features Mel frequency cepstral coefficients (MFCC) are derived. A single classifier based support vector machine (SVM) and ensemble based extreme random forest (ERF) classification methods are employed for recognition. The effectiveness of the Gabor features for ADI system is demonstrated with proposed Kannada dialect dataset along with a standard intonation variation in English (IViE) dataset for British English dialects. The Gabor features have shown better performance over MFCC features with both datasets. Better recognition performance of 88.75% and 99.16% is achieved with Kannada and IViE dialect datasets respectively. Proposed Gabor features have demonstrated better performances even under noisy conditions. © 2018 IEEE.Item Kannada Dialect Classification using Artificial Neural Networks(Institute of Electrical and Electronics Engineers Inc., 2020) Mothukuri, S.K.P.; Hegde, P.; Chittaragi, N.B.; Koolagudi, S.G.In this paper, Automatic Dialect Classification (ADC) system is proposed for dialects of Kannada language (the Dravidian language spoken in Southern Karnataka). ADC system is proposed by extracting spectral Mel Frequency Cepstral Coefficients (MFCCs), and log filter bank features along with Linear predictive coefficients. In addition, prosodic pitch and energy features are extracted to capture dialect specific cues. A Kannada dialect speech corpus consisting of five prominent dialects of Kannada language is used for designing the ADC system. An attempt is made by using Artificial Neural Networks (ANNs) technique for classification of Kannada dialects. As, recently, ANNs and its variants are gaining more popularity in the area of speech processing application. Hyperparameter tuning of ANN has resulted with an increase in performance. © 2020 IEEE.Item Kannada Dialect Classification Using CNN(Springer Science and Business Media Deutschland GmbH, 2020) Hegde, P.; Chittaragi, N.B.; Mothukuri, S.K.P.; Koolagudi, S.G.Kannada is one of the prominent languages spoken in southern India. Since the Kannada is a lingua franca and spoken by more than 70 million people, it is evident to have dialects. In this paper, we identified five major dialectal regions in Karnataka state. An attempt is made to classify these five dialects from sentence-level utterances. Sentences are segmented from continuous speech automatically by using spectral centroid and short term energy features. Mel frequency cepstral coefficient (MFCC) features are extracted from these sentence units. These features are used to train the convolutional neural networks (CNN). Along with MFCCs, shifted delta and double delta coefficients are also attempted to train the CNN model. The proposed CNN based dialect recognition system is also tested with internationally known standard Intonation Variation in English (IViE) dataset. The CNN model has resulted in better performance. It is observed that the use of one convolution layer and three fully connected layers balances computational complexity and results in better accuracy with both Kannada and English datasets. © 2020, Springer Nature Switzerland AG.Item Monophone and Triphone Acoustic Phonetic Model for Kannada Speech Recognition System(Institute of Electrical and Electronics Engineers Inc., 2022) Kumar, T.N.M.; Jayan, A.; Bhat, S.; Anvith, M.; Narasimhadhan, A.V.The automatic Speech Recognition system (ASR) is the most widely used application in the speech domain. ASR systems generate text data from spoken utterances without manual intervention. In this work, we build an ASR system for the Kannada language. For building the proposed system, we extract Mel Frequency Cepstral Coefficients (MFCC) features from the audio data, and the Kannada language model is developed using corresponding labels. The dictionary generation and phonetic labelings are automated. Recognition performance is compared for both monophonic and triphone models. The word error rate of 15.73 % and the sentence error rate of 55.5 % are achieved for the triphone model. Comparatively, the triphone model gives a better performance than the monophonic model. © 2022 IEEE.Item Speaker Identification and Verification using Deep Learning(Institute of Electrical and Electronics Engineers Inc., 2022) Recharla, R.; Jeevan Reddy, C.; Tanguturu, R.; Anand Kumar, A.M.Many voice assistants gained importance across globe in the recent times, for example, Cortana, Siri, Ok Google. These assistants are part of everyone's life these days. The main motive behind the proposed system is to improve recognition assistant system. The speaker prediction model is trained using features MFCC, Chroma, Tonnetz, Mel spectrogram, and Spectral contrast extracted from audio samples. The proposed system has numerous real-world applications, such as meeting transcription, unlocking smart devices using voice, and online viva voice verification. It can replace the existing biometric system for faculty attendance and traditional fingerprint recognition. A Dense Neural Network was created for each audio feature and finally concatenated using a concatenation layer which fetched the best performance output compared to LSTM. Dense Neural Network successfully predicted the speaker with an accuracy of more than 95% most of the times. In the case of LSTM, due to fewer samples, the accuracy of speaker prediction is around 79%. In the case of CNN, the accuracy of speaker prediction is around 86%; this behavior can be attributed to the noise environment. When an unknown speaker tries to speak, the Dense Neural network can manage the task by placing them in an anonymous class. © 2022 IEEE.Item A framework for estimating geometric distortions in video copies based on visual-audio fingerprints(Springer-Verlag London Ltd, 2015) Roopalakshmi, R.; Guddeti, G.R.M.Spatio-temporal alignments and estimation of distortion model between pirate and master video contents are prerequisites, in order to approximate the illegal capture location in a theater. State-of-the-art techniques are exploiting only visual features of videos for the alignment and distortion model estimation of watermarked sequences, while few efforts are made toward acoustic features and non-watermarked video contents. To solve this, we propose a distortion model estimation framework based on multimodal signatures, which fully integrates several components: Compact representation of a video using visual-audio fingerprints derived from Speeded Up Robust Features and Mel-Frequency Cepstral Coefficients; Segmentation-based bipartite matching scheme to obtain accurate temporal alignments; Stable frame pairs extraction followed by filtering policies to achieve geometric alignments; and distortion model estimation in terms of homographic matrix. Experiments on camcorded datasets demonstrate the promising results of the proposed framework compared to the reference methods. © 2013, Springer-Verlag London.Item Detection of Heart Abnormality with Stethoscope Sounds(Springer, 2025) Jat, T.; Bhat, P.; Patil, N.Cardiac rhythm assessment is a critical step in the early diagnosis of cardiac arrhythmia, which has been identified as a kind of cardiovascular disease (CVD) that affects millions of individuals worldwide. Although electrocardiography is the right test to confirm cardiovascular diseases, due to the time and cost involved in the test, an alternate solution like heart abnormality detection using a stethoscope test is needed. A stethoscope test is a facility that is relatively easily available in rural parts of the country and can aid in the early diagnosis of heart abnormalities. The central aim of this work is to build a deep learning model for classifying heartbeat sounds which are captured with the help of iStethoscope Pro iPhone app or using a digital stethoscope. The proposed methodology is implemented using the Heart Sounds Classification Challenge dataset from PASCAL and the 2016 Physionet Challenge dataset. To extract features from the recorded heart sounds, we employ Mel spectrograms, Mel-frequency cepstral coefficients (MFCC), and Chroma short-time Fourier transform (STFT). A key novelty of our approach lies in the use of the stacked Bidirectional and Unidirectional Long Short Term Memory (SBU-LSTM) and deep BiLSTM architectures, which, combined with the three feature types (Mel spectrograms, Chroma STFT, and MFCC), enhances model performance for the classification task. Additionally, we introduce the use of Mel spectrograms and Chroma STFT with the 2D Convolutional Neural Network (CNN) architecture, which, as far as we know, has not been investigated in prior research. Experimental results show that the best accuracy achieved is 72% for PASCAL’s Dataset-A, 66% for Dataset-B, 83% for Dataset A + B, and 89% for Physionet’s Dataset-C. © The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd. 2025.
