Conference Papers

Search Results

Now showing 1 - 3 of 3

Age approximation from speech using Gaussian mixture models
(IEEE Computer Society help@computer.org, 2013) Mittal, T.; Barthwal, A.; Koolagudi, S.G.
In this work, spectral features are extracted from speech to perform speaker classification based on thier age. Mel frequency cepstral coefficients (MFCCs) are explored as features. Gaussian mixture models (GMMs) are proposed as classifiers. The age groups considered in this study are 1-10, 11-20, 21-30, 31-40 and 41-50. The age-group database used in this work is recorded in Hindi from speakers of different ages and dialects containing five Hindi text prompts. The text prompts are constructed using textually neutral Hindi words recorded in neutral emotion which are used for characterizing the age group, for both male and female. Average age recognition performance, in the case of multiple speaker database is observed to be around 92.0%. Â© 2013 IEEE.
Contribution of Telugu vowels in identifying emotions
(Institute of Electrical and Electronics Engineers Inc., 2015) Shashidhar Koolagudi, G.; Shivakranthi, B.; Sreenivasa Rao, K.S.; Ramteke, P.B.
This work is mainly intended at identifying emotion contribution of different vowels in Telugu language. Instead of processing the entire speech signal we propose to focus only vowel parts of the utterance (/a/, /i/, /u/, /e/ and /o/). By analysing the vowels we can discriminate the emotions. In this work spectral and prosodic features are used for studying the effect of emotions on different vowels. Even though prosodic features are best discriminators of emotions at utterance level, at phoneme level spectral features are more useful. One may observe that same vowel exhibits different spectral behaviour when expressed in different emotions. Shimmer and jitter play a crucial role for classifying emotions using vowels. A semi natural database used in this work is collected from Telugu movies. Gaussian Mixture Models (GMMs) are used as the mathematical models for classification. Emotions considered for this work are anger, fear, happy, sad and neutral. Average emotion recognition performance obtained by combining MFCCs, formants, intensity, shimmer and jitter is around 78%. Â© 2015 IEEE.
Audio Replay Attack Detection for Speaker Verification System Using Convolutional Neural Networks
(Springer, 2019) Kemanth, P.J.; Supanekar, S.; Koolagudi, G.K.
An audio replay attack is one of the most popular spoofing attacks on speaker verification systems because it is very economical and does not require much knowledge of signal processing. In this paper, we investigate the significance of non-voiced audio segments and deep learning models like Convolutional Neural Networks (CNN) for audio replay attack detection. The non-voiced segments of the audio can be used to detect reverberation and channel noise. FFT spectrograms are generated and given as input to CNN to classify the audio as genuine or replay. The advantage of the proposed approach is, because of the removal of the voiced speech, the feature vector size is reduced without compromising the necessary features. This leads to significant amount of reduction on training time of the networks. The ASVspoof 2017 dataset is used to train and evaluate the model. The Equal Error Rate (EER) is computed and used as a metric to evaluate model performance. The proposed system has achieved an EER of 5.62% on the development dataset and 12.47% on the evaluation dataset. Â© 2019, Springer Nature Switzerland AG.

Conference Papers

Browse

Filters

Settings

Sort By

Results per page

Search Results