Conference Papers

Permanent URI for this collectionhttps://idr.nitk.ac.in/handle/123456789/28506

Browse

Search Results

Now showing 1 - 7 of 7
  • Item
    Gender Identification from Children's Speech
    (Institute of Electrical and Electronics Engineers Inc., 2018) Ramteke, P.B.; Dixit, A.A.; Supanekar, S.; Dharwadkar, N.V.; Koolagudi, S.G.
    Children's speech can be characterized by higher pitch and format frequencies compared to the adult speech. Gender identification task from children's speech is difficult as there is no significant difference in the acoustic properties of male and female child. Here, an attempt has been made to explore the features efficient in discriminating the gender from children's speech. Different combinations of spectral features such as Mel-frequency cepstral coefficients (MFCCs), ΔMFCCs and ΔΔMFCCs, Formants, Linear predictive cepstral coefficients (LPCCs); Shimmer and Jitter; Prosodic features like pitch and its statistical variations along with Δpitch related features are explored. Features are evaluated using non linear classifiers namely Artificial Neural Network (ANNs), Deep Neural Network (DNNs) and Random Forest (RF). From the results it is observed that the RF achieves an highest accuracy of 84.79% amongst the other classifiers. © 2018 IEEE.
  • Item
    Identification of Phonological Process: Final Consonant Deletion from Childrens' Speech
    (Institute of Electrical and Electronics Engineers Inc., 2018) Ramteke, P.B.; Supanekar, S.; Koolagudi, S.G.
    Children within the age range of 2 1/2 to 6 1/2 years face difficulties in pronunciation due to underdeveloped vocal tract and neuromotor control. They try to substitute a simple class of sounds in place of sounds difficult for them to pronounce. These pronunciation error patterns are called phonological processes. Phonological processes disappear as the child advances in age, and its analysis gives the measure of language learning ability of children over the time. Appearance of these processes after the specified age (8 years) represents a phonological disorder. In this paper, final consonant deletion, one of the phonological processes in the Kannada language is considered for the analysis. In final consonant deletion consonant, part syllable, syllable or part word which appear at the end of the word is deleted. As the part of the word is deleted, features efficient in speech recognition namely MFCCs and LPCCs are explored for the analysis. Dynamic time warping (DTW) algorithm is considered to compare the correct and mispronounced word for identification of the region of final consonant deletion. DTW comparison path is observed to warp around the end of the mispronounced word where the part of the word is deleted. Combination of 13 MFCCs and 13 LPCCs is observed to achieve the highest accuracy of 72.68% within the tolerance range of ±50ms. Results show that the features efficient in speech recognition are efficient in the identification of final consonant deletion. © 2018 IEEE.
  • Item
    Audio Replay Attack Detection for Speaker Verification System Using Convolutional Neural Networks
    (Springer, 2019) Kemanth, P.J.; Supanekar, S.; Koolagudi, G.K.
    An audio replay attack is one of the most popular spoofing attacks on speaker verification systems because it is very economical and does not require much knowledge of signal processing. In this paper, we investigate the significance of non-voiced audio segments and deep learning models like Convolutional Neural Networks (CNN) for audio replay attack detection. The non-voiced segments of the audio can be used to detect reverberation and channel noise. FFT spectrograms are generated and given as input to CNN to classify the audio as genuine or replay. The advantage of the proposed approach is, because of the removal of the voiced speech, the feature vector size is reduced without compromising the necessary features. This leads to significant amount of reduction on training time of the networks. The ASVspoof 2017 dataset is used to train and evaluate the model. The Equal Error Rate (EER) is computed and used as a metric to evaluate model performance. The proposed system has achieved an EER of 5.62% on the development dataset and 12.47% on the evaluation dataset. © 2019, Springer Nature Switzerland AG.
  • Item
    Nitk Kids' speech corpus
    (International Speech Communication Association publication@isca-speech.org 4 Rue des Fauvettes - Lous Tourils Baixas 66390, 2019) Ramteke, P.B.; Supanekar, S.; Hegde, P.; Nelson, H.; Aithal, V.; Koolagudi, S.G.
    This paper introduces speech database for analyzing children's speech. The proposed database of children is recorded in Kannada language (one of the South Indian languages) from children between age 2 12 to 6 12 years. The database is named as National Institute of Technology Karnataka Kids' Speech Corpus (NITK Kids' Speech Corpus). The relevant design considerations for the database collection are discussed in detail. It is divided into four age groups with an interval of 1 year between each age group. The speech corpus includes nearly 10 hours of speech recordings from 160 children. For each age range, the data is recorded from 40 children (20 male and 20 female). Further, the effect of developmental changes on the speech from 2 12 to 6 12 years are analyzed using pitch and formant analysis. Some of the potential applications, of the NITK Kids' Speech Corpus, such as, systematic study on the language learning ability of children, phonological process analysis and children speech recognition are discussed. © © 2019 ISCA
  • Item
    Gender Identification using Spectral Features and Glottal Closure Instants (GCIs)
    (Institute of Electrical and Electronics Engineers Inc., 2019) Ramteke, P.B.; Supanekar, S.; Koolagudi, S.G.
    Automatic identification of gender from speech may help to improve the performance of the systems such as speaker speech recognition, forensic analysis, authentication processes. The difference in the physiological parameters of male and female vocal folds results in significant changes in their vocal fold vibration pattern. These changes can be characterized from the differences in the duration of their glottal closure. In this paper, an attempt has been made for gender recognition from speech using spectral features such as MFCCs, LPCCs, etc.; pitch (F0), excitation source features like glottal closure instants (GCIs) and its statistical variations. Western Michigan University's Gender dataset is used for experimentation. The dataset is collected from 93 speakers consisting of speech from 45 male and 48 female speakers respectively. Random forests (RFs) and Support vector machines (SVMs) are used to measure the performance of the proposed features. Random forest is observed to achieve average frame level accuracy of 96.908% using 13 MFCCs, 13 LPCCs, Pitch (F0) and GCI Stats (5). SVM is observed to achieve an average accuracy of 98.607% using 13 MFCCs, 13 LPCCs and GCI Stats (5). From the results, it is observed that the proposed features are efficient in discriminating the gender from speech. © 2019 IEEE.
  • Item
    Identification of Nasalization and Nasal Assimilation from Children’s Speech
    (Springer Science and Business Media Deutschland GmbH, 2020) Ramteke, P.B.; Supanekar, S.; Aithal, V.; Koolagudi, S.G.
    In children, nasalization is a commonly observed phonological process where the non-nasal sounds are substituted with nasal sounds. Here, an attempt has been made for the identification of nasalization and nasal assimilation. The properties of nasal sounds and nasalized voiced sounds are explored using MFCCs extracted from Hilbert envelope of the numerator of group delay (HNGD) Spectrum. HNGD Spectrum highlights the formants in the speech and extra nasal formant in the vicinity of first formant in nasalized voiced sounds. Features extracted from correctly pronounced and mispronounced words are compared using Dynamic Time Warping (DTW) algorithm. The nature of the deviation of DTW comparison path from its diagonal behavior is analyzed for the identification of mispronunciation. The combination of FFT based MFCCs and HNGD spectrum based MFCCs are observed to achieve highest accuracy of 82.22% within the tolerance range of ±50 ms. © 2020, Springer Nature Switzerland AG.
  • Item
    Identification of Palatal Fricative Fronting Using Shannon Entropy of Spectrogram
    (Springer Science and Business Media Deutschland GmbH, 2020) Ramteke, P.B.; Supanekar, S.; Aithal, V.; Koolagudi, S.G.
    In this paper, an attempt has been made to identify palatal fricative fronting in children speech, where postalveolar /sh/ is mispronounced as dental /s/. In children’s speech, the concentration of energy (darkest part) of spectrogram for /s/ ranges 4000 Hz to 8000 Hz, whereas it ranges 3000 Hz 8000 Hz for /sh/. Gammatonegram follows the frequency subbands of the ear (wider for higher frequencies). Various spectral properties such as spectral centroid, spectral crest factor, spectral decrease, spectral flatness, spectral flux, spectral kurtosis, spectral spread, spectral skewness, spectral slope and Shannon entropy of the spectrogram (interval of 2000 Hz), extracted from the Gammatonegram are proposed for the characterization of /sh/ and /s/. The dataset recorded from 60 native Kannada speaking children of age between 3 1/2 to 6 1/2 years is considered for the analysis from NITK Kids’ Speech Corpus. Support vector machine (SVMs) is considered for the classification. Various combinations of the proposed features are considered for the evaluation, along with the MFCCs(39) and LPCCs(39). Combination of MFCCs(39), LPCCs(39) and Entropy(4) is observed to achieve highest mispronunciation identification performance of 83.2983%. © 2020, Springer Nature Switzerland AG.