Conference Papers
Permanent URI for this collectionhttps://idr.nitk.ac.in/handle/123456789/28506
Browse
7 results
Search Results
Item A robust speech rate estimation based on the activation profile from the selected acoustic unit dictionary(Institute of Electrical and Electronics Engineers Inc., 2016) Nagesh, S.; Yarra, C.; Deshmukh, O.D.; Ghosh, P.K.A typical solution for the speech rate estimation consists of two stages, which involves first computing a short-time feature contour such that most of peaks of the contour correspond to the syllable nuclei followed by the detection of the peaks of the contour corresponding to the syllable nuclei. Temporal correlation selected subband correlation (TCSSBC) is often used as a feature contour for the speech rate estimation in which correlation within and across a few selected sub-band energies are computed. In this work, instead of a fixed set of sub-bands, we learn them in a data-driven manner using a dictionary learning approach. Similarly, instead of the energy contours, we use the activation profile from the learned dictionary elements. We found that the peaks detected from the data-driven approach significantly improve the speech rate estimation when combined with the traditional TCSSBC approach using a proposed peak-merging strategy. Experiments are performed separately using Switchboard, TIMIT and CTIMIT corpora. Except Switchboard, the correlation coefficient for the speech rate estimation using the proposed approach is found to be higher than those by the TCSSBC technique - 3.1% and 5.2% (relative) improvements for TIMIT and CTIMIT respectively. © 2016 IEEE.Item A high resolution ENF based multi-stage classifier for location forensics of media recordings(Institute of Electrical and Electronics Engineers Inc., 2017) Suresha, P.B.; Nagesh, S.; Roshan, P.S.; Gaonkar, P.A.; Nisha Meenakshi, G.N.; Ghosh, P.K.Media recordings, when captured close to active power system components, are known to be influenced by the electromagnetic interference caused by those power grid components. This electromagnetic interference manifests itself in such media recordings in the form of time-varying frequency components around the electric network frequency (ENF) of the power grid. For example, the ENF of the Indian power grid has a nominal value 50Hz. Classification of a given media signal into the grid or region of recording using the electric network frequency (ENF) is vital in location forensics. In this work, we use power recordings and audio recordings captured from 12 different grids around the globe. To use the variations in the ENF from the media signals for region-of-recording classification, we propose a high resolution ENF extraction technique. We also propose the use of a multi-stage support vector machine (SVM) based classification system. We find that the proposed system outperforms the existing baseline scheme for region-of-recording classification, by yielding an improvement in the overall accuracy by 17.33%. © 2017 IEEE.Item SPIRE-SST: An automatic web-based self-learning tool for syllable stress tutoring (SST) to the second language learners(International Speech Communication Association publication@isca-speech.org 4 Rue des Fauvettes - Lous Tourils Baixas 66390, 2018) Yarra, C.; Anand, P.A.; Kausthubha, N.K.; Ghosh, P.K.Correct stress placement on the syllables in a word or word groups is important in the spoken communication. Thus, incorrect syllable stress, typically made by second language (L2) learners, could result in miscommunication. In this demo, we present SPIRE-SST tool that tutors to learn correct stress patterns in a self-learning manner. Thus, the proposed tool could also benefit the learners without any access to the effective training methods. For this, we design a front-end containing self-explanatory instructions that can be easily followed by the user. Using the front-end, learners can submit their audio to the back-end and can view the corresponding feedback from the back-end. In the back-end, we divide the entire audio from the learner into syllable segments and detect each syllable as stressed or unstressed. Using these stress markings, we compute a score representing the stress quality in comparison with the ground-truth stress markings and send it to the front-end as a feedback. We also send a set of three features by comparing the audio from the expert and learner as the feedback, which we assume to be useful for correcting the pronunciation errors. © 2018 International Speech Communication Association. All rights reserved.Item Intonation tutor by SPIRE (In-SPIRE): An online tool for an automatic feedback to the second language learners in learning intonation(International Speech Communication Association publication@isca-speech.org 4 Rue des Fauvettes - Lous Tourils Baixas 66390, 2018) Anand, P.A.; Yarra, C.; Kausthubha, N.K.; Ghosh, P.K.In spoken communication, intonation often conveys meaning of an utterance. Thus, incorrect intonation, typically made by second language (L2) learners, could result in miscommunication. We demonstrate In-SPIRE tool that helps the L2 learners to learn intonation in a self-learning manner. For this, we design an interactive self-explanatory front end, which is also used to send learner's audio and hand-shake signals to the back-end. At the back-end, we implement a system that takes the learner's audio against a specific stimuli and computes pitch patterns representing the intonation. For this, we apply pitch stylization on each syllable segment in the audio. Further, we compute a quality score using the learner's patterns and the respective ground-truth patterns. Finally, the score, the patterns of the learners and the ground-truth are sent to the front-end for display as a feedback to the learners. Thus, the learner could correct any mismatch in his/her intonation with respect to the ground-truth. The proposed tool benefits the learners who do not have access to effective spoken language training. © 2018 International Speech Communication Association. All rights reserved.Item Speech enhancement using multiple deep neural networks(Institute of Electrical and Electronics Engineers Inc., 2018) Karjol, P.; Kumar, M.A.; Ghosh, P.K.In this work, we present a variant of multiple deep neural network (DNN) based speech enhancement method. We directly estimate clean speech spectrum as a weighted average of outputs from multiple DNNs. The weights are provided by a gating network. The multiple DNNs and the gating network are trained jointly. The objective function is set as the mean square logarithmic error between the target clean spectrum and the estimated spectrum. We conduct experiments using two and four DNNs using the TIMIT corpus with nine noise types (four seen noises and five unseen noises) taken from the AURORA database at four different signal-to-noise ratios (SNRs). We also compare the proposed method with a single DNN based speech enhancement scheme and existing multiple DNN schemes using segmental SNR, perceptual evaluation of speech quality (PESQ) and short-term objective intelligibility (STOI) as the evaluation metrics. These comparisons show the superiority of proposed method over baseline schemes in both seen and unseen noises. Specifically, we observe an absolute improvement of 0.07 and 0.04 in PESQ measure compared to single DNN when averaged over all noises and SNRs for seen and unseen noise cases respectively. © 2018 IEEE.Item An Improved Air Tissue Boundary Segmentation Technique for Real Time Magnetic Resonance Imaging Video Using Segnet(Institute of Electrical and Electronics Engineers Inc., 2019) Valliappan, C.A.; Kumar, A.; Mannem, R.; Karthik, G.R.; Ghosh, P.K.This paper presents an improved methodology for the segmentation of the Air-Tissue boundaries (ATBs) in the upper airway of the human vocal tract using Real-Time Magnetic Resonance Imaging (rtMRI) videos. Semantic segmentation is deployed in the proposed approach using a Deep learning architecture called SegNet. The network processes an input image to produce a binary output image of the same dimensions having classified each pixel as air cavity or tissue, following which contours are predicted. A Multi-dimensional least square smoothing technique is applied to smoothen the contours. To quantify the precision of predicted contours, Dynamic Time Warping (DTW) distance is calculated between the predicted contours and the manually annotated ground truth contour. Four fold experiments are conducted with four subjects from the USC-TIMIT corpus, which demonstrates that the proposed approach achieves a lower DTW distance of 1.02 and 1.09 for the upper and lower ATB compared to the best baseline scheme. The proposed SegNet based approach has an average pixel classification accuracy of 99.3% across all the subjects with only 2 rtMRI videos (~180 frames) per subject for training. © 2019 IEEE.Item SPIRE-SIES: A Spontaneous Indian English Speech Corpus(Institute of Electrical and Electronics Engineers Inc., 2023) Singh, A.; Shah, C.; Varadaraj, R.; Chauhan, S.; Ghosh, P.K.In this paper, we present a 170.83 hour Indian English spontaneous speech dataset. Lack of Indian English speech data is one of the major hindrances in developing robust speech systems which are adapted to the Indian speech style. Moreover this scarcity is even more for spontaneous speech. This corpus is crowd-sourced over varied Indian nativities, genders and age groups. Traditional spontaneous speech collection strategies involve capturing of speech during interviewing or conversations. In this study, we use images as stimuli to induce spontaneity in speech. Transcripts for 23 hours is generated and validated which can serve as a spontaneous speech ASR benchmark. Quality of the corpus is validated with voice activity detection based segmentation, gender verification and image semantic correlation. Which determines a relationship between image stimulus and recorded speech using caption keywords derived from Image-to-Text model and high occurring words derived from whisper ASR's generated transcripts. © 2023 IEEE.
