Repository logo
Communities & Collections
All of DSpace
  • English
  • العربية
  • বাংলা
  • Català
  • Čeština
  • Deutsch
  • Ελληνικά
  • Español
  • Suomi
  • Français
  • Gàidhlig
  • हिंदी
  • Magyar
  • Italiano
  • Қазақ
  • Latviešu
  • Nederlands
  • Polski
  • Português
  • Português do Brasil
  • Srpski (lat)
  • Српски
  • Svenska
  • Türkçe
  • Yкраї́нська
  • Tiếng Việt
Log In
Have you forgotten your password?
  1. Home
  2. Browse by Author

Browsing by Author "Ghosh, P.K."

Filter results by typing the first few letters
Now showing 1 - 16 of 16
  • Results Per Page
  • Sort Options
  • No Thumbnail Available
    Item
    A high resolution ENF based multi-stage classifier for location forensics of media recordings
    (Institute of Electrical and Electronics Engineers Inc., 2017) Suresha, P.B.; Nagesh, S.; Roshan, P.S.; Gaonkar, P.A.; Nisha Meenakshi, G.N.; Ghosh, P.K.
    Media recordings, when captured close to active power system components, are known to be influenced by the electromagnetic interference caused by those power grid components. This electromagnetic interference manifests itself in such media recordings in the form of time-varying frequency components around the electric network frequency (ENF) of the power grid. For example, the ENF of the Indian power grid has a nominal value 50Hz. Classification of a given media signal into the grid or region of recording using the electric network frequency (ENF) is vital in location forensics. In this work, we use power recordings and audio recordings captured from 12 different grids around the globe. To use the variations in the ENF from the media signals for region-of-recording classification, we propose a high resolution ENF extraction technique. We also propose the use of a multi-stage support vector machine (SVM) based classification system. We find that the proposed system outperforms the existing baseline scheme for region-of-recording classification, by yielding an improvement in the overall accuracy by 17.33%. © 2017 IEEE.
  • No Thumbnail Available
    Item
    A robust speech rate estimation based on the activation profile from the selected acoustic unit dictionary
    (Institute of Electrical and Electronics Engineers Inc., 2016) Nagesh, S.; Yarra, C.; Deshmukh, O.D.; Ghosh, P.K.
    A typical solution for the speech rate estimation consists of two stages, which involves first computing a short-time feature contour such that most of peaks of the contour correspond to the syllable nuclei followed by the detection of the peaks of the contour corresponding to the syllable nuclei. Temporal correlation selected subband correlation (TCSSBC) is often used as a feature contour for the speech rate estimation in which correlation within and across a few selected sub-band energies are computed. In this work, instead of a fixed set of sub-bands, we learn them in a data-driven manner using a dictionary learning approach. Similarly, instead of the energy contours, we use the activation profile from the learned dictionary elements. We found that the peaks detected from the data-driven approach significantly improve the speech rate estimation when combined with the traditional TCSSBC approach using a proposed peak-merging strategy. Experiments are performed separately using Switchboard, TIMIT and CTIMIT corpora. Except Switchboard, the correlation coefficient for the speech rate estimation using the proposed approach is found to be higher than those by the TCSSBC technique - 3.1% and 5.2% (relative) improvements for TIMIT and CTIMIT respectively. © 2016 IEEE.
  • No Thumbnail Available
    Item
    An Improved Air Tissue Boundary Segmentation Technique for Real Time Magnetic Resonance Imaging Video Using Segnet
    (Institute of Electrical and Electronics Engineers Inc., 2019) Valliappan, C.A.; Kumar, A.; Mannem, R.; Karthik, G.R.; Ghosh, P.K.
    This paper presents an improved methodology for the segmentation of the Air-Tissue boundaries (ATBs) in the upper airway of the human vocal tract using Real-Time Magnetic Resonance Imaging (rtMRI) videos. Semantic segmentation is deployed in the proposed approach using a Deep learning architecture called SegNet. The network processes an input image to produce a binary output image of the same dimensions having classified each pixel as air cavity or tissue, following which contours are predicted. A Multi-dimensional least square smoothing technique is applied to smoothen the contours. To quantify the precision of predicted contours, Dynamic Time Warping (DTW) distance is calculated between the predicted contours and the manually annotated ground truth contour. Four fold experiments are conducted with four subjects from the USC-TIMIT corpus, which demonstrates that the proposed approach achieves a lower DTW distance of 1.02 and 1.09 for the upper and lower ATB compared to the best baseline scheme. The proposed SegNet based approach has an average pixel classification accuracy of 99.3% across all the subjects with only 2 rtMRI videos (~180 frames) per subject for training. © 2019 IEEE.
  • No Thumbnail Available
    Item
    A high resolution ENF based multi-stage classifier for location forensics of media recordings
    (2017) Suresha, P.B.; Nagesh, S.; Roshan, P.S.; Gaonkar, P.A.; Meenakshi, G.N.; Ghosh, P.K.
    Media recordings, when captured close to active power system components, are known to be influenced by the electromagnetic interference caused by those power grid components. This electromagnetic interference manifests itself in such media recordings in the form of time-varying frequency components around the electric network frequency (ENF) of the power grid. For example, the ENF of the Indian power grid has a nominal value 50Hz. Classification of a given media signal into the grid or region of recording using the electric network frequency (ENF) is vital in location forensics. In this work, we use power recordings and audio recordings captured from 12 different grids around the globe. To use the variations in the ENF from the media signals for region-of-recording classification, we propose a high resolution ENF extraction technique. We also propose the use of a multi-stage support vector machine (SVM) based classification system. We find that the proposed system outperforms the existing baseline scheme for region-of-recording classification, by yielding an improvement in the overall accuracy by 17.33%. � 2017 IEEE.
  • No Thumbnail Available
    Item
    An Improved Air Tissue Boundary Segmentation Technique for Real Time Magnetic Resonance Imaging Video Using Segnet
    (2019) Valliappan, C.A.; Kumar, A.; Mannem, R.; Karthik, G.R.; Ghosh, P.K.
    This paper presents an improved methodology for the segmentation of the Air-Tissue boundaries (ATBs) in the upper airway of the human vocal tract using Real-Time Magnetic Resonance Imaging (rtMRI) videos. Semantic segmentation is deployed in the proposed approach using a Deep learning architecture called SegNet. The network processes an input image to produce a binary output image of the same dimensions having classified each pixel as air cavity or tissue, following which contours are predicted. A Multi-dimensional least square smoothing technique is applied to smoothen the contours. To quantify the precision of predicted contours, Dynamic Time Warping (DTW) distance is calculated between the predicted contours and the manually annotated ground truth contour. Four fold experiments are conducted with four subjects from the USC-TIMIT corpus, which demonstrates that the proposed approach achieves a lower DTW distance of 1.02 and 1.09 for the upper and lower ATB compared to the best baseline scheme. The proposed SegNet based approach has an average pixel classification accuracy of 99.3% across all the subjects with only 2 rtMRI videos (~180 frames) per subject for training. � 2019 IEEE.
  • No Thumbnail Available
    Item
    Improved subject-independent acoustic-to-articulatory inversion
    (Elsevier, 2015) Afshan, A.; Ghosh, P.K.
    In subject-independent acoustic-to-articulatory inversion, the articulatory kinematics of a test subject are estimated assuming that the training corpus does not include data from the test subject. The training corpus in subject-independent inversion (SII) is formed with acoustic and articulatory kinematics data and the acoustic mismatch between training and test subjects is then estimated by an acoustic normalization using acoustic data drawn from a large pool of speakers called generic acoustic space (GAS). In this work, we focus on improving the SII performance through better acoustic normalization and adaptation. We propose unsupervised and several supervised ways of clustering GAS for acoustic normalization. We perform an adaptation of acoustic models of GAS using the acoustic data of the training and test subjects in SII. It is found that SII performance significantly improves (?25% relative on average) over the subject-dependent inversion when the acoustic clusters in GAS correspond to phonetic units (or states of 3-state phonetic HMMs) and when the acoustic model built on GAS is adapted to training and test subjects while optimizing the inversion criterion. © 2014 Elsevier B.V. All rights reserved.
  • No Thumbnail Available
    Item
    Intonation tutor by SPIRE (In-SPIRE): An online tool for an automatic feedback to the second language learners in learning intonation
    (2018) Anand, P.A.; Yarra, C.; Kausthubha, N.K.; Ghosh, P.K.
    In spoken communication, intonation often conveys meaning of an utterance. Thus, incorrect intonation, typically made by second language (L2) learners, could result in miscommunication. We demonstrate In-SPIRE tool that helps the L2 learners to learn intonation in a self-learning manner. For this, we design an interactive self-explanatory front end, which is also used to send learner's audio and hand-shake signals to the back-end. At the back-end, we implement a system that takes the learner's audio against a specific stimuli and computes pitch patterns representing the intonation. For this, we apply pitch stylization on each syllable segment in the audio. Further, we compute a quality score using the learner's patterns and the respective ground-truth patterns. Finally, the score, the patterns of the learners and the ground-truth are sent to the front-end for display as a feedback to the learners. Thus, the learner could correct any mismatch in his/her intonation with respect to the ground-truth. The proposed tool benefits the learners who do not have access to effective spoken language training. � 2018 International Speech Communication Association. All rights reserved.
  • No Thumbnail Available
    Item
    Intonation tutor by SPIRE (In-SPIRE): An online tool for an automatic feedback to the second language learners in learning intonation
    (International Speech Communication Association publication@isca-speech.org 4 Rue des Fauvettes - Lous Tourils Baixas 66390, 2018) Anand, P.A.; Yarra, C.; Kausthubha, N.K.; Ghosh, P.K.
    In spoken communication, intonation often conveys meaning of an utterance. Thus, incorrect intonation, typically made by second language (L2) learners, could result in miscommunication. We demonstrate In-SPIRE tool that helps the L2 learners to learn intonation in a self-learning manner. For this, we design an interactive self-explanatory front end, which is also used to send learner's audio and hand-shake signals to the back-end. At the back-end, we implement a system that takes the learner's audio against a specific stimuli and computes pitch patterns representing the intonation. For this, we apply pitch stylization on each syllable segment in the audio. Further, we compute a quality score using the learner's patterns and the respective ground-truth patterns. Finally, the score, the patterns of the learners and the ground-truth are sent to the front-end for display as a feedback to the learners. Thus, the learner could correct any mismatch in his/her intonation with respect to the ground-truth. The proposed tool benefits the learners who do not have access to effective spoken language training. © 2018 International Speech Communication Association. All rights reserved.
  • No Thumbnail Available
    Item
    Multiple Spectral Peak Tracking for Heart Rate Monitoring from Photoplethysmography Signal during Intensive Physical Exercise
    (Institute of Electrical and Electronics Engineers Inc., 2015) Lakshminarasimha Murthy, N.K.; Madhusudana, P.C.; Suresha, P.; Periyasamy, V.; Ghosh, P.K.
    We propose a multiple initialization based spectral peak tracking (MISPT) technique for heart rate monitoring from photoplethysmography (PPG) signal. MISPT is applied on the PPG signal after removing the motion artifact using an adaptive noise cancellation filter. MISPT yields several estimates of the heart rate trajectory from the spectrogram of the denoised PPG signal which are finally combined using a novel measure called trajectory strength. Multiple initializations help in correcting erroneous heart rate trajectories unlike the typical SPT which uses only single initialization. Experiments on the PPG data from 12 subjects recorded during intensive physical exercise show that the MISPT based heart rate monitoring indeed yields a better heart rate estimate compared to the SPT with single initialization. On the 12 datasets MISPT results in an average absolute error of 1.11 BPM which is lower than 1.28 BPM obtained by the state-of-the-art online heart rate monitoring algorithm. © 2015 IEEE.
  • No Thumbnail Available
    Item
    A robust speech rate estimation based on the activation profile from the selected acoustic unit dictionary
    (2016) Nagesh, S.; Yarra, C.; Deshmukh, O.D.; Ghosh, P.K.
    A typical solution for the speech rate estimation consists of two stages, which involves first computing a short-time feature contour such that most of peaks of the contour correspond to the syllable nuclei followed by the detection of the peaks of the contour corresponding to the syllable nuclei. Temporal correlation selected subband correlation (TCSSBC) is often used as a feature contour for the speech rate estimation in which correlation within and across a few selected sub-band energies are computed. In this work, instead of a fixed set of sub-bands, we learn them in a data-driven manner using a dictionary learning approach. Similarly, instead of the energy contours, we use the activation profile from the learned dictionary elements. We found that the peaks detected from the data-driven approach significantly improve the speech rate estimation when combined with the traditional TCSSBC approach using a proposed peak-merging strategy. Experiments are performed separately using Switchboard, TIMIT and CTIMIT corpora. Except Switchboard, the correlation coefficient for the speech rate estimation using the proposed approach is found to be higher than those by the TCSSBC technique - 3.1% and 5.2% (relative) improvements for TIMIT and CTIMIT respectively. � 2016 IEEE.
  • No Thumbnail Available
    Item
    Spectrogram Enhancement Using Multiple Window Savitzky-Golay (MWSG) Filter for Robust Bird Sound Detection
    (Institute of Electrical and Electronics Engineers Inc., 2017) Koluguri, N.R.; Nisha Meenakshi, G.N.; Ghosh, P.K.
    Bird sound detection from real-field recordings is essential for identifying bird species in bioacoustic monitoring. Variations in the recording devices, environmental conditions, and the presence of vocalizations from other animals make the bird sound detection very challenging. In order to overcome these challenges, we propose an unsupervised algorithm comprising two main stages. In the first stage, a spectrogram enhancement technique is proposed using a multiple window Savitzky-Golay (MWSG) filter. We show that the spectrogram estimate using MWSG filter is unbiased and has lower variance compared with its single window counterpart. It is known that bird sounds are highly structured in the time-frequency (T-F) plane. We exploit these cues of prominence of T-F activity in specific directions from the enhanced spectrogram, in the second stage of the proposed method, for bird sound detection. In this regard, we use a set of four moving average filters that when applied to the enhanced spectrogram, yield directional spectrograms that capture the direction specific information. We propose a thresholding scheme on the time varying energy profile computed from each of these directional spectrograms to obtain frame-level binary decisions of bird sound activity. These individual decisions are then combined to obtain the final decision. Experiments are performed with three different datasets, with varying recording and noise conditions. Frame level F-score is used as the evaluation metric for bird sound detection. We find that the proposed method, on average, achieves higher F-score (10.24% relative) compared to the best of the six baseline schemes considered in this work. © 2017 IEEE.
  • No Thumbnail Available
    Item
    Speech enhancement using multiple deep neural networks
    (2018) Karjol, P.; Kumar, M.A.; Ghosh, P.K.
    In this work, we present a variant of multiple deep neural network (DNN) based speech enhancement method. We directly estimate clean speech spectrum as a weighted average of outputs from multiple DNNs. The weights are provided by a gating network. The multiple DNNs and the gating network are trained jointly. The objective function is set as the mean square logarithmic error between the target clean spectrum and the estimated spectrum. We conduct experiments using two and four DNNs using the TIMIT corpus with nine noise types (four seen noises and five unseen noises) taken from the AURORA database at four different signal-to-noise ratios (SNRs). We also compare the proposed method with a single DNN based speech enhancement scheme and existing multiple DNN schemes using segmental SNR, perceptual evaluation of speech quality (PESQ) and short-term objective intelligibility (STOI) as the evaluation metrics. These comparisons show the superiority of proposed method over baseline schemes in both seen and unseen noises. Specifically, we observe an absolute improvement of 0.07 and 0.04 in PESQ measure compared to single DNN when averaged over all noises and SNRs for seen and unseen noise cases respectively. � 2018 IEEE.
  • No Thumbnail Available
    Item
    Speech enhancement using multiple deep neural networks
    (Institute of Electrical and Electronics Engineers Inc., 2018) Karjol, P.; Kumar, M.A.; Ghosh, P.K.
    In this work, we present a variant of multiple deep neural network (DNN) based speech enhancement method. We directly estimate clean speech spectrum as a weighted average of outputs from multiple DNNs. The weights are provided by a gating network. The multiple DNNs and the gating network are trained jointly. The objective function is set as the mean square logarithmic error between the target clean spectrum and the estimated spectrum. We conduct experiments using two and four DNNs using the TIMIT corpus with nine noise types (four seen noises and five unseen noises) taken from the AURORA database at four different signal-to-noise ratios (SNRs). We also compare the proposed method with a single DNN based speech enhancement scheme and existing multiple DNN schemes using segmental SNR, perceptual evaluation of speech quality (PESQ) and short-term objective intelligibility (STOI) as the evaluation metrics. These comparisons show the superiority of proposed method over baseline schemes in both seen and unseen noises. Specifically, we observe an absolute improvement of 0.07 and 0.04 in PESQ measure compared to single DNN when averaged over all noises and SNRs for seen and unseen noise cases respectively. © 2018 IEEE.
  • No Thumbnail Available
    Item
    SPIRE-SIES: A Spontaneous Indian English Speech Corpus
    (Institute of Electrical and Electronics Engineers Inc., 2023) Singh, A.; Shah, C.; Varadaraj, R.; Chauhan, S.; Ghosh, P.K.
    In this paper, we present a 170.83 hour Indian English spontaneous speech dataset. Lack of Indian English speech data is one of the major hindrances in developing robust speech systems which are adapted to the Indian speech style. Moreover this scarcity is even more for spontaneous speech. This corpus is crowd-sourced over varied Indian nativities, genders and age groups. Traditional spontaneous speech collection strategies involve capturing of speech during interviewing or conversations. In this study, we use images as stimuli to induce spontaneity in speech. Transcripts for 23 hours is generated and validated which can serve as a spontaneous speech ASR benchmark. Quality of the corpus is validated with voice activity detection based segmentation, gender verification and image semantic correlation. Which determines a relationship between image stimulus and recorded speech using caption keywords derived from Image-to-Text model and high occurring words derived from whisper ASR's generated transcripts. © 2023 IEEE.
  • No Thumbnail Available
    Item
    SPIRE-SST: An automatic web-based self-learning tool for syllable stress tutoring (SST) to the second language learners
    (2018) Yarra, C.; Anand, P.A.; Kausthubha, N.K.; Ghosh, P.K.
    Correct stress placement on the syllables in a word or word groups is important in the spoken communication. Thus, incorrect syllable stress, typically made by second language (L2) learners, could result in miscommunication. In this demo, we present SPIRE-SST tool that tutors to learn correct stress patterns in a self-learning manner. Thus, the proposed tool could also benefit the learners without any access to the effective training methods. For this, we design a front-end containing self-explanatory instructions that can be easily followed by the user. Using the front-end, learners can submit their audio to the back-end and can view the corresponding feedback from the back-end. In the back-end, we divide the entire audio from the learner into syllable segments and detect each syllable as stressed or unstressed. Using these stress markings, we compute a score representing the stress quality in comparison with the ground-truth stress markings and send it to the front-end as a feedback. We also send a set of three features by comparing the audio from the expert and learner as the feedback, which we assume to be useful for correcting the pronunciation errors. � 2018 International Speech Communication Association. All rights reserved.
  • No Thumbnail Available
    Item
    SPIRE-SST: An automatic web-based self-learning tool for syllable stress tutoring (SST) to the second language learners
    (International Speech Communication Association publication@isca-speech.org 4 Rue des Fauvettes - Lous Tourils Baixas 66390, 2018) Yarra, C.; Anand, P.A.; Kausthubha, N.K.; Ghosh, P.K.
    Correct stress placement on the syllables in a word or word groups is important in the spoken communication. Thus, incorrect syllable stress, typically made by second language (L2) learners, could result in miscommunication. In this demo, we present SPIRE-SST tool that tutors to learn correct stress patterns in a self-learning manner. Thus, the proposed tool could also benefit the learners without any access to the effective training methods. For this, we design a front-end containing self-explanatory instructions that can be easily followed by the user. Using the front-end, learners can submit their audio to the back-end and can view the corresponding feedback from the back-end. In the back-end, we divide the entire audio from the learner into syllable segments and detect each syllable as stressed or unstressed. Using these stress markings, we compute a score representing the stress quality in comparison with the ground-truth stress markings and send it to the front-end as a feedback. We also send a set of three features by comparing the audio from the expert and learner as the feedback, which we assume to be useful for correcting the pronunciation errors. © 2018 International Speech Communication Association. All rights reserved.

Maintained by Central Library NITK | DSpace software copyright © 2002-2026 LYRASIS

  • Privacy policy
  • End User Agreement
  • Send Feedback
Repository logo COAR Notify