Automatic Estimation of Personal Characteristics using Speech Data

Babu, Kalluri Shareef.

Please use this identifier to cite or link to this item: https://idr.nitk.ac.in/jspui/handle/123456789/17050

Title:	Automatic Estimation of Personal Characteristics using Speech Data
Authors:	Babu, Kalluri Shareef.
Supervisors:	Vijayasenan, Deepu.
Keywords:	Department of Electronics and Communication Engineering;Speaker Pro ling;Multilingual data;AFDS;NISP;Short duration;Physical Parameters;Audio Forensics
Issue Date:	2021
Publisher:	National Institute of Technology Karnataka, Surathkal
Abstract:	Many paralinguistic speech applications demand the extraction of information about the speaker's characteristics from as little speech data as possible. In this work, we explore the estimation of the speaker's multiple physical parameters from the short duration of speech in monolingual (English) and multilingual settings. This has applications in forensics as well as e􀀀commerce. We explore di erent feature streams derived from the speech spectrum at di erent resolutions. Short-term log-mel spectrogram, formant features, and harmonic features are extracted for age and body build estimation (height, weight, shoulder size, and waist size) of the speaker. The statistics of these features accumulated over the speech utterance are used to learn a support vector regression model for speaker age and body build estimation. The experiments performed on the TIMIT dataset show that each of the individual features can achieve results that outperform the default predictor (prediction of the mean of test samples by blindly predicting the mean of training data without looking at the features) in height and age estimation. Furthermore, the estimation errors from these di erent feature streams are complementary, allowing the combination of estimates from these feature streams to improve the results further. The combined system from short audio snippets achieves a performance of 5:2 cm, and 4:8 cm in Mean Absolute Error (MAE) for male and female, respectively, for height estimation. Similarly, in age estimation, the MAE is 5.2 years and 5.6 years for male and female speakers. We extend the same physical parameter estimation system to other body build parameters like shoulder width, waist size, weight, and height. We created two datasets for the speaker pro ling task in a multilingual and multi-accent setting. Speech data is collected along with speaker parameter details (like height, age, shoulder size, waist size, and weight). A pilot dataset Audio Forensic Dataset (AFDS) with 207 speakers across 12 di erent native Indian languages has around 8 hours of native languages speech and around 9 hours of English speech data. Later, a bigger dataset NITK-IISc Multilingual Multi-accent Speaker Pro ling (NISP) dataset has collected, and it has 345 speakers across ve Indian languages as well as English. NISP dataset has around 25 hours of native languages speech data and 32 hours of English speech data. The system can estimate all the physical parameters and showed better improvement than the default predictor in the multilingual and multi-accent setting. The duration analysis shows that the state-of-the-art results can be achieved using short utterances(around 1􀀀2 seconds) of speech data. To the best of our knowledge, i this is the rst attempt to use a common set of features for estimating the di erent physical traits of a speaker from short utterances. An integrated end-to-end deep neural network architecture is proposed for joint prediction of all the physical parameters. A novel initialization scheme for deep neural architecture is introduced, which avoids a large training dataset requirement. On the TIMIT dataset, the system achieves an RMSE error of 6:85 and 6:29 cm for male and female height prediction. In the case of age estimation, the RMSE errors are 7:60 and 8:63 years for male and female, respectively. Analysis of shorter durations of speech reveals that the network only degrades around 3% at most with only 1 second of the speech input. Also, the performance saturates around 3seconds in predicting the height and age of a speaker using the TIMIT dataset. In the multilingual setting using collected datasets, the predicted error metrics are less than the default predictor except for female age prediction in both AFDS and NISP datasets. In male speakers, the system performance is less than the default predictor in height estimation of the NISP dataset.
URI:	http://idr.nitk.ac.in/jspui/handle/123456789/17050
Appears in Collections:	1. Ph.D Theses

Files in This Item:

File	Description	Size	Format
Shareef_PhD_thesis.pdf		4.16 MB	Adobe PDF	View/Open

Show full item record