Automatic Estimation of Personal Characteristics using Speech Data
Date
2021
Authors
Babu, Kalluri Shareef.
Journal Title
Journal ISSN
Volume Title
Publisher
National Institute of Technology Karnataka, Surathkal
Abstract
Many paralinguistic speech applications demand the extraction of information
about the speaker's characteristics from as little speech data as possible. In this
work, we explore the estimation of the speaker's multiple physical parameters from
the short duration of speech in monolingual (English) and multilingual settings. This
has applications in forensics as well as ecommerce. We explore di erent feature
streams derived from the speech spectrum at di erent resolutions. Short-term log-mel
spectrogram, formant features, and harmonic features are extracted for age and body
build estimation (height, weight, shoulder size, and waist size) of the speaker. The
statistics of these features accumulated over the speech utterance are used to learn a
support vector regression model for speaker age and body build estimation. The experiments
performed on the TIMIT dataset show that each of the individual features
can achieve results that outperform the default predictor (prediction of the mean of
test samples by blindly predicting the mean of training data without looking at the
features) in height and age estimation. Furthermore, the estimation errors from these
di erent feature streams are complementary, allowing the combination of estimates
from these feature streams to improve the results further. The combined system from
short audio snippets achieves a performance of 5:2 cm, and 4:8 cm in Mean Absolute
Error (MAE) for male and female, respectively, for height estimation. Similarly, in
age estimation, the MAE is 5.2 years and 5.6 years for male and female speakers.
We extend the same physical parameter estimation system to other body build parameters
like shoulder width, waist size, weight, and height. We created two datasets
for the speaker pro ling task in a multilingual and multi-accent setting. Speech data
is collected along with speaker parameter details (like height, age, shoulder size, waist
size, and weight). A pilot dataset Audio Forensic Dataset (AFDS) with 207 speakers
across 12 di erent native Indian languages has around 8 hours of native languages
speech and around 9 hours of English speech data. Later, a bigger dataset NITK-IISc
Multilingual Multi-accent Speaker Pro ling (NISP) dataset has collected, and it has
345 speakers across ve Indian languages as well as English. NISP dataset has around
25 hours of native languages speech data and 32 hours of English speech data. The
system can estimate all the physical parameters and showed better improvement than
the default predictor in the multilingual and multi-accent setting.
The duration analysis shows that the state-of-the-art results can be achieved using
short utterances(around 12 seconds) of speech data. To the best of our knowledge,
i
this is the rst attempt to use a common set of features for estimating the di erent
physical traits of a speaker from short utterances.
An integrated end-to-end deep neural network architecture is proposed for joint
prediction of all the physical parameters. A novel initialization scheme for deep neural
architecture is introduced, which avoids a large training dataset requirement. On the
TIMIT dataset, the system achieves an RMSE error of 6:85 and 6:29 cm for male
and female height prediction. In the case of age estimation, the RMSE errors are 7:60
and 8:63 years for male and female, respectively. Analysis of shorter durations of
speech reveals that the network only degrades around 3% at most with only 1 second
of the speech input. Also, the performance saturates around 3seconds in predicting
the height and age of a speaker using the TIMIT dataset. In the multilingual setting
using collected datasets, the predicted error metrics are less than the default predictor
except for female age prediction in both AFDS and NISP datasets. In male speakers,
the system performance is less than the default predictor in height estimation of the
NISP dataset.
Description
Keywords
Department of Electronics and Communication Engineering, Speaker Pro ling, Multilingual data, AFDS, NISP, Short duration, Physical Parameters, Audio Forensics