Acoustic Scene Classification Using Speech Features
Date
2020
Authors
Mulimani, Manjunath.
Journal Title
Journal ISSN
Volume Title
Publisher
National Institute of Technology Karnataka, Surathkal
Abstract
Currently, smart devices like smartphones, laptops, tablets, etc., need human intervention in the effective delivery of the services. They are capable of recognizing
stuff like speech, music, images, characters and so on. To make smart systems
behave as intelligent ones, we need to build a capacity in them, to understand and
respond to the surrounding situation accordingly, without human intervention.
Enabling the devices to sense the environment in which they are present through
analysis of sound is the main objective of the Acoustic Scene Classification. The
initial step in analyzing the surroundings is recognition of acoustic events present
in day-to-day environment. Such acoustic events are broadly categorized into two
types: monophonic and polyphonic. Monophonic acoustic events correspond to
the non-overlapped events; in other words, at most one acoustic event is active
in a given time. Polyphonic acoustic events correspond to the overlapped events;
in other words, multiple acoustic events occur at the same time instance. In this
work, we aim to develop the systems for automatic recognition of monophonic
and polyphonic acoustic events along with corresponding acoustic scene. Applications of this research work include context-aware mobile devices, robots, intelligent
monitoring systems, assistive technologies for hearing-aids and so on.
Some of the important issues in this research area are, identifying acoustic event
specific features for acoustic event characterization and recognition, optimization
of the existing algorithms, developing robust mechanisms for acoustic event recognition in noisy environments, making the-state-of-the-art methods working on big
data, developing a joint model that recognizes both acoustic events followed by
corresponding scenes etc. Some of the existing approaches towards solutions have
major limitations of using known traditional speech features, that are sensitive
to noise, use of features from two-dimensional Time-Frequency Representations
(TFRs) for recognizing the acoustic events, that demand high computational time;use of deep learning models, that require substantially huge amount of training
data.
Many novel approaches have been presented in this thesis for recognition of
monophonic acoustic events, polyphonic acoustic events and scenes. Two main
challenges associated with the real-time Acoustic Event Classification (AEC) are
addressed in this thesis. The first one is the effective recognition of acoustic events
in noisy environments, and the second one is the use of MapReduce programming
model on Hadoop distributed environment to reduce computational complexity.
In this thesis, the features are extracted from the spectrograms, which are robust
compared to the traditional speech features. Further, an improved Convolutional
Recurrent Neural Network (CRNN) and a Deep Neural Network-Driven feature
learning models are proposed for Polyphonic Acoustic Event Detection (AED) in
real-life recordings. Finally, binaural features are explored to train Kervolutional
Recurrent Neural Network (KRNN), which recognizes both acoustic events and a
respective scene of an audio signal. Detailed experimental evaluation is carried out
to compare the performance of each of the proposed approaches against baseline
and state-of-the-art systems.
Description
Keywords
Department of Computer Science & Engineering, Monophonic Acoustic Event Classification (AEC), polyphonic, Acoustic Event Detection (AED), Acoustic Scene Classification (ASC), Time Frequency Representations (TFRs), Map Reduce programming model, Convolutional Recurrent Neural Network (CRNN), Kervolutional Recurrent Neural Network (KRNN)