Auditory Scene Analysis using Deep Learning Approaches

Thumbnail Image

Date

2024

Journal Title

Journal ISSN

Volume Title

Publisher

National Institute of Technology Karnataka, Surathkal.

Abstract

Auditory scene analysis (ASA) is a fundamental skill of the system that allows us to perceive and identify acoustic events in the environment around us. Automating ASA through computational devices such as hand-held smartphones or laptops is known as Computational Auditory Scene Analysis (CASA). This research is motivated by the significant number of real-time applications that ASA has. ASA can be used in context aware mobile devices where the device can turn to silent mode when the owner enters a meeting or an ICU of a hospital. The environment/location in which a particular audio is recorded is known as an acoustic scene. The sounds occurring in a particular scene/location are called sound events. A combination of two or more acoustic events forms one acoustic scene. For example, given a meeting room as a scene, the sound events present in the scene are keyboard typing, mouse clicks, somebody speaking, and so on. The task of identifying the events present in an acoustic scene is known as Sound Event Detection (SED), and identifying the location of the source of the sound along with the event type is known as Sound Source Localization (SSL). Acoustic Scene Classification (ASC) is identifying a scene using sound cues and assigning a label to this scene. In this thesis, three ASA tasks have been investigated, namely, SED, SSL, and ASC. One major challenge in identifying sound events is when they overlap at a given point in time. This type of event detection is said to be Polyphonic Sound Event Detection (PSED). In the existing works, the results obtained for the PSED task are less, and there is a considerable scope to improve the results. In this thesis, two new methods are proposed to perform PSED using spectral features and deep learning techniques. To perform PSED, a Mel-pseudo-based Constant Q-transform is proposed. The dataset considered to perform this task is TUT-Sound Event Detection (SED) 2016. The method resulted in an F1 score of 54% and an Error rate of 0.66. Once the event is detected, it is necessary to identify the source of the event. The presence of noise or the distance of the source can majorly affect the performance of the SSL. Therefore, this work proposes Sound Event Localization and Detection (SELD) systems to estimate the Direction-of-Arrival (DOA) of the sound event and event type. In this research work, a channel-wise ‘FusionNet’ deep learning network is designed to perform the SELD task. The proposed model performs the tasks of SED and DOA estimation in one neural network model. The dataset considered to perform this task is TAU-NIGENS Spatial Sound Events 2020. The method resulted in an F-score of 81.2%, an Error rate of 0.23, and a Frame recall of 86.9%. Accurate event detection and localization in a particular surrounding will make identification of the scene a more straightforward task. However, a critical challenge in ASC is when the recording devices are different. In this case, there is a high chance of device distortion present in the audio recordings. Therefore, a device-robust ASC method is proposed to eliminate the device distortion in the audio recordings and improve the performance of the ASC task. Also, a different deep learning approach named Deep Fisher Network is also proposed to perform ASC. This method combines the working principles of traditional machine learning algorithms and deep learning algorithms. The dataset considered to perform this task is DCASE (Detection and Classification of Acoustic Scenes and Events) 2019 ASC Task 1(a). The best average accuracy achieved is 91%. Detailed experimental evaluation is carried out to compare the performance of each of the proposed approaches against baseline and state-of-the-art systems.

Description

Keywords

Auditory Scene Analysis, Polyphonic Sound Event Detection, Sound Source Localization, Acoustic Scene Classification

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By