Rhevanth, M.Ahmed, R.Shah, V.Mohan, B.R.2026-02-062022Lecture Notes in Electrical Engineering, 2022, Vol.858, , p. 229-24318761100https://doi.org/10.1007/978-981-19-0840-8_17https://idr.nitk.ac.in/handle/123456789/29942The techniques of video summarization (VS) has garnered immense interests in current generation leading to enormous applications in different computer vision domains, such as video extraction, image captioning, indexing, and browsing. By the addition of high-quality features and clusters to pick representative visual elements, conventional VS studies often aim at the success of the VS algorithms. Many of the existing VS mechanisms only take into consideration the visual aspect of the video input, thereby ignoring the influence of audio features in the generated summary. To cope with such issues, we propose an efficient video summarization technique that processes both visual and audio content while extracting key frames from the raw video input. Structural similarity index is used to check similarity between the frames, while mel-frequency cepstral coefficient (MFCC) helps in extracting features from the corresponding audio signals. By combining the previous two features, the redundant frames of the video are removed. The resultant key frames are refined using a deep convolution neural network (CNN) model to retrieve a list of candidate key frames which finally constitute the summarization of the data. The proposed system is experimented on video datasets from YouTube that contain events within them which helps in better understanding the video summary. Experimental observations indicate that with the inclusion of audio features and an efficient refinement technique, followed by an optimization function, provides better summary results as compared to standard VS techniques. © 2022, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.Computer visionDeep convolutional neural networkMel-frequency cepstral coefficientStructural similarity indexVideo summarizationDeep Learning Framework Based on Audio–Visual Features for Video Summarization