A Region Based Semantic Composition Framework to Visual Image and Video Event Specificatioa
Date
2023
Authors
Naik, Dinesh
Journal Title
Journal ISSN
Volume Title
Publisher
National Institute Of Technology Karnataka Surathkal
Abstract
A long-standing goal of artificial intelligence in Computer Vision has been to de-
velop models capable of perceiving and comprehending the complex visual environ-
ment around us and communicating with us in natural language about it. Significant
progress has been achieved toward this goal over the last few years as a result of paral-
lel advancements in computing systems, data collection, and algorithms. Visual recog-
nition has advanced at a breakneck pace, with computers now capable of classifying
images, recognising them, and describing them in even longer words. They exceed
humans in various categories, even surpassing them in some instances. Despite tremen-
dous progress, the majority of improvements in visual recognition continue to occur
when an image is labelled with one or a few different labels and swiftly explained
in natural language. The majority of people find it straightforward to watch a brief
video and describe what occurred (in words). Machines have a difficult time extracting
meaning from video frames and generating a sentence description. Computer vision
research has long been focused on comprehending visual media, such as images and
videos. Additionally, a new issue within the scope of this study area, dynamic image
and video transcription, has sparked the interest of a large number of people. This re-
search presents models and methods for associating visual data with semantic labels
and visual data with natural language utterances, thereby simplifying translation be-
tween domain constituents.
Semantic segmentation is a fundamental component of object recognition models,
as it aims to classify things on a pixel-by-pixel basis. The primary goal of this re-
search is to classify an individual object within an image pixel by pixel. The provided
image is evaluated to ascertain the pixel-level properties that are present. Second, we
suggested an encoder-decoder architecture with a hybrid loss function that employs a
layered LSTM as the encoder and an LSTM model combined with an attention mecha-
nism as the decoder. Thirdly, we propose a unique framework for video captioning that
combines a bidirectional multi-layer LSTM encoder and a unidirectional decoder with
a temporal attention technique to produce superior global representations for videos.
Finally, we propose an efficient method for captioning videos using CNN in conjunc-
tion with a short-connected LSTM-based encoder-decoder model and a phrase context
vector.
Description
Keywords
Computer Vision, Object Detection, Semantic Segmentation, Ob- ject Recognition