A Region Based Semantic Composition Framework to Visual Image and Video Event Specificatioa

Naik, Dinesh

A Region Based Semantic Composition Framework to Visual Image and Video Event Specificatioa

Files

110652-IT11P01-DINESH NAIK.pdf (19.18 MB)

Date

2023

Authors

Naik, Dinesh

Publisher

National Institute Of Technology Karnataka Surathkal

Abstract

A long-standing goal of artificial intelligence in Computer Vision has been to de- velop models capable of perceiving and comprehending the complex visual environ- ment around us and communicating with us in natural language about it. Significant progress has been achieved toward this goal over the last few years as a result of paral- lel advancements in computing systems, data collection, and algorithms. Visual recog- nition has advanced at a breakneck pace, with computers now capable of classifying images, recognising them, and describing them in even longer words. They exceed humans in various categories, even surpassing them in some instances. Despite tremen- dous progress, the majority of improvements in visual recognition continue to occur when an image is labelled with one or a few different labels and swiftly explained in natural language. The majority of people find it straightforward to watch a brief video and describe what occurred (in words). Machines have a difficult time extracting meaning from video frames and generating a sentence description. Computer vision research has long been focused on comprehending visual media, such as images and videos. Additionally, a new issue within the scope of this study area, dynamic image and video transcription, has sparked the interest of a large number of people. This re- search presents models and methods for associating visual data with semantic labels and visual data with natural language utterances, thereby simplifying translation be- tween domain constituents. Semantic segmentation is a fundamental component of object recognition models, as it aims to classify things on a pixel-by-pixel basis. The primary goal of this re- search is to classify an individual object within an image pixel by pixel. The provided image is evaluated to ascertain the pixel-level properties that are present. Second, we suggested an encoder-decoder architecture with a hybrid loss function that employs a layered LSTM as the encoder and an LSTM model combined with an attention mecha- nism as the decoder. Thirdly, we propose a unique framework for video captioning that combines a bidirectional multi-layer LSTM encoder and a unidirectional decoder with a temporal attention technique to produce superior global representations for videos. Finally, we propose an efficient method for captioning videos using CNN in conjunc- tion with a short-connected LSTM-based encoder-decoder model and a phrase context vector.

Keywords

Computer Vision, Object Detection, Semantic Segmentation, Ob- ject Recognition

URI

https://idr.nitk.ac.in/handle/123456789/17747

Collections

1. Ph.D Theses

Full item page

A Region Based Semantic Composition Framework to Visual Image and Video Event Specificatioa

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By