2. Thesis and Dissertations

Permanent URI for this communityhttps://idr.nitk.ac.in/handle/1/10

Browse

Search Results

Now showing 1 - 10 of 22
  • Thumbnail Image
    Item
    An Efficient Framework for Information Retrieval from Linked Data
    (National Institute of Technology Karnataka, Surathkal, 2021) R, Sakthi Murugan.; V. S, Ananthanarayana
    Linked data is a method of publishing machine-processable data over the web. The resource description framework (RDF) is a standard model for publishing linked data. Currently, distributed compendiums of linked data are available over multiple SPARQL endpoints that can be queried using SPARQL queries. The size of the linked data is steadily increasing as many companies and government organizations have begun adopting linked data technologies. Many frameworks have been proposed for information retrieval from linked data, and the volume of data is a common challenge. Due to the huge volume, many existing linked data search engines have not indexed the latest data. The major components of information retrieval from linked data include storing, partitioning, indexing and ranking. This thesis presents a novel framework for information retrieval from a distributed compendium of linked data called the ‘Linked Data Search Framework’, abbreviated as LDSF. The significant contributions of LDSF include the method of storage, partitioning, indexing and ranking of linked data. The storage cost of RDF data is one of the primary concerns for searching linked data. The main objective of linked data is to represent the data as URI’s in a format that is both human understandable and machine processable. This intermediate URI form of representation is difficult for humans to understand and consumes massive storage in the case of machines. Humans read, think and speak text data as words and not as characters, but computers use character-based encoding such as Unicode to handle text data (including linked data). This thesis presents an approach named ‘WordCode’, a wordbased encoding of text data (including linked data), that enables computers to store and process text data as words. A trie based code page named ‘WordTrie’ is proposed to store words for rapid encoding and decoding. Experimental results from encoding text files from the standard corpus using WordCode show an up to 19.9% reduction in file size compared to that achieved with character-based encoding. The proposed Word- Code method of encoding words in a machine-processable format used less storage space, resulting in faster processing and communication of text data (including linked data). Query processing in massive linked data is performed by distributing the storage across multiple partitions. Considerable research has been conducted to partition linked data based on clusters. Additionally, substantial research on hash-based partitioning, cloud-based partitioning, and graph-based partitioning has been reported. However, these sophisticated partitioning algorithms are not based on the semantic relatedness of the data and suffer from high preprocessing cost. In this thesis, a semantic-based partitioning method using a novel nexus clustering algorithm is discussed. For every concept, the core properties of the linked data are identified, and bilevel, nexus-based hierarchical agglomerative clustering is used to partition the linked data. The proposed method is evaluated using the gold standard test data sampled from DBpedia across eight closely related categories. The proposed clustering technique partitions the linked data with a precision of 98.7% on the gold standard dataset. Multiple indexing strategies have been proposed to search and access linked data easily at any given time. All these extensive indexing schemes involve substantial redundant data, which greatly increases the required storage and computational resources needed to update the index of the dynamically growing linked data. This thesis introduces ‘trist’, a hybrid data structure combining a tree and doubly linked list to index linked data. The linked data contain URIs and values. The URIs and values are separately indexed using ‘URI trist’ and ‘Value trist’, respectively. Compared to the existing indexing strategies, this indexing approach reduces the storage consumption. The experimental results using the sampled DBpedia dataset demonstrate that trist-based indexing achieves a space-saving of 60% compared to regular graph-based storage of linked data. Also, the proposed trist-based indexing is 6000% faster in accessing the linked data from the graph than the regular graph without indexing. The ultimate goal of an information retrieval system is to rank the linked data that will be appealing to the end user. The existing approaches for ranking linked data are all atomistic. Often, the problem with ranking linked data is that the data are of various kinds from multiple sources. This thesis presents a holistic approach to rank linked data from multiple SPARQL endpoints and presenting the integrated results. The holistic rank is computed based on four subranks: endpoint rank, concept rank, predicate rank and value rank. LDSF also provides an approach to represent the URI form of linked data to the user in an easily understandable manner. The ordering ofWikipedia is agreed to be readable by its users over the web, and the ranking in this thesis is evaluated based on the ordering ofWikipedia: the proposed ranking correlates up to 99% with the ordering of Wikipedia with the DBpedia sampled dataset. Overall, the proposed LDSF is an efficient framework for storing, partitioning, indexing and ranking linked data that produces a more satisfactory query result than that of existing systems.
  • Thumbnail Image
    Item
    Computational Analysis of Protein Structure and its Subcellular Localization using Amino Acid Sequences
    (National Institute of Technology Karnataka, Surathkal, 2021) Bankapur, Sanjay S.; Patil, Nagamma.
    A cell is the basic unit of all organisms. In a cellular life cycle, various complex metabolic activities are being carried out in different cell compartments. Protein plays an important role in many complex metabolic activities. Proteins are generated in the post-transcriptional modification activity of a cell. Initially, the generated proteins are in the linear structure and it is called as protein primary structure. Within the cell, proteins tend to move from one compartment (subcellular location) to other compartments, and based on the environment (in which the primarily structured proteins reside), primary structured proteins transform into secondary and tertiary structures. Tertiary structured proteins interact with nearby structured proteins to form a quaternary structure. A protein performs its biological functions when it attains its respective tertiary structure. Identification of a protein structure and its subcellular locations are challenging and important tasks in the field of medical science. Various health issues are identified and solved via novel drug discoveries and a prior and accurate knowledge of protein structure and its subcellular location helps in developing a respective drug. In order to identify protein structure and its subcellular locations, various biological methods such as X-ray crystallography, nuclear magnetic resonance spectroscopy, cell fractionation, fluorescence microscopy, and electron microscopy are used. The main advantage of biological methods is that they are accurate in identifying protein structures and its subcellular locations. The disadvantages of biological methods are that they are time-consuming and very expensive. In this post-genomic era, high-volumes of protein primary structures are decoded by various research communities and are added to protein data banks. Identification of protein structure and its subcellular locations using biological methods are not a feasible option for high-volumes of proteins. Over the decades, various computational methods have been proposed to identify protein structure and its locations; however, the existing computational methods exhibit limited accuracy and hence they are less effective. The main objective of this thesis is to propose effective computational models that contribute to the prediction of protein structure and its subcellular locations. In this regard, four important and specific problems of protein structure and its subcellular location have been solved and they are: (i) multiple sequence alignment, (ii) protein secondary structural class prediction, (iii) protein fold recognition, and (iv) protein subcellular localization prediction. The importance of multiple sequence alignment is that a vital and consistent homologous pattern of proteins can be captured and these patterns will further help in solving protein structure and its subcellular locations. The proposed alignment method includes three main modules: a) an effective scoring system to score the quality of the aligned sequences, b) a progressive-based alignment approach is adopted and modified to align multiple sequences, and c) the aligned sequences are refined using the proposed polynomial-time complexity-based single iterative optimization framework. The proposed method has been assessed on publicly available benchmark datasets and recorded 17.7% improvement over the CLUSTAL X model on the BAliBASE dataset. Identification of protein secondary structural class is one of the important tasks that further help in the prediction of protein tertiary structure. Protein secondary structural class prediction is a supervised problem that falls under the multi-class category. The proposed protein secondary structural class prediction model contains a novel feature modelling strategy that extracts global and local features followed by a novel ensemble of classifiers to predict structural class. The proposed model has been assessed on both publicly available benchmark datasets and derived latest high-volume datasets. The performance of the proposed model recorded an improvement of 5.3% on the 25PDB dataset over one of the best predictors from the literature. A protein fold recognition is a categorization of various folds of a protein that exhibits in tertiary structure. Protein fold recognition is a supervised problem that falls under the multi-class category. The proposed fold recognition model contains a novel and effective feature modelling approach that includes Convolutional and SkipXGram bi-gram techniques to extract global and local features followed by an effective deep learning framework for fold recognition. The proposed model has been assessed on both publicly available benchmark datasets and derived latest high-volume datasets. The performance of the proposed model recorded a relative improvement of 5% on the DD dataset over one of the best predictors from the literature. An effective protein sub-chloroplast localization prediction model is proposed to solve one-level more microscopic problem of subcellular localization. Protein subchloroplast localization is a supervised problem that falls under the multi-class and multi-label category. The proposed protein sub-chloroplast localization prediction model contains a novel feature extraction technique such as SkipXGram bi-gram followed by a deep learning framework for multi-label classification. The proposed model has been assessed on publicly available benchmark datasets and recorded an improvement of (absolute) 30.39% on the Novel dataset over the best predictor from the literature.
  • Thumbnail Image
    Item
    Windows Malware Detection Techniques Using Static and Behavioural-based Features
    (National Institute of Technology Karnataka, Surathkal, 2020) L, Shiva Darshan S.; D, Jaidhar C.
    The advancement in Internet-based communication technology has enabled malware to achieve its intent without the user’s consent. It penetrates or harms a computer system’s integrity, availability, and confidentiality. Forbye, a modern malware, is equipped with obfuscation techniques that maximize its capability to defeat antimalware detection systems and evade detection. The conventional anti-malware detection techniques exhibit inherent delayed effectiveness due to their signature-based detection and are inadequate to ascertain advanced malware. Therefore, there is need for a proficient malware detection technique, which can precisely identify it. The traditional Windows malware detection techniques can analyze malware without executing them. These techniques discern the malware by analyzing the static features of the Portable Executable (PE) files. However, they are incompetent against the emerging advanced malware attacks. To address this, behavioural-based malware detection technique emerges as an essential complement to defend against such sophisticated malware. The behavioural-based detection technique monitors and captures the activities of the malware during its runtime. It executes the input file (PE) in an isolated environment and records its behaviours during execution. However, in real-life scenario, it is tedious to examine all the recorded features. Hence, identifying significant features from the original features set is the primary challenging task in this technique. Several issues remain open in the development of an intricate malware detection system that can resist the attacks caused by the malware. Many examinations illustrate that the current malware detection systems are easily compromised by sophisticated malware. There are various solutions proposed in literature to uncover malware. However, each detection approach has its own limitation(s). The present research work aims to propose a classic approach to detect and classify Windows malware by extracting static features or behavioural features or a combination of both (hybrid features) of the PE files. In this regard, initially, the Malware Detection System (MDS) was designed based on the information extracted related to Portable Executable Optional Header Fields (PEOHF) as static features. In addition, to identify the malicious activities of the malware, behaviour analysis of the PE files was also performed by considering Application Programming Interface (API) or API with their corresponding category (CAT-API) or System calls invoked by the input PE file during execution. Concurrently, for precise classification operation, preserving the informative features is highly necessary to detect and distinguish the unknown PE files as malwareor benign. With this in view, the performance of the Feature Selection Techniques (FSTs) in recommending the best features is crucial for classifiers in discriminating between benign and malware PEs was evaluated. Subsequently, a malware detection technique based on visualization images was proposed where the images were generated using behavioural features suggested by the FST. Moreover, the effectiveness of the hybrid features in the detection of malware was examined based on the significant features recommended by the FSTs. Several sets of experiments were carried out to evaluate and demonstrate the potency of the proposed approaches. The efficiency of all the proposed approaches was assessed using real-world malware samples with 10-fold cross-validation tests. Different evaluation metrics such as True Positive Rate (TPR), False Positive Rate (FPR), Precision, Recall, F-Measure, and Accuracy were used to evaluate the proposed approaches. Based on the obtained experimental results, it was observed that the proposed approaches are effective in the detection and classification of the Windows malware.
  • Thumbnail Image
    Item
    An Efficient Mapreduce Scheduler for Cloud Environment
    (National Institute of Technology Karnataka, Surathkal, 2020) Jeyaraj, Rathinaraja.; S, Ananthanarayana V.
    Hadoop MapReduce is one of the cost-effective ways to process a large volume of data for reliable and effective decision-making. As on-premise Hadoop cluster is not affordable for short-term users, many public cloud service providers like Amazon, Google, and Microsoft typically offer Hadoop MapReduce and relevant applications as a service via a cluster of virtual machines over the Internet. In general, these Hadoop virtual machines are launched in different physical machines across cloud data-center and co-located with non-Hadoop virtual machines. It introduces many challenges, more specifically, a layer of heterogeneities (hardware heterogeneity, virtual machine heterogeneity, performance heterogeneity, and workload heterogeneity) that impacts the performance of MapReduce job and task scheduler. Containing physical servers of different configuration and performance in cloud data-centers is called hardware heterogeneity. Existence of different size of virtual machines in a Hadoop virtual cluster is called virtual machine heterogeneity. Hardware heterogeneity, virtual machine heterogeneity, and co-located non-Hadoop virtual machine’s interference together cause varying performance for the same map/reduce task of a job. This is called performance heterogeneity. Latest MapReduce versions allow users to customize the resource capacity (container size) for the map/reduce tasks of different jobs. This leads a batch of MapReduce of jobs to be heterogeneous. These heterogeneities are inevitable and profoundly affect the performance of MapReduce job and task scheduler concerning job latency, makespan, and virtual resource utilization. Therefore, it is essential to exploit these heterogeneities while offering Hadoop MapReduce as a service to improve MapReduce scheduler performance in real-time. Existing MapReduce job and task schedulers addressed some of these heterogeneities but fell short in improving the performance. In order to improve these qualities of service further, we proposed a following set of methods: Dynamic Ranking-based MapReduce Job Scheduler (DRMJS) to exploit performance heterogeneity, Multi-Level Per Node Combiner (MLPNC) to minimize the number of intermediate records in the shuffle phase, Roulette Wheel Scheme (RWS) based data block placement and a constrained 2-dimensional bin packing model to exploit virtual machine and workload level heteroigeneities, and Fine-Grained Data Locality Aware (FGDLA) job scheduling by extending MLPNC for a batch of jobs. Firstly, DRMJS is proposed to improve MapReduce job latency and resource utilization by exploiting heterogeneous performance. The DRMJS calculates the performance score for each Hadoop virtual machine based on CPU and Disk IO for map tasks, CPU and Network IO for reduce tasks separately. Then, a rank list is prepared for scheduling map tasks based on map performance score, and reduce tasks based on reduce performance score. Ultimately, DRMJS improved overall job latency, makespan, and resource utilization up to 30%, 28%, and 60%, respectively, on average compared to existing MapReduce schedulers. To improve job latency further, MLPNC is introduced to minimize the number of intermediate records in the shuffle phase, which is responsible for the significant portion of MapReduce job latency. In general, each map task runs a dedicated combiner function to minimize the number of intermediate records. In MLPNC, we split the combiner function from map task and run a single MLPNC in every Hadoop virtual machine for a set of map tasks of the same job. These map tasks write its output to the common MLPNC, which minimizes the number of intermediate records level by level. Ultimately, MLPNC improved job latency up to 33% compared to existing MapReduce schedulers for a single job. However, in production environment, a batch of MapReduce jobs is periodically executed. Therefore, to extend MLPNC for a batch of jobs, we introduced FGDLA job scheduler. Results showed that FGDLA minimized the amount of intermediate data and makespan up to 62.1% and 32.4% when compared to existing schedulers. Secondly, virtual machine and workload level heterogeneities cause resource underutilization in the Hadoop virtual cluster and impact makespan for a batch of MapReduce jobs. Considering this, we proposed RWS based data block placement, and a constrained 2-dimensional bin packing to place heterogeneous map/reduce tasks onto heterogeneous virtual machines. RWS places data blocks based on the processing capacity of each virtual machine, and bin packing model helps to find the right combination of map/reduce tasks of different jobs for each bin to improve makespan and resource utilization. The experimental results showed that the proposed model improved makespan iiand resource utilization up to 57.9% and 59.3% over MapReduce fair scheduler.
  • Thumbnail Image
    Item
    Energy Efficient Resource Management and Task Scheduling at the Cloud Data Center
    (National Institute of Technology Karnataka, Surathkal, 2020) Sharma, Neeraj Kumar.; Reddy, G Ram Mohana.
    Due to the growing demand for cloud services, allocation of energy efficient resources (CPU, memory, storage, etc.) and utilization of these resources are the major challenging issues of a large cloud data center. To meet the ever increasing demand of the customers, more number of servers are needed at the data center. These data centers require more cooling devices in order to keep the data center at a specified temperature resulting in more energy consumption and CO2 emission. The user requested on demand virtual machine (VM) allocation problem is widely known as a combinatorial optimization problem. Due to the large number of PMs present in the data center, the specified VM allocation problem is related to the NP-hard/NP-complete complexity class. Finding an optimal solution to the specified VM allocation problem with the multi-objective approach in the polynomial time will thus create a lot of challenges. Further, the networking devices of data center like switches consume 10% to 20% of the total energy consumed by IT devices in the data center. Hence, the network-aware VM allocation algorithm is required to minimize the energy consumption of switches and physical machines (PMs) at the cloud data center. Further, a policy for migrating VMs from underutilized PMs to the energy efficient PMs is required over a period of time without violating the service level agreement (SLA) between the cloud service provider and the customer. In order to minimize both the energy consumption and resources wastage, this thesis presents multi-objective VM allocation to PM using hybrid bio-inspired algorithms (HGACSO, HGAPSO, and HGAPSOSA) based on GA, CSO, PSO, and SA algorithms. Further, to save the energy consumption of networking switches in the cloud data center, a branch-and-bound based exact algorithm is proposed for VM allocation problem. The proposed branch-and-bound based exact algorithm saves the energy consumption of PMs and networking switches at the cloud data center. Further, the proposed VM migration technique and a 9task scheduling technique based on the First-Fit approximation algorithm will not only reduce the energy consumption at the cloud data center but also avoids the SLA. The experimental results were carried out in both homogeneous and heterogeneous cloud data center environments. Experimental results demonstrated that the proposed VM allocation algorithms outperform the state-of-the-art benchmark and peer research algorithms.
  • Thumbnail Image
    Item
    Efficient Mining of Frequent Colossal Itemsets from High Dimensional Data
    (National Institute of Technology Karnataka, Surathkal, 2020) Vanahalli, Manjunath K.; Patil, Nagamma.
    The basic and major step of Association Rule Mining (ARM) is itemset mining. ARM and itemset mining have a great and vast range of applications. The conventional featured enumeration based itemset mining algorithms focus on mining frequent itemsets, frequent closed itemsets, and frequent maximal itemsets from transactional datasets. The transactional datasets consist of a smaller number of attributes (features) and a large number of rows (samples). The abundant data across a variety of domains, including bioinformatics has led to the formation of a new form of dataset known as high dimensional dataset, whose data characteristics are different from that of transactional datasets. The high dimensional datasets consist of a large number of features and a smaller number of rows. The amount of information that can be extracted from high dimensional datasets is potentially huge, but extraction of information from these datasets is a non-trivial task. The result of Frequent Itemset Mining (FIM) and Frequent Closed Itemset Mining (FCIM) algorithms include small and mid-sized itemsets, which do not enclose valuable and complete information for decision making. In applications dealing with high dimensional datasets such as bioinformatics, ARM gives greater importance to the large-sized itemsets known as colossal itemsets. The recent research focused on mining frequent colossal itemsets and frequent colossal closed itemsets, which are more influential in decision making and are significant for many applications, especially in the field of bioinformatics. The preprocessing technique of existing frequent colossal itemset mining and frequent colossal closed itemset mining algorithms fail to prune the complete set of insignificant features and rows. An Effective Improved Preprocessing (EIP) technique has been proposed to prune the complete set of insignificant features and rows, which confines an increase in the mining search space. The existing frequent colossal itemset mining algorithm mine limited set of frequent colossal itemsets leading to the generation of an incomplete set of association rules, which consequently affects the decision making. Frequent colossal itemset mining algorithm has been proposed to achieve better accuracy than existing algorithms in terms of mining number of frequent colossal itemsets from the high dimensional dataset. The existing algorithms for mining Frequent Colossal Closed Itemsets (FCCI) from the high dimensional dataset do not enclose an efficient pruning strategy and closeness checking method. To overcome the drawbacks of the existing works, an algorithm enclosed with efficient Rowset Cardinality Table (RCT) based closeness checking methodand pruning strategy has been proposed to efficiently mine FCCI from high dimensional dataset. The existing algorithms are inefficient in mining FCCI from the datasets consisting of a large number of features and rows, as they are inefficient in handling the changing characteristics of data subset during the mining process. The combination of different enumeration methods is required to efficiently handle different characteristics possessed by different datasets. A dynamic switching algorithm has been proposed to efficiently mine FCCI form the dataset consisting of a large number of features and rows. The dynamic switching algorithm efficiently handles the changing characteristics of the data subset during the mining process. The dynamic switching algorithm is enclosed with Itemset Support Table (IST) based closeness checking method and pruning strategy. The existing algorithms for mining FCCI from high dimensional datasets are sequential and computationally expensive. Distributed and parallel computing is a good strategy to overcome the inefficiency of the existing sequential algorithms. The inefficiency of the existing sequential algorithms has been overcome by proposing the parallel row enumerated algorithm to efficiently mine FCCI from the high dimensional dataset. Traversing the row enumerated tree is the best solution for mining FCCI from the high dimensional dataset. The intrinsic nature of the row enumerated tree is typically unbalanced, as the number of nodes in each row enumerated tree branch vary. The distributed and parallel algorithm with load balancing has been designed to address the inefficiency of existing works.
  • Thumbnail Image
    Item
    An Efficient Trusted Framework for Context Aware Sensor driven Pervasive Applications and their Integration using Ontologies
    (National Institute of Technology Karnataka, Surathkal, 2020) N, Karthik.; S, Ananthanarayana V.
    Pervasive computing application consists of various types of sensors, actuators, set of protocols and services for monitoring physical, environmental circumstances and happenings by collecting data and act autonomously to serve the user. The pervasive computing is established on recent advancements of mobile computing, distributed computing, wireless communications, embedded systems and context-aware computing that makes computing devices smaller and earns more ability for perception, communication and computation operations. Sensor nodes play an important role in a pervasive computing environment. These sensor nodes are expected to be installed in various pervasive applications for detecting real-world events and respond consequently. Tiny sensor nodes are embedded in everyday objects invisibly that provides ubiquitous access to information services. Due to recent advancements of sensors and wireless technologies, pervasive computing is bringing heterogeneous sensors into our everyday life for providing better services. Massive amount of data is generated from sensor nodes of a pervasive environment, which is forwarded to the sink node through the gateway for data analysis and event detection. The sensed data from pervasive computing application suffers from data fault, missing data, due to the unfriendly, harsh environment and resource restriction. In most of the cases, the generated data can be shared among different applications in the pervasive environment for increasing the user comfortableness, reliability of the application and achieving the full potential of the application. The shared data plays a vital role in critical decision making. The generated data from various sensors depict conflict in types, formats, and representations which arises problem for nodes to process and infer. Various types of sensor nodes and other devices would lead to the generation of heterogeneous data which constrains pervasive application to understand data and use efficaciously. Data interoperability problem occurs when different pervasive applications interact with each other. Furthermore, with the rise of several sensor node manufacturers, pervasive computing faces the problem in the data integration process. Because of data heterogeneity, the data cannot be shared with other application which leads to interoperability problem in the pervasive environment. The objective of the thesis is to share the trustworthy data and offer interoperability across different trusted context-aware pervasive applications. To deal with data faults, data loss and event detection, Trust Management Schemes (TMS) are proposed. To solve interoperability problem, hybrid ontology matching technique is proposed. Sensor data modeling is the basis for all TMS in sensor netowrks. An energy efficient hybrid sensor data modelingfor data fault detection, data reconstruction and event detection is proposed and analysis of energy consumption of data fault detection in various environment is also given. This thesis introduces the Trust-based Data Gathering (TDG) in sensor networks, which focuses on trust-based data collection, trust-based data aggregation, and trustbased data reconstruction to show that the absence of trust in a sensor-driven harsh pervasive environment consumes more energy and delay for handling untrustworthy data, untrustworthy node and affects the normal functionality of the application. This thesis presents the Hybrid Trust Management Scheme (HTMS) for sensor networks, which assign the trust score to node and data based on interdependency property. The correlation metric and provenance data are used to score the sensed data. The data trust score is utilized for making a decision. The communication trust and provenance data are used to evaluate the trust score of intermediate nodes and the source node. The Context-Aware Trust Management Scheme (CATMS) is introduced in pervasive healthcare systems for data fault detection, data reconstruction and medical event detection. It employs heuristic functions, data correlation, and contextual information based algorithms to identify data faults and events. It also reconstructs the data faults and data loss for detecting events reliably. This work aims to alert the caregiver and raise the alarm only when the patient enters into a medical emergency. Finally, this thesis investigates the hybrid ontology matching using upper ontology for solving semantic heterogeneity and interoperability problems. It combines direct and indirect matching techniques with upper ontology to share and integrate data semantically and establishes a semantic correspondence among various entities of pervasive application ontologies. To find the efficiency of the proposed framework, we carried out experiments with INTEL Berkeley lab dataset, sensorscope dataset and data samples collected by medical sensor network prototype of pervasive healthcare application. The experimental results show that the proposed framework shares trustworthy data and offers interoperability across different trusted context-aware pervasive applications.
  • Thumbnail Image
    Item
    Predictive Analytics Based Integrated Framework for Intelligent Healthcare Applications
    (National Institute of Technology Karnataka, Surathkal, 2020) Krishnan, Gokul S.; S, Sowmya Kamath.
    Healthcare analytics is a field that deals with the examination of underlying patterns in healthcare data in order to determine ways in which clinical care can be improved - in terms of patient care, hospital management and cost optimization. Towards this end, health information technology systems such as Clinical Decision Support Systems (CDSSs) have received extensive research attention over the years. A CDSS is designed to provide physicians and other health professionals assistance with clinical decision-making tasks, based on automated analysis of patient data and other knowledge sources. Recent advancements in Big Data and Healthcare Analytics have seen an emerging trend in the application of Artificial Intelligence techniques to healthcare data for supporting essential applications like disease prediction, mortality prediction, symptom analysis, epidemic prediction etc. Despite such major advantages o↵ered by CDSSs, there are several issues that need to be overcome to achieve their full potential. There is scope for significant improvements in terms of patient data modeling strategies and prediction models, especially with respect to clinical data of unstructured nature. In this research thesis, various approaches for building decision support systems towards patient-centric and population-centric predictive analytics on large healthcare data of both structured and unstructured nature are presented. For structured data, an empirical study was performed to observe the e↵ect of feature modeling on mortality prediction performance, which revealed the need for extensive study on the relative relevance of features contributing to mortality risk prediction. Towards this, a Genetic Algorithm based wrapper feature selection method was proposed, for determining the most relevant lab events that help in e↵ective patient-specific mortality prediction. Clinical data in the form of unstructured text, being rich in patient-specific information sources has remained largely unexplored, and could be potentially used to leverage e↵ective CDSS development. Towards this, an Extreme Learning Machine based patient-specific mortality prediction model built on ECG text reports of cardiac patients was proposed. The approach, which involved word iiiiv embedding based feature modeling and an unsupervised data cleansing technique to filter out anomalous data, underscored the importance of e↵ective word embeddings. Therefore, our next objective was to study the word embedding models and their role in feature modeling for building e↵ective CDSSs. A benchmarking study on performance of word representation models for patient specific mortality prediction using unstructured clinical notes was performed. Our next objective involved analyzing and utilizing the unstructured clinical notes for building e↵ective disease prediction models. An ontology-driven feature modeling approach was proposed, for designing a disease group prediction model built on unstructured radiology reports. In order to solve the problems of sparsity and high dimensionality of this approach, another feature modeling approach based on Particle Swarm Optimization (PSO) and neural networks was proposed to further enhance the performance of disease group prediction models using unstructured radiology reports. With the objective of analyzing physician notes, a hybrid feature modeling approach was proposed to leverage the latent information embedded in unstructured patient records for disease group prediction. Towards addressing the incremental and redundant nature of unstructured clinical notes, aggregation of nursing notes using TAGS and FarSight approaches were also explored for e↵ective disease group prediction, which demonstrated significant potential towards enabling early disease diagnosis. For population health analysis (flu vaccine hesitancy, flu vaccine behaviour and depression detection), a generic model called Multi-task Deep Social Health Analyzer (MDSHA) was proposed which uses a PSO based topic modeling approach for e↵ective feature representation and predictive modeling. All proposed approaches were compared to existing state-of-the-art approaches for respective prediction tasks using standard datasets. The promising results achieved underscore the superior performance of the approaches designed in this research, and reveal much scope for adaptation in the healthcare field for improving quality of healthcare.
  • Thumbnail Image
    Item
    Development of Unobtrusive Affective Computing Framework for Students’ Engagement Analysis in Classroom Environment
    (National Institute of Technology Karnataka, Surathkal, 2020) S, Ashwin T.; Reddy, G Ram Mohana.
    Pervasive intelligent learning environments can be made more personalized by adapting the teaching strategies according to the students’ emotional and behavioral engagements. The students’ engagement analysis helps to foster those emotions and behavioral patterns that are beneficial to learning, thus improving the effectiveness of the teachinglearning process. The students’ emotional and behavioral patterns are to be recognized unobtrusively using learning-centered emotions (engaged, confused, frustrated, and so on), and engagement levels (looking away from the tutor or board, eyes completely closed, and so on). Recognizing both the behavioral and emotional engagement from students’ image data in the wild (obtained from classrooms) is a challenging task. The use of the multitude of modalities enhances the performance of affective state classification, but recognizing facial expressions, hand gestures, and body posture of each student in a classroom environment is another challenge. Here, the classification of affective states is not sufficient, object localization also plays a vital role. Both the classification and object localization should be robust enough to perform better for various image variants such as occlusion, background clutter, pose, illumination, cultural & regional background, intra-class variations, cropped images, multipoint view, and deformations. The most popular and state-of-the-art classification and localization techniques are machine and deep learning techniques that depend on a database for the ground truth. A standard database that contains data from different learning environments with a multitude of modalities is also required. Hence, in the research work, different deep learning architectures are proposed to classify the students’ affective states with object localization. A standard database with students’ multimodal affective states is created and benchmarked. The students’ affective states obtained from the proposed real-time affective state classification method is used as feedback to the teacher in order to enhance the teaching-learning process in four different learning environments, namely: e-learning, classrooms, webinars and flipped classrooms. More details on the contribution of this thesis are as follows.A real-time students’ emotional engagement analysis is proposed for both the individual and group of students based on their facial expressions, hand gestures, and body postures for e-learning, flipped classroom, classroom, and webinar environments. Both basic and learning-centered emotions are used in the study. Various CNN based architectures are proposed to predict the students’ emotional engagement. The students’ behavioral engagement analysis method is also proposed and implemented in the classroom and computer-enabled teaching laboratories. The proposed scale-invariant context assisted single-shot CNN architecture performed well for multiple students in a single image frame. A single group engagement level score for each frame is obtained using the proposed feature fusion technique. The proposed model effectively classifies the students’ affective states into teachercentric attentive and in-attentive affective states. Inquiry interventions are proposed to address the negative impact of in-attentive affective states on the performance of students. Experimental results demonstrated a positive correlation between the students learning rate and their attentive affective state engagement score for both individual and group of students. Further, an affective state transition diagram and visualizations are proposed to help the students and the teachers to improve the teaching-learning process. A multimodal database is created for both e-learning (single student in a single image frame) and classroom environments (multiple students in a single image frame) using the students’ facial expressions, hand gestures, and body postures. Both posed and spontaneous expressions are collected to make the training set more robust. Also, various image variants are considered during the dataset creation. Annotations are performed using the gold standard study for eleven different affective states and four different engagement levels. Object localization is performed on each modality of every student, and the bounding box coordinates are stored along with the affective state/engagement level. This database is benchmarked with various popular classification algorithms and state-of-the-art deep learning architectures
  • Thumbnail Image
    Item
    Mobility Management Protocols for Low power and Lossy Networks
    (National Institute of Technology Karnataka, Surathkal, 2019) Sanshi, Shridhar; C. D, Jaidhar
    The Internet of Things (IoT) is emerging as a new paradigm for information systems as things are seamlessly integrated with computation and communication capabilities. The Wireless Sensor Network (WSN) is a key component of the IoT environment which is typically composed of large-scale resource-constrained devices, that exploit the Multi-Hop data delivery over wireless links. Recently, the Internet Protocol (IP) based WSN has gained popularity due to the many opportunities it provides for direct communication with the WSN as well as remote access to the sensor data. On the other hand, assigning IP for sensor devices raises numerous challenges due to its resource constraints. Nevertheless, the Internet Engineering Task Force has developed the IPv6 over low power wireless personal area network (6LoWPAN) adaptation layer that enables IPv6 communication over the IEEE 802.15.4 layer, and also standardized the IPv6 Routing Protocol for Low power and Lossy Networks (RPL), to route packets over the 6LoWPAN adaptation layer. The RPL is a gradient-based routing protocol with bidirectional links that aim to build a robust Multi-Hop mesh topology based on the routing metrics and constraints. However, several issues remain open for improvement and specification, in particular with respect to node mobility that arises in real-time scenarios. Several examinations have illustrated that the RPL is affected under mobility. There are various solutions proposed in the literature to support mobility in the RPL, with limitations. In order to address these issues, this thesis aims to support mobility in the RPL with enhanced performance. The effects of mobility in the RPL is evaluated with different Objective Functions (OFs) such as Objective Function Zero, Energy-based Objective Function, Delay-Efficient Objective Function, and Minimum Rank with Hysteresis Objective Function under different mobility models. Subsequently, it proposes a Multimetrics based OF (MMOF) based on the node type by considering node properties as well as the link properties. It proposes new mechanisms to update the Preferred Parent Node (PPN) based on the control messages to maintain connectivity to the DODAG root. Further, various timers modules are incorporated into the proposed techniques in order to maintain up to date neighbour nodes. To evaluate the efficacy of the proposed protocols, simulations were carried out by using Contiki based Cooja simulator by varying system and traffic parameters. The simulations were repeated for 3 times and average of the results were considered for evaluating the performance. Different evaluation metrics, namely, the Packet Delivery Ratio (PDR), power consumption, end-to-end delay, and the number of control messages were considered to evaluate the performance of the proposed protocols. Based on the obtained experimental results, it was observed that under mobility, the OFs have a direct effect on the evaluation metrics. The proposed MMOF along with a mechanism to update the PPN showed improved performance in terms of PDR and power consumption compared to other protocols