Efficient Mining of Frequent Colossal Itemsets from High Dimensional Data

Vanahalli, Manjunath K.

Please use this identifier to cite or link to this item: https://idr.nitk.ac.in/jspui/handle/123456789/16864

Title:	Efficient Mining of Frequent Colossal Itemsets from High Dimensional Data
Authors:	Vanahalli, Manjunath K.
Supervisors:	Patil, Nagamma.
Keywords:	Department of Information Technology;Bioinformatics;High Dimensional Dataset;Data Characteristics;Preprocessing;Frequent Colossal Item sets;Frequent Colossal Closed Item sets;Row set Cardinality Table;Item set Support Table;Dynamic Switching;Pruning Strategy;Closeness Checking;Parallel algorithm;Load Balancing
Issue Date:	2020
Publisher:	National Institute of Technology Karnataka, Surathkal
Abstract:	The basic and major step of Association Rule Mining (ARM) is itemset mining. ARM and itemset mining have a great and vast range of applications. The conventional featured enumeration based itemset mining algorithms focus on mining frequent itemsets, frequent closed itemsets, and frequent maximal itemsets from transactional datasets. The transactional datasets consist of a smaller number of attributes (features) and a large number of rows (samples). The abundant data across a variety of domains, including bioinformatics has led to the formation of a new form of dataset known as high dimensional dataset, whose data characteristics are different from that of transactional datasets. The high dimensional datasets consist of a large number of features and a smaller number of rows. The amount of information that can be extracted from high dimensional datasets is potentially huge, but extraction of information from these datasets is a non-trivial task. The result of Frequent Itemset Mining (FIM) and Frequent Closed Itemset Mining (FCIM) algorithms include small and mid-sized itemsets, which do not enclose valuable and complete information for decision making. In applications dealing with high dimensional datasets such as bioinformatics, ARM gives greater importance to the large-sized itemsets known as colossal itemsets. The recent research focused on mining frequent colossal itemsets and frequent colossal closed itemsets, which are more influential in decision making and are significant for many applications, especially in the field of bioinformatics. The preprocessing technique of existing frequent colossal itemset mining and frequent colossal closed itemset mining algorithms fail to prune the complete set of insignificant features and rows. An Effective Improved Preprocessing (EIP) technique has been proposed to prune the complete set of insignificant features and rows, which confines an increase in the mining search space. The existing frequent colossal itemset mining algorithm mine limited set of frequent colossal itemsets leading to the generation of an incomplete set of association rules, which consequently affects the decision making. Frequent colossal itemset mining algorithm has been proposed to achieve better accuracy than existing algorithms in terms of mining number of frequent colossal itemsets from the high dimensional dataset. The existing algorithms for mining Frequent Colossal Closed Itemsets (FCCI) from the high dimensional dataset do not enclose an efficient pruning strategy and closeness checking method. To overcome the drawbacks of the existing works, an algorithm enclosed with efficient Rowset Cardinality Table (RCT) based closeness checking methodand pruning strategy has been proposed to efficiently mine FCCI from high dimensional dataset. The existing algorithms are inefficient in mining FCCI from the datasets consisting of a large number of features and rows, as they are inefficient in handling the changing characteristics of data subset during the mining process. The combination of different enumeration methods is required to efficiently handle different characteristics possessed by different datasets. A dynamic switching algorithm has been proposed to efficiently mine FCCI form the dataset consisting of a large number of features and rows. The dynamic switching algorithm efficiently handles the changing characteristics of the data subset during the mining process. The dynamic switching algorithm is enclosed with Itemset Support Table (IST) based closeness checking method and pruning strategy. The existing algorithms for mining FCCI from high dimensional datasets are sequential and computationally expensive. Distributed and parallel computing is a good strategy to overcome the inefficiency of the existing sequential algorithms. The inefficiency of the existing sequential algorithms has been overcome by proposing the parallel row enumerated algorithm to efficiently mine FCCI from the high dimensional dataset. Traversing the row enumerated tree is the best solution for mining FCCI from the high dimensional dataset. The intrinsic nature of the row enumerated tree is typically unbalanced, as the number of nodes in each row enumerated tree branch vary. The distributed and parallel algorithm with load balancing has been designed to address the inefficiency of existing works.
URI:	http://idr.nitk.ac.in/jspui/handle/123456789/16864
Appears in Collections:	1. Ph.D Theses

Files in This Item:

File	Description	Size	Format
145063IT14F02.pdf		9.16 MB	Adobe PDF	View/Open

Show full item record