Weighted frequent pattern based agglomerative clustering for large unstructured text data

Kanimozhi, K.V.; Rajakumar, K.S.; Venkatesan, M.

Weighted frequent pattern based agglomerative clustering for large unstructured text data

dc.contributor.author	Kanimozhi, K.V.
dc.contributor.author	Rajakumar, K.S.
dc.contributor.author	Venkatesan, M.
dc.date.accessioned	2026-02-05T09:28:39Z
dc.date.issued	2020
dc.description.abstract	Processing large amount of text using traditional clustering methods are key challenges.Research communities have proposed the various clustering approaches for analyzing unstructured data. Frequent item based clustering method is one of the mostly used clustering for text analytic domain. An approach based on Frequent Weighted Utility Itemsets (FWUI) and then clustering using the MC (Maximum Capturing) algorithm is one of the most effective methods for text clustering. However, the Maximum Capturing clusteringAlgorithm based on the similarity matrix leads to a lot of irrelevant clusters that aren’t desired. In this work, Weighted Frequent Pattern based Agglomerative Clustering(WFUP_AC)is proposed for clustering large text data.First, the Term Frequency (TF) is calculated for each term in the documents to create a weight matrix for all documents. The weights of terms in documents are based on the Inverse Document Frequency. The WFUP algorithm is applied for mining Weighted Frequent Utility Pattern (WFUP) from a number matrix and the weights of terms in documents. Then based on frequent utility itemsets, a similarity matrix is obtained for each document where each entry equals to common frequent itemset between two documents. Then distance matrix is calculated from the similarity matrix, finally Hierarchical Agglomerative Clustering method is applied on the Distance matrix using complete linkage and cut the dendrogram as per the need. Our proposed method has been evaluated on two text document data sets like newsgroup and Reuters data sets with different size consisting of 100,300,500 and 1000 documents. The experimental results show that our method, weighted frequent pattern based agglomerative clustering (WFUP_AC) improves the accuracy of the text clustering compared to MC clustering methods using FIs(Frequent Itemset) and FWUIs. © 2020 SERSC.
dc.identifier.citation	International Journal of Control and Automation, 2020, 13, 2 Special Issue, pp. 151-164
dc.identifier.issn	20054297
dc.identifier.uri	https://idr.nitk.ac.in/handle/123456789/23936
dc.publisher	Science and Engineering Research Support Society ijbsbt@sersc.org PO Box 5014Sandy Bay TAS 7005 Tasmania
dc.subject	Agglomerative Clustering
dc.subject	Frequent Pattern Mining
dc.subject	Minimum Support
dc.subject	Text Clustering
dc.subject	Unstructured data
dc.title	Weighted frequent pattern based agglomerative clustering for large unstructured text data

Collections

Journal Articles

Weighted frequent pattern based agglomerative clustering for large unstructured text data

Files

Collections