Please use this identifier to cite or link to this item:
Title: An improved K-means algorithm using modified cosine distance measure for document clustering using Mahout with Hadoop
Authors: Sahu, L.
Mohan, B.R.
Issue Date: 2015
Citation: 9th International Conference on Industrial and Information Systems, ICIIS 2014, 2015, Vol., , pp.-
Abstract: In this paper, we have proposed a novel K-means algorithm with modified Cosine Distance Measure for clustering of large datasets like Wikipedia latest articles and Reuters dataset. We are customizing Cosine Distance Measure for computing similarity between objects for improving cluster quality. Our method will calculate the similarity between objects by Cosine Distance Measure and then try to bring distance more closer by squaring the distance if it is between 0 to 0.5 else increase it. It will result in minimum Intra-cluster and maximizes Inter-cluster distance value. We are measuring cluster quality in term of Inter and Intra-cluster distances, good Feature weighting such as TF-IDF, Cluster Size and Top terms of the clusters. We have compared K-means algorithm by Cosine and modified Cosine Distance measure by setting performance metric such as Inter-cluster and Intra-cluster distances, Cluster size, Execution time etc. Our experimental result shows in minimizing Intra-cluster by 0.016% and maximizing Inter-cluster distance by 0.012%, reducing the cluster size by 1.5% and reducing sequence file size by 4%, that will result in good cluster quality. � 2014 IEEE.
Appears in Collections:2. Conference Papers

Files in This Item:
There are no files associated with this item.

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.