This paper presents the results of an experimental study of two main approaches to document clustering, agglomerative hierarchical clustering and K-means (standard K-means and bisecting K-means).
The two basic approaches to generating a hierarchical clustering are agglomerative and divisive. The paper evaluated agglomerative techniques in the comparison. It then described the agglomerative clustering algorithm, the K-means algorithm, and the bisecting K-means algorithm in details.
Visualization of the K-means algorithm
Three evaluation metrics are used in the experiments, and they include two external quality measure, entropy, F measure, and one internal quality measure, overall similarity. The paper described each measure in detail.
Eight data sets were used in the experiments: 5 from TREC, 2 from Reuters-21578, and 1 from WebACE. Performances of three agglomerative hierarchical techniques, Intra-Cluster Similarity Technique (IST), Centroid Similarity Technique (CST), and UPGMA were compared using F-measure and entropy. UPGMA is the best performing hierarchical technique overall, therefore, its performance is compared against standard K-means and bisecting K-means. The performances of bisecting K-means with refinement and hierarchical with refinement are also included in the comparison. In the experiments, the authors used many runs of the regular K-means algorithm and also used incremental updating of centroids.
Experimental results show that the bisecting K-means technique is better than the standard K-means approach and as good or better than the hierarchical approaches when using the three evaluation metrics mentioned. Also the time complexity of bisecting K-means is linear, which makes it very attractive.
The authors argued that the agglomerative hierarchical clustering didn’t do well because nearest neighbors of documents often belong to different classes. K-means and bisecting K-means algorithms do better because they rely on a more global approach. They also believe that bisecting K-means does better than standard K-means because it produces relatively uniformly sized clusters.
Video of the Day:
The Honest $10000 SPAM
Even though miracle happens, still, don't click on suspicious links or give out your bank information. The Nigerian connection at the end of the video is simply hilarious!!