This tech report compares three batch clustering algorithms: EM, CEM, and HAC and also investigate the effect of three different initialization schemes on the final solution produced by the EM algorithm.
The clustering algorithms first use the log posterior probability of model structure given the training data log P(K|Dtrain) to select model structure, and then learn the parameters of a given model structure. The EM algorithm is iterative, and in the E step, assigns the case fractionally to each cluster. The CEM algorithm assigns the case fully to the class k that has the highest posterior probability given the document and the current parameter values. The HAC algorithm merge smaller clusters by recursively merging the two clusters that are closest together until only K clusters are left. The authors derive the distance metric for mixtures of independent multinomially distributed variables.
The three initialization methods experimented are: the random approach where the parameters are independent of the data, the noisy-marginal method (data dependent), and the method of using HAC on a random sample of the data. Several performance criteria are used. For the model structure, the log-marginal-likelihood criterion and the difference between the number of clusters in the model and the true number of clusters are used. For the entire model, one criterion is the cross entropy between the true joint distribution for X and the joint distribution given by the model, and another criterion is the classification accuracy. Running time is also used to compare algorithms.
Experimental results from two datasets were reported. The synthetic dataset was constructed using the MSNBC news service, and the digits dataset consists of images of handwritten digits made available by the US Postal Service Office for Advanced Technology. For both datasets, EM outperformed CEM for all initialization methods and for all criteria (except that CEM is four times faster than EM). After constraining EM run time to match CEM, EM still performed better. When comparing EM versus HAC (initializing using Marginal method), EM is clearly better.
When comparing initialization methods, for the Synthetic dataset, Marginal and HAC are always better than Random (for all criteria) with no significant differences between themselves, but Marginal is much faster than HAC. For the digit6 dataset, there is no clear winner.
Picture of the Day:
First time ever a rescue helicopter landing in Bryce Canyon, Utah. The chopper landed right on the top of the ridge. Got to give the pilot some shout out for his awesome skills!