Paper Review: Distributional Clustering of Words for Text Classification

Wednesday, February 04, 2009

Paper Review: Distributional Clustering of Words for Text Classification

This paper was written by Baker from Carnegie Mellon and McCallum from Justsystem Pittsburgh Research Center, published at 21st annual international ACM SIGIR conference on Research and development in information retrieval, 1998.

This paper applies Distributional Clustering to document classification. Distributional Clustering can reduce even space by joining words that induce similar probability distributions among the target features that co-occur with the words in questions. Word similarity is measured by the distributions of class labels associated with the words in question.

The three benefits of using word clustering are: useful semantic word clusterings, higher classification accuracy, and smaller classification models.

Clustered Word Clouds by Jeff Clark, perfect in memory of Dr. King on this year's Martin Luther King Day!

The paper first went over the probabilistic framework and Naïve Bayes, and then suggests using weighted average of the individual distributions as the new distribution. Kullback-Leibler divergence, an information-theoretic measure that can be used to measure difference between two probability distributions is introduced, and then the paper uses “KL divergence to the mean”. Instead of compressing two distributions optimally with their own code, the paper uses the code that would be optimal for their mean. Assuming a uniform class prior, choosing the most probable class by naïve Bayes is identical to choosing the class that has the minimal cross entropy with the test document. Therefore, when words are clustered according to this similarity metric, increase in naïve Bayes error is minimized.

The algorithm works as the following: The clusters are initialized with the M words that have the highest mutual information with the class variable. The most similar two clusters are joined, then the next word is added as a singleton cluster to bring the total number of clusters back up to M. This repeats until all words have been put into one of the M clusters.

The paper uses three real-world text corpora: newswire stores, UseNet articles and Web pages. Results show that Distributional Clustering can reduce the feature dimensionality by three orders of magnitude, and lose only 2% accuracy.

Distributional Clustering performs better than feature selection because merging preserves information instead of discarding it. Some features that are infrequent, but useful when they do occur, get removed by the feature selector; feature merging keeps them.

I have a dream that one day when I get old, there will be intelligent robots to take care of people like me, so we can enjoy life freely, happily, and independently.

Lannyland - A Land of Imagination

Search within Lanny's blog:

Leave me comments so I know people are actually reading my blogs! Thanks!

Wednesday, February 04, 2009