Paper Review: Using Maximum Entropy for Text Classification

Thursday, February 19, 2009

Paper Review: Using Maximum Entropy for Text Classification

This paper is written by Kamal Nigam, John Lafferty, and Andrew McCallum, all from Carnnegie Mellon University. It was presented at IJCAI-99 workshop on machine learning for information filtering.

This paper talks about the use of maximum entropy techniques for text classification and compares the performance to that of naïve Bayes.

Maximum entropy is a general technique for estimating probability distributions from data. The main principle in maximum entropy is that when nothing is known, the distribution should be as uniform as possible, that is, have maximal entropy. In text classification scenarios, maximum entropy estimates the conditional distribution of the class label given a document. The paper uses word counts as features.

Training data is used to set constraints on the conditional distribution. Maximum entropy first identifies a set of feature functions that will be useful for classification, then for each feature, measures its expected value over the training data and take this to be a constraint for the model distribution.

Improved Iterative Scaling (IIS) is a hillclimbing algorithm for calculating the parameters of a maximum entropy classifier given a set of constraints. It performs hillclimbing in parameter log likelihood space. At each step IIS finds an incrementally more likely set of parameters and converges to the globally optimal set of parameters.

Maximum entropy can suffer from overfitting and introducing a prior on the model can reduce overfitting and improve performance. To integrate a prior into maximum entropy, the paper proposes using maximum a posteriori estimation for the exponential model instead of maximum likelihood estimation. A Gaussian prior is used in all the experiments.

One good thing about maximum entropy is that it does not suffer from any independence assumptions.

The paper used three data sets to compare the performance of maximum entropy to naïve Bayes. The three data sets are WebKB, Industry Sector, and Newsgroups. In WebKB data set, the maximum entropy was able to reduce classification error by more than 40%. For the other two data sets, maximum entropy overfitted and performed worse than naïve Bayes.

Video of the Day:

Liu Qian performing magic tricks at the Chinese New Year Show. Can you figure out how he did the tricks?

Lannyland - A Land of Imagination

Search within Lanny's blog:

Leave me comments so I know people are actually reading my blogs! Thanks!

Thursday, February 19, 2009