Paper Review: A Comparison of Event Models for Naive Bayes Text Classi cation

Saturday, January 24, 2009

Paper Review: A Comparison of Event Models for Naive Bayes Text Classi cation

This paper is written by Andrew McCallum (Just Research) and Kamal Nigam (Carnegie Mellon University) and published at an AAAI-98 workshop on learning for text categorization. This is a seminal paper cited by 1426 papers according to Google Scholar!

Two text classification approaches, the multi-variate Bernoulli model and the multinomial model, both use the naïve Bayes assumption. This paper tries to differentiate the two models and compare their performances empirically on five text corpora.

In text classification, a naïve Bayes classifier assumes that the probability of each word occurring in a document is independent of the occurrence of other words in the document and is independent of the word’s context and position in the document. This assumption simplifies learning dramatically when the number of attributes (features) is large. Both approaches use training data to calculate estimates of the generative model and then uses Bayes’ rule to calculate the posterior probability of each class given the evidence of the test document. Then the class most probable is selected.

One major difference between these two approaches is that the multi-variate Bernoulli model doesn’t care about word frequency in documents, while the multinomial model does. Another difference is that the multi-variate Bernoulli model explicitly includes the non-occurrence probability words that do not appear in the document.

When selecting features, in order to reduce vocabulary size, only words that have the highest average mutual information with the class variable are kept. Average mutual information is the difference between the entropy of the class variable and the entropy of the class variable conditioned on the absence or presence of the word.

Five text corpora are used and they are Yahoo! ‘Science’ hierarchy, Industry Sector, Newsgroups, WebKB, and Reuters. Empirical results show that the multi-variate Bernoulli model works well with small vocabulary sizes, but the multinomial model performs better at larger vocabulary sizes. The multinomial model produced on average a 27% reduction in error over the multi-variate Bernoulli model at any vocabulary size.

The boringness of a paper is inverse proportional to the amount of time it takes to put the reader to sleep.

Video of the Day:

You have to watch past 0:40 to really appreciate it! It's an LCD!

Lannyland - A Land of Imagination

Search within Lanny's blog:

Leave me comments so I know people are actually reading my blogs! Thanks!

Saturday, January 24, 2009