Paper Review: A Comparative Study on Feature Selection in Text Categorization

Monday, February 02, 2009

Paper Review: A Comparative Study on Feature Selection in Text Categorization

This paper is written by Yiming Yang from Carnnegie Mellon University and Jan O. Pedersen from Verity, Inc. presented at ICML 97 (International Conference on Machine Learning). It has been cited 2742 times according to Google Scholar, another seminal paper indeed!

This paper evaluates and compares five feature selection methods: document frequency (DF), information gain (IG), mutual information (MI), a χ²-test (CHI), and term strength (TS).

A major difficulty of text categorization is the high dimensionality of the feature space. Automatic feature selection methods include the removal of non-informative terms according to corpus statistics, and the construction of new features which combine lower level features into higher-level orthogonal dimensions.

Document frequency is the number of documents in which a term occurs. Terms with low document frequency (less than a threshold) were removed from the feature space with the assumption that rare terms are either non-informative for category prediction or not influential in global performance. Information gain measures the number of bits of information obtained for category prediction by knowing the presence or absence of a term in a document. Terms whose information gain was less than threshold were removed from the feature space. Mutual information becomes zero if the term t and category c are independent. A weakness of mutual information is that the score is strongly influenced by the marginal probabilities of terms. The χ² statistic measures the lack of independence between t and c, and it is known not to be reliable for low-frequency terms. Term strength method estimates term importance based on how commonly a term is likely to appear in “closely-related” documents and is based on document clustering. Average number of related documents per document is used in threshold tuning.

The paper used kNN and LLSF because: both are top-performing, state-of-the-art classifier, both scale to large classification problems, both are a m-ary classifier providing a global ranking of categories given a document, both are context sensitive, and the two classifiers differ statistically. The Reuters-22173 collection and the OHSUMED collection were used for the experiments and recall and precision are used as performance measures.

Experiments include using full vocabulary to removing 98% of the unique terms. Results show that IG and CHI are most effective in aggressive term removal. DF is comparable to IG and CHI with up to 90% term removal. TS is comparable with up to 50-60% term removal. MI has inferior performance due to a bias favoring rare terms and a strong sensitivity to probability estimation errors. Results also show strong correlation among DF, IG and CHI scores.

Video of the Day: Industrorious Clock

Credit: Yugo Nakamura

Lannyland - A Land of Imagination

Search within Lanny's blog:

Leave me comments so I know people are actually reading my blogs! Thanks!

Monday, February 02, 2009