This paper evaluates and compares five feature selection methods: document frequency (DF), information gain (IG), mutual information (MI), a χ2-test (CHI), and term strength (TS).
A major difficulty of text categorization is the high dimensionality of the feature space. Automatic feature selection methods include the removal of non-informative terms according to corpus statistics, and the construction of new features which combine lower level features into higher-level orthogonal dimensions.
Document frequency is the number of documents in which a term occurs. Terms with low document frequency (less than a threshold) were removed from the feature space with the assumption that rare terms are either non-informative for category prediction or not influential in global performance. Information gain measures the number of bits of information obtained for category prediction by knowing the presence or absence of a term in a document. Terms whose information gain was less than threshold were removed from the feature space. Mutual information becomes zero if the term t and category c are independent. A weakness of mutual information is that the score is strongly influenced by the marginal probabilities of terms. The χ2 statistic measures the lack of independence between t and c, and it is known not to be reliable for low-frequency terms. Term strength method estimates term importance based on how commonly a term is likely to appear in “closely-related” documents and is based on document clustering. Average number of related documents per document is used in threshold tuning.
The paper used kNN and LLSF because: both are top-performing, state-of-the-art classifier, both scale to large classification problems, both are a m-ary classifier providing a global ranking of categories given a document, both are context sensitive, and the two classifiers differ statistically. The Reuters-22173 collection and the OHSUMED collection were used for the experiments and recall and precision are used as performance measures.
Experiments include using full vocabulary to removing 98% of the unique terms. Results show that IG and CHI are most effective in aggressive term removal. DF is comparable to IG and CHI with up to 90% term removal. TS is comparable with up to 50-60% term removal. MI has inferior performance due to a bias favoring rare terms and a strong sensitivity to probability estimation errors. Results also show strong correlation among DF, IG and CHI scores.
Video of the Day: Industrorious Clock