Paper Review: Text Categorization with Support Vector Machines: Learning with Many Relevant Features

Tuesday, February 24, 2009

Paper Review: Text Categorization with Support Vector Machines: Learning with Many Relevant Features

This paper was written by Thorsten Joachims from University Dortmund. It was published at ECML-98 10th European Conference on Machine Learning. It has been cited by 3632 people according to Google Scholar. Very influential paper indeed!

This paper provides both theoretical and empirical evidence that SVMs are great for text categorization.

When performing text classification, the first step is to transform documents. Each distinct word becomes a feature and the frequency of the word in the document is the value of the feature, resulting in a very high-dimensional feature space. Then, the information gain criterion can be used to select a subset of features. The final step is to scale the dimensions of the feature vector with their inverse document frequency.

SVMs are very universal learners. It can learn linear threshold function and can also learn non-linear functions by using the kernel trick. One great property of SVMs is the ability to learn which is independent of the dimensionality of the feature space, because SVMs use support vectors. SVMs also do not require parameter tuning.

Left: Group of dots in different colors on 2D plane.
Right: Boundaries identified using SVM to group colored dots.

SVMs work well for text categorization because of these following properties of text: 1) High dimensional input space. SVMs do not depend on the number of features and only use support vectors. This prevents overfitting. 2) Few irrelevant features. Aggressive feature selection may result in loss of information. SVMs can work with very high dimensional feature space. 3) Document vectors are sparse. SVMs work well with problems with dense concepts and sparse instances. 4) Most text categorization problems are linearly separable. These are the theoretical evidence.

The paper used two test collections: the “ModApte” split of the Reuters-21578 database and Ohsumed corpus. SVMs are compared with Naïve Bayes, Rocchio, kNN, and C4.5 Decision Tree. Precision/Recall-Breakeven Point is used as a measure of performance. Experimental results show that SVMs had robust performance improvements over other algorithms. Training using SVMs is slow but classification is very fast. SVMs eliminate the need for feature selection and do not require any parameter tuning.

Video of the Day:

I am not a member of the Mormon church, but I still found this story very moving. It was told by the former LDS Church President Gordon Hinckley. Hope you enjoy it! And may God bless us all if there is a God.

Lannyland - A Land of Imagination

Search within Lanny's blog:

Leave me comments so I know people are actually reading my blogs! Thanks!

Tuesday, February 24, 2009