Search within Lanny's blog:

Leave me comments so I know people are actually reading my blogs! Thanks!

Thursday, March 12, 2009

Paper Review: A Semi-Supervised Document Clustering Technique for Information Organization

This paper was written by Hanjoon Kim and Sanggoo Lee from Seoul National University presented at the ninth international conference on Information and knowledge management, 2000.

Traditional document clustering methods are unable to correctly discern structural details hidden within the document corpus because they inherently strongly depend on the document themselves and their similarity to each other. To solve this problem, this paper proposes a clustering method that incorporates relevance feedback into a traditional complete-linkage clustering algorithm in order to find more coherent clusters.

The main contribution of this paper is the method of combining complete-linkage hierarchical agglomerative clustering (HAC) algorithm using the VSM model with user relevance feedback using the fuzzy information retrieval. The HAC with VSM allows the forming of pre-clusters using a “fixed-diameter”. Then documents in small clusters (less than η) form a “training document set”, which is used for user relevance feedback where user can answer “yes/no” questions to indicate relevance of documents returned using fuzzy information retrieval. This way, positive and negative bundles can be created and used to help reassign pre-clusters. Note that all these techniques mentioned above are not original ideas from the authors. The paper simply proposed a way to combine the strengths of these techniques in order to improve performance of the clustering.

Toward the end of the paper, the authors included two subsections that discuss the sensitivity of the degree of supervision parameter and the pre-cluster size parameter (“fixed-diameter”). This is a good attempt to justify the “magic numbers” used in the experiments. However, there is no discussion or analysis of the parameter η used in the supervising phase at all. Also, the paper describes that, “the threshold diameter value for pre-clusters is experimentally adjusted to provide good quality clustering.” This makes the reader wonder whether the experimental results presented are truthful and question how well the proposed method would work in real world applications.

Clustering techniques are used to group together things that have common themes. However, frequently, there are many different ways to group things based on different common themes and the distance function used decides what common theme will be used for the clustering. For example, a bunch of people can be grouped by one of many themes (e.g. age, gender, race, wealth, and hair style). Similarly, a document collection can also be clustered by themes such as style, topic, language, or subject. The paper simply stated the two alternative ways to define the diameter of a cluster but didn’t analyze how such selections might affect the results of the document clustering. The paper also did not clarify which definition of the diameter is used in the experiments.

The authors mentioned in several places that the proposed method helps in discerning structural details hidden within documents. However, throughout the paper, there is no discussion about how structural details are extracted from documents. Arguably the users might be able to identify structural details through user relevance feedback. However, this really has nothing to do with the proposed algorithm. The users might not even be looking at the structural detail of these documents at all.

A core part of the proposed method relies on human to provide domain-specific knowledge for the method to work, yet the paper is missing detailed description of how the experiment was designed to incorporate human knowledge. For example, we don’t know how many human analysts are involved. Could the human analysts have different opinions? If so, how is that handled? Are these human analysts really domain experts who possess domain-specific knowledge about the subject? The paper is also missing detailed description about how the queries are reformulated, which is an integral part of the user relevance feedback method.

In the supervising phrase, the user is asked to answer “yes/no” questions to mark documents as relevant to non-relevant to the query document. Firstly, relevance is a subjective question. It is possible that document marked as relevant by one user might end up being marked as non-relevant by another user. Secondly, a binary answer completely ignores the level of relevance (ranking). This results in that documents border lining non-relevance could still be marked as relevant and have the same weight as the most relevant document. This leads to inaccuracy and misrepresentation. A non-binary weighting probably would work better.

When building the test document set, the authors intentionally picked those that had a single topic, claiming that this would avoid the ambiguity of documents with multiple topics. Is this really for technical convenience so the experiment results would look better? Does this mean the proposed method would not work well at all for documents with multiple topics (which is probably more like the real world scenarios)?

The authors used F1-measure as the performance metric and used the complete-linkage clustering method as a baseline. The baseline selected is reasonable in order to show how much benefit can be achieved by adding the user relevance feedback component. However, using this baseline alone is not very convincing because we already know the complete-linkage clustering method doesn’t work well, and we also know that adding user relevance feedback will certainly help. It would be much more convincing if the paper compared the performance of the proposed method with another common document clustering technique such as K-Means and Bisecting K-Means. Using only the F1-measure as the performance metric is also weak. Including other quantitative metrics together with some qualitative analyses would certainly help.

In summary, this paper proposes a novice approach that combines the strengths of several IR techniques. However, it also has plenty of room for improvement.

Picture of the Day:

The biggest failure of the 2010 World Cup -- Domenech! We still wonder, "Why not Henry and why not Malouda?"