Paper Review: Evaluation of evaluation in information retrieval

This paper was written by Saracevic from Rutgers University and published at the 18th annual international ACM SIGIR conference on Research and development in information retrieval, 1995.

Evaluation metrics are important components of scientific researches. Through evaluation metrics, performances of systems, algorithms, solution to problems can be measured and compared against baselines or among each other. However, what metrics should we use, at what level, and how good are these metrics? Questions like these must be carefully considered. This paper discussed such concerns about past and existing evaluation metrics used in Information Retrieval (IR) and raised many more questions. Please note that this paper was published in 1995 and evaluation metrics/methods in IR have progressed dramatically by now.

This paper is somewhat a survey paper that discussed evaluation metrics used in IR throughout the history and provided many literature references. The main contribution of the paper is that it suggests looking at the evaluation of IR from a much higher perspective, going back to the true purpose of IR, which is to resolve the problem of information explosion. When considering the evaluation of IR systems from this high point, the paper pointed out that there are a lot more to be evaluated besides common/popular evaluation metrics at simply the process level (e.g. Precision and Recall). It urged the IR community to break out of the isolation of single level narrow evaluations.

The author systematically defined six levels of objectives (engineering, input, processing, output, use and user, and social) that need to be addressed in IR systems together with five evaluation requirements (a system with a process, criteria, measures, measuring instruments, and methodology). Then he further discussed in details current practice, potential problems, and needs of evaluation metrics with respect to each of the requirement and how they can be categorized into the six objective levels. This is an excellent way of organizing contents and arguments, which allows the readers to easily see the big picture in a structured framework.

The paper made a strong statement that “both system- and user-centered evaluations are needed” and more efforts are required to allow cooperative efforts of the two orientations, in contrast to the widely proposed shifting from one to the other. This again highlights the author’s suggestion of treating the evaluation of IR as an overall approach.

The author identified many compelling problems and important issues with regard to the evaluation of IR and argued them well. To name a few: Laboratory collections are too removed from reality and TREC has highly unusual composition as to types and subjects of documents and should not be the sole vehicle for IR evaluation. Applying various criteria in some unified manner still poses a major challenge in IR. Assumption of one and only one relevant set as an answer to a request is not warranted. When using relevance as the criterion with precision and recall as the corresponding measures, someone has to judge or establish relevance; the assumption here is that the judgment is reasonable while we know relevance is a very subjective topic.

The paper repeatedly emphasized evaluation of interaction between users and IR systems as an integral part of the overall evaluation. In recent years, there’s also a strong trend showing more and more researchers in various areas interested in understanding how the human factors and the interaction between human and machines (robots) play an important role in the performance of systems. A good example is the emergence of Human Robot Interaction (HRI). Therefore, this topic deserves a separate discussion here. The ultimate goal of an IR system is to serve human. If information retrieved is not presented to the user correctly, then the IR system fails miserably. Also because of the subjectivity (with respect to an individual user) and ambiguity (such as query term meanings) of IR, multiple rounds of interaction between the user and the IR system can dramatically improve the performance of information retrieval. One example would be retrieving documents related to the query term “Python”. An interactive IR system can further allow the user to specify if he/she wants to retrieval information about the animal or the programming language. As stated in the paper, interactions in IR were extensive studied and modeled, however, interactivity plays no role in large evaluation projects (which I believe is still true even up to today). Granted that it is difficult to come up with sound evaluation metrics for interactivity, more discussion and research in this area is definitely very necessary.

This paper certainly has its shortcomings. First of all, the author could certainly have been more concise in the writing. Additionally I found the comparisons using expert systems and OPAC to be distracting from the main ideas and do not contribute much to the arguments. Eliminating them would have made the paper more focused.

Granted that precision and recall are used as the main evaluation metrics to measure relevance in the system and process level, many other evaluation metrics also existed but were not covered in this paper. Examples include (but not limited to) F-measure, Entropy, Variation of Information, Adjusted Rand Index, V-Measure, Q-2, Log likelihood of the data, etc. Besides quantitive evaluation metrics, qualitative analysis is also a common tool people use to evaluate performances of IR systems, and the paper didn’t touch this subject at all.

The paper argued that it is a problem that “every algorithm and every approach anywhere, IR included, is based on certain assumptions” and these assumptions could have potential negative effects. I found this argument weak and not well constructed. It is true that when researches design algorithms or model problems they make assumptions. Good researchers clearly identify these assumptions in their publications and analyze the potential effects these assumptions have on their algorithms or their models. Sometimes assumptions are made without sound reasons but are justified by good performances from real applications/data. It is almost unavoidable to make various assumptions when we do research. We should not be afraid of making assumptions, but be careful about our assumptions and justify for them.

Lastly, there is one more important drawback of this paper. It did a good job identifying many problems and issues regarding evaluations in IR. However, it did a poor job providing constructive ideas and suggestions to many of these problems. I am not suggesting the author should find solutions to many of these problems, but some initial ideas or thoughts (let it be throw-away or naïve ideas) would have improved the paper considerably.

In summary, the paper succeeded in bringing attentions to treating evaluation in IR from a much higher perspective and also provided good description, references, and discussion for the “current” state (up to 1995) of evaluation metrics in IR. I enjoyed reading this paper.

Video of the Day:

If you have a Toyota, take it in for a check, because it might be a matter of life and death for you and your family!

Lannyland - A Land of Imagination

Search within Lanny's blog:

Leave me comments so I know people are actually reading my blogs! Thanks!

Sunday, February 08, 2009