Search within Lanny's blog:

Leave me comments so I know people are actually reading my blogs! Thanks!

Thursday, July 03, 2008

Paper Review: The Seven Ages of Information Retrieval (2)

This is a make up post! Continuing from previous post.

In this survey paper, the author used the analogy from Shakespeare’s seven ages of man to describe and predict the different stages in the evolution of the Information Retrieval (IR) systems. Note that the paper was written in 1996, which was very near to the beginning of the Internet/Dot Com booming era. At the current time of 2008, which is only two years away from the final stage of IR (2010) described by the author, we are certainly at an unfair advantage of being able to validate and criticize some of the predictions the author made, just as the author also had the same advantage over Vannevar Bush’s predictions at 1945.

The paper made good contribution to the field by describing the history of the IR systems from 1945 to 1996 with abundant information on the various technologies developed, IR systems built, and how they affected the research in IR. The paper is especially well organized and easy to understand. It started by introducing Bush’s predictions and also ended with the confirmation that Bush’s predictions will be achieved in one lifetime. This made the paper complete. The author also used comparing the simple statistical approach to more sophisticated Artificial Intelligence (AI) approach as a main thread throughout the different ages, which connected the seven ages well.

However, the paper also has some shortcomings. Firstly, AI is a big field that also used probabilistic/Bayesian methods all over the many subfields. There is not really a clear cut between AI and IR. For example, Natural Language Processing (NLP) is commonly considered a subfield in AI, but many NLP techniques are also the same as IR techniques.

Secondly, the author did not provide enough coverage for the AI side of the story, probably because he considered himself one in the IR camp. For example, Artificial Neural Networks (ANN) started around 1975, and Backpropagation (a form of ANN) gained its recognition in 1986 (itself actually dated in 1974). ANN can be used to detect patterns in text documents and is a great tool for IR, but was never mentioned in the paper. Another example is computer vision. The field of computer vision started in the 1970s and by 1995, many algorithms have been developed to analyze image contents. The paper didn’t mention any existing computer vision algorithms/techniques. Support Vector Machine (SVM), another great tool for IR, also came out in 1996, but I suspect it came out after the author wrote this paper.

The paper also failed to mention many important IR techniques such as td-idf, discriminant function. Specifically, it did not cover in enough depth with respect to evaluation methods such as K-L divergence, F1-Measure etc. More coverage of techniques/methods like these would have improved the quality of the survey paper.

Additionally, some of the graphs (Figure 4, 5, 9) in the paper do not contribute much to the content of the paper. Adding more information to these graphs to show correlation of things, or combining these graphs would be more beneficial to the readers.

In the latter part of the paper, the author made predictions about the possible evolution of IR and also pointed out potential problems. Since we know how technology evolved from 1996 to 2008, I’ll address some of them here.
The author mentioned that there would be enough guidance companies on the Web to help serve each user, so the lack of any fundamental advances in knowledge organization will not matter. What do we do when we need to look for information online these days? We search using Google or Wikipedia, and most of the time, we are relatively happy with the search results. Google made it its mission to “organize the world's information and make it universally accessible and useful”. And the mission for Wikipedia is to “empower and engage people around the world to collect and develop educational content under a free license or in the public domain, and to disseminate it effectively and globally”. This leads to an interesting idea: with the help of Google and Wikipedia, maybe we can make the Internet the “Expert System” or “Knowledge Database” and have agents learn from it directly.

The author also worried about commercial publishing on the Internet. These days, the music industry has made (probably was forced to) online distribution an integral part of its sales channel. The sale of e-books, although not mainstream, is slowly growing its market share, and various fancy e-book readers (e.g. the Amazon Kindle) are also getting better and making headlines. Google book search and Amazon’s book preview function are also getting more and more attention.

In the paper, the author cautioned about storage and transfer constraints in digital video. Thanks to even lower storage cost and many competing broadband service providers, today, a majority of Internet users have fast connections and use various streaming video websites such as (video contents provide by users), or (content provided by commercial content owner) to watch video online. Even Google ads these days contain video contents. The paper proposed that in the 2000s, more research is needed for image, audio, and video content extraction. He was right on. Even today extracting information out of abundant rich-multimedia content is still a very challenging problem for many researchers. Other than the traditional type of information media, now we also have new media such as Google Earth, where you can retrieve information from a hybrid of satellite images, regular maps, and street views (360 degrees), coupled with driving directions and estimated travel time.

In the retirement age of IR, the author predicted that “the central library buildings on campus have been reclaimed for other uses, as students access all the works they need from dormitory room computer”. I don’t think this will happen in two years. Libraries in universities still play very major roles for students and teachers, and university bookstores are still making huge profits off poor students. And one has to admit that holding a physical book in hand is a very different experience from reading a book online.

To reduce the amount of junks and cluttering on the Internet, the author suggested maybe anonymous posting should not be allowed on the Internet. This sounds so funny for modern day people when privacy is such a big concern, though this remains a big problem. Think about the trillions of documents out there with more and more rich-multimedia contents, plus the flourishing blogs, forums and social networks. Google suggested using page ratings by users which was not well greeted. I think we just have to rely on the advancement in AI and search engines to deal with it. Techniques such as taking consideration of user preferences and past history are certainly the right way to go.

The author further pointed out some potential problems such as illegal copying (pirating in today’s term), copyright law itself, difficulty for people to upload, legal liability and public policy debates restricting technological development and availability. These remain challenges for IR systems today (the word RIAA and peer-peer network suddenly emerged in my head for some reason!) and probably will take more than two years to resolve.

In my personal opinion, I think AI will start to play a leading role in IR in the following years and one day we will have true question answering type of information retrieval at the finger tip of every Internet user. This concludes my review of this paper! Thanks for reading it!

Listen to smart people.