Search within Lanny's blog:


Leave me comments so I know people are actually reading my blogs! Thanks!

Wednesday, July 02, 2008

Paper Review: The Seven Ages of Information Retrieval (1)

This is a make up post!

In these paper review series, I will summarize interesting papers I read in a non-technical way (as best as I can) and write my own opinions too. Since my research interest is in AI robotics, the papers I review will mostly relate to interesting topics. So this is a good way of reading about research ideas but not worry too much about the math involved. I will also provide links to the actual paper, so if you want, you can read the actual paper after reading my review.

The paper I review today talks about the history of Information Retrieval. How would this relate to robots? I will review the answer in my future posts, so stay tuned.

Here's the PDF link for the paper "The Seven Ages of Information Retrieval" by Michael Lesk. And here below is the first part of my review:

This paper uses Shakespeare’s concept of seven ages of man to describe/predict the evolution of Information Retrieval from 1945 to 2010. Throughout the paper, the author tried to compare two “competing” approaches to IR: simple statistical methods – statistics (Warren Weaver’s approach) and sophisticated information analysis – artificial intelligence (Vannevar Bush’s approach). Keep in mind that the paper was written in 1996, just at the beginning of the Internet/dot com boom. That gives us this unfair advantage of being able to criticize some of the predictions the author made (just as the author had the advantage in criticizing Bush’s predictions).

In the childhood stage of IR (1945-1955), people still worked with very old technology. Having no idea how technology completely changed people’s lives starting from the end of the century, Bush predicted about the evolution of IR. He believed that photographic inventions (such as ultramicrofiche) would have great impact on libraries and IR, which the author didn’t agree. Bush also predicted automatic typing from dictation and OCR, which was not quite achieved at 1996. However, his prediction about the capabilities of computer systems became reality in the 1960s. The 7.5TB/user storage he predicted was far from 1996’s reality. Bush predicted individual interfaces personalized to the user and people would search from notes before search in scientific papers, but not until after the 1970s, it was difficult to get information into computers. The first IR system was built in the 1950s, which used indexes and concordances.

In the schoolboy stage (1960s), the first large scale information systems were built. Computers can search indexes must better than human, which demanded more detailed indexing. However, indexing could also become too expensive, hence arose the idea of free-text searching, which eliminates the need for manual indexing. Objections pointed out that selecting the right words might not be the correct label for a given subject. One solution is official vocabularies. The idea of recall and precision also came out as methods for evaluating IR systems, and they showed that free-text indexing was as effective as manual indexing and much cheaper. New IR techniques such as relevance feedback, multi-lingual retrieval were invented. The 1960s also was the start of research into natural language question-answering, and AI researchers began building systems to retrieval actual answers instead of documents, which turned out to be fragile.

In the adulthood stage (1970s), development of computer typesetting, word processing and the availability of time-sharing systems allowed IR to mature into real systems. Some of the early large-scale systems include Dialog, Orbit, BRS, OCLC, and Lexis. The most important research progress was the rise of probabilistic information retrieval with techniques such as term frequency. On the AI side, the key subjects in the 1970s were speech recognition and the beginning of expert systems. AI researchers felt they were attacking more fundamental and complex problems and that there would be inherent limits in the IR string-searching approach. They built programs that mapped information into standard patterns, but these tend to operate off databases rather than text files. The IR camp felt the AI researchers did not do evaluated experiments, and in fact built only prototypes which were at grave risk of not generalizing.

In the maturity stage (1980s), more information was available in machine-readable form and kept that way. There was also an enormous increase in the number of databases available on the online systems. Online Public Access Catalog (OPACS) developed during this period and many current magazines and newspapers were now online. There was increasing interest in new kinds of retrieval methods such as sense disambiguation using machine-readable dictionaries and computational linguistics. These would all fall under the statistical kind of retrieval. Because of the size of large commercial systems, evaluation of IR became very difficult. The widespread use of CD-ROM was a key technology change, which fit well with traditional information publishing economics and developed into a real threat to the online systems. Meanwhile, the AI community continued expert systems and knowledge representation languages. However, later in the decade, the failure of expert systems to deliver on their initial promises caused a movement away from this area, which marked the “AI winter”.

In the mid-life crisis stage (1990s), another technology revolution came out, the Internet. What’s remarkable is not that everyone is accessing information, but that everyone is providing information on a free basis. This matches the model Bush forecasted where each user is organizing information of personal interest and trading this with others. Classification type search engines (such as Yahoo) also came out. Internet also became a standard medium for publishing. Another important technology was scanning, which lowered the cost or digitizing publications. The Federal government also started a Digital Library research initiative. However, there is still very large scatter in the performance of retrieval systems, not only by question but even over individual relevant documents within answer lists. The author didn’t mention how the AI side was during this period.

In the fulfillment stage (2000s), the author predicted how IR might evolve. He believed that more ordinary questions can be answered by reference to online materials rather than paper materials, new books are offered online and there are guidance companies on the web so that the lack of any fundamental advances in knowledge organization will not matter. He thought the area required more research was in the handling of images, sounds and video. It was noted that online publish won’t pose a problem for academic publishing, but will do for commercial publishing. He further discussed the dramatic storage requirements for video contents.

In the retirement stage (2010), the author forecasted that the basic job of conversion to machine-readable form is done and great deal of multimedia information will be available, which are as easy to deal with as text. Internationalism will become a major issue. As to research, work will focus on improving the systems and learn new ways to use the new IR systems. There might even be PhDs in probabilistic retrieval.

The author further pointed out some potential problems such as illegal copying (pirating in today’s terms), copyright law itself, abundance of junk and cluttering on the Internet, difficulty for people to upload, legal liability and public policy debates restricting technological development and availability. At the end, the author also expressed positive views that Bush’s dream will be achieved in one lifetime and the job of organizing information could have higher status in the very near future.

[To be continued....]



Bill Gates does the Robot!
(See hi res video at http://www.microsoft.com)
(Rumor says no more Jerry Seinfeld and Bill Gates duo!)