Search within Lanny's blog:


Leave me comments so I know people are actually reading my blogs! Thanks!

Monday, February 02, 2009

Paper Review: A Comparative Study on Feature Selection in Text Categorization

This paper is written by Yiming Yang from Carnnegie Mellon University and Jan O. Pedersen from Verity, Inc. presented at ICML 97 (International Conference on Machine Learning). It has been cited 2742 times according to Google Scholar, another seminal paper indeed!

This paper evaluates and compares five feature selection methods: document frequency (DF), information gain (IG), mutual information (MI), a χ2-test (CHI), and term strength (TS).




A major difficulty of text categorization is the high dimensionality of the feature space. Automatic feature selection methods include the removal of non-informative terms according to corpus statistics, and the construction of new features which combine lower level features into higher-level orthogonal dimensions.


Document frequency is the number of documents in which a term occurs. Terms with low document frequency (less than a threshold) were removed from the feature space with the assumption that rare terms are either non-informative for category prediction or not influential in global performance. Information gain measures the number of bits of information obtained for category prediction by knowing the presence or absence of a term in a document. Terms whose information gain was less than threshold were removed from the feature space. Mutual information becomes zero if the term t and category c are independent. A weakness of mutual information is that the score is strongly influenced by the marginal probabilities of terms. The χ2 statistic measures the lack of independence between t and c, and it is known not to be reliable for low-frequency terms. Term strength method estimates term importance based on how commonly a term is likely to appear in “closely-related” documents and is based on document clustering. Average number of related documents per document is used in threshold tuning.


The paper used kNN and LLSF because: both are top-performing, state-of-the-art classifier, both scale to large classification problems, both are a m-ary classifier providing a global ranking of categories given a document, both are context sensitive, and the two classifiers differ statistically. The Reuters-22173 collection and the OHSUMED collection were used for the experiments and recall and precision are used as performance measures.


Experiments include using full vocabulary to removing 98% of the unique terms. Results show that IG and CHI are most effective in aggressive term removal. DF is comparable to IG and CHI with up to 90% term removal. TS is comparable with up to 50-60% term removal. MI has inferior performance due to a bias favoring rare terms and a strong sensitivity to probability estimation errors. Results also show strong correlation among DF, IG and CHI scores.




Video of the Day: Industrorious Clock


Credit: Yugo Nakamura

Sunday, February 01, 2009

Random Thoughts: My Philosophy for Scholarship in Computer Science

I actually wrote this article to meet the requirement of a take home exam, but I thought you'd still find something useful. So here it is:

My Philosophy for Scholarship in Computer Science


When I first started graduate school at BYU three years ago, I had no idea what research in Computer Science (CS) was all about. Three years later, after successfully defending my Master Thesis and publishing (presenting) multiple papers, I found myself much more comfortable at research work. I am not saying that research in Computer Science is easy. In fact, it is quite difficult and requires strong determination and much devotion. I do, however, believe that there are secrets to success in Computer Science scholarship. If we follow these guidelines, research can be fun and rewarding, and we can also make good contributions to the advancement of the field. In this document, I would like to share my “philosophy” for key elements of research in Computer Science.


(a) Reading and evaluating CS literature


Literature review is a critical element of CS research. It allows a researcher to see what others have done to solve the problem the researcher is interested in. One purpose of the exercise is to not repeat prior art; another equally important purpose is to learn from previous work, see other researchers’ insights, and identify possible places for improvement. A good place to start is to find a survey paper for your research problem, which can help you identify key researchers in the field. Your advisor can also provide useful information in this regard. Next, pick a few seminal papers from the key researchers. See what papers they reference and what papers cite them. We are blessed with the Internet, so take advantage of it and use research tools such as Google Scholar, CiteSeer, and DBLP in your search for papers. Give priority to papers published at top journals and conferences. An important skill in literature review is focused reading. There are too many papers out there and you cannot read them all. Quickly skim through the abstract and conclusions to determine if the paper is related to your research problem. If it is, then do focused reading with the goal of identifying the paper’s contributions and shortcomings. Finally, make literature review an ongoing process throughout your research because new research results continuously appear.


(b) Doing CS research


The first step of doing CS research is to define a research problem. The problem has to be important in the field. A problem no one cares about is useless. The problem also has to be well defined and scoped. Setting a clear target makes reaching the target possible. Restricting the problem to a reasonable scale provides focus and allows incremental advancement in the research field. The problem also needs to be interesting to the researcher because passion is the driving force that enables the researcher to endure the many difficulties during research and keeps the research going. The second step in CS research is to select the appropriate research pattern for conducting research investigation. Choosing the right research pattern helps the researcher identify directions to follow, strategies for validation, and fallacies to avoid. The next step is to create a testable hypothesis. The keyword here is “testable”, meaning other people can repeat the experiments and achieve similar results. This step also implicitly requires careful experiment design and meaningful metrics. Overlooked factors in experiments and badly selected metrics can skew experiment results or lead to erroneous conclusions.


(c) Writing-up CS research


A lot of people think that writing is important only because it leads to publication, which shows research achievements and also let others know what the researchers are doing, but writing is more than that. Writing should begin at the same time research begins, because writing shapes research and writing is an important research activity. Writing things down on paper forces one to formulate and clarify. It helps a researcher better understand what to say and also helps the researcher better organize thoughts and materials. When writing for publication, the most important thing is to choose the right set of keywords so your paper ends up in the right group of reviewers’ hands. A pithy title and a well-written abstract (including the four elements described in Secret #1 in class slides) are also important because they might be the only things the busy reviewers read. To make a reviewer’s job easier, it is a good idea to embed answers to review questions (such as contributions, new results, and validation) in the abstract, introduction, and conclusion sections. For the body of the paper, write precisely and concisely. Know the technical capabilities of your readers and then write accordingly, striking the right balance of detail. Use simple words and active verbs. Avoid using the “official style” writing because you will sound like a lawyer and really annoy your readers.


(d) Teaching CS topics


Teaching is not about showing off knowledge. A good teacher doesn’t focus on teaching facts (they can be looked up), but instead, focuses on teaching learning methods and problem solving skills. A great teacher not only teaches, but also inspires students to seek knowledge on their own. For CS topics, teaching should focus on algorithms (including complexity analysis) and how to apply them to solve real-world problems. An ideal CS course should have interesting problems for students to solve at the beginning of the course, and throughout the course, students can apply algorithms, techniques, and problem solving skills to these problems creatively. They are also encouraged to apply skills learned in the course to problems the students are interested in themselves. A successful teacher creates a stimulating environment that encourages students to participate and learn.


(e) Presenting CS research


Presenting your research is like advertising your ideas to peers. It promotes discussion and collaboration among researchers who are interested in the same research problem. The presentation should be self-contained to tell a complete story, however, it should not include all the technical details, because that makes the presentation tedious and chubby. If the audience wants more details about your research, they can always read your paper. Presentation slides should be simple and easy to follow, lively but not too fancy. Try to make your story interesting and impressive, and when necessary, use simple graphs to show trends in your data analysis. Speak clearly and it is okay to use an excited tone. Don’t be afraid to show your passion, but also don’t make your presentation into a political speech. Always back up your claims with facts or experimental results. Adding a bit of humor is always helpful. Just make sure your audience will understand your joke. Lastly, don’t perform live demos unless you are very sure the demos will work smoothly, otherwise, you will look like a fool on stage.


(f) Evaluating the research work of others


Reviewing papers of others is a great opportunity to learn, to teach, and to contribute to the research community. The reviewer can learn not only what mistakes to avoid in his or her own writing, but also good writing styles and techniques the reviewer can follow. The review process also allows the reviewer to provide insightful feedback to the authors in order to help them improve the quality of their research and writing. A review report normally contains the reviewer’s summary of the paper, comments about the solution described, experiments performed, metrics used, and conclusions drawn, and comments about the presentation. It is possible that a reviewer will reject a paper that has great ideas but does a poor job presenting them. Remember to always be respectful and use a positive tone rather than a negative tone. Realize that no papers are perfect. Use a simply rule of thumb when evaluating others’ work: review work of others the same way you would want others to review your work.


At the end of the document, I’d like to emphasize “passion” in all the elements of research. Passion is the best motivation for keeping the research going. Passion also leads to the best reward for research progresses ¾ feelings of self-fulfillment. I hope you find great passion in your research, which makes research more than work and turns it into a life style. With passion, we strive for excellence, and make the world a better place.



Picture of the Day:

Picture credit: Argyris Kaldiris

Saturday, January 31, 2009

Joy of Life: Volume 1 Chapter 3

Volume One: The City by the Sea
-- written by Maoni

Chapter 3: Training and Schooling


Fan Xian had no idea that he was training in a very profound inner energy manuscript. Any normal martial art student would have trained with the utter most scrupulousness and prudence; furthermore, he would undoubtedly ask his master or trustworthy friends to monitor alongside.
The most dangerous part of the manuscript was at the very beginning. When depositing energy into the Snow Mountain of the Dantian[1], the practitioner’s reaction time with respect to his body would become dramatically slower than his mind, and the direct result being that the practitioner would feel as though he had completely lost control of his body.
If the practitioner had no such prior experience, he would easily come to the conclusion that he was going through a fire deviation[2], and then forcefully retract the inner energy – if he were lucky, and if he were exceptionally good at controlling his inner energy flow, he could possibly direct the energy inside his body back into the various chi channels and meridians; but that would also mean the practice achieved nothing. For new learners, such panic could very likely lead to obstacles with one’s spirit and mind.
Being a new learner, the fact that Fan Xian didn’t end up with fire deviation, yet even had a better understanding of the profound experience than those advanced practitioners must be attributed to his life experience and luck.
When he started to train in the nameless inner energy, he lived inside the body of an infant, which had not returned most of the congenital energy inherited from the mother’s womb back to the outside world. With this inside his body, he could get twice the result with half the effort. Additionally, most of the congenital energy actually ended up stored inside his chi channels and meridians magically.
The most common problem of obstacles in spirit and mind for practitioners didn’t cause many problems for Fan Xian.
In his prior life, Fan Xian lived in a sickbed for several years and was already accustomed to not being able to control his body with his mind. When he had the similar feeling, he didn’t panic. Instead, he actually felt slight joy and warmth for being able to find some of the remnant memories.
When he trained for the first time, even though he felt as though the energy streams inside him seemed chaotic and he felt his mind was disconnected from his body, he wasn’t afraid at all. Because there was no interference from fear, his mind was clear and focused. He overcame the most difficult barrier with ease.
Ever since then, training became easy. All he had to do was to recite the scripts inside his head, and then he would naturally enter a state of trance – that was why each day’s nap time for Fan Xian was a very pleasant experience, and why it was so hard to wake him up.
For an ordinary inner energy practitioner, entering into a state of meditation trance would have required good luck together with special circumstances. Using everyday’s nap in place of meditation was indeed a luxury no one could have imagined.
God certainly had special favor for this kid.
……
……
Once he was awake, Fan Xian would roll his delicate and lovely little face swiftly in the warm towel in the servant girl’s hands, which counted as a face wash. Then in the afternoon, it was time to take lessons in the study from the literature teacher, who was respectably invited by the Count’s Manor from the East-Sea County. The teacher wasn’t very old, roughly in his thirties, yet carried the distinct flavor of a pedantic scholar.
A literature improvement movement had being going on in the Qing Empire for the last ten years, initiated by an article calling for improvement in literary works written by the Minister of the Literature Cabinet, Mister Hu. Today’s world of literary was filled with battles between the ancient prose style and the modern text style.
The ancient prose style was identical to the Classical Chinese style[3] in Fan Xian’s memory. The modern text style was similar to Vernacular Chinese style[4], only with more elegant wording.
Fan Xian’s teacher was a fan of the ancient prose style, so Fan Xian’s curriculum included many classic scriptures. Although these scriptures were not the same as the Four Books and the Five Classics[5], interestingly, the contents and ideas weren’t that different, and also had the separation of Confucianism, Mohist School, Legalists, and Taoism.
As a result, during the first lesson, Fan Xian became very suspicious about where he really was.
It was a very hot and muggy summer day. The study was also hot and humid. The teacher pushed open the south-facing window, and immediately the singing of cicadas entered the room together with the cooling breeze. Turning around, the teacher found his little pupil bending over the desk staring blankly. He was just about to reprimand him when he suddenly caught sight of that innocent little face, which touched a soft spot in his heart.
The teacher actually took pleasure in the little pupil of his. At such a young age, the kid was able to speak orderly, and even comprehend somewhat the sublime words with deep meaning in the scriptures. It was really no easy task for a mere four-year-old.
The teacher also had doubts himself. Why was the Count of Southernland so anxious and had such high demands in the invitation letter to him? Because of such high pressure, he had no choice but to teach the four-year-old books of scripture. If it had been a normal family, kids at this age would have been learning just characters and memorizing some beginning literature.
When the lesson was over, Fan Xian saluted the teacher in good manners, and then waited respectfully until the teacher left the study. Then stripping off the sweat-soaked robe, he ran out of the study, as the servant girl followed hurriedly while calling out for him to be careful.
As soon as he entered the main courtyard, Fan Xian stopped running. Putting out the most innocent smile on his face, he strolled inside, and as soon as he saw the Old Madam sitting in the middle of the hall, he called out in his baby voice, “Grandma!”
The Old Madam’s face was kind and affable, and the deep wrinkles were clear marks of the many passing years. Only the occasional shine in her eyes would briefly reveal her shrewdness and the fact that she was not just a simple woman – it was said that the Count of Southernland would not have been able to achieve his fame and power without the Old Madam’s connections in the Capital City.
“What have you learned today?”
Fan Xian stood obediently in front of the Old Madam’s chair and described what the teacher had taught him that day. Then after a respectful salute, he headed toward the side hall to have dinner together with his younger sister.
The Old Madam and her grandson didn’t seem to be very close to each other. Maybe it was because Fan Xian was only a baseborn son, though the Old Madam didn’t mistreat him. She always had high demands for him, which naturally resulted in the estrangement.
Fan Xian could still remember that when he had been only one year old, the Old Madam had once held him tightly in her arms late at night and wept dearly. The Old Madam, of course, did not think a one year old infant would understand her words, or even remember them.
“Kid, if you have to blame, then blame your father. You poor little thing! Losing your mother right after you came to this world.”
……
……
What kind of family did he belong to? That was the biggest question occupying Fan Xian’s mind. He had already experienced an assassination right after he came to this world. Even though he already knew his father was a high-ranking official in the Capital City, the Count of Southernland, who he had never met in person, but who was his mother? At the time of the assassination, the Count of Southernland was still at His Majesty’s service in the conquering army. Those assassins certainly had targeted his mother directly.
Since the soul inside his body really belonged to another world, he obviously didn’t think much of the father-son relation between him and the never-met Count of Southernland. Only occasionally, he would think of that woman who had already left this world, the woman who he should have called mother.



[1] Dantian literally means "cinnabar or red field" and is loosely translated as "elixir field". It is described as an important focal point for internal meditative techniques and refers specifically to the physical center of gravity located in the abdomen (about three finger widths below and two finger widths behind the navel).
[2] Improper training in inner energy that normally causes severe injury to self in the forms of paralysis and damage to internal chi channels and meridians. 
[3] With the characteristics of using very short verses and very uncommon characters to express very rich meanings in order to demonstrate the author’s intellectual and language capabilities. This type of literature can only be understood by well-educated scholars. However, sometimes the extreme brevity creates bifurcation and no one is sure what the original author really wanted to say.
[4] Using verses and words that are closer to the real world language, which are easier for common people to understand.
[5] The Four Books (The Great Learning, The Doctrine of the Mean, The Confucian Analects, and The Works of Mencius) and The Five Classics (The Book of Songs, The Book of History, The Book of Changes, The Book of Rites, and The Spring and Autumn Annals) are required readings for Confucian scholars starting from the Southern Song Dynasty.


Now support the author Maoni by clicking this link, and support the translator Lanny by following my blog! :)


Picture of the Day: (Really slides)


Photo Credits: Sean Gallather

Friday, January 30, 2009

Robot of the Day: Ikhana, another Predetor UAV used by NASA to support fire-fighting

In my previous posts, I shows the MQ-1 Predator and the MQ-9 Predator B Reaper UAVs. I feel obligated, especially after the previous post talking about my opinion on using military technology for civilian use, to introduce you another Predator UAV used exclusively for scientific data collection and fire-fighting support: The NASA Ikhana Unmanned Science and Research Aircraft System


The name Ikhana is a native American word that means "intelligence". It is really just a MQ-9 Predator B Reaper, except the payload does not include missiles or bombs, but scientific equipments (mostly additional sensors, such as thermal sensors) used to help see through heavy smokes from wild fires. Instead of killing people, the Predator UAV is used to save lives, lives of firefighters or local residents affected by wild fires.


Talking about fire-fighting, there were two big fires, the "Station Fire" and the "Guiberson Fire", near Los Angeles in the last California fire season. These two fires were very close to local residence and damaged and threatened many houses and properties. Ikhana was used for data collection in the Station Fire. Although the Ikhana UAV was not used in the Guiberson fire, I was fortunately enough to be part of a NASA team (as a continuation of my summer internship with NASA) and actually flew to LA and participated the fire-fighting, working on another NASA project in support of the fire fighters to better collect information. This project is called GeoCam, and it is the topic of another blog post in the near future.

Picture of the Day:


Here's me videotaping a user from the US Forest Service using an android phone (running our GeoCam client software) to take picture of a live fire at the Guiberson Fire, and he took a picture of me instead. The other person in the picture is the technical manager of the project, a NASA scientist.

Thursday, January 29, 2009

AI and Robots: Rise of the Machines -- Robotics Professors Discussing AI

Yes, they are coming, the robots and the intelligent machines, into every aspect of our lives. Is this good or bad? Are we understanding humanity better during the process? Or are we really digging our own graves?

Why are people so fascinated by robots? (I know my answer: I want build a robot that do all my work! :) Why do humans have such a dystopian view of the future where robots are concerned? These are some of the questions asked in an interview with Noel Sharkey,a Robotics and Artificial Intelligence professor at University of Sheffield, UK. When asked when the first mass produced robots will have a serious impact on society? Dr. Sharkey expressed his concerns about the advances of military robots. (Remember the Predator, Reaper robots we talked about recently?)

Every roboticists and AI researchers in the US know that the majority of the research in this field are driven by military funding and initiatives. The biggest one of them all is DARPA -- standing for Defense Advanced Research Projects Agency. Many of you might not be aware that the Internet actually was created in a research program of ARPA (the earlier name for DARPA). As a matter of fact, part of my research is funded by the Army Research Lab. After all, there's not much different between Search and Rescue and Seek and Destroy.

I, and I believe a great number of other researchers, are strong believers of Azimov's Three Laws of Robotics, especially the first half of the first one:
A robot may not injure a human being or, through inaction, allow a human being to come to harm.
Other other hand, I also must admit that the military driving force drove the advancement of technology, which benefits the entire human race. Just to name a few: the Internet, Satellites, cell networks. Therefore, the stance I take on this issue is that robotics and AI technologies developed for military purposes can also be used for normal people, and I shall work very hard to help make that into reality.









So how do we make sure we don't create the Terminator Scenario? Some people believe we should upgrade ourselves and turn ourselves into cyborgs (half human and half machine), so we still dominate the world, instead of robots. A robotics researcher friend of mine at NASA Ames (no mentioning names) holds this view, and a Robotics and Artificial Intelligence professor at the University of Redding, UK, is another strong believer.










My guess is that robots will become more capable and intelligent, and humans will also become more capable with wearable devices or implanted devices. There are already robotic suits enabling wearers to carry weights far exceeding human capabilities, and there are also robotic hands that connect directly to nerves in people's arms controllable by human brain. I am SERIOUSLY not joking about these things, and you'll reading more about them in my future blog posts (specifically under the Robot of the Day label).

Whether to rely on robots or becoming a cyborg is your own choice, but the time that you have to make that choice might not be very far in the future. There's at least one thing clear: We live in a very exciting era of the world, and we should enjoy it!!

Picture of the Day:

I actually have not seen this movie. Is it good?