【Question answering of the past and present】

A pre-existence

The traditional question answering (QA) system is an application of Artificial Intelligence (AI). It is usually confined to a very narrow and specialized domain, which is basically made up of a hand-crafted knowledge base with a natural language interface. As the field is narrow, the vocabulary is very limited, and its pragmatic ambiguity can be effectively under control. Questions are highly predictable, or close to a closed set, the rules for the corresponding answers are fairly straightforward. Well-known projects in the 1960s include LUNAR, a QA system specializing in answering questions about the geological analysis on the lunar samples collected from the Apollo's landing on the Moon. SHRDLE is another famous QA expert system in AI history, it simulates the operation of a robot in the toy building world. The robot can answer the question of the geometric state of a toy and listen to the language instruction for its operation.

These early AI explorations seemed promising, revealing a fairy-tale world of scientific fantasy, greatly stimulating our curiosity and imagination. Nevertheless, in essence, these are just toy systems that are confined to the laboratory and are not of much practical value. As the field of artificial intelligence was getting narrower and narrower (although some expert systems have reached a practical level, majority AI work based on common sense and knowledge reasoning could not get out beyond lab), the corresponding QA systems failed to render meaningful results. There were some conversational systems (chatterbots) that had been developed thus far and became children's popular online toys (I remember at one time when my daughter was young, she was very fond of surfing the Internet to find various chatbots, sometimes deliberately asking tricky questions for fun. Recent years have seen a revival of this tradition by industrial giants, with some flavor seen in Siri, and greatly emphasized in Microsoft's Little Ice).

2. Rebirth

Industrial open-domain QA systems are another story, it came into existence with the development of the Internet boom and the popularity of search engines. Specifically, the open QA system was born in 1999, when the TREC-8 (Eighth Text Retrieval Conference) decided to add a natural language QA track of competition, funded by the US Department of Defense's DARPA program, administrated by the United States National Institute of Standards and Technology (NIST), thus giving birth to this emerging QA community. Its opening remarks when calling for the participation of the competition are very impressive, to this effect:

Users have questions, they need answers. Search engines claim that they are doing information retrieval, yet the information is not an answer to their questions but links to thousands of possibly related files. Answers may or may not be in the returned documents. In any case, people are compelled to read the documents in order to find answers. A QA system in our vision is to solve this key problem of information need. For QA, the input is a natural language question, the output is the answer, it is that simple.

It seems of benefit to introduce some background for academia as well as the industry when the open QA was born.

From the academic point of view, the traditional sense of artificial intelligence is no longer popular, replaced by the large-scale corpus-based machine learning and statistical research. Linguistic rules still play a role in the field of natural language, but only as a complement to the mainstream machine learning. The so-called intelligent knowledge systems based purely on knowledge or common sense reasoning are largely put on hold by academic scholars (except for a few, such as Dr. Douglas Lenat with his Cyc). In the academic community before the birth of open-domain question and answering, there was a very important development, i.e. the birth and popularity of a new area called Information Extraction (IE), again a child of DARPA. The traditional natural language understanding (NLU) faces the entire language ocean, trying to analyze each sentence seeking a complete semantic representation of all its parts. IE is different, it is task-driven, aiming at only the defined target of information, leaving the rest aside. For example, the IE template of a conference may be defined to fill in the information of the conference [name], [time], [location], [sponsors], [registration] and such. It is very similar to filling in the blank in a student's reading comprehension test. The idea of task-driven semantics for IE shortens the distance between the language technology and practicality, allowing researchers to focus on optimizing tasks according to the tasks, rather than trying to swallow the language monster at one bite. By 1999, the IE community competitions had been held for seven annual sessions (MUC-7: Seventh Message Understanding Conference), the tasks of this area, approaches and the then limitations were all relatively clear. The most mature part of information extraction technology is the so-called Named Entity (NE tagging), including identification of names for human, location, and organization as well as tagging time, percentage, etc. The state-of-the-art systems, whether using machine learning or hand-crafted rules, reached a precision-recall combined score (F-measures) of 90+%, close to the quality of human performance. This first-of-its-kind technological advancement in a young field turned out to play a key role in the new generation of open-domain QA.

In industry, by 1999, search engines had grown rapidly with the popularity of the Internet, and search algorithms based on keyword matching and page ranking were quite mature. Unless there was a methodological revolution, the keyword search field seemed to almost have reached its limit. There was an increasing call for going beyond basic keyword search. Users were dissatisfied with search results in the form of links, and they needed more granular results, at least in paragraphs (snippets) instead of URLs, preferably in the form of direct short answers to the questions in mind. Although the direct answer was a dream yet to come true waiting for the timing of open-domain QA era, the full-text search more and more frequently adopted paragraph retrieval instead of simple document URLs as a common practice in the industry, the search results changed from the simple links to web pages to the highlighting of the keywords in snippets.

In such a favorable environment in industry and academia, the open-domain question answering came onto the stage of history. NIST organized its first competition, requiring participating QA systems to provide the exact answer to each question, with a short answer of no more than 50 bytes in length and a long answer no more than 250 bytes. Here are the sample questions for the first QA track:

Who was the first American in space?
Where is the Taj Mahal?
In what year did Joe DiMaggio compile his 56-game hitting streak?

3. Short-lived prosperity

What are the results and significance of this first open domain QA competition? It should be said that the results are impressive, a milestone of significance in the QA history. The best systems (including ours) achieve more than 60% correct rate, that is, for every three questions, the system can search the given corpus and is able to return two correct answers. This is a very encouraging result as a first attempt at an open domain system. At the time of dot.com's heyday, the IT industry was eager to move this latest research into information products and revolutionize the search. There were a lot of interesting stories after that (see my related blog post in Chinese: "the road to entrepreneurship"), eventually leading to the historical AI event of IBM Watson QA beating humans in Jeopardy.

The timing and everything prepared by then from the organizers, the search industry, and academia, have all contributed to the QA systems' seemingly miraculous results. The NIST emphasizes well-formed natural language questions as appropriate input (i.e. English questions, see above), rather than traditional simple and short keyword queries. These questions tend to be long, well suited for paragraph searches as a leverage. For competition's sake, they have ensured that each question asked indeed has an answer in the given corpus. As a result, the text archive contains similar statements corresponding to the designed questions, having increased the odds of sentence matching in paragraph retrieval (Watson's later practice shows that from the big data perspective, similar statements containing answers are bound to appear in text as long as a question is naturally long). Imagine if there are only one or two keywords, it will be extremely difficult to identify relevant paragraphs and statements that contain answers. Of course, finding the relevant paragraphs or statements is not sufficient for this task, but it effectively narrows the scope of the search, creating a good condition for pinpointing the short answers required. At this time, the relatively mature technology of named entity tagging from the information extraction community kicked in. In order to achieve the objectivity and consistency in administrating the QA competition, the organizers deliberately select only those questions which are relatively simple and straightforward, questions about names, time or location (so-called factoid questions). This practice naturally agrees with the named entity task closely, making the first step into open domain QA a smooth process, returning very encouraging results as well as a shining prospect to the world. For example, for the question "In what year did Joe DiMaggio compile his 56-game hitting streak?", the paragraph or sentence search could easily find text statements similar to the following: "Joe DiMaggio's 56 game hitting streak was between May 15, 1941 and July 16". An NE system tags 1941 as time with no problem and the asking point for time in parsing the wh-phrase "in what year" is also not difficult to decode. Therefore, an exact answer to the exact question seems magically retrieved from the sea of documents to satisfy the user, like a needle found in the haystack. Following roughly the same approach, equipped with gigantic computing power for parallel processing of big data, 11 years later, IBM Watson QA beat humans in the Jeopardy live show in front of the nationwide TV audience, stimulating the entire nation's imagination with awe for this technology advance. From QA research perspective, the IBM's victory in the show is, in fact, an expected natural outcome, more of an engineering scale-up showcase rather than research breakthrough as the basic approach of snippet + NE + asking-point has long been proven.

A retrospect shows that adequate QA systems for factoid questions are invariably combined with a solid Named Entity module and a question parser for identifying asking points. As long as there is an IE-indexed big data behind, with information redundancy as its nature, factoid QA is a very tractable task .

4. State of the art

The year 1999 witnessed the academic community's initial success of the first open-domain QA track as a new frontier of the retrieval world. We also benefited from that event as a winner, having soon secured a venture capital injection of $10 million from the Wall Street. It was an exciting time shortly after AskJeeves' initial success in presenting a natural language interface online (but they did not have the QA technology for handling the huge archive for retrieving exact answers automatically, instead they used human editors behind the scene to update the answers database). A number of QA start-ups were funded. We were all expecting to create a new era in the information revolution. Unfortunately, the good times are not long, the Internet bubble soon burst, and the IT industry fell into the abyss of depression. Investors tightened their monetary operations, the QA heat soon declined to freezing point and almost disappeared from the industry (except for giants' labs such as IBM Watson; in our case, we shifted from QA to mining online brand intelligence for enterprise clients). No one in the mainstream believes in this technology anymore. Compared with traditional keyword indexing and searching, the open domain QA is not as robust and is yet to scale up to really big data for showing its power. The focus of the search industry is shifting from depth back to breadth, focusing on the indexing coverage, including the so-called deep web. As the development of QA systems is almost extinct from the industry, this emerging field stays deeply rooted in the academic community, developed into an important branch, with increasing natural language research from universities and research labs. IBM later solves the scale-up challenge, as a precursor of the current big data architectural breakthrough.

At the same time, scholars begin to summarize the various types of questions that challenge QA. A common classification is based on identifying the type of questions for their asking points. Many of us still remember our high school language classes, where the teacher stressed the 6 WHs for reading comprehension: who / what / when / where / how / why. (Who did what when, where, how and why?) Once answers to these questions are clear , the central stories of an article are in hands. As a simulation of human reading comprehension, the QA system is designed to answer these key WH questions as well. It is worth noting that these WH questions are of different difficulty levels, depending on the types of asking points (one major goal for question parsing is to identify the key need from a question, what we call asking point identification, usually based on question parsing of wh-phrases and other question clues). Those asking points corresponding to an entity as an appropriate answer, such as who / when / where, are relatively easy questions to answer (i.e. factoid questions). Another type of question is not simply answerable by an entity, such as what-is / how / why, there is consensus that answering such questions is a much more challenging task than factors questions. A brief introduction to these three types of "tough" questions and their solutions are presented below as a showcase of the on-going state to conclude this overview of the QA journey.

What/who is X? This type of questions is the so-called definition question, such as What is iPad II? Who is Bill Clinton? This type of question is typically very short, after the wh-word and the stop word "is" are stripped in question parsing, what is left is just a name or a term as input to the QA system. Such an input is detrimental to the traditional keyword retrieval system as it ends up with too many hits from which the system can only pick the documents with the most keyword density or page rank as returns. But from QA perspective, the minimal requirement to answer this question is a definition statement in the forms of "X is a ...". Since any entity or object is in multiple relationships with other entities and involved in various events as described in the corpus, a better answer to the definition question involves a summary of the entity with all the links to its key associated relations and events, giving a profile of the entity. Such technology is in existence, and, in fact, has been partly deployed today. It is called knowledge graph, supported by underlying information extraction and fusion. The state-of-the-art solution for this type of questions is best illustrated in the Google deployment of its knowledge graph in handling queries of a short search for movie stars or other VIP.

The next challenge is how-questions, asking about a solution for solving a problem or doing something, e.g. How can we increase bone density? How to treat a heart attack? This type of question calls for a summary of all types of solutions such as medicine, experts, procedures, or recipe. A simple phrase is usually not a good answer and is bound to miss varieties of possible solutions to satisfy the information need of the users (often product designers, scientists or patent lawyers) who typically are in the stage of prior art research and literature review for a conceived solution in mind. We have developed such a powerful system based on deep parsing and information extraction to answer open-domain how-questions comprehensively in the product called Illumin8, as deployed by Elsevier for quite some years. (Powerful as it is, unfortunately, it did not end up as a commercial success in the market from revenue perspective.)

The third difficult question is why. People ask why-questions to find the cause or motive of a phenomenon, whether an event or an opinion. For example, why people like or dislike our product Xyz? There might be thousands of different reasons behind a sentiment or opinion. Some reasons are explicitly expressed (I love the new iPhone 7 because of its greatly enhanced camera) and more reasons are actually in some implicit expressions (just replaced my iPhone , it sucks in battery life). An adequate QA system should be equipped with the ability to mine the corpus and summarize and rank the key reasons for the user. In the last 5 years, we have developed a customer insight product that can answer why questions behind the public opinions and sentiments for any topics by mining the entire social media space.

Since I came to the Silicon Valley 9 years ago, I have been lucky, with pride, in having had a chance to design and develop QA systems for answering the widely acknowledged challenging questions. Two products for answering the open-domain how questions and why-questions in addition to deep sentiment analysis have been developed and deployed to global customers. Our deep parsing and IE platform is also equipped with the capability to construct deep knowledge graph to help answer definition questions, but unlike Google with its huge platform for the search needs, we have not identified a commercial opportunity to deploy that capability for a market yet.

This piece of writing first appeared in 2011 in my personal blog, with only limited revisions since. Thanks to Google Translate at https://translate.google.com/ for providing a quick basis, which was post-edited by myself.

The Anti-Eliza Effect, New Concept in AI

"Knowledge map and open-domain QA (1)" (in Chinese)

"knowledge map and how-question QA (2)" (in Chinese)

【Ask Jeeves and its million-dollar idea for human interface in 】(in Chinese)

Dr Li’s NLP Blog in English