[Abstract]
In the area of Computational Linguistics, there are two basic approaches to natural language processing, the traditional rule system and the mainstream machine learning. They are complementary and there are pros and cons associated with both. However, as machine learning is the dominant mainstream philosophy reflected by the overwhelming ratio of papers published in academia, the area seems to be heavily biased against the rule system methodology. The tremendous success of machine learning as applied to a list of natural language tasks has reinforced the mainstream pride and prejudice in favor of one and against the other. As a result, there are numerous specious views which are often taken for granted without check, including attacks on the rule system's defects based on incomplete induction or misconception. This is not healthy for NLP itself as an applied research area and exerts an inappropriate influence on the young scientists coming to this area. This is the first piece of a series of writings aimed at educating the public and confronting the prevalent prejudice, focused on the in-depth examination of the so-called hand-crafted defect of the rule system and the associated knowledge bottleneck issue.
I. introduction
Over 20 years ago, the area of NLP (natural language processing) went through a process of replacing traditional rule-based systems by statistical machine learning as the mainstream in academia. Put in a larger context of AI (Artificial Intelligence), this represents a classical competition, and their ups and downs, between the rational school and the empirical school (Church 2007 ). It needs to be noted that the statistical approaches' dominance in this area has its historical inevitability. The old school was confined to toy systems or lab for too long without scientific break-through while machine learning started showing impressive results in numerous fronts of NLP in a much larger scale, initially very low level NLP such as POS (Part-of-Speech) tagging and speech recognition / synthesis, and later on expanded to almost all NLP tasks, including machine translation, search and ranking, spam filtering, document classification, automatic summarization, lexicon acquisition, named entity tagging, relationship extraction, event classification, sentiment analysis. This dominance has continued to grow till today when the other school is largely "out" from almost all major NLP arenas, journals and top conferences. New graduates hardly realize its existence. There is an entire generation gap for such academic training or carrying on the legacy of the old school, with exceptions of very few survivors (including yours truly) in industry because few professors are motivated to teach it at all or even qualified with an in-depth knowledge of this when the funding and publication prospects for the old school are getting more and more impossible. To many people's minds today, learning (or deep learning) is NLP, and NLP is learning, that is all. As for the "last century's technology" of rule-based systems, it is more like a failure tale from a distant history.
The pride and prejudice of the mainstream were demonstrated the most in the recent incidence when Google announced its deep-learning-based SyntaxNet and proudly claimed it to be "the most accurate parser in the world", so resolute and no any conditions attached, and without even bothering to check the possible existence of the other school. This is not healthy (and philosophically unbalanced too) for a broad area challenged by one of the most complex problems of mankind, i.e. to decode natural language understanding. As there is only one voice heard, it is scaring to observe that the area is packed with prejudice and ignorance with regards to the other school, some from leaders of the area. Specious comments are rampant and often taken for granted without check.
Prejudice is not a real concern as it is part of the real world around and involving ourselves, something to do with human nature and our innate limitation and ignorance. What is really scary is the degree and popularity of such prejudice represented in numerous misconceptions that can be picked up everywhere in this circle (I am not going to trace the sources of these as they are everywhere and people who are in this area for some time know this is not Quixote's windmill but a reality reflection). I will list below some of the myths or fallacies so deeply rooted in the area that they seem to become cliche, or part of the community consensus. If one or more statements below sound familiar to you and they do not strike you as opinionated or specious which cannot withstand scrutiny, then you might want to give a second study of the issue to make sure we have not been subconsciously brain-washed. The real damage is to our next generation, the new scholars coming to this field, who often do not get a chance for doubt.
For each such statement to be listed, it is not difficult to cite a poorly designed stereotypical rule system that falls short of the point, but the misconception lies in its generalization of associating an accused defect to the entire family of a school, ignorant of the variety of designs and the progress made in that school.
There are two types of misconceptions, one might be called myth and the other is sheer fallacy. Myths arise as a result of "incomplete induction". Some may have observed or tried some old school rule systems of some sort, which show signs of the stated defect, then they jump to conclusions leading to the myths. These myths call for in-depth examination and arguments to get the real picture of the truth. As for fallacies, they are simply untrue. It is quite a surprise, though, to see that even fallacies seem to be widely accepted as true by many, including some experts in this area. All we need is to cite facts to prove them wrong. For example, [Grammaticality Fallacy] says that the rule system can only parse grammatical text and cannot handle degraded text with grammar mistakes in it. Facts speak louder than words: the sentiment engine we have developed for our main products is a parsing-supported, rule-based system that fully automatically extracts and mines public opinions and consumer insights from all types of social media, typical of degraded text. Third-party evaluations show that this system is industry leader in data quality of sentiments, significantly better than competitions adopting machine learning. The large-scale operation of our system in the cloud in handling terabytes of real life social media big data (a year of social media in our index involve about 30 billion documents across more than 40 languages) also prove wrong what is stated in [Scalability Fallacy] below.
Let us now list these widely spread rumours collected from the community about the rule-based system to see if they ring the bell before we dive into the first two core myths to uncover the true picture behind in separate blogs.
II. Top 10 Misconceptions against Rules
[Hand-crafted Myth] Rule-based system faces a knowledge bottleneck of hand-crafted development while a machine learning system involves automatic training (implying no knowledge bottleneck). [see On Hand-crafted Myth of Knowledge Bottleneck.]
[Domain Portability Myth] The hand-crafted nature of a rule-based system leads to its poor domain portability as rules have to be rebuilt each time we shift to a new domain; but in case of machine learning, since the algorithm and system are universal, domain shift only involves new training data (implying strong domain portability). [see Domain Portability Myth]
[Fragility Myth] A rule-based system is very fragile and it may break before unseen language data, so it cannot lead to a robust real life application.
[Weight Myth] Since there is no statistical weight associated with the results from a rule-based system, the data quality cannot be trusted with confidence.
[Complexity Myth] As a rule-based system is complex and intertwined, it is easy to get to a standstill, with little hope for further improvement.
[Scalability Fallacy] The hand-crafted nature of a rule-based system makes it difficult to scale up for real life application; it is largely confined to the lab as a toy.
[Domain Restriction Fallacy] A rule-based system only works in a narrow domain and it cannot work across domains.
[Grammaticality Fallacy] A rule-based system can only handle grammatical input in the formal text (such as news, manuals, weather broadcasts), it fails in front of degraded text involving misspellings and ungrammaticality such as social media, oral transcript, jargons or OCR output.
[Outdated Fallacy] A rule-based system is a technology of last century, it is outdated (implying that it no longer works or can result in a quality system in modern days).
[Data Quality Fallacy] Based on the data quality of results, a machine learning system is better than a rule based system. (cf: On Recall of Grammar Engineering Systems)
III. Retrospect and Reflection of Mainstream
As mentioned earlier, a long list of misconceptions about the old school of rule-based systems have been around the mainstream for years in the field. It may sound weird for an interdisciplinary field named Computational Linguistics to drift more and more distant from linguistics; linguists play less and less a role in NLP dominated by statisticians today. It seems widely assumed that with advanced deep learning algorithms, once data are available, a quality system will be trained without the need for linguistic design or domain expertise.
Not all main stream scholars are one-sided and near-sighted. In recent years, insightful articles (e.g., church 2007, Wintner 2009) began a serious retrospect and reflection process and called for the return of Linguistics: “In essence, linguistics is altogether missing in contemporary natural language engineering research. … I want to call for the return of linguistics to computational linguistics.”(Wintner 2009)Let us hope that their voice will not be completely muffled in this new wave of deep learning heat.
Note that the rule system which the linguists are good at crafting in industry is different from the classical linguistic study, it is formalized modeling of linguistic analysis. For NLP tasks beyond shallow level, an effective rule system is not a simple accumulation of computational lexicons and grammars, but involves a linguistic processing strategy (or linguistic algorithm) for different levels of linguistic phenomena. However, this line of study on the NLP platform design, system architecture and formalism has increasingly smaller space for academic discussion and publication, the research funding becomes almost impossible, as a result, the new generation faces the risk of a cut-off legacy, with a full generation of talent gap in academia. Church (2007) points out that the statistical research is so dominant and one-sided that only one voice is now heard. He is a visionary main stream scientist, deeply concerned about the imbalance of the two schools in NLP and AI. He writes:
Part of the reason why we keep making the same mistakes, as Minsky and Papert mentioned above, has to do with teaching. One side of the debate is written out of the textbooks and forgotten, only to be revived/reinvented by the next generation. ......
To prepare students for what might come after the low hanging fruit has been picked over, it would be good to provide today’s students with a broad education that makes room for many topics in Linguistics such as syntax, morphology, phonology, phonetics, historical linguistics and language universals. We are graduating Computational Linguistics students these days that have very deep knowledge of one particular narrow sub-area (such as machine learning and statistical machine translation) but may not have heard of Greenberg’s Universals, Raising, Equi, quantifier scope, gapping, island constraints and so on. We should make sure that students working on co-reference know about c-command and disjoint reference. When students present a paper at a Computational Linguistics conference, they should be expected to know the standard treatment of the topic in Formal Linguistics.
We ought to teach this debate to the next generation because it is likely that they will have to take Chomsky’s objections more seriously than we have. Our generation has been fortunate to have plenty of low hanging fruit to pick (the facts that can be captured with short ngrams), but the next generation will be less fortunate since most of those facts will have been pretty well picked over before they retire, and therefore, it is likely that they will have to address facts that go beyond the simplest ngram approximations.
About Author
Dr. Wei Li is currently Chief Scientist at Netbase Solutions in the Silicon Valley, leading the effort for the design and development of a multi-lingual sentiment mining system based on deep parsing. A hands-on computational linguist with 30 years of professional experience in Natural Language Processing (NLP), Dr. Li has a track record of making NLP work robust. He has built three large-scale NLP systems, all transformed into real-life, globally distributed products.
Note: This is the author's own translation, with adaptation, of our paper in Chinese which originally appeared in W. Li & T. Tang, "Pride and Prejudice of Main Stream: Rule-based System vs. Machine Learning", in Communications of Chinese Computer Federation (CCCF), Issue 8, 2013
[Related]
K. Church: A Pendulum Swung Too Far, Linguistics issues in Language Technology, 2011; 6(5)
Wintner 2009. What Science Underlies Natural Language Engineering? Computational Linguistics, Volume 35, Number 4
Domain portability myth in natural language processing
On Hand-crafted Myth and Knowledge Bottleneck
On Recall of Grammar Engineering Systems
Pros and Cons of Two Approaches: Machine Learning vs Grammar Engineering
It is untrue that Google SyntaxNet is the “world’s most accurate parser”
R. Srihari, W Li, C. Niu, T. Cornell: InfoXtract: A Customizable Intermediate Level Information Extraction Engine. Journal of Natural Language Engineering, 12(4), 1-37, 2006
Introduction of Netbase NLP Core Engine
Overview of Natural Language Processing
Dr. Wei Li’s English Blog on NLP