分类： English

PhD Thesis: Chapter I Introduction

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

The Morpho-syntactic Interface in a Chinese Phrase Structure Grammar

Wei Li

B.A., Anqing Normal College, China, 1982

M.A., The Graduate School of Chinese Academy of

Social Sciences, China, 1986

Thesis submitted in partial fulfillment of

the requirements for the degree of

DOCTOR OF PHILOSOPHY

in the Department

Linguistics

Morpho-syntactic Interface in a Chinese Phrase Structure Grammar

Wei Li 2000

SIMON FRASER UNIVERSITY

November 2000

Approval

Name: Wei Li

Degree: Ph.D.

Title of thesis: THE MORPHO-SYNTACTIC INTERFACE IN

A CHINESE PHRASE STRUCTURE GRAMMAR

(Approved January 12, 2001)

Abstract

This dissertation examines issues related to the morpho-syntactic interface in Chinese, specifically those issues related to the following long-standing problems in Chinese Natural Language Processing (NLP): (i) disambiguation in Chinese word identification; (ii) Chinese productive word formation; (iii) borderline phenomena between morphology and syntax, such as Chinese separable verbs and ‘quasi-affixation’.

All these problems pose challenges to an independent Chinese morphology system or separate word segmenter. It is argued that there is a need to bring in the syntactic analysis in handling these problems.

To enable syntactic analysis in addition to morphological analysis in an integrated system, it is necessary to develop a Chinese grammar that is capable of representing sufficient information from both morphology and syntax. The dissertation presents the design of such a Chinese phrase structure grammar, named CPSG95 (for Chinese Phrase Structure Grammar). The unique feature of CPSG95 is its incorporation of Chinese morphology in the framework of Head-Driven Phrase Structure Grammar. The interface between morphology and syntax is then defined system internally in CPSG95 and uniformly represented using the underlying grammar formalism used by the Attribute Logic Engine. For each problem, arguments are presented for the proposed analysis to capture the linguistic generality; morphological or syntactic solutions are formulated based on the analysis. This provides a secure approach to solving problems at the interface of Chinese morphology and syntax.

Dedication

To my daughter Tian Tian

whose babbling accompanied and inspired the writing of this work

And to my most devoted friend Dr. Jianjun Wang

whose help and advice encouraged me to complete this work

Acknowledgments

First and foremost, I feel profoundly grateful to Dr. Paul McFetridge, my senior supervisor. It was his support that brought me to SFU and the beautiful city Vancouver, which changed my life. Over the years, he introduced me into the HPSG study, and provided me with his own parser for testing grammar writing. His mentorship and guidance have influenced my research fundamentally. He critiqued my research experiments and thesis writing in many facets, from the development of key ideas, selection of topics, methodology, implementation details to writing and presentation style. I feel guilty for not being able to promptly understand and follow his guidance at times.

I would like to thank Dr. Fred Popowich, my second advisor. He has given me both general academic guidance on research methodology and numerous specific comments for the thesis revision which have helped shape the present version of the thesis as it is today.

I am also grateful to Dr. Nancy Hedberg from whom I have taken four graduate courses, including the course of HPSG. I have not only learned a lot from her lectures in the classroom, but have benefited greatly from our numerous discussions on general linguistic topics as well as issues in Chinese linguistics.

Thanks to Davide Turkato, my friend and colleague in the Natural Language Lab. He is always there whenever I need help. We have also shared many happy hours in our common circle of Esperanto club in Vancouver.

I would like to thank Dr. Ping Xue, Dr. Zita McRobbie, Dr. Thomas Perry, Dr. Donna Gerdts and Dr. Richard DeArmond for the courses I have taken from them. These courses were an important part of my linguistic training at SFU.

For various help and encouragement I have got during my graduate studies, I should also thank all the faculty, staff and colleagues of the linguistics department and the Natural Language Lab of SFU, in particular, Rita, Sheilagh, Dr. Ross Saunders, Dr. Wyn Roberts, Dr. Murray Munro and Dr. Olivier Laurens. I am particularly thankful to Carol Jackson, our Graduate Secretary for her years of help. She is remarkable, very caring and responsive.

I would like to extend my thanks to all my fellow students and friends in the linguistics department of SFU, in particular, Dr. Trude Heift, Dr. Janine Toole, Susan Russel, Dr. Baoning Fu, Zhongying Lu, Dr. Shuicai Zhou, Jianyi Yu, Jean Wang, Cliff Burgess and Kyoung-Ja Lee. We have had so much fun together and have had many interesting discussions, both academic and non-academic. Today, most of us have graduated, some are professors or professionals in different universities or institutions. Our linguistics department is not big, but it is such a nice department where faculty, staff and the graduate student body form a very sociable community. I have truly enjoyed my graduate life in this department.

Beyond SFU, I would like to thank Dr. De-Kang Lin for the insightful discussion on the possibilities of integrated Chinese parsing back in 1995. Thanks to Gerald Penn, one of the authors of ALE, for providing the powerful tool ALE and for giving me instructions on modifying some functions in ALE to accommodate some needs for Chinese parsing during my experiment in implementing a Chinese grammar.

I am also grateful to Dr. Rohini Srihari, my current industry supervisor, for giving me an opportunity to manage NLP projects for real world applications at Cymfony. This industrial experience has helped me to broaden my NLP knowledge, especially in the area of statistical NLP and the area of shallow parsing using Finite State Transducers.

Thanks to Carrie Pine and Walter Gadz from US Air Force Research Laboratory who have been project managers for the Small Business Innovation Research (SBIR) efforts ‘A Domain Independent Event Extraction Toolkit’ (Phase II), ‘Flexible Information Extraction Learning Algorithm’ (Phase I and Phase II) and ‘Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization’ (Phase I and Phase II). I have been Principal Investigator for these government funded efforts at Cymfony Inc. and have had frequent and extremely beneficial contact with them. With these projects, I have had an opportunity to apply the skills and knowledge I have acquired from my Ph.D. program at SFU.

My professional training at SFU was made possible by a grant that Dr. Paul McFetridge and Dr. Nick Cercone applied for. The work reported in this thesis was supported in the later stage by a Science Council of B.C. (CANADA) G.R.E.A.T. award. I am grateful to both my academic advisor Paul McFetridge and my industry advisor John Grayson, CEO of TCC Communications Corporation of Victoria, for assisting me in obtaining this prestigious grant.

I would not have been able to start and continue my research career without many previous helps I got from various sources, agencies and people in the last 15 years, for which I owe a big prayer of thanks.

I owe a great deal to Prof. Zhuo Liu and Prof. Yongquan Liu for leading me into the NLP area and supervising my master program in computational linguistics at CASS (Chinese Academy of Social Sciences, 1983-1986). Their guidance in both research ideas and implementation details benefited me for life. I am grateful to my former colleagues Prof. Aiping Fu, Prof. Zhenghui Xiong and Prof. Linding Li at the Institute of Linguistics of CASS for many insightful discussions on issues involving NLP and Chinese grammars. Thanks also go to Ms. Fang Yang and the machine translation team at Gaoli Software Co. in Beijing for the very constructive and fruitful collaborative research and development work. Our collaboration ultimately resulted in the commercialization of the GLMT English-to-Chinese machine translation system.

Thanks to Dr. Klaus Schubert, Dr. Dan Maxwell and Dr. Victor Sadler from BSO (Utrecht, The Netherlands) for giving me the project of writing a computational grammar of Chinese dependency syntax in 1988. They gave me a lot of encouragement and guidance in the course of writing the grammar. This work enabled me to study Chinese grammar in a formal and systematic way. I have carried over this formal study of Chinese grammar to the work reported in this thesis.

I am also thankful to the Education Ministry of China, Sir Pao Foundation and British Council for providing me with the prestigious Sino-British Friendship Scholarship. This scholarship enabled me to study computational linguistics at Centre for Computational Linguistics, UMIST, England (1992). During my stay in UMIST, I had opportunities to attend lectures given by Prof. Jun-ichi Tsujii, Prof. Harold Somers and Dr. Paul Bennett. I feel grateful to all of them for their guidance in and beyond the classroom. In particular, I must thank Dr. Paul Bennett for his supervision, help and care.

I would like to thank Prof. Dong Zhen Dong and Dr. Lua Kim Teng for inviting and sponsoring me for a presentation at ICCC'96 in Singapore. They are the leading researchers in the area of Chinese NLP. I have benefited greatly from the academic contact and communication with them.

Thanks to anonymous reviewers of the international journals of Communications of COLIPS, Journal of Chinese Information Processing, World Science and Technology and grkg/Humankybernetik. Thanks also to reviewers of the International Conference on Chinese Computing (ICCC’96), North American Conference on Chinese Linguistics (NACCL‑9), Applied Natural Language Conference (ANLP’2000), Text Retrieval Conference (TREC-8), Machine Translation SUMMIT II, Conference of the Pacific Association for Computational Linguistics (PACLING-II) and North West Linguistics Conferences (NWLC). These journals and conferences have provided a forum for publishing the NLP-related research work I and my colleagues have undertaken at different times of my research career.

Thanks to Dr. Jin Guo who has developed his influential theory of tokenization. I have benefited enormously from exchanging ideas with him on tokenization and Chinese NLP.

In terms of research methodology and personal advice, I owe a great deal to my most devoted friend Dr. Jianjun Wang, Associate Professor at California State University, Bakersfield, and Fellow of the National Center for Education Statistics in US. Although in a totally different discipline, there has never been an obstacle for him to understand the basic problem I was facing and to offer me professional advice. At times when I was puzzled and confused, his guidance often helped me to quickly sort things out. Without his advice and encouragement, I would not have been able to complete this thesis.

Finally, I wish to thank my family for their support. All my family members, including my parents, brothers and sisters in China, have been so supportive and understanding. In particular, my father has been encouraging me all the time. When I went through hardships in my pursuit, he shared the same burden; when I had some achievement, he was as happy as I was.

I am especially grateful to my wife, Chunxi. Without her love, understanding and support, it is impossible for me to complete this thesis. I wish I had done a better job to have kept her less worried and frustrated. I should thank my four-year-old daughter, Tian Tian. I feel sorry for not being able to spend more time with her. What has supported me all these years is the idea that some day she will understand that as a first-generation immigrant, her dad has managed to overcome various challenges in order to create a better environment for her to grow.

Approval ii

Abstract iii

Dedication iv

Acknowledgments v

Chapter I Introduction 1

1.0. Foreword 1

1.1. Background 2

Principle of Maximum Tokenization and Critical Tokenization 2

Monotonicity Principle and Task-driven Segmentation 5

1.2. Morpho-syntactic Interface Problems 8

1.2.1. Segmentation ambiguity 8

1.2.2. Productive Word Formation 10

1.2.3. Borderline Cases between Morphology and Syntax 11

1.3. CPSG95: HPSG-style Chinese Grammar in ALE 13

1.3.1. Background and Overview of CPSG95 14

1.3.2. Illustration 15

1.4. Organization of the Dissertation 16

Chapter II Role of Grammar 18

2.0. Introduction 18

2.1. Segmentation Ambiguity and Syntax 19

2.1.1. Resolution of Hidden Ambiguity 19

2.1.2. Resolution of Overlapping Ambiguity 24

2.2. Productive Word Formation and Syntax 33

2.3. Borderline Cases and Grammar 37

2.4. Knowledge beyond Syntax 39

2.5. Summary 46

Chapter III Design of CPSG95 48

3.0. Introduction 48

3.1. Mono-stratal Design of Sign 52

3.2. Expectation Feature Structures 57

3.2.1. Morphological Expectation 58

3.2.2. Syntactic Expectation 59

3.2.3. Chinese Subcategorization 63

3.2.4. Configurational Constraint 67

3.3. Structural Feature Structure 70

3.4. Summary 73

Chapter IV Defining the Chinese Word 75

4.0. Introduction 75

4.1. Two Notions of Word 78

4.2. Judgment Methods 83

4.3. Formal Representation of Word 88

4.4. Summary 92

Chapter V Chinese Separable Verbs 93

5.0. Introduction 93

5.1. Verb-object Idioms: V+N I 96

5.2. Verb-object Idioms: V+N II 107

5.3. Verb-modifier Idioms: V+A/V 116

5.4. Summary 122

Chapter VI Morpho-syntactic Interface Involving Derivation 123

6.0. Introduction 123

6.1. General Approach to Derivation 125

6.2. Prefixation 127

6.3. Suffixation 130

6.4. Quasi-affixes 132

6.5. Suffix zhe (-er) 139

6.6. Summary 151

Chapter VII Concluding Remarks 152

7.0. Summary 152

7.1. Contributions 154

7.2. Limitation 158

7.3. Final Notes 159

BIBLIOGRAPHY 161

APPENDIX I Source Code of Implemented CPSG95 170

APPENDIX II Source Code of Implemented CPSG95 Lexicon 208

APPENDIX III Tested Results in Three Experiments Using CPSG95 229

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Why deep parsing rules instead of deep learning model for sentiment analysis?
Pros and Cons of Two Approaches: Machine Learning and Grammar Engineering
Coarse-grained vs. fine-grained sentiment analysis
一切声称用机器学习做社会媒体舆情挖掘的系统，都值得怀疑
 【立委科普：基于关键词的舆情分类系统面临挑战】

The mainstream sentiment approach simply breaks in front of social media

I have articulated this point in various previous posts or blogs before, but the world is so dominated by the mainstream that it does not seem to carry. So let me make it simple to be understood:

The sentiment classification approach based on bag of words (BOW) model, so far the dominant approach in the mainstream for sentiment analysis, simply breaks in front of social media. The major reason is simple: the social media posts are full of short messages which do not have the "keyword density" required by a classifier to make the proper sentiment decision. Larger training sets cannot help this fundamental defect of the methodology. The precision ceiling for this line of work in real-life social media is found to be 60%, far behind the widely acknowledged precision minimum 80% for a usable extraction system. Trusting a machine learning classifier to perform social media sentiment mining is not much better than flipping a coin.

So let us get straight. From now on, whoever claims the use of machine learning for social media mining of public opinions and sentiments is likely to be a trap (unless it is verified to have involved parsing of linguistic structures or patterns, which so far has never been heard of in practical systems based on machine learning). Fancy visualizations may make the results of the mainstream approach look real and attractive but they are just not trustable at all.

Why deep parsing rules instead of deep learning model for sentiment analysis?

aaa

(1) Learning does not work in short messages as short messages do not have enough data points (or keyword density) to support the statistical model trained by machine learning. Social media is dominated by short messages.

(2) With long messages, learning can do a fairly good job in coarse-grained sentiment classification of thumbs-up and thumbs-down, but it is not good at decoding the fine-grained sentiment analysis to answer why people like or dislike a topic or brand. Such fine-grained insights are much more actionable and valuable than the simple classification of thumbs-up and thumbs-down.

We have experimented with and compared both approaches to validate the above conclusions. That is why we use deep parsing rules instead of a deep learning model to reach the industry-leading data quality we have for sentiment analysis.

We do use deep learning for other tasks such as logo and image processing. But for sentiment analysis and information extraction from text, especially in processing social media, the deep parsing approach is a clear leader in data quality.

Coarse-grained vs. fine-grained sentiment analysis

Deep parsing is the key to natural language understanding

Automated survey based on social media

Chomsky's Negative Impact

S. Bai: Natural Language Caterpillar Breaks through Chomsky's Castle

masterminds-it-quiz-10-728

Translator's note:

This article written in Chinese by Prof. S. Bai is a wonderful piece of writing worthy of recommendation for all natural language scholars. Prof. Bai's critical study of Chomsky's formal language theory with regards to natural language has reached a depth never seen before ever since Chomsky's revolution in 50's last century. For decades with so many papers published by so many scholars who have studied Chomsky, this novel "caterpillar" theory still stands out and strikes me as an insight that offers a much clearer and deeper explanation for how natural language should be modeled in formalism, based on my decades of natural language parsing study and practice (in our practice, I call the caterpillar FSA++, an extension of regular grammar formalism adequate for multi-level natural language deep parsing). For example, so many people have been trapped in Chomsky's recursion theory and made endless futile efforts to attempt a linear or near-linear algorithm to handle the so-called recursive nature of natural language which is practically non-existent (see Chomsky's Negative Impact). There used to be heated debates in computational linguistics on whether natural language is context-free or context-sensitive, or mildly sensitive as some scholars call it. Such debates mechanically apply Chomsky's formal language hierarchy to natural languages, trapped in metaphysical academic controversies, far from language facts and data. In contrast, Prof. Bai's original "caterpillar" theory presents a novel picture that provides insights in uncovering the true nature of natural languages.

S. Bai: Natural Language Caterpillar Breaks through Chomsky's Castle

Tags: Chomsky Hierarchy, computational linguistics, Natural Language Processing, linear speed

This is a technology-savvy article, not to be fooled by the title seemingly about a bug story in some VIP's castle. If you are neither an NLP professional nor an NLP fan, you can stop here and do not need to continue the journey with me on this topic.

Chomsky's Castle refers to the famous Chomsky Hierarchy in his formal language theory, built by the father of contemporary linguistics Noam Chomsky more than half a century ago. According to this theory, the language castle is built with four enclosing walls. The outmost wall is named Type-0, also called Phrase Structure Grammar, corresponding to a Turing machine. The second wall is Type-1, or Context-sensitive Grammar (CSG), corresponding to a parsing device called linear bounded automaton with time complexity known to be NP-complete. The third wall is Type-2, or Context-free Grammar (CFG), corresponding to a pushdown automaton, with a time complexity that is polynomial, somewhere between square and cubic in the size of the input sentence for the best asymptotic order measured in the worst case scenario. The innermost wall is Type-3, or Regular Grammar, corresponding to deterministic finite state automata, with a linear time complexity. The sketch of the 4-wall Chomsky Castle is illustrated below.

This castle of Chomsky has impacted generations of scholars, mainly along two lines. The first line of impact can be called "the outward fear syndrome". Because the time complexity for the second wall (CSG) is NP-complete, anywhere therein and beyond becomes a Forbidden City before NP=P can be proved. Thus, the pressure for parsing natural languages has to be all confined to within the third wall (CFG). Everyone knows the natural language involves some context sensitivity, but the computing device cannot hold it to be tractable once it is beyond the third wall of CFG. So it has to be left out.

The second line of impact is called "the inward perfection syndrome". Following the initial success of using Type 2 grammar (CFG) comes a severe abuse of recursion. When the number of recursive layers increases slightly, the acceptability of a sentence soon approximates to almost 0. For example, "The person that hit Peter is John" looks fine, but it starts sounding weird to hear "The person that hit Peter that met Tom is John". It becomes gibberish with sentences like "The person that hit Peter that met Tom that married Mary is John". In fact, the majority resources spent with regards to the parsing efficiency are associated with such abuse of recursion in coping with gibberish-like sentences, rarely seen in real life language. For natural language processing to be practical, pursuing the linear speed cannot be over emphasized. If we reflect on the efficiency of the human language understanding process, the conclusion is certainly about the "linear speed" in accordance with the length of the speech input. In fact, the abuse of recursion is most likely triggered by the "inward perfection syndrome", for which we intend to cover every inch of the land within the third wall of CFG, even if it is an area piled up by gibberish or garbage.

In a sense, it can be said that one reason for the statistical approach to take over the rule-based approach for such a long time in the academia of natural language processing is just the combination effect of these two syndromes. To overcome the effects of these syndromes, many researchers have made all kinds of efforts, to be reviewed below one by one.

Along the line of the outward fear syndrome, evidence against the context-freeness has been found in some constructions in Swiss-German. Chinese has similar examples in expressing respective correspondence of conjoined items and their descriptions. For example, “张三、李四、王五的年龄分别是25岁、32岁、27岁，出生地分别是武汉、成都、苏州” (Zhang San, Li Si, Wang Wu's age is respectively 25, 32, and 27, they were born respectively in Wuhan, Chengdu, Suzhou" ). Here, the three named entities constitute a list of nouns. The number of the conjoined list of entities cannot be predetermined, but although the respective descriptors about this list of nouns also vary in length, the key condition is that they need to correspond to the antecedent list of nouns one by one. This respective correspondence is something beyond the expression power of the context-free formalism. It needs to get out of the third wall.

As for overcoming "the inward perfection syndrome", the pursuit of "linear speed" in the field of NLP has never stopped. It ranges from allowing for the look-ahead mechanism in LR (k) grammar, to the cascaded finite state automata, to the probabilistic CFG parsers which are trained on a large treebank and eventually converted to an Ngram (n=>5) model. It should also include RNN/LSTM for its unique pursuit for deep parsing from the statistical school. All these efforts are striving for defining a subclass in Type-2 CFG that reaches linear speed efficiency yet still with adequate linguistic power. In fact, all parsers that have survived after fighting the statistical methods are to some degree a result of overcoming "the inward perfection syndrome", with certain success in linear speed pursuit while respecting linguistic principles. The resulting restricted subclass, compared to the area within the original third wall CFG, is a greatly "squashed" land.

If we agree that everything in parsing should be based on real life natural language as the starting point and the ultimate landing point, it should be easy to see that the outward limited breakthrough and the inward massive compression should be the two sides of a coin. We want to strive for a formalism that balances both sides. In other words, our ideal natural language parsing formalism should look like a linguistic "caterpillar" breaking through the Chomsky walls in his castle, illustrated below:

It seems to me that such a "caterpillar" may have already been found by someone. It will not take too long before we can confirm it.
Original article in Chinese from 《穿越乔家大院寻找“毛毛虫”》
Translated by Dr. Wei Li

【Related】

K. Church: A Pendulum Swung Too Far, Linguistics issues in Language Technology, 2011; 6(5)

Pride and Prejudice of NLP Main Stream

On Hand-crafted Myth of Knowledge Bottleneck

In my article "Pride and Prejudice of Main Stream", the first myth listed as top 10 misconceptions in NLP is as follows:

[Hand-crafted Myth] Rule-based system faces a knowledge bottleneck of hand-crafted development while a machine learning system involves automatic training (implying no knowledge bottleneck).

While there are numerous misconceptions on the old school of rule systems , this hand-crafted myth can be regarded as the source of all. Just take a review of NLP papers, no matter what are the language phenomena being discussed, it's almost cliche to cite a couple of old school work to demonstrate superiority of machine learning algorithms, and the reason for the attack only needs one sentence, to the effect that the hand-crafted rules lead to a system "difficult to develop" (or "difficult to scale up", "with low efficiency", "lacking robustness", etc.), or simply rejecting it like this, "literature [1], [2] and [3] have tried to handle the problem in different aspects, but these systems are all hand-crafted". Once labeled with hand-crafting, one does not even need to discuss the effect and quality. Hand-craft becomes the rule system's "original sin", the linguists crafting rules, therefore, become the community's second-class citizens bearing the sin.

So what is wrong with hand-crafting or coding linguistic rules for computer processing of languages? NLP development is software engineering. From software engineering perspective, hand-crafting is programming while machine learning belongs to automatic programming. Unless we assume that natural language is a special object whose processing can all be handled by systems automatically programmed or learned by machine learning algorithms, it does not make sense to reject or belittle the practice of coding linguistic rules for developing an NLP system.

For consumer products and arts, hand-craft is definitely a positive word: it represents quality or uniqueness and high value, a legit reason for good price. Why does it become a derogatory term in NLP? The root cause is that in the field of NLP, almost like some collective hypnosis hit in the community, people are intentionally or unintentionally lead to believe that machine learning is the only correct choice. In other words, by criticizing, rejecting or disregarding hand-crafted rule systems, the underlying assumption is that machine learning is a panacea, universal and effective, always a preferred approach over the other school.

The fact of life is, in the face of the complexity of natural language, machine learning from data so far only surfaces the tip of an iceberg of the language monster (called low-hanging fruit by Church in K. Church: A Pendulum Swung Too Far), far from reaching the goal of a complete solution to language understanding and applications. There is no basis to support that machine learning alone can solve all language problems, nor is there any evidence that machine learning necessarily leads to better quality than coding rules by domain specialists (e.g. computational grammarians). Depending on the nature and depth of the NLP tasks, hand-crafted systems actually have more chances of performing better than machine learning, at least for non-trivial and deep level NLP tasks such as parsing, sentiment analysis and information extraction (we have tried and compared both approaches). In fact, the only major reason why they are still there, having survived all the rejections from mainstream and still playing a role in industrial practical applications, is the superior data quality, for otherwise they cannot have been justified for industrial investments at all.

the “forgotten” school: why is it still there? what does it have to offer? The key is the excellent data quality as advantage of a hand-crafted system, not only for precision, but high recall is achievable as well.
quote from On Recall of Grammar Engineering Systems

In the real world, NLP is applied research which eventually must land on the engineering of language applications where the results and quality are evaluated. As an industry, software engineering has attracted many ingenious coding masters, each and every one of them gets recognized for their coding skills, including algorithm design and implementation expertise, which are hand-crafting by nature. Have we ever heard of a star engineer gets criticized for his (manual) programming? With NLP application also as part of software engineering, why should computational linguists coding linguistic rules receive so much criticism while engineers coding other applications get recognized for their hard work? Is it because the NLP application is simpler than other applications? On the contrary, many applications of natural language are more complex and difficult than other types of applications (e.g. graphics software, or word processing apps). The likely reason to explain the different treatment between a general purpose programmer and a linguist knowledge engineer is that the big environment of software engineering does not involve as much prejudice while the small environment of NLP domain is deeply biased, with belief that the automatic programming of an NLP system by machine learning can replace and outperform manual coding for all language projects. For software engineering in general, (manual) programming is the norm and no one believes that programmers' jobs can be replaced by automatic programming in any time foreseeable. Automatic programming, a concept not rare in science fiction for visions like machines making machines, is currently only a research area, for very restricted low-level functions. Rather than placing hope on automatic programming, software engineering as an industry has seen a significant progress on work of the development infrastructures, such as development environment and a rich library of functions to support efficient coding and debugging. Maybe in the future one day, applications can use more and more of automated code to achieve simple modules, but the full automation of constructing any complex software project is nowhere in sight. By any standards, natural language parsing and understanding (beyond shallow level tasks such as classification, clustering or tagging) is a type of complex tasks. Therefore, it is hard to expect machine learning as a manifestation of automatic programming to miraculously replace the manual code for all language applications. The application value of hand-crafting a rule system will continue to exist and evolve for a long time, disregarded or not.

"Automatic" is a fancy word. What a beautiful world it would be if all artificial intelligence and natural languages tasks could be accomplished by automatic machine learning from data. There is, naturally, a high expectation and regard for machine learning breakthrough to help realize this dream of mankind. All this should encourage machine learning experts to continue to innovate to demonstrate its potential, and should not be a reason for the pride and prejudice against a competitive school or other approaches.

Before we embark on further discussions on the so-called rule system's knowledge bottleneck defect, it is worth mentioning that the word "automatic" refers to the system development, not to be confused with running the system. At the application level, whether it is a machine-learned system or a manual system coded by domain programmers (linguists), the system is always run fully automatically, with no human interference. Although this is an obvious fact for both types of systems, I have seen people get confused so to equate hand-crafted NLP system with manual or semi-automatic applications.

Is hand-crafting rules a knowledge bottleneck for its development? Yes, there is no denying or a need to deny that. The bottleneck is reflected in the system development cycle. But keep in mind that this "bottleneck" is common to all large software engineering projects, it is a resources cost, not only introduced by NLP. From this perspective, the knowledge bottleneck argument against hand-crafted system cannot really stand, unless it can be proved that machine learning can do all NLP equally well, free of knowledge bottleneck: it might be not far from truth for some special low-level tasks, e.g. document classification and word clustering, but is definitely misleading or incorrect for NLP in general, a point to be discussed below in details shortly.

Here are the ballpark estimates based on our decades of NLP practice and experiences. For shallow level NLP tasks (such as Named Entity tagging, Chinese segmentation), a rule approach needs at least three months of one linguist coding and debugging the rules, supported by at least half an engineer for tools support and platform maintenance, in order to come up with a decent system for initial release and running. As for deep NLP tasks (such as deep parsing, deep sentiments beyond thumbs-up and thumbs-down classification), one should not expect a working engine to be built up without due resources that at least involve one computational linguist coding rules for one year, coupled with half an engineer for platform and tools support and half an engineer for independent QA (quality assurance) support. Of course, the labor resources requirements vary according to the quality of the developers (especially the linguistic expertise of the knowledge engineers) and how well the infrastructures and development environment support linguistic development. Also, the above estimates have not included the general costs, as applied to all software applications, e.g. the GUI development at app level and operations in running the developed engines.

Let us present the scene of the modern day rule-based system development. A hand-crafted NLP rule system is based on compiled computational grammars which are nowadays often architected as an integrated pipeline of different modules from shallow processing up to deep processing. A grammar is a set of linguistic rules encoded in some formalism, which is the core of a module intended to achieve a defined function in language processing, e.g. a module for shallow parsing may target noun phrase (NP) as its object for identification and chunking. What happens in grammar engineering is not much different from other software engineering projects. As knowledge engineer, a computational linguist codes a rule in an NLP-specific language, based on a development corpus. The development is data-driven, each line of rule code goes through rigid unit tests and then regression tests before it is submitted as part of the updated system for independent QA to test and feedback. The development is an iterative process and cycle where incremental enhancements on bug reports from QA and/or from the field (customers) serve as a necessary input and step towards better data quality over time.

Depending on the design of the architect, there are all types of information available for the linguist developer to use in crafting a rule’s conditions, e.g. a rule can check any elements of a pattern by enforcing conditions on (i) word or stem itself (i.e. string literal, in cases of capturing, say, idiomatic expressions), and/or (ii) POS (part-of-speech, such as noun, adjective, verb, preposition), (iii) and/or orthography features (e.g. initial upper case, mixed case, token with digits and dots), and/or (iv) morphology features (e.g. tense, aspect, person, number, case, etc. decoded by a previous morphology module), (v) and/or syntactic features (e.g. verb subcategory features such as intransitive, transitive, ditransitive), (vi) and/or lexical semantic features (e.g. human, animal, furniture, food, school, time, location, color, emotion). There are almost infinite combinations of such conditions that can be enforced in rules’ patterns. A linguist’s job is to code such conditions to maximize the benefits of capturing the target language phenomena, a balancing art in engineering through a process of trial and error.

Macroscopically speaking, the rule hand-crafting process is in its essence the same as programmers coding an application, only that linguists usually use a different, very high-level NLP-specific language, in a chosen or designed formalism appropriate for modeling natural language and framework on a platform that is geared towards facilitating NLP work. Hard-coding NLP in a general purpose language like Java is not impossible for prototyping or a toy system. But as natural language is known to be a complex monster, its processing calls for a special formalism (some form or extension of Chomsky's formal language types) and an NLP-oriented language to help implement any non-toy systems that scale. So linguists are trained on the scene of development to be knowledge programmers in hand-crafting linguistic rules. In terms of different levels of languages used for coding, to an extent, it is similar to the contrast between programmers in old days and the modern software engineers today who use so-called high-level languages like Java or C to code. Decades ago, programmers had to use assembly or machine language to code a function. The process and workflow for hand-crafting linguistic rules are just like any software engineers in their daily coding practice, except that the language designed for linguists is so high-level that linguistic developers can concentrate on linguistic challenges without having to worry about low-level technical details of memory allocation, garbage collection or pure code optimization for efficiency, which are taken care of by the NLP platform itself. Everything else follows software development norms to ensure the development stay on track, including unit testing, baselines construction and monitoring, regressions testing, independent QA, code reviews for rules' quality, etc. Each level language has its own star engineer who masters the coding skills. It sounds ridiculous to respect software engineers while belittling linguistic engineers only because the latter are hand-crafting linguistic code as knowledge resources.

The chief architect in this context plays the key role in building a real life robust NLP system that scales. To deep-parse or process natural language, he/she needs to define and design the formalism and language with the necessary extensions, the related data structures, system architecture with the interaction of different levels of linguistic modules in mind (e.g. morpho-syntactic interface), workflow that integrate all components for internal coordination (including patching and handling interdependency and error propagation) and the external coordination with other modules or sub-systems including machine learning or off-shelf tools when needed or felt beneficial. He also needs to ensure efficient development environment and to train new linguists into effective linguistic "coders" with engineering sense following software development norms (knowledge engineers are not trained by schools today). Unlike the mainstream machine learning systems which are by nature robust and scalable, hand-crafted systems' robustness and scalability depend largely on the design and deep skills of the architect. The architect defines the NLP platform with specs for its core engine compiler and runner, plus the debugger in a friendly development environment. He must also work with product managers to turn their requirements into operational specs for linguistic development, in a process we call semantic grounding to applications from linguistic processing. The success of a large NLP system based on hand-crafted rules is never a simple accumulation of linguistics resources such as computational lexicons and grammars using a fixed formalism (e.g. CFG) and algorithm (e.g. chart-parsing). It calls for seasoned language engineering masters as architects for the system design.

Given the scene of practice for NLP development as describe above, it should be clear that the negative sentiment association with "hand-crafting" is unjustifiable and inappropriate. The only remaining argument against coding rules by hands comes down to the hard work and costs associated with hand-crafted approach, so-called knowledge bottleneck in the rule-based systems. If things can be learned by a machine without cost, why bother using costly linguistic labor? Sounds like a reasonable argument until we examine this closely. First, for this argument to stand, we need proof that machine learning indeed does not incur costs and has no or very little knowledge bottleneck. Second, for this argument to withstand scrutiny, we should be convinced that machine learning can reach the same or better quality than hand-crafted rule approach. Unfortunately, neither of these necessarily hold true. Let us study them one by one.

As is known to all, any non-trivial NLP task is by nature based on linguistic knowledge, irrespective of what form the knowledge is learned or encoded. Knowledge needs to be formalized in some form to support NLP, and machine learning is by no means immune to this knowledge resources requirement. In rule-based systems, the knowledge is directly hand-coded by linguists and in case of (supervised) machine learning, knowledge resources take the form of labeled data for the learning algorithm to learn from (indeed, there is so-called unsupervised learning which needs no labeled data and is supposed to learn from raw data, but that is research-oriented and hardly practical for any non-trivial NLP, so we leave it aside for now). Although the learning process is automatic, the feature design, the learning algorithm implementation, debugging and fine-tuning are all manual, in addition to the requirement of manual labeling a large training corpus in advance (unless there is an existing labeled corpus available, which is rare; but machine translation is a nice exception as it has the benefit of using existing human translation as labeled aligned corpora for training). The labeling of data is a very tedious manual job. Note that the sparse data challenge represents the need of machine learning for a very large labeled corpus. So it is clear that knowledge bottleneck takes different forms, but it is equally applicable to both approaches. No machine can learn knowledge without costs, and it is incorrect to regard knowledge bottleneck as only a defect for the rule-based system.

One may argue that rules require expert skilled labor, while the labeling of data only requires high school kids or college students with minimal training. So to do a fair comparison of the costs associated, we perhaps need to turn to Karl Marx whose "Das Kapital" has some formula to help convert simple labor to complex labor for exchange of equal value: for a given task with the same level of performance quality (assuming machine learning can reach the quality of professional expertise, which is not necessarily true), how much cheap labor needs to be used to label the required amount of training corpus to make it economically an advantage? Something like that. This varies from task to task and even from location to location (e.g. different minimal wage laws), of course. But the key point here is that knowledge bottleneck challenges both approaches and it is not the case believed by many that machine learning learns a system automatically with no or little cost attached. In fact, things are far more complicated than a simple yes or no in comparing the costs as costs need also to be calculated in a larger context of how many tasks need to be handled and how much underlying knowledge can be shared as reusable resources. We will leave it to a separate writing for the elaboration of the point that when put into the context of developing multiple NLP applications, the rule-based approach which shares the core engine of parsing demonstrates a significant saving on knowledge costs than machine learning.

Let us step back and, for argument's sake, accept that coding rules is indeed more costly than machine learning, so what? Like in any other commodities, hand-crafted products may indeed cost more, they also have better quality and value than products out of mass production. For otherwise a commodity society will leave no room for craftsmen and their products to survive. This is common sense, which also applies to NLP. If not for better quality, no investors will fund any teams that can be replaced by machine learning. What is surprising is that there are so many people, NLP experts included, who believe that machine learning necessarily performs better than hand-crafted systems not only in costs saved but also in quality achieved. While there are low-level NLP tasks such as speech processing and document classification which are not experts' forte as we human have much more restricted memory than computers do, deep NLP involves much more linguistic expertise and design than a simple concept of learning from corpora to expect superior data quality.

In summary, the hand-crafted rule defect is largely a misconception circling around wildly in NLP and reinforced by the mainstream, due to incomplete induction or ignorance of the scene of modern day rule development. It is based on the incorrect assumption that machine learning necessarily handles all NLP tasks with same or better quality but less or no knowledge bottleneck, in comparison with systems based on hand-crafted rules.

Note: This is the author's own translation, with adaptation, of part of our paper which originally appeared in Chinese in Communications of Chinese Computer Federation (CCCF), Issue 8, 2013

K. Church: A Pendulum Swung Too Far, Linguistics issues in Language Technology, 2011; 6(5)

Wintner 2009. What Science Underlies Natural Language Engineering? Computational Linguistics, Volume 35, Number 4

Pros and Cons of Two Approaches: Machine Learning vs Grammar Engineering

Domain portability myth in natural language processing

Pride and Prejudice of NLP Main Stream

[Abstract]

In the area of Computational Linguistics, there are two basic approaches to natural language processing, the traditional rule system and the mainstream machine learning. They are complementary and there are pros and cons associated with both. However, as machine learning is the dominant mainstream philosophy reflected by the overwhelming ratio of papers published in academia, the area seems to be heavily biased against the rule system methodology. The tremendous success of machine learning as applied to a list of natural language tasks has reinforced the mainstream pride and prejudice in favor of one and against the other. As a result, there are numerous specious views which are often taken for granted without check, including attacks on the rule system's defects based on incomplete induction or misconception. This is not healthy for NLP itself as an applied research area and exerts an inappropriate influence on the young scientists coming to this area. This is the first piece of a series of writings aimed at educating the public and confronting the prevalent prejudice, focused on the in-depth examination of the so-called hand-crafted defect of the rule system and the associated knowledge bottleneck issue.

I. introduction

Over 20 years ago, the area of NLP (natural language processing) went through a process of replacing traditional rule-based systems by statistical machine learning as the mainstream in academia. Put in a larger context of AI (Artificial Intelligence), this represents a classical competition, and their ups and downs, between the rational school and the empirical school (Church 2007 ). It needs to be noted that the statistical approaches' dominance in this area has its historical inevitability. The old school was confined to toy systems or lab for too long without scientific break-through while machine learning started showing impressive results in numerous fronts of NLP in a much larger scale, initially very low level NLP such as POS (Part-of-Speech) tagging and speech recognition / synthesis, and later on expanded to almost all NLP tasks, including machine translation, search and ranking, spam filtering, document classification, automatic summarization, lexicon acquisition, named entity tagging, relationship extraction, event classification, sentiment analysis. This dominance has continued to grow till today when the other school is largely "out" from almost all major NLP arenas, journals and top conferences. New graduates hardly realize its existence. There is an entire generation gap for such academic training or carrying on the legacy of the old school, with exceptions of very few survivors (including yours truly) in industry because few professors are motivated to teach it at all or even qualified with an in-depth knowledge of this when the funding and publication prospects for the old school are getting more and more impossible. To many people's minds today, learning (or deep learning) is NLP, and NLP is learning, that is all. As for the "last century's technology" of rule-based systems, it is more like a failure tale from a distant history.

The pride and prejudice of the mainstream were demonstrated the most in the recent incidence when Google announced its deep-learning-based SyntaxNet and proudly claimed it to be "the most accurate parser in the world", so resolute and no any conditions attached, and without even bothering to check the possible existence of the other school. This is not healthy (and philosophically unbalanced too) for a broad area challenged by one of the most complex problems of mankind, i.e. to decode natural language understanding. As there is only one voice heard, it is scaring to observe that the area is packed with prejudice and ignorance with regards to the other school, some from leaders of the area. Specious comments are rampant and often taken for granted without check.

Prejudice is not a real concern as it is part of the real world around and involving ourselves, something to do with human nature and our innate limitation and ignorance. What is really scary is the degree and popularity of such prejudice represented in numerous misconceptions that can be picked up everywhere in this circle (I am not going to trace the sources of these as they are everywhere and people who are in this area for some time know this is not Quixote's windmill but a reality reflection). I will list below some of the myths or fallacies so deeply rooted in the area that they seem to become cliche, or part of the community consensus. If one or more statements below sound familiar to you and they do not strike you as opinionated or specious which cannot withstand scrutiny, then you might want to give a second study of the issue to make sure we have not been subconsciously brain-washed. The real damage is to our next generation, the new scholars coming to this field, who often do not get a chance for doubt.

For each such statement to be listed, it is not difficult to cite a poorly designed stereotypical rule system that falls short of the point, but the misconception lies in its generalization of associating an accused defect to the entire family of a school, ignorant of the variety of designs and the progress made in that school.

There are two types of misconceptions, one might be called myth and the other is sheer fallacy. Myths arise as a result of "incomplete induction". Some may have observed or tried some old school rule systems of some sort, which show signs of the stated defect, then they jump to conclusions leading to the myths. These myths call for in-depth examination and arguments to get the real picture of the truth. As for fallacies, they are simply untrue. It is quite a surprise, though, to see that even fallacies seem to be widely accepted as true by many, including some experts in this area. All we need is to cite facts to prove them wrong. For example, [Grammaticality Fallacy] says that the rule system can only parse grammatical text and cannot handle degraded text with grammar mistakes in it. Facts speak louder than words: the sentiment engine we have developed for our main products is a parsing-supported, rule-based system that fully automatically extracts and mines public opinions and consumer insights from all types of social media, typical of degraded text. Third-party evaluations show that this system is industry leader in data quality of sentiments, significantly better than competitions adopting machine learning. The large-scale operation of our system in the cloud in handling terabytes of real life social media big data (a year of social media in our index involve about 30 billion documents across more than 40 languages) also prove wrong what is stated in [Scalability Fallacy] below.

Let us now list these widely spread rumours collected from the community about the rule-based system to see if they ring the bell before we dive into the first two core myths to uncover the true picture behind in separate blogs.

II. Top 10 Misconceptions against Rules

[Hand-crafted Myth] Rule-based system faces a knowledge bottleneck of hand-crafted development while a machine learning system involves automatic training (implying no knowledge bottleneck). [see On Hand-crafted Myth of Knowledge Bottleneck.]

[Domain Portability Myth] The hand-crafted nature of a rule-based system leads to its poor domain portability as rules have to be rebuilt each time we shift to a new domain; but in case of machine learning, since the algorithm and system are universal, domain shift only involves new training data (implying strong domain portability). [see Domain Portability Myth]

[Fragility Myth] A rule-based system is very fragile and it may break before unseen language data, so it cannot lead to a robust real life application.

[Weight Myth] Since there is no statistical weight associated with the results from a rule-based system, the data quality cannot be trusted with confidence.

[Complexity Myth] As a rule-based system is complex and intertwined, it is easy to get to a standstill, with little hope for further improvement.

[Scalability Fallacy] The hand-crafted nature of a rule-based system makes it difficult to scale up for real life application; it is largely confined to the lab as a toy.

[Domain Restriction Fallacy] A rule-based system only works in a narrow domain and it cannot work across domains.

[Grammaticality Fallacy] A rule-based system can only handle grammatical input in the formal text (such as news, manuals, weather broadcasts), it fails in front of degraded text involving misspellings and ungrammaticality such as social media, oral transcript, jargons or OCR output.

[Outdated Fallacy] A rule-based system is a technology of last century, it is outdated (implying that it no longer works or can result in a quality system in modern days).

[Data Quality Fallacy] Based on the data quality of results, a machine learning system is better than a rule based system. (cf: On Recall of Grammar Engineering Systems)

III. Retrospect and Reflection of Mainstream

As mentioned earlier, a long list of misconceptions about the old school of rule-based systems have been around the mainstream for years in the field. It may sound weird for an interdisciplinary field named Computational Linguistics to drift more and more distant from linguistics; linguists play less and less a role in NLP dominated by statisticians today. It seems widely assumed that with advanced deep learning algorithms, once data are available, a quality system will be trained without the need for linguistic design or domain expertise.

Not all main stream scholars are one-sided and near-sighted. In recent years, insightful articles (e.g., church 2007, Wintner 2009) began a serious retrospect and reflection process and called for the return of Linguistics: “In essence, linguistics is altogether missing in contemporary natural language engineering research. … I want to call for the return of linguistics to computational linguistics.”（Wintner 2009）Let us hope that their voice will not be completely muffled in this new wave of deep learning heat.

Note that the rule system which the linguists are good at crafting in industry is different from the classical linguistic study, it is formalized modeling of linguistic analysis. For NLP tasks beyond shallow level, an effective rule system is not a simple accumulation of computational lexicons and grammars, but involves a linguistic processing strategy (or linguistic algorithm) for different levels of linguistic phenomena. However, this line of study on the NLP platform design, system architecture and formalism has increasingly smaller space for academic discussion and publication, the research funding becomes almost impossible, as a result, the new generation faces the risk of a cut-off legacy, with a full generation of talent gap in academia. Church (2007) points out that the statistical research is so dominant and one-sided that only one voice is now heard. He is a visionary main stream scientist, deeply concerned about the imbalance of the two schools in NLP and AI. He writes:

Part of the reason why we keep making the same mistakes, as Minsky and Papert mentioned above, has to do with teaching. One side of the debate is written out of the textbooks and forgotten, only to be revived/reinvented by the next generation. ......

To prepare students for what might come after the low hanging fruit has been picked over, it would be good to provide today’s students with a broad education that makes room for many topics in Linguistics such as syntax, morphology, phonology, phonetics, historical linguistics and language universals. We are graduating Computational Linguistics students these days that have very deep knowledge of one particular narrow sub-area (such as machine learning and statistical machine translation) but may not have heard of Greenberg’s Universals, Raising, Equi, quantifier scope, gapping, island constraints and so on. We should make sure that students working on co-reference know about c-command and disjoint reference. When students present a paper at a Computational Linguistics conference, they should be expected to know the standard treatment of the topic in Formal Linguistics.

We ought to teach this debate to the next generation because it is likely that they will have to take Chomsky’s objections more seriously than we have. Our generation has been fortunate to have plenty of low hanging fruit to pick (the facts that can be captured with short ngrams), but the next generation will be less fortunate since most of those facts will have been pretty well picked over before they retire, and therefore, it is likely that they will have to address facts that go beyond the simplest ngram approximations.

About Author

Dr. Wei Li is currently Chief Scientist at Netbase Solutions in the Silicon Valley, leading the effort for the design and development of a multi-lingual sentiment mining system based on deep parsing. A hands-on computational linguist with 30 years of professional experience in Natural Language Processing (NLP), Dr. Li has a track record of making NLP work robust. He has built three large-scale NLP systems, all transformed into real-life, globally distributed products.

Note: This is the author's own translation, with adaptation, of our paper in Chinese which originally appeared in W. Li & T. Tang, "Pride and Prejudice of Main Stream: Rule-based System vs. Machine Learning", in Communications of Chinese Computer Federation (CCCF), Issue 8, 2013

[Related]

K. Church: A Pendulum Swung Too Far, Linguistics issues in Language Technology, 2011; 6(5)

Wintner 2009. What Science Underlies Natural Language Engineering? Computational Linguistics, Volume 35, Number 4

On Hand-crafted Myth and Knowledge Bottleneck

On Recall of Grammar Engineering Systems

Pros and Cons of Two Approaches: Machine Learning vs Grammar Engineering

It is untrue that Google SyntaxNet is the “world’s most accurate parser”

R. Srihari, W Li, C. Niu, T. Cornell: InfoXtract: A Customizable Intermediate Level Information Extraction Engine. Journal of Natural Language Engineering, 12(4), 1-37, 2006

It is untrue that Google SyntaxNet is the “world’s most accurate parser”

On Recall of Grammar Engineering Systems

After I showed the benchmarking results of SyntaxNet and our rule system based on grammar engineering, many people seem to be surprised by the fact that the rule system beats the newest deep-learning based parser in data quality. I then got asked many questions, one question is:

Q: We know that rules crafted by linguists are good at precision, how about recall?

This question is worth a more in-depth discussion and serious answer because it touches the core of the viability of the "forgotten" school: why is it still there? what does it have to offer? The key is the excellent data quality as advantage of a hand-crafted system, not only for precision, but high recall is achievable as well.

Before we elaborate, here was my quick answer to the above question:

Unlike precision, recall is not rules' forte, but there are ways to enhance recall;
To enhance recall without precision compromise, one needs to develop more rules and organize the rules in a hierarchy, and organize grammars in a pipeline, so recall is a function of time;
To enhance recall with limited compromise in precision, one can fine-tune the rules to loosen conditions.

Let me address these points by presenting the scene of action for this linguistic art in its engineering craftsmanship.

A rule system is based on compiled computational grammars. A grammar is a set of linguistic rules encoded in some formalism. What happens in grammar engineering is not much different from other software engineering projects. As knowledge engineer, a computational linguist codes a rule in a NLP-specific language, based on a development corpus. The development is data-driven, each line of rule code goes through rigid unit tests and then regression tests before it is submitted as part of the updated system. Depending on the design of the architect, there are all types of information available for the linguist developer to use in crafting a rule's conditions, e.g. a rule can check any elements of a pattern by enforcing conditions on (i) word or stem itself (i.e. string literal, in cases of capturing, say, idiomatic expressions), and/or (ii) POS (part-of-speech, such as noun, adjective, verb, preposition), (iii) and/or orthography features (e.g. initial upper case, mixed case, token with digits and dots), and/or (iv) morphology features (e.g. tense, aspect, person, number, case, etc. decoded by a previous morphology module), (v) and/or syntactic features (e.g. verb subcategory features such as intransitive, transitive, ditransitive), (vi) and/or lexical semantic features (e.g. human, animal, furniture, food, school, time, location, color, emotion). There are almost infinite combinations of such conditions that can be enforced in rules' patterns. A linguist's job is to use such conditions to maximize the benefits of capturing the target language phenomena, through a process of trial and error.

Given the description of grammar engineering as above, what we expect to see in the initial stage of grammar development is a system precision-oriented by nature. Each rule developed is geared towards a target linguistic phenomenon based on the data observed in the development corpus: conditions can be as tight as one wants them to be, ensuring precision. But no single rule or a small set of rules can cover all the phenomena. So the recall is low in the beginning stage. Let us push things to extreme, if a rule system is based on only one grammar consisting of only one rule, it is not difficult to quickly develop a system with 100% precision but very poor recall. But what is good of a system that is precise but without coverage?

So a linguist is trained to generalize. In fact, most linguists are over-trained in school for theorizing and generalization before they get involved in software industrial development. In my own experience in training new linguists into knowledge engineers, I often have to de-train this aspect of their education by enforcing strict procedures of data-driven and regression-free development. As a result, the system will generalize only to the extent allowed to maintain a target precision, say 90% or above.

It is a balancing art. Experienced linguists are better than new graduates. Out of explosive possibilities of conditions, one will only test some most likely combination of conditions based on linguistic knowledge and judgement in order to reach the desired precision with maximized recall of the target phenomena. For a given rule, it is always possible to increase recall at compromise of precision by dropping some conditions or replacing a strict condition by a loose condition (e.g. checking a feature instead of literal, or checking a general feature such as noun instead of a narrow feature such as human). When a rule is fine-tuned with proper conditions for the desired balance of precision and recall, the linguist developer moves on to try to come up with another rule to cover more space of the target phenomena.

So, as the development goes on, and more data from the development corpus are brought to the attention on the developer's radar, more rules are developed to cover more and more phenomena, much like silkworms eating mulberry leaves. This is incremental enhancement fairly typical of software development cycles for new releases. Most of the time, newly developed rules will overlap with existing rules, but their logical OR points to an enlarged conquered territory. It is hard work, but recall gradually, and naturally, picks up with time while maintaining precision until it hits long tail with diminishing returns.

There are two caveats which are worth discussing for people who are curious about this "seasoned" school of grammar engineering.

First, not all rules are equal. A non-toy rule system often provides mechanism to help organize rules in a hierarchy for better quality as well as easier maintenance: after all, a grammar hard to understand and difficult to maintain has little prospect for debugging and incremental enhancement. Typically, a grammar has some general rules at the top which serve as default and cover the majority of phenomena well but make mistakes in the exceptions which are not rare in natural language. As is known to all, naturally language is such a monster that almost no rules are without exceptions. Remember in high school grammar class, our teacher used to teach us grammar rules. For example, one rule says that a bare verb cannot be used as predicate with third person singular subject, which should agree with the predicate in person and number by adding -s to the verb: hence, She leaves instead of *She leave. But soon we found exceptions in sentences like The teacher demanded that she leave. This exception to the original rule only occurs in object clauses following certain main clause verbs such as demand, theoretically labeled by linguists as subjunctive mood. This more restricted rule needs to work with the more general rule to result in a better formulated grammar.

Likewise, in building a computational grammar for automatic parsing or other NLP tasks, we need to handle a spectrum of rules with different degrees of generalizations in achieving good data quality for a balanced precision and recall. Rather than adding more and more restrictions to make a general rule not to overkill the exceptions, it is more elegant and practical to organize the rules in a hierarchy so the general rules are only applied as default after more specific rules are tried, or equivalently, specific rules are applied to overturn or correct the results of general rules. Thus, most real life formalisms are equipped with hierarchy mechanism to help linguists develop computational grammars to model the human linguistic capability in language analysis and understanding.

The second point that relates to the topic of recall of a rule system is so significant but often neglected that it cannot be over-emphasized and it calls for a separate writing in itself. I will only present a concise conclusion here. It relates to multiple levels of parsing that can significantly enhance recall for both parsing and parsing-supported NLP applications. In a multi-level rule system, each level is one module of the system, involving a grammar. Lower levels of grammars help build local structures (e.g. basic Noun Phrase), performing shallow parsing. System thus designed are not only good for modularized engineering, but also great for recall because shallow parsing shortens the distance of words that hold syntactic relations (including long distance relations) and lower level linguistic constructions clear the way for generalization by high level rules in covering linguistic phenomena.

In summary, a parser based on grammar engineering can reach very high precision and there are proven effective ways of enhancing its recall. High recall can be achieved if enough time and expertise are invested in its development. In case of parsing, as shown by test results, our seasoned English parser is good at both precision (96% vs. SyntaxNet 94%) and recall (94% vs. SyntaxNet 95%, only 1 percentage point lower than SyntaxNet) in news genre, and with regards to social media, our parser is robust enough to beat SyntaxNet in both precision (89% vs. SyntaxNet 60%) and recall (72% vs. SyntaxNet 70%).

R. Srihari, W Li, C. Niu, T. Cornell: InfoXtract: A Customizable Intermediate Level Information Extraction Engine. Journal of Natural Language Engineering, 12(4), 1-37, 2006

K. Church: A Pendulum Swung Too Far, Linguistics issues in Language Technology, 2011; 6(5)

Pros and Cons of Two Approaches: Machine Learning vs Grammar Engineering

Pride and Prejudice of NLP Main Stream

On Hand-crafted Myth and Knowledge Bottleneck

Domain portability myth in natural language processing

Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open

Small talk: World's No 0

A few weeks ago, I had a chat with my daughter who's planning to study cs.
"Dad, how are things going?"
"Got a problem: Google announced SyntaxNet claimed to be world's no 1."
"Why a problem?"
"Well if they are no 1, where am I?"
"No 2?"
"No, I don't know who is no 1, but I have never seen a system beating ours. I might just as well be no 0."
"Brilliant, I like that! Then stay in no 0, and let others fight for no 1. ....... In my data structure, I always start with 0 any way."

It is untrue that Google SyntaxNet is the "world’s most accurate parser"

As we all know, natural language parsing is fairly complex but instrumental in Natural Language Understanding (NLU) and its applications. We also know that a breakthrough to 90%+ accuracy for parsing is close to human performance and is indeed an achievement to be proud of. Nevertheless, following the common sense, we all have learned that you got to have greatest guts to claim the "most" for anything without a scope or other conditions attached, unless it is honored by authoritative agencies such as Guinness. For Google's claim of "the world's most accurate parser", we only need to cite one system out-performing theirs to prove its being untrue or misleading. We happen to have built one.

For a long time, we know that our English parser is near human performance in data quality, and is robust, fast and scales up to big data in supporting real life products. For the approach we take, i.e. the approach of grammar engineering, which is the other "school" from the mainstream statistical parsing, this was just a natural result based on the architect's design and his decades of linguistic expertise. In fact, our parser reached near-human performance over 5 years ago, at a point of diminishing returns, hence we decided not to invest heavily any more in its further development. Instead, our focus was shifted to its applications in supporting open-domain question answering and fine-grained deep sentiment analysis for our products, as well as to the multilingual space.

So a few weeks ago when Google announced SyntaxNet, I was bombarded by the news cited to me from all kinds of channels by many colleagues of mine, including my boss and our marketing executives. All are kind enough to draw my attention to this "newest breakthrough in NLU" and seem to imply that we should work harder, trying to catch up with the giant.

In my mind, there has never been doubt that the other school has a long way before they can catch us. But we are in information age, and this is the power of Internet: eye-catching news from or about a giant, true or misleading, instantly spreads to all over the world. So I felt the need to do some study, not only to uncover the true picture of this space, but more importantly, also to attempt to educate the public and the young scholars coming to this field that there have always been and will always be two schools of NLU and AI (Artificial Intelligence). These two schools actually have their respective pros and cons, they can be complementary and hybrid, but by no means can we completely ignore or replace one by the other. Plus, how boring a world would become if there were only one approach, one choice, one voice, especially in core cases of NLU such as parsing (as well as information extraction and sentiment analysis, among others) where the "select approach" does not perform nearly as well as the forgotten one.

So I instructed a linguist who was not involved in the development of the parser to benchmark both systems as objectively as possible, and to give an apples-to-apples comparison of their respective performance. Fortunately, the Google SyntaxNet outputs syntactic dependency relationships and ours is also mainly a dependency parser. Despite differences in details or naming conventions, the results are not difficult to contrast and compare based on linguistic judgment. To make things simple and fair, we fragment a parse tree of an input sentence into binary dependency relations and let the testor linguist judge; once in doubt, he will consult another senior linguist to resolve, or to put on hold if believed to be in gray area, which is rare.

Unlike some other areas of NLP tasks, e.g. sentiment analysis, where there is considerable space of gray area or inter-annotator disagreement, parsing results are fairly easy to reach consensus among linguists. Despite the different format such results are embodied in by two systems (an output sample is shown below), it is not difficult to make a direct comparison of each dependency in the sentence tree output of both systems. (To be stricter on our side, a patched relationship called Next link used in our results do not count as a legit syntactic relation in testing.)

SyntaxNet output:

1.Input: President Barack Obama endorsed presumptive Democratic presidential nominee Hillary Clinton in a web video Thursday .
Parse:
endorsed VBD ROOT
 +-- Obama NNP nsubj
 |   +-- President NNP nn
 |   +-- Barack NNP nn
 +-- Clinton NNP dobj
 |   +-- nominee NN nn
 |   |   +-- presumptive JJ amod
 |   |   +-- Democratic JJ amod
 |   |   +-- presidential JJ amod
 |   +-- Hillary NNP nn
 +-- in IN prep
 |   +-- video NN pobj
 |       +-- a DT det
 |       +-- web NN nn
 +-- Thursday NNP tmod
 +-- . . punct

Netbase output:

Benchmarking was performed in two stages as follows.

Stage 1, we select English formal text in the news domain, which is SyntaxNet's forte as it is believed to have much more training data in news than in other styles or genres. The announced 94% accuracy in news parsing is indeed impressive. In our case, news is not the major source of our development corpus because our goal is to develop a domain-independent parser to support a variety of genres of English text for real life applications on text such as social media (informal text) for sentiment analysis, as well as technology papers (formal text) for answering how questions.

We randomly select three recent news article for this testing, with the following links.

(1) http://www.cnn.com/2016/06/09/politics/president-barack-obama-endorses-hillary-clinton-in-video/
(2) Part of news from: http://www.wsj.com/articles/nintendo-gives-gamers-look-at-new-zelda-1465936033
(3) Part of news from: http://www.cnn.com/2016/06/15/us/alligator-attacks-child-disney-florida/

Here are the benchmarking results of parsing the above for the news genre:

(1) Google SyntaxNet: F-score= 0.94
(tp for true positive, fp for false positive, tn for true negative;
P for Precision, R for Recall, and F for F-score)

P = tp/(tp+fp) = 1737/(1737+104) = 1737/1841 = 0.94
R = tp/(tp+tn) = 1737/(1737+96) = 1737/1833 = 0.95
F= 2*[(P*R)/(P+R)] = 2*[(0.94*0.95)/(0.94+0.95)] = 2*(0.893/1.89) = 0.94

(2) Netbase parser: F-score = 0.95

P = tp/(tp+fp) = 1714/(1714+66) = 1714/1780 = 0.96
R = tp/(tp+tn) = 1714/(1714+119) = 1714/1833 = 0.94
F = 2*[(P*R)/(P+R)] = 2*[(0.96*0.94)/(0.96+0.94)] = 2*(0.9024/1.9) = 0.95

So the Netbase parser is about 2 percentage points better than Google SyntaxNet in precision but 1 point lower in recall. Overall, Netbase is slightly better than Google in the precision-recall combined measures of F-score. As both parsers are near the point of diminishing returns for further development, there is not too much room for further competition.

Stage 2, we select informal text, from social media Twitter to test a parser's robustness in handling "degraded text": as is expected, degraded text will always lead to degraded performance (for a human as well as a machine), but a robust parser should be able to handle it with only limited degradation. If a parser can only perform well in one genre or one domain and the performance drastically falls in other genres, then this parser is not of much use because most genres or domains do not have as large labeled data as the seasoned news genre. With this knowledge bottleneck, a parser is severely challenged and limited in its potential to support NLU applications. After all, parsing is not the end, but a means to turn unstructured text into structures to support semantic grounding to various applications in different domains.

We randomly select 100 tweets from Twitter for this testing, with some samples shown below.

1.Input: RT @ KealaLanae : ?? ima leave ths here. https : //t.co/FI4QrSQeLh2.Input: @ WWE_TheShield12 I do what I want jk I ca n't kill you .10.Input: RT @ blushybieber : Follow everyone who retweets this , 4 mins?

20.Input: RT @ LedoPizza : Proudly Founded in Maryland. @ Budweiser might have America on their cans but we think Maryland Pizza sounds better

30.Input: I have come to enjoy Futbol over Football ⚽️

40.Input: @ GameBurst That 's not meant to be rude. Hard to clarify the joke in tweet form .

50.Input: RT @ undeniableyella : I find it interesting , people only talk to me when they need something ...

60.Input: Petshotel Pet Care Specialist Jobs in Atlanta , GA # Atlanta # GA # jobs # jobsearch https : //t.co/pOJtjn1RUI

70.Input: FOUR ! BUTTLER nailed it past the sweeper cover fence to end the over ! # ENG - 91/6 -LRB- 20 overs -RRB- . # ENGvSL https : //t.co/Pp8pYHfQI8

79..Input: RT @ LenshayB : I need to stop spending money like I 'm rich but I really have that mentality when it comes to spending money on my daughter

89.Input: RT MarketCurrents : Valuation concerns perk up again on Blue Buffalo https : //t.co/5lUvNnwsjA , https : //t.co/Q0pEHTMLie

99.Input: Unlimited Cellular Snap-On Case for Apple iPhone 4/4S -LRB- Transparent Design , Blue/ https : //t.co/7m962bYWVQ https : //t.co/N4tyjLdwYp

100.Input: RT @ Boogie2988 : And some people say , Ethan 's heart grew three sizes that day. Glad to see some of this drama finally going away. https : //t.co/4aDE63Zm85

Here are the benchmarking results for the social media Twitter:

(1) Google SyntaxNet: F-score = 0.65

P = tp/(tp+fp) = 842/(842+557) = 842/1399 = 0.60
R = tp/(tp+tn) = 842/(842+364) = 842/1206 = 0.70
F = 2*[(P*R)/(P+R)] = 2*[(0.6*0.7)/(0.6+0.7)] = 2*(0.42/1.3) = 0.65

Netbase parser: F-score = 0.80

P = tp/(tp+fp) = 866/(866+112) = 866/978 = 0.89
R = tp/(tp+tn) = 866/(866+340) = 866/1206 = 0.72
F = 2*[(P*R)/(P+R)] = 2*[(0.89*0.72)/(0.89+0.72)] = 2*(0.64/1.61) = 0.80

For the above benchmarking results, we leave it to the next blog for interesting observations and more detailed illustration, analyses and discussions.

To summarize, our real life production parser beats Google's research system SyntaxtNet in both formal news text (by a small margin as we both are already near human performance) and informal text, with a big margin of 15 percentage points. Therefore, it is safe to conclude that Google's SytaxNet is by no means "world’s most accurate parser", in fact, it has a long way to get even close to the Netbase parser in adapting to the real world English text of various genres for real life applications.

K. Church: "A Pendulum Swung Too Far", Linguistics issues in Language Technology, 2011; 6(5)

Pros and Cons of Two Approaches: Machine Learning vs Grammar Engineering

Dr. Wei Li's English Blog on NLP

Is Google SyntaxNet Really the World's Most Accurate Parser?

Google is a giant and its marketing is more than powerful. While the whole world was stunned at their exciting claim in Natural Language Parsing and Understanding, while we respect Google research and congratulate their breakthrough in statistical parsing space, we have to point out that their claim in their recently released blog that that SyntaxNet is the "world’s most accurate parser" is simply not true. In fact, far from truth.

The point is that they have totally ignored the other school of NLU, which is based on linguistic rules, as if it were non-existent. While it is true that for various reasons, the other school is hardly presented any more in academia today due to the mainstream's dominance by machine learning (which is unhealthy but admittedly a reality, see Church's long article for a historical background of this inbalance in AI and NLU: K. Church: "A Pendulum Swung Too Far"）, any serious researcher knows that it has never vanished from the world, and it actually has been well developed in industry's real life applications for many years, including ours.

In the same blog, Google mentioned that Parsey McParseface is the "most accurate such model in the world", with model referring to "powerful machine learning algorithms". This statement seems to be true based on their cited literature review, but the equating this to the "world's most accurate parser" publicized in the same blog news and almost instantly disseminated all over the media and Internet is simply irresponsible, and misleading at the very least.

In the next blog of mine, I will present an apples-to-apples comparison of Google's SyntaxNet with the NetBase deep parser to prove and illustrate the misleading nature of Google's recent announcement.

Stay tuned.

Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open

K. Church: "A Pendulum Swung Too Far", Linguistics issues in Language Technology, 2011; 6(5)

Pros and Cons of Two Approaches: Machine Learning vs Grammar Engineering