The Morpho-syntactic Interface in a Chinese Phrase Structure Grammar
B.A., Anqing Normal College, China, 1982
M.A., The Graduate School of Chinese Academy of
Social Sciences, China, 1986
Thesis submitted in partial fulfillment of
the requirements for the degree of
DOCTOR OF PHILOSOPHY
in the Department
Morpho-syntactic Interface in a Chinese Phrase Structure Grammar
Wei Li 2000
SIMON FRASER UNIVERSITY
All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without permission of the author.
Name: Wei Li
Title of thesis: THE MORPHO-SYNTACTIC INTERFACE IN
A CHINESE PHRASE STRUCTURE GRAMMAR
(Approved January 12, 2001)
This dissertation examines issues related to the morpho-syntactic interface in Chinese, specifically those issues related to the following long-standing problems in Chinese Natural Language Processing (NLP): (i) disambiguation in Chinese word identification; (ii) Chinese productive word formation; (iii) borderline phenomena between morphology and syntax, such as Chinese separable verbs and ‘quasi-affixation’.
All these problems pose challenges to an independent Chinese morphology system or separate word segmenter. It is argued that there is a need to bring in the syntactic analysis in handling these problems.
To enable syntactic analysis in addition to morphological analysis in an integrated system, it is necessary to develop a Chinese grammar that is capable of representing sufficient information from both morphology and syntax. The dissertation presents the design of such a Chinese phrase structure grammar, named CPSG95 (for Chinese Phrase Structure Grammar). The unique feature of CPSG95 is its incorporation of Chinese morphology in the framework of Head-Driven Phrase Structure Grammar. The interface between morphology and syntax is then defined system internally in CPSG95 and uniformly represented using the underlying grammar formalism used by the Attribute Logic Engine. For each problem, arguments are presented for the proposed analysis to capture the linguistic generality; morphological or syntactic solutions are formulated based on the analysis. This provides a secure approach to solving problems at the interface of Chinese morphology and syntax.
whose babbling accompanied and inspired the writing of this work
And to my most devoted friend Dr. Jianjun Wang
whose help and advice encouraged me to complete this work
First and foremost, I feel profoundly grateful to Dr. Paul McFetridge, my senior supervisor. It was his support that brought me to SFU and the beautiful city Vancouver, which changed my life. Over the years, he introduced me into the HPSG study, and provided me with his own parser for testing grammar writing. His mentorship and guidance have influenced my research fundamentally. He critiqued my research experiments and thesis writing in many facets, from the development of key ideas, selection of topics, methodology, implementation details to writing and presentation style. I feel guilty for not being able to promptly understand and follow his guidance at times.
I would like to thank Dr. Fred Popowich, my second advisor. He has given me both general academic guidance on research methodology and numerous specific comments for the thesis revision which have helped shape the present version of the thesis as it is today.
I am also grateful to Dr. Nancy Hedberg from whom I have taken four graduate courses, including the course of HPSG. I have not only learned a lot from her lectures in the classroom, but have benefited greatly from our numerous discussions on general linguistic topics as well as issues in Chinese linguistics.
Thanks to Davide Turkato, my friend and colleague in the Natural Language Lab. He is always there whenever I need help. We have also shared many happy hours in our common circle of Esperanto club in Vancouver.
I would like to thank Dr. Ping Xue, Dr. Zita McRobbie, Dr. Thomas Perry, Dr. Donna Gerdts and Dr. Richard DeArmond for the courses I have taken from them. These courses were an important part of my linguistic training at SFU.
For various help and encouragement I have got during my graduate studies, I should also thank all the faculty, staff and colleagues of the linguistics department and the Natural Language Lab of SFU, in particular, Rita, Sheilagh, Dr. Ross Saunders, Dr. Wyn Roberts, Dr. Murray Munro and Dr. Olivier Laurens. I am particularly thankful to Carol Jackson, our Graduate Secretary for her years of help. She is remarkable, very caring and responsive.
I would like to extend my thanks to all my fellow students and friends in the linguistics department of SFU, in particular, Dr. Trude Heift, Dr. Janine Toole, Susan Russel, Dr. Baoning Fu, Zhongying Lu, Dr. Shuicai Zhou, Jianyi Yu, Jean Wang, Cliff Burgess and Kyoung-Ja Lee. We have had so much fun together and have had many interesting discussions, both academic and non-academic. Today, most of us have graduated, some are professors or professionals in different universities or institutions. Our linguistics department is not big, but it is such a nice department where faculty, staff and the graduate student body form a very sociable community. I have truly enjoyed my graduate life in this department.
Beyond SFU, I would like to thank Dr. De-Kang Lin for the insightful discussion on the possibilities of integrated Chinese parsing back in 1995. Thanks to Gerald Penn, one of the authors of ALE, for providing the powerful tool ALE and for giving me instructions on modifying some functions in ALE to accommodate some needs for Chinese parsing during my experiment in implementing a Chinese grammar.
I am also grateful to Dr. Rohini Srihari, my current industry supervisor, for giving me an opportunity to manage NLP projects for real world applications at Cymfony. This industrial experience has helped me to broaden my NLP knowledge, especially in the area of statistical NLP and the area of shallow parsing using Finite State Transducers.
Thanks to Carrie Pine and Walter Gadz from US Air Force Research Laboratory who have been project managers for the Small Business Innovation Research (SBIR) efforts ‘A Domain Independent Event Extraction Toolkit’ (Phase II), ‘Flexible Information Extraction Learning Algorithm’ (Phase I and Phase II) and ‘Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization’ (Phase I and Phase II). I have been Principal Investigator for these government funded efforts at Cymfony Inc. and have had frequent and extremely beneficial contact with them. With these projects, I have had an opportunity to apply the skills and knowledge I have acquired from my Ph.D. program at SFU.
My professional training at SFU was made possible by a grant that Dr. Paul McFetridge and Dr. Nick Cercone applied for. The work reported in this thesis was supported in the later stage by a Science Council of B.C. (CANADA) G.R.E.A.T. award. I am grateful to both my academic advisor Paul McFetridge and my industry advisor John Grayson, CEO of TCC Communications Corporation of Victoria, for assisting me in obtaining this prestigious grant.
I would not have been able to start and continue my research career without many previous helps I got from various sources, agencies and people in the last 15 years, for which I owe a big prayer of thanks.
I owe a great deal to Prof. Zhuo Liu and Prof. Yongquan Liu for leading me into the NLP area and supervising my master program in computational linguistics at CASS (Chinese Academy of Social Sciences, 1983-1986). Their guidance in both research ideas and implementation details benefited me for life. I am grateful to my former colleagues Prof. Aiping Fu, Prof. Zhenghui Xiong and Prof. Linding Li at the Institute of Linguistics of CASS for many insightful discussions on issues involving NLP and Chinese grammars. Thanks also go to Ms. Fang Yang and the machine translation team at Gaoli Software Co. in Beijing for the very constructive and fruitful collaborative research and development work. Our collaboration ultimately resulted in the commercialization of the GLMT English-to-Chinese machine translation system.
Thanks to Dr. Klaus Schubert, Dr. Dan Maxwell and Dr. Victor Sadler from BSO (Utrecht, The Netherlands) for giving me the project of writing a computational grammar of Chinese dependency syntax in 1988. They gave me a lot of encouragement and guidance in the course of writing the grammar. This work enabled me to study Chinese grammar in a formal and systematic way. I have carried over this formal study of Chinese grammar to the work reported in this thesis.
I am also thankful to the Education Ministry of China, Sir Pao Foundation and British Council for providing me with the prestigious Sino-British Friendship Scholarship. This scholarship enabled me to study computational linguistics at Centre for Computational Linguistics, UMIST, England (1992). During my stay in UMIST, I had opportunities to attend lectures given by Prof. Jun-ichi Tsujii, Prof. Harold Somers and Dr. Paul Bennett. I feel grateful to all of them for their guidance in and beyond the classroom. In particular, I must thank Dr. Paul Bennett for his supervision, help and care.
I would like to thank Prof. Dong Zhen Dong and Dr. Lua Kim Teng for inviting and sponsoring me for a presentation at ICCC’96 in Singapore. They are the leading researchers in the area of Chinese NLP. I have benefited greatly from the academic contact and communication with them.
Thanks to anonymous reviewers of the international journals of Communications of COLIPS, Journal of Chinese Information Processing, World Science and Technology and grkg/Humankybernetik. Thanks also to reviewers of the International Conference on Chinese Computing (ICCC’96), North American Conference on Chinese Linguistics (NACCL‑9), Applied Natural Language Conference (ANLP’2000), Text Retrieval Conference (TREC-8), Machine Translation SUMMIT II, Conference of the Pacific Association for Computational Linguistics (PACLING-II) and North West Linguistics Conferences (NWLC). These journals and conferences have provided a forum for publishing the NLP-related research work I and my colleagues have undertaken at different times of my research career.
Thanks to Dr. Jin Guo who has developed his influential theory of tokenization. I have benefited enormously from exchanging ideas with him on tokenization and Chinese NLP.
In terms of research methodology and personal advice, I owe a great deal to my most devoted friend Dr. Jianjun Wang, Associate Professor at California State University, Bakersfield, and Fellow of the National Center for Education Statistics in US. Although in a totally different discipline, there has never been an obstacle for him to understand the basic problem I was facing and to offer me professional advice. At times when I was puzzled and confused, his guidance often helped me to quickly sort things out. Without his advice and encouragement, I would not have been able to complete this thesis.
Finally, I wish to thank my family for their support. All my family members, including my parents, brothers and sisters in China, have been so supportive and understanding. In particular, my father has been encouraging me all the time. When I went through hardships in my pursuit, he shared the same burden; when I had some achievement, he was as happy as I was.
I am especially grateful to my wife, Chunxi. Without her love, understanding and support, it is impossible for me to complete this thesis. I wish I had done a better job to have kept her less worried and frustrated. I should thank my four-year-old daughter, Tian Tian. I feel sorry for not being able to spend more time with her. What has supported me all these years is the idea that some day she will understand that as a first-generation immigrant, her dad has managed to overcome various challenges in order to create a better environment for her to grow.
Chapter I Introduction 1
1.0. Foreword 1
1.1. Background 2
- Principle of Maximum Tokenization and Critical Tokenization 2
- Monotonicity Principle and Task-driven Segmentation 5
1.2. Morpho-syntactic Interface Problems 8
1.2.1. Segmentation ambiguity 8
1.2.2. Productive Word Formation 10
1.2.3. Borderline Cases between Morphology and Syntax 11
1.3. CPSG95: HPSG-style Chinese Grammar in ALE 13
1.3.1. Background and Overview of CPSG95 14
1.3.2. Illustration 15
1.4. Organization of the Dissertation 16
Chapter II Role of Grammar 18
2.0. Introduction 18
2.1. Segmentation Ambiguity and Syntax 19
2.1.1. Resolution of Hidden Ambiguity 19
2.1.2. Resolution of Overlapping Ambiguity 24
2.2. Productive Word Formation and Syntax 33
2.3. Borderline Cases and Grammar 37
2.4. Knowledge beyond Syntax 39
2.5. Summary 46
Chapter III Design of CPSG95 48
3.0. Introduction 48
3.1. Mono-stratal Design of Sign 52
3.2. Expectation Feature Structures 57
3.2.1. Morphological Expectation 58
3.2.2. Syntactic Expectation 59
3.2.3. Chinese Subcategorization 63
3.2.4. Configurational Constraint 67
3.3. Structural Feature Structure 70
3.4. Summary 73
Chapter IV Defining the Chinese Word 75
4.0. Introduction 75
4.1. Two Notions of Word 78
4.2. Judgment Methods 83
4.3. Formal Representation of Word 88
4.4. Summary 92
Chapter V Chinese Separable Verbs 93
5.0. Introduction 93
5.1. Verb-object Idioms: V+N I 96
5.2. Verb-object Idioms: V+N II 107
5.3. Verb-modifier Idioms: V+A/V 116
5.4. Summary 122
Chapter VI Morpho-syntactic Interface Involving Derivation 123
6.0. Introduction 123
6.1. General Approach to Derivation 125
6.2. Prefixation 127
6.3. Suffixation 130
6.4. Quasi-affixes 132
6.5. Suffix zhe (-er) 139
6.6. Summary 151
Chapter VII Concluding Remarks 152
7.0. Summary 152
7.1. Contributions 154
7.2. Limitation 158
7.3. Final Notes 159
APPENDIX I Source Code of Implemented CPSG95 170
APPENDIX II Source Code of Implemented CPSG95 Lexicon 208
APPENDIX III Tested Results in Three Experiments Using CPSG95 229