PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

 

The Morpho-syntactic Interface in a Chinese Phrase Structure Grammar

by

 

Wei Li

B.A., Anqing Normal College, China, 1982

M.A., The Graduate School of Chinese Academy of

Social Sciences, China, 1986

 

 

Thesis submitted in partial fulfillment of

the requirements for the degree of

DOCTOR OF PHILOSOPHY

 

in the Department

of

Linguistics

Morpho-syntactic Interface in a Chinese Phrase Structure Grammar

 

Wei Li 2000

SIMON FRASER UNIVERSITY

November 2000

 

 

All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without permission of the author.

 

Approval

Name:                         Wei Li

Degree:                       Ph.D.

Title of thesis:             THE MORPHO-SYNTACTIC INTERFACE IN

A CHINESE PHRASE STRUCTURE GRAMMAR

 

(Approved January 12, 2001)

 

Abstract

This dissertation examines issues related to the morpho-syntactic interface in Chinese, specifically those issues related to the following long-standing problems in Chinese Natural Language Processing (NLP): (i) disambiguation in Chinese word identification;  (ii) Chinese productive word formation;  (iii) borderline phenomena between morphology and syntax, such as Chinese separable verbs and ‘quasi-affixation’.

All these problems pose challenges to an independent Chinese morphology system or separate word segmenter.  It is argued that there is a need to bring in the syntactic analysis in handling these problems.

To enable syntactic analysis in addition to morphological analysis in an integrated system, it is necessary to develop a Chinese grammar that is capable of representing sufficient information from both morphology and syntax.  The dissertation presents the design of such a Chinese phrase structure grammar, named CPSG95 (for Chinese Phrase Structure Grammar).  The unique feature of CPSG95 is its incorporation of Chinese morphology in the framework of Head-Driven Phrase Structure Grammar.  The interface between morphology and syntax is then defined system internally in CPSG95 and uniformly represented using the underlying grammar formalism used by the Attribute Logic Engine.  For each problem, arguments are presented for the proposed analysis to capture the linguistic generality;  morphological or syntactic solutions are formulated based on the analysis.  This provides a secure approach to solving problems at the interface of Chinese morphology and syntax.


Dedication

To my daughter Tian Tian

whose babbling accompanied and inspired the writing of this work

And to my most devoted friend Dr. Jianjun Wang

whose help and advice encouraged me to complete this work

Acknowledgments

First and foremost, I feel profoundly grateful to Dr. Paul McFetridge, my senior supervisor.  It was his support that brought me to SFU and the beautiful city Vancouver, which changed my life.  Over the years,  he introduced me into the HPSG study, and provided me with his own parser for testing grammar writing.  His mentorship and guidance have influenced my research fundamentally.  He critiqued my research experiments and thesis writing in many facets, from the development of key ideas, selection of topics, methodology, implementation details to writing and presentation style.  I feel guilty for not being able to promptly understand and follow his guidance at times.

I would like to thank Dr. Fred Popowich, my second advisor.  He has given me both general academic guidance on research methodology and numerous specific comments for the thesis revision which have helped shape the present version of the thesis as it is today.

I am also grateful to Dr. Nancy Hedberg from whom I have taken four graduate courses, including the course of HPSG.  I have not only learned a lot from her lectures in the classroom, but have benefited greatly from our numerous discussions on general linguistic topics as well as issues in Chinese linguistics.

Thanks to Davide Turkato, my friend and colleague in the Natural Language Lab.  He is always there whenever I need help.  We have also shared many happy hours in our common circle of Esperanto club in Vancouver.

I would like to thank Dr. Ping Xue, Dr. Zita McRobbie, Dr. Thomas Perry, Dr. Donna Gerdts and Dr. Richard DeArmond for the courses I have taken from them.  These courses were an important part of my linguistic training at SFU.

For various help and encouragement I have got during my graduate studies, I should also thank all the faculty, staff and colleagues of the linguistics department and the Natural Language Lab of SFU, in particular, Rita, Sheilagh, Dr. Ross Saunders, Dr. Wyn Roberts, Dr. Murray Munro and Dr. Olivier Laurens.  I am particularly thankful to Carol Jackson, our Graduate Secretary for her years of help.  She is remarkable, very caring and responsive.

I would like to extend my thanks to all my fellow students and friends in the linguistics department of SFU, in particular, Dr. Trude Heift, Dr. Janine Toole, Susan Russel, Dr. Baoning Fu, Zhongying Lu, Dr. Shuicai Zhou, Jianyi Yu, Jean Wang, Cliff Burgess and Kyoung-Ja Lee.  We have had so much fun together and have had many interesting discussions, both academic and non-academic.  Today, most of us have graduated, some are professors or professionals in different universities or institutions.  Our linguistics department is not big, but it is such a nice department where faculty, staff and the graduate student body form a very sociable community.  I have truly enjoyed my graduate life in this department.

Beyond SFU, I would like to thank Dr. De-Kang Lin for the insightful discussion on the possibilities of integrated Chinese parsing back in 1995.  Thanks to Gerald Penn, one of the authors of ALE, for providing the powerful tool ALE and for giving me instructions on modifying some functions in ALE to accommodate some needs for Chinese parsing during my experiment in implementing a Chinese grammar.

I am also grateful to Dr. Rohini Srihari, my current industry supervisor, for giving me an opportunity to manage NLP projects for real world applications at Cymfony.  This industrial experience has helped me to broaden my NLP knowledge, especially in the area of statistical NLP and the area of shallow parsing using Finite State Transducers.

Thanks to Carrie Pine and Walter Gadz from US Air Force Research Laboratory who have been project managers for the Small Business Innovation Research (SBIR) efforts ‘A Domain Independent Event Extraction Toolkit’ (Phase II), ‘Flexible Information Extraction Learning Algorithm’ (Phase I and Phase II) and ‘Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization’ (Phase I and Phase II).  I have been Principal Investigator for these government funded efforts at Cymfony Inc. and have had frequent and extremely beneficial contact with them.  With these projects, I have had an opportunity to apply the skills and knowledge I have acquired from my Ph.D. program at SFU.

My professional training at SFU was made possible by a grant that Dr. Paul McFetridge and Dr. Nick Cercone applied for.  The work reported in this thesis was supported in the later stage  by a Science Council of B.C. (CANADA) G.R.E.A.T. award.  I am grateful to both my academic advisor Paul McFetridge and my industry advisor John Grayson, CEO of TCC Communications Corporation of Victoria, for assisting me in obtaining this prestigious grant.

I would not have been able to start and continue my research career without many previous helps I got from various sources, agencies and people in the last 15 years, for which I owe a big prayer of thanks.

I owe a great deal to Prof. Zhuo Liu and Prof. Yongquan Liu for leading me into the NLP area and supervising my master program in computational linguistics at CASS (Chinese Academy of Social Sciences, 1983-1986).  Their guidance in both research ideas and implementation details benefited me for life.  I am grateful to my former colleagues Prof. Aiping Fu, Prof. Zhenghui Xiong and Prof. Linding Li at the Institute of Linguistics of CASS for many insightful discussions on issues involving NLP and Chinese grammars.  Thanks also go to Ms. Fang Yang and the machine translation team at Gaoli Software Co. in Beijing for the very constructive and fruitful collaborative research and development work.  Our collaboration ultimately resulted in the commercialization of the GLMT English-to-Chinese machine translation system.

Thanks to Dr. Klaus Schubert, Dr. Dan Maxwell and Dr. Victor Sadler from BSO (Utrecht, The Netherlands) for giving me the project of writing a computational grammar of Chinese dependency syntax in 1988.  They gave me a lot of encouragement and guidance in the course of writing the grammar.  This work enabled me to study Chinese grammar in a formal and systematic way.  I have carried over this formal study of Chinese grammar to the work reported in this thesis.

I am also thankful to the Education Ministry of China, Sir Pao Foundation and British Council for providing me with the prestigious Sino-British Friendship Scholarship.  This scholarship enabled me to study computational linguistics at Centre for Computational Linguistics, UMIST, England (1992).  During my stay in UMIST, I had opportunities to attend lectures given by Prof. Jun-ichi Tsujii, Prof. Harold Somers and Dr. Paul Bennett.  I feel grateful to all of them for their guidance in and beyond the classroom.  In particular, I must thank Dr. Paul Bennett for his supervision, help and care.

I would like to thank Prof. Dong Zhen Dong and Dr. Lua Kim Teng for inviting and sponsoring me for a presentation at ICCC’96 in Singapore.  They are the leading researchers in the area of Chinese NLP.  I have benefited greatly from the academic contact and communication with them.

Thanks to anonymous reviewers of the international journals of  Communications of COLIPS, Journal of Chinese Information Processing, World Science and Technology and grkg/Humankybernetik.  Thanks also to reviewers of the International Conference on Chinese Computing (ICCC’96), North American Conference on Chinese Linguistics (NACCL‑9), Applied Natural Language Conference (ANLP’2000), Text Retrieval Conference (TREC-8), Machine Translation SUMMIT II, Conference of the Pacific Association for Computational Linguistics (PACLING-II) and North West Linguistics Conferences (NWLC).  These journals and conferences have provided a forum for publishing the NLP-related research work I and my colleagues have undertaken at different times of my research career.

Thanks to Dr. Jin Guo who has developed his influential theory of tokenization.  I have benefited enormously from exchanging ideas with him on tokenization and Chinese NLP.

In terms of research methodology and personal advice, I owe a great deal to my most devoted friend Dr. Jianjun Wang, Associate Professor at California State University, Bakersfield, and Fellow of the National Center for Education Statistics in US.  Although in a totally different discipline, there has never been an obstacle for him to understand the basic problem I was facing and to offer me professional advice.  At times when I was puzzled and confused, his guidance often helped me to quickly sort things out.  Without his advice and encouragement, I would not have been able to complete this thesis.

Finally, I wish to thank my family for their support.  All my family members, including my parents, brothers and sisters in China, have been so supportive and understanding.  In particular, my father has been encouraging me all the time.  When I went through hardships  in my pursuit,  he shared the same burden;  when I had some achievement,  he was as happy as I was.

I am especially grateful to my wife, Chunxi.  Without her love, understanding and support, it is impossible for me to complete this thesis.  I wish I had done a better job to have kept her less worried and frustrated.  I should thank my four-year-old daughter, Tian Tian.  I feel sorry for not being able to spend more time with her.  What has supported me all these years is the idea that some day she will understand that as a first-generation immigrant, her dad has managed to overcome various challenges in order to create a better environment for her to grow.


 

Approval                    ii

Abstract                    iii

Dedication                    iv

Acknowledgments                  v

Chapter I  Introduction                1

1.0. Foreword                1

1.1. Background                2

  • Principle of Maximum Tokenization and Critical Tokenization            2
  • Monotonicity Principle and Task-driven Segmentation            5

1.2. Morpho-syntactic Interface Problems        8

1.2.1. Segmentation ambiguity        8

1.2.2. Productive Word Formation        10

1.2.3. Borderline Cases between Morphology and Syntax              11

1.3. CPSG95:  HPSG-style Chinese Grammar in ALE    13

1.3.1. Background and Overview of CPSG95    14

1.3.2. Illustration              15

1.4. Organization of the Dissertation          16

Chapter II  Role of Grammar              18

2.0. Introduction                18

2.1. Segmentation Ambiguity and Syntax        19

2.1.1. Resolution of Hidden Ambiguity      19

2.1.2. Resolution of Overlapping Ambiguity    24

2.2. Productive Word Formation and Syntax      33

2.3. Borderline Cases and Grammar          37

2.4. Knowledge beyond Syntax            39

2.5. Summary                46

Chapter III  Design of CPSG95              48

3.0. Introduction                48

3.1. Mono-stratal Design of Sign          52

3.2. Expectation Feature Structures          57

3.2.1. Morphological Expectation        58

3.2.2. Syntactic Expectation          59

3.2.3. Chinese Subcategorization        63

3.2.4. Configurational Constraint        67

3.3. Structural Feature Structure          70

3.4. Summary                73

Chapter IV  Defining the Chinese Word          75

4.0. Introduction                75

4.1. Two Notions of Word            78

4.2. Judgment Methods              83

4.3. Formal Representation of Word          88

4.4. Summary                92

Chapter V  Chinese Separable Verbs            93

5.0. Introduction                93

5.1. Verb-object Idioms: V+N I            96

5.2. Verb-object Idioms: V+N II          107

5.3. Verb-modifier Idioms: V+A/V          116

5.4. Summary                122

Chapter VI  Morpho-syntactic Interface Involving Derivation    123

6.0. Introduction                123

6.1. General Approach to Derivation          125

6.2. Prefixation                127

6.3. Suffixation                130

6.4. Quasi-affixes                132

6.5. Suffix zhe (-er)              139

6.6. Summary                151

Chapter VII  Concluding Remarks            152

7.0. Summary                152

7.1. Contributions              154

7.2. Limitation                158

7.3. Final Notes                    159

BIBLIOGRAPHY                  161

APPENDIX I    Source Code of Implemented CPSG95      170

APPENDIX II  Source Code of Implemented CPSG95 Lexicon    208

APPENDIX III  Tested Results in Three Experiments Using CPSG95  229

 

[Related]

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

发布者

liweinlp

立委博士,自然语言处理(NLP)资深架构师,Research Director, Beyond AI.前 Principle Scientist, jd-valley, 主攻深度解析和知识图谱及其应用。Netbase前首席科学家,期间指挥研发了18种语言的理解和应用系统。特别是汉语和英语,具有世界一流的解析(parsing)精度,并且做到鲁棒、线速,scale up to 大数据,语义落地到数据挖掘和问答产品。Cymfony前研发副总,曾荣获第一届问答系统第一名(TREC-8 QA Track),并赢得联邦政府17个小企业创新研究的信息抽取项目(PI for 17 SBIRs)。立委NLP工作的应用方向包括大数据舆情挖掘、客户情报、信息抽取、知识图谱、问答系统、智能助理、语义搜索等等。

发表评论