This chapter summarizes the research conducted in this dissertation, including its contributions as well as limitation.
7.0. Summary
The goal of this dissertation is to explore effective ways of formally approaching Chinese morpho-syntactic interface in a phrase structure grammar. This research has led to the following results: (i) the design of a Chinese grammar, namely CPSG95, which enables flexible coordination and interaction of morphology and syntax; (ii) the solutions proposed in CPSG95 to a series of long-standing problems at the Chinese morpho-syntactic interface.
CPSG95 was designed in the general framework of HPSG (Pollard and Sag 1987, 1994). The sign-based mono-stratal design from HPSG demonstrates the advantage in being capable of accommodating and accessing information of different components of a grammar. One crucial feature of CPSG95 is its introduction of morphology expectation feature structures and the corresponding morphological PS rules into HPSG. As a result, CPSG95 has been demonstrated to provide a favorable environment for solving morpho-syntactic interface problems.
Three types of morpho-syntactic interface problems have been studied extensively: (i) the segmentation ambiguity in Chinese word identification; (ii) Chinese separable verbs, a borderline problem between compounding and syntax; and (iii) borderline phenomena between derivation morphology and syntax.
In the context of the CPSG95 design, the segmentation ambiguity is no longer a problem as morphology and syntax are designed system internally in the grammar to support morpho-syntactic parsing based on non-deterministic tokenization (W. Li 1997, 2000). In other words, the design of CPSG95 itself entails an adequate solution to this long-standing problem, a problem which has been a central topic in Chinese NLP for the last two decades. This is made possible because the access to a full grammar including both morphology and syntax is available in the integrated process of Chinese parsing and word identification while traditional word segmenters can at best access partial grammar knowledge.[1]
The second problem involves an interesting case between compounding and syntax: different types of Chinese separable verbs demonstrate various degrees of separability in syntax while all these verbs, when used contiguously, are part of Chinese verb vocabulary. For each type of separable verbs, arguments were presented for the proposed linguistic analysis and a solution to the problem was then formulated in CPSG95 based on the analysis. All the proposed solutions provide a way of capturing the link between the separated use and the contiguous use of the separable verb phenomena. They are shown to be better solutions than previous approaches in the literature which either cannot link the separated use and the contiguous use in the analysis or suffer from being not formal.
The third problem at the interface of derivation and syntax involves two issues: (i) a considerable amount of ‘quasi-affix’ data, and (ii) the intriguing case of zhe-suffixation which demonstrates an unusual combination of a phrase with a bound morpheme. A generic analysis of Chinese derivation has been proposed in CPSG95. This analysis has been demonstrated to be also effective in handling both quasi-affixation and zhe-affixation.
7.1. Contributions
The specific contributions are reflected in the study of the following five topics, each constituting a chapter.
On the topic of the Role of Grammar, the investigation leads to the central argument that knowledge from both morphology and syntax is required to properly handle the major types of morpho-syntactic interface problems. This establishes the foundation for the general design of CPSG95 as consisting of morphology and syntax in one grammar formalism.
An in-depth study has been conducted in the area of the segmentation ambiguity in Chinese word identification. The most important discovery from the study is that the disambiguation involves the analysis of the entire input string. This means that the availability of a grammar is key to the solution of this problem. A natural solution to this problem is the use of grammatical analysis to resolve, and/or prepare the basis for resolving, the segmentation ambiguity.
On the topic of the Design of CPSG95, a mono-stratal Chinese phrase structure grammar has been established in the spirit of the HPSG theory. Components of a grammar such as morphology, syntax and semantics are all accommodated in distinct features of a sign. CPSG95 is designed to provide a framework and means for formalizing the analysis of the linguistic problems at the morpho-syntactic interface.
The essential part of this work is the design of expectation feature structures. Expectation feature structures are generalized from the HPSG feature structures for syntactic subcategorization and modification. One characteristic of the CPSG95 structural expectation is the design of morphological expectation features to incorporate Chinese productive derivation, which covers a wide range of linguistic phenomena in Chinese word formation.
In order to meet the requirements induced by introducing morphology into the general grammar and by accommodating linguistic characteristics of Chinese, modifications from the standard HPSG are proposed in CPSG95. The rationale and arguments for these modifications have been presented. The design of CPSG95 is demonstrated to be a successful application of HPSG in the study of Chinese morpho-syntactic phenomena.
On the topic of Defining the Chinese Word, efforts have been made to reach a better understanding of Chinese wordhood in theory, methodology and formalization.
The theoretical inquiry follows the insight from Di Sciullo and Williams (1987) and Lü (1989). Two notions of word, namely grammar word and vocabulary word, have been examined and distinguished. While vocabulary word is easy to define once a lexicon is given, the object for linguistic study and generalization is actually grammar word. Unfortunately, as there is a considerable amount of borderline phenomena between Chinese morphology and syntax, no precise definition of Chinese grammar word has been available across systems. Therefore, an argument in favor of the system-internal wordhood definition and interface coordination within a grammar has been made. This leads to a case-by-case approach to the analysis of specific Chinese morpho-syntactic interface problems.
On the other hand, three useful wordhood judgment methods have also been proposed as a complementary means to the case-by-case analysis. These methods are (i) syntactic process test involving passivization and topicalization; (ii) keyword based judgment patterns for verbs, and (iii) a general expansion test named X-insertion. These methods are demonstrated to be fairly operational and easy to apply.
In terms of formalization, a system-internal representation of word has been defined in CPSG95 feature structures. This definition distinguishes a grammar word from both bound morphemes and syntactic constructions. The formalization effort is necessary for the rigid study of Chinese morpho-syntactic problems and ensures the implementability of the solutions to these problems as proposed in the dissertation.
On the topic of Chinese Separable Verbs, the task is to coordinate the idiomatic nature of separable verbs and their separated uses in various syntactic patterns.
Since there are different degrees of ‘separability’ for different types of Chinese separable verbs, there is no uniform analysis which can handle all separable verbs properly. A case-by-case study for each type of separable verbs has been conducted. An essential part of this study is the arguments for the wordhood judgment for each type. In the light of this judgment, CPSG95 provides formalized analyses of separable verbs which satisfy two criteria: (i) they all capture both structural and semantic aspects of the constructions at issue; (ii) they all provide a way of capturing the link between the separated use and contiguous use.
Finally, on the topic of Morpho-syntactic Interface Involving Derivation, a general approach to Chinese derivation has been proposed. This approach not only enables us to handle quasi-affix phenomena, but is also flexible enough to provide an adequate treatment of the special problem in zhe-suffixation.
In the CPSG95 analysis, the affix serves as head of a derivative and can impose various constraints in the lexicon on its expected stem sign for the morphological expectation. Coupled with only two PS rules formulated in the general grammar (Prefix PS Rule and Suffix PS Rule), it has been shown that various Chinese affixation phenomena can be captured equally well. The PS rules ensure that all the lexical constraints be observed before the affix and the stem combine and that the output of derivation be a word.
As for the quasi-affixation problem, based on the observation that there is no fundamental structural difference between quasi-affixation and other affixation, a proper treatment of 'quasi-affixes' can be established in the same way as other affixes are handled in CPSG95; the individual difference in semantics is shown to be capturable in the lexicon.
The study of zhe-suffixation started with arguments for its analysis of VP+-zhe. This is an unsolvable problem in any system which enforces sequential processing of morphology before syntax. The solution which CPSG95 offers demonstrates the power of designing derivation morphology and syntax in a mono-stratal grammar. With this novel design in modeling Chinese grammar, the CPSG95 general approach to derivation readily applies to the tough case of zhe-suffixation. This is possible because of the ability of an affix in placing any lexicalized constraints, VP in this case, on the expected stem for morphological expectation. In addition, the proposed lexicalized solution also captures the building of the semantic content for this morpho-syntactic borderline phenomenon.
7.2. Limitation
The major limitation of the work reported in this thesis lies in the following two aspects.
Limited by space, the thesis has only presented some sample formulation of typical affixes and quasi-affixes to demonstrate the proposed general approach to Chinese derivation morphology. As many affixes/quasi-affixes have their distinctive semantic property, a reader who likes to experiment with this proposal in implementation still has to work out the technical details for each affix. However, it is believed that the general strategy has been presented in sufficient details to allow for easy accommodation of individual aspects of an affix which have not been specifically addressed in the thesis.
Limited by the focus on a handful of major morpho-syntactic interface problems, the treatment of reduplication and unlisted proper names have not been listed as special topics for in-depth exploration. They are only briefly discussed in Chapter II (Section 2.2) as cases of productive word formation for the need to involve syntax when they involve segmentation ambiguity at the boundaries. However, they are also long-standing word identification problems which affect morpho-syntactic interface when the segmentation ambiguity is involved. In particular, it is felt that the treatment of transliterated foreign names requires further research before a satisfactory solution can be found in the framework of CPSG95.[2]
7.3. Final Notes
This last section is used to place the research reported in this thesis in a larger context.
Chinese NLP has reached a new stage marked by the publication of Guo’s series of papers on Chinese tokenization (Guo 1997a,b,c,d, Guo 1998). There are signs that the major research focus is being shifted from word segmentation to the grammar design and development. In this process, the morph-syntactic interface will remain a hot topic for quite some time to come. The work on CPSG95 can be seen as one of the efforts in this direction.
The design of CPSG95, a formal grammar capable of representing both morphology and syntax in a uniform formalism, is one successful application of the modern linguistic theory HPSG in the area of Chinese morpho-syntactic interface research. However, this is by no means to claim that CPSG95 is the only or best framework to capture the morpho-syntactic problems. This is only one approach which has been shown to be feasible and effective. Other equally good or better approaches may exist.
In terms of future directions, constraints from semantics and discourse should be made available in the grammatical analysis. In Chapter II (Section 2.4), we have seen problems whose ultimate solutions depend on the access to the semantic or discourse constraints. It is believed that the sign-based mono-stratal design of CPSG95 will be extensible to accommodate these constraints. However, this will require years of future research before they can be formally modeled and properly introduced into the grammar.
--------------------------
[1] As a matter of fact, the CPSG95 experiment shows that most segmentation ambiguity is resolved automatically as a by-product of morpho-syntactic parsing and the remaining ambiguity is embodied in the multiple syntactic trees as the results of the analysis.
[2] However, in the CPSG95 implementation, the problem of handling the Chinese person names, a special case of compounding, has been solved fairly satisfactorily. The proposal is to use the surname as the head sign to expect the given name (of one or two characters) on its right to form potential full names. As the right boundary of a person name is difficult to define without the support of sentential analysis, the conventional word segmenter frequently makes wrong segmentation in such cases. In contrast, the approach implemented in CPSG95 is free from this problem because whether a potential name proposed by the surname ultimately survive as a proper name is decided by whether it contributes to a valid parse for the processed sentence. In last few years, there has been rapid progress on proper name identification in the area of information extraction, called named entity tagging (MUC7 1998; Chen et al 1997).
BIBLIOGRAPHY
Bauer, Laurie (1988). Introducing Linguistic Morphology. Edinburgh: Edinburgh University Press.
Bloomfield, Leonard (1933). Language, New York: Henry Holt & Co.
Borsley, Robert (1987). Subjects and Complements in HPSG. Technical report no. CSLI-107-87. Stanford: Center for the Study of Language and Information.
Carpenter, B. and G. Penn (1994). ALE, The Attribute Logic Engine, User's Guide. From http://www.sfs.nphil.uni-tuebingen.de/~gpenn/ale.html (accessed January 30, 2001).
Chao, Yuen-Ren (1968). A Grammar of Spoken Chinese. Berkeley: University of California Press.
Chen, H.-H et al (1997). Description of the NTU System used for MET-2. Proceedings of MUC-7. From http://perso.enst.fr/~monnier/lectures/IE/MUC7/muc_7_toc.html (accessed January 30, 2001).
Chen, K. and S. Liu (1992). Word Identification for Mandarin Chinese Sentences. Proceedings of 14th International Conference on Computational Linguistics (COLING’92). Nantes, France, 101-107.
Chen, M.Y. and W. S-Y. Wang (1975). Sound Change: Actuation and Implementation. Language 51:2, 255-281.
Chen, Ping (1994). “Shilun Hanyu zhong San Zhong Juzi Chengfen yu Yuyi Cheng Fen de Peiwei Yuanze” (On Mapping Principles of Relationship between Chinese Three Syntactic Constituents and Semantic Roles). Zhongguo Yuwen (Chinese Linguistics), No.3.
Chomsky, Noam (1970). Remarks on Nominalization. Readings in English Transformational Grammar, eds. by R. Jacobs and P. Rosenbaum, Waltham, Massachasetts: Ginn and Company, 184-221.
Dai, John Xiang-ling (1993). Chinese Morphology and its Interface with Syntax. Ph.D. Dissertation, Ohio State University.
DeFrancis, John (1984). The ChineseLanguage: Fact and Fantasy. Honolulu: University of Hawaii Press.
Di Sciullo, A.M. and E. Williams (1987). On The Definition of Word. The MIT Press, Cambridge, Massachusetts.
Ding, Shengshu (1953). “Hanyu Yufa Jianghua” (Lectures of Chinese Grammar), Zhongguo Yuwen (Chinese Linguistics), No. 3 and No. 4.
Dowty, D. (1982). More on the Categorial Analysis of Grammatical Relations. In A. Zaenen (Ed.), Subjects and Other Subjects: Proceedings of the Harvard Conference on Grammatical Relations. Bloomington: Indiana University Linguistics Club.
Feng, Zhiwei (1996). COLIPS Lecture Series - Chinese Natural Language Processing, Communications of COLIPS, Vol.6, No.1, Singapore.
Gan, Kok Wee (1995). Integrating Word Boundary Disambiguation with Sentence Understanding, Ph.D. Dissertation, National University of Singapore.
Gazdar, G., E. Klein, G.K. Pullum, and I.A. Sag (1985). Generalized Phrase Structure Grammar. Cambridge: Blackwell, and Cambridge, Mass.: Harvard University Press.
Guo, Jin (1997a). Critical tokenization and its properties. Computational Linguistics, Vo. 23, No.4, 569-596.
Guo, Jin (1997b). Chinese Language Modeling for Speech Recognition. Ph.D. dissertation, Institute of Systems Science, National University of Singapore.
Guo, Jin (1997c). A Comparative Study on Sentence Tokenization Generation Schemes. In review for journal publication from http://sunzi.iss.nus.sg:1996/guojin/papers/ (accessed March 25, 1999).
Guo, Jin (1998). One tokenization per source. Proceedings of the 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics (COLING-ACL ’98), Montreal, Canada, 457-463.
He, K., H. Xu and B. Sun (1991). Design Principles of an Expert System for Automatic Word Segmentation of Written Chinese Texts, Journal of Chinese Information Processing, Vol. 5, No. 2, 1-14.
Hockett, C.F. (1958). A Course in Modern Linguistics. New York: Macmillan.
Hu, F. and L. Wen (1954). “Ci de fanwei, xingtai, gongneng” (Scope, form and function of word). Zhongguo Yuwen (Chinese Linguistics), August issue.
Jackendoff, Ray (1972). Semantic Interpretation In Generative Grammar, Cambridge, Massachusetts: MIT Press.
Jensen, John T. (1990). Morphology: Word Structure in Generative Grammar. Amsterdam/Philadephia: John Benjamins Publishing Company.
Kathol, Andreas (1999). Agreement and the Syntax-Morphology Interface in HPSG. In Robert Levine and Georgia Green (eds.) Studies in Current Phrase Structure Grammar. Cambridge University Press, 223-274.
Kolman, B. and R.C. Busby (1987). Discrete Mathematical Structures for Computer Science, 2nd edition. Prentice-Hall, Inc.
Krieger, Hans-Ulrich (1994). Derivation without Lexical Rules, in C.J Rupp, M. Rosner and R. Johnson (eds), Constraints, Language, and Computation. Academic Press, 277-313.
Li, C.N. and S.A. Thompson (1981). Mandarin Chinese: A Functional Grammar. Berkeley: University of California Press.
Li, Linding (1986). Xiandai Hanyu Juxing (Sentence Patterns in Contemporary Mandarin), Shangwu Yinshuguan (Commercial Press), Beijing.
Li, Linding (1990). Xiandai Hanyu Dongci (Verbs in Contemporary Mandarin), Zhongguo Shehui Kexue Chubanshe, Beijing.
Li, Qinghua (1983). “Tan liheci de tedian he yongfa” (On the characteristics and usages of separable words). Yuyan Jiaoxue He Yan Jiu (Language Instruction and Research), No.3.
Li, Wei (1996). Interaction of Syntax and Semantics in Parsing Chinese Transitive Patterns. Proceedings of International Conference on Chinese Computing (ICCC'96), Singapore.
Li, Wei (1997). Chart Parsing Chinese Character Strings. Proceedings of the Ninth North American Conference on Chinese Linguistics (NACCL-9), Victoria, Canada.
Li, Wei (2000). On Chinese parsing without using a separate word segmenter. Communication of COLIPS 10 (1): 19-68.
Liang, Nanyuan (1987). CDWS -- A Written Chinese Automatic Word Segmentation System. Journal of Chinese Information Processing, 1(2): 44-52.
Lieber, R. (1992). Deconstructing Morphology. Chicago: University of Chicago Press.
Lin, Handa (1983). “Shime shi ci – xiaoyu ci de bu shi ci” (What is a word – a unit smaller than a word is not a word). Zhongguo Yuwen (Chinese Linguistics), No.34.
Lu, Jianming (1988). “Mingci-xing ‘laixin’ shi ci haishi cizu” (Nominal laixin: word or word group). Zhongguo Yuwen (Chinese Linguistics), No. 5.
Lu, Zhiwei (1957). Hanyu de Goucifa (Chinese Word Formation), Kexue Chubanshe (Science Publishing House)..
Lü, Shuxiang. (1946). “Cong Zhuyu, Binyu de Fenbie Tan Guoyu Juzi de Fenxi” (On Sentence Analysis of Mandarin Chinese from the Angle of the Distinction between Subject and Object), Kaiming Shudian Er Shi Zhounian Jiannian Wenji (Selected Works to Celebrate the 20th Anniversary of Kaiming Bookstore).
Lü, Shuxinag et al (ed.) (1980). Xiandai Hanyu Babai Ci (800 Words in Contemporary Mandarin), Shangwu Yinshuguan (Commercial Press), Beijing.
Lü, Shuxiang (1989). “Hanyu Yufa Fenxi Wenti” (Issues on Chinese grammatical analysis), Lü Shuxiang Zixuanji (Self-selected Works of Shuxiang Lü), Shang Hai Jiaoyu Chubanshe (Shanghai Education Publishing House), Shanghai, 93-180.
Lua, Kim Teng (1994). Application of Information Theory Binding in Word Segmentation. Computer Processing of Chinese and Oriental Languages 8(1): 115-124.
Lyons, John (1968). Introduction to Theoretical Linguistics. Cambridge: Cambridge University Press.
MUC-7 (1998). Proceedings of the Seventh Message Understanding Conference (MUC-7). From http://perso.enst.fr/~monnier/lectures/IE/MUC7/muc_7_toc.html (accessed January 30, 2001).
Pollard, C. and I. Sag (1987). Information based Syntax and Semantics Vol. 1: Fundamentals. Centre for the Study of Language and Information, Stanford University, CA.
Pollard, C. and I. Sag (1994). Head-Driven Phrase Structure Grammar. The University of Chicago Press.
Riehemann, Susanne (1993). Word Formation in Lexical Type Hierarchies – A Case Study of bar-Adjectives in German. SfS-Report-02-93, University of Tübingen.
Riehemann, Susanne (1998). Type-based derivational morphology. Journal of Comparative Germanic Linguistics 2. 49-77.
Sapir, Edward (1921). Language: Introduction to the Study of Speech. NewYork: Harcourt, Brace, and World.
Selkirk, E. (1982). The Syntax of Words. Cambridge: MIT Press.
Shi, Youwei (1992). Huhuan Rouxing – Hanyu Yufa Tanyi (A Call for Flexibility – Peculiarities of Chinese Grammar), Hunan Publishing House.
Shieber, S. (1986). An Introduction to Unification-Based Approaches to Grammar. Centre for the Study of Language and Information, Stanford University, CA.
Sproat, R., C. Shih, V. Gale, and N. Chang (1996). A Stochastic Finite-State Word-Segmentation Algorithm for Chinese. Computational Linguistics. Vol. 22, No. 3.
Sun, L. and P. Cole (1991). The effect of morphology on long-distance reflexives. Journal of Chinese Linguistics 19:1, 42-62.
Sun, M. and B. T’sou (1995). Ambiguity resolution in Chinese word segmentation. Proceedings of the 10th Pacific Asia Conference on Language, Information and Computation (PACLIC-95), Hong Kong, 121-126.
Sun, M. and C. Huang (1996). Word Segmentation and Part-of-Speech Tagging for Unrestricted Chinese Texts, A Tutorial at the 1996 International Conference on Chinese Computing (ICCC96), Singapore.
Thompson, S.A. (1973). Resultative Verb Compounds in Mandarin Chinese: A Case of Lexical Rules. Language 49:2, 361-379.
Wang, Li (1955). ZhongguoYufa Lilun (Chinese Grammatical Theory), Zhonghua Shuju, Shanghai.
Wang, Xiaolong (1989). Automatic Chinese Word Segmentation, in Word Separating and Mutual Translation of Syllable and Character Strings, Ph.D. Dissertation, Dept. of Computer Science and Engineering, Harbin Institute of Technology.
Webster, J. J. and C-Y Kit. (1992). Tokenization as the Initial Phase in NLP. Proceedings of the 14th International Conference on Computational Linguistics (COLING-92). Nantes, France, 1106-1110.
Wu, A. and Z. Jiang (1998). Word Segmentation in Sentence Analysis. Proceedings of the 1998 International Conference on Chinese Information Processing. Beijing, China, 169-180.
Wu, Dekai (1998). A Position Statement on Chinese Segmentation. Presented at the Chinese Language Processing Workshop, University of Pennsylvania. (Current draft at http://www.cs.ust.hk/~dekai/papers/segmentation.html, accessed January 30, 2001).
Wu, M. and K. Su (1993). Corpus-Based Automatic Compound Extraction with Mutual Information and Relative Frequency Count. Proceedings of R.O.C. Computational Linguistics Conference (ROCLING) VI, Taiwan, 207-216.
Xue, Ping (1991). Syntactic Dependencies in Chinese and their Theoretical Implications. Ph.D. dissertation, University of Victoria, Canada.
Yao, T., G. Zhang, and Y. Wu (1990). A Rule-Based Chinese Automatic Segmentation System. Journal of Chinese Information Processing 4(1): 37-43.
Yeh, C-L. and H-J. Lee (1991). Rule-Based Word Identification For Mandarin Chinese Sentences -- A Unification Approach. Computer Processing of Chinese and Oriental Languages. Vol. 5, No. 2, 97-118.
Yu, Shihong et al (1997). Description of the Kent Ridge Digital Labs System Used for MUC-7. Proceedings of MUC-7. From http://perso.enst.fr/~monnier/lectures/IE/MUC7/muc_7_toc.html (accessed January 30, 2001).
Zhang, J., Z. Chen and S. Chen (1991). A Method of Word Identification for Chinese by Constraint Satisfaction and Statistical Optimization Techniques. Proceedings of R.O.C. Computational Linguistics Conference (ROCLING) IV, Taiwan, 147-165.
Zhang, Shoukang (1957). “Lüetan hanyu goucifa” (A brief discussion on Chinese word formation) Xiandai Hanyu Cankao Ziliao (Reference for Comtemporary Chinese), ed. by Yushu Hu (1981), Shanghai: Shanghai Jiaoyu Chubanshe (Shanghai Education Publishing Company), 241-256.
Zhao, S. and B. Zhang (1996). “Liheci de queding yu liheci de xingzhi” (Determination and characteristics of separable words). Yuyan Jiaoxue he Yanjiu (Language Instruction and Research), No.1, 40-51.
Zhu, Dexi (1985). Yufa Wenda (Questions and Answers on Chinese Grammar). Shangwu Yinshuguan (Commercial Press), Beijing.
Zwicky, A.M. (1987). Slashes in the Passive. Linguistics 25, 639-669.
Zwicky, A.M. (1989). Idioms and Constructions. Eastern States Conference on Linguistics 5, 547-558.
[Related]
PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)
PhD Thesis: Chapter I Introduction
PhD Thesis: Chapter II Role of Grammar
PhD Thesis: Chapter III Design of CPSG95
PhD Thesis: Chapter IV Defining the Chinese Word
PhD Thesis: Chapter V Chinese Separable Verbs
PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation
PhD Thesis: Chapter VII Concluding Remarks