Handling Chinese NP predicate in HPSG （old paper）

Handling Chinese NP predicate in HPSG
(old paper in Proceedings of the Second Conference of the Pacific
Association for Computational Linguistics, Brisbane, 1995)

Wei Li & Paul McFetridge

Department of Linguistics
Simon Fraser University
Burnaby, B.C. CANADA V5A 1S6

Key words: HPSG; knowledge representation, Chinese processing

Abstract

This paper addresses a type of Chinese NP predicate in the framework of HPSG 1994 (Pollard & Sag 1994). The special emphasis is laid on knowledge representation and the interaction of syntax and semantics in natural language processing. A knowledge based HPSG model is designed. This design not only lays a foundation for effectively handling Chinese NP predicate problem, but has theoretical and methodological significance on NLP in general.

In Section 1, the data are analyzed. Both structural and semantic constraints for this pattern are defined. Section 2 discusses the semantic constraints in the wider context of the conceived knowledge-based model. The aim of natural language analysis is to reach interpretations, i.e. correctly assigning semantic roles to the constituents. We indicate that without being able to resort to some common sense knowledge, some structures cannot get interpreted. We present a way on how to organize and utilize knowledge in HPSG lexicon. In Section 3, a lexical rule for this pattern is proposed in our HPSG model for Chinese, whose prototype is being implemented.

Problem

We will show the data of Chinese NP predicate first. Then we will investigate what makes it possible for an NP to behave like a predicate. We will do this by defining both the syntactic and semantic constraints for this Chinese pattern.

1.1. Data: one type of Chinese NP predicate

1) 他好身体。

ta hao shenti.
he good body
He is of good health.

2) 张三高个子。

Zhangsan gao gezi
Zhangsan tall figure.
Zhangsan is tall.

3) 李四圆圆的脸。 Lisi

Lisi yuanyuan de lian.
Lisi round-round DE face.
Lisi has a quite round face.

4) 这件大衣红颜色。

zhe jian dayi hong yanse.
this (cl.) coat red colour.
This coat is of red colour.

5) 明天小雨。

mingtian xiao yu.
tomorrow little rain.
Tomorrow it will drizzle.

6) 那张桌子三条腿。

na zhang zhuozi san tiao tui.
that (cl.) table three (cl.) leg
That table is three-legged.

Note: (cl.) for classifier.
DE for Chinese attribute particle.

The relation between the subject NP and the predicate NP is not identity. The NP predicate in Chinese usually describes a property the subject NP has, corresponding to English be-of/have NP. In identity constructions, the linking verb SHI (be) cannot normally be omitted.[1]

7a) 他是学者。

ta shi xuezhe.
he be scholar
He is a scholar.

8b) ？他学者。

ta xuezhe. 他学者。
he scholar

1.2. Problem analysis

1.2.1. We first investigate the structural characteristics of the Chinese NP predicate pattern.

A single noun cannot act as predicate. More restrictively, not every NP can become a predicate. It seems that only the NP with the following configuration has this potential: NP [lex -, predicate +]. In other words, a predicate NP consists of a lexical N with a modifying sister. Structures of this sort should not be further modified.[2] Thus, the following patterns are predicted.

8a) 那张桌子三条腿。

na zhang zhuozi san tiao tui. [ same as 6) ]
that (cl.) table three (cl.) leg
That table is three-legged.

8b) 那张桌子塑料腿。

na zhang zhuozi suliao tui.
that (cl.) table plastic leg
That table is of plastic legs.

8c) * 那张桌子三条塑料腿。
* na zhang zhuozi san tiao suliao tui. [too many attributes]

8d) * 那张桌子腿。
* na zhang zhuozi tui. [no attributes]

1.2.2. What is the semantic constraint for the Chinese predicate pattern?

Although there is no syntactic agreement between subject and predicate in Chinese, there is an obvious semantic "agreement" between the two: hao shenti (good body) requires a HUMAN as its subject; san tiao tui (three leg) demands that the subject be FURNITURE or ANIMATE. Therefore, the following are unacceptable:

9) * 这杯茶好身体。

* zhe bei cha hao shenti.
this cup tea good body

10) * 空气三条腿。

* kongqi san tiao tui.
air three (cl.) leg

Obviously,. it is not hao (good) or san tiao (three) which poses this semantic selection of subject. The semantic restriction comes from the noun shenti (body) or tui (leg). There is an internal POSSESS relationship between them: shenti (body) belongs to human beings and tui (leg) is one part of an animal or some furniture. This common sense relation is a crucial condition for the successful interpretation of the Chinese NP predicate sentences.

There are a number of issues involved here. First, what is the relationship of this type of knowledge to the syntactic structures and semantic interpretations? Second, where and how would this knowledge be represented? Third, how will the system use the knowledge when it is needed? More specifically, how will the introduction of this knowledge coordinate with the other parts of the well established HPSG formalism? Those are the questions we attempt to answer before we proceed to provide a solution to the Chinese NP predicate. Let us look at some more examples:

11a) 桌子坏了。

zhuozi huai le.
table bad LE
The table went wrong.

11b) 腿坏了。

tui huai le.leg bad LE
leg bad LE
The leg went wrong.

11c) 桌子的腿坏了。

zhuozi de tui huai le.
table DE leg bad LE
The table's leg went wrong.

12a) 他好。

ta hao.
he good
He is good.

12b) 身体好。

shenti hao.
body good
The health is good.

12c) 他的身体好。

ta de shenti hao.
he DE body good
His health is good.

note: LE for Chinese perfect aspect particle.

When people say 11b) tui huai le (leg went wrong), we know something (the possessor) is omitted. For 11a), however, we have no such feel of incompleteness. Although we may also ask whose table, this possessive relation between who and table is by no means innate. Similarly, ta (he) in 12a) is a complete notion denoting someone while shenti (body) in 12b) is not. In 11c) and 12c), the possessor appears in the possessive structure DE-construction, the expectation of tui (leg) and shenti (body) is realized. These examples show that some words (concepts) have conceptual expectation for some other words (concepts) although the expected words do not necessarily show up in a sentence and the expectation might not be satisfied. In fact, this type of expectation forms part of our knowledge (common sense). One way to represent the knowledge is to encode it with the related word in the lexicon.

Therefore we propose an underlying SYNSEM feature KNOWLEDGE to store some of our common sense knowledge by capturing the internal relation between concepts. KNOWLEDGE parallels to syntactic SUBCAT and semantic RELATION. KNOWLEDGE imposes semantic constraints on their expected arguments no matter what syntactic forms the arguments will take (they may take null form, i.e. the underlying arguments are not realized). In contrast, SUBCAT only defines syntactic requirement for the complements and gets interpreted in RELATION. Following this design, syntactic form and semantic constraints are kept apart. When necessary, the interaction between them can be implemented by lexical rules, or directly coindexed in the lexicon. For example, the following KNOWLEDGE information will be enforced as the necessary semantic constraints when we handle Chinese NP predicates by a lexical rule (see 3.3).

PHON shenti
SYNSEM | KNOWLEDGE | PRED [1] possess
SYNSEM | KNOWLEDGE | POSSESSOR [2] human
SYNSEM | KNOWLEDGE | POSSESSED [3]
SYNSEM | LOCAL | CONTENT | INDEX [3]
SYNSEM | LOCAL | CONTENT | RESTRICTION { RELATION body }
SYNSEM | LOCAL | CONTENT | RESTRICTION { INSTANCE [3] }

Agreement revisited

This section relates semantic constraints which embody common sense to the conventional linguistic notion of agreement. We will show that they are essentially the same thing from different perspectives. We only need slight expansion for the definition of agreement to accommodate some of our basic knowledge. This is important as it accounts for the feasibility of coding knowledge in linguistic ways. Linguistic lexicon seems to be good enough to house some general knowledge in addition to linguistic knowledge. Some possible problems with this knowledge-based approach are also discussed.

Let's first consider the following two parallel agreement problems in English:

13) * The boy drink.

14) ? The air drinks.

13) is ungrammatical because it violates the syntactic agreement between the subject and predicate. 14) is conventionally considered as grammatical although it violates the semantic agreement between the agent and the action. Since the approach taken in this paper is motivated by semantic agreement, some elaboration and comment on agreement seem to be in need.

The agreement in person, gender and number are included in CONTENT | INDEX features (Pollard & Sag 1994, Chapter 2). It follows that any two signs co-indexed naturally agree with each other. That is desirable because co-indexed signs refer to the same entity. However, person, gender and number seem to be only part of the story of agreement. We may expand the INDEX feature to cope with the semantic agreement for handling Chinese and for in-depth semantic analysis for other languages as well.

Note that to accommodate semantic agreement in HPSG, we first need features to represent the result of semantic classification of lexical meanings like HUMAN, FOOD, FURNITURE, etc. We therefore propose a ROGET feature (named after the thesaurus dictionary) and put it into the INDEX feature.

Semantic agreement, termed sometimes as semantic constraint or semantic selection restriction in literature, is not a new conception in natural language processing. Hardly any in-depth language analysis can go smoothly without incorporating it to a certain extent. For languages like Chinese with virtually no inflection, it is more important. We can hardly imagine how the roles can be correctly assigned without the involvement of semantic agreement in the following sentences of the form NP1 NP2 Vt:

15a) 点心我吃了。

dianxin wo chi le.
Dim-Sum I eat LE
The Dim Sum I have eaten.

15b) 我点心吃了。

wo dianxin chi le.
I Dim-Sum eat LE
I have eaten the Dim Sum.

Who eats what? There is no formal way but to resort to semantic agreement enforced by eat to correctly assign the roles. In HPSG 1994, it was pointed out (Pollard & Sag 1994, p81), "... there is ample independent evidence that verbs specify information about the indices of their subject NPs. Unless verbs 'had their hands on' (so to speak) their subjects' indices, they would be unable to assign semantic roles to their subjects." The Chinese data show that sometimes verbs need to have their hands on the semantic categories (ROGET) of both their external argument (subject) and internal arguments to be able to correctly assign roles. Now we have expanded the INDEX feature to cover both ROGET and the conventional agreement features number, person and gender, the above claim of Pollard and Sag becomes more general.

It is widely agreed that knowledge is bound to play an important role in natural language analysis and disambiguation. The question is how to build a knowledge-based system which is manageable. Knowledge consists of linguistic knowledge (phonology, morphology, syntax, semantics, etc.) and extra-linguistic knowledge (common sense, professional knowledge, etc.). Since semantics is based on lexical meanings, lexical meanings represent concepts and concepts are linked to each other in a way to form knowledge, we can well regard semantics as a link between linguistics and beyond-linguistics in terms of knowledge. In other words, some extra-linguistic knowledge may be represented in linguistic ways. In fact, lexicon, if properly designed, can be a rich source of knowledge, both linguistic and extra-linguistic. A typical example of how concepts are linked in a network (a sophisticated concept lexicon) is seen in the representation of drink ((*ANI SUBJ) (((FLOW STUFF) OBJE) ((SELF IN) (((*ANI (THRU PART)) TO) (BE CAUSE))))) in Wilks 1975b. While for various reasons we will not go as far as Wilks, we can gain enlightenment from this type of AI approach to knowledge. Lexicon-driven systems like the one in HPSG can, of course, make use of this possibility. Take the Chinese role-assignment problem, for example, the common sense that ANIMATE being eats FOOD can be seamlessly incorporated in the lexical entry chi (eat) as a semantic agreement requirement.

Note: Following the convention, the part after the colon is SYNSEM | LOCAL | CONTENT information.

One last point we would like to make in this context is that semantic agreement, like syntactic agreement, should be able to loosen its restriction, in other words, agreement is just a canonical, in Wilk's term preference, requirement (Wilks 1975a). In practice of communication, deviation in different degrees is often seen and people often relax the preference restriction in order to understand. With semantic agreement, the deliberate deviation is one of the handy means to help render rhetorical expression. In a certain domain, Chomsky's famous sentence Colorless green ideas sleep furiously is well imaginable. On the other hand, the syntactic agreement deviation will not affect the meaning if no confusion is caused, which may or may not happen depending on context and the structure of the language. In English, lack of syntactic agreement for the present third person singular between subject and predicate usually causes no problem. Sentence 15) The boy drink therefore can be accepted and correctly interpreted. There is much more to say on the interaction of the two types of agreement deviation, how a preference model might be conceived, what computational complexities it may cause and how to handle them effectively. We plan to address it in another paper. The interested reader is referred to one famous approach in this direction. (Wilks 1975a, 1978).

Solution

We will set some requirements first and then present a lexical rule to see how well it meets our requirements.

3.1. Based on the discussion in Section 1, the solution to the Chinese predicate NP problem should meet the following 4 requirements:

(1)        It should enforce the syntactic constraints for this pattern: one and only one modifier XP in the form of NP1 XP NP2.

(2)        It should enforce the semantic constraints for this pattern: N2 must expect NP1 as its POSSESSOR with semantic agreement.

(3)        It should correctly assign roles to the constituents of the pattern: NP1 POSSESS NP2 (where NP2 consists of XP N2).

(4)        It should be implementable in HPSG formalism.

3.2. What mechanisms can we use to tackle a problem in HPSG formalism?

HPSG grammar consists of two components: a general grammar (ID schemata and principles) and a lexical grammar (in the lexicon). The lexicon houses lexical entries with their linguistic description and knowledge representation in feature structures. The lexicon also contains generalizations captured by inheritance of lexical hierarchy and by a set of lexical rules. Roughly speaking, lexical hierarchy covers static redundancy between related potential structures. Just because the lexicon can reflect different degrees of lexical redundancy in addition to idiosyncrasy, the general grammar can desirably be kept to minimum.

The Chinese NP predicate pattern should be treated in the lexicon. There are two arguments for that. First, this pattern covers only restricted phenomena (see 3.4). Second, it relies heavily on the semantic agreement, which in our model is specified in the lexicon by KNOWLEDGE. We need somehow to link the semantic expectation KNOWLEDGE and the syntactic expectation SUBCAT or MOD. The general mechanism to achieve that is structure sharing by coindexing the features either directly in the lexical entries (see the representation of the entry chi in Section 2) or through lexical rules (see 3.3).

3.3. Lexical Rule

Lexical rules are applied to lexical signs (words, not phrases) which satisfy the condition. The result of the application is an expanded lexicon to be used during parsing. Since the pattern is of the form NP1 XP N2, the only possible target is N2, i.e. shenti (body) or tui (leg). This is due to the fact that among the three necessary signs in this form, the first two are phrases and only the final N2 is a lexical sign. We assume the following structure for our proposed lexical rule:

NP[ta[1]] [[AP[2] hao] [N<NP[1], XP[2]> shenti]]

NP Predicate Lexical Rule

hpsg1

SYNSEM | KNOWLEDGE | PRED [1] possess
SYNSEM | KNOWLEDGE | POSSESSOR [2]
SYNSEM | LOCAL | CATEGORY | HEAD | MAJ [6] n
SYNSEM | LOCAL | CATEGORY | PREDICATE -
SYNSEM | LOCAL | CONTENT | INDEX [4]
SYNSEM | LOCAL | CONTENT | RESTRICTION {[3]}
...| CATEGORY | PREDICATE +
...| CATEGORY | SUBCAT | EXTERNAL_ARGUMENT [NP: [5]]
...| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS < [...| CATEGORY | HEAD | MOD [6] ] >
...| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS < [...| CONTENT | INDEX [4] ]

==>

...| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS < [...| CONTENT | RESTRICTION {[7]} ] >
...| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS < [...| LEX - ] >
...| CONTENT | RELATION [1] possess
...| CONTENT | POSSESSOR [5] | INDEX | ROGET [2]
...| CONTENT | POSSESSED | INDEX [4]
...| CONTENT | POSSESSED | RESTRICTION {[7] | [3] }

For complicated information flow like this, it is best to explain the indices one by one with regards to the example ta hao shenti (he is of good body) in the form of NP1 XP N2.

The index [1] links the underlying PRED feature of N2 to the semantic RELATION feature; in other words, the predicate in the underlying KNOWLEDGE of shenti (body) now surfaces as the relation for the whole sentence. The index [2] enforces the semantic constraint for this pattern, i.e. shenti (body) expects a human (ROGET) possessor as the subject (EXTERNAL_ARGUMENT) for this sentence. The index [3] is the restriction relation of N2. [4] links the INDEX features of XP and N2, and [6] indicates that the internal argument is a de-facto modifier of N2, i.e. XP mods-for N2. Note that the part of speech of the internal argument (INTERNAL_ARGUMENT | SYNSEM | LOCAL | CATEGORY | HEAD | MAJ) is deliberately not specified in the rule because Chinese modifiers (XP) are not confined to one class, as can be seen in our linguistic data. Finally, [7] defines the restriction relation of the XP to the INDEX of N2.

The indices [4], [7] and [3] all contribute to artificially creating a semantic interpretation for [XP N2]. As is interpreted, XP is, in fact, a modifier of N2 and they would form an NP2, or [XP N2] constituent. In normal circumstances, the building of NP2 interpretation is taken care of by HPSG Semantics Principle. But in this special pattern, we have treated XP as a complement of N2, yet semantically they are still understood as one instance: hao shenti (good body) is an instance of good and body. This interpretation of NP2 serves as POSSESSED of the sentence predicate, indicated by the structure-sharing of [4], [7] and [3]. Finally, [5] is the interpretation of NP1 and is assigned the role of POSSESSOR for the sentence predicate.

Let's see how well this lexical rule meets the 4 requirements set in 3.1.

(1) It enforces the syntactic constraints by treating XP as the internal argument and NP1 as the external argument.

(2) It enforces the semantic constraints through structure sharing by the index [2].

(3) It correctly assigns roles to the constituents of the pattern.

The following interpretation will be established for ta hao shenti (he is of good body) by the parser.

hpsg2

CONTENT | RELATION possess
CONTENT | POSSESSOR | INDEX | PERSON 3
CONTENT | POSSESSOR | INDEX | NUMBER singular
CONTENT | POSSESSOR | INDEX | GENDER male
CONTENT | POSSESSOR | INDEX | ROGET human
CONTENT | POSSESSOR | RESTRICTION { }
CONTENT | POSSESSED | INDEX [1]    | PERSON 3
CONTENT | POSSESSED | INDEX          | NUMBER singular
CONTENT | POSSESSED | INDEX          | GENDER nil
CONTENT | POSSESSED | INDEX          | ROGET organ
CONTENT | POSSESSED | RESTRICTION { [ RELATION good],              [ RELATION body ] }
CONTENT | POSSESSED | RESTRICTION { [ INSTANCE [1] ],              [ INSTANCE [1] ] }

In prose, it says roughly that a third person male human he possesses something which is an instance of good body. We believe that this is the adequate interpretation for the original sentence.

(4) Last, this rule has been implemented in our Chinese HPSG-style grammar using ALE and Prolog. The results meet our objective.

But there is one issue we have not touched yet, word order. At first sight, Chinese seems to have similar LP constraints as those in English. For example, the internal argument(s) of a Chinese transitive verb by default appear on the right side of the head. It seems that our formulation contradicts this constraint in grammar. But in fact, there are many other examples with the internal argument(s), especially PP argument(s), appearing on the left side of the head.

服务 fuwu (serve): <NP, PP(wei)>

16a) 为人民服务

wei renmin fuwu
for people serve
Serve the people.

16b) ? 服务为人民。

fuwu wei renmin.
serve for people

有益 youyi (of benefit): <NP, PP(dui yu)>

17a) 这对我有益。

zhe dui wo youyi
this to I have-benefit
This is of benefit to me.

17b) * 这有益对我。

zhe youyi dui wo
this have-benefit to I

18a) 这于我有益。

zhe yu wo youyi
this to I have-benefit
This is of benefit to me.

18b) 这有益于我。

zhe youyi yu wo
this have-benefit to I
This is of benefit to me.

Word order and its place in grammar are important issues in formulating Chinese grammar. To play safe and avoid generalization too soon, we assume a lexicalized view on Chinese LP constraint, encoding word order information in LEXICON through SUBCAT and MOD features. This proves to be a realistic and precise approach to Chinese word order phenomena.

3.4. As a final note, we will briefly compare the NP Predicate Pattern with one of the Chinese Topic Constructions:

NP1 NP2 Vi/A
(topic + (subject + predicate))

In Chinese, this is a closely related but much more productive form than this NP Predicate Pattern. And their structures are different.

19) 他身体好。

ta shenti hao
he body good
He is good in health.

For topic constructions, we propose a new feature CONTEXT | TOPIC, whose index in this case is token identical to the INDEX value of ta. Please be advised that in the above structure, the CONTEXT | TOPIC ta is considered as a sentential adjunct instead of a complement subcated-for by shenti. Why? First, ta is highly optional: topic-less sentence is still a sentence. Second, and more convincingly, ta cannot always be predicted by its following noun. Compare:

20a) 他身体好。

ta shenti hao
he body good
He is good in health.

20b) 他好身体。

ta hao shenti
he good body
He is of good health.

21a) 他脾气好。

ta piqi hao
he disposition good
He is good in disposition.

21b) 他好脾气。

ta hao piqi
he good disposition
He is of good disposition.

but:

22a) 她学习好。

ta xuexi hao. [3]
he study good
He is good in study.

22b) * 他好学习。

ta hao xuexi
he good study

What this shows is that for topic sentences like ta shenti hao (He is good in health), ta xuexi hao (He is good in study), etc., there is no requirement to regard topic ta (he) as a necessary semantic possessor of shenti / xuexi, the relation is rather "in-aspect": something (NP1) is good (A) in some aspect (NP2), or for something (NP1), some aspect (NP2) is good (A).

Finally, it needs to be mentioned that our proposed lexical rule requires modification to accommodate sentence 6). That is already beyond what we can reach in this paper because it is integrated with the way we handle Chinese classifiers in HPSG framework.

References

Pollard, Carl & Sag, Ivan A. (1994): Head-Driven Phrase Structure Grammar, Centre for the Study of Language and Information, Stanford University, CA

Pollard, Carl & Sag, Ivan A. (1987): Information‑based Syntax and Semantics Vol. 1: Fundamentals. Centre for the Study of Language and Information, Stanford University, CA

Wilks, Y.A. (1975a): A Preferential Pattern-Seeking Semantics for Natural Language Interference. Artificial Intelligence, Vol. 6, pp.53-74.

Wilks, Y.A. (1975b): An Intelligent Analyzer and Understander of English, in Communications of the ACM, Vol. 18, No.5, pp.264-274

Wilks, Y.A. (1978): Making Preferences More Active. Artificial Intelligence, Vol. 11, pp. 197-223

~~~~~~~~~~~~~~~ footnotes ~~~~~~~~~~~~~~~~

[1] This is not absolute, we do have the following examples:

Ia) 约翰是纽约人。

Yuehan shi Niuyue ren
John be New-York person
John is a New Yorker.

Ib) 约翰纽约人。

Yuehan Niuyue ren.
John New-York person
John is a New Yorker.

IIa) 今天是星期天。

jintian shi xingqi-tian.
today be Sun-day
Today is Sunday.

IIb) 今天星期天。

jintian xingqi-tian.
today Sun-day
Today is Sunday.

It seems to be that the subject NP stands for some individual element(s), and the predicate NP describes a set (property) where the subject belongs. But it is not clear how to capture Ib) and IIb) while excluding 7b). We leave this question open.

[2] We realize that the syntactic constraint defined here is only a rough approximation to the data from syntactic angle. It seems to match most data, but there are exceptions when yi (one) appears in a numeral-classifier phrase:

IIIa) 他一副好身体。

ta yi fu hao shenti.
he one (cl.) good body
He is of good health. (He is of a good body.)

IIIb) * 他三副好身体。

ta san fu hao shenti
he three (cl.) good body

IIIc) 他好身体。

ta hao shenti. [same as 1) ]

IVa) 李四一张圆圆的脸。

Lisi yi zhang yuanyuan de lian.
Lisi one (cl.) round-round DE face
Lisi has a quite round face.

IVb) * 李四两张圆圆的脸。

Lisi liang zhang yuanyuan de lian.
Lisi two (cl.) round-round DE face

IVc) 李四圆圆的脸。

Lisi yuanyuan de lian. [ same as 3) ]