PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

6.0. Introduction

This chapter studies some challenging problems of Chinese derivation and its interface with syntax.  These problems have been a challenge to existing word segmenters; they are also long-standing problems for Chinese grammar research.

It is observed that a good number of signs have become more and more like affixes as the Chinese language develops.  Typical, indisputable examples include signs like the nominalizer 性 ‑xing (-ness) and the prefix 第 di- (-th).  While few people doubt the existence of affixes in Contemporary Chinese, there is no general agreement on the exact number of Chinese affixes, due to a considerable number of borderline cases often referred to as ‘quasi-affixes’ (类语缀 lei yu-zhui).[1]  It will be argued that the quasi-affixes belong to morphology and are structurally not different from other affixes.  The major difference between ‘quasi-affixes’ and the few generally honored (‘genuine’) affixes lies mainly in the following aspect.  The former retain some ‘solid’ meaning while the latter are more functionalized.  However, this does not prevent CPSG95 from providing a proper treatment of quasi-affixes in the same way as it handles other affixes.  It will be shown that the difference in semantics between affixes or quasi-affixes can be accommodated fairly easily in the CPSG95 lexicon.

Based on the examination of the common property of Chinese affixes and quasi-affixes, a general approach to Chinese derivation is proposed.  This approach not only enables us to handle quasi-affix phenomena, but is also flexible enough to provide an adequate treatment of a special problem in Chinese derivation, namely zhe-suffixation.  The affix status of 者 -zhe (-er) is generally acknowledged (classified as suffix in the authoritative books like Lü et al 1980):  it attaches to a verb sign and produces a word.  The peculiar aspect of this suffix is that the verb stem which it attaches to can be syntactically expanded.  In fact, there is significant amount of evidence for the argument that this suffix expects a VP as its stem (see 6.5 for evidence).   Since a VP is only formed in syntax and derivation is within the domain of morphology, this phenomenon presents a highly challenging case on how morphology should be interfaced properly to syntax.  The solution which is offered in CPSG95 demonstrates the power of designing morphology and syntax in an integrated grammar formalism.  In contrast, in any system which enforces sequential processing of derivation morphology before syntax – most traditional systems assume this, this is an unsolvable problem.  There does not seem to be a way of enabling partial output of syntactic analysis (i.e. VP) to feed back to some derivation rule in the preprocessing stage.

In Section 6.1, the general approach to Chinese derivation is proposed first.  Following this proposal, prefixation is illustrated in 6.2 and suffixation in 6.3.  Section 6.4 shows that this general approach to derivation applies equally well to the ‘quasi-affix’ phenomena.  Section 6.5 investigates the suffixation of -zhe (-er).  The analysis is based on the argument that this suffixation involves the combination VP+-zhe.  The specific solution following the CPSG95 general approach will be presented based on this analysis.

6.1. General Approach to Derivation

This section examines the property of Chinese affixes and proposes a corresponding general approach to Chinese derivation.  This serves as the basis for the specific solutions to be presented in the remaining sections to various problems in Chinese derivation.

It is fairly easy to observe that in Chinese derivation it is the affix which selects the stem, not the other way round.  For example, the suffix 性 -xing (‑ness) expects an adjective to produce an (abstract) noun.   Based on the examination of the behavior of a variety of Chinese affixes or quasi-affixes, the following generalization has been reached.  That is, an affix lexically expects a sign of category x, with possible additional constraints, to form a derived word of category y.   This generalization is believed to capture the common property shared by Chinese affixes/quasi-affixes.  It seems to account for all Chinese derivational data, including typical affixation, quasi-affixation (see 6.4) and the special case of zhe-suffixation (see 6.5).  So far no counter evidence has been found to challenge this generalization.

The observation and the generalization above support the argument that in a grammar which relies on lexicalized expectation feature structures to drive the building of structures, affixes, not the stems, should be selecting heads of the morphological structures.[2]   Leaving aside the non-productive affixation,[3] the general strategy to Chinese productive derivation is proposed as follows.  In the lexicon, the affix as head of derivative is encoded with the following derivation information:  (i) what type of stem (constraints) it expects;  (ii) where to look for the expected stem, on its right or left;  (iii) what type of (derived) word it leads to (category, semantics, etc.).  Based on this lexical information, CPSG95 has two PS rules in the general grammar for derivation:  one for prefixation, one for suffixation.[4]  These rules ensure that all the constraints be observed before an affix and a stem are combined.  They also determine that the output of derivation, i.e. the mother sign, be a word.

Along this line, the key to a lexicalized treatment of Chinese derivation is to determine the structural and semantic property of the derivative and to impose proper constraints on the expected stem.  The constraints on the expected stem can be lexically specified in the morphological expectation feature [PREFIXING] or [SUFFIXING] of the affix.  The property (category, syntactic expectation, semantics, etc.) of the derivative can also be encoded directly in the lexical entry of the affix, seen as the head of a derivational structure in the CPSG95 analysis.  This property information, as part of head features, will be percolated up when the derivation rules are applied.

In the remaining part of this chapter, it will be demonstrated how this proposed general approach is applied to each specific derivation problem.

6.2. Prefixation

The purpose of this section is to present the CPSG95 solution to Chinese prefixation.  This is done by formulating a sample lexical entry for the ordinal prefix 第 di- (-th) in CPSG95.  It will be shown how the lexical information drives the prefix rule in the general grammar for the derivational combination.

Thanks to the productivity of the prefix 第 di- (-th), the ordinal numeral is always a derived word from the cardinal numeral via the following rule, informally formulated in (6-1).

(6-1.) 第 di- + cardinal numeral –> ordinal numeral

di-      22      tiao    jun-gui
-th     22      CLA   military-rule
the 22-nd military rule (Catch-22)

di-      ba      ge      shi     tong-xiang
-th     eight  CLA   be      bronze-statue
The eighth is the bronze statue.

The basic function of the Chinese numeral, whether cardinal or ordinal,  is to combine with a classifier, as shown in the sample sentences above.

To capture this phenomenon, CPSG95 defines two subtypes for the category numeral [num], namely the [cardinal_num] and [ordinal_num].   The lexical entries of the prefix 第 di‑ (‑th) and the cardinal numeral 五 wu (five) are formulated in (6-2) and (6-3).  The prefix encodes the lexical expectation for the derivation 第 di- + [cardinal_num] ‑‑> [ordinal_num] plus the semantic composition of the combination.  Note that the constraint @numeral inherits all common property specified for the numeral macro.


As indicated before, prefixation in CPSG95 is handled by the Prefix PS Rule based on the lexical specification.  More specifically, it is driven by the lexical expectation encoded in [PREFIXING].  The prefix rule is formulated in (6-4).


Like all PS rules in CPSG95, whenever two adjacent signs satisfy all the constraints, this rule takes effect in combining them into a higher level sign in parsing.  For example, the prefix 第 di- (-th) and the sign 五 wu (five) will be combined into the sign as shown in (6-5).


The combination of 第五 di+wu in (6-5) demonstrates how the morphological structure is built in the CPSG95 approach to Chinese prefixation.

6.3. Suffixation

Like prefixation, the Suffix PS Rule for suffixation is driven by the lexically encoded expectation in [SUFFIXING].  Parallel to the Prefix PS Rule, the suffix rule is formulated in (6-6).


With this PS rule in hand, all that is needed is to capture the individual derivational constraint in the lexical entries of the suffixes at issue.  For example, the suffix 性 –xing (-ness) changes an adjective or verb into an abstract noun:  A/V + ‑xing  ‑‑> N.  This information is contained in the formulation of the suffix 性 –xing (-ness) in the CPSG95 lexicon, as shown in (6-7).


Note that abstract nouns are uncountable, hence the call to the uncountable_noun macro to inherit the common property of uncountable nouns.[5]

Suppose the suffix 性 -xing (-ness) appears immediately after the adjective 实用 shi-yong (practical) formulated in (6-8), the suffix PS rule will combine them into a noun, as  shown in (6-9).


The combination of 实用性 shi-yong+xing in (6-9) demonstrates how the morphological structure is built in the CPSG95 approach to Chinese suffixation.

6.4. Quasi-affixes

The purpose of this section is to propose an adequate treatment of the quasi-affix phenomena in Chinese.  This is an area which has not received enough investigation in the field of Chinese NLP.  Few Chinese NLP systems demonstrate where and how to handle these quasi-affixes.

To achieve the purpose, typical examples of ‘quasi-affixes’ are presented and compared with some ‘genuine’ affixes.  The comparison highlights the general property shared by both ‘quasi-affixes’ and other affixes and also shows their differences.  Based on this study, it is found to be a feasible proposal to treat quasi-affixes within the derivation morphology of CPSG95.  The proposed solution will be presented by demonstrating how a typical quasi-affix is represented in CPSG95 and how the general affix rules can work with the lexical entries of ‘quasi-affixes’ as well.

The tables in (6-10) and (6-11) list some representative quasi-affixes in Chinese.

(6-10.)         Table for sample quasi-prefixes

prefixation examples
lei (quasi-)+N –> N 类前缀 lei-[qian-zhui]: quasi-[pre-fix]
前缀 qian (before, pre-, former-) zhui (…)
ban (semi-)+N –> N 半文盲 ban-[wen-mang]: semi-illiterate
文盲 wen (written-language), mang (blind)
dan (mono-)+N –> N 单音节 dan-[yin-jie]: mono-syllable
音节 yin (sound), jie (segment)
shuang (bi-)+N –> N 双音节 shuang-[yin-jie]: bi-syllable
duo (multi-)+N –> N 多音节 duo-[yin-jie]: multi-syllable
fei (non-)+N/A –> A 非谓 fei-wei: non-predicate
非正式 fei-[zheng-shi]: non-official
xiang (each other)+Vt (mono-syllabic) –> Vi 相爱 xiang-ai: love each other
zi (self-)+Vt –> Vi 自爱 zi-ai: self-love zi-xue-xi: self-learning
qian (former, ex-) + N
–> N
前夫人 qian-[fu-ren]: ex-wife
前总统 qian-[zong-tong]: former president

(6-11.)         Table for sample quasi-suffixes

suffixation Examples
N + shi (style) –> N 美国式 [mei-guo]-shi: American-style
NUM/N + xing (model)
–> N
1980型 1980-xing: 1980 model;
IV型 IV-xing: Model IV
A/V + (rate) –> N 准确率 [zhun-que]-lü: (percentage of) precision
NUM + liu (class) –> A 一流 yi-liu: first class
三流 san-liu: third class
N + mang (‘blind’, person who has little knowledge of) –> N 法盲 fa-mang:
person who has no knowledge of law
计算机盲 [ji-suan-ji]-mang: computer-layman

Compare the above quasi-affixes with the few widely acknowledged affixes like 性 -xing (-ness) and 第 di- (-th), it is fairly easy to observe that the property as generalized in Section 6.1 is shared by both affixes and quasi-affixes.  That is, in all cases of the combination, the affix or quasi-affix expects a sign of category x, with possible additional constraints, either on the right or on the left to form a derived word of category y (y may be equal to x).  For example, the quasi-prefix 自 zi- (self-) expects a transitive verb to produce an intransitive verb, etc.  This property supports the following two points of view:  (i) the affix or quasi-affix is the selecting head of the combination;  (ii) both types of combination (affixation) should be properly contained in morphology since the output is always a word (derivative).

In terms of difference, it is observed that there are different degrees of the functionalization of the meaning between quasi-affixes and other affixes.  For example, the nominalizer 性 -xing (‑ness) seems to be semantically more functionalized than the quasi-suffix 盲 -mang (blind-man, person who has little knowledge of).  In the case of 性 -xing (-ness), there is believed to be little semantic contribution from the affix.  But in cases of affixation by quasi-affixes, the semantic contribution of the affixes is non-trivial, and it must be ensured that proper semantics be built based on semantic compositionality of both the stem and the affix.

Except for the different degrees of semantic abstractness, there is no essential grammatical difference observed between quasi-affixes and the few widely accepted affixes.  As the semantic variation can be easily accommodated in the lexicon, nothing needs to be changed in the  general approach to Chinese derivation as described before.  The text below demonstrates how the quasi-affix phenomena are handled in CPSG95, using a sample quasi-affix to show the derivation.

The quasi-prefix to examine is 相 xiang- (each other).  It is used before a mono-syllabic transitive verb, making it an intransitive verb: 相 xiang- + Vt (monosyllabic) ‑‑> Vi.  More precisely, the syntactic object of the transitive verb is morphologically satisfied so that the derivative becomes an intransitive verb.

Unlike the original verb, the verb derived via xiang-prefixation requires a plural subject, as shown in (6-12).  This is a linguistically interesting phenomenon.  In a sense, it is a version of subject-predicate agreement in Chinese.

(6-12.) (a)    他们相爱过。
ta-men         xiang-         ai       guo
they            each-other   love    GUO
They used to love each other.

(b)      他爱过。
ta       ai       guo
he      love    GUO.
He used to love (someone).

(b) *   他相爱过。
ta       xiang-         ai       guo
he      each-other   love    GUO.

This number agreement can help decode the plural semantics of the subject noun as shown in the first sentence (6-13a) in the following group.  Sentence (6-13a) illustrates a common, number-underspecified case where the NP has no plural marker.  This contrasts with (6-13b) which includes a plural marker 们 men (-s), and with (6-13c) which resorts to the use of a numeral-classifier construction.

(6-13.) (a)     孩子相爱了。
hai-zi           xiang-         ai       le
child           each-other   love    LE
The children have fallen in love with each other.

(b)      孩子们相爱了。
hai-zi men   xiang-         ai       le
child  PLU   each-other   love    LE
The children have fallen in love with each other.

(c)      两个孩子相爱了。
liang ge      hai-zi           xiang-         ai       le
two    CLA   child           each-other   love    LE
The two children have fallen in love with each other.

Following the practice for number agreement in HPSG, the agreement can be captured by enforcing an additional plural constraint on the subject expectation [SUBJ | SIGN | CONTENT | INDEX | NUMBER plural], as shown in the formulation of the lexical entry for 相 xiang- (each other) in (6-14) below.


As shown above, the affixation also necessitates corresponding modification of the semantics in the argument structure:  the first argument is equal to the second via index [2].[6]  Note that the notation [ ], or more accurately, the most general feature structure, is used as a place holder.  For example, HANZI <[ ]> stands for the constraint of a mono-hanzi sign.  Another thing worth noticing is that the derivative requires that a subject must appear before it.  In other words, the subject expectation becomes obligatory.  This is based on the fact that this derived verb cannot stand by itself in syntax, unlike most original verbs in Chinese, say 爱 ai (love), whose subject expectation is optional.

With the lexical entries for the quasi-affixes taking care of the differences in the building of semantics, there is no need for any modification of the CPSG95 PS rules.  For example, the prefix 相 xiang- (each other) and the verb 爱 ai (love) formulated in (6-15) will be combined into the derivative 相爱 xiang-ai (love each other) shown in (6-16) via the Prefix PS Rule.


In summary, the proposed approach to Chinese derivation is effective in handling quasi-affixes as well.  The general grammar rules for derivation remain unchanged while lexical constraints are accommodated in the lexicon.  This demonstrates the advantages of the lexicalized design for grammar development.

6.5. Suffix 者 zhe (-er)

This section analyzes zhe-suffixation, a highly challenging  case at the interface between morphology and syntax.  This is believed to be an unsolvable problem as long as a system is based on the sequential processing of derivation morphology and syntax.  The solution to be proposed in this section is based on the argument that this suffixation is a combination of VP+zhe.

The suffix 者 zhe (-er, person) is a very productive bound morpheme.   It is often compared to the English suffix ‑er or ‑or, as seen in the pairs in (6-17).

工作 gong-zuo (work)      工作者 [gong-zuo]-zhe (work‑er)
劳动 lao-dong (labor)       劳动者 [lao-dong]-zhe (labor-er)
学习 xue-xi (learn)           学习者 [xue-xi]-zhe (learn-er);.

But 者 ‑zhe is not an ordinary suffix;  it belongs to the category of so-called ‘phrasal affix’,[7] with very different characteristics than the English counterpart.  Although the output of the zhe-suffixation is a word, the input is a VP, not a lexical V.  In other words, it combines with a VP and produces a lexical N:  VP+zhe –> N.   The arguments to be presented below support this analysis.

The first thing is to demonstrate the word status of zhe‑suffixation.  This is fairly straightforward:  there are no observed facts to show that the zhe-derivative is different from other lexical nouns in the syntactic distribution.  For example, like other lexical nouns, the derivative can combine with an optional classifier construction to form a noun phrase.   Compare the following pairs of examples in (6-18) and (6-19).

(6-18.) (a)    两名违反这项规定者
liang  ming [[wei-fan      zhe    xiang gui-ding]     -zhe]
two    CLA   violate         this    CLA   regulation   -er
two persons who have violated this regulation

(b)    两名学生
liang  ming xue-sheng
two    CLA   student
two students

(6-19.) (a)    他是一位优秀工作者
ta       shi     yi       wei    you-xiu        [[gong-zuo]   -zhe]
he      be      one    CLA   excellent      work           -er
He is an excellent worker.

(b)    他是一位优秀工人。
ta       shi     yi       wei    you-xiu        gong-ren
he      be      one    CLA   excellent      worker
He is an excellent worker.

The next thing is to demonstrate the phrasal nature of the ‘stem’.[8]   The stem is judged as a VP because it can be freely expanded by syntactical complements or modifiers without changing the morphological relationship between the stem and the suffix, as shown in (6‑20) below.  (6-20a) involves a modifier (努力 nu-li) before the head verb.  The verb stem in (6-20b) and (6-20c) is a transitive VP consisting of a verb and an NP object.

(6-20.) (a)    努力工作者
[nu-li  gong-zuo]     -zhe
hard  work           ‑er
hard-worker, person who works hard

(b)      学习鲁迅者
[xue-xi         Lu Xun]       -zhe
learn           Lu Xun       -er
person who is learning from Lu Xun

(c)      违反这项规定者
[wei-fan       zhe    xiang           gui-ding]      -zhe
violate         this    CLA   regulation   -er
person who violates this rule

More examples with the head verb 雇 gu (employ) are given in (6-21), with the last two expressions involving passivized VP.

(6-21.)(a)    雇者

(b)      雇人者
[gu               ren]             -zhe
employ        person         -er
those who employ people, employer/recruiter

(c)      被雇者
[bei gu]                  -zhe
[be-employed]       -er

(d)      被人雇者
[bei    ren              gu]               -zhe
by      person         employ        -er
those who are employed by (other) people

In fact, the stem VP is semantically equivalent to a relative clause.   A Chinese relative clause is normally expressed in the form of a DE-phrase: VP+de+N (Xue 1991).  In other words, 者 ‑zhe embodies functions of two signs, an N (‘person’, by default) and a relative clause introducer de, something like English one that + VP (or person who + VP).[9]  Compare the two examples in (6-22) and (6-23) with the same meaning – the expression in (6-23) is more colloquial than the first in (6-22) which uses the suffix 者‑zhe.

(6-22.) 违反规定者,处以罚款。
wei-fan        gui-ding       zhe,            chu-yi                   fa-kuan
violate         regulation   one that      punish-by   fine

Those who violate the regulations will be punished by fines.

(6-23.) 违反规定的人,处以罚款。
wei-fan        gui-ding       de      ren,             chu-yi          fa-kuan
violate         regulation   DE     person         punish-by   fine
Those who violate the regulations will be punished by fines.

On further examination, it is found that VPs with attached aspect markers combine with the suffix 者 -zhe with difficulty, as seen in the following examples.

(6-24.) (a)    违反规定者
wei-fan        gui-ding       zhe
violate         regulation   -er
Those who violate the regulations

(b) ?  违反了规定者
wei-fan        le       gui-ding       zhe
violate         LE     regulation   one that

This means that some further constraint may be necessary in order to prevent the grammar from producing strings like (6-24b).  If CPSG95 is only used for parsing, such a constraint is not absolutely necessary because, in normal Chinese text, such input is almost never seen.  Since CPSG95 is intended to be procedure-neutral, for use in both parsing and generation, the further constraint is desirable.

This constraint is in fact not an isolated phenomenon in Chinese grammar.  In syntax, the constraint is commonly required when the VP is not in the predicate position.[10]  For example, when a verb, say 喜欢 xi-huan (like), or a preposition, say 为了 wei-le (in order to), subcategorizes for a VP as a complement, it actually expects a VP with no aspect markers attached.   The following pair of sentences demonstrates this point.

(6-25.) (a)    我喜欢打篮球。
wo     xi-huan       da      lan-qiu.
I         like              play   basket-ball
I like playing basket-ball.

(b) * 我喜欢打了篮球。
wo     xi-huan       da      le       lan-qiu
I         like              play   LE     basket-ball

To accommodate such common constraint requirement in both Chinese morphology and syntax, a binary feature [FINITE] is designed for Chinese verbs in CPSG95.  In the lexicon, this feature is under-specified for each Chinese verb, i.e. [FINITE bin].  When an aspect marker 了着过 le/zhe/guo combines with the verb, this feature is unified to be [FINITE plus].  We can then enforce the required constraint [FINITE minus] in the morphological expectation or syntactic expectation to prevent aspected VP from appearing in a position expecting a non-predicate un-aspected  VP.

Based on the above analysis, the lexical entry of the suffix 者 –zhe is formulated in (6-26).  Note the notation for the macro with parameter (placed in parentheses) @common_noun(名|位|个).  This macro represents the following information.  The derivative is like any other common noun, it inherits the common property;  it can combine with an optional classifier construction using the classifier 名 ming or 位  wei or 个 ge.[11]


As seen, the VP expectation is realized by using the macro constraint @vp.  The semantics of the derivative is [np_semantics], an instance of -er with restriction from the event of VP, represented by [2].  The index [1] ensures that whatever is expected as a subject by the VP, which has no chances to be satisfied syntactically in this case, is semantically identical to this noun.[12]  In other words, this derived noun semantically fills an argument slot held by the subject in the VP semantics [v_content].  In the active case, say, 雇人者 [gu ren]–zhe (‘person who employs people’), the subject is the first argument, i.e. the index of this noun is the logical subject of employ.  However, when the VP is in passive, say, 被人雇者 [bei ren gu]‑zhe (‘person who is employed by other people’), the subject expected by the VP fills the second argument, i.e. the noun in this case is the logical object of the VP.  It is believed that this is the desired result for the semantic composition of zhe-derivation.

With the lexical expectation of the suffix as the basis, the general Suffix PS Rule is ready to work.  Remember that there is nothing restricting the input stem to the derivation in either of the derivation rules, formulated in (6-4) and (6-6) before.  In CPSG95, this is not considered part of the general grammar but rather a lexical property of the head affix.  It is up to the affix to decide what constraints such as category, wordhood status, semantic constraint, etc., to impose on the expected stem to produce a derivative.  In most cases of derivation, the input status of the stem is a word, but now we have an intricate case where the suffix zhe (-er) expects a verb phrase for derivation.  The general property for all cases of derivation is that regardless of the input, the output of derivation (as well as any other types of morphology) is always a word.

Before demonstrating by examples how zhe-derivation is implemented, there is a need to address the configurational constraints of CPSG95.  This is an important factor in realizing the flexible interaction between morphology and syntax as required in this case.

In all HPSG-style grammars, some type of configurational constraint is in place to ensure the proper order of rule application.  A typical constraint is that the subject rule should apply after the object rule.  This is implemented in CPSG95 by imposing the constraint in the subject PS rule that the head daughter must be a phrase and by imposing the constraint in the object PS rule that the subject of the head daughter may not be satisfied.[13]

Since derivation morphology and syntax are designed in the same framework in CPSG95, constraints are called for to ensure the ordering of rule application between morphological PS rules and syntactic PS rules as well.  In general, morphological rules apply before syntactic rules.  However, if this constraint is made absolute, to the extent that that all morphological rules must apply before all syntactic rules, we in effect make morphology and syntax two independent, successive modules, just like the case for traditional systems.  The grammar will then lose the power of flexible interaction between morphology and syntax and cannot handle cases like zhe-derivation.  However, this is not a problem in CPSG95.

The proposed constraint regulating the rule application order between morphological PS rules and syntactic PS rules is as follows.  Only when a sign has both obligatory morphological expectation and syntactic expectation will CPSG95 have constraints ensuring that the morphological rule apply first.  For example, as formulated in (6-14) before, the sign 相 xiang- (each other) has both morphological expectation in [PREFIXING] as a bound morpheme and syntactic expectation for the subject in [SUBJ] as (head of) derivative.  If the input string is 他们相爱  ta-men (they) xiang- (each other) ai (love), the prefix rule will first combine 相 xiang- (each other) and the stem 爱 ai (love) before the subject rule can apply.  The result is the expected structure embodying the results of both morphological analysis and syntactic analysis, [ta-men [xiang- ai]].  This constraint is implemented by specifying in all syntactic PS rules that the head daughter cannot have obligatory morphological expectation yet to be satisfied.  It effectively prevents a bound morpheme from being used as a constituent in syntax.   It should be emphasized that this constraint in the general grammar does not prohibit a bound morpheme from combining with any types of sign;  such constraints are only lexically decided in the expectation feature of the affix.

The following text shows step by step the CPSG95 solution to the problem of zhe-derivation.  The chosen example is the derivation for the derived noun 违法规定者 [[wei-fan gui-ding]-zhe]  ‘persons violating (the) regulation’.  The lexical sign of the suffix 者 -zhe (-er) has already been formulated in (6-26) before.  The words 违反 wei-fan (violate) and 规定 gui-ding (regulation) in the CPSG95 lexicon are shown in (6-27) and (6-28) respectively.


Note that all common nouns, specified as @common_noun, in the lexicon have the following INDEX features [PERSON 3, NUMBER number], i.e. third person with unspecified number.  As for the feature [GENDER], it is encoded in the noun itself with one of the following [male], [female], [have_gender], [no_gender] or unspecified as [gender].   The corresponding sort hierarchy is: [gender] consists of sub-sorts [no_gender] and [have_gender];  and [have_gender] is sub-typed into [male] and [female].  Of course, 规定 gui-ding (regulation) is lexically specified as [GENDER no_gender].

The following is the VP built by the object PS rule in the CPSG95 syntax.  As seen, the building of the semantics follows the practice in HPSG, with the argument slots filled by the [INDEX] feature of the subject and object.  In this VP case, [ARG2] has been realized.

The VP result in (6-29) and the suffix 者 –zhe will combine into the expected derived noun via the Suffix PS Rule, as shown in (6-30).


To summarize, it is the integrated model of derivational morphology and syntax in CPSG95 that makes the above analysis implementable.  Without the integration, there is no way that a suffix is allowed to expect a phrasal stem.[14]  The lexicalist approach adopted in CPSG95 facilitates the capturing of the individual feature of the phrase expectation for the few individual affixes like 者 –zhe. This enables the general PS rules for derivation in CPSG95 to be applicable to both typical cases of affixation and special cases of affixation.

6.6. Summary

This chapter has investigated some representative phenomena of Chinese derivation and their interface to syntax.  The solutions to these problems have been presented based on the arguments for the analysis.

The key to a lexicalized treatment of Chinese derivation is to determine the structural and semantic property of the derivative and to impose proper constraints on the expected stem.  The constraints on the expected stem are lexically specified in the corresponding morphological expectation feature structure of the affix.  The property of the derivative is also lexically encoded in the affix, seen as head of derivational structure in the CPSG95 analysis.  This property information will be percolated up when the derivation rules are applied.  These rules ensure that the output of derivation is a word.  It has been shown that this approach applies equally well to derivation via ‘quasi-affixes’ and the tough case of zhe-suffixation as well.



[1] Some linguists (e.g. Li and Thompson 1981) hold the view that Chinese has only a few affixes;  others (e.g. Chao 1968) believe that the inventory of Chinese affixes should be extended to include quasi-affixes.  Interestingly, the sign lei (quasi-, original sense ‘class’) itself is a quasi-prefix in Chinese.  Phenomena similar to Chinese quasi-affixes, called ‘semi-affixes’ or ‘Affixoide’, also exist in German morphology (Riehemann 1998).

[2] This is similar to the practice in many grammars, including HPSG, that a functional sign preposition is the selecting head of the corresponding syntactic structure, namely Prepositional Phrase.

[3] Those affixes which are not or no longer productive, e.g. lao‑ (original meaning ‘old’) in lao‑hu (tiger) and lao‑shu (mouse),  are not a problem.  The corresponding derived words are simply listed in the CPSG95 lexicon.

[4] The CPSG95 phrase-structural approach to Chinese productive derivation was inspired by the implementation in HPSG of a word-syntactic approach in Krieger (1994).  Similar practice is also seen in Selkirk (1982), Riehemann (1993) and Kathol (1999) in an effort to explore alternative approaches than the lexical rule approach to morphology.

[5] The major common property is reflected in two aspects, formulated in the macro definition of uncountable_noun in CPSG95.  First, there is value setting for the [NUMBER] feature, i.e. [CONTENT|INDEX|NUMBER no_number].  The CPSG95 sort hierarchy for the type [number] is defined as {a_number, no_number} where [a_number] is further sub-typed into {singular, plural}.  [NUMBER no_number] applies to uncountable nouns while [NUMBER a_number] is used for countable noun where the plurality is yet to be decided (i.e. under-specified for plurality).  Second, based on the syntactic difference between Chinese countable nouns and uncountable nouns, the classifier expected by uncountable nouns is exclusively zhong (kind/sort of).  That is, uncountable nouns may only combine with a preceding classifier construction using the classifier zhong.

[6] For time being, the subtle difference in semantics between pairs like We love ourselves and We love each other is not represented in the content.  It requires a more elaborate system of semantics to reflect the nuance.  The elaboration of semantics is left for future research.

[7] Some linguists (e.g. Z. Lu 1957; Lü et al 1980; Lü 1989; Dai 1993) have briefly introduced the notion of ‘phrasal affix’ in Chinese.  Lü further indicates that these ‘phrasal affixes’ are a distinctive characteristic of the Chinese grammar.

[8] The English possessive morpheme ‘s is arguably a suffix which expects an NP instead of a lexical noun as its stem:  NP + -’s.  Unlike VP + -zhe, the result of this NP + -‘s combination is generally regarded as a phrase, not a word.  In this sense, ‘s seems to be closer to a functional word, similar to a preposition or postposition, than to a suffix.

[9] Chinese zhe-suffixation is somewhat like the English phenomenon of what-clause (in ‘what he likes is not what interests her’). ‘What’ in this use also embodies functions of two signs that which. But the English what-clause functions as an NP, but VP+zhe forms a lexical N.

[10] It is generally agreed in the circle of Chinese grammar research that Chinese predicate (or finite) verbs have aspect distinction, using or not using aspect markers.  This is in contrast to English where both finite and non-finite verbs have aspect distinction but only finite verbs are tensed.

[11] It is generally agreed that each Chinese common noun may only combine with a classifier construction using a specific set of classifiers.  This classifier specification is generally regarded as lexical, idiosyncratic information of nouns (Lü et al 1980).  Using the macro with the classifier parameter follows this general idea.  It is worth noticing that the lexical formulation for -zhe (-er) in CPSG95 does not rely on any specific NP analysis chosen in syntax, except that the classifier specification should be placed under the entry for nouns (or derived nouns).

[12] The proposal in building the semantics for the zhe-derivative is based on ideas similar to the assumption adopted for the complement control in HPSG that ‘the fundamental mechanism of control was coindexing between the unexpressed subject of an unsaturated complement and its controler’ (Pollard and Sag 1994:282).

[13] If the object expectation is obligatory, this constraint ensures the priority of the object rule over the subject rule in application, building the desirable structure [S [V O]] instead of [[S V] O].  This is because, a verb with obligatory object yet to be satisfied is by definition not a phrase.  If the object expectation is optional, the order of rule application is still in effect although the lexical V in this scenario does not violate the phrase definition.  There are two cases for this situation.  In case one, the object O happens to occur in the input string.  The subject PS rule will tentatively combine S and V via the subject rule, but it can go no further.  This is because the object rule cannot apply after the subject rule, due to the constraint in the object rule that the head cannot have a satisfied subject.  The successful parse will only build the expected structure [S [V O]].  In case two, the object O does not appear in the input string.  Then the tentative combination [S V] built by the subject rule becomes the final parse.

[14] For example, if the lexical rule approach were adopted for derivation, this problem could not be solved.



PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP


PhD Thesis: Chapter V Chinese Separable Verbs


5.0. Introduction

This chapter investigates the phenomena usually referred to as separable verbs (离合动词 lihe dongci) in the form V+X.  Separable verbs constitute a significant portion of Chinese verb vocabulary.[1]  These idiomatic combinations seem to show dual status (Z. Lu 1957; L. Li 1990).  When V+X is not separated, it is like an ordinary verb.   When V is separated from X, it seems to be more like a phrasal combination.  The co-existence of both the separated use and contiguous use for these constructions is recognized as a long-standing problem at the interface of Chinese morphology and syntax (L. Wang 1955;  Z. Lu 1957; Chao 1968; Lü 1989; Lin 1983;  Q. Li 1983; L. Li 1990; Shi 1992; Dai 1993; Zhao and Zhang 1996).

Some linguists (e.g. L. Li 1990; Zhao and Zhang 1996) have made efforts to classify different types of separable verbs and demonstrated different linguistic facts about these types.  There are two major types of separable verbs:  V+N idioms with the verb-object relation and V+A/V idioms with the verb-modifier relation – when X is A or non-conjunctive V.[2]

The V+N idiom is a typical case which demonstrates the mismatch between a vocabulary word and grammar word.  There have been three different views on whether V+N idioms are words or phrases in Chinese grammar.

Given the fact that the V and the N can be separated in usage, the most popular view (e.g. Z. Lu 1957; L. Li 1990; Shi 1992) is that they are words when V+N are contiguous and they are phrases otherwise.  This analysis fails to account for the link between the separated use and the contiguous use of the idioms.  In terms of the type of V+N idioms like 洗澡 xi zao (wash-bath: take a bath), this analysis also fails to explain why a different structural analysis should be given to this type of contiguous V+N idioms listed in the lexicon than the analysis to the also contiguous but non-listable combination of V and N (e.g. 洗碗 xi wan ‘wash dishes’).[3]  As will be shown in Section 5.1, the structural distribution for this type of V+N idioms and the distribution for the corresponding non-listable combinations are identical.

Other grammarians argue that V+N idioms are not phrases (Lin 1983;  Q. Li 1983; Zhao and Zhang 1996).  They insist that they are words, or a special type of words.  This argument cannot explain the demonstrated variety of separated uses.

There are scholars (e.g. Lü 1989; Dai 1993) who indicate that idioms like 洗澡 xi zao are phrases.  Their judgment is based on their observation of the linguistic variations demonstrated by such idioms.  But they have not given detailed formal analyses which account for the difference between these V+N idioms and the non-listable V+NP constructions in the semantic compositionality.  That seems to be the major reason why this insightful argument has not convinced people with different views.

As for V+A/V idioms, Lü (1989) offers a theory that these idioms are words and the insertable signs between V and A/V are Chinese infixes.  This is an insightful hypothesis.  But as in the case of the analyses proposed for V+N idioms, no formal solutions have been proposed based on the analyses in the context of phrase structure grammars.  As a general goal, a good solution should not only be implementable, but also offer an analysis which captures the linguistic link, both structural and semantic, between the separated use and the contiguous use of separable verbs.  It is felt that there is still a distance between the proposed analyses reported in literature and achieving this goal of formally capturing the linguistic generality.

Three types of V+X idioms can be classified based on their different degrees of ‘separability’ between V and X, to be explored in three major sections of this chapter.  Section 5.1 studies the first type of V+N idioms like 洗澡 xi zao (wash-bath: take a bath).  These idioms are freely separable.  It is a relatively easy case.  Section 5.2 investigates the second type of the V+N idioms represented by 伤心 shang xin (hurt-heart: sad or heartbroken).  These idioms are less separable.  This category constitutes the largest part of the V+N phenomena.  It is a more difficult borderline case.  Section 5.3 studies the V+A/V idioms.  These idioms are least separable:  only the two modal signs 得 de3 (can) and 不 bu (cannot) can be inserted inside them, and nothing else.  For all these problems, arguments for the wordhood judgment will be presented first.  A corresponding morphological or syntactic analysis will be proposed, together with the formulation of the solution in CPSG95 based on the given analysis.

5.1. Verb-object Idioms: V+N I

The purpose of this section is to analyze the first type of V+N idioms, represented by 洗澡 xi zao (wash‑bath: take a bath).  The basic arguments to be presented are that they are verb phrases in Chinese syntax and the relationship between the V and the N is syntactic.  Based on these arguments, formal solutions to the problems involved in this construction will be presented.

The idioms like 洗澡 xi zao are classified as V+N I, to be distinguished from another type of idioms V+N II (see 5.2).  The following is a sample list of this type of idioms.

(5-1.) V+N I: xi zao type

洗澡 xi (wash) zao (bath #)              take a bath
擦澡 ca (scrub) zao (bath #)             clean one’s body by scrubbing
吃亏 chi (eat) kui (loss #)                   get the worst
走路 zou (go) lu (way $)                      walk
吃饭 chi (eat) fan (rice $)                    have a meal
睡觉 shui (V:sleep) jiao (N:sleep #)   sleep
做梦 zuo (make) meng (N:dream)     dream (a dream)
吵架  chao (quarrel) jia (N:fight #)    quarrel (or have a row)
打仗 da (beat) zhang (battle)              fight a battle
上当 shang (get) dang (cheating #)                be taken in
拆台 chai (pull down) tai (platform #)          pull away a prop
见面 jian (see) mian (face #)                            meet (face to face)
磕头 ke (knock) tou (head)                              kowtow
带头 dai (lead) tou (head $)                            take the lead
帮忙 bang (help) mang (business #)              give a hand
告状 gao (sue) zhuang (complaint #)            lodge a complaint

Note: Many nouns (marked with # or $) in this type of constructions cannot be used independently of the corresponding V.[4]  But those with the mark $ have no such restriction in their literal sense.  For example, when the sign fan  means ‘meal’, as it does in the idiom, it cannot be used in a context other than the idiom chi-fan (have a meal).  Only when it stands for the literal meaning ‘rice’, it does not have to co-occur with  chi.

There is ample evidence for the phrasal status of the combinations like 洗澡 xi zao.  The evidence is of three types.  The first comes from the free insertion of some syntactic constituent X between the idioms in the form V+X+N: this involves keyword-based judgment patterns and other X‑insertion tests proposed in Chapter IV.  The second type of evidence resorts to some syntactic processes for the transitive VP, namely passivization and long-distance topicalization.  The V+N I idioms can be topicalized and passivized in the same way as ordinary transitive VP structures do.  The last piece of evidence comes from the reduplication process associated with this type of idiom.   All the evidence leads to the conclusion that V+N I idioms are syntactic in nature.

The first evidence comes from using the wordhood judgment pattern: V(X)+zhe/guo à word(X).  It is a well observed syntactic fact that Chinese aspectual markers appear right after a lexical verb (and before the direct object).  If 洗澡 xi zao were a lexical verb, the aspectual markers would appear after the combinations, not inside them.  But that is not the case, shown by the ungrammaticality of the example in (5-2b).  A productive transitive VP example is given in (5-3) to show its syntactic similarity (parallelness) with V+N I idioms.

(5-2.) (a)      他正在洗着澡
ta       zheng-zai    xi      zhe    zao.
he      right-now    wash ZHE   bath
He is taking a bath right now.

(b) *   他正在洗澡着。
ta       zheng-zai    xizao         zhe.
he      right-now    wash-bath   ZHE

(5-3.) (a)      他正在洗着衣服。
ta       zheng-zai    xi      zhe    yi-fu.
he      right-now    wash ZHE   clothes
He is washing the clothes right now.

(b) *   他正在洗衣服着。
ta       zheng-zai    xi      yi-fu           zhe.
he      right-now    wash clothes        ZHE

The above examples show that the aspectual marker 着 zhe (ZHE) should be inserted in the V+N idiom, just as it does in an ordinary transitive VP structure.

Further evidence for X-insertion is given below.   This comes from the post-verbal modifier of ‘action-times’ (动量补语 dongliang buyu) like ‘once’, ‘twice’, etc.  In Chinese, action-times modifiers appear after the lexical verb and aspectual marker (but before the object), as shown in (5-4a) and (5-5a).

(5-4.) (a)      他洗了两次澡。
ta       xi      le       liang  ci       zao.
he      wash LE     two    time   bath
He has taken a bath twice.

(b) *   他洗澡了两次。
ta       xizao         le       liang  ci.
he      wash-bath   LE     two    time

(5-5.) (a)      他洗了两次衣服。
ta       xi      le       liang  ci       yi-fu.
he      wash LE     two    time   clothes
He has washed the clothes twice.

(b) *   他洗衣服了两次。
ta       xi      yi-fu           le       liang  ci.
he      wash clothes        LE     two    time

So far, evidence has been provided of syntactic constituents which are attached to the verb in the V+N I idioms.  To further argue for the VP status of the whole idiom, it will be demonstrated that the N in the V+N I idioms in fact fills the syntactic NP position in the same way as all other objects do in Chinese transitive VP structures.  In fact, N in the V+N I does not have to be a bare N:  it can be legitimately expanded to a full-fledged NP (although it does not normally do so).  A full-fledged NP in Chinese typically consists of a classifier phrase (and modifiers like de-construction) before the noun.  Compare the following pair of examples.  Just like an ordinary NP 一件崭新的衣服 yi jian zan-xin de yi-fu (one piece of brand-new clothes), 一个痛快的澡 yi ge tong-kuai de zao (a comfortable bath) is a full-fledged NP.

(5-6.)           他洗了一个痛快的澡。
ta       xi      le       yi       ge      tong-kuai     de      zao.
he      wash LE     one    CLA   comfortable DE     bath
He has taken a comfortable bath.

(5-7.)           他洗了一件崭新的衣服。
ta       xi      le       yi       jian    zan-xin        de      yi-fu.
he      wash LE     one    CLA   brand-new  DE     clothes
He has washed one piece of brand-new clothes.

It requires attention that the above evidence is directly against the following widespread view, i.e. signs like 澡 zao, marked with # in (5-1), are ‘bound morphemes’ or ‘bound stems’ (e.g. L. Li 1990; Zhao and Zhang 1996).  As shown, like every other free morpheme noun (e.g. yi-fu), zao holds a lexical position in the typical Chinese NP sequence ‘determiner + classifier + (de-construction) + N’, e.g. 一个澡 yi ge zao (a bath), 一个痛快的澡 yi ge tong-kuai de zao (a comfortable bath).[5]  In fact, as long as the ‘V+N I phrase’ arguments are accepted (further evidence to come), by definition ‘bound morpheme’ is a misnomer for 澡 zao.  As a part of morphology, a bound morpheme cannot play a syntactic role:  it is inside a word and cannot be seen in syntax.  The analysis of 洗xi (…) zao as a phrase entails the syntactic roles played by 澡 zao:  (i) 澡 zao is a free morpheme noun which fills the lexical position as the final N inside the possibly full-fledged NP;  (ii) 澡zao plays the object role in the syntactic transitive structure 洗澡xi zao.

This bound morpheme view is an argument used for demonstrating  the relevant V+N idioms to be words rather than phrases (e.g. L. Li 1990).  Further examination of this widely accepted view will help to strengthen the counter-arguments that all V+N I idioms are phrases.

Labeling signs like 澡zao (bath) as bound morphemes seem to come from an inappropriate interpretation of the statement that bound morphemes cannot be ‘freely’, or ‘independently’, used in syntax.[6]  This interpretation places an equal sign between the idiomatic co-occurrence constraint and ‘not being freely used’.  It is true that 澡zao is not an ordinary noun to be used in isolation.  There is a co-occurrence constraint in effect:  澡zao cannot be used without the appearance of 洗xi (or 擦ca).  However, the syntactic role played by 澡zao, the object in the syntactic VP structure, has full potential of being ‘freely’ used as any other Chinese NP object:   it can even be placed before the verb in long-distance constructions as shall be shown shortly.  A more proper interpretation of ‘not being freely used’ in terms of defining bound morphemes should be that a genuine bound morpheme, e.g. the suffix 性 -xing ‘-ness’, has to attach to another sign contiguously to form a word.

A comparison with similar phenomena in English may be helpful.  English also has similar idiomatic VPs, such as kick the bucket.[7]  For the same reason, it cannot be concluded that bucket (or the bucket) is a bound morpheme only because it demonstrates necessary co-occurrence with the verb literal kick.  Signs like bucket, zao (bath) are not of the same nature as bound morphemes like –less, -ly, un-, ‑xing (-ness), etc

The second type of evidence shows some pattern variations for the V+N I idioms.  These variations are typical syntactic patterns for the transitive V+NP structure in Chinese.  One of most frequently used patterns for transitive structures is the topical pattern of long distance dependency.  This provides strong evidence for judging the V+N I idioms as syntactic rather than morphological.  For, with the exception of clitics, morphological theories in general conceive of the parts of a word as being contiguous.[8]  Both the V+N I idiom and the normal V+NP structure can be topicalized, as shown in (5-8b) and (5-9b) below.

(5-8.) (a)      我认为他应该洗澡。
wo     ren-wei        ta       ying-gai       xi zao.
I         think           he      should        wash-bath
I think that he should take a bath.

(b)      澡我认为他应该洗
zao    wo     ren-wei        ta       ying-gai       xi.
bath  I         think           he      should        wash
The bath I think that he should take.

(5-9.) (a)       我认为他应该洗衣服。
wo     ren-wei        ta       ying-gai       xi      yi-fu.
I         think           he      should        wash clothes
I think that he should wash the clothes.

(b)      衣服我认为他应该洗。
yi-fu           wo     ren-wei        ta       ying-gai       xi.
clothes        I         think           he      should        wash
The clothes I think that he should wash.

The minimal pair of passive sentences in (5-10) and (5‑11) further demonstrates the syntactic nature of the V+N I structure.

(5-10.)         澡洗得很干净。
zao             xi      de3    hen    gan-jing.
bath            wash DE3   very   clean
A good bath was taken so that one was very clean.

(5-11.)         衣服洗得很干净。
yi-fu           xi      de3    hen    gan-jing.
clothes        wash DE3   very   clean
The clothes were washed clean.

The third type of evidence involves the nature of reduplication associated with such idioms.  For idioms like 洗澡 xi zao (take a bath), the first sign can be reduplicated to denote the shortness of the action:  洗澡 xi zao (take a bath) –> 洗洗澡 xi xi zao (take a short bath).  If 洗澡 xi zao is a word, by definition, 洗xi is a morpheme inside the word and 洗洗澡 xi-xi-zao belongs to morphological reduplication (AB–>AAB type).  However, this analysis fails to account for the generality of such reduplication:  it is a general rule in Chinese grammar that a verb reduplicates itself contiguously to denote the shortness of the action.  For example, 听音乐 ting (listen to) yin-yue (music) –> 听听音乐 ting ting yin-yue (listen to music for a while); 休息 xiu-xi (rest) –> 休息休息 xiu-xi xiu-xi (have a short rest), etc.  On the other hand, when we accept that 洗澡 xi zao is a verb-object phrase in syntax and the nature of this reduplication is accordingly judged as syntactic,[9] we come to a satisfactory and unified account for all the related data.  As a result, only one reduplication rule is required in CPSG95 to capture the general phenomena;[10]  there is no need to do anything special for V+N  idioms.

This AB ‑‑> AAB type reduplication problem for the V+N idioms poses a big challenge to traditional word segmenters (Sun and Huang 1996).  Moreover, even when a word segmenter successfully incorporates some procedure to cope with this problem, the essentially same rule has to be repeated in the grammar for the general VV reduplication.  This is not desirable in terms of capturing the linguistic generality.

All the evidence presented above indicates that idioms like 洗澡xi zao, no matter whether V and N are used contiguously or not, are not words, but phrases.  The idiomatic nature of such combinations seems to be the reason why most native speakers, including some linguists, regard them as words.  Lü (1989: 113-114) suggests that vocabulary words  like 洗澡 xi zao should be distinguished from grammar words.  He was one of the first Chinese grammarians who found that the V+N relation in the idioms like 洗澡 xi zao is a syntactic verb object relation.  But he did not provide full arguments for his view, neither did he offer a precise formalized analysis of this problem.[11]

As shown in the previous examples, the V+N I idioms do not differ from other transitive verb phrases in all major syntactic behaviors.   However, due to their idiomatic nature, the V+N I idioms are different from ordinary transitive VPs in the following two major aspects.  These differences need to be kept in mind when formulating the grammar to capture the phenomena.

  • Semantics:  the semantics of the idiom should be given directly in the lexicon, not as a result of the computation of the semantics of the parts based on some general principle of compositionality.
  • Co-occurrence requirement:  洗 xi (or 擦 ca) and 澡 zao must co-occur with each other;  走 zou (go) and 路 lu (way) must co-occur; etc.  This is a requirement specific to the idioms at issue.  For example, 洗 xi and 澡 zao must co-occur in order to stand as an idiom to mean ‘take a bath’.

Based on the study above, the CPSG95 solution to this problem is described below.  In order to enforce the co-occurrence of the V+N I idioms, it is specified in the CPSG95 lexicon that the head V obligatorily expects as its object an NP headed by a specific literal.  This treatment originates from the practice of handling collocations in HPSG.  In HPSG, there are features designed to enable the subcategorization for particular words, or phrases headed by particular words.  For example, the feature [NFORM there] and [NFORM it] refer to the expletive there and it respectively for the special treatment of existential constructions, cleft constructions, etc. (Pollard and Sag 1987:62).  The values of the feature PFORM distinguish individual prepositions like for, on, etc.  They are used in phrasal verbs like rely on NP, look for NP, etc.  In CPSG95, this approach is being generalized, as described below.

As presented before, the feature for orthography [HANZI] records the Chinese character string for each lexical sign.  When a specific lexical literal is required in an idiomatic expectation, the constraint is directly placed on the value of the feature [HANZI] of the expected sign, in addition to possible other constraints.  It is standard practice in a lexicalized grammar that the expected complement (object) for the transitive structure be coded directly in the entry of the head V in the lexicon.  Usually, the expected sign is just an ordinary NP.  In the idiomatic VP like 洗 xi (…) 澡 zao, one further constraint is placed:  the expected NP must be headed by the literal character 澡zao.  This treatment ensures that all pattern variations for transitive VP such as passive constructions, topicalized constructions, etc. in Chinese syntax will equally apply to the V+N I idioms.[12]

The difference in semantics is accommodated in the feature [CONTENT] of the head V with proper co-indexing.  In ordinary cases like 洗衣服 xi yi-fu (wash clothes), the argument structure is [vt_semantics] which requires two arguments, with the role [ARG2] filled by the semantics of the object NP.  In the idiomatic case 洗澡 xi zao (take a bath), the V and N form a semantic whole, coded as [RELN take_bath].[13]  The V+N I idioms are formulated like intransitive verbs in terms of composing the semantics – hence coded as [vi_semantics], with only one argument to be co-indexed with the subject NP.  Note that there are two lexical entries in the lexicon for the verb 洗 xi (wash), one for the ordinary use and the other for the idiom, shown in (5-12) and (5-13).


The above solution takes care of the syntactic similarity of the
V+N I idioms and ordinary V+NP structures.  It is also detailed enough to address their major differences.  In addition, the associated reduplication process (i.e. V+N –> V+V+N) is no longer a problem once this solution is adopted.  As the V in the V+N idioms is judged and coded as a lexical V (word) in this proposal, the reduplication rule which handles V –> VV will equally apply here.

5.2. Verb-object Idioms: V+N II

The purpose of this section is to provide an analysis of another type of V+N idiom and present the solution implemented in CPSG95 based on the analysis.

Examples like 洗澡 xi zao (take a bath) are in fact easy cases to judge.   There are more marginal cases.  When discussing Chinese verb-object idioms, L. Li (1990) and Shi (1992) indicate that the boundary between a word and a phrase in Chinese is far from clear-cut.  There is a remarkable “gray area” in between.  Examples in (5-14) are V+N II idioms, in contrast to the V+N I type, classified by L. Li (1990).

(5-14.) V+N II: 伤心 shang xin type

伤心 shang (hurt) xin (heart)             sad or break one’s heart
担心 dan (carry) xin (heart)               worry
留神 liu (pay) shen (attention)           pay attention to
冒险 mao (take) xian (risk)                 take the risk
借光 jie (borrow) guang (light)           benefit from
劳驾 lao (bother) jia (vehicle)             beg the pardon
革命 ge (change) ming (life)                 make revolution
落后 luo (lag) hou (back)                      lag behind
放手 fang (release) shou (hand)          release one’s hold

Compared with V+N I (洗澡xi zao type), V+N II has more characteristics of a word.  The lists below given by L. Li (1990) contrast their respective characteristics.[14]

(5-15.) V+N I (based on L. Li 1990:115-116)

as a word


(a1) corresponds to one generalized sense (concept)

(a2) usually contains ‘bound morpheme(s)’

as a phrase



(b1) may insert an aspectual particle (X=le/zhe/guo)

(b2) may insert all types of post-verbal modifiers (X=BUYU)

(b3) may insert a pre-nominal modifier de-construction (X=DEP)

(5-16.) V+N II (based on L. Li 1990:115)

as a word



(a1) corresponds to one generalized sense (concept)

(a2) usually contains ‘bound morpheme(s)’

(a3) (some) may be followed by an aspectual particle (X=le/zhe/guo)

(a4) (some) may be followed by a post-verbal modifier
of duration or number of times (X=BUYU)

(a5) (some) may take an object (X=BINYU)

as a phrase



(b1) may insert an aspectual particle (X=le/zhe/guo)

(b2) may insert all types of post-verbal modifiers (X=BUYU)

(b3) may insert a pre-nominal modifier de-construction (X=DEP)

For V+N I, the previous text has already given detailed analysis and evidence and decided that such idioms are phrases, not words.  This position is not affected by the demonstrated features (a1) and (a2) in (5‑15);  as argued before,  (a1) and (a2) do not contribute to the definition of a grammar word.

However, (a3), (a4) and (a5) are all syntactic evidence showing that V+N II idioms can be inserted in lexical positions.   On the other hand, these idioms also show the similarity with V+N I idioms in the features (b1), (b2) and (b3) as a phrase.  In particular, (a3) versus (b1) and (a4) versus (b2) demonstrate a ‘minimal pair’ of phrase features and word features.  The following is such a minimal pair example (with the same meaning as well) based on the feature pairs (a3) versus (b1), with a post-verbal modifier 透tou (thorough) and aspectual particle 了le (LE).  It demonstrates the borderline status of such idioms.  As before, a similar example of an ordinary transitive VP is also given below for comparison.

(5-17.)         V+N II: word or phrase?

伤心:sad; heart-broken
shang          xin
hurt            heart

(a)      我伤心透了
wo     shang-xin  tou              le.
I         sad              thorough     LE
I was extremely sad.

(b)      我伤透了心
wo     shang         tou              le       xin.
I         break          thorough     LE     heart
I was extremely sad.

(5-18.)         Ordinary V+NP phrase: 恨hen (hate) 他ta (he)

(a) *   我恨他透了
wo     hen   ta      tou              le.
I         hate   he      thorough     LE

(b)      我恨透了他
wo     hen   tou              le       ta.
I         hate   thorough     LE     he
I thoroughly hate him.

As shown in (5-18), in the common V+NP structure, the post-verbal modifier 透 tou (thorough) and the aspectual particle 了 le (perfect aspect) can only occur between the lexical V and NP.  But in many V+N II idioms, they may occur either after the V+N combination or in between.  In (5‑17a), 伤心 shang xin is in the lexical position because Chinese syntax requires that the post-verbal modifier attach to the lexical V, not to a VP as indicated in (5-18a).  Following the same argument, 伤 shang (hurt) alone in (5-17b) must be a lexical V as well.  The sign 心 xin (heart) in (5‑17b) establishes itself in syntax as object of the V, playing the same role as 他ta (he) in (5-18b).  These facts show clearly that V+N II idioms can be used both as lexical verbs and as transitive verb phrases.   In other words, before entering a context, while still in the lexicon, one can not rule out either possibility.

However, there is a clear cut condition for distinguishing its use as a word and its use as a phrase once a V+N II idiom is placed in a context.   It is observed that the only time a V+N II idiom assumes the lexical status is when V and N are contiguous.  In all other cases, i.e. when V and N are not contiguous, they behave essentially similar to the V+N I type.

In addition to the examples in (5-17) above, two more examples are given below to demonstrate the separated phrasal use of V+N II.  The first is the case V+X+N where X is a possessive modifier attached to the head N.  Note also the post-verbal position of 透 tou (thorough) and 了le (LE).  The second is an example of passivization when N occurs before V.  These examples provide strong evidence for the syntactic nature of V+N II idioms when V and N are not used contiguously.

(5-19.) (a) *   你伤他的心透了
ni       shang         ta       de      xin    tou              le.
you    hurt            she    DE     heart thorough     LE

(b)      你伤透了他的心
ni       shang         tou              le       ta       de      xin.
you    hurt            thorough     LE     she    DE     heart
You broke her heart.

(5-20.)         V+N II: instance of passive with or without 被 bei (BEI)

xin    (bei)   shang         tou              le.
heart BEI    break          thorough     LE
The heart was completely broken.
or: (Someone) was extremely sad.

Based on the above investigation, it is proposed in CPSG95 that two distinct entries be constructed for each such idiom, one as an inseparable lexical V, and the other as a transitive VP just like that of V+N I.  Each entry covers its own part of the phenomena.  In order to capture the semantic link between the two entries, a lexical rule called V_N_II Rule is formulated in CPSG95, shown in (5-21).


The input to the V_N_II Lexical Rule is an entry with [CATEGORY v_n_ii] where [v_n_ii] is a given sub-category in the lexicon for V+N II type verbs.  The output is another entry with the same information except for three features [HANZI], [CATEGROY] and [COMP1_RIGHT].  The new value for [HANZI] is a list concatenating the old [HANZI] and the [HANZI] for the expected [COMP1_RIGHT].  The new [CATEGORY] value is simply [v].  The value for [COMP1_RIGHT] becomes [null].  The outline of the two entries captured by this lexical rule are shown in (5-22) and (5-23).


It needs to be pointed out that the definition of [CATEGORY v_n_ii] in CPSG95 is narrower than L. Li’s definition of V+N II type idioms.  As indicated by L. Li (1990), not all V+N II idioms share the same set of lexical features (a3), (a4) and (a5) as a word.  The definition in CPSG95 does not include the idioms which share the lexical feature (a5), i.e. taking a syntactic object.  These are idioms like 担心danxin (carry-heart: worry about).  For such idioms, when they are used as inseparable compound words, they can take a syntactic object.  This is not possible for all other V+N idioms, as shown below.

(5-24.) (a)     她很担心你
ta       hen    dan-xin                ni.
he      very   worry (about)        you
He is very concerned about you.

(b) *   他很伤心你
ta       hen    shang-xin            ni.
he      very   sad                       you

In addition, these idioms do not demonstrate the full distributional potential of transitive VP constructions.  The separated uses of these idioms are far more limited than other V+N idioms.  For example, they can hardly be passivized or topicalized as other V+N idioms can, as shown by the following minimal pair of passive constructions.

(5-25.)(a) *   心(被)担透了
xin    (bei)   dan             tou              le.
heart BEI    carry           thorough     LE

(b)      心(被)伤透了
xin    (bei)   shang         tou              le.
heart BEI    break          thorough     LE
The heart was completely broken.
or: (Someone) was extremely sad.

In fact, the separated use (‘phrasal use’) for such V+N idioms seems only limited to some type of X-insertion, typically the appearance of aspect signs between V and N.[15]  Such separated use is the only thing shared by all V+N idioms, as shown below.

(5-26.)(a)     他担过心
ta       dan             guo    xin
he      carry           GUO  heart
He (once) was worried.

(b)      他伤过心
ta       shang         guo    xin
he      break          GUO  heart
He (once) was heart-broken.

To summarize,  the V+N idioms like 担心 dan-xin which can take a syntactic object do not share sufficient generality with other V+N II idioms for a lexical rule to capture.  Therefore, such idioms are excluded from the [CATEGORY v_n_ii] type.  This makes these idioms not subject to the lexical rule proposed above.  It is left for future research to answer the question whether there is enough generality among this set of idioms to justify some general approach to this problem, say, another lexical rule or some other ways of generalization of the phenomena.  For time being, CPSG95 simply lists both the contiguous and separated uses of these idioms in the lexicon.[16]

It is worth noticing that leaving such idioms aside, this lexical rule still covers large parts of V+N II phenomena.  The idioms like 担心dan-xin only form a very small set which are in the state of transition to words per se (from the angle of language development) but which still retain some (but not complete) characteristics of a phrase.[17]

5.3. Verb-modifier Idioms: V+A/V

This section investigates the V+X idioms in the form of V+A/V.  The data for the interaction of V+A/V idioms and the modal insertion are presented first.  The subsequent text will argue for Lü’s infix hypothesis for the modal insertion and accordingly propose a lexical rule to capture the idioms with or without modal insertion.

The following is a sample list of V+A/V idioms, represented by kan jian (look-see: have seen).

(5-27.) V+A/V: kan jian type

看见 kan (look) jian (see)                    have seen
看穿 kan (look) chuan (through)        see through
离开 li (leave) kai (off)                         leave
打倒 da (beat) dao (fall)                      down with
打败 da (beat) bai (fail)                       defeat
打赢 da (beat) ying (win)                    fight and win
睡着 shui (sleep) zhao (asleep)            fall asleep
进来 jin (enter) lai (come)                             enter
走开 zou (go) kai (off)                         go away
关上  guan (close) shang (up)             close

In the V+A/V idiom kan jian (have-seen), the first sign kan (look) is the head of the combination while the second jian (see) denotes the result.  So when we say, wo (I) kan-jian (see) ta (he), even without the aspectual marker le (LE) or guo (GUO), we know that it is a completed action:  ‘I have seen him’ or ‘I saw him’.[18]

Idioms like kan-jian (have-seen) function just as a lexical whole (transitive verb).  When there is an aspect marker, it is attached immediately after the idioms as shown in (5‑28).  This is strong evidence for judging V+A/V idioms as words, not as syntactic constructions.

(5-28.)         我看见了他
wo     kan jian     le       ta.
I         look-see       LE     he                   I have seen him.

The only observed separated use is that such idioms allow for two modal signs 得 de3 (can) and 不 bu (cannot) in between, shown by (5-29a) and (5-29b).  But no other signs, operations or processes can enter the internal structure of these idioms.

(5-29.) (a)     我看不见他
wo     kan bu jian         ta.
I         look cannot see     he
I cannot see him.

(c)      你看得见他吗?
ni       kan de3 jian       ta       me?
you    look can see          he      ME
Can you see him?

Note that English modal verbs ‘can’ and ‘cannot’ are used to translate these two modal signs.  In fact, Contemporary Mandarin also has corresponding modal verbs (能愿动词 neng-yuan dong-ci):  能 neng (can) and 不能 bu neng (cannot).  The major difference between Chinese modal verbs 能 neng / 不能 bu neng and the modal signs 得 de3 / 不 bu lies in their different distribution in syntax.  The use of modal signs 得 de3 (can) and 不 bu (cannot) is extremely restrictive:  they have to be inserted into V+BUYU combinations.  But Chinese modal verbs can be used before any VP structures.  It is interesting to see the cases when they are used together in one sentence, as shown in (5-30 a+b) below.  Note that the meaning difference between the two types of modal signs is subtle, as shown in the examples.

(5-30.)(a)     你看得见他吗?
ni       kan de3 jian         ta       me?
you    look can see          he      ME
Can you see him? (Is your eye-sight good enough?)

(b)      你能看见他吗?
ni       neng kan jian      ta       me?
you    can    see              he      ME
Can you see him?
(Note: This is used in more general sense. It covers (a) and more.)

(a+b)  你能看得见他吗?
ni       neng kan de3 jian         ta       me?
you    can    look can see          he      ME
Can you see him? (Is your eye-sight good enough?)

(5-31.)(a)     我看不见他
wo     kan bu jian           ta
I         look cannot see     he
I cannot see him. (My eye-sight is too poor.)

(b)      我不能看见他
wo     bu     neng kan jian      ta
I         not    can    see              he
I cannot see him. (Otherwise, I will go crazy.)

(a+b) 我不能看不见他
wo     bu     neng kan bu jian           ta.
I         not    can    look cannot see     he
I cannot stand not being able to see him.
(I have to keep him always within the reach of my sight.)

Lü (1989:127) indicates that the modal signs are in fact the only two infixes in Contemporary Chinese.  Following this infix hypothesis, there is a good account for all the data above.  In other words, the V+A/V idioms are V+BUYU compound words subject to the modal infixation.  The phenomena of 看得见 kan-de3-jian (can see) and 看不见 kan-bu-jian (cannot see) are therefore morphological by nature.  But Lü did not offer formal analysis for these idioms.

Thompson (1973) first proposed a lexical rule to derive the potential forms V+de3/bu+A/V from the V+A/V idioms.  The lexical rule approach seems to be most suitable for capturing the regularity of the V+A/V idioms and their infixation variants V+de3/bu+A/V.  The  approach taken in CPSG95 is similar to Thompson’s proposal.  More precisely, two lexical rules are formulated in CPSG95 to handle the infixation in V+A/V idioms.  This way, CPSG95 simply lists all V+A/V idioms in the lexicon as V+A/V type compound words, coded as [CATEGORY v_buyu].[19]  Such entries cover all the contiguous uses of the idioms.  It is up to the two lexical rules to produce two infixed entries to cover the separated uses of the idioms.

The change of the infixed entries from the original entry lies in the semantic contribution of the modal signs.  This is captured in the lexical rules in (5-32) and (5-33).  In case of V+de3+A/V, the Modal Infixation Lexical Rule I in (5-32) assigns the value [can] to the feature [MODAL] in the semantics.  As for V+bu+A/V, there is a setting  [POLARITY minus] used to represent the negation in the semantics, shown in (5-33).[20]


The following lexical entry shows the idiomatic compound 看见 kan-jian as coded in the CPSG95 lexicon (leaving some irrelevant details aside).   This entry satisfies the necessary condition for the proposed infixation lexical rules.


The modal infixation lexical rules will take this [v_buyu] type compound as input and produce two V+MODAL+BUYU entries.  As a result, new entries 看得见 kan-de3-jian (can see) and 看不见 kan-bu-jian (cannot see) as shown below are added to the lexicon.[21]



The above proposal offers a simple, effective way of capturing the linguistic data of the interaction of V+A/V idioms and the modal insertion, since it eliminates the need for any change of the general grammar in order to accommodate this type of separable verbs interacting with 得 de3 / 不 bu, the only two infixes in Chinese.

5.4. Summary

This chapter has conducted an inquiry into the linguistic phenomena of Chinese separable verbs, a long-standing difficult problem at the interface of Chinese compounding and syntax.   For each type of separable verb, arguments for the wordhood judgment have been presented.  Based on this judgment, CPSG95 provides analyses which capture both structural and semantic aspects of the constructions at issue.  The proposed solutions are formal and implementable.  All the solutions provide a way of capturing the link between the separated use and contiguous use of the V+X idioms.  The proposals presented in this chapter cover the vast majority of separable verbs.  Some unsolved rare cases or potential problems are also identified for further research.



[1] They are also called phrasal verbs (duanyu dongci) or compound verbs (fuhe dongci) among Chinese grammarians.  For linguists who believe that they are compounds, the V+N separable verbs are often called verb object compounds and the V+A/V separable verbs resultative compounds.  The want of a uniform term for such phenomena reflects the borderline nature of these cases.  According to Zhao and Zhang (1996), out of the 3590 entries in the frequently used verb vocabulary, there are 355 separable V+N idioms.

[2] As the term ‘separable verbs’ gives people an impression that these verbs are words (which is not necessarily true), they are better called V+X (or V+N or V+A/V) idioms.

[3] There is no disagreement among Chinese grammarians for the verb-object combinations like xi wan:  they are analyzed as transitive verb phrases in all analyses, no matter whether the head V and the N is contiguous (e.g. xi wan ‘wash dishes’) or not (e.g. xi san ge wan ‘wash three dishes’).

[4] Such signs as zao (bath), which are marked with # in (5-1), are often labeled as ‘bound morphemes’ among Chinese grammarians, appearing only in idiomatic combinations like xi zao (take a bath), ca zao (clean one’s body by scrubbing).  As will be shown shortly, bound morpheme is an inappropriate classification for these signs.

[5] It is widely acknowledged that the sequence num+classifier+noun is one typical form of Chinese NP in syntax.  The argument that zao is not a bound morpheme does not rely on any particular analysis of such Chinese NPs.  The fact that such a combination is generally regarded as syntactic ensures the validity of this argument.

[6] The notion ‘free’ or ‘freely’ is linked to the generally accepted view of regarding word as a minimal ‘free’ form, which can be traced back to classical linguistics works such as Bloomfield (1933).

[7] It is generally agreed that idioms like kick the bucket are not compounds but phrases (Zwicky 1989).

[8] That is the rationale behind the proposal of inseparability as important criterion for wordhood judgment in Lü (1989).

[9] In Chinese, reduplication is a general mechanism used both in morphology and syntax.  This thesis only addresses certain reduplication issues when they are linked to the morpho-syntactic problems under examination, but cannot elaborate on the Chinese reduplication phenomena in general.  The topic of Chinese reduplication deserves the study of a full-length dissertation.     

[10] In the ALE implementation of CPSG95, there is a VV Diminutive Reduplication Lexical Rule in place for phenomena like xi zao (take a bath) à xi xi zao (take a short bath);  ting yin-yue (listen to music) à ting ting yin-yue (listen to music for a while);  xiu-xi (rest) à xiu-xi xiu-xi (have a short rest).

[11] He observes that there are two distinct principles on wordhood.  The vocabulary principle requires that a word represent an integrated concept, not the simple composition of its parts.  Associated with the above is a tendency to regard as a word a relatively short string.  The grammatical principle, however, emphasizes the inseparability of the internal parts of a combination.  Based on the grammatical principle, xi zao is not a word, but a phrase.  This view is very insightful.

[12] The pattern variations are captured in CPSG95 by lexical rules following the HPSG tradition.  It is out of the scope of this thesis to present these rules in the CPSG95 syntax.  See W. Li (1996) for details.

[13] In the rare cases when the noun zao is realized in a full-fledged phrase like yi ge tong-kuai de zao (a comfortable bath), we may need some complicated special treatment in the building of the semantics.  Semantically, xi (wash) yi (one) ge (CLA) tong‑kuai (comfortable) de (DE) zao (bath): ‘take a comfortable bath’ actually means tong‑kuai (comfortable) de2 (DE2) xi (wash) yi (one) ci (time) zao (bath): ‘comfortably take a bath once’.  The syntactic modifier of the N zao is semantically a modifier attached to the whole idiom.  The classifier phrase of the N becomes the semantic ‘action-times’ modifier of the idiom.  The elaboration of semantics in such cases is left for future research.

[14] The two groups classified by L. Li (1990) are not restricted to the V+N combinations.  In order not to complicate the case,  only the comparison of the two groups of V+N idioms are discussed here.  Note also that in the tables, he used the term ‘bound morpheme’ (inappropriately) to refer to the co-occurrence constraint of the idioms.

[15] Another type of X-insertion is that N can occasionally be expanded by adding a de‑phrase modifier.  However, this use is really rare.

[16] Since they are only a small, easily listable set of verbs, and they only demonstrate limited separated uses (instead of full pattern variations of a transitive VP construction), to list these words and all their separated uses in the lexicon seems to be a better way than, say, trying to come up with another lexical rule just for this small set.  Listing such idiosyncratic use of language in the lexicon is common practice in NLP.

[17] In fact, this set has been becoming smaller because some idioms, say zhu-yi ‘focus-attention: pay attention to’, which used to be in this set, have already lost all separated phrasal uses and have become words per se.  Other idioms including dan-xin (worry about) are in the process of transition (called ionization by Chao 1968) with their increasing frequency of being used as words.   There is a fairly obvious tendency that they combine more and more closely as words, and become transparent to syntax.  It is expected that some, or all, of them will ultimately become words proper in future, just as zhu-yi did.

[18] In general, one cannot use kan-jian to translate English future tense ‘will see’, instead one should use the single-morpheme word kan:  I will see him –> wo (I) jiang (will) kan (see) ta (he).

[19] Of course, [v_buyu] is a sub-type of verb [v].

[20] The use of this feature for representing negation was suggested in  Footnote 18 in Pollard and Sag (1994:25)

[21] This is the procedural perspective of viewing the lexical rules.  As pointed out by Pollard and Sag (1987:209), “Lexical rules can be viewed from either a declarative or a procedural perspective: on the former view, they capture generalizations about static relationships between members of two or more word classes; on the latter view, they describe processes which produce the output from the input form.”



PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

PhD Thesis: Chapter IV Defining the Chinese Word


4.0. Introduction

This chapter examines the linguistic definition of the Chinese word and establishes its formal representation in CPSG95.  This lays a foundation for the treatment of Chinese morpho-syntactic interface problems in later chapters.

To address issues on interfacing morphology and syntax in Chinese NLP, the fundamental question is:  what is a Chinese word?  A proper answer to this question defines the boundaries between morphology, the study of how morphemes combine into words, and syntax, the study of how words combine into phrases.  However, there is no easy answer to this question.

In fact, how to define Chinese words has been a central topic among Chinese grammarians for decades (Hu and Wen 1954; L. Wang 1955;  Z. Lu 1957; Lin 1983; Lü 1989; Shi 1992; Dai 1993; Zhao and Zhang 1996).  In late 50’s, there was a heated discussion on the definition of Chinese word in China.  This discussion was induced by the campaign for the Chinese writing system reform (文字改革运动).  At that time, the government policy was to ultimately replace the Chinese characters (hanzi) by a Romanized writing system.  The system of pinyin, based on the Latin alphabet, was designed to represent the pronunciation of the characters in the Contemporary Mandarin.  The simplest way is to use pinyin as a writing system and simply translate Chinese characters into syllables in pinyin.  But it was soon found impractical due to the many-to-one correspondence from hanzi to syllable.  Text in pinyin with no  explicit word boundary delimiters is hardly comprehensible.   Linguists agree that the key issue for the feasibility of a pinyin-based writing system is to establish a standard or definition for Chinese words (Z. Lu 1957).  Once words can be identified by a common standard, the pinyin system can in principle be adopted for recording the Chinese language by using space and punctuation marks to separate words.  This is because the number of homophones at the word level is dramatically reduced when compared to the number of homophones at the hanzi (morpheme or monosyllabic) level.

But the definition of a Chinese word is a very complicated issue due to the existence of a considerable amount of borderline cases.  It has never been possible to reach a precise definition which can be applied to all circumstances and which can be accepted by linguists from different schools.

There have been many papers addressing the Chinese wordhood issue (e.g. Z. Lu 1957; Lin 1983; Lü 1989; Dai 1993).  Although there are still many problems in defining Chinese words for borderline cases and more debate will continue for many years to come, the understanding of Chinese wordhood has been deepened in the general acknowledgement of the following key aspects:  (i) the distinct status of Chinese morphology;  (ii) the distinction of different notions of word;  and (iii) the lack of absolute definition across systems or theories.

Almost all Chinese grammarians agree that unlike Classical Chinese, Contemporary Chinese is not based on single-morpheme words.   In other words, the word and the morpheme are no longer coextensive in Contemporary Chinese.[1]  In fact, that is the reason why we need to define Chinese morphology.  If the word and the morpheme stand for the same linguistic object in a language, like Classical Chinese, the definition of  morpheme will entail the definition of word and there is no role of morphology.

As it stands, there is little debate on the definition of morpheme in Chinese.  It is generally acknowledged that each syllable (or its corresponding written form hanzi) corresponds to (at least) one morpheme.  In a characteristic ‘isolating language’ – Classical Chinese is close to this, there is no or very poor morphology.[2]  However, Contemporary Chinese contains a significant number of bound morphemes in word formation (Dai 1993).  In particular, it is observed that many affixes are highly productive (Lü et al 1980).

It is widely acknowledged that the grammar of Contemporary Chinese is not complete without the component of morphology (Z. Lu 1957; Chao 1968; Li and Thompson 1981; Dai 1993; etc.).   Based on this widely accepted assumption, one major task for this thesis is to argue for the proper place to cut the line between morphology and syntax, and to explore effective ways of interleaving the two for analysis.

A significant development concerning the Chinese wordhood study is the  distinction between two different notions of word:  grammar word versus vocabulary word.  It is now clear that in terms of grammar analysis, a vocabulary word is not an appropriate notion (Lü 1989; more discussion to come in 4.1).

Decades of debate and discussion on the definition of a Chinese word have also shown that an operational definition for a grammar word precise enough to apply to all cases can hardly be established across systems or theories.  But a computational grammar of Chinese cannot be developed without precise definitions.  This leads to an argument in favor of the system internal wordhood definition and the interface coordination within a grammar.

The remaining sections of this chapter are organized like this.  Section 4.1 examines two notions of word.  Making sure that we use the right notion based on some appropriate guideline, some operational methods for judging a Chinese grammar word will be developed in 4.2.  Section 4.3 demonstrates the formal representation of a word in CPSG95.  This formalization is based on the design of expectation feature structures and the structural feature structure  presented in Chapter III.

4.1. Two Notions of Word

This section examines the two notions of word which have caused confusion.  The first notion, namely vocabulary word, is easy to define.  However, for the second notion, namely, grammar word, unfortunately,  no operational definition has been available.  It will be argued that a feasible alternative is to system internally define a grammar word and the labor division between Chinese morphology and syntax.

A grammar word stands for the grammatical unit which fits in the hierarchy of morpheme, word and phrase in linguistic analysis.  This gives the general concept of this notion but it is by no means an operational definition.  Vocabulary word, on the other hand, refers to the listed entry in the lexicon.  This definition is simple and unambiguous once a lexicon is given.  The lexical lookup will generate vocabulary words as potential building blocks for analysis.

On one hand, vocabulary words come from the lexicon;  they are basic building blocks for linguistic analysis.  On the other hand, as the ‘resulting’ unit for morphological analysis as well as the ‘starting’ or ‘atomic’ unit for syntactic analysis, the grammar word is the notion for linguistic generalization.  But it is observed that a vocabulary word is not necessarily a grammar word and vice versa.  It is this possible mismatch between vocabulary word and grammar word that has caused a problem in both Chinese grammar research and Chinese NLP system development.

Lü (1989) indicates that not making a distinction between these two notions of word has caused considerable confusion on the definition of Chinese word in the literature.  He further points out that only the former notion should be used in the grammar research.

Di Sciullo and Williams (1987) have similar ideas on these two notions of word.  They indicate that a sign listable in the lexicon corresponds to no certain grammatical unit.[3]   It can be a morpheme, a (grammar) word, or a phrase including sentence.  Some examples of different kinds of Chinese vocabulary words are given below to demonstrate this insight.

(4-1.) sample Chinese vocabulary words

(a) 性           bound morpheme, noun suffix, ‘-ness’
(b) 洗           free morpheme or word, V: ‘wash’
(c) 澡           word (only used in idioms), N: ‘bath’
(d) 澡盆        compound word, N: ‘bath-tub’
(e) 洗澡        idiom phrase, VP: ‘take a bath’
(f) 他们         pronoun as noun phrase, NP: ‘they’
(g) 城门失火,殃及池鱼

idiomatic sentence, S:
‘When the gate of a city is on fire, the fish in the
canal around the gate is also endangered.’

The above signs are all Chinese vocabulary words.  But grammatically, they do not necessarily function as a grammar word.  For example, (4-1a) functions as a suffix, smaller than a word.  (4-1e) behaves like a transitive VP (see 5.1 for more evidence), and (4-1g) acts as a sentence, both larger than a word.  The consequence of mixing up these different units in a grammar is the loss of power for a grammar to capture the linguistic generality for each level of grammatical unit.

The definition of grammar word has been a contentious issue in general linguistics (Di Sciullo and Williams 1987).  Its precise definition is particularly difficult in Chinese linguistics as there is a considerable amount of phenomena marginal between Chinese morphology and syntax (Zhu 1985; L. Li 1990; Sun and Huang 1996).  The morpheme-word-phrase transition is a continuous band in the linguistic reality.  Different grammars may well cut the division differently.  As long as there is no contradiction in coordinating these objects within the grammar, there does not seem to exist absolute judgment on which definition is right and which is wrong.

It is generally agreed that a grammar word is the smallest unit in syntax (Lü 1989), as also emphasized by Di Sciullo and Williams (1987) on the ‘syntactic atomicity’ of word.[4]  But this statement only serves as a guideline in theory, it is not an operational definition for the following reason.  It is logically circular to define word, smallest unit in syntax, and syntax, study of how words combine into phrases, one upon the other.

To avoid this ‘circular definition’ problem, a feasible alternative is to system internally define grammar word and the labor division between Chinese morphology and syntax, as in the case of CPSG95.  Of course, the system internal definition still needs to be justified based on the proposed morphological or syntactic analysis of borderline phenomena in terms of capturing the linguistic generality.  More specifically, three things need to be done:  (i) argue for the analysis case by case, e.g. why a certain construction should be treated as a morphological or syntactic phenomenon, what linguistic generality is captured by such a treatment, etc.;  (ii) establish some operational methods for wordhood judgment to cover similar cases;  (iii) use formalized data structures to represent the linguistic units after the wordhood judgment is made.  Section 4.2 will handle task (ii) and Section 4.3 is devoted to the formal definition of word required by task (iii).   The task in (i) will be pursued in the remaining chapters.

Another important notion related to grammar word is unlisted word.  Conceptually, an unlisted word is a novel construction formed via morphological rules, e.g. a derived word like 可读性 ke-du-xing (-able-read-ness: readability), foolish-ness, a compound person name (given name + family name) such as John Smith, 毛泽东 mao-ze-dong (Mao Zedong).  Unlisted words are often rule-based.  This is where productive word formation sets in.

However, unlisted word is not a crystal clear notion, just like the underlying concept grammar word.  Many grammarians have observed that phrases and unlisted words in Chinese are formed under similar rules (e.g. Zhu 1985; J. Lu 1988).  As both syntactic constructions and unlisted words are rule based, it can be difficult to judge a significant amount of borderline constructions as morphological or syntactic.

There are fuzzy cases where a construction is regarded as a grammar word by one and judged as a syntactic construction by another.  For example, while san (three) ge (CLA) is regarded as a syntactic construction, namely numeral-classifier phrase, in many grammars including CPSG95, such constructions are treated as compound words by others (e.g. Chen and Liu 1992).  ‘Quasi-affixation’ presents another outstanding ‘gray area’ (see 6.2).

The difficulty in handling the borderline phenomena leads back to the argument that the labor division between Chinese morphology and syntax should be pursued system internally and argued case by case in terms of capturing the linguistic generality.  To implement the required system internal definition, it is desirable to investigate practical wordhood judgment methods in addition to case-by-case arguments.  Some judgment methods will be developed in 4.2.  Case-by-case arguments and analysis for specific phenomena will be presented in later chapters.  After the wordhood judgment is made, there is a need for the formal representation.  Section 4.3 defines the formal representation of word with illustrations.

4.2. Judgment Methods

This section proposes some operational wordhood judgment methods based on the notion of ‘syntactic atomicity’ (Di Sciullo and Williams 1987).  These methods should be applied in combination with arguments of the associated grammatical analysis.  In fact, whether a sign is judged as a morpheme, a grammar word or a phrase ultimately depends on the related grammatical analysis.  However, the operationality of these methods will help facilitate the later analysis for some individual problems and avoid unnecessary repetition of similar arguments.

Most methods proposed for Chinese wordhood judgment in the literature are not fully operational.  For example, Chao (1968) agrees with Z. Lu (1957) that a word can fill the functional frame of a typical syntactic structure.  Dai (1993) points out that this method may effectively separate bound morphemes from free words, it cannot differentiate between words and phrases, as phrases may also be positioned in a syntactic frame.  In fact, whether this method can indeed separate bound morphemes from free words is still a problem.  This method cannot be made operational unless the definition of ‘frame of a typical syntactic structure’ is given.  The judgment methods proposed in this section try to avoid this ‘lack of operationality’ problem.

Dai (1993) made a serious effort in proposing a series of methods for cutting the line between morphemes and syntactic units in Chinese.  These methods have significantly advanced the study of this topic.  However, Dai admits that there is limitation associated with these proposals.  While each proposed method provides a sufficient (but not necessary) condition for judging whether a unit is a morpheme,  none of the methods can further determine whether this unit is a word or a phrase.  For example, the method of syntactic independence tests whether a unit in a question can be used as a short answer to the question.  If yes, the syntactic independence is confirmed and this unit is not a morpheme inside a word.  Obviously, such a method tells nothing about the syntactic rank of the tested unit because a word, a phrase or clause can all serve as an answer to a question.  In order to achieve that, other methods and/or analyses need to be brought in.

The first judgment method proposed below involves passivization and topicalization tests.  In essence, this is to see whether a string involves syntactic processes.  As an atomic unit, the internal structure of a word is transparent to syntax.  It follows that no syntactic processes are allowed to exert effects on the internal structure of a word.[5]  As  passivization and topicalization are generally acknowledged to be typical syntactic processes, if a potential combination A+B is subject to passivization B+bei+A and topicalization B+…+NP+A, it can be concluded that A+B is not a word:   the relation between A and B must be syntactic.

The second method is to define an unambiguous pattern for the wordhood judgment, namely, judgment patterns.  Judgment patterns are by no means a new concept.  In particular, keyword based judgment patterns have been frequently used in the literature of Chinese linguistics as a handy way for deterministic word category detection (e.g. L. Wang 1955;  Zhu 1985; Lü 1989).

The following keyword (i.e. aspect markers) based patterns are proposed for  judging a verb sign.

(a) V(X)+着/过 –> word(X)
(b) V(X)+着/过/了+NP –> word(X)

The pattern (4-2a) states that if X is a sign of verb, no matter transitive or intransitive, appearing immediately before zhe/guo, then X is a word.  This proposal is backed by the following argument.  It is an important and widely acknowledged grammatical generalization in Chinese syntax that the aspect markers appear immediately after lexical verbs (Lü et al 1980).

Note that the aspect marker le (LE) is excluded from the pattern in (4-2a) because the same keyword le corresponds to two distinctive morphemes in Chinese:  the aspect le (LE) attaches to a lexical V while the sentence-final le (LEs) attaches to a VP (Lü et al 1980).  Therefore, judgment cannot be reliably made when a sentence ends in X+le, for example, when X is an intransitive verb or a transitive verb with the optional object omitted.  However, le in pattern (4-2b) has no problem since le is not in the ambiguous sentence final position.  This pattern says that if any of the three aspect markers appears between a sign X of verb and NP, X must be a word:  in fact, it is a lexical transitive verb.

There are two ways to use the judgment patterns.  If a sub-string of the input sentence matches a judgment pattern, one reaches the conclusion promptly.  If the input string does not match a pattern directly, one can still make indirect use of the patterns for judgment.  The idiomatic combination xi (wash) zao (bath) is a representative example.   Assume that the vocabulary word xi zao is a grammar word.  It follows that it should be able to fill in the lexical verb position in the judgment pattern (4-2a).  We then make a sentence which contains a substring matching the pattern to see whether it is grammatical.  The result is ungrammatical:  * 他洗澡着 ta (he) xi-zao (V) zhe (ZHE);  * 他洗澡过 ta (he) xi-zao (V) guo (GUO).  Therefore, our assumption must be wrong:  洗澡 xi zao is not a grammar word.  We then change the assumption and try to insert aspect markers inside them (it is in fact an expansion test, to be discussed shortly).  The new assumption is that the verb xi alone is a grammar word.  What we get are perfectly grammatical sentences and they match the pattern (4-2b):  他洗着澡 ta (he) xi (V) zhe (ZHE) zao (bath): ‘He is taking a bath’;  他洗过澡 ta (he) xi (V) guo (GUO) zao (bath): ‘He has taken the bath’.  Therefore the assumption is proven to be correct.  This way, all V+X combinations can be judged based on the judgment patterns (4-2a) or (4-2b).

The third method proposed below involves a more general expansion test.  As an atomic unit in syntax, the internal parts of a word are in principle not separable.[6]  Lü (1989) emphasized inseparability as a criterion for judging grammar words.  But he did not give instructions how this criterion should be applied.  Nevertheless, many linguists (e.g. Bloomfield 1933; Z. Lu 1957;  Lyons 1968; Dai 1993) have discussed expansion tests one way or another in assisting the wordhood judgment.

The method of expansion to be presented below for wordhood judgment is called X-insertion.  X-insertion is based on Di Sciullo and Williams’ thesis of the syntactic atomicity of word.  The rationale is that the internal parts of a word cannot be separated by syntactic constituents.

As a method, how to perform X-insertion is defined as follows.   Suppose that one needs to judge whether the combination A+B is a word.   If a sign X can be found to satisfy the following condition, then A+B is not a word, but a syntactic combination:  (i) A+X+B is a grammatical string,  (ii) X is not a bound morpheme, and (iii) the sub-structure [A+X] is headed by A or the sub-string [X+B] is headed by B.

The first constraint is self-evident:  a syntactic combination is necessarily a grammatical string.  The second constraint aims at  eliminating the danger of wrongly applying an infix here.  In fact, if X is a morphological infix, the conclusion would be just opposite:  A+B is a word.  The last constraint states that X must be a dependant of the head A (or B).  Otherwise, it results in a different structure.  There is no direct structural relation between A and B when A (or B) is a dependant of the head X in the structure.  Therefore, the question of whether A+B is a phrase or a word does not apply in the first place.

After the wordhood judgment is made on strings of signs based on the above judgment methods and/or the arguments for the analysis involved, the next step is to have them properly represented (coded) in the grammar formalism used.  This is the topic to be presented in 4.3 below.

4.3. Formal Representation of Word

The expectation feature structure and structural phrase structure in the mono-stratal design of CPSG95 presented in Chapter III provide means for the formal definition of the basic unit word in CPSG95.  Once the wordhood judgment for a unit is made based on arguments for a structural analysis and/or using the methods presented in Section 4.2., the formal representation is required for coding it in CPSG95.

This type of formalization is required to ensure its implementability in enforcing a required configurational constraint.  For example, the suffix 性 -xing expects an adjective word to form an abstract noun, such constraints [CATEGORY a] and @word can be placed in the morphological expectation feature [SUFFIXING].  These constraints will permit, for example, the legitimately derived word 严肃性 [yan-su]-xing] (serious-ness), but will block the following combination * 非常严肃性 [[fei-chang yan-su]-xing] (very-serious-ness).  This is because 非常严肃 [fei-chang yan-su] violates the formal constraint as given in the word definition:  it is not an atomic unit in syntax.

In CPSG95, word is defined as a syntactically atomic unit without obligatory morphological expectations, formally represented in the following macro.

word macro
PREFIXING saturated | optional
SUFFIXING saturated | optional
STRUCT no_syn_dtr

Note that the above formal definition uses the sorted hierarchy [struct] for the structural feature structure and the sorted hierarchy [expected] for the expectation feature structure.  The definitions of these feature structures have been given in the preceding Chapter III.

Based on the sorted hierarchy struct: {syn_dtr, no_syn_dtr}, the constraint [no_syn_dtr] ensures that the word sign do not contain any syntactic daughter.[7]  This prevents syntactic constructions from being treated as words.  On the other hand, since [saturated], [obligatory] and [optional] are three subtypes of [expected], the constraint [saturated|optional] prevents a bound morpheme, say a prefix or suffix which has obligatory expectation in [PREFIXING] or [SUFFIXING], from being treated as a word.

This macro definition covers the representation of mono-morpheme words, e.g. 鹅 e ‘goose’, 读 du ‘read’, etc., or multi-morpheme words, e.g. 小看 xiao-kan ‘look down upon’, 天鹅 tian-e ‘swan’, etc., as well as unlisted words such as derived words whose internal morphological structures have already been formed.  Some typical examples of word are shown below.



For a derived word, note that the specification of [PREFIXING satisfied] and [STRUCT prefix], or [SUFFIXING satisfied] and [STRUCT suffix], assigned by the corresponding PS rule is compatible with the macro word definition.

The above word definition is an extension of the corresponding representation features from HPSG (Pollard and Sag 1987).  HPSG uses a binary structural feature [LEX] to distinguish lexical signs, [LEX +], and non-lexical signs, [LEX -].  In addition, [sign] is divided into [lexical_sign] and [phrasal_sign].[8]  Except for the one-to-one correspondence between [phrasal_sign] and [syn_dtr] in terms of rank (which stands for non-atomic syntactic constructs including phrases), neither of these HPSG binary divisions account for the distinction between a bound morpheme and a free morpheme.  Such a distinction is not necessary in HPSG because bound morphemes are assumed to be processed in the preprocessing stage (e.g. lexical rules for English inflection, Pollard and Sag 1987) and do not show themselves as independent input to the parser.  As CPSG95 involves both derivation morphology and syntax in an integrated general grammar, the HPSG binary divisions are no longer sufficient for formalizing the word definition.  ‘Word’ in CPSG95 needs to be distinguished with proper constraints from not only syntactic constructs, but also from affixes (bound morphemes).

In CPSG95, as productive derivation is designed to be an integrated component of the grammar, the word definition is both specified in the lexicon for some free morpheme words and assigned by the rules in morphological analysis.  This practice in essence follows one  suggestion in the original HPSG book:  “we might divide rules of grammar into two classes: rules of word formation, including compounding rules, which introduce the specification [LEX +] on the mother, and other rules, which introduce [LEX -] on the mother.” (Pollard and Sag 1987:73).

It is worth noticing that words thus defined can fill either a morphological position or a syntactic position.  This reflects the interface nature of word:  word is an eligible unit in both morphology and syntax.  This is in contrast to bound morphemes which can only be internal parts of morphology.

In morphology, derivation combines a word and an affix into a derived word.  These derivatives are eligible to feed morphology again.   This is shown above by the examples in (4-5) and (4-6).  The adjective word 可读 ke-du (read-able) is derived from the prefix morpheme 可 ke- (-able) and the word 读 du (read).  Like other adjective words, this derived word can further combine with the suffix 性
–xing (-ness) in morphology.  It can also directly enter syntax, as all words do.

To syntax, all words are atomic units.  If a lexical position is specified, via the macro constraint @word in CPSG95, in a syntactic pattern, it makes no difference whether a filler of this position is a listed grammar word, or an unlisted word such as a derivative.  Such distinction is transparent to the syntactic structure.

4.4. Summary

Efforts have been made to reach a better understanding of Chinese wordhood in theory, methodology and formalization.  The main spirit of the HPSG theory and Di Sciullo and Williams’ ‘syntactic atomicity’ theory has been applied to the study of Chinese wordhood and its formal representation.  Some effective wordhood judgment methods have also been proposed, based on theoretical guidelines.

The above work in the area of Chinese wordhood study provides a sound foundation for the analysis of the specific Chinese morpho-syntactic interface problems in Chapter V and Chapter VI.




[1] For Classical Chinese, word, morpheme, syllable and hanzi are presumably all co-extensive.  This is the so-called Monosyllabic Myth of Chinese (DeFrancis 1984: ch.8).  The development of large numbers of homophones, mainly due to the loss of coda stops, has led to the development of large quantities of bi-syllabic and poly-syllabic word-like expressions (Chen and Wang 1975).

[2] Classical Chinese arguably allows for a certain degree of compounding.  In the linguistic literature, some linguists (e.g. Sapir 1921; Zhang 1957; Jensen 1990) did not strictly distinguish Contemporay/Modern Chinese from Classical Chinese and they held the general view that Chinese has little morphology except for limited compounding.  But this view of Contemporary Chinese has been criticized as misconception (Dai 1993) and is no longer accepted by the community of Chinese grammarians.

[3] Di Sciullo and Williams call a sign listable in the lexicon listeme, equivalent to the notion vocabulary word.

[4] In the literature, variations of  this view include the Lexicalist position (Chomsky 1970), the Lexical Integrity Hypothesis (Jackendoff 1972), the Principle of Morphology-Free Syntax (Zwicky 1987), etc.

[5] This type of ‘atomicity’ constraint (Di Sciullo and Williams 1987) is generally known as Lexical Integrity Hypothesis (LIH, Jackendoff 1972), which states that syntactic rules or operations cannot refer to part of a word.  A more elaborate version of LIH is proposed by Zwicky (1987) as a Principle of Morphology-Free Syntax.  This principle states that syntactic rules cannot make reference to the internal morphological composition of words.  The only lexical properties accessible to syntax, according to Zwicky, are syntactic category, subcategory, and features like gender, case, person, etc.

[6] Of course, in theory a word may be separated by morphological infix.  But except for the two modal signs de3 (can) and bu (cannot) (see Section 5.3 in Chapter V), there does not seem to exist infixation in Mandarin Chinese.

[7] In terms of rank, [no_syn_dtr] in CPSG95 corresponds to the type [lexical_sign] in HPSG (Pollard and Sag 1987).  A binary division between [lexical_sign] and [phrasal_sign] is enough in HPSG to distinguish the atomic unit word from syntactic construction.  But, as CPSG95 incorporates derivation in the general grammar, [no_syn_dtr] covers for both free morphemes and bound morphemes.  That is why the [no_syn_dtr] constraint on [STRUCT] alone cannot define word in CPSG95;  it needs to involve constraints on morphological expectation structures as well, as shown in the macro definition.

[8] Note that there are [LEX -] signs which are not of the type [phrasal_sign].



PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

PhD Thesis: Chapter III Design of CPSG95

3.0. Introduction

CPSG95 is the grammar designed to formalize the morpho-syntactic analysis presented in this dissertation.  This chapter presents the general design of CPSG95 with emphasis on three essential aspects related to the morpho-syntactic interface:  (i) the overall mono-stratal design of the sign;  (ii) the design of expectation feature structures;  (iii) the design of structural feature structures.

The HPSG-style mono-stratal design of the sign in CPSG95 provides a general framework for the information flow between different components of a grammar via unification.  Morphology, syntax and semantics are all accommodated in distinct features of a sign.  An example will be shown to illustrate the information flow between these components.

Expectation feature structures are designed to accommodate lexical information for the structural combination.  Expectation feature structures are vital to a lexicalized grammar like CPSG95.  The formal definition for the sort hierarchy [expected] for the expectation features will be given.  It will be demonstrated that the defined sort hierarchy provides means for imposing a proper structural hierarchy as defined by the general grammar.

One characteristic of the CPSG95 structural expectation is the unique design of morphological expectation features to incorporate Chinese productive derivation.  This design is believed to be a feasible and natural way of modeling Chinese derivation, as shall be presented shortly below and elaborated in section 3.2.1.  How this design benefits the interface coordination between derivation and syntax will be further demonstrated in Chapter VI.

The type [expected] for the expectation features is similar to the HPSG definition of [subcat] and [mod].  They both accommodate lexical expectation information to drive the analysis conducted via the general grammar.  In order to meet some requirements induced by introducing morphology into the general grammar and by accommodating linguistic characteristics of Chinese, three major modifications from the standard HPSG are proposed in CPSG95.  They are:  (i) the CPSG95 type [expected] is more generalized as to cover productive derivation in addition to syntactic subcategorization and modification;  (ii) unlike HPSG which tries to capture word order phenomena as independent constraints, Chinese word order in CPSG95 is integrated in the definition of the expectation features and the corresponding morphological/syntactic relations;  (iii) in terms of handling the syntactic subcategorization, CPSG95 pursues a non-list alternative to the standard practice of HPSG relying on the list design of obliqueness hierarchy.  The rationale and arguments for these modifications are presented in the corresponding sections, with a brief summary given below.

The first modification is necessitated by meeting the needs of introducing Chinese productive derivation into the grammar.  It is observed that a Chinese affix acts as the head daughter of the derivative in terms of expectation (Dai 1993).  The expectation information that drives the analysis of a Chinese productive derivation is found to be capturable lexically by the affix sign;  this is very similar to how the information for the head-driven syntactic analysis is captured in HPSG.  The expansion of the expectation notion to include productive morphology can account for a wider range of linguistic phenomena.  The feasibility of this modification has been verified by the implementation of CPSG95 based on the generalized expectation feature structures.

One outstanding characteristic of all the expectation features designed in CPSG95 is that the word order information is implied in the definition of these features.[1]  Word order constraints in CPSG95 are captured by individual PS rules for the structural relationship between the constituents.  In other words, Chinese word order constraints are not treated as phenomena which have sufficient generalizations of themselves independent of the individual morphological or syntactic relations.  This is very different from the word order treatment in theories like HPSG (Pollard and Sag 1987) and GPSG (Gazdar, Klein, Pullum and Sag 1985).  However, a similar treatment can be found in the work from  the school of ‘categorial grammar’ (e.g. Dowty 1982).

The word order theory in HPSG and GPSG is based on the assumption that structural relations and syntactic roles can be defined without involving the factor of word order.  In other words, it is assumed that the structural nature of a constituent (subject, object, etc.) and its linear position in the related structures can be studied separately.  This assumption is found to be inappropriate in capturing Chinese structural relations.  So far, no one has been able to propose an operational definition for Chinese structural relations and morphological/syntactic roles without bringing in word order.[2]

As Ding (1953) points out, without the means of inflections and case markers, word order is a primary constraint for defining and distinguishing Chinese structural relations.[3]  In terms of expectation, it can always be lexically decided where for the head sign to look for its expected daughter(s).  It is thus natural to design the expectation features directly on their expected word order.

The reason for the non-list design in capturing Chinese subcategorization can be summarized as follows:  (i) there has been no successful attempt by anyone, including the initial effort involved in the CPSG95 experiment, which demonstrates that the obliqueness design can be applied to Chinese grammar with sufficient linguistic generalizations;  (ii) it is found that the atomic approach with separate features for each complement is a feasible and flexible proposal in representing the relevant linguistic phenomena.

Finally, the design of the structural feature [STRUCT]  originates from [LEX + | -] in HPSG (Pollard and Sag 1987).  Unlike the binary type for [LEX], the type [struct] for [STRUCT] forms an elaborate sort hierarchy.  This is designed to meet the configurational requirements of introducing morphology into CPSG95.  This feature structure, together with the design of expectation feature structures, will help create a favorable framework for handling Chinese morpho-syntactic interface.  The proposed structural feature structure and the expectation feature structures contribute to the formal definition of linguistic units in CPSG95.  Such definitions enable proper lexical configurational constraints to be imposed on the expected signs when required.

3.1. Mono-stratal Design of Sign

This section presents the data structure involving the interface between morphology, syntax and semantics in CPSG95.  This is done by defining the mono-stratal design of the fundamental notion sign and by illustrating how different components, represented by the distinct features for the sign, interact.

As a dynamic unit of grammatical analysis, a sign can be a morpheme, a word, a phrase or a sentence.  It is the most fundamental object of HPSG-style grammars.  Formally, a sign is defined in CPSG95 by the type [a_sign], as shown below.[4]

(3-1.) Definition: a_sign

HANZI                            hanzi_list
CONTENT                      content
CATEGORY                    category
SUBJ                               expected
COMP0_LEFT               expected
COMP1_RIGHT             expected
COMP2_RIGHT             expected
MOD_LEFT                    expected
MOD_RIGHT                  expected
PREFIXING                    expected
SUFFIXING                    expected
STRUCT                          struct

The type [a_sign] introduces a set of linguistic features for the description of a sign.  These are features for orthography, morphology, syntax and semantics, etc.[5]  The types, which are eligible to be the values of these features, have their own definitions in the sort hierarchy.  An introduction of these features follows.

The orthographic feature [HANZI] contains a list of Chinese characters (hanzi or kanji).  The feature [CONTENT] embodies the semantic representation of the sign.  [CATEGORY] carries values like [n] for noun, [v] for verb, [a] for adjective, [p] for preposition, etc.  The structural feature [STRUCT] contains information on the relation of the structure to its sub-constituents, to be presented in detail in section 3.3.

The features whose appropriate value must be the type [expected] are called expectation features.  They are the essential part of a lexicalist grammar as these features contain information about various types of potential structures in both syntax and morphology.  They specify various constraints on the expected daughter(s) of a sign for structural analysis.   The design of these expectation features and their appropriate type [expected] will be presented shortly in section 3.2.

The definition of [a_sign] illustrates the HPSG philosophy of mono-stratal analysis interleaving different components.  As seen, different components of Chinese grammar are contained in different feature structures for the general linguistic unit sign.  Their interaction is effected via the unification of relevant feature structures during various stages of analysis.  This will unfold as the solutions to the morpho-syntactic interface problems are presented in Chapter V and Chapter VI.  For illustration, the prefix 可 ke (-able) is used as an example in the following discussion.

As is known, the prefix ke– (-able) makes an adjective out of a transitive verb:  ke- + Vt –> A.  This lexicalized rule is contained in the CPSG95 entry for the prefix ke-, shown in (3-2).  Following the ALE notation, @ is used for macro, a shorthand mechanism for a pre-defined feature structure.[6]


As seen, the prefix ke- morphologically expects a sign with [CATEGORY vt].  An affix is analyzed as the head of a derivational structure in CPSG95 (see section 6.1 for discussion) and [CATEGORY] is a representative head feature to be percolated up to the mother sign via the corresponding morphological PS rule as formulated in (6-4) of section 6.2, this expectation eventually leads to a derived word with [CATEGORY a].  Like most Chinese adjectives, the derived adjective has an optional expectation for a subject NP to account for sentences like 这本书很可读 zhe (this) ben (CLA) shu (book) hen (very) ke-du (read-able): ‘This book is very readable’.  This syntactic optional expectation for the derivative is accommodated in the head feature [SUBJ].

Note that before any structural combination of ke- with other expected signs, ke- is a bound morpheme, a sign which has obligatory morphological expectation in [PREFIXING].  As a head for both the morphological combination ke+Vt and the potential syntactic combination NP+[ke+Vt], the interface between morphology and syntax in this case lies in the hierarchical structures which should be imposed.   That is, the morphological structure (derivation) should be established before its syntactic expected structure can be realized.  Such a configurational constraint is specified in the corresponding PS rules, i.e. the Subject PS Rule and The Prefix PS Rule.  It guarantees that the obligatory morphological expectation of ke- has to be saturated before the sign can be legitimately used in syntactic combination.

The interaction between morphology/syntax and semantics in this case is encoded by the information flow, i.e. structure-sharing indicated by the number index in square brackets, between the corresponding feature structures inside this sign.  The semantic compositionality involved in the morphological and syntactic grouping is represented like this.  There is a semantic predicate marked as [-able] (for worthiness) in the content feature [RELN];  this predicate has an argument which is co-indexed by [1] with the semantics of the expected Vt.  Note that the syntactic subject of the derived adjective, say ke-du (read-able) or ke-chi (eat-able), is the semantic (or logical) object of the stem verb, co-indexed by [2] in the sample entry above.  The head feature [CONTENT] which reflects the semantic compositionality will be percolated up to the mother sign when applicable morphological and syntactic PS rules take effect in structure building.

In summary, embodied in CPSG95 is a mono-stratal grammar of morphology and syntax within the same formalism.  Both morphology and syntax use same data structure (typed feature structure) and mechanisms (unification, sort hierarchy, PS rules, lexical rules, macros).   This design for Chinese grammar is original and is shown to be feasible in the CPSG95 experiments on various Chinese constructions.  The advantages of handling morpho-syntactic interface problems under this design will be demonstrated throughout this dissertation.

3.2. Expectation Feature Structures

This section presents the design of the expectation features in CPSG95.  In general, the expectation features contain information about various types of potential structures of the sign.  In CPSG95, various constraints on the expected daughter(s) of a sign are specified in the lexicon to drive both morphological and syntactic structural analysis.  This provides a favorable basis for interleaving Chinese morphology and syntax in analysis.

The expected daughter in CPSG95 is defined as one of the following grammatical constituents:  (i) subject in the feature [SUBJ];  (ii) first complement in the feature [COMP0_LEFT] or [COMP1_RIGHT];  (iii) second complement in [COMP2_RIGHT];   (iv) head of a modifier in the feature [MOD_LEFT] or [MOD_RIGHT];   (v) stem of an affix in the feature [PREFIXING] or [SUFFIXING].[7]  The first four are syntactic daughters which will be investigated in sections 3.2.2 and 3.2.3.  The last one is the morphological daughter for affixation, to be presented in section 3.2.1.  All these features are defined on the basis of the relative word order of the constituents in the structure.  The hierarchy for the structure at issue resorts to the configurational constraints which will be presented in section 3.2.4.

3.2.1. Morphological Expectation

One key characteristic of the CPSG95 expectation features is the design of morphological expectation features to incorporate Chinese productive derivation.

It is observed that a Chinese affix acts as the head daughter of the derivative in terms of expectation (see section 6.1 for more discussion).   An affix can lexically define what stem to expect and can predict the derivation structure to be built.  For example, the suffix 性 –xing demands that it combine with a preceding adjective to make an abstract noun, i.e. A+-xing –> N.  This type of information can be easily captured by the expectation feature structure in the lexicon, following the practice of the HPSG treatment of the syntactic expectation such as subcategorization and modification.

In the CPSG95 lexicon, each affix entry is encoded to provide the following derivation information:   (i) what type of stem it expects;  (ii) whether it is a prefix or suffix to decide where to look for the expected stem;  (iii) what type of (derived) word it produces.  Based on this lexical information, the general grammar only needs to include two PS rules for Chinese derivation:  one for prefixation, one for suffixation.  These rules will be formulated in Chapter VI (sections 6.2 and 6.3).  It will also be demonstrated that this lexicalist design for Chinese derivation works for both typical cases of affixation and for some difficult cases such as ‘quasi-affixation’ and zhe-suffixation.

In summary, the morphological combination for productive derivation in CPSG95 is designed to be handled by only two PS rules in the general grammar, based on the lexical specification in [PREFIXING] and [SUFFIXING].  Essentially, in CPSG95, productive derivation is treated like a ‘mini-syntax’;[8]  it becomes an integrated part of Chinese structural analysis.

3.2.2. Syntactic Expectation

This section presents the design of the expectation features to represent Chinese syntactic relations.  It will be demonstrated that constraints like word order and function words are crucial to the formalization of syntactic relations.  Based on them, four types of syntactic relations can be defined, which are accommodated in six syntactic expectation feature structures for each head word.

There is no general agreement on how to define Chinese syntactic relations.  In particular, the distinction between Chinese subject and object has been a long debated topic (e.g. Ding 1953; L. Li 1986, 1990; Zhu 1985; Lü 1989).  The major difficulty lies in the fact that Chinese does not have inflection to indicate subject-verb agreement and nominative case or accusative case, etc.

Theory-internally, there have been various proposals that Chinese syntactic relations be defined on the basis of one or more of the following factors:  (i) word order (more precisely, constituent order);  (ii) the function words associated with the constituents;  (iii) the semantic relations or roles.  The first two factors are linguistic forms while the third factor belongs to linguistic content.

L. Li (1986, 1990) relies mainly on the third factor to study Chinese verb patterns. The constituents in his proposal are named as NP-agent (ming-shi), NP-patient (ming-shou), etc. This practice amounts to placing an equal sign between the syntactic relation and semantic relation.  It implies that the syntactic relation is not an independent feature.  This makes syntactic generalization difficult.

Other Chinese grammarians (e.g. Ding 1953; Zhu 1985) emphasize the factor of word order in defining syntactic relations.  This school insists that syntactic relations be differentiated from semantic relations.  More precisely, semantic relations should be the result of the analysis of syntactic relations.  That is also the rationale behind the CPSG95 practice of using word order and other constraints (including function words) in the definition of Chinese relations.

In CPSG95, the expected syntactic daughter in CPSG95 is defined as one of the following grammatical constituents:  (i) subject in the feature [SUBJ], typically an NP which is on the leftmost position relative to the head;  (ii) complements closer to the head in the feature [COMP0_LEFT] or [COMP1_RIGHT], in the form of an NP or a specific PP;  (iii) the second complement in [COMP2_RIGHT]: this complement is defined to be an XP (NP, a specific PP, VP, AP, etc.) farther away from the head than [COMP1_RIGHT] in word order;  (iv) head of a modifier in the feature [MOD_LEFT] or [MOD_RIGHT].  In this defined framework of four types of possible syntactic relations, for each head word, the lexicon is expected to specify the specific constraints in its corresponding expectation feature structures and map the syntactic constituents to the corresponding semantic roles in [CONTENT].  This is a secure way of linking syntactic structures and their semantic composition for the following reason.  Given a specific head word and a syntactic structure with its various constraints specified in the expectation feature structures, the decoding of semantics is guaranteed.[9]

A Chinese syntactic pattern can usually be defined by constraints from category, word order, and/or function words (W. Li 1996).  For example, NP+V, NP+V+NP, NP+PP(x)+NP, NP+V+NP+NP, NP+V+NP+VP, etc.  are all  such patterns.  With the design of the expectation features presented above, these patterns can be easily formulated in the lexicon under the relevant head entry, as demonstrated by the sample formulations given in (3-3) and (3-4).



The structure in (3-3) is a Chinese transitive pattern in its default word order, namely NP1+Vt+NP2.  The representation in (3-4) is another transitive pattern NP+PP(x)+Vt.  This pattern requires a particular preposition x to introduce its object before the head verb.

The sample entry in (3-5) is an example of how modification is represented in CPSG95.  Following the HPSG semantics principle, the semantic content from the modifier will be percolated up to the mother sign from the head-modifier structure via the corresponding PS rule.  The added semantic contribution of the adverb chang-chang (often) is its specification of the feature [FREQUENCY] for the event at issue.


3.2.3. Chinese Subcategorization

This section presents the rationale behind the CPSG95 design for subcategorization.  Instead of a SUBCAT-list, a keyword approach with separate features for each complement is chosen for representing the subcategorization information, as shown in the corresponding expectation features in section 3.2.2.  This design has been found to be a feasible alternative to the standard practice of HPSG relying on the list design of obliqueness hierarchy and SUBCAT Principle when handling subject and complements.

The CPSG95 design for representing subcategorization follows one proposal from Pollard and Sag (1987:121), who point out:  “It may be possible to develop a hybrid theory that uses the keyword approach to subjects, objects and other complements, but which uses other means to impose a hierarchical structure on syntactic elements, including optional modifiers not subcategorized for in the same sense.”  There are two issues for such a hybrid theory:  the keyword approach to representing subject and complements and the means for imposing a hierarchical structure.  The former is discussed below while the latter will be addressed in the subsequent section 3.2.4.

The basic reason for abandoning the list design is due to the lack of an operational definition of obliqueness which captures generalizations of Chinese subcategorization.  In the English version of HPSG (Pollard and Sag 1987, 1994), the obliqueness ordering is established between the syntactic notions of subject, direct object and second object (or oblique object).[10]  But these syntactic relations themselves are by no means universal.  In order to apply this concept to the Chinese language, there is a need for an operational definition of obliqueness which can be applied to Chinese syntactic relations.  Such a definition has not been available.

In fact, how to define Chinese subject, object and other complements has been one of the central debated topics among Chinese grammarians for decades (Lü 1946, 1989; Ding 1953; L. Li 1986, 1990; Zhu 1985; P. Chen 1994).  No general agreement for an operational, cross-theory definition of Chinese subcategorization has been reached.  It is often the case that formal or informal definitions of Chinese subcategorization are given within a theory or grammar.   But so far no Chinese syntactic relations defined in a theory are found to demonstrate convincing advantages of a possible obliqueness ordering, i.e. capturing the various syntactic generalizations for Chinese.

Technically, however, as long as subject and complements are formally defined in a theory, one can impose an ordering of them in a SUBCAT list.  But if such a list does not capture significant generalizations, there is no point in doing so.[11]  It has turned out that the keyword approach is a promising alternative once proper means are developed for the required configurational constraint on structure building.

The keyword approach is realized in CPSG95 as follows.  Syntactic constituents for subcategorization, namely subject and complements, are directly accommodated in four parallel features [SUBJ], [COMP0_LEFT], [COMP1_RIGHT] and [COMP2_RIGHT].

The feasibility of the keyword approach proposed here has been tested during the implementation of CPSG95 in representing a variety of structures.  Particular attention has been given to the constructions or patterns related to Chinese subcategorization.  They include various transitive structures, di-transitive structures, pivotal construction (jianyu-shi), ba-construction (ba-zi ju), various passive constructions (bei-dong shi), etc.  It is found to be easy to  accommodate all these structures in the defined framework consisting of the four features.

We give a couple of typical examples below, in addition to the ones in (3-3) and (3-4) formulated before, to show how various subcategorization phenomena are accommodated in the CPSG95 lexicon within the defined feature structures for subcategorization.  The expected structure and example are shown before each sample formulation in (3‑6) through (3-8) (with irrelevant implementation details left out).



Based on such lexical information, the desirable hierarchical structure on the related syntactic elements, e.g. [S [V O]] instead of [[S V] O], can be imposed via the configurational constraint based on the design of the expectation type.  This is presented in section 3.2.4 below.

3.2.4. Configurational Constraint

The means for the configurational constraint to impose a desirable hierarchical morpho-syntactic structure defined by a grammar is the key to the success of a keyword approach to structural constituents, including subject and complements from the subcategorization.  This section defines the sort hierarchy of the expectation type [expected].  The use of this design for flexible configurational constraint both in the general grammar and in the lexicon will be demonstrated.

As presented before, whether a sign has structural expectation, and what type of expectation a sign has, can be lexically decided:  they form the basis for a lexicalized grammar.  Four basic cases for  expectation are distinguished in the expectation type of CPSG95:  (i) obligatory: the expected sign must occur;  (ii) optional:  the expected sign may occur;  (iii) null:  no expectation;  (iv) satisfied: the expected sign has occurred.  Note that case (i), case (ii) and case (iii) are static information while (iv) is dynamic information, updated at the time when the daughters are combined into a mother sign.  In other words, case (iv) is only possible when the expected structure has actually been built.  In HPSG-style grammars, only the general grammar, i.e. the set of PS rules, has the power of building structures.  For each structure being built, the general grammar will set [satisfied] to the corresponding expectation feature of the mother sign.

Out of the four types, case (i) and case (ii) form a natural class,  named as [a_expected];  case (iii) and case (iv) are of one class named as [saturated].  The formal definition of the type [expected] is given (3-9].

(3-9.) Definition: sorted hierarchy for [expected]

expected: {a_expected, saturated}
a_expected: {obligatory, optional}
ROLE role
SIGN a_sign
saturated: {null, satisfied}

The type [a_expected] introduces two features:  [ROLE] and [SIGN].   [ROLE] specifies the semantic role which the expected sign plays in the structure.  [SIGN] houses various types of constraints on the expected sign.

The type [expected] is designed to meet the requirement of the configurational constraint.  For example, in order to guarantee that syntactic structures for an expecting sign are built on top of its morphological structures if the sign has obligatory morphological expectation, the following configurational constraint is enforced in the general grammar.  (The notation | is used for logical OR.)

(3-10.)         configurational constraint in syntactic PS rules

PREFIXING                    saturated | optional
SUFFIXING                    saturated | optional

The constraint [saturated] means that syntactic rules are permitted to apply if a sign has no morphological expectation or after the morphological expectation has been satisfied.  The reason why the case [optional] does not block the application of syntactic rules is the following.  Optional expectation entails that the expected sign may or may not appear.  It does not have to be satisfied.

Similarly, within syntax, the constraints can be specified in the Subject PS Rule:

(3-11.)         configurational constraint in Subject PS rule

COMP0_LEFT                 saturated | optional
COMP1_RIGHT              saturated | optional
COMP2                           saturated | optional

This ensures that complement rules apply before the subject rule does.  This way of imposing a hierarchical structure between subcategorized elements corresponds to the use of SUBCAT Principle in HPSG based on the notion of obliqueness.

The configurational constraint is also used in CPSG95 for the formal definition of phrase, as formulated below.

phrase macro

PREFIXING saturated | optional
SUFFIXING saturated | optional
COMP1_LEFT saturated | optional
COMP1_RIGHT saturated | optional
COMP2 saturated | optional

Despite the notational difference, this definition follows the spirit reflected in the phrase definition given in Pollard and Sag (1987:69) in terms of the saturation status of the subcategorized complements.  In essence, the above definition says that a phrase is a sign whose morphological expectation and syntactic complement expectation (except for subject) are both saturated.  The reason to include [optional] in the definition is to cover phrases whose head daughter has optional expectation, for example, a verb phrase consisting of just a verb with its optional object omitted in the text.

Together with the design of the structural feature [STRUCT] (section 3.3), the sort hierarchy of the type [expected] will also enable the formal definition for the representation of the fundamental notion word (see Section 4.3 in Chapter IV).  Definitions such as @word and @phrase are the basis for lexical configurational constraints to be imposed on the expected signs when required.  For example, -xing (-ness) will expect an adjective stem with the word constraint and -zhe (-er) can impose the phrase constraint on the expected verb sign based on the analysis proposed in section 6.5.

3.3. Structural Feature Structure

The design of the feature [STRUCT] serves important structural purposes in the formalization of the CPSG95 interface between morphology and syntax.  It is necessary to present the rationale of this design and the sort hierarchy of the type [struct] used in this feature.

The design of [STRUCT struct] originates from the binary structural feature structure [LEX + | -] in the original HPSG theory (Pollard and Sag 1987).  However, in the CPSG95 definition, the type [struct] forms an elaborate sort hierarchy.   It is divided into two types at the top level:  [syn_dtr] and [no_syn_dtr].  A sub-type of [no_syn_dtr] is [no_dtr].  The CPSG95 lexicon encodes the feature [STRUCT no_dtr] for all single morphemes.[12]  Another sub-type of [no_syn_dtr] is [affix] (for units formed via affixation) which is further sub-typed into [prefix] and [suffix], assigned by the Prefix PS rule and Suffix PS Rule.  In syntax, [syn_dtr] includes sub-types like [subj], [comp] and [mod].  Despite the hierarchical depth of the type, it is organized to follow the natural classification of the structural relation involved.  The formal definition is given below.

(3-12.)         Definition: sorted hierarchy for [struct]

struct: {syn_dtr, no_syn_dtr}
syn_dtr: {subj, comp, mod}
comp: {comp0_left, comp1_right, comp2_right}
mod: {mod_left, mod_right}
no_syn_dtr: {no_dtr, affix}
affix: {prefix, suffix}

In CPSG95, [STRUCT] is not a (head) feature which percolates up to the mother sign;  its value is solely decided by the structure being built.[13]   Each PS rule, whether syntactic or morphological, assigns the value of the [STRUCT] feature for the mother sign, according to the nature of combination.  When morpheme daughters are combined into a mother sign word, the value of the feature [STRUCT] for the mother sign remains a sub-type of [no_syn_dtr].  But when some syntactic rules are applied, the rules will assign the value to the mother sign as a sub-type of [syn_dtr] to show that the structure being built is a syntactic construction.

The design of the feature structure [STRUCT struct] is motivated by the new requirement caused by introducing morphology into the general grammar of  CPSG95.  In HPSG, a simple, binary type for [LEX] is sufficient to distinguish lexical signs, i.e. [LEX +], from signs created via syntactic rules, i.e. [LEX -].  But in CPSG95, as presented in section 3.2.1 before, productive derivation is also accommodated in the general grammar.  A simple distinction between a lexical sign and a syntactic sign cannot capture the difference between signs created via morphological rules and signs created via syntactic rules.  This difference plays an essential role in formalizing the morpho-syntactic interface, as shown below.

The following examples demonstrate the structural representation through the design of the feature [STRUCT].  In the CPSG95 lexicon, the single Chinese characters like the prefix ke- (-able) and the free morphemes du (read), bao (newspaper) are all coded as [STRUCT no_dtr].   When the Prefix PS Rule combines the prefix ke- and the verb du into an adjective ke-du, the rule assigns [STRUCT prefix] to the newly built derivative.  The structure may remain in the domain of morphology as the value [prefix] is a sub-type of [no_syn_dtr].  However, when this structure is further combined with a subject, say, bao (newspaper) by the syntactic Subj PS Rule, the resulting structure [bao [ke-du]] (‘Newspapers are readable’) is syntactic, having [STRUCT subj] assigned by the Subj PS Rule;  in fact, this is a simple sentence.  Similarly, the syntactic Comp1_right PS Rules can combine the transitive verb du (read) and the object bao (newspaper) and assign for the unit du bao (read newspapers) in the feature [STRUCT comp1_right].  In general, when signs whose [STRUCT] value is a sub-type of [no_syn_dtr] combine into a unit whose [STRUCT] is assigned a sub-type of [syn_dtr], it marks the jump from the domain of morphology to syntax.  This is the way the interface of Chinese morphology and syntax is formalized in the present formalism.

The use of this feature structure in the definition of Chinese word will be presented in Chapter IV.  Further advantages and flexibility of the design of this structural feature structure and the expectation feature structures will be demonstrated in later chapters in presenting solutions to some long-standing problems at the morpho-syntactic interface.

3.4. Summary

The major design issues for the proposed mono-stratal Chinese grammar CPSG95 are addressed.  This provides a framework and means for formalizing the analysis of the linguistic problems at the morpho-syntactic interface.  It has been shown that the design of the CPSG95 expectation structures enables configuration constraints to be imposed on the structure hierarchy defined by the grammar.  This makes the keyword approach to Chinese subcategorization a feasible alternative to the list design based on the obliqueness hierarchy of subject and complements.

Within this defined framework of CPSG95, the subsequent Chapter IV will be able to formulate the system-internal, but strictly formalized definition of Chinese word.  Formal definitions such as @word and @phrase enable proper configurational constraints to be imposed on the expected signs when required.  This lays a foundation for implementing the proposed solutions to the morpho-syntactic interface problems to be explored in the remaining chapters.



[1] More precisely, it is not ‘word’ order, it is constituent order, or linear precedence (LP) constraint between constituents.

[2]  L. Li (1986, 1990)’s definition on structural constituents does not involve word order.  However, his proposed definition is not an operational one from the angle of natural language processing.  He relies on the decoding of the semantic roles for the definitions of the proposed constituents like NP-agent (ming-shi), NP-patient (ming-shou), etc.  Nevertheless, his proposal has been reported to produce good results in the field of Chinese language teaching.  This seems to be understandable because the process of decoding semantic roles is naturally and subconsciously conducted in the mind of the language instructors/learners.

[3] Most linguists agree that Chinese has no inflectional morphology (e.g. Hockett 1958; Li and Thompson 1981; Zwicky 1987; Sun and Cole 1991).  The few linguists who believe that Chinese has developed or is developing inflection morphology include  Bauer (1988) and Dai (1993).  Typical examples cited as Chinese inflection morphemes are aspect markers le, zhe, guo and the plural marker men.

[4] A note for the notation: uppercase is used for feature and lowercase, for type.

[5] Phonology and discourse are not yet included in the definition.  The latter is a complicated area which requires further research before it can be properly integrated in the grammar analysis.  The former is not necessary because the object for CPSG95 is Written Chinese.   In the few cases where phonology affects structural analysis, e.g. some structural expectation needs to check the match of number of syllables, one can place such a constraint indirectly by checking the number of Chinese characters instead (as we know, a syllable roughly corresponds to a Chinese character or hanzi).

[6] The macro constraint @np in (3-2) is defined to be [CATEGORY n] and a call to another macro constraint @phrase to be defined shortly in Section 3.2.4.

[7] These expectation features defined for [a_sign] are a maximum set of possible expected daughters;  any specific sign may only activate a subset of them, represented by non-null value.

[8] This is similar to viewing morphology as ‘the syntax of words’ (Selkirk 1982; Lieber 1992; Krieger 1994).  It seems that at least affixation shares with syntax similar structural constraints on constituency and linear ordering in Chinese.  The same type of mechanisms (PS rules, typed feature structure for expectation, etc) can be used to capture both Chinese affixation and syntax (see Chapter VI).

[9] More precisely, the decoding of possible ways of semantic composition is guaranteed.  Syntactically ambiguous structures with the same constraints correspond to multiple ways of semantic compositionality.  These are expressed as different entries in the lexicon and the link between these entries is via corresponding lexical rules, following the HPSG practice. (W. Li 1996)

[10]  Borsley (1987) has proposed an HPSG framework where subject is posited as a distinct feature than other complements.  Pollard and Sag (1994:345) point out that “the overwhelming weight of evidence favors Borsley’s view of this matter”.

[11] The only possible benefit of such arrangement is that one can continue using the SUBCAT Principle for building complement structure via list cancellation.

[12] It also includes idioms whose internal morphological structure is unknown or has no grammatical relevance.

[13] The reader might have noticed that the assigned value is the same as the name of the PS rule which applies.  This is because there is correspondence between what type of structure is being built and what PS rule is building it.  Thus, the [STRUCT] feature actually records the rule application information.  For example, [STRUCT subj] reflects the fact that the Subj PS Rule is the most recently applied rule to the structure in point;  a structure built via the Prefix PS Rule has [STRUCT prefix] in place; etc.  This practice gives an extra benefit of the functionality of ‘tracing’ which rules have been applied in the process of debugging the grammar.  If there has never been a rule applied to a sign, it must be a morpheme carrying [STRUCT no_dtr] from the lexicon.



PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP








I should thank my four-year-old daughter, Tian Tian. I feel sorry for not being able to spend more time with her. What has supported me all these years is the idea that some day she will understand that as a first-generation immigrant, her dad has managed to overcome various challenges in order to create a better environment for her to grow.
PhD Thesis Dedication
To my daughter Tian Tian
whose babbling accompanied and inspired the writing of this work

I still remember I was in tears when writing this to give a final touch on this degree thesis

现如在正在做中文 deep parser,已经很有规模了。正好回顾一下,看 20 年前的思路与20年后做法,有何不同。离校后开始工业开发至今,我毫不犹豫就抛弃了博士的自动分析的路线,虽然做博士时说得头头是道。实际是扬弃吧。有抛弃有继承。抛弃的是单层的CFG,继承的是词法句法的无缝连接。这个转变反映的是理论和实践的距离以及学术与工业的关系。

做博士的时候,正是 unification systems 最被热捧的时候。于是跟随导师,在 Prolog平台上用 HPSG 做了一个汉语文法的MT双向实验(同一个汉语文法被用来同时做分析与生成,支持汉语英语的双向机器翻译),做了个 toy。需要写论文了,不得不把做过的各种现象不断缩小,最后集中到汉语的词法(包括切词)和句法的接口上做文章。整篇论文论述的就是一个思想,切词、词法与句法必须一体化,用的是单层 CFG parsing,说得头头是道。

一体化理论上当然是成立的,因为语言现象中的相互依赖,只有在一体化的框架下才好对付。哪怕 90% 的现象不是相互依赖的,是可以摘开的,你总可以用 10% 的现象证明一体化的正确性(理论上不妨碍那 90%)。

20年后呢,去球吧。早抛弃了单层一体化的思路,那是一个死胡同,做 toy 可以,很难 scale up,也做不深入,做不了真实世界的系统。继承的是一体化的通信管道和休眠唤醒似的patching机制。但宁肯修修补补,也不追求语法体系的完美。

对 HPSG 好奇,或感兴趣汉语怎么用HPSG的同学可以看看我整理出来的博士论文,虽然是过气了的 formalism,记得半年前冯志伟老师还系列编译介绍了 HPSG 讲座。有读者问,怎么用到中文呢?其实对于这种涉及一系列理论assumptions和技术细节的所谓 theoretical formalism,不做一遍基本是雾里看花。Unification 和 typed 数据结构逻辑上看上去很美,做起来也觉得好玩,做过后就洗手不干了。玩过 Prolog 的人也许有类似的体会。

决定把当年在博士论文中列举的具有句法分析难点的例子,当作 unit test 都  parse  一遍,看变了设计思想的系统是不是还可以抓住这些语言现象。











“头羊”(类似案例还有“个人”、“难过”)带有所谓切词的 hidden ambiguity,因为直接违反 longest principle,是中文切词的痛点,也是一体化的有力证据。理论上,任何的切词 ambiguity (不仅仅是 hidden ambiguity)都需要带入整个句子才能最后确认,local context 永远有漏洞,你永远可以营造出一个 context 使得你的 local 决策失效。但实践中还是可以大体把 local 与 全局分开,没必要带着切词的 ambiguity 一路跑到终点。hidden ambiguity 不影响大局者可以休眠,如上例。必要的时候可以用 word-driven 的句法后模块再唤醒它



PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

《泥沙龙笔记:parsing 的休眠反悔机制》


【泥沙龙笔记:NLP hard 的歧义突破】


【新智元笔记:李白对话录 – 从“把手”谈起】



关于 parsing





PhD Thesis: Chapter II Role of Grammar


2.0. Introduction

This chapter examines the role of grammar in handling the three major types of morpho-syntactic interface problems.  This investigation  justifies the mono-stratal design of CPSG95 which contains feature structures of both morphology and syntax.

The major observation from this study is:  (i) grammatical analysis, including both morphology and syntax, plays the fundamental role in contributing to the solutions of the morpho-syntactic problems;  (ii)  when grammar alone is not sufficient to reach the final solution, knowledge beyond morphology and syntax may come into play and serve as “filters” based on the grammatical analysis results.[1]  Based on this observation, a study in the direction of interleaving morphology and syntax will be pursued in the grammatical analysis.  Knowledge beyond morphology and syntax is left to future research.

Section 2.1 investigates the relationship between grammatical analysis and  the resolution of segmentation ambiguity.  Section 2.2 studies the role of syntax in handling Chinese productive word formation.  The borderline cases and their relationship with grammar are explored in 2.3.  Section 2.4 examines the relevance of knowledge beyond syntax to segmentation disambiguation.  Finally, a summary of the presented arguments and discoveries is given in 2.5.

2.1. Segmentation Ambiguity and Syntax

Segmentation ambiguity is one major problem which challenges the traditional word segmenter or an independent morphology.  The following study shows that this ambiguity is structural in nature, not fundamentally different from other structural ambiguity in grammar.  It will be demonstrated that sentential structural analysis is the key to this problem.

A huge amount of research effort in the last decade has been made on resolving segmentation ambiguity (e.g. Chen and Liu 1992; Gan 1995; He, Xu and Sun 1991; Liang 1987; Lua 1994; Sproat, Shih, Gale and Chang 1996; Sun and T’sou 1995; Sun and Huang 1996; X. Wang 1989; Wu and Su 1993; Yao, Zhang and Wu 1990; Yeh and Lee 1991; Zhang, Chen and Chen 1991; Guo 1997b).  Many (e.g. Sun and Huang 1996; Guo 1997b) agree that this is still an unsolved problem.  The major difficulty with most approaches reported in the literature lies in the lack of support from sufficient grammar knowledge.  To ultimately solve this problem, grammatical analysis is vital, a point to be elaborated in the subsequent sections.

2.1.1. Resolution of Hidden Ambiguity

The topic of this section is the treatment of hidden ambiguity.   The conclusion out of the investigation below is that the structural analysis of the entire input string provides a sound basis for handling this problem.

The following sample sentences illustrate a typical case involving the hidden ambiguity string 烤白薯 kao bai shu.

(2-1.) (a)      他吃烤白薯
ta         | chi  | kao-bai-shu
he      | eat  | baked-sweet-potato
[S [NP ta] [VP [V chi] [NP kao-bai-shu]]]
He eats the baked sweet potato.

(b) * ta       | chi  | kao          | bai-shu
he      | eat  | bake         | sweet-potato

(2-2.) (a) *    他会烤白薯
ta         | hui  | kao-bai-shu.
he      | can | baked-sweet-potato

(b)     ta       | hui  | kao          | bai-shu.
he      | can | bake         | sweet-potato
[S [NP ta] [VP [V hui] [VP [V kao] [NP bai-shu]]]]
He can bake sweet potatoes.

Sentences (2-1) and (2-2) are a minimal pair;  the only difference is the choice of the predicate verb, namely chi (eat) versus hui (can, be capable of).  But they have very different structures and assume different word identification.  This is because verbs like chi expect an NP object but verbs like hui require a VP complement.  The two segmentations of the string kao bai shu provide two possibilities, one as an NP kao-bai-shu and the other as a VP kao | bai-shu.  When the provided unit matches the expectation, it leads to a successful syntactic analysis, as illustrated by the parse trees in (2‑1a) and (2-2b).  When the expectation constraint is not satisfied, as in (2-1b) and (2-2a), the analysis fails.  These examples show that all candidate words in the input string should be considered for grammatical analysis.  The disambiguation choice can be made via the analysis, as seen in the examples above with the sample parse trees.  Correct segmentation results in at least one successful parse.

He, Xu and Sun (1991) indicate that a hidden ambiguity string requires a larger context for disambiguation.  But they did not define what the ‘larger context’ should be.  The following discussion attempts to answer this question.

The input string to the parser constitutes a basic context as well as the object for sentential analysis.[2]  It will be argued that this input string is the proper context for handling the hidden ambiguity problem.  The point to be made is, context smaller than the input string is not reliable for the hidden ambiguity resolution.  This point is illustrated by the following examples of the hidden ambiguity string ge ren in (2-3).[3]  In each successive case, the context is expanded to form a new input string.   As a result, the analysis and the associated interpretation of ‘person’ versus ‘individual’ change accordingly.

(2-3.)  input string                            reading(s)

(a)      人  ren                                       person (or man, human)
[N ren]

(b)      个人  ge ren                               individual
[N ge-ren]

(c)      三个人  san ge ren                               three persons
[NP [CLAP [NUM san] [CLA ge]] [N ren]]

(d)      人的力量  ren de li liang                      the human power
[NP [DEP [NP ren] [DE de]] [N li-liang]]

(e)      个人的力量  ge ren de li liang                        the power of an individual
[NP [DEP [NP ge-ren] [DE de]] [N li-liang]]

(f)       三个人的力量  san ge ren de li liang              the power of three persons
[NP [DEP [NP [CLAP [NUM san] [CLA ge]] [N ren]] [DE de]] [N li-liang]]

(g)      他不是个人  ta bu shi ge ren.
           (1)    He is not a man. (He is a pig.)
[S [NP ta] [VP [ADV bu] [VP [V shi] [NP [CLAP ge] [N ren]]]]]
(2)  He is not an individual. (He represents a social organization.)
[S [NP ta] [VP [ADV bu] [VP [V shi] [NP  ge-ren]]]]

Comparing (a), (b) with (c), and (d), (e) with (f), one can see the associated change of readings when each successively expanded input string leads to a different grammatical analysis.  Accordingly, one segmentation is chosen over the other on the condition that the grammatical analysis of the full string can be established based on the segmentation.  In (b), the ambiguous string is all that is input to the parser, therefore the local context becomes full context.  It then acquires the lexical reading individual as the other possible segmentation ge | ren does not form a legitimate combination.  This reading may be retained, as in (e), or changed to the other reading person, as in (c) and (f), or reduced to one of the possible interpretations, as in (g), when the input string is further lengthened.  All these changes depend on the sentential analysis of the entire input string, as shown by the associated structural trees above.  It demonstrates that the full context is required for the adequate treatment of the hidden ambiguity phenomena.  Full context here refers to the entire input string to the parser.

It is necessary to explain some of the analyses as shown in the sample parses  above.  In Contemporary Mandarin, a numeral cannot  combine with a noun without a classifier in between.[4]  Therefore, the segmentation san (three) | ge-ren (individual) is excluded in (c) and (f), and the correct segmentation san (three) | ge (CLA) | ren (person) leads to the NP analysis.  In general, a classifier alone cannot combine with the following noun either, hence the interpretation of ge ren as one word ge-ren (individual) in (b) and (e).  A classifier usually combines with a preceding numeral or determiner before it can combine with the noun.  But things are more complicated.  In fact, the Chinese numeral yi (one) can be omitted when the NP is in object position.  In other words, the classifier alone can combine with a noun in a very restricted syntactic environment.  That explains the two readings in (g).[5]

The following is a summary of the arguments presented above.   These arguments have been shown to account for the hidden ambiguity phenomena.  The next section will further demonstrate the validity of these arguments for overlapping ambiguity as well.

(2-4.) Conclusion
The grammatical analysis of the entire input string is required for the adequate treatment of the hidden ambiguity problem in word identification.

2.1.2. Resolution of Overlapping Ambiguity

This section investigates overlapping ambiguity and its resolution.  A previous influential theory is examined, which claims that the overlapping ambiguity string can be locally disambiguated.   However, this theory is found to be unable to account for a significant amount of data.  The conclusion is that both overlapping ambiguity and hidden ambiguity require a context of the entire input string and a grammar for disambiguation.

For overlapping ambiguity, comparing different critical tokenizations will be able to detect it, but such a technique cannot guarantee a correct choice without introducing other knowledge.  Guo (1997) pointed out:

As all critical tokenizations hold the property of minimal elements on the word string cover relationship, the existence of critical ambiguity in tokenization implies that the “most powerful and commonly used” (Chen and Liu 1992, page 104) principle of maximum tokenization would not be effective in resolving critical ambiguity in tokenization and implies that other means such as statistical inferencing or grammatical reasoning have to be introduced.

However, He, Xu and Sun (1991) claim that overlapping ambiguity can be resolved within the local context of the ambiguous string.  They classify the overlapping ambiguity string into nine types.  The classification is based on the categories of the assumably correctly segmented words in the ambiguous strings, described below.

Suppose there is an overlapping ambiguous string consisting of ABC;  both AB and BC are entries listed in the lexicon.  There are two possible cases.  In case one, the category of A and the category of BC define the classification of the ambiguous string.  This is the case when the segmentation A|BC is considered correct.  For example, in the  ambiguous string 白天鹅 bai tian e, the word AB is  bai-tian (day-time) and the word BC is tian-e (swan).  The correct segmentation for this string is assumed to be A|BC, i.e. bai (A: white) | tian-e (N: swan) (in fact, this cannot be taken for granted as shall be shown shortly), therefore, it belongs to the A-N type.  In case two, i.e. when the segmentation AB|C is considered correct, the category of AB and the category C define the classification of the ambiguous string.   For example, in the ambiguous string 需求和 xu qiu he, the word AB is  xu-qiu (requirement) and the word BC qiu-he (sue for peace).  The correct segmentation for this string is AB|C, i.e. xu-qiu (N: requirement) | he (CONJ: and) (again, this should not be taken for granted), therefore, it belongs to the N-CONJ type.

After classifying the overlapping ambiguous strings into one of nine types, using the two different cases described above, they claim to have discovered a rule.[6]  That is, the category of the correctly segmented word BC in case one (or AB in case two) is predictable from AB (or BC in case two) within the local ambiguous string.  For example, the category of tian-e (swan) in bai | tian-e (white swan) is a noun.  This information is predictable from bai tian within the respective local string bai tian e.  The idea is, if ever an overlapping ambiguity string is formed of bai tian and C, the judgment of bai | tian-C as the correct segmentation entails that the word tian-C  must be a noun.  Otherwise, the segmentation A|BC is wrong and the other segmentation AB|C is right.  For illustration, it is noted that tian-shi (angel) in the ambiguous string 白天使 bai | tian-shi (white angel) is, as expected, a noun.  This predictability of the category information from within the local overlapping ambiguous string is seen as an important discovery (Feng 1996).  Based on this assumed feature of the overlapping ambiguous strings, He,  Xu and Sun (1991) developed their theory that an overlapping ambiguity string can be disambiguated within the local string itself.

The proposed disambiguation process within the overlapping ambiguous string proceeds as follows.  In order to correctly segment an overlapping ambiguous string, say, bai tian e or bai tian shi, the following information needs to be given under the entry bai-tian (day-time) in the tokenization lexicon:  (i) an ambiguity label, to indicate the necessity to call a disambiguation rule;  (ii) the ambiguity type A-N, to indicate that it should call the rule corresponding to this type.  Then the following disambiguation rule can be formulated.

(2-5.) A-N type rule       (He,  Xu and Sun 1991)
In the overlapping ambiguous string A(1)…A(i) B(1)…B(j) C(1)…C(k),
if        B(1)…B(j) and C(1)…C(k) form a noun,
then  the correct segmentation is A(1)…A(i) | B(1)…B(j)-C(1)…C(k),
else    the correct segmentation is A(1)…A(i)-B(1)…B(j) | C(1)…C(k).

This way, bai tian e and bai tian shi will always be segmented as bai (white) | tian-e (swan) and bai (white) | tian-shi (angel) instead of bai-tian (daytime) | e (goose) and bai-tian (daytime) | shi (make).  This can be easily accommodated in a segmentation algorithm provided the above  information is added to the lexicon and the disambiguation rules are implemented.  The whole procedure is running within the local context of the overlapping ambiguous string and uses only lexical information.  So they also name the overlapping ambiguity disambiguation morphology-based disambiguation, with no need to consult syntax, semantics or discourse.

Feng (1996) emphasizes that He, Xu and Sun’s view on the overlapping ambiguous string constitutes a valuable contribution to the theory of Chinese word identification.  Indeed, this overlapping ambiguous string theory, if it were right, would be a breakthrough in this field.  It in effect suggests that the majority of the segmentation ambiguity is resolvable without and before a grammar module.  A handful of simple rules, like the A-N type rule formulated above, plus a lexicon would solve most ambiguity problems in word identification.[7]

Feng (1996) provides examples for all the nine types of overlapping ambiguous strings as evidence to support He, Xu and Sun (1991)’s theory.   In the case of the A-N type ambiguous string bai tian e, the correct segmentation is supposed to be bai | tian-e in this theory.  However, even with his own cited example, Feng ignores a perfect second reading (parse) when the time NP bai-tian (daytime) directly acts as a modifier for the sentence with no need for a preposition, as shown in (2‑6b) below.

(2-6.)           白天鹅游过来了
bai tian e you guo lai le       (Feng 1996)

(a)      bai     | tian-e       | you          | guo-lai      | le.
          white | swan        | swim        | over-here  | LE
[S [NP bai tian-e] [VP you guo-lai le]]
The white swan swam over here.

(b)      bai-tian       | e              | you          | guo-lai      | le.
          day-time      | goose        | swim        | over-here  | LE
[S [NP+mod bai-tian] [S [NP e] [VP you guo-lai le]]]
In the day time the geese swam over here.

In addition, one only needs to add a preposition zai (in) to the beginning of the sentence to make the abandoned segmentation bai-tian | e the only right one in the changed context.  The presumably correct segmentation, namely bai | tian-e, now turns out to be wrong, as shown in (2-7a) below.

(2-7.)           在白天鹅游过来了
zai bai tian e you guo lai le

(a) *   zai     | bai           | tian-e       | you          | guo-lai      | le.
          in      | white        | swan        | swim        | over-here  | LE

(b)      zai     | bai-tian    | e              | you          | guo-lai      | le.
          in      | day-time   | goose        | swim        | over-here  | LE
[S [PP+mod zai bai-tian] [S [NP e] [VP you guo-lai le]]]
In the day time the geese swam over here.

The above counter-example is by no means accidental.  In fact, for each cited ambiguous string in the examples given by Feng, there exist counter-examples.  It is not difficult to construct a different context where the preferred segmentation within the local string, i.e. the segmentation chosen according to one of the rules, is proven to be wrong.[8]  In the pairs of sample sentences (2‑8) through (2-10), (a) is an example which Feng (1996) cited to support the view that the local ambiguous string itself is enough for disambiguation.  Sentences in (b) are counter-examples to this theory.  It is a notable fact that the listed local string is often properly contained in a more complicated ambiguous string in an expanded context, seen in (2-9b) and (2-10b).  Therefore, even when the abandoned segmentation can never be linguistically correct in any context, as shown for tu-xing (graph) | shi (BM) in (2-9) where a bound morpheme still exists after the segmentation, it does not entail the correctness of the other segmentation in all contexts.  These data show that all possible segmentations should be retained for the grammatical analysis to judge.

(2-8.)  V-N type of overlapping ambiguous string

          yan jiu sheng ming:
          yan-jiu (V:study) | sheng-ming (N:life)
yang-jiu-sheng (N:graduate student) | ming (life/destiny)

(a)      研究生命的本质
          yan-jiu    sheng-ming de      ben-zhi
          study          life               DE     essence
Study the essence of life.

(b)      研究生命金贵
           yan-jiu-sheng      ming  jin-gui
          graduate-student  life     precious
Life for graduate students is precious.

(2-9.)  CONJ-N type of overlapping ambiguous string
和平等 he ping deng:
          he (CONJ:and) | ping-deng (N:equality)
he-ping (N:peace) | deng (V:wait)?

(a)      独立自主和平等互利的原则
           du-li-zi-zhu           he      ping-deng-hu-li               de      yuan-ze
          independence       and    equal-reciprocal-benefit  DE     principle
the principle of independence and equal reciprocal benefit

(b)      和平等于胜利 he-ping       deng-yu       sheng-li
          peace           equal           victory
Peace is equal to victory.

(2-10.)  V-P type of overlapping ambiguous string
看中和 kan zhong he:
          kan-zhong (V:target) | he (P:with)
kan (V:see) | zhong-he (V:neutralize)

(a)      他们看中和日本人生意的机会
ta-men    kan-zhong   he      ri-ben          ren              zuo     sheng-yi      de      ji-hui    
they         target           with    Japan          person          do      business     DE   opportunity
They have targeted the opportunity to do business with the Japanese.

(b)      这要看中和作用的效果
zhe          yao    kan    zhong-he-zuo-yong                   de          xiao-guo
this    need  see     neutralization                DE     effect
This will depend on the effects of the neutralization.

The data in (b) above directly contradict the claim that an overlapping ambiguous string can be disambiguated within the local string itself.  While this approach is shown to be inappropriate in practice, the following comment attempts to reveal its theoretical motivation.

As reviewed in the previous text, He, Xu and Sun (1991)’s overlapping ambiguity theory is established on the classification of the overlapping ambiguous strings.  A careful examination of their proposed nine types of the overlapping ambiguous strings reveals an underlying assumption on which the classification is based.  That is, the correctly segmented words within the overlapping ambiguous string will automatically remain correct in a sentence containing the local string.   This is in general untrue, as shown by the counter-examples above.[9]   The following analysis reveals why.

Within the local context of the overlapping ambiguous string, the chosen segmentation often leads to a syntactically legitimate structure while the abandoned segmentation does not.  For example,  bai (white) | tian-e (swan) combines into a valid syntactic unit while there is no structure which can span bai-tian (daytime) | e (goose).  For another example,  yan-jiu (study) | sheng-ming (life) can be combined into a legitimate verb phrase [VP [V yan-jiu] [NP sheng-ming]], but  yan-jiu-sheng (graduate student) | ming (life/destiny) cannot.  But that legitimacy only stands locally within the boundary of the ambiguous string.  It does not necessarily hold true in a larger context containing the string.  As shown previously in (2-7a),  the locally legitimate structure bai | tian-e (white swan) does not lead to a successful parse for the sentence.  In contrast, the locally abandoned segmentation bai-tian (daytime) | e (goose) has turned out to be right with the parse in (2-7b).   Therefore, the full context instead of the local context of the ambiguous string is required for the final judgment on which segmentation can be safely abandoned.  Context smaller than the entire input string is not reliable for the overlapping ambiguity resolution.  Note that exactly the same conclusion has been reached for the hidden ambiguous strings in the previous section.

The following data in (2-11) further illustrate the point of the full context requirement for the overlapping ambiguity resolution, similar to what has been presented for the hidden ambiguity phenomena in (2-3).  In each successive case, the context is expanded to form a new input string.  As a result, the interpretation of ‘goose’ versus ‘swan’ changes accordingly.

(2-11.)  input string                reading(s)

(a)      鹅 e                                goose
[N e]

(b)      天鹅 tian e                                swan
[N tian-e]

(c)      白天鹅 bai tian e                       white swan
[N [A bai] [N tian-e]]

(d)      鹅游过来了 e you guo lai le.
The geese swam over here.
[S [NP e] [VP you guo-lai le]]

(e)      天鹅游过来了 tian e you guo lai le.
The swans swam over here.
[S [NP tian-e] [VP you guo-lai le]]

(f)      白天鹅游过来了 bai tian e you guo lai le.
          (i)       The white swan swam over here.
[S [NP bai tian-e] [VP you guo-lai le]]
          (ii)      In the daytime, the geese swam over here.
S [NP+mod bai-tian] [S [NP e] [VP you guo-lai le]]]

(g)       在白天鹅游过来了 zai bai tian e you guo lai le.
            In the daytime, the geese swam over here.
[S [PP zai bai-tian] [S [NP e] [VP you guo-lai le]]]

(h)      三只白天鹅游过来了 san zhi bai tian e you guo lai le.
           Three white swans swam over here.
[S [NP san zhi bai tian-e] [VP you guo-lai le]]

It is interesting to compare (c) with (f), (g) and (h) to see their associated change of readings based on different ways of  segmentation.  In (c), the overlapping ambiguous string is all that is input to the parser, therefore the local context becomes full context.  It then acquires the reading white swan corresponding to the segmentation bai | tian-e.  This reading may be retained, or changed, or reduced to one of the possible interpretations when the input string is lengthened.  That is respectively the case in (h), (g) and (f).  All these changes depend on the grammatical analysis of the entire input string.  It shows that the full context and a grammar are required for the resolution of most ambiguities;  and when sentential analysis cannot disambiguate – in cases of ‘genuine’ segmentation ambiguity like (f), the structural analysis can make explicit the ambiguity in the form of multiple parses (readings).

In the light of the inquiry in this section, the theoretical significance of the distinction between overlapping ambiguity and hidden ambiguity seems to have diminished.[10]  They are both structural in nature.  They both require full context and a grammar for proper treatment.

(2-12.) Conclusion

(i)  It is not necessarily true that an overlapping ambiguous string can be disambiguated within the local string.

(ii) The grammatical analysis of the entire input string is required for the adequate treatment of the overlapping ambiguity problem as well as the hidden ambiguity problem.

2.2. Productive Word Formation and Syntax

This section examines the connection of productive word formation and segmentation ambiguity.  The observation is that there is always a possible involvement of ambiguity with each type of word formation.  The point to be made is that no independent morphology systems can resolve this ambiguity when syntax is unavailable.  This is because words formed via morphology, just like words looked up from lexicon, only provide syntactic ‘candidate’ constituents for the sentential analysis.  The choice is decided by the structural analysis of the entire sentence.

Derivation is a major type of productive word formation in Chinese.   Section 1.2.2 has given an example of the involvement of hidden ambiguity in derivation, repeated below.

(2-13.)         这道菜没有吃头  zhe dao cai mei you chi tou.

(a)      zhe    | dao          | cai            | mei-you    | chi-tou
          this    | CLA                   | dish         | not-have   | worth-of-eating
[S [NP zhe dao cai] [VP [V mei-you] [NP chi-tou]]]
This dish is not worth eating.

(b) ?   zhe    | dao          | cai            | mei-you    | chi  | tou
          this    | CLA                   | dish         | not have   | eat  | head
[S [NP zhe dao cai] [VP [ADV mei-you] [VP [V chi] [NP tou]]]]
This dish did not eat the head.

(2-14.)         他饿得能吃头牛 ta e de neng chi tou niu.

(a) *   ta       | e              | de            | neng         | chi-tou               | niu
he      | hungry     | DE3         | can           | worth-of-eating  | ox

(b)      ta       | e              | de            | neng         | chi  | tou            | niu
he      | hungry     | DE3         | can           | eat  | CLA          | ox
[…[VP [V e] [DE3P [DE3 de] [VP [V neng] [VP [V chi] [NP tou niu]]]]]]
He is so hungry that he can eat an ox.

Some derivation rule like the one in (2-15) is responsible for combining the transitive verb stem and the suffix –tou (worth-of) into a derived noun for (2-13a) and (2-14a).

(2-15.)         X (transitive verb) + tou –> X-tou (noun, semantics: worth-of-X)

However, when syntax is not available, there is always a danger of wrongly applying this morphological rule due to possible ambiguity involved, as shown in (2-14a).  In other words, morphological rules only provide candidate words;  they cannot make the decision whether these words are legitimate in the context.

Reduplication is another method for productive word formation in Chinese.  An outstanding problem is the AB –> AABB reduplication or AB –> AAB reduplication if AB is a listed word.   In these cases, some reduplication rules or procedures need to be involved to recognize AABB or AAB.  If reduplication is a simple process confined to a local small context, it may be possible to handle it by incorporating some procedure-based function calls during the lexical lookup.  For example, when a three-character string, say 分分心 fen fen xin, cannot be found in the lexicon, the reduplication function will check whether the first two characters are the same, and if yes, delete one of them and consult the lexicon again.  This method is expected to handle the AAB type reduplication, e.g. fen-xin (divide-heart: distract) –> fen-fen-xin (distract a bit).

But, segmentation ambiguity can be involved in reduplication as well.  Compare the following examples in (2-16) and (2-17) containing the sub-string fen fen xin, the first is ambiguity free but the second is ambiguous.  In fact, (2‑17) involves an overlapping ambiguous string  shi fen fen xinshi (ten) | fen-fen-xin (distract a bit) and shi-fen (very) | fen-xin (distract).  Based on the conclusion presented in 2.1, it requires grammatical analysis to resolve the segmentation ambiguity.  This is illustrated in (2‑17).

(2-16.)         让他分分心

rang     | ta    | fen-fen-xin
let      | he   | distracted-a-bit
Let him relax a while.

(2-17.)         这件事十分分心

zhe jian shi shi fen fen xin.

(a) *   zhe    | jian          | shi           | shi  | fen-fen-xin
          this    | CLA         | thing       | ten  | distracted a bit

(b)      zhe    | jian          | shi            | shi-fen     | fen-xin
           this    | CLA         | thing       | very         | distract
[S [NP zhe jian shi] [VP [ADV shi-fen] [V fen-xin]]]
This thing is very distracting.

Finally, there is also possible ambiguity involvement in the proper name formation.  Proper names for persons, locations, etc. that are not listed in the lexicon are recognized as another major problem in word identification (Sun and Huang 1996).[11]  This problem is complicated when ambiguity is involved.

For example, a Chinese person name usually consists of a family name followed by a given name of one or two characters.  For example, the late Chinese chairman mao-ze-dong (Mao Zedong) used to have another name li-de-sheng (Li Desheng).  In the lexicon, li is a listed family name.  Both de-sheng and sheng mean ‘win’.  This may lead to three ways of word segmentation, a complicated case involving both overlapping ambiguity and hidden ambiguity:  (i) li | de-sheng;  (ii) li-de | sheng;  (iii) li-de-sheng, as shown in (2-18) below.

(2-18.)         李得胜了 li de sheng le.

(a)      li        | de-sheng  | le
           Li       | win          | LE
[S [NP li] [VP de-sheng le]]
Li won.

(b)      li-de   | sheng       | le
           Li De | win          | LE
[S [NP li de] [VP sheng le]]
Li De won.

(c) *    li-de-sheng  | le
           Li Desheng  | LE

For this particular type of compounding, the family name serves as the left boundary of a potential compound name of person and the length can be used to determine candidates.[12]  Again, the choice is decided by the grammatical analysis of the entire sentence, as illustrated in (2-18).

(2-19.) Conclusion

Due to the possible ambiguity involvement in productive word formation, a grammar containing both morphology and syntax is required for an adequate treatment.  An independent morphology system or separate word segmenter cannot solve ambiguity problems.

2.3. Borderline Cases and Grammar

This section reviews some outstanding morpho-syntactic borderline phenomena.  The points to be made are:  (i) each proposed morphological or syntactic analysis should be justified in terms of capturing the linguistic generality;  (ii) the design of a grammar should facilitate the access to the knowledge from both morphology and syntax in analysis.

The nature of the borderline phenomena calls for the coordination of morphology and syntax in a grammar.  The phenomena of Chinese separable verbs are one typical example.  The co-existence of their contiguous use and separate use leads to the confusion whether they belong to the lexicon and morphology, or whether they are syntactic phenomena.  In fact, as will be discussed in Chapter V, there are different degrees of ‘separability’ for different types of Chinese separable verbs;  there is no uniform analysis which can handle all separable verbs properly.  Different types of separable verbs may justify different approaches to the problems.  In terms of capturing linguistic generality, a good analysis should account for the demonstrated variety of separated uses and link the separated use and the contiguous use.

‘Quasi-affixation’ is another outstanding interface problem.  This problem requires careful morpho-syntactic coordination.  As presented in Chapter I, structurally, ‘quasi-affixes’ and ‘true’ affixes demonstrate very similar word formation potential, but ‘quasi-affixes’ often retain some ‘solid’ meaning while the meaning of ‘true’ affixes are functionalized.  Therefore, how to coordinate the semantic contribution of the derived words via ‘quasi-affixation’ in the context of the building of the semantics for the entire sentence is the key.  This coordination requires flexible information flow between data structures for morphology, syntax and semantics during the morpho-syntactic analysis.

In short, the proper treatment of the morpho-syntactic borderline phenomena requires inquiry into each individual problem in order to reach a morphological or syntactic analysis which maximally captures linguistic generality.  It also calls for the design of a grammar where information between morphology and syntax can be effectively coordinated.

2.4. Knowledge beyond Syntax

This section examines the roles of knowledge beyond syntax in the resolution of segmentation ambiguity.  Despite the fact that further information beyond syntax may be necessary for a thorough solution to segmentation ambiguity,[13] it will be argued that syntax is the appropriate place for initiating this process due to the structural nature of segmentation ambiguity.

Depending on which type of information is essential for the disambiguation, disambiguation can be classified as structure-oriented, semantics-oriented and pragmatics-oriented.  This classification hierarchy is modified from that in He, Xu and Sun (1991).  They have classified the hidden ambiguity disambiguation into three categories:  syntax-based, semantics-based and pragmatics-based.  Together with the morphology-based disambiguation which is equivalent to the overlapping ambiguity resolution in their theory, they have built a hierarchy from morphology up to pragmatics.

A note on the technical details is called for here.  The term X‑oriented (where X is either syntax, semantics or pragmatics) is selected here instead of X-based in order to avoid the potential misunderstanding that X is the basis for the relevant disambiguation.  It will be shown that while information from X is required for the ambiguity resolution, the basis is always syntax.

Based on the study in 2.1, it is believed that there is no morphology-based (or morphology-oriented) disambiguation independent of syntax.  This is because the context of morphology is a local context, too small for resolving structural ambiguity.  There is little doubt that the morphological analysis is a necessary part of word identification in terms of handling productive word formation.  But this analysis cannot by itself resolve ambiguity, as argued in 2.2.  The notion ‘structure’ in structure-oriented disambiguation includes both syntax and morphology.

He, Xu and Sun (1991) exclude the overlapping ambiguity resolution in the classification beyond morphology.  This exclusion is found to be not appropriate.  In fact, both the resolution of hidden ambiguity and overlapping ambiguity can be classified into this hierarchy.   In order to illustrate this point, for each such class, I will give examples from both hidden ambiguity and overlapping ambiguity.

Sentences in (2-20) and (2-21) which contain the hidden ambiguity string 阵风zhen feng  are examples for the structure-oriented disambiguation.  This type of disambiguation relying on a grammar constitutes the bulk of the disambiguation task required for word identification.

(2-20.)         一阵风吹过来了
yi zhen feng chui guo lai le.          (Feng 1996)

(a)      yi       | zhen         | feng         | chui          | guo-lai      | le
          one    | CLA          | wind        | blow         | over-here  | LE
[S [NP [CLAP yi zhen] [N feng]] [VP chui guo-lai le]]
A gust of wind blew here

(b) *   yi       | zhen-feng                    | chui                   | guo-lai      | le
          one    | gusts-of-wind    | blow         | over-here  | LE

(2-21.)         阵风会很快来临 zhen feng hui hen kuai lai lin.

(a)      zhen-feng              | hui  | hen          | kuai         | lai-lin
          gusts-of-wind       | will | very                   | soon         | come
[S [NP zhen-feng] [VP hui hen kuai lai-lin]]]
Gusts of wind will come very soon.

(b) *   zhen  | feng                   | hui  | hen          | kuai         | lai-lin
          CLA   | wind        | will | very                   | soon         | come

Compare (2-20a) where the ambiguity string is identified as two words zhen (CLA) feng (wind) and (2-21a) where the string is taken as one word zhen-feng (gusts-of-wind).  Chinese syntax defines that a numeral cannot directly combine with a noun, neither can a classifier alone when it is in non-object position.  The numeral and the classifier must combine together before they can combine with a noun.  So (2-20b) and (2‑21b) are both ruled out while (2-20a) and (2-21a) are structurally well-formed.

For the structure-oriented overlapping ambiguity resolution,  numerous examples have been cited before, and one typical example is repeated below.

(2-22.)         研究生命金贵 yan jiu sheng ming jin gui

(a)      yan-jiu-sheng       | ming         | jin-gui
graduate student | life            | precious
[S [NP yan-jiu-sheng] [S [NP ming] [AP jin-gui]]]
Life for graduate students is precious.

(b) *   yan-jiu        | sheng-ming        | jin-gui
study          | life                     | precious

As a predicate, the adjective jin-gui (precious) syntactically expects an NP as its subject, which is saturated by the second NP ming (life) in (2-22a).   The first NP serves as a topic of the sentence and is semantically linked to the subject ming (life) as its possessive entity.[14]  But there is no parse for (2-22b) despite the fact that the sub-string yan-jiu sheng-ming (to study life) forms a verb phrase [VP [V yan-jiu] [NP sheng-ming]] and the sub-string sheng-ming jin-gui (life is precious) forms a sentence [S [NP sheng-ming] [AP jin-gui]].  On one hand, the VP in the subject position does not satisfy the syntactic constraint (the category NP) expected by the adjective jin-gui (precious) – although other adjectives, say zhong-yao ‘important’, may expect a VP subject.  On the other hand, the transitive verb yan-jiu (study) expects an NP object.  It cannot take an S object (embedded object clause) as do other verbs, say ren-wei (think).

The resolution of the following hidden ambiguity belongs to the semantics-oriented disambiguation.

(2-23.)         请把手抬高一点儿 qing ba shou tai gao yi dian er            (Feng 1996)

(a1)    qing             | ba   | shou         | tai            | gao | yi-dian-er
          please          | BA  | hand        | hold         | high| a-little
[VP [ADV qing] [VP ba shou tai gao yi-dian-er]]
Please raise your hand a little higher.

(a2) * qing   | ba   | shou         | tai            | gao           | yi-dian-er
          invite | BA  | hand        | hold         | high         | a-little

(b1) * qing             | ba-shou    | tai            | gao           | yi-dian-er
          please          | N:handle  | hold         | high         | a-little

(b2) ? qing   | ba-shou    | tai            | gao           | yi-dian-er
          invite | N:handle  | hold         | high         | a-little
[VP [VG [V qing] [NP ba-shou]] [VP tai gao yi-dian-er]]
Invite the handle to hold a little higher.

This is an interesting example.  The same character qing is both an adverb ‘please’ and a verb ‘invite’.  (2-23b2) is syntactically valid, but violates the semantic constraint or semantic selection restriction.  The logical object of qing (invite) should be human but ba-shou (handle)  is not human.  The two syntactically valid parses (2-23a1) and (2-23b2), which correspond to two ways of segmentation, are expected to be somehow disambiguated on the above semantic grounds.

The following case is an example of semantics-oriented resolution of the overlapping ambiguity.

(2-24.)         茶点心吃了 cha dian xin chi le.

(a1)    cha    | dian-xin   | chi  | le
tea     | dim sum  | eat  | LE
[S [NP+object cha dian-xin] [VP chi le]]
The tea dim sum was eaten.

(a2) ? cha    | dian-xin   | chi  | le
tea     | dim sum  | eat  | LE
[S [NP+agent cha dian-xin] [VP chi le]]
The tea dim sum ate (something).

(a3) ? cha    | dian-xin   | chi  | le
tea     | dim sum  | eat  | LE
[S [NP+object cha ] [S [NP+agent dian-xin] [VP chi le]]]
Tea, the dim sum ate.

(a4) ? cha    | dian-xin   | chi  | le
tea     | dim sum  | eat  | LE
[S [NP+agent cha ] [VP [NP+object dian-xin] [VP chi le]]]
The tea ate the dim sum.

(b1) ? cha-dian               | xin           | chi  | le
tea dim sum         | heart       | eat  | LE
[S [NP+object cha-dian] [S [NP+agent xin] [VP chi le]]]
The tea dim sum, the heart ate.

(b2) ? cha-dian               | xin           | chi  | le
tea dim sum         | heart        | eat  | LE
[S [NP+agent cha-dian] [VP [NP+object xin] [VP chi le]]]
The tea dim sum ate the heart.

Most Chinese dictionaries contain the listed compound noun cha-dian (tea-dim-sum), but not cha dian-xin which stands for the same thing, namely the snacks served with the tea.  As shown above, there are four analyses for one segmentation and two analyses for the other segmentation.  These are all syntactically legitimate, corresponding to six different readings.  But there is only one analysis which makes sense, namely the implicit passive construction with the compound noun cha dian-xin as the preceding (logical) object in (a1).  All the other five analyses are nonsense and can be disambiguated if the semantic selection restriction that animate being eats (i.e. chi) food is enforced.   Syntactically, (a2) is an active construction with the optional object omitted.  The constructions for (a3) and (b1) are of long distance dependency where the object is topicalized and placed at the beginning.   The SOV (Subject Object Verb) pattern for (a4) and (b2) is a very  restrictive construction in Chinese.[15]

The pragmatics-oriented disambiguation is required for the case where ambiguity remains after the application of both structural and semantic constraints.[16]  The sentences containing this type of ambiguity are genuinely ambiguous within the sentence boundary, as shown with the multiple parses in (2-25) for the hidden ambiguity and (2-26) for the overlapping ambiguity below.

(2-25.)         他喜欢烤白薯 ta xi huan kao bai shu.

(a)      ta       | xi-huan    | kao          | bai-shu.
          he      | like           | bake         | sweet-potato
[S [NP ta] [VP [V xi-huan] [VP [V kao] [NP bai-shu]]]]
He likes baking sweet potatoes.

(b)      ta       | xi-huan    | kao-bai-shu.
          he      | like           | baked-sweet-potato
[S [NP ta] [VP [V xi-huan] [NP kao-bai-shu]]]
He likes the baked sweet potatoes.

(2-26.)         研究生命不好 yan jiu sheng ming bu hao

(a)      yan-jiu-sheng       | ming         | bu   | hao.
          graduate student | destiny     | not | good
[S [NP yan-jiu-sheng] [S [NP ming] [AP bu hao]]]
The destiny of graduate students is not good.

(b)      yan-jiu        | sheng-ming        | bu   | hao.
          study          | life                     | not | good
[S [VP yan-jiu sheng-ming] [AP bu hao]]
It is not good to study life.

An important distinction should be made among these classes of disambiguation.  Some ambiguity must be solved in order to get a reading during analysis.  Other ambiguity can be retained in the form of multiple parses, corresponding to multiple readings.  In either case, it demonstrates that at least a grammar (syntax and morphology) is required.  The structure-oriented ambiguity belongs to the former, and can be handled by the appropriate structural analysis.  The semantics-oriented ambiguity and the pragmatics-oriented ambiguity belong to the latter, so multiple parses are a way out.  The examples for different classes of ambiguity show that the structural analysis is the foundation for handling ambiguity problems in word identification.  It provides possible structures for the semantic constraints or pragmatic constraints to work on.

In fact, the resolution of segmentation ambiguity in Chinese word identification is but a special case of the resolution of structural ambiguity for NLP in general.  As a matter of fact, the grammatical analysis has been routinely used to resolve, and/or prepare the basis for resolving, the structural ambiguity like the PP attachment.[17]

2.5. Summary

The most important discovery in the field of Chinese word identification presented in this chapter is that the resolution of both types of segmentation ambiguity involves the analysis of the entire input string.  This means that the availability of a grammar is the key to the solution of this problem.

This chapter has also examined the ambiguity involvement in productive word formation and reached the following conclusion.  A grammar for morphological analysis as well as for sentential analysis is required for an adequate treatment of this problem.  This establishes the foundation for the general design of CPSG95 as consisting of morphology and syntax in one grammar formalism. [18]

The study of the morpho-syntactic borderline problems shows that  the sophisticated design of a grammar is called for so that information between morphology and syntax can be effectively coordinated.  This is the work to be presented in Chapter III and Chapter IV.  It also demonstrates that each individual borderline problem should be studied carefully in order to reach a morphological or syntactic analysis which maximally captures linguistic generality.  This study will be pursued in Chapter V and Chapter VI.




[1]  Constraints beyond morphology and syntax can be implemented as subsequent modules, or “filters”, in order to select the correct analysis when morpho-syntactic analysis leads to multiple results (parses).  Alternatively, such constraints can also be integrated into CPSG95 as components parallel to, and interacting with, morphology and syntax.  W. Li (1996) illustrates how semantic selection restriction can be integrated into syntactic constraints in CPSG95 to support Chinese parsing.

[2] In theory, if discourse is integrated in the underlying grammar, the input can be a unit larger than sentence, say, a paragraph or even a full text.  But this will depend on the further development in discourse theory and its formalization.  Most grammars in current use assume sentential analysis.

[3]  Similar examples for the overlapping ambiguity string will be shown in 2.1.2.

[4]  But in Ancient Chinese, a numeral can freely combine with countable nouns.

[5] These two readings in written Chinese correspond to an obvious difference in Spoken Chinese:  ge (CLA) in (g1) is weakened in pronunciation, marked by the dropping of the tone, while in (g2) it reads with the original 4th tone emphatically.

[6] It is likely that what they have found corresponds to Guo’s discovery of “one tokenization per source” (Guo 1998).  Guo’s finding is based on his experimental study involving domain (“source”) evidence and seems to account for the phenomena better.  In addition, Guo’s strategy in his proposal is also more effective, reported to be one of the best strategies for disambiguation in word segmenters.

[7] According to He, Xu and Sun (1991)’s statistics on a corpus of 50833 Chinese characters, the overlapping ambiguous strings make up 84.10%, and the hidden ambiguous strings 15.90%, of all ambiguous strings.

[8] Guo (1997b) goes to the other extreme to hypothesize that “every tokenization is possible”.   Although this seems to be a statement too strong, the investigation in this chapter shows that at least domain independently, local context is very unreliable for making tokenization decision one way or the other.

[9] However, this assumption may become statistically valid within a specific domain or source, as examined in Guo (1998).  But Guo did not give an operational definition of source/domain.  Without such a definition, it is difficult to decide where to collect the domain-specific information required for disambiguation based on the principle one tokenization per source, as proposed by Guo (1998).

[10] This distinction is crucial in the theories of Liang (1987) and He,  Xu and Sun (1991).

[11] This work is now defined as one fundamental task, called Named Entity tagging, in the world of information extraction (MUC-7 1998).  There has been great advance in developing Named Entity taggers both for Chinese (e.g. Yu et al 1997; Chen et al 1997) and for other languages.

[12] That is what was actually done with the CPSG95 implementation.  More precisely, the family name expects a special sign with hanzi-length of 1 or 2 to form a full name candidate.

[13] A typical, sophisticated word segmenter making reference to knowledge beyond syntax is presented in Gan (1995).

[14] This is in fact one very common construction in Chinese in the form of NP1 NP2 Predicate.  Other examples include ta (he) tou (head) tong (ache): ‘he has a head-ache’ and ta (he) shen-ti (body) hao (good): ‘he is good in health’.

[15] For the detailed analysis of these constructions, see W. Li (1996).

[16] It seems that it may be more appropriate to use terms like global disambiguation or discourse-oriented disambiguation instead of the term pragmatics-oriented disambiguation for the relevant phenomena.

[17] It seems that some PP attachment problems can be resolved via grammatical analysis alone.  For example, put something on the table; found the key to that door.  Others require information beyond syntax (semantics, discourse, etc.) for a proper solution.  For example, see somebody with telescope. In either case, the structural analysis provides a basis.  The same thing happens to the disambiguation in Chinese word identification.

[18] In fact, once morphology is incorporated in the grammar, the identification of both vocabulary words and non-listable words becomes a by-product during the integrated morpho-syntactic analysis.  Most ambiguity is resolved automatically and the remaining ambiguity will be embodied in the multiple syntactic trees as the results of the analysis.  This has been shown to be true and viable by W. Li (1997, 2000) and Wu and Jiang (1998).



PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

PhD Thesis: Chapter I Introduction

1.0. Foreword

This thesis addresses the issue of the Chinese morpho-syntactic interface.  This study is motivated by the need for a solution to a series of long-standing problems at the interface.  These problems pose challenges to an independent morphology system or a separate word segmenter as there is a need to bring in syntactic information in handling these problems.

The key is to develop a Chinese grammar which is capable of representing sufficient information from both morphology and syntax.  On the basis of the theory of Head-Driven Phrase Structure Grammar (Pollard and Sag 1987, 1994), the thesis will present the design of a Chinese grammar, named CPSG95 (for Chinese Phrase Structure Grammar).  The interface between morphology and syntax is defined system internally in CPSG95.  For each problem, arguments will be presented for the linguistic analysis involved.  A solution to the problem will then be formulated based on the analysis.  The proposed solutions are formalized and implementable;  most of the proposals have been tested in the implementation of CPSG95.

In what follows, Section 1.1 reviews some important developments in the field of Chinese NLP (Natural Language Processing).  This serves as the background for this study.  Section 1.2 presents a series of long-standing problems related to the Chinese morpho-syntactic interface.  These problems are the focus of this thesis.  Section 1.3 introduces CPSG95 and sketches its morpho-syntactic interface by illustrating an example of the proposed morpho-syntactic analysis.

1.1. Background

This section presents the background for the work on the interface between morphology and syntax in CPSG95.  Major development on Chinese tokenization and parsing, the two areas which are related to this study, will be reviewed.

1.1.1. Principle of Maximum Tokenization and Critical Tokenization

This section reviews the influential Theory of Critical Tokenization (Guo 1997a) and its implications.  The point to be made is that the results of Guo’s study can help us to select the tokenization scheme used in the lexical lookup phase in order to create the basis for morpho-syntactic parsing.

Guo (1997a,b,c) has conducted a comprehensive formal study on tokenization schemes in the framework of formal languages, including deterministic tokenization such as FT (Forward Maximum Tokenization) and BT (Backward Maximum Tokenization), and non-deterministic tokenization such as CT (Critical Tokenization), ST (Shortest Tokenization) and ET (Exhaustive Tokenization).  In particular, Guo has focused on the study of the rich family of tokenization strategies following the general Principle of Maximum Tokenization, or “PMT”.  Except for ET, all the tokenization schemes mentioned above are PMT-based.

In terms of lexical lookup, PMT can be understood as a heuristic by which a longer match overrides all shorter matches.  PMT has been widely adopted (e.g. Webster and Kit 1992; Guo 1997b) and is believed to be “the most powerful and commonly used disambiguation rule” (Chen and Liu 1992:104).

Shortest Tokenization, or “ST”, first proposed by X. Wang (1989), is a non-deterministic tokenization scheme following the Principle of Maximum Tokenization.  A segmented token string is shortest if it contains the minimum number of vocabulary words possible – “short” in the sense of the shortest word string length.

Exhaustive Tokenization, or “ET”, does not follow PMT.  As its name suggests, the ET set is the universe of all possible segmentations consisting of all candidate vocabulary words.  The mathematical definition of ET is contained in Definition 4 for “the character string tokenization operation”  in Guo (1997a).

The most important concept in Guo’s theory is Critical Tokenization, or “CT”.  Guo’s definition is based on the partially ordered set, or ‘poset’, theory in discrete mathematics (Kolman and Busby 1987).  Guo has found that different segmentations can be linked by the cover relationship to form a poset.   For example, abc|d and ab|cd both cover ab|c|d, but they do not cover each other.

Critical tokenization is defined as the set of minimal elements, i.e. tokenizations which are not covered by other tokenizations, in the tokenization poset.  Guo has given proof for a number of mathematical properties involving critical tokenization.  The major ones are listed below.

  • Every tokenization is a subtokenization of (i.e. covered by) a critical tokenization, but no critical tokenization has a true supertokenization;
  • The tokenization variations following the Principle of Maximum Tokenization proposed in the literature, such as FT, BT, FT+BT and ST, are all true sub-classes of CT.

Based on these properties, Guo concludes that CT is the precise mathematical description of the widely adopted Principle of Maximum Tokenization.

Guo (1997c) further reports his experimental studies on relative merits of these tokenization schemes in terms of three quality indicators, namely, perplexity, precision and recall.  The perplexity of a tokenization scheme gives the expected number of tokenized strings generated for average ambiguous fragments.  The precision score is the percentage of correctly tokenized strings among all possible tokenized strings while the recall rate is the percentage of correctly tokenized strings generated by the system among all correctly tokenized strings.  The main results are:

  • Both FT and BT can achieve perfect unity perplexity but have the worst precision and recall;
  • ET achieves perfect recall but has the lowest precision and highest perplexity;
  • ST and CT are simple with good computational properties.  Between the two, ST has lower perplexity but CT has better recall.

Guo (1997c) concludes, “for applications with moderate performance requirement, ST is the choice;  otherwise, CT is the solution.”

In addition to the above theoretical and experimental study, Guo (1997b) also develops a series of optimized algorithms for the implementation of these generation schemes.

The relevance and significance of Guo’s achievement to the research in this thesis lie in the following aspect.  The research on Chinese morpho-syntactic interface is conducted with the goal of  supporting Chinese morpho-syntactic parsing.  The input to a Chinese morpho-syntactic parser comes directly from the lexical lookup of the input string based on some non-deterministic tokenization scheme (W. Li 1997, 2000; Wu and Jiang 1998).  Guo’s research and algorithm development can help us to decide which tokenization schemes to use depending on the tradeoff between precision, recall and perplexity or the balance between reducing the search space and minimizing premature commitment.

1.1.2. Monotonicity Principle and Task-driven Segmentation

This section reviews the recent development on Chinese analysis systems involving the interface between morphology and syntax.  The research on the Chinese morpho-syntactic interface in this thesis echoes this new development in the field of Chinese NLP.

In the last few years, projects have been proposed for implementing a Chinese analysis system which integrates word identification and parsing.  Both rule-based systems and statistical models have been attempted with good results.

Wu (1998) has addressed the drawbacks of the conventional practice on the development of Chinese word segmenters, in particular, the problem of premature commitment in handling segmentation ambiguity.  In his A Position Statement on Chinese Segmentation, Wu proposed a general principle:

Monotonicity Principle for segmentation:

A valid basic segmentation unit (segment or token) is a substring that no processing stage after the segmenter needs to decompose.

The rationale behind this principle is to prevent premature commitment and to avoid repetition of work between modules.   In fact, traditional word segmenters are modules independent of subsequent applications (e.g. parsing).  Due to the lack of means for accessing sufficient grammar knowledge, they suffer from premature commitment and repetition of work, hence violating this principle.

Wu’s proposal of the monotonicity principle is a challenge to the Principle of Maximum Tokenization.  These two principles are not always compatible.  Due to the existence of hidden ambiguity (see 1.2.1), the PMT-based segmenters by definition are susceptible to premature commitment leading to “too-long segments”.  If the target application is designed to solve the hidden ambiguity problem in the segments, “decomposition” of some segments is unavoidable.

In line with the Monotonicity Principle, Wu (1998) proposes an alternative approach which he claims “eliminates the danger of premature commitment”, namely task-driven segmentation.  Wu (1998) points out, “Task-driven segmentation is performed in tandem with the application (parsing, translating, named-entity labeling, etc.) rather than as a preprocessing stage.  To optimize accuracy, modern systems make use of integrated statistically-based scores to make simultaneous decisions about segmentation and parsing/translation.”  The HKUST parser, developed by Wu’s group, is such a statistical system employing the task-driven segmentation.

As for rule-based systems, similar practice of integrating word identification and parsing has also been explored.  W. Li (1997, 2000) proposed that the results of an ET-based lexical lookup directly feed the parser for the hanzi-based parsing.  More concretely, morphological rules are designed to build word internal structure for productive morphology and non-productive morphology is lexicalized via entry enumeration.[1]  This approach is the background for conducting the research on Chinese morpho-syntactic interface for CPSG95 in this dissertation.

The Chinese parser on the platform of multilingual NLPWin developed by Microsoft Research also integrates word identification and parsing (Wu and Jiang 1998).  They also use a hand-coded grammar for word identification as well as for sentential parsing.  The unique part of this system is the use of a certain lexical constraint on ET in the lexical lookup phase.  This effectively reduces the parsing search space as well as the number of syntactic trees produced by the parser, with minimal sacrifice in the recall of tokenization.  This tokenization strategy provides a viable alternative to the PMT-based tokenization schemes like CT or ST in terms of the overall balance between precision, recall and perplexity.

The practice of simultaneous word identification and parsing in implementing a Chinese analysis system calls for the support of a grammar (or statistical model) which contains sufficient information from both morphology and syntax.  The research on Chinese morpho-syntactic interface in this dissertation aims at providing this support.

1.2. Morpho-syntactic Interface Problems

This section presents a series of outstanding problems in Chinese NLP which are related to the morpho-syntactic interface.  One major goal of this dissertation is to argue for the proposed analyses of the problems and to provide solutions to them based on the analyses.

Sun and Huang (1996) have reviewed numerous cases which challenge the existing word segmenters.  As many of these cases call for an exchange of information between morphology and syntax, an appropriate solution can hardly be reached within the module of a separate word segmenter.  Three major problems at issue are presented below.

1.2.1. Segmentation ambiguity

This section presents the long-standing problem in Chinese tokenization, i.e. the resolution of the segmentation ambiguity.  Within a separate word segmenter, resolving the segmentation ambiguity is a difficult, sometimes hopeless job.  However, the majority of ambiguity can be resolved when a grammar is available.

Segmentation ambiguity has been the focus of extensive study in Chinese NLP for the last decade (e.g. Chen and Liu 1992; Liang 1987;  Sproat, Shih, Gale and Chang 1996; Sun and Huang 1996; Guo 1997b).  There are two types of segmentation ambiguities (Liang 1987; Guo 1997b):  (i) overlapping ambiguity:  e.g. da-xue | sheng-huo vs. da-xue-sheng | huo as shown in (1-1) and (1-2);  and (ii) hidden ambiguity:  ge-ren vs. ge | ren, as shown in (1-3) and (1-4).

(1-1.) 大学生活很有趣
da-xue         | sheng-huo          | hen          | you-qu
university    | life                     | very          | interesting
The university life is very interesting.

(1-2.)  大学生活不下去了
da-xue-sheng                 | huo          | bu | xia-qu      | le
university student          | live           | not | down        | LEs
University students can no longer make a living.

(1-3.)  个人的力量
ge-ren         | de   | li-liang
individual   | DE  | power
the power of an individual

(1-4.) 三个人的力量
san    |  ge            | ren           | de   | li-liang
three  | CLA          | person      |DE   | power
the power of three persons

These examples show that the resolution of segmentation ambiguity requires larger syntactic context and grammatical analysis.   There will be further arguments and evidence in Chapter II (2.1) for the following conclusion:  both types of segmentation ambiguity are structural by nature and require sentential analysis for the resolution.  Without access to a grammar, no matter how sophisticated a tokenization algorithm is designed, a word segmenter is bound to face an upper bound for the precision of word identification.  However, in an integrated system, word identification becomes a natural by-product of parsing (W. Li 1997, 2000;  Wu and Jiang 1998).  More precisely, the majority of ambiguity can be resolved automatically during morpho-syntactic parsing;  the remaining ambiguity can be made explicit in the form of  multiple syntactic trees.[2]  But in order to make this happen, the parser requires reliable support from a grammar which contains both morphology and syntax.

1.2.2. Productive Word Formation

Non-listable words created via productive morphology pose another challenge (Sun and Huang 1996).  There are two major problems involved in this issue:  (i) problem in identifying lexicon-unlisted words;  (ii) problem of possible segmentation ambiguity.

One important method of productive word formation is derivation.  For example, the derived word 可读性 ke-du-xing (-able-read-ness: readability) is created via morphology rules, informally formulated below

(1-5.) derivation rules

ke + X (transitive verb) –> ke-X (adjective, semantics: X-able)

Y (adjective or verb) + xing –> Y-xing (abstract noun, semantics: Y-ness)

Rules like the above have to be incorporated properly in order to correctly identify such non-listable words.  However, there has been little research in the literature on what formalism should be adopted for Chinese morphology  and how it should be interfaced to syntax.

To make the case more complicated, ambiguity may also be involved in productive word formation.  When the segmentation ambiguity is involved in word formation, there is always a danger of wrongly applying morphological rules.  For example, 吃头 chi-tou (worth of eating) is a derived word (transitive verb + suffix tou);   however, it can also be segmented as two separate tokens chi (eat) | tou (CLA), as shown in (1-6) and (1-7) below.

(1-6.)  这道菜没有吃头
zhe    | dao           | cai            | mei-you    | chi-tou
this    | CLA          | dish         | not have   | worth-of-eating
This dish is not worth eating.

(1-7.) 他饿得能吃头牛
ta       | e               | de             | neng        | chi  | tou           | niu
he      | hungry     | DE3         | can           | eat  | CLA                   | ox
He is so hungry that he can eat an ox.

To resolve this segmentation ambiguity, as indicated before in 1.2.1, the structural analysis of the complete sentences is required.  An independent morphology system or a separate word segmenter cannot handle this problem without accessing syntactic knowledge.

1.2.3. Borderline Cases between Morphology and Syntax

It is widely acknowledged that there is a remarkable gray area between Chinese morphology and Chinese syntax (L. Li 1990; Sun and Huang 1996).  Two typical cases are described below.  The first is the phenomena of Chinese separable verbs.  The second case involves interfacing derivation and syntax.

Chinese separable verbs are usually in the form of V+N and V+V or V+A.  These idiomatic combinations are long-standing problems at the interface between compounding and syntax in Chinese grammar (L. Wang 1955; Z. Lu 1957; Lü 1989; Lin 1983; Q. Li 1983; L. Li 1990; Shi 1992; Zhao and Zhang 1996).

The separable verb 洗澡 xi zao (wash‑bath: take a bath) is a typical example.  Many native speakers regard xi zao as one word (verb), but the two morphemes are separable.  In fact, xi+zao shares the syntactic behavior and the pattern variations with the syntactic transitive combination V+NP:  not only can aspect markers appear between xi and zao,  but this structure can be passivized and topicalized as well.  The following is an example of topicalization (of long distance dependency) for xi zao.

(1-8.)(a)       我认为他应该洗澡
wo     ren-wei        ta       ying-gai       xi zao.
I         think           he      should        wash-bath
I think that he should take a bath.

(b)      澡我认为他应该洗
zao    wo     ren-wei        ta       ying-gai       xi.
bath  I         think           he      should        wash
The bath I think that he should take.

Although xi zao behaves like a syntactic phrase, it is a vocabulary word in the lexicon due to its idiomatic nature.  As a result, almost all word segmenters output xi-zao in (1-8a) as one word while treating the two signs[3] in (1-8b) as two words.  Thus the relationship between the separated use of the idiom and the non-separated use is lost.

The second case represents a considerable number of borderline cases often referred to as  ‘quasi-affixes’.  These are morphemes like 前 qian (former, ex-) in words like 前夫 qian-fu (ex-husband), 前领导 qian-[ling-dao] (former boss) and -盲 mang (person who has little knowledge of) in words like 计算机盲 [ji-suan-ji]-mang (computer layman), 法盲 fa-mang (person who has no knowledge of laws).

It is observed that ‘quasi-affixes’ are structurally not different from other affixes.  The major difference between ‘quasi-affixes’ and the few generally honored (‘genuine’) affixes like the nominalizer 性 -xing (-ness) lies mainly in the following aspect.  The former retain some ‘solid’ meaning while the latter are more functionalized.  Therefore, the key to this problem seems to lie in the appropriate way of coordinating the semantic contribution of the derived words using ‘quasi-affixes’ to the building of the semantics for the entire sentence.  This is an area which has not received enough investigation in the field of Chinese NLP.  While many word segmenters have included some type of derivational processing for a few typical affixes, few systems demonstrate where and how to handle these ‘quasi-affixes’.

1.3. CPSG95:  HPSG-style Chinese Grammar in ALE

To investigate the interaction between morphological and syntactic information, it is important to develop a Chinese grammar which incorporates morphology and syntax in the same formalism.  This section gives a brief presentation on the design and background of CPSG95 (including lexicon).

1.3.1. Background and Overview of CPSG95

Shieber (1986) distinguishes two types of grammar formalism:  (i) theory-oriented formalism;  (ii) tool-oriented formalism.  In general, a language-specific grammar turns to a theory-oriented formalism for its foundation and a tool-oriented formalism for its implementation.  The work on CPSG95 is developed in the spirit of the theory-oriented formalism Head-driven Phrase Structure Grammar (HPSG, proposed by Pollard and Sag 1987).  The tool-oriented formalism used to implement CPSG95 is the Attribute Logic Engine (ALE, developed by Carpenter and Penn 1994).

The unique feature of CPSG95 is its incorporation of Chinese morphology in the HPSG framework.[4]  Like other HPSG grammars, CPSG95 is a heavily lexicalized unification grammar.  It consists of two parts:  a minimized general grammar and an information-enriched lexicon.  The general grammar contains a small number of Phrase Structure (PS) rules, roughly corresponding to the HPSG schemata tuned to the Chinese language.[5]  The syntactic PS rules capture the subject-predicate structure, complement structure, modifier structure, conjunctive structure and long-distance dependency.  The morphological PS rules cover morphological structures for productive word formation.  In one version of CPSG95 (its source code is  shown in APPENDIX I), there are nine PS rules:  seven syntactic rules and two morphological rules.

In CPSG95, potential morphological structures and potential syntactic structures are both lexically encoded.  In syntax, a word can expect (subcat-for or mod in HPSG terms) another sign to form a phrase.   Likewise, in Chinese morphology, a morpheme can expect another sign to form a word.[6]

One important modification of HPSG in designing CPSG95 is to use an atomic approach with separate features for each complement to replace the list design of obliqueness hierarchy among complements.  The rationale and arguments for this modification are presented in Section 3.2.3 in Chapter III.

1.3.2. Illustration

The example shown in (1-9) demonstrates the morpho-syntactic analysis  in CPSG95.

(1-9.) 这本书的可读性
zhe    ben    shu    de      ke               du      xing
this    CLA   book  DE     AF:-able      read   AF:-ness
this book’s readability
(Note: CLA for classifier; DE for particle de; AF for affix.)

Figure 1 illustrates the tree structure built by the morphological PS rules and the syntactic PS rules in CPSG95


Figure 1. Sample Tree Structure for CPSG95 Analysis

As shown, the tree embodies both morphological analysis (the sub-tree for ke-du-xing) and syntactic analysis (the NP structure).  The results of the morphological analysis (the category change from V to A and to N and the building of semantics, etc.) are readily accessible in building syntactic structures.

1.4. Organization of the Dissertation

The remainder of this dissertation is divided into six chapters.

Chapter II presents arguments for the need to involve syntactic analysis for a proper solution to the targeted morpho-syntactic problems.   This establishes the foundation on which CPSG95 is based.

Chapter III presents the design of CPSG95.  In particular, the expectation feature structures will be defined.  They are used to encode the lexical expectation of both morphological and syntactic structures.  This design provides the necessary means for formally defining Chinese word and the interface of morphology, syntax and semantics.

Chapter IV is on defining the Chinese word.  This is generally recognized as a basic issue in discussing Chinese morpho-syntactic interface.  The investigation leads to a way of the wordhood formalization and a coherent, system-internal definition of the work division between morphology and syntax.

Chapter V studies Chinese separable verbs.  It discusses  wordhood judgment for each type of separable verbs based on their distribution.   The corresponding morphological or syntactic solutions will then be presented.

Chapter VI investigates some outstanding problems of Chinese derivation and its interface with syntax.  It will be demonstrated that the general approach to Chinese derivation in CPSG95 works both for typical cases of derivation and the two special problems, namely ‘quasi-affix’ phenomena and zhe-affixation.

The last chapter, Chapter VII, concludes this dissertation.  In addition to a concise retrospect for what has been achieved, it also gives an account of the limitations of the present research and future research directions.

Finally, the three appendices give the source code of one version of the implemented CPSG95 and some tested results.[7]



[1] In line with the requirements by Chinese NLP, this thesis places emphasis on the analysis of productive morphology:  phenomena which are listable in the lexicon are not the major concern.  This is different from many previous works on Chinese morphology (e.g. Z. Lu 1957; Dai 1993) where the bulk of discussions is on unproductive morphemes (affixes or ‘bound stems’).

[2] Ambiguity which remains after sentential parsing may be resolved by using further semantic, discourse or pragmatic knowledge, or ‘filters’.

[3] In CPSG95 and other HPSG-style grammars, a ‘sign’ usually stands for the generalized notion of grammatical units such as morpheme, word, phrase, etc.

[4] Researchers have looked at the incorporation of morphology of other natural languages in the HPSG framework (e.g. Type-based Derivation Morphology by Riehemann 1998).  Arguments for  the inclusion of morphological features in the definition of sign will be presented in detail in Chapter III

[5] Note that ‘phrase structure’ in terms like Phrase Structure Grammar (PSG) or Phrase Structure rules (PS rules) does not necessarily refer to structures of (syntactic) phrases. It stands for surface-based constituency structure, in contrast to, say, dependency structure in Dependency Grammar.  In CPSG95, some productive morphological structures are also captured by PS rules.

[6] Note that in this dissertation, the term expect is used as a more generalized notion than the terms subcat-for (subcategorize for) and mod (modify).  ‘Expect’ is intended to be applied to morphology as well as to syntax.

[7]  There are differences in technical details between the proposed grammar in this dissertation and the implemented version.  This is because any implemented version was tested at a given time while this thesis evolved over a long period of time.  It is the author’s belief that it best benefits readers (including those who want to follow the CPSG practice) when a version was actually tested and given as was.



PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)


The Morpho-syntactic Interface in a Chinese Phrase Structure Grammar



Wei Li

B.A., Anqing Normal College, China, 1982

M.A., The Graduate School of Chinese Academy of

Social Sciences, China, 1986



Thesis submitted in partial fulfillment of

the requirements for the degree of



in the Department



Morpho-syntactic Interface in a Chinese Phrase Structure Grammar


Wei Li 2000


November 2000



All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without permission of the author.



Name:                         Wei Li

Degree:                       Ph.D.

Title of thesis:             THE MORPHO-SYNTACTIC INTERFACE IN



(Approved January 12, 2001)



This dissertation examines issues related to the morpho-syntactic interface in Chinese, specifically those issues related to the following long-standing problems in Chinese Natural Language Processing (NLP): (i) disambiguation in Chinese word identification;  (ii) Chinese productive word formation;  (iii) borderline phenomena between morphology and syntax, such as Chinese separable verbs and ‘quasi-affixation’.

All these problems pose challenges to an independent Chinese morphology system or separate word segmenter.  It is argued that there is a need to bring in the syntactic analysis in handling these problems.

To enable syntactic analysis in addition to morphological analysis in an integrated system, it is necessary to develop a Chinese grammar that is capable of representing sufficient information from both morphology and syntax.  The dissertation presents the design of such a Chinese phrase structure grammar, named CPSG95 (for Chinese Phrase Structure Grammar).  The unique feature of CPSG95 is its incorporation of Chinese morphology in the framework of Head-Driven Phrase Structure Grammar.  The interface between morphology and syntax is then defined system internally in CPSG95 and uniformly represented using the underlying grammar formalism used by the Attribute Logic Engine.  For each problem, arguments are presented for the proposed analysis to capture the linguistic generality;  morphological or syntactic solutions are formulated based on the analysis.  This provides a secure approach to solving problems at the interface of Chinese morphology and syntax.


To my daughter Tian Tian

whose babbling accompanied and inspired the writing of this work

And to my most devoted friend Dr. Jianjun Wang

whose help and advice encouraged me to complete this work


First and foremost, I feel profoundly grateful to Dr. Paul McFetridge, my senior supervisor.  It was his support that brought me to SFU and the beautiful city Vancouver, which changed my life.  Over the years,  he introduced me into the HPSG study, and provided me with his own parser for testing grammar writing.  His mentorship and guidance have influenced my research fundamentally.  He critiqued my research experiments and thesis writing in many facets, from the development of key ideas, selection of topics, methodology, implementation details to writing and presentation style.  I feel guilty for not being able to promptly understand and follow his guidance at times.

I would like to thank Dr. Fred Popowich, my second advisor.  He has given me both general academic guidance on research methodology and numerous specific comments for the thesis revision which have helped shape the present version of the thesis as it is today.

I am also grateful to Dr. Nancy Hedberg from whom I have taken four graduate courses, including the course of HPSG.  I have not only learned a lot from her lectures in the classroom, but have benefited greatly from our numerous discussions on general linguistic topics as well as issues in Chinese linguistics.

Thanks to Davide Turkato, my friend and colleague in the Natural Language Lab.  He is always there whenever I need help.  We have also shared many happy hours in our common circle of Esperanto club in Vancouver.

I would like to thank Dr. Ping Xue, Dr. Zita McRobbie, Dr. Thomas Perry, Dr. Donna Gerdts and Dr. Richard DeArmond for the courses I have taken from them.  These courses were an important part of my linguistic training at SFU.

For various help and encouragement I have got during my graduate studies, I should also thank all the faculty, staff and colleagues of the linguistics department and the Natural Language Lab of SFU, in particular, Rita, Sheilagh, Dr. Ross Saunders, Dr. Wyn Roberts, Dr. Murray Munro and Dr. Olivier Laurens.  I am particularly thankful to Carol Jackson, our Graduate Secretary for her years of help.  She is remarkable, very caring and responsive.

I would like to extend my thanks to all my fellow students and friends in the linguistics department of SFU, in particular, Dr. Trude Heift, Dr. Janine Toole, Susan Russel, Dr. Baoning Fu, Zhongying Lu, Dr. Shuicai Zhou, Jianyi Yu, Jean Wang, Cliff Burgess and Kyoung-Ja Lee.  We have had so much fun together and have had many interesting discussions, both academic and non-academic.  Today, most of us have graduated, some are professors or professionals in different universities or institutions.  Our linguistics department is not big, but it is such a nice department where faculty, staff and the graduate student body form a very sociable community.  I have truly enjoyed my graduate life in this department.

Beyond SFU, I would like to thank Dr. De-Kang Lin for the insightful discussion on the possibilities of integrated Chinese parsing back in 1995.  Thanks to Gerald Penn, one of the authors of ALE, for providing the powerful tool ALE and for giving me instructions on modifying some functions in ALE to accommodate some needs for Chinese parsing during my experiment in implementing a Chinese grammar.

I am also grateful to Dr. Rohini Srihari, my current industry supervisor, for giving me an opportunity to manage NLP projects for real world applications at Cymfony.  This industrial experience has helped me to broaden my NLP knowledge, especially in the area of statistical NLP and the area of shallow parsing using Finite State Transducers.

Thanks to Carrie Pine and Walter Gadz from US Air Force Research Laboratory who have been project managers for the Small Business Innovation Research (SBIR) efforts ‘A Domain Independent Event Extraction Toolkit’ (Phase II), ‘Flexible Information Extraction Learning Algorithm’ (Phase I and Phase II) and ‘Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization’ (Phase I and Phase II).  I have been Principal Investigator for these government funded efforts at Cymfony Inc. and have had frequent and extremely beneficial contact with them.  With these projects, I have had an opportunity to apply the skills and knowledge I have acquired from my Ph.D. program at SFU.

My professional training at SFU was made possible by a grant that Dr. Paul McFetridge and Dr. Nick Cercone applied for.  The work reported in this thesis was supported in the later stage  by a Science Council of B.C. (CANADA) G.R.E.A.T. award.  I am grateful to both my academic advisor Paul McFetridge and my industry advisor John Grayson, CEO of TCC Communications Corporation of Victoria, for assisting me in obtaining this prestigious grant.

I would not have been able to start and continue my research career without many previous helps I got from various sources, agencies and people in the last 15 years, for which I owe a big prayer of thanks.

I owe a great deal to Prof. Zhuo Liu and Prof. Yongquan Liu for leading me into the NLP area and supervising my master program in computational linguistics at CASS (Chinese Academy of Social Sciences, 1983-1986).  Their guidance in both research ideas and implementation details benefited me for life.  I am grateful to my former colleagues Prof. Aiping Fu, Prof. Zhenghui Xiong and Prof. Linding Li at the Institute of Linguistics of CASS for many insightful discussions on issues involving NLP and Chinese grammars.  Thanks also go to Ms. Fang Yang and the machine translation team at Gaoli Software Co. in Beijing for the very constructive and fruitful collaborative research and development work.  Our collaboration ultimately resulted in the commercialization of the GLMT English-to-Chinese machine translation system.

Thanks to Dr. Klaus Schubert, Dr. Dan Maxwell and Dr. Victor Sadler from BSO (Utrecht, The Netherlands) for giving me the project of writing a computational grammar of Chinese dependency syntax in 1988.  They gave me a lot of encouragement and guidance in the course of writing the grammar.  This work enabled me to study Chinese grammar in a formal and systematic way.  I have carried over this formal study of Chinese grammar to the work reported in this thesis.

I am also thankful to the Education Ministry of China, Sir Pao Foundation and British Council for providing me with the prestigious Sino-British Friendship Scholarship.  This scholarship enabled me to study computational linguistics at Centre for Computational Linguistics, UMIST, England (1992).  During my stay in UMIST, I had opportunities to attend lectures given by Prof. Jun-ichi Tsujii, Prof. Harold Somers and Dr. Paul Bennett.  I feel grateful to all of them for their guidance in and beyond the classroom.  In particular, I must thank Dr. Paul Bennett for his supervision, help and care.

I would like to thank Prof. Dong Zhen Dong and Dr. Lua Kim Teng for inviting and sponsoring me for a presentation at ICCC’96 in Singapore.  They are the leading researchers in the area of Chinese NLP.  I have benefited greatly from the academic contact and communication with them.

Thanks to anonymous reviewers of the international journals of  Communications of COLIPS, Journal of Chinese Information Processing, World Science and Technology and grkg/Humankybernetik.  Thanks also to reviewers of the International Conference on Chinese Computing (ICCC’96), North American Conference on Chinese Linguistics (NACCL‑9), Applied Natural Language Conference (ANLP’2000), Text Retrieval Conference (TREC-8), Machine Translation SUMMIT II, Conference of the Pacific Association for Computational Linguistics (PACLING-II) and North West Linguistics Conferences (NWLC).  These journals and conferences have provided a forum for publishing the NLP-related research work I and my colleagues have undertaken at different times of my research career.

Thanks to Dr. Jin Guo who has developed his influential theory of tokenization.  I have benefited enormously from exchanging ideas with him on tokenization and Chinese NLP.

In terms of research methodology and personal advice, I owe a great deal to my most devoted friend Dr. Jianjun Wang, Associate Professor at California State University, Bakersfield, and Fellow of the National Center for Education Statistics in US.  Although in a totally different discipline, there has never been an obstacle for him to understand the basic problem I was facing and to offer me professional advice.  At times when I was puzzled and confused, his guidance often helped me to quickly sort things out.  Without his advice and encouragement, I would not have been able to complete this thesis.

Finally, I wish to thank my family for their support.  All my family members, including my parents, brothers and sisters in China, have been so supportive and understanding.  In particular, my father has been encouraging me all the time.  When I went through hardships  in my pursuit,  he shared the same burden;  when I had some achievement,  he was as happy as I was.

I am especially grateful to my wife, Chunxi.  Without her love, understanding and support, it is impossible for me to complete this thesis.  I wish I had done a better job to have kept her less worried and frustrated.  I should thank my four-year-old daughter, Tian Tian.  I feel sorry for not being able to spend more time with her.  What has supported me all these years is the idea that some day she will understand that as a first-generation immigrant, her dad has managed to overcome various challenges in order to create a better environment for her to grow.


Approval                    ii

Abstract                    iii

Dedication                    iv

Acknowledgments                  v

Chapter I  Introduction                1

1.0. Foreword                1

1.1. Background                2

  • Principle of Maximum Tokenization and Critical Tokenization            2
  • Monotonicity Principle and Task-driven Segmentation            5

1.2. Morpho-syntactic Interface Problems        8

1.2.1. Segmentation ambiguity        8

1.2.2. Productive Word Formation        10

1.2.3. Borderline Cases between Morphology and Syntax              11

1.3. CPSG95:  HPSG-style Chinese Grammar in ALE    13

1.3.1. Background and Overview of CPSG95    14

1.3.2. Illustration              15

1.4. Organization of the Dissertation          16

Chapter II  Role of Grammar              18

2.0. Introduction                18

2.1. Segmentation Ambiguity and Syntax        19

2.1.1. Resolution of Hidden Ambiguity      19

2.1.2. Resolution of Overlapping Ambiguity    24

2.2. Productive Word Formation and Syntax      33

2.3. Borderline Cases and Grammar          37

2.4. Knowledge beyond Syntax            39

2.5. Summary                46

Chapter III  Design of CPSG95              48

3.0. Introduction                48

3.1. Mono-stratal Design of Sign          52

3.2. Expectation Feature Structures          57

3.2.1. Morphological Expectation        58

3.2.2. Syntactic Expectation          59

3.2.3. Chinese Subcategorization        63

3.2.4. Configurational Constraint        67

3.3. Structural Feature Structure          70

3.4. Summary                73

Chapter IV  Defining the Chinese Word          75

4.0. Introduction                75

4.1. Two Notions of Word            78

4.2. Judgment Methods              83

4.3. Formal Representation of Word          88

4.4. Summary                92

Chapter V  Chinese Separable Verbs            93

5.0. Introduction                93

5.1. Verb-object Idioms: V+N I            96

5.2. Verb-object Idioms: V+N II          107

5.3. Verb-modifier Idioms: V+A/V          116

5.4. Summary                122

Chapter VI  Morpho-syntactic Interface Involving Derivation    123

6.0. Introduction                123

6.1. General Approach to Derivation          125

6.2. Prefixation                127

6.3. Suffixation                130

6.4. Quasi-affixes                132

6.5. Suffix zhe (-er)              139

6.6. Summary                151

Chapter VII  Concluding Remarks            152

7.0. Summary                152

7.1. Contributions              154

7.2. Limitation                158

7.3. Final Notes                    159

BIBLIOGRAPHY                  161

APPENDIX I    Source Code of Implemented CPSG95      170

APPENDIX II  Source Code of Implemented CPSG95 Lexicon    208

APPENDIX III  Tested Results in Three Experiments Using CPSG95  229



PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP


汉语中有一种特别常见的句式,形式上看上去很像主谓关系(S-Pred)或 Topic+Clause,但是却是表达类似条件虚拟的因果关系(的浓缩形式,通常前一部分是VP或Clause,后一部分是 Pred,偶尔为小句),考虑给这种关系一个特别的命名,不叫 S,也不叫 Next,也不叫 Conj, 叫个什么好呢?实质是条件状语 Cond-Adv,

他要来就好了 ==【 if】 他要来 【then it】 就好了 == 如果他要来,【那】就好了。

LOCK状态下连按两下HOME可以快速启动照相机 ==
【if】LOCK状态下【你】连按两下HOME【then 你就】可以快速启动照相机

喝粥吃不饱 《– 【if 我们】喝粥【then 我们就】吃不饱

这个句式似乎有些 trigger 它的小词,但非常微妙,形式也很多样,不好掌控:上面几句算随手举例,里面的 triggers 大概包括:“就”、“可以”、“也”等。“喝粥吃不饱”似乎与结果补语“不饱”有关。也有前后都是小句的:

她来我走 == 【if】她来【then】我【就】走。
她不来我也走 《– 【if】她不来【then】我也【仍然要】走


@wei “要来”也可以是“要是来”的缩略。

这些压缩复句内部的逻辑关系是上下文相关的。某人拼命挣钱,忽略了健康,被人批评为“要钱不要命”,是“(为了)要钱(而)不要命”;但如果出自强盗之口,就是 “(如果)要钱(就)不要命”。强盗对被抢人说的话。也可能是 “(我)(只)要钱,不要(你的)命。” 这是复句义的歧义。

卖瓜的说“不甜不要钱”。明明有歧义,大家也不按别的意思去理解,否则有强词夺理的嫌疑,比如“不甜 and 不要钱”。



这玩意儿真是难点 不好识别 可不识别 就有恶果,譬如 假如语义落地是抽取 sentiment,这种句子里一多半的虚拟式的本质 说明不是事实 不应该抽取。
“x 不好不要钱” 并不是评价 x 不好,而是条件虚拟 “如果 x 不好”,条件是没有 sentiment 的。

不一定虚拟。“(我)(只)要钱,不要(你的)命。” 就不是虚拟。

“x 不降价就不要买”,也没断定 x 降价还是没降价。x “降价”一般认为是好事,“不要买” 一般认为是对 x 不利的行为:“x 不增加电池寿命就不喜欢了” (本来可能是喜欢的,但这里隐含了小抱怨,抱怨的原因是“电池寿命”)。诸如此类 都要求要识别条件式 才好准确判断 sentiment 及其原因。

汉语怎么就发展了这么个表达法呢 偏偏不用显性的小词 “如果”、“要是”、“假如”、“倘若” 等。口语还特常见。英语不用 if 的时候就要用倒装词序,也还是有了显性的形式痕迹:
Had I done that I would have lost
Should they get there in time they would make it

Had I done that I would have lost
Should they get there in time they would make it

很不错。MT 汉语生成用了显性表达的主从连词“如果”。

越是大家熟悉的事,大家才会用缩略语,口语。 才会出现一些语法上有歧义的句子,因为大家心照不宣,太熟悉了。这叫做语境。 而大多数不那么熟悉的地方就更偏向书面语,歧义就很少了。 所以我觉得歧义问题没那么严重吧,假如碰到就把这个特例记住得了。英语应该也有很多省略的口语,生活中的。这是人之常情吧,太常见的事就省略点,反正大家都懂。

记住的是可以词典化的,open ended 的句式难进词典,死记不行。汉语省略小词的现象 总体说都是句法层的挑战 不是词典可以解决的。除了主从连词,省略介词也极为常见:

== 【关于】这件事儿【依】我的意见【,】大家还是要往前看

翻译成英语,这些介词绝不可以省略,否则就是 Chinglish:on this matter in my opinion, …….

* this matter my opinion we all should look forward

关于这件事我的意见是。不能是省略了“是”?英语口语也有省略吧 肯定有





没有小词,一样填坑。但是,如果有多种填坑的可能性 问题就出来了。在涉及公平交易的场合,设坑和填坑 都是隐性形式的法子。如果你用明显不利于交易对手方的解释,这法子太low。不甜不要钱,就是这个情况。

如果有显性形式可资利用,就对隐性形式(subcat 之类)的依赖减轻了。
涉及语用,算另一个层面,开始可以不问。只解析出 if-then 的框架即可。

如果…则,是有利于对手方的;… and …,是不利于对手方的

我准备在 links 中加一个 CondAdv 的 link,把目前的 S,Next 和 Conj
分出来一批表达这类条件。Next 从默认越来越单纯化为 【接续】;Conj 为【并列】,S 为【主语】, CondAdv【条件】


还有 “不破不立”。不破哪能立。不学则无术。

真的不是则 是因为所以

【因为所以】 与 【如果则】 已经相当接近了 可以找一个上位,把两者都囊括进去 模糊一把。它们 与 【并列】 完全不同。


“不学无术” 作为【并列】 也很说得通。【如果则】 是虚拟,而【因为所以】可能是已然,也可能不是已然: 除非【因为所以】里面的动词附着了时体助词,否则 因为所以不强调已然。


林彪既不学亦无术: “不读书 不看报 一点马列主义都不讲”。当年批林的时候常有类似的话。
不甜不要钱,甜也不要钱。” 前者是省了 if 后者是省了 even if 让了一步,可仅有的形式是那个几乎万能的 “也”。


【不 VP 不 Pred(结果)】这样的 pattern,“罢休” 这个谓词可以在词典标注隐性形式 具有结果的意味。


it won’t be good that you are not coming
it won’t be good if u r not coming

这类句式限定于谓语是判定性的说法 诸如 not good, does not work。有时想 硬把汉语的简约条件式直译过去 虽然不合语法 似乎老外也应该可以理解(待核实。假设可理解,这说明简约式还是内含了某种逻辑链条的蛛丝马迹,即便没有小词的显性形式的帮忙):

? u come, I will leave
* u not come, i stay
* not work not succeed
* no work no success
* not leave won’t work




if you/we/one/anyone/everyone/anyone/they/people do not work
you/we/one/they/he/she should not get food

英语不得不在主句和条件从句加上这些莫名其妙的主语;汉语简约式直接省去,默认就是宇宙真理,普适于所有人。最叠床架屋的是,为了政治正确,在填写这些无意义的主语的时候,还常不得不这样来说: he or she 或者 s/he: if a person does not work, he or she should not get food.

no hard work no success 这种应该算是英语接受的、最接近汉语的简约句式了。

@wei 你解析句子生成的内容用什么形式?用什么来表达解析后获得的语义?

tree (& roles) / IE Templates




关于 parsing






凑热闹参加【征文:美梦成真】 ,有网友搞不懂这美梦是啥,怎么叫美梦成真。说明我瞎激动的所谓美梦,非但没有做到老妪能解,甚至没有让科学人士明白,就科普而言,那是相当的失败。



几千年的探索,总结出一种叫文法的东西,用它可以对语言的内在规律做一些总结,这样,千变万化的语句就可以分析成有限的句型结构,可以帮助语言理解和把握。人类本能的语言理解能力也因此显得有迹可循了。这就是我们在学校文法课上老师教给我们的知识,特别是一种语句分析的结构图的画法(grammar diagramming),条分缕析建立主语谓语宾语定语状语等结构联系,证明是一个很管用的语言分析技能。这一切本来是为了加强我们的语文能力。

电脑出现以后,就有人工智能的科学家想到,要教会电脑人类语言,这个领域叫自然语言理解(Natural Language Understanding),其核心是对人类语言做自动分析(parsing),分析结果往往用类似文法课上学到的树形图来表达。自动语言分析很重要,它是语言处理的核心技术。一个质量优良、抗干扰强(所谓鲁棒 robust)而且可以运行到大数据上面的自动分析引擎,就是个核武器。有了这样的自动分析,就可以帮助完成很多语言任务,譬如人机对话、机器秘书、情报抽取、舆情挖掘、自动文摘、机器翻译、热点追踪等等。(也有不少日常语言处理应用,譬如关键词搜索、垃圾过滤、文章分类、作者鉴定,甚至自动文摘和机器翻译,不分析,不理解,只是把语言当成黑匣子,把任务定义成通过黑匣子的从输入到输出的映射,然后利用统计模型来学习模拟,也可以走得很远。这些绕过了结构和理解的近似方法,由于其鲁棒性等优点,实际上是主流的主导性做法)。



[11]方锦清  2013-10-17 15:04


博主回复(2013-10-17 19:18):

The mission impossible accomplished.




  • 这是一个跨越1/4世纪科研美梦终成真的现实故事。故事的主人公做助理研究员的时候,满怀热情,不知天高地厚地为世界上最微妙的语言之一现代汉语,描绘了一幅自然语言理解(NLU)蓝图,其核心是对千变万化的中文文句施行自动语法分析。这幅蓝图距离现实太过遥远,其实现似乎遥遥无期,非人力可为。然而,1/4世纪之后,积累加机缘,天时和地利,主人公终于实现了这个理想,正在投入真实世界的大数据应用。The mission impossible accomplished.







(1) 25年前的蓝图(美梦):

(2) 25年前的蓝图(美梦):




(4) 25年前的蓝图(美梦):



(5) 25年前的蓝图(美梦):




胶合板是把原木旋切或刨切成单片薄板, 经过干燥、涂胶,  并按木材纹理方向纵横交错相叠, 在加热或不加热的条件下压制而成的一种板材。



初稿(2012-10-13 ):科学网—【立委随笔:中文之心,如在吾庐】

汉语依从文法: 维文钩沉(25年前旧作,浏览器下请选用国标码 GB 阅读以免乱码和图形失真)】:

立委科普:语法结构树之美 (英文例示)】





关于 parsing







终于冰箱安装到位了, 欣喜之余发现有点儿小问题, 就联系了店家, 店家主动帮助联系客服上门查看, 虽然最终没有解决问题, 心里有点儿遗憾, 但是因为不影响使用, 所以也就无所谓了.”  这一句够复杂的,目前酱紫滴:

“店家” 与 “主动帮助”在主语之外,语义中间件给做了逻辑宾语,是 overkill,以为帮助的 subcat 的宾语没有 saturated,但是 动词性宾语ObjV 也算宾语的,这个调整一下可以 fix
最后的错误是远距离,“虽然” 应该找到 “但是”的,是强搭配,但里面有几个小句挡路。“但是”前面的小句没关系,反正是强搭配,抽着鞭子跑马也不怕越位,可是“但是”后面又来了个“因为 。。。所以”,这个嵌套有点讨厌:“但是”的落脚点因此不在第一小句,而在第二小句“所以”上。换句话说,人的理解是,“虽然”引导的让步状语从句应该长距离落实在最后的“无所谓”上,才符合句法语义逻辑。社会媒体似乎是不经意写出来的句子,也有这种繁复的小句嵌套的长距离句法问题(贴帖的人大概是个知道分子老九,大老粗没那么多“因为所以”“虽然但是”的,而且嵌套)。最后,“联系客服上门查看”还有个 subcat 词典没到位的 bug,小 case 了,不难纠正。small bugs are de-ed:


这OV的 –》瞧人家这OV的










关于 parsing





毛老啊,1966-1976 10年文革,是我十年的中小学,我容易吗?10年中学文化课的时间不到一半,其余是学工学农学军。学赤脚医生 学开手扶拖拉机。
为什么是 【十年中】【学文化课】不是 【十年中学】【文化课】?

@wei 单就这句,确实两可。但你后面有这么多的”学”……

怎讲?因为“学”频率高 所以“中学”成词就不便?统计模型在这个case怎么工作显示merit呢?愿闻其详。
大数据说 有五年中学 有六年中学,极少见十年中学,反映的是中学学制的常识。但是这个知识不是很强大,很难作数,因为这不是 positive evidence。如果句子在 “六年中学” 发生边界纠纷的时候 得到来自大数据的直接支持,那是正面的 evidence,力量就很强。负面证据不顶事儿,因为它面对的是 【非六】(或【非五】)的大海,理论上无边无沿,那点儿证据早被淹没了。

统计分long term / global vs short term / local.


现在热的”深度神经”,有些是有意无意地多考虑些后者。例如,深度神经”皇冠上的明珠”LSTM即是Long Short Term Memory。虽非显式地求取利用”即时统计”,那层意思还是感觉的到的。

@Guo 恩。这个 local 和 global 之间的关系很tricky


N+N的得分本来就低 有状语有动词的更加“典型” N+N是实在没招了只能借助构词法解决零碎的产物 有状语有动词时谁还理N+N。不管几年中学,也抗衡不了这个结构要素。就是说,同样是使用规则,有些规则上得厅堂,有些规则只能下得厨房。如果没有上得厅堂的规则可用,随你下厨房怎么折腾。但是如果有上得厅堂的规则可用,谁也不去下厨房。

这里不仅仅是 N+N 的问题,在绝大多数切词模块中,还没走到N+N这一步,因此这个问题实际上可能挑战不少现存的切词程序:十年/中学/文化课 or 十年中/学/文化课 ?
有一个常用的切词 heuristic 要求偏向于音节数均匀的路径 显然前者比后者均匀多了。

句法上谈多层,也是“狗/咬吕洞宾”, 不是“狗咬/吕洞宾”

句法怎么谈层次 其实无关 因为多层的切词不过是一个技术策略,(通常)本身并不参与 parsing,最终的结果是 狗/咬/吕洞宾 就行了。其实 即便论句法 SVO 层次 在汉语中还是颇有争论的 不像西方语言里面 V+NP 的证据那么充分。


目前的接口是这样的 多数系统的接口是。切词的结果并不存在层次,虽然切词内部可以也应该使用层次。肯定有研究型系统不采用这样的接口,但实用系统中的多数似乎就是这样简单。


数据结构多了维度,对于传统系统,涉及面蛮大的。词不仅仅是词,词本身不是一个简单的 object。以前的系统词流就是string 或最多是 token+POS list 对那样简单的结构增加维度还好。

词和短语一样可以给位置加锁解锁 竞争位置的锁

不错,词是一切潜在结构的发源地,蕴藏了很大潜能,甚至在设计中,应该让词典可以内建结构,与parsing机制一体化。这种设计思想下的词 增加维度 就是带着镣铐跳舞 不是容易处置好的。nondeterministic 是一个动听但不太好使的策略。否则理论上无需任何休眠与唤醒。


多层系统下的 nondeterministic 结构,就好比潘多拉的盒子。放鬼容易降鬼难,层次越多越是这样。也许机器学习那边不怕,反正不是人在降服鬼。


这就是毛主席的路线 叫天下大乱达到天下大治。文革大乱10年国民经济临近崩溃的边缘,但没有像60年那样彻底崩盘,除了狗屎运,还因为有一个绝对权威在。这个权威冷酷无情 翻脸不认人。今天红上了天的红卫兵造反派 明天就下牢狱。


这样的系统大多难以调试 等到见人了 结果已定局 好坏都是它了 斯大林说 胜利者是不受指责的。


鬼虽然是按照人制定的规则打架。具体细节却难以追踪 因此也难以改正。当然 这个毛病也不是现在才有的 是一切黑箱子策略的通病。


刁德一说 这茶喝到这儿才有了滋味。看好白老师及其design










“围” 与 “做” 的逻辑主语阙如。原因之一是这两个动词本身的subcat没有要求“她”【human】或“海草”【physical object】。语义中间件目前是保守策略,因为逻辑填坑是无中生有,宁缺毋滥,rather underkill than overkill,精度优先。

人的理解是怎么回事呢:单个儿的“围”不好说,但是VP【围在身上】从“身上”继承了【human】的未填之坑,正好让“她”填做逻辑主语。同理,“做”是万能动词,也没有特定语义要求的坑,但是VP【做装饰物】(act as NP)则挖了一个同位语的语义坑【physical object】,可以让“海草”来填:【human】“把”(“用”)【physical object】“围在身上”;【physical object】“做装饰物”。
“围在身上”的句法主语可以是【human】,也可以是【physical object】:“一根漂亮的海草围在身上”。但是背后的逻辑语义都是 【human】为逻辑主语。


围,属于具有“附着、固定”subcat的动词子类,如果做话题,可以单独表示起始动作完成后的遗留状态。话题化 被固定物做话题

而“海草”可以看做【工具】(包括【材料】状语),也可以看做是 VP【围在身上】内部的“围“的【受事】


实际上,对这一类汉语单音节动词做如此细致的语义分析,挑战性很大。它们太多义了,只有组成合成动词、甚至形成 VP 以后,才逐渐排除多义而收心。这个动态的 subcat 的确定和填写过程,相当繁难,if not impossible。



again “做” 的逻辑主语(深层同位语)没连上“房子”。


SVO 齐活了,主句的O却断了。这叫顾腚不顾头,需要好好debug一哈:






关于 parsing




《朝华午拾: 与女民兵一道成长的日子》
























很多年过去,老队长的歌声却一直留在我的记忆中,虽然我从未搞清这首歌的来历。直到去年,女儿的 iPod 新增的一首歌,一下子把我抓住了。这歌当然不是老队长的歌,可曲调内蕴与老队长的歌神似,是它复活了我心中掩埋已久的歌。每当歌声响起,我就沉浸在遐想之中。老队长的面容身影,广阔天地的清风和日,单纯悠长的田家生活和劳动的场景,就在我眼前浮现。 我问女儿这是什么曲子。女儿一副我是土老冒的惊讶,这是 Akon 啊,那首红透半边天的歌曲 don’t matter 啊。这首黑人歌曲2007年一出品,很快在电台热播,连续两周居于排行榜首。我惊喜,也感到诧异,远隔千山万水,神秘古老的中国民间小调居然与带有美国非裔色彩的黑人歌曲如此契合。甚至我在 Akon 本人身上也隐约看到黑瘦干练的老队长的身影。
Akon “Don’t Matter”




至于原歌,现在也忘记具体曲调了,就是那种陶醉心迷的印象还在。认准了 Akon 以后,今天就是真的那个曲子再现,我不敢肯定我是否还能识出来。 就 Akon 吧。记忆已经外化,有个落实处,挺好的。







parsing 是最好的游戏,而且实用。

据说好玩的游戏都没用,有实用价值的东西做不成游戏。但是,对于AI人员,parsing 却是这么一个最好玩但也最有用的游戏。纵情于此,乐得其所,死得其所也。


不烦 特别好玩。能玩AI公认的世界级人类难题且登顶在望,何烦之有?
烦的是好做的语言 做着做着 没啥可做了 那才叫烦。英语就有点做烦了。做中文不烦 还有不少土地没有归顺 夺取一个城池或山头 就如将军打仗赢了一个战役似的 特别有满足感。




parsing 是最好的游戏。先撒一个default的网,尽量楼。其实不能算“优雅的规则集”,土八路的战略,谈不上优雅。倒有点像原始积累期的跑马,搂到越多越好。然后才开始 lexicalist 的精度攻坚,这才是愚公移山。在 default 与 lexicalist 的策略之间,建立动态通信管道,一盘棋就下活了。

parsing 是最好的游戏。一方面它其实不是愚公面对的似乎永无尽头的大山,虽然这个 monster 看上去还是挺吓人的。但大面上看,结构是可以见底的,细节可以永远纠缠下去。另一方面,它又是公认的世界级人类难题。不少人说,自然语言理解(NLU)是人工智能(AI)的终极难题,而 deep parsing 是公认的通向NLU的必由之路,其重要性可比陈景润为攀登哥德巴赫猜想之巅所做出的1+1=2.  我们这代人不会忘记30多年前迎来“科学的春天”时除迟先生的如花妙笔:“自然科学的皇后是数学。数学的皇冠是数论。哥德巴赫猜想,则是皇冠上的明珠。…… 现在,离开皇冠上的明珠,只有一步之遥了。”(作为毛时代最后的知青,笔者是坐着拖拉机在颠簸的山路回县城的路上读到徐迟的长篇报告文学作品【哥德巴赫猜想】的,一口气读完,头晕眼花却兴奋不已。)

不世出的林彪都会悲观主义,问红旗到底要打到多久。但做 deep parsing,现在就可以明确地说,红旗登顶在望,短则一年,长则三五年而已。登顶可以定义为 open domain 正规文体达到 95% 左右的精度广度(f-score, near-human performance)。换句话说,就是结构分析的水平已经超过一般人,仅稍逊色于语言学家。譬如,英语我们五六年前就登顶了

最有意义的还是因为 parsing 的确有用,说它是自然语言应用核武器毫不为过。有它没它,做起事来就大不一样。shallow parsing 可以以一当十,到了 deep parsing,就是以一当百+了。换句话说,这是一个已经成熟(90+精度可以认为是成熟了)、潜力几乎无限的技术。

@wei 对parsing的执着令人钦佩

多谢鼓励。parsing 最终落地,不在技术的三五个百分点的差距,而在有没有一个好的产品经理,既懂市场和客户,也欣赏和理解技术的潜力。


量变引起质变。90以后,四五个百分点的差别,也许对产品和客户没有太大的影响。但是10多个百分点就大不一样了。譬如,社会媒体 open domain 舆情分析的精度,我们利用 deep parsing support 比对手利用机器学习去做,要高出近20个百分点。结果就天差地别。虽然做出来的报表可以一样花哨,但是真要试图利用舆情做具体分析并支持决策,这样的差距是糊弄不过去的。大数据的统计性过滤可以容忍一定的错误,但不能容忍才六七十精度的系统。

当然也有客户本来就是做报表赶时髦,而不是利用 insights 帮助调整 marketing 的策略或作为决策的依据,对这类客户,精度和质量不如产品好用、fancy、便宜更能打动他们。而且这类客户目前还不在少数。这时候单单有过硬的技术,也还是使不上劲儿。这实际上也是市场还不够成熟的一个表现。拥抱大数据成为潮流后,市场的消化、识别和运用能力还没跟上来。从这个角度看市场,北美的市场成熟度比较东土,明显成熟多了。



泥沙龙笔记:parsing 是引擎的核武器,再论NLP与搜索

泥沙龙笔记:从 sparse data 再论parsing乃是NLP应用的核武器

It is untrue that Google SyntaxNet is the “world’s most accurate parser”




关于 parsing




The mainstream sentiment approach simply breaks in front of social media

I have articulated this point in various previous posts or blogs before, but the world is so dominated by the mainstream that it does not seem to carry.  So let me make it simple to be understood:

The sentiment classification approach based on bag of words (BOW) model, so far the dominant approach in the mainstream for sentiment analysis, simply breaks in front of social media.  The major reason is simple: the social media posts are full of short messages which do not have the “keyword density” required by a classifier to make the proper sentiment decision.   Larger training sets cannot help this fundamental defect of the methodology.  The precision ceiling for this line of work in real-life social media is found to be 60%, far behind the widely acknowledged precision minimum 80% for a usable extraction system.  Trusting a machine learning classifier to perform social media sentiment mining is not much better than flipping a coin.

So let us get straight.  From now on, whoever claims the use of machine learning for social media mining of public opinions and sentiments is likely to be a trap (unless it is verified to have involved parsing of linguistic structures or patterns, which so far has never been heard of in practical systems based on machine learning).  Fancy visualizations may make the results of the mainstream approach look real and attractive but they are just not trustable at all.

Related Posts:

Why deep parsing rules instead of deep learning model for sentiment analysis?
Pros and Cons of Two Approaches: Machine Learning and Grammar Engineering
Coarse-grained vs. fine-grained sentiment analysis

Why deep parsing rules instead of deep learning model for sentiment analysis?


(1)    Learning does not work in short messages as short messages do not have enough data points (or keyword density) to support the statistical model trained by machine learning.  Social media is dominated by short messages.

(2)    With long messages, learning can do a fairly good job in coarse-grained sentiment classification of thumbs-up and thumbs-down, but it is not good at decoding the fine-grained sentiment analysis to answer why people like or dislike a topic or brand.  Such fine-grained insights are much more actionable and valuable than the simple classification of thumbs-up and thumbs-down.

We have experimented with and compared  both approaches to validate the above conclusions.  That is why we use deep parsing rules instead of a deep learning model to reach the industry-leading data quality we have for sentiment analysis.

We do use deep learning for other tasks such as logo and image processing.  But for sentiment analysis and information extraction from text, especially in processing social media, the deep parsing approach is a clear leader in data quality.



The mainstream sentiment approach simply breaks in front of social media

Coarse-grained vs. fine-grained sentiment analysis

Deep parsing is the key to natural language understanding 

Automated survey based on social media

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP





譬如 Esperanto 的语序自由度应该很大,怎么排列,意思都不变,但是由于很多人可能思想是用英语的,写出来的时候下意识在头脑里面翻译成了世界语,结果跟机器翻译一样,人的懒惰使得表达出来的语序照着英语的样子相对固定起来,并没有充分利用语言本身本来有的那么大自由度。

汉语的语序自由度,语感上,比图示出来的,要大。但是,做这项研究的双英对照数据也许大多是正规文体(譬如新闻),而不是自由度更大的口语,因此出现这样的结论也不奇怪。虽然汉语是所谓孤立语,英语接近汉语,但没有那么“孤立”,汉语的语序自由度比英语要大。做英汉MT的 generation 的时候,需要调整词序的时候并不很多,多数情况,保留原词序,基本就凑合了,这是利用了汉语语序有弹性,相对自由度大的特点。汉英MT没亲手做过(除了博士项目在Prolog平台上做过的一个英汉双向MT的玩具),感觉上应该比英汉MT,需要做调序的时候更多。调序多容易乱套,特别是结构分析不到位的时候更容易出乱子,是 MT 的痛点之一。尽量少调序,警惕调序过度弄巧成拙,是实践中常常采取的策略。包括英语的定语从句,多数时候不调序比调序好,用的技巧就是把定语从句当成一个插入语似的,前面加个逗号或括号,适当把 which 翻译成“它”等等。





虽然大而言之形态丰富的语言语序自由度就大、形态贫乏的语言语序相对固定是对的,但汉语并不是持孤立语语序固定论者说的那样语序死板,其语序的自由度超出我们一般人的想象:拿最典型的 SVO patterns 的变式来看,SVO 三个元素,排列的极限是6种词序的组合。Esperanto 形态并不丰富,只有一个宾格 -n 的形态(比较 俄语有6个格变):主格是零形式(零词尾也是形式),它可以采用六种变式的任意一个,而不改变 SVO 的句法语义:

1. SVO Mi manĝas fiŝon (I eat fish)
2. SOV: Mi fiŝon manĝas
3. VOS: Manĝas fiŝon mi
4. VSO: Manĝas mi fiŝon
5. OVS: Fiŝon manĝas mi.
6. OSV: Fiŝon mi manĝas.


1. SVO 是默认的语序,没有问题:
I eat fish

2. SOV:
* I fish eat (英语不允许这个语序)
我鱼吃 【了】(汉语基本上是允许的,尤其是后面有时态小词的时候,听起来很自然)
虽然英语有代词的格变(小词直接量:”I” vs “me”), 而汉语没有格变,英语在这个变式上的语序反而不如汉语。可见形态的丰富性指标不是语序自由度的必然对应。

3. VOS:
* Eat fish I (英语不允许这个语序)

4. VSO:
* Eat I fish (不允许)
* 吃我鱼 (作为 VSO 是不允许的,但可以存在,表示另外一种句法语义:吃我的鱼)

5. OVS:
* Fish eat I (不允许,尽管 I 有主格标记)
* 鱼吃我 (句子是合法的,但句法语义正好相反了 , 是 SVO 不是 OVS。句子本身合法,但做OVS非法。)

6 OSV:
fish I eat (合法,除了表达 OSV 的逻辑语义 这个语序,还表达定语从句的关系)
鱼我吃(合法,常听到,鱼是所谓 Topic 我是 S,逻辑语义不变)

总结一下,汉语在 6 个语序中,有 3 个是合法的,1 个灰色地带,2 个非法。英语呢,只有两个合法,其余皆非法。可见汉语的语序自由度在最常见的SVO句式中,比英语要大。


不管那啥,这个 illustration 说明,语序自由度不是与形态丰富性线性相关。也说明了,汉语往往比我们想象的,比很多人(包括语言学家)想象的具有更大的自由度和弹性。白老师的例子也是后者的一个例示。其实,如果加上其他因素和tokens,这种弹性和自由,简直有点让人瞠目结舌。汉语不仅是裸奔的语言,也是有相当程度随心所欲语序的语言。超出想象的语序弹性其实是裸奔的表现之一,思维里什么概念先出现,就直接蹦出来。而且汉语不仅没有(严格意义的)形态,小词这种形式也常常省略,是一种不研究它会觉得不可思议的语言。它依赖隐性形式比依赖显性形式更多,来达到交流。这对 NLP 和 parsing 自然很不利,但是对人并不构成大负担。


这要看你怎么定义语序自由了。你给的定义是针对格变语言做的,有宾格的语言,等于是把句法关系浓缩了标给了充当角色的词,它跑到哪里都是宾语是题中应有之意。但语序自由的更标准和开放的定义不是这样的,如果 SVO 是基本的语序,凡是与它相左的语序的可能性,就是语序自由,研究的是其自由度。这种可能性的存在就证实了我们在理解语言的时候,或者机器在做 parse 的时候,必须要照顾这种 linear order 的不同,否则就 parse 不了,就抓不住语序自由的表达。不能因为一种相左的语序,由于词选的不同,某个可能语序不能实现,来否定那种语序自由的可能性和现实性。

退一步说,你的语序自由是 narrow definition, 我们也可以从广义来看语序自由,因为这种广义是客观的存在,这种存在你不对付它就不能理解它。就说 “小王打小张”,SVO 似乎不能变化。但是 “小张小王打不过” 就是 OSV,不能因为这个变式有一个补语的触发因素,来否定语序的确改变了。pattern 必须变换才能应对这种词序的改变。

最后,汉语与英语的对比,更说明了汉语的语序自由度大于英语,否则不能解释为什么汉语缺乏形态,反而比形态虽然贫乏但是比汉语多一些形态的英语,表现出更多的语序自由。“鱼我吃了” 和 “我鱼吃了” 是一个 minimal pair,它所标示的语序自由的可能性,是如此显然。人在语序自由的时候仍然可以做句法语义的理解,说明了形态虽然是促进自由的一个重要因素,但不会是唯一的因素。隐性形式乃至常识也可以帮助语序变得自由。



影响语序自由的,形态肯定是重要因素,其他的语言形式也有作用。小句也不好 主句也好,SVO 的逻辑语义在那里,谁打谁?我们在说SVO语序自由这个概念的时候,出发点是思维里的逻辑语义,就是谁打谁,然后考察这个谁1 和 谁2,在语言的 surface form 里面是怎样表达的,它们之间的次序是怎样的。。

这就强拧了。这么说the apple he ate is red. 也是osv了?apple he ate的逻辑关系在哪里。这么说英语也可以osv了?

不错,那就是地地道道的 OSV:谁吃什么,现在这个【什么】 跑到 【谁】 和 “ate” 的前面去了,底层的逻辑语义不变,表层次序不同了。

说英语是 svo 语言,这种说法只是一种标签,并不代表英语只允许这个词序。英语的SVO 6 种语序中,前面说了,有两种合法常见,其他四种基本不合法。





按照你的定义:Eating the apple he smiled. 英语还可以VOS

beat him as much as I can


@禹 俄语的语序确实很灵活,尤其在口语体中,但意思不会变,因为名词有六个格,施受关系基本不会乱。


@白硕 这个日语句子什么意思啊?

被张三 把李四 在胡同里 打了个半死……

广义说 介词也是格 也是形态,格通常是词尾形式,介词的本质却是一样的。
“被” 是主格,“给” 是与格,“用” 是工具格。


格就是parsing依赖的形式条件之一。形态丰富一些的语言 parsing 难度降低




泥沙龙笔记:汉语就是一种“裸奔” 的语言


关于 parsing








如果追求极致,那就动一下手术:(1)断掉原先的主谓(行李与确认);(2)建立新的主谓 (行李与摆放);(3)断掉原先的 OPred(谓词性宾语);(4)代之以 O-S(宾语从句)。这个也合情合理,条件同样清晰。


Hey @白老师报告,毛主席保证:
mission impossible accomplished in semantics module


不管那个,整体架构在轨道上,宾语从句 O-S 和前面的定语从句 Mod-S。
追求极致的话,“旅客”和“你”是同位语。但是,因为“请你VP”用得太多,而且其中的“你”常常省略,因此parsing根本就不理会你的存在,你没有地位,就是祈使句的默认(这里的祈使句标志是小词 “请”)。因此旅客无需与那个子虚乌有的“你”做同位语了,做主语就好了。


我: 白老师 得寸进尺呢。


“看说明书” 与 “看书” 同属于搭配,这个还可以debug一下,本来应该勾搭上的。
“确认” 与 “是” 断链子了,不过 “是” 与其他动词不同,不是好缠的主儿,不敢轻易动它。


语义中间件 continues,逻辑SVO补全:




谁也赢不了 / 谁都赢不了 入词典,两个义项:1 必赢;2 必输







关于 parsing





再谈汉语的 Topic 句式,这玩意儿说到底就是句法偷懒:不求甚解,凡是句首看上去像个实词的,贴个discourse意味上的标签 Topic 完事儿。管它逻辑语义上究竟是扮演什么角色,怎样达成深度的理解。说得难听一点儿,这就是汉语文法“耍流氓”。

“吃苦他在前”–> Topic【吃苦】Subj【他】Pred【在前】
更常见的其实是:“他吃苦在前”。 分析起来,也是一个套路:

“他学习好” 也是如此,话题是某个人(“他”),说的是他的弱点:什么地方(aspect)好(evaluation)。“学习【,】他好”(不用逗号亦可,但有歧义:【学习他】好。)。话题是 “学习”这事儿,说的是哪些人(subset)这方面好(evaluation)。

英语大概是: he is good in study;his study is good; he studies well

(1)他(big object)好;(2)学习(small object)好;(3)他学习。

iPhone 屏幕不好。
细节是屏幕的不如人意,但是屏幕(部分)不好,也就影响了iPhone(整体)的评价,所以也是 iPhone 不好。

说来归齐,就是 Topic 做句法的第一步没问题,但不是句法语义的终点。更像是偷懒,或者桥梁,最终要达到(1)(2)(3)才算完事儿。无论“iPhone屏幕不行”还是“屏幕iPhone不行”,无论中文英文,表达法可以不同,最终的逻辑归结点应该是一致的,大体上就是123。思考一下英语没有话题句式但用了至少三种其他的表达式(如上所述),想想这些表达式最终怎么归化到逻辑的123,是非常有意思和启迪的。

句法分析或逻辑语义上的123,最终要落地到语用去支持应用。语用上的定义可以依据应用层面的情报需求。下面是我们目前的自动句法分析及其相关的 sentiment analysis 的语用表达:




关于 parsing





一日一parsing:“宝宝的经纪人睡了宝宝的宝宝 …”


这种八卦,估计很难分析。 难为parse这个宝宝了!






@wei 这主要不是parsing的问题,主要是指代消解的问题

是的,我是说parsing提供结构的基础不难, 语义模块怎么消解指代是另外的问题。很多人以为这玩意儿没法 parsing,是因为对句法和语义的模块和功能不甚了解。当然一般人心目中的 parsing 等于 understanding,是结合了句法、语义甚至语用的整体的系统。
我是 native speaker,其实也看不懂,不知道里面宝宝指的都是谁。刚才上网恶补了一下新闻,才明白怎么回事儿。所以,这里的指代消解需要很多语用背景知识,光句法语义分析不能解析的。

这么快段子就来了:王宝强去马场选了一匹马准备骑,这时候马场的师傅阻止了他,说选的这个马不好。宝强问为啥?师傅解释说:“这马容易劈腿。” 宝强没有听太明白,师傅又大声说:“这马蓉易劈腿啊!!”?

这是白老师先前转的,还是也 parse parse 凑齐一段吧:




雷: 宝宝的宝宝是宝宝的吗?

我: 最多也只能提供可能性,然后语用或背景知识去定。

白: 名字里带“宝”字的都有可能
之前宝宝的称号还被赋予过仰望星空那位大人物 是有先例的

我: 这样每一个宝宝就有三个义项: 宝宝1 (孩子),宝宝2 (爱人),宝宝3 (宝强)



雷: 还有,现在还流行自称“宝宝”,比如,宝宝不高兴

白: 没那么复杂。“【孩子】是【男人】的吗?”才是唯一有看点的提问。其他都是渣。

我: 人不看新闻也解不了,反正我第一次看,完全不知所云。

雷: 还有,宝马之争

我: 宝强的孩子是宝强(亲生)的吗

雷: 宝强的老婆是宝强的吗?

我: 恩,这两条是关键,怎么从成堆的渣里面提炼出这两条?


雷: 孩子的爸爸是孩子的吗?

我: 孩子的爸爸是孩子的吗? 这个可能只在想象中成立,想象孩子把自己的爸爸叫做宝宝,孩子有这种探究的动机,等。

雷: 第三方看,也是可以的




关于 parsing





S. Bai: Natural Language Caterpillar Breaks through Chomsky’s Castle


Translator’s note:

This article written in Chinese by Prof. S. Bai is a wonderful piece of writing worthy of recommendation for all natural language scholars.  Prof. Bai’s critical study of Chomsky’s formal language theory with regards to natural language has reached a depth never seen before ever since Chomsky’s revolution in 50’s last century.  For decades with so many papers published by so many scholars who have studied Chomsky, this novel “caterpillar” theory still stands out and strikes me as an insight that offers a much clearer and deeper explanation for how natural language should be modeled in formalism, based on my decades of natural language parsing study and practice (in our practice, I call the caterpillar FSA++, an extension of regular grammar formalism adequate for multi-level natural language deep parsing).  For example, so many people have been trapped in Chomsky’s recursion theory and made endless futile efforts to attempt a linear or near-linear algorithm to handle the so-called recursive nature of natural language which is practically non-existent (see Chomsky’s Negative Impact).  There used to be heated debates  in computational linguistics on whether natural language is context-free or context-sensitive, or mildly sensitive as some scholars call it.  Such debates mechanically apply Chomsky’s formal language hierarchy to natural languages, trapped in metaphysical academic controversies, far from language facts and data.  In contrast, Prof. Bai’s original “caterpillar” theory presents a novel picture that provides insights in uncovering the true nature of natural languages.

S. Bai: Natural Language Caterpillar Breaks through Chomsky’s Castle

Tags: Chomsky Hierarchy, computational linguistics, Natural Language Processing, linear speed

This is a technology-savvy article, not to be fooled by the title seemingly about a bug story in some VIP’s castle.  If you are neither an NLP professional nor an NLP fan, you can stop here and do not need to continue the journey with me on this topic.

Chomsky’s Castle refers to the famous Chomsky Hierarchy in his formal language theory, built by the father of contemporary linguistics Noam Chomsky more than half a century ago.  According to this theory, the language castle is built with four enclosing walls.  The outmost wall is named Type-0, also called Phrase Structure Grammar, corresponding to a Turing machine.  The second wall is Type-1, or Context-sensitive Grammar (CSG), corresponding to a parsing device called linear bounded automaton  with time complexity known to be NP-complete.  The third wall is Type-2, or Context-free Grammar (CFG), corresponding to a  pushdown automaton, with a time complexity that is polynomial, somewhere between square and cubic in the size of the input sentence for the best asymptotic order measured in the worst case scenario.  The innermost wall is Type-3, or Regular Grammar, corresponding to deterministic finite state automata, with a linear time complexity.  The sketch of the 4-wall Chomsky Castle is illustrated below.

This castle of Chomsky has impacted generations of scholars, mainly along two lines.  The first line of impact can be called “the outward fear syndrome”.  Because the time complexity for the second wall (CSG) is NP-complete, anywhere therein and beyond becomes a Forbidden City before NP=P can be proved.  Thus, the pressure for parsing natural languages has to be all confined to within the third wall (CFG).  Everyone knows the natural language involves some context sensitivity,  but the computing device cannot hold it to be tractable once it is beyond the third wall of CFG.  So it has to be left out.

The second line of impact is called “the inward perfection syndrome”.  Following the initial success of using Type 2 grammar (CFG) comes a severe abuse of recursion.  When the number of recursive layers increases slightly, the acceptability of a sentence soon approximates to almost 0.  For example, “The person that hit Peter is John” looks fine,  but it starts sounding weird to hear “The person that hit Peter that met Tom is John”.  It becomes gibberish with sentences like “The person that hit Peter that met Tom that married Mary is John”.  In fact, the majority resources spent with regards to the parsing efficiency are associated with such abuse of recursion in coping with gibberish-like sentences, rarely seen in real life language.  For natural language processing to be practical,  pursuing the linear speed cannot be over emphasized.  If we reflect on the efficiency of the human language understanding process, the conclusion is certainly about the “linear speed” in accordance with the length of the speech input.  In fact, the abuse of recursion is most likely triggered by the “inward perfection syndrome”, for which we intend to cover every inch of the land within the third wall of CFG, even if it is an area piled up by gibberish or garbage.

In a sense, it can be said that one reason for the statistical approach to take over the rule-based approach for such a long time in the academia of natural language processing is just the combination effect of these two syndromes.  To overcome the effects of these syndromes, many researchers have made all kinds of efforts, to be reviewed below one by one.

Along the line of the outward fear syndrome, evidence against the context-freeness has been found in some constructions in Swiss-German.  Chinese has similar examples in expressing respective correspondence of conjoined items and their descriptions.  For example,   “张三、李四、王五的年龄分别是25岁、32岁、27岁,出生地分别是武汉、成都、苏州” (Zhang San, Li Si, Wang Wu’s age is respectively 25, 32, and 27, they were born respectively in Wuhan, Chengdu, Suzhou” ).  Here, the three named entities constitute a list of nouns.  The number of the conjoined list of entities cannot be predetermined, but although the respective descriptors about this list of nouns also vary in length, the key condition is that they need to correspond to the antecedent list of nouns one by one.  This respective correspondence is something beyond the expression power of the context-free formalism.  It needs to get out of the third wall.

As for overcoming “the inward perfection syndrome”, the pursuit of “linear speed” in the field of NLP has never stopped.  It ranges from allowing for the look-ahead mechanism in LR (k) grammar, to the cascaded finite state automata, to the probabilistic CFG parsers which are trained on a large treebank and eventually converted to an Ngram (n=>5) model.  It should also include RNN/LSTM for its unique pursuit for deep parsing from the statistical school.  All these efforts are striving for defining a subclass in Type-2 CFG that reaches linear speed efficiency yet still with adequate linguistic power.  In fact, all parsers that have survived after fighting the statistical methods are to some degree a result of overcoming “the inward perfection syndrome”, with certain success in linear speed pursuit while respecting linguistic principles.  The resulting restricted subclass, compared to the area within the original third wall CFG, is a greatly “squashed” land.

If we agree that everything in parsing should be based on real life natural language as the starting point and the ultimate landing point, it should be easy to see that the outward limited breakthrough and the inward massive compression should be the two sides of a coin.  We want to strive for a formalism that balances both sides.  In other words, our ideal natural language parsing formalism should look like a linguistic “caterpillar” breaking through the Chomsky walls in his castle, illustrated below:

It seems to me that such a “caterpillar” may have already been found by someone.  It will not take too long before we can confirm it.
Original article in Chinese from 《穿越乔家大院寻找“毛毛虫”
Translated by Dr. Wei Li




[转载]【白硕 – 穿越乔家大院寻找“毛毛虫”】


白硕老师这篇文章值得所有自然语言学者研读和反思。击节叹服,拍案叫绝,是初读此文的真切感受。白老师对乔姆斯基形式语言理论用于自然语言所造成的误导,给出了迄今所见最有深度的犀利解析,而且写得深入浅出,形象生动,妙趣横生。这么多年,这么多学者,怎么就达不到这样的深度呢?一个乔姆斯基的递归陷阱不知道栽进去多少人,造成多少人在 “不是人话” 的现象上做无用功,绕了无数弯路。学界曾有多篇长篇大论,机械地套用乔氏层级体系,在自然语言是 context-free 还是 context-sensitive 的框框里争论不休,也有折衷的说法,诸如自然语言是 mildly sensitive,这些形而上的学究式争论,大多雾里看花,隔靴搔痒,不得要领,离语言事实甚远。白老师独创的 “毛毛虫” 论,形象地打破了这些条条框框。

白老师自己的总结是:‘如果认同“一切以真实的自然语言为出发点和最终落脚点”的理念,那就应该承认:向外有限突破,向内大举压缩,应该是一枚硬币的两面。’ 此乃金玉良言,掷地有声。

【白硕 – 穿越乔家大院寻找“毛毛虫”】














【新智元:parsing 在希望的田野上】 


【科研笔记:NLP “毛毛虫” 笔记,从一维到二维】

【泥沙龙笔记:NLP 专门语言是规则系统的斧头】








On Hand-crafted Myth of Knowledge Bottleneck

In my article “Pride and Prejudice of Main Stream“, the first myth listed as top 10 misconceptions in NLP is as follows:

[Hand-crafted Myth]  Rule-based system faces a knowledge bottleneck of hand-crafted development while a machine learning system involves automatic training (implying no knowledge bottleneck).

While there are numerous misconceptions on the old school of rule systems , this hand-crafted myth can be regarded as the source of all.  Just take a review of NLP papers, no matter what are the language phenomena being discussed, it’s almost cliche to cite a couple of old school work to demonstrate superiority of machine learning algorithms, and the reason for the attack only needs one sentence, to the effect that the hand-crafted rules lead to a system “difficult to develop” (or “difficult to scale up”, “with low efficiency”, “lacking robustness”, etc.), or simply rejecting it like this, “literature [1], [2] and [3] have tried to handle the problem in different aspects, but these systems are all hand-crafted”.  Once labeled with hand-crafting, one does not even need to discuss the effect and quality.  Hand-craft becomes the rule system’s “original sin”, the linguists crafting rules, therefore, become the community’s second-class citizens bearing the sin.

So what is wrong with hand-crafting or coding linguistic rules for computer processing of languages?  NLP development is software engineering.  From software engineering perspective, hand-crafting is programming while machine learning belongs to automatic programming.  Unless we assume that natural language is a special object whose processing can all be handled by systems automatically programmed or learned by machine learning algorithms, it does not make sense to reject or belittle the practice of coding linguistic rules for developing an NLP system.

For consumer products and arts, hand-craft is definitely a positive word: it represents quality or uniqueness and high value, a legit reason for good price. Why does it become a derogatory term in NLP?  The root cause is that in the field of NLP,  almost like some collective hypnosis hit in the community, people are intentionally or unintentionally lead to believe that machine learning is the only correct choice.  In other words, by criticizing, rejecting or disregarding hand-crafted rule systems, the underlying assumption is that machine learning is a panacea, universal and effective, always a preferred approach over the other school.

The fact of life is, in the face of the complexity of natural language, machine learning from data so far only surfaces the tip of an iceberg of the language monster (called low-hanging fruit by Church in K. Church: A Pendulum Swung Too Far), far from reaching the goal of a complete solution to language understanding and applications.  There is no basis to support that machine learning alone can solve all language problems, nor is there any evidence that machine learning necessarily leads to better quality than coding rules by domain specialists (e.g. computational grammarians).  Depending on the nature and depth of the NLP tasks, hand-crafted systems actually have more chances of performing better than machine learning, at least for non-trivial and deep level NLP tasks such as parsing, sentiment analysis and information extraction (we have tried and compared both approaches).  In fact, the only major reason why they are still there, having survived all the rejections from mainstream and still playing a role in industrial practical applications, is the superior data quality, for otherwise they cannot have been justified for industrial investments at all.

the “forgotten” school:  why is it still there? what does it have to offer? The key is the excellent data quality as advantage of a hand-crafted system, not only for precision, but high recall is achievable as well.
quote from On Recall of Grammar Engineering Systems

In the real world, NLP is applied research which eventually must land on the engineering of language applications where the results and quality are evaluated.  As an industry, software engineering has attracted many ingenious coding masters, each and every one of them gets recognized for their coding skills, including algorithm design and implementation expertise, which are hand-crafting by nature.   Have we ever heard of a star engineer gets criticized for his (manual) programming?  With NLP application also as part of software engineering, why should computational linguists coding linguistic rules receive so much criticism while engineers coding other applications get recognized for their hard work?  Is it because the NLP application is simpler than other applications?  On the contrary, many applications of natural language are more complex and difficult than other types of applications (e.g. graphics software, or word processing apps).  The likely reason to explain the different treatment between a general purpose programmer and a linguist knowledge engineer is that the big environment of software engineering does not involve as much prejudice while the small environment of NLP domain  is deeply biased, with belief that the automatic programming of an NLP system by machine learning can replace and outperform manual coding for all language projects.   For software engineering in general, (manual) programming is the norm and no one believes that programmers’ jobs can be replaced by automatic programming in any time foreseeable.  Automatic programming, a concept not rare in science fiction for visions like machines making machines, is currently only a research area, for very restricted low-level functions.  Rather than placing hope on automatic programming, software engineering as an industry has seen a significant progress on work of the development infrastructures, such as development environment and a rich library of functions to support efficient coding and debugging.  Maybe in the future one day, applications can use more and more of automated code to achieve simple modules, but the full automation of constructing any complex software project  is nowhere in sight.  By any standards, natural language parsing and understanding (beyond shallow level tasks such as classification, clustering or tagging)  is a type of complex tasks. Therefore, it is hard to expect machine learning as a manifestation of automatic programming to miraculously replace the manual code for all language applications.  The application value of hand-crafting a rule system will continue to exist and evolve for a long time, disregarded or not.

“Automatic” is a fancy word.  What a beautiful world it would be if all artificial intelligence and natural languages tasks could be accomplished by automatic machine learning from data.  There is, naturally, a high expectation and regard for machine learning breakthrough to help realize this dream of mankind.  All this should encourage machine learning experts to continue to innovate to demonstrate its potential, and should not be a reason for the pride and prejudice against a competitive school or other approaches.

Before we embark on further discussions on the so-called rule system’s knowledge bottleneck defect, it is worth mentioning that the word “automatic” refers to the system development, not to be confused with running the system.  At the application level, whether it is a machine-learned system or a manual system coded by domain programmers (linguists), the system is always run fully automatically, with no human interference.  Although this is an obvious fact for both types of systems, I have seen people get confused so to equate hand-crafted NLP system with manual or semi-automatic applications.

Is hand-crafting rules a knowledge bottleneck for its development?  Yes, there is no denying or a need to deny that.  The bottleneck is reflected in the system development cycle.  But keep in mind that this “bottleneck” is common to all large software engineering projects, it is a resources cost, not only introduced by NLP.  From this perspective, the knowledge bottleneck argument against hand-crafted system cannot really stand, unless it can be proved that machine learning can do all NLP equally well, free of knowledge bottleneck:  it might be not far from truth for some special low-level tasks, e.g. document classification and word clustering, but is definitely misleading or incorrect for NLP in general, a point to be discussed below in details shortly.

Here are the ballpark estimates based on our decades of NLP practice and experiences.  For shallow level NLP tasks (such as Named Entity tagging, Chinese segmentation), a rule approach needs at least three months of one linguist coding and debugging the rules, supported by at least half an engineer for tools support and platform maintenance, in order to come up with a decent system for initial release and running.  As for deep NLP tasks (such as deep parsing, deep sentiments beyond thumbs-up and thumbs-down classification), one should not expect a working engine to be built up without due resources that at least involve one computational linguist coding rules for one year, coupled with half an engineer for platform and tools support and half an engineer for independent QA (quality assurance) support.  Of course, the labor resources requirements vary according to the quality of the developers (especially the linguistic expertise of the knowledge engineers) and how well the infrastructures and development environment support linguistic development.  Also, the above estimates have not included the general costs, as applied to all software applications, e.g. the GUI development at app level and operations in running the developed engines.

Let us present the scene of the modern day rule-based system development.  A hand-crafted NLP rule system is based on compiled computational grammars which are nowadays often architected as an integrated pipeline of different modules from shallow processing up to deep processing.  A grammar is a set of linguistic rules encoded in some formalism, which is the core of a module intended to achieve a defined function in language processing, e.g. a module for shallow parsing may target noun phrase (NP) as its object for identification and chunking.  What happens in grammar engineering is not much different from other software engineering projects.  As knowledge engineer, a computational linguist codes a rule in an NLP-specific language, based on a development corpus.  The development is data-driven, each line of rule code goes through rigid unit tests and then regression tests before it is submitted as part of the updated system for independent QA to test and feedback.  The development is an iterative process and cycle where incremental enhancements on bug reports from QA and/or from the field (customers) serve as a necessary input and step towards better data quality over time.

Depending on the design of the architect, there are all types of information available for the linguist developer to use in crafting a rule’s conditions, e.g. a rule can check any elements of a pattern by enforcing conditions on (i) word or stem itself (i.e. string literal, in cases of capturing, say, idiomatic expressions), and/or (ii) POS (part-of-speech, such as noun, adjective, verb, preposition), (iii) and/or orthography features (e.g. initial upper case, mixed case, token with digits and dots), and/or (iv) morphology features (e.g. tense, aspect, person, number, case, etc. decoded by a previous morphology module), (v) and/or syntactic features (e.g. verb subcategory features such as intransitive, transitive, ditransitive), (vi) and/or lexical semantic features (e.g. human, animal, furniture, food, school, time, location, color, emotion).  There are almost infinite combinations of such conditions that can be enforced in rules’ patterns.  A linguist’s job is to code such conditions to maximize the benefits of capturing the target language phenomena, a balancing art in engineering through a process of trial and error.

Macroscopically speaking, the rule hand-crafting process is in its essence the same as programmers coding an application, only that linguists usually use a different, very high-level NLP-specific language, in a chosen or designed formalism appropriate for modeling natural language and framework on a platform that is geared towards facilitating NLP work.  Hard-coding NLP in a general purpose language like Java is not impossible for prototyping or a toy system.  But as natural language is known to be a complex monster, its processing calls for a special formalism (some form or extension of Chomsky’s formal language types) and an NLP-oriented language to help implement any non-toy systems that scale.  So linguists are trained on the scene of development to be knowledge programmers in hand-crafting linguistic rules.  In terms of different levels of languages used for coding, to an extent, it is similar to the contrast between programmers in old days and the modern software engineers today who use so-called high-level languages like Java or C to code.  Decades ago, programmers had to use assembly or machine language to code a function.  The process and workflow for hand-crafting linguistic rules are just like any software engineers in their daily coding practice, except that the language designed for linguists is so high-level that linguistic developers can concentrate on linguistic challenges without having to worry about low-level technical details of memory allocation, garbage collection or pure code optimization for efficiency, which are taken care of by the NLP platform itself.  Everything else follows software development norms to ensure the development stay on track, including unit testing, baselines construction and monitoring, regressions testing, independent QA, code reviews for rules’ quality, etc.  Each level language has its own star engineer who masters the coding skills.  It sounds ridiculous to respect software engineers while belittling linguistic engineers only because the latter are hand-crafting linguistic code as knowledge resources.

The chief architect in this context plays the key role in building a real life robust NLP system that scales.  To deep-parse or process natural language, he/she needs to define and design the formalism and language with the necessary extensions, the related data structures, system architecture with the interaction of different levels of linguistic modules in mind (e.g. morpho-syntactic interface), workflow that integrate all components for internal coordination (including patching and handling interdependency and error propagation) and the external coordination with other modules or sub-systems including machine learning or off-shelf tools when needed or felt beneficial.  He also needs to ensure efficient development environment and to train new linguists into effective linguistic “coders” with engineering sense following software development norms (knowledge engineers are not trained by schools today).  Unlike the mainstream machine learning systems which are by nature robust and scalable,  hand-crafted systems’ robustness and scalability depend largely on the design and deep skills of the architect.  The architect defines the NLP platform with specs for its core engine compiler and runner, plus the debugger in a friendly development environment.  He must also work with product managers to turn their requirements into operational specs for linguistic development, in a process we call semantic grounding to applications from linguistic processing.   The success of a large NLP system based on hand-crafted rules is never a simple accumulation of linguistics resources such as computational lexicons and grammars using a fixed formalism (e.g. CFG) and algorithm (e.g. chart-parsing).  It calls for seasoned language engineering masters as architects for the system design.

Given the scene of practice for NLP development as describe above, it should be clear that the negative sentiment association with “hand-crafting” is unjustifiable and inappropriate.  The only remaining argument against coding rules by hands comes down to the hard work and costs associated with hand-crafted approach, so-called knowledge bottleneck in the rule-based systems.  If things can be learned by a machine without cost, why bother using costly linguistic labor?  Sounds like a reasonable argument until we examine this closely.  First, for this argument to stand, we need proof that machine learning indeed does not incur costs and has no or very little knowledge bottleneck.  Second, for this argument to withstand scrutiny, we should be convinced that machine learning can reach the same or better quality than hand-crafted rule approach.  Unfortunately, neither of these necessarily hold true.  Let us study them one by one.

As is known to all, any non-trivial NLP task is by nature based on linguistic knowledge, irrespective of what form the knowledge is learned or encoded.   Knowledge needs to be formalized in some form to support NLP, and machine learning is by no means immune to this knowledge resources requirement.  In rule-based systems, the knowledge is directly hand-coded by linguists and in case of (supervised) machine learning, knowledge resources take the form of labeled data for the learning algorithm to learn from (indeed, there is so-called unsupervised learning which needs no labeled data and is supposed to learn from raw data, but that is research-oriented and hardly practical for any non-trivial NLP, so we leave it aside for now).  Although the learning process is automatic,  the feature design, the learning algorithm implementation, debugging and fine-tuning are all manual, in addition to the requirement of manual labeling a large training corpus in advance (unless there is an existing labeled corpus available, which is rare; but machine translation is a nice exception as it has the benefit of using existing human translation as labeled aligned corpora for training).  The labeling of data is a very tedious manual job.   Note that the sparse data challenge represents the need of machine learning for a very large labeled corpus.  So it is clear that knowledge bottleneck takes different forms, but it is equally applicable to both approaches.  No machine can learn knowledge without costs, and it is incorrect to regard knowledge bottleneck as only a defect for the rule-based system.

One may argue that rules require expert skilled labor, while the labeling of data only requires high school kids or college students with minimal training.  So to do a fair comparison of the costs associated, we perhaps need to turn to Karl Marx whose “Das Kapital” has some formula to help convert simple labor to complex labor for exchange of equal value: for a given task with the same level of performance quality (assuming machine learning can reach the quality of professional expertise, which is not necessarily true), how much cheap labor needs to be used to label the required amount of training corpus to make it economically an advantage?  Something like that.  This varies from task to task and even from location to location (e.g. different minimal wage laws), of course.   But the key point here is that knowledge bottleneck challenges both approaches and it is not the case believed by many that machine learning learns a system automatically with no or little cost attached.  In fact, things are far more complicated than a simple yes or no in comparing the costs as costs need also to be calculated in a larger context of how many tasks need to be handled and how much underlying knowledge can be shared as reusable resources.  We will leave it to a separate writing for the elaboration of the point that when put into the context of developing multiple NLP applications, the rule-based approach which shares the core engine of parsing demonstrates a significant saving on knowledge costs than machine learning.

Let us step back and, for argument’s sake, accept that coding rules is indeed more costly than machine learning, so what? Like in any other commodities, hand-crafted products may indeed cost more, they also have better quality and value than products out of mass production.  For otherwise a commodity society will leave no room for craftsmen and their products to survive.  This is common sense, which also applies to NLP.  If not for better quality, no investors will fund any teams that can be replaced by machine learning.  What is surprising is that there are so many people, NLP experts included, who believe that machine learning necessarily performs better than hand-crafted systems not only in costs saved but also in quality achieved.  While there are low-level NLP tasks such as speech processing and document classification which are not experts’ forte as we human have much more restricted memory than computers do, deep NLP involves much more linguistic expertise and design than a simple concept of learning from corpora to expect superior data quality.

In summary, the hand-crafted rule defect is largely a misconception circling around wildly in NLP and reinforced by the mainstream, due to incomplete induction or ignorance of the scene of modern day rule development.  It is based on the incorrect assumption that machine learning necessarily handles all NLP tasks with same or better quality but less or no knowledge bottleneck, in comparison with systems based on hand-crafted rules.



Note: This is the author’s own translation, with adaptation, of part of our paper which originally appeared in Chinese in Communications of Chinese Computer Federation (CCCF), Issue 8, 2013



Domain portability myth in natural language processing

Pride and Prejudice of NLP Main Stream

K. Church: A Pendulum Swung Too Far, Linguistics issues in Language Technology, 2011; 6(5)

Wintner 2009. What Science Underlies Natural Language Engineering? Computational Linguistics, Volume 35, Number 4

Pros and Cons of Two Approaches: Machine Learning vs Grammar Engineering

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

Pride and Prejudice of NLP Main Stream


In the area of Computational Linguistics, there are two basic approaches to natural language processing, the traditional rule system and the mainstream machine learning.  They are complementary and there are pros and cons associated with both.  However, as machine learning is the dominant mainstream philosophy reflected by the overwhelming ratio of papers published in academia, the area seems to be heavily biased against the rule system methodology.  The tremendous success of machine learning as applied to a list of natural language tasks has reinforced the mainstream pride and prejudice in favor of one and against the other.   As a result, there are numerous specious views which are often taken for granted without check, including attacks on the rule system’s defects based on incomplete induction or misconception.  This is not healthy for NLP itself as an applied research area and exerts an inappropriate influence on the young scientists coming to this area.  This is the first piece of a series of writings aimed at educating the public and confronting the prevalent prejudice, focused on the in-depth examination of the so-called hand-crafted defect of the rule system and the associated  knowledge bottleneck issue.

I. introduction

Over 20 years ago, the area of NLP (natural language processing) went through a process of replacing traditional rule-based systems by statistical machine learning as the mainstream in academia.  Put in a larger context of AI (Artificial Intelligence), this represents a classical competition, and their ups and downs, between the rational school and the empirical school (Church 2007 ).  It needs to be noted that the statistical approaches’ dominance in this area has its historical inevitability.   The old school was confined to toy systems or lab for too long without scientific break-through while machine learning started showing impressive results in numerous fronts of NLP in a much larger scale, initially  very low level NLP such as POS (Part-of-Speech) tagging and speech recognition / synthesis, and later on expanded to almost all NLP tasks, including machine translation, search and ranking, spam filtering, document classification, automatic summarization, lexicon acquisition, named entity tagging, relationship extraction, event classification, sentiment analysis.  This dominance has continued to grow till today when the other school is largely “out” from almost all major NLP arenas, journals and top conferences.  New graduates hardly realize its existence.  There is an entire generation gap for such academic training or carrying on the legacy of the old school, with exceptions of very few survivors (including yours truly) in industry because few professors are motivated to teach it at all or even qualified with an in-depth knowledge of this when the funding and publication prospects for the old school are getting more and more impossible.   To many people’s minds today, learning (or deep learning) is NLP, and NLP is learning, that is all.  As for the “last century’s technology” of rule-based systems, it is more like a failure tale from a distant history.

The pride and prejudice of the mainstream were demonstrated the most in the recent incidence when Google announced its deep-learning-based SyntaxNet and proudly claimed it to be “the most accurate parser in the world”, so resolute and no any conditions attached, and without even bothering to check the possible existence of the other school.  This is not healthy (and philosophically unbalanced too) for a broad area challenged by one of the most complex problems of mankind, i.e. to decode natural language understanding.  As there is only one voice heard, it is scaring to observe that the area is packed with prejudice and ignorance with regards to the other school, some from leaders of the area.  Specious comments are rampant and often taken for granted without check.

Prejudice is not a real concern as it is part of the real world around and involving ourselves, something to do with human nature and our innate limitation and ignorance. What is really scary is the degree and popularity of such prejudice represented in numerous misconceptions that can be picked up everywhere in this circle (I am not going to trace the sources of these as they are everywhere and people who are in this area for some time know this is not Quixote’s windmill but a reality reflection).  I will list below some of the myths or fallacies so deeply rooted in the area that they seem to become cliche, or part of the community consensus.  If one or more statements below sound familiar to you and they do not strike you as opinionated or specious which cannot withstand scrutiny, then you might want to give a second study of the issue to make sure we have not been subconsciously brain-washed.  The real damage is to our next generation, the new scholars coming to this field, who often do not get a chance for doubt.

For each such statement to be listed, it is not difficult to cite a poorly designed stereotypical rule system that falls short of the point, but the misconception lies in its generalization of associating an accused defect to the entire family of a school, ignorant of the variety of designs and the progress made in that school.

There are two types of misconceptions, one might be called myth and the other is sheer fallacy.  Myths arise as a result of “incomplete induction”.  Some may have observed or tried some old school rule systems of some sort, which show signs of the stated defect, then they jump to conclusions leading to the myths.  These myths call for in-depth examination and arguments to get the real picture of the truth.  As for fallacies, they are simply untrue.  It is quite a surprise, though, to see that even fallacies seem to be widely accepted as true by many, including some experts in this area.  All we need is to cite facts to prove them wrong.  For example, [Grammaticality Fallacy] says that the rule system can only parse grammatical text and cannot handle degraded text with grammar mistakes in it.  Facts speak louder than words: the sentiment engine we have developed for our main products is a parsing-supportedrule-based system that fully automatically extracts and mines public opinions and consumer insights from all types of social media, typical of degraded text.  Third-party evaluations show that this system is industry leader in data quality of sentiments, significantly better than competitions adopting machine learning. The large-scale operation of our system in the cloud in handling terabytes of real life social media big data (a year of social media in our index involve about 30 billion documents across more than 40 languages) also prove wrong what is stated in [Scalability Fallacy] below.

Let us now list these widely spread rumours collected from the community about the rule-based system to see if they ring the bell before we dive into the first two core myths to uncover the true picture behind in separate blogs.

II.  Top 10 Misconceptions against Rules

[Hand-crafted Myth]  Rule-based system faces a knowledge bottleneck of hand-crafted development while a machine learning system involves automatic training (implying no knowledge bottleneck). [see On Hand-crafted Myth of Knowledge Bottleneck.]

[Domain Portability Myth] The hand-crafted nature of a rule-based system leads to its poor domain portability as rules have to be rebuilt each time we shift to a new domain; but in case of machine learning, since the algorithm and system are universal, domain shift only involves new training data (implying strong domain portability). [see Domain Portability Myth]

[Fragility Myth]  A rule-based system is very fragile and it may break before unseen language data, so it cannot lead to a robust real life application.

[Weight Myth] Since there is no statistical weight associated with the results from a rule-based system, the data quality cannot be trusted with confidence.

[Complexity Myth] As a rule-based system is complex and intertwined, it is easy to get to a standstill, with little hope for further improvement.

[Scalability Fallacy]  The hand-crafted nature of a rule-based system makes it difficult to scale up for real life application; it is largely confined to the lab as a toy.

[Domain Restriction Fallacy]  A rule-based system only works in a narrow domain and it cannot work across domains.

[Grammaticality Fallacy] A rule-based system can only handle grammatical input in the formal text (such as news, manuals, weather broadcasts), it fails in front of degraded text involving misspellings and ungrammaticality such as social media, oral transcript, jargons or OCR output.

[Outdated Fallacy]  A rule-based system is a technology of last century, it is outdated (implying that it no longer works or can result in a quality system in modern days).

[Data Quality Fallacy]  Based on the data quality of results, a machine learning system is better than a rule based system. (cf: On Recall of Grammar Engineering Systems)

III.  Retrospect and Reflection of Mainstream

As mentioned earlier, a long list of misconceptions about the old school of rule-based systems have been around the mainstream for years in the field.   It may sound weird for an interdisciplinary field named Computational Linguistics to drift more and more distant from linguistics;  linguists play less and less a role in NLP dominated by statisticians today.  It seems widely assumed that with advanced deep learning algorithms, once data are available, a quality system will be trained without the need for linguistic design or domain expertise.

Not all main stream scholars are one-sided and near-sighted.  In recent years, insightful articles  (e.g., church 2007, Wintner 2009) began a serious retrospect and reflection process and called for the return of Linguistics: “In essence, linguistics is altogether missing in contemporary natural language engineering research. … I want to call for the return of linguistics to computational linguistics.”(Wintner 2009)Let us hope that their voice will not be completely muffled in this new wave of deep learning heat.

Note that the rule system which the linguists are good at crafting in industry is different from the classical linguistic study, it is formalized modeling of linguistic analysis.  For NLP tasks beyond shallow level, an effective rule system is not a simple accumulation of computational lexicons and grammars, but involves a linguistic processing strategy (or linguistic algorithm) for different levels of linguistic phenomena.  However, this line of study on the NLP platform design, system architecture and formalism has increasingly smaller space for academic discussion and publication, the research funding becomes almost impossible, as a result, the new generation faces the risk of a cut-off legacy, with a full generation of talent gap in academia.  Church (2007) points out that the statistical research is so dominant and one-sided that only one voice is now heard.  He is a visionary main stream scientist, deeply concerned about the imbalance of the two schools in NLP and AI.  He writes:

Part of the reason why we keep making the same mistakes, as Minsky and Papert mentioned above, has to do with teaching. One side of the debate is written out of the textbooks and forgotten, only to be revived/reinvented by the next generation.  ……

To prepare students for what might come after the low hanging fruit has been picked over, it would be good to provide today’s students with a broad education that makes room for many topics in Linguistics such as syntax, morphology, phonology, phonetics, historical linguistics and language universals. We are graduating Computational Linguistics students these days that have very deep knowledge of one particular narrow sub-area (such as machine learning and statistical machine translation) but may not have heard of Greenberg’s Universals, Raising, Equi, quantifier scope, gapping, island constraints and so on. We should make sure that students working on co-reference know about c-command and disjoint reference. When students present a paper at a Computational Linguistics conference, they should be expected to know the standard treatment of the topic in Formal Linguistics.

We ought to teach this debate to the next generation because it is likely that they will have to take Chomsky’s objections more seriously than we have. Our generation has been fortunate to have plenty of low hanging fruit to pick (the facts that can be captured with short ngrams), but the next generation will be less fortunate since most of those facts will have been pretty well picked over before they retire, and therefore, it is likely that they will have to address facts that go beyond the simplest ngram approximations.



About Author

Dr. Wei Li is currently Chief Scientist at Netbase Solutions in the Silicon Valley, leading the effort for the design and development of a multi-lingual sentiment mining system based on deep parsing.  A hands-on computational linguist with 30 years of professional experience in Natural Language Processing (NLP), Dr. Li has a track record of making NLP work robust. He has built three large-scale NLP systems, all transformed into real-life, globally distributed products.


Note: This is the author’s own translation, with adaptation, of our paper in Chinese which originally appeared in W. Li & T. Tang, “Pride and Prejudice of Main Stream:  Rule-based System vs. Machine Learning“, in Communications of Chinese Computer Federation (CCCF), Issue 8, 2013



K. Church: A Pendulum Swung Too Far, Linguistics issues in Language Technology, 2011; 6(5)

Wintner 2009. What Science Underlies Natural Language Engineering? Computational Linguistics, Volume 35, Number 4

Domain portability myth in natural language processing

On Hand-crafted Myth and Knowledge Bottleneck

On Recall of Grammar Engineering Systems

Pros and Cons of Two Approaches: Machine Learning vs Grammar Engineering

It is untrue that Google SyntaxNet is the “world’s most accurate parser”

R. Srihari, W Li, C. Niu, T. Cornell: InfoXtract: A Customizable Intermediate Level Information Extraction Engine. Journal of Natural Language Engineering, 12(4), 1-37, 2006

Introduction of Netbase NLP Core Engine

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP



I.          引言

有回顾NLP(Natural Language Processing)历史的知名学者介绍机器学习(machine learning)取代传统规则系统(rule-based system)成为学界主流的掌故,说20多年前好像经历了一场惊心动魄的宗教战争。必须承认,NLP 这个领域,统计学家的完胜,是有其历史必然性的。机器学习在NLP很多任务上的巨大成果和效益是有目共睹的:机器翻译,语音识别/合成,搜索排序,垃圾过滤,文档分类,自动文摘,词典习得,专名标注,词性标注等(Church 2007)。

然而,近来浏览几篇 NLP 领域代表人物的综述,见其中不乏主流的傲慢与偏见,依然令人惊诧。细想之下,统计学界的确有很多对传统规则系统根深蒂固的成见和经不起推敲但非常流行的蛮横结论。可怕的不是成见,成见无处不在。真正可怕的是成见的流行无阻。而在NLP这个领域,成见的流行到了让人瞠目结舌的程度。不假思索而认同接受这些成见成为常态。因此想到立此存照一下,并就核心的几条予以详论。下列成见随处可见,流传甚广,为免纷扰,就不列出处了,明白人自然知道这绝不是杜撰和虚立的靶子。这些成见似是而非,经不起推敲,却被很多人视为理所当然的真理。为每一条成见找一个相应的规则系统的案例并不难,但是从一些特定系统的缺陷推广到对整个规则系统的方法学上的批判,乃是其要害所在。



【成见三】规则系统很脆弱,遇到没有预测的语言现象系统就会 break(什么叫 break,死机?瘫痪?失效?),开发不了鲁棒(robust)产品。



【成见六】规则系统的手工编制注定其无法实用,不能 scale up,只能是实验室里的玩具。


【成见八】规则系统只能处理规范的语言(譬如说明书、天气预报、新闻等),无法应对 degraded text,如社会媒体、口语、方言、黑话、OCR 文档。



所列“成见”有两类:一类是“偏”见,如【成见一】至【成见五】。这类偏见主要源于不完全归纳,他们也许看到过或者尝试过规则系统某一个类型, 浅尝辄止,然后遽下结论(jump to conclusions)。盗亦有道,情有可原,虽然还是应该对其一一纠“正”。本文即是拨乱反正的第一篇。成见的另一类是谬见,可以事实证明其荒谬。令人惊诧的是,谬见也可以如此流行。【成见五】以降均属不攻自破的谬见。譬如【成见八】说规则系统只能分析规范性语言。事实胜于雄辩,我们开发的以规则体系为主的舆情挖掘系统处理的就是非规范的社交媒体。这个系统的大规模运行和使用也驳斥了【成见六】,可以让读者评判这样的规则系统够不够资格称为实用系统:


以全球500强企业为主要客户的多语言客户情报挖掘系统由前后两个子系统组成。核心引擎是后台子系统(back-end indexing engine),用于对社交媒体大数据做自动分析和抽取。分析和抽取结果用开源的Apache Lucene文本搜索引擎( 存储。生成后台索引的过程基于Map-Reduce框架,利用计算云(computing cloud) 中200台虚拟服务器进行分布式索引。对于过往一年的社会媒体大数据存档(约300亿文档跨越40多种语言),后台索引系统可以在7天左右完成全部索引。前台子系统(front-end app)是基于 SaaS 的一种类似搜索的应用。用户通过浏览器登录应用服务器,输入一个感兴趣的话题,应用服务器对后台索引进行分布式搜索,搜索的结果在应用服务器经过整合,以用户可以预设(configurable)的方式呈现给用户。这一过程立等可取,响应时间不过三四秒。

                                                                                                                                                II.         规则系统手工性的责难


NLP主流对规则系统和语言学家大小偏见积久成堆,这第一条可以算是万偏之源。随便翻开计算语言学会议的论文,无论讨论什么语言现象,为了论证机器学习某算法的优越,在对比批评其他学习算法的同时,规则系统大多是随时抓上来陪斗的攻击对象,而攻击的理由往往只有这么一句话,规则系统的手工性决定了 “其难以开发”(或“其不能 scale up”,“其效率低下”,“其不鲁棒”,不一而足),或者干脆不给具体理由,直接说“文献【1】【2】【3】尝试了这个问题的不同方面,但这些系统都是手工编制的”,一句话判处死刑,甚至不用讨论它们的效果和质量。手工性几乎成了规则系统的“原罪”,编制这些系统的语言学家因此成为学术共同体背负原罪的二等公民。


真实世界中,NLP 是应用学科,最终结果体现在应用软件(applications)上,属于语言软件工程。作为一个产业,软件工程领域吸引了无数软件工程师,虽然他们自嘲为“码工”,社会共同体给予他们的尊重和待遇是很高的(Bill Gates 自封了一个 Chief Engineer,说明了这位软件大王对工匠大师的高度重视)。古有鲁班,现有码师(coding master)。这些码工谁不靠手工编制代码作为立足之本呢?没听说一位明星工程师因为编制代码的手工性质而被贬损。同是软件工程,为什么计算语言学家手工编制NLP代码与其他工程师手工编制软件代码,遭遇如此不同的对待。难道是因为NLP应用比其他应用简单?恰恰相反,自然语言的很多应用比起大多数应用(譬如图形软件、字处理软件等等)更加复杂和艰难。解释这种不同遭遇的唯一理由就是,作为大环境的软件领域没有NLP主流的小环境里面那么多的傲慢和偏见。软件领域的大师们还没有狂妄到以为可以靠自动编程取代手工编程。他们在手工编程的基础建设(编程架构和开发环境等)上下功夫,而不是把希望寄托在自动编程的万能上。也许在未来的某一天,一些简单的应用可以用代码自动化来实现,但是复杂任务的全自动化从目前来看是遥遥无期的。不管从什么标准来看,非浅层的自然语言分析和理解都是复杂任务的一种。因此,机器学习作为自动编程的一个体现是几乎不可能取代手工代码的。规则系统的NLP应用价值会长期存在。



手工编制NLP系统是不是规则系统的知识瓶颈?毋庸讳言,确实如此。这个瓶颈体现在系统开发的周期上。但是,这个瓶颈是几乎所有大型软件工程项目所共有的,是理所当然的资源成本,不独为 NLP “专美”。从这个意义上看,以知识瓶颈诟病规则系统是可笑的,除非可以证明对所有NLP项目,用机器学习开发系统比编制规则系统,周期短且质量高(个别的项目可能是这样,但一般而言绝非如此,后面还要详谈)。大体说来,对于NLP的浅层应用(譬如中文切词,专名识别,等等),没有三个月的开发,没有至少一位计算语言学家手工编制和调试规则和至少半个工程师的平台层面的支持,是出不来规则系统的。对于NLP的深层应用(如句法分析,舆情抽取等),没有至少一年的开发,涉及至少一位计算语言学家的手工编制规则,至少半个质量检测员的协助和半个工程师的平台支持,外加软件工程项目普遍具有的应用层面的用户接口开发等投入,也是出不来真正的软件产品的。当然需要多少开发资源在很大程度上决定于开发人员(包括作为知识工程师的计算语言学家)的经验和质量以及系统平台和开发环境的基础(infrastructures)如何。

计算语言学家编制规则系统的主体工作是利用形式化工具编写并调试语言规则、各类词典以及语言分析的流程调控。宏观上看,这个过程与软件工程师编写应用程序没有本质不同,不过是所用的语言、形式框架和开发平台(language,formalism and development platform)不同,系统设计和开发的测重点不同而已。这就好比现代的工程师用所谓高级语言 Java 或者 C,与30年前的工程师使用汇编语言的对比类似,本质是一样的编程,只是层次不同罢了。在为NLP特制的“高级”语言和平台上,计算语言学家可以不用为内存分配等非语言学的工程细节所羁绊,一般也不用为代码的优化和效率而烦扰,他们的注意力更多地放在面对自然语言的种种复杂现象,怎样设计语言处理的架构和流程,怎样平衡语言规则的条件宽窄,怎样与QA(质量检测)协调确保系统开发的健康,怎样保证语言学家团队编制规则的操作规范(unit testing,regression testing,code review,baselines,等等)以确保系统的可持续性,怎样根据语言开发需求对于现有形式框架的限制提出扩展要求,以及怎样保证复杂系统的鲁棒性,以及怎样突破规则系统的框架与其他语言处理包括机器学习进行协调,等等。一个领头的计算语言学家就是规则系统的架构师,系统的成败绝不仅仅在于语言规则的编制及其堆积,更多的决定于系统架构的合理性。明星工程师是软件企业的灵魂,NLP 规则系统的大规模成功也一样召唤着语言工程大师。

关于知识瓶颈的偏见,必须在对比中评估。自然语言处理需要语言学知识,把这些知识形式化是每个NLP系统的题中应有之义,机器学习绝不会自动免疫,无需知识的形式化。规则系统需要语言学家手工开发的资源投入,机器学习也同样需要资源的投入,不过是资源方式不同而已。具体说,机器学习的知识瓶颈在于需要大数量的训练数据集。排除研究性强实用性弱的无监督学习(unsupervised learning),机器学习中可资开发系统的方法是有监督的学习(supervised learning)。有监督的学习能开发知识系统成为应用的前提是必须有大量的手工标注的数据,作为学习的源泉。虽然机器学习的过程是自动的(学习算法的创新、调试和实现当然还是手工的),但是大量的数据标注则是手工的(本来就有现成标注不计,那是例外)。因此,机器学习同样面临知识瓶颈,不过是知识瓶颈的表现从需要少量的语言学家变成需要大量的低端劳动者(懂得语言及其任务的中学生或大学生即可胜任)。马克思说金钱是一般等价物,知识瓶颈的问题于是转化为高级劳动低级劳动的开销和转换问题:雇佣一个计算语言学家的代价大,还是雇佣10个中学生的代价大?虽然这个问题根据不同项目不同地区等因素答案会有不同,但所谓机器学习没有知识瓶颈的神话可以休矣。

另外,知识瓶颈的对比问题不仅仅是针对一个应用而言,而应该放在多应用的可移植性上来考察。我们知道大多数非浅层的NLP应用的技术支持都源于从自然语言做特定的信息抽取:抽取关系、事件、舆情等。由于机器学习把信息抽取看成一个直接对应输入和输出的黑匣子,所以一旦改变信息抽取目标和应用方向,以前的人工标注就废弃了,作为知识瓶颈的标注工作必须完全重来。可是规则系统不同,它通常设计成一个规则层级体系,由独立于领域的语言分析器(parser)来支持针对领域的信息抽取器(extractor)。结果是,在转移应用目标的时候,作为技术基础的语言分析器保持不变,只需重新编写不同的抽取规则而已。实践证明,对于规则系统,真正的知识瓶颈在语言分析器的开发上,而信息抽取本身花费不多。这是因为前者需要应对自然语言变化多端的表达方式,将其逻辑化,后者则是建立在逻辑形式(logical form)上,一条规则等价于底层规则的几百上千条。因此,从多应用的角度看,规则系统的知识成本趋小,而机器学习的知识成本则没有这个便利。

                                                                                                                                                III.        主流的反思

如前所述,NLP领域主流意识中的成见很多,积重难返。世界上还很少有这样的怪现象:号称计算语言学(Computational Linguistics)的领域一直在排挤语言学和语言学家。语言学家所擅长的规则系统,与传统语言学完全不同,是可实现的形式语言学(Formal Linguistics)的体现。对于非浅层的NLP任务,有效的规则系统不可能是计算词典和文法的简单堆积,而是蕴含了对不同语言现象的语言学处理策略(或算法)。然而,这一路研究在NLP讲台发表的空间日渐狭小,资助亦难,使得新一代学人面临技术传承的危险。Church (2007)指出,NLP研究统计一边倒的状况是如此明显,其他的声音已经听不见。在浅层NLP的低垂果实几乎全部采摘完毕以后,当下一代学人面对复杂任务时,语言学营养缺乏症可能导致统计路线捉襟见肘。

可喜的是,近年来主流中有识之士(如,Church 2007, Wintner 2009)开始了反思和呼吁,召唤语言学的归来:“In essence, linguistics is altogether missing in contemporary natural language engineering research. … I want to call for the return of linguistics to computational linguistics.”(Wintner 2009)。相信他们的声音会受到越来越多的人的注意。


  • Church 2007. A Pendulum Swung Too Far.  Linguistics issues in Language Technology,  Volume 2, Issue 4.
  • Wintner 2009. What Science Underlies Natural Language Engineering? Computational Linguistics, Volume 35, Number 4


原载 《W. Li & T. Tang: 主流的傲慢与偏见:规则系统与机器学习》


Pride and Prejudice in Mainstream:  Rule System vs. Machine Learning

In the area of Computational Linguistics, there are two basic approaches to natural language processing, the traditional rule system and the mainstream machine learning.  They are complementary and there are pros and cons associated with both.  However, as machine learning is the dominant mainstream philosophy reflected by the overwhelming ratio of papers published in academia, the area seems to be heavily biased against the rule system methodology.  The tremendous success of machine learning as applied to a list of natural language tasks has reinforced the mainstream pride and prejudice in favor of one and against the other.   As a result, there are numerous specious views which are often taken for granted without check, including attacks on the rule system’s defects based on incomplete induction or misconception.  This is not healthy for NLP itself as an applied research area and exerts an inappropriate influence on the young scientists coming to this area.  This is the first piece of a series of writings aimed at correcting the prevalent prejudice, focused on the in-depth examination of the so-called hand-crafted defect of the rule system and the associated  knowledge bottleneck issue.


K. Church: A Pendulum Swung Too Far, Linguistics issues in Language Technology, 2011; 6(5)


Pride and Prejudice of NLP Main Stream

On Hand-crafted Myth and Knowledge Bottleneck

Domain portability myth in natural language processing

关于NLP方法论以及两条路线之争】 专栏:NLP方法论





On Recall of Grammar Engineering Systems

After I showed the benchmarking results of SyntaxNet and our rule system based on grammar engineering, many people seem to be surprised by the fact that the rule system beats the newest deep-learning based parser in data quality.  I then got asked many questions, one question is:

Q: We know that rules crafted by linguists are good at precision, how about recall?

This question is worth a more in-depth discussion and serious answer because it touches the core of the viability of the “forgotten” school:  why is it still there? what does it have to offer? The key is the excellent data quality as advantage of a hand-crafted system, not only for precision, but high recall is achievable as well.

Before we elaborate, here was my quick answer to the above question:

  • Unlike precision, recall is not rules’ forte, but there are ways to enhance recall;
  • To enhance recall without precision compromise, one needs to develop more rules and organize the rules in a hierarchy, and organize grammars in a pipeline, so recall is a function of time;
  • To enhance recall with limited compromise in precision, one can fine-tune the rules to loosen conditions.

Let me address these points by presenting the scene of action for this linguistic art in its engineering craftsmanship.

A rule system is based on compiled computational grammars.  A grammar is a set of linguistic rules encoded in some formalism.  What happens in grammar engineering is not much different from other software engineering projects.  As knowledge engineer, a computational linguist codes a rule in a NLP-specific language, based on a development corpus.  The development is data-driven, each line of rule code goes through rigid unit tests and then regression tests before it is submitted as part of the updated system.  Depending on the design of the architect, there are all types of information available for the linguist developer to use in crafting a rule’s conditions, e.g. a rule can check any elements of a pattern by enforcing conditions on (i) word or stem itself (i.e. string literal, in cases of capturing, say, idiomatic expressions), and/or (ii) POS (part-of-speech, such as noun, adjective, verb, preposition), (iii) and/or orthography features (e.g. initial upper case, mixed case, token with digits and dots), and/or (iv) morphology features (e.g. tense, aspect, person, number, case, etc. decoded by a previous morphology module), (v) and/or syntactic features (e.g. verb subcategory features such as intransitive, transitive, ditransitive), (vi) and/or lexical semantic features (e.g. human, animal, furniture, food, school, time, location, color, emotion).  There are almost infinite combinations of such conditions that can be enforced in rules’ patterns.  A linguist’s job is to use such conditions to maximize the benefits of capturing the target language phenomena, through a process of trial and error.

Given the description of grammar engineering as above, what we expect to see in the initial stage of grammar development is a system precision-oriented by nature.  Each rule developed is geared towards a target linguistic phenomenon based on the data observed in the development corpus: conditions can be as tight as one wants them to be, ensuring precision.  But no single rule or a small set of rules can cover all the phenomena.  So the recall is low in the beginning stage.  Let us push things to extreme, if a rule system is based on only one grammar consisting of only one rule, it is not difficult to quickly develop a system with 100% precision but very poor recall.  But what is good of a system that is precise but without coverage?

So a linguist is trained to generalize.  In fact, most linguists are over-trained in school for theorizing and generalization before they get involved in software industrial development.  In my own experience in training new linguists into knowledge engineers, I often have to de-train this aspect of their education by enforcing strict procedures of data-driven and regression-free development.  As a result, the system will generalize only to the extent allowed to maintain a target precision, say 90% or above.

It is a balancing art.  Experienced linguists are better than new graduates.  Out of  explosive possibilities of conditions, one will only test some most likely combination of conditions based on linguistic knowledge and judgement in order to reach the desired precision with maximized recall of the target phenomena.  For a given rule, it is always possible to increase recall at compromise of precision by dropping some conditions or replacing a strict condition by a loose condition (e.g. checking a feature instead of literal, or checking a general feature such as noun instead of a narrow feature such as human).  When a rule is fine-tuned with proper conditions for the desired balance of precision and recall, the linguist developer moves on to try to come up with another rule to cover more space of the target phenomena.

So, as the development goes on, and more data from the development corpus are brought to the attention on the developer’s radar, more rules are developed to cover more and more phenomena, much like silkworms eating mulberry leaves.  This is incremental enhancement fairly typical of software development cycles for new releases.  Most of the time, newly developed rules will overlap with existing rules, but their logical OR points to an enlarged conquered territory.  It is hard work, but recall gradually, and naturally, picks up with time while maintaining precision until it hits long tail with diminishing returns.

There are two caveats which are worth discussing for people who are curious about this “seasoned” school of grammar engineering.

First, not all rules are equal.  A non-toy rule system often provides mechanism to help organize rules in a hierarchy for better quality as well as easier maintenance: after all, a grammar hard to understand and difficult to maintain has little prospect for debugging and incremental enhancement.  Typically, a grammar has some general rules at the top which serve as default and cover the majority of phenomena well but make mistakes in the exceptions which are not rare in natural language.  As is known to all, naturally language is such a monster that almost no rules are without exceptions.  Remember in high school grammar class, our teacher used to teach us grammar rules.  For example, one rule says that a bare verb cannot be used as predicate with third person singular subject, which should agree with the predicate in person and number by adding -s to the verb: hence, She leaves instead of *She leave.  But soon we found exceptions in sentences like The teacher demanded that she leave.  This exception to the original rule only occurs in object clauses following certain main clause verbs such as demand, theoretically labeled  by linguists as subjunctive mood.  This more restricted rule needs to work with the more general rule to result in a better formulated grammar.

Likewise, in building a computational grammar for automatic parsing or other NLP tasks, we need to handle a spectrum of rules with different degrees of generalizations in achieving good data quality for a balanced precision and recall.  Rather than adding more and more restrictions to make a general rule not to overkill the exceptions, it is more elegant and practical to organize the rules in a hierarchy so the general rules are only applied as default after more specific rules are tried, or equivalently, specific rules are applied to overturn or correct the results of general rules.  Thus, most real life formalisms are equipped with hierarchy mechanism to help linguists develop computational grammars to model the human linguistic capability in language analysis and understanding.

The second point that relates to the topic of recall of a rule system is so significant but often neglected that it cannot be over-emphasized and it calls for a separate writing in itself.  I will only present a concise conclusion here.  It relates to multiple levels of parsing that can significantly enhance recall for both parsing and parsing-supported NLP applications.  In a multi-level rule system, each level is one module of the system, involving a grammar.  Lower levels of grammars help build local structures (e.g. basic Noun Phrase), performing shallow parsing.  System thus designed are not only good for modularized engineering, but also great for recall because shallow parsing shortens the distance of words that hold syntactic relations (including long distance relations) and lower level linguistic constructions clear the way for generalization by high level rules in covering linguistic phenomena.

In summary, a parser based on grammar engineering can reach very high precision and there are proven effective ways of enhancing its recall.  High recall can be achieved if enough time and expertise are invested in its development.  In case of parsing, as shown by test results, our seasoned English parser is good at both precision (96% vs. SyntaxNet 94%) and recall (94% vs. SyntaxNet 95%, only 1 percentage point lower than SyntaxNet) in news genre, and with regards to social media, our parser is robust enough to beat SyntaxNet in both precision (89% vs. SyntaxNet 60%) and recall (72% vs. SyntaxNet 70%).



Is Google SyntaxNet Really the World’s Most Accurate Parser?

It is untrue that Google SyntaxNet is the “world’s most accurate parser”

R. Srihari, W Li, C. Niu, T. Cornell: InfoXtract: A Customizable Intermediate Level Information Extraction Engine. Journal of Natural Language Engineering, 12(4), 1-37, 2006

K. Church: A Pendulum Swung Too Far, Linguistics issues in Language Technology, 2011; 6(5)

Pros and Cons of Two Approaches: Machine Learning vs Grammar Engineering

Pride and Prejudice of NLP Main Stream

On Hand-crafted Myth and Knowledge Bottleneck

Domain portability myth in natural language processing

Introduction of Netbase NLP Core Engine

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP



Small talk: World’s No 0

A few weeks ago, I had a chat with my daughter who’s planning to study cs.
“Dad, how are things going?”
“Got a problem: Google announced SyntaxNet claimed to be world’s no 1.”
“Why a problem?”
“Well if they are no 1, where am I?”
“No 2?”
“No, I don’t know who is no 1, but I have never seen a system beating ours. I might just as well be no 0.”
“Brilliant, I like that! Then stay in no 0, and let others fight for no 1. ……. In my data structure, I always start with 0 any way.”