PhD Thesis: Chapter II Role of Grammar


2.0. Introduction

This chapter examines the role of grammar in handling the three major types of morpho-syntactic interface problems.  This investigation  justifies the mono-stratal design of CPSG95 which contains feature structures of both morphology and syntax.

The major observation from this study is:  (i) grammatical analysis, including both morphology and syntax, plays the fundamental role in contributing to the solutions of the morpho-syntactic problems;  (ii)  when grammar alone is not sufficient to reach the final solution, knowledge beyond morphology and syntax may come into play and serve as “filters” based on the grammatical analysis results.[1]  Based on this observation, a study in the direction of interleaving morphology and syntax will be pursued in the grammatical analysis.  Knowledge beyond morphology and syntax is left to future research.

Section 2.1 investigates the relationship between grammatical analysis and  the resolution of segmentation ambiguity.  Section 2.2 studies the role of syntax in handling Chinese productive word formation.  The borderline cases and their relationship with grammar are explored in 2.3.  Section 2.4 examines the relevance of knowledge beyond syntax to segmentation disambiguation.  Finally, a summary of the presented arguments and discoveries is given in 2.5.

2.1. Segmentation Ambiguity and Syntax

Segmentation ambiguity is one major problem which challenges the traditional word segmenter or an independent morphology.  The following study shows that this ambiguity is structural in nature, not fundamentally different from other structural ambiguity in grammar.  It will be demonstrated that sentential structural analysis is the key to this problem.

A huge amount of research effort in the last decade has been made on resolving segmentation ambiguity (e.g. Chen and Liu 1992; Gan 1995; He, Xu and Sun 1991; Liang 1987; Lua 1994; Sproat, Shih, Gale and Chang 1996; Sun and T’sou 1995; Sun and Huang 1996; X. Wang 1989; Wu and Su 1993; Yao, Zhang and Wu 1990; Yeh and Lee 1991; Zhang, Chen and Chen 1991; Guo 1997b).  Many (e.g. Sun and Huang 1996; Guo 1997b) agree that this is still an unsolved problem.  The major difficulty with most approaches reported in the literature lies in the lack of support from sufficient grammar knowledge.  To ultimately solve this problem, grammatical analysis is vital, a point to be elaborated in the subsequent sections.

2.1.1. Resolution of Hidden Ambiguity

The topic of this section is the treatment of hidden ambiguity.   The conclusion out of the investigation below is that the structural analysis of the entire input string provides a sound basis for handling this problem.

The following sample sentences illustrate a typical case involving the hidden ambiguity string 烤白薯 kao bai shu.

(2-1.) (a)      他吃烤白薯
ta         | chi  | kao-bai-shu
he      | eat  | baked-sweet-potato
[S [NP ta] [VP [V chi] [NP kao-bai-shu]]]
He eats the baked sweet potato.

(b) * ta       | chi  | kao          | bai-shu
he      | eat  | bake         | sweet-potato

(2-2.) (a) *    他会烤白薯
ta         | hui  | kao-bai-shu.
he      | can | baked-sweet-potato

(b)     ta       | hui  | kao          | bai-shu.
he      | can | bake         | sweet-potato
[S [NP ta] [VP [V hui] [VP [V kao] [NP bai-shu]]]]
He can bake sweet potatoes.

Sentences (2-1) and (2-2) are a minimal pair;  the only difference is the choice of the predicate verb, namely chi (eat) versus hui (can, be capable of).  But they have very different structures and assume different word identification.  This is because verbs like chi expect an NP object but verbs like hui require a VP complement.  The two segmentations of the string kao bai shu provide two possibilities, one as an NP kao-bai-shu and the other as a VP kao | bai-shu.  When the provided unit matches the expectation, it leads to a successful syntactic analysis, as illustrated by the parse trees in (2‑1a) and (2-2b).  When the expectation constraint is not satisfied, as in (2-1b) and (2-2a), the analysis fails.  These examples show that all candidate words in the input string should be considered for grammatical analysis.  The disambiguation choice can be made via the analysis, as seen in the examples above with the sample parse trees.  Correct segmentation results in at least one successful parse.

He, Xu and Sun (1991) indicate that a hidden ambiguity string requires a larger context for disambiguation.  But they did not define what the ‘larger context’ should be.  The following discussion attempts to answer this question.

The input string to the parser constitutes a basic context as well as the object for sentential analysis.[2]  It will be argued that this input string is the proper context for handling the hidden ambiguity problem.  The point to be made is, context smaller than the input string is not reliable for the hidden ambiguity resolution.  This point is illustrated by the following examples of the hidden ambiguity string ge ren in (2-3).[3]  In each successive case, the context is expanded to form a new input string.   As a result, the analysis and the associated interpretation of ‘person’ versus ‘individual’ change accordingly.

(2-3.)  input string                            reading(s)

(a)      人  ren                                       person (or man, human)
[N ren]

(b)      个人  ge ren                               individual
[N ge-ren]

(c)      三个人  san ge ren                               three persons
[NP [CLAP [NUM san] [CLA ge]] [N ren]]

(d)      人的力量  ren de li liang                      the human power
[NP [DEP [NP ren] [DE de]] [N li-liang]]

(e)      个人的力量  ge ren de li liang                        the power of an individual
[NP [DEP [NP ge-ren] [DE de]] [N li-liang]]

(f)       三个人的力量  san ge ren de li liang              the power of three persons
[NP [DEP [NP [CLAP [NUM san] [CLA ge]] [N ren]] [DE de]] [N li-liang]]

(g)      他不是个人  ta bu shi ge ren.
           (1)    He is not a man. (He is a pig.)
[S [NP ta] [VP [ADV bu] [VP [V shi] [NP [CLAP ge] [N ren]]]]]
(2)  He is not an individual. (He represents a social organization.)
[S [NP ta] [VP [ADV bu] [VP [V shi] [NP  ge-ren]]]]

Comparing (a), (b) with (c), and (d), (e) with (f), one can see the associated change of readings when each successively expanded input string leads to a different grammatical analysis.  Accordingly, one segmentation is chosen over the other on the condition that the grammatical analysis of the full string can be established based on the segmentation.  In (b), the ambiguous string is all that is input to the parser, therefore the local context becomes full context.  It then acquires the lexical reading individual as the other possible segmentation ge | ren does not form a legitimate combination.  This reading may be retained, as in (e), or changed to the other reading person, as in (c) and (f), or reduced to one of the possible interpretations, as in (g), when the input string is further lengthened.  All these changes depend on the sentential analysis of the entire input string, as shown by the associated structural trees above.  It demonstrates that the full context is required for the adequate treatment of the hidden ambiguity phenomena.  Full context here refers to the entire input string to the parser.

It is necessary to explain some of the analyses as shown in the sample parses  above.  In Contemporary Mandarin, a numeral cannot  combine with a noun without a classifier in between.[4]  Therefore, the segmentation san (three) | ge-ren (individual) is excluded in (c) and (f), and the correct segmentation san (three) | ge (CLA) | ren (person) leads to the NP analysis.  In general, a classifier alone cannot combine with the following noun either, hence the interpretation of ge ren as one word ge-ren (individual) in (b) and (e).  A classifier usually combines with a preceding numeral or determiner before it can combine with the noun.  But things are more complicated.  In fact, the Chinese numeral yi (one) can be omitted when the NP is in object position.  In other words, the classifier alone can combine with a noun in a very restricted syntactic environment.  That explains the two readings in (g).[5]

The following is a summary of the arguments presented above.   These arguments have been shown to account for the hidden ambiguity phenomena.  The next section will further demonstrate the validity of these arguments for overlapping ambiguity as well.

(2-4.) Conclusion
The grammatical analysis of the entire input string is required for the adequate treatment of the hidden ambiguity problem in word identification.

2.1.2. Resolution of Overlapping Ambiguity

This section investigates overlapping ambiguity and its resolution.  A previous influential theory is examined, which claims that the overlapping ambiguity string can be locally disambiguated.   However, this theory is found to be unable to account for a significant amount of data.  The conclusion is that both overlapping ambiguity and hidden ambiguity require a context of the entire input string and a grammar for disambiguation.

For overlapping ambiguity, comparing different critical tokenizations will be able to detect it, but such a technique cannot guarantee a correct choice without introducing other knowledge.  Guo (1997) pointed out:

As all critical tokenizations hold the property of minimal elements on the word string cover relationship, the existence of critical ambiguity in tokenization implies that the “most powerful and commonly used” (Chen and Liu 1992, page 104) principle of maximum tokenization would not be effective in resolving critical ambiguity in tokenization and implies that other means such as statistical inferencing or grammatical reasoning have to be introduced.

However, He, Xu and Sun (1991) claim that overlapping ambiguity can be resolved within the local context of the ambiguous string.  They classify the overlapping ambiguity string into nine types.  The classification is based on the categories of the assumably correctly segmented words in the ambiguous strings, described below.

Suppose there is an overlapping ambiguous string consisting of ABC;  both AB and BC are entries listed in the lexicon.  There are two possible cases.  In case one, the category of A and the category of BC define the classification of the ambiguous string.  This is the case when the segmentation A|BC is considered correct.  For example, in the  ambiguous string 白天鹅 bai tian e, the word AB is  bai-tian (day-time) and the word BC is tian-e (swan).  The correct segmentation for this string is assumed to be A|BC, i.e. bai (A: white) | tian-e (N: swan) (in fact, this cannot be taken for granted as shall be shown shortly), therefore, it belongs to the A-N type.  In case two, i.e. when the segmentation AB|C is considered correct, the category of AB and the category C define the classification of the ambiguous string.   For example, in the ambiguous string 需求和 xu qiu he, the word AB is  xu-qiu (requirement) and the word BC qiu-he (sue for peace).  The correct segmentation for this string is AB|C, i.e. xu-qiu (N: requirement) | he (CONJ: and) (again, this should not be taken for granted), therefore, it belongs to the N-CONJ type.

After classifying the overlapping ambiguous strings into one of nine types, using the two different cases described above, they claim to have discovered a rule.[6]  That is, the category of the correctly segmented word BC in case one (or AB in case two) is predictable from AB (or BC in case two) within the local ambiguous string.  For example, the category of tian-e (swan) in bai | tian-e (white swan) is a noun.  This information is predictable from bai tian within the respective local string bai tian e.  The idea is, if ever an overlapping ambiguity string is formed of bai tian and C, the judgment of bai | tian-C as the correct segmentation entails that the word tian-C  must be a noun.  Otherwise, the segmentation A|BC is wrong and the other segmentation AB|C is right.  For illustration, it is noted that tian-shi (angel) in the ambiguous string 白天使 bai | tian-shi (white angel) is, as expected, a noun.  This predictability of the category information from within the local overlapping ambiguous string is seen as an important discovery (Feng 1996).  Based on this assumed feature of the overlapping ambiguous strings, He,  Xu and Sun (1991) developed their theory that an overlapping ambiguity string can be disambiguated within the local string itself.

The proposed disambiguation process within the overlapping ambiguous string proceeds as follows.  In order to correctly segment an overlapping ambiguous string, say, bai tian e or bai tian shi, the following information needs to be given under the entry bai-tian (day-time) in the tokenization lexicon:  (i) an ambiguity label, to indicate the necessity to call a disambiguation rule;  (ii) the ambiguity type A-N, to indicate that it should call the rule corresponding to this type.  Then the following disambiguation rule can be formulated.

(2-5.) A-N type rule       (He,  Xu and Sun 1991)
In the overlapping ambiguous string A(1)…A(i) B(1)…B(j) C(1)…C(k),
if        B(1)…B(j) and C(1)…C(k) form a noun,
then  the correct segmentation is A(1)…A(i) | B(1)…B(j)-C(1)…C(k),
else    the correct segmentation is A(1)…A(i)-B(1)…B(j) | C(1)…C(k).

This way, bai tian e and bai tian shi will always be segmented as bai (white) | tian-e (swan) and bai (white) | tian-shi (angel) instead of bai-tian (daytime) | e (goose) and bai-tian (daytime) | shi (make).  This can be easily accommodated in a segmentation algorithm provided the above  information is added to the lexicon and the disambiguation rules are implemented.  The whole procedure is running within the local context of the overlapping ambiguous string and uses only lexical information.  So they also name the overlapping ambiguity disambiguation morphology-based disambiguation, with no need to consult syntax, semantics or discourse.

Feng (1996) emphasizes that He, Xu and Sun’s view on the overlapping ambiguous string constitutes a valuable contribution to the theory of Chinese word identification.  Indeed, this overlapping ambiguous string theory, if it were right, would be a breakthrough in this field.  It in effect suggests that the majority of the segmentation ambiguity is resolvable without and before a grammar module.  A handful of simple rules, like the A-N type rule formulated above, plus a lexicon would solve most ambiguity problems in word identification.[7]

Feng (1996) provides examples for all the nine types of overlapping ambiguous strings as evidence to support He, Xu and Sun (1991)’s theory.   In the case of the A-N type ambiguous string bai tian e, the correct segmentation is supposed to be bai | tian-e in this theory.  However, even with his own cited example, Feng ignores a perfect second reading (parse) when the time NP bai-tian (daytime) directly acts as a modifier for the sentence with no need for a preposition, as shown in (2‑6b) below.

(2-6.)           白天鹅游过来了
bai tian e you guo lai le       (Feng 1996)

(a)      bai     | tian-e       | you          | guo-lai      | le.
          white | swan        | swim        | over-here  | LE
[S [NP bai tian-e] [VP you guo-lai le]]
The white swan swam over here.

(b)      bai-tian       | e              | you          | guo-lai      | le.
          day-time      | goose        | swim        | over-here  | LE
[S [NP+mod bai-tian] [S [NP e] [VP you guo-lai le]]]
In the day time the geese swam over here.

In addition, one only needs to add a preposition zai (in) to the beginning of the sentence to make the abandoned segmentation bai-tian | e the only right one in the changed context.  The presumably correct segmentation, namely bai | tian-e, now turns out to be wrong, as shown in (2-7a) below.

(2-7.)           在白天鹅游过来了
zai bai tian e you guo lai le

(a) *   zai     | bai           | tian-e       | you          | guo-lai      | le.
          in      | white        | swan        | swim        | over-here  | LE

(b)      zai     | bai-tian    | e              | you          | guo-lai      | le.
          in      | day-time   | goose        | swim        | over-here  | LE
[S [PP+mod zai bai-tian] [S [NP e] [VP you guo-lai le]]]
In the day time the geese swam over here.

The above counter-example is by no means accidental.  In fact, for each cited ambiguous string in the examples given by Feng, there exist counter-examples.  It is not difficult to construct a different context where the preferred segmentation within the local string, i.e. the segmentation chosen according to one of the rules, is proven to be wrong.[8]  In the pairs of sample sentences (2‑8) through (2-10), (a) is an example which Feng (1996) cited to support the view that the local ambiguous string itself is enough for disambiguation.  Sentences in (b) are counter-examples to this theory.  It is a notable fact that the listed local string is often properly contained in a more complicated ambiguous string in an expanded context, seen in (2-9b) and (2-10b).  Therefore, even when the abandoned segmentation can never be linguistically correct in any context, as shown for tu-xing (graph) | shi (BM) in (2-9) where a bound morpheme still exists after the segmentation, it does not entail the correctness of the other segmentation in all contexts.  These data show that all possible segmentations should be retained for the grammatical analysis to judge.

(2-8.)  V-N type of overlapping ambiguous string

          yan jiu sheng ming:
          yan-jiu (V:study) | sheng-ming (N:life)
yang-jiu-sheng (N:graduate student) | ming (life/destiny)

(a)      研究生命的本质
          yan-jiu    sheng-ming de      ben-zhi
          study          life               DE     essence
Study the essence of life.

(b)      研究生命金贵
           yan-jiu-sheng      ming  jin-gui
          graduate-student  life     precious
Life for graduate students is precious.

(2-9.)  CONJ-N type of overlapping ambiguous string
和平等 he ping deng:
          he (CONJ:and) | ping-deng (N:equality)
he-ping (N:peace) | deng (V:wait)?

(a)      独立自主和平等互利的原则
           du-li-zi-zhu           he      ping-deng-hu-li               de      yuan-ze
          independence       and    equal-reciprocal-benefit  DE     principle
the principle of independence and equal reciprocal benefit

(b)      和平等于胜利 he-ping       deng-yu       sheng-li
          peace           equal           victory
Peace is equal to victory.

(2-10.)  V-P type of overlapping ambiguous string
看中和 kan zhong he:
          kan-zhong (V:target) | he (P:with)
kan (V:see) | zhong-he (V:neutralize)

(a)      他们看中和日本人生意的机会
ta-men    kan-zhong   he      ri-ben          ren              zuo     sheng-yi      de      ji-hui    
they         target           with    Japan          person          do      business     DE   opportunity
They have targeted the opportunity to do business with the Japanese.

(b)      这要看中和作用的效果
zhe          yao    kan    zhong-he-zuo-yong                   de          xiao-guo
this    need  see     neutralization                DE     effect
This will depend on the effects of the neutralization.

The data in (b) above directly contradict the claim that an overlapping ambiguous string can be disambiguated within the local string itself.  While this approach is shown to be inappropriate in practice, the following comment attempts to reveal its theoretical motivation.

As reviewed in the previous text, He, Xu and Sun (1991)’s overlapping ambiguity theory is established on the classification of the overlapping ambiguous strings.  A careful examination of their proposed nine types of the overlapping ambiguous strings reveals an underlying assumption on which the classification is based.  That is, the correctly segmented words within the overlapping ambiguous string will automatically remain correct in a sentence containing the local string.   This is in general untrue, as shown by the counter-examples above.[9]   The following analysis reveals why.

Within the local context of the overlapping ambiguous string, the chosen segmentation often leads to a syntactically legitimate structure while the abandoned segmentation does not.  For example,  bai (white) | tian-e (swan) combines into a valid syntactic unit while there is no structure which can span bai-tian (daytime) | e (goose).  For another example,  yan-jiu (study) | sheng-ming (life) can be combined into a legitimate verb phrase [VP [V yan-jiu] [NP sheng-ming]], but  yan-jiu-sheng (graduate student) | ming (life/destiny) cannot.  But that legitimacy only stands locally within the boundary of the ambiguous string.  It does not necessarily hold true in a larger context containing the string.  As shown previously in (2-7a),  the locally legitimate structure bai | tian-e (white swan) does not lead to a successful parse for the sentence.  In contrast, the locally abandoned segmentation bai-tian (daytime) | e (goose) has turned out to be right with the parse in (2-7b).   Therefore, the full context instead of the local context of the ambiguous string is required for the final judgment on which segmentation can be safely abandoned.  Context smaller than the entire input string is not reliable for the overlapping ambiguity resolution.  Note that exactly the same conclusion has been reached for the hidden ambiguous strings in the previous section.

The following data in (2-11) further illustrate the point of the full context requirement for the overlapping ambiguity resolution, similar to what has been presented for the hidden ambiguity phenomena in (2-3).  In each successive case, the context is expanded to form a new input string.  As a result, the interpretation of ‘goose’ versus ‘swan’ changes accordingly.

(2-11.)  input string                reading(s)

(a)      鹅 e                                goose
[N e]

(b)      天鹅 tian e                                swan
[N tian-e]

(c)      白天鹅 bai tian e                       white swan
[N [A bai] [N tian-e]]

(d)      鹅游过来了 e you guo lai le.
The geese swam over here.
[S [NP e] [VP you guo-lai le]]

(e)      天鹅游过来了 tian e you guo lai le.
The swans swam over here.
[S [NP tian-e] [VP you guo-lai le]]

(f)      白天鹅游过来了 bai tian e you guo lai le.
          (i)       The white swan swam over here.
[S [NP bai tian-e] [VP you guo-lai le]]
          (ii)      In the daytime, the geese swam over here.
S [NP+mod bai-tian] [S [NP e] [VP you guo-lai le]]]

(g)       在白天鹅游过来了 zai bai tian e you guo lai le.
            In the daytime, the geese swam over here.
[S [PP zai bai-tian] [S [NP e] [VP you guo-lai le]]]

(h)      三只白天鹅游过来了 san zhi bai tian e you guo lai le.
           Three white swans swam over here.
[S [NP san zhi bai tian-e] [VP you guo-lai le]]

It is interesting to compare (c) with (f), (g) and (h) to see their associated change of readings based on different ways of  segmentation.  In (c), the overlapping ambiguous string is all that is input to the parser, therefore the local context becomes full context.  It then acquires the reading white swan corresponding to the segmentation bai | tian-e.  This reading may be retained, or changed, or reduced to one of the possible interpretations when the input string is lengthened.  That is respectively the case in (h), (g) and (f).  All these changes depend on the grammatical analysis of the entire input string.  It shows that the full context and a grammar are required for the resolution of most ambiguities;  and when sentential analysis cannot disambiguate – in cases of ‘genuine’ segmentation ambiguity like (f), the structural analysis can make explicit the ambiguity in the form of multiple parses (readings).

In the light of the inquiry in this section, the theoretical significance of the distinction between overlapping ambiguity and hidden ambiguity seems to have diminished.[10]  They are both structural in nature.  They both require full context and a grammar for proper treatment.

(2-12.) Conclusion

(i)  It is not necessarily true that an overlapping ambiguous string can be disambiguated within the local string.

(ii) The grammatical analysis of the entire input string is required for the adequate treatment of the overlapping ambiguity problem as well as the hidden ambiguity problem.

2.2. Productive Word Formation and Syntax

This section examines the connection of productive word formation and segmentation ambiguity.  The observation is that there is always a possible involvement of ambiguity with each type of word formation.  The point to be made is that no independent morphology systems can resolve this ambiguity when syntax is unavailable.  This is because words formed via morphology, just like words looked up from lexicon, only provide syntactic ‘candidate’ constituents for the sentential analysis.  The choice is decided by the structural analysis of the entire sentence.

Derivation is a major type of productive word formation in Chinese.   Section 1.2.2 has given an example of the involvement of hidden ambiguity in derivation, repeated below.

(2-13.)         这道菜没有吃头  zhe dao cai mei you chi tou.

(a)      zhe    | dao          | cai            | mei-you    | chi-tou
          this    | CLA                   | dish         | not-have   | worth-of-eating
[S [NP zhe dao cai] [VP [V mei-you] [NP chi-tou]]]
This dish is not worth eating.

(b) ?   zhe    | dao          | cai            | mei-you    | chi  | tou
          this    | CLA                   | dish         | not have   | eat  | head
[S [NP zhe dao cai] [VP [ADV mei-you] [VP [V chi] [NP tou]]]]
This dish did not eat the head.

(2-14.)         他饿得能吃头牛 ta e de neng chi tou niu.

(a) *   ta       | e              | de            | neng         | chi-tou               | niu
he      | hungry     | DE3         | can           | worth-of-eating  | ox

(b)      ta       | e              | de            | neng         | chi  | tou            | niu
he      | hungry     | DE3         | can           | eat  | CLA          | ox
[…[VP [V e] [DE3P [DE3 de] [VP [V neng] [VP [V chi] [NP tou niu]]]]]]
He is so hungry that he can eat an ox.

Some derivation rule like the one in (2-15) is responsible for combining the transitive verb stem and the suffix –tou (worth-of) into a derived noun for (2-13a) and (2-14a).

(2-15.)         X (transitive verb) + tou –> X-tou (noun, semantics: worth-of-X)

However, when syntax is not available, there is always a danger of wrongly applying this morphological rule due to possible ambiguity involved, as shown in (2-14a).  In other words, morphological rules only provide candidate words;  they cannot make the decision whether these words are legitimate in the context.

Reduplication is another method for productive word formation in Chinese.  An outstanding problem is the AB –> AABB reduplication or AB –> AAB reduplication if AB is a listed word.   In these cases, some reduplication rules or procedures need to be involved to recognize AABB or AAB.  If reduplication is a simple process confined to a local small context, it may be possible to handle it by incorporating some procedure-based function calls during the lexical lookup.  For example, when a three-character string, say 分分心 fen fen xin, cannot be found in the lexicon, the reduplication function will check whether the first two characters are the same, and if yes, delete one of them and consult the lexicon again.  This method is expected to handle the AAB type reduplication, e.g. fen-xin (divide-heart: distract) –> fen-fen-xin (distract a bit).

But, segmentation ambiguity can be involved in reduplication as well.  Compare the following examples in (2-16) and (2-17) containing the sub-string fen fen xin, the first is ambiguity free but the second is ambiguous.  In fact, (2‑17) involves an overlapping ambiguous string  shi fen fen xinshi (ten) | fen-fen-xin (distract a bit) and shi-fen (very) | fen-xin (distract).  Based on the conclusion presented in 2.1, it requires grammatical analysis to resolve the segmentation ambiguity.  This is illustrated in (2‑17).

(2-16.)         让他分分心

rang     | ta    | fen-fen-xin
let      | he   | distracted-a-bit
Let him relax a while.

(2-17.)         这件事十分分心

zhe jian shi shi fen fen xin.

(a) *   zhe    | jian          | shi           | shi  | fen-fen-xin
          this    | CLA         | thing       | ten  | distracted a bit

(b)      zhe    | jian          | shi            | shi-fen     | fen-xin
           this    | CLA         | thing       | very         | distract
[S [NP zhe jian shi] [VP [ADV shi-fen] [V fen-xin]]]
This thing is very distracting.

Finally, there is also possible ambiguity involvement in the proper name formation.  Proper names for persons, locations, etc. that are not listed in the lexicon are recognized as another major problem in word identification (Sun and Huang 1996).[11]  This problem is complicated when ambiguity is involved.

For example, a Chinese person name usually consists of a family name followed by a given name of one or two characters.  For example, the late Chinese chairman mao-ze-dong (Mao Zedong) used to have another name li-de-sheng (Li Desheng).  In the lexicon, li is a listed family name.  Both de-sheng and sheng mean ‘win’.  This may lead to three ways of word segmentation, a complicated case involving both overlapping ambiguity and hidden ambiguity:  (i) li | de-sheng;  (ii) li-de | sheng;  (iii) li-de-sheng, as shown in (2-18) below.

(2-18.)         李得胜了 li de sheng le.

(a)      li        | de-sheng  | le
           Li       | win          | LE
[S [NP li] [VP de-sheng le]]
Li won.

(b)      li-de   | sheng       | le
           Li De | win          | LE
[S [NP li de] [VP sheng le]]
Li De won.

(c) *    li-de-sheng  | le
           Li Desheng  | LE

For this particular type of compounding, the family name serves as the left boundary of a potential compound name of person and the length can be used to determine candidates.[12]  Again, the choice is decided by the grammatical analysis of the entire sentence, as illustrated in (2-18).

(2-19.) Conclusion

Due to the possible ambiguity involvement in productive word formation, a grammar containing both morphology and syntax is required for an adequate treatment.  An independent morphology system or separate word segmenter cannot solve ambiguity problems.

2.3. Borderline Cases and Grammar

This section reviews some outstanding morpho-syntactic borderline phenomena.  The points to be made are:  (i) each proposed morphological or syntactic analysis should be justified in terms of capturing the linguistic generality;  (ii) the design of a grammar should facilitate the access to the knowledge from both morphology and syntax in analysis.

The nature of the borderline phenomena calls for the coordination of morphology and syntax in a grammar.  The phenomena of Chinese separable verbs are one typical example.  The co-existence of their contiguous use and separate use leads to the confusion whether they belong to the lexicon and morphology, or whether they are syntactic phenomena.  In fact, as will be discussed in Chapter V, there are different degrees of ‘separability’ for different types of Chinese separable verbs;  there is no uniform analysis which can handle all separable verbs properly.  Different types of separable verbs may justify different approaches to the problems.  In terms of capturing linguistic generality, a good analysis should account for the demonstrated variety of separated uses and link the separated use and the contiguous use.

‘Quasi-affixation’ is another outstanding interface problem.  This problem requires careful morpho-syntactic coordination.  As presented in Chapter I, structurally, ‘quasi-affixes’ and ‘true’ affixes demonstrate very similar word formation potential, but ‘quasi-affixes’ often retain some ‘solid’ meaning while the meaning of ‘true’ affixes are functionalized.  Therefore, how to coordinate the semantic contribution of the derived words via ‘quasi-affixation’ in the context of the building of the semantics for the entire sentence is the key.  This coordination requires flexible information flow between data structures for morphology, syntax and semantics during the morpho-syntactic analysis.

In short, the proper treatment of the morpho-syntactic borderline phenomena requires inquiry into each individual problem in order to reach a morphological or syntactic analysis which maximally captures linguistic generality.  It also calls for the design of a grammar where information between morphology and syntax can be effectively coordinated.

2.4. Knowledge beyond Syntax

This section examines the roles of knowledge beyond syntax in the resolution of segmentation ambiguity.  Despite the fact that further information beyond syntax may be necessary for a thorough solution to segmentation ambiguity,[13] it will be argued that syntax is the appropriate place for initiating this process due to the structural nature of segmentation ambiguity.

Depending on which type of information is essential for the disambiguation, disambiguation can be classified as structure-oriented, semantics-oriented and pragmatics-oriented.  This classification hierarchy is modified from that in He, Xu and Sun (1991).  They have classified the hidden ambiguity disambiguation into three categories:  syntax-based, semantics-based and pragmatics-based.  Together with the morphology-based disambiguation which is equivalent to the overlapping ambiguity resolution in their theory, they have built a hierarchy from morphology up to pragmatics.

A note on the technical details is called for here.  The term X‑oriented (where X is either syntax, semantics or pragmatics) is selected here instead of X-based in order to avoid the potential misunderstanding that X is the basis for the relevant disambiguation.  It will be shown that while information from X is required for the ambiguity resolution, the basis is always syntax.

Based on the study in 2.1, it is believed that there is no morphology-based (or morphology-oriented) disambiguation independent of syntax.  This is because the context of morphology is a local context, too small for resolving structural ambiguity.  There is little doubt that the morphological analysis is a necessary part of word identification in terms of handling productive word formation.  But this analysis cannot by itself resolve ambiguity, as argued in 2.2.  The notion ‘structure’ in structure-oriented disambiguation includes both syntax and morphology.

He, Xu and Sun (1991) exclude the overlapping ambiguity resolution in the classification beyond morphology.  This exclusion is found to be not appropriate.  In fact, both the resolution of hidden ambiguity and overlapping ambiguity can be classified into this hierarchy.   In order to illustrate this point, for each such class, I will give examples from both hidden ambiguity and overlapping ambiguity.

Sentences in (2-20) and (2-21) which contain the hidden ambiguity string 阵风zhen feng  are examples for the structure-oriented disambiguation.  This type of disambiguation relying on a grammar constitutes the bulk of the disambiguation task required for word identification.

(2-20.)         一阵风吹过来了
yi zhen feng chui guo lai le.          (Feng 1996)

(a)      yi       | zhen         | feng         | chui          | guo-lai      | le
          one    | CLA          | wind        | blow         | over-here  | LE
[S [NP [CLAP yi zhen] [N feng]] [VP chui guo-lai le]]
A gust of wind blew here

(b) *   yi       | zhen-feng                    | chui                   | guo-lai      | le
          one    | gusts-of-wind    | blow         | over-here  | LE

(2-21.)         阵风会很快来临 zhen feng hui hen kuai lai lin.

(a)      zhen-feng              | hui  | hen          | kuai         | lai-lin
          gusts-of-wind       | will | very                   | soon         | come
[S [NP zhen-feng] [VP hui hen kuai lai-lin]]]
Gusts of wind will come very soon.

(b) *   zhen  | feng                   | hui  | hen          | kuai         | lai-lin
          CLA   | wind        | will | very                   | soon         | come

Compare (2-20a) where the ambiguity string is identified as two words zhen (CLA) feng (wind) and (2-21a) where the string is taken as one word zhen-feng (gusts-of-wind).  Chinese syntax defines that a numeral cannot directly combine with a noun, neither can a classifier alone when it is in non-object position.  The numeral and the classifier must combine together before they can combine with a noun.  So (2-20b) and (2‑21b) are both ruled out while (2-20a) and (2-21a) are structurally well-formed.

For the structure-oriented overlapping ambiguity resolution,  numerous examples have been cited before, and one typical example is repeated below.

(2-22.)         研究生命金贵 yan jiu sheng ming jin gui

(a)      yan-jiu-sheng       | ming         | jin-gui
graduate student | life            | precious
[S [NP yan-jiu-sheng] [S [NP ming] [AP jin-gui]]]
Life for graduate students is precious.

(b) *   yan-jiu        | sheng-ming        | jin-gui
study          | life                     | precious

As a predicate, the adjective jin-gui (precious) syntactically expects an NP as its subject, which is saturated by the second NP ming (life) in (2-22a).   The first NP serves as a topic of the sentence and is semantically linked to the subject ming (life) as its possessive entity.[14]  But there is no parse for (2-22b) despite the fact that the sub-string yan-jiu sheng-ming (to study life) forms a verb phrase [VP [V yan-jiu] [NP sheng-ming]] and the sub-string sheng-ming jin-gui (life is precious) forms a sentence [S [NP sheng-ming] [AP jin-gui]].  On one hand, the VP in the subject position does not satisfy the syntactic constraint (the category NP) expected by the adjective jin-gui (precious) – although other adjectives, say zhong-yao ‘important’, may expect a VP subject.  On the other hand, the transitive verb yan-jiu (study) expects an NP object.  It cannot take an S object (embedded object clause) as do other verbs, say ren-wei (think).

The resolution of the following hidden ambiguity belongs to the semantics-oriented disambiguation.

(2-23.)         请把手抬高一点儿 qing ba shou tai gao yi dian er            (Feng 1996)

(a1)    qing             | ba   | shou         | tai            | gao | yi-dian-er
          please          | BA  | hand        | hold         | high| a-little
[VP [ADV qing] [VP ba shou tai gao yi-dian-er]]
Please raise your hand a little higher.

(a2) * qing   | ba   | shou         | tai            | gao           | yi-dian-er
          invite | BA  | hand        | hold         | high         | a-little

(b1) * qing             | ba-shou    | tai            | gao           | yi-dian-er
          please          | N:handle  | hold         | high         | a-little

(b2) ? qing   | ba-shou    | tai            | gao           | yi-dian-er
          invite | N:handle  | hold         | high         | a-little
[VP [VG [V qing] [NP ba-shou]] [VP tai gao yi-dian-er]]
Invite the handle to hold a little higher.

This is an interesting example.  The same character qing is both an adverb ‘please’ and a verb ‘invite’.  (2-23b2) is syntactically valid, but violates the semantic constraint or semantic selection restriction.  The logical object of qing (invite) should be human but ba-shou (handle)  is not human.  The two syntactically valid parses (2-23a1) and (2-23b2), which correspond to two ways of segmentation, are expected to be somehow disambiguated on the above semantic grounds.

The following case is an example of semantics-oriented resolution of the overlapping ambiguity.

(2-24.)         茶点心吃了 cha dian xin chi le.

(a1)    cha    | dian-xin   | chi  | le
tea     | dim sum  | eat  | LE
[S [NP+object cha dian-xin] [VP chi le]]
The tea dim sum was eaten.

(a2) ? cha    | dian-xin   | chi  | le
tea     | dim sum  | eat  | LE
[S [NP+agent cha dian-xin] [VP chi le]]
The tea dim sum ate (something).

(a3) ? cha    | dian-xin   | chi  | le
tea     | dim sum  | eat  | LE
[S [NP+object cha ] [S [NP+agent dian-xin] [VP chi le]]]
Tea, the dim sum ate.

(a4) ? cha    | dian-xin   | chi  | le
tea     | dim sum  | eat  | LE
[S [NP+agent cha ] [VP [NP+object dian-xin] [VP chi le]]]
The tea ate the dim sum.

(b1) ? cha-dian               | xin           | chi  | le
tea dim sum         | heart       | eat  | LE
[S [NP+object cha-dian] [S [NP+agent xin] [VP chi le]]]
The tea dim sum, the heart ate.

(b2) ? cha-dian               | xin           | chi  | le
tea dim sum         | heart        | eat  | LE
[S [NP+agent cha-dian] [VP [NP+object xin] [VP chi le]]]
The tea dim sum ate the heart.

Most Chinese dictionaries contain the listed compound noun cha-dian (tea-dim-sum), but not cha dian-xin which stands for the same thing, namely the snacks served with the tea.  As shown above, there are four analyses for one segmentation and two analyses for the other segmentation.  These are all syntactically legitimate, corresponding to six different readings.  But there is only one analysis which makes sense, namely the implicit passive construction with the compound noun cha dian-xin as the preceding (logical) object in (a1).  All the other five analyses are nonsense and can be disambiguated if the semantic selection restriction that animate being eats (i.e. chi) food is enforced.   Syntactically, (a2) is an active construction with the optional object omitted.  The constructions for (a3) and (b1) are of long distance dependency where the object is topicalized and placed at the beginning.   The SOV (Subject Object Verb) pattern for (a4) and (b2) is a very  restrictive construction in Chinese.[15]

The pragmatics-oriented disambiguation is required for the case where ambiguity remains after the application of both structural and semantic constraints.[16]  The sentences containing this type of ambiguity are genuinely ambiguous within the sentence boundary, as shown with the multiple parses in (2-25) for the hidden ambiguity and (2-26) for the overlapping ambiguity below.

(2-25.)         他喜欢烤白薯 ta xi huan kao bai shu.

(a)      ta       | xi-huan    | kao          | bai-shu.
          he      | like           | bake         | sweet-potato
[S [NP ta] [VP [V xi-huan] [VP [V kao] [NP bai-shu]]]]
He likes baking sweet potatoes.

(b)      ta       | xi-huan    | kao-bai-shu.
          he      | like           | baked-sweet-potato
[S [NP ta] [VP [V xi-huan] [NP kao-bai-shu]]]
He likes the baked sweet potatoes.

(2-26.)         研究生命不好 yan jiu sheng ming bu hao

(a)      yan-jiu-sheng       | ming         | bu   | hao.
          graduate student | destiny     | not | good
[S [NP yan-jiu-sheng] [S [NP ming] [AP bu hao]]]
The destiny of graduate students is not good.

(b)      yan-jiu        | sheng-ming        | bu   | hao.
          study          | life                     | not | good
[S [VP yan-jiu sheng-ming] [AP bu hao]]
It is not good to study life.

An important distinction should be made among these classes of disambiguation.  Some ambiguity must be solved in order to get a reading during analysis.  Other ambiguity can be retained in the form of multiple parses, corresponding to multiple readings.  In either case, it demonstrates that at least a grammar (syntax and morphology) is required.  The structure-oriented ambiguity belongs to the former, and can be handled by the appropriate structural analysis.  The semantics-oriented ambiguity and the pragmatics-oriented ambiguity belong to the latter, so multiple parses are a way out.  The examples for different classes of ambiguity show that the structural analysis is the foundation for handling ambiguity problems in word identification.  It provides possible structures for the semantic constraints or pragmatic constraints to work on.

In fact, the resolution of segmentation ambiguity in Chinese word identification is but a special case of the resolution of structural ambiguity for NLP in general.  As a matter of fact, the grammatical analysis has been routinely used to resolve, and/or prepare the basis for resolving, the structural ambiguity like the PP attachment.[17]

2.5. Summary

The most important discovery in the field of Chinese word identification presented in this chapter is that the resolution of both types of segmentation ambiguity involves the analysis of the entire input string.  This means that the availability of a grammar is the key to the solution of this problem.

This chapter has also examined the ambiguity involvement in productive word formation and reached the following conclusion.  A grammar for morphological analysis as well as for sentential analysis is required for an adequate treatment of this problem.  This establishes the foundation for the general design of CPSG95 as consisting of morphology and syntax in one grammar formalism. [18]

The study of the morpho-syntactic borderline problems shows that  the sophisticated design of a grammar is called for so that information between morphology and syntax can be effectively coordinated.  This is the work to be presented in Chapter III and Chapter IV.  It also demonstrates that each individual borderline problem should be studied carefully in order to reach a morphological or syntactic analysis which maximally captures linguistic generality.  This study will be pursued in Chapter V and Chapter VI.




[1]  Constraints beyond morphology and syntax can be implemented as subsequent modules, or “filters”, in order to select the correct analysis when morpho-syntactic analysis leads to multiple results (parses).  Alternatively, such constraints can also be integrated into CPSG95 as components parallel to, and interacting with, morphology and syntax.  W. Li (1996) illustrates how semantic selection restriction can be integrated into syntactic constraints in CPSG95 to support Chinese parsing.

[2] In theory, if discourse is integrated in the underlying grammar, the input can be a unit larger than sentence, say, a paragraph or even a full text.  But this will depend on the further development in discourse theory and its formalization.  Most grammars in current use assume sentential analysis.

[3]  Similar examples for the overlapping ambiguity string will be shown in 2.1.2.

[4]  But in Ancient Chinese, a numeral can freely combine with countable nouns.

[5] These two readings in written Chinese correspond to an obvious difference in Spoken Chinese:  ge (CLA) in (g1) is weakened in pronunciation, marked by the dropping of the tone, while in (g2) it reads with the original 4th tone emphatically.

[6] It is likely that what they have found corresponds to Guo’s discovery of “one tokenization per source” (Guo 1998).  Guo’s finding is based on his experimental study involving domain (“source”) evidence and seems to account for the phenomena better.  In addition, Guo’s strategy in his proposal is also more effective, reported to be one of the best strategies for disambiguation in word segmenters.

[7] According to He, Xu and Sun (1991)’s statistics on a corpus of 50833 Chinese characters, the overlapping ambiguous strings make up 84.10%, and the hidden ambiguous strings 15.90%, of all ambiguous strings.

[8] Guo (1997b) goes to the other extreme to hypothesize that “every tokenization is possible”.   Although this seems to be a statement too strong, the investigation in this chapter shows that at least domain independently, local context is very unreliable for making tokenization decision one way or the other.

[9] However, this assumption may become statistically valid within a specific domain or source, as examined in Guo (1998).  But Guo did not give an operational definition of source/domain.  Without such a definition, it is difficult to decide where to collect the domain-specific information required for disambiguation based on the principle one tokenization per source, as proposed by Guo (1998).

[10] This distinction is crucial in the theories of Liang (1987) and He,  Xu and Sun (1991).

[11] This work is now defined as one fundamental task, called Named Entity tagging, in the world of information extraction (MUC-7 1998).  There has been great advance in developing Named Entity taggers both for Chinese (e.g. Yu et al 1997; Chen et al 1997) and for other languages.

[12] That is what was actually done with the CPSG95 implementation.  More precisely, the family name expects a special sign with hanzi-length of 1 or 2 to form a full name candidate.

[13] A typical, sophisticated word segmenter making reference to knowledge beyond syntax is presented in Gan (1995).

[14] This is in fact one very common construction in Chinese in the form of NP1 NP2 Predicate.  Other examples include ta (he) tou (head) tong (ache): ‘he has a head-ache’ and ta (he) shen-ti (body) hao (good): ‘he is good in health’.

[15] For the detailed analysis of these constructions, see W. Li (1996).

[16] It seems that it may be more appropriate to use terms like global disambiguation or discourse-oriented disambiguation instead of the term pragmatics-oriented disambiguation for the relevant phenomena.

[17] It seems that some PP attachment problems can be resolved via grammatical analysis alone.  For example, put something on the table; found the key to that door.  Others require information beyond syntax (semantics, discourse, etc.) for a proper solution.  For example, see somebody with telescope. In either case, the structural analysis provides a basis.  The same thing happens to the disambiguation in Chinese word identification.

[18] In fact, once morphology is incorporated in the grammar, the identification of both vocabulary words and non-listable words becomes a by-product during the integrated morpho-syntactic analysis.  Most ambiguity is resolved automatically and the remaining ambiguity will be embodied in the multiple syntactic trees as the results of the analysis.  This has been shown to be true and viable by W. Li (1997, 2000) and Wu and Jiang (1998).



PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

PhD Thesis: Chapter I Introduction

1.0. Foreword

This thesis addresses the issue of the Chinese morpho-syntactic interface.  This study is motivated by the need for a solution to a series of long-standing problems at the interface.  These problems pose challenges to an independent morphology system or a separate word segmenter as there is a need to bring in syntactic information in handling these problems.

The key is to develop a Chinese grammar which is capable of representing sufficient information from both morphology and syntax.  On the basis of the theory of Head-Driven Phrase Structure Grammar (Pollard and Sag 1987, 1994), the thesis will present the design of a Chinese grammar, named CPSG95 (for Chinese Phrase Structure Grammar).  The interface between morphology and syntax is defined system internally in CPSG95.  For each problem, arguments will be presented for the linguistic analysis involved.  A solution to the problem will then be formulated based on the analysis.  The proposed solutions are formalized and implementable;  most of the proposals have been tested in the implementation of CPSG95.

In what follows, Section 1.1 reviews some important developments in the field of Chinese NLP (Natural Language Processing).  This serves as the background for this study.  Section 1.2 presents a series of long-standing problems related to the Chinese morpho-syntactic interface.  These problems are the focus of this thesis.  Section 1.3 introduces CPSG95 and sketches its morpho-syntactic interface by illustrating an example of the proposed morpho-syntactic analysis.

1.1. Background

This section presents the background for the work on the interface between morphology and syntax in CPSG95.  Major development on Chinese tokenization and parsing, the two areas which are related to this study, will be reviewed.

1.1.1. Principle of Maximum Tokenization and Critical Tokenization

This section reviews the influential Theory of Critical Tokenization (Guo 1997a) and its implications.  The point to be made is that the results of Guo’s study can help us to select the tokenization scheme used in the lexical lookup phase in order to create the basis for morpho-syntactic parsing.

Guo (1997a,b,c) has conducted a comprehensive formal study on tokenization schemes in the framework of formal languages, including deterministic tokenization such as FT (Forward Maximum Tokenization) and BT (Backward Maximum Tokenization), and non-deterministic tokenization such as CT (Critical Tokenization), ST (Shortest Tokenization) and ET (Exhaustive Tokenization).  In particular, Guo has focused on the study of the rich family of tokenization strategies following the general Principle of Maximum Tokenization, or “PMT”.  Except for ET, all the tokenization schemes mentioned above are PMT-based.

In terms of lexical lookup, PMT can be understood as a heuristic by which a longer match overrides all shorter matches.  PMT has been widely adopted (e.g. Webster and Kit 1992; Guo 1997b) and is believed to be “the most powerful and commonly used disambiguation rule” (Chen and Liu 1992:104).

Shortest Tokenization, or “ST”, first proposed by X. Wang (1989), is a non-deterministic tokenization scheme following the Principle of Maximum Tokenization.  A segmented token string is shortest if it contains the minimum number of vocabulary words possible – “short” in the sense of the shortest word string length.

Exhaustive Tokenization, or “ET”, does not follow PMT.  As its name suggests, the ET set is the universe of all possible segmentations consisting of all candidate vocabulary words.  The mathematical definition of ET is contained in Definition 4 for “the character string tokenization operation”  in Guo (1997a).

The most important concept in Guo’s theory is Critical Tokenization, or “CT”.  Guo’s definition is based on the partially ordered set, or ‘poset’, theory in discrete mathematics (Kolman and Busby 1987).  Guo has found that different segmentations can be linked by the cover relationship to form a poset.   For example, abc|d and ab|cd both cover ab|c|d, but they do not cover each other.

Critical tokenization is defined as the set of minimal elements, i.e. tokenizations which are not covered by other tokenizations, in the tokenization poset.  Guo has given proof for a number of mathematical properties involving critical tokenization.  The major ones are listed below.

  • Every tokenization is a subtokenization of (i.e. covered by) a critical tokenization, but no critical tokenization has a true supertokenization;
  • The tokenization variations following the Principle of Maximum Tokenization proposed in the literature, such as FT, BT, FT+BT and ST, are all true sub-classes of CT.

Based on these properties, Guo concludes that CT is the precise mathematical description of the widely adopted Principle of Maximum Tokenization.

Guo (1997c) further reports his experimental studies on relative merits of these tokenization schemes in terms of three quality indicators, namely, perplexity, precision and recall.  The perplexity of a tokenization scheme gives the expected number of tokenized strings generated for average ambiguous fragments.  The precision score is the percentage of correctly tokenized strings among all possible tokenized strings while the recall rate is the percentage of correctly tokenized strings generated by the system among all correctly tokenized strings.  The main results are:

  • Both FT and BT can achieve perfect unity perplexity but have the worst precision and recall;
  • ET achieves perfect recall but has the lowest precision and highest perplexity;
  • ST and CT are simple with good computational properties.  Between the two, ST has lower perplexity but CT has better recall.

Guo (1997c) concludes, “for applications with moderate performance requirement, ST is the choice;  otherwise, CT is the solution.”

In addition to the above theoretical and experimental study, Guo (1997b) also develops a series of optimized algorithms for the implementation of these generation schemes.

The relevance and significance of Guo’s achievement to the research in this thesis lie in the following aspect.  The research on Chinese morpho-syntactic interface is conducted with the goal of  supporting Chinese morpho-syntactic parsing.  The input to a Chinese morpho-syntactic parser comes directly from the lexical lookup of the input string based on some non-deterministic tokenization scheme (W. Li 1997, 2000; Wu and Jiang 1998).  Guo’s research and algorithm development can help us to decide which tokenization schemes to use depending on the tradeoff between precision, recall and perplexity or the balance between reducing the search space and minimizing premature commitment.

1.1.2. Monotonicity Principle and Task-driven Segmentation

This section reviews the recent development on Chinese analysis systems involving the interface between morphology and syntax.  The research on the Chinese morpho-syntactic interface in this thesis echoes this new development in the field of Chinese NLP.

In the last few years, projects have been proposed for implementing a Chinese analysis system which integrates word identification and parsing.  Both rule-based systems and statistical models have been attempted with good results.

Wu (1998) has addressed the drawbacks of the conventional practice on the development of Chinese word segmenters, in particular, the problem of premature commitment in handling segmentation ambiguity.  In his A Position Statement on Chinese Segmentation, Wu proposed a general principle:

Monotonicity Principle for segmentation:

A valid basic segmentation unit (segment or token) is a substring that no processing stage after the segmenter needs to decompose.

The rationale behind this principle is to prevent premature commitment and to avoid repetition of work between modules.   In fact, traditional word segmenters are modules independent of subsequent applications (e.g. parsing).  Due to the lack of means for accessing sufficient grammar knowledge, they suffer from premature commitment and repetition of work, hence violating this principle.

Wu’s proposal of the monotonicity principle is a challenge to the Principle of Maximum Tokenization.  These two principles are not always compatible.  Due to the existence of hidden ambiguity (see 1.2.1), the PMT-based segmenters by definition are susceptible to premature commitment leading to “too-long segments”.  If the target application is designed to solve the hidden ambiguity problem in the segments, “decomposition” of some segments is unavoidable.

In line with the Monotonicity Principle, Wu (1998) proposes an alternative approach which he claims “eliminates the danger of premature commitment”, namely task-driven segmentation.  Wu (1998) points out, “Task-driven segmentation is performed in tandem with the application (parsing, translating, named-entity labeling, etc.) rather than as a preprocessing stage.  To optimize accuracy, modern systems make use of integrated statistically-based scores to make simultaneous decisions about segmentation and parsing/translation.”  The HKUST parser, developed by Wu’s group, is such a statistical system employing the task-driven segmentation.

As for rule-based systems, similar practice of integrating word identification and parsing has also been explored.  W. Li (1997, 2000) proposed that the results of an ET-based lexical lookup directly feed the parser for the hanzi-based parsing.  More concretely, morphological rules are designed to build word internal structure for productive morphology and non-productive morphology is lexicalized via entry enumeration.[1]  This approach is the background for conducting the research on Chinese morpho-syntactic interface for CPSG95 in this dissertation.

The Chinese parser on the platform of multilingual NLPWin developed by Microsoft Research also integrates word identification and parsing (Wu and Jiang 1998).  They also use a hand-coded grammar for word identification as well as for sentential parsing.  The unique part of this system is the use of a certain lexical constraint on ET in the lexical lookup phase.  This effectively reduces the parsing search space as well as the number of syntactic trees produced by the parser, with minimal sacrifice in the recall of tokenization.  This tokenization strategy provides a viable alternative to the PMT-based tokenization schemes like CT or ST in terms of the overall balance between precision, recall and perplexity.

The practice of simultaneous word identification and parsing in implementing a Chinese analysis system calls for the support of a grammar (or statistical model) which contains sufficient information from both morphology and syntax.  The research on Chinese morpho-syntactic interface in this dissertation aims at providing this support.

1.2. Morpho-syntactic Interface Problems

This section presents a series of outstanding problems in Chinese NLP which are related to the morpho-syntactic interface.  One major goal of this dissertation is to argue for the proposed analyses of the problems and to provide solutions to them based on the analyses.

Sun and Huang (1996) have reviewed numerous cases which challenge the existing word segmenters.  As many of these cases call for an exchange of information between morphology and syntax, an appropriate solution can hardly be reached within the module of a separate word segmenter.  Three major problems at issue are presented below.

1.2.1. Segmentation ambiguity

This section presents the long-standing problem in Chinese tokenization, i.e. the resolution of the segmentation ambiguity.  Within a separate word segmenter, resolving the segmentation ambiguity is a difficult, sometimes hopeless job.  However, the majority of ambiguity can be resolved when a grammar is available.

Segmentation ambiguity has been the focus of extensive study in Chinese NLP for the last decade (e.g. Chen and Liu 1992; Liang 1987;  Sproat, Shih, Gale and Chang 1996; Sun and Huang 1996; Guo 1997b).  There are two types of segmentation ambiguities (Liang 1987; Guo 1997b):  (i) overlapping ambiguity:  e.g. da-xue | sheng-huo vs. da-xue-sheng | huo as shown in (1-1) and (1-2);  and (ii) hidden ambiguity:  ge-ren vs. ge | ren, as shown in (1-3) and (1-4).

(1-1.) 大学生活很有趣
da-xue         | sheng-huo          | hen          | you-qu
university    | life                     | very          | interesting
The university life is very interesting.

(1-2.)  大学生活不下去了
da-xue-sheng                 | huo          | bu | xia-qu      | le
university student          | live           | not | down        | LEs
University students can no longer make a living.

(1-3.)  个人的力量
ge-ren         | de   | li-liang
individual   | DE  | power
the power of an individual

(1-4.) 三个人的力量
san    |  ge            | ren           | de   | li-liang
three  | CLA          | person      |DE   | power
the power of three persons

These examples show that the resolution of segmentation ambiguity requires larger syntactic context and grammatical analysis.   There will be further arguments and evidence in Chapter II (2.1) for the following conclusion:  both types of segmentation ambiguity are structural by nature and require sentential analysis for the resolution.  Without access to a grammar, no matter how sophisticated a tokenization algorithm is designed, a word segmenter is bound to face an upper bound for the precision of word identification.  However, in an integrated system, word identification becomes a natural by-product of parsing (W. Li 1997, 2000;  Wu and Jiang 1998).  More precisely, the majority of ambiguity can be resolved automatically during morpho-syntactic parsing;  the remaining ambiguity can be made explicit in the form of  multiple syntactic trees.[2]  But in order to make this happen, the parser requires reliable support from a grammar which contains both morphology and syntax.

1.2.2. Productive Word Formation

Non-listable words created via productive morphology pose another challenge (Sun and Huang 1996).  There are two major problems involved in this issue:  (i) problem in identifying lexicon-unlisted words;  (ii) problem of possible segmentation ambiguity.

One important method of productive word formation is derivation.  For example, the derived word 可读性 ke-du-xing (-able-read-ness: readability) is created via morphology rules, informally formulated below

(1-5.) derivation rules

ke + X (transitive verb) –> ke-X (adjective, semantics: X-able)

Y (adjective or verb) + xing –> Y-xing (abstract noun, semantics: Y-ness)

Rules like the above have to be incorporated properly in order to correctly identify such non-listable words.  However, there has been little research in the literature on what formalism should be adopted for Chinese morphology  and how it should be interfaced to syntax.

To make the case more complicated, ambiguity may also be involved in productive word formation.  When the segmentation ambiguity is involved in word formation, there is always a danger of wrongly applying morphological rules.  For example, 吃头 chi-tou (worth of eating) is a derived word (transitive verb + suffix tou);   however, it can also be segmented as two separate tokens chi (eat) | tou (CLA), as shown in (1-6) and (1-7) below.

(1-6.)  这道菜没有吃头
zhe    | dao           | cai            | mei-you    | chi-tou
this    | CLA          | dish         | not have   | worth-of-eating
This dish is not worth eating.

(1-7.) 他饿得能吃头牛
ta       | e               | de             | neng        | chi  | tou           | niu
he      | hungry     | DE3         | can           | eat  | CLA                   | ox
He is so hungry that he can eat an ox.

To resolve this segmentation ambiguity, as indicated before in 1.2.1, the structural analysis of the complete sentences is required.  An independent morphology system or a separate word segmenter cannot handle this problem without accessing syntactic knowledge.

1.2.3. Borderline Cases between Morphology and Syntax

It is widely acknowledged that there is a remarkable gray area between Chinese morphology and Chinese syntax (L. Li 1990; Sun and Huang 1996).  Two typical cases are described below.  The first is the phenomena of Chinese separable verbs.  The second case involves interfacing derivation and syntax.

Chinese separable verbs are usually in the form of V+N and V+V or V+A.  These idiomatic combinations are long-standing problems at the interface between compounding and syntax in Chinese grammar (L. Wang 1955; Z. Lu 1957; Lü 1989; Lin 1983; Q. Li 1983; L. Li 1990; Shi 1992; Zhao and Zhang 1996).

The separable verb 洗澡 xi zao (wash‑bath: take a bath) is a typical example.  Many native speakers regard xi zao as one word (verb), but the two morphemes are separable.  In fact, xi+zao shares the syntactic behavior and the pattern variations with the syntactic transitive combination V+NP:  not only can aspect markers appear between xi and zao,  but this structure can be passivized and topicalized as well.  The following is an example of topicalization (of long distance dependency) for xi zao.

(1-8.)(a)       我认为他应该洗澡
wo     ren-wei        ta       ying-gai       xi zao.
I         think           he      should        wash-bath
I think that he should take a bath.

(b)      澡我认为他应该洗
zao    wo     ren-wei        ta       ying-gai       xi.
bath  I         think           he      should        wash
The bath I think that he should take.

Although xi zao behaves like a syntactic phrase, it is a vocabulary word in the lexicon due to its idiomatic nature.  As a result, almost all word segmenters output xi-zao in (1-8a) as one word while treating the two signs[3] in (1-8b) as two words.  Thus the relationship between the separated use of the idiom and the non-separated use is lost.

The second case represents a considerable number of borderline cases often referred to as  ‘quasi-affixes’.  These are morphemes like 前 qian (former, ex-) in words like 前夫 qian-fu (ex-husband), 前领导 qian-[ling-dao] (former boss) and -盲 mang (person who has little knowledge of) in words like 计算机盲 [ji-suan-ji]-mang (computer layman), 法盲 fa-mang (person who has no knowledge of laws).

It is observed that ‘quasi-affixes’ are structurally not different from other affixes.  The major difference between ‘quasi-affixes’ and the few generally honored (‘genuine’) affixes like the nominalizer 性 -xing (-ness) lies mainly in the following aspect.  The former retain some ‘solid’ meaning while the latter are more functionalized.  Therefore, the key to this problem seems to lie in the appropriate way of coordinating the semantic contribution of the derived words using ‘quasi-affixes’ to the building of the semantics for the entire sentence.  This is an area which has not received enough investigation in the field of Chinese NLP.  While many word segmenters have included some type of derivational processing for a few typical affixes, few systems demonstrate where and how to handle these ‘quasi-affixes’.

1.3. CPSG95:  HPSG-style Chinese Grammar in ALE

To investigate the interaction between morphological and syntactic information, it is important to develop a Chinese grammar which incorporates morphology and syntax in the same formalism.  This section gives a brief presentation on the design and background of CPSG95 (including lexicon).

1.3.1. Background and Overview of CPSG95

Shieber (1986) distinguishes two types of grammar formalism:  (i) theory-oriented formalism;  (ii) tool-oriented formalism.  In general, a language-specific grammar turns to a theory-oriented formalism for its foundation and a tool-oriented formalism for its implementation.  The work on CPSG95 is developed in the spirit of the theory-oriented formalism Head-driven Phrase Structure Grammar (HPSG, proposed by Pollard and Sag 1987).  The tool-oriented formalism used to implement CPSG95 is the Attribute Logic Engine (ALE, developed by Carpenter and Penn 1994).

The unique feature of CPSG95 is its incorporation of Chinese morphology in the HPSG framework.[4]  Like other HPSG grammars, CPSG95 is a heavily lexicalized unification grammar.  It consists of two parts:  a minimized general grammar and an information-enriched lexicon.  The general grammar contains a small number of Phrase Structure (PS) rules, roughly corresponding to the HPSG schemata tuned to the Chinese language.[5]  The syntactic PS rules capture the subject-predicate structure, complement structure, modifier structure, conjunctive structure and long-distance dependency.  The morphological PS rules cover morphological structures for productive word formation.  In one version of CPSG95 (its source code is  shown in APPENDIX I), there are nine PS rules:  seven syntactic rules and two morphological rules.

In CPSG95, potential morphological structures and potential syntactic structures are both lexically encoded.  In syntax, a word can expect (subcat-for or mod in HPSG terms) another sign to form a phrase.   Likewise, in Chinese morphology, a morpheme can expect another sign to form a word.[6]

One important modification of HPSG in designing CPSG95 is to use an atomic approach with separate features for each complement to replace the list design of obliqueness hierarchy among complements.  The rationale and arguments for this modification are presented in Section 3.2.3 in Chapter III.

1.3.2. Illustration

The example shown in (1-9) demonstrates the morpho-syntactic analysis  in CPSG95.

(1-9.) 这本书的可读性
zhe    ben    shu    de      ke               du      xing
this    CLA   book  DE     AF:-able      read   AF:-ness
this book’s readability
(Note: CLA for classifier; DE for particle de; AF for affix.)

Figure 1 illustrates the tree structure built by the morphological PS rules and the syntactic PS rules in CPSG95


Figure 1. Sample Tree Structure for CPSG95 Analysis

As shown, the tree embodies both morphological analysis (the sub-tree for ke-du-xing) and syntactic analysis (the NP structure).  The results of the morphological analysis (the category change from V to A and to N and the building of semantics, etc.) are readily accessible in building syntactic structures.

1.4. Organization of the Dissertation

The remainder of this dissertation is divided into six chapters.

Chapter II presents arguments for the need to involve syntactic analysis for a proper solution to the targeted morpho-syntactic problems.   This establishes the foundation on which CPSG95 is based.

Chapter III presents the design of CPSG95.  In particular, the expectation feature structures will be defined.  They are used to encode the lexical expectation of both morphological and syntactic structures.  This design provides the necessary means for formally defining Chinese word and the interface of morphology, syntax and semantics.

Chapter IV is on defining the Chinese word.  This is generally recognized as a basic issue in discussing Chinese morpho-syntactic interface.  The investigation leads to a way of the wordhood formalization and a coherent, system-internal definition of the work division between morphology and syntax.

Chapter V studies Chinese separable verbs.  It discusses  wordhood judgment for each type of separable verbs based on their distribution.   The corresponding morphological or syntactic solutions will then be presented.

Chapter VI investigates some outstanding problems of Chinese derivation and its interface with syntax.  It will be demonstrated that the general approach to Chinese derivation in CPSG95 works both for typical cases of derivation and the two special problems, namely ‘quasi-affix’ phenomena and zhe-affixation.

The last chapter, Chapter VII, concludes this dissertation.  In addition to a concise retrospect for what has been achieved, it also gives an account of the limitations of the present research and future research directions.

Finally, the three appendices give the source code of one version of the implemented CPSG95 and some tested results.[7]



[1] In line with the requirements by Chinese NLP, this thesis places emphasis on the analysis of productive morphology:  phenomena which are listable in the lexicon are not the major concern.  This is different from many previous works on Chinese morphology (e.g. Z. Lu 1957; Dai 1993) where the bulk of discussions is on unproductive morphemes (affixes or ‘bound stems’).

[2] Ambiguity which remains after sentential parsing may be resolved by using further semantic, discourse or pragmatic knowledge, or ‘filters’.

[3] In CPSG95 and other HPSG-style grammars, a ‘sign’ usually stands for the generalized notion of grammatical units such as morpheme, word, phrase, etc.

[4] Researchers have looked at the incorporation of morphology of other natural languages in the HPSG framework (e.g. Type-based Derivation Morphology by Riehemann 1998).  Arguments for  the inclusion of morphological features in the definition of sign will be presented in detail in Chapter III

[5] Note that ‘phrase structure’ in terms like Phrase Structure Grammar (PSG) or Phrase Structure rules (PS rules) does not necessarily refer to structures of (syntactic) phrases. It stands for surface-based constituency structure, in contrast to, say, dependency structure in Dependency Grammar.  In CPSG95, some productive morphological structures are also captured by PS rules.

[6] Note that in this dissertation, the term expect is used as a more generalized notion than the terms subcat-for (subcategorize for) and mod (modify).  ‘Expect’ is intended to be applied to morphology as well as to syntax.

[7]  There are differences in technical details between the proposed grammar in this dissertation and the implemented version.  This is because any implemented version was tested at a given time while this thesis evolved over a long period of time.  It is the author’s belief that it best benefits readers (including those who want to follow the CPSG practice) when a version was actually tested and given as was.



PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP




好在我们把动宾离合词“洗澡”的搭配问题解决了,定语(Mod)、状语(Adv)同是附加语(adjunct),都挂到了同样的动词“洗澡”身上了,加上部分补语(Buyu)也是附加语,可谓世界大同了。原先较真的话,要问“痛快”的是“澡”,还是“洗”,还是“洗澡”, who cares?其实都是一个意思。类似的,英语也有:
live a happy life
live (a life) happily


如果程序在此类情形下 只选一个路径,或不做规约也是可以的。到语义落地的时候 只要系统适应性鲁棒即可:Adv:happily OR Mod:happy。

工程师可以不 care,架构师必须给说法。

do + Adjunct + core pred
已经做了相当努力去规约这些本质上相同的说法了,如前面图中的“洗澡”:Mod 也好 Adv 也好 Buyu 也好,大体属于同样性质的附加语:
adjunct 痛快 —-》 pred 洗澡


“追悔莫及” 本义 有一个 human 的坑
“做出决定” 也有一个 human 的坑
现在 human (张三) 与 “做出决定” 发生了直接联系(S) 与 “追悔莫及” 发生了间接关系(通过“做出决定”)。离开让 human (张三) 与 需要 human 坑的 “追悔莫及”直接联系 只有一步之遥了。


“的” 是敲门砖。句法树出来了, x它意思意思,比扔掉它也许好一些。


关键是,第一个句子是一步之遥,第二个句子是两步之遥,几乎不可能超过两步。也就是说 从ngram角度看 也不过是dag中的 bigram 或 trigram 的语义规则,如果真想做的话。只要证明从间接联系到直接联系 在语义中间件做 对应用有益处 这个工作是非常 tractable 的。
一个有语义的坑 一个正好符合语义可以填坑 近在咫尺 有何难处?给我五分钟 我两条线都可以勾搭上,而且保证不是权宜之计 不引起副作用。其所以这些语义中间件的细活 虽然不难 但并没去全做 是因为不很确定做了 到底能带来多大好处,虽然理论上是有好处的。




我做一下 regression testing 看看有无副作用,没有的话,这个 trigram 的语义填坑规则就留下来。

具体到这个 case 是从线性 5-gram 缩小成 graph 的 trigram
5 与 3 在组合爆炸的考量中是天壤之别
何况完全可以造出比 5 更加远距离的同样合适的例子来 这就是句法的威力。
更主要的是,即便一个线性系统用得起 5-gram


说的是一回事儿 5gram 必然是稀疏数据 不足以支撑远距离选取。不能因为一个token需要human 另一个token恰好是human 中间隔了四个词,就可以填坑了。总之是,没有结构,这事儿就做不成。






NLP University


NLP University 开张大吉


余致力自然语言处理(NLP,Natural Language Processing)凡30年,其目的在求交流之通畅,信息之自由,语言之归一,世界之大同。积30年之经验,深知欲达此目的,必须启蒙后进,普及科学,同心协力,共建通天之塔,因作文鼓而吹之。处理尚未成功,同志仍需努力。

0. AI/NLP最新博文

AIGC 潮流扑面而来,是顺应还是(无谓)抵抗呢?
漫谈AI 模型生成图像
AI 正在不声不响渗透我们的生活
RPA 是任务执行器还是数字员工?
NLP 新纪元来临了吗?
推荐Chris Manning 论大模型,并附上相关讨论
[转载]转载:斯坦福Chris Manning: 大模型剑指通用人工智能?
[转载]编译 Gary Marcus 最新著述:《深度学习正在撞南墙》
关于NLP 落地以及冷启动的对话
《AI 随笔:从对张医生的综述抄袭指控谈起》 
《AI 随笔:观老教授Walid的神经网络批判有感》
《AI 理性主义的终结是不可能的吗》
《AI 赚钱真心难》

1. 关于NLP体系及方法论





【立委科普:语言学算法是 deep NLP 绕不过去的坎儿】


《NLP White Paper: Overview of Our NLP Core Engine》

White Paper of NLP Engine


【新智元笔记:李白对话录 – RNN 与语言学算法】


《新智元笔记:对于 tractable tasks, 机器学习很难胜过专家》

《新智元笔记:【Google 年度顶级论文】有感》

《新智元笔记:NLP 系统的分层挑战》


《泥沙龙笔记:parsing 的休眠反悔机制》


【泥沙龙笔记:NLP hard 的歧义突破】


【新智元笔记:李白对话录 – 从“把手”谈起】


立委科普:NLP 中的一袋子词是什么






【科普小品:NLP 的锤子和斧头】



Comparison of Pros and Cons of Two NLP Approaches

why hybrid? on machine learning vs. hand-coded rules in NLP

Why Hybrid?

钩沉:Early arguments for a hybrid model for NLP and IE



《泥沙龙笔记:铿锵众人行,parsing 可以颠覆关键词吗?》







Chomsky’s Negative Impact

[转载]【白硕 – 穿越乔家大院寻找“毛毛虫”】


【科研笔记:NLP “毛毛虫” 笔记,从一维到二维】

【泥沙龙笔记:NLP 专门语言是规则系统的斧头】



【Church – 钟摆摆得太远(2):乔姆斯基论】

【NLP主流的反思:Church – 钟摆摆得太远(1):历史回顾】

【Church – 钟摆摆得太远(3):皮尔斯论】

【Church – 钟摆摆得太远(4):明斯基论】

【Church – 钟摆摆得太远(5):现状与结论】





Notes on Building and Using Lexical Semantic Knowledge Bases


Domain portability myth in natural language processing (NLP)


Church – 计算语言学课程的缺陷 (翻译节选)



NLP 围脖:成语从来不是问题

NLP 是一个力气活:再论成语不是问题


《科普随笔:keep ambiguity untouched》



没有语言学的 CL 走不远









《新智元:通用的机器人都是闹着玩的,有用的都是 domain 的》

SBIR Grants


2. 关于NLP分析(parsing)

语义计算沙龙:Parsing 的数据结构和形式文法













语义计算沙龙:parsing 的鲁棒比精准更重要】


【做 parsing 还是要靠语言学家,机器学习不给力】







泥沙龙笔记:parsing 是引擎的核武器,再论NLP与搜索

泥沙龙笔记:从 sparse data 再论parsing乃是NLP应用的核武器







【立委科普:及物、不及物 与 动词 subcat 及句型】





Parsing nonsense with a sense of humor


Parent-child Principle in Dependency Grammar

乔氏 X 杠杠理论 以及各式树形图表达法


【没有语言结构可以解析语义么?浅论 LSA】






《泥沙龙笔记:NLP component technology 的市场问题》


Deep parsing:每日一析

Deep parsing 每日一析:内情曝光 vs 假货曝光

Deep parsing 每日一析 半垃圾进 半垃圾出

【一日一parsing: 屈居世界第零】


【deep parsing:植树为林自成景(20/n)】

【deep parsing:植树为林自成景(30/n)】


【deep parsing 吃文化:植树为林自成景(60/n)】

【deep parsing (70/n):离合词与定语从句的纠缠】

【deep parsing (80/n):植树成林自成景】

【deep parsing (90/n):“雨是好雨,但风不正经”】

【deep parsing (100/n):其实 NLP 也没那么容易气死】


3. 关于NLP抽取

【立委科普:NLU 的螺旋式上升及其 open知识图谱的趋向】

【语义计算沙龙:知识图谱无需动用太多知识 负重而行】




《知识图谱的先行:从Julian Hill 说起》

《有了deep parsing,信息抽取就是个玩儿》


泥沙龙笔记: parsing vs. classification and IE

前知识图谱钩沉: 信息抽取引擎的架构

前知识图谱钩沉: 信息体理论



钩沉:SVO as General Events

Pre-Knowledge-Graph Profile Extraction Research via SBIR (1)

Pre-Knowledge-Graph Profile Extraction Research via SBIR (2)

Coarse-grained vs. fine-grained sentiment extraction



SBIR Grants






Automated survey based on social media








On Big Data NLP



【立委科普:所谓大数据(BIG DATA)】

【科研笔记:big data NLP, how big is big?】





2011 信息产业的两大关键词:社交媒体和云计算

《扫了 sentiment,NLP 一览众山小:从“良性肿瘤”说起》


5. 关于NLP应用




【Bots 的愿景】

《新智元笔记:知识图谱和问答系统:how-question QA(2)》


【泥沙龙笔记:NLP 市场落地,主餐还是副食?】






新智元笔记:微软小冰,QA 和AI,历史与展望(4)

泥沙龙笔记:把酒话桑麻,聊聊 NLP 工业研发的掌故


泥沙龙笔记:parsing 是引擎的核武器,再论NLP与搜索



立委硕士论文【附录一:EChA 试验结果】


2011 信息产业的两大关键词:社交媒体和云计算

再说苹果爱疯的贴身小蜜 死日(Siri)


非常折服苹果的技术转化能力,但就自然语言技术本身来说 …

科研笔记:big data NLP, how big is big?





6. 关于中文NLP


【parsing 在希望的田野上】

语义计算沙龙:其实 NLP 也没那么容易气死

【deep parsing (70/n):离合词与定语从句的纠缠】

【立委科普:deep parsing 小讲座】




【新智元笔记:parsing 汉语涉及重叠的鸡零狗碎及其他】


【deep parsing:“对医闹和对大夫使用暴力者,应该依法严惩”












钩沉:博士阶段的汉语HPSG研究 2015-11-02



泥沙龙笔记:汉语就是一种“裸奔” 的语言



汉语依从文法 (维文钩沉)



应该立法禁止切词研究 :=)








社会媒体舆情自动分析:马英九 vs 陈水扁







《立委随笔: 语言学家是怎样炼成的》







《科普随笔:汉语自动断词 “一次性交500元”》

《科普随笔:“他走得风一样地快” 的详细语法结构分析》

【立委科普:自动分析 《偉大的中文》】









汉语依从文法 (维文钩沉)



7. 关于NLP社会媒体舆情挖掘的实践


【语义计算沙龙:sentiment 中的讽刺和正话反说】














【『科学』预测:A-股 看好】











只认数据不认人:IRT 的鼓噪左右美国民情了么?















大数据淹没下的冰美人(之三): 喜欢的理由

大数据淹没下的冰美人(之四): 流言蜚语篇(慎入)

大数据淹没下的冰美人(之五): 星光灿烂谁为最?







舆情挖掘:九合一國民黨慘敗 馬英九時代行將結束?

社会媒体舆情自动分析:马英九 vs 陈水扁






全球社交媒体热议苹果推出 iPhone 6













Chinese First Lady in Social Media

Social media mining on credit industry in China

Sina Weibo IPO and its automatic real time monitoring

Social media mining: Teens and Issues










新鲜出炉:2012 热点话题五大盘点之五【小方vs韩2】

【凡事不决问 social:切糕是神马?】

Social media mining: 2013 vs. 2012


尝试揭秘百度的“哪里有小姐”: 小姐年年讲、月月讲、天天讲?


圣诞社媒印象: 简体世界狂欢,繁體世界分享

WordClouds: Season’s sentiments, pros & cons of Xmas

新鲜出炉:2012 热点话题五大盘点之一【吊丝】

新鲜出炉:2012 热点的社会媒体五大盘点之二【林书豪】

新鲜出炉:2012 热点话题五大盘点之三【舌尖上的中国】

新鲜出炉:2012 热点话题五大盘点之四【三星vs苹果】






8. 关于NLP的掌故趣闻


《朝华午拾 – 水牛风云》











10 周年入职纪念日有感

《立委随笔: 语言学家是怎样炼成的》


围脖:一个人对抗一个世界,理性主义大师 Lenat 教授

《泥沙龙笔记:再谈 cyc》

围脖:格语法创始人菲尔墨(Charles J. Fillmore)教授千古!

百度大脑从谷歌大脑挖来深度学习掌门人 Andrew Ng



NLP 历史上最大的媒体误导:成语难倒了电脑



[转载]欧阳锋:巧遇语言学新锐 - 乔姆斯基



【随记:湾区的年度 NLP BBQ 】







MT 杀手皮尔斯 (翻译节选)

ALPAC 黑皮书 1/9:前言



立委随笔:Chomsky meets Gates







【泥沙龙笔记:机器 parsing 洪爷,无论打油或打趣】


我要是退休了,就机器 parse 《离骚》玩儿




谁会误读?为什么误读?研究一下背后的语言学 and beyond。

双宾两个坑 human 默认的坑是对象 “老婆”是“送”的对象,这是正解。
对于心术不正的人 human 也可以填受事的坑,“老婆”跟礼物一样,成了“送”的受事。
这是 “送” 的歧义,到了 caption 里面的合成词 “送给”,subcat 有细微变化,就没歧义了。为什么 “送-个” 也没歧义呢?因为“个”是不定的,而对象这个角色通常是有定的。

“一所” 是不定数量词,作为对象。
汉语中的 “一+量词”与光杆“量词”通常认为是等价的,范畴都是不定(indefinite),后者是前者省略了“一”而得。但是二者并非完全等价。

(2)再看合成词 “送给” 里面的语言学。
汉语反映双宾概念的语词,常常可以进一步与“给”组成合成动词,意义不变,但注意合成前后的subcat的微妙变化:“送” vs “送给” (寄给,赠给,赠送给,等)
“送”的 subcat patterns:
(1) 送 + 对象NP + 受事NP: 送她一本书
(2) “把”受事NP+送+对象: 把一本书送她
(3)受事NP+送+对象: 这本书送她了
(4)送+受事NP: 送个老婆

请留心(4)和(5):两个patterns有相交竞争的时候,于是歧义产生。当“送+给”构成合成动词后,subcat 的 patterns(1)(2)(3)(5) 保持不变,而(4)基本失效(退出)了。说基本失效,是因为:虽然 “送给老婆”只能循 pattern 5,但“送给个老婆”(稍微有限别扭,但仍在语言可接受之列)似乎仍然需要理解为 pattern 4,这是怎么回事呢?
这就是语言的微妙之处:pattern 4 本来应该退出,因为“给”已经决定了后面是对象而不是受事;但是因为汉语有另一条很细但是很强的规则说,光杆量词的NP只能做受事,不能做对象或其他。在这两条规则(pattern 5的对象规则与光杆受事规则)发生冲突的时候,后一条胜,因此“送给个老婆”就不得不做 pattern 4 的受事解了。这叫规则与规则打架,谁胜谁输也是语言学的一部分,电脑实现的时候可以运用一个priority的机制来model。

上图还涉及一个常见的促销句式: 买NP1送NP2
买iPhone 6 送耳机
买 Prius 送三年保修
这个语用句式的存在,加强了NP2作为受事的可能性,使得 human 本来默认为对象的力量受到制衡。这似乎涉及语用与句法的交界了。

这些算是语言学。Beyond 语言学,也可以从文化上看这个误解或歧义的现象:


就跟王若水老老年谈桌子的哲学类似,这则小品主要是想谈谈日常的语言学。哲学家满眼都是哲学,语言学家以语言学看世界。语言人人会说,背后的语言学却不是老妪能解。语言如水如空气,一般人熟视无睹了,语言学家来揭示。这是 real life linguistics,琐碎而不乏规律,似海却仍可见底。


《立委随笔: 语言学家是怎样炼成的》






【立委按】有了这个《关于系列》,NLP有关的话,该说的已经大体说完了。以后再说,大多是重复或细节而已。有些论点可以不同角度说,关键的事情可以反复说,以信息的冗余试图保障信息传输的有效性和完整性。以前说过的,这方面立委有三个榜样,一律苦口婆心:第一是马克思,尤其反映在他集30多年功力未及完工的砖头一般厚重的《Das Kapital(资本论)》;第二是乔姆斯基,他对美国外交霸权主义和美国大众媒体的批判,絮叨了一辈子,万变不离其宗;三是老友镜子先生横扫万事万物,见诸立委主编【镜子大全】。都是菩萨心肠,把自以为的真知灼见(当然不是真理,也难免偏激)说给世界听。至少于我,说给世界听,但并不在乎世界听不听。老夫聊发少年狂,花开花落两由之。

关于 NLP 以及杂谈】                         专栏:杂类English

关于NLP体系和设计哲学】;               专栏:NLP架构

关于NLP方法论以及两条路线之争】 专栏:NLP方法论

关于 parsing】                                    专栏:Parsing

【关于中文NLP】                                   专栏:中文处理

【关于信息抽取】                                   专栏:信息抽取

【关于大数据挖掘】                               专栏:情报挖掘

【关于知识图谱】                                   专栏:知识图谱

【关于舆情挖掘】                                   专栏:舆情挖掘

【关于问答系统】                                   专栏:问答系统

【关于机器翻译】                                    专栏:机器翻译

【关于NLP应用】                                   专栏:NLP应用

【关于我与NLP】                                  专栏:NLP掌故

【关于NLP掌故】                                  专栏:NLP掌故

【关于人工智能】                                  专栏:杂类





《新智元笔记:知识图谱和问答系统:how-question QA(2)》


【Bots 的愿景】





新智元笔记:微软小冰,QA 和AI,历史与展望(4)

再说苹果爱疯的贴身小蜜 死日(Siri)


非常折服苹果的技术转化能力,但就自然语言技术本身来说 …


关于 NLP 以及杂谈



关于 parsing












我虽然被封了个小公司 Chief Scientist 的职称,实在不敢称科学家了,因为早已脱离 academia,也没真正靠科学吃饭:这个金饭碗太沉,端不起。这倒不是谦虚,也不是自我矮化,因为科学家和技术人在我心中难分高低。作为一线技术人,并没觉得自己比一流科学家逊色。

不说生物,说说NLP。可重复性是科学的根本,否则算命先生和跳大神的也都是科学家了。针对一个单纯的任务,或一个纯粹的算法,在 community 有一个标注测试集的时候,这个可重复性似乎是理应有所要求的,虽然具体怎么验证这个要求,验证到哪一步才被公认有效,似乎远非黑白分明。

我的问题是,如果是一个复杂一些的系统,譬如 deep parser,譬如 MT,特别是在工业界,有可能做到可重复吗?不可重复就不能认可吗?且不说不可重复是保持竞争优势的必要条件,就算一家公司不在乎 IP,指望对手能重复自己的结果,也是难以想象的事儿 — 除非把全盘源代码、原资源,包括所有的词典,原封不动交给对方,而且不许configure,亦不允许改动任何参数,否则怎么可能做到结果可以被重复呢?




所以 key 就是看你吃哪一碗饭。吃学术的饭,你就必须过这一关。怎么拿捏是 community peer reviewers 的事儿。




吃工业的饭,你只要你的黑箱子 performs 就ok了。

这就使得学术界只能就“构成性要素”而发表,做一个 integrated 系统是不讨好的。这个从科学上是有道理的,但是很多做学术的人也不甘心总猫在象牙塔里,为他人做嫁衣裳,他们也想做实用系统。integrated 的实用系统几乎肯定无法由他人重复出结果来,因为变数太多,过程太复杂。


那倒也不一定,当年的 unix 就是系统。但是在同样的配置条件下得到的结果应该在一定的误差范围之内。


换句话说吧,别说他人,就是自己也不见得能重复出自己的结果来。如果重起炉灶,再做一个 parser 出来,结果的误差是多少才能算容许的范围呢?就算基本设计和算法不变,相信是越做越好,但结果的误差在做成之前是很难预测的。这与在新的开发现场所能调用的资源等因素有关。














《知识图谱的先行:从Julian Hill 说起》

《有了deep parsing,信息抽取就是个玩儿》


泥沙龙笔记: parsing vs. classification and IE

前知识图谱钩沉: 信息抽取引擎的架构 2015-11-01

前知识图谱钩沉: 信息体理论 2015-10-31

前知识图谱钩沉,信息抽取任务由浅至深的定义 2015-10-30


钩沉:SVO as General Events

Pre-Knowledge-Graph Profile Extraction Research via SBIR (1)

Pre-Knowledge-Graph Profile Extraction Research via SBIR (2)

Coarse-grained vs. fine-grained sentiment extraction



SBIR Grants


【关于 parsing】

关于 NLP 以及杂谈







“快叫爸爸小视频” 的社会计算语言学解析

“快叫爸爸小视频” 这样的东西 有社会语言学的味道 随着时代和潮流翻滚。在微信朋友圈及其提供的小视频功能风靡之前 小视频不是术语 不是合成词 也没有动词的引申用法。它就是一个定中结构的 NP,在句型中等价于说”把爸爸叫做小视频”,虽然常识是 “人(爸爸)不可以等价于物(视频)”。在语言的强制性subcat结构(叫NP1NP2)里面,常识是没有位置的。句法不需要顾及常识 正如 “鸡把我吃了”的违反常识一样 也正如乔姆斯基千古名句的 green ideas。
可是 社会语言学登场了 语言被置于流动的社会背景之下,小视频成了 technical term,然后又从术语融入了语言共同体的动词用法,正如谷歌从术语(专名)变成动词一样: “我还是先谷歌一下再回应吧”,“快小视频呀”,“一定要小视频这个精彩时刻”。
“一下”强制“谷歌”为动词。半个括号已经有了 另半个没有也得有。
于是 subcats 开始 compete,有了 competition,有了结构歧义 就有了常识出场的理由。顺应常识者于是推翻了句法的第一个 reading。
“你是我的小苹果”是强制性的句法啊,无论怎么理解这个苹果(到现在我也没有理解为什么把爱人或意中人叫做小苹果,是因为拿高大上的苹果比喻珍贵吗?)都与常识无关:你是我的 x,就是强行的句法等价关系。
“一下”强制“谷歌”为动词 这一类看似临时的强制 在语言共同体中逐渐从临时变成常态后就侵入了词汇。换句话说,“谷歌”在以前的词典里面是没有也无需“潜在动词”的标注(lexical candidate POS feature),因为几乎所有的动词用法都是零星的 句法强制的 无需词典 support 的。但是随着语言的发展 “谷歌”的动词用法逐渐变成了语言共同体司空见惯的表达方式(其动词用法的流行显得简洁、时髦甚至俏皮),这时候 语言的用法被反映在语言共同体的集体词汇表中,我们模型这个共同体的语言能力的时候 就开始标注其动词的可能性了。
或问:这词典里面标注了(反映的是共同体集体意识到这种用法的流行)和不标注 有什么区别?
当然有区别。标注了 就意味着其动词用法作为一个合理的路径 参与 parsing 的正常竞争;不标注 虽然也不能排除临时的动词用法 但是因为缺乏了底部的词典支持 其动词用法的路径是默认不合法,除非句法(包括词法)的context逼迫它成为动词,这就是 “一哈”的所谓强盗句法: 不仅词典是绑架的天堂,句法也可以绑架。
重温一下白老师 作为对比,字字玑珠,而且妙趣啊。“冰冻三尺”就是社会语言学。

我们学习语言学 模型句法 绝大多数都是针对现时的 把语言看成是一个静态的剖面 来研究它 模型它。这个也没大错 而且简化了问题。但是语言是流动的 社会语言学强调的就是这个流动性。流动自然反映在大数据中。因此对于静态的语言模型 需要不断的更新 如果有大数据 那就定时地 check 它。
陈原是个大家。他写的社会语言学很有趣味。在世界语场合 有幸聆听过陈原先生的世界语演讲:那个才华四射、感染力和个性特色 让人高山仰止。人家做语言学是业余 本职工作是出版商。据说是中国最权威的出版家,也是个左派社会活动家。
陈原的那次演讲 与 黄华(我做翻译的那次)的演讲 都有一个共同的特点,就是表情丰富、富于感染力,能感受到人的 personality,都是“大家”。






立委 NLP 频道 开张大吉


其前身是【立委科学网博客】的NLP科普相关博文,所谓 NLP University:。我将逐渐把原NLP博客转移至此,新的博客会同步在此发布。非 NLP 博文仍然以科学网为基地。


不知道多少次电脑输入 NLP(自然语言处理),出来的都是“你老婆”。难怪 NLP 跟了我一辈子,or 我跟了 NLP 一辈子。不离不弃。

开篇词: 余致力自然语言处理凡30年,其目的在求交流之通畅,信息之自由,语言之归一,世界之大同。积30年之经验,深知欲达此目的,必须启蒙后进,普及科学,同心协力,共建通天之塔,因作文鼓而吹之。处理尚未成功,同志仍需努力。


第一章:体系和方法论,关键是这一篇【NLP 联络图 】。除了体系和术语联络图,也谈方法论及其两条路线的斗争。

第二章 Parsing,包括 shallow parsing 和 deep parsing 的方方面面。要强调的一点是,deep parsing 是 NLP 的核武器。当自然语言的 unstructured text 被精准分析成 structures 以后,语言因为有了有限的 patterns 而变得有迹可循,NLP 应用的很多难题就迎刃而解了。

第三章 抽取,进入NLP语用。虽然学界绝大多数抽取都是不用parsing的,或者只用 stemming,最多是 shallow parsing,这里更注重的是在 deep parsing 基础上的抽取。可以看成是针对知识图谱的全自动最终解决方案。

第四章 挖掘。抽取和挖掘常常搞混,但一般的共识是它们处于不同的层次:抽取针对的是个体,一颗颗的树,而挖掘针对的是森林,是语料库或文本数据源。在大数据年代,文本挖掘被认为是开采金矿的核武器,可以领跑下个 decade,但是从 NLP 体系框架来看,它是处于 parsing 和抽取之后的,是抽取的统计化结果。真正的核武器是 deep parsing,因为有了它,抽取才能快速进入domain,以不变应万变,同时抽取的质量也能大幅度提升。这才为最终的大数据挖掘打牢了基础。

第五章 NLP 的其他应用,文本挖掘是 NLP 的主打应用,可以用在很多产品和domains,其他的应用则包括机器翻译(MT),问答系统 (QA),智能搜索,如 SVO search (超越关键词的结构搜索)。当然也包括语言生成(聊天机器人要用的),还有自动文摘等。这些方面目前还没有面面俱到,有些应用笔者迄今没有找到机会涉猎。

第六章 中文 NLP。作者读者都是中国人,写的是中文博客,加上中文处理有其特殊的挑战,所以单列。更重要的是,很多年来,中文 NLP 被认为远远落后于欧洲语言的 NLP。这里的材料深入研究了中文的特点和难点,展示中文 NLP 的新进展。结论是,中文处理的确有其挑战,但其处理水平并没有落后太多。与英语NLP或其他欧洲语言NLP一样,最先进的中文NLP系统也已经进入了大规模大数据应用的时代。

第七章 舆情挖掘实践。舆情挖掘也是挖掘,这里单列是因为这是笔者目前的研发重心,也是因为这是 NLP 中最 tricky 也很有价值的应用,展示其挖掘实例可以激发大数据挖掘的想象力。本章集中了舆情挖掘的中外实例,几年来的热点话题追踪,或者打趣,也有不少闹着玩的成分在,包括给男星女星排名,甚至挖掘他们的花边新闻。

舆情挖掘比事实挖掘难很多,虽然体系和方法论上二者有很大的相同点,但难度有天壤之别的感觉。这是因为主观性语言(subjective language)是人类语言中较难的一面。严格说 sentiment analysis 属于抽取,sentiment extraction 才是更准确的说法,不过大家都习惯了沿用 sentiment analysis,而 opinion mining 才属于挖掘 (or mining of public opinions and sentiments)。这个里面学界最多报道的工作实际是 sentiment classification,但classification只是sentiment analysis 的一个皮毛。舆情舆情,有舆有情。舆就是 public opinion,情才是 public sentiment,后来为了统一在大家习惯的 sentiment 的 umbrella 下面,我们把情限定于 emotion 的表达,但 emotion 的表达只是一种情绪的挖掘,可以与 classification 很好对应,不管是分两种情绪(褒贬),三种情绪(褒贬中),还是四种情绪(喜怒哀乐),或 n 种,总之是 classification 。但是 deep sentiment analysis 不能停留在情绪的 classification,必须找到背后的东西。这就是为什么我们强调要挖掘情绪背后的理由,因为人不能老是只有情绪(喜欢不喜欢)和结论(采纳不采纳),而不给出理由。前者仅仅是发泄,后者才是为了传达、说服或影响人的具体情报,是可以帮助决策的。挖掘的主要目的有二:一个是把这些情报统计出来,给出概貌,不管是制作成图表还是使用词云等可视化的表达。第二就是允许用户从这些情报开始做任意的 drill down 或顺藤摸瓜。很多时候我们只展示了前者,其实真正的价值在后面(系统demo可以展示其威力,博文很难表现其动态)。后者才真显系统的威力,前者不过是静态的报表而已。Deep sentiment analysis 是 NLP 应用中最难啃的果子。

第八章是最后一章,NLP 掌故。这里面说的都是故事,有亲身经历,也有耳闻目睹。

希望 这个 NLP University 提供一些 NLP 课堂和教科书中没有的内容和角度。前后积攒了几百篇了,不仅分了大类,也尽量在每一篇里面给出了相互之间的链接。


科学网【NLP University