Chart Parsing Chinese Character Strings

W. Li. 1997. Chart Parsing Chinese Character Strings. In
Proceedings of the Ninth North American Conference on Chinese
Linguistics (NACCL-9). Victoria, Canada.

Chart Parsing Chinese Character Strings [1]

 

Wei  LI

Simon Fraser University
Burnaby B.C. V5A 1S6 CANADA ([email protected]) 

 

ABSTRACT

This paper examines problems in word identification for a Chinese natural language processing system and presents our solution to these problems. In conventional systems, written Chinese parsing takes two steps: (1) a segmentation preprocessor for word identification (segmenter); (2) a grammar parsing the string of identified words. Morphological analysis, when required, as in the case of productive word formation, has to be incorporated in the segmenter. This matches the conventional morphology-before-syntax architecture. We will demonstrate the theoretical defect of this architecture when applied to Chinese. This leads to the conclusion that segmentational approach, despite its being the mainstream in Chinese computational morphology, is in general not adequate for the task of Chinese word identification. To solve this problem, a full grammar should be made available. Therefore, we take an alternative one-step approach. We have implemented an integrated grammar of morphology and syntax for directly parsing a string of Chinese characters, building both morphological and syntactic structures. Compared with the conventional two-step approach, our strategy has advantages in resolving ambiguity in word identification and in handling productive word formation.

  1. Introduction

A written Chinese sentence is a string of characters with no blanks to mark word boundaries. In conventional systems, Chinese parsing takes two steps as shown in the following Figure 1: (1) a segmentation preprocessor (called segmenter) for word identification; (2) a word based parsing grammar, building syntactic structures (Feng 1996, Chen & Liu (1992).

hpsg4

 

In contrast, we take an alternative one-step approach, as shown in Figure 2 below. We have implemented a grammar named W‑CPSG (for Wei's Chinese Phrase Structure Grammar). W‑CPSG integrates morphology and syntax for character based parsing, building both morphological and syntactic structures.

hpsg5

In the two-step architecture, the purpose for the segmenter is to properly identify a string of words to feed syntax. This is not an easy task due to the possible involvement of the segmentation ambiguity. For example, given a string of 4 Chinese characters 研究生命, the segmentation ambiguity is shown in (1.a) and (1.b) below.

(1.)  研究生命

(a)        研究生               |
graduate student         | life or destiny

(b)        研究    | 生命
study   | life

The resolution of the above ambiguity in the segmenter is a hopeless job because such ambiguity is syntactically conditioned. For sentences like 研究生命金贵 (life for graduate students is precious), (1.a) is the right identification. For the phrase 研究生命起源 (to study the origin of life), (1.b) is right. So far there are no segmenters which can handle this properly and guarantee right word segmentation (Feng 1996). In fact, there can never be such segmenters as long as a grammar is not brought in. This is a theoretical defect of all Chinese analysis systems in the conventional architecture. We have solved this problem in our morphology-syntax integrated W‑CPSG. Word identification in our design becomes a by-product of parsing instead of a pre-condition for parsing.

In the text below, Section 2 investigates problems with the conventional two-step approach. In Section 3, we will present W‑CPSG one-step approach and demonstrate how W‑CPSG parsing solves these problems. The following is a list for abbreviations used in this paper.

A (Adjective); AF (Affix); BM (Bound Morpheme);
CLA (Classifier); CLAP (Classifier Phrase);
DE (Chinese particle introducing a modifier of noun); DEP (DE Phrase);
DE3 (Chinese particle introducing a modifier of result or capability);
DET (Determiner); LE (Chinese perfective aspect marker);
N (Noun); NP (Noun Phrase); P (Preposition); PP (Prepositional Phrase);
S (Sentence); V (Verb); VP (Verb Phrase); Vt (Transitive Verb)

  1. Problems Challenging Segmenters

In general, there are two basic problems for segmenters, namely, segmentation ambiguity and productive word formation.

2.1. segmentation ambiguity

This sub-section studies the segmentation ambiguity for Chinese word identification. We indicate that this ambiguity is structural in nature. Therefore it should be captured by structural trees via parsing. We conclude that a parsing grammar is indispensable in the resolution of the segmentation ambiguity.

Behind all segmenters are procedure based segmentation algorithms. Most proposals are some modified versions of large-lexicon based matching algorithms. As an underlying hypothesis, a longer match overrides a shorter match, hence the name maximum match. Decided by the  direction of the  procedure, i.e. whether  the segmentation proceeds from left (the beginning of a string) to right (the end of the string) or from right to left, we have two general types of maximum match: (1) FMM (Forward Maximum Match) algorithm; (2) BMM (Backward Maximum Match) algorithm (Feng 1996).

According to Liang 1987, segmenters have trouble with cases involving the segmentation ambiguity. There are two types of segmentation ambiguity: the cross ambiguity (AB|C vs. A|BC) and the embedded ambiguity (AB vs. A|B).

To detect possible ambiguity, many researchers use the technique of combining the FMM algorithm and the BMM algorithm. When the output of FMM and BMM are different, there must be some ambiguity involved. The following table lists the cases associated with the FMM and BMM combined approach.[2]

hpsg6

The following 3 examples all contain a cross ambiguity sub-string 研究生命 with 2 segmentation possibilities: 研究生|命 and 研究|生命. Example (4.) is a genuinely ambiguous case. Genuinely ambiguous sentences cannot be disambiguated within the sentence boundary, rendering multiple readings.

(2.) case 1:      研究生命金贵。

(a)        研究生                |      | 金贵                  (FMM: correct)
graduate student         | life   | precious
Life for graduate students is precious.

(b) * 研究 | 生命    |起源                                   (BMM: incorrect)
study        | life     | precious

(3.) case 2:       研究生命起源。

(a) *     研究生              | 命     | 起源                       (FMM: incorrect)
graduate-student       | life   | origin

(b)        研究     | 生命    | 起源                                (BMM: correct)
study   | life     | origin
to study the origin of life

(4.) case 3:       研究生命不好。

(a)        研究生                   | 命                 |        |      (FMM: correct)
graduate student         | destiny        | not     | good
The destiny of graduate students is not good.

(b) 研究 | 生命   | 不      | 好                                      (BMM: correct)
study    | life     |  not    | good
It is not good to study life.

The following example is a complicated case of cross ambiguity, involving more than 2 ways of segmentation. Both the FMM segmentation 出现|在世|界 and the BMM segmentation 出|现在|世界 are wrong. A third segmentation 出现||世界 is right.

(5.)  case 4:      出现在世界东方。

(a) * 出现 | 在世          |      | 东方                       (FMM: incorrect)
appear     | be-alive   | BM   | east

(b) * 出  | 现在  | 世界    | 东方                               (BMM: incorrect)
out        | now   | world | east

(c)  出现  |     | 世界     | 东方                               (correct)
appear    | at    | world  | east
to appear in the east of the world

In the following examples (6.) through (8.), ¿¾°×Êí involves embedded ambiguity. As separate words, the verb ¿¾ (bake) and the NP °×Êí (sweet potato) form a VP. As a whole, it is a compound noun ¿¾°×Êí (baked sweet potato). In cases of the embedded ambiguity, FMM and BMM always make the same segmentation, namely AB instead of A|B. It may be the only right choice, as seen in (6.). It may be wrong as shown in (7.). It may only be half right, as in the case of genuine ambiguity shown in (8.).

(6.) case 5:       他吃烤白薯。

(a)        他       |       | 烤白薯                                 (FMM&BMM: correct)
he       | eat     | baked sweet potato
He eats baked sweet potatoes.

(b) *     他       |       |       | 白薯                        (incorrect)
he       | eat     | bake | sweet potato

(7.) case 6:       他会烤白薯。

(a) *     他       |       | 烤白薯                                 (FMM&BMM: incorrect)
he       | can    | baked sweet potato

(b)        他      |       |       | 白薯                         (correct)
he      | can   | bake | sweet potato
He can bake sweet potatoes.

(8.) case 7:       他喜欢烤白薯。

(a)       他       | 喜欢 | 烤白薯                                  (FMM&BMM: correct)
he      | like  | baked sweet potato
He likes baked sweet potatoes.

(b)        他       | 喜欢   |       | 白薯                       (correct)
he      | like     | bake | sweet potato
He likes baking sweet potatoes.

Compare the above examples, we see that there are severe limitations for the FMM-BMM combined approach. First, it only serves the purpose of ambiguity detection (when the results of FMM and BMM do not match), and contributes nothing to its resolution. It has no way to tell which segmentation is right (compare case 1 and case 2), and, worse still, whether both are right (case 3) or wrong (case 4). Second, even when the results of FMM and BMM do match, it by no means guarantees right segmentation (case 6). Third, as far as detection is concerned, it is only limited to the problems for the cross ambiguity. The existence of the embedded ambiguity defines a blind area for this way of detection (case 6 and case 7). This is because the underlying maximum match hypothesis assumed in the FMM and BMM segmentation algorithms is directly contradictory to the phenomena of the embedded ambiguity.

In face of ambiguity, how do people judge which segmentation is right in the first place? It really depends on whether we can understand the sentence or phrase based on the segmentation. In computational linguistics, this is equivalent to whether the segmented string can be parsed by a grammar. The segmentation ambiguity is one type of structural ambiguity, not in essence different from typical structural ambiguity like, say, PP attachment ambiguity. In fact, PP attachment problem is a counterpart of the cross ambiguity in English syntax, as shown below.

(9.)       Cross ambiguity in PP attachment: V NP PP

(a) [V NP] [PP]
(b) [V] [NP PP]

Therefore, like English PP attachment, Chinese word segmentation ambiguity should also be captured by a parsing grammar. A parser resolves the ambiguity if it can, or detects the ambiguity in the form of multiple parses when it cannot. As shall be demonstrated in Section 3, wrong segmentation will not lead to a parse. Right segmentation results in at least one successful parse. In any case, at least a parser (hence a grammar on which the parser is based) is required for proper word identification.

The important thing is that the ambiguity in word identification is a grammatical problem. The attempt to solve this problem without a grammar is bound to be crippled. Since traditional segmentation algorithms are non-grammatical in nature, they are theoretically not equipped for handling such ambiguity. A successive model of segmentater-before-grammar attempts to do what it is not yet able to do. This is the theoretical defect for almost all existing segmentation approaches.

(10.)     Conclusion for 2.1.

The segmentation ambiguity in word identification is one type of structural ambiguity. In order to solve this problem, a parsing grammar is indispensable.

2.2. productive word formation

Unless morphological analysis is incorporated, lexicon match based segmenters will have trouble with new words produced by Chinese productive word formation, including reduplication, derivation and the formation of proper names. When the morphology component is incorporated in the segmenter, the two-step design becomes a variant of the conventional morphology-before-syntax architecture. But this architecture is not effective when the segmentation ambiguity is at issue.

In the following, we investigate reduplication, derivation and proper names one by one. In each case, we find that there is always a possible involvement of the segmentation ambiguity. This problem cannot be solved by a morphology component independent of syntax. We therefore propose a  grammar incorporating both morphology and syntax.

2.2.1. reduplication

Reduplication in Chinese serves various grammatical and/or lexical functions. Not all reduplications pose challenges to segmentation algorithms. Assume that a word consists of 2 characters AB, reduplication of the type AB --> ABAB is no problem. What becomes a problem for word segmentation is the reduplication of the type AB --> AABB or its variants like AB --> AAB. For example, a two-morpheme verb with verb-object relation at the level of morphology has the following way of reduplication.

(11.) Verb Reduplication: AB --> AAB  (for diminutive use)

分心 (get distracted) --> 分分心 (get distracted a bit)

让他分分心。

让       | 他     | 分分心
let       | he    | get distracted a bit
Let him relax a while.

It seems that reduplication is a simple process which can be handled by incorporating some procedure-based function calls in the segmentation algorithm. If a 3-character string, say 分分心, cannot be found in the lexicon, the reduplication procedure will check whether the first 2 characters are the same, and if yes, delete one of them and consult the lexicon again. But, such expansion of the segmentation algorithm is powerless when the segmentation ambiguity is involved. For example, it is wrong to regard 分分心 as of reduplication in the following sentence.

(12.)   这件事十分分心。

(a) *     这       |      |         |       | 分分心
this      | CLA  | thing  | ten    | get distracted a bit

(b)        这       |       |         | 十分    | 分心
this      | CLA  | thing  | very   | distracting
This thing is very distracting.

2.2.2. derivation

In Contemporary Mandarin, there have come to be a few morphemes functioning similarly to English affixes, e.g. 可 (-able) turns a transitive verb into an adjective.

(13.)     可 (-able) + Vt --> A

可 (-able) + 读 (Vt: read) -->   可读 (A:readable)

这本书非常可读。

这       | 本     | 书       | 非常   | 可读
this    | CLA  | book  | very  | readable
This book is very readable.

The suffix 性 works just like '-ness',  changing an adjective into an abstract noun.  The derived noun 可读性 (readability) in the following example, similar to its English counterpart, involves a process of double affixation.

(14.)     A + 性 (-ness)  --> N
可 (-able) + 读 (Vt: read) -->   可读 (A:readable)
可读 (A:readable) + 性 (-ness) --> 可读性 (N:readability)

这本书的可读性

这       | 本      | 书       |      | 可读性
this    | CLA  | book  | DE    | readability
this book's readability

The suffix Í· can change a transitive verb into an abstract noun adding to it the meaning "worth-of".

(15.) Vt + 头 (AF:worth of) --> N

吃 (Vt:eat) + 头 (AF:worth of) --> 吃头 (N:worth of eating)

这道菜没有吃头

这       | 道     | 菜      | 没有             | 吃头
this    | CLA  | dish  | not-have    | worth-of-eating
This dish is not worth eating.

It is not difficult to incorporate in the segmenter these derivation rules for the morphological analysis. But, as in the case of reduplication, there is always a danger of wrongly applying the rules due to possible ambiguity involved. For example, 吃头 is a sub-string of embedded ambiguity. It can be both a derived noun 'worth of eating' or two separate words as seen in the following example.

(16.)  他饿得能吃头牛。

(a) *     他      | 饿             |       |      | 吃头·                       |
             he     | hungry    | DE3  | can  | worth-of-eating   | ox

(b)        他      | 饿              |      |      |       |       |
              he     | hungry    | DE3  | can  | eat    | CLA  | ox
He is so hungry that he can eat an ox.

2.2.3. proper name

Proper names are of 2 major types: (1) Chinese names; (2) transliterated foreign names. In this paper, we only target the identification of Chinese names and leave the problem of transliterated foreign names for further research (Li, 1997b).

A Chinese human name usually consists of a family name followed by a given name. Chinese family names form a clear-cut closed set. A given name is usually either one character or two characters. For example, the late Chinese chairman 毛泽东 (Mao Zedong) used to have another name 李得胜 (Li Desheng). In the lexicon, 李 is a registered family name. Both 得胜 and 胜 mean 'win'. This may lead to 3 ways of word segmentation: (1) 李得胜; (2) 李|得胜; (3) 李得|胜, as seen in the following examples.

(17.)    李得胜了

(a)  李    | 得胜 | .
       Li    | win  | LE
Li won.

(b)   李得   |      |
        Li De | win  | LE
Li De won.

(c) *  李得胜          | .
          Li Desheng | LE

(18.)   李得胜胜了 。

(a) *  李 | 得胜 |     | .
         Li  | win | win | LE

(b) *  李得   |      |      |
          Li De | win  | win  | LE

(c)   李得胜            |      |
Li Desheng   | win  | LE
Li Desheng won.

Since the given name like µÃʤ is an arbitrary string of 1 or 2 characters, the morphological analysis of the full name should start with family name which can optionally combine with any 1 or 2 characters to form candidate proper names Àî, ÀîµÃ and ÀîµÃʤ. In other words, family name serves as the left boundary of a full name and the length is used to determine candidates. The right segmentation can only be made via sentence analysis as shown in the above examples.

Most Chinese place proper names are made of 1 to 3 characters, for example, 武汉市(Wuhu City), 南陵县 (Nanling County). The arbitrariness of these names makes any sub-strings of n characters (0<n<4) in the sentence a suspect. Fortunately, in most cases we may find boundary indicators of these names, like 省 (province), 市 (city), 县 (county), etc. Once the boundary indicator is located, the similar technique in using Chinese family name to identify the given name can be applied to select candidates of place proper names for verification through grammatical analysis.

In general, there is always a possibility of ambiguity involvement in the formation of all types of proper names.

(19.)     Conclusion for 2.2.

Due to the possible involvement of ambiguity, a parsing grammar for morphological analysis as well as for sentence analysis is required for the proper identification of the words produced by Chinese productive word formation.

  1. W‑CPSG Grammatical Approach

This section presents W‑CPSG approach to Chinese word identification and morphological analysis. We will demonstrate how a parser based on W‑CPSG solves the problems of the word identification ambiguity and productive word formation.

3.1. rationale of W‑CPSG approach

There have been a number of word identification algorithms based on both morphological and syntactic information (see survey in Feng 1996 and Sun & Huang 1996). Most such approaches do not use a self-contained grammar to parse the complete sentence. They are confined to the conventional two-step process of the segmentation-before-grammar design. As long as the word identification procedure is independent of a parsing grammar, it is extremely difficult to make full use of grammatical information to resolve ambiguity in word identification. Careful tuning up and sophisticated design improves the precision but will not change the theoretical defect of all such approaches. Chen & Liu acknowledges the limitation of their approach due to the lack of a grammar.  “However”, they say,  “it is almost impossible to apply real world knowledge nor to check the grammatical validity at this stage”. (Chen & Liu 1992, p.105) Why impossible at this stage? Because these segmentation systems are based on the concept of  two-step architecture and the grammar is not yet available! As we have demonstrated, the final judgment for proper word identification can hardly be made until the whole sentence is parsed, hence the requirement of a full grammar. Therefore, we are forced to make a compromise in involving how much of grammatical information depending on how much word identification precision we can afford to sacrifice. Needless to say, there is significant double-labor between such a word segmentation procedure and the following stage of parsing. As more and more grammatical information is used to achieve better precision, the overhead of this double labor becomes more serious. We consider the double labor as one strong argument against the two-step approach. If enough grammatical information is incorporated, it is essentially equivalent to a grammar. And  the segmenter will be equivalent to a parser.  Then why two grammars, one for word identification, and one for sentence parsing? Why not combine them? That is exactly what we are proposing in W‑CPSG - one-step approach based on an integrated grammar,  eliminating the necessity of a segmentation preprocessor.

3.2. W‑CPSG character-based parsing

W‑CPSG (Li. 1997a, 1997b) is a lexicalized Chinese unification grammar. The work on W‑CPSG is taken in the spirit of the modern linguistic theory Head-driven Phrase Structure Grammar (Pollard & Sag 1994). W‑CPSG consists of two parts: a minimized general grammar and an enriched lexicon. The general grammar only contains a handful of PS (phrase structure) rules, covering complement structure, modifier structure, conjunctive structure and morphological structure. This is the nature of lexicalized grammars. PS rules in such grammars are very abstract. Essentially, all they say is one thing, that is, 2 signs can combine so long as the lexicon so indicates. The lexicon houses lexical entries with their linguistic description in feature structures. Potential morphological structures as well as potential syntactic structures are lexically encoded. In syntax, a word expects another sign to form a phrase. In morphology, a morpheme expects another sign to form a word. For example, the prefix 可 (-able) expects a transitive verb to form an adjective. The morphological PS rule will build the morphological structure when a transitive verb does appear after the prefix 可 (-able) in the input string.

We now illustrate how W‑CPSG parses a string of Chinese characters by a sample parsing chart. The prototype of W‑CPSG was written in ALE, a grammar compiler developed on top of Prolog by Carpenter & Penn (1994). ALE compiles W‑CPSG into a Chinese parser, a Prolog program ready to accept a string of characters for analysis. W‑CPSG parse tree embodies both morphological analysis and syntactic analysis, as shown below.

hpsg12

 

This is so-called bottom-up parsing. It starts with lexicon look-up. Edges 1 through 7 are lexical edges. Other edges are phrasal edges. Each edge represents a sign, i.e. a character (morpheme), a word, a phrase or a sentence. Lexical edges result from a successful match between the signs in the input string and the entries in the lexicon during lexicon look-up. For example, 可 (-able), 读 (read) and 性 (-ness) are all registered entries in the lexicon, so they get matched and shown by edge 5, edge 6 and edge 7. Words produced by productive word formation present themselves as phrasal edges, e.g. edge ((5+6)+7) for 可读性 (readability). For the sake of concise illustration, we only show two pieces of information for the signs in the chart, namely category and interpretation with a delimiting colon (lexical edges are only labeled for either category or interpretation). The parser attempts to combine the signs according to PS rules in the grammar until parses are found. A parse is an edge which ranges over the whole string. The parse ((((1+2)+3)+4)+((5+6)+7)) for (20.) represents a binary structural tree based on the W‑CPSG analysis, as shown below.

hpsg13

3.3. ambiguity resolution in word identification

Given the resources of a phrase structure grammar like W‑CPSG, a parser based on standard chart parsing algorithms can handle both the cross ambiguity and the embedded ambiguity provided that a match algorithm based on exhaustive lookup instead of maximum match is adopted for lexicon lookup. All candidate words in the input string are presented to the parser for judgment. Ambiguous segmentation becomes a natural part of parsing: different ways of segmentation add different edges, a successful parse always embodies right identification. In other words, word identification in our design becomes a by-product of parsing instead of a pre-condition for parsing. The following example of the complicated cross ambiguity illustrates how the W‑CPSG parser resolves ambiguity. As seen, both the FMM segmentation (represented by the edge sequence 8-9-5-10) and the BMM segmentation (represented by 1-11-12-10) are in the chart as a result of exhaustive lexicon lookup. They are proved to be wrong because they do not lead to a successful parse according to the grammar. As a by-product, the final parse (8+(3+(12+10))) automatically embodies rightly identified word sequence 8-3-12-10, i.e. 出现  (appear) |在  (at) |世界 (world) |东方 (east).

hpsg10

 

Exhaustive lookup also makes an embedded ambiguity sub-string like 烤红薯 no longer a blind area for word identification, as shown in (22.) below. All the candidate words in the sub-string including 烤 (bake), 红薯 (sweet potato), 烤红薯 (baked sweet potato) are added to the chart as lexical edges (edge 4, edge 8 and edge 10). This is a case of genuine ambiguity, resulting in 2 parses corresponding to 2 readings. The first parse (1+(7+10)) identifies the word sequence 他|喜欢|烤红薯, and the second parse (1+(9+(4+8))) a different sequence 他|喜欢|烤|红薯. Edge 7 and edge 9 represent two lexical entries for the verb 喜欢 (like), with different syntactic expectation (categorization). One expects an NP object, notated in the chart by like<NP>, and the other expects a VP complement, notated by like<VP>.

hpsg11

 

We now illustrate how Chinese proper names are identified in W‑CPSG parsing. In the W‑CPSG lexicon, Chinese family name is encoded to optionally expect the given name. Due to the arbitrariness of given names, no other constraint except for the length (either 1 character or 2 characters) is specified in the expectation. Therefore, we have three candidates for proper names in the following example, namely 李 (Li), 李得 (Li De), 李得胜 (Li Desheng), represented respectively by edge 1, edge (1+2) and the NP edge (1+5).[3] The first two candidates contribute to two valid parses while the third does not, hence the identification of the word sequences 李|得胜|了 and 李得|胜|了.

hpsg8

 

Now we add one more character 胜 (win) to form a new sentence, as shown in (24.) below.

hpsg9

 

The first two candidate proper names 李 (Li) and 李得 (Li De) no longer lead to parses. But the third candidate 李得胜 (Li Desheng) becomes part of the parse as a subject NP. The parse (((1+6)+4)+5) corresponds to the identification of the only valid word sequence 李得胜|胜|了.

Finally, we give an example to demonstrate how W‑CPSG handles reduplication in parsing and word identification. The sample sentence to be processed by the parser is 让他分分心 (Let him relax a while), involving the AB-->AAB type verb reduplication for diminutive use.

In most lexicons, 分心 (distract-heart: get distracted) is a registered 2-morpheme verb with internal morphological verb-object relation. Therefore, the reduplication is considered morphological. But in Chinese syntax, we also have a  general verb reduplication rule of the type A-->AA for diminutive use, for example, 看(look) --> 看看(have a look). This morphological verb reduplication rule AB-->AAB and the syntactic verb reduplication rule A-->AA are essentially the same rule in Chinese grammar. 分心 sits in the gray area between morphology and syntax. It looks both like a word (verb) and a phrase (VP). Lexically, it corresponds to one generalized sense (concept) and the internal combination is idiomatic, i.e. 分 (distract) must combine with 心 (heart) to mean 'get distracted'. But, structurally, the combination of 分 and 心 is not fundamentally different from a VP consisting of Vt and NP, as in the phrase 看电影 (see a film). In fact, there is no clear-cut boundary between Chinese morphology and syntax. This morphology-syntax isomorphic fact serves as a further argument to support the W‑CPSG design of integrating morphology and syntax in one grammar module. Although the boundary between Chinese morphology and syntax is fuzzy, hence no universal definition of basic notions like word and phrase, the division can be easily defined system internally in an integrated grammar. In W‑CPSG,  分心 is treated as a phrase (VP) instead of a word (verb). The lexical entry 分 (distract) is coded to obligatorily expect the literal 心 (heart) as its syntactic object, shown in the following chart by the notation V<>. This approach has the advantage of eliminating the doubling of the reduplication rule for diminutive use in both syntax and morphology, making the grammar more elegant. The verb reduplication rule is implemented as a lexical rule in W‑CPSG.[4] This lexical rule creates a reduplicated verb with added diminutive sense, shown by edge 8 (a lexical edge).  The whole parsing process is illustrated below.

hpsg7

 

 

REFERENCES

Carpenter, B. & Penn, G. (1994): ALE, The Attribute Logic Engine, User's Guide, Carnegie Mellon University

Chen, K-J., & S-H. Liu (1992): "Word identification for mandarin Chinese sentences". Proceedings of the 15th International Conference on Computational Linguistics, Nantes, 101-107.

Feng, Z-W. (1996): "COLIPS lecture series - Chinese natural language processing",  Communications of COLIPS, Vol.6, No.1 1996, Singapore

Li, W. (1997a): "Outline of an HPSG-style Chinese reversible grammar", Proceedings of The Northwest Linguistics Conference-97 (NWLC-97, forthcoming), UBC, Vancouver, Canada

Li, W. (1997b): W‑CPSG: A Lexicalized Chinese Unification Grammar And Its Application, Doctoral dissertation (on-going), Simon Fraser University, Canada

Liang, N. (1987): "Shumian Hanyu Zidong Fenci Xitong - CDWS" (Automatic word segmentation system for written Chinese - CDWS), Journal of Chinese Information Processing, No.2 1987, pp 44-52, Beijing

Pollard, C.  & I. Sag (1994): Head-Driven Phrase Structure Grammar,  Centre for the Study of Language and Information, Stanford University, CA

Sun, M-S. & C-N. Huang  (1996): "Word segmentation and part of speech tagging for unrestricted Chinese texts" (Tutorial Notes for International Conference on Chinese Computing ICCC'96), Singapore

~~~~~~~~~~~~~~~~~~~

[1] The author benefited from the insightful discussion with Dr. Dekang Lin on the feasibility of parsing Chinese character strings instead of word strings. Thanks also go to Paul McFetridge and Fred Popowich for their supervision and encouragement.

[2] This table is adapted from the following table in Sun & Huang (1996).

case 1 The output of FMM and BMM are different, but both are incorrect 0.054%
case 2 The output of FMM and BMM are different, but only one is correct 9.24%
case 3 The output of FMM and BMM are identical, but incorrect 0.41%
case 4 The output of FMM and BMM are identical, and correct 90.30%

The 4 cases which they listed are not logically exhaustive in terms of sentence based processing (i.e. when discourse is not involved in a system). In particular, there is another case when the output of FMM and BMM are different, and both are correct. We call this a case of genuine cross ambiguity.

[3] Note that there is another S edge (1+5) in the chart. These two edges are structurally different, created via different PS rules. The NP edge (1+5) is formed through the morphological PS rule, combining the family name (edge 1) and its expected given name (edge 5). In the S edge (1+5). however, it is the subject rule (one of the complement PS rules) that decides the combination of the predicate (edge 5) and its expected subject NP (edge 1).

[4] Lexical rules are favored by many linguists to capture redundancy in the lexicon instead of the conventional approach of syntactic transformation. Lexical rules are applied at compile time to form an expanded lexicon before parsing starts.

 

[Related]

Interaction of syntax and semantics in parsing Chinese transitive verb patterns 

Handling Chinese NP predicate in HPSG 

Notes for An HPSG-style Chinese Reversible Grammar

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

PhD Thesis: Chapter II Role of Grammar

 

2.0. Introduction

This chapter examines the role of grammar in handling the three major types of morpho-syntactic interface problems.  This investigation  justifies the mono-stratal design of CPSG95 which contains feature structures of both morphology and syntax.

The major observation from this study is:  (i) grammatical analysis, including both morphology and syntax, plays the fundamental role in contributing to the solutions of the morpho-syntactic problems;  (ii)  when grammar alone is not sufficient to reach the final solution, knowledge beyond morphology and syntax may come into play and serve as “filters” based on the grammatical analysis results.[1]  Based on this observation, a study in the direction of interleaving morphology and syntax will be pursued in the grammatical analysis.  Knowledge beyond morphology and syntax is left to future research.

Section 2.1 investigates the relationship between grammatical analysis and  the resolution of segmentation ambiguity.  Section 2.2 studies the role of syntax in handling Chinese productive word formation.  The borderline cases and their relationship with grammar are explored in 2.3.  Section 2.4 examines the relevance of knowledge beyond syntax to segmentation disambiguation.  Finally, a summary of the presented arguments and discoveries is given in 2.5.

2.1. Segmentation Ambiguity and Syntax

Segmentation ambiguity is one major problem which challenges the traditional word segmenter or an independent morphology.  The following study shows that this ambiguity is structural in nature, not fundamentally different from other structural ambiguity in grammar.  It will be demonstrated that sentential structural analysis is the key to this problem.

A huge amount of research effort in the last decade has been made on resolving segmentation ambiguity (e.g. Chen and Liu 1992; Gan 1995; He, Xu and Sun 1991; Liang 1987; Lua 1994; Sproat, Shih, Gale and Chang 1996; Sun and T’sou 1995; Sun and Huang 1996; X. Wang 1989; Wu and Su 1993; Yao, Zhang and Wu 1990; Yeh and Lee 1991; Zhang, Chen and Chen 1991; Guo 1997b).  Many (e.g. Sun and Huang 1996; Guo 1997b) agree that this is still an unsolved problem.  The major difficulty with most approaches reported in the literature lies in the lack of support from sufficient grammar knowledge.  To ultimately solve this problem, grammatical analysis is vital, a point to be elaborated in the subsequent sections.

2.1.1. Resolution of Hidden Ambiguity

The topic of this section is the treatment of hidden ambiguity.   The conclusion out of the investigation below is that the structural analysis of the entire input string provides a sound basis for handling this problem.

The following sample sentences illustrate a typical case involving the hidden ambiguity string 烤白薯 kao bai shu.

(2-1.) (a)      他吃烤白薯
ta         | chi  | kao-bai-shu
he      | eat  | baked-sweet-potato
[S [NP ta] [VP [V chi] [NP kao-bai-shu]]]
He eats the baked sweet potato.

(b) * ta       | chi  | kao          | bai-shu
he      | eat  | bake         | sweet-potato

(2-2.) (a) *    他会烤白薯
ta         | hui  | kao-bai-shu.
he      | can | baked-sweet-potato

(b)     ta       | hui  | kao          | bai-shu.
he      | can | bake         | sweet-potato
[S [NP ta] [VP [V hui] [VP [V kao] [NP bai-shu]]]]
He can bake sweet potatoes.

Sentences (2-1) and (2-2) are a minimal pair;  the only difference is the choice of the predicate verb, namely chi (eat) versus hui (can, be capable of).  But they have very different structures and assume different word identification.  This is because verbs like chi expect an NP object but verbs like hui require a VP complement.  The two segmentations of the string kao bai shu provide two possibilities, one as an NP kao-bai-shu and the other as a VP kao | bai-shu.  When the provided unit matches the expectation, it leads to a successful syntactic analysis, as illustrated by the parse trees in (2‑1a) and (2-2b).  When the expectation constraint is not satisfied, as in (2-1b) and (2-2a), the analysis fails.  These examples show that all candidate words in the input string should be considered for grammatical analysis.  The disambiguation choice can be made via the analysis, as seen in the examples above with the sample parse trees.  Correct segmentation results in at least one successful parse.

He, Xu and Sun (1991) indicate that a hidden ambiguity string requires a larger context for disambiguation.  But they did not define what the 'larger context' should be.  The following discussion attempts to answer this question.

The input string to the parser constitutes a basic context as well as the object for sentential analysis.[2]  It will be argued that this input string is the proper context for handling the hidden ambiguity problem.  The point to be made is, context smaller than the input string is not reliable for the hidden ambiguity resolution.  This point is illustrated by the following examples of the hidden ambiguity string ge ren in (2-3).[3]  In each successive case, the context is expanded to form a new input string.   As a result, the analysis and the associated interpretation of ‘person’ versus ‘individual’ change accordingly.

(2-3.)  input string                            reading(s)

(a)      人  ren                                       person (or man, human)
[N ren]

(b)      个人  ge ren                               individual
[N ge-ren]

(c)      三个人  san ge ren                               three persons
[NP [CLAP [NUM san] [CLA ge]] [N ren]]

(d)      人的力量  ren de li liang                      the human power
[NP [DEP [NP ren] [DE de]] [N li-liang]]

(e)      个人的力量  ge ren de li liang                        the power of an individual
[NP [DEP [NP ge-ren] [DE de]] [N li-liang]]

(f)       三个人的力量  san ge ren de li liang              the power of three persons
[NP [DEP [NP [CLAP [NUM san] [CLA ge]] [N ren]] [DE de]] [N li-liang]]

(g)      他不是个人  ta bu shi ge ren.
           (1)    He is not a man. (He is a pig.)
[S [NP ta] [VP [ADV bu] [VP [V shi] [NP [CLAP ge] [N ren]]]]]
(2)  He is not an individual. (He represents a social organization.)
[S [NP ta] [VP [ADV bu] [VP [V shi] [NP  ge-ren]]]]

Comparing (a), (b) with (c), and (d), (e) with (f), one can see the associated change of readings when each successively expanded input string leads to a different grammatical analysis.  Accordingly, one segmentation is chosen over the other on the condition that the grammatical analysis of the full string can be established based on the segmentation.  In (b), the ambiguous string is all that is input to the parser, therefore the local context becomes full context.  It then acquires the lexical reading individual as the other possible segmentation ge | ren does not form a legitimate combination.  This reading may be retained, as in (e), or changed to the other reading person, as in (c) and (f), or reduced to one of the possible interpretations, as in (g), when the input string is further lengthened.  All these changes depend on the sentential analysis of the entire input string, as shown by the associated structural trees above.  It demonstrates that the full context is required for the adequate treatment of the hidden ambiguity phenomena.  Full context here refers to the entire input string to the parser.

It is necessary to explain some of the analyses as shown in the sample parses  above.  In Contemporary Mandarin, a numeral cannot  combine with a noun without a classifier in between.[4]  Therefore, the segmentation san (three) | ge-ren (individual) is excluded in (c) and (f), and the correct segmentation san (three) | ge (CLA) | ren (person) leads to the NP analysis.  In general, a classifier alone cannot combine with the following noun either, hence the interpretation of ge ren as one word ge-ren (individual) in (b) and (e).  A classifier usually combines with a preceding numeral or determiner before it can combine with the noun.  But things are more complicated.  In fact, the Chinese numeral yi (one) can be omitted when the NP is in object position.  In other words, the classifier alone can combine with a noun in a very restricted syntactic environment.  That explains the two readings in (g).[5]

The following is a summary of the arguments presented above.   These arguments have been shown to account for the hidden ambiguity phenomena.  The next section will further demonstrate the validity of these arguments for overlapping ambiguity as well.

(2-4.) Conclusion
The grammatical analysis of the entire input string is required for the adequate treatment of the hidden ambiguity problem in word identification.

2.1.2. Resolution of Overlapping Ambiguity

This section investigates overlapping ambiguity and its resolution.  A previous influential theory is examined, which claims that the overlapping ambiguity string can be locally disambiguated.   However, this theory is found to be unable to account for a significant amount of data.  The conclusion is that both overlapping ambiguity and hidden ambiguity require a context of the entire input string and a grammar for disambiguation.

For overlapping ambiguity, comparing different critical tokenizations will be able to detect it, but such a technique cannot guarantee a correct choice without introducing other knowledge.  Guo (1997) pointed out:

As all critical tokenizations hold the property of minimal elements on the word string cover relationship, the existence of critical ambiguity in tokenization implies that the “most powerful and commonly used” (Chen and Liu 1992, page 104) principle of maximum tokenization would not be effective in resolving critical ambiguity in tokenization and implies that other means such as statistical inferencing or grammatical reasoning have to be introduced.

However, He, Xu and Sun (1991) claim that overlapping ambiguity can be resolved within the local context of the ambiguous string.  They classify the overlapping ambiguity string into nine types.  The classification is based on the categories of the assumably correctly segmented words in the ambiguous strings, described below.

Suppose there is an overlapping ambiguous string consisting of ABC;  both AB and BC are entries listed in the lexicon.  There are two possible cases.  In case one, the category of A and the category of BC define the classification of the ambiguous string.  This is the case when the segmentation A|BC is considered correct.  For example, in the  ambiguous string 白天鹅 bai tian e, the word AB is  bai-tian (day-time) and the word BC is tian-e (swan).  The correct segmentation for this string is assumed to be A|BC, i.e. bai (A: white) | tian-e (N: swan) (in fact, this cannot be taken for granted as shall be shown shortly), therefore, it belongs to the A-N type.  In case two, i.e. when the segmentation AB|C is considered correct, the category of AB and the category C define the classification of the ambiguous string.   For example, in the ambiguous string 需求和 xu qiu he, the word AB is  xu-qiu (requirement) and the word BC qiu-he (sue for peace).  The correct segmentation for this string is AB|C, i.e. xu-qiu (N: requirement) | he (CONJ: and) (again, this should not be taken for granted), therefore, it belongs to the N-CONJ type.

After classifying the overlapping ambiguous strings into one of nine types, using the two different cases described above, they claim to have discovered a rule.[6]  That is, the category of the correctly segmented word BC in case one (or AB in case two) is predictable from AB (or BC in case two) within the local ambiguous string.  For example, the category of tian-e (swan) in bai | tian-e (white swan) is a noun.  This information is predictable from bai tian within the respective local string bai tian e.  The idea is, if ever an overlapping ambiguity string is formed of bai tian and C, the judgment of bai | tian-C as the correct segmentation entails that the word tian-C  must be a noun.  Otherwise, the segmentation A|BC is wrong and the other segmentation AB|C is right.  For illustration, it is noted that tian-shi (angel) in the ambiguous string 白天使 bai | tian-shi (white angel) is, as expected, a noun.  This predictability of the category information from within the local overlapping ambiguous string is seen as an important discovery (Feng 1996).  Based on this assumed feature of the overlapping ambiguous strings, He,  Xu and Sun (1991) developed their theory that an overlapping ambiguity string can be disambiguated within the local string itself.

The proposed disambiguation process within the overlapping ambiguous string proceeds as follows.  In order to correctly segment an overlapping ambiguous string, say, bai tian e or bai tian shi, the following information needs to be given under the entry bai-tian (day-time) in the tokenization lexicon:  (i) an ambiguity label, to indicate the necessity to call a disambiguation rule;  (ii) the ambiguity type A-N, to indicate that it should call the rule corresponding to this type.  Then the following disambiguation rule can be formulated.

(2-5.) A-N type rule       (He,  Xu and Sun 1991)
In the overlapping ambiguous string A(1)...A(i) B(1)...B(j) C(1)...C(k),
if        B(1)...B(j) and C(1)...C(k) form a noun,
then  the correct segmentation is A(1)...A(i) | B(1)...B(j)-C(1)...C(k),
else    the correct segmentation is A(1)...A(i)-B(1)...B(j) | C(1)...C(k).

This way, bai tian e and bai tian shi will always be segmented as bai (white) | tian-e (swan) and bai (white) | tian-shi (angel) instead of bai-tian (daytime) | e (goose) and bai-tian (daytime) | shi (make).  This can be easily accommodated in a segmentation algorithm provided the above  information is added to the lexicon and the disambiguation rules are implemented.  The whole procedure is running within the local context of the overlapping ambiguous string and uses only lexical information.  So they also name the overlapping ambiguity disambiguation morphology-based disambiguation, with no need to consult syntax, semantics or discourse.

Feng (1996) emphasizes that He, Xu and Sun's view on the overlapping ambiguous string constitutes a valuable contribution to the theory of Chinese word identification.  Indeed, this overlapping ambiguous string theory, if it were right, would be a breakthrough in this field.  It in effect suggests that the majority of the segmentation ambiguity is resolvable without and before a grammar module.  A handful of simple rules, like the A-N type rule formulated above, plus a lexicon would solve most ambiguity problems in word identification.[7]

Feng (1996) provides examples for all the nine types of overlapping ambiguous strings as evidence to support He, Xu and Sun (1991)'s theory.   In the case of the A-N type ambiguous string bai tian e, the correct segmentation is supposed to be bai | tian-e in this theory.  However, even with his own cited example, Feng ignores a perfect second reading (parse) when the time NP bai-tian (daytime) directly acts as a modifier for the sentence with no need for a preposition, as shown in (2‑6b) below.

(2-6.)           白天鹅游过来了
bai tian e you guo lai le       (Feng 1996)

(a)      bai     | tian-e       | you          | guo-lai      | le.
          white | swan        | swim        | over-here  | LE
[S [NP bai tian-e] [VP you guo-lai le]]
The white swan swam over here.

(b)      bai-tian       | e              | you          | guo-lai      | le.
          day-time      | goose        | swim        | over-here  | LE
[S [NP+mod bai-tian] [S [NP e] [VP you guo-lai le]]]
In the day time the geese swam over here.

In addition, one only needs to add a preposition zai (in) to the beginning of the sentence to make the abandoned segmentation bai-tian | e the only right one in the changed context.  The presumably correct segmentation, namely bai | tian-e, now turns out to be wrong, as shown in (2-7a) below.

(2-7.)           在白天鹅游过来了
zai bai tian e you guo lai le

(a) *   zai     | bai           | tian-e       | you          | guo-lai      | le.
          in      | white        | swan        | swim        | over-here  | LE

(b)      zai     | bai-tian    | e              | you          | guo-lai      | le.
          in      | day-time   | goose        | swim        | over-here  | LE
[S [PP+mod zai bai-tian] [S [NP e] [VP you guo-lai le]]]
In the day time the geese swam over here.

The above counter-example is by no means accidental.  In fact, for each cited ambiguous string in the examples given by Feng, there exist counter-examples.  It is not difficult to construct a different context where the preferred segmentation within the local string, i.e. the segmentation chosen according to one of the rules, is proven to be wrong.[8]  In the pairs of sample sentences (2‑8) through (2-10), (a) is an example which Feng (1996) cited to support the view that the local ambiguous string itself is enough for disambiguation.  Sentences in (b) are counter-examples to this theory.  It is a notable fact that the listed local string is often properly contained in a more complicated ambiguous string in an expanded context, seen in (2-9b) and (2-10b).  Therefore, even when the abandoned segmentation can never be linguistically correct in any context, as shown for tu-xing (graph) | shi (BM) in (2-9) where a bound morpheme still exists after the segmentation, it does not entail the correctness of the other segmentation in all contexts.  These data show that all possible segmentations should be retained for the grammatical analysis to judge.

(2-8.)  V-N type of overlapping ambiguous string

研究生命
          yan jiu sheng ming:
          yan-jiu (V:study) | sheng-ming (N:life)
yang-jiu-sheng (N:graduate student) | ming (life/destiny)

(a)      研究生命的本质
          yan-jiu    sheng-ming de      ben-zhi
          study          life               DE     essence
Study the essence of life.

(b)      研究生命金贵
           yan-jiu-sheng      ming  jin-gui
          graduate-student  life     precious
Life for graduate students is precious.

(2-9.)  CONJ-N type of overlapping ambiguous string
和平等 he ping deng:
          he (CONJ:and) | ping-deng (N:equality)
he-ping (N:peace) | deng (V:wait)?

(a)      独立自主和平等互利的原则
           du-li-zi-zhu           he      ping-deng-hu-li               de      yuan-ze
          independence       and    equal-reciprocal-benefit  DE     principle
the principle of independence and equal reciprocal benefit

(b)      和平等于胜利 he-ping       deng-yu       sheng-li
          peace           equal           victory
Peace is equal to victory.

(2-10.)  V-P type of overlapping ambiguous string
看中和 kan zhong he:
          kan-zhong (V:target) | he (P:with)
kan (V:see) | zhong-he (V:neutralize)

(a)      他们看中和日本人生意的机会
ta-men    kan-zhong   he      ri-ben          ren              zuo     sheng-yi      de      ji-hui    
they         target           with    Japan          person          do      business     DE   opportunity
They have targeted the opportunity to do business with the Japanese.

(b)      这要看中和作用的效果
zhe          yao    kan    zhong-he-zuo-yong                   de          xiao-guo
this    need  see     neutralization                DE     effect
This will depend on the effects of the neutralization.

The data in (b) above directly contradict the claim that an overlapping ambiguous string can be disambiguated within the local string itself.  While this approach is shown to be inappropriate in practice, the following comment attempts to reveal its theoretical motivation.

As reviewed in the previous text, He, Xu and Sun (1991)'s overlapping ambiguity theory is established on the classification of the overlapping ambiguous strings.  A careful examination of their proposed nine types of the overlapping ambiguous strings reveals an underlying assumption on which the classification is based.  That is, the correctly segmented words within the overlapping ambiguous string will automatically remain correct in a sentence containing the local string.   This is in general untrue, as shown by the counter-examples above.[9]   The following analysis reveals why.

Within the local context of the overlapping ambiguous string, the chosen segmentation often leads to a syntactically legitimate structure while the abandoned segmentation does not.  For example,  bai (white) | tian-e (swan) combines into a valid syntactic unit while there is no structure which can span bai-tian (daytime) | e (goose).  For another example,  yan-jiu (study) | sheng-ming (life) can be combined into a legitimate verb phrase [VP [V yan-jiu] [NP sheng-ming]], but  yan-jiu-sheng (graduate student) | ming (life/destiny) cannot.  But that legitimacy only stands locally within the boundary of the ambiguous string.  It does not necessarily hold true in a larger context containing the string.  As shown previously in (2-7a),  the locally legitimate structure bai | tian-e (white swan) does not lead to a successful parse for the sentence.  In contrast, the locally abandoned segmentation bai-tian (daytime) | e (goose) has turned out to be right with the parse in (2-7b).   Therefore, the full context instead of the local context of the ambiguous string is required for the final judgment on which segmentation can be safely abandoned.  Context smaller than the entire input string is not reliable for the overlapping ambiguity resolution.  Note that exactly the same conclusion has been reached for the hidden ambiguous strings in the previous section.

The following data in (2-11) further illustrate the point of the full context requirement for the overlapping ambiguity resolution, similar to what has been presented for the hidden ambiguity phenomena in (2-3).  In each successive case, the context is expanded to form a new input string.  As a result, the interpretation of ‘goose’ versus ‘swan’ changes accordingly.

(2-11.)  input string                reading(s)

(a)      鹅 e                                goose
[N e]

(b)      天鹅 tian e                                swan
[N tian-e]

(c)      白天鹅 bai tian e                       white swan
[N [A bai] [N tian-e]]

(d)      鹅游过来了 e you guo lai le.
The geese swam over here.
[S [NP e] [VP you guo-lai le]]

(e)      天鹅游过来了 tian e you guo lai le.
The swans swam over here.
[S [NP tian-e] [VP you guo-lai le]]

(f)      白天鹅游过来了 bai tian e you guo lai le.
          (i)       The white swan swam over here.
[S [NP bai tian-e] [VP you guo-lai le]]
          (ii)      In the daytime, the geese swam over here.
S [NP+mod bai-tian] [S [NP e] [VP you guo-lai le]]]

(g)       在白天鹅游过来了 zai bai tian e you guo lai le.
            In the daytime, the geese swam over here.
[S [PP zai bai-tian] [S [NP e] [VP you guo-lai le]]]

(h)      三只白天鹅游过来了 san zhi bai tian e you guo lai le.
           Three white swans swam over here.
[S [NP san zhi bai tian-e] [VP you guo-lai le]]

It is interesting to compare (c) with (f), (g) and (h) to see their associated change of readings based on different ways of  segmentation.  In (c), the overlapping ambiguous string is all that is input to the parser, therefore the local context becomes full context.  It then acquires the reading white swan corresponding to the segmentation bai | tian-e.  This reading may be retained, or changed, or reduced to one of the possible interpretations when the input string is lengthened.  That is respectively the case in (h), (g) and (f).  All these changes depend on the grammatical analysis of the entire input string.  It shows that the full context and a grammar are required for the resolution of most ambiguities;  and when sentential analysis cannot disambiguate - in cases of ‘genuine’ segmentation ambiguity like (f), the structural analysis can make explicit the ambiguity in the form of multiple parses (readings).

In the light of the inquiry in this section, the theoretical significance of the distinction between overlapping ambiguity and hidden ambiguity seems to have diminished.[10]  They are both structural in nature.  They both require full context and a grammar for proper treatment.

(2-12.) Conclusion

(i)  It is not necessarily true that an overlapping ambiguous string can be disambiguated within the local string.

(ii) The grammatical analysis of the entire input string is required for the adequate treatment of the overlapping ambiguity problem as well as the hidden ambiguity problem.

2.2. Productive Word Formation and Syntax

This section examines the connection of productive word formation and segmentation ambiguity.  The observation is that there is always a possible involvement of ambiguity with each type of word formation.  The point to be made is that no independent morphology systems can resolve this ambiguity when syntax is unavailable.  This is because words formed via morphology, just like words looked up from lexicon, only provide syntactic ‘candidate’ constituents for the sentential analysis.  The choice is decided by the structural analysis of the entire sentence.

Derivation is a major type of productive word formation in Chinese.   Section 1.2.2 has given an example of the involvement of hidden ambiguity in derivation, repeated below.

(2-13.)         这道菜没有吃头  zhe dao cai mei you chi tou.

(a)      zhe    | dao          | cai            | mei-you    | chi-tou
          this    | CLA                   | dish         | not-have   | worth-of-eating
[S [NP zhe dao cai] [VP [V mei-you] [NP chi-tou]]]
This dish is not worth eating.

(b) ?   zhe    | dao          | cai            | mei-you    | chi  | tou
          this    | CLA                   | dish         | not have   | eat  | head
[S [NP zhe dao cai] [VP [ADV mei-you] [VP [V chi] [NP tou]]]]
This dish did not eat the head.

(2-14.)         他饿得能吃头牛 ta e de neng chi tou niu.

(a) *   ta       | e              | de            | neng         | chi-tou               | niu
he      | hungry     | DE3         | can           | worth-of-eating  | ox

(b)      ta       | e              | de            | neng         | chi  | tou            | niu
he      | hungry     | DE3         | can           | eat  | CLA          | ox
[…[VP [V e] [DE3P [DE3 de] [VP [V neng] [VP [V chi] [NP tou niu]]]]]]
He is so hungry that he can eat an ox.

Some derivation rule like the one in (2-15) is responsible for combining the transitive verb stem and the suffix –tou (worth-of) into a derived noun for (2-13a) and (2-14a).

(2-15.)         X (transitive verb) + tou --> X-tou (noun, semantics: worth-of-X)

However, when syntax is not available, there is always a danger of wrongly applying this morphological rule due to possible ambiguity involved, as shown in (2-14a).  In other words, morphological rules only provide candidate words;  they cannot make the decision whether these words are legitimate in the context.

Reduplication is another method for productive word formation in Chinese.  An outstanding problem is the AB --> AABB reduplication or AB --> AAB reduplication if AB is a listed word.   In these cases, some reduplication rules or procedures need to be involved to recognize AABB or AAB.  If reduplication is a simple process confined to a local small context, it may be possible to handle it by incorporating some procedure-based function calls during the lexical lookup.  For example, when a three-character string, say 分分心 fen fen xin, cannot be found in the lexicon, the reduplication function will check whether the first two characters are the same, and if yes, delete one of them and consult the lexicon again.  This method is expected to handle the AAB type reduplication, e.g. fen-xin (divide-heart: distract) --> fen-fen-xin (distract a bit).

But, segmentation ambiguity can be involved in reduplication as well.  Compare the following examples in (2-16) and (2-17) containing the sub-string fen fen xin, the first is ambiguity free but the second is ambiguous.  In fact, (2‑17) involves an overlapping ambiguous string  shi fen fen xinshi (ten) | fen-fen-xin (distract a bit) and shi-fen (very) | fen-xin (distract).  Based on the conclusion presented in 2.1, it requires grammatical analysis to resolve the segmentation ambiguity.  This is illustrated in (2‑17).

(2-16.)         让他分分心

rang     | ta    | fen-fen-xin
let      | he   | distracted-a-bit
Let him relax a while.

(2-17.)         这件事十分分心

zhe jian shi shi fen fen xin.

(a) *   zhe    | jian          | shi           | shi  | fen-fen-xin
          this    | CLA         | thing       | ten  | distracted a bit

(b)      zhe    | jian          | shi            | shi-fen     | fen-xin
           this    | CLA         | thing       | very         | distract
[S [NP zhe jian shi] [VP [ADV shi-fen] [V fen-xin]]]
This thing is very distracting.

Finally, there is also possible ambiguity involvement in the proper name formation.  Proper names for persons, locations, etc. that are not listed in the lexicon are recognized as another major problem in word identification (Sun and Huang 1996).[11]  This problem is complicated when ambiguity is involved.

For example, a Chinese person name usually consists of a family name followed by a given name of one or two characters.  For example, the late Chinese chairman mao-ze-dong (Mao Zedong) used to have another name li-de-sheng (Li Desheng).  In the lexicon, li is a listed family name.  Both de-sheng and sheng mean ‘win’.  This may lead to three ways of word segmentation, a complicated case involving both overlapping ambiguity and hidden ambiguity:  (i) li | de-sheng;  (ii) li-de | sheng;  (iii) li-de-sheng, as shown in (2-18) below.

(2-18.)         李得胜了 li de sheng le.

(a)      li        | de-sheng  | le
           Li       | win          | LE
[S [NP li] [VP de-sheng le]]
Li won.

(b)      li-de   | sheng       | le
           Li De | win          | LE
[S [NP li de] [VP sheng le]]
Li De won.

(c) *    li-de-sheng  | le
           Li Desheng  | LE

For this particular type of compounding, the family name serves as the left boundary of a potential compound name of person and the length can be used to determine candidates.[12]  Again, the choice is decided by the grammatical analysis of the entire sentence, as illustrated in (2-18).

(2-19.) Conclusion

Due to the possible ambiguity involvement in productive word formation, a grammar containing both morphology and syntax is required for an adequate treatment.  An independent morphology system or separate word segmenter cannot solve ambiguity problems.

2.3. Borderline Cases and Grammar

This section reviews some outstanding morpho-syntactic borderline phenomena.  The points to be made are:  (i) each proposed morphological or syntactic analysis should be justified in terms of capturing the linguistic generality;  (ii) the design of a grammar should facilitate the access to the knowledge from both morphology and syntax in analysis.

The nature of the borderline phenomena calls for the coordination of morphology and syntax in a grammar.  The phenomena of Chinese separable verbs are one typical example.  The co-existence of their contiguous use and separate use leads to the confusion whether they belong to the lexicon and morphology, or whether they are syntactic phenomena.  In fact, as will be discussed in Chapter V, there are different degrees of ‘separability’ for different types of Chinese separable verbs;  there is no uniform analysis which can handle all separable verbs properly.  Different types of separable verbs may justify different approaches to the problems.  In terms of capturing linguistic generality, a good analysis should account for the demonstrated variety of separated uses and link the separated use and the contiguous use.

‘Quasi-affixation’ is another outstanding interface problem.  This problem requires careful morpho-syntactic coordination.  As presented in Chapter I, structurally, ‘quasi-affixes’ and ‘true’ affixes demonstrate very similar word formation potential, but ‘quasi-affixes’ often retain some ‘solid’ meaning while the meaning of ‘true’ affixes are functionalized.  Therefore, how to coordinate the semantic contribution of the derived words via ‘quasi-affixation’ in the context of the building of the semantics for the entire sentence is the key.  This coordination requires flexible information flow between data structures for morphology, syntax and semantics during the morpho-syntactic analysis.

In short, the proper treatment of the morpho-syntactic borderline phenomena requires inquiry into each individual problem in order to reach a morphological or syntactic analysis which maximally captures linguistic generality.  It also calls for the design of a grammar where information between morphology and syntax can be effectively coordinated.

2.4. Knowledge beyond Syntax

This section examines the roles of knowledge beyond syntax in the resolution of segmentation ambiguity.  Despite the fact that further information beyond syntax may be necessary for a thorough solution to segmentation ambiguity,[13] it will be argued that syntax is the appropriate place for initiating this process due to the structural nature of segmentation ambiguity.

Depending on which type of information is essential for the disambiguation, disambiguation can be classified as structure-oriented, semantics-oriented and pragmatics-oriented.  This classification hierarchy is modified from that in He, Xu and Sun (1991).  They have classified the hidden ambiguity disambiguation into three categories:  syntax-based, semantics-based and pragmatics-based.  Together with the morphology-based disambiguation which is equivalent to the overlapping ambiguity resolution in their theory, they have built a hierarchy from morphology up to pragmatics.

A note on the technical details is called for here.  The term X‑oriented (where X is either syntax, semantics or pragmatics) is selected here instead of X-based in order to avoid the potential misunderstanding that X is the basis for the relevant disambiguation.  It will be shown that while information from X is required for the ambiguity resolution, the basis is always syntax.

Based on the study in 2.1, it is believed that there is no morphology-based (or morphology-oriented) disambiguation independent of syntax.  This is because the context of morphology is a local context, too small for resolving structural ambiguity.  There is little doubt that the morphological analysis is a necessary part of word identification in terms of handling productive word formation.  But this analysis cannot by itself resolve ambiguity, as argued in 2.2.  The notion 'structure' in structure-oriented disambiguation includes both syntax and morphology.

He, Xu and Sun (1991) exclude the overlapping ambiguity resolution in the classification beyond morphology.  This exclusion is found to be not appropriate.  In fact, both the resolution of hidden ambiguity and overlapping ambiguity can be classified into this hierarchy.   In order to illustrate this point, for each such class, I will give examples from both hidden ambiguity and overlapping ambiguity.

Sentences in (2-20) and (2-21) which contain the hidden ambiguity string 阵风zhen feng  are examples for the structure-oriented disambiguation.  This type of disambiguation relying on a grammar constitutes the bulk of the disambiguation task required for word identification.

(2-20.)         一阵风吹过来了
yi zhen feng chui guo lai le.          (Feng 1996)

(a)      yi       | zhen         | feng         | chui          | guo-lai      | le
          one    | CLA          | wind        | blow         | over-here  | LE
[S [NP [CLAP yi zhen] [N feng]] [VP chui guo-lai le]]
A gust of wind blew here

(b) *   yi       | zhen-feng                    | chui                   | guo-lai      | le
          one    | gusts-of-wind    | blow         | over-here  | LE

(2-21.)         阵风会很快来临 zhen feng hui hen kuai lai lin.

(a)      zhen-feng              | hui  | hen          | kuai         | lai-lin
          gusts-of-wind       | will | very                   | soon         | come
[S [NP zhen-feng] [VP hui hen kuai lai-lin]]]
Gusts of wind will come very soon.

(b) *   zhen  | feng                   | hui  | hen          | kuai         | lai-lin
          CLA   | wind        | will | very                   | soon         | come

Compare (2-20a) where the ambiguity string is identified as two words zhen (CLA) feng (wind) and (2-21a) where the string is taken as one word zhen-feng (gusts-of-wind).  Chinese syntax defines that a numeral cannot directly combine with a noun, neither can a classifier alone when it is in non-object position.  The numeral and the classifier must combine together before they can combine with a noun.  So (2-20b) and (2‑21b) are both ruled out while (2-20a) and (2-21a) are structurally well-formed.

For the structure-oriented overlapping ambiguity resolution,  numerous examples have been cited before, and one typical example is repeated below.

(2-22.)         研究生命金贵 yan jiu sheng ming jin gui

(a)      yan-jiu-sheng       | ming         | jin-gui
graduate student | life            | precious
[S [NP yan-jiu-sheng] [S [NP ming] [AP jin-gui]]]
Life for graduate students is precious.

(b) *   yan-jiu        | sheng-ming        | jin-gui
study          | life                     | precious

As a predicate, the adjective jin-gui (precious) syntactically expects an NP as its subject, which is saturated by the second NP ming (life) in (2-22a).   The first NP serves as a topic of the sentence and is semantically linked to the subject ming (life) as its possessive entity.[14]  But there is no parse for (2-22b) despite the fact that the sub-string yan-jiu sheng-ming (to study life) forms a verb phrase [VP [V yan-jiu] [NP sheng-ming]] and the sub-string sheng-ming jin-gui (life is precious) forms a sentence [S [NP sheng-ming] [AP jin-gui]].  On one hand, the VP in the subject position does not satisfy the syntactic constraint (the category NP) expected by the adjective jin-gui (precious) - although other adjectives, say zhong-yao 'important', may expect a VP subject.  On the other hand, the transitive verb yan-jiu (study) expects an NP object.  It cannot take an S object (embedded object clause) as do other verbs, say ren-wei (think).

The resolution of the following hidden ambiguity belongs to the semantics-oriented disambiguation.

(2-23.)         请把手抬高一点儿 qing ba shou tai gao yi dian er            (Feng 1996)

(a1)    qing             | ba   | shou         | tai            | gao | yi-dian-er
          please          | BA  | hand        | hold         | high| a-little
[VP [ADV qing] [VP ba shou tai gao yi-dian-er]]
Please raise your hand a little higher.

(a2) * qing   | ba   | shou         | tai            | gao           | yi-dian-er
          invite | BA  | hand        | hold         | high         | a-little

(b1) * qing             | ba-shou    | tai            | gao           | yi-dian-er
          please          | N:handle  | hold         | high         | a-little

(b2) ? qing   | ba-shou    | tai            | gao           | yi-dian-er
          invite | N:handle  | hold         | high         | a-little
[VP [VG [V qing] [NP ba-shou]] [VP tai gao yi-dian-er]]
Invite the handle to hold a little higher.

This is an interesting example.  The same character qing is both an adverb ‘please’ and a verb ‘invite’.  (2-23b2) is syntactically valid, but violates the semantic constraint or semantic selection restriction.  The logical object of qing (invite) should be human but ba-shou (handle)  is not human.  The two syntactically valid parses (2-23a1) and (2-23b2), which correspond to two ways of segmentation, are expected to be somehow disambiguated on the above semantic grounds.

The following case is an example of semantics-oriented resolution of the overlapping ambiguity.

(2-24.)         茶点心吃了 cha dian xin chi le.

(a1)    cha    | dian-xin   | chi  | le
tea     | dim sum  | eat  | LE
[S [NP+object cha dian-xin] [VP chi le]]
The tea dim sum was eaten.

(a2) ? cha    | dian-xin   | chi  | le
tea     | dim sum  | eat  | LE
[S [NP+agent cha dian-xin] [VP chi le]]
The tea dim sum ate (something).

(a3) ? cha    | dian-xin   | chi  | le
tea     | dim sum  | eat  | LE
[S [NP+object cha ] [S [NP+agent dian-xin] [VP chi le]]]
Tea, the dim sum ate.

(a4) ? cha    | dian-xin   | chi  | le
tea     | dim sum  | eat  | LE
[S [NP+agent cha ] [VP [NP+object dian-xin] [VP chi le]]]
The tea ate the dim sum.

(b1) ? cha-dian               | xin           | chi  | le
tea dim sum         | heart       | eat  | LE
[S [NP+object cha-dian] [S [NP+agent xin] [VP chi le]]]
The tea dim sum, the heart ate.

(b2) ? cha-dian               | xin           | chi  | le
tea dim sum         | heart        | eat  | LE
[S [NP+agent cha-dian] [VP [NP+object xin] [VP chi le]]]
The tea dim sum ate the heart.

Most Chinese dictionaries contain the listed compound noun cha-dian (tea-dim-sum), but not cha dian-xin which stands for the same thing, namely the snacks served with the tea.  As shown above, there are four analyses for one segmentation and two analyses for the other segmentation.  These are all syntactically legitimate, corresponding to six different readings.  But there is only one analysis which makes sense, namely the implicit passive construction with the compound noun cha dian-xin as the preceding (logical) object in (a1).  All the other five analyses are nonsense and can be disambiguated if the semantic selection restriction that animate being eats (i.e. chi) food is enforced.   Syntactically, (a2) is an active construction with the optional object omitted.  The constructions for (a3) and (b1) are of long distance dependency where the object is topicalized and placed at the beginning.   The SOV (Subject Object Verb) pattern for (a4) and (b2) is a very  restrictive construction in Chinese.[15]

The pragmatics-oriented disambiguation is required for the case where ambiguity remains after the application of both structural and semantic constraints.[16]  The sentences containing this type of ambiguity are genuinely ambiguous within the sentence boundary, as shown with the multiple parses in (2-25) for the hidden ambiguity and (2-26) for the overlapping ambiguity below.

(2-25.)         他喜欢烤白薯 ta xi huan kao bai shu.

(a)      ta       | xi-huan    | kao          | bai-shu.
          he      | like           | bake         | sweet-potato
[S [NP ta] [VP [V xi-huan] [VP [V kao] [NP bai-shu]]]]
He likes baking sweet potatoes.

(b)      ta       | xi-huan    | kao-bai-shu.
          he      | like           | baked-sweet-potato
[S [NP ta] [VP [V xi-huan] [NP kao-bai-shu]]]
He likes the baked sweet potatoes.

(2-26.)         研究生命不好 yan jiu sheng ming bu hao

(a)      yan-jiu-sheng       | ming         | bu   | hao.
          graduate student | destiny     | not | good
[S [NP yan-jiu-sheng] [S [NP ming] [AP bu hao]]]
The destiny of graduate students is not good.

(b)      yan-jiu        | sheng-ming        | bu   | hao.
          study          | life                     | not | good
[S [VP yan-jiu sheng-ming] [AP bu hao]]
It is not good to study life.

An important distinction should be made among these classes of disambiguation.  Some ambiguity must be solved in order to get a reading during analysis.  Other ambiguity can be retained in the form of multiple parses, corresponding to multiple readings.  In either case, it demonstrates that at least a grammar (syntax and morphology) is required.  The structure-oriented ambiguity belongs to the former, and can be handled by the appropriate structural analysis.  The semantics-oriented ambiguity and the pragmatics-oriented ambiguity belong to the latter, so multiple parses are a way out.  The examples for different classes of ambiguity show that the structural analysis is the foundation for handling ambiguity problems in word identification.  It provides possible structures for the semantic constraints or pragmatic constraints to work on.

In fact, the resolution of segmentation ambiguity in Chinese word identification is but a special case of the resolution of structural ambiguity for NLP in general.  As a matter of fact, the grammatical analysis has been routinely used to resolve, and/or prepare the basis for resolving, the structural ambiguity like the PP attachment.[17]

2.5. Summary

The most important discovery in the field of Chinese word identification presented in this chapter is that the resolution of both types of segmentation ambiguity involves the analysis of the entire input string.  This means that the availability of a grammar is the key to the solution of this problem.

This chapter has also examined the ambiguity involvement in productive word formation and reached the following conclusion.  A grammar for morphological analysis as well as for sentential analysis is required for an adequate treatment of this problem.  This establishes the foundation for the general design of CPSG95 as consisting of morphology and syntax in one grammar formalism. [18]

The study of the morpho-syntactic borderline problems shows that  the sophisticated design of a grammar is called for so that information between morphology and syntax can be effectively coordinated.  This is the work to be presented in Chapter III and Chapter IV.  It also demonstrates that each individual borderline problem should be studied carefully in order to reach a morphological or syntactic analysis which maximally captures linguistic generality.  This study will be pursued in Chapter V and Chapter VI.

 

 

----------------------------------------------------------

[1]  Constraints beyond morphology and syntax can be implemented as subsequent modules, or “filters”, in order to select the correct analysis when morpho-syntactic analysis leads to multiple results (parses).  Alternatively, such constraints can also be integrated into CPSG95 as components parallel to, and interacting with, morphology and syntax.  W. Li (1996) illustrates how semantic selection restriction can be integrated into syntactic constraints in CPSG95 to support Chinese parsing.

[2] In theory, if discourse is integrated in the underlying grammar, the input can be a unit larger than sentence, say, a paragraph or even a full text.  But this will depend on the further development in discourse theory and its formalization.  Most grammars in current use assume sentential analysis.

[3]  Similar examples for the overlapping ambiguity string will be shown in 2.1.2.

[4]  But in Ancient Chinese, a numeral can freely combine with countable nouns.

[5] These two readings in written Chinese correspond to an obvious difference in Spoken Chinese:  ge (CLA) in (g1) is weakened in pronunciation, marked by the dropping of the tone, while in (g2) it reads with the original 4th tone emphatically.

[6] It is likely that what they have found corresponds to Guo’s discovery of “one tokenization per source” (Guo 1998).  Guo’s finding is based on his experimental study involving domain (“source”) evidence and seems to account for the phenomena better.  In addition, Guo’s strategy in his proposal is also more effective, reported to be one of the best strategies for disambiguation in word segmenters.

[7] According to He, Xu and Sun (1991)'s statistics on a corpus of 50833 Chinese characters, the overlapping ambiguous strings make up 84.10%, and the hidden ambiguous strings 15.90%, of all ambiguous strings.

[8] Guo (1997b) goes to the other extreme to hypothesize that “every tokenization is possible”.   Although this seems to be a statement too strong, the investigation in this chapter shows that at least domain independently, local context is very unreliable for making tokenization decision one way or the other.

[9] However, this assumption may become statistically valid within a specific domain or source, as examined in Guo (1998).  But Guo did not give an operational definition of source/domain.  Without such a definition, it is difficult to decide where to collect the domain-specific information required for disambiguation based on the principle one tokenization per source, as proposed by Guo (1998).

[10] This distinction is crucial in the theories of Liang (1987) and He,  Xu and Sun (1991).

[11] This work is now defined as one fundamental task, called Named Entity tagging, in the world of information extraction (MUC-7 1998).  There has been great advance in developing Named Entity taggers both for Chinese (e.g. Yu et al 1997; Chen et al 1997) and for other languages.

[12] That is what was actually done with the CPSG95 implementation.  More precisely, the family name expects a special sign with hanzi-length of 1 or 2 to form a full name candidate.

[13] A typical, sophisticated word segmenter making reference to knowledge beyond syntax is presented in Gan (1995).

[14] This is in fact one very common construction in Chinese in the form of NP1 NP2 Predicate.  Other examples include ta (he) tou (head) tong (ache): ‘he has a head-ache’ and ta (he) shen-ti (body) hao (good): 'he is good in health'.

[15] For the detailed analysis of these constructions, see W. Li (1996).

[16] It seems that it may be more appropriate to use terms like global disambiguation or discourse-oriented disambiguation instead of the term pragmatics-oriented disambiguation for the relevant phenomena.

[17] It seems that some PP attachment problems can be resolved via grammatical analysis alone.  For example, put something on the table; found the key to that door.  Others require information beyond syntax (semantics, discourse, etc.) for a proper solution.  For example, see somebody with telescope. In either case, the structural analysis provides a basis.  The same thing happens to the disambiguation in Chinese word identification.

[18] In fact, once morphology is incorporated in the grammar, the identification of both vocabulary words and non-listable words becomes a by-product during the integrated morpho-syntactic analysis.  Most ambiguity is resolved automatically and the remaining ambiguity will be embodied in the multiple syntactic trees as the results of the analysis.  This has been shown to be true and viable by W. Li (1997, 2000) and Wu and Jiang (1998).

 

[Related]

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

PhD Thesis: Chapter I Introduction

1.0. Foreword

This thesis addresses the issue of the Chinese morpho-syntactic interface.  This study is motivated by the need for a solution to a series of long-standing problems at the interface.  These problems pose challenges to an independent morphology system or a separate word segmenter as there is a need to bring in syntactic information in handling these problems.

The key is to develop a Chinese grammar which is capable of representing sufficient information from both morphology and syntax.  On the basis of the theory of Head-Driven Phrase Structure Grammar (Pollard and Sag 1987, 1994), the thesis will present the design of a Chinese grammar, named CPSG95 (for Chinese Phrase Structure Grammar).  The interface between morphology and syntax is defined system internally in CPSG95.  For each problem, arguments will be presented for the linguistic analysis involved.  A solution to the problem will then be formulated based on the analysis.  The proposed solutions are formalized and implementable;  most of the proposals have been tested in the implementation of CPSG95.

In what follows, Section 1.1 reviews some important developments in the field of Chinese NLP (Natural Language Processing).  This serves as the background for this study.  Section 1.2 presents a series of long-standing problems related to the Chinese morpho-syntactic interface.  These problems are the focus of this thesis.  Section 1.3 introduces CPSG95 and sketches its morpho-syntactic interface by illustrating an example of the proposed morpho-syntactic analysis.

1.1. Background

This section presents the background for the work on the interface between morphology and syntax in CPSG95.  Major development on Chinese tokenization and parsing, the two areas which are related to this study, will be reviewed.

1.1.1. Principle of Maximum Tokenization and Critical Tokenization

This section reviews the influential Theory of Critical Tokenization (Guo 1997a) and its implications.  The point to be made is that the results of Guo’s study can help us to select the tokenization scheme used in the lexical lookup phase in order to create the basis for morpho-syntactic parsing.

Guo (1997a,b,c) has conducted a comprehensive formal study on tokenization schemes in the framework of formal languages, including deterministic tokenization such as FT (Forward Maximum Tokenization) and BT (Backward Maximum Tokenization), and non-deterministic tokenization such as CT (Critical Tokenization), ST (Shortest Tokenization) and ET (Exhaustive Tokenization).  In particular, Guo has focused on the study of the rich family of tokenization strategies following the general Principle of Maximum Tokenization, or “PMT”.  Except for ET, all the tokenization schemes mentioned above are PMT-based.

In terms of lexical lookup, PMT can be understood as a heuristic by which a longer match overrides all shorter matches.  PMT has been widely adopted (e.g. Webster and Kit 1992; Guo 1997b) and is believed to be “the most powerful and commonly used disambiguation rule” (Chen and Liu 1992:104).

Shortest Tokenization, or “ST”, first proposed by X. Wang (1989), is a non-deterministic tokenization scheme following the Principle of Maximum Tokenization.  A segmented token string is shortest if it contains the minimum number of vocabulary words possible - “short” in the sense of the shortest word string length.

Exhaustive Tokenization, or “ET”, does not follow PMT.  As its name suggests, the ET set is the universe of all possible segmentations consisting of all candidate vocabulary words.  The mathematical definition of ET is contained in Definition 4 for “the character string tokenization operation”  in Guo (1997a).

The most important concept in Guo’s theory is Critical Tokenization, or “CT”.  Guo’s definition is based on the partially ordered set, or ‘poset’, theory in discrete mathematics (Kolman and Busby 1987).  Guo has found that different segmentations can be linked by the cover relationship to form a poset.   For example, abc|d and ab|cd both cover ab|c|d, but they do not cover each other.

Critical tokenization is defined as the set of minimal elements, i.e. tokenizations which are not covered by other tokenizations, in the tokenization poset.  Guo has given proof for a number of mathematical properties involving critical tokenization.  The major ones are listed below.

  • Every tokenization is a subtokenization of (i.e. covered by) a critical tokenization, but no critical tokenization has a true supertokenization;
  • The tokenization variations following the Principle of Maximum Tokenization proposed in the literature, such as FT, BT, FT+BT and ST, are all true sub-classes of CT.

Based on these properties, Guo concludes that CT is the precise mathematical description of the widely adopted Principle of Maximum Tokenization.

Guo (1997c) further reports his experimental studies on relative merits of these tokenization schemes in terms of three quality indicators, namely, perplexity, precision and recall.  The perplexity of a tokenization scheme gives the expected number of tokenized strings generated for average ambiguous fragments.  The precision score is the percentage of correctly tokenized strings among all possible tokenized strings while the recall rate is the percentage of correctly tokenized strings generated by the system among all correctly tokenized strings.  The main results are:

  • Both FT and BT can achieve perfect unity perplexity but have the worst precision and recall;
  • ET achieves perfect recall but has the lowest precision and highest perplexity;
  • ST and CT are simple with good computational properties.  Between the two, ST has lower perplexity but CT has better recall.

Guo (1997c) concludes, “for applications with moderate performance requirement, ST is the choice;  otherwise, CT is the solution.”

In addition to the above theoretical and experimental study, Guo (1997b) also develops a series of optimized algorithms for the implementation of these generation schemes.

The relevance and significance of Guo’s achievement to the research in this thesis lie in the following aspect.  The research on Chinese morpho-syntactic interface is conducted with the goal of  supporting Chinese morpho-syntactic parsing.  The input to a Chinese morpho-syntactic parser comes directly from the lexical lookup of the input string based on some non-deterministic tokenization scheme (W. Li 1997, 2000; Wu and Jiang 1998).  Guo’s research and algorithm development can help us to decide which tokenization schemes to use depending on the tradeoff between precision, recall and perplexity or the balance between reducing the search space and minimizing premature commitment.

1.1.2. Monotonicity Principle and Task-driven Segmentation

This section reviews the recent development on Chinese analysis systems involving the interface between morphology and syntax.  The research on the Chinese morpho-syntactic interface in this thesis echoes this new development in the field of Chinese NLP.

In the last few years, projects have been proposed for implementing a Chinese analysis system which integrates word identification and parsing.  Both rule-based systems and statistical models have been attempted with good results.

Wu (1998) has addressed the drawbacks of the conventional practice on the development of Chinese word segmenters, in particular, the problem of premature commitment in handling segmentation ambiguity.  In his A Position Statement on Chinese Segmentation, Wu proposed a general principle:

Monotonicity Principle for segmentation:

A valid basic segmentation unit (segment or token) is a substring that no processing stage after the segmenter needs to decompose.

The rationale behind this principle is to prevent premature commitment and to avoid repetition of work between modules.   In fact, traditional word segmenters are modules independent of subsequent applications (e.g. parsing).  Due to the lack of means for accessing sufficient grammar knowledge, they suffer from premature commitment and repetition of work, hence violating this principle.

Wu’s proposal of the monotonicity principle is a challenge to the Principle of Maximum Tokenization.  These two principles are not always compatible.  Due to the existence of hidden ambiguity (see 1.2.1), the PMT-based segmenters by definition are susceptible to premature commitment leading to “too-long segments”.  If the target application is designed to solve the hidden ambiguity problem in the segments, “decomposition” of some segments is unavoidable.

In line with the Monotonicity Principle, Wu (1998) proposes an alternative approach which he claims “eliminates the danger of premature commitment”, namely task-driven segmentation.  Wu (1998) points out, “Task-driven segmentation is performed in tandem with the application (parsing, translating, named-entity labeling, etc.) rather than as a preprocessing stage.  To optimize accuracy, modern systems make use of integrated statistically-based scores to make simultaneous decisions about segmentation and parsing/translation.”  The HKUST parser, developed by Wu’s group, is such a statistical system employing the task-driven segmentation.

As for rule-based systems, similar practice of integrating word identification and parsing has also been explored.  W. Li (1997, 2000) proposed that the results of an ET-based lexical lookup directly feed the parser for the hanzi-based parsing.  More concretely, morphological rules are designed to build word internal structure for productive morphology and non-productive morphology is lexicalized via entry enumeration.[1]  This approach is the background for conducting the research on Chinese morpho-syntactic interface for CPSG95 in this dissertation.

The Chinese parser on the platform of multilingual NLPWin developed by Microsoft Research also integrates word identification and parsing (Wu and Jiang 1998).  They also use a hand-coded grammar for word identification as well as for sentential parsing.  The unique part of this system is the use of a certain lexical constraint on ET in the lexical lookup phase.  This effectively reduces the parsing search space as well as the number of syntactic trees produced by the parser, with minimal sacrifice in the recall of tokenization.  This tokenization strategy provides a viable alternative to the PMT-based tokenization schemes like CT or ST in terms of the overall balance between precision, recall and perplexity.

The practice of simultaneous word identification and parsing in implementing a Chinese analysis system calls for the support of a grammar (or statistical model) which contains sufficient information from both morphology and syntax.  The research on Chinese morpho-syntactic interface in this dissertation aims at providing this support.

1.2. Morpho-syntactic Interface Problems

This section presents a series of outstanding problems in Chinese NLP which are related to the morpho-syntactic interface.  One major goal of this dissertation is to argue for the proposed analyses of the problems and to provide solutions to them based on the analyses.

Sun and Huang (1996) have reviewed numerous cases which challenge the existing word segmenters.  As many of these cases call for an exchange of information between morphology and syntax, an appropriate solution can hardly be reached within the module of a separate word segmenter.  Three major problems at issue are presented below.

1.2.1. Segmentation ambiguity

This section presents the long-standing problem in Chinese tokenization, i.e. the resolution of the segmentation ambiguity.  Within a separate word segmenter, resolving the segmentation ambiguity is a difficult, sometimes hopeless job.  However, the majority of ambiguity can be resolved when a grammar is available.

Segmentation ambiguity has been the focus of extensive study in Chinese NLP for the last decade (e.g. Chen and Liu 1992; Liang 1987;  Sproat, Shih, Gale and Chang 1996; Sun and Huang 1996; Guo 1997b).  There are two types of segmentation ambiguities (Liang 1987; Guo 1997b):  (i) overlapping ambiguity:  e.g. da-xue | sheng-huo vs. da-xue-sheng | huo as shown in (1-1) and (1-2);  and (ii) hidden ambiguity:  ge-ren vs. ge | ren, as shown in (1-3) and (1-4).

(1-1.) 大学生活很有趣
da-xue         | sheng-huo          | hen          | you-qu
university    | life                     | very          | interesting
The university life is very interesting.

(1-2.)  大学生活不下去了
da-xue-sheng                 | huo          | bu | xia-qu      | le
university student          | live           | not | down        | LEs
University students can no longer make a living.

(1-3.)  个人的力量
ge-ren         | de   | li-liang
individual   | DE  | power
the power of an individual

(1-4.) 三个人的力量
san    |  ge            | ren           | de   | li-liang
three  | CLA          | person      |DE   | power
the power of three persons

These examples show that the resolution of segmentation ambiguity requires larger syntactic context and grammatical analysis.   There will be further arguments and evidence in Chapter II (2.1) for the following conclusion:  both types of segmentation ambiguity are structural by nature and require sentential analysis for the resolution.  Without access to a grammar, no matter how sophisticated a tokenization algorithm is designed, a word segmenter is bound to face an upper bound for the precision of word identification.  However, in an integrated system, word identification becomes a natural by-product of parsing (W. Li 1997, 2000;  Wu and Jiang 1998).  More precisely, the majority of ambiguity can be resolved automatically during morpho-syntactic parsing;  the remaining ambiguity can be made explicit in the form of  multiple syntactic trees.[2]  But in order to make this happen, the parser requires reliable support from a grammar which contains both morphology and syntax.

1.2.2. Productive Word Formation

Non-listable words created via productive morphology pose another challenge (Sun and Huang 1996).  There are two major problems involved in this issue:  (i) problem in identifying lexicon-unlisted words;  (ii) problem of possible segmentation ambiguity.

One important method of productive word formation is derivation.  For example, the derived word 可读性 ke-du-xing (-able-read-ness: readability) is created via morphology rules, informally formulated below

(1-5.) derivation rules

ke + X (transitive verb) --> ke-X (adjective, semantics: X-able)

Y (adjective or verb) + xing --> Y-xing (abstract noun, semantics: Y-ness)

Rules like the above have to be incorporated properly in order to correctly identify such non-listable words.  However, there has been little research in the literature on what formalism should be adopted for Chinese morphology  and how it should be interfaced to syntax.

To make the case more complicated, ambiguity may also be involved in productive word formation.  When the segmentation ambiguity is involved in word formation, there is always a danger of wrongly applying morphological rules.  For example, 吃头 chi-tou (worth of eating) is a derived word (transitive verb + suffix tou);   however, it can also be segmented as two separate tokens chi (eat) | tou (CLA), as shown in (1-6) and (1-7) below.

(1-6.)  这道菜没有吃头
zhe    | dao           | cai            | mei-you    | chi-tou
this    | CLA          | dish         | not have   | worth-of-eating
This dish is not worth eating.

(1-7.) 他饿得能吃头牛
ta       | e               | de             | neng        | chi  | tou           | niu
he      | hungry     | DE3         | can           | eat  | CLA                   | ox
He is so hungry that he can eat an ox.

To resolve this segmentation ambiguity, as indicated before in 1.2.1, the structural analysis of the complete sentences is required.  An independent morphology system or a separate word segmenter cannot handle this problem without accessing syntactic knowledge.

1.2.3. Borderline Cases between Morphology and Syntax

It is widely acknowledged that there is a remarkable gray area between Chinese morphology and Chinese syntax (L. Li 1990; Sun and Huang 1996).  Two typical cases are described below.  The first is the phenomena of Chinese separable verbs.  The second case involves interfacing derivation and syntax.

Chinese separable verbs are usually in the form of V+N and V+V or V+A.  These idiomatic combinations are long-standing problems at the interface between compounding and syntax in Chinese grammar (L. Wang 1955; Z. Lu 1957; Lü 1989; Lin 1983; Q. Li 1983; L. Li 1990; Shi 1992; Zhao and Zhang 1996).

The separable verb 洗澡 xi zao (wash‑bath: take a bath) is a typical example.  Many native speakers regard xi zao as one word (verb), but the two morphemes are separable.  In fact, xi+zao shares the syntactic behavior and the pattern variations with the syntactic transitive combination V+NP:  not only can aspect markers appear between xi and zao,  but this structure can be passivized and topicalized as well.  The following is an example of topicalization (of long distance dependency) for xi zao.

(1-8.)(a)       我认为他应该洗澡
wo     ren-wei        ta       ying-gai       xi zao.
I         think           he      should        wash-bath
I think that he should take a bath.

(b)      澡我认为他应该洗
zao    wo     ren-wei        ta       ying-gai       xi.
bath  I         think           he      should        wash
The bath I think that he should take.

Although xi zao behaves like a syntactic phrase, it is a vocabulary word in the lexicon due to its idiomatic nature.  As a result, almost all word segmenters output xi-zao in (1-8a) as one word while treating the two signs[3] in (1-8b) as two words.  Thus the relationship between the separated use of the idiom and the non-separated use is lost.

The second case represents a considerable number of borderline cases often referred to as  ‘quasi-affixes’.  These are morphemes like 前 qian (former, ex-) in words like 前夫 qian-fu (ex-husband), 前领导 qian-[ling-dao] (former boss) and -盲 mang (person who has little knowledge of) in words like 计算机盲 [ji-suan-ji]-mang (computer layman), 法盲 fa-mang (person who has no knowledge of laws).

It is observed that 'quasi-affixes' are structurally not different from other affixes.  The major difference between 'quasi-affixes' and the few generally honored ('genuine') affixes like the nominalizer 性 -xing (-ness) lies mainly in the following aspect.  The former retain some 'solid' meaning while the latter are more functionalized.  Therefore, the key to this problem seems to lie in the appropriate way of coordinating the semantic contribution of the derived words using 'quasi-affixes' to the building of the semantics for the entire sentence.  This is an area which has not received enough investigation in the field of Chinese NLP.  While many word segmenters have included some type of derivational processing for a few typical affixes, few systems demonstrate where and how to handle these 'quasi-affixes'.

1.3. CPSG95:  HPSG-style Chinese Grammar in ALE

To investigate the interaction between morphological and syntactic information, it is important to develop a Chinese grammar which incorporates morphology and syntax in the same formalism.  This section gives a brief presentation on the design and background of CPSG95 (including lexicon).

1.3.1. Background and Overview of CPSG95

Shieber (1986) distinguishes two types of grammar formalism:  (i) theory-oriented formalism;  (ii) tool-oriented formalism.  In general, a language-specific grammar turns to a theory-oriented formalism for its foundation and a tool-oriented formalism for its implementation.  The work on CPSG95 is developed in the spirit of the theory-oriented formalism Head-driven Phrase Structure Grammar (HPSG, proposed by Pollard and Sag 1987).  The tool-oriented formalism used to implement CPSG95 is the Attribute Logic Engine (ALE, developed by Carpenter and Penn 1994).

The unique feature of CPSG95 is its incorporation of Chinese morphology in the HPSG framework.[4]  Like other HPSG grammars, CPSG95 is a heavily lexicalized unification grammar.  It consists of two parts:  a minimized general grammar and an information-enriched lexicon.  The general grammar contains a small number of Phrase Structure (PS) rules, roughly corresponding to the HPSG schemata tuned to the Chinese language.[5]  The syntactic PS rules capture the subject-predicate structure, complement structure, modifier structure, conjunctive structure and long-distance dependency.  The morphological PS rules cover morphological structures for productive word formation.  In one version of CPSG95 (its source code is  shown in APPENDIX I), there are nine PS rules:  seven syntactic rules and two morphological rules.

In CPSG95, potential morphological structures and potential syntactic structures are both lexically encoded.  In syntax, a word can expect (subcat-for or mod in HPSG terms) another sign to form a phrase.   Likewise, in Chinese morphology, a morpheme can expect another sign to form a word.[6]

One important modification of HPSG in designing CPSG95 is to use an atomic approach with separate features for each complement to replace the list design of obliqueness hierarchy among complements.  The rationale and arguments for this modification are presented in Section 3.2.3 in Chapter III.

1.3.2. Illustration

The example shown in (1-9) demonstrates the morpho-syntactic analysis  in CPSG95.

(1-9.) 这本书的可读性
zhe    ben    shu    de      ke               du      xing
this    CLA   book  DE     AF:-able      read   AF:-ness
this book’s readability
(Note: CLA for classifier; DE for particle de; AF for affix.)

Figure 1 illustrates the tree structure built by the morphological PS rules and the syntactic PS rules in CPSG95

cpsgtree

Figure 1. Sample Tree Structure for CPSG95 Analysis

As shown, the tree embodies both morphological analysis (the sub-tree for ke-du-xing) and syntactic analysis (the NP structure).  The results of the morphological analysis (the category change from V to A and to N and the building of semantics, etc.) are readily accessible in building syntactic structures.

1.4. Organization of the Dissertation

The remainder of this dissertation is divided into six chapters.

Chapter II presents arguments for the need to involve syntactic analysis for a proper solution to the targeted morpho-syntactic problems.   This establishes the foundation on which CPSG95 is based.

Chapter III presents the design of CPSG95.  In particular, the expectation feature structures will be defined.  They are used to encode the lexical expectation of both morphological and syntactic structures.  This design provides the necessary means for formally defining Chinese word and the interface of morphology, syntax and semantics.

Chapter IV is on defining the Chinese word.  This is generally recognized as a basic issue in discussing Chinese morpho-syntactic interface.  The investigation leads to a way of the wordhood formalization and a coherent, system-internal definition of the work division between morphology and syntax.

Chapter V studies Chinese separable verbs.  It discusses  wordhood judgment for each type of separable verbs based on their distribution.   The corresponding morphological or syntactic solutions will then be presented.

Chapter VI investigates some outstanding problems of Chinese derivation and its interface with syntax.  It will be demonstrated that the general approach to Chinese derivation in CPSG95 works both for typical cases of derivation and the two special problems, namely 'quasi-affix' phenomena and zhe-affixation.

The last chapter, Chapter VII, concludes this dissertation.  In addition to a concise retrospect for what has been achieved, it also gives an account of the limitations of the present research and future research directions.

Finally, the three appendices give the source code of one version of the implemented CPSG95 and some tested results.[7]

 

--------------------------------------------------

[1] In line with the requirements by Chinese NLP, this thesis places emphasis on the analysis of productive morphology:  phenomena which are listable in the lexicon are not the major concern.  This is different from many previous works on Chinese morphology (e.g. Z. Lu 1957; Dai 1993) where the bulk of discussions is on unproductive morphemes (affixes or ‘bound stems’).

[2] Ambiguity which remains after sentential parsing may be resolved by using further semantic, discourse or pragmatic knowledge, or ‘filters’.

[3] In CPSG95 and other HPSG-style grammars, a ‘sign’ usually stands for the generalized notion of grammatical units such as morpheme, word, phrase, etc.

[4] Researchers have looked at the incorporation of morphology of other natural languages in the HPSG framework (e.g. Type-based Derivation Morphology by Riehemann 1998).  Arguments for  the inclusion of morphological features in the definition of sign will be presented in detail in Chapter III

[5] Note that ‘phrase structure’ in terms like Phrase Structure Grammar (PSG) or Phrase Structure rules (PS rules) does not necessarily refer to structures of (syntactic) phrases. It stands for surface-based constituency structure, in contrast to, say, dependency structure in Dependency Grammar.  In CPSG95, some productive morphological structures are also captured by PS rules.

[6] Note that in this dissertation, the term expect is used as a more generalized notion than the terms subcat-for (subcategorize for) and mod (modify).  ‘Expect’ is intended to be applied to morphology as well as to syntax.

[7]  There are differences in technical details between the proposed grammar in this dissertation and the implemented version.  This is because any implemented version was tested at a given time while this thesis evolved over a long period of time.  It is the author’s belief that it best benefits readers (including those who want to follow the CPSG practice) when a version was actually tested and given as was.

 

[Related]

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

【语义计算群:借定语的壳装状语的瓤】

看一组例子:
“洗了一个痛快的澡”
“痛快地洗了一个澡”
“洗澡洗得痛快”

t0708p

好在我们把动宾离合词“洗澡”的搭配问题解决了,定语(Mod)、状语(Adv)同是附加语(adjunct),都挂到了同样的动词“洗澡”身上了,加上部分补语(Buyu)也是附加语,可谓世界大同了。原先较真的话,要问“痛快”的是“澡”,还是“洗”,还是“洗澡”, who cares?其实都是一个意思。类似的,英语也有:
live a happy life
live (a life) happily

白:
do了一个痛快的“洗澡”
程序还是要care的。

我:
如果程序在此类情形下 只选一个路径,或不做规约也是可以的。到语义落地的时候 只要系统适应性鲁棒即可:Adv:happily OR Mod:happy。

白:
借定语的壳装状语的瓤,总要有一个地方碰上的。
“开了一个无聊的会”
工程师可以不 care,架构师必须给说法。
我要说的是,伪定语伪状语在formalism层面就是可以解决的,并不带来额外负担。

我:
do + Adjunct + core pred
已经做了相当努力去规约这些本质上相同的说法了,如前面图中的“洗澡”:Mod 也好 Adv 也好 Buyu 也好,大体属于同样性质的附加语:
adjunct 痛快 ----》 pred 洗澡

白:
“张三做出了一个追悔莫及的决定。”
“张三遇上了这个倒霉的天气。”
“倒霉的”修饰“天气”,但倒霉的不是天气。
同理,“追悔莫及的”修饰“天气”,但追悔莫及的不是天气。
修饰关系和修饰语内置的填坑关系是脱钩的。

我:
“追悔莫及” 本义 有一个 human 的坑
“做出决定” 也有一个 human 的坑
现在 human (张三) 与 “做出决定” 发生了直接联系(S) 与 “追悔莫及” 发生了间接关系(通过“做出决定”)。离开让 human (张三) 与 需要 human 坑的 “追悔莫及”直接联系 只有一步之遥了。

白:
由此可见,有了的字结构,就由“的”统一应对被修饰语。至于修饰语内部的坑由谁填,被修饰语不过只是一个普通的候选而已。选不上不勉强,有更好的候选完全可以进来。所以我对把“的”这种重要的词仅仅处理成x,是有保留看法的。

我:
“的” 是敲门砖。句法树出来了, x它意思意思,比扔掉它也许好一些。

白:
我有更好的处理办法,绝非仅是敲门砖。

我:
关键是,第一个句子是一步之遥,第二个句子是两步之遥,几乎不可能超过两步。也就是说 从ngram角度看 也不过是dag中的 bigram 或 trigram 的语义规则,如果真想做的话。只要证明从间接联系到直接联系 在语义中间件做 对应用有益处 这个工作是非常 tractable 的。
一个有语义的坑 一个正好符合语义可以填坑 近在咫尺 有何难处?给我五分钟 我两条线都可以勾搭上,而且保证不是权宜之计 不引起副作用。其所以这些语义中间件的细活 虽然不难 但并没去全做 是因为不很确定做了 到底能带来多大好处,虽然理论上是有好处的。

白:
这些后缀,几乎每个case都是一样的。

我:
要的是这个结果吗?
t0708r

白:
一点不错,就是它

我:
我做一下 regression testing 看看有无副作用,没有的话,这个 trigram 的语义填坑规则就留下来。

我:
trigram
具体到这个 case 是从线性 5-gram 缩小成 graph 的 trigram
5 与 3 在组合爆炸的考量中是天壤之别
何况完全可以造出比 5 更加远距离的同样合适的例子来 这就是句法的威力。
更主要的是,即便一个线性系统用得起 5-gram
没有结构支撑,也不敢乱用

白:
5-gram配得上的不稀疏的数据哪里来?

我:
说的是一回事儿 5gram 必然是稀疏数据 不足以支撑远距离选取。不能因为一个token需要human 另一个token恰好是human 中间隔了四个词,就可以填坑了。总之是,没有结构,这事儿就做不成。

 

【相关】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

【置顶:立委NLP博文一览】

NLP University

【立委NLP相关博文汇总一览】

NLP University 开张大吉

 《朝华午拾》电子版

余致力自然语言处理(NLP,Natural Language Processing)凡30年,其目的在求交流之通畅,信息之自由,语言之归一,世界之大同。积30年之经验,深知欲达此目的,必须启蒙后进,普及科学,同心协力,共建通天之塔,因作文鼓而吹之。处理尚未成功,同志仍需努力。

0. AI/NLP最新博文

AIGC 潮流扑面而来,是顺应还是(无谓)抵抗呢?
美术新时代,视频展示
漫谈AI 模型生成图像
《李白宋梁130:从短语结构的词序基础约束到大模型向量空间的天马行空》
AI 正在不声不响渗透我们的生活
RPA 是任务执行器还是数字员工?
图灵测试其实已经过时了
《立委科普:自注意力机制解说》
《深层解析符号模型与深度学习预训练模型》(修订文字版)
NLP 新纪元来临了吗?
【随感:大数据时代的信息茧房和“自洗脑”】
推荐Chris Manning 论大模型,并附上相关讨论
[转载]转载:斯坦福Chris Manning: 大模型剑指通用人工智能?
《我看好超大生成模型的创造前途》
[转载]编译 Gary Marcus 最新著述:《深度学习正在撞南墙》
老司机谈NLP半自动驾驶,欢迎光临。
立委随笔:机器翻译,从学者到学员
关于NLP 落地以及冷启动的对话
《AI 随笔:从对张医生的综述抄袭指控谈起》 
《AI 随笔:观老教授Walid的神经网络批判有感》
从人类认知谈AI融合之不易
与AI老友再谈特斯拉自动驾驶
《AI 理性主义的终结是不可能的吗》
《马斯克AI自动驾驶的背后:软件的内伤,硬件的短板》
《王婆不卖瓜,特斯拉车主说自驾》
《AI 赚钱真心难》
NLP自选系列2020专栏连载
《语言形式的无中生有:从隐性到显性》

1. 关于NLP体系及方法论

 
 
 

【立委科普:自然语言parsers是揭示语言奥秘的LIGO式探测仪】

泥沙龙笔记:漫谈语言形式

《泥沙龙笔记:沾深度神经的光,谈parsing的深度与多层》

【立委科普:语言学算法是 deep NLP 绕不过去的坎儿】

《OVERVIEW OF NATURAL LANGUAGE PROCESSING》

《NLP White Paper: Overview of Our NLP Core Engine》

White Paper of NLP Engine

【新智元笔记:工程语法和深度神经】

【新智元笔记:李白对话录 – RNN 与语言学算法】

《新智元笔记:再谈语言学手工编程与机器学习的自动编程》

《新智元笔记:对于 tractable tasks, 机器学习很难胜过专家》

《新智元笔记:【Google 年度顶级论文】有感》

《新智元笔记:NLP 系统的分层挑战》

《泥沙龙笔记:连续、离散,模块化和接口》

《泥沙龙笔记:parsing 的休眠反悔机制》

【立委科普:歧义parsing的休眠唤醒机制初探】

【泥沙龙笔记:NLP hard 的歧义突破】

【立委科普:结构歧义的休眠唤醒演义】

【新智元笔记:李白对话录 – 从“把手”谈起】

《新智元笔记:跨层次结构歧义的识别表达痛点》

立委科普:NLP 中的一袋子词是什么

一切声称用机器学习做社会媒体舆情挖掘的系统,都值得怀疑

立委科普:关键词革命

立委科普:关键词外传

《立委随笔:机器学习和自然语言处理》

【泥沙龙笔记:语法工程派与统计学习派的总结】

【科普小品:NLP 的锤子和斧头】

【新智元笔记:两条路线上的NLP数据制导】

《立委随笔:语言自动分析的两个路子》

Comparison of Pros and Cons of Two NLP Approaches

why hybrid? on machine learning vs. hand-coded rules in NLP

Why Hybrid?

钩沉:Early arguments for a hybrid model for NLP and IE

【李白对话录:你波你的波,我粒我的粒】

【泥沙龙笔记:学习乐观主义的极致,奇文共欣赏】

《泥沙龙笔记:铿锵众人行,parsing 可以颠覆关键词吗?》

泥沙龙笔记:铿锵三人行

《泥沙龙铿锵三人行:句法语义纠缠论》

【科普随笔:NLP主流的傲慢与偏见】

【科普随笔:NLP主流最大的偏见,规则系统的手工性】

再谈机器学习和手工系统:人和机器谁更聪明能干?

乔姆斯基批判

Chomsky’s Negative Impact

[转载]【白硕 – 穿越乔家大院寻找“毛毛虫”】

【新智元笔记:语法糖霜论不值得认真对待】

【科研笔记:NLP “毛毛虫” 笔记,从一维到二维】

【泥沙龙笔记:NLP 专门语言是规则系统的斧头】

【新智元:理论家的围墙和工程师的私货】

泥沙龙笔记:从乔姆斯基大战谷歌Norvig说起

【Church – 钟摆摆得太远(2):乔姆斯基论】

【NLP主流的反思:Church – 钟摆摆得太远(1):历史回顾】

【Church – 钟摆摆得太远(3):皮尔斯论】

【Church – 钟摆摆得太远(4):明斯基论】

【Church – 钟摆摆得太远(5):现状与结论】

《泥沙龙笔记:【钟摆摆得太远】高大上,但有偏颇》

自给自足是NLP王道

自然语言后学都应该看看白硕老师的“自然语言处理与人工智能”

语言创造简史

Notes on Building and Using Lexical Semantic Knowledge Bases

【NLP主流成见之二,所谓规则系统的移植性太差】

Domain portability myth in natural language processing (NLP)

【科普随笔:NLP的宗教战争?】

Church – 计算语言学课程的缺陷 (翻译节选)

【科普随笔:NLP主流之偏见重复一万遍成为反真理】

坚持四项基本原则,开发鲁棒性NLP系统

NLP 围脖:成语从来不是问题

NLP 是一个力气活:再论成语不是问题

立委围脖:对于用户来说,抓住老鼠就是好猫

《科普随笔:keep ambiguity untouched》

【科研笔记:NLP的词海战术】

在构筑一个模型时,枚举法是常用的必要的强盗分类

没有语言学的 CL 走不远

[转载]为什么谷歌搜索并不像广泛相信的那样主要采用机器学习?

手工规则系统的软肋在文章分类

老教授回函:理性主义回摆可能要再延迟10几年

每隔二十年振荡一次的钟摆要多长?

【系统不能太精巧,正如人不能太聪明】

《泥沙龙李白对话录:关于纯语义系统》

【泥沙龙笔记:语义可以绕过句法吗】

一袋子词的主流方法面对社交媒体捉襟见肘,结构分析是必由之路

《新智元:通用的机器人都是闹着玩的,有用的都是 domain 的》

SBIR Grants

 

2. 关于NLP分析(parsing)

语义计算沙龙:Parsing 的数据结构和形式文法

【语义计算群:句法语义的萝卜与坑】

【语义计算群:李白侃中文parsing】

【语义计算群:借定语的壳装状语的瓤】

【语义计算群:带歧义或模糊前行,有如带病生存】

【一日一parsing:”钱是没有问题”】

【一日一parsing:休眠唤醒的好例子】

【一日一parse:长尾问题种种】

【语言学小品:送老婆后面的语言学】 

【一日一parsing:NLP应用可以对parsing有所包容】

泥沙龙笔记:骨灰级砖家一席谈,真伪结构歧义的对策(1/2)

泥沙龙笔记:骨灰级砖家一席谈,真伪结构歧义的对策(2/2)

【语义计算沙龙:巨头谷歌昨天称句法分析极难,但他们最强】

语义计算沙龙:parsing 的鲁棒比精准更重要】

语义计算沙龙:基本短语是浅层和深层parsing的重要接口》

【做 parsing 还是要靠语言学家,机器学习不给力】

《泥沙龙笔记:狗血的语言学》

语义计算沙龙:关于汉语介词的兼语句型,兼论POS】

泥沙龙笔记:在知识处理中,很多时候,人不如机

《立委科普:机器可以揭开双关语神秘的面纱》

《泥沙龙笔记:漫谈自动句法分析和树形图表达》

泥沙龙笔记:语言处理没有文法就不好玩了

泥沙龙笔记:parsing 是引擎的核武器,再论NLP与搜索

泥沙龙笔记:从 sparse data 再论parsing乃是NLP应用的核武器

【立委科普:NLP核武器的奥秘】

【立委科普:语法结构树之美】

【立委科普:语法结构树之美(之二)】

【立委科普:自然语言理解当然是文法为主,常识为辅】

语义计算沙龙:从《知网》抽取逻辑动宾的关系】

【立委科普:教机器识英文】

【立委科普:及物、不及物 与 动词 subcat 及句型】

泥沙龙笔记:再聊乔老爷的递归陷阱

【泥沙龙笔记:人脑就是豆腐,别扯什么递归了】

泥沙龙笔记:儿童语言没有文法的问题

《自然语言是递归的么?》

Parsing nonsense with a sense of humor

【科普小品:文法里的父子原则】

Parent-child Principle in Dependency Grammar

乔氏 X 杠杠理论 以及各式树形图表达法

【泥沙龙笔记:依存语言学的怪圈】

【没有语言结构可以解析语义么?浅论 LSA】

【没有语言结构可以解析语义么?(之二)】

自然语言中,约定俗成大于文法教条和逻辑

泥沙龙笔记:三论世界语

泥沙龙笔记:再聊世界语及其文化

泥沙龙笔记:聊一聊世界语及老柴老乔以及老马老恩

《泥沙龙笔记:NLP component technology 的市场问题》

【泥沙龙笔记:没有结构树,万古如长夜】

Deep parsing:每日一析

Deep parsing 每日一析:内情曝光 vs 假货曝光

Deep parsing 每日一析 半垃圾进 半垃圾出

【一日一parsing: 屈居世界第零】

【研发随笔:植树为林自成景(10/n)】

【deep parsing:植树为林自成景(20/n)】

【deep parsing:植树为林自成景(30/n)】

语义计算沙龙:植树为林自成景(40/n)】

【deep parsing 吃文化:植树为林自成景(60/n)】

【deep parsing (70/n):离合词与定语从句的纠缠】

【deep parsing (80/n):植树成林自成景】

【deep parsing (90/n):“雨是好雨,但风不正经”】

【deep parsing (100/n):其实 NLP 也没那么容易气死】

 

3. 关于NLP抽取

【立委科普:NLU 的螺旋式上升及其 open知识图谱的趋向】

【语义计算沙龙:知识图谱无需动用太多知识 负重而行】

【立委科普:信息抽取】

《朝华午拾:信息抽取笔记》

泥沙龙笔记:搜索和知识图谱的话题

《知识图谱的先行:从Julian Hill 说起》

《有了deep parsing,信息抽取就是个玩儿》

【立委科普:实体关系到知识图谱,从“同学”谈起】

泥沙龙笔记: parsing vs. classification and IE

前知识图谱钩沉: 信息抽取引擎的架构

前知识图谱钩沉: 信息体理论

前知识图谱钩沉,信息抽取任务由浅至深的定义

前知识图谱钩沉,关于事件的抽取

钩沉:SVO as General Events

Pre-Knowledge-Graph Profile Extraction Research via SBIR (1)

Pre-Knowledge-Graph Profile Extraction Research via SBIR (2)

Coarse-grained vs. fine-grained sentiment extraction

【立委科普:基于关键词的舆情分类系统面临挑战】

【“剩女”的去向和出路】

SBIR Grants

 

4.关于NLP大数据挖掘

 

“大数据与认识论”研讨会的书面发言(草稿)

【立委科普:自动民调】

Automated survey based on social media

《立委科普:机器八卦》

言多必露,文本挖掘可以揭示背景信息

社媒是个大染缸,大数据挖掘有啥价值?

大数据挖掘问答2:会哭的孩子有奶吃

大数据挖掘问答1:所谓数据完整性

为什么做大数据的吹鼓手?

大数据NLP论

On Big Data NLP

作为公开课的大数据演讲

【立委科普:舆情挖掘的背后】

【立委科普:所谓大数据(BIG DATA)】

【科研笔记:big data NLP, how big is big?】

文本挖掘需要让用户既能见林又能见木

【社媒挖掘:《品牌舆情图》的设计问题】

研究发现,国人爱说反话:夸奖的背后藏着嘲讽

立委统计发现,人是几乎无可救药的情绪性动物

2011 信息产业的两大关键词:社交媒体和云计算

《扫了 sentiment,NLP 一览众山小:从“良性肿瘤”说起》

 

5. 关于NLP应用

 

【河东河西,谁敢说SMT最终一定打得过规则MT?】

【立委科普:NLP应用的平台之叹】

【Bots 的愿景】

《新智元笔记:知识图谱和问答系统:how-question QA(2)》

《新智元笔记:知识图谱和问答系统:开题(1)》

【泥沙龙笔记:NLP 市场落地,主餐还是副食?】

《泥沙龙笔记:怎样满足用户的信息需求》

立委科普:问答系统的前生今世

《新智元笔记:微软小冰,人工智能聊天伙伴(1)》

《新智元笔记:微软小冰,可能的商业模式(2)》

《新智元笔记:微软小冰,两分钟定律(3)》

新智元笔记:微软小冰,QA 和AI,历史与展望(4)

泥沙龙笔记:把酒话桑麻,聊聊 NLP 工业研发的掌故

泥沙龙笔记:创新,失败,再创新,再失败,直至看上去没失败

泥沙龙笔记:parsing 是引擎的核武器,再论NLP与搜索

立委科普:从产业角度说说NLP这个行当

【立委科普:机器翻译】

立委硕士论文【附录一:EChA 试验结果】

社会媒体(围脖啦)火了,信息泛滥成灾,技术跟上了么?

2011 信息产业的两大关键词:社交媒体和云计算

再说苹果爱疯的贴身小蜜 死日(Siri)

从新版iPhone发布,看苹果和微软技术转化能力的天壤之别

非常折服苹果的技术转化能力,但就自然语言技术本身来说 ...

科研笔记:big data NLP, how big is big?

与机器人对话

《机器翻译词义辨识对策》

【立委随笔:机器翻译万岁】

 

6. 关于中文NLP

【语义计算群:李白侃中文秀parsing】

【parsing 在希望的田野上】

语义计算沙龙:其实 NLP 也没那么容易气死

【deep parsing (70/n):离合词与定语从句的纠缠】

【立委科普:deep parsing 小讲座】

【新智元笔记:词的幽灵在NLP徘徊】

《新智元笔记:机器的馅饼专砸用心者的头》

【新智元笔记:机器的馅饼(续篇)】

【新智元笔记:parsing 汉语涉及重叠的鸡零狗碎及其他】

【新智元笔记:中文自动分析杂谈】

【deep parsing:“对医闹和对大夫使用暴力者,应该依法严惩"

【让机器人解读洪爷的《人工智能忧思录》(4/n)】

【让机器人解读洪爷的《人工智能忧思录》(3/n)】

【让机器人解读洪爷的《人工智能忧思录》(2/n)】

【让机器人解读洪爷的《人工智能忧思录》(1/n)】

《新智元笔记:找茬拷问立氏parser》

【新智元笔记:汉语分离词的自动分析】

《新智元笔记:与汉语离合词有关的结构关系》

《新智元笔记:汉语使动结构与定中结构的纠缠》

《新智元笔记:汉语parsing的合成词痛点》

《新智元:填空“的子结构”、“所字结构”和“者字结构“》

【沙龙笔记:汉语构词和句法都要用到reduplication机制】

钩沉:博士阶段的汉语HPSG研究 2015-11-02

泥沙龙小品:小词搭配是上帝给汉语文法的恩赐

泥沙龙笔记:汉语牛逼,国人任性!句法语义,粗细不同

泥沙龙笔记:汉语就是一种“裸奔” 的语言

【NLP笔记:人工智能神话的背后是汗水】

【立委随笔:中文之心,如在吾庐】

汉语依从文法 (维文钩沉)

《立委科普:现代汉语语法随笔》

“自由”的语言学至少有三种理论

应该立法禁止切词研究 :=)

再谈应该立法禁止切词研究

中文处理的迷思之一:切词特有论

中文处理的迷思之二:词类标注是句法分析的前提

中文NLP迷思之三:中文处理的长足进步有待于汉语语法的理论突破

专业老友痛批立委《迷思》系列搅乱NLP秩序,立委固执己见

后生可畏,专业新人对《迷思》争论表面和稀泥,其实门儿清

突然有一种紧迫感:再不上中文NLP,可能就错过时代机遇了

社会媒体舆情自动分析:马英九 vs 陈水扁

舆情自动分析表明,谷歌的社会评价度高出百度一倍

方寒大战高频情绪性词的词频分析

方韩大战的舆情自动分析:小方的评价比韩少差太多了

研究发现,国人爱说反话:夸奖的背后藏着嘲讽

立委统计发现,人是几乎无可救药的情绪性动物

研发笔记:粤语文句的情报挖掘

《立委随笔: 语言学家是怎样炼成的》

《立委科普:汉语只有完成体,没有过去时》

《科研笔记:中文图灵试题?》

立委统计发现,汉语既适合吹嘘拍马亦长于恶意构陷

比起英语,汉语感情更外露还是更炽烈?

科研笔记:究竟好还是不好

《科普随笔:汉字和语素》

《科普随笔:汉语自动断词 “一次性交500元”》

《科普随笔:“他走得风一样地快” 的详细语法结构分析》

【立委科普:自动分析 《偉大的中文》】

《立委随笔:汉语并不简单》

语言学小品:结婚的远近距离搭配

中文处理的模块化纠结

【立委科普:《非诚勿扰》中是谁心动谁动心?】

曙光在眼前,轻松过个年

挺反自便,但不要欺负语言学!

当面对很烦很难很挑战的时候

创造着是美丽的

汉语依从文法 (维文钩沉)

《新智元:挖掘你的诗人气质,祝你新年快乐》

 

7. 关于NLP社会媒体舆情挖掘的实践

 

【语义计算沙龙:sentiment 中的讽刺和正话反说】

【喋喋不休论大数据(立委博文汇总)】

【新智元笔记:再谈舆情】

舆情挖掘系统独立验证的意义

【社煤挖掘:雷同学之死】

《利用大数据高科技,实时监测美国总统大选舆情变化》

世人皆错nlp不错,民调错大数据也不会错

社媒大数据的困境:微信的风行导致舆情的碎片化

从微信的用户体验谈大数据挖掘的客户情报

社媒挖掘:社会媒体疯传柴静调查,毁誉参半,争议趋于情绪化

奥巴马赢了昨晚辩论吗?舆情自动检测告诉你

全球社交媒体热议阿里巴巴上市

到底社媒曲线与股市曲线有没有、有多少相关度?

再谈舆情与股市的相关性

【『科学』预测:A-股 看好】

舆情挖掘用于股市房市预测靠谱么?

大数据帮助决策实例:《走进“大数据”——洗衣机寻购记》

【社媒挖掘:外来快餐店风光不再】

【社媒挖掘:中国手机市场仍处于战国争雄的阶段】

世界杯是全世界的热点,纵不懂也有义务挖掘一哈

【大数据挖掘:方崔大战一年回顾】(更正版)

【大数据挖掘:转基因一年回顾】

【大数据挖掘:“苦逼”小崔2013年5-7月为什么跌入谷底?】

【大数据挖掘:转基因中文网络的自动民调,东风压倒西风?】

【大数据挖掘:转基因英文网络的自动民调和分析】

只认数据不认人:IRT 的鼓噪左右美国民情了么?

继续转基因的大数据挖掘:谁在说话?发自何处?能代表美国人民么

关于转基因及其社会媒体大数据挖掘的种种问题

【美国网民怎么看转基因:英文社交媒体大数据调查告诉你】

【社媒挖掘:必胜客是七夕节情侣聚餐的首选之地?】

【社媒挖掘:大数据时代的危机管理】

测试粤语舆情挖掘:拿娱乐界名人阿娇和陈冠希开刀

【社媒挖掘:不朽邓丽君】

【社媒挖掘:社会媒体眼中的李开复老师】

【社媒挖掘:糟糕透顶的方韩社会形象】

社媒挖掘:关于狗肉的争议

社媒挖掘:央视的老毕

社媒挖掘:老毕私下辱毛事件再挖掘

大数据淹没下的冰美人(之一)

大数据淹没下的冰美人(之二)

大数据淹没下的冰美人(之三): 喜欢的理由

大数据淹没下的冰美人(之四): 流言蜚语篇(慎入)

大数据淹没下的冰美人(之五): 星光灿烂谁为最?

【社媒挖掘:成都暴打事件中的男司机和女司机】

【社媒挖掘:社会媒体眼中的陳水扁】

【社媒挖掘:社会媒体眼中的李登輝】

【社媒挖掘:馬英九施政一年來輿情晴雨表】

【社媒挖掘:臺灣政壇輿情圖】

【社媒挖掘:社会媒体眼中的臺灣綠營大佬】

舆情挖掘:九合一國民黨慘敗 馬英九時代行將結束?

社会媒体舆情自动分析:马英九 vs 陈水扁

社媒挖掘:争议人物方博士被逐,提升了其网路形象

方韩大战高频情绪性词的词频分析

方韩大战的舆情自动分析:小方的评价比韩少差太多了

社媒挖掘:苹果CEO库克公开承认同志身份,媒体反应相当正面

苹果智能手表会是可穿戴设备的革命么?

全球社交媒体热议苹果推出 iPhone 6

互联网盛世英雄马云的媒体形象

革命革到自身头上,给咱“科学网”也挖掘一下形象

两年来中国红十字会的社会媒体形象调查

自动民调Walmart,挖掘发现跨国公司在中国的日子不好过

【社媒挖掘:“剩女”问题】

【舆情挖掘:2013央视春晚播后】

【舆情挖掘:年三十挖一挖央视春晚】

新浪微博下周要大跌?舆情指数不看好,负面评价太多(疑似虚惊)

【大数据挖掘:微信(WeChat)】

【大数据解读:方崔大战对转基因形象的影响】

【微博自动民调:薄熙来、薛蛮子和李天一】

【社媒挖掘:第一夫人光彩夺目赞誉有加】

Chinese First Lady in Social Media

Social media mining on credit industry in China

Sina Weibo IPO and its automatic real time monitoring

Social media mining: Teens and Issues

立委元宵节大数据科技访谈土豆视频上网

【大数据挖掘:中国红十字会的社会媒体形象】

【社媒挖掘:社会媒体眼中的财政悬崖】

【社媒挖掘:美国的枪支管制任重道远】

【舆情挖掘:房市总体看好】

【社媒挖掘:社会媒体眼中的米拉先生】

【社会媒体:现代婚姻推背图】

【社会媒体:现代爱情推背图】

【科学技术之云】

新鲜出炉:2012 热点话题五大盘点之五【小方vs韩2】

【凡事不决问 social:切糕是神马?】

Social media mining: 2013 vs. 2012

社会媒体测试知名品牌百度,有惊人发现

尝试揭秘百度的“哪里有小姐”: 小姐年年讲、月月讲、天天讲?

舆情自动分析表明,谷歌的社会评价度高出百度一倍

圣诞社媒印象: 简体世界狂欢,繁體世界分享

WordClouds: Season's sentiments, pros & cons of Xmas

新鲜出炉:2012 热点话题五大盘点之一【吊丝】

新鲜出炉:2012 热点的社会媒体五大盘点之二【林书豪】

新鲜出炉:2012 热点话题五大盘点之三【舌尖上的中国】

新鲜出炉:2012 热点话题五大盘点之四【三星vs苹果】

社会媒体比烂,但国骂隐含舆情

肮脏语言研究:英语篇

肮脏语言研究:汉语篇(18岁以下勿入)

新年新打算:【社媒挖掘】专栏开张大吉

 

8. 关于NLP的掌故趣闻

《朝华午拾:创业之路》

《朝华午拾 - 水牛风云》

《朝华午拾:用人之道》

《朝华午拾:欧洲之行》

《朝华午拾:“数小鸡”的日子》

《朝华午拾:一夜成为万元户》

《朝华午拾:世界语之恋》

《朝华午拾:我的考研经历》

80年代在国内,社科院的硕士训练使我受益最多

科研笔记:开天辟地的感觉真好

《朝华午拾:今天是个好日子》

【朝华午拾:那天是个好日子】

10 周年入职纪念日有感

《立委随笔: 语言学家是怎样炼成的》

说说科研立项中的大跃进

围脖:一个人对抗一个世界,理性主义大师 Lenat 教授

《泥沙龙笔记:再谈 cyc》

围脖:格语法创始人菲尔墨(Charles J. Fillmore)教授千古!

百度大脑从谷歌大脑挖来深度学习掌门人 Andrew Ng

冯志伟老师以及机器翻译历史的一些事儿

《立委随笔:微软收购PowerSet》

NLP 历史上最大的媒体误导:成语难倒了电脑

立委推荐:乔姆斯基

巧遇语言学上帝乔姆斯基

[转载]欧阳锋:巧遇语言学新锐 - 乔姆斯基

【科普小品:伟哥的关键词故事】

不是那根萝卜,不做那个葱

【随记:湾区的年度 NLP BBQ 】

女怕嫁错郎,男怕入错行,专业怕选错方向

据说,神奇的NLP可以增强性吸引力,增加你的信心和幽会成功率

【立委科普:美梦成真的通俗版解说】

【征文参赛:美梦成真】

【创业故事:技术的力量和技术公司的命运】

把酒话桑麻,再泡一壶茶,白头老机译,闲坐说研发

MT 杀手皮尔斯 (翻译节选)

ALPAC 黑皮书 1/9:前言

《眼睛一眨,来了王子,走了白马》

职业随想曲:语言学万岁

立委随笔:Chomsky meets Gates

钩沉:《中国报道》上与导师用世界语发表的第一篇论文

钩沉:《中国报道》上用世界语发表的第二篇论文

贴身小蜜的面纱和人工智能的奥秘

有感于人工智能的火热

泥沙龙笔记微博议摘要

【泥沙龙笔记:没有结构树,万古如长夜】

【泥沙龙笔记:机器 parsing 洪爷,无论打油或打趣】

老革命遇到新问题,洪爷求饶打油翁

我要是退休了,就机器 parse 《离骚》玩儿

 

【语言学小品:送老婆后面的语言学】

456822675539882531

谁会误读?为什么误读?研究一下背后的语言学 and beyond。

双宾两个坑 human 默认的坑是对象 “老婆”是“送”的对象,这是正解。
对于心术不正的人 human 也可以填受事的坑,“老婆”跟礼物一样,成了“送”的受事。
这是 “送” 的歧义,到了 caption 里面的合成词 “送给”,subcat 有细微变化,就没歧义了。为什么 “送-个” 也没歧义呢?因为“个”是不定的,而对象这个角色通常是有定的。
这里面细说起来还有一摞的语言学。

(1)双宾句型的对象一般是有定的,不定的对象不是绝对不可以,譬如:
“我把一大批书送(给)一所学校了。”
“一所” 是不定数量词,作为对象。
汉语中的 “一+量词”与光杆“量词”通常认为是等价的,范畴都是不定(indefinite),后者是前者省略了“一”而得。但是二者并非完全等价。
对象这个角色默认有定(definite,虽然汉语没有定冠词),如果是有定,不可以省略“一”,或者说,不可以由带光杆量词的NP充当。
汉语句法里面可以总结出这么一条细则:带有光杆量词的NP只能充当直接宾语,不能充当间接宾语(对象)或其他。

(2)再看合成词 “送给” 里面的语言学。
汉语反映双宾概念的语词,常常可以进一步与“给”组成合成动词,意义不变,但注意合成前后的subcat的微妙变化:“送” vs “送给” (寄给,赠给,赠送给,等)
“送”的 subcat patterns:
(1) 送 + 对象NP + 受事NP: 送她一本书
(2) “把”受事NP+送+对象: 把一本书送她
(3)受事NP+送+对象: 这本书送她了
(4)送+受事NP: 送个老婆
(5)送+对象NP(human,definite):送(我)老婆。

请留心(4)和(5):两个patterns有相交竞争的时候,于是歧义产生。当“送+给”构成合成动词后,subcat 的 patterns(1)(2)(3)(5) 保持不变,而(4)基本失效(退出)了。说基本失效,是因为:虽然 “送给老婆”只能循 pattern 5,但“送给个老婆”(稍微有限别扭,但仍在语言可接受之列)似乎仍然需要理解为 pattern 4,这是怎么回事呢?
这就是语言的微妙之处:pattern 4 本来应该退出,因为“给”已经决定了后面是对象而不是受事;但是因为汉语有另一条很细但是很强的规则说,光杆量词的NP只能做受事,不能做对象或其他。在这两条规则(pattern 5的对象规则与光杆受事规则)发生冲突的时候,后一条胜,因此“送给个老婆”就不得不做 pattern 4 的受事解了。这叫规则与规则打架,谁胜谁输也是语言学的一部分,电脑实现的时候可以运用一个priority的机制来model。

上图还涉及一个常见的促销句式: 买NP1送NP2
买iPhone 6 送耳机
买 Prius 送三年保修
这个语用句式的存在,加强了NP2作为受事的可能性,使得 human 本来默认为对象的力量受到制衡。这似乎涉及语用与句法的交界了。

这些算是语言学。Beyond 语言学,也可以从文化上看这个误解或歧义的现象:

对于来自落后农村的人,老婆作为受事的理解几乎是理所当然,因为农村的封建落后使得娶不起媳妇的光棍汉太多,白捞一个媳妇的渴望诱使他们更多向受事而不是对象方面联想,何况手机对于他们是天价,卖肾才可得之,因此对于促销句式也就更加敏感。反之,对于一个知识分子或富裕阶层人士,“送老婆”可能更偏向于理解为对象。

就跟王若水老老年谈桌子的哲学类似,这则小品主要是想谈谈日常的语言学。哲学家满眼都是哲学,语言学家以语言学看世界。语言人人会说,背后的语言学却不是老妪能解。语言如水如空气,一般人熟视无睹了,语言学家来揭示。这是 real life linguistics,琐碎而不乏规律,似海却仍可见底。

【相关】

《立委随笔: 语言学家是怎样炼成的》

《朝华午拾》总目录

【关于立委NLP的《关于系列》】

【置顶:立委NLP博文一览(定期更新版)】

立委NLP频道

【立委NLP频道的《关于系列》】

【立委按】有了这个《关于系列》,NLP有关的话,该说的已经大体说完了。以后再说,大多是重复或细节而已。有些论点可以不同角度说,关键的事情可以反复说,以信息的冗余试图保障信息传输的有效性和完整性。以前说过的,这方面立委有三个榜样,一律苦口婆心:第一是马克思,尤其反映在他集30多年功力未及完工的砖头一般厚重的《Das Kapital(资本论)》;第二是乔姆斯基,他对美国外交霸权主义和美国大众媒体的批判,絮叨了一辈子,万变不离其宗;三是老友镜子先生横扫万事万物,见诸立委主编【镜子大全】。都是菩萨心肠,把自以为的真知灼见(当然不是真理,也难免偏激)说给世界听。至少于我,说给世界听,但并不在乎世界听不听。老夫聊发少年狂,花开花落两由之。

关于 NLP 以及杂谈】                         专栏:杂类English

关于NLP体系和设计哲学】;               专栏:NLP架构

关于NLP方法论以及两条路线之争】 专栏:NLP方法论

关于 parsing】                                    专栏:Parsing

【关于中文NLP】                                   专栏:中文处理

【关于信息抽取】                                   专栏:信息抽取

【关于大数据挖掘】                               专栏:情报挖掘

【关于知识图谱】                                   专栏:知识图谱

【关于舆情挖掘】                                   专栏:舆情挖掘

【关于问答系统】                                   专栏:问答系统

【关于机器翻译】                                    专栏:机器翻译

【关于NLP应用】                                   专栏:NLP应用

【关于我与NLP】                                  专栏:NLP掌故

【关于NLP掌故】                                  专栏:NLP掌故

【关于人工智能】                                  专栏:杂类

 

【关于问答系统】

立委科普:问答系统的前生今世

《新智元笔记:知识图谱和问答系统:开题(1)》

《新智元笔记:知识图谱和问答系统:how-question QA(2)》

《朝华午拾:创业之路》

【Bots 的愿景】

《泥沙龙笔记:怎样满足用户的信息需求》

《新智元笔记:微软小冰,人工智能聊天伙伴(1)》

《新智元笔记:微软小冰,可能的商业模式(2)》

《新智元笔记:微软小冰,两分钟定律(3)》

新智元笔记:微软小冰,QA 和AI,历史与展望(4)

再说苹果爱疯的贴身小蜜 死日(Siri)

从新版iPhone发布,看苹果和微软技术转化能力的天壤之别

非常折服苹果的技术转化能力,但就自然语言技术本身来说 ...

与机器人对话

关于 NLP 以及杂谈

关于NLP体系和设计哲学

关于NLP方法论以及两条路线之争

关于 parsing

【关于中文NLP】

【关于信息抽取】

【关于舆情挖掘】

【关于大数据挖掘】

【关于知识图谱】

【关于NLP应用】

【关于人工智能】

【关于我与NLP】

【关于NLP掌故】

【泥沙龙笔记:吃科学的饭,还是技术的饭?】

我:

我虽然被封了个小公司 Chief Scientist 的职称,实在不敢称科学家了,因为早已脱离 academia,也没真正靠科学吃饭:这个金饭碗太沉,端不起。这倒不是谦虚,也不是自我矮化,因为科学家和技术人在我心中难分高低。作为一线技术人,并没觉得自己比一流科学家逊色。

不说生物,说说NLP。可重复性是科学的根本,否则算命先生和跳大神的也都是科学家了。针对一个单纯的任务,或一个纯粹的算法,在 community 有一个标注测试集的时候,这个可重复性似乎是理应有所要求的,虽然具体怎么验证这个要求,验证到哪一步才被公认有效,似乎远非黑白分明。

我的问题是,如果是一个复杂一些的系统,譬如 deep parser,譬如 MT,特别是在工业界,有可能做到可重复吗?不可重复就不能认可吗?且不说不可重复是保持竞争优势的必要条件,就算一家公司不在乎 IP,指望对手能重复自己的结果,也是难以想象的事儿 -- 除非把全盘源代码、原资源,包括所有的词典,原封不动交给对方,而且不许configure,亦不允许改动任何参数,否则怎么可能做到结果可以被重复呢?

毛:

凡是“构成性要素”,必须在一定的误差范围内可重复。要不然就属于商业秘密而不属于科学发现了。

我:

所以 key 就是看你吃哪一碗饭。吃学术的饭,你就必须过这一关。怎么拿捏是 community peer reviewers 的事儿。

毛:

还是那句话,你不能把什么好处都占了。

我:

吃工业的饭,你只要你的黑箱子 performs 就ok了。

这就使得学术界只能就“构成性要素”而发表,做一个 integrated 系统是不讨好的。这个从科学上是有道理的,但是很多做学术的人也不甘心总猫在象牙塔里,为他人做嫁衣裳,他们也想做实用系统。integrated 的实用系统几乎肯定无法由他人重复出结果来,因为变数太多,过程太复杂。

毛:

那倒也不一定,当年的 unix 就是系统。但是在同样的配置条件下得到的结果应该在一定的误差范围之内。

我:

换句话说吧,别说他人,就是自己也不见得能重复出自己的结果来。如果重起炉灶,再做一个 parser 出来,结果的误差是多少才能算容许的范围呢?就算基本设计和算法不变,相信是越做越好,但结果的误差在做成之前是很难预测的。这与在新的开发现场所能调用的资源等因素有关。

毛:

对呀,所以别人也不至于吹毛求疵,大家会有个共识的。像Parser一类,如果是对自然语言,那应该是很宽的。但如果是形式语言、编程语言,那就要求很严了。

我:

说的是自然语言。十几年前,我还在学术殿堂边徘徊,试图讨好主流,分一杯羹,虽然明知学界的统计一边倒造成偏见流行(【科普随笔:NLP主流的傲慢与偏见】)积久成疾,我辈压抑,同行如隔山,相互听不见。直到有一天大彻大悟,我到底吃的是谁的饭,我凭的什么在吃饭?原来我的衣食父母不是科学,更不是主流。我与隔壁的木匠阿二无异,主要靠的是手艺吃饭,靠的是技术创新的绝技,而不是纯科学的突破。认清这一点,也就避免了以卵击石,长他人威风,灭自己志气。说到底,在业界,老板不在意你在哪一条路线上,客户更不在乎你有没有追赶潮流,白猫黑猫,一切由系统说话。你有你的科学突破,我有我的技术绝技,到了应用现场,还要看谁接地气,有没有硬通货呢。系统结果可能难以重复,客观测量却并非难事儿。

【相关】

关于NLP方法论以及两条路线之争

【关于我与NLP】

《朝华午拾》总目录

 

【关于信息抽取】

【立委科普:信息抽取】

《朝华午拾:信息抽取笔记》

泥沙龙笔记:搜索和知识图谱的话题

《知识图谱的先行:从Julian Hill 说起》

《有了deep parsing,信息抽取就是个玩儿》

【立委科普:实体关系到知识图谱,从“同学”谈起】

泥沙龙笔记: parsing vs. classification and IE

前知识图谱钩沉: 信息抽取引擎的架构 2015-11-01

前知识图谱钩沉: 信息体理论 2015-10-31

前知识图谱钩沉,信息抽取任务由浅至深的定义 2015-10-30

前知识图谱钩沉,关于事件的抽取

钩沉:SVO as General Events

Pre-Knowledge-Graph Profile Extraction Research via SBIR (1)

Pre-Knowledge-Graph Profile Extraction Research via SBIR (2)

Coarse-grained vs. fine-grained sentiment extraction

【立委科普:基于关键词的舆情分类系统面临挑战】

【“剩女”的去向和出路】

SBIR Grants

 

【关于 parsing】

关于 NLP 以及杂谈

关于人工智能

关于NLP体系和设计哲学

关于NLP方法论以及两条路线之争

《朝华午拾》总目录

【置顶:立委NLP博文一览(定期更新版)】

立委NLP频道

"快叫爸爸小视频" 的社会计算语言学解析

“快叫爸爸小视频” 这样的东西 有社会语言学的味道 随着时代和潮流翻滚。在微信朋友圈及其提供的小视频功能风靡之前 小视频不是术语 不是合成词 也没有动词的引申用法。它就是一个定中结构的 NP,在句型中等价于说”把爸爸叫做小视频”,虽然常识是 “人(爸爸)不可以等价于物(视频)”。在语言的强制性subcat结构(叫NP1NP2)里面,常识是没有位置的。句法不需要顾及常识 正如 “鸡把我吃了”的违反常识一样 也正如乔姆斯基千古名句的 green ideas。
可是 社会语言学登场了 语言被置于流动的社会背景之下,小视频成了 technical term,然后又从术语融入了语言共同体的动词用法,正如谷歌从术语(专名)变成动词一样: “我还是先谷歌一下再回应吧”,“快小视频呀”,“一定要小视频这个精彩时刻”。
白:
“一下”强制“谷歌”为动词。半个括号已经有了 另半个没有也得有。
我:
于是 subcats 开始 compete,有了 competition,有了结构歧义 就有了常识出场的理由。顺应常识者于是推翻了句法的第一个 reading。
白:
你是我的小苹果,怎解?
我:
“你是我的小苹果”是强制性的句法啊,无论怎么理解这个苹果(到现在我也没有理解为什么把爱人或意中人叫做小苹果,是因为拿高大上的苹果比喻珍贵吗?)都与常识无关:你是我的 x,就是强行的句法等价关系。
“一下”强制“谷歌”为动词 这一类看似临时的强制 在语言共同体中逐渐从临时变成常态后就侵入了词汇。换句话说,“谷歌”在以前的词典里面是没有也无需“潜在动词”的标注(lexical candidate POS feature),因为几乎所有的动词用法都是零星的 句法强制的 无需词典 support 的。但是随着语言的发展 “谷歌”的动词用法逐渐变成了语言共同体司空见惯的表达方式(其动词用法的流行显得简洁、时髦甚至俏皮),这时候 语言的用法被反映在语言共同体的集体词汇表中,我们模型这个共同体的语言能力的时候 就开始标注其动词的可能性了。
金:
厉害,这抠的!金融语义在一边看热闹
我:
或问:这词典里面标注了(反映的是共同体集体意识到这种用法的流行)和不标注 有什么区别?
当然有区别。标注了 就意味着其动词用法作为一个合理的路径 参与 parsing 的正常竞争;不标注 虽然也不能排除临时的动词用法 但是因为缺乏了底部的词典支持 其动词用法的路径是默认不合法,除非句法(包括词法)的context逼迫它成为动词,这就是 “一哈”的所谓强盗句法: 不仅词典是绑架的天堂,句法也可以绑架。
白老师说:“兼语理解(叫某人做某事)有谓词性的坑不饱和,双宾理解(叫某人某称呼)有体词性的坑不饱和。如果拘泥于结构,二者半斤八两。但如果结合语境,非兼语理解是颠覆性的,兼语理解是常识性的。放着常识性的理解不选选择颠覆性的理解,说明心头的阴云不是一天两天了。冰冻三尺。
重温一下白老师 作为对比,字字玑珠,而且妙趣啊。“冰冻三尺”就是社会语言学。

也可以说,冰冻三尺就是大数据
我:
我们学习语言学 模型句法 绝大多数都是针对现时的 把语言看成是一个静态的剖面 来研究它 模型它。这个也没大错 而且简化了问题。但是语言是流动的 社会语言学强调的就是这个流动性。流动自然反映在大数据中。因此对于静态的语言模型 需要不断的更新 如果有大数据 那就定时地 check 它。
白:
有个动态更新的中间件就够了
我:
陈原是个大家。他写的社会语言学很有趣味。在世界语场合 有幸聆听过陈原先生的世界语演讲:那个才华四射、感染力和个性特色 让人高山仰止。人家做语言学是业余 本职工作是出版商。据说是中国最权威的出版家,也是个左派社会活动家。
洪:
虽然解放初才入党,但应该早就是中共地下党员,三十年代初就在三联当编辑,胡愈之邹韬奋的部下,以前《读书》上一直有陈原的《在语词的密林里》
我:
陈原的那次演讲 与 黄华(我做翻译的那次)的演讲 都有一个共同的特点,就是表情丰富、富于感染力,能感受到人的 personality,都是“大家”。
aaa

 

【相关】

我的世界语国
朝华午拾:欧洲之行

【置顶:立委科学网博客NLP博文一览(定期更新版)】

《朝华午拾》总目录

立委 NLP 频道 开张大吉

承蒙高博协助,立委牌NLP博客频道今天开张大吉,广告一下,尤其对后学:https://liweinlp.com/

其前身是【立委科学网博客】的NLP科普相关博文,所谓 NLP University: http://blog.sciencenet.cn/blog-362400-902391.html。我将逐渐把原NLP博客转移至此,新的博客会同步在此发布。非 NLP 博文仍然以科学网为基地。

本大学有网无墙,有教无类,对公众无条件全天候开放。学分以研读立委教授博文为单元,从下列清单任选100篇博文,计100分,急用先学,学有所得,学以致用,是为有效学分,学员自我判分,过60可毕业也。门槛说高不高,说低不低,师傅领进门,修行靠个人,能否诚实毕业,就看造化了。

不知道多少次电脑输入 NLP(自然语言处理),出来的都是“你老婆”。难怪 NLP 跟了我一辈子,or 我跟了 NLP 一辈子。不离不弃。

开篇词: 余致力自然语言处理凡30年,其目的在求交流之通畅,信息之自由,语言之归一,世界之大同。积30年之经验,深知欲达此目的,必须启蒙后进,普及科学,同心协力,共建通天之塔,因作文鼓而吹之。处理尚未成功,同志仍需努力。

分八章。

第一章:体系和方法论,关键是这一篇【NLP 联络图 】。除了体系和术语联络图,也谈方法论及其两条路线的斗争。

第二章 Parsing,包括 shallow parsing 和 deep parsing 的方方面面。要强调的一点是,deep parsing 是 NLP 的核武器。当自然语言的 unstructured text 被精准分析成 structures 以后,语言因为有了有限的 patterns 而变得有迹可循,NLP 应用的很多难题就迎刃而解了。

第三章 抽取,进入NLP语用。虽然学界绝大多数抽取都是不用parsing的,或者只用 stemming,最多是 shallow parsing,这里更注重的是在 deep parsing 基础上的抽取。可以看成是针对知识图谱的全自动最终解决方案。

第四章 挖掘。抽取和挖掘常常搞混,但一般的共识是它们处于不同的层次:抽取针对的是个体,一颗颗的树,而挖掘针对的是森林,是语料库或文本数据源。在大数据年代,文本挖掘被认为是开采金矿的核武器,可以领跑下个 decade,但是从 NLP 体系框架来看,它是处于 parsing 和抽取之后的,是抽取的统计化结果。真正的核武器是 deep parsing,因为有了它,抽取才能快速进入domain,以不变应万变,同时抽取的质量也能大幅度提升。这才为最终的大数据挖掘打牢了基础。

第五章 NLP 的其他应用,文本挖掘是 NLP 的主打应用,可以用在很多产品和domains,其他的应用则包括机器翻译(MT),问答系统 (QA),智能搜索,如 SVO search (超越关键词的结构搜索)。当然也包括语言生成(聊天机器人要用的),还有自动文摘等。这些方面目前还没有面面俱到,有些应用笔者迄今没有找到机会涉猎。

第六章 中文 NLP。作者读者都是中国人,写的是中文博客,加上中文处理有其特殊的挑战,所以单列。更重要的是,很多年来,中文 NLP 被认为远远落后于欧洲语言的 NLP。这里的材料深入研究了中文的特点和难点,展示中文 NLP 的新进展。结论是,中文处理的确有其挑战,但其处理水平并没有落后太多。与英语NLP或其他欧洲语言NLP一样,最先进的中文NLP系统也已经进入了大规模大数据应用的时代。

第七章 舆情挖掘实践。舆情挖掘也是挖掘,这里单列是因为这是笔者目前的研发重心,也是因为这是 NLP 中最 tricky 也很有价值的应用,展示其挖掘实例可以激发大数据挖掘的想象力。本章集中了舆情挖掘的中外实例,几年来的热点话题追踪,或者打趣,也有不少闹着玩的成分在,包括给男星女星排名,甚至挖掘他们的花边新闻。

舆情挖掘比事实挖掘难很多,虽然体系和方法论上二者有很大的相同点,但难度有天壤之别的感觉。这是因为主观性语言(subjective language)是人类语言中较难的一面。严格说 sentiment analysis 属于抽取,sentiment extraction 才是更准确的说法,不过大家都习惯了沿用 sentiment analysis,而 opinion mining 才属于挖掘 (or mining of public opinions and sentiments)。这个里面学界最多报道的工作实际是 sentiment classification,但classification只是sentiment analysis 的一个皮毛。舆情舆情,有舆有情。舆就是 public opinion,情才是 public sentiment,后来为了统一在大家习惯的 sentiment 的 umbrella 下面,我们把情限定于 emotion 的表达,但 emotion 的表达只是一种情绪的挖掘,可以与 classification 很好对应,不管是分两种情绪(褒贬),三种情绪(褒贬中),还是四种情绪(喜怒哀乐),或 n 种,总之是 classification 。但是 deep sentiment analysis 不能停留在情绪的 classification,必须找到背后的东西。这就是为什么我们强调要挖掘情绪背后的理由,因为人不能老是只有情绪(喜欢不喜欢)和结论(采纳不采纳),而不给出理由。前者仅仅是发泄,后者才是为了传达、说服或影响人的具体情报,是可以帮助决策的。挖掘的主要目的有二:一个是把这些情报统计出来,给出概貌,不管是制作成图表还是使用词云等可视化的表达。第二就是允许用户从这些情报开始做任意的 drill down 或顺藤摸瓜。很多时候我们只展示了前者,其实真正的价值在后面(系统demo可以展示其威力,博文很难表现其动态)。后者才真显系统的威力,前者不过是静态的报表而已。Deep sentiment analysis 是 NLP 应用中最难啃的果子。

第八章是最后一章,NLP 掌故。这里面说的都是故事,有亲身经历,也有耳闻目睹。

希望 这个 NLP University 提供一些 NLP 课堂和教科书中没有的内容和角度。前后积攒了几百篇了,不仅分了大类,也尽量在每一篇里面给出了相互之间的链接。

【相关】

科学网【NLP University