Chart Parsing Chinese Character Strings

W. Li. 1997. Chart Parsing Chinese Character Strings. In
Proceedings of the Ninth North American Conference on Chinese
Linguistics (NACCL-9). Victoria, Canada.

Chart Parsing Chinese Character Strings [1]

 

Wei  LI

Simon Fraser University
Burnaby B.C. V5A 1S6 CANADA ([email protected]) 

 

ABSTRACT

This paper examines problems in word identification for a Chinese natural language processing system and presents our solution to these problems. In conventional systems, written Chinese parsing takes two steps: (1) a segmentation preprocessor for word identification (segmenter); (2) a grammar parsing the string of identified words. Morphological analysis, when required, as in the case of productive word formation, has to be incorporated in the segmenter. This matches the conventional morphology-before-syntax architecture. We will demonstrate the theoretical defect of this architecture when applied to Chinese. This leads to the conclusion that segmentational approach, despite its being the mainstream in Chinese computational morphology, is in general not adequate for the task of Chinese word identification. To solve this problem, a full grammar should be made available. Therefore, we take an alternative one-step approach. We have implemented an integrated grammar of morphology and syntax for directly parsing a string of Chinese characters, building both morphological and syntactic structures. Compared with the conventional two-step approach, our strategy has advantages in resolving ambiguity in word identification and in handling productive word formation.

  1. Introduction

A written Chinese sentence is a string of characters with no blanks to mark word boundaries. In conventional systems, Chinese parsing takes two steps as shown in the following Figure 1: (1) a segmentation preprocessor (called segmenter) for word identification; (2) a word based parsing grammar, building syntactic structures (Feng 1996, Chen & Liu (1992).

hpsg4

 

In contrast, we take an alternative one-step approach, as shown in Figure 2 below. We have implemented a grammar named W‑CPSG (for Wei's Chinese Phrase Structure Grammar). W‑CPSG integrates morphology and syntax for character based parsing, building both morphological and syntactic structures.

hpsg5

In the two-step architecture, the purpose for the segmenter is to properly identify a string of words to feed syntax. This is not an easy task due to the possible involvement of the segmentation ambiguity. For example, given a string of 4 Chinese characters 研究生命, the segmentation ambiguity is shown in (1.a) and (1.b) below.

(1.)  研究生命

(a)        研究生               |
graduate student         | life or destiny

(b)        研究    | 生命
study   | life

The resolution of the above ambiguity in the segmenter is a hopeless job because such ambiguity is syntactically conditioned. For sentences like 研究生命金贵 (life for graduate students is precious), (1.a) is the right identification. For the phrase 研究生命起源 (to study the origin of life), (1.b) is right. So far there are no segmenters which can handle this properly and guarantee right word segmentation (Feng 1996). In fact, there can never be such segmenters as long as a grammar is not brought in. This is a theoretical defect of all Chinese analysis systems in the conventional architecture. We have solved this problem in our morphology-syntax integrated W‑CPSG. Word identification in our design becomes a by-product of parsing instead of a pre-condition for parsing.

In the text below, Section 2 investigates problems with the conventional two-step approach. In Section 3, we will present W‑CPSG one-step approach and demonstrate how W‑CPSG parsing solves these problems. The following is a list for abbreviations used in this paper.

A (Adjective); AF (Affix); BM (Bound Morpheme);
CLA (Classifier); CLAP (Classifier Phrase);
DE (Chinese particle introducing a modifier of noun); DEP (DE Phrase);
DE3 (Chinese particle introducing a modifier of result or capability);
DET (Determiner); LE (Chinese perfective aspect marker);
N (Noun); NP (Noun Phrase); P (Preposition); PP (Prepositional Phrase);
S (Sentence); V (Verb); VP (Verb Phrase); Vt (Transitive Verb)

  1. Problems Challenging Segmenters

In general, there are two basic problems for segmenters, namely, segmentation ambiguity and productive word formation.

2.1. segmentation ambiguity

This sub-section studies the segmentation ambiguity for Chinese word identification. We indicate that this ambiguity is structural in nature. Therefore it should be captured by structural trees via parsing. We conclude that a parsing grammar is indispensable in the resolution of the segmentation ambiguity.

Behind all segmenters are procedure based segmentation algorithms. Most proposals are some modified versions of large-lexicon based matching algorithms. As an underlying hypothesis, a longer match overrides a shorter match, hence the name maximum match. Decided by the  direction of the  procedure, i.e. whether  the segmentation proceeds from left (the beginning of a string) to right (the end of the string) or from right to left, we have two general types of maximum match: (1) FMM (Forward Maximum Match) algorithm; (2) BMM (Backward Maximum Match) algorithm (Feng 1996).

According to Liang 1987, segmenters have trouble with cases involving the segmentation ambiguity. There are two types of segmentation ambiguity: the cross ambiguity (AB|C vs. A|BC) and the embedded ambiguity (AB vs. A|B).

To detect possible ambiguity, many researchers use the technique of combining the FMM algorithm and the BMM algorithm. When the output of FMM and BMM are different, there must be some ambiguity involved. The following table lists the cases associated with the FMM and BMM combined approach.[2]

hpsg6

The following 3 examples all contain a cross ambiguity sub-string 研究生命 with 2 segmentation possibilities: 研究生|命 and 研究|生命. Example (4.) is a genuinely ambiguous case. Genuinely ambiguous sentences cannot be disambiguated within the sentence boundary, rendering multiple readings.

(2.) case 1:      研究生命金贵。

(a)        研究生                |      | 金贵                  (FMM: correct)
graduate student         | life   | precious
Life for graduate students is precious.

(b) * 研究 | 生命    |起源                                   (BMM: incorrect)
study        | life     | precious

(3.) case 2:       研究生命起源。

(a) *     研究生              | 命     | 起源                       (FMM: incorrect)
graduate-student       | life   | origin

(b)        研究     | 生命    | 起源                                (BMM: correct)
study   | life     | origin
to study the origin of life

(4.) case 3:       研究生命不好。

(a)        研究生                   | 命                 |        |      (FMM: correct)
graduate student         | destiny        | not     | good
The destiny of graduate students is not good.

(b) 研究 | 生命   | 不      | 好                                      (BMM: correct)
study    | life     |  not    | good
It is not good to study life.

The following example is a complicated case of cross ambiguity, involving more than 2 ways of segmentation. Both the FMM segmentation 出现|在世|界 and the BMM segmentation 出|现在|世界 are wrong. A third segmentation 出现||世界 is right.

(5.)  case 4:      出现在世界东方。

(a) * 出现 | 在世          |      | 东方                       (FMM: incorrect)
appear     | be-alive   | BM   | east

(b) * 出  | 现在  | 世界    | 东方                               (BMM: incorrect)
out        | now   | world | east

(c)  出现  |     | 世界     | 东方                               (correct)
appear    | at    | world  | east
to appear in the east of the world

In the following examples (6.) through (8.), ¿¾°×Êí involves embedded ambiguity. As separate words, the verb ¿¾ (bake) and the NP °×Êí (sweet potato) form a VP. As a whole, it is a compound noun ¿¾°×Êí (baked sweet potato). In cases of the embedded ambiguity, FMM and BMM always make the same segmentation, namely AB instead of A|B. It may be the only right choice, as seen in (6.). It may be wrong as shown in (7.). It may only be half right, as in the case of genuine ambiguity shown in (8.).

(6.) case 5:       他吃烤白薯。

(a)        他       |       | 烤白薯                                 (FMM&BMM: correct)
he       | eat     | baked sweet potato
He eats baked sweet potatoes.

(b) *     他       |       |       | 白薯                        (incorrect)
he       | eat     | bake | sweet potato

(7.) case 6:       他会烤白薯。

(a) *     他       |       | 烤白薯                                 (FMM&BMM: incorrect)
he       | can    | baked sweet potato

(b)        他      |       |       | 白薯                         (correct)
he      | can   | bake | sweet potato
He can bake sweet potatoes.

(8.) case 7:       他喜欢烤白薯。

(a)       他       | 喜欢 | 烤白薯                                  (FMM&BMM: correct)
he      | like  | baked sweet potato
He likes baked sweet potatoes.

(b)        他       | 喜欢   |       | 白薯                       (correct)
he      | like     | bake | sweet potato
He likes baking sweet potatoes.

Compare the above examples, we see that there are severe limitations for the FMM-BMM combined approach. First, it only serves the purpose of ambiguity detection (when the results of FMM and BMM do not match), and contributes nothing to its resolution. It has no way to tell which segmentation is right (compare case 1 and case 2), and, worse still, whether both are right (case 3) or wrong (case 4). Second, even when the results of FMM and BMM do match, it by no means guarantees right segmentation (case 6). Third, as far as detection is concerned, it is only limited to the problems for the cross ambiguity. The existence of the embedded ambiguity defines a blind area for this way of detection (case 6 and case 7). This is because the underlying maximum match hypothesis assumed in the FMM and BMM segmentation algorithms is directly contradictory to the phenomena of the embedded ambiguity.

In face of ambiguity, how do people judge which segmentation is right in the first place? It really depends on whether we can understand the sentence or phrase based on the segmentation. In computational linguistics, this is equivalent to whether the segmented string can be parsed by a grammar. The segmentation ambiguity is one type of structural ambiguity, not in essence different from typical structural ambiguity like, say, PP attachment ambiguity. In fact, PP attachment problem is a counterpart of the cross ambiguity in English syntax, as shown below.

(9.)       Cross ambiguity in PP attachment: V NP PP

(a) [V NP] [PP]
(b) [V] [NP PP]

Therefore, like English PP attachment, Chinese word segmentation ambiguity should also be captured by a parsing grammar. A parser resolves the ambiguity if it can, or detects the ambiguity in the form of multiple parses when it cannot. As shall be demonstrated in Section 3, wrong segmentation will not lead to a parse. Right segmentation results in at least one successful parse. In any case, at least a parser (hence a grammar on which the parser is based) is required for proper word identification.

The important thing is that the ambiguity in word identification is a grammatical problem. The attempt to solve this problem without a grammar is bound to be crippled. Since traditional segmentation algorithms are non-grammatical in nature, they are theoretically not equipped for handling such ambiguity. A successive model of segmentater-before-grammar attempts to do what it is not yet able to do. This is the theoretical defect for almost all existing segmentation approaches.

(10.)     Conclusion for 2.1.

The segmentation ambiguity in word identification is one type of structural ambiguity. In order to solve this problem, a parsing grammar is indispensable.

2.2. productive word formation

Unless morphological analysis is incorporated, lexicon match based segmenters will have trouble with new words produced by Chinese productive word formation, including reduplication, derivation and the formation of proper names. When the morphology component is incorporated in the segmenter, the two-step design becomes a variant of the conventional morphology-before-syntax architecture. But this architecture is not effective when the segmentation ambiguity is at issue.

In the following, we investigate reduplication, derivation and proper names one by one. In each case, we find that there is always a possible involvement of the segmentation ambiguity. This problem cannot be solved by a morphology component independent of syntax. We therefore propose a  grammar incorporating both morphology and syntax.

2.2.1. reduplication

Reduplication in Chinese serves various grammatical and/or lexical functions. Not all reduplications pose challenges to segmentation algorithms. Assume that a word consists of 2 characters AB, reduplication of the type AB --> ABAB is no problem. What becomes a problem for word segmentation is the reduplication of the type AB --> AABB or its variants like AB --> AAB. For example, a two-morpheme verb with verb-object relation at the level of morphology has the following way of reduplication.

(11.) Verb Reduplication: AB --> AAB  (for diminutive use)

分心 (get distracted) --> 分分心 (get distracted a bit)

让他分分心。

让       | 他     | 分分心
let       | he    | get distracted a bit
Let him relax a while.

It seems that reduplication is a simple process which can be handled by incorporating some procedure-based function calls in the segmentation algorithm. If a 3-character string, say 分分心, cannot be found in the lexicon, the reduplication procedure will check whether the first 2 characters are the same, and if yes, delete one of them and consult the lexicon again. But, such expansion of the segmentation algorithm is powerless when the segmentation ambiguity is involved. For example, it is wrong to regard 分分心 as of reduplication in the following sentence.

(12.)   这件事十分分心。

(a) *     这       |      |         |       | 分分心
this      | CLA  | thing  | ten    | get distracted a bit

(b)        这       |       |         | 十分    | 分心
this      | CLA  | thing  | very   | distracting
This thing is very distracting.

2.2.2. derivation

In Contemporary Mandarin, there have come to be a few morphemes functioning similarly to English affixes, e.g. 可 (-able) turns a transitive verb into an adjective.

(13.)     可 (-able) + Vt --> A

可 (-able) + 读 (Vt: read) -->   可读 (A:readable)

这本书非常可读。

这       | 本     | 书       | 非常   | 可读
this    | CLA  | book  | very  | readable
This book is very readable.

The suffix 性 works just like '-ness',  changing an adjective into an abstract noun.  The derived noun 可读性 (readability) in the following example, similar to its English counterpart, involves a process of double affixation.

(14.)     A + 性 (-ness)  --> N
可 (-able) + 读 (Vt: read) -->   可读 (A:readable)
可读 (A:readable) + 性 (-ness) --> 可读性 (N:readability)

这本书的可读性

这       | 本      | 书       |      | 可读性
this    | CLA  | book  | DE    | readability
this book's readability

The suffix Í· can change a transitive verb into an abstract noun adding to it the meaning "worth-of".

(15.) Vt + 头 (AF:worth of) --> N

吃 (Vt:eat) + 头 (AF:worth of) --> 吃头 (N:worth of eating)

这道菜没有吃头

这       | 道     | 菜      | 没有             | 吃头
this    | CLA  | dish  | not-have    | worth-of-eating
This dish is not worth eating.

It is not difficult to incorporate in the segmenter these derivation rules for the morphological analysis. But, as in the case of reduplication, there is always a danger of wrongly applying the rules due to possible ambiguity involved. For example, 吃头 is a sub-string of embedded ambiguity. It can be both a derived noun 'worth of eating' or two separate words as seen in the following example.

(16.)  他饿得能吃头牛。

(a) *     他      | 饿             |       |      | 吃头·                       |
             he     | hungry    | DE3  | can  | worth-of-eating   | ox

(b)        他      | 饿              |      |      |       |       |
              he     | hungry    | DE3  | can  | eat    | CLA  | ox
He is so hungry that he can eat an ox.

2.2.3. proper name

Proper names are of 2 major types: (1) Chinese names; (2) transliterated foreign names. In this paper, we only target the identification of Chinese names and leave the problem of transliterated foreign names for further research (Li, 1997b).

A Chinese human name usually consists of a family name followed by a given name. Chinese family names form a clear-cut closed set. A given name is usually either one character or two characters. For example, the late Chinese chairman 毛泽东 (Mao Zedong) used to have another name 李得胜 (Li Desheng). In the lexicon, 李 is a registered family name. Both 得胜 and 胜 mean 'win'. This may lead to 3 ways of word segmentation: (1) 李得胜; (2) 李|得胜; (3) 李得|胜, as seen in the following examples.

(17.)    李得胜了

(a)  李    | 得胜 | .
       Li    | win  | LE
Li won.

(b)   李得   |      |
        Li De | win  | LE
Li De won.

(c) *  李得胜          | .
          Li Desheng | LE

(18.)   李得胜胜了 。

(a) *  李 | 得胜 |     | .
         Li  | win | win | LE

(b) *  李得   |      |      |
          Li De | win  | win  | LE

(c)   李得胜            |      |
Li Desheng   | win  | LE
Li Desheng won.

Since the given name like µÃʤ is an arbitrary string of 1 or 2 characters, the morphological analysis of the full name should start with family name which can optionally combine with any 1 or 2 characters to form candidate proper names Àî, ÀîµÃ and ÀîµÃʤ. In other words, family name serves as the left boundary of a full name and the length is used to determine candidates. The right segmentation can only be made via sentence analysis as shown in the above examples.

Most Chinese place proper names are made of 1 to 3 characters, for example, 武汉市(Wuhu City), 南陵县 (Nanling County). The arbitrariness of these names makes any sub-strings of n characters (0<n<4) in the sentence a suspect. Fortunately, in most cases we may find boundary indicators of these names, like 省 (province), 市 (city), 县 (county), etc. Once the boundary indicator is located, the similar technique in using Chinese family name to identify the given name can be applied to select candidates of place proper names for verification through grammatical analysis.

In general, there is always a possibility of ambiguity involvement in the formation of all types of proper names.

(19.)     Conclusion for 2.2.

Due to the possible involvement of ambiguity, a parsing grammar for morphological analysis as well as for sentence analysis is required for the proper identification of the words produced by Chinese productive word formation.

  1. W‑CPSG Grammatical Approach

This section presents W‑CPSG approach to Chinese word identification and morphological analysis. We will demonstrate how a parser based on W‑CPSG solves the problems of the word identification ambiguity and productive word formation.

3.1. rationale of W‑CPSG approach

There have been a number of word identification algorithms based on both morphological and syntactic information (see survey in Feng 1996 and Sun & Huang 1996). Most such approaches do not use a self-contained grammar to parse the complete sentence. They are confined to the conventional two-step process of the segmentation-before-grammar design. As long as the word identification procedure is independent of a parsing grammar, it is extremely difficult to make full use of grammatical information to resolve ambiguity in word identification. Careful tuning up and sophisticated design improves the precision but will not change the theoretical defect of all such approaches. Chen & Liu acknowledges the limitation of their approach due to the lack of a grammar.  “However”, they say,  “it is almost impossible to apply real world knowledge nor to check the grammatical validity at this stage”. (Chen & Liu 1992, p.105) Why impossible at this stage? Because these segmentation systems are based on the concept of  two-step architecture and the grammar is not yet available! As we have demonstrated, the final judgment for proper word identification can hardly be made until the whole sentence is parsed, hence the requirement of a full grammar. Therefore, we are forced to make a compromise in involving how much of grammatical information depending on how much word identification precision we can afford to sacrifice. Needless to say, there is significant double-labor between such a word segmentation procedure and the following stage of parsing. As more and more grammatical information is used to achieve better precision, the overhead of this double labor becomes more serious. We consider the double labor as one strong argument against the two-step approach. If enough grammatical information is incorporated, it is essentially equivalent to a grammar. And  the segmenter will be equivalent to a parser.  Then why two grammars, one for word identification, and one for sentence parsing? Why not combine them? That is exactly what we are proposing in W‑CPSG - one-step approach based on an integrated grammar,  eliminating the necessity of a segmentation preprocessor.

3.2. W‑CPSG character-based parsing

W‑CPSG (Li. 1997a, 1997b) is a lexicalized Chinese unification grammar. The work on W‑CPSG is taken in the spirit of the modern linguistic theory Head-driven Phrase Structure Grammar (Pollard & Sag 1994). W‑CPSG consists of two parts: a minimized general grammar and an enriched lexicon. The general grammar only contains a handful of PS (phrase structure) rules, covering complement structure, modifier structure, conjunctive structure and morphological structure. This is the nature of lexicalized grammars. PS rules in such grammars are very abstract. Essentially, all they say is one thing, that is, 2 signs can combine so long as the lexicon so indicates. The lexicon houses lexical entries with their linguistic description in feature structures. Potential morphological structures as well as potential syntactic structures are lexically encoded. In syntax, a word expects another sign to form a phrase. In morphology, a morpheme expects another sign to form a word. For example, the prefix 可 (-able) expects a transitive verb to form an adjective. The morphological PS rule will build the morphological structure when a transitive verb does appear after the prefix 可 (-able) in the input string.

We now illustrate how W‑CPSG parses a string of Chinese characters by a sample parsing chart. The prototype of W‑CPSG was written in ALE, a grammar compiler developed on top of Prolog by Carpenter & Penn (1994). ALE compiles W‑CPSG into a Chinese parser, a Prolog program ready to accept a string of characters for analysis. W‑CPSG parse tree embodies both morphological analysis and syntactic analysis, as shown below.

hpsg12

 

This is so-called bottom-up parsing. It starts with lexicon look-up. Edges 1 through 7 are lexical edges. Other edges are phrasal edges. Each edge represents a sign, i.e. a character (morpheme), a word, a phrase or a sentence. Lexical edges result from a successful match between the signs in the input string and the entries in the lexicon during lexicon look-up. For example, 可 (-able), 读 (read) and 性 (-ness) are all registered entries in the lexicon, so they get matched and shown by edge 5, edge 6 and edge 7. Words produced by productive word formation present themselves as phrasal edges, e.g. edge ((5+6)+7) for 可读性 (readability). For the sake of concise illustration, we only show two pieces of information for the signs in the chart, namely category and interpretation with a delimiting colon (lexical edges are only labeled for either category or interpretation). The parser attempts to combine the signs according to PS rules in the grammar until parses are found. A parse is an edge which ranges over the whole string. The parse ((((1+2)+3)+4)+((5+6)+7)) for (20.) represents a binary structural tree based on the W‑CPSG analysis, as shown below.

hpsg13

3.3. ambiguity resolution in word identification

Given the resources of a phrase structure grammar like W‑CPSG, a parser based on standard chart parsing algorithms can handle both the cross ambiguity and the embedded ambiguity provided that a match algorithm based on exhaustive lookup instead of maximum match is adopted for lexicon lookup. All candidate words in the input string are presented to the parser for judgment. Ambiguous segmentation becomes a natural part of parsing: different ways of segmentation add different edges, a successful parse always embodies right identification. In other words, word identification in our design becomes a by-product of parsing instead of a pre-condition for parsing. The following example of the complicated cross ambiguity illustrates how the W‑CPSG parser resolves ambiguity. As seen, both the FMM segmentation (represented by the edge sequence 8-9-5-10) and the BMM segmentation (represented by 1-11-12-10) are in the chart as a result of exhaustive lexicon lookup. They are proved to be wrong because they do not lead to a successful parse according to the grammar. As a by-product, the final parse (8+(3+(12+10))) automatically embodies rightly identified word sequence 8-3-12-10, i.e. 出现  (appear) |在  (at) |世界 (world) |东方 (east).

hpsg10

 

Exhaustive lookup also makes an embedded ambiguity sub-string like 烤红薯 no longer a blind area for word identification, as shown in (22.) below. All the candidate words in the sub-string including 烤 (bake), 红薯 (sweet potato), 烤红薯 (baked sweet potato) are added to the chart as lexical edges (edge 4, edge 8 and edge 10). This is a case of genuine ambiguity, resulting in 2 parses corresponding to 2 readings. The first parse (1+(7+10)) identifies the word sequence 他|喜欢|烤红薯, and the second parse (1+(9+(4+8))) a different sequence 他|喜欢|烤|红薯. Edge 7 and edge 9 represent two lexical entries for the verb 喜欢 (like), with different syntactic expectation (categorization). One expects an NP object, notated in the chart by like<NP>, and the other expects a VP complement, notated by like<VP>.

hpsg11

 

We now illustrate how Chinese proper names are identified in W‑CPSG parsing. In the W‑CPSG lexicon, Chinese family name is encoded to optionally expect the given name. Due to the arbitrariness of given names, no other constraint except for the length (either 1 character or 2 characters) is specified in the expectation. Therefore, we have three candidates for proper names in the following example, namely 李 (Li), 李得 (Li De), 李得胜 (Li Desheng), represented respectively by edge 1, edge (1+2) and the NP edge (1+5).[3] The first two candidates contribute to two valid parses while the third does not, hence the identification of the word sequences 李|得胜|了 and 李得|胜|了.

hpsg8

 

Now we add one more character 胜 (win) to form a new sentence, as shown in (24.) below.

hpsg9

 

The first two candidate proper names 李 (Li) and 李得 (Li De) no longer lead to parses. But the third candidate 李得胜 (Li Desheng) becomes part of the parse as a subject NP. The parse (((1+6)+4)+5) corresponds to the identification of the only valid word sequence 李得胜|胜|了.

Finally, we give an example to demonstrate how W‑CPSG handles reduplication in parsing and word identification. The sample sentence to be processed by the parser is 让他分分心 (Let him relax a while), involving the AB-->AAB type verb reduplication for diminutive use.

In most lexicons, 分心 (distract-heart: get distracted) is a registered 2-morpheme verb with internal morphological verb-object relation. Therefore, the reduplication is considered morphological. But in Chinese syntax, we also have a  general verb reduplication rule of the type A-->AA for diminutive use, for example, 看(look) --> 看看(have a look). This morphological verb reduplication rule AB-->AAB and the syntactic verb reduplication rule A-->AA are essentially the same rule in Chinese grammar. 分心 sits in the gray area between morphology and syntax. It looks both like a word (verb) and a phrase (VP). Lexically, it corresponds to one generalized sense (concept) and the internal combination is idiomatic, i.e. 分 (distract) must combine with 心 (heart) to mean 'get distracted'. But, structurally, the combination of 分 and 心 is not fundamentally different from a VP consisting of Vt and NP, as in the phrase 看电影 (see a film). In fact, there is no clear-cut boundary between Chinese morphology and syntax. This morphology-syntax isomorphic fact serves as a further argument to support the W‑CPSG design of integrating morphology and syntax in one grammar module. Although the boundary between Chinese morphology and syntax is fuzzy, hence no universal definition of basic notions like word and phrase, the division can be easily defined system internally in an integrated grammar. In W‑CPSG,  分心 is treated as a phrase (VP) instead of a word (verb). The lexical entry 分 (distract) is coded to obligatorily expect the literal 心 (heart) as its syntactic object, shown in the following chart by the notation V<>. This approach has the advantage of eliminating the doubling of the reduplication rule for diminutive use in both syntax and morphology, making the grammar more elegant. The verb reduplication rule is implemented as a lexical rule in W‑CPSG.[4] This lexical rule creates a reduplicated verb with added diminutive sense, shown by edge 8 (a lexical edge).  The whole parsing process is illustrated below.

hpsg7

 

 

REFERENCES

Carpenter, B. & Penn, G. (1994): ALE, The Attribute Logic Engine, User's Guide, Carnegie Mellon University

Chen, K-J., & S-H. Liu (1992): "Word identification for mandarin Chinese sentences". Proceedings of the 15th International Conference on Computational Linguistics, Nantes, 101-107.

Feng, Z-W. (1996): "COLIPS lecture series - Chinese natural language processing",  Communications of COLIPS, Vol.6, No.1 1996, Singapore

Li, W. (1997a): "Outline of an HPSG-style Chinese reversible grammar", Proceedings of The Northwest Linguistics Conference-97 (NWLC-97, forthcoming), UBC, Vancouver, Canada

Li, W. (1997b): W‑CPSG: A Lexicalized Chinese Unification Grammar And Its Application, Doctoral dissertation (on-going), Simon Fraser University, Canada

Liang, N. (1987): "Shumian Hanyu Zidong Fenci Xitong - CDWS" (Automatic word segmentation system for written Chinese - CDWS), Journal of Chinese Information Processing, No.2 1987, pp 44-52, Beijing

Pollard, C.  & I. Sag (1994): Head-Driven Phrase Structure Grammar,  Centre for the Study of Language and Information, Stanford University, CA

Sun, M-S. & C-N. Huang  (1996): "Word segmentation and part of speech tagging for unrestricted Chinese texts" (Tutorial Notes for International Conference on Chinese Computing ICCC'96), Singapore

~~~~~~~~~~~~~~~~~~~

[1] The author benefited from the insightful discussion with Dr. Dekang Lin on the feasibility of parsing Chinese character strings instead of word strings. Thanks also go to Paul McFetridge and Fred Popowich for their supervision and encouragement.

[2] This table is adapted from the following table in Sun & Huang (1996).

case 1 The output of FMM and BMM are different, but both are incorrect 0.054%
case 2 The output of FMM and BMM are different, but only one is correct 9.24%
case 3 The output of FMM and BMM are identical, but incorrect 0.41%
case 4 The output of FMM and BMM are identical, and correct 90.30%

The 4 cases which they listed are not logically exhaustive in terms of sentence based processing (i.e. when discourse is not involved in a system). In particular, there is another case when the output of FMM and BMM are different, and both are correct. We call this a case of genuine cross ambiguity.

[3] Note that there is another S edge (1+5) in the chart. These two edges are structurally different, created via different PS rules. The NP edge (1+5) is formed through the morphological PS rule, combining the family name (edge 1) and its expected given name (edge 5). In the S edge (1+5). however, it is the subject rule (one of the complement PS rules) that decides the combination of the predicate (edge 5) and its expected subject NP (edge 1).

[4] Lexical rules are favored by many linguists to capture redundancy in the lexicon instead of the conventional approach of syntactic transformation. Lexical rules are applied at compile time to form an expanded lexicon before parsing starts.

 

[Related]

Interaction of syntax and semantics in parsing Chinese transitive verb patterns 

Handling Chinese NP predicate in HPSG 

Notes for An HPSG-style Chinese Reversible Grammar

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

Interaction of syntax and semantics in parsing Chinese transitive verb patterns

Interaction of syntax and semantics in parsing Chinese transitive verb patterns *
(old paper in Proceedings of International Chinese Computing Conference, ICCC'96)

Wei  LI

Department of Linguistics, Simon Fraser University
Burnaby, B.C. V5A 1S6 CANADA (email: [email protected])

Keywords: Chinese processing, transitive pattern, syntax, semantics, lexical rule, HPSG

Abstract

This paper addresses the problem of parsing Chinese transitive verb patterns (including the BA construction and the BEI construction) and handling the related phenomena of semantic deviation (i.e. the violation of the semantic constraint).

We designed a syntax-semantics combined model of Chinese grammar in the framework of Head-driven Phrase Structure Grammar [Pollard & Sag 1994]. Lexical rules are formulated to handle both the transitive patterns which allow for semantic deviation and the patterns which disallow it. The lexical rules ensure the effective interaction between the syntactic constraint and the semantic constraint in analysis.

The contribution of our research can be summarized as:

(1) the insight on the interaction of syntax and semantics in analysis;
(2) a proposed lexical rule approach to semantic deviation based on (1);
(3) the application of (2) to the study of the Chinese transitive patterns;
(4) the implementation of (3) in an unification-based Chinese HPSG prototype.

  1. Background

When Chomsky proposed his Syntactic Structures in Fifties, he seemed to indicate that syntax should be addressed independently of semantics. As a convincing example, he presented a famous sentence:

1)             Colorless green ideas sleep furiously.

Weird as it sounds, the grammaticality of this sentence is intuitively acknowledged: (1) it follows the English syntax; (2) it can be interpreted. In fact, there is only one possible interpretation, solely decided by its syntactic structure. In other words, without the semantic interference, our linguistic knowledge about the English syntax is sufficient to assign roles to each constituent to produce a reading although the reading does not seem to make sense.

However, things are not always this simple. Compare the following Chinese sentences of the same form NP NP V:

2a)           dianxin  wo           chi           le.
                Dim-Sum I               eat           LE.
The Dim Sum I have eaten.
Note:        LE is a particle for perfect aspect.

2b)   wo dianxin chi le.
I have eaten the Dim Sum.

Who eats what? There is no formal way but to resort to the semantic constraint imposed by the notion eat to reach the correct interpretation [Li, W. & McFetridge 1995].

Of course, if we want to maintain the purity of syntax, it could be argued that syntax will only render possible interpretations and not the interpretation.  It is up to other components (semantic filter and/or other filters) of grammar to decide which interpretation holds in a certain context or discourse. The power of syntax lies in the ability to identify structural ambiguities and to render possible corresponding interpretations. We call this type of linguistic design a syntax-before-semantics model. While this is one way to organize a  grammar, we found it unsatisfactory for two reasons. First, it does not seem to simulate the linguistic process of human comprehension closely.  For human listeners, there are no ambiguities involved in sentences 2a) and 2b). Secondly, there is considerable cost on processing efficiency in terms of computer implementation. This efficiency problem can be very serious in the analysis of languages like Chinese with virtually no inflection.

Head-driven Phrase Structure Grammar (HPSG) [Pollard & Sag 1994, 1987] assumes a lexicalist approach to linguistic analysis and advocates an integrated model of syntax and the other components of grammar. It serves as a desirable framework for the integration of the semantic constraint in establishing syntactic structures and interpretations. Therefore, we proposed to enforce the semantic constraint that animate being eats food directly in the lexical entry chi  (eat) [Li, W. & McFetridge 1995]: chi (eat) requires an animate NP subject and a food NP object. It correctly addresses who-eats-what problem for sentences like 2a) and 2b). In fact, this type of semantic constraint (selection restriction) has been widely used for disambiguation in NLP systems.

The problem is, the constraint should not always be enforced. In the practice of communication, deviation from the constraint is common and deviation is often deliberately applied to help render rhetorical expressions.

 

3) xiang      chi           yueliang,  ni             gou           de3    zhao       me?
    want        eat           moon,       you          reach       DE3  -able          ME?
Wanting to eat the moon, but can you reach it?
Note:  DE3 is a particle, introducing a postverbal adjunct of result or capability. ME is a sentence final particle for yes-no question.

4) dajia         dou   chi           shehui zhuyi,           neng         bu            qiong       me?
     people      all      eat           social -ism,               can            not           poor         ME
Everyone is eating socialism, can it not be poor?

yueliang (moon) is not food, of course. It is still some physical object, though. But in 4), shehui zhuyi (socialism) is a purely abstract notion. If a parser enforces the rigid semantic constraint, there are many such sentences that will be rejected without getting a chance to be interpreted. The fact is, we do have interpretations for 3) and 4). Hence an adequate grammar should be able to accommodate those interpretations.

To capture such deviation, Wilks came up with his Preference Semantics [Wilks 1975, 1978]. A sophisticated mechanism is designed to calculate the semantic weight for each possible interpretation, i.e. how much it deviates from the preference semantic constraint. The final choice will be given to the interpretation with the most semantic weight in total. His preference model simulates the process of how human comprehends language more closely than most previous approaches.

The problem with this design is the serious computational complexities involved in the model [Huang 1987]. In order to calculate the semantic weight, the preference semantic constraint is loosened step by step. Each possible substructure has to be re-tried with each step of loosening. It may well lead to combinatorial explosion.

What we are proposing here is to look at semantic deviation in the light of the interaction of the syntactic constraint and the semantic constraint. In concrete terms, the loosening of the semantic constraint is conditioned by syntactic patterns. Syntactic pattern is defined as the representation of an argument structure in surface form. A pattern consists of 2 parts: a structure's syntactic constraint (in terms of the syntactic categories and configuration, word order,  function words and/or inflections) and its interpretation (role assignment). For example, for Chinese transitive structure, NP V NP: SVO is one pattern, NP NP V: SOV is another pattern, and NP [ba NP] V: SOV (the BA construction) is still another. The expressive power of a language is indicated by the variety of patterns used in that language. Our design will account for some semantic deviation or rhetorical phenomena seen in everyday Chinese without the overhead of computational complexities. We will focus on Chinese transitive verb patterns for illustration of this approach.

  1. Chinese transitive patterns

Assuming three notional signs wo (I), chi (eat) and dianxin (Dim Sum), there are maximally 6 possible combinations in surface word order, out of which 3 are grammatical in Chinese.[1]

5a)           wo chi le dianxin.                                   SVO
5b)           wo dianxin chi le.                                   SOV
5c)           dianxin wo chi le.                                    OSV

SVO is the canonical word order for Chinese transitive structure. When a string of signs matches the order NP V NP, the semantic constraint has to yield to syntax for interpretation.

NP V NP: SVO

6)  daodi         shi     ni             zai         du       shu          ne,
haishi                 shu           zai         du       ni             ne?

     on-earth     be     you          ZAI        read     book        NE,
or                        book        ZAI        read     you          NE?

Are you reading the book, or is the book reading you, anyway?
Note:        ZAI is a particle for continuous aspect.
NE is a sentence final particle for or-question.

Same as in the English equivalent, the interpretation of  6) can only be SVO, no matter how contradictory  it might be to our common sense. In other words, in the form of NP V NP, syntax plays a decisive role.

In contrast, to interpret the form NP NP V as SOV in 2b), the semantic constraint is critical. Without the enforcement of the semantic constraint, the interpretation of SOV does not  hold. In fact, this SOV pattern (NP1 NP2 V: SOV) has been regarded as ungrammatical in a Case Theory account for Chinese transitive structure in the framework of GB. According to their analysis, something similar to this pattern constitutes the D‑Structure for transitive pattern and Chinese is an underlying SOV language (called "SOV Hypothesis": see the survey in Gao 1993). In the surface structure, NP2 is without case on the assumption that V assigns its CASE only to the right. One has to either insert the case-marker ba to assign CASE to it (the BA construction) or move it to the right of V to get its CASE (the SVO pattern). This analysis suffers from not being able to account for the grammaticality of sentences like 2b).  However, by distinguishing the deep pattern SOV from the 2 surface patterns (the SVO and the BA construction), the theory has its merit to alert us that the SOV pattern seems to be syntactically problematic (crippled, so to speak). This is an insightful point, but it goes one step too far in totally rejecting the SOV pattern in surface structure. If we modify this idea, we can claim that SOV is a syntactically unstable pattern and that SOV tends to (not must) "transform" to the SVO or the BA construction unless it is reinforced by semantic coherence (i.e. the enforcement of the semantic constraint). This argument in the light of syntax-semantics interaction is better supported by the Chinese data. In essence, our account is close to this reformulated argument, but in our theory, we do not assume a deep structure and transformation. All patterns are surface constructions. If no sentences can match a construction, it is not considered as a pattern by our definition.

This type of unstable pattern which depends on the semantic constraint is not limited to the transitive phenomena. For example, the type of Chinese NP predicate defined in  [Li, W. & McFetridge 1995] is also a semantics dependent pattern. Compare:

7a)  zhe           zhang       zhuozi                  san          tiao          tui.
        this           Cl.         table(furniture)      three        Cl.            leg
This table is three-legged.
Note:        Cl for classifier.

7b) *        zhe           zhang       ditu                          san          tiao          tui.
                this           Cl.           map(non-furniture)  three        Cl.            leg

There is clearly a semantic constraint of the NP predicate on its subject: it should be furniture (or animate). Without this "semantic agreement", Chinese NP is normally not capable of functioning as a predicate, as shown in 7b).

Between semantics dependent and semantics independent patterns, we may have partially dependent patterns. For example, in NP NP V: OSV, it seems that the semantic constraint on the initial object is less important than the semantic constraint on the subject.

8)   shitou                wo              ye   xiang  chi,    kexi      yao       bu      dong.
   stone(non-food)  I(animate) also want  eat,    pity       chew    not      -able

Even stones I also want to eat, but it's such a pity that I am not able to chew them.

If the constraint on the object matches well, is the subject allowed to be semantically deviant?

9) ?          dianxin                     zhuozi                        chi           le.
                Dim-Sum(food)        table(non-animate)  eat           LE.

Those are the marginal cases, a grammar may choose to be more tolerable to accept it or to be more restrained to reject it.

Unlike SOV, but similar to its English counterpart, OSV is one type of Chinese topic constructions and the relationship between the initial O and V is of long distance dependency.

10a)  dianxin      wo     xiangxin   ni           yiwei        Lisi          chi           le.
          Dim-Sum    I         believe     you          think        Lisi           eat           LE

The Dim Sum I believe you think that Lisi ate.

10b) *      Lisi wo xiangxin ni yiwei dianxin chi le.

10b) will not be accepted in our model because (1) it cannot be interpreted as OSV since it violates the semantic constraint on S: dianxin is not animate; (2) it can neither be interpreted as SOV since it violates the configurational constraint: SOV is simply not of a long distance pattern. In fact, NP NP V: SOV is such a restricted pattern in Chinese that it not only excludes any long distance dependency but even disallows some adjuncts. Compare 11a) in the OSV pattern and 11b) and 11c) in the SOV pattern:

11a)  dianxin      wo           jinjinyouwei             de2           chi           le.
          Dim-Sum      I              with-relish                DE2         eat           LE

The Dim Sum I ate with relish.
Note:        DE2 is a particle introducing a preverbal adjunct of  manner.

11b) *      wo dianxin jinjinyouwei de2 chi le.

11c) *      wo jinjinyouwei de2 dianxin chi le.

There is another pattern of the linear order SOV, the Chinese notorious BA construction. ba is usually regarded as a preposition which introduces a preverbal object for transitive verbs.

NP [ba NP] V: SOV

12a)  wo           ba            dianxin       jinjinyouwei             de2          chi           le.
           I              BA           Dim-Sum     with-relish                DE2         eat           LE

I ate the Dim Sum with relish.

12b)         wo jinjinyouwei de2 ba dianxin  chi le.
With relish, I ate the Dim Sum.

12c)         dianxin  ba wo jinjinyouwei de2  chi le.
The Dim Sum ate me with relish.

12d)         dianxin jinjinyouwei de2 ba wo  chi le.
With relish, the Dim Sum ate me.

For the OSV order, there is another so-called BEI construction. The BEI construction is usually regarded as an explicit passive pattern in Chinese.

NP [bei NP] V: OSV

13a)        dianxin       bei          wo           chi           le.
                Dim-Sum     BEI          I               eat           LE

The Dim Sum was eaten by me.

13b)         wo bei dianxin  chi le.

I was eaten by the Dim Sum.

The BEI construction and the BA construction are both semantics independent. In fact, any pattern resorting to the means of function words in Chinese seems to be sufficiently independent of the semantic constraint.

To conclude, semantic deviation often occurs in some more independent patterns, as seen in 5d2), 6), 8), 12c), 12d), 13b). Close study reveals that different patterns result in different reliance on the semantic constraint, as summarized in the following table.

                syntactic pattern                                 semantic dependence

                NP V NP: SVO                                                    no dependence
                NP [ba NP] V: SOV                                            no dependence
                NP [bei NP] V: OSV                                           no dependence
                NP NP V: OSV                                                    partial dependence
                NP NP V: SOV                                                    full dependence
............

It should be emphasized that this observation constitutes the rationale behind our approach.

  1. Formulation of lexical rules

Based on the above observation, we have designed a syntax-semantics combined model. In this model, we take a lexical rule approach to Chinese patterns and the related problem of semantic deviation.

A lexical rule takes as its input a lexical entry which satisfies its condition and generates another entry. Lexical rules are usually used to cover lexical redundancy between related patterns. The design of lexical rules is preferred by many grammarians over the more conventional use of syntactic transformation, especially for lexicalist theories.

Our general design is as follows, still using chi (eat) for illustration:

(1)   Syntactically, chi (eat) as a transitive verb subcategorizes for a left NP as its subject and a right NP as its object.

(2)   Semantically, the corresponding notion eat expects an entity of category animate as its logical subject and an entity of category food as its logical object. Therefore the common sense (knowledge) that animate being eats food is represented.

(3)   The interaction of syntax and semantics is implemented by lexical rules. The lexical rules embody the linguistic generalizations about the transitive patterns. They will decide to enforce or waive the semantic constraint based on different patterns.

As seen, syntax only stipulates the requirement of two NPs as complements for chi and does not care about the NPs' semantic constraint. Semantics sets its own expectation of animate entity and food entity as arguments for eat and does not care what syntactic forms these entities assume on the surface. It is up to lexical rules to coordinate the two. In our model, the information in (1) and (2) is encoded in the corresponding lexical entry and the lexical rules in (3) will then be applied to expand the lexicon before parsing begins. Driven by the expanded lexicon, analysis is implemented by a lexicalist parser to build the interpretation structure for the input sentence. Following this design, there will be sufficient interaction between syntax and semantics as desired while syntax still remains to be a self-contained component from semantics in the lexicon. More importantly, this design does not add any computational complexities to parsing because in order to handle different patterns, the similar lexical rules are also required even for a pure syntax model.

Before we proceed to formulate lexical rules for transitive patterns, we should make sure what a transitive pattern is. As we defined before, a pattern consists of 2 parts: a structure's syntactic constraint and the corresponding interpretation. Word order is important constraint for Chinese syntax. In addition to word order, we have categories and function words (preposition, particle, etc.). As for interpretation, transitive structure involves 3 elements: V (predicate) and its arguments S (logical subject) and O (logical object). There is a further factor to take into account: Chinese complements are often optional. In many cases, subject and/or object can be omitted either because they can be recovered in the discourse or they are unknown. We call those patterns  elliptical patterns (with some complement(s) omitted), in contrast to full patterns. With these in mind, we can define 10 patterns for Chinese transitive structure: 5 full patterns and 5 elliptical patterns.

We now investigate these transitive patterns one by one and try to informally formulate the corresponding lexical rules to capture them. Please note that the basic input condition is the same with all the lexical rules. This is because they share one same argument structure - transitive structure.

Lexical rule 1:   

                V ((NP1, NP2), (constr1, constr2)) --> NP1 V NP2: SVO

The above notation for the lexical rule should be quite obvious. The input of the rule is a transitive verb which subcategorizes for two NPs: NP1 and NP2 and whose corresponding notion expects two arguments of constr1 and constr2NP is syntactic category, and constr is semantic category (human, animate, food, etc.). The output pattern is in a defined word order SVO and waives the semantic constraint.

Lexical rule 2:   

      V ((NP1, NP2), (constr1, constr2)) --> [NP1, constr1] [NP2, constr2] V: SOV

Please note that the semantic constraint is enforced for this SOV pattern. Since this pattern shares the form NP NP V with the OSV pattern, it would be interesting to see what happens if a transitive verb has the same semantic constraint on both its subject and object. For example, qingjiao (consult) expects a human subject and a human object.

14)           ta                     ni                               qingjiao    guo        me?
                he(human)     you(human)             consult     GUO        ME

Him, have you ever consulted?
Note: GUO is a particle for experience aspect.

15)           ni ta  qingjiao guo  me?

You, has he ever consulted?

In both cases, the interpretation is OSV instead of SOV. Therefore, we need to reformulate Lexical rule 2 to exclude the case when the subject constraint is the same as the object constraint.

Lexical rule 2' (refined version):

                V ((NP1, NP2), (constr1, constr2), (constr1 not = constr2))

                --> [NP1, constr1] [NP2, constr2] V: SOV

Lexical rule 3:

                V ((NP1, NP2), (constr1, constr2)) --> NP1 [ba NP2] V: SOV

This is the typical BA construction. But not every transitive verb can assume the BA pattern. In fact, ba is one of a set of prepositions to introduce the logical object. There are other more idiosyncratic prepositions (xiang, dao, dui, etc.) required by different verbs to do the same job.

16a)      ni             qingjiao    guo         ta             me?
              you          consult     GUO        he            ME

Have you ever consulted him?

16b)         ni             xiang        ta             qingjiao    guo        me?
                 you          XIANG     he            consult     GUO        ME

Have you ever consulted him?

16c) *      ni             ba            ta             qingjiao    guo        me?
                you          BA           he            consult     GUO        ME

17a)         ta             qu             guo         Beijing.
                 he            go-to        GUO        Beijing

He has been to Beijing.

17b)         ta             dao         Beijing     qu             guo.
                 he            DAO        Beijing     go-to        GUO

He has been to Beijing.

17c) *      ta             ba            Beijing     qu            guo.
                 he            BA           Beijing     go-to        GUO

18a)         ta             hen         titie                             zhangfu.
                 she           very       tenderly-care-for      husband

She cares for her husband very tenderly.

18b)         ta             dui          zhangfu       hen        titie.
                 she           DUI         husband      very       tenderly-care-for

She cares for her husband very tenderly.

18c) *      ta             ba            zhangfu         hen                          titie.
                she           BA           husband         very                         tenderly-care-for

This originates from different theta-roles assumed by different verb notions on their object argument: patient, theme, destination, to name only a few. These theta-roles are further classification of the more general semantic role logical object. We can rely on the subcategorization property of the verb for the choice of the preposition literal (so-called valency preposition). With the valency information in place, we now reformulate Lexical rule 3 to make it more general:

Lexical rule 3' (refined version):

       V ((NP1, NP2), (constr1, constr2),  (valency_preposition=P), (P not = null))

       --> NP1 [P NP2] V: SOV

Lexical rule 4:   

                V ((NP1, NP2), (constr1, constr2)) --> NP2 ... [NP1, constr1] V: OSV

This is a topic pattern of long distance dependency. It is up to different formalisms to provide different approaches to long-distance phenomena. In our present implementation, NP2 is placed in a feature called BIND to indicate the nature of long distance dependency. One phrase structure rule Topic Rule is designed to use this information and handle the unification of the long distance complement properly.

Following the topic pattern, the passive BEI construction is formulated in Lexical rule 5.

Lexical rule 5:   

                V ((NP1, NP2), (constr1, constr2)) --> NP2 [bei NP1] V: OSV

We now turn to elliptical patterns.

Lexical rule 6:   

                V ((NP1, NP2), (constr1, constr2)) --> V NP2: VO

19)           chi           guo          jiaozi                        me?
                eat           GUO        dumpling                 ME

Have (you) ever eaten dumpling?

Lexical rule 7:   

                V ((NP1, NP2), (constr1, constr2)) --> [NP1, constr1] V: SV

20)           wo           chi           le.
                I               eat           LE

I have eaten (it).

21)           ji                                 chi           le.
                chicken1(animate)   eat           LE

The chicken has eaten (it).

Like its English counterpart, ji (chicken) has two senses: (1) chicken1 as animate; (2) chicken2 as food. We code this difference in two lexical entries. Only the first entry matches the semantic constraint on the subject in the pattern and reaches the above SV interpretation in 21). Interestingly enough, the same sentence will get another parse with a different interpretation OV in 23) because the second entry also satisfies the semantic constraint on the object in the OV pattern in Lexical rule 8.

22)           ni             qingjiao    guo         me?
                you          consult     GUO        ME

Have you consulted (someone)?

22) indicates that the SV interpretation is preferred over the OV interpretation when the semantic constraint on the subject and the semantic constraint on the object happen to be the same. Hence the added condition in Lexical rule 8.

Lexical rule 8:   

                V ((NP1, NP2), (constr1, constr2), (constr1 not = constr2))

                --> [NP2, constr2] V: OV

23)           ji                                 chi           le.
                chicken2(food)         eat           LE

The chicken has been eaten.

Lexical rule 9:   

                V ((NP1, NP2), (constr1, constr2)) --> NP2 [bei V]: OV

24)           dianxin    bei           chi           le.
                Dim-Sum  BEI          eat           LE

The Dim Sum has been eaten.

Lexical rule 10:

                V ((NP1, NP2), (constr1, constr2)) --> V: V

25)           chi           le             me?
                eat           LE            ME?                        

(Have you) eaten (it)?

  1. Implementation

We begin with a discussion of some major feature structures in HPSG related to handling the transitive patterns.  Then, we will show how our proposal works and discuss some related implementation issues.

HPSG is a highly lexicalist theory. Most information is housed in the lexicon. The general grammar is kept to minimum: only a few phrase structure rules (called ID Schemata) associated with a couple of principles. The data structure is typed feature structure. The necessary part for a typed feature structure is the type information. A simple feature structure contains only the type information, but a complex feature structure can introduce a set of feature/value pairs in addition to the type information. In a feature/value pair, the value is itself a feature structure (simple or complex). The following is a sample implementation of the lexical entry chi for our Chinese HPSG grammar using the ALE formalism [Carpenter  & Penn 1994].

hpsg3

Note:  (1) Uppercase notation for feature; (2) Lowercase notation for type; (3) Number indices in square brackets for unification.

Leaving the notational details aside, what this roughly says is: (1) for the semantic constraint, the arguments of the notion eat are an animate entity and a food entity; (2) for the syntactic constraint, the complements of the verb chi are 2 NPs: one on the left and the other on the right; (3) the interpretation of the structure is a transitive predicate with a subject and an object. The three corresponding features are: (1) KNOWLEDGE; (2) SUBCAT; (3) CONTENT. KNOWLEDGE stores some of our common sense by capturing the internal relation between concepts. Such common sense knowledge is represented in linguistic ways, i.e. it is represented as a semantic expectation feature, which parallels to the syntactic expectation feature SUBCAT. KNOWLEDGE defines the semantic constraint on the expected arguments no matter what syntactic forms the arguments will take.  In contrast, SUBCAT only defines the syntactic constraint on the expected complements. The syntactic constraint includes word order (LEFT feature), syntactic category (CATEGORY feature) and configurational information (LEX feature).  Finally, CONTENT feature assigns the roles SUBJECT and OBJECT for the represented structure.

A more important issue is the interaction of the three feature structures. Among the three features, only KNOWLEDGE is our add-on. The relationship between SUBCAT and CONTENT has been established in all HPSG versions: SUBCAT resorts to CONTENT for interpretation.  This interaction corresponds to our definition of pattern. Everything goes fine as far as the syntactic constraint alone can decide interpretation. When the semantic constraint (in KNOWLEDGE) has to be involved in the interpretation process, we need a way to access this information. In unification based theories, information flow is realized by unification (i.e. structure sharing, which is represented by the co-index of feature values). In general, we have two ways to ensure structure sharing in the lexicon. It is either directly co-indexed in the lexical entries, or it resorts to lexical rules. The former is unconditional, and the latter is conditional. As argued before, we cannot directly enforce the semantic constraint for every transitive pattern in Chinese, for otherwise our grammar will not allow for any semantic deviation. We are left with lexical rules which we have informally formulated in Section 3 and implemented in the ALE formalism.

CATEGORY is another major feature for a sign. The CATEGORY feature in our implementation includes functional category which can specify functional literal (function word) as its value. Function words belong to closed categories. Therefore, they can be classified by enumeration of literals. Like word order, function words are important form for Chinese syntactic constraint. Grammars for other languages also resort to some functional literals for constraint. In most HPSG grammars for English, for example, a preposition literal is specified in a feature called P_FORM. There are two problems involved there. First, at representation level, there is redundancy: P_FORM:x --> CATEGORY:p (where x is not null). In other words, there exists feature dependency between P_FORM and CATEGORY which is not captured in the formalism. Second, if P_FORM is designed to stipulate a preposition literal, we will ultimately need to add features like CL_FORM for classifier specification, CO_FORM for conjunction specification, etc. In fact, for each functional category, literal specification may be required for constraint in a non-toy grammar. That will make the feature system of the grammar too cumbersome. These problems are solved in our grammar implementation in ALE. One significant mechanism in ALE is its type inheritance and appropriateness specifications for feature structures [Carpenter  & Penn 1994]. (Similar design is found in the new software paradigm of Object Oriented Programming.) Thanks to ALE, we can now use literals (ba, xiang, dao, dui, etc) as well as major categories (n, v, a, p, etc.) to define the CATEGORY feature. In fact, any intermediate level of subclassification between these two extremes, major categories and literals, can all be represented in CATEGORY just as handily. They together constitute a type hierarchy of CATEGORY. The same mechanism can also be applied to semantic categories (human, animate, food, etc.) to capture the thesaurus inference like human --> animate. This makes our knowledge representation much more powerful than in those formalisms without this mechanism. We will address this issue in depth in another paper Typology for syntactic category and semantic category in Chinese grammar.

In the following, we give a brief description on how our grammar works. The grammar consists of several phrase structure rules and a lexicon with lexical entries and lexical rules. First, ALE compiles the grammar into a Prolog parser. During this process (at compile time), lexical rules are applied to lexical entries. In the case of transitive patterns, this means that one entry of chi will evolve into 10 entries. Please note that it is this expanded lexicon that is used for parsing (at run time).

At the level of implementation, we do not need to presuppose an abstract transitive structure as input of the lexical rules and from there generates 10 new entries for each transitive verb. What is needed is one pattern as the basic pattern for transitive structure and derives the other patterns. In fact, we only need 4 lexical rules to derive the other 4 full patterns from 1 basic full pattern. Elliptical patterns can be handled more elegantly by other means than lexical rules.[2]

The basic pattern constitutes the common condition for lexical rules. Although in theory any one of the 5 full patterns can be seen as the basic pattern, the choice is not arbitrarily made. The pattern we chose is the valency preposition pattern (the BA-type construction) NP1 [P NP2] V: SOV (see Lexical rule 3').[3] This is justified as follows. The valency preposition P (ba, xiang, dao, dui, etc.) is idiosyncratically associated with the individual verb. To derive a more general pattern from a specific pattern is easier than the other way round, for example,  NP1 [P NP2] V: SOV --> NP1 V NP2: SVO is easier than NP1 V NP2: SVO --> NP1 [P NP2] V: SOV. This is because we can then directly code the valency preposition under CATEGORY in the SUBCAT feature and do not have to design a specific feature to store this valency information.

 

  1. Summery

The ultimate aim for natural language analysis is to reach interpretation, i.e. to assign roles to the constituents. An old question is how syntax (form) and semantics (meaning) interact in this interpretation process. More specifically, which is a more important factor in Chinese analysis, the syntactic constraint or the semantic constraint? For the linguistic data we have investigated, it seems that sometimes syntax plays a decisive role and other times semantics has the final say. The essence is how to adequately handle the interface between syntax and semantics.

In our proposal, the syntactic constraint is seen as a more fundamental factor. It serves as the frame of reference for the semantic constraint. The involvement of the semantic constraint seems to be most naturally conditioned by syntactic patterns. In order to ensure their effective interaction, we accommodate syntax and semantics in one model.  The model is designed to be based on syntax and resorts to semantic information only when necessary. In concrete terms, the system will selectively enforce or waive the semantic constraint, depending on syntactic patterns.

It needs to be advised that there are other factors involved in reaching a correct interpretation. For example, in order to recover the omitted complements in elliptical patterns, information from discourse and pragmatics may be vital. We leave this for future research.

 

References

Carpenter, B. & Penn, G. (1994): ALE, The Attribute Logic Engine, User's Guide, Version 2.0

Gao, Qian (1993): “Chinese BA-Construction: Its Syntax and Semantics”, OSU Working Papers in Linguistics 1993, Kathol A. & Pollard C. (eds.)

Huang, Xiuming (1987): “XTRA: The Design and Implementation of A Fully Automatic Machine Translation System”, Ph.D. dissertation.

Li, Audry (1990): Chapter 6 “Passive, BA, and topic constructions”, Order & Constituency in Mandarin Chinese. Kluwer Academic Publishers

Li, Wei & McFetridge, Paul (1995): “Handling Chinese NP predicate in HPSG”, Proceedings of PACLING-II, Brisbane, Australia

Pollard, Carl  & Sag, Ivan A. (1994): Head-Driven Phrase Structure Grammar,  Centre for the Study of Language and Information, Stanford University, CA

Pollard, Carl  & Sag, Ivan A. (1987): Information-based Syntax and Semantics. Vol. 1: Fundamentals. Centre for the Study of Language and Information, Stanford University, CA

Wilks, Y.A. (1978): “Making Preferences More Active”,  Artificial Intelligence, Vol. 11

Wilks, Y.A. (1975): “A Preferential Pattern-Seeking Semantics for Natural Language Interference”, Artificial Intelligence, Vol. 6

~~~~~~~~~~~~

* This research is part of my Ph.D. project on a Chinese HPSG-style grammar, supported by the Science Council of British Columbia, Canada under G.R.E.A.T. award (code: 61). I thank my supervisor Dr. Paul McFetridge for his supervision. He introduced me into the HPSG theory and provided me with his sample grammars. Without his help, I would not have been able to implement the Chinese grammar in a relatively short time. Thanks also go to Prof. Dong Zhen Dong and Dr. Ping Xue for their comments and encouragement.

 

[1]               The other combinations are:

5d1) *      dianxin chi le wo.              OVS

5d2)         dianxin chi le wo.
The Dim Sum ate me.

Note:        It is OK with the 5d2) reading in the pattern NP V NP: SVO.

5e1) *      chi le wo dianxin.               VSO
5e2)         chi le wo dianxin.

(Somebody) ate my Dim Sum.

Note:        It is OK with the 5e2) reading of in the pattern V [NP1 NP2]: VO where NP1 modifies NP2.

5f1) *      chi le dianxin wo.                 VOS
5f2)         chi le dianxin, wo.

Eaten the Dim Sum, I have.

Note:        It is OK in Spoken Chinese, with a short pause before wo, in a  pattern like V NP, NP: VOS.

[2]   The conventional configurational approach is based on the assumption that complements are obligatory and should be saturated. If saturation of complements were not taken as a precondition for a phrase, serious problems might arise in structural overgeneration. On the other hand, optionality of complement(s) is a real life fact. Elliptical patterns are seen in many languages and especially commonplace in Chinese. In order to ensure obligatoriness of complements, the lexical rule approach can be applied to elliptical patterns, as shown in Section 3. This approach maintains configurational constraint in tree building to block structural overgeneration, but the cost is great: each possible elliptical pattern for a head will have to be accommodated by a new lexical entry. With the type mechanism provided by ALE, we have developed a technique to allow for optionality of complement(s) and still maintain proper configurational constraint. We will address this issue in another paper Configurational constraint in Chinese grammar.

[3]    This choice is coincidental to the base‑generated account of the BA construction in [Li, A. 1990], but that does not mean much. First, our so‑called basic pattern is not their D‑Structure. Second, our choice is based on more practical considerations. Their claim involves more theoretical arguments in the context of the generative grammar.

 

 

[Related]

Handling Chinese NP predicate in HPSG (old paper)

Notes for An HPSG-style Chinese Reversible Grammar

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

【创业笔记:安娜离职记】

安娜是个很可爱的俄罗斯上进女青年,从小弹钢琴跳芭蕾,小学没毕业即随父母移民美国。她身材高佻,曲线优美,性情温和,举止得体,善解人意,给人一种古典但不古板,现代却不俗艳,阳光而浪漫的印象。大家知道,虽然俄罗斯大嫂大多偏胖粗线条,但俄罗斯姑娘却多有迷人的风采,老帮菜耳熟能详念念不忘的就有钢铁怎样炼成里面的资产阶级小姐冬妮亚,芭蕾舞天后乌兰诺娃,风华绝代的花样滑冰艺术家 Ekaterina Gordeeva。安娜也是这样一位俄罗斯女郎,每天就在身边,给满屋大多是 boys 的办公室带来了温馨柔和的气息。自然地,大家都喜欢她。

然而,安娜辞职了,很快就要离开,大家都舍不得。我心里也不是滋味,想到午餐时不再有她的说说笑笑,餐后也不能邀她打乒乓球了,失落落的。我问她一定要离开么,你不是说很喜欢这个环境么?You know this office is already too crowded with boys, and we are trying to change this situation, trying to find some girls with affirmative action, and you are leaving?

她回说,我喜欢这个环境,是因为在这里我接触的都是你这样的世界上最聪明的人,因为你们太聪明了,结果我的发展道路堵死了,只好痛下决心离开了,我还是去 consulting company 做我擅长的分析工作去吧。两年来,我亲眼目睹我的20小时的人工怎样被你的20秒的全自动搜索所替代,而且结果往往比人工更好更全更有一致性。

她说的不假。确实是技术的转移抢走了她的饭碗,但公司不想辞她,决定让她转型做在线客户服务,可她思前想后,觉得年轻轻不能放弃自己的专长,只好决定离开了。

作为技术带头人,她的离开与我直接相关。这是一个活生生的机器取代人工的例子。

两年前我加入公司的时候,公司基本上是一个 professional service 类型的公司,虽然也开发了一个内部使用的系统,但系统的输出只是缩小了人工范围,必须有长时间的后编辑,手动增删修补,分析归纳,才能提供给客户。编辑人员我们称为信息分析员,要求语言能力强,阅读理解一目十行,并具有分析综合的技能。安娜就是信息分析员中的佼佼者。经她过手的分析报告,客户特别满意。

可是公司需要成本核算。核算的结果是,肉工可以,要适度,否则入不敷出,是亏本买卖。当时平均每个搜索分析的订单需要肉工22小时方能完工,这22小时叫做 pain time (既是分析员的pain, 更是公司的pain)。要想赚钱,理想的 pain time 支出需要控制在两个小时之内,在当时有点天方夜谭。老板找我谈的时候,就把它定为主要目标,但并没有设置时间限度,因为没有人知道其可行性以及达成这样的目标需要多少资源。我自己也不明白,只是感觉到了这个重担。我以前做过的工作,都是先研究,后做原型引擎,然后寻找应用领域,最后开发产品。而这家公司与多数技术创新公司截然相反,它是先有客户,后有粗糙的引擎,最后才引进人才和技术,把希望寄托在技术的快速转移身上。这条路子让我觉得新鲜和刺激,觉得可以试一下,我的技术转移技能能不能如鱼得水,发挥出来。先有客户和应用领域的好处是显而易见的,就像搞共产主义有了遵义会议的明灯一样,省却了在黑暗中的漫长摸索。道路是光明的,就看路怎样走才能赚钱了。

长话短说。我上马以后,三个月把系统的核心部分替换了,半年下来结果明显改善,到一周年的时候,肉工的痛苦时间已经缩短到两小时以下,老板喜不自禁。

人心不足蛇吞象,老板告诉我,Wei,你知道,你的技术给我们的业务带来了革命性变化。我们的立足已经不成问题,只要我们愿意,维持一个机器加人工的服务,发展成年入几千万的企业指日可待。但是,只要有人工,就不能 scale up, 赚钱就有限,盘子就做不大。我知道你是有雄心的人(我心里说,子非鱼),肯定不满足小打小闹。不管多大风险,我们还是决定放弃这条道路,而走全自动的路子,让系统可以服务所有的分析客户,而不是只供我们内部人工(安娜这样的)或者需要专门训练的 power users 使用。我们的目标是让世界上每个分析员都离不开我们,就如大家离不开Google一样。为此,我们必须做到 pain time  为零,这是着险棋,但是前景不可限量。

好家伙,这个口气,就梦想称霸全世界了。美国是个很有意思的地方,这方水土盛产百折不挠,心比天高的企业梦想家。但美国并非梦想家的乐园,95%的梦想家牺牲了,不到5%得以生存,其中不过1%最终做大,真正是一将功成万骨枯。虽然如此,美国造企业梦想家仍然前赴后继,生生不息。我其实很喜欢这些梦想家,他们的坚韧豪情很感染人。

一年又过去了。我们实现了在一个主要分析领域完全铲除痛苦时间的目标(pain time 0),把搜索分析从两年前的22小时人工,发展成为如今的20秒钟全自动立等可取,无需任何人工编辑。

得之桑榆,失之东隅, 两年的奋战取得了超出所有人预料的成就,但同时也失去了一位可爱的俄罗斯女郎。

【二次创业笔记】 记于2008年四月

【后记】关于安娜,还有一个小插曲。大家知道,创业公司的人都爱做梦数小鸡,股票期权则是催梦剂。

有一天,公司哥们跟往常一样数小鸡玩儿,安娜跟我说:Wei, come here, I got something to show you. 我走近一看,是一辆轿车。她跟我一字一板地说:

I like this car. I just love it. It is my dream car. I want to buy it.
Guys, work hard so I can own this car.

及至仔细一看价码,吓了一个筋斗,百万以上,她可真敢想啊,乖乖隆的东,here it is:

http://abcnews.go.com/GMA/Moms/story?id=1406161

相关篇什:

【一日一parsing:舍我其谁,我又是谁?】

昨夜名段:
【中秋,混得好的是花前月下,混得一般的是月下花钱,混得最差的是花下月的钱,混得最好的是钱下月花。】

0916a

0916b

几乎完美parsing了,但有一个分离词没有搭配的瑕疵,对比:

0916d

合在一起就眼花缭乱了,这是非一般的 graph,与多数句法树颇不同:

0916c

索性把前天的 parsing 也秀一秀。汉语 deep parsing 没有绝对的标准,但语言学家心里还是有杆秤的:靠谱不靠谱,内行看门道,外行看热闹罢。这种感觉有些奇诡刺激,一方面觉得是在走前人没走过的路,充满了拓荒者的悲壮与豪情。另一方面,也好像冥冥之中的命定,替天行道,舍我其谁,我又是谁?如果语言是思想的载体和表达(presentation),parsing 就是思想的形式化机器展示(representation),而我就是贯通二者的使者。感谢上帝,在创造了谜一样的语言的同时,没忘记把钥匙留下。

0915a

0915b

0915c

0915d

是的,【人类最无法理解的事情,就是机器对人类语言结构的分析能力】。机器达到人类的语言结构分析能力,现在已经没有悬念了。而机器难以达到的那部分理解能力,可以用人机辅助的方式进行,这个景象就在不太远的将来,已然历历在目了。让我们准备好,去拥抱这个人机交融的新时代。

洪爷有诗云:
庖丁解牛在语言,伟爷Parser之中练。善刀藏之于深山,实则乱麻可以斩。

【相关】

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

【博士涂鸦回顾:把常识代入文法的尝试】

上次说过,绝大多数的parsers对于谓词的 subcat 的表达都很简陋,伸展不开,多数不过把 subcat 当成一个代码,然后在相关的 subcat 规则中去确定 pattern。但是词驱动的文法 HPSG 却可以丝丝入扣,合情合理,可以直接在词典里面把 subcat 的 pattern 细致地描述,并对其句法语义的输入(pattern的条件)和输出(逻辑语义)之间的映射和解构,做出一个符合语言学原则的表达(representation)。

简陋有简陋的工程考量和理由,叠床架屋有叠床架屋的逻辑优美。鱼与熊掌不可兼得,我们最终还是更加倾向于简陋之法。尽管如此,走简陋快捷的路线的人,如果对结构表达的优美有所体验,还是有莫大的好处,至少不会被简陋的表象所迷惑,对于复杂的语言现象,逐渐摆脱简陋的捉襟见肘。

最近回看当年博士阶段的涂鸦文章,虽然其中反映出的对汉语句法的见识并不出彩,但是得力于 HPSG 的结构丰富性,还是把 subcat 在汉语文法中应用,表现得有条不紊,经得起时间的检验。当年钻研 HPSG 还是很专心的,吃得蛮透。正因为吃得透了,后来扬弃的时候就没有拖泥带水的牵挂。

譬如,在论及汉语NP带坑的现象的时候,是这样模型的:

11a)     桌子坏了。
11b)     腿坏了。
11c)     桌子的腿坏了。
12a)     他好。
12b)     身体好。
12c)     他的身体好。

When people say 11b) tui huai le (leg went wrong), we know something (the possessor) is omitted. For 11a), however, we have no such feel of incompleteness.

Although we may also ask whose table, this possessive relation between who and table is by no means innate. Similarly, ta (he) in 12a) is a complete notion denoting someone while shenti (body) in 12b) is not. In 11c) and 12c), the possessor appears in the possessive structure DE-construction, the expectation of tui (leg) and shenti (body) is realized. These examples show that some words (concepts) have conceptual expectation for some other words (concepts) although the expected words do not necessarily show up in a sentence and the expectation might not be satisfied. In fact, this type of expectation forms part of our knowledge (common sense). One way to represent the knowledge is to encode it with the related word in the lexicon.
Therefore we propose an underlying SYNSEM feature KNOWLEDGE to store some of our common sense knowledge by capturing the internal relation between concepts. KNOWLEDGE parallels to syntactic SUBCAT and semantic RELATION. KNOWLEDGE imposes semantic constraints on their expected arguments no matter what syntactic forms the arguments will take (they may take null form, i.e. the underlying arguments are not realized). In contrast, SUBCAT only defines syntactic requirement for the complements and gets interpreted in RELATION. Following this design, syntactic form and semantic constraints are kept apart. When necessary, the interaction between them can be implemented by lexical rules, or directly coindexed in the lexicon. For example, the following KNOWLEDGE information will be enforced as the necessary semantic constraints when we handle Chinese NP predicates by a lexical rule (see 3.3).

把常识暗度陈仓从后门带入文法,就是从那时候开始的。这个做法在欧洲语言的形式文法中不多见,因为句法形式大体够用了,通常不需要常识的帮忙。但是对于汉语,没有某种常识的引入,想做一个成熟的深度分析系统,则很难。当年带常识的的句法结构模型是这样定义的:

PHON      shenti
SYNSEM | KNOWLEDGE | PRED [1] possess
SYNSEM | KNOWLEDGE | POSSESSOR [2] human
SYNSEM | KNOWLEDGE | POSSESSED [3]
SYNSEM | LOCAL | CONTENT | INDEX [3]
SYNSEM | LOCAL | CONTENT | RESTRICTION { RELATION body }
SYNSEM | LOCAL | CONTENT | RESTRICTION { INSTANCE [3] }

最后,汉语文法中常识的引入被认为是对欧洲语言利用性数格的 agreement 的一个自然延伸。句法手段到语义限制的延伸。

Agreement revisited
This section relates semantic constraints which embody common sense to the conventional linguistic notion of agreement. We will show that they are essentially the same thing from different perspectives. We only need slight expansion for the definition of agreement to accommodate some of our basic knowledge. This is important as it accounts for the feasibility of coding knowledge in linguistic ways. Linguistic lexicon seems to be good enough to house some general knowledge in addition to linguistic knowledge.

为 parse“我鸡吃“ 和“鸡我吃”, 常识进入了文法(现在也可以利用大数据把常识代入):

A typical example of how concepts are linked in a network (a sophisticated concept lexicon) is seen in the representation of drink ((*ANI SUBJ) (((FLOW STUFF) OBJE) ((SELF IN) (((*ANI (THRU PART)) TO) (BE CAUSE))))) in Wilks 1975b. While for  various reasons we will not go as far as Wilks, we can gain enlightenment from this type of AI approach to knowledge. Lexicon-driven systems like the one in HPSG can, of course, make use of this possibility. Take the Chinese role-assignment problem, for example, the common sense that ANIMATE being eats FOOD can be seamlessly incorporated in the lexical entry chi (eat) as a semantic agreement requirement.

PHON chi
SYNSEM | KNOWLEDGE | PRED [1]  eat
SYNSEM | KNOWLEDGE | AGENT [2] animate
SYNSEM | KNOWLEDGE | PATIENT [3] food
SYNSEM | LOCAL | CATEGORY | SUBCAT | EXTERNAL_ARGUMENT [NP: [4]]
SYNSEM | LOCAL | CATEGORY | SUBCAT | INTERNAL_ARGUMENTS <[NP: [5]]>
SYNSEM | LOCAL | CONTENT | RELATION [1]
SYNSEM | LOCAL | CONTENT | EATER [4] | INDEX | ROGET [2]
SYNSEM | LOCAL | CONTENT | EATEN [5] | INDEX | ROGET [3]

可见,看上去不过是 POS 细分后的一个 subcat 的代码,里面其实包含了多少结构及其蕴含其内的知识。在 unification grammars 几乎成为历史陈迹的今天,我还是认为 HPSG 这样的表达是最优美的语言学的逻辑表达之一,论逻辑的清晰和美,后来的文法很难超越。

 

[Related]

Handling Chinese NP predicate in HPSG (old paper)

Notes for An HPSG-style Chinese Reversible Grammar

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

 

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

 

Handling Chinese NP predicate in HPSG (old paper)

Handling Chinese NP predicate in HPSG
(old paper in Proceedings of the Second Conference of the Pacific
Association for Computational Linguistics, Brisbane, 1995)

Wei Li & Paul McFetridge

Department of Linguistics
Simon Fraser University
Burnaby, B.C. CANADA  V5A 1S6

 

Key words: HPSG; knowledge representation, Chinese processing 

 

Abstract 

This paper addresses a type of Chinese NP predicate in the framework of HPSG 1994 (Pollard & Sag 1994). The special emphasis is laid on knowledge representation and the interaction of syntax and semantics in natural language processing. A knowledge based HPSG model is designed. This design not only lays a foundation for effectively handling Chinese NP predicate problem, but has theoretical and methodological significance on NLP in general.

In Section 1, the data are analyzed. Both structural and semantic constraints for this pattern are defined. Section 2 discusses the semantic constraints in the wider context of the conceived knowledge-based model. The aim of natural language analysis is to reach interpretations, i.e. correctly assigning semantic roles to the constituents. We indicate that without being able to resort to some common sense knowledge, some structures cannot get interpreted. We present a way on how to organize and utilize knowledge in HPSG lexicon. In Section 3, a lexical rule for this pattern is proposed in our HPSG model for Chinese, whose prototype is being implemented.

  1. Problem

We will show the data of Chinese NP predicate first. Then we will investigate what makes it possible for an NP to behave like a predicate. We will do this by defining both the syntactic and semantic constraints for this Chinese pattern.

1.1. Data: one type of Chinese NP predicate

1) 他好身体。

ta         hao      shenti.
he        good    body
He is of good health.

2)  张三高个子。

Zhangsan         gao      gezi
Zhangsan         tall       figure.
Zhangsan is tall.

3)  李四圆圆的脸。       Lisi

Lisi      yuanyuan         de        lian.
Lisi      round-round    DE       face.
Lisi has a quite round face.

4) 这件大衣红颜色。

zhe       jian      dayi     hong    yanse.
this      (cl.)      coat     red       colour.
This coat is of red colour.

5)  明天小雨。

mingtian          xiao     yu.
tomorrow        little     rain.
Tomorrow it will drizzle.

6)  那张桌子三条腿。

na        zhang   zhuozi san       tiao      tui.
that      (cl.)      table   three    (cl.)      leg
That table is three-legged.

Note:      (cl.) for classifier.
DE for Chinese attribute particle.

The relation between the subject NP and the predicate NP is not identity. The NP predicate in Chinese usually describes a property the subject NP has, corresponding to English be-of/have NP. In identity constructions, the linking verb SHI (be) cannot normally be omitted.[1]

7a)  他是学者。

ta         shi        xuezhe.
he        be        scholar
He is a scholar.

8b) ?他学者。

ta         xuezhe.  他学者。
he        scholar

1.2.  Problem analysis

1.2.1. We first investigate the structural characteristics of the Chinese NP predicate pattern.

A single noun cannot act as predicate. More restrictively, not every NP can become a predicate. It seems that only the NP with the following configuration has this potential: NP [lex -, predicate +].  In other words, a predicate NP consists of a lexical N with a modifying sister. Structures of this sort should not be further modified.[2] Thus, the following patterns are predicted.

8a)      那张桌子三条腿。

na        zhang   zhuozi san       tiao      tui.                   [ same as 6) ]
that      (cl.)      table    three    (cl.)      leg
That table is three-legged.

8b)       那张桌子塑料腿。

na        zhang   zhuozi suliao   tui.
that      (cl.)      table    plastic leg
That table is of plastic legs.

8c) * 那张桌子三条塑料腿。
*    na        zhang   zhuozi san       tiao      suliao   tui.       [too many attributes]

8d) * 那张桌子腿。
*    na        zhang   zhuozi tui.                                           [no attributes]

1.2.2. What is the semantic constraint for the Chinese predicate pattern?

Although there is no syntactic agreement between subject and predicate in Chinese, there is an obvious semantic "agreement" between the two: hao shenti (good body) requires a HUMAN as its subject; san tiao tui (three leg) demands that the subject be FURNITURE or ANIMATE. Therefore, the following are unacceptable:

9) * 这杯茶好身体。

* zhe       bei       cha       hao      shenti.
this      cup      tea       good    body

10) * 空气三条腿。

* kongqi san       tiao      tui.
air        three    (cl.)      leg

Obviously,. it is not hao (good) or san tiao (three) which poses this semantic selection of subject. The semantic restriction comes from the noun shenti (body) or tui (leg). There is an internal POSSESS relationship between them: shenti (body)  belongs to human beings and tui (leg) is one part of an animal or some furniture. This common sense relation is a crucial condition for the successful interpretation of the Chinese NP predicate sentences.

There are a number of issues involved here. First, what is the relationship of this type of knowledge to the syntactic structures and semantic interpretations? Second, where and how would this knowledge be represented? Third, how will the system use the knowledge when it is needed? More specifically, how will the introduction of this knowledge coordinate with the other parts of the well established HPSG formalism? Those are the questions we attempt to answer before we proceed to provide a solution to the Chinese NP predicate. Let us look at some more examples:

11a)     桌子坏了。

zhuozi huai     le.
table    bad      LE
The table went wrong.

11b)     腿坏了。

tui        huai     le.leg       bad      LE
leg       bad      LE
The leg went wrong.

11c)     桌子的腿坏了。

zhuozi  de        tui        huai     le.
table    DE       leg       bad      LE
The table's leg went wrong.

12a)     他好。

ta         hao.
he        good
He is good.

12b)     身体好。

shenti   hao.
body    good
The health is good.

12c)     他的身体好。

ta         de        shenti   hao.
he        DE       body    good
His health is good.

note: LE for Chinese perfect aspect particle.

When people say 11b) tui huai le (leg went wrong), we know something (the possessor) is omitted. For 11a), however, we have no such feel of incompleteness. Although we may also ask whose table, this possessive relation between who and table is by no means innate. Similarly, ta (he) in 12a) is a complete notion denoting someone while shenti (body) in 12b) is not. In 11c) and 12c), the possessor appears in the possessive structure DE-construction, the expectation of tui (leg) and shenti (body) is realized. These examples show that some words (concepts) have conceptual expectation for some other words (concepts) although the expected words do not necessarily show up in a sentence and the expectation might not be satisfied. In fact, this type of expectation forms part of our knowledge (common sense). One way to represent the knowledge is to encode it with the related word in the lexicon.

Therefore we propose an underlying SYNSEM feature KNOWLEDGE to store some of our common sense knowledge by capturing the internal relation between concepts. KNOWLEDGE parallels to syntactic SUBCAT and semantic RELATION. KNOWLEDGE imposes semantic constraints on their expected arguments no matter what syntactic forms the arguments will take (they may take null form, i.e. the underlying arguments are not realized). In contrast, SUBCAT only defines syntactic requirement for the complements and gets interpreted in RELATION. Following this design, syntactic form and semantic constraints are kept apart. When necessary, the interaction between them can be implemented by lexical rules, or directly coindexed in the lexicon. For example, the following KNOWLEDGE information will be enforced as the necessary semantic constraints when we handle Chinese NP predicates by a lexical rule (see 3.3).

PHON      shenti
SYNSEM | KNOWLEDGE | PRED [1] possess
SYNSEM | KNOWLEDGE | POSSESSOR [2] human
SYNSEM | KNOWLEDGE | POSSESSED [3]
SYNSEM | LOCAL | CONTENT | INDEX [3]
SYNSEM | LOCAL | CONTENT | RESTRICTION { RELATION body }
SYNSEM | LOCAL | CONTENT | RESTRICTION { INSTANCE [3] }

  1. Agreement revisited

This section relates semantic constraints which embody common sense to the conventional linguistic notion of agreement. We will show that they are essentially the same thing from different perspectives. We only need slight expansion for the definition of agreement to accommodate some of our basic knowledge. This is important as it accounts for the feasibility of coding knowledge in linguistic ways. Linguistic lexicon seems to be good enough to house some general knowledge in addition to linguistic knowledge. Some possible problems with this knowledge-based approach are also discussed.

Let's first consider the following two parallel agreement problems in English:

13) *    The boy drink.

14) ?    The air drinks.

13) is ungrammatical because it violates the syntactic agreement between the subject and predicate. 14) is conventionally considered as grammatical although it violates the semantic agreement between the agent and the action. Since the approach taken in this paper is motivated by semantic agreement, some elaboration and comment on agreement seem to be in need.

The agreement in person, gender and number are included in CONTENT | INDEX features (Pollard & Sag 1994, Chapter 2). It follows that any two signs co-indexed naturally agree with each other. That is desirable because co-indexed signs refer to the same entity. However, person, gender and number seem to be only part of the story of agreement. We may expand the INDEX feature to cope with the semantic agreement for handling Chinese and for in-depth semantic analysis for other languages as well.

Note that to accommodate semantic agreement in HPSG, we first need features to represent the result of semantic classification of lexical meanings like HUMAN, FOOD, FURNITURE, etc. We therefore propose a ROGET feature (named after the thesaurus dictionary) and put it into the INDEX feature.

Semantic agreement, termed sometimes as semantic constraint or semantic selection restriction in literature, is not a new conception in natural language processing. Hardly any in-depth language analysis can go smoothly without incorporating it to a certain extent. For languages like Chinese with virtually no inflection, it is more important. We can hardly imagine how the roles can be correctly assigned without the involvement of semantic agreement in the following sentences of the form NP1 NP2 Vt:

15a)     点心我吃了。

dianxin            wo       chi       le.
Dim-Sum         I           eat       LE
The Dim Sum I have eaten.

15b)     我点心吃了。

wo       dianxin            chi       le.
I           Dim-Sum         eat       LE
I have eaten the Dim Sum.

Who eats what?  There is no formal way but to resort to semantic agreement enforced by eat to correctly assign the roles. In HPSG 1994, it was pointed out (Pollard & Sag 1994, p81), "... there is ample independent evidence that verbs specify information about the indices of their subject NPs. Unless verbs 'had their hands on' (so to speak) their subjects' indices, they would be unable to assign semantic roles to their subjects." The Chinese data show that sometimes verbs need to have their hands on the semantic categories (ROGET) of both their external argument (subject) and internal arguments to be able to correctly assign roles. Now we have expanded the INDEX feature to cover both ROGET and the conventional agreement features number, person and gender, the above claim of Pollard and Sag becomes more general.

It is widely agreed that knowledge is bound to play an important role in natural language analysis and disambiguation. The question is how to build a knowledge-based system which is manageable. Knowledge consists of linguistic knowledge (phonology, morphology, syntax, semantics, etc.) and extra-linguistic knowledge (common sense, professional knowledge, etc.). Since semantics is based on lexical meanings, lexical meanings represent concepts and concepts are linked to each other in a way to form knowledge, we can well regard semantics as a link between linguistics and beyond-linguistics in terms of knowledge. In other words, some extra-linguistic knowledge may be represented in linguistic ways. In fact, lexicon, if properly designed, can be a rich source of knowledge, both linguistic and extra-linguistic. A typical example of how concepts are linked in a network (a sophisticated concept lexicon) is seen in the representation of drink ((*ANI SUBJ) (((FLOW STUFF) OBJE) ((SELF IN) (((*ANI (THRU PART)) TO) (BE CAUSE))))) in Wilks 1975b. While for  various reasons we will not go as far as Wilks, we can gain enlightenment from this type of AI approach to knowledge. Lexicon-driven systems like the one in HPSG can, of course, make use of this possibility. Take the Chinese role-assignment problem, for example, the common sense that ANIMATE being eats FOOD can be seamlessly incorporated in the lexical entry chi (eat) as a semantic agreement requirement.

PHON chi
SYNSEM | KNOWLEDGE | PRED [1]  eat
SYNSEM | KNOWLEDGE | AGENT [2] animate
SYNSEM | KNOWLEDGE | PATIENT [3] food
SYNSEM | LOCAL | CATEGORY | SUBCAT | EXTERNAL_ARGUMENT [NP: [4]]
SYNSEM | LOCAL | CATEGORY | SUBCAT | INTERNAL_ARGUMENTS <[NP: [5]]>
SYNSEM | LOCAL | CONTENT | RELATION [1]
SYNSEM | LOCAL | CONTENT | EATER [4] | INDEX | ROGET [2]
SYNSEM | LOCAL | CONTENT | EATEN [5] | INDEX | ROGET [3]

Note:        Following the convention, the part after the colon is SYNSEM | LOCAL | CONTENT information.

One last point we would like to make in this context is that semantic agreement, like syntactic agreement, should be able to loosen its restriction, in other words, agreement is just a canonical, in Wilk's term preference, requirement (Wilks 1975a). In practice of communication, deviation in different degrees is often seen and people often relax the preference restriction in order to understand. With semantic agreement, the deliberate deviation is one of the handy means to help render rhetorical expression. In a certain domain, Chomsky's famous sentence Colorless green ideas sleep furiously is well imaginable. On the other hand, the syntactic agreement deviation will not affect the meaning if no confusion is caused, which may or may not happen depending on context and the structure of the language. In English, lack of syntactic agreement for the present third person singular between subject and predicate usually causes no problem. Sentence 15) The boy drink therefore can be accepted and correctly interpreted. There is much more to say on the interaction of the two types of agreement deviation, how a preference model might be conceived, what computational complexities it may cause and how to handle them effectively. We plan to address it in another paper. The interested reader is referred to one famous approach in this direction. (Wilks 1975a, 1978).

 

  1. Solution

We will set some requirements first and then present a lexical rule to see how well it meets our requirements.

3.1. Based on the discussion in Section 1, the solution to the Chinese predicate NP problem should meet the following 4 requirements:

(1)        It should enforce the syntactic constraints for this pattern: one and only one modifier XP in the form of NP1 XP NP2.

(2)        It should enforce the semantic constraints for this pattern: N2 must expect NP1 as its POSSESSOR with semantic agreement.

(3)        It should correctly assign roles to the constituents of the pattern: NP1 POSSESS NP2 (where NP2 consists of XP N2).

(4)        It should be implementable in HPSG formalism.

 

3.2. What mechanisms can we use to tackle a problem in HPSG formalism?

HPSG grammar consists of two components: a general grammar (ID schemata and principles) and a lexical grammar (in the lexicon). The lexicon houses lexical entries with their linguistic description and knowledge representation in feature structures. The lexicon also contains generalizations captured by inheritance of lexical hierarchy and by a set of lexical rules. Roughly speaking, lexical hierarchy covers static redundancy between related potential structures. Just because the lexicon can reflect different degrees of lexical redundancy in addition to idiosyncrasy, the general grammar can desirably be kept to minimum.

The Chinese NP predicate pattern should be treated in the lexicon. There are two arguments for that. First, this pattern covers only restricted phenomena (see 3.4). Second, it relies heavily on the semantic agreement, which in our model is specified in the lexicon by KNOWLEDGE. We need somehow to link the semantic expectation KNOWLEDGE and the syntactic expectation SUBCAT or MOD. The general mechanism to achieve that is structure sharing by coindexing the features either directly in the lexical entries (see the representation of the entry chi in Section 2) or through lexical rules (see 3.3).

3.3. Lexical Rule

Lexical rules are applied to lexical signs (words, not phrases) which satisfy the condition. The result of the application is an expanded lexicon to be used during parsing. Since the pattern is of the form NP1 XP N2, the only possible target is N2, i.e. shenti (body) or tui (leg). This is due to the fact that among the three necessary signs in this form, the first two are phrases and only the final N2 is a lexical sign. We assume the following structure for our proposed lexical rule:

NP[ta[1]]         [[AP[2] hao] [N<NP[1], XP[2]> shenti]]

NP Predicate Lexical Rule

hpsg1

SYNSEM | KNOWLEDGE | PRED [1] possess
SYNSEM | KNOWLEDGE | POSSESSOR [2]
SYNSEM | LOCAL | CATEGORY | HEAD | MAJ [6] n
SYNSEM | LOCAL | CATEGORY | PREDICATE -
SYNSEM | LOCAL | CONTENT | INDEX [4]
SYNSEM | LOCAL | CONTENT | RESTRICTION {[3]}
...| CATEGORY | PREDICATE +
...| CATEGORY | SUBCAT | EXTERNAL_ARGUMENT [NP: [5]]
...| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS < [...| CATEGORY | HEAD | MOD [6] ] >
...| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS < [...| CONTENT | INDEX [4] ]

==>

...| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS < [...| CONTENT | RESTRICTION {[7]} ] >
...| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS < [...| LEX - ] >
...| CONTENT | RELATION [1] possess
...| CONTENT | POSSESSOR [5] | INDEX | ROGET [2]
...| CONTENT | POSSESSED | INDEX [4]
...| CONTENT | POSSESSED | RESTRICTION {[7] | [3] }

For complicated information flow like this, it is best to explain the indices one by one with regards to the example ta hao shenti (he is of good body) in the form of NP1 XP N2.

The index [1] links the underlying PRED feature of N2 to the semantic RELATION feature; in other words, the predicate in the underlying KNOWLEDGE of shenti (body) now surfaces as the relation for the whole sentence. The index [2] enforces the semantic constraint for this pattern, i.e. shenti (body) expects a human (ROGET) possessor as the subject (EXTERNAL_ARGUMENT) for this sentence. The index [3] is the restriction relation of N2. [4] links the INDEX features of XP and N2, and [6] indicates that the internal argument is a de-facto modifier of N2, i.e. XP mods-for N2. Note that the part of speech of the internal argument (INTERNAL_ARGUMENT | SYNSEM | LOCAL | CATEGORY | HEAD | MAJ) is deliberately not specified in the rule because Chinese modifiers (XP) are not confined to one class, as can be seen in our linguistic data. Finally, [7] defines the restriction relation of the XP to the INDEX of N2.

The indices [4], [7] and [3] all contribute to artificially creating a semantic interpretation for [XP N2]. As is interpreted, XP is, in fact, a modifier of N2 and they would form an NP2, or [XP N2] constituent. In normal circumstances, the building of NP2 interpretation is taken care of by HPSG Semantics Principle. But in this special pattern, we have treated XP as a complement of N2, yet semantically they are still understood as one instance: hao shenti (good body) is an instance of good and body. This interpretation of NP2 serves as POSSESSED of the sentence predicate, indicated by the structure-sharing of [4], [7] and [3]. Finally, [5] is the interpretation of NP1 and is assigned the role of POSSESSOR for the sentence predicate.

Let's see how well this lexical rule meets the 4 requirements set in 3.1.

(1) It enforces the syntactic constraints by treating XP as the internal argument and NP1 as the external argument.

(2) It enforces the semantic constraints through structure sharing by the index [2].

(3) It correctly assigns roles to the constituents of the pattern.

The following interpretation will be established for ta hao shenti (he is of good body) by the parser.

hpsg2

CONTENT | RELATION possess
CONTENT | POSSESSOR | INDEX | PERSON 3
CONTENT | POSSESSOR | INDEX | NUMBER singular
CONTENT | POSSESSOR | INDEX | GENDER male
CONTENT | POSSESSOR | INDEX | ROGET human
CONTENT | POSSESSOR | RESTRICTION { }
CONTENT | POSSESSED | INDEX [1]    | PERSON 3
CONTENT | POSSESSED | INDEX          | NUMBER singular
CONTENT | POSSESSED | INDEX          | GENDER nil
CONTENT | POSSESSED | INDEX          | ROGET organ
CONTENT | POSSESSED | RESTRICTION { [ RELATION good],              [ RELATION body  ] }
CONTENT | POSSESSED | RESTRICTION { [ INSTANCE [1] ],              [ INSTANCE [1]  ] }

In prose, it says roughly that a third person male human he possesses something which is an instance of good body. We believe that this is the adequate interpretation for the original sentence.

(4) Last, this rule has been implemented in our Chinese HPSG-style grammar using ALE and Prolog.  The results meet our objective.

But there is one issue we have not touched yet, word order. At first sight, Chinese seems to have similar LP constraints as those in English. For example, the internal argument(s) of a Chinese transitive verb by default appear on the right side of the head. It seems that our formulation contradicts this constraint in grammar. But in fact, there are many other examples with the internal argument(s), especially PP argument(s), appearing on the left side of the head.

服务 fuwu (serve): <NP, PP(wei)>

16a) 为人民服务

wei      renmin fuwu
for       people  serve
Serve the people.

16b) ? 服务为人民。

fuwu    wei      renmin.
serve    for       people

有益 youyi (of benefit): <NP, PP(dui yu)>

17a) 这对我有益。

zhe       dui       wo       youyi
this      to         I           have-benefit
This is of benefit to me.

17b) * 这有益对我。

zhe       youyi               dui       wo
this      have-benefit    to         I

18a) 这于我有益。

zhe       yu        wo       youyi
this      to         I           have-benefit
This is of benefit to me.

18b) 这有益于我。

zhe       youyi               yu        wo
this      have-benefit    to         I
This is of benefit to me.

Word order and its place in grammar are important issues in formulating Chinese grammar. To play safe and avoid generalization too soon, we assume a lexicalized view on Chinese LP constraint, encoding word order information in LEXICON through SUBCAT and MOD features. This proves to be a realistic and precise approach to Chinese word order phenomena.

3.4. As a final note, we will briefly compare the NP Predicate Pattern with one of the Chinese Topic Constructions:

NP1 NP2 Vi/A
(topic + (subject + predicate))

In Chinese, this is a closely related but much more productive form than this NP Predicate Pattern. And their structures are different.

19)       他身体好。

ta         shenti   hao
he        body    good
He is good in health.

For topic constructions, we propose a new feature CONTEXT | TOPIC, whose index in this case is token identical to the INDEX value of ta. Please be advised that in the above structure, the CONTEXT | TOPIC ta is considered as a sentential adjunct instead of a complement subcated-for by shenti. Why? First, ta is highly optional: topic-less sentence is still a sentence. Second, and more convincingly, ta cannot always be predicted by its following noun. Compare:

20a) 他身体好。

ta         shenti   hao
he        body    good
He is good in health.

20b) 他好身体。

ta         hao      shenti
he        good    body
He is of good health.

21a) 他脾气好。

ta         piqi                  hao
he        disposition       good
He is good in disposition.

21b)  他好脾气。

ta         hao      piqi
he        good    disposition
He is of good disposition.

but:

22a) 她学习好。

ta         xuexi   hao. [3]
he        study   good
He is good in study.

22b) *  他好学习。

ta         hao      xuexi
he        good    study

What this shows is that for topic sentences like ta shenti hao (He is good in health), ta xuexi hao (He is good in study), etc., there is no requirement to regard topic ta (he) as a necessary semantic possessor of shenti / xuexi, the relation is rather "in-aspect": something (NP1) is good (A) in some aspect (NP2), or for something (NP1), some aspect (NP2) is good (A).

Finally, it needs to be mentioned that our proposed lexical rule requires modification to accommodate sentence 6). That is already beyond what we can reach in this paper because it is integrated with the way we handle Chinese classifiers in HPSG framework.

 

References

Pollard, Carl  & Sag, Ivan A. (1994): Head-Driven Phrase Structure Grammar,  Centre for the Study of Language and Information, Stanford University, CA

Pollard, Carl & Sag, Ivan A. (1987): Information‑based Syntax and Semantics Vol. 1: Fundamentals. Centre for the Study of Language and Information, Stanford University, CA

Wilks, Y.A. (1975a): A Preferential Pattern-Seeking Semantics for Natural Language Interference.  Artificial Intelligence, Vol. 6, pp.53-74.

Wilks, Y.A. (1975b): An Intelligent Analyzer and Understander of English, in Communications of the ACM, Vol. 18, No.5, pp.264-274

Wilks, Y.A. (1978): Making Preferences More Active.  Artificial Intelligence, Vol. 11, pp. 197-223

~~~~~~~~~~~~~~~ footnotes ~~~~~~~~~~~~~~~~

[1] This is not absolute, we do have the following examples:

Ia)          约翰是纽约人。

Yuehan shi           Niuyue                   ren
John       be            New-York              person
John is a New Yorker.

Ib)           约翰纽约人。

Yuehan  Niuyue                   ren.
John       New-York              person
John is a New Yorker.

IIa)         今天是星期天。

jintian    shi           xingqi-tian.
today     be            Sun-day
Today is Sunday.

IIb)         今天星期天。

jintian    xingqi-tian.
today     Sun-day
Today is Sunday.

It seems to be that the subject NP stands for some individual element(s), and the predicate NP describes a set (property) where the subject belongs. But it is not clear how to capture Ib) and IIb) while excluding 7b). We leave this question open.

[2] We realize that the syntactic constraint defined here is only a rough approximation to the data from syntactic angle. It seems to match most data, but there are exceptions when yi (one) appears in a numeral-classifier phrase:

IIIa)  他一副好身体。

ta            yi             fu            hao         shenti.
he            one         (cl.)         good       body
He is of good health. (He is of a good body.)

IIIb) * 他三副好身体。

ta            san          fu            hao         shenti
he            three       (cl.)         good       body

IIIc)   他好身体。

ta            hao         shenti.    [same as 1) ]

IVa) 李四一张圆圆的脸。

Lisi          yi             zhang     yuanyuan             de            lian.
Lisi          one         (cl.)         round-round         DE          face
Lisi has a quite round face.

IVb) * 李四两张圆圆的脸。

Lisi          liang       zhang     yuanyuan             de            lian.
Lisi          two         (cl.)         round-round         DE          face

IVc)  李四圆圆的脸。

Lisi          yuanyuan             de            lian.        [ same as 3) ]

[3] Another reading for 22a) is [S [Sta xuexi][AP hao]], where ta xuexi is a subject clause: "That he studies is good". This is another issue.

 

[Related]

Interaction of syntax and semantics in parsing Chinese transitive verb patterns 

Notes for An HPSG-style Chinese Reversible Grammar

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

Notes for An HPSG-style Chinese Reversible Grammar

ABSTRACT

Key words: Chinese parsing, Chinese generation, reversible grammar,  HPSG

This paper presents a reversible Chinese unification grammar named CPSG. The lexicalized and integrated design of CPSG embodies the general spirit of the modern linguistic theory Head-driven Phrase Structure Grammar (HPSG, Pollard & Sag 1987, 1994). Using ALE formalism in Prolog (Carpenter & Penn 1994), we have implemented a prototype of CPSG.

CPSG covers Chinese morphology, Chinese syntax and semantics in a novel integrated language model (Figure 1, for interface between morphology, see Li 1997; for interface between syntax and semantics, see Li 1996). CPSG model is in sharp contrast to the conventional clear-cut successive design of grammar components (Figure 2, see survey in Feng 1996). We will show that our model is much better suited and more efficient for Chinese analysis (or generation).

 

cpsg

Grammar reversibility is a highly desired feature for multi-lingual machine translation application (Hutchins & Somers 1992, Huang 1986, 1987). To test its reversible features, we have applied the CPSG prototype to an experiment of bi-directional machine translation between English and Chinese. The machine translation engine developed in our Natural Language Lab is based on shake-and-bake design, a novel approach to machine translation suited for unification grammars (Whitelock 1992, 1994, Beaven 1992, Brew 1992). The experimental results meet our design objective and verify the feasibility of CPSG approach.

~~~~~~~~~~~~~~~~~~~~~

Notes for NWLC-97, UBC, Vancouver

Outline of An HPSG-style Chinese Reversible Grammar

Wei LI   ([email protected])

Linguistics Department, Simon Fraser University

 

 Key words:          lexicalist approach, integrated language model, HPSG,

                                reversible grammar,  bi-directional machine translation, 

                                Chinese computational grammar,

                                Chinese word identification, Chinese parsing,
Chinese generation

 

  1. background

1.1. design philosophy

Two major obstacles in writing Chinese computational grammar:

lacking in serious study on Chinese lexical base

well designed lexicon is crucial for a successful computational system

theoretical linguists have made fruitful efforts (e.g. Li Linding) but lack formalization

computational linguists require more patience in adapting and formalizing the fruits:

it is huge work, but has to be done if a non-toy system is targeted

lack of effective interaction between morphology, syntax and semantics.

e.g.

ambiguity in word identification makes it hard to interface morphology & syntax:

a theoretical defect of morphology preprocessor (segmenter)

e.g. ABC: ABC or A | BC or AB | C or A | B | C?

active/passive isomorphic phenomena make semantic constraint a desired need in parsing NP Vt: subject NP or object NP?

Solution: the lexicalized and integrated design of Chinese grammar

1.2. major theoretical foundation:

HPSG:       lexicalist theory encouraging integration of different components

a desired framework matching our design philosophy

CPSG: HPSG-style unification grammar

CPSG: reversible grammar suited for both parsing and generation

CPSG: formalized grammar, a description that does not rely on undefined notions

  1. integrated language model

2.1. CPSG versus conventional Chinese grammar

 

 

parse tree embodies both morphological and syntactic structures in CPSG

  1. lexicalized formal grammar

3.1. formalized grammar, as required by a computational grammar: formulation of CPSG

readily implementable (theories, principles, rules, etc.);

precise definition for the very basic notions (e.g. sign, morpheme, word, phrase, sentence, NP, VP, etc.), rules (PS rules and lexical rules), lexical items (lexical hierarchy), typology (hierarchy embodied in feature structures)

(4.)       Definition: sign

A sign is the most fundamental concept of grammar. Formally, a sign is defined by the type [a_sign], which introduces a set of linguistic features for its description, as shown below.

a_sign
INDEX index
KANJI kanji
MORPH1 expected
MORPH2 expected
CATEGORY category
COMP0 expected
COMP1 expected
COMP2 expected
MOD expected
KNOWLEDGE knowledge
CONTENT content
INDEX0 index
INDEX1 index
INDEX2 index
DTR dtr

(5.)       Definition: word

In CPSG, a word is a sign satisfying the following two conditions: (1) its obligatory morphological expectation has all been saturated; (2) it is not a mother of any syntactic structures, hence no syntactic daughters. Formally, a word is defined as shown below.

(6.)       word

a_sign
MORPH1 ~obligatory
MORPH2 ~obligatory
DTR no_syn_dtr

3.2. lexicalized grammar

CPSG consists of two parts:

(1) a minimized general grammar:

only 11 phrase structure rules
(covering complement structure, modifier structure,
conjunctive structure and morphological structure)

(2) a feature enriched lexicon:

lexical entries;
lexical hierarchy and a set of lexical rules
(capturing lexical generalizations).

 

(7.)          comp0 PS rule

MOTHER               a_sign
COMP0 saturated
COMP1 [1]
COMP2 [2]
DTR comp0
MYSISTER [6]
LEFTMOD [7] category
RIGHTMOD [8] category
LEFTCOMP [9] category
RIGHTCOMP [10] category

===>

EXPECTING          a_sign
COMP0 a_expected
DIRECTION left
ROLE [3]
SIGN [4]
COMP1 [1] ~obligatory
COMP2 [2] ~obligatory
INDEX [5]
DTR dtr
LEFTMOD [7]
RIGHTMOD [8]
RIGHTCOMP [10]

EXPECTED            a_sign [4]
CONTENT content
MYHEAD [5]
MYROLE [3] comp_role
INDEX [6]
CATEGORY [9]

PRINCIPLE            #head_feature

(8.)          lexical entry: chi

a_sign
KANJI one_character
H1 chi
CATEGORY v
INDEX0 [1] index
INDEX1 [2] index
COMP0 a_expected
DIRECTION left
SIGN a_sign
CATEGORY n
INDEX [1]
COMP1 a_expected
DIRECTION right
SIGN a_sign
CATEGORY n
INDEX [2]
KNOWLEDGE eat
U_OBJECT food
MALE none
PERSON 3
SINGULAR bin
U_SUBJECT animate
MALE bin
PERSON tri
SINGULAR bin

  1. Implementation and Application of CPSG

CPSG prototype implemented in ALE and Prolog, having parsed a corpus of 200 various types of sentences

ALE and Prolog: suitable for unification grammar
ALE:         mechanism for typed feature structures: type polymorphism
a powerful tool in language modeling

CPSG prototype adapted for application to bi-directional MT, having generated the same corpus of 200 sentences

References

Beaven, John L. (1992): "Shake and Bake Machine Translation", Proceedings of the 15th International Conference on Computational Linguistics, pp. 603-609, Nantes, France.

Brew, Chris (1992): "Letting the Cat out of the Bag: Generation for Shake-and-bake MT", Proceedings of the 15th International Conference on Computational Linguistics, pp. 610-616, Nantes, France.

Carpenter, B. & Penn, G. (1994): ALE, The Attribute Logic Engine, User's Guide

Feng, Z.  (1996): "COLIPS Lecture Series - Chinese Natural Language Processing",  Communications of COLIPS, Vol.6, No.1 1996, Singapore (http://www.iscs.nus.sg/~colips/commcolips/paper/p96.html)

Huang, X-M. (1986): "A Bidirectional Grammar for Parsing and Generating Chinese".  Proceedings of the International Conference on Chinese Computing, Singapore, pp. 46-54

Huang, X-M. (1987): XTRA: The Design and Implementation of A Fully Automatic Machine Translation System, Doctoral dissertation, University of Essex.

Hutchins, W.J. & H.L. Somers (1992): An Introduction to Machine Translation. London, Academic Press.

Li, W. (1996): Interaction of Syntax and Semantics in Parsing Chinese Transitive Patterns. Proceedings of International Conference on Chinese Computing (ICCC'96), Singapore

Li, W. (1997): Chart Parsing Chinese Character Strings. Proceedings of The Ninth North American Conference on Chinese Linguistics (NACCL-9, to be available), Victoria, Canada

Pollard, C.  & I. Sag (1987): Information based Syntax and Semantics Vol. 1: Fundamentals. Centre for the Study of Language  and Information, Stanford University, CA

Pollard, C.  & I. Sag (1994): Head-Driven Phrase Structure Grammar,  Centre for the Study of Language and Information, Stanford University, CA

Whitelock, Pete (1992): "Shake and Bake Translation", Proceedings of the 14th International Conference on Computational Linguistics, pp. 784-790, Nantes, France.

Whitelock, Pete (1994). "Shake and Bake Translation", C.J. Rupp, M.A. Rosner, and R.L. Johnson (eds.), Constraints, Language and Computation, pp. 339-359, London, Academic Press.

 

[Related]

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

【一日一parsing:从“见面”的subcat谈起】

白:
“三两面”和“两三面”很不一样啊……
我借过他三两面。我见过他两三面。

我:
三两面 > 两三面
我见过他三两面

0912a
ditransitive, no problem, but:

0912b

separable verb jian-mian is still not connected

还有:
(0)我见过他两三面。
(1)我见过他。
(2)我与他见过面。
(3)* 我见过面
(4)我们见过面。
(5)我与他,见面过。

“见面” 要求或者主语是复数(4),或者主语是并列结构(5),或者带有介词短语“与(with)”(PP或并列在汉语界限不清,(2)),或者动量词疑似的“两三面”前必须有定语【human】。所有的这些句法subcat要求都是满足语义(或常识)的一个【human】的坑:常识是,“见面“”必须在两个或以上的 human entities 之间进行。

HPSG 这类极端依赖subcat数据结构的词驱动的理论和语言学表达,尽管繁缛,但有一个亮点, 就是把上述的句法要求作为 input 的匹配条件描述,与内在的语义要求(类似于 HowNet 的描述)作为语义的 output,一条一条形式化,细致入微,丝丝入扣。用的是 label 的unification(就是 label 所代表的子结构的 sharing)机制。多数系统对于 subcat 的内部结构,input到output的映射,以及背后的句法与语义的关系(语义是句法的动因,同时也是句法的目标:句法匹配,语义实现),都显得太简陋了。

过犹不及,不及犹过。我们一直在探索在 subcat 的表达和实现中,如何做到中庸而不平庸,简约而不简陋。

白:
他我见过几面

我:
简陋之极的一个例证是给人用的 Oxford 高级词典和朗曼词典的那些 subcat codes,类似 v1,。。。v23 之类。后来纽约大学专门组织CL的研究生做 CompLex 和 NomLex 等 subcat 词典。中文方面,社科院语言所的【现代汉语800词】开 subcat 先河,【动词用法词典】等系列辞典,开始试图把 subcat 用某种编码加例句予以表达。所有这些工作,从数据表达和关系看,都显得有些简陋。其根子是,句法和语义没有厘清。

对于一个 NLP practitioner,拿来这些资源,必须在肚子里做这个句法语义的连接和消化,然后确定数据结构,找寻自己的实现途径。实现的时候,很难达到 unification 文法的漂亮,大多是凑合事儿,为的是避免 HPSG 这类的实现起来的低效率和数据结构的难维护。

董老师的 HowNet 对于汉语和英语的 subcat,语义上登峰造极了,但是句法方面还是显得不够细致周全。譬如“见面”这类的上述6-7种句法规定,好像就没有一一描述(董老师指正:也许我没吃透),也没见哪家描述清楚过。也都需要一个重新咀嚼消化,然后去实现。

0912c

(3)的 generation 不合法(*),但对于 parsing,鲁棒性要求这样parsing,没错。

0912d

没调试,居然出来了,912 的狗屎运吧。(911恐袭,913林跑,都不是好日子。)只剩下 “我见过他两三面” 这个 case 了。这个类似动量补语的东西其实仅限于:“一面”,“几面”,“两三面”,“三两面”,等少数几个。起码,100+ 面 基本不可能 除非是恋人。

张: 崇拜严重中

我:
张老师谬赞。清谈误国,我只要不误“人”子弟就好了,一辈子没当过教授,要误也都是人家子弟,哈。

张: 白求恩

我:
认真说,其实真地涉嫌误人子弟,因为凡事都有一个大环境和背景,我说的这些个多少有些异类,结果是,主流学生雾里看花。雾里看花也算增加视野,最误人的是,看到花,却够不着。这就好比鲁老爷子说的,本来人家黑屋子里面睡得蛮香甜,你非要去【呐喊】,唤醒了,可屋子还是黑屋子,这就不仅仅是残忍了。不残忍的法子就是,等以后退休了,开一个 Deep Parsing 开源公园,每条代码,每个词条,每段规则,全部公开,然后看看能不能靠众人的力量,弄一个无敌系统来。大家一起玩符号逻辑,让两条路线永远。

 

 

【相关】

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

【语言学小品:苹果发布 iPhone 7 的“话术”】

我:
前一阵提到汉语 if-then 简约式对parsing的挑战。昨天又遇到一些例子,也是极少显性形式痕迹,可是人就理解为 if-then: “中国出生,美国长大,如何申请回国定居?”

VP1, VP2, how VP3

中国出生,美国长大,如何申请回国定居?
== 【如果】【一个人】【在】中国出生,【并且】【在】美国长大,【那么】【他/她】【将】如何申请【他/她】回国定居【的paperwork】【呢】?

省去了多少玩意儿,简约的中文!

这种句式听起来很顺耳,普罗没感觉有理解或缺省的问题。仔细看,也不能算没有形式痕迹,这样的 pattern 似乎就应该是这样的理解(?):

VP1(, VP2, ...), how VPn?

一旦匹配上,还有其他的语义可能吗?VP1 到 VPn-1 都是 AND 条件, VPn 才是虚拟条件的结果。

白:
不甜不要钱,不甜的不要钱
一个意思,形式上真要拉开那么大差距吗
理解为省略“的”,就是单。理解为省略“如果……则……”就是复句

我:
的字结构,是一个短语与从句的中间怪物,英语的 what-clause 亦然。

白:
如果依照“懒人定律”,无论如何过程简约、结果简约的理解优先。
用最小能量补齐者优先

宋:
不完全一样。瓜主指着一堆瓜说“不甜不要钱”,意思是我保证个个都甜。“不甜的不要钱”口气软一点,是说我不保证每个瓜都甜,如果你买到的瓜不甜,我就不收钱。

白:
您的例子只能区分省略掉的名词加的是存在量词还是全称量词,不能区分省略掉的小词是“的”还是“如果……则”

我:
@宋 好区分。不过,这种口气的软硬真地很 subtle,广告商似乎常常有意利用这种 nuances,来忽悠老百姓。同样的广告词,软的方面理解才是实在的,广告商希望听众往硬的方面理解,来凸显其底气。“不甜不要钱”,就是这样的话术。它的实际意义和法律意义等价于“不甜的不要钱”。但它想传达的却是,我的产品多牛,根本不可能不甜,不信我愿意跟你打赌。

白:
不管软硬,真遇到不甜的(逻辑反例),肯定是哪个瓜不甜哪个瓜不要钱,不会整堆儿不要钱。不信试试。

我:
不用试吧,@白硕

说到“话术”,昨天看苹果发布会,体会才深,从乔布斯时代到现在,苹果最经常用到的忽悠信众和普罗的话术就是:iXyz is the best Xyz ever made by Apple
这种话是宇宙真理,没有丝毫信息量,却听上去似乎是最有力量的广告词。

白:
有sentiment就够了

我:
尼玛做电子产品,不是越做越好,难道越做越坏?新一代比起前面的几代好,不是理所当然吗?这里的 best 不就是这么声称吗?屡试不爽,把全世界当傻瓜,可是全世界还就愿意当傻瓜。没人 question 或反讽。我要是苹果的竞争对手,就专门做一个宣传片,嘲讽这个“话术”。

白:
made和比较范围并没有硬捆绑呀。
不是硬性的

我:
是 iPhone7 与 iPhones 比较;iWatch Series 2 与 iWatch 1 比较

白:
也可以理解为横向比

我:
这是正式新闻发布:
San Francisco — Apple today introduced iPhone 7 and iPhone 7 Plus, the best, most advanced iPhone ever, packed with unique innovations that improve all the ways iPhone is used every day.“the

“the best, most advanced iPhone ever”

白:
又回到限定性非限定性问题上,聪明的一休

我:
逻辑上,剔除定语,就是 iPhone 7 is iPhone

白:
这个跟“媳妇是娘”那种剔除法一样不可取。

我:
苹果就是完全烂了,没有任何创新,也永远可以这样声称:
iPhone 7 is the best iPhone.
(iPhone 8 will be the best iPhone)
In fact, a new iPhone release is always the best iPhone.

白:
问题是,把苹果买在手里的用户,按照另一种理解,会有一种傲视天下的感觉。

宋:
马列主义的顶峰。
新顶峰。

我:
他要是真牛,应该说 iPhone 7 is the best smart phone.
不过他不敢

白:
苹果不蠢,只是蒙不了伟哥而已。

我:
只有谷歌 SyntaxNet 才傻乎乎地敢于不带范围地如此声称世界第一

 

 

【相关】

【汉语句法的挑战之一:if-then的简约式】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

《立委科普:NLP系统语义模块的任务》

本篇旨在探讨NLP(Natural Language Processing)语义模块的任务,尤其在知识图谱应用中。探讨之前,我们先站在万米高看俯瞰一下语义模块在语言学和NLP的主要模块的架构中位于何处。
语言学的教科书通常把语言文本研究从浅入深划分为这么几个分支:词法(morphology)、句法(syntax)、语义(semantics)和语用(pragmatics)。还有另一个维度的分支,叫篇章研究(discourse study),是跨句进行,其他的研究一般限于句内。词法句法的研究成果在 NLP 中表现为 parser,可以自动把线性字符串的语句分析为句法树结构,千变万化的语句因此化为有限的句型或 patterns,为语言理解和应用提供了坚实的基础。语义处于句法之后、语用之前,我们叫它为语义中间件 (middleware),因为它是领域独立的语言研究的终点,支持的是依赖领域和应用的语用。这个语义中间件的任务也可以留到语用阶段在语义落地(semantic grounding)的时候根据语用对语义的要求来一起做,但是理论上,总有一部分语义工作有足够的领域独立性,值得提前做好,来支持种种不同的语用场景和应用,减轻语用模块的负担。
如此定义的语义模块(语义中间件),主要是寻找 hidden links,譬如隐含的逻辑主语、宾语 等。这些在句法阶段没有显性表明,但是有足够证据去确定如何填补。填补的时候,一个是利用句法(显性的links),一个是利用 ontology,通常是二者的结合。词驱动(word-driven)来做,是一个很 tractable 的任务,是比parsing更琐碎但难度较低的工作,因为要结构有结构,要ontology有ontology(包括动态形成的ontology节点,譬如NE专名的分类),条件比纯句法分析模块只有线性的pattern可用,是成熟多了。其有用性还是不太清晰: argument 之一就是,如果 hidden 的语义重要,人为什么不用显性句法手段?即便在一个句子的选定的句法结构中,某个重要的语义难以显性表达,如果足够重要,人就会换一种句法结构在另一个句子显性表达出来。如果上述 argument 有一定的道理,那么不做 hidden 语义,对于大数据挖掘,应该不会有太大的损害。至少在大数据挖掘这样的场景,信息的冗余性足以弥补 个体 hidden 语义的不全。在句法结束的时候,有些句子提到的 arg(s) 并没有到位,可以说是不饱和(unsaturated)。语义中间件的任务就是把句法没有做全的不饱和的坑填得饱和,hidden links 建立了,于是就饱和了。如果句法模块和语义模块以后,仍然不饱和,就应该在 discourse 中去找。如果 discourse 中还是没找到,那么理论上是应该通过常识去饱和它。
回到万米高空俯瞰,昨天还在想所谓“语义计算”到底包含哪些呢。从 community 来看,相关的方面有:(1)WSD(Word Sense Disambiguation); (2) FrameNet (role labeling); (3) IE(Information Extraction)。“经典”IE (MUC IE 传统)里面一般分 NE、relationship、event,外加 Coreference,等任务。从结构图的角度看,NE 和 WSD 是做 node 的语义计算;FrameNet 和 IE Template (for relationship or event) 是做 arc (link)的语义计算。这样来看 community 定义的几个任务和方向,可以发现,(1) 和 (2) 都是学究式的任务,不实用。(3) 是最接地气的东西,是应用(apps)直接需要的。但是 IE 是针对领域的,直接为产品服务的,不好抽象,那么就可以想想什么东西是句法之后,语用之前,最能帮助 IE。其中之一就是 Coreference,这个任务已经被 IE 收编了,但它实际上是独立于领域的篇章(discourse)尺度的语义计算,是为了支持 IE 的跨句整合的。
沿着这个思路,我们还可以细化,根据实际需求,我们定义过三个任务,觉得应该在语义中间件里面做,它们应该可以惠及所有的应用:第一个是 同位语关系,这个可以看成是 Corference之一种;第二个是 部分和整体的关系(譬如,苹果和iPhone);第三个 原因和结果的关系。上述三个关系不限于句法短距离,也包括远距离的,甚至跨句的这类联系。我们一直在这三个关系,加上代词的coreference (包括专名的 aliasing) 上下功夫,比在 hidden 逻辑主谓宾方面更多,因为前者直接服务于 local IE 以后的 IE,以便整合成图谱,是整合的粘合剂,后者大多可以通过信息冗余去做弥补。
以上说的是实践中摸索出来的体会,就是自然而然这么走下来的。local IE 在抓取信息填 IE Template 里面的坑的时候,所看到的都是局部的信息,所填坑的材料经常很“虚”。虚的极端例子就是代词(“它”,“这个”),或者 一些指代性的名词(“这台电脑”),这些东西只能作为桥梁,不能真正导致图谱。这时候语义模块在上述四个方面所做的工作,就可以帮助把这些虚的材料,变得实在,这是通向图谱的一个很重要的支持。
大而言之,语义中间件做到什么程度合适,有很大的争论空间。在确定应用之前,不少细线条语义进一步伸展没有太大意义,或者劳而少功。就是说在句法把结构的框架搭起来以后,在语用层面的具体应用确定之前,到底要做多少语义计算,不是容易说清楚的,直觉上和经验上,不赞成做得太多。从某种意义上看,费尔默创立 FrameNet 就是想把语义中间件进行到底。理论上,他的深入是有道理的,因为在 arg structure (句法subcat的拿手好戏)之后,如果要深入,domain independent 的 Frame hierarchy 是通向语用的深度桥梁。起码理论上如此。但是我们做了18年的 IE 以后,结论是,费尔默那个语义计算的路子基本是歧途。没感觉到啥好处,却带来了很大的 overhead,可操作性很差,也并不省功。IE 领域用 Template 定义语用领域的需求,没有人主张把这些 Templates 定义在 FrameNet 的 hierarchy 上面,因为感觉不到需要,而且也不现实。100 年后,也许 FrameNet 可以被重新发现,因为那时候的语用落地已经太多了,需要组织组织了。FrameNet 正好提供了一个组织和整合的框架,如今的语用落地都是零星的。
立委牌 NLP University 中,能看懂上面这些参杂了些假洋鬼子话(术语)的“高阶科普”的后学,是可以授予学位的。这个学位是硬通货。看不懂也没关系,可以视为狂人乱语,或者是误入迷宫,不隔行也如山,耽误了你玩深度学习(dl)的宝贵时间。

【语义计算沙龙:三角关系的 chemistry 种种】

白:
朴泰恒小组成绩不好,今天不一定能进决赛
上面例子,“小组”怎么摆,是个考验。
原意是“在小组赛阶段的”

梁:
朴泰恒今天小组成绩不好。
孙杨小组第一。

白:
以人命名的小组也是存在的

梁:
是啊,感觉“小组成绩不好”是谓语。这里小组也不是“朴泰恒的小组“,考验来了。

我:
不是说大数据吗 看 某某某小组 是不是够资格

t08061

t08062

t08063

t08064

t08065

梁:
@wei 很棒! 有个 Topic.

宋:
@wei 确实很好。但是确实能区分两种“小组”,还是只顾一头?

我:
没有大数据,应该是只顾一头吧,可以试试另一头的典型案例

宋:
即使有大数据,还得区分时代、地域、行业等,不好办。
而且,这就成了有监督的学习了,需要做语料标注。

白:
不一定宋老师。可以词典里离线加标签,目标文本在线只需计算标签密度,不涉及监督学习。

宋:
具体解释一下吗?

我:
词典习得本质上是无监督的 ngram 频率做底。假设北京大学不在词典 应该可以学出来,某某某小组 亦然。白老师说的是在线词典化 通过现场计算。

宋:
@wei 就这个例子而言,对比“朴泰恒小组”和“朴泰恒……小组”的频率,是吗?

我:
能不能解决这个问题:北京大学、中学、小学要立刻全部动员起来
xyz 相交切分的通则:xy 强 还是 yz 强,这个道理上可以在线检索计算
“北京大学” 还是 “大学、中学” 强

宋:
如果看作交搭型歧义问题,那么在大数据中,肯定是“小组成绩”频率高过“朴泰恒”的频率,除非朴泰恒这个人太红。因此,以此决定句法结构,似乎理由不足。

我:
人是怎么决策的呢?
这里可能涉及大数据的范围问题。
数据不是越大越好 尤其不能杂 大而杂 就把领域抹平了,而很可能这是领域知识

宋:
对,我糊涂了。

白:
其实,和人名结合是兜底的,要学的只是不和人名结合的高频词串。
向右结合的条件不满足,就默认向左好了。
大数据不是这么用的。

宋:
不过无论如何,一般来说,X小组 比不上 小组成绩。这里是领域知识问题,不大好用词频去处理。

我:
先说一下篇章现象 one sense per discourse.
如果同一篇中 还有 某某某小组 再现。那个原则是过硬的 可以 在篇章内搞定,这时候大数据认输。

宋:
张三小组第一,李四小组第二。

白:
@宋柔 这个是歧义

我:
分为四级
第一级 是词典绑架 北京大学基本如此
第二级 是篇章原则
第三级 是领域数据
第四级 才是大数据 超领域的
涉及到专名 术语的 走不到超领域的大数据,大数据抹平了领域知识 反而不妙

白:
词例级如此,特征级未必
特征级可以把xx小组一起拿上来统计。

我:
明白。不过具体操作起来,还是一笔糊涂账。xxx 小组 与 小组成绩 打架,要赢多少 算赢?在多大的数据里?如果特别悬殊 好说,稍微有些接近 就是烂帐,or 烂仗。

白:
另外,针对篇章可以计算特征密度,如果某种特征密度显著比其他特征高,也可用。比如体育特征显著,“小组”做前缀就优先级较高。

宋:
我在11年人民日报中检索,“小组赛”1013次,“小组成绩”4次,“小组赛成绩”两次,人名+小组3次。对于一个毫无体育比赛知识的人,如果有一般的比赛知识,知道比赛会出成绩,就能推知“小组比赛”是一个短语。首先是从黏着的“赛”黏着到“小组赛”,知道有“小组赛”这个术语,并能理解这是分小组而比赛。由于知道比赛会出成绩,就能推知“小组成绩”是一个短语,指某人在小组赛中的成绩。人名+小组7次,但都与体育无关:赵梦桃小组,郝建秀小组等,都是棉纺厂的。一个人,没有体育比赛知识,但有一般的比赛知识,又有语言知识,就可以有这样的推理

我:
“周恩来思想深刻 谈吐幽默”,vs. “毛泽东思想深刻”
“思想” 与 “小组” 类似

宋:
1940年代以前,汉语中好像没有“人名+思想”作为一个词的。此后,“毛泽东思想”频率越来越高。但其他人名+思想就不能成词。

我:
这个政治有意思:从此 其他 人名+思想 成为禁忌:我花开来百花杀啊。

白:
@宋 “小组循环赛”“小组出线”“小组第一”……等各种组合均以“小组”为前缀,如果只对实例,其实比“朴泰恒小组”好不到哪里去。统计频度多一点少一点都做不得结构优选的依据。但是如果抽象地考察“前缀模式”和“后缀模式”的优先程度受什么影响,必然会追溯到特征以及特征在篇章中的密度分布。如果“体育”或“竞赛”特征及其密度优势显著,“小组”倾向于做前缀,否则倾向于做后缀。如果前缀所带的实例碰巧在大数据里固然好,不在,也可通过特征及特征密度间接获得友军的支持。同样,如果“人名”“任务名”特征或特征密度显著,“小组”倾向于做后缀。

 

【相关】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

 

【一日一parsing:degraded text and robust parsing】

我:
“i love programming the games are cool its fun to play them don't you think”
@梁 here are parsing results of your casual English:

t0721a

So there is one error in parsing this "degraded text":
our parser links "the games" as Object of "programming" which is locally correct, an understandable mistake. But human knows there is a missing punctuation and will link "the games" as Subject of "are", other aspects of parsing seem alright.  So "degraded text" does pose some challenges, but a robust parser can still handle most of it.

@梁:
Thank you, @wei. It is very well handled. By the way, it is not my casual English. I copied it from Khan Academy.
@wei, ”Opred“ means predicate as objective, what is "infmod"?

白:
不定式作后置修饰语

我:
对。Opred 是谓词性宾语,包括ing和不定式。
其实那个错误 做细活 是可以改正的 因为 are 对主语的强制性力量 远远超越了作为前面动词宾语的力量。这样就达到人的结构分析水平了。

白:
think怎么next了?这个是个反义疑问句啊。

我:
白老师眼毒,不指出我根本就没注意到呢。那显然是一个 bug:助动当成主动词了。
就事论事 那个应该词典化。

白:
are距离又近,不填主语又不饱和。反倒是programming,不是非有坑不可。
词典化赞同。

 

 

 

【相关】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

 

成语的弹性识别和理解机制

白:
“去年秋膘应犹在,只是猪颜改”

我:
1234应犹在 只是56改
成语弹性机制一抓一个准。一个成语中 哪些是变量 哪些是常量 可以研究。人心里大体有本帐。拿 “九牛二虎之力” 为例,弹性第一环是数词的变量化:m牛n虎之力

二牛九虎之力
九虎二牛之力
八虎七牛之力
四牛五虎之力

都不影响parsing和理解,总之是 费了老鼻子劲儿。

弹性第二环 是名词沿着taxonomy变量化:m 【大动物】n【大动物】之力

九熊二豹之力
三象五狮之力

转:
今个立秋,问苍天什么季节最忙? 秋天,多事之秋; 什么季节最公平? 秋天,平分秋色; 什么季节最简单? 秋天, 一叶知秋; 什么季节最长? 秋天,一日不见如隔三秋; 什么季节最爽? 秋天,秋高气爽;什么季节最险?秋天,秋后算账: 什么季节最暧昧? 秋天,暗送秋波!秋日快乐!!

成语弹性机制 从 “秋” 上升到 【季节】 再上升可以到 【时段】:

多事之春 多事之年 多事之岁月
平分春色
一花知春
一日不见如隔三冬
一日不见如隔九冬

白:
秋天来了,冬天还会远吗

我:
冬天来了 秋天还会远吗
这是时间隧道
或倒转 或快进。

关于小标题:

0905b

【成语】的【【弹性【识别和理解】】机制】,论句法应该是这样的:对于成语,需要一个弹性的识别机制,或者弹性识别的机制。但写的时候,脑子里更可能想的是,对于【成语的弹性】,需要一个识别机制。

再一想,who cares,人的表达和理解不常常是这样模模糊糊的吗。除了段子或较真,通常人根本就对这类结构歧义无感。语义上的模糊也不影响理解的大面。

【相关】

立委NLP博文一览

《朝华午拾》总目录

立委NLP频道

Once upon a time, we were publishing like crazy

List of 23 NLP Publications (Cymfony Period)

Once upon a time, we were publishing like crazy ...... as if we were striving for tenure faculty

[1] R. Srihari, W. Li and X. Li. 2006. Question Answering Supported by
Multiple Levels of Information Extraction.  a book chapter in T. Strzalkowski & S. Harabagiu (eds.), Advances in Open- Domain Question Answering.  Springer, 2006, ISBN:1-4020-4744-4.

http://link.springer.com/chapter/10.1007%2F978-1-4020-4746-6_11

[2] R. Srihari, W. Li, C. Niu and T. Cornell. 2006.  InfoXtract: A Customizable Intermediate Level Information Extraction Engine.  Journal of Natural Language Engineering, 12(4), 1-37

http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=1513012

This paper focuses on IE tasks designed to support  information discovery applications. It defines new IE tasks such as entity profiles, and concept-based general events which represent realistic goals in terms of what can be accomplished in the near-term as well as providing useful, actionable information.

[3] C. Niu, W. Li, R. Srihari, H. Li.  2005. Word Independent Context Pair Classification Model For Word Sense Disambiguation.  Proceedings of Ninth Conference on Computational Natural Language Learning (CoNLL-2005)

W05-0605

[4] C. Niu, W. Li and R. Srihari. 2004. Weakly Supervised Learning for
Cross-document Person Name Disambiguation Supported by Information
Extraction. In Proceedings of ACL 2004.

ACL 2004 Niu Li Srihari 372_pdf_2-col

[5] C. Niu, W. Li, R. Srihari, H. Li and L. Christ. 2004. Context Clustering for Word Sense Disambiguation Based on Modeling Pairwise Context Similarities. In Proceedings of Senseval-3 Workshop.

ACL 2004 Context Clustering for WSD niu1

[6] C. Niu, W. Li, J. Ding, and R. Rohini. 2004. Orthographic Case
Restoration Using Supervised Learning Without Manual Annotation.
International Journal of Artificial Intelligence Tools, Vol. 13, No.
1, 2004.

IJAIT 2004 Niu, Li, Ding, and Srihari caseR

(7) Cheng Niu, Wei Li and Rohini Srihari 2004. A Bootstrapping
Approach to Information Extraction Domain Porting. ATEM-2004: The
AAAI-04 Workshop on Adaptive Text Extraction and Mining. San Jose. (PDF)

[8] W. Li, X. Zhang, C. Niu, Y. Jiang, and R. Srihari. 2003. An Expert
Lexicon Approach to Identifying English Phrasal Verbs. In Proceedings
of ACL 2003. Sapporo, Japan. pp. 513-520.

ACL 2003 Li, Zhang, Niu, Jiang and Srihari 2003 PhrasalVerb_ACL2003_submitted

[9] C. Niu, W. Li, J. Ding, and R. Srihari 2003. A Bootstrapping
Approach to Named Entity Classification using Successive Learners. In
Proceedings of ACL 2003. Sapporo, Japan. pp. 335-342.

ACL 2003 Niu, Li, Ding and Srihari 2003 ne-acl2003

[10] W. Li, R. Srihari, C. Niu, and X. Li. 2003. Question Answering on
a Case Insensitive Corpus. In Proceedings of Workshop on Multilingual
Summarization and Question Answering - Machine Learning and Beyond
(ACL-2003 Workshop). Sapporo, Japan. pp. 84-93.

ACL 2003 Workshop Li, Srihari, Niu and Li 2003 QA-workshopl2003_final

[11] C. Niu, W. Li, J. Ding, and R.K. Srihari. 2003. Bootstrapping for
Named Entity Tagging using Concept-based Seeds. In Proceedings of
HLT/NAACL 2003. Companion Volume, pp. 73-75, Edmonton, Canada.

NAACL 2003 Niu, Li, Ding and Srihari 2003 ne_submitted

[12] R. Srihari, W. Li, C. Niu and T. Cornell. 2003. InfoXtract: A
Customizable Intermediate Level Information Extraction Engine. In
Proceedings of HLT/NAACL 2003 Workshop on Software Engineering and
Architecture of Language Technology Systems (SEALTS). pp. 52-59,
Edmonton, Canada.

NAACL 2003 Workshop InfoXtract SEALTS paper2

[13] H. Li, R. Srihari, C. Niu, and W. Li. 2003. InfoXtract Locatio
Normalization: A Hybrid Approach to Geographic References in
Information Extraction. In Proceedings of HLT/NAACL 2003 Workshop on
Analysis of Geographic References. Edmonton, Canada.

NAACL 2003 Workshop Li, Srihari, Niu and Li 2003 CymfonyLoc_final

[14] W. Li, R. Srihari, C. Niu, and X. Li 2003. Entity Profile
Extraction from Large Corpora. In Proceedings of Pacific Association
for Computational Linguistics 2003 (PACLING03). Halifax, Nova Scotia,
Canada.

PACLING 2003 Li, Srihari, Niu and Li 2003 Entity Profile profile_PACLING_final_submitted

[15] C. Niu, W. Li, R. Srihari, and L. Crist 2003. Bootstrapping a
Hidden Markov Model for Relationship Extraction Using Multi-level
Contexts. In Proceedings of Pacific Association for Computational
Linguistics 2003 (PACLING03). Halifax, Nova Scotia, Canada.

PACLING 2003 Niu, Li, Srihari and Crist 2003 CE Bootstrapping PACLING03_15_final

[16] C. Niu, Z. Zheng, R. Srihari, H. Li, and W. Li 2003. Unsupervised
Learning for Verb Sense Disambiguation Using Both Trigger Words and
Parsing Relations. In Proceedings of Pacific Association for
Computational Linguistics 2003 (PACLING03). Halifax, Nova Scotia,
Canada.

PACLING 2003 Niu, Zheng, Srihari, Li and Li 2003 Verb Sense Identification PACLING_14_final

[17] C. Niu, W. Li, J. Ding, and R.K. Srihari 2003. Orthographic Case
Restoration Using Supervised Learning Without Manual Annotation. In
Proceedings of the Sixteenth International FLAIRS Conference, St.
Augustine, FL, May 2003, pp. 402-406.

FLAIRS 2003 Niu, Li, Ding and Srihari 2003 FLAIRS03CNiu

[18] R. Srihari  and W. Li 2003. Rapid Domain Porting of an
Intermediate Level Information Extraction Engine. In Proceedings of
International Conference on Natural Language Processing 2003.

ICON2003 paper FINAL

[19] H. Li, R. Srihari, C. Niu and W. Li 2002. Location Normalization
for Information Extraction. In Proceedings of the 19th International
Conference on Computational Linguistics (COLING-2002). Taipei, Taiwan.

COLING 2002 Li, Srihari, Niu and Li 2002 coling2002LocNZ

[20] W. Li, R. Srihari, X. Li, M. Srikanth, X. Zhang and C. Niu 2002.
Extracting Exact Answers to Questions Based on Structural Links. In
Proceedings of Multilingual Summarization and Question Answering
(COLING-2002 Workshop). Taipei, Taiwan.

COLING 2002 Workshop Li et al CymfonyQA_final

[21] R. Srihari, and W. Li. 2000. A Question Answering System
Supported by Information Extraction. In Proceedings of ANLP 2000.
Seattle.

ANLP 2000 Srihari and Li 2000 anlp9l

[22] R. Srihari, C. Niu and W. Li. 2000. A Hybrid Approach for Named
Entity and Sub-Type Tagging. In Proceedings of ANLP 2000. Seattle.

ANLP 2000 Srihari, Niu and Li 2000 anlp105_final9

[23] R. Srihari and W. Li. 1999. Question Answering Supported by
Information Extraction. In Proceedings of TREC-8. Washington

cymfony

Other publications: SBIR Final Reports

W. Li & R. Srihari. 2003.  Flexible Information Extraction Learning Algorithm (Phase 2), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. 


W. Li & R. Srihari. 2001.  Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization (Phase 1), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York.

W. Li & R. Srihari.  2000.  A Domain Independent Event Extraction Toolkit (Phase 2), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York.

W. Li & R. Srihari. 2000.  Flexible Information Extraction Learning Algorithm (Phase 1), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York.

W. Li & R. Srihari 2003. Automated Verb Sense Identification (Phase I), Final Techinical Report, U.S. DoD SBIR (Navy), Contract No. N00178-02-C-3073 (2002-2003)

R. Srihari & W. Li 2003. Fusion of Information from Diverse, Textual Media: A Case Restoration Approach (Phase I), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Contract No. F30602-02-C-0156 (2002-2003)

R. Srihari, W. Li & C. Niu 2004. A Large Scale Knowledge Repository and Information Discovery Portal Derived from Information Extraction (Phase 1), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. (2003-2004)

R. Srihari & W. Li 2003. An Automated Domain Porting Toolkit for Information Extraction (Phase I), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Contract No. F30602-02-C-0057 (2002-2003)

T. Cornell, R. Srihari & W. Li 2004. Automatically Time Stamping Events in Unrestricted Text (Phase I), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. (2003-2004)

 

[Related]

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

【一日一parsing:谈parsing是问答系统的核武】

一日一parsing:今天的是。。。

0831d

怎么知道这里的问题和答案可以相配呢?如果有 parsing 和建立其上的知识图谱,那就好办。图谱里面有 professionOf 的 relationship,有了 parsing 抽取这个关系就是小菜(这个例子很简单,就是把同位语关系映射到professionOf关系)。有了 parsing 对于 question 要问的关系,也可以解出来 asking point,子树(S:李娜-从事,O:从事-运动;Mod:什么-关系)就确定了 asking point 是寻求 professionOf(“李娜”)。然后做语义 matching,问答系统的这个环就圆了。This is IE or knowledge-graph supported QA.

具体说,为了让Q和A能match,我们可以对两边做子树规则,填空(抽取)到 professionOf 的关系去,语义一体化,然后就顺风顺水了。第一条子树规则是:

"从事"O: (“职业|运动”)

O: (“职业|运动”)

Mod (“什么|何种”)

S: ^Sombody==>

==> professionOf(^Somebody,?)

professionOf(^Somebody,?)

这是 Question parsing 和 asking point extraction.  在答案源那一边,也有一组规则做 professionOf 的抽取,其中有这样一条规则:[personNE]

[person-NE]:^Person

equiv([profession_token]:^Profession)

==> professionOf(^Person,^Profession)

QA 就这样 match 了。

如果没有专门的知识图谱,没有事先定义好的关系的抽取,怎样做 QA 来应对呢?那就用 SVO parsing 也可以应对相当多的关于事件的问答。但是关系和复杂的事件的问答,简单的 SVO matching 就不行。好在原则上说,复杂的语义大多可以预先定义成 IE (predefined), 专门去做针对性抽取。简单的语义是 open-ended 的,语言学parsing(主谓宾定状补等)就够应付了。

天不我欺也。

IE 对于 SVO,实质就是 (semantic) slot normalization,原来的 slots 是语言学的,叫 S 也好, O 也好,equiv(同位语)也好,mod 也好 。。。。现在的 slots 是 pragmatic 的语义: 譬如 professionOf, locationOf, employeeOf, acquiringCompany, acquiredCompany, priceOfAcqusition, etc.

SVO matching 的 QA 也可以举一个例子, 譬如询问如何做某事:做+某事 就是一个 V+O:

0831a

0831b

0831c

甭管怎样换说法,不变的是 VO (格式化,硬盘)。有了这个 VO matching 做底,离开QA 或人机对话就不远了。譬如,FAQ 档案里很可能就有这样的标题: 格式化硬盘的步骤;关于格式化硬盘;等。于是 Q与A基本就是 SVO 子树 matching:"格式化“ ---O---> “硬盘”。
0901b

接着这个话题再发挥一下。IE 说的是信息抽取,多数时候这个 information 是与 insights (情报,有价值的信息)等价。但其实 IE 可以是抽取有价值的情报,也可以是抽取无价值的情报(噪音)。

为啥要抽取无价值的信息呢?道理很简单,噪音捣乱啊,为了剔除噪音,首先要识别它,或者说抽取它以便扔掉它。所用的方法可以完全一样。搜索界有 stop words ,被当做噪音扔掉了,那是噪音的最简单形式,不需要上下文,纯粹是高频虚词:对于 parsing 这些 stop words 其实很关键,是必要的建立结构的桥梁,但对于关键词搜索,因为里面没有结构,这些词就变成纯粹的噪音了。用 IE 来剔除噪音,实际上是根据上下文结构来断定哪些信息是应该扔掉的,譬如上面的句子里面,在 QA 的语用场景下,就可以剔除诸如:“请告诉我”、“我不知道”等,这样才凸显关键的的VO“格式化-硬盘”。要是做相似度计算,这些个词都是噪音。把“请告诉我”当成一个 4-gram 的 stop word 行不行?可以,但是如果这种东西有很多变式,ngram 就不行了。这时候在子树基础上做 IE 抽取噪音就非常可取了。又因为噪音大多可以用 word-driven 来做,做这件事儿是很靠谱的,基本一抓一准。

小结一下,一般而言,如果 Q 和 A 说法类似,譬如“格式化”+ “硬盘”,那么只要在 SVO 基础上做 matching 就可以把 QA couple 起来。如果 说法很不相同,或者一个关系或事件的变式太多,那么就加一层 IE,matching 在 IE 语义上做。SVO 的 QA matching 是智能搜索的本质,可以对付不可预测的问题。IE 的 QA matching 是预先定义的,针对领域的,不仅精准,而且可以应对变式。两个方案相辅相成。一个善于领域的精准,一个善于open domain 的广度和召回。二者都比 keywords 好出很多,因为有结构。如果从 backoff 来看,那就是 IE 优先, SVO 其次,keywords 楼底。这样精度广度就全照顾到了。

说来归齐,对于QA,对于对话系统,parsing 是核心引擎的关键技术。QA 说到底就是在 Q 与 A 中建立映射,映射的基础是语义匹配。deep parsing 及其 IE 是语义匹配的核武。

 

【相关】

【Bots 的愿景】

立委科普:问答系统的前生今世

泥沙龙笔记:parsing 是引擎的核武器,再论NLP与搜索

泥沙龙笔记:从 sparse data 再论parsing乃是NLP应用的核武器

【立委科普:NLP核武器的奥秘】

问答系统

泥沙龙笔记:搜索和知识图谱的话题

置顶:立委NLP博文一览】

《朝华午拾》总目录

立委NLP频道

立委硕士论文:EChA 试验结果 (11)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

 [参考书目]

 

  1. Heinz Dieter MAAS "Automata  Tradukado en kaj el Esperanto" ( "Lingvo-kibernetiko kaj aliajinternacilingvaj aktoj de l(1a IX-a Internacia Kongreso de Kibernetiko", pp 75-81, 1982 Gunter Narr Verlag Tubingen )
  1. <<机器翻译论文选辑>> ( 科学技术文献出版社, 1979 )
  2. Kalocsay-Waringhien <<Plena Analiza Gramatiko de Esperanto>> ( 中国世界语出版社, 1984 )
  3. 刘涌泉等著 <<中国的机器翻译>> ( 知识出版社, 1984 )
  4. 刘涌泉, 高祖舜, 刘倬著 <<机器翻译浅说>> ( 科学普及出版社, 1964 )
  5. 刘涌泉, 李维 <<巴贝尔通天塔必将建成>> ( 中国第一届世界语大会论文, 1985.8 )
  6. 刘倬 <<三次机器翻译试验>> ( 第一次机器翻译学术会议论文, 1980.9 )
    <<论机器翻译规则系统的编制方法>> ( 1982.3 上海 )
    <<JFY型英汉机器翻译系统的研制和试验>> ( 语言学会第二届年会论文, 1983.4 )
  1. 乔毅 <<开展语言的计算机处理和世界语类型的机器翻译>> ( 中国第一届世界语大会论文, 1985.8 )
  2. 魏原枢, 徐文琪编 <<世界语语法>> ( 上海外语教育出版社, 1982 )
  3. 叶蜚声, 徐通锵著 <<语言学纲要>> ( 北京大学出版社, 1981 )
  4. <<语言和计算机>> (1) (中国社会科学出版社, 1982 )
  5. <<语言和计算机>> (2) (中国社会科学出版社, 1985 )
  6. 张道真编著 <<实用英语语法>> ( 商务印书馆, 1984 )

[致谢]

研制世界语类型的机器翻译系统, 从一开始就得到刘涌泉老师的热情支持, 从方案主体到具体问题的处理, 他都给以认真指导。在程序设计和上机调试的的过程中, 刘倬老师也多次给予指导, 有些基本操作的算法也是刘倬老师提供的。在EChA系统取得初步成果的时候, 笔者向他们表示深切的感谢。另外, 还要特别感谢机房韩老师的多方协助。没有她提供的方便, EChA系统根本不可能在这么短时间试验成功。

[附录一] EChA试验结果

 

(1) LA ORIGINALA TEKSTO / THE ORIGINAL TEXT / 世界语原文

(001) TIEL EVOLUIGHIS PLI KAJ PLI LA PLANADO PER MASHINOJ . (002) TIUJ MASHINOJ KOMENCE NUR ELKALKULIS LA DIKTITAJN MATEMATIKAJN PROBLEMOJN , KONFORME AL LA ENPROGRAMIGO . (003) LA ELEKTRONIKAN PROGRAMIGON PRETIGIS HOMOJ . (004) PLI POSTE , KIAM LA SCIODISKETOJ ESTIS ELTROVITAJ , LA PLENAN INDIKARON , ENDISKIGITAN , ONI METIS EN MASHINOJN KAJ ILI TIAMANIERE POVIS EN SI MEM AKUMULI SCIENCAN STOKON , PLI GRANDAN OL LA HOMA CERBO. (005) KAJ SE TEMIS EKZEMPLE PRI LA PLANADO DE ELEKTROMOTORO , ONI ENMETIS LA SHABLONDISKETON DE LA ELEKTROMOTOR-PLANADO , DONIS LA INDIKOJN DE LA DEZIRATA MOTORO ( KILOVATO , TENSIO , ROTACIO , TIPO , KTP ) , (006) POST KIO LA MASHINO MEM PROGRAMIGIS SIN KAJ FARIS LA KALKULOJN . POST KELKAJ MINUTOJ GHI JAM PRETE ELDONIS LA MEZUROJN : LA DIAMETRON DE LA ROTACIA PARTO , GHIAN LONGON, LA MEZUROJN DE LA KANELOJ , DRATOJ , LA VOLVONOMBRON , ENTUTE CHION BEZONATAN . (007) ECH PLI : BALDAU ESTIS ATINGITE , KE LA MASHINO FARIS LA TUTAN DESEGNON KAJ TRANSDONIS GHIN AL LA FABRIKO . (008) KOMPRENEBLE TIUJ < DESEGNOJ > NE ESTIS IDENTAJ KUN NIAJ PAPERDESEGNOJ . (009) ILI ESTIS DISKETOJ , KIUJ ENTENIS CHIUN DETALON . (010) TIAMANIERE LA PLANADON KAJ FABRIKADON DE LA MASHINOJ JAM PLENUMIS SAME MASHINOJ . (011) ILI PLANIS LA MENDITAN MASHINON , FABRIKIS , ECH KONTROLPROVIS GHIN KAJ LA FUSHAN FORJHETIS . (012) SED CHIO CHI ANKORAU OKAZIS SUB HOMA GVIDADO KAJ PLEJ GRAVE ESTIS , KE CHIO CHI BAZIGHIS SUR LA HOMA SCIO .

LA TEKSTO TRADUKITA EN LA ANGLAN / THE TEXT TRANSLATED INTO ENGLISH / 英语译文

(001) SO DEVELOPED MORE AND MORE THE PLANNING BY MACHINES . (002) THOSE MACHINES AT BEGINNING ONLY CALCULATED OUT THE DICTATED MATHEMATICAL PROBLEMS , ACCORDING TO THE PROGRAMMING . (003) MEN PREPARED THE ELECTRONIC PROGRAMMING . (004) MORE LATER , WHEN THE KNOWLEDGE-DISKETTES HAD BEEN FOUND OUT , PEOPLE PUT THE FULL INDICATION , ENDISKED , INTO MACHINES AND THEY THEREFORE COULD IN THEMSELVES ACCUMULATE SCIENTIFIC STOCK, MORE GREAT THAN THE MAN'SBRAIN . (005) AND IF IT CONCERNED FOR EXAMPLE ABOUT THE PLANNING OF ELECTRIC MOTOR, PEOPLE INPUT THE SAMPLE DISKETTE OF THE MOTOR PLANNING , GAVE THE INDICATIONS OF THE DESIRED MOTOR (KILOWATT , VOLTAGE , ROTATION , TYPE , ETC ) , AFTER WHICH THE MACHINE ITSELF PROGRAMMED ITSELF AND DID THE CALCULATIONS . (006) AFTER SEVERAL MINUTES IT ALREADY READILY GAVE OUT THE MEASUREMENTS : THE DIAMETER OF THE ROTARY PART ,ITS LENGTH , THE MEASUREMENTS OF THE GROOVES , WIRES , THE WINDING NUMBER , IN TOTAL ALL REQUIRED . (007) EVEN MORE : SOON IT HAD BEEN ACHIEVED , THAT THE MACHINE DID THE TOTAL DESIGN AND OVERHANDED IT TO THE FACTORY . (008) OF COURSE THOSE < DESIGNS >  WERE NOT IDENTICAL WITH OUR PAPERDESIGNS . (009) THEY WERE DISKETTES , WHICH CARRIED ALL DETAIL . (010) THEREFORE MACHINES ALREADY FULFILED THE PLANNING AND MANUFACTURING OF THE MACHINES SAMELY . (011) THEY PLANNED THE ORDERED MACHINE , MANUFACTURED , EVEN EXAMINED IT AND THREW AWAY THE USELESS . (012) BUT ALL THIS STILL HAPPENED UNDER MAN'S GUIDING AND IT WAS MOST IMPORTANT , THAT ALL THIS WAS BASED ON THE MAN'S KNOWLEDGE .

LA TEKSTO TRADUKITA EN LA CHINAN / THE TEXT TRANSLATED INTO CHINESE / 汉语译文

(001) 这样用机器设计越来越发展了. (002) 那些机器开始时仅仅按照输入程序计算出所命令的数学问题. (003) 人准备了电子程序设计. (004) 更以后,当微型知识磁盘被发明了时,人们把所写入磁盘的全套指令集合放到机器里面,他(它)们这样能在自己本身里面积累比人的头脑更大的科学贮蓄. (005) 如果涉及例如关于电动机的设计, 人们输入了电动机设计的微型样品磁盘, 给了所希望的电动机的指标(千瓦,电压,运转,型号,等等),在此以后机器本身把自己程序化了,做了计算. (006) 在几分钟以后它已经就能给出尺寸:运转部分的直径,它的长度,槽纹,导线的尺寸,圈数,总之所需要的一切. (007) 甚至更:很快达到了,机器做了整个图样,把它转交到工厂. (008) 当然那些<图样>与我们的图纸不是一样的. (009) 他(它)们是储有所有细节的微型磁盘. (010) 这样机器已经同样地完成了机器的设计和制造. (011) 他(它)们设计了所定购的机器,制造了,甚至检验了它,把废的抛弃了. (012) 但是这一切仍然在人的指导下进行,最重要的是,这一切以人的知识作为基础.

(2) DIVERSAJ FRAZOJ / VARIOUS SENTENCES / 各类文句

(016) KIAM MI ESTIS LUDANTA VIOLONON , MIA ONKLO VIZITIS NIAN HEJMON .
WHEN I WAS PLAYING VIOLIN , MY UNCLE VISITED OUR HOME .
当我(当时)正在拉小提琴时,我的叔叔访问了我的家.

(020)  MI ESTOS FININTA LA EKSPERIMENTON PRI MASHINA TRADUKADO POST KELKAJ MONATOJ .
I WILL HAVE FINISHED THE EXPERIMENT ABOUT MACHINE'S TRANSLATING IN SEVERAL MONTHS.
我在几月以后将已经完成关于机器的翻译的实验.

(028)  BABELO NE ESTIS ELKONSTRUITA.
BABEL HAD NOT BEEN BUILT UP .
巴贝尔塔没有被建成.

(029)  NEPRE ESTOS ELKONSTRUITA LA NOVA BABELO .
ABSOLUTELY WILL HAVE BEEN BUILT UP THE NEW BABEL .
新巴贝尔塔必然地将被建成.

(040)  KIAL VI LERNAS ESPERANTON ?
WHY DO YOU LEARN ESPERANTO ?
为什么你学习世界语?

(044)  NE PROKRASTU LA HODIAUAN LABORON GHIS MORGAU .
DON'T PUT OFF THE TODAY'S WORK TILL TOMORROW .
别把今天的工作推迟到明天.

(045)  KIEL BONE PENTRAS LA KNABO !
HOW WELL THE BOY PAINTS !
男孩多么好地画画啊!

(048)  KIU ESTAS LA AUTORO DE LA LIBRO , KIUN VI JHUS LEGIS ?
WHO IS THE AUTHOR OF THE BOOK , WHICH YOU JUST READ ?
你刚刚读了的书的作者是谁?

(050)  SE MI PARTOPRENUS EN VIA AMUZA AKTIVADO , MI ESTUS TRE GHOJA .
IF I WOULD TAKE PART IN YOUR RECREATIONAL ACTIVITY , I WOULD BE VERY GLAD .
如果我参加你(们)的文娱活动,我会是很高兴的.

(056)  CHU VI MEMORAS LA TAGOJN , KIAM NI KUNE STUDIS EN LA UNIVERSITATO ?
DO YOU REMEMBER THE DAYS , WHEN WE TOGETHER STUDIED IN THE UNIVERSITY ?
你记得我们在一起在大学里面学习的日子吗?

(058)  UNUIGHU PROLETOJ DE CHIUJ LANDOJ !
LET PROLETARIANS OF ALL COUNTRIES UNITE !
让所有国家的无产者联合吧!

(061)  KIEL SAGHA VI ESTAS !
HOW WISE YOU ARE !
你是多么聪明啊!

(062)  ESPERANTO ESTAS INTERNACIA HELPA LINGVO .
ESPERANTO IS INTERNATIONAL HELP LANGUAGE .
世界语是国际辅助语言.

(067)  LIA PROPONO ESTAS , KE NI CHIUJ LIBERE ELMETU NIAJN OPINIOJN .
HIS PROPOSAL IS , THAT WE ALL FREELY OUTPUT OUR OPINIONS .
他的建议是,让我们所有人自由地提出我们的意见.

(068)  MI NE SCIAS , KIAM KOMENCIGHOS NIAJ FERIOJ .
I DON'T KNOW , WHEN WILL BEGIN OUR HOLIDAYS .
我不知道,我们的假日什么时候将开始.

(069)  LA LIBRO , KIU KUSHAS SUR LA TABLO , ESTAS VERDA .
THE BOOK , WHICH LIES ON THE TABLE , IS GREEN .
在桌子上躺的书是绿的.

(071)  LA INFANO PLORAS , CHAR IU LIN BATIS .
THE CHILD CRIES , BECAUSE SOMEBODY BEAT HIM .
小孩哭,因为某人打了他.

(078)  LERNI ESPERANTON NE ESTAS MALFACILE .
TO LEARN ESPERANTO IS NOT DIFFICULT .
学习世界语不是困难的.

(084)  MI NE SCIAS , CHU VI POVAS PLENUMI TIUN CHI TASKON .
I DON'T KNOW , WHETHER YOU CAN FULFIL THIS TASK .
我不知道,是否你能完成这个任务.

(086)  MULTAJ DIVERSLANDAJ ESPERANTISTOJ CHEESTOS LA UNIVERSALAN KONGRESON DE ESPERANTO OKAZONTAN PEKINE .
A LOT OF VARIOUS COUNTRY'S ESPERANTISTS WILL ATTEND THE UNIVERSAL CONGRESS OF ESPERANTO TO BE HELD IN BEIJING .
许多不同国家的世界语者将参加在北京将召开的世界语的国际大会.

(089)  LIA PROPONO ELEKTI NOVAN PREZIDANTON NE ESTIS AKCEPTITA .
HIS PROPOSAL TO ELECT NEW PRESIDENT HAD NOT BEEN ACCEPTED .
他的选举新总统的建议没有被接受.

(090)  SHI ESTAS LA PLEJ BELA EL LA KNABINOJ .
SHE IS THE MOST BEAUTIFUL OF THE GIRLS .
她在女孩里面是最漂亮的.

(092)  FALINTE , LI NE POVIS RELEVIGHI .
HAVING FALLEN , HE COULD NOT GET UP .
摔倒了,他不能重新起来.

(093)  FORIRONTE , LI PREMIS MIAN MANON .
TO GO AWAY , HE SHOOK MY HAND .
将要离去,他握了我的手.

(098)  MI TRE AMAS ESPERANTON , MI PLI AMAS ESPERANTISTOJN , MI PLEJ AMAS LA IDEALON DE ESPERANTO .
I VERY MUCH LOVE ESPERANTO , I MORE LOVE ESPERANTISTS , I MOST LOVE THE IDEAL OF ESPERANTO .
我很爱世界语,我更爱世界语者,我最爱世界语的理想.

(116)  NI LUDU , CHU BONE ?
LET'S PLAY , ALL RIGHT ?
让我们玩吧,好吗?

(119)  KIA MIRAKLO TIO ESTAS , KE NIAJ ANTIKVULOJ KONSTRUIS LA GRANDAN MURON NUR PER SIAJ DU MANOJ !
WHAT MIRACLE IT IS , THAT OUR ANCESTORS BUILT THE GREAT WALL ONLY BY THEIR TWO HANDS !
我们的祖先仅仅用自己的两手建造了长城,这是怎样的奇迹啊!

(121)  FORPASIS UNU TAGO , FORPASIS ANKAU LA DUA .
PASSED AWAY ONE DAY , PASSED AWAY ALSO THE SECOND .
一天过去了,第二也过去了.

(122)  CHU ESTAS EBLE , KE VI NENION SCIAS ?
IS IT POSSIBLE , THAT YOU KNOW NOTHING ?
你不知道任何事,这是可能的吗?

(131)  LA HOMON , PRI KIU VI PAROLAS , MI NENIAM VIDIS .
I NEVER SAW THE MAN , ABOUT WHOM YOU SPEAK .
我从未看见过你提到的人.

(132)  NI , ESPERANTISTOJ , DEVAS LABORI PLI ENERGIE OL IAM .
WE , ESPERANTISTS , MUST WORK MORE HARD THAN EVER .
我们,世界语者,应该比任何时候更努力工作.

(133)  SOMERE ESTAS TRE VARME .
IN SUMMER IT IS VERY HOT .
夏天是很热的.

(134)  DOKTORO ZAMENHOF NASKIGHIS LA 15-AN DE DECEMBRO EN 1859 .
DOCTOR ZAMENHOF WAS BORN ON THE 15TH OF DECEMBER IN 1859 .
柴门霍夫博士1859年十二月的15号出生.

(135)  SE VI SCIUS , KIU LI ESTAS , VI LIN PLI ESTIMUS .
IF YOU WOULD KNOW , WHO HE IS , YOU MORE WOULD ESTEEM HIM .
如果你知道,他是谁,你更会尊敬他.

(136)  CENTOJ DA MALFERMAJ AUTOJ NIN PORTIS AL LA CENTRA LENIN-STADIONO, MALRAPIDE MOVIGHANTE TRA LA HOMA SVARMO .
HUNDREDS OF OPEN CARS CARRIED US TO THE CENTRAL LENIN STADIUM , SLOWLY MOVING THROUGH THE MAN'S SWARM .
成百敞篷汽车把我们带到中央列宁运动场,缓慢地通过人群运动.

(137)  MI VIDIS , KE LI FALIS KAJ LIA VESTO MALPURIGHIS .
I SAW , THAT HE FELL AND HIS CLOTHES BECAME DIRTY .
我看见了,他摔倒了,他的衣服弄脏了.

(139)  MI SCIIS , KE LI NE FAROS , KION LI PROMESIS .
I KNEW , THAT HE WOULD NOT DO WHAT HE PROMISED .
我知道,他将不做他允诺的.

(140)  ESTAS PAULO , KIU ARANGHIS LA AFERON .
IT IS PAULO THAT ARRANGED THE AFFAIR .
是PAULO安排了事情.

(142)  KUREGIS LA KNABO PER SIA TUTA FORTO , SED LI NE POVIS ATINGI LA PAPILION .
RAN THE BOY BY HIS TOTAL STRENGTH , BUT HE COULD NOT ACHIEVE THE BUTTERFLY .
男孩用自己的整个力量狂奔,但是他不能达到蝴蝶.

(144)  LI DONIS AL MI MULTAJN INSTRUAJN LIBROJN .
HE GAVE ME A LOT OF TEACHING BOOKS .
他给了我许多教科书.

(145)  CHU VI PAROLAS CHINE AU JAPANE ?
DO YOU SPEAK IN CHINESE OR IN JAPANESE ?
你用中文还是用日文说话?

(151)  NUR TIU NE ERARAS , KIU NENIAM ION FARAS .
ONLY THAT PERSON IS NOT WRONG , WHO NEVER DOES SOMETHING .
仅仅从不做某事的那个人不犯错误.

(155)  ESPERANTO ESTAS CHIES PROPRAJHO .
ESPERANTO IS EVERYBODY'S PROPERTY .
世界语是所有人的财产.

(156)  MI MEMORAS CHIUN , KIUN MI VIDIS .
I REMEMBER ALL , WHOM I SAW .
我记得我看见了的所有人.

(157)  ESTAS NENIU EN LA CHAMBRO .
THERE IS NOBODY IN THE ROOM .
在房间里面没有任何人.

(3) DU POEMOJ / TWO POEMS / 两首诗歌

(099) LA ESPERO : ESPERANTISTA HIMNO ( POEMO FAR ZAMENHOF ) .

(100) EN LA MONDON VENIS NOVA SENTO ,
TRA LA MONDO IRAS FORTA VOKO ;
(101) PER FLUGILOJ DE FACILA VENTO ,
NUN DE LOKO FLUGU GHI AL LOKO .

(102) NE AL GLAVO SANGONSOIFANTA ,
GHI LA HOMAN TIRAS FAMILION ;
(103) AL LA MOND' ETERNE MILITANTA ,
GHI PROMESAS SANKTAN HARMONION .
(099) THE HOPE : ESPERANTIST'S HYMN ( POEM BY ZAMENHOF ) .

(100) INTO THE WORLD CAME NEW FEELING ,
OVER THE WORLD GOES STRONG VOICE ;
(101) BY WINGS OF EASY WIND ,
NOW FROM PLACE LET IT FLY TO PLACE .
(102) NOT TO SWORD BLOODTHIRSTY ,
IT PULLS THE MAN FAMILY ;
(103) TO THE WORLD EVER FIGHTING ,
IT PROMISES SACRED HARMONY .

(099) 希望: 世界语者的颂歌 (柴门霍夫所作的诗歌).

(100) 新感觉来到了世界,
有力的声音走遍世界;
(101) 用顺风的翅膀,
现在让它从一个地方飞到另一个地方吧.

(102) 它不把人的家庭
引到渴血的刀剑;
(103) 向永远战争着的世界,
它允诺神圣的和谐.

(104) AL NIA KARA LINGVO ( FAR IU NOVA ESPERANTISTO ) .

(105) LA LINGVO GRACIA , KARA MIA ,
GHIS KIAM VI VENIS AL MI FINE FIN ?
(106) ATENDIS SOIFE MI , ETERNE VIA ,
MI AMAS VIN !

(107) MI AMAS VIN VERE , PRUVU DIO ,
KAJ MIA BON-KORO BATAS NUR POR VI ;
(108) NE PLU SEKRETETO ESTAS TIO :
VIN AMAS MI !

(109) CHU KREDAS VI MIAN AMON MARAN ?
(110) CHU KREDAS , KE MIA KORO FLAMAS ?
(111) CHU KREDAS LA VORTON PURE KARAN :
VIN MI AMAS !

(104) TO OUR DEAR LANGUAGE ( BY SOME NEW ESPERANTIST ) .

(105) THE LANGUAGE GRACEFUL , MY DEAR ,
TILL WHEN YOU CAME TO ME AT LAST ?
(106) WAITED LONGINGLY I , EVER YOURS ,
I LOVE YOU !

(107) I LOVE YOU TRUELY , LET GOD PROVE ,
AND MY GOOD HEART BEATS ONLY FOR YOU ;
(108) NO LONGER THAT IS LITTLE SECRET :
I LOVE YOU !

(109) DO YOU BELIEVE MY LOVE LIKE SEA ?
(110) DO BELIEVE , THAT MY HEART BURNS ?
(111) DO BELIEVE THE WORD PURELY DEAR :
I LOVE YOU !

(104) 献给我们的亲爱的语言(某新世界语者所作).

(105) 优美的语言,我的亲爱的,
到什么时候你最后来到了我这儿?
(106) 我渴望地等待,你的永远的,
我爱你!

(107) 我真实地爱你,让上帝证明吧,
我的善良的心仅仅为了你跳动;
(108) 那已经不再是小秘密:
我爱你!

(109) 你相信我的大海一样的爱吗?
(110) 相信,我的心燃烧吗?
(111) 相信纯粹地亲爱的词吗:
我爱你!

 

 

【相关】

硕士论文: 世界语到汉语和英语的自动翻译试验
立委硕士论文:1. EChA概况
立委硕士论文:2. 世界语: 语言学特点及其研究价值
立委硕士论文:3. 层次递归成分体系
立委硕士论文:4. EChA机器词典及词表
立委硕士论文:5. 世界语形态分析
立委硕士论文:6/7 世界语句法分析
立委硕士论文:8. 英语形态生成
立委硕士论文:9. 目标语调序
立委硕士论文:10. EChA 试验结果的分析
立委硕士论文【致谢】【参考书目】
立委硕士论文全文(世界语版)

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

Outline of an HPSG-style Chinese reversible grammar

 Outline of an HPSG-style Chinese reversible grammar*

Wei  LI
Simon Fraser University
(NLWC97)

This paper presents the outline and the design philosophy of a lexicalized Chinese unification grammar named W‑CPSG. W‑CPSG covers Chinese morphology, Chinese syntax and semantics in a novel integrated language model. The grammar works reversibly, suited for both parsing and generation. This work is developed in the general spirit of the linguistic theory Head-driven Phrase Structure Grammar (Pollard & Sag 1994). We identify the following two problems as major obstacles in formulating a precise and efficient Chinese grammar. First, we lack in serious study on Chinese lexical base and often jump too soon for linguistic generalization. Second, there is a lack of effective interaction and adequate interface between morphology, syntax and semantics. We address these problems in depth with the lexicalized and integrated design of W‑CPSG. We will also illustrate how W‑CPSG is formalized and how it works.

 

  1. Background

Unification grammars have been extensively studied in the last decade (Shieber 1986). Implementations of such grammars for English are being used in a wide variety of applications. Attempts also have been made to write Chinese unification grammars (Huang 1986, among others). W‑CPSG (for Wei's Chinese Phrase Structure Grammar, Li, W. 1997b) is a new endeavor in this direction, with its unique design and characteristics.

1.1. Design philosophy

We identify the following two problems as major obstacles in formulating a precise and efficient Chinese grammar. First, we lack in serious study on Chinese lexical base and often jump too soon for linguistic generalization. Second, there is a lack of effective interaction and adequate interface between morphology, syntax and semantics. We address these problems in depth with the lexicalized and integrated design of W‑CPSG.

1.1.1. Lexicalized design

It has been widely accepted that a well-designed lexicon is crucial for a successful grammar, especially for a natural language computational system. But Chinese linguistics in general and Chinese computational grammars in particular have generally been lacking in in-depth research on Chinese lexical base. For many years, most dictionaries published in China did not even contain information for grammatical categories in the lexical entries (except for a few dictionaries intended for foreign readers learning Chinese). Compared with the sophisticated design and rich linguistic information embodied in English dictionaries like Oxford Advanced Learners' Dictionary and Longman Dictionary of Contemporary English, Chinese linguistics is hampered by the lack of such reliable lexical resources.

In the last decade, however, Chinese linguists have achieved significant progress in this field. The publication of 800 Words in Contemporary Mandarin (Lü et al., 1980) marked a milestone for Chinese lexical research. This book is full of detailed linguistic description of the most frequently used Chinese words and their collocations. Since then, Chinese linguists have made fruitful efforts, marked by the publication of a series of valency dictionaries (e.g. Meng et al., 1987) and books  (e.g. Li, L. 1986, 1990). But almost all such work was done by linguists with little knowledge of computational linguistics. Their description lacks formalization and consistency. Therefore, Chinese computational linguists require patience in adapting and formalizing these results, making them implementable.

1.1.2. Integrated design

Most conventional grammars assume a successive model of morphology, syntax and semantics. We argue that this design is not adequate for Chinese natural language processing. Instead, an integrated grammar of morphology, syntax and semantics is adopted in W‑CPSG.

Let us first discuss the rationale of integrating morphology and syntax in Chinese grammar. As it stands, a written Chinese sentence is a string of characters (morphemes) with no blanks to mark word boundaries. In conventional systems, there is a procedure-based Chinese morphology preprocessor (so-called segmenter). The major purpose for the segmenter is to identify a string of words to feed syntax. This is not an easy task, due to the possible involvement of the segmentation ambiguity. For example, given a string of 4 Chinese characters da xue sheng huo, the segmentation ambiguity is shown in (1a) and (1b) below.

(1)                    da xue sheng huo

(a)        da-xue                          | sheng-huo
university                    | life

(b)        da-xue-sheng               | huo
university-student       | live

The resolution of the above ambiguity in the morphology preprocessor is a hopeless job because such structural ambiguity is syntactically conditioned. For sentences like da xue sheng huo you qu (university life is interesting), (1a) is the right identification. For sentences like da xue sheng huo bu xia qu le (university students cannot make a living), (1b) is right. So far there are no segmenters which can handle this properly and guarantee correct word segmentation (Feng 1996). In fact, there can never be such segmenters as long as syntax is not brought in. This is a theoretical defect of all Chinese analysis systems in the morphology-before-syntax architecture (Li, W. 1997a). I have solved this problem in our morphology-syntax integrated W‑CPSG (see 2.2. below).

Now we examine the motivation of integrating syntax and semantics in Chinese grammar. It has been observed that, compared with the analysis of Indo-European languages, proper Chinese analysis relies more heavily on semantic information (see, e.g. Chen 1996, Feng 1996). Chinese syntax is not as rigid as languages with inflections. Semantic constraint is called for in both structural and lexical disambiguation as well as in solving the problem of computational complexity.  The integration of syntax and semantics helps establish flexible ways of their interaction in analysis (see 2.3. below).

1.2. Major theoretical foundation: HPSG

The work on W‑CPSG is developed in the spirit of the linguistic theory Head-driven Phrase Structure Grammar (HPSG, proposed by Pollard & Sag, 1987). HPSG is a highly lexicalist theory, which encourages the integration of different components. This matches our design philosophy for implementing our Chinese computational grammar. HPSG serves as a desired framework to start this research with. We benefit most from the general linguistic ideas in HPSG. However, W‑CPSG is not confined to the theory-internal formulations of principles and rules and other details in HPSG versions (e.g. Pollard & Sag 1987, 1994 or later developments). We borrow freely from other theoretical sources or form our own theories in W‑CPSG to meet our goal of Natural Language Processing in general and Chinese computing in particular. For example, treating morphology as an integrated part of parsing and placing it right into grammar is our deliberate choice. In syntax, we formulate our own theory for configuration and word order. Our semantics differs most from any standard version of situation-semantics-based theory in HPSG. It is based on insights from Tesnière's Dependency Grammar (Tesnière 1959), Fillmore's Case Grammar (Fillmore 1968) and  Wilks' Preference Semantics (Wilks 1975, 1978) as well as our own semantic view for knowledge representation and better coordination of syntax-semantics interaction (Li, W. 1996). For these differences and other modifications, it is more accurate to regard W‑CPSG as an HPSG-style Chinese grammar, rather than an (adapted) version of Chinese HPSG.

  1. Integrated language model

2.1. W‑CPSG versus conventional Chinese grammar

The lexicalized design sets the common basis for the organization of the grammar in W‑CPSG. This involves the interfaces of morphology, syntax and semantics.[1]   W‑CPSG assumes an integrated language model of its components (see Figure 1).  The W‑CPSG model is in sharp contrast to the conventional clear-cut successive design of grammar components (see Figure 2).

 

 lw1

Figure 2.  conventional language model (non-reversible)

2.2. Interfacing morphology and syntax

As shown in Figure 2 above, conventional  systems take a two-step approach: a procedure-based preprocessor for word identification (without discovering the internal structure) and a grammar for word-based parsing. W‑CPSG takes an alternative one-step approach and the parsing is character- (i.e. morpheme-) based. A morphological PS (phrase structure) rule is designed not only to identify candidate words but to build word‑internal structures as well. In other words, W‑CPSG is a self-contained model, directly accepting the input of a character string for parsing. The parse tree embodies both the morphological analysis and the syntactic analysis, as illustrated by the following sample parsing chart.

lw6

Note:    DET for determiner; CLA for classifier; N for noun; DE for particle de;
AF for affix; V for verb; A for adjective; CLAP for classifier phrase;
NP for noun phrase; DEP for DE-phrase

This is so-called bottom-up parsing. It starts with lexicon look-up. Simple edges 1 through 7 are lexical edges. Combined edges are phrasal edges. Each edge represents a sign, i.e. a character (morpheme), a word, a phrase or a sentence. Lexical edges result from a successful match between the signs in the input string and the entries in the lexicon during lexicon look-up. After looking up the lexicon, the lexical information for the signs are made available to the parser. For the sake of concise illustration, we only show two crucial pieces of information for each edge in the chart, namely category and interpretation with a delimiting colon (some function words are only labeled for category). The parser attempts to combine the edges according to PS rules in the grammar until a parse is found. A parse is an edge which ranges over the whole string. The parse ((((1+2)+3)+4)+((5+6)+7)) represents the following binary structural tree embodying both the morphological and syntactic analysis of this NP phrase.

lw5

As seen, word identification is no longer a pre-condition for parsing. It becomes a natural by-product of parsing in this integrated grammar of morphology and syntax: a successful parse always embodies the right word identification. For example, the parse ((((1+2)+3)+4)+((5+6)+7)) includes the identification of a word-string zhe (DET) ben (CLA) shu (N) de (DE) ke-du-xing (N). An argument against the conventional separation model is that there exists in the two-step approach a theoretical threshold beyond which the precision for the correct word identification is not possible. This is because proper word identification in Chinese is to a considerable extent syntactically conditioned due to  possible structural ambiguity involved. Our strategy has advantages over the conventional approach  in  resolving word identification ambiguities and in handling the productive word formation. It has solved the problems inherent in the morphology-before-syntax architecture (for detailed argumentation, see Li, W. 1997a).

2.3. Interaction of syntax and semantics

The interface and interaction of syntax and semantics are of vital importance in a Chinese grammar. We are of the same opinion as Chen (1996) and many others that it is more effective to analyze Chinese in an environment where semantic constraints are enforced during the parsing, not after. The argument is based on the linguistic characteristics of Chinese. Chinese has no inflection (like English ‑'s, ‑s, ‑ing, ‑ed, etc.), no such formatives as article (like English a, the), infinitivizer (like English to) and complementizer (like English that). Instead, function words and word order are used as major syntactic devices. But Chinese function words (prepositions, aspect particles, passive particle, plural suffix, conjunctions, etc.) can often be omitted (Lü et al. 1980, p.2). Moreover, fixed word order in order to mark syntactic functions which is usually assumed for isolating languages, is to a considerable extent untrue for Chinese. In fact, there is remarkable freedom or flexibility in Chinese word order. One typical example is demonstrated in the numerous word order variations (although the default order is S‑V‑O subject-verb-object) for the Chinese transitive patterns  (Li, W. 1996).  All these added up project a picture of Chinese as a language of loose syntactic constraint. A weak syntax requires some support beyond syntax to enhance grammaticality. Semantic constraints are therefore called for. I believe that an effective way to model this interaction between syntax and semantics is to integrate the two in one grammar.

One strong piece of evidence for this syntax-semantics integration argument is that Chinese has what I call syntactically crippled structures. These are structures which can hardly be understood on purely formal grounds and are usually judged as ungrammatical unless accompanied with the support from the semantic constraints (i.e. the match of semantic selection restrictions). Some Chinese NP predicate (Li, W. & McFetridge 1995) and transitive patterns like S‑O‑V (Li, W. 1996), among others, are such structures. The NP Predicate is a typical instance of semantic dependence. It is highly undesirable if we assume a general rule like S --> NP1 NP2 in a Chinese grammar to capture such phenomena. This is because there is a semantic condition for NP2 to function as predicate, which makes the Chinese NP predicate a very restricted pattern. For example, in the sentence This table is three-legged: zhe (this) zhang (classifier) zhuo-zi (desk) san (three) tiao (classifier) tui (leg), the subject must be of the semantic type animate or furniture (which can have legs). The general rule with no recourse to semantic constraints is simply too productive and may cause severe computational complexity. In the case of Chinese transitive patterns, formal means are decisive for some variations in their interpretation (i.e. role assignment) process. But others are heavily dependent on semantic constraint. Take chi (eat) as an example. There is no difference in syntactic form in sentences like wo (I) chi (eat) dianxin (Dim-Sum) le (perfect-aspect) and dianxin (Dim-Sum) wo (I) chi (eat) le (perfect-aspect). Who eats what? To properly assign roles to NP1 NP2 V as S-O-V versus O-S-V, the semantic constraint animate eats food needs to be enforced.

The conventional syntax-before-semantics model has now received less popularity in Chinese computing community. Researchers have been exploring various ways of integrating syntax and semantics in Chinese grammar (Chen 1996). In W‑CPSG, the Chinese syntax was enhanced by the incorporation of a semantic constraint mechanism. This mechanism embodies a lexicalized knowledge representation, which parallels to the syntactic representation in the lexicon. I have developed a way to dynamically coordinate the syntactic constraint and semantic constraint in one model. This technique proves to be effective in handling rhetorical expressions and in making the grammar both precise and robust (Li, W 1996).

 

  1. Lexicalized formal grammar

3.1. Formalized grammar

The application nature of this research requires that we pay equal attention to practical issues of computational systems as well as to a sound theoretical design. All theories and rule formulations in W‑CPSG are implementable. In fact. most of them have been implemented in our prototype W‑CPSG. W‑CPSG is a strictly formalized grammar that does not rely on undefined notions. The whole grammar is represented by typed feature structures (TFS), as defined below based on Carpenter & Penn (1994).

(3)        Definition: typed feature structure 

A typed feature structure is a data structure adopted to model a certain object of a grammar. The necessary part for a typed feature structure is type. Type represents the classification of the feature structure. A simple feature structure contains only the type information, but a complex feature structure can introduce a set of feature-value pairs in addition to the type. A feature-value pair consists of a feature and a value. A feature reflects one aspect of an object. The value describes that aspect. A value is itself a feature structure (simple or complex). A feature determines which type of feature structures it takes as its value. Typed feature structures are finite in a grammar. Their definition constitutes the typology of the grammar.

With this formal device of typed feature structures, we formulate W‑CPSG by defining from the very basic notions (e.g. sign, morpheme, word, phrase, S, NP, VP, etc.) to rules (PS rules and lexical rules), lexical items, lexical hierarchy and typology (hierarchy embodied in feature structures) (Li, W. 1997b). The following sample definitions of some basic notions illustrate the formal nature of W‑CPSG. Please note that they are system-internal definitions and are used in W‑CPSG to serve the purpose of configurational constraints (see Chapter VI of Li, W. 1997b).

(4)        Definition: sign [2]

a_sign
KANJI kanji
MORPH expected
CATEGORY category
COMP0 expected
COMP1 expected
COMP2 expected
MOD expected
KNOWLEDGE knowledge
CONTENT content
DTR dtr

A sign is the most fundamental concept of grammar. A sign is a dynamic unit of grammatical analysis. It can be a morpheme, a word, a phrase or a sentence. Formally, a sign is defined by the TFS a_sign, which introduces a set of linguistic features for its description, as shown above. These features include the orthographic feature KANJI; morphological feature MORPH; syntactic features CATEGORY, COMP0, COMP1, COMP2, and MOD; structural feature (for both morphology and syntax) DTR; semantic features KNOWLEDGE and CONTENT.

(5)        Definition: morpheme

a_sign
MORPH ~saturated

A morpheme is a sign whose morphological expectation has not been saturated. In W‑CPSG, ~saturated is equivalent to obligatory/optional/null. For example, the suffix ‑xing (‑ness) is such a morpheme whose morphological expectation for a preceding adjective is obligatory.  In W‑CPSG, a morpheme like ‑xing (‑ness) ceases to be a morpheme when its obligatory expectation, say the adjective ke-du (readable), is saturated. Therefore, the sign ke-du-xing (readability) is not a morpheme, but becomes a word per se.

(6)        Definition: word

a_sign
MORPH ~obligatory
DTR no_syn_dtr

In W‑CPSG, ~obligatory is equivalent to saturated/optional/null. The specification [MORPH ~obligatory] defines a syntactic sign, i.e. a sign whose obligatory morphological expectation has been saturated. A word is a syntactic sign with no syntactic daughters, i.e. [DTR no_syn_dtr]. Obviously, word with [MORPH saturated/optional/null] overlaps morpheme with [MORPH obligatory/optional/null] in cases when the morphological expectation is optional or null.

Just like the overlapping of morpheme and word, there is also an intersection between word and phrase. Compare the following definition of phrase with the above definition of word.

(7)        Definition: phrase

a_sign
MORPH ~obligatory
COMP0 ~obligatory
COMP1 ~obligatory
COMP2 ~obligatory 

A phrase is a syntactic sign whose obligatory complement expectation has all been saturated, i.e. [COMP0 ~obligatory, COMP1 ~obligatory, COMP2 ~obligatory]. When a word has only optional complement expectation or no complement expectation, it is also a phrase. The overlapping relationship among morpheme, word and phrase can be shown by the following illustration of the three sets.

lw4 

S is a syntactic sign satisfying the following 3 conditions: (1) its category is pred (which includes V and A); (2) its comp0 is saturated; (3) its obligatory comp1 and comp2  are saturated.

3.2. Lexicalized grammar

W‑CPSG takes a radical lexicalist approach. We started with individual words in the lexicon and have gradually built up a lexical hierarchy and the grammar prototype.

W‑CPSG consists of two parts: a minimized general grammar and a information-enriched lexicon. The general grammar contains only 11 PS rules, covering complement structure, modifier structure, conjunctive structure and morphological structure. We formulate a PS rule for illustration.

lw3

This comp0 PS rule is similar to the rule S ==> NP VP in the conventional phrase structure grammar. The feature COMP0 represents the expectation of the head daughter for its external complement (subject or specifier) on its left side, i.e. [DIRECTION left]. The nature of its expected comp0, NP or other types of sign, is lexically decided by the individual head (hence head-driven or lexicon-driven). It will always be warranted by the general grammar, here via the index [3]. This is the nature of lexicalized grammars. PS rules in such grammars are very abstract. Essentially, they say one thing, namely, 2 signs can combine so long as the lexicon so indicates. The indices [1] and [2] represent configurational constraint. They ensure that internal obligatory complements COMP1 and COMP2 must be saturated before this rule can be applied. Finally, Head Feature Principle (defined elsewhere in the grammar based on the adaptation of the Head Feature Principle in HPSG, Pollard & Sag, 1994) ensures that head features are percolated up from the head daughter to the mother sign.

The lexicon houses lexical entries with their linguistic description and knowledge representation. Potential morphological structures, as well as potential syntactic structures, are lexically encoded (in the feature MORPH for the former and in the features COMP0, COMP1, COMP2, MOD for the latter). Our knowledge representation is also embodied in the lexicon (in the feature KNOWLEDGE). I believe that this is an effective and realistic way of handling natural language phenomena and their disambiguation without having to resort to an encyclopedia-like knowledge base. The following sample formulation of the lexical entry chi (eat) projects a rough picture of what the W‑CPSG lexicon looks like.

lw2

The lexicon also contains lexical generalizations. The  generalizations are captured by the inheritance of the lexical hierarchy and by a set of lexical rules. Due to space limitations, I will not show them in this paper.

  1. Implementation and application of W‑CPSG

A substantial Chinese computational grammar has been implemented in the W‑CPSG prototype.  It covers all basic Chinese constructions. Particular attention is paid to the handling of function words and verb patterns.  On the basis of the information- enriched lexicon and the general grammar, the system adequately handles the relationship between linguistic individuality and generality. The grammar formalism which I use to code W‑CPSG is ALE, a grammar compiler on top of Prolog, developed by Carpenter & Penn (1994). ALE  is equipped with an inheritance mechanism on typed feature structures, a powerful tool in grammar modeling. I have made extensive use of the mechanism in the description of lexical categories as well as in knowledge representation. This seems to be an adequate way of capturing the inherent relationship between features in a grammar. Prolog is a programming environment particularly suitable for the development of unification and reversible grammars (Huang 1986, 1987). ALE compiles W‑CPSG into a Chinese parser, a Prolog program ready to accept a string of characters for analysis. In the first experiment, W‑CPSG has parsed a corpus of 200 Chinese sentences of various types.

An important benefit of a unification-based grammar is that the same grammar can be used both for parsing and generation. Grammar reversibility is a highly desired feature for multi-lingual machine translation application. Following this line, I have successfully applied W‑CPSG to the experiment of bi-directional machine translation between English and Chinese. The machine translation system developed in our Natural Language Lab is based on the shake-and-bake design (Whitelock 1992, 1994). I used the same three grammar modules (W‑CPSG, an English grammar and a bilingual transfer lexicon) and the same corpus for the experiment. As part of machine translation output, W‑CPSG has successfully generated the 200 Chinese sentences. The experimental results meet our design objective and verify the feasibility of our approach.

 

References

 

Carpenter, B. & Penn, G. (1994): ALE, The Attribute Logic Engine, User's Guide

Chen, K-J.  (1996): "Chinese sentence parsing" Tutorial Notes for International Conference on Chinese Computing ICCC'96, Singapore

Feng, Z-W.  (1996): "COLIPS lecture series - Chinese natural language processing",  Communications of COLIPS, Vol. 6, No. 1 1996, Singapore

Fillmore, C. J. (1968): "The case for case". Bach and Harms (eds.), Universals in Linguistic Theory. Holt, Reinhart and Winston, pp. 1-88.

Huang, X-M. (1986): "A bidirectional grammar for parsing and generating Chinese".  Proceedings of the International Conference on Chinese Computing, Singapore, pp. 46-54

Huang, X-M. (1987): XTRA: The Design and Implementation of A Fully Automatic Machine Translation System, Doctoral dissertation, University of Essex.

Li, L-D. (1986): Xiandai Hanyu Juxing (Sentence Patterns in Contemporary Mandarin), Shangwu Yinshuguan, Beijing

Li, L-D. (1990): Xiandai Hanyu Dongci (Verbs in Contemporary Mandarin), Zhongguo Shehui Kexue Chubanshe, Beijing

Li, W. & P. McFetridge (1995): "Handling Chinese NP predicate in HPSG", Proceedings of PACLING-II, Brisbane, Australia

Li, W. (1996): "Interaction of syntax and semantics in parsing Chinese transitive patterns", Proceedings of International Conference on Chinese Computing (ICCC'96), Singapore

Li, W. (1997a): "Chart parsing Chinese character strings", Proceedings of The Ninth North American Conference on Chinese Linguistics (NACCL-9, to be available), Victoria, Canada

Li, W. (1997b): W‑CPSG: A Lexicalized Chinese Unification Grammar, Doctoral dissertation, Simon Fraser University (on-going)

Lü, S-X. et al. (ed.) (1980): Xiandai Hanyu Babai Ci (800 Words in Contemporary Mandarin), Shangwu Yinshuguan, Beijing

Meng, Z., H-D. Zheng, Q-H. Meng, & W-L. Cai (1987): Dongci Yongfa Cidian (Dictionary of Verb Usages), Shanghai Cishu Chubanshe, Shanghai

Pollard, C.  & I. Sag (1987): Information based Syntax and Semantics Vol. 1: Fundamentals. Centre for the Study of Language  and Information, Stanford University, CA

Pollard, C.  & I. Sag (1994): Head-Driven Phrase Structure Grammar,  Centre for the Study of Language and Information, Stanford University, CA

Shieber, S. (1986): An Introduction to Unification-Based Approaches to Grammar. Centre for the Study of Language  and Information, Stanford University, CA

Tesnière, L. (1959): Éléments de Syntaxe Structurale, Paris: Klincksieck

Whitelock, Pete (1992): "Shake and bake translation", Proceedings of the 14th International Conference on Computational Linguistics, pp. 784-790, Nantes, France.

Whitelock, Pete (1994). "Shake and bake translation", C.J. Rupp, M.A. Rosner, and R.L. Johnson (eds.), Constraints, Language and Computation, pp. 339-359, London, Academic Press.

Wilks, Y.A. (1975). "A preferential pattern-seeking semantics for natural language interference".  Artificial Intelligence, Vol. 6, pp. 53-74.

Wilks, Y.A. (1978). "Making preferences more active".  Artificial Intelligence, Vol. 11,  pp. 197-223

 

-------------------------------------

* This project was supported by the Science Council of British Columbia, Canada under G.R.E.A.T. Award (code: 61) and by my industry partner TCC Communications Corporation, British Columbia, Canada. I thank my academic advisors Paul McFetridge and Fred Popowich and my industry advisor John Grayson for their supervision and encouragement. Thanks also go to my colleagues Davide Turcato, James Devlan Nicholson and Olivier Laurens for their help during the implementation of this grammar in our Natural Language Lab. I am also grateful to the editors of the NWLC'97 Proceedings for their comments and corrections.

[1] We leave aside the other components such as discourse, pragmatics, etc. They are an important part of a grammar for a full analysis of language phenomena, but they are beyond what can be addressed in this research.

[2] In formulating W‑CPSG, we use uppercase for feature and lowercase for type; ~ for logical not and / for logical or; number in square brackets for unification.

 

[Related]

Outline of An HPSG-style Chinese Reversible Grammar ABSTRACT

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

立委硕士论文:目标语调序 (9)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

目标语调序

在前面的虚词一线和形态生成一线, 已经做了一些局部调序并给了同号. 如:

CHIO (一切) CHI (这) ----> 这一切 (012);
DOKTORO (博士) ZAMENHOF (柴门霍夫) ----> 柴门霍夫博士 (134)

英语疑问句和否定句所需要的调序, 就放在形态生成的同时进行. 如:

NE (NOT) ESTIS (WERE) ----> WERE NOT (008)

CHU VIA (YOUR) AMIKO (FRIEND) ESTAS (IS) KURACISTO (DOCTOR) ?
----> IS YOUR FRIEND DOCTOR ? (039)

从综合第二线开始, 系统从句子整体着眼, 自底而上分别做各目标语的归约调序. 有了CDC和调序子程序, 建立目标语的归约生成算法就很简单了. 其基本思路是:

(1) 由句首至句末依次取词, 放过已加工和非终结节点.
(2) 若该词层号为一, 右链为零, 说明已经归约到顶层主轴心, 该句加工完毕.
(3) 若该词需要调序, 入调序子程序.
(4) 该词做已加工特征, 并视情况决定是否给该词以轴心词同号.
(5) 入子程序检查该词的姐妹词是否也都已加工.
(6) 若是, 则该词及其所有姐妹词给以轴心词同号, 轴心词做终结节点特征.
(7) 返回第(1)步.

对于英语, 问题特别简单, 只有一种情况需要调序, 即及物谓语所带的前置宾语和后置主语. (不及物谓语句中的后置主语无需调序.) 汉语的问题就复杂得多, 主要规则有:

(1) 存在 "有" (ESTI) 的主语应后置. 除此以外, 后置主语(包括多数主语从句)一律前移.

(2) 要求带 "把", "使" 等的汉语及物动词做谓语的句子, 其宾语在加上 "把", "使"等以后, 应置于谓语前. 除此以外, 前置宾语一律后移.

(3) 后置定语从句在两种情况下不需前移: 1. ESTAS + X, KIU 型强调句式; 2. 长15词以上的定语从句. 其余的所有后置定语一律前移. 各姐妹定语的相对位置主要由它们的语义特征决定, 具体是通过调序时给或不给同号来实现.

(4) 状语从句一般原位不动(但后置时间状语从句最好前移). 其余后置状语一律前移. 各姐妹状语相对位置的处理原则同上.

 

 

【相关】

硕士论文: 世界语到汉语和英语的自动翻译试验
立委硕士论文:1. EChA概况
立委硕士论文:2. 世界语: 语言学特点及其研究价值
立委硕士论文:3. 层次递归成分体系
立委硕士论文:4. EChA机器词典及词表
立委硕士论文:5. 世界语形态分析
立委硕士论文:6/7 世界语句法分析
立委硕士论文:8. 英语形态生成
立委硕士论文:9. 目标语调序
立委硕士论文:10. EChA 试验结果的分析
立委硕士论文【致谢】【参考书目】
立委硕士论文全文(世界语版)

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

立委硕士论文:EChA试验结果分析 (10)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

EChA试验结果分析

总的来说, 这次试验结果相当令人满意. 译文不但可读, 多数都很通顺. 由于比较重视修辞, 机器味儿也不浓. 当然, 这毕竟是小范围的实验, 虽然我们尽量照顾到各种可能出现的语言现象, 但也难说在今后的扩大试验中会出现什么问题, 好在该系统比较容易维护和改进.

第二首诗中有两处(110)(111)把疑问句错译成英语强调句:

CHU kredas la vorton pure karan: vin mi amas! (111)
DO BELIEVE the word purely dear: I love you!
Cf: 相信纯粹地亲爱的词吗:我爱你!

这是因为原诗句为了节奏的需要, 承前省略了主语 VI (YOU). 有意思的是, 译成强调句于诗意没有什么损害.

在EChA上机伊始, 我们由于专心于检验方案主体的可行性和合理性, 而忽略了修辞. 初期译文(1985.12)显得较粗糙, 比较后期结果(1986.2), 译文的改进是明显的. 例如:

  1. 形式主语IT的增加 (007)(012)(077)(122)(125)(133):

Sed chio chi ankorau okazis sub homa gvidado kaj PLEJ GRAVE ESTIS, KE chio chi bazighis sur la homa scio. (012)

1) But all this still happened under man's guiding and MOST IMPORTANT WAS, THAT all this was based on the man's knowledge.

2) But all this still happened under man's guiding and IT WAS MOST IMPORTANT, THAT all this was based on the man's knowledge.

  1. 不定式带TO跟不带TO的区分 (004)(019)(072)(078)(083)(084)(088)(089)(092)(095)(132)(142)(146):

LABORI estas necese.(072)
1) (TO) WORK is necessary.
2) TO WORK is necessary.
工作是必要的.

  1. 双宾语 (128)(143)(144):

Donu AL mi iom da kafo! (128)
1) Give TO me a little coffee!
2) Give me a little coffee!
给我一点咖啡!

  1. 表示存在的 ESTI 译 "有" 和 THERE TO BE (049)(157):

En unu jaro ESTAS kvar sezonoj: printempo, somero, autuno kaj vintro. (049)

1) In one year ARE four seasons: spring, summer, autumn and winter.
在一年里面 "是" 四季节:春季,夏季,秋季和冬季.

2) In one year THERE ARE four seasons: spring, summer, autumn and winter.
在一年里面 "有" 四季节:春季,夏季,秋季和冬季.

  1. 目标语词义的选择 (059)(067)(081)(046)(098)(013)(014)(027)(118)(130):

ELMETU viajn opiniojn pri nia laboro! (059)

1) "输出" 你们的关于我们的工作的意见!
2) "提出" 你们的关于我们的工作的意见!
OUTPUT your opinions about our work!

Chu mi FARIS multajn erarojn en mia hejmtasko? (081)

1) Did I DO a lot of mistakes in my homework?
我在我的家庭作业里面 "做" 了许多错误吗?

2) Did I MAKE a lot of mistakes in my homework?
我在我的家庭作业里面 "犯" 了许多错误吗?

La partio TRE zorgas la vivon de la popolamaso. (046)

1) The party VERY cares for the life of the masses.
2) The party VERY MUCH cares for the life of the masses.
党很关心人民群众的生活.

La suno levighas CHE oriento. (013)

1) The sun rises AT east.
2) The sun rises IN THE east.
太阳在东方升起.

POST unu monato komencighos la someraj ferioj. (014)

1) AFTER one month will begin the summer's holidays.
2) IN one month will begin the summer's holidays.
暑假在一月以后将开始.

La eksperimento pri mashina tradukado ANKORAU NE estas finita. (027)

1) The experiment about machine's translating STILL has been NOT finished.
关于机器的翻译的试验 "仍然没有" 被完成.

2) The experiment about machine's translating has been NOT finshed YET.
关于机器的翻译的试验 "还没有" 被完成.

Ni esperas, ke li GAJNU championecon en la konkurso. (118)

1) We hope, that he WIN championship in the competition.
2) We hope, that he WILL WIN championship in the competition.
我们希望,让他在比赛里面赢得冠军.

Prenu la lingvon neutralan KIEL la bazon. (130)

1) Take the language neutral AS the base.
2) Take the language neutral FOR the base.
拿中立的语言作为基础.

通过EChA试验, 我们深深体会到, 同一语系中的语言转换较之不同语系容易许多. 亲属关系越近, 机器翻译对自动分析的精度要求也就越低, 因而越容易推向实用. 英语和汉语都是分析型语言, 有很多类似的语言特点, 即便如此, 世英转换比

世汉还是简单得多. 只要建立一部世英自动词典, 再加上一套形态转换算法, 甚至无需进行层次和句法的分析, 就可以实现词对词世英机器翻译. 这样的译文尽管粗糙, 但在相当程度上是可用的. 我们对ECHA综合第一线(形态转换)输出的未经调序的中间译文作了统计, 以不引起误解为标准, 英语正确率为 95% (150/158) 左右, 费解的有八句 (003)(010)(075)(095)(102)(108)(111)(141), 汉语正确率为 72% (113/158) 左右. 排除形态转换中利用了句法分析结果的部分, (但不排除第一线的虚词分析和转换), 英语正确率也在80%以上. 如果在输出译文时, 对前置宾格名词加上标识符, 则可懂度还可提高. 当然, 我们试验的这158句总有一定的局限, 所以上述统计也只具有相对意义. 中国的机器翻译, 从一开始研究的就是印欧和汉臧这两个没有亲属关系的语系间语言的自动转换, 难度很大. 这恐怕是我们的实用系统迟迟不能问世的重要原因之一. 所以, 崐中国机器翻译工作者肩上的担子更重, 任务更艰巨, 更需要独创和献身精神. 这种不利的条件也有它的另一面: 机器翻译与汉语结合带来的许多特别的问题, 客观上使我们的研究比较深入. 我国的机译研究就没有象欧美那样经历词对词翻译的第一代, 而是直接从第二代句对句翻译开始, 起点较高, 并且在很短时间内(60年代初期)就赶上了当时的世界先进水平. 这显然与我们所研究的特定对象(俄-汉, 英-汉等)的要求有关.[10]

现在谈谈另一个问题: 文学作品可不可以由机器翻译? 我们说完全可以, 不过很困难. 要把人在翻译文学作品时所遵循的规则(其中很多是下意识的)形式化算法化, 显然不容易. 即便做到了, 经济上也不上算. 所以, 在相当长的时间内, 除特别的实验需要外, 人们一般不去花这个力气. EChA选译了两首诗歌, 在这个方面做了粗浅的尝试, 证明机器也可译诗. 从译?
文看, 英语比汉语美, 保留了更多的节奏和韵律的特点, 更象一首诗. 汉语译文除了几句译得较好( 如: "向永远战争着的世界, / 它允诺神圣的和谐" ), 总体上看, 更象一篇散文. 这也难怪, 因为EChA本来就不是专门为翻译诗歌而设计的. 诗歌形式上的两个最大特点是节奏和尾韵. 可以设想, 诗歌机译系统的词典跟一般机器词典应有所不同: 各词条的每一义项下集中了一批同义的目标语等价词. 这些词长短不一, 韵尾各异, 供机器在诗歌综合时选用, 正象人在写诗或译诗时常需要翻韵书一样.

一提机器翻译, 人们总爱问: 机器能够翻译文学作品吗? 为什么不能? 离散是对连续的逼近, 机器智能是对人的智能的模拟, 二者之间并没有一道不可逾越的鸿沟. 从功能上看, 机器和人没有什么不同. 机器不过是无机体的人罢了. 只要人会的事情, 机器迟早也能会. 机器的不会并不是它不能, 而是人没有使它会, 这正如文盲不会写字是因为没人教他一样. 不过, 机器胃口很刁, 不懂 "意会", 只有 "言传"(通过计算机语言)才能教会它. 可惜, 对很多事, 人至今还是知其然, 并不知其所以然, 无法传授. 可见, 机器的无能全由于人的无能. 可人今天不知其所以然的, 并不说明将来总也不知, 所以从发展的观点看, 机器和人一样是无所不能的. 事实上, 机器目前已能代替医生, 译员和作曲家做部分工作, 而且比技术较差的人做得还象样些, 因为它 "取法乎上". 即便人, 也只有很少一部分专家能够从事这些工作. 机器已经闯进了万物之灵的神圣禁地.

最后, 一般地谈谈修辞问题. 由于机器翻译至今多局限在实验室里, 所以未予修辞而产生的阅读障碍(包括心理障碍)还不突出. 但随着机器翻译的逐步实用化, 修辞的必要性将越来越明显. 前面所举的后期译文对初期译文的改进的实例, 主要涉及的就是修辞.

1) 什么是机器翻译修辞?

机器翻译修辞是保证译文通顺的一个重要手段. 它是机器语法之后译文综合的一部分, 是自动翻译过程的最后一个环节. 广义的修辞包括贯穿翻译全过程的, 一切旨在促使译文通顺和美化的手段, 譬如成语手段(通过成语词典), 虚词分析(通过虚词模块), 结构手段(通过搭配关系)等等. 有些所谓多义区分, 实际上也是一种修辞, 例如 LUDI (PLAY) 可分为 "玩", "打球)", "演奏(乐器)"等义项, 但 "演奏" 义下具体选择 "拉(提琴, 胡琴)"(016), "弹(钢琴)"(038) 还是 "吹(口琴)" 就属于修辞了. EChA对于涉及多义的修辞, 即目标语合适对等词的选择, 就把它当作多义问题解决(见EChA虚词模块, 词类词义区分表和多义区分模块). 一般来说, 跟具体的词汇或语法现象联系很紧的修辞, 以及其他个性较强的特例修辞, 应该放在相应的词典或语法部分同时处理, 而可以归出类别的修辞, 则由最后独立的修辞模块统一解决.

机器翻译修辞具有某种超语言学的特征, 属于翻译学范畴. 我们知道, 根据原语和译语的语言学角度的对比差异, 就可以对所译文句实现转换(主要是句型转换), 这是我们目前机器翻译的主体工作. 但这样直接转换的句子不能保证其通顺, 甚至也不能保证其正确(即不被误解), 因为语言间(尤其是没有亲属关系的语言间)除了词汇语法等差异外, 还有超语言学(表达习惯, 思维方式等等)的差异存在, 即翻译学角度的对比差异. 例如: nun DE LOKO flugu ghi AL LOKO (now FROM PLACE let it fly TO PLACE) (101) / 现在从 "一个" 地方让它飞到 "另一个" 地方吧("从地方到地方" 不符合汉语表达习惯). 修辞主要是为消除这种差异而设置的. 因此, 只有翻译学角度的语言对比差异, 才是修辞的根本依据.

2) 修辞的分类

可分作两大类: 必要修辞和美修辞. 必要修辞是保证译文正确可懂所必需的修辞, 它是修辞的初级阶段. 美修辞则是保证译文通顺畅达, 甚至产生某种美感或帮助形成译文风格所要求的修辞, 它是修辞的高级阶段. 机器翻译修辞首先是作为必要修辞提出来的. 必要修辞是基础, 具有更大的迫切性, 是所有实用系统的必要组成部分, 如形态修辞. 这部分修辞数量很有限, 一定量的研究就可以穷尽它. 美修辞可以说是锦上添花. 它是为机器译文不断提高质量, 使之朝成熟, 完美方向发展, 以期赶上人工翻译的手段. 可见, 美修辞是无限发展的, 它本身具有许多层次和侧面. 修修补补远不能满足美修辞发展的需要. 它要求体系和方法上的不断革新. 就机器翻译的前景来说, 美修辞的比重将逐渐变大. 从严格的意义上讲, 只有美修辞才真正体现修辞本身的特点和规律, 因为必要修辞在一定的意义上不过是语法的推广, 即可以算作广义的语法. 它的手段跟机器语法没有根本的不同. 在现行的EChA系统中, 必要修辞就常常跟语法混在一起.

关于美修辞, EChA只是做了一点尝试. 应该指出, 机器翻译的美有自己的侧重点, 它最推崇 "通顺流畅, 合乎习惯和简洁自然", 其次是译文风格的形成. 我们认为, 机器译文的风格逐步形成, 是完全可能的. 因为从形式上看, 风格的承担者主要是词汇, 尤其是小词(语气词, 结构词), 其次, 语法形式也有些不同. 不同风格的形式特点, 是可以为机器识辨和接受的. ?
具体做法可以吸收计算风格学(Computational stylistics)的研究成果, 去设计不同风格的译语修辞模型. 风格可以有正规体, 典雅体和口语体等等. 正规体格式规范, 清楚简单, 给人的印象是客观公正, 不假藻饰. 典雅体的特点是虚词多用古字 (如 "则", 即", "乃", "便", "故", "且", "其", "及" 等), 成语用的也较多, 显得简洁古雅. 口语体则比较松散自由, 带?
有更多的语气词(如 "吗", "呢", "可不", "是吗", "啊" 等).

____________________________________________________________________

附注: [10] 参见 刘涌泉 <<中国的机器翻译>> ( <<情报科学>> 1980, 3 )

 

[致谢]

研制世界语类型的机器翻译系统, 从一开始就得到刘涌泉老师的热情支持, 从方案主体到具体问题的处理, 他都给以认真指导. 在程序设计和上机调试的的过程中, 刘倬老师也多次给予指导, 有些基本操作的算法也是刘倬老师提供的. 在EChA系统取得初步成果的时候, 笔者向他们表示深切的感谢. 另外, 还要特别感谢机房韩老师的多方协助. 没有她提供的方便, EChA系统根本不可能在这么短时间试验成功.

 

\

【相关】

硕士论文: 世界语到汉语和英语的自动翻译试验
立委硕士论文:1. EChA概况
立委硕士论文:2. 世界语: 语言学特点及其研究价值
立委硕士论文:3. 层次递归成分体系
立委硕士论文:4. EChA机器词典及词表
立委硕士论文:5. 世界语形态分析
立委硕士论文:6/7 世界语句法分析
立委硕士论文:8. 英语形态生成
立委硕士论文:9. 目标语调序
立委硕士论文:10. EChA 试验结果的分析
立委硕士论文【致谢】【参考书目】
立委硕士论文全文(世界语版)

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

立委硕士论文:英语形态生成 (8)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

英语形态生成

加尾算法跟削尾算法正好是逆过程. 建立一个完全的, 符合实用系统要求的英语加尾算法并不困难, 因为英语的形态比较简单. EChA把汉语形态修辞与英语形态生成放在一处进行.

原语和译语的对比差异是建立语言转换规则的依据. 这种对比差异可以归纳为下面五种情况: 1) 一一对应; 2) 此一彼多; 3) 此多彼一; 4) 此有彼无; 5) 此无彼有. 我们以世界语到英语的形态转换分别举例如下:

1) 一一对应

世界语派生副词(由逻辑类为形容词的词干加 "-E" 尾构成)
--------->英语相应形容词加 "-LY" 尾.

例: diligent-E ----> diligent-LY ; serioz-E ----> serious-LY ;
sincer-E ----> sincere-LY. (063)

例外: bon-E ----> well (045)
( 不是 good-LY, 这种情况在词典一线入词类词义区分表处理. )

显然, 一一对应的情形最好办.

2) 此一彼多

世界语不定式 --------> 英语动词原形 或 TO + 动词原形
世界语条件句(谓语动词以 "-US" 收尾) --------> 英语三种形式(过去, 现在, 将来).
例: 1. Se mi sci-US hierau, mi certe ven-US. -
---> If I HAD KNOWN yesterday, I certainly                                                                                SHOULD HAVE COME. (与过去事实相反的假设)

  1. Se vi est-US mi, kion vi far-US? ----> If you WERE me, what WOULD you do?                                 (与现在事实相反)
  1. Se vi ven-US morgau, vi shin vid-US.
    ----> If you SHOULD come tomorrow, you WOULD see her.                                                                 (与将来事实相反)

这种情况最麻烦. 机器翻译中的多义现象盖源于此. 如果上例没有明确的时间状语, 那只能靠跨句上下文去推测, 这对机器实在太难了. EChA遇到这种情况, 就干脆一律用 "WOULD" 代替 "-US" (050), 这虽然不大符合英语语语法规范, 暂时也只能这样了. 好在这样转换并不造成误解.

此一彼多另一个常见的例子是, 世界语现在时简单式(-AS尾)对应于英语一般现在时和现在进行时两种. 虽然世界语复合时态有与英语现在进行时对应的形式( ESTAS x-ANTA ), 但是世界语的节约原则要求人们尽可能少用复杂形式. 我们一时还找不出足够可靠的形式规则, 来决定 "-AS" 究竟何时译作一般时态, 何时译作进行时态. EChA目前一律以一般现在时译之, 这使得部分译文不是很确切, 但并不造成误解或费解. 如:

Kien vi ir-RA? (158) ----> To where DO you go? ( CF: Where ARE you GOING? )
Chu kredas, ke mia koro flam-AS? (110) ----> Do believe, that my heart burn-S?
( CF: Do you believe that my heart IS BURNING? )

3) 此多彼一

世界语形动词或副动词的各种形式 --------> 英语分词的相应形式.

-ANTA 和 -ANTE ----> -ING ; -INTA 和 -INTE ----> HAVING+过去分词 ;
-OTA 和 -OTE ----> TO BE+过去分词; 等等.

[例] KURANTE sur la strato, li falis. (091) ----> RUNNING on the street, he fell.

Laboristoj estas KONSTRUANTAJ fabrikon. (015)
----> Workers are BUILDING factory.

这种情况好办. 世界语形态比较丰富, 而现代英语形态不发达, 所以世英形态转换中最经常出现的, 就是此多彼一或此有彼无的情形, 这对建立比较完全的EChA英语形态生成(加尾)算法是很有利的条件.

4) 此有彼无

世界语将来将来时 ( ESTOS x-ONTA(J) ) --------> 英语 ?

[例] Mi ESTOS LEGONTA la libron kiam shi venos. (023)
----> I WILL ( 或: WILL BE GOING TO ) read the book when she comes.

这种情况看上去似乎很不利, 实际上并不难处理. 因为现今存在的各种语言, 作为人们千百年来交流思想的工具, 一般都能够表达各种细微的语义差别. 虽然乙语言也许缺乏甲语言的某个特定的表达手段, 但如果必要, 它总可以找到代替的表达方式. 如上例 ESTOS LEGONTA 通常译作 WILL READ 已经足够, 如果一定要强调将来的将来, 也不妨译作 WILL BE GOING TO READ 这样繁冗的形式. 再如汉语缺乏形态, 但如果需要, 总可以用适当的助词或副词等来代替, 这就是所谓的形态修辞.

5) 此无彼有

世界语 ? --------> 英语完成进行时

[例] Mi atend-AS vin chi tie du horojn.
----> I HAVE BEEN WAITING here for you for two hours.
CF: I WAIT here for you for two hours.
I AM WAITING here for you for two hours.

此所无彼所有的, 如果在彼也是可有可无的, 或并不太影响语义, 那还好办, 如上例. 再如, 英语的不定冠词, 世界语就没有, EChA对此干脆不管, 也没造成严重的后果, 只是译文显得有些不顺: Is your friend (*) doctor?       (039) This is (*) green star, and that is (*) red star. (152) ( * 处本应有不定冠词 A ) 最头痛的是此所无彼必有. 从完全没有冠词的语言(如汉语和俄语)译入有冠词的语言在很多情况下就是这样.

上述归纳在机器翻译的转换生成中具有普遍意义. 最困难的是此一彼多和此所无彼必有两种情况, 一般要通过精密的句法和语义的对比和分析来解决. 比如通过分析不定式所直接联系的英语轴心词的句型特征, 就可以决定该不定式采用带 TO 还是不带 TO 的形式. 实在不得已, 只好把几种可能的选择同时打印出来, 由用户自己决定----这当然是权宜之计, 但常常比编制一套不可靠的区分规则, 客观上更有利一些. 机器模拟人的智能, 在一定的阶段总还有某些局限. 上面的做法, 实际上就是把机器暂时还不具有的智能, 交还给人发挥, 特别是那些很难形式化, 但人凭经验和直感却很容易判断的部分. 然而, 人工智能的使命决定了, 人们应该尽最大努力提高机器智能化程度. 条件允许却不去努力是设计者的懒惰和失职.

在EChA形态生成一线, 还有词典化了的多义区分程序段(它在形态生成前执行), 用BASIC写起来很容易. 现举例介绍如下:

1) LUDI 玩 / 打(各类球) / 拉(提琴, 胡琴) / 弹(钢琴) / 吹(口琴)

2120 IF VT$(GC)<>"1" THEN 2160
( 若该词不及物则保留词典基本义项 "玩", 该词多义区分毕, 转2160. )

2130 IF HY$(ZC)="胡琴" OR RIGHT$(HY$(ZC),4)="提琴" THEN HY$(GC)="拉": GOTO 2160
( 若找到词为 "胡琴", 或找到词的后两字为 "提琴" (包括大提琴,小提琴,中音提琴等), 则该词取汉义 "拉", 该词毕, 转2160. )

2140 IF HY$(ZC)="钢琴" THEN HY$(GC)="弹": GOTO 2160
2145 IF HY$(ZC)="口琴" THEN HY$(GC)="吹": GOTO 2160
2150 IF RIGHT$(HY$(ZC),2)="球" THEN HY$(GC)="打"
2160 GC=GC+1: GOTO 1830 ( 放过该词, 取后一词, 转1830. )

2) BATI 打 / (心)跳动

1990 IF VT$(GC)="1" AND (RIGHT$(HY$(ZC),2)=心" OR HY$(ZC)="心脏") THEN HY$(GC)="跳动"
2000 GOTO 2160

3) OKAZI 进行 / 发生 / 召开

2450 IF RIGHT$(HY$(ZC),2)="事" THEN HY$(GC)="发生":GOTO 2160
2460 IF RIGHT$(HY$(ZC),2)="会" THEN HY$(GC)="召开":YY$(GC)="BE HELD": YTZ$(GC)="8": XX$(GC)="1"
2470 GOTO 2160

3) RIGARDI: LOOK AT / LOOK / WATCH (TV) / SEE (FILM)

2830 IF VT$(GC)<>"1" THEN YY$(GC)="LOOK": GOTO 2160
2840 IF YY$(ZC)="TELEVISION" OR YY$(ZC)="TV" THEN YY$(GC)="WATCH": GOTO 2160
2850 IF YY$(ZC)="FILM" THEN YY$(GC)="SEE": YTZ$(GC)="1"
2860 GOTO 2160

4) NENIAM 从不 / 从未

3070 IF ST$(ZC)="2" THEN HY$(GC)="从未": HY$(ZC)=HY$(ZC)+"过": JG$(ZC)="9"
3080 GOTO 2160

 

 

 

【相关】

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录立委硕士论文:9. 目标语调序

立委硕士论文:世界语句法分析(6&7)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

世界语句法分析(1): 虚词处理

虚词分析是世界语句法分析中最困难的部分. EChA的策略是分而治之, 各个击破. 每一个虚词的分析规则自成一体, 互相独立, 这样在充实或改进某一具体虚词的规则时, 便不致于影响其他虚词的规则, 这也就是规则和规则分开吧.[9] 语言规则和算法程序应该分开, 大家已经说了许多, 而规则和规则分开, 似乎还没有引起足够的重视. (不是指所有规则都分开: 具有普遍意义的抽象语法规则集合, 作为系统对于该语言充分形式化的逻辑描述, 是自动分析的枢纽, 本身就是一个可以做的很美的统一整体, 谈不上分开. (参考EChA句法分析第二线, 见第7节.) 一个优良的系统应该既能分得开, 又能合得拢.) 我们认为, 规则和规则分开, 对于研制实用性机译系统具有决定性意义. 没有什么系统从一开始研制就可以足够完善, 所以是否容易扩充和改进, 在很大程度上决定了一个系统的前途. 规则和算法分开, 固然大大增强了系统的扩充能力, 并且便于语言工作者和软件工作者充分合作. 但这还不够. 如果能实现规则和规则分开, 不但有利于遵循具体问题具体分析原则, 去解决语言这种特别复杂的现象中的许多个性问题, 从而大大提高翻译质量, 而且也为语言工作者和语言工作者的协作, 创造了必要的条件----这种协作, 对于研制大型实用系统是必不可少的.

规则和规则分开的主要方式是: 1) 词典语法化: 以词为基本单位, 把关于该词的各种用法及其分析规则, 以数据的形式写入词典(它建在外存贮器上). 这样的机器词典, 形式上很类似于我们案头的词典工具书, 如牛津, 韦式, LONGMAN等, 而且也较容易借鉴已有的这些词典的研究成果. 我们建议首先把虚词和动词的条目语法化. 2) 语法词典化: 在编写句法分析或综合程序(它在内存贮器中)时, 把规则落实到具体词或小类上, 并使这些规则独立开来. 这两种方法形式有别, 实质是一样的. 我们在EChA中采用的是第二种方法. (参见EChA虚词分析部分和EChA综合部分的多义词区分规则.) 说到底, EChA分析第一线不过是一个带有分析规则的虚词大词典.

当然, 应该指出, 规则和规则分开, 必然使规则量成倍增长. 然而, 由于边界分明, 这种增长并不影响系统结构上的逻辑清晰性, 这跟以前语言和算法, 规则和规则都没分开时的情形大不相同, 那时的规则无限膨胀, 只能致使系统最终报废. 不过规则量的增长, 涉及到机器的存贮容量问题. 但这实际上也不成问题, 因为现在的机器对于存贮节省的要求, 已经不是那么苛刻了. 即便是微型机, 中高挡的内存容量就能达到, 或很容易扩充到四兆到八兆字节. 值得强调的是, 规则量的增长, 一般并不影响系统的工作效率, 因为规则是附在具体的词或小类下, 只有所译文句出现了某词, 才会入该词一线.

在EChA虚词分析一线中, 我们把虚词的多义区分, 甚至有些涉及虚词特点的目标语修辞, 都一古脑纳入具体虚词的分析规则中. 这样处理显然比较简便易行, 也大大减轻了综合的困难. 但是, 正是在这儿, EChA违背了我们所极力赞同的分析和综合独立的原则. 目前还想不出更好更合理的办法. 不过, 我们主张独立分析的本意, 不外乎为了两点: 1) 为了使分析深入以便提高机译质量; 2) 让同一个独立分析结果, 能为多语综合所利用. 考虑到虚词的分析和综合同步进行, 有助于提高译文崐质量, 而且由于虚词数量的有限及其分析规则的相互独立, 在增加新的目标语时充实这些规则不会有很大困难, 更不会影响整个系统的筋骨, 因而我们目前的做法是有理由的, 它并不违背我们的宗旨.

 

世界语句法分析(2)

分析第(2)线与目标语综合充分独立, 逻辑性强, 是一个相当完整的语言分析模型. 它由一个主程序和几个以动词分析算法为核心的环环相扣的子程序构成. 主程序主要用来确定各语段的范围(前限后限)及其加工次序, 为它们进入动词子程序做好准备. 它必须对各种类型的世界语文句作出正确, 合理的处理, 才能保证系统的充分概括性和适应性. 从各类文句的试验结果看, EChA相当好地做到了这一点.

我们把世界语文句的类型归纳如下:

1.无谓句. 如:

Kia belega pejzagho ! (041) / What beautiful scenery ! 多么绝美的景色!

2.谓语句:

1) 简单句: 全句只有一个谓语. 如: Skribu klare ! (033) / Write clearly ! 写清楚!

2) 扩展的简单句: 全句至少有两个谓语, 但只有一个主句, 从句跟主句(以主轴心为代表)没有直接联系, 即从句处于2层以外 ( 其层号 >= 3 ). 这类从句往往是定语从句或同位语从句. 如:

La homon , pri kiu vi parolas , mi neniam vidis . (131)
The man(宾), about whom you speak , I never saw .
我从未见过你提到的人.

3) 主从句: 全句至少有两个谓语, 但只有一个主句, 从句跟主句发生直接联系. 如:

Se mi partoprenus en via amuza aktivado , mi estus tre ghoja . (050)
If I should take part in your recreational activity , I would be very glad .
如果我参加你们的文娱活动, 我会是很高兴的.

4) 并列句: 全句至少有两个谓语, 同时也至少有两个有并列关系的分句, 并且其中一个是主轴心. 如:

Mi miras , timas , tremas . (074)
I wonder, fear, tremble.
我惊奇, 害怕, 颤抖.

5) 交错句: 以上四类句子交错组合而成的复杂句. 如本文第3节举的例句(004)就是.

EChA在对付这些不同类型的句子时, 能够把复杂的句子分解成简单的句子处理. 分析程序首先查找从句. 如果查到, 先入并列从句子程序分解(若是光杆从句就放过, 返主), 然后确定每一个从句的前后限, 入动词子程序加工. 加工完毕, 做绝对放过标志. 所有从句处理完毕, 再行主句加工. 这时候, 句子呈或者简单句, 或者并列句的形式.

世界语中表示关系的从句, 如有相应的 T 类相关词与之呼应, 就是同位语从句. 而当主句中 T 类相关词省略时, 便于表示疑问的名词性从句同形, 从而增加了识辨难度. 对此本系统暂时不予考虑. 这种省略虽然显得较干练 (成语警句中常用), 崐但不宜提倡, 因为甚至人(尤其是非印欧语系的人)理解起来, 也常常感到困难.

[例] Bone ridas , KIU laste ridas .
Well smiles, WHO smiles at last.
谁笑得最后, 笑得最好.

KIO pasis , ne revenos .
WHAT passed, will not return.
时不再来. (一去不复返.)

CF: Nur TIU ne eraras, KIU neniam ion faras.(151)
Only THAT PERSON is not wrong, WHO never dose something.
仅仅从不做某事的那个人不犯错误.

第二线的关键是动词子程序的建立. (这儿所谓动词包括谓语动词, 形动词, 副动词和不定式, 但不包括-ADO词, 因为世界语的-ADO词已经完全名词化了, 不再具有动词的特性.) 如果说先从句后主句的加工过程, 实际上是自下而上的方法, 那么动词算法的路径正好反过来, 是自上而下. 动词子程序首先设三个开关. 一是检验是否可以构成动词短语 VP. 若不能, 如独词句及光杆的形动词, 副动词或不定式, 则给该词节点信息 J (终结节点), 该词加工完毕, 退出. 二是检验该词是否系词, 若是, 转系词子程序作适当处理, 再回动词子程序递归加工. 这是因为系动词有其特殊性, 比如一般动词谓语简单句, 只可能有一个前面没有介词的普通格名词(它当然是主语), 而系词谓语句却可以有两个(一主一表), 因而不能直接入动词子程序.  最后一个开关检验该动词短语是否扩展的 VP, 若不是, 即行分析. 扩展的 VP 定义为该动词的间接成分层中(所谓间接成分层是指其层号 >= 动词轴心的层号 + 2 的层次), 至少又包含一个 VP. 对于扩展的动词短语, 运用栈技术作递归加工. 这样动词子程序真正的加工单位便是不扩展的各类 VP (简单句, 形动词短语, 副动词短语, 不定式短语). 动词子程序在工作期间, 常常需要调用其他子程序. 各子程序间的逻辑关系是十分清楚的.

名词子程序也要设开关. 扩展的 NP 定义为带有至少一个 VP 的 NP, 它必须回动词子程序递归加工.

对于不扩展的动词短语, 一般来说加工次序如下:

丨动词子程序丨--------丨 名词子程序 丨------丨形容词子程序丨----丨 副词子程序 丨

这形象地体现了 "自顶而下" 的分析思想.

试验表明, EChA的两线分析程序, 一具体一抽象, 一个对付个性一个对付共性, 一个面向虚词一个面向实词, 一个尽量使句法分析词典化, 一个则努力使分析过程逻辑化, 二者相互配合, 很有效地实现了各类世界语文句的自动分析. EChA输出的中间结果158条CDC链中只发现一处分析错误. 它出现在第一首诗歌 "LA ESPERO" 的第三句:

Ne al glavo sangonsoifanta , ghi LA HOMAN tiras FAMILION . (102)
Not to sword bloodthirsty , it THE MAN'S (目的格) pulls FAMILY (目的格).

为了节奏和韵律的关系, 作者把形容词修饰语与其轴心词分开了(当然仍同格同数), 中间插进一个动词谓语. 于是系统误把二者都看作是动词谓语的宾语, 因为 "冠词+形容词" (后不跟名词) 结构一般总是代替 NP 的, 所以EChA也就这样分析了. 幸运的是, 这一分析错误没有导致译文错误, 因为中英文综合都把前置宾语移至动词轴心之后, 客观上恢复了修饰语与其中心词的正常词序, 当然这只是巧合.

_____________________________________________________________________

附注: [9] 这儿关于规则和规则分开的讨论, 很大程度上得益于与刘倬老师的几次谈话.

 

【相关】

硕士论文: 世界语到汉语和英语的自动翻译试验
立委硕士论文:1. EChA概况
立委硕士论文:2. 世界语: 语言学特点及其研究价值
立委硕士论文:3. 层次递归成分体系
立委硕士论文:4. EChA机器词典及词表
立委硕士论文:5. 世界语形态分析
立委硕士论文:6/7 世界语句法分析
立委硕士论文:8. 英语形态生成
立委硕士论文:9. 目标语调序
立委硕士论文:10. EChA 试验结果的分析
立委硕士论文【致谢】【参考书目】
立委硕士论文全文(世界语版)

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

立委硕士论文:世界语形态分析 (5)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

世界语形态分析

源语文句分析大体可以分形态分析和句法分析两大类. 前者研究的对象小于等于词, 而后者的对象大于等于词(句素). 分析的终极目的就是求解词的正确的CDC成分. 本节先讨论形态分析问题. 我们把构词分析的讨论也放在这一节.

世界语形态分析的主体是消尾算法的建立. 世界语没有形态同形现象, 所以只要削尾正确, 形态分析也就完成. 下面给出EChA的削尾算法. 应该说, 该算法是比较完备和合理的, 完全能够满足世界语自动分析实用系统的要求.

世界语削尾算法

(1) 若该词最末字母为 "-O" 取 "名词 / 普通格 / 单数" 的结论, 该词削尾后查实词词干词典, 转下一步(2), 否则步骤(12).

(2) 若查词典成功, 取词典信息到加工场, 该词加工完毕, 否则下一步(3).

(3) 若该词最末二字母为 "-AD" 取 "AD词" 的结论, 该词削尾后查实词词干词典, 转下一步(4), 否则步骤(5).

(4) 若查词典成功, 取词典信息到加工场, 该词加工完毕, 否则步骤(11).

(5) 若该词最末三字母为 "-ANT" 取 "分词 / 进行式 / 主动式" 的结论, 该词削尾后查实词词干词典, 转步骤(4), 否则下一步(6).

(6) 若该词最末三字母为 "-INT" 取 "分词 / 完成式 / 主动式" 的结论, 该词削尾后查实词词干词典, 转步骤(4), 否则下一步(7).

(7) 若该词最末三字母为 "-ONT" 取 "分词 / 将来式 / 主动式" 的结论, 该词削尾后查实词词干词典, 转步骤(4), 否则下一步(8).

(8) 若该词最末二字母为 "-AT" 取 "分词 / 进行式 / 被动式" 的结论, 该词削尾后查实词词干词典, 转步骤(4), 否则下一步(9).

(9) 若该词最末二字母为 "-IT" 取 "分词 / 完成式 / 被动式" 的结论, 该词削尾后查实词词干词典, 转步骤(4), 否则下一步(10).

(10) 若该词最末二字母为 "-OT" 取 "分词 / 将来式 / 被动式" 的结论, 该词削尾后查实词词干词典, 转步骤(4), 否则下一步(11).

(11) 该词取 "生词" 的结论, 保留削尾结论, 在加工场的目标语语义项里复制该词, 该词加工完毕.

(12) 若该词最末字母为 "-'" 取 "名词 / 普通格 / 单数" 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则下一步(13).

(13) 若该词最末字母为 "-A" 取 "形容词 / 普通格 / 单数" 的结论, 该词削尾后查实词词干词典, 转步骤(2),  否则下一步(14).

(14) 若该词最末字母为 "-E" 取 "副词 / 普通格" 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则下一步(15).

(15) 若该词最末字母为 "-J" 取 "普通格 / 复数" 的结论, 该词削尾后转下一步(16), 否则步骤(18).

(16) 若该词最末字母为 "-O" 取 "名词" 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则下一步(17).

(17) 若该词最末字母为 "-A" 取 "形容词" 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则步骤(11).

(18) 若该词最末字母为 "-N" 取 "目的格" 的结论, 该词削尾后转下一步(19), 否则步骤(23).

(19) 若该词最末字母为 "-J" 取 "复数" 的结论, 该词削尾后转步骤(16), 否则下一步(20).

(20) 若该词最末字母为 "-O" 取 "名词 / 单数" 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则下一步(21).

(21) 若该词最末字母为 "-A" 取 "形容词 / 单数" 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则下一步(22).

(22) 若该词最末字母为 "-E" 取 "副词" 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则步骤(11).

(23) 若该词最末字母为 "-S" 转下一步(24), 否则转步骤(30).

(24) 若该词最末二字母为 "-AS" 取 "现在时" 的结论, 该词削尾后转步骤(28), 否则下一步(25).

(25) 若该词最末二字母为 "-IS" 取 "过去时" 的结论, 该词削尾后转步骤(28), 否则下一步(26).

(26) 若该词最末二字母为 "-OS" 取 "将来时" 的结论, 该词削尾后转步骤(28), 否则下一步(27).

(27) 若该词最末二字母为 "-US" 取 "虚拟式" 的结论, 该词削尾后转步骤(29), 否则步骤(32).

(28) 取 "陈述式" 的结论, 转下一步(29).

(29) 取 "动词 / 谓语 / 主动语态" 的结论, 查实词词干词典, 转步骤(2).

(30) 若该词最末字母为 "-I" 取 "动词 / 不定式" 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则下一步(31).

(31) 若该词最末字母为 "-U" 取 "命令式" 的结论, 该词削尾后转步骤(29), 否则下一步(32).

(32) 查虚词词典(因该词无尾可削). 若成功取词典信息到加工场, 该词加工完毕, 否则取 "名词 / 专有名词" 的结论, 返回步骤(11).

[注] 世界语基本法规第16条说: "名词和冠词末尾的元音字母可以省略, 用省略号 ' 来代替". 这种现象多出现在诗歌里, 如 MOND'(103). 我们在步骤(12)对它作了处理(冠词是长度小于 3 的虚词, 直接查虚词词典, 不入削尾一线, 故不予考虑).

我们谈谈构词分析问题, 这包括两个方面: 1. 关于建立削缀算法(派生词处理)的讨论; 2. 关于拆离合成词的讨论. 在现行的EChA系统中, 这两个问题都回避了. 我们建立的词典, 是以词干(包括合成词词干)作存贮单位的, 加工词只要削去语法词尾, 就可以查到. 但是, 应该指出, 这样做, 对于世界语这种构词特别灵活的语言并不合理. 以词干存词, 在做小型实验时还可应付, 如果是实用系统, 就会出现存不胜存的情况. 我们主张实词词典既存词根也存词干, 同时建立一个完全的世界语削缀算法和合成词拆离算法, 以便对付生词. (世界语除国际性的专业词汇外, 基本词根很有限. 所谓生词, 一般都是由基本词根及几十个词缀随机组合的派生词或合成词. 因此, 只要切分正确, 生词便不 "生".)

世界语后缀可以叠加(理论上无限), 但前缀通常只能有一个. 这样词典一线的加工路径应该是:

lw9

削缀与削尾不同, 并非有缀必削. 对于削尾, 机器是先削后查, 而对于削缀, 则是先查词典, 查不着的生词再去削缀. 这样处理便于我们根据设计要求(实验型还是实用型, 对于翻译速度, 质量, 成本的要求等等)和机器条件(内存容量, 运算速度等)决定实词词典收词干的标准.

现在, 由于计算机技术的发展, 机器功能(存贮, 速度)越来越强, 而成本急遽下降. 因此机器翻译界如今有人提倡存贮单位宜大不宜小(如尽量多收成语的主张[7] ), 以海量存贮和快速查找来减轻分析的负担. 这是很有见地的认识. 单位越大, 确定性就越强, 对分析综合(机器智能)的要求就越低, 研制的难度相对减轻, 而译文的质量会大大提高. 机器翻译是实用性?
很强的学科, 这种主张就显得更有价值. 当然, 单位也不是越大越好, 因为单位每大一级(从词根到词干, 从词干到词, 从词到词组, 从词组到语句), 其组合的可能性呈指数增长.[8] 如果推向极端, 以句子为存贮单位, 则完全不需要分析和综合, 只要对号入座即可输出译文. 这时候, 人工智能的程度等于零, 翻译质量却可以达到最佳(如果以人工水平为最佳). 可惜, 硬件技术无论怎样发达, 其存贮容量和查找速度也总有限, 不可能对付无穷的句子. (但为了某种特殊的需要在有限的范围内, 这种办法是可行的, 如旅游翻译机. 这到底还算不算机器翻译? 应该算的, 只是它不是人工智能意义下的机器翻译.) 机器翻译的另一极是以词素(词根, 词缀, 词尾)为分析单位, 它所需要的词典容量(只存词根)最小, 人工智能的水平最高, 不但有句法分析和综合, 还要有构词分析和综合. 但费了好大劲儿, 质量却最不能保证, 因为一个句子掰得太碎(原文分析), 捏拢来总难免有些难看的痕迹(译文综合). 所以, 现行的机译系统, 一般都是在这两极中根据具体条件和设计者的观点取某个中值. 我们认为, 一个优秀的实用系统应该有两手, 既能分析得很透彻, 又能对常用词组(成语)囫囵儿处理. 该细的地方细得下去, 该粗的地方粗得起来. 一般来说, 对于常用的, 固定的, 个性的可枚举现象粗一点比较有利, 而对于规律性的随机现象, 则适宜较细致的分析. 所以, 对于以世界语为分析对象的实用机译系统, 我们既主张尽可能多收成语和带缀词干, 也充分肯定建立一个完备的削缀算法的必要性.

那么, 世界语实词词典收多少派生词词干比较合理呢? 对于独立型机器翻译:

(1) 如果是小型实验系统, 目的是在有限的材料内试验系统的句法分析和综合能力, 那就词干全收; 否则:

(2) 凡是常用的派生词词干一律收进词典, 而不再入削缀子程序----常用性(出现频率高)是根本标准;

(3) 有助于区别同形多义的派生词词干, 应该收;

(4) 可收可不收的, 主张收;

(5) 在刚开始设计实用系统的机器词典时, 由于世界语词缀的极端灵活性和随机性, 很难一次收入许多带缀的词干, 这样, 削缀算法就显得更重要. 削下缀来, 虽然表义不是很确切, 甚至有时在目标语综合时, 还需要辅以说明性注释(见后面例释), 但总比直接打出生词来(信息量为零)强出百倍. 随着系统的不断扩充和完善, 收的词干自然会越来越多.

如果是具有特定的目标语的相关型机器翻译:

(1) 收多少派生词词干应该考虑目标语的构词特点及词汇状况;

(2) 在目标语中作为一个完整概念, 而不是词根和词缀意义简单相加所能反映的词干, 应该收入词典. 如: DOM-EGO 楼房, 大厦 (而不是一般的 "大-房子" );

(3) 如果以汉语为目标语, 削缀更多一些, 因为世汉构词法很相似, 汉族人的心理本能地习惯于理解词素与词素的组合. (这种民族偏爱心理在引进外来词时表现的很明显, 如 "德律风" 为 "电话" 取代, "莱塞" 为 "激光" 取代等.) 可以举出很多世汉构词神似的例子. 而且也有许多世界语派生词如 DOM-ACHO 虽然整个儿译作 "陋室" 更雅一些, 但也不妨用统一的削缀合成法组成新词 "鬼-房子", 与原义相去也不远. 特别是有些缀与汉字(词素)有很多一致性, 如 VIC-/副-, -IN-/女-, -EBL-/可- 等等, 就更有理由作削缀处理.

世汉构词对比例释(1): 派生词

(1) BO- 姻- : BO-PATRO 姻-父亲 (岳父或公公) , BO-FILO 姻-儿子 (女婿) , BO-FRATO 姻-兄弟 (内弟) ;

(2) GE- (男女)- : GE-AMIKOJ (男女)-朋友们 , GE-KAMARADOJ (男女)-同志们 , GE-AKTOROJ (男女)-演员们 ;

(3) EKS- 前- : EKS-OFICISTO 前-职员 , EKS-MINISTRO 前-部长 , EKS-INSTRUISTO 前-教师 ;

(4) MAL- [反义] : MAL-BONA [反义]好 (坏) , MAL-AMIKO [反义]朋友 (敌人) , MAL-SAGHE [反义]聪明 (愚苯) ;

[说明] MAL-是世界语中用得最广, 随机性最强的前缀之一, 具有极强的造词能力, 可惜, 中文没有对应的词素. 如果系统遇到某个MAL-型生词, 削下前缀后给出[反义]这样的说明性标识, 也还可以使人理解.

(5) VIC- 副- : VIC-PREZIDANTO 副-主席 , VIC-ESTRO 副-队长 , VIC-CHEFMINISTRO 副-总理 ;

(6) FI- 坏- : FI-INSEKTO 坏-虫 , FI-KOMERCISTO 坏-商人 (奸商) , FI-KUTIMO 坏-习惯 (恶习) ;

(7) SEN- 1. 若词根逻辑类为名词则 "无-" : SEN-GUSTA 无-味的 , SEN-SENCA 无-意义的 ;

  1. 若词根逻辑类为动词则 "不-" : SEN-MORTA 不-死的 (不朽的) , SEN-ATENTA 不-注意的 ;

(8) NE- 若词根逻辑类为名词则 "非-" 否则 "不-" : NE-ESPERANTISTO 非-世界语者 , NE-BONA 不-好的 ;

(9) 介词性前缀:  1. SUR- -上: SUR-TABLE 桌子-上 ; 2. APUD- -旁: APUD-VOJA 路-旁的 ;

  1. EN- -内: EN-LANDE 国-内 ; 4. LAU- 按-: LAU-VICE 按-次序 ; 5. DE- 从-: DE-NOVE 从-新 ;

(10) -ACH- 鬼- : DOM-ACHO 鬼-房子 (陋室) , KNAB-ACHO 鬼-男孩 (捣蛋鬼) , VETER-ACHO 鬼天气 ;

(11) -AN- -成员 : KLUB-ANO 俱乐部-成员 , KURS-ANO 讲习班-成员 , KOMUNUM-ANO 公社-成员 ;

(12) -UL- -者 : BON-ULO 好-者 , KAR-ULO 亲爱-者 , JUN-ULO 年青-者 , LONG-KRUR-ULO 长/腿-者 ;

(13) -IN- 女- : KAMARAD-INO 女-同志 , INSTRUIST-INO 女-教师 , OFICIST-INO 女-职员 , AKTOR-INO , 女-演员 ;

(14) -EBL- 可- : VID-EBLA 可-见的 , MANGH-EBLA 可-吃的 , UZ-EBLA 可-用的 , NE-ATING-EBLA 不-可-达到的 ;

(15) -EC- -性 : CERT-ECO 确实-性 , NECES-ECO 必要-性 , KLAR-ECO 清楚-性 , LIBER-ECO 自由-性 ;

(16) -EM- 爱- : LABOR-EMA 爱-工作的 (勤劳的) , PAROL-EMA 爱-说话的 , MENSOG-EMA 爱-撒谎的 ;

(17) -IND- 值得- : LERN-INDA 值得-学习的 , LAUD-INDE 值得-称赞 , LEG-INDA 值得-读的 , AM-INDA 值得-爱的 ;

(18) -ON- 1. 若 -ONO 则 "-分之一": DU-ONO 二-分之一 , TRI-ONO 三-分之一 , KVAR-ONO 四-分之一 ;

  1. 若 X+Y-ONOJ 则 "Y-分之X": TRI DEK-ONOJ 十-分之三 , KVIN OK-ONOJ 八-分之五 .

合成词 ("词根+词根") 也是一样. 比较固定的, 应该整个儿存入词典, 随机组合的, 应该拆开. 但这儿有一个困难, 世界语语法为了方便使用者, 即便对完全随机组合的合成词, 也不作加连字符的规定. 那么怎么拆呢? 词根的数量与词缀不能比, 长度也变化很大, 一个字母一个字母地削查比较, 显然不是办法. 如果坚持不要译前编辑, 还找不到一个合理的解决办法. 目前可以考虑先对中间有连字符的合成词作拆词加工. 我们提倡除比较固定常用的合成词外, 世界语者在运用随机合成词时,为读者的省力和机器的识辨计加上连字符. 鉴于世界语构词法与汉语构词法惊人的一致(组合方式及其高度随机性都很类似), 对于世汉机器翻译这一倡议更加必要.

世汉构词对比例释(2): 合成词

(1) AKVO-FONTO 水/源 ; (2) VARM-ENERGIO 热/能 ; (3) ARBO-BRANCHO 树/枝 ; (4) VAPOR-SHIPO 汽/船 ;

(5) SURD-MUT-ULO 聋/哑-者 ; (6) BLANK-HARA 白/发的 ; (7) NUD-PIEDA 光/脚的 ; (8) FISH-KAPTI 捕/鱼

______________________________________________________________

附注: [7] 参见:

刘涌泉 <<中国的机器翻译>> ( <<情报科学>> 1980, 3 )

王广义 <<机器翻译中的固定词组和固定结构问题>> ( <<语言和计算机>> (1), 1982 )

[8] 参看: 叶蜚声, 徐通锵 <<语言学纲要>> 第二章第二节 " 1. 语言的层级体系", PP.34-36 ( 北京大学出版社, 1981 )

 

 

【相关】

硕士论文: 世界语到汉语和英语的自动翻译试验
立委硕士论文:1. EChA概况
立委硕士论文:2. 世界语: 语言学特点及其研究价值
立委硕士论文:3. 层次递归成分体系
立委硕士论文:4. EChA机器词典及词表
立委硕士论文:5. 世界语形态分析
立委硕士论文:6/7 世界语句法分析
立委硕士论文:8. 英语形态生成
立委硕士论文:9. 目标语调序
立委硕士论文:10. EChA 试验结果的分析
立委硕士论文【致谢】【参考书目】
立委硕士论文全文(世界语版)

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

立委硕士论文:EChA机器词典及词表 (4)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

EChA机器词典及词表

EChA所有词典词表都是随机数据文件, 并且各配有一套修改和扩充的外围维护程序, 这给系统的改进提供了方便. 下面

分别介绍各词典词表的定义.

1) 实词词干词典
格式:            __________________________________________________________________________________
词干丨逻辑类丨及物性丨带不定式丨支配词丨支配词汉义码丨汉义丨汉义特征 丨 英义
____丨_______丨______丨_________丨_______丨_____________丨____丨_________丨______    ___________________________________________________________
丨英义特征 丨 语义特征 丨 词类词义区分表记录号  丨 备用项 丨
丨_________丨__________丨_______________________丨________丨

<逻辑类>::= { N, V, A, F, P, C, K, T, R, S, W, E, D, X }

N=名词 , V=动词 , A=形容词 , F=副词 , P=介词 , C=连词或标点 , K=K类相关词 ,
T=T类相关词 , R=其他相关词 , S=数词 , W=人称代词 , E=系词 , D=冠词 , X=万能词

[说明] 逻辑类用来表明词的静态词性. 世界语实词的语法词性是动态随机的, 只能由削尾决定. 但每个词一般具有一个基本词性, 这是单词的深层的逻辑特征. 语法词性不过是由它通过加词尾派生的表层的句法特征.

<汉义特征>::= { "...以后", "...的", "使...", "把...", "给...", "...下", "...上", "...里", "...时",
多义词特征, 构成成语特征, ... }

[说明] 汉义特征揭示了该词汉义的结构特性, 也给出了汉语生成的修辞信息.

<英义特征>::= { 不规则变化特征, 双写特征, 形式不变特征, ... }

[说明] 英义特征给出该词的英语形态生成方式信息.

<支配词汉义>::= { 零义, "给", "以", "到", ... }

[说明] 支配词汉义标示该词所支配的词(通常是介词)的汉义.

<语义特征>::= { HM, LK, TM, FX, ... }

HM=人类特征, LK=地点特征, TM=时间特征, FX=方向特征
2) 虚词词典

虚词词典除包含实词词典的各项信息外, 还揭示了部分CDC信息, 如词性, 格, 数, 关系, 分布, 节点等. 分析之前就能在词典里给出某些动态信息, 这是由虚词特点决定的. 例如: 介词永远处于非终结节点(节点"Y")上, 原副词和万能词一般是不扩展的, 所以总处于终结节点(节点"J")上. 万能词 ECH (EVEN) 永远位于其轴心词之前(分布"Q"). 原副词 JAM (ALREADY) 永远做状语(关系"F"). 从属连词 KE (THAT) 总是引导名词性从句(词类"K", 节点"K"), 而且总位于其轴心词之后(分布"H").

冠词LA永远做定语(关系"D"), 位于轴心词前(分布"Q"), 处于终结节点上(节点"J").

3) 成语词典

机器翻译界所谓的成语, 比其通常的意义要宽泛得多. 凡是常用的比较固定的词组都可收作成语. 世界语中纯粹的不可分析的习惯表达法较少, 所以成语词典容量相对不大. 成语词典的收词范围, 还在很大程度上决定于原语和译语的对比差异. 亲属关系相近的表达方法类似, 可以少收或不收成语. 在EChA中, 就没有设立世英成语词典, 只有一部世汉成语词典.

EChA成语例释:

MALFERMA(JN) AUTO(JN) ----- 敞蓬汽车 ( CF: OPEN CAR(S) )
SOMERA(JN) FERIO(JN) ----- 暑假 ( CF: SUMMER HOLIDAY(S) )
LA ANGLA(N) LINGVO(N) ---- 英语 ( CF: THE ENGLISH LANGUAGE )
INSTRUA(JN) LIBRO(JN) ---- 教科书 ( CF: TEACHING BOOK(S) )
LA GRANDA(N) MURO(N) ---- 长城 ( CF: THE GREAT WALL )
HOMA(N) SVARMO(N) ---- 人群 ( CF: MAN'S SWARM )
FACILA(N) VENTO(N) ---- 顺风 (CF: EASY WIND )

4) 词类词义区分表

建立该词表对于世界语作为源语的机器翻译很必要, 可以大大减轻综合时多义区分的负担. 凡是随着词性和逻辑类的不同, 目标语的义项也相应不同, 而这种改变并不遵循形态转换规律, 这样的单词就收入区分表. 例如: MATEMATIK-A(JN) 必须收入, 而 HOM-A(JN) 就不必收, 因为前者的英义是 MATHEMATICAL (不是 MATHEMATICS' ), 而后者只要按规律从源语形容格(形容词性), 生成目标语所有格的词尾 -'S 或助词  "的" ( MAN-'S / "人-的" ) 就可以了. 我们在实词词典中对要入区分表的词, 都给出了查表记录号(随机文件地址), 所以系统只要按地址取记录就行了. 用BASIC编程时, 拿随机文件记录号?
作为单词内部代码, 是值得推荐的.

词类词义区分表例释:

实词词典                      词类词义区分表

ATING-I: ACHIEVE / 达到        ATING-O: ACHIEVEMENT / 成就
EKZEMPL-O: EXAMPLE / 例子      EKZEMPL-E: FOR EXAMPLE / 例如
KOMENC-I: BEGIN / 开始         KOMENC-E: AT BEGINNING / 开始时
MEZUR-I: MEASURE / 测量        MEZUR-O: MEASUREMENT / 尺寸
OKAZ-I: HAPPEN / 发生          OKAZ-O: OCCASION / 场合
SCI-I: KNOW / 知道             SCI-O: KNOWLEDGE / 知识
TIP-O: TYPE / 型号             TIP-A: TYPICAL / 典型的

5) 英语不规则词表

这个词表跟一般英语词典附录中列的不规则表没什么两样, 不过为了简便, 我们把动词形式的不规则变化和名词复数的不规则变化放在一个表内. 不规则词表是供英语形态生成查用的.

英语不规则词表

原形             过去时                过去分词              名词复数

BEAT             BEAT                  BEATEN
BECOME       BECAME                BECOME
...              ...                   ...                    ...
CHILD                                                         CHILDREN
...              ...                   ...                    ...

最后我们给出EChA句子加工场的格式:

目标语序号丨实词词典各项丨CDC信息丨已加工特征丨虚词特征丨
目标语调序信息丨目标语位移序号丨

[说明] 1. 目标语序号用来在综合阶段自底而上归约加工时给同号.

  1. 目标语位移序号用来在用搬家法作虚拟调序时代表整个词条. 用序号代替整个词条位移的虚拟调序, 比纯粹用搬家法效率高, 大约跟拉链法相仿. 鉴于BASIC不能处理组合项变量, 如果采用搬家法调序, 只能一项一项位移, 这种虚拟调序的技术更显出优越性. 但须注意, 跟位移序号一起移动的, 还必须包括该词的自然顺序号, 用它标示原词条位置, 这样查问时才无后顾之忧.

【相关】

硕士论文: 世界语到汉语和英语的自动翻译试验
立委硕士论文:1. EChA概况
立委硕士论文:2. 世界语: 语言学特点及其研究价值
立委硕士论文:3. 层次递归成分体系
立委硕士论文:4. EChA机器词典及词表
立委硕士论文:5. 世界语形态分析
立委硕士论文:6/7 世界语句法分析
立委硕士论文:8. 英语形态生成
立委硕士论文:9. 目标语调序
立委硕士论文:10. EChA 试验结果的分析
立委硕士论文【致谢】【参考书目】
立委硕士论文全文(世界语版)

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

立委硕士论文:层次递归成分体系 (3)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

层次递归成分体系

在给出层次递归成分体系(CDC)的定义之前, 我们先说说该体系的来源及其理论依据.

CDC体系是机器翻译的一种中间语言, 我们试图提供一套更加合乎独立分析独立综合要求的机器翻译抽象文法. CDC是EChA系统的关键, 它体现了我们对语言结构的看法和对机器翻译的认识. CDC是直接从导师们的中介成分体系[2] 脱胎而来的, 它保留了中介成分的形式, 继承和改造了它的内容, 其思想基础是有向直接联系理论(或轴心词理论). 体现在CDC中的要点是:

1) 句子的最顶层是主句谓语, 它是全句的最大联系中心(主轴心), 所以谓语是全句的代表. 一个完整的句子的最简单也是最典型的形式, 就是独词祈使句. 如:

Venu! Come! 来!

任何其他句子(无谓句是不完整句, 除外)都是从上面的简单形式一层一层推衍出来的:

Venu! ... La studento venu chi tien! ... La studento, kiu parolis, venu chi tien! ......

Come!     Let the student come here!     Let the student, who spoke, come here!

反过来说, 对一个无论怎样复杂的句子层层归约, 归约的顶层必然是主句动词谓语:

VENU
/                \    \
studento         tien   (!)
/           \               /
la        parolis      chi
/     /    \
(,)    kiu      (,)

2) 一个词只能跟另外的一个词发生直接联系, 但一个词可以带 N 个 ( N>=0 ) 直接联系词. 这就是句子结构的有向直接联系观点.[3] 带直接联系词的词叫轴心词, 当 N>0 时, 它是非终结节点词. 直接联系词本身也常常是低一层次的轴心词.

3) 主句谓语(主轴心)处在第一层. 与主句谓语发生直接联系的词位于第二层. 与第二层词直接联系的词在第三层. 这样一环扣一环, 组成句子的每一个词都处在某一个层次上. 理论上说, 句子的层次可以是无限的.

4) "虚词不虚." 虚词(或者叫功能词, 结构词)较之实词包含更多的句法结构信息. 有些虚词同样可以充当轴心词. 比如: 在 "介+名" 结构中, 介词是轴心词. 主从连词如 SE (IF), KVANKAM (ALTHOUGH) 等也充当轴心词, 作为从句的代表, 它跟主句谓语发生直接联系, 它所带的下位直接联系词是从句谓语.崐    5) 作为源语文句的中间语言映射, 层次递归成分应该, 也可以落实到每个词上. 所谓词, 从机器角度来看, 就是两空之间的字符串(汉语另当别论). 严格地说, 标点符号也是词(虚词), 也要参与文句的分析和归约.

建立CDC体系的两项基本原则是:

1) 层次递归原则: 有多少层次反映多少层次, 而且层次是递归的. 层次的递归性表现在: (1) 对文句可以自底而上层层归约(参见EChA系统的目标语生成算法); (2) 对文句可以自顶而下层层分析(参见EChA的源语分析算法).

2) 词本位原则:[4] 词到句子(以主句谓语为代表)是一个动态递归过程的两极, 其间的各个环节就是所谓层次. 贯彻词本位原则的实质, 就是在一切层次上都把成分(CDC)落实到词. 句子是, 也仅仅是由句素组成的. 而每一个大大小小的句素(词组, 短语, 从句等)按照我们的看法, 总是以一个轴心词来代表的.

现在, 我们给出层次递归成分体系的形式化定义:

  1. 层次递归成分体系是层次递归成分的集合.
  2. 层次递归成分是这样一个六元信息组:
    形态信息 | 结构关系信息 | 节点信息 | 分布信息 | 层号信息 | 链号信息
  1.  <形态信息>::=
    { <词性>, <格>, <数>, <时态>, <语态>, <语式>, <非谓语形式>, <体>, <人称>, ... }

<词性>::= { N, V, A, F, P, Z, C, K, B }

N=名词, V=动词, A=形容词, F=副词, P=介词, Z=助动词, C=并列连词,
K=主从连词, B=标点符号

<格>::= { 非格, 普通格, 目的格 }

<数>::= { 非数, 单数, 复数 }

<时态>::= { 非时态, 现在时, 过去时, 将来时 }

<语态>::= { 非语态, 主动语态, 被动语态 }

<语式>::= { 非语式, 陈述语式, 命令语式, 虚拟语式 }

<非谓语形式>::= { 非非谓语形式, 分词, 不定式, 名动词 }

<体>::= { 非体, 进行体, 完成体, 将来体 }

<人称>::= { 非人称, 第一人称, 第二人称, 第三人称 }

  1. <结构关系信息>::= { S, W, O, D, F, B, T, I, C, L, M, A, Z, V, R }

S=主语, W=谓语, O=宾语, D=定语, F=状语, B=补语, T=同位语,
I=独立成分, C=同等连词或标点, L=从句起始标点, M=从句末标点,
A=插入成分起始标点,Z=插入成分末标点, V=非结构意义标点, R=句末标点

  1. <节点信息>::= { J, <非终结节点> }

J=终结节点

<非终结节点>::= { S, O, D, B, K, X, Y }

S=主语从句节点, O=宾语从句节点, D=定语从句节点, B=补语从句节点,
K=一般从句节点, X=动词性非终结节点, Y=其他非终结节点

  1. <分布信息>::= { Q, H, G }

Q=位于轴心词前, H=位于轴心词后, G=轴心

  1. <层号信息>::= { 非层号, <自然数> }

<自然数>::= { 1, 2, 3, ... }

  1. <链号信息>::= { <左链号>, <右链号> }

<左链号>::= { 非左链号, 99, N }

N=大于句首号小于句末号的自然数

<右链号>::= { 非右链号, N }

[说明]   左链号的设置是为了处理同等成分的方便. 我们把同等成分的最右元素认作整个成分的代表(落脚点, 轴心).  左链号99是同等成分最左元素的标志. 有了左链号, 消除了后顾之忧, 同等成分就可以和其他句素一样, 参加文句的分析和归约.

下面是用这套成分体系作分析的例句(004):

CDC中形态信息略去, 余下依次是: 关系/节点/分布/层号/左链/右链, 例如:

FJQ 05 00 02 --->
状语/终结节点/位于其轴心词之前/处于第5层/没有左链(00是非左链号)
/右链号为02

Pli    poste          ,              kiam           la                     sciodisketoj
英:   More  later ,             when           the            knowledge-disks
汉:   更以后           ,            当(...时)                            微型知识磁盘
CDC链:  FJQ 05 00 02   FYQ 04 00 17   LJQ 05 00 04   FKQ 04 00 17   DJQ 07 00 06   SYQ 06 00 07

estis          eltrovitaj     ,             la          plenan         indikaron [注:目的格]
had been       found out      ,    the            full           indication
被             发明了         ,                             全套           指令集合
WBH 05 00 04   BJH 06 00 07   MJH 05 00 04   DJQ 05 00 12   DJQ 05 00 12   OYQ 04 00 17

,              endiskigitan   ,              oni            metis          en
,              endisked       ,              people         put            into
,              所写入磁盘的   ,              人们           放             到(...里面)
AJQ 06 00 14   DYH 05 00 12   ZJH 06 00 14   SJQ 04 00 17   WXG 03 99 20   BYH 04 00 17

mashinojn      kaj            ili            tiamaniere     povis          en
machines       and            they           therefore      could          in
机器                                    它们           这样               能             在(...里面)
OJH 05 00 18   CJQ 02 17 23   SJQ 02 00 23   FJQ 02 00 23   WXG 01 20 00   FYQ 03 00 27

si             mem            akumuli           sciencan       stokon         ,
themselves                 accumulate     scientific      stock          ,
自己           本身           积累                     科学           贮蓄           ,
BYH 04 00 24   BJH 05 00 25   BXH 02 00 23   DJQ 04 00 29   OYH 03 00 27   VJQ 05 00 32

pli            grandan        ol             la             homa           cerbo          .
more       great          than           the            man's          brain          .
更             大                 比                                 人的           头脑           .
FJQ 05 00 32   DYH 04 00 29   FYH 05 00 32   DJQ 07 00 36   DJQ 07 00 36   BYH 06 00 33   RJH 02 00 23

层次递归成分实质上就是不同层次的词之间直接联系关系的一种反映. 它揭示了文句结构的正确的句法树. 根据文句的CDC链, 我们很容易画出该句的句法树.

实验证明, 作为体现独立分析结果的机器翻译中间语言, 层次递归成分体系是比较有效的. 现在, 越来越多的专家呼吁建立能充分体现对源语分析的结果, 正确揭示文句的层次结构和语义信息的媒介语, 或类似媒介语的东西. 许多文章论证了分析和综合独立的必要性. 原语分析依赖译语, 或译语综合依赖原语, 使分析和综合都不能深入, 而且难免捉襟见肘.[5]

当然, 层次递归成分体系还处于草创时期, 必然存在不少问题, 有待于在实践中不断检验, 改进和完善. 通过时间的考验和我们的努力, 也许它最终能成为一个比较得心应手的机译工具, 而为人们乐于采用, 这当然是我们所希望的. 也许它不是一个好的方案, 很快便被淘汰了. 但无论如何, 总是一次有益的尝试.

这套体系的不足之处是, 它不大能够反映有向直接联系的语义性质, 而这对于高质量的机器翻译是比较关键的信息. 人类语言不管怎样千差万别, 总有某些共同的东西. 例如, 句素间的层次结构及其直接联系关系就具有很强的普遍性. 正是这些语言共性才使翻译成为可能, 从而它成为语言转换的基础. 句素与句素之间的逻辑语义联系, 也是重要的语言共性之一.[6] 逻辑语义的确定, 将大大有助于生成地道的目标语. 在CDC体系中, 结构关系一项基本上是传统语法中句法成分的继承, 反映的是句子表层结构的关系(主谓宾定状补等). 看来, 有必要扩充CDC, 再加一个逻辑语义元:

<逻辑语义信息>::= { Ag, Sb, Ob, Vb, Pl, Tl, Mn, Pp, Rs, Fr, Rg, Dg, Tm, Pr, Cl, Fn, Ms, Pm, Cd, Nb, Pt, Mt, Ps, Tg, Cs, Ex, Dt, Ct, Cn, Cc, Cp, Tw, Xx }

Ag=施事(Agent), Sb=主体(Subject), Ob=受事(Object), Vb=行为(Verb), Pl=地点(Place),
Tl=工具(Tool), Mn=方式(Manner), Pp=目的(Purpose), Rs=结果(Result),
Fr=频率(Frequency), Rg=范围(Range), Dg=程度(degree), Tm=时点(Time),
Pr=时段(Period), Cl=颜色(Colour), Fn=功能(Function), Ms=尺寸(Measurement),
Pm=后饰(Post-modifier), Cd=条件(Condition) , Nb=数量(Number),
Pt=属性(Property), Mt=质料(Material), Ps=领属(Possession), Tg=对象(Target),
Cs=原因(Cause), Ex=说明(Explanation), Dt=限定(Determiner),
Ct=环境(Circumstance), Cn=内容(Content), Cc=让步(Concession),
Cp=比较(Comparison), Tw=同位, Xx=非语义(或不定语义)

[注] Xx是所有无法确定, 或没有必要确定的成分的逻辑语义. 机器翻译跟自然语言理解不同, 并不一味要求分析得越具体越透彻越好. 机器翻译过程中的中间信息究竟要深入到怎样的程度, 应根据充分必要的原则来决定. 少则影响效果(质量), 多则白费功夫.

_____________________________________________________________

附注: [2] 关于中介成分体系, 参见:

刘涌泉, 刘倬, 高祖舜 <<俄汉机器翻译规则系统新旧方案比较>> ( <<中国语文>> 1962.2 )

刘涌泉 <<外汉机器翻译中的中介成分体系>> ( <<中国语文>> 1982.2 )

刘  倬 <<三次机器翻译试验>> ( 第一次机器翻译学术会议论文, 1980.9 )

[3] 关于有向直接联系理论, 参见:

刘涌泉, 刘倬, 高祖舜 <<俄汉机器翻译规则系统新旧方案比较>> (同上)

刘涌泉, 刘倬, 高祖舜 <<机器翻译中的词序问题>> ( <<中国语文>> 1965.3 )

并请参阅 <<特斯尼埃的 <结构句法基础> 简介>> ( 张烈材, <<国外语言学>> 1985.2 )

[4] 参见: 刘涌泉 <<词>> ( 1984年机器翻译及自然语言处理学术讨论会论文, 1984.9 )

[5] 参见: 冯志伟 <<当前机器翻译的一些新特点>> ( <<情报学刊>> 1982. Vol 1 No.2 )

[6] 参见: 董振东 <<逻辑语义及其在机译中的应用>> ( <<中国的机器翻译>> pp.25-45 )

 

 

 

 

【相关】

立委硕士论文:目标语调序

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

 

立委硕士论文:世界语: 语言学特点及其研究价值 (2)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

世界语: 语言学特点及其研究价值

在进入EChA系统的细节和探讨机器翻译的一般理论和方法之前, 我们专列这一节讨论世界语本身, 这对说明本系统的设计思想和具体方法是很必要的. 毫无疑问, 我们的讨论主要是从语言学角度着眼.

世界语(Esperanto)是波兰的语言大师柴门霍夫博士( L.L.Zamenhof 1859.12.15 - 1917.4.14 )于1887年在印欧语系的基础上经过艰苦研究提出的一个人造语方案. 由于其科学, 简明, 逻辑性强, 由于日益增长的克服语言障碍的国际需要, 也由于其维护世界和平, 增进各民族相互了解, 实现世界大同的崇高理想的感召, 它逐渐为人们所接受. 目前, 世界上有2000多万人在学习和使用世界语. 世界语早已脱尽了人造的斧痕, 走上了自然发展的道路. 它不但能写也能说, 不但适于表达精密的科学思想, 而且在文学上也取得了令人赞叹的成就. 从莱勃尼茨的万国通用文字的设想开始, 先后提出的人造语方案达150多种, 唯有世界语经受住各种考验生存下来了. 现在, 越来越多的人认识到世界语作为国际辅助语的独特价值. 有些国际性学术会议(如控制论大会)已经采用世界语作为工作语言.

世界语中除数量有限的虚词外, 其他词都有非常规则的形态变化, 借以表现该词的词性, 格, 数, 时态, 语态, 语式, 分词形式等语法信息. 另外还有一整套前缀后缀, 用以表现词汇意义上的细微差别和修辞色彩. 世界语是典型的黏着语, 词尾和语缀的意义单一, 可以叠加. 这套词尾和语缀设计得非常巧妙, 规则, 特别容易掌握, 而且也非常适合机器的递归加工.

(EChA的削尾算法就体现了这种递归加工的优点, 见本文第5节.) 世界语没有语法同形词, 句法关系一目了然, 这不论对人还是对机器的识辨, 都是一个极为有利的条件(民族语机器翻译中同形判别的问题在这儿根本不存在了). 同时, 世界语的词类转换也特别灵活, 只要逻辑上说得过去, 不致引起误解, 同一个词干可以根据句法需要, 通过词尾变化随意改变词性. (我国古汉语词类活用也比较自由, 在一定程度上具有类似的灵活性, 可惜这种活用没有明确的形态标志, 常常要靠逻辑语义的分析才能确定.)

世界语的词尾形式并不很多, 但却很完备, 可以和形态发达的语言相媲美, 这一点我们不能不为之惊叹. 拿格来说, 世界语只有普通格(零形态)和目的格(加词尾-N)两种, 但由于它把词性和格的用法巧妙地统一起来, 再加上有介词这种分析形式的后备, 表达起来跟形态丰富的语言一样灵活自由. 俄语是现代形态最丰富的语言之一, 它有六个格. 粗略地说, 它的一格(主格)跟世界语普通格对应, 二格(属格)跟世界语形容词--姑且叫做形容格吧(加词尾-A)对应, 三格(与格)在世界语中没有相应的屈折形式, 一般用介词AL来代替. 四格(宾格)对应于世界语的目的格. 五格(工具格)跟世界语副词--也姑且叫做状格吧相对应. 六格是前置格, 跟前置词O,Ha,B等搭配, 它本身并不表示特定的语义关系. 有意思的是, 世界语介词后可以跟崐普通格和目的格两种, 前者表示静态, 后者表示动态(方向). 比较俄语的类似用法, 世界语的简洁和完备的特点是很明显的.

世界语基本语法规则共16条, 原则上没有例外.[0] 由此人们也许会推断这门语言很简陋, 刻板, 缺乏表现力. 这是一个极大的误解. 这里涉及世界语的另一个非常突出的语言学特点, 就是它兼有分析性语言和综合性语言的要素(虚词和形态都比较丰富), 同一种语义既可以用分析形式(借助于虚词), 又可以用综合形式(借助于屈折变化)来表示----当然, 这两种形式并不等同, 它们体现了不同的风格. 由于这一特点, 世界语兼容性强, 文体多样, 特别灵活, 富于弹性和表现力. 如果作为目标语, 它最能维妙维肖地模仿原文的语言特色. 它既可以反映语序自由, 文体柔美的斯拉夫风格, 又可以表现形态缺乏的语言(如汉语和英语)的单纯, 严谨, 密集的特点. 下面我们举几个例子来看一下分析形式和综合形式在世界语中的兼容并存情况:

分析形式                              综合形式

  1. 时态: Mi ESTAS skrib-ANTA. Mi skrib-AS. / Mi skrib-ANTAS.

I AM writ-ING. 我 "在" 写字.

  1. 语态: Ghi ESTAS limig-ITA. Ghi limig-ITAS. / Ghi lim-IGHAS.

It IS limit-ED. 它 "被" 限定了.

  1. 词义: Tio estas MALGRANDA (ETA) Tio estas sekret-ETO.

sekreto.

That is a LITTLE secret.

那是 "小" 秘密.

  1. 介词与副词(状格):

Li parolas EN (PER) Esperanto.
Li parolas esperant-E. Li parolas Esperant-ON.

He speaks IN Esperanto.
He speaks Esperanto.

他说世界语.
他 "用" 世界语说话.

  1. 介词与格(目的格):

Shi parolis POR 30 minutoj.           Shi parolis 30 minut-OJN.

She spoke FOR 30 minutes.             她说了30分钟.

  1. 分析形式向综合形式的转换:

LAU kutimo ...............LAU-kutim-E...kutim-E

这种分析形式和综合形式并存的情形在世界语中极其普遍, 这一点跟民族语不一样. 虽然没有绝对不用分析形式的综合性语言, 也没有绝对不用综合形式的分析性语言, 但是, 每一个具体的民族语言总是以一种形式为主, 而且在多数场合总是一种形式排斥另一种形式, 一般不允许并存.

总之, 跟人们通常想象的正相反, 世界语是高度灵活的, 表达方式极其多样, 且能互相转换. 这种高度灵活性正好适应了人类思维模糊性的特点. 灵活性与规则性的高度统一, 这就是世界语的真正奇迹.

人造语言的规则性容易为人理解. 关于灵活性, 再补充几点. 由于篇幅关系, 我们不打算展开, 必要时辅以一两句例证.

  1. 在世界语中动词的及物与不及物的界限模糊了.

Mi IRAS. / IRU vian propran VOJON!

I GO. / GO your own WAY!                 我行走. / 走你自己的路!

La tuta homaro PAROLOS nur unu LINGVON.
/ Mi PAROLAS esperante (en Esperanto, per Espernato).

The whole mankind will SPEAK only one LANGUAGE.
/ I SPEAK in Esperanto.

全人类将说仅仅一种语言. / 我用世界语说话.

  1. 直接宾语(所谓宾格)与间接宾语(所谓与格)的界限模糊了.

informi ION al IU / informi IUN pri IO

tell sth. to sb. / tell sb. about sth.   向某人告诉某事 / 告诉某人关于某事

  1. 宾语与状语的界限模糊了. 世界语语法规定: 目的格(即通常所谓宾格)也可以表达某种状语意义(参见基本法规第14和第13条).

Mi invitas vin VOJAGHI kun mi PEKINON.

I invite you to TRAVEL with me TO PEKING. 我邀请你和我一起 "旅游北京".

  1. 词缀与词根的界限模糊了, 从而派生词与合成词的界限模糊了. 同时虚词与实词的界限也模糊了.

sekret-ET-o / ET-a sekreto       JES, / mi JES-as vian opinion.

little secret  小秘密            Yes, I agree with you. 是的, 我同意你的意见.

ANTAU-vidi / Sinjorinoj ANTAU-as.  Kred-IND-a
/ ne-IND-a , IND-igi , sen-IND-ulo

foresee / Ladies first.            believ-able
/ not worthy, make worthy, good-for-nothing

  1. 万能介词JE的设置. 人们在表达思想时, 常常只意识到从属成分与中心成分有某种朦胧的修饰关系, 但却说不出, 往崐往也不必要说究竟是何种语义联系. 为了适应人类思维的这种模糊特点, 柴门霍夫引入介词JE. 这是一个很有见识的创造. (表达这种模糊关系还可用屈折形式的目的格或副词(状格), 见基本法规第14条.)
  2. 词性与格在用法上的统一. 词性和格都是根据词尾 "入句而后定" 的动态句法特征, 都能表现比较抽象的语义关系, 可以相互补充. (这跟分析形式的介词短语不同. 介词除了上述JE外, 一般用来表示较为具体和确定的语义关系.)

Mi skribas plum-E.
CF: (俄)                       (五格)

  1. 极其灵活的词类转换.

La FLOR-OJ FLOR-AS.     Li KANT-AS italan popolan KANT-ON.
Mi estas GHOJ-A. Mi GHOJ-AS.

The flowers blossom. .  He sang an Italian folk song.
I am glad.

  1. 词序的自由.

Mi amas vin. (106) / Mi vin amas. / Vin amas mi. (108)
/ Vin mi amas. (111) / Amas mi vin.  / Amas vin mi.

I love you. 我爱你.

  1. 构词的灵活. 派生词: 词缀的丰富及其黏合特点; 合成词: 词根与词根的自由复合.

Shi rid-AS. Shi rid-ETAS. Shi estas rid-EMA.
Shi estas rid-EMULO. Shi estas rid-EMULINO ( rid-EMINO ).
Shi estas rid-EMULINETO ( rid-EMINETO ).......

她笑.      她微笑.        她爱笑.
她是爱笑的人.        她是爱笑的女人.
她是爱笑的小女孩儿 .......

INTER-lingvo   中间语言

fonto-lingvo         celo-lingvo       ponto-lingvo
naci-lingvo       internaci-lingvo

源语                 目标语            媒介语(桥梁语言)
民族语             国际语

  1. 完善的时态语态系统和精巧的相关词表. 世界语的时态语态系统和相关词表是两项绝妙的创造. 它们是如此地精巧完善, 富有逻辑的力量和美, 每一个世界语者都象化学家欣赏元素周期表一样体验到这种美, 并为此感到自豪. 借助于唯一的一个助动词ESTI, 世界语能表达各种复合时态语态. 相关词表所能表达的语义的简洁和丰富更是无与伦比的.

世界语的这些特点给人们的自由创造留下了很大的余地, 为人们充分发挥自己的语言才能提供了最好的条件. 这种灵活性并不影响作为世界语基础的16条基本法则的不可动摇的严格性. 在这儿, 自由和约束达到了完美的统一. 在世界语国里, 每个人都在不同程度上是创造者, 每一个世界语者都体验到这种创造的乐趣. 人们再也不是习惯的奴隶了.

然而, 不能不承认, 世界语的灵活和自由给机器的自动处理带来了一定的困难. 我们在研制EChA系统的过程中, 深深感到, 与民族语相比, 以世界语为源语的机器翻译虽然有其容易的一面, 也有其特有的难处, 总之要比我们预料的要复杂得多. 容易来自其高度规则性, 困难则源于其高度灵活性.

世界语作为人们唯一实际使用的人造语言自然有它独特的研究价值. 拿它与民族语作对比研究, 我们会得到很多有益的启示. 由于其独特的地位, 人们在研究思维与语言, 民族与语言, 社会与语言, 个体与语言, 信仰与语言等等的关系, 以及探讨语言的共性, 语言的本质, 语言的前途(未来社会的语言), 语言的形式和内容, 语言的类型, 语言的教学等问题时都可?
能在研究世界语的过程中获益. 另外, 世界语本身的发展也需要语言学者对它作科学的研究和总结, 这不但有益于这门语言健康的发展, 有助于世界语语言学理论体系的建立, 同时也会丰富一般语言学的理论. 语言学者对世界语的理论研究虽然早已开始, 但还远远不够.

对于机器翻译工作者, 世界语还有一层特殊的意义, 就是世界语作为民族语间机器翻译的媒介语的价值.[1] 这可以从两方面看: 1) 按照机器特点对世界语作必要改造, 定义一个作为媒介语的世界语子集, 再辅以一套高度形式化的成分体系. 这个设想我们在第一届中国世界语大会上提过. 我们也确实设计过一个以世界语作为媒介语的英汉机器翻译规则系统. 虽然由于时间等原因没有能上机试验, 但我们相信该方案是可行的, 也是值得尝试的. 拿世界语或其子集作媒介语, 尽管还远远不是最理想, 但如果研制的是印欧语系间多语言自动翻译, 或者是以这些语言为源语的多对一系统(如英/法/德/俄--汉系统), 相信会带来很多方便. 2) 虽然不直接采用世界语作媒介语, 但在设计机译媒介语时, 认真吸取世界语的优点, 可以少走弯路.

_______________________________________________________________________

附注: [0] 为便于查对, 这里把世界语16条基本法规转抄如下:

(1) 不存在不定冠词, 只存在定冠词 (LA), 其性数格不变.

(2) 名词词尾为 "-O", 复数形式加词尾 "-J". 只存在两个格: 普通格和目的格; 后者由普通格加词尾 "-N" 构成.

(3) 形容词以 "-A" 收尾, 其格数与名词同. 比较级用PLI和连词OL, 最高级用PLEJ.

(4) 基数词(没有词尾变化)是: UNU 1, DU 2, TRI 3, KVAR 4, KVIN 5, SES 6, SEP 7, OK 8, NAU 9, DEK 10, CENT 100, MIL 1000. 几十和几百由数词简单合并而成. 序数词加形容词词尾; 倍数加后缀 "-OBL-", 分数加 "-ON-", 集合数词加 "-OP-", 分配意义用介词 PO. 此外, 数词也可以有名词和副词形式.

(5) 人称代词: MI, VI, LI, SHI, LI, GHI (代物件或动物), NI, VI, ILI. 其所有格形式加形容词词尾构成. 数格的变化与名词同.

(6) 动词没有人称和数的变化. 动词的各种形式: 现在时用词尾 "-AS"; 过去时 "-IS"; 将来时 "-OS"; 假定式 "-US"; 命令式 "-U"; 不定式 "-I". 分词(有形容词和副词的意义): 主动现在式 "-ANT-"; 主动过去式 "-INT-"; 主动将来式 "-ONT-"; 被动现在式 "-AT-"; 被动过去式 "-IT-"; 被动将来式 "-OT-". 被动语态的各种形式, 都借助于ESTI的相应形式和所需要的动词的被动分词构成; 被动式所用的介词是DE.

(7) 副词以 "-E" 收尾; 各比较等级与形容词同.

(8) 所有介词都要求普通格.

(9) 每个词读写一致.

(10) 单词重音永远在倒数第二个音节上.

(11) 合成词由词与词简单合并而成(主要的词放在后面); 语法词尾也被看作独立的词.

(12) 有其他否定词的时候, 就不再用 NE.

(13) 为了表示方向, 单词加目的格词尾.

(14) 每个介词都有确定不变的意义. 但是如果我们需要用一个介词, 而从意义上看不出应该用哪一个, 这时我们就用没有独立意义的介词JE. 介词JE也可以用没有介词的目的格来代替.

(15) 所谓外来词, 即大多数语言取自同一来源的词, 在世界语里不加变化地应用, 只需照世界语拼写法书写; 但如果一个词根派生几个不同的词时, 最好只不加变化地采用那个基本词, 并由此按照世界语的规则构造出其他的词来.

(16) 名词和冠词末尾的元音字母可以省略, 用省略号 ' 来代替.

 

[1] 请参看 <<巴贝尔通天塔必将建成>> (刘涌泉 李维, 中国第一届世界语大会论文. 其中第四节专门讨论了世界语作为机译媒介语的优点, 缺点, 可能和前景.)

 

 

【相关】

硕士论文: 世界语到汉语和英语的自动翻译试验
立委硕士论文:1. EChA概况
立委硕士论文:2. 世界语: 语言学特点及其研究价值
立委硕士论文:3. 层次递归成分体系
立委硕士论文:4. EChA机器词典及词表
立委硕士论文:5. 世界语形态分析
立委硕士论文:6/7 世界语句法分析
立委硕士论文:8. 英语形态生成
立委硕士论文:9. 目标语调序
立委硕士论文:10. EChA 试验结果的分析
立委硕士论文【致谢】【参考书目】
立委硕士论文全文(世界语版)

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

硕士论文:世界语到汉语和英语的自动翻译试验(1)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

本文是我在导师刘涌泉和刘倬先生指导下所做的毕业设计的论文总结. 共分十大部分:
1. EChA概况: 系统流程图; 2. 世界语: 语言学特点及其研究价值; 3. 层次递归成分体系CDC: 体现独立分析结果的EChA中间语言; 4. EChA机器词典, 句子加工场格式; 5. 世界语形态分析: 削尾算法, 关于削缀问题的讨论; 6. 句法分析第一线: 虚词处理, 规则和规则分开的讨论; 7. 句法分析第二线: CDC的求解, 中间结果分析; 8. 英语形态生成, 汉语形态修辞, 原语和译语对比差异的一般总结, 多义区分例释; 9. 调序: 自底而上加工; 10. EChA试验结果分析, 汉语和英语的机译文的比较, 关于文学作品可不可以跟机器翻译结合的问题, 修辞的讨论。

                         目       录

  1. EChA概况 ............................................................... 3
  2. 世界语: 语言学特点及其研究价值 ......................................... 7
  3.   层次递归成分体系 ....................................................... 13
  4. EChA机器词典 ........................................................... 19
  5. 世界语形态分析 ......................................................... 23
  6. 世界语句法分析(1) ...................................................... 29
  7. 世界语句法分析(2) ...................................................... 31
  8. 英语形态生成 ........................................................... 34
  9. 目标语调序 ............................................................. 38
  10. EChA试验结果的分析 ..................................................... 39

[致谢] ..................................................................... 44

[参考书目] ................................................................. 45

[附录一] EChA试验结果 ...................................................... 46

[附录二] 世界语文摘 ........................................................ 57

EChA概况

EChA (E-Ch/A: el Esperanto en la Chinan kaj Anglan Lingvojn) 系统是以世界语作为源语, 以汉语和英语作为目标语的一对多小型实验系统. 它是一个句对句的, 分析和综合有一定独立性的全文机器翻译系统. 本系统实现了翻译过程的完全自动化,不需要译前和译后编辑. (由于纯技术原因, 世界语中的几个戴帽字母暂时还需要用加 H 的复合字母来转写.) EChA系统从上机调试到打出译文只用了五个月, 全部工作历时近一年, 进展比较顺利. 本系统使用的是IBM-PC/XT微型机, 编程语言 BASIC (Version D2.00), 同时选用IBM公司的BASIC编译程序软件包. EChA由CCDOS操作系统(即带有汉字库的PC DOS 2.10)支持. 系统主体是六线分析和综合程序. 另外还建立了三部词典, 两个词表, 编制了词典的造查, 扩充和维护程序. 整个系统由近一万条BASIC语句构成. 编程时充分利用了BASIC串处理函数, 显得特别方便.

这次试验共翻译了150多句世界语文句. 汉语和英语的机器译文都通顺或可懂, 结果令人满意. (见附录) 提供本系统试验的源语素材有三部分: 第一部分是选自著名世界语作家Sandor Szhatmari的世界语原文著作 "Mashinmondo" (<<机器世界>>, 中国展望出版社)上的两段连续文章(12句, P.100-101), 句子比较长, 结构也比较复杂. 第二部分选自魏原枢和徐文琪编著的 <<世界语语法>> (上海外语教育出版社, 1982.10)中的典型例句(100多句), 这些例句(其中有一部分是日常用语)都具有一定的语言学特点, 表现了不同时态(简单时态,复合时态), 语态(主动语态, 被动语态), 语式(陈述语式, 命令语式, 假定语式),不同的句式(简单句, 并列句, 复合句, 无主句, 独词句, 一般疑问句, 特殊疑问句, 等等),不同的句型以及动词的各种形式. 总之, 它们具有相当的代表性, 基本上反映了世界语语法概貌, 这就弥补了连续文句特点单一的不足, 更有利于试验EChA系统的能力和适应性. 最后作为一种尝试,还选译了两首世界语诗歌(第一首是著名的世界语者的颂歌"希望之歌").

EChA由三大部分组成: 1) 机器词典; 2) 源语分析; 3) 目标语生成. 源语分析部分包括了世界语的全部基本语法和常用句型. 然而, 由于机器条件和实验周期的限制, 本系统的规模(特别是词典的规模)还很小, 有待于进一步扩充和改进. ----准备从两方面来扩充EChA系统, 一是补充例句, 做扩大试验; 二是增加俄语和法语作为新的目标语, 进一步检验体现独立分析结果的中间语言CDC(层次递归成分体系, 第3节详述)的适应范围, 并探讨其完善的途径. 另外, 时间仓促给系统还带来一些问题:  EChA的结构还不是很合理, 算法有待于进一步优化, 规则和算法还没能分开, 在分析和综合的独立性上下了不少功夫, 但还没有完全独立.

尽管还有上述问题, 然而按照设计要求, 只要适当扩充词典, 系统就有能力处理世界语的绝大多数语言现象. 在中国近三十年的机器翻译研究历史中, EChA是第一个以世界语为研究对象的机译系统. 在世界语跟机器翻译结合的过程中, EChA是一个成功的尝试和良好的开端. 我们热切希望得到专家学者, 世界语同志们的帮助和指导.

EChA系统流程图
______丨________
/   原文输入    丨
/_________________丨
_______________________丨________________________
词               丨 1. 削尾, 查词典(实词词典, 虚词词典, 成语词典, 丨
典               丨    词类词义区分表)                                               丨
(形态分析)     丨_____________________________________________丨
-------------------  _______________________丨_________________________
句               丨 2. 连词标点, 切分, 其他虚词                                     丨
法               丨________________________________________________丨
分                _______________________丨_________________________
析               丨 3. 中间语言CDC的求解                                           丨
丨________________________________________________丨
-------------------  _______________________丨_________________________
丨 4. 多义词区分; 英语形态生成及汉语形态修辞; 查丨
目               丨      英语不规则词词表                                              丨
标               丨_______________________________________________丨
语                _______________________丨_________________________
生               丨 5. 英语调序                                                                丨
成               丨________________________________________________丨
_______________________丨_________________________
丨  6. 汉语调序及其他修辞                                            丨
丨________________________________________________丨
_________丨_________
丨     译文输出           丨
丨__________________丨

源语文句输入以后, 作第一遍扫描. 首先判定加工词长度是否大于三. 若大于三, 转子程序削尾后查实词词干词典, 否则查虚词词典. 因为世界语虚词(无词尾变化)大多短小, 以三为界限最合理, 可以大大减少虚查次数. 词典查不着的作生词处理, 削尾信息保留. 查完词典及词表以后, 把削尾信息和词典信息移到计算机内存中所开辟的句子加工场.

句法分析确定源语文句的层次结构和句法关系. 分析结果以一种高度形式化的层次递归成分体系CDC来体现. CDC是独立于目标语的机器翻译中间语言, 这种独立性对于一对多机译系统是必要的. CDC由形态, 成分, 节点, 分布, 链号和层次几部分信息构成. 它不但揭示了源语文句的正确的句法树, 而且还包含了其它的有用的信息. 事实上, 它为建立多目标语的生成系统奠定了良好的基础.

句法分析第一线处理虚词, 中心任务是加工连词和标点, 正确切分语段. 原则上为每一个虚词编制一套分析规则. 世界语虚词数量很有限, 但用法较多, 具有民族语功能词的类似的复杂性, 是语言个性的集中表现, 所以分别加工比较适宜, 这也有利于规则跟规则分开. 该线加工任务很重, 特别是连词KAJ和KE, 分析规则十分复杂. 在很大程度上, 虚词分析对了, 句法关系也就清楚了. 因此, 集中力量编制一套完备的针对具体虚词的分析系统, 对于世界语类型的机器翻译至关重要. 该线正确处理了虚词个性现象, 便可以保证下一线分析的充分抽象性和概括性, 这样做对于象世界语这样的科学而规则的语言显得特别有利. 句法分析第二线运用自顶而下的方法, 从句子的谓语轴心(第一层)着手, 一层一层往下递归加工, 直到最末层(终结节点层). 加工过程就是不断递归调用各子程序的过程. 其中以动词子程序为核心, 它充分反映了世界语语法的基本内容及其高度规则性. 分析完毕得出一条对应于源语文句的中间语言CDC的链.

综合第一线做英语形态生成和汉语形态修辞. 英语形态并不发达, 所以世英的形态转换规则也不复杂. 汉语缺乏形态, 一般用适当的虚词(助词, 副词等)来代替. 我们把多义词区分规则也放在这一线, 这是因为多义区分的条件至此已经具备. 一般来说, 根据多义词及其联系词的CDC成分和语义特征就可以得出该词的正确义项. 综合第二线和第三线分别做英语调序和汉语调序. 调序信息由CDC结合目标语语法规律得出, 调序的方法是自底而上, 层层归约, 这样就不至于调乱. 我们知道, 世界语语序极为灵活自由, 而汉语语序却很固定, 所以生成汉语的主要任务是调序. 对于英语, 调序的任务较轻, 主要是保证文句主干 "主谓宾" 次序不乱. 英语名词没有主宾格的区分, 所以关键是把前置宾语移到动词之后. "世界语是印欧语系的一个合理化的公分母", 与英语相似处毕竟很多, 比如同一句法层次的定语或状语的内部调序, 在译汉语时是一个难题, 而在印欧系诸语言中则不是大问题. 另外修辞加工的过程也可以免了. (世英转换中的成语和多义现象较之世汉转换也少得多.) 总之, 英语生成比汉语生成容易许多.

EChA虽然是个不大的系统, 但是内容比较丰富. 它既有形态分析, 又有形态生成, 也有调序和修辞, 还有自己的一套成分体系. 我们在总体设计时, 已经考虑到增加新的不同类型的目标语扩充该系统的需要. 可以预计, 如果增加两线俄语和法语的生成程序(主要是形态生成), 分析部分稍作改动(主要是充实与综合还没有完全独立开来的虚词分析规则), 就可以实现崐世到汉/英/法/俄的自动翻译. 总之, 实用机译系统所能遇到的问题, EChA几乎都已涉及, 而且主体六线程序各个有自己的特色, 是个有相当代表性的一对多全自动机译模型.

 

【相关】

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

PhD Thesis: Chapter VII Concluding Remarks

This chapter summarizes the research conducted in this dissertation, including its contributions as well as limitation.

7.0. Summary

The goal of this dissertation is to explore effective ways of formally approaching Chinese morpho-syntactic interface in a phrase structure grammar.  This research has led to the following results:  (i) the design of a Chinese grammar, namely CPSG95, which enables flexible coordination and interaction of morphology and syntax;  (ii) the solutions proposed in CPSG95 to a series of long-standing problems at the Chinese morpho-syntactic interface.

CPSG95 was designed in the general framework of HPSG (Pollard and Sag 1987, 1994).  The sign-based mono-stratal design from HPSG demonstrates the advantage in being capable of accommodating and accessing information of different components of a grammar.  One crucial feature of CPSG95 is its introduction of morphology expectation feature structures and the corresponding morphological PS rules into HPSG.  As a result, CPSG95 has been demonstrated to provide a favorable environment for solving morpho-syntactic interface problems.

Three types of morpho-syntactic interface problems have been studied extensively: (i) the segmentation ambiguity in Chinese word identification;  (ii) Chinese separable verbs, a borderline problem between compounding and syntax; and (iii) borderline phenomena between derivation morphology and syntax.

In the context of the CPSG95 design, the segmentation ambiguity is no longer a problem as morphology and syntax are designed system internally in the grammar to support morpho-syntactic parsing based on non-deterministic tokenization (W. Li 1997, 2000).  In other words, the design of CPSG95 itself entails an adequate solution to this long-standing problem, a problem which has been a central topic in Chinese NLP for the last two decades.  This is made possible because the access to a full grammar including both morphology and syntax is available in the integrated process of Chinese parsing and word identification while traditional word segmenters can at best access partial grammar knowledge.[1]

The second problem involves an interesting case between compounding and syntax:  different types of Chinese separable verbs demonstrate various degrees of separability in syntax while all these verbs, when used contiguously, are part of Chinese verb vocabulary.  For each type of separable verbs, arguments were presented for the proposed linguistic analysis and a solution to the problem was then formulated in CPSG95 based on the analysis.  All the proposed solutions provide a way of capturing the link between the separated use and the contiguous use of the separable verb phenomena.  They are shown to be better solutions than previous approaches in the literature which either cannot link the separated use and the contiguous use in the analysis or suffer from being not formal.

The third problem at the interface of derivation and syntax involves two issues: (i) a considerable amount of ‘quasi-affix’ data, and (ii) the intriguing case of zhe-suffixation which demonstrates an unusual combination of a phrase with a bound morpheme.  A generic analysis of Chinese derivation has been proposed in CPSG95.  This analysis has been demonstrated to be also effective in handling both quasi-affixation and zhe-affixation.

7.1. Contributions

The specific contributions are reflected in the study of the following five topics, each constituting a chapter.

On the topic of the Role of Grammar, the investigation leads to the central argument that knowledge from both morphology and syntax is required to properly handle the major types of morpho-syntactic interface problems.  This establishes the foundation for the general design of CPSG95 as consisting of morphology and syntax in one grammar formalism.

An in-depth study has been conducted in the area of the segmentation ambiguity in Chinese word identification.  The most important discovery from the study is that the disambiguation involves the analysis of the entire input string.  This means that the availability of a grammar is key to the solution of this problem.  A natural solution to this problem is the use of grammatical analysis to resolve, and/or prepare the basis for resolving, the segmentation ambiguity.

On the topic of the Design of CPSG95, a mono-stratal Chinese phrase structure grammar has been established in the spirit of the HPSG theory.  Components of a grammar such as morphology, syntax and semantics are all accommodated in distinct features of a sign.  CPSG95 is designed to provide a framework and means for formalizing the analysis of the linguistic problems at the morpho-syntactic interface.

The essential part of this work is the design of expectation feature structures.  Expectation feature structures are generalized from the HPSG feature structures for syntactic subcategorization and modification.  One characteristic of the CPSG95 structural expectation is the design of morphological expectation features to incorporate Chinese productive derivation, which covers a wide range of linguistic phenomena in Chinese word formation.

In order to meet the requirements induced by introducing morphology into the general grammar and by accommodating linguistic characteristics of Chinese, modifications from the standard HPSG are proposed in CPSG95.  The rationale and arguments for these modifications have been presented.  The design of CPSG95 is demonstrated to be a successful application of HPSG in the study of Chinese morpho-syntactic phenomena.

On the topic of Defining the Chinese Word, efforts have been made to reach a better understanding of Chinese wordhood in theory, methodology and formalization.

The theoretical inquiry follows the insight from Di Sciullo and Williams (1987) and Lü (1989).  Two notions of word, namely grammar word and vocabulary word, have been examined and distinguished.  While vocabulary word is easy to define once a lexicon is given, the object for linguistic study and generalization is actually grammar word.  Unfortunately, as there is a considerable amount of borderline phenomena between Chinese morphology and syntax, no precise definition of Chinese grammar word has been available across systems.  Therefore, an argument in favor of the system-internal wordhood definition and interface coordination within a grammar has been made.  This leads to a case-by-case approach to the analysis of specific Chinese morpho-syntactic interface problems.

On the other hand, three useful wordhood judgment methods have also been proposed as a complementary means to the case-by-case analysis.  These methods are (i) syntactic process test involving passivization and topicalization; (ii) keyword based judgment patterns for verbs, and (iii) a general expansion test named X-insertion.  These methods are demonstrated to be fairly operational and easy to apply.

In terms of formalization, a system-internal representation of word has been defined in CPSG95 feature structures.  This definition distinguishes a grammar word from both bound morphemes and syntactic constructions.  The formalization effort is necessary for the rigid study of Chinese morpho-syntactic problems and ensures the implementability of the solutions to these problems as proposed in the dissertation.

On the topic of Chinese Separable Verbs, the task is to coordinate the idiomatic nature of separable verbs and their separated uses in various syntactic patterns.

Since there are different degrees of ‘separability’ for different types of Chinese separable verbs, there is no uniform analysis which can handle all separable verbs properly.  A case-by-case study for each type of separable verbs has been conducted.  An essential part of this study is the arguments for the wordhood judgment for each type.  In the light of this judgment, CPSG95 provides formalized analyses of separable verbs which satisfy two criteria:  (i)  they all capture both structural and semantic aspects of the constructions at issue; (ii) they all provide a way of capturing the link between the separated use and contiguous use.

Finally, on the topic of Morpho-syntactic Interface Involving Derivation, a general approach to Chinese derivation has been proposed.  This approach not only enables us to handle quasi-affix phenomena, but is also flexible enough to provide an adequate treatment of the special problem in zhe-suffixation.

In the CPSG95 analysis, the affix serves as head of a derivative and can impose various constraints in the lexicon on its expected stem sign for the morphological expectation.  Coupled with only two PS rules formulated in the general grammar (Prefix PS Rule and Suffix PS Rule), it has been shown that various Chinese affixation phenomena can be captured equally well.  The PS rules ensure that all the lexical constraints be observed before the affix and the stem combine and that the output of derivation be a word.

As for the quasi-affixation problem, based on the observation that there is no fundamental structural difference between quasi-affixation and other affixation, a proper treatment of 'quasi-affixes' can be established in the same way as other affixes are handled in CPSG95; the individual difference in semantics is shown to be capturable in the lexicon.

The study of zhe-suffixation started with arguments for its analysis of VP+-zhe.  This is an unsolvable problem in any system which enforces sequential processing of morphology before syntax.  The solution which CPSG95 offers demonstrates the power of designing derivation morphology and syntax in a mono-stratal grammar.   With this novel design in modeling Chinese grammar, the CPSG95 general approach to derivation readily applies to the tough case of zhe-suffixation.  This is possible because of the ability of an affix in placing any lexicalized constraints, VP in this case, on the expected stem for morphological expectation.  In addition, the proposed lexicalized solution also captures the building of the semantic content for this morpho-syntactic borderline phenomenon.

7.2. Limitation

The major limitation of the work reported in this thesis lies in the following two aspects.

Limited by space, the thesis has only presented some sample formulation of typical affixes and quasi-affixes to demonstrate the proposed general approach to Chinese derivation morphology.  As many affixes/quasi-affixes have their distinctive semantic property, a reader who likes to experiment with this proposal in implementation still has to work out the technical details for each affix.  However, it is believed that the general strategy has been presented in sufficient details to allow for easy accommodation of individual aspects of an affix which have not been specifically addressed in the thesis.

Limited by the focus on a handful of major morpho-syntactic interface problems, the treatment of reduplication and unlisted proper names have not been listed as special topics for in-depth exploration.  They are only briefly discussed in Chapter II (Section 2.2) as cases of productive word formation for the need to involve syntax when they involve segmentation ambiguity at the boundaries.  However, they are also long-standing word identification problems which affect morpho-syntactic interface when the segmentation ambiguity is involved.  In particular, it is felt that the treatment of transliterated foreign names requires further research before a satisfactory solution can be found in the framework of CPSG95.[2]

7.3. Final Notes

This last section is used to place the research reported in this thesis in a larger context.

Chinese NLP has reached a new stage marked by the publication of Guo’s series of papers on Chinese tokenization (Guo 1997a,b,c,d, Guo 1998).  There are signs that the major research focus is being shifted from word segmentation to the grammar design and development.  In this process,  the morph-syntactic interface will remain a hot topic for quite some time to come.  The work on CPSG95 can be seen as one of the efforts in this direction.

The design of CPSG95, a formal grammar capable of representing both morphology and syntax in a uniform formalism, is one successful application of the modern linguistic theory HPSG in the area of Chinese  morpho-syntactic interface research.  However, this is by no means to claim that CPSG95 is the only or best framework to capture the morpho-syntactic problems.   This is only one approach which has been shown to be feasible and effective.  Other equally good or better approaches may exist.

In terms of future directions, constraints from semantics and discourse should be made available in the grammatical analysis.  In Chapter II (Section 2.4), we have seen problems whose ultimate solutions depend on the access to the semantic or discourse constraints.  It is believed that the sign-based mono-stratal design of CPSG95 will be extensible to accommodate these constraints.  However, this will require years of future research before they can be formally modeled and properly introduced into the grammar.

 

--------------------------

[1] As a matter of fact, the CPSG95 experiment shows that most segmentation ambiguity is resolved automatically as a by-product of morpho-syntactic parsing and the remaining ambiguity is embodied in the multiple syntactic trees as the results of the analysis.

[2] However, in the CPSG95 implementation, the problem of handling the Chinese person names, a special case of compounding, has been solved fairly satisfactorily.  The proposal is to use the surname as the head sign to expect the given name (of one or two characters) on its right to form potential full names.  As the right boundary of a person name is difficult to define without the support of sentential analysis, the conventional word segmenter frequently makes wrong segmentation in such cases.  In contrast, the approach implemented in CPSG95 is free from this problem because whether a potential name proposed by the surname ultimately survive as a proper name is decided by whether it contributes to a valid parse for the processed sentence.  In last few years, there has been rapid progress on proper name identification in the area of information extraction, called named entity tagging (MUC7 1998; Chen et al 1997).

 

BIBLIOGRAPHY

Bauer, Laurie (1988).  Introducing Linguistic Morphology.  Edinburgh:  Edinburgh University Press.

Bloomfield, Leonard (1933). Language, New York: Henry Holt & Co.

Borsley, Robert (1987).  Subjects and Complements in HPSG.   Technical report no. CSLI-107-87.  Stanford:  Center for the Study of Language and Information.

Carpenter, B. and G. Penn (1994).  ALE, The Attribute Logic Engine, User's Guide.  From http://www.sfs.nphil.uni-tuebingen.de/~gpenn/ale.html (accessed January 30, 2001).

Chao, Yuen-Ren (1968).  A Grammar of Spoken Chinese.  Berkeley:  University of California Press.

Chen, H.-H et al (1997).  Description of the NTU System used for MET-2.  Proceedings of MUC-7.  From http://perso.enst.fr/~monnier/lectures/IE/MUC7/muc_7_toc.html (accessed January 30, 2001).

Chen, K. and S. Liu (1992).  Word Identification for Mandarin Chinese Sentences.  Proceedings of 14th International Conference on Computational Linguistics (COLING’92). Nantes, France, 101-107.

Chen, M.Y. and W. S-Y. Wang (1975).  Sound Change:  Actuation and Implementation.  Language 51:2, 255-281.

Chen, Ping (1994).  “Shilun Hanyu zhong San Zhong Juzi Chengfen yu Yuyi Cheng Fen de Peiwei Yuanze” (On Mapping Principles of Relationship between Chinese Three Syntactic Constituents and Semantic Roles). Zhongguo Yuwen (Chinese Linguistics), No.3.

Chomsky, Noam (1970).  Remarks on Nominalization.  Readings in English Transformational Grammar, eds. by R. Jacobs and P. Rosenbaum, Waltham, Massachasetts:  Ginn and Company, 184-221.

Dai, John Xiang-ling (1993).  Chinese Morphology and its Interface with Syntax.  Ph.D. Dissertation, Ohio State University.

DeFrancis, John (1984).  The ChineseLanguage: Fact and Fantasy.  Honolulu:  University of Hawaii Press.

Di Sciullo, A.M. and E. Williams (1987).  On The Definition of Word.  The MIT Press, Cambridge, Massachusetts.

Ding, Shengshu (1953). “Hanyu Yufa Jianghua” (Lectures of Chinese Grammar), Zhongguo Yuwen (Chinese Linguistics), No. 3 and No. 4.

Dowty, D. (1982).  More on the Categorial Analysis of Grammatical Relations.  In A. Zaenen (Ed.), Subjects and Other Subjects:  Proceedings of the Harvard Conference on Grammatical Relations.  Bloomington:  Indiana University Linguistics Club.

Feng, Zhiwei (1996).  COLIPS Lecture Series - Chinese Natural Language Processing,  Communications of COLIPS, Vol.6, No.1, Singapore.

Gan, Kok Wee (1995).  Integrating Word Boundary Disambiguation with Sentence Understanding, Ph.D. Dissertation, National University of Singapore.

Gazdar, G., E. Klein, G.K. Pullum, and I.A. Sag (1985).  Generalized Phrase Structure Grammar.  Cambridge: Blackwell, and Cambridge, Mass.:  Harvard University Press.

Guo, Jin (1997a).  Critical tokenization and its properties.  Computational Linguistics, Vo. 23, No.4, 569-596.

Guo, Jin (1997b).  Chinese Language Modeling for Speech Recognition.  Ph.D. dissertation, Institute of Systems Science, National University of Singapore.

Guo, Jin (1997c).  A Comparative Study on Sentence Tokenization Generation Schemes.  In review for journal publication from http://sunzi.iss.nus.sg:1996/guojin/papers/ (accessed March 25, 1999).

Guo, Jin (1998).  One tokenization per source.  Proceedings of the 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics (COLING-ACL ’98),  Montreal, Canada, 457-463.

He, K., H. Xu and B. Sun (1991).  Design Principles of an Expert System for Automatic Word Segmentation of Written Chinese Texts, Journal of Chinese Information Processing, Vol. 5, No. 2, 1-14.

Hockett, C.F. (1958).  A Course in Modern Linguistics.  New York:  Macmillan.

Hu, F. and L. Wen (1954).  “Ci de fanwei, xingtai, gongneng” (Scope, form and function of word). Zhongguo Yuwen (Chinese Linguistics), August issue.

Jackendoff, Ray (1972). Semantic Interpretation In Generative Grammar, Cambridge, Massachusetts:  MIT Press.

Jensen, John T. (1990).  Morphology:  Word Structure in Generative Grammar.  Amsterdam/Philadephia:  John Benjamins Publishing Company.

Kathol, Andreas (1999).  Agreement and the Syntax-Morphology Interface in HPSG. In Robert Levine and Georgia Green (eds.) Studies in Current Phrase Structure Grammar. Cambridge University Press, 223-274.

Kolman, B. and R.C. Busby (1987). Discrete Mathematical Structures for Computer Science, 2nd edition. Prentice-Hall, Inc.

Krieger, Hans-Ulrich (1994). Derivation without Lexical Rules,  in C.J Rupp, M. Rosner and R. Johnson (eds), Constraints, Language, and Computation.  Academic Press, 277-313.

Li, C.N. and  S.A. Thompson (1981).  Mandarin Chinese:  A Functional Grammar.  Berkeley:  University of California Press.

Li, Linding (1986).  Xiandai Hanyu Juxing (Sentence Patterns in Contemporary Mandarin), Shangwu Yinshuguan (Commercial Press), Beijing.

Li, Linding (1990).  Xiandai Hanyu Dongci (Verbs in Contemporary Mandarin), Zhongguo Shehui Kexue Chubanshe, Beijing.

Li, Qinghua (1983).  “Tan liheci de tedian he yongfa” (On the characteristics and usages of separable words).  Yuyan Jiaoxue He Yan Jiu (Language Instruction and Research), No.3.

Li, Wei (1996).  Interaction of Syntax and Semantics in Parsing Chinese Transitive Patterns.  Proceedings of International Conference on Chinese Computing (ICCC'96), Singapore.

Li, Wei (1997).  Chart Parsing Chinese Character Strings.  Proceedings of the Ninth North American Conference on Chinese Linguistics (NACCL-9), Victoria, Canada.

Li, Wei (2000). On Chinese parsing without using a separate word segmenter.  Communication of COLIPS 10 (1): 19-68.

Liang, Nanyuan (1987).  CDWS -- A Written Chinese Automatic Word Segmentation System.  Journal of Chinese Information Processing, 1(2): 44-52.

Lieber, R. (1992).  Deconstructing Morphology. Chicago: University of Chicago Press.

Lin, Handa (1983).  “Shime shi ci – xiaoyu ci de bu shi ci” (What is a word – a unit smaller than a word is not a word). Zhongguo Yuwen (Chinese Linguistics), No.34.

Lu, Jianming (1988).  “Mingci-xing ‘laixin’ shi ci haishi cizu” (Nominal laixin: word or word group).  Zhongguo Yuwen (Chinese Linguistics), No. 5.

Lu, Zhiwei (1957).  Hanyu de Goucifa (Chinese Word Formation), Kexue Chubanshe (Science Publishing House)..

Lü, Shuxiang. (1946). “Cong Zhuyu, Binyu de Fenbie Tan Guoyu Juzi de Fenxi” (On Sentence Analysis of Mandarin Chinese from the Angle of the Distinction between Subject and Object),  Kaiming Shudian Er Shi Zhounian Jiannian Wenji (Selected Works to Celebrate the 20th Anniversary of Kaiming Bookstore).

Lü, Shuxinag et al (ed.) (1980).  Xiandai Hanyu Babai Ci (800 Words in Contemporary Mandarin), Shangwu Yinshuguan (Commercial Press), Beijing.

Lü, Shuxiang (1989). “Hanyu Yufa Fenxi Wenti” (Issues on Chinese grammatical analysis),  Lü Shuxiang Zixuanji (Self-selected Works of Shuxiang Lü), Shang Hai Jiaoyu Chubanshe (Shanghai Education Publishing House), Shanghai, 93-180.

Lua, Kim Teng (1994).  Application of Information Theory Binding in Word Segmentation. Computer Processing of Chinese and Oriental Languages 8(1): 115-124.

Lyons, John (1968).  Introduction to Theoretical Linguistics.  Cambridge:  Cambridge University Press.

MUC-7 (1998).  Proceedings of the Seventh Message Understanding Conference (MUC-7).  From http://perso.enst.fr/~monnier/lectures/IE/MUC7/muc_7_toc.html (accessed January 30, 2001).

Pollard, C. and I. Sag (1987).  Information based Syntax and Semantics Vol. 1: Fundamentals.  Centre for the Study of Language  and Information, Stanford University, CA.

Pollard, C. and I. Sag (1994).  Head-Driven Phrase Structure Grammar.  The University of Chicago Press.

Riehemann, Susanne (1993). Word Formation in Lexical Type Hierarchies – A Case Study of bar-Adjectives in German. SfS-Report-02-93, University of Tübingen.

Riehemann, Susanne (1998). Type-based derivational morphology.  Journal of Comparative Germanic Linguistics 2. 49-77.

Sapir, Edward (1921).  Language:  Introduction to the Study of Speech.  NewYork:  Harcourt, Brace, and World.

Selkirk, E. (1982).  The Syntax of Words.  Cambridge:  MIT Press.

Shi, Youwei (1992).  Huhuan Rouxing – Hanyu Yufa Tanyi (A Call for Flexibility – Peculiarities of Chinese Grammar), Hunan Publishing House.

Shieber, S. (1986).  An Introduction to Unification-Based Approaches to Grammar.  Centre for the Study of Language  and Information, Stanford University, CA.

Sproat, R., C. Shih, V. Gale, and N. Chang (1996).  A Stochastic Finite-State Word-Segmentation Algorithm for Chinese.  Computational Linguistics. Vol. 22, No. 3.

Sun, L. and P. Cole (1991).  The effect of morphology on long-distance reflexives.  Journal of Chinese Linguistics 19:1, 42-62.

Sun, M. and B. T’sou (1995).  Ambiguity resolution in Chinese word segmentation.  Proceedings of the 10th Pacific Asia Conference on Language, Information and Computation (PACLIC-95), Hong Kong, 121-126.

Sun, M. and C. Huang (1996).  Word Segmentation and Part-of-Speech Tagging for Unrestricted Chinese Texts, A Tutorial at the 1996 International Conference on Chinese Computing (ICCC96), Singapore.

Thompson, S.A. (1973).  Resultative Verb Compounds in Mandarin Chinese:  A Case of Lexical Rules. Language 49:2, 361-379.

Wang, Li (1955).  ZhongguoYufa Lilun (Chinese Grammatical Theory), Zhonghua Shuju, Shanghai.

Wang, Xiaolong (1989).  Automatic Chinese Word Segmentation, in Word Separating and Mutual Translation of Syllable and Character Strings, Ph.D. Dissertation, Dept. of Computer Science and Engineering, Harbin Institute of Technology.

Webster, J. J. and C-Y Kit. (1992).  Tokenization as the Initial Phase in NLP.  Proceedings of the 14th International Conference on Computational Linguistics (COLING-92).  Nantes, France, 1106-1110.

Wu, A. and Z. Jiang (1998).  Word Segmentation in Sentence Analysis.  Proceedings of the 1998 International Conference on Chinese Information Processing.  Beijing, China, 169-180.

Wu, Dekai (1998).  A Position Statement on Chinese Segmentation.  Presented at the Chinese Language Processing Workshop, University of Pennsylvania. (Current draft at http://www.cs.ust.hk/~dekai/papers/segmentation.html, accessed January 30, 2001).

Wu, M. and K. Su (1993).  Corpus-Based Automatic Compound Extraction with Mutual Information and Relative Frequency Count.  Proceedings of R.O.C. Computational Linguistics Conference (ROCLING) VI, Taiwan, 207-216.

Xue, Ping (1991).  Syntactic Dependencies in Chinese and their Theoretical Implications.  Ph.D. dissertation, University of Victoria, Canada.

Yao, T., G. Zhang, and Y. Wu (1990).  A Rule-Based Chinese Automatic Segmentation System.  Journal of Chinese Information Processing 4(1): 37-43.

Yeh, C-L. and H-J. Lee (1991).  Rule-Based Word Identification For Mandarin Chinese Sentences -- A Unification Approach.  Computer Processing of Chinese and Oriental Languages. Vol. 5, No. 2, 97-118.

Yu, Shihong et al (1997).  Description of the Kent Ridge Digital Labs System Used for MUC-7.  Proceedings of MUC-7.  From http://perso.enst.fr/~monnier/lectures/IE/MUC7/muc_7_toc.html (accessed January 30, 2001).

Zhang, J., Z. Chen and S. Chen (1991).  A Method of Word Identification for Chinese by Constraint Satisfaction and Statistical Optimization Techniques.  Proceedings of R.O.C. Computational Linguistics Conference (ROCLING) IV, Taiwan, 147-165.

Zhang, Shoukang (1957).  “Lüetan hanyu goucifa” (A brief discussion on Chinese word formation)  Xiandai Hanyu Cankao Ziliao (Reference for Comtemporary Chinese),  ed. by Yushu Hu (1981),  Shanghai:  Shanghai Jiaoyu Chubanshe (Shanghai Education Publishing Company), 241-256.

Zhao, S. and B. Zhang (1996).  “Liheci de queding yu liheci de xingzhi” (Determination and characteristics of separable words).  Yuyan Jiaoxue he Yanjiu (Language Instruction and Research), No.1, 40-51.

Zhu, Dexi (1985).  Yufa Wenda (Questions and Answers on Chinese Grammar).  Shangwu Yinshuguan (Commercial Press), Beijing.

Zwicky, A.M. (1987). Slashes in the Passive.  Linguistics 25, 639-669.

Zwicky, A.M. (1989).  Idioms and Constructions.  Eastern States Conference on Linguistics 5, 547-558.

 

[Related]

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

6.0. Introduction

This chapter studies some challenging problems of Chinese derivation and its interface with syntax.  These problems have been a challenge to existing word segmenters; they are also long-standing problems for Chinese grammar research.

It is observed that a good number of signs have become more and more like affixes as the Chinese language develops.  Typical, indisputable examples include signs like the nominalizer 性 ‑xing (-ness) and the prefix 第 di- (-th).  While few people doubt the existence of affixes in Contemporary Chinese, there is no general agreement on the exact number of Chinese affixes, due to a considerable number of borderline cases often referred to as ‘quasi-affixes’ (类语缀 lei yu-zhui).[1]  It will be argued that the quasi-affixes belong to morphology and are structurally not different from other affixes.  The major difference between ‘quasi-affixes’ and the few generally honored (‘genuine’) affixes lies mainly in the following aspect.  The former retain some ‘solid’ meaning while the latter are more functionalized.  However, this does not prevent CPSG95 from providing a proper treatment of quasi-affixes in the same way as it handles other affixes.  It will be shown that the difference in semantics between affixes or quasi-affixes can be accommodated fairly easily in the CPSG95 lexicon.

Based on the examination of the common property of Chinese affixes and quasi-affixes, a general approach to Chinese derivation is proposed.  This approach not only enables us to handle quasi-affix phenomena, but is also flexible enough to provide an adequate treatment of a special problem in Chinese derivation, namely zhe-suffixation.  The affix status of 者 -zhe (-er) is generally acknowledged (classified as suffix in the authoritative books like Lü et al 1980):  it attaches to a verb sign and produces a word.  The peculiar aspect of this suffix is that the verb stem which it attaches to can be syntactically expanded.  In fact, there is significant amount of evidence for the argument that this suffix expects a VP as its stem (see 6.5 for evidence).   Since a VP is only formed in syntax and derivation is within the domain of morphology, this phenomenon presents a highly challenging case on how morphology should be interfaced properly to syntax.  The solution which is offered in CPSG95 demonstrates the power of designing morphology and syntax in an integrated grammar formalism.  In contrast, in any system which enforces sequential processing of derivation morphology before syntax - most traditional systems assume this, this is an unsolvable problem.  There does not seem to be a way of enabling partial output of syntactic analysis (i.e. VP) to feed back to some derivation rule in the preprocessing stage.

In Section 6.1, the general approach to Chinese derivation is proposed first.  Following this proposal, prefixation is illustrated in 6.2 and suffixation in 6.3.  Section 6.4 shows that this general approach to derivation applies equally well to the 'quasi-affix' phenomena.  Section 6.5 investigates the suffixation of -zhe (-er).  The analysis is based on the argument that this suffixation involves the combination VP+-zhe.  The specific solution following the CPSG95 general approach will be presented based on this analysis.

6.1. General Approach to Derivation

This section examines the property of Chinese affixes and proposes a corresponding general approach to Chinese derivation.  This serves as the basis for the specific solutions to be presented in the remaining sections to various problems in Chinese derivation.

It is fairly easy to observe that in Chinese derivation it is the affix which selects the stem, not the other way round.  For example, the suffix 性 -xing (‑ness) expects an adjective to produce an (abstract) noun.   Based on the examination of the behavior of a variety of Chinese affixes or quasi-affixes, the following generalization has been reached.  That is, an affix lexically expects a sign of category x, with possible additional constraints, to form a derived word of category y.   This generalization is believed to capture the common property shared by Chinese affixes/quasi-affixes.  It seems to account for all Chinese derivational data, including typical affixation, quasi-affixation (see 6.4) and the special case of zhe-suffixation (see 6.5).  So far no counter evidence has been found to challenge this generalization.

The observation and the generalization above support the argument that in a grammar which relies on lexicalized expectation feature structures to drive the building of structures, affixes, not the stems, should be selecting heads of the morphological structures.[2]   Leaving aside the non-productive affixation,[3] the general strategy to Chinese productive derivation is proposed as follows.  In the lexicon, the affix as head of derivative is encoded with the following derivation information:  (i) what type of stem (constraints) it expects;  (ii) where to look for the expected stem, on its right or left;  (iii) what type of (derived) word it leads to (category, semantics, etc.).  Based on this lexical information, CPSG95 has two PS rules in the general grammar for derivation:  one for prefixation, one for suffixation.[4]  These rules ensure that all the constraints be observed before an affix and a stem are combined.  They also determine that the output of derivation, i.e. the mother sign, be a word.

Along this line, the key to a lexicalized treatment of Chinese derivation is to determine the structural and semantic property of the derivative and to impose proper constraints on the expected stem.  The constraints on the expected stem can be lexically specified in the morphological expectation feature [PREFIXING] or [SUFFIXING] of the affix.  The property (category, syntactic expectation, semantics, etc.) of the derivative can also be encoded directly in the lexical entry of the affix, seen as the head of a derivational structure in the CPSG95 analysis.  This property information, as part of head features, will be percolated up when the derivation rules are applied.

In the remaining part of this chapter, it will be demonstrated how this proposed general approach is applied to each specific derivation problem.

6.2. Prefixation

The purpose of this section is to present the CPSG95 solution to Chinese prefixation.  This is done by formulating a sample lexical entry for the ordinal prefix 第 di- (-th) in CPSG95.  It will be shown how the lexical information drives the prefix rule in the general grammar for the derivational combination.

Thanks to the productivity of the prefix 第 di- (-th), the ordinal numeral is always a derived word from the cardinal numeral via the following rule, informally formulated in (6-1).

(6-1.) 第 di- + cardinal numeral --> ordinal numeral

第22条军规
di-      22      tiao    jun-gui
-th     22      CLA   military-rule
the 22-nd military rule (Catch-22)

第八个是铜像
di-      ba      ge      shi     tong-xiang
-th     eight  CLA   be      bronze-statue
The eighth is the bronze statue.

The basic function of the Chinese numeral, whether cardinal or ordinal,  is to combine with a classifier, as shown in the sample sentences above.

To capture this phenomenon, CPSG95 defines two subtypes for the category numeral [num], namely the [cardinal_num] and [ordinal_num].   The lexical entries of the prefix 第 di‑ (‑th) and the cardinal numeral 五 wu (five) are formulated in (6-2) and (6-3).  The prefix encodes the lexical expectation for the derivation 第 di- + [cardinal_num] ‑‑> [ordinal_num] plus the semantic composition of the combination.  Note that the constraint @numeral inherits all common property specified for the numeral macro.

th6263

As indicated before, prefixation in CPSG95 is handled by the Prefix PS Rule based on the lexical specification.  More specifically, it is driven by the lexical expectation encoded in [PREFIXING].  The prefix rule is formulated in (6-4).

th64

Like all PS rules in CPSG95, whenever two adjacent signs satisfy all the constraints, this rule takes effect in combining them into a higher level sign in parsing.  For example, the prefix 第 di- (-th) and the sign 五 wu (five) will be combined into the sign as shown in (6-5).

th65

The combination of 第五 di+wu in (6-5) demonstrates how the morphological structure is built in the CPSG95 approach to Chinese prefixation.

6.3. Suffixation

Like prefixation, the Suffix PS Rule for suffixation is driven by the lexically encoded expectation in [SUFFIXING].  Parallel to the Prefix PS Rule, the suffix rule is formulated in (6-6).

th66

With this PS rule in hand, all that is needed is to capture the individual derivational constraint in the lexical entries of the suffixes at issue.  For example, the suffix 性 -xing (-ness) changes an adjective or verb into an abstract noun:  A/V + ‑xing  ‑‑> N.  This information is contained in the formulation of the suffix 性 –xing (-ness) in the CPSG95 lexicon, as shown in (6-7).

th67

Note that abstract nouns are uncountable, hence the call to the uncountable_noun macro to inherit the common property of uncountable nouns.[5]

Suppose the suffix 性 -xing (-ness) appears immediately after the adjective 实用 shi-yong (practical) formulated in (6-8), the suffix PS rule will combine them into a noun, as  shown in (6-9).

th6869

The combination of 实用性 shi-yong+xing in (6-9) demonstrates how the morphological structure is built in the CPSG95 approach to Chinese suffixation.

6.4. Quasi-affixes

The purpose of this section is to propose an adequate treatment of the quasi-affix phenomena in Chinese.  This is an area which has not received enough investigation in the field of Chinese NLP.  Few Chinese NLP systems demonstrate where and how to handle these quasi-affixes.

To achieve the purpose, typical examples of ‘quasi-affixes’ are presented and compared with some ‘genuine’ affixes.  The comparison highlights the general property shared by both 'quasi-affixes' and other affixes and also shows their differences.  Based on this study, it is found to be a feasible proposal to treat quasi-affixes within the derivation morphology of CPSG95.  The proposed solution will be presented by demonstrating how a typical quasi-affix is represented in CPSG95 and how the general affix rules can work with the lexical entries of 'quasi-affixes' as well.

The tables in (6-10) and (6-11) list some representative quasi-affixes in Chinese.

(6-10.)         Table for sample quasi-prefixes

prefixation examples
lei (quasi-)+N --> N 类前缀 lei-[qian-zhui]: quasi-[pre-fix]
前缀 qian (before, pre-, former-) zhui (...)
ban (semi-)+N --> N 半文盲 ban-[wen-mang]: semi-illiterate
文盲 wen (written-language), mang (blind)
dan (mono-)+N --> N 单音节 dan-[yin-jie]: mono-syllable
音节 yin (sound), jie (segment)
shuang (bi-)+N --> N 双音节 shuang-[yin-jie]: bi-syllable
duo (multi-)+N --> N 多音节 duo-[yin-jie]: multi-syllable
fei (non-)+N/A --> A 非谓 fei-wei: non-predicate
非正式 fei-[zheng-shi]: non-official
xiang (each other)+Vt (mono-syllabic) --> Vi 相爱 xiang-ai: love each other
zi (self-)+Vt --> Vi 自爱 zi-ai: self-love zi-xue-xi: self-learning
qian (former, ex-) + N
--> N
前夫人 qian-[fu-ren]: ex-wife
前总统 qian-[zong-tong]: former president

(6-11.)         Table for sample quasi-suffixes

suffixation Examples
N + shi (style) --> N 美国式 [mei-guo]-shi: American-style
NUM/N + xing (model)
--> N
1980型 1980-xing: 1980 model;
IV型 IV-xing: Model IV
A/V + (rate) --> N 准确率 [zhun-que]-lü: (percentage of) precision
NUM + liu (class) --> A 一流 yi-liu: first class
三流 san-liu: third class
N + mang ('blind', person who has little knowledge of) --> N 法盲 fa-mang:
person who has no knowledge of law
计算机盲 [ji-suan-ji]-mang: computer-layman

Compare the above quasi-affixes with the few widely acknowledged affixes like 性 -xing (-ness) and 第 di- (-th), it is fairly easy to observe that the property as generalized in Section 6.1 is shared by both affixes and quasi-affixes.  That is, in all cases of the combination, the affix or quasi-affix expects a sign of category x, with possible additional constraints, either on the right or on the left to form a derived word of category y (y may be equal to x).  For example, the quasi-prefix 自 zi- (self-) expects a transitive verb to produce an intransitive verb, etc.  This property supports the following two points of view:  (i) the affix or quasi-affix is the selecting head of the combination;  (ii) both types of combination (affixation) should be properly contained in morphology since the output is always a word (derivative).

In terms of difference, it is observed that there are different degrees of the functionalization of the meaning between quasi-affixes and other affixes.  For example, the nominalizer 性 -xing (‑ness) seems to be semantically more functionalized than the quasi-suffix 盲 -mang (blind-man, person who has little knowledge of).  In the case of 性 -xing (-ness), there is believed to be little semantic contribution from the affix.  But in cases of affixation by quasi-affixes, the semantic contribution of the affixes is non-trivial, and it must be ensured that proper semantics be built based on semantic compositionality of both the stem and the affix.

Except for the different degrees of semantic abstractness, there is no essential grammatical difference observed between quasi-affixes and the few widely accepted affixes.  As the semantic variation can be easily accommodated in the lexicon, nothing needs to be changed in the  general approach to Chinese derivation as described before.  The text below demonstrates how the quasi-affix phenomena are handled in CPSG95, using a sample quasi-affix to show the derivation.

The quasi-prefix to examine is 相 xiang- (each other).  It is used before a mono-syllabic transitive verb, making it an intransitive verb: 相 xiang- + Vt (monosyllabic) ‑‑> Vi.  More precisely, the syntactic object of the transitive verb is morphologically satisfied so that the derivative becomes an intransitive verb.

Unlike the original verb, the verb derived via xiang-prefixation requires a plural subject, as shown in (6-12).  This is a linguistically interesting phenomenon.  In a sense, it is a version of subject-predicate agreement in Chinese.

(6-12.) (a)    他们相爱过。
ta-men         xiang-         ai       guo
they            each-other   love    GUO
They used to love each other.

(b)      他爱过。
ta       ai       guo
he      love    GUO.
He used to love (someone).

(b) *   他相爱过。
ta       xiang-         ai       guo
he      each-other   love    GUO.

This number agreement can help decode the plural semantics of the subject noun as shown in the first sentence (6-13a) in the following group.  Sentence (6-13a) illustrates a common, number-underspecified case where the NP has no plural marker.  This contrasts with (6-13b) which includes a plural marker 们 men (-s), and with (6-13c) which resorts to the use of a numeral-classifier construction.

(6-13.) (a)     孩子相爱了。
hai-zi           xiang-         ai       le
child           each-other   love    LE
The children have fallen in love with each other.

(b)      孩子们相爱了。
hai-zi men   xiang-         ai       le
child  PLU   each-other   love    LE
The children have fallen in love with each other.

(c)      两个孩子相爱了。
liang ge      hai-zi           xiang-         ai       le
two    CLA   child           each-other   love    LE
The two children have fallen in love with each other.

Following the practice for number agreement in HPSG, the agreement can be captured by enforcing an additional plural constraint on the subject expectation [SUBJ | SIGN | CONTENT | INDEX | NUMBER plural], as shown in the formulation of the lexical entry for 相 xiang- (each other) in (6-14) below.

th614

As shown above, the affixation also necessitates corresponding modification of the semantics in the argument structure:  the first argument is equal to the second via index [2].[6]  Note that the notation [ ], or more accurately, the most general feature structure, is used as a place holder.  For example, HANZI <[ ]> stands for the constraint of a mono-hanzi sign.  Another thing worth noticing is that the derivative requires that a subject must appear before it.  In other words, the subject expectation becomes obligatory.  This is based on the fact that this derived verb cannot stand by itself in syntax, unlike most original verbs in Chinese, say 爱 ai (love), whose subject expectation is optional.

With the lexical entries for the quasi-affixes taking care of the differences in the building of semantics, there is no need for any modification of the CPSG95 PS rules.  For example, the prefix 相 xiang- (each other) and the verb 爱 ai (love) formulated in (6-15) will be combined into the derivative 相爱 xiang-ai (love each other) shown in (6-16) via the Prefix PS Rule.

th615616

In summary, the proposed approach to Chinese derivation is effective in handling quasi-affixes as well.  The general grammar rules for derivation remain unchanged while lexical constraints are accommodated in the lexicon.  This demonstrates the advantages of the lexicalized design for grammar development.

6.5. Suffix 者 zhe (-er)

This section analyzes zhe-suffixation, a highly challenging  case at the interface between morphology and syntax.  This is believed to be an unsolvable problem as long as a system is based on the sequential processing of derivation morphology and syntax.  The solution to be proposed in this section is based on the argument that this suffixation is a combination of VP+zhe.

The suffix 者 zhe (-er, person) is a very productive bound morpheme.   It is often compared to the English suffix ‑er or ‑or, as seen in the pairs in (6-17).

(6-17.)
工作 gong-zuo (work)      工作者 [gong-zuo]-zhe (work‑er)
劳动 lao-dong (labor)       劳动者 [lao-dong]-zhe (labor-er)
学习 xue-xi (learn)           学习者 [xue-xi]-zhe (learn-er);.

But 者 ‑zhe is not an ordinary suffix;  it belongs to the category of so-called ‘phrasal affix’,[7] with very different characteristics than the English counterpart.  Although the output of the zhe-suffixation is a word, the input is a VP, not a lexical V.  In other words, it combines with a VP and produces a lexical N:  VP+zhe --> N.   The arguments to be presented below support this analysis.

The first thing is to demonstrate the word status of zhe‑suffixation.  This is fairly straightforward:  there are no observed facts to show that the zhe-derivative is different from other lexical nouns in the syntactic distribution.  For example, like other lexical nouns, the derivative can combine with an optional classifier construction to form a noun phrase.   Compare the following pairs of examples in (6-18) and (6-19).

(6-18.) (a)    两名违反这项规定者
liang  ming [[wei-fan      zhe    xiang gui-ding]     -zhe]
two    CLA   violate         this    CLA   regulation   -er
two persons who have violated this regulation

(b)    两名学生
liang  ming xue-sheng
two    CLA   student
two students

(6-19.) (a)    他是一位优秀工作者
ta       shi     yi       wei    you-xiu        [[gong-zuo]   -zhe]
he      be      one    CLA   excellent      work           -er
He is an excellent worker.

(b)    他是一位优秀工人。
ta       shi     yi       wei    you-xiu        gong-ren
he      be      one    CLA   excellent      worker
He is an excellent worker.

The next thing is to demonstrate the phrasal nature of the ‘stem’.[8]   The stem is judged as a VP because it can be freely expanded by syntactical complements or modifiers without changing the morphological relationship between the stem and the suffix, as shown in (6‑20) below.  (6-20a) involves a modifier (努力 nu-li) before the head verb.  The verb stem in (6-20b) and (6-20c) is a transitive VP consisting of a verb and an NP object.

(6-20.) (a)    努力工作者
[nu-li  gong-zuo]     -zhe
hard  work           ‑er
hard-worker, person who works hard

(b)      学习鲁迅者
[xue-xi         Lu Xun]       -zhe
learn           Lu Xun       -er
person who is learning from Lu Xun

(c)      违反这项规定者
[wei-fan       zhe    xiang           gui-ding]      -zhe
violate         this    CLA   regulation   -er
person who violates this rule

More examples with the head verb 雇 gu (employ) are given in (6-21), with the last two expressions involving passivized VP.

(6-21.)(a)    雇者
gu-zhe
employ-er

(b)      雇人者
[gu               ren]             -zhe
employ        person         -er
those who employ people, employer/recruiter

(c)      被雇者
[bei gu]                  -zhe
[be-employed]       -er
employee

(d)      被人雇者
[bei    ren              gu]               -zhe
by      person         employ        -er
those who are employed by (other) people

In fact, the stem VP is semantically equivalent to a relative clause.   A Chinese relative clause is normally expressed in the form of a DE-phrase: VP+de+N (Xue 1991).  In other words, 者 ‑zhe embodies functions of two signs, an N (‘person’, by default) and a relative clause introducer de, something like English one that + VP (or person who + VP).[9]  Compare the two examples in (6-22) and (6-23) with the same meaning - the expression in (6-23) is more colloquial than the first in (6-22) which uses the suffix 者‑zhe.

(6-22.) 违反规定者,处以罚款。
wei-fan        gui-ding       zhe,            chu-yi                   fa-kuan
violate         regulation   one that      punish-by   fine

Those who violate the regulations will be punished by fines.

(6-23.) 违反规定的人,处以罚款。
wei-fan        gui-ding       de      ren,             chu-yi          fa-kuan
violate         regulation   DE     person         punish-by   fine
Those who violate the regulations will be punished by fines.

On further examination, it is found that VPs with attached aspect markers combine with the suffix 者 -zhe with difficulty, as seen in the following examples.

(6-24.) (a)    违反规定者
wei-fan        gui-ding       zhe
violate         regulation   -er
Those who violate the regulations

(b) ?  违反了规定者
wei-fan        le       gui-ding       zhe
violate         LE     regulation   one that

This means that some further constraint may be necessary in order to prevent the grammar from producing strings like (6-24b).  If CPSG95 is only used for parsing, such a constraint is not absolutely necessary because, in normal Chinese text, such input is almost never seen.  Since CPSG95 is intended to be procedure-neutral, for use in both parsing and generation, the further constraint is desirable.

This constraint is in fact not an isolated phenomenon in Chinese grammar.  In syntax, the constraint is commonly required when the VP is not in the predicate position.[10]  For example, when a verb, say 喜欢 xi-huan (like), or a preposition, say 为了 wei-le (in order to), subcategorizes for a VP as a complement, it actually expects a VP with no aspect markers attached.   The following pair of sentences demonstrates this point.

(6-25.) (a)    我喜欢打篮球。
wo     xi-huan       da      lan-qiu.
I         like              play   basket-ball
I like playing basket-ball.

(b) * 我喜欢打了篮球。
wo     xi-huan       da      le       lan-qiu
I         like              play   LE     basket-ball

To accommodate such common constraint requirement in both Chinese morphology and syntax, a binary feature [FINITE] is designed for Chinese verbs in CPSG95.  In the lexicon, this feature is under-specified for each Chinese verb, i.e. [FINITE bin].  When an aspect marker 了着过 le/zhe/guo combines with the verb, this feature is unified to be [FINITE plus].  We can then enforce the required constraint [FINITE minus] in the morphological expectation or syntactic expectation to prevent aspected VP from appearing in a position expecting a non-predicate un-aspected  VP.

Based on the above analysis, the lexical entry of the suffix 者 –zhe is formulated in (6-26).  Note the notation for the macro with parameter (placed in parentheses) @common_noun(名|位|个).  This macro represents the following information.  The derivative is like any other common noun, it inherits the common property;  it can combine with an optional classifier construction using the classifier 名 ming or 位  wei or 个 ge.[11]

th626

As seen, the VP expectation is realized by using the macro constraint @vp.  The semantics of the derivative is [np_semantics], an instance of -er with restriction from the event of VP, represented by [2].  The index [1] ensures that whatever is expected as a subject by the VP, which has no chances to be satisfied syntactically in this case, is semantically identical to this noun.[12]  In other words, this derived noun semantically fills an argument slot held by the subject in the VP semantics [v_content].  In the active case, say, 雇人者 [gu ren]–zhe (‘person who employs people’), the subject is the first argument, i.e. the index of this noun is the logical subject of employ.  However, when the VP is in passive, say, 被人雇者 [bei ren gu]‑zhe (‘person who is employed by other people’), the subject expected by the VP fills the second argument, i.e. the noun in this case is the logical object of the VP.  It is believed that this is the desired result for the semantic composition of zhe-derivation.

With the lexical expectation of the suffix as the basis, the general Suffix PS Rule is ready to work.  Remember that there is nothing restricting the input stem to the derivation in either of the derivation rules, formulated in (6-4) and (6-6) before.  In CPSG95, this is not considered part of the general grammar but rather a lexical property of the head affix.  It is up to the affix to decide what constraints such as category, wordhood status, semantic constraint, etc., to impose on the expected stem to produce a derivative.  In most cases of derivation, the input status of the stem is a word, but now we have an intricate case where the suffix zhe (-er) expects a verb phrase for derivation.  The general property for all cases of derivation is that regardless of the input, the output of derivation (as well as any other types of morphology) is always a word.

Before demonstrating by examples how zhe-derivation is implemented, there is a need to address the configurational constraints of CPSG95.  This is an important factor in realizing the flexible interaction between morphology and syntax as required in this case.

In all HPSG-style grammars, some type of configurational constraint is in place to ensure the proper order of rule application.  A typical constraint is that the subject rule should apply after the object rule.  This is implemented in CPSG95 by imposing the constraint in the subject PS rule that the head daughter must be a phrase and by imposing the constraint in the object PS rule that the subject of the head daughter may not be satisfied.[13]

Since derivation morphology and syntax are designed in the same framework in CPSG95, constraints are called for to ensure the ordering of rule application between morphological PS rules and syntactic PS rules as well.  In general, morphological rules apply before syntactic rules.  However, if this constraint is made absolute, to the extent that that all morphological rules must apply before all syntactic rules, we in effect make morphology and syntax two independent, successive modules, just like the case for traditional systems.  The grammar will then lose the power of flexible interaction between morphology and syntax and cannot handle cases like zhe-derivation.  However, this is not a problem in CPSG95.

The proposed constraint regulating the rule application order between morphological PS rules and syntactic PS rules is as follows.  Only when a sign has both obligatory morphological expectation and syntactic expectation will CPSG95 have constraints ensuring that the morphological rule apply first.  For example, as formulated in (6-14) before, the sign 相 xiang- (each other) has both morphological expectation in [PREFIXING] as a bound morpheme and syntactic expectation for the subject in [SUBJ] as (head of) derivative.  If the input string is 他们相爱  ta-men (they) xiang- (each other) ai (love), the prefix rule will first combine 相 xiang- (each other) and the stem 爱 ai (love) before the subject rule can apply.  The result is the expected structure embodying the results of both morphological analysis and syntactic analysis, [ta-men [xiang- ai]].  This constraint is implemented by specifying in all syntactic PS rules that the head daughter cannot have obligatory morphological expectation yet to be satisfied.  It effectively prevents a bound morpheme from being used as a constituent in syntax.   It should be emphasized that this constraint in the general grammar does not prohibit a bound morpheme from combining with any types of sign;  such constraints are only lexically decided in the expectation feature of the affix.

The following text shows step by step the CPSG95 solution to the problem of zhe-derivation.  The chosen example is the derivation for the derived noun 违法规定者 [[wei-fan gui-ding]-zhe]  ‘persons violating (the) regulation’.  The lexical sign of the suffix 者 -zhe (-er) has already been formulated in (6-26) before.  The words 违反 wei-fan (violate) and 规定 gui-ding (regulation) in the CPSG95 lexicon are shown in (6-27) and (6-28) respectively.

th627628

Note that all common nouns, specified as @common_noun, in the lexicon have the following INDEX features [PERSON 3, NUMBER number], i.e. third person with unspecified number.  As for the feature [GENDER], it is encoded in the noun itself with one of the following [male], [female], [have_gender], [no_gender] or unspecified as [gender].   The corresponding sort hierarchy is: [gender] consists of sub-sorts [no_gender] and [have_gender];  and [have_gender] is sub-typed into [male] and [female].  Of course, 规定 gui-ding (regulation) is lexically specified as [GENDER no_gender].

The following is the VP built by the object PS rule in the CPSG95 syntax.  As seen, the building of the semantics follows the practice in HPSG, with the argument slots filled by the [INDEX] feature of the subject and object.  In this VP case, [ARG2] has been realized.

th629
The VP result in (6-29) and the suffix 者 –zhe will combine into the expected derived noun via the Suffix PS Rule, as shown in (6-30).

th630

To summarize, it is the integrated model of derivational morphology and syntax in CPSG95 that makes the above analysis implementable.  Without the integration, there is no way that a suffix is allowed to expect a phrasal stem.[14]  The lexicalist approach adopted in CPSG95 facilitates the capturing of the individual feature of the phrase expectation for the few individual affixes like 者 -zhe. This enables the general PS rules for derivation in CPSG95 to be applicable to both typical cases of affixation and special cases of affixation.

6.6. Summary

This chapter has investigated some representative phenomena of Chinese derivation and their interface to syntax.  The solutions to these problems have been presented based on the arguments for the analysis.

The key to a lexicalized treatment of Chinese derivation is to determine the structural and semantic property of the derivative and to impose proper constraints on the expected stem.  The constraints on the expected stem are lexically specified in the corresponding morphological expectation feature structure of the affix.  The property of the derivative is also lexically encoded in the affix, seen as head of derivational structure in the CPSG95 analysis.  This property information will be percolated up when the derivation rules are applied.  These rules ensure that the output of derivation is a word.  It has been shown that this approach applies equally well to derivation via ‘quasi-affixes’ and the tough case of zhe-suffixation as well.

 

------------------------------------

[1] Some linguists (e.g. Li and Thompson 1981) hold the view that Chinese has only a few affixes;  others (e.g. Chao 1968) believe that the inventory of Chinese affixes should be extended to include quasi-affixes.  Interestingly, the sign lei (quasi-, original sense ‘class’) itself is a quasi-prefix in Chinese.  Phenomena similar to Chinese quasi-affixes, called ‘semi-affixes’ or ‘Affixoide’, also exist in German morphology (Riehemann 1998).

[2] This is similar to the practice in many grammars, including HPSG, that a functional sign preposition is the selecting head of the corresponding syntactic structure, namely Prepositional Phrase.

[3] Those affixes which are not or no longer productive, e.g. lao‑ (original meaning ‘old’) in lao‑hu (tiger) and lao‑shu (mouse),  are not a problem.  The corresponding derived words are simply listed in the CPSG95 lexicon.

[4] The CPSG95 phrase-structural approach to Chinese productive derivation was inspired by the implementation in HPSG of a word-syntactic approach in Krieger (1994).  Similar practice is also seen in Selkirk (1982), Riehemann (1993) and Kathol (1999) in an effort to explore alternative approaches than the lexical rule approach to morphology.

[5] The major common property is reflected in two aspects, formulated in the macro definition of uncountable_noun in CPSG95.  First, there is value setting for the [NUMBER] feature, i.e. [CONTENT|INDEX|NUMBER no_number].  The CPSG95 sort hierarchy for the type [number] is defined as {a_number, no_number} where [a_number] is further sub-typed into {singular, plural}.  [NUMBER no_number] applies to uncountable nouns while [NUMBER a_number] is used for countable noun where the plurality is yet to be decided (i.e. under-specified for plurality).  Second, based on the syntactic difference between Chinese countable nouns and uncountable nouns, the classifier expected by uncountable nouns is exclusively zhong (kind/sort of).  That is, uncountable nouns may only combine with a preceding classifier construction using the classifier zhong.

[6] For time being, the subtle difference in semantics between pairs like We love ourselves and We love each other is not represented in the content.  It requires a more elaborate system of semantics to reflect the nuance.  The elaboration of semantics is left for future research.

[7] Some linguists (e.g. Z. Lu 1957; Lü et al 1980; Lü 1989; Dai 1993) have briefly introduced the notion of ‘phrasal affix’ in Chinese.  Lü further indicates that these ‘phrasal affixes’ are a distinctive characteristic of the Chinese grammar.

[8] The English possessive morpheme ‘s is arguably a suffix which expects an NP instead of a lexical noun as its stem:  NP + -’s.  Unlike VP + -zhe, the result of this NP + -‘s combination is generally regarded as a phrase, not a word.  In this sense, ‘s seems to be closer to a functional word, similar to a preposition or postposition, than to a suffix.

[9] Chinese zhe-suffixation is somewhat like the English phenomenon of what-clause (in ‘what he likes is not what interests her’). ‘What’ in this use also embodies functions of two signs that which. But the English what-clause functions as an NP, but VP+zhe forms a lexical N.

[10] It is generally agreed in the circle of Chinese grammar research that Chinese predicate (or finite) verbs have aspect distinction, using or not using aspect markers.  This is in contrast to English where both finite and non-finite verbs have aspect distinction but only finite verbs are tensed.

[11] It is generally agreed that each Chinese common noun may only combine with a classifier construction using a specific set of classifiers.  This classifier specification is generally regarded as lexical, idiosyncratic information of nouns (Lü et al 1980).  Using the macro with the classifier parameter follows this general idea.  It is worth noticing that the lexical formulation for -zhe (-er) in CPSG95 does not rely on any specific NP analysis chosen in syntax, except that the classifier specification should be placed under the entry for nouns (or derived nouns).

[12] The proposal in building the semantics for the zhe-derivative is based on ideas similar to the assumption adopted for the complement control in HPSG that ‘the fundamental mechanism of control was coindexing between the unexpressed subject of an unsaturated complement and its controler’ (Pollard and Sag 1994:282).

[13] If the object expectation is obligatory, this constraint ensures the priority of the object rule over the subject rule in application, building the desirable structure [S [V O]] instead of [[S V] O].  This is because, a verb with obligatory object yet to be satisfied is by definition not a phrase.  If the object expectation is optional, the order of rule application is still in effect although the lexical V in this scenario does not violate the phrase definition.  There are two cases for this situation.  In case one, the object O happens to occur in the input string.  The subject PS rule will tentatively combine S and V via the subject rule, but it can go no further.  This is because the object rule cannot apply after the subject rule, due to the constraint in the object rule that the head cannot have a satisfied subject.  The successful parse will only build the expected structure [S [V O]].  In case two, the object O does not appear in the input string.  Then the tentative combination [S V] built by the subject rule becomes the final parse.

[14] For example, if the lexical rule approach were adopted for derivation, this problem could not be solved.

 

[Related]

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

 

PhD Thesis: Chapter V Chinese Separable Verbs

 

5.0. Introduction

This chapter investigates the phenomena usually referred to as separable verbs (离合动词 lihe dongci) in the form V+X.  Separable verbs constitute a significant portion of Chinese verb vocabulary.[1]  These idiomatic combinations seem to show dual status (Z. Lu 1957; L. Li 1990).  When V+X is not separated, it is like an ordinary verb.   When V is separated from X, it seems to be more like a phrasal combination.  The co-existence of both the separated use and contiguous use for these constructions is recognized as a long-standing problem at the interface of Chinese morphology and syntax (L. Wang 1955;  Z. Lu 1957; Chao 1968; Lü 1989; Lin 1983;  Q. Li 1983; L. Li 1990; Shi 1992; Dai 1993; Zhao and Zhang 1996).

Some linguists (e.g. L. Li 1990; Zhao and Zhang 1996) have made efforts to classify different types of separable verbs and demonstrated different linguistic facts about these types.  There are two major types of separable verbs:  V+N idioms with the verb-object relation and V+A/V idioms with the verb-modifier relation - when X is A or non-conjunctive V.[2]

The V+N idiom is a typical case which demonstrates the mismatch between a vocabulary word and grammar word.  There have been three different views on whether V+N idioms are words or phrases in Chinese grammar.

Given the fact that the V and the N can be separated in usage, the most popular view (e.g. Z. Lu 1957; L. Li 1990; Shi 1992) is that they are words when V+N are contiguous and they are phrases otherwise.  This analysis fails to account for the link between the separated use and the contiguous use of the idioms.  In terms of the type of V+N idioms like 洗澡 xi zao (wash-bath: take a bath), this analysis also fails to explain why a different structural analysis should be given to this type of contiguous V+N idioms listed in the lexicon than the analysis to the also contiguous but non-listable combination of V and N (e.g. 洗碗 xi wan 'wash dishes').[3]  As will be shown in Section 5.1, the structural distribution for this type of V+N idioms and the distribution for the corresponding non-listable combinations are identical.

Other grammarians argue that V+N idioms are not phrases (Lin 1983;  Q. Li 1983; Zhao and Zhang 1996).  They insist that they are words, or a special type of words.  This argument cannot explain the demonstrated variety of separated uses.

There are scholars (e.g. Lü 1989; Dai 1993) who indicate that idioms like 洗澡 xi zao are phrases.  Their judgment is based on their observation of the linguistic variations demonstrated by such idioms.  But they have not given detailed formal analyses which account for the difference between these V+N idioms and the non-listable V+NP constructions in the semantic compositionality.  That seems to be the major reason why this insightful argument has not convinced people with different views.

As for V+A/V idioms, Lü (1989) offers a theory that these idioms are words and the insertable signs between V and A/V are Chinese infixes.  This is an insightful hypothesis.  But as in the case of the analyses proposed for V+N idioms, no formal solutions have been proposed based on the analyses in the context of phrase structure grammars.  As a general goal, a good solution should not only be implementable, but also offer an analysis which captures the linguistic link, both structural and semantic, between the separated use and the contiguous use of separable verbs.  It is felt that there is still a distance between the proposed analyses reported in literature and achieving this goal of formally capturing the linguistic generality.

Three types of V+X idioms can be classified based on their different degrees of 'separability' between V and X, to be explored in three major sections of this chapter.  Section 5.1 studies the first type of V+N idioms like 洗澡 xi zao (wash-bath: take a bath).  These idioms are freely separable.  It is a relatively easy case.  Section 5.2 investigates the second type of the V+N idioms represented by 伤心 shang xin (hurt-heart: sad or heartbroken).  These idioms are less separable.  This category constitutes the largest part of the V+N phenomena.  It is a more difficult borderline case.  Section 5.3 studies the V+A/V idioms.  These idioms are least separable:  only the two modal signs 得 de3 (can) and 不 bu (cannot) can be inserted inside them, and nothing else.  For all these problems, arguments for the wordhood judgment will be presented first.  A corresponding morphological or syntactic analysis will be proposed, together with the formulation of the solution in CPSG95 based on the given analysis.

5.1. Verb-object Idioms: V+N I

The purpose of this section is to analyze the first type of V+N idioms, represented by 洗澡 xi zao (wash‑bath: take a bath).  The basic arguments to be presented are that they are verb phrases in Chinese syntax and the relationship between the V and the N is syntactic.  Based on these arguments, formal solutions to the problems involved in this construction will be presented.

The idioms like 洗澡 xi zao are classified as V+N I, to be distinguished from another type of idioms V+N II (see 5.2).  The following is a sample list of this type of idioms.

(5-1.) V+N I: xi zao type

洗澡 xi (wash) zao (bath #)              take a bath
擦澡 ca (scrub) zao (bath #)             clean one's body by scrubbing
吃亏 chi (eat) kui (loss #)                   get the worst
走路 zou (go) lu (way $)                      walk
吃饭 chi (eat) fan (rice $)                    have a meal
睡觉 shui (V:sleep) jiao (N:sleep #)   sleep
做梦 zuo (make) meng (N:dream)     dream (a dream)
吵架  chao (quarrel) jia (N:fight #)    quarrel (or have a row)
打仗 da (beat) zhang (battle)              fight a battle
上当 shang (get) dang (cheating #)                be taken in
拆台 chai (pull down) tai (platform #)          pull away a prop
见面 jian (see) mian (face #)                            meet (face to face)
磕头 ke (knock) tou (head)                              kowtow
带头 dai (lead) tou (head $)                            take the lead
帮忙 bang (help) mang (business #)              give a hand
告状 gao (sue) zhuang (complaint #)            lodge a complaint

Note: Many nouns (marked with # or $) in this type of constructions cannot be used independently of the corresponding V.[4]  But those with the mark $ have no such restriction in their literal sense.  For example, when the sign fan  means 'meal', as it does in the idiom, it cannot be used in a context other than the idiom chi-fan (have a meal).  Only when it stands for the literal meaning ‘rice’, it does not have to co-occur with  chi.

There is ample evidence for the phrasal status of the combinations like 洗澡 xi zao.  The evidence is of three types.  The first comes from the free insertion of some syntactic constituent X between the idioms in the form V+X+N: this involves keyword-based judgment patterns and other X‑insertion tests proposed in Chapter IV.  The second type of evidence resorts to some syntactic processes for the transitive VP, namely passivization and long-distance topicalization.  The V+N I idioms can be topicalized and passivized in the same way as ordinary transitive VP structures do.  The last piece of evidence comes from the reduplication process associated with this type of idiom.   All the evidence leads to the conclusion that V+N I idioms are syntactic in nature.

The first evidence comes from using the wordhood judgment pattern: V(X)+zhe/guo à word(X).  It is a well observed syntactic fact that Chinese aspectual markers appear right after a lexical verb (and before the direct object).  If 洗澡 xi zao were a lexical verb, the aspectual markers would appear after the combinations, not inside them.  But that is not the case, shown by the ungrammaticality of the example in (5-2b).  A productive transitive VP example is given in (5-3) to show its syntactic similarity (parallelness) with V+N I idioms.

(5-2.) (a)      他正在洗着澡
ta       zheng-zai    xi      zhe    zao.
he      right-now    wash ZHE   bath
He is taking a bath right now.

(b) *   他正在洗澡着。
ta       zheng-zai    xi-zao         zhe.
he      right-now    wash-bath   ZHE

(5-3.) (a)      他正在洗着衣服。
ta       zheng-zai    xi      zhe    yi-fu.
he      right-now    wash ZHE   clothes
He is washing the clothes right now.

(b) *   他正在洗衣服着。
ta       zheng-zai    xi      yi-fu           zhe.
he      right-now    wash clothes        ZHE

The above examples show that the aspectual marker 着 zhe (ZHE) should be inserted in the V+N idiom, just as it does in an ordinary transitive VP structure.

Further evidence for X-insertion is given below.   This comes from the post-verbal modifier of ‘action-times’ (动量补语 dongliang buyu) like 'once', 'twice', etc.  In Chinese, action-times modifiers appear after the lexical verb and aspectual marker (but before the object), as shown in (5-4a) and (5-5a).

(5-4.) (a)      他洗了两次澡。
ta       xi      le       liang  ci       zao.
he      wash LE     two    time   bath
He has taken a bath twice.

(b) *   他洗澡了两次。
ta       xi-zao         le       liang  ci.
he      wash-bath   LE     two    time

(5-5.) (a)      他洗了两次衣服。
ta       xi      le       liang  ci       yi-fu.
he      wash LE     two    time   clothes
He has washed the clothes twice.

(b) *   他洗衣服了两次。
ta       xi      yi-fu           le       liang  ci.
he      wash clothes        LE     two    time

So far, evidence has been provided of syntactic constituents which are attached to the verb in the V+N I idioms.  To further argue for the VP status of the whole idiom, it will be demonstrated that the N in the V+N I idioms in fact fills the syntactic NP position in the same way as all other objects do in Chinese transitive VP structures.  In fact, N in the V+N I does not have to be a bare N:  it can be legitimately expanded to a full-fledged NP (although it does not normally do so).  A full-fledged NP in Chinese typically consists of a classifier phrase (and modifiers like de-construction) before the noun.  Compare the following pair of examples.  Just like an ordinary NP 一件崭新的衣服 yi jian zan-xin de yi-fu (one piece of brand-new clothes), 一个痛快的澡 yi ge tong-kuai de zao (a comfortable bath) is a full-fledged NP.

(5-6.)           他洗了一个痛快的澡。
ta       xi      le       yi       ge      tong-kuai     de      zao.
he      wash LE     one    CLA   comfortable DE     bath
He has taken a comfortable bath.

(5-7.)           他洗了一件崭新的衣服。
ta       xi      le       yi       jian    zan-xin        de      yi-fu.
he      wash LE     one    CLA   brand-new  DE     clothes
He has washed one piece of brand-new clothes.

It requires attention that the above evidence is directly against the following widespread view, i.e. signs like 澡 zao, marked with # in (5-1), are 'bound morphemes' or ‘bound stems’ (e.g. L. Li 1990; Zhao and Zhang 1996).  As shown, like every other free morpheme noun (e.g. yi-fu), zao holds a lexical position in the typical Chinese NP sequence 'determiner + classifier + (de-construction) + N', e.g. 一个澡 yi ge zao (a bath), 一个痛快的澡 yi ge tong-kuai de zao (a comfortable bath).[5]  In fact, as long as the ‘V+N I phrase’ arguments are accepted (further evidence to come), by definition ‘bound morpheme’ is a misnomer for 澡 zao.  As a part of morphology, a bound morpheme cannot play a syntactic role:  it is inside a word and cannot be seen in syntax.  The analysis of 洗xi (...) zao as a phrase entails the syntactic roles played by 澡 zao:  (i) 澡 zao is a free morpheme noun which fills the lexical position as the final N inside the possibly full-fledged NP;  (ii) 澡zao plays the object role in the syntactic transitive structure 洗澡xi zao.

This bound morpheme view is an argument used for demonstrating  the relevant V+N idioms to be words rather than phrases (e.g. L. Li 1990).  Further examination of this widely accepted view will help to strengthen the counter-arguments that all V+N I idioms are phrases.

Labeling signs like 澡zao (bath) as bound morphemes seem to come from an inappropriate interpretation of the statement that bound morphemes cannot be ‘freely’, or ‘independently’, used in syntax.[6]  This interpretation places an equal sign between the idiomatic co-occurrence constraint and ‘not being freely used’.  It is true that 澡zao is not an ordinary noun to be used in isolation.  There is a co-occurrence constraint in effect:  澡zao cannot be used without the appearance of 洗xi (or 擦ca).  However, the syntactic role played by 澡zao, the object in the syntactic VP structure, has full potential of being ‘freely’ used as any other Chinese NP object:   it can even be placed before the verb in long-distance constructions as shall be shown shortly.  A more proper interpretation of ‘not being freely used’ in terms of defining bound morphemes should be that a genuine bound morpheme, e.g. the suffix 性 -xing ‘-ness’, has to attach to another sign contiguously to form a word.

A comparison with similar phenomena in English may be helpful.  English also has similar idiomatic VPs, such as kick the bucket.[7]  For the same reason, it cannot be concluded that bucket (or the bucket) is a bound morpheme only because it demonstrates necessary co-occurrence with the verb literal kick.  Signs like bucket, zao (bath) are not of the same nature as bound morphemes like –less, -ly, un-, ‑xing (-ness), etc

The second type of evidence shows some pattern variations for the V+N I idioms.  These variations are typical syntactic patterns for the transitive V+NP structure in Chinese.  One of most frequently used patterns for transitive structures is the topical pattern of long distance dependency.  This provides strong evidence for judging the V+N I idioms as syntactic rather than morphological.  For, with the exception of clitics, morphological theories in general conceive of the parts of a word as being contiguous.[8]  Both the V+N I idiom and the normal V+NP structure can be topicalized, as shown in (5-8b) and (5-9b) below.

(5-8.) (a)      我认为他应该洗澡。
wo     ren-wei        ta       ying-gai       xi zao.
I         think           he      should        wash-bath
I think that he should take a bath.

(b)      澡我认为他应该洗
zao    wo     ren-wei        ta       ying-gai       xi.
bath  I         think           he      should        wash
The bath I think that he should take.

(5-9.) (a)       我认为他应该洗衣服。
wo     ren-wei        ta       ying-gai       xi      yi-fu.
I         think           he      should        wash clothes
I think that he should wash the clothes.

(b)      衣服我认为他应该洗。
yi-fu           wo     ren-wei        ta       ying-gai       xi.
clothes        I         think           he      should        wash
The clothes I think that he should wash.

The minimal pair of passive sentences in (5-10) and (5‑11) further demonstrates the syntactic nature of the V+N I structure.

(5-10.)         澡洗得很干净。
zao             xi      de3    hen    gan-jing.
bath            wash DE3   very   clean
A good bath was taken so that one was very clean.

(5-11.)         衣服洗得很干净。
yi-fu           xi      de3    hen    gan-jing.
clothes        wash DE3   very   clean
The clothes were washed clean.

The third type of evidence involves the nature of reduplication associated with such idioms.  For idioms like 洗澡 xi zao (take a bath), the first sign can be reduplicated to denote the shortness of the action:  洗澡 xi zao (take a bath) --> 洗洗澡 xi xi zao (take a short bath).  If 洗澡 xi zao is a word, by definition, 洗xi is a morpheme inside the word and 洗洗澡 xi-xi-zao belongs to morphological reduplication (AB-->AAB type).  However, this analysis fails to account for the generality of such reduplication:  it is a general rule in Chinese grammar that a verb reduplicates itself contiguously to denote the shortness of the action.  For example, 听音乐 ting (listen to) yin-yue (music) --> 听听音乐 ting ting yin-yue (listen to music for a while); 休息 xiu-xi (rest) --> 休息休息 xiu-xi xiu-xi (have a short rest), etc.  On the other hand, when we accept that 洗澡 xi zao is a verb-object phrase in syntax and the nature of this reduplication is accordingly judged as syntactic,[9] we come to a satisfactory and unified account for all the related data.  As a result, only one reduplication rule is required in CPSG95 to capture the general phenomena;[10]  there is no need to do anything special for V+N  idioms.

This AB ‑‑> AAB type reduplication problem for the V+N idioms poses a big challenge to traditional word segmenters (Sun and Huang 1996).  Moreover, even when a word segmenter successfully incorporates some procedure to cope with this problem, the essentially same rule has to be repeated in the grammar for the general VV reduplication.  This is not desirable in terms of capturing the linguistic generality.

All the evidence presented above indicates that idioms like 洗澡xi zao, no matter whether V and N are used contiguously or not, are not words, but phrases.  The idiomatic nature of such combinations seems to be the reason why most native speakers, including some linguists, regard them as words.  Lü (1989: 113-114) suggests that vocabulary words  like 洗澡 xi zao should be distinguished from grammar words.  He was one of the first Chinese grammarians who found that the V+N relation in the idioms like 洗澡 xi zao is a syntactic verb object relation.  But he did not provide full arguments for his view, neither did he offer a precise formalized analysis of this problem.[11]

As shown in the previous examples, the V+N I idioms do not differ from other transitive verb phrases in all major syntactic behaviors.   However, due to their idiomatic nature, the V+N I idioms are different from ordinary transitive VPs in the following two major aspects.  These differences need to be kept in mind when formulating the grammar to capture the phenomena.

  • Semantics:  the semantics of the idiom should be given directly in the lexicon, not as a result of the computation of the semantics of the parts based on some general principle of compositionality.
  • Co-occurrence requirement:  洗 xi (or 擦 ca) and 澡 zao must co-occur with each other;  走 zou (go) and 路 lu (way) must co-occur; etc.  This is a requirement specific to the idioms at issue.  For example, 洗 xi and 澡 zao must co-occur in order to stand as an idiom to mean ‘take a bath’.

Based on the study above, the CPSG95 solution to this problem is described below.  In order to enforce the co-occurrence of the V+N I idioms, it is specified in the CPSG95 lexicon that the head V obligatorily expects as its object an NP headed by a specific literal.  This treatment originates from the practice of handling collocations in HPSG.  In HPSG, there are features designed to enable the subcategorization for particular words, or phrases headed by particular words.  For example, the feature [NFORM there] and [NFORM it] refer to the expletive there and it respectively for the special treatment of existential constructions, cleft constructions, etc. (Pollard and Sag 1987:62).  The values of the feature PFORM distinguish individual prepositions like for, on, etc.  They are used in phrasal verbs like rely on NP, look for NP, etc.  In CPSG95, this approach is being generalized, as described below.

As presented before, the feature for orthography [HANZI] records the Chinese character string for each lexical sign.  When a specific lexical literal is required in an idiomatic expectation, the constraint is directly placed on the value of the feature [HANZI] of the expected sign, in addition to possible other constraints.  It is standard practice in a lexicalized grammar that the expected complement (object) for the transitive structure be coded directly in the entry of the head V in the lexicon.  Usually, the expected sign is just an ordinary NP.  In the idiomatic VP like 洗 xi (...) 澡 zao, one further constraint is placed:  the expected NP must be headed by the literal character 澡zao.  This treatment ensures that all pattern variations for transitive VP such as passive constructions, topicalized constructions, etc. in Chinese syntax will equally apply to the V+N I idioms.[12]

The difference in semantics is accommodated in the feature [CONTENT] of the head V with proper co-indexing.  In ordinary cases like 洗衣服 xi yi-fu (wash clothes), the argument structure is [vt_semantics] which requires two arguments, with the role [ARG2] filled by the semantics of the object NP.  In the idiomatic case 洗澡 xi zao (take a bath), the V and N form a semantic whole, coded as [RELN take_bath].[13]  The V+N I idioms are formulated like intransitive verbs in terms of composing the semantics - hence coded as [vi_semantics], with only one argument to be co-indexed with the subject NP.  Note that there are two lexical entries in the lexicon for the verb 洗 xi (wash), one for the ordinary use and the other for the idiom, shown in (5-12) and (5-13).

th000

The above solution takes care of the syntactic similarity of the
V+N I idioms and ordinary V+NP structures.  It is also detailed enough to address their major differences.  In addition, the associated reduplication process (i.e. V+N --> V+V+N) is no longer a problem once this solution is adopted.  As the V in the V+N idioms is judged and coded as a lexical V (word) in this proposal, the reduplication rule which handles V --> VV will equally apply here.

5.2. Verb-object Idioms: V+N II

The purpose of this section is to provide an analysis of another type of V+N idiom and present the solution implemented in CPSG95 based on the analysis.

Examples like 洗澡 xi zao (take a bath) are in fact easy cases to judge.   There are more marginal cases.  When discussing Chinese verb-object idioms, L. Li (1990) and Shi (1992) indicate that the boundary between a word and a phrase in Chinese is far from clear-cut.  There is a remarkable “gray area” in between.  Examples in (5-14) are V+N II idioms, in contrast to the V+N I type, classified by L. Li (1990).

(5-14.) V+N II: 伤心 shang xin type

伤心 shang (hurt) xin (heart)             sad or break one's heart
担心 dan (carry) xin (heart)               worry
留神 liu (pay) shen (attention)           pay attention to
冒险 mao (take) xian (risk)                 take the risk
借光 jie (borrow) guang (light)           benefit from
劳驾 lao (bother) jia (vehicle)             beg the pardon
革命 ge (change) ming (life)                 make revolution
落后 luo (lag) hou (back)                      lag behind
放手 fang (release) shou (hand)          release one's hold

Compared with V+N I (洗澡xi zao type), V+N II has more characteristics of a word.  The lists below given by L. Li (1990) contrast their respective characteristics.[14]

(5-15.) V+N I (based on L. Li 1990:115-116)

as a word

V-N

(a1) corresponds to one generalized sense (concept)

(a2) usually contains ‘bound morpheme(s)’

as a phrase

V X N

 

(b1) may insert an aspectual particle (X=le/zhe/guo)

(b2) may insert all types of post-verbal modifiers (X=BUYU)

(b3) may insert a pre-nominal modifier de-construction (X=DEP)

(5-16.) V+N II (based on L. Li 1990:115)

as a word

 

V-N X

(a1) corresponds to one generalized sense (concept)

(a2) usually contains ‘bound morpheme(s)’

(a3) (some) may be followed by an aspectual particle (X=le/zhe/guo)

(a4) (some) may be followed by a post-verbal modifier
of duration or number of times (X=BUYU)

(a5) (some) may take an object (X=BINYU)

as a phrase

 

V X N

(b1) may insert an aspectual particle (X=le/zhe/guo)

(b2) may insert all types of post-verbal modifiers (X=BUYU)

(b3) may insert a pre-nominal modifier de-construction (X=DEP)

For V+N I, the previous text has already given detailed analysis and evidence and decided that such idioms are phrases, not words.  This position is not affected by the demonstrated features (a1) and (a2) in (5‑15);  as argued before,  (a1) and (a2) do not contribute to the definition of a grammar word.

However, (a3), (a4) and (a5) are all syntactic evidence showing that V+N II idioms can be inserted in lexical positions.   On the other hand, these idioms also show the similarity with V+N I idioms in the features (b1), (b2) and (b3) as a phrase.  In particular, (a3) versus (b1) and (a4) versus (b2) demonstrate a 'minimal pair' of phrase features and word features.  The following is such a minimal pair example (with the same meaning as well) based on the feature pairs (a3) versus (b1), with a post-verbal modifier 透tou (thorough) and aspectual particle 了le (LE).  It demonstrates the borderline status of such idioms.  As before, a similar example of an ordinary transitive VP is also given below for comparison.

(5-17.)         V+N II: word or phrase?

伤心:sad; heart-broken
shang          xin
hurt            heart

(a)      我伤心透了
wo     shang-xin  tou              le.
I         sad              thorough     LE
I was extremely sad.

(b)      我伤透了心
wo     shang         tou              le       xin.
I         break          thorough     LE     heart
I was extremely sad.

(5-18.)         Ordinary V+NP phrase: 恨hen (hate) 他ta (he)

(a) *   我恨他透了
wo     hen   ta      tou              le.
I         hate   he      thorough     LE

(b)      我恨透了他
wo     hen   tou              le       ta.
I         hate   thorough     LE     he
I thoroughly hate him.

As shown in (5-18), in the common V+NP structure, the post-verbal modifier 透 tou (thorough) and the aspectual particle 了 le (perfect aspect) can only occur between the lexical V and NP.  But in many V+N II idioms, they may occur either after the V+N combination or in between.  In (5‑17a), 伤心 shang xin is in the lexical position because Chinese syntax requires that the post-verbal modifier attach to the lexical V, not to a VP as indicated in (5-18a).  Following the same argument, 伤 shang (hurt) alone in (5-17b) must be a lexical V as well.  The sign 心 xin (heart) in (5‑17b) establishes itself in syntax as object of the V, playing the same role as 他ta (he) in (5-18b).  These facts show clearly that V+N II idioms can be used both as lexical verbs and as transitive verb phrases.   In other words, before entering a context, while still in the lexicon, one can not rule out either possibility.

However, there is a clear cut condition for distinguishing its use as a word and its use as a phrase once a V+N II idiom is placed in a context.   It is observed that the only time a V+N II idiom assumes the lexical status is when V and N are contiguous.  In all other cases, i.e. when V and N are not contiguous, they behave essentially similar to the V+N I type.

In addition to the examples in (5-17) above, two more examples are given below to demonstrate the separated phrasal use of V+N II.  The first is the case V+X+N where X is a possessive modifier attached to the head N.  Note also the post-verbal position of 透 tou (thorough) and 了le (LE).  The second is an example of passivization when N occurs before V.  These examples provide strong evidence for the syntactic nature of V+N II idioms when V and N are not used contiguously.

(5-19.) (a) *   你伤他的心透了
ni       shang         ta       de      xin    tou              le.
you    hurt            she    DE     heart thorough     LE

(b)      你伤透了他的心
ni       shang         tou              le       ta       de      xin.
you    hurt            thorough     LE     she    DE     heart
You broke her heart.

(5-20.)         V+N II: instance of passive with or without 被 bei (BEI)

心(被)伤透了
xin    (bei)   shang         tou              le.
heart BEI    break          thorough     LE
The heart was completely broken.
or: (Someone) was extremely sad.

Based on the above investigation, it is proposed in CPSG95 that two distinct entries be constructed for each such idiom, one as an inseparable lexical V, and the other as a transitive VP just like that of V+N I.  Each entry covers its own part of the phenomena.  In order to capture the semantic link between the two entries, a lexical rule called V_N_II Rule is formulated in CPSG95, shown in (5-21).

th001

The input to the V_N_II Lexical Rule is an entry with [CATEGORY v_n_ii] where [v_n_ii] is a given sub-category in the lexicon for V+N II type verbs.  The output is another entry with the same information except for three features [HANZI], [CATEGROY] and [COMP1_RIGHT].  The new value for [HANZI] is a list concatenating the old [HANZI] and the [HANZI] for the expected [COMP1_RIGHT].  The new [CATEGORY] value is simply [v].  The value for [COMP1_RIGHT] becomes [null].  The outline of the two entries captured by this lexical rule are shown in (5-22) and (5-23).

th002

It needs to be pointed out that the definition of [CATEGORY v_n_ii] in CPSG95 is narrower than L. Li’s definition of V+N II type idioms.  As indicated by L. Li (1990), not all V+N II idioms share the same set of lexical features (a3), (a4) and (a5) as a word.  The definition in CPSG95 does not include the idioms which share the lexical feature (a5), i.e. taking a syntactic object.  These are idioms like 担心dan-xin (carry-heart: worry about).  For such idioms, when they are used as inseparable compound words, they can take a syntactic object.  This is not possible for all other V+N idioms, as shown below.

(5-24.) (a)     她很担心你
ta       hen    dan-xin                ni.
he      very   worry (about)        you
He is very concerned about you.

(b) *   他很伤心你
ta       hen    shang-xin            ni.
he      very   sad                       you

In addition, these idioms do not demonstrate the full distributional potential of transitive VP constructions.  The separated uses of these idioms are far more limited than other V+N idioms.  For example, they can hardly be passivized or topicalized as other V+N idioms can, as shown by the following minimal pair of passive constructions.

(5-25.)(a) *   心(被)担透了
xin    (bei)   dan             tou              le.
heart BEI    carry           thorough     LE

(b)      心(被)伤透了
xin    (bei)   shang         tou              le.
heart BEI    break          thorough     LE
The heart was completely broken.
or: (Someone) was extremely sad.

In fact, the separated use ('phrasal use') for such V+N idioms seems only limited to some type of X-insertion, typically the appearance of aspect signs between V and N.[15]  Such separated use is the only thing shared by all V+N idioms, as shown below.

(5-26.)(a)     他担过心
ta       dan             guo    xin
he      carry           GUO  heart
He (once) was worried.

(b)      他伤过心
ta       shang         guo    xin
he      break          GUO  heart
He (once) was heart-broken.

To summarize,  the V+N idioms like 担心 dan-xin which can take a syntactic object do not share sufficient generality with other V+N II idioms for a lexical rule to capture.  Therefore, such idioms are excluded from the [CATEGORY v_n_ii] type.  This makes these idioms not subject to the lexical rule proposed above.  It is left for future research to answer the question whether there is enough generality among this set of idioms to justify some general approach to this problem, say, another lexical rule or some other ways of generalization of the phenomena.  For time being, CPSG95 simply lists both the contiguous and separated uses of these idioms in the lexicon.[16]

It is worth noticing that leaving such idioms aside, this lexical rule still covers large parts of V+N II phenomena.  The idioms like 担心dan-xin only form a very small set which are in the state of transition to words per se (from the angle of language development) but which still retain some (but not complete) characteristics of a phrase.[17]

5.3. Verb-modifier Idioms: V+A/V

This section investigates the V+X idioms in the form of V+A/V.  The data for the interaction of V+A/V idioms and the modal insertion are presented first.  The subsequent text will argue for Lü's infix hypothesis for the modal insertion and accordingly propose a lexical rule to capture the idioms with or without modal insertion.

The following is a sample list of V+A/V idioms, represented by kan jian (look-see: have seen).

(5-27.) V+A/V: kan jian type

看见 kan (look) jian (see)                    have seen
看穿 kan (look) chuan (through)        see through
离开 li (leave) kai (off)                         leave
打倒 da (beat) dao (fall)                      down with
打败 da (beat) bai (fail)                       defeat
打赢 da (beat) ying (win)                    fight and win
睡着 shui (sleep) zhao (asleep)            fall asleep
进来 jin (enter) lai (come)                             enter
走开 zou (go) kai (off)                         go away
关上  guan (close) shang (up)             close

In the V+A/V idiom kan jian (have-seen), the first sign kan (look) is the head of the combination while the second jian (see) denotes the result.  So when we say, wo (I) kan-jian (see) ta (he), even without the aspectual marker le (LE) or guo (GUO), we know that it is a completed action:  'I have seen him' or 'I saw him'.[18]

Idioms like kan-jian (have-seen) function just as a lexical whole (transitive verb).  When there is an aspect marker, it is attached immediately after the idioms as shown in (5‑28).  This is strong evidence for judging V+A/V idioms as words, not as syntactic constructions.

(5-28.)         我看见了他
wo     kan jian     le       ta.
I         look-see       LE     he                   I have seen him.

The only observed separated use is that such idioms allow for two modal signs 得 de3 (can) and 不 bu (cannot) in between, shown by (5-29a) and (5-29b).  But no other signs, operations or processes can enter the internal structure of these idioms.

(5-29.) (a)     我看不见他
wo     kan bu jian         ta.
I         look cannot see     he
I cannot see him.

(c)      你看得见他吗?
ni       kan de3 jian       ta       me?
you    look can see          he      ME
Can you see him?

Note that English modal verbs ‘can’ and ‘cannot’ are used to translate these two modal signs.  In fact, Contemporary Mandarin also has corresponding modal verbs (能愿动词 neng-yuan dong-ci):  能 neng (can) and 不能 bu neng (cannot).  The major difference between Chinese modal verbs 能 neng / 不能 bu neng and the modal signs 得 de3 / 不 bu lies in their different distribution in syntax.  The use of modal signs 得 de3 (can) and 不 bu (cannot) is extremely restrictive:  they have to be inserted into V+BUYU combinations.  But Chinese modal verbs can be used before any VP structures.  It is interesting to see the cases when they are used together in one sentence, as shown in (5-30 a+b) below.  Note that the meaning difference between the two types of modal signs is subtle, as shown in the examples.

(5-30.)(a)     你看得见他吗?
ni       kan de3 jian         ta       me?
you    look can see          he      ME
Can you see him? (Is your eye-sight good enough?)

(b)      你能看见他吗?
ni       neng kan jian      ta       me?
you    can    see              he      ME
Can you see him?
(Note: This is used in more general sense. It covers (a) and more.)

(a+b)  你能看得见他吗?
ni       neng kan de3 jian         ta       me?
you    can    look can see          he      ME
Can you see him? (Is your eye-sight good enough?)

(5-31.)(a)     我看不见他
wo     kan bu jian           ta
I         look cannot see     he
I cannot see him. (My eye-sight is too poor.)

(b)      我不能看见他
wo     bu     neng kan jian      ta
I         not    can    see              he
I cannot see him. (Otherwise, I will go crazy.)

(a+b) 我不能看不见他
wo     bu     neng kan bu jian           ta.
I         not    can    look cannot see     he
I cannot stand not being able to see him.
(I have to keep him always within the reach of my sight.)

Lü (1989:127) indicates that the modal signs are in fact the only two infixes in Contemporary Chinese.  Following this infix hypothesis, there is a good account for all the data above.  In other words, the V+A/V idioms are V+BUYU compound words subject to the modal infixation.  The phenomena of 看得见 kan-de3-jian (can see) and 看不见 kan-bu-jian (cannot see) are therefore morphological by nature.  But Lü did not offer formal analysis for these idioms.

Thompson (1973) first proposed a lexical rule to derive the potential forms V+de3/bu+A/V from the V+A/V idioms.  The lexical rule approach seems to be most suitable for capturing the regularity of the V+A/V idioms and their infixation variants V+de3/bu+A/V.  The  approach taken in CPSG95 is similar to Thompson’s proposal.  More precisely, two lexical rules are formulated in CPSG95 to handle the infixation in V+A/V idioms.  This way, CPSG95 simply lists all V+A/V idioms in the lexicon as V+A/V type compound words, coded as [CATEGORY v_buyu].[19]  Such entries cover all the contiguous uses of the idioms.  It is up to the two lexical rules to produce two infixed entries to cover the separated uses of the idioms.

The change of the infixed entries from the original entry lies in the semantic contribution of the modal signs.  This is captured in the lexical rules in (5-32) and (5-33).  In case of V+de3+A/V, the Modal Infixation Lexical Rule I in (5-32) assigns the value [can] to the feature [MODAL] in the semantics.  As for V+bu+A/V, there is a setting  [POLARITY minus] used to represent the negation in the semantics, shown in (5-33).[20]

th003

The following lexical entry shows the idiomatic compound 看见 kan-jian as coded in the CPSG95 lexicon (leaving some irrelevant details aside).   This entry satisfies the necessary condition for the proposed infixation lexical rules.

th004

The modal infixation lexical rules will take this [v_buyu] type compound as input and produce two V+MODAL+BUYU entries.  As a result, new entries 看得见 kan-de3-jian (can see) and 看不见 kan-bu-jian (cannot see) as shown below are added to the lexicon.[21]

th005

th006

The above proposal offers a simple, effective way of capturing the linguistic data of the interaction of V+A/V idioms and the modal insertion, since it eliminates the need for any change of the general grammar in order to accommodate this type of separable verbs interacting with 得 de3 / 不 bu, the only two infixes in Chinese.

5.4. Summary

This chapter has conducted an inquiry into the linguistic phenomena of Chinese separable verbs, a long-standing difficult problem at the interface of Chinese compounding and syntax.   For each type of separable verb, arguments for the wordhood judgment have been presented.  Based on this judgment, CPSG95 provides analyses which capture both structural and semantic aspects of the constructions at issue.  The proposed solutions are formal and implementable.  All the solutions provide a way of capturing the link between the separated use and contiguous use of the V+X idioms.  The proposals presented in this chapter cover the vast majority of separable verbs.  Some unsolved rare cases or potential problems are also identified for further research.

 

----------------------------------------------------------------------

[1] They are also called phrasal verbs (duanyu dongci) or compound verbs (fuhe dongci) among Chinese grammarians.  For linguists who believe that they are compounds, the V+N separable verbs are often called verb object compounds and the V+A/V separable verbs resultative compounds.  The want of a uniform term for such phenomena reflects the borderline nature of these cases.  According to Zhao and Zhang (1996), out of the 3590 entries in the frequently used verb vocabulary, there are 355 separable V+N idioms.

[2] As the term 'separable verbs' gives people an impression that these verbs are words (which is not necessarily true), they are better called V+X (or V+N or V+A/V) idioms.

[3] There is no disagreement among Chinese grammarians for the verb-object combinations like xi wan:  they are analyzed as transitive verb phrases in all analyses, no matter whether the head V and the N is contiguous (e.g. xi wan 'wash dishes') or not (e.g. xi san ge wan 'wash three dishes').

[4] Such signs as zao (bath), which are marked with # in (5-1), are often labeled as 'bound morphemes' among Chinese grammarians, appearing only in idiomatic combinations like xi zao (take a bath), ca zao (clean one's body by scrubbing).  As will be shown shortly, bound morpheme is an inappropriate classification for these signs.

[5] It is widely acknowledged that the sequence num+classifier+noun is one typical form of Chinese NP in syntax.  The argument that zao is not a bound morpheme does not rely on any particular analysis of such Chinese NPs.  The fact that such a combination is generally regarded as syntactic ensures the validity of this argument.

[6] The notion ‘free’ or ‘freely’ is linked to the generally accepted view of regarding word as a minimal ‘free’ form, which can be traced back to classical linguistics works such as Bloomfield (1933).

[7] It is generally agreed that idioms like kick the bucket are not compounds but phrases (Zwicky 1989).

[8] That is the rationale behind the proposal of inseparability as important criterion for wordhood judgment in Lü (1989).

[9] In Chinese, reduplication is a general mechanism used both in morphology and syntax.  This thesis only addresses certain reduplication issues when they are linked to the morpho-syntactic problems under examination, but cannot elaborate on the Chinese reduplication phenomena in general.  The topic of Chinese reduplication deserves the study of a full-length dissertation.     

[10] In the ALE implementation of CPSG95, there is a VV Diminutive Reduplication Lexical Rule in place for phenomena like xi zao (take a bath) à xi xi zao (take a short bath);  ting yin-yue (listen to music) à ting ting yin-yue (listen to music for a while);  xiu-xi (rest) à xiu-xi xiu-xi (have a short rest).

[11] He observes that there are two distinct principles on wordhood.  The vocabulary principle requires that a word represent an integrated concept, not the simple composition of its parts.  Associated with the above is a tendency to regard as a word a relatively short string.  The grammatical principle, however, emphasizes the inseparability of the internal parts of a combination.  Based on the grammatical principle, xi zao is not a word, but a phrase.  This view is very insightful.

[12] The pattern variations are captured in CPSG95 by lexical rules following the HPSG tradition.  It is out of the scope of this thesis to present these rules in the CPSG95 syntax.  See W. Li (1996) for details.

[13] In the rare cases when the noun zao is realized in a full-fledged phrase like yi ge tong-kuai de zao (a comfortable bath), we may need some complicated special treatment in the building of the semantics.  Semantically, xi (wash) yi (one) ge (CLA) tong‑kuai (comfortable) de (DE) zao (bath): ‘take a comfortable bath’ actually means tong‑kuai (comfortable) de2 (DE2) xi (wash) yi (one) ci (time) zao (bath): ‘comfortably take a bath once’.  The syntactic modifier of the N zao is semantically a modifier attached to the whole idiom.  The classifier phrase of the N becomes the semantic 'action-times' modifier of the idiom.  The elaboration of semantics in such cases is left for future research.

[14] The two groups classified by L. Li (1990) are not restricted to the V+N combinations.  In order not to complicate the case,  only the comparison of the two groups of V+N idioms are discussed here.  Note also that in the tables, he used the term ‘bound morpheme’ (inappropriately) to refer to the co-occurrence constraint of the idioms.

[15] Another type of X-insertion is that N can occasionally be expanded by adding a de‑phrase modifier.  However, this use is really rare.

[16] Since they are only a small, easily listable set of verbs, and they only demonstrate limited separated uses (instead of full pattern variations of a transitive VP construction), to list these words and all their separated uses in the lexicon seems to be a better way than, say, trying to come up with another lexical rule just for this small set.  Listing such idiosyncratic use of language in the lexicon is common practice in NLP.

[17] In fact, this set has been becoming smaller because some idioms, say zhu-yi 'focus-attention: pay attention to', which used to be in this set, have already lost all separated phrasal uses and have become words per se.  Other idioms including dan-xin (worry about) are in the process of transition (called ionization by Chao 1968) with their increasing frequency of being used as words.   There is a fairly obvious tendency that they combine more and more closely as words, and become transparent to syntax.  It is expected that some, or all, of them will ultimately become words proper in future, just as zhu-yi did.

[18] In general, one cannot use kan-jian to translate English future tense 'will see', instead one should use the single-morpheme word kan:  I will see him --> wo (I) jiang (will) kan (see) ta (he).

[19] Of course, [v_buyu] is a sub-type of verb [v].

[20] The use of this feature for representing negation was suggested in  Footnote 18 in Pollard and Sag (1994:25)

[21] This is the procedural perspective of viewing the lexical rules.  As pointed out by Pollard and Sag (1987:209), “Lexical rules can be viewed from either a declarative or a procedural perspective: on the former view, they capture generalizations about static relationships between members of two or more word classes; on the latter view, they describe processes which produce the output from the input form.”

 

[Related]

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

PhD Thesis: Chapter IV Defining the Chinese Word

 

4.0. Introduction

This chapter examines the linguistic definition of the Chinese word and establishes its formal representation in CPSG95.  This lays a foundation for the treatment of Chinese morpho-syntactic interface problems in later chapters.

To address issues on interfacing morphology and syntax in Chinese NLP, the fundamental question is:  what is a Chinese word?  A proper answer to this question defines the boundaries between morphology, the study of how morphemes combine into words, and syntax, the study of how words combine into phrases.  However, there is no easy answer to this question.

In fact, how to define Chinese words has been a central topic among Chinese grammarians for decades (Hu and Wen 1954; L. Wang 1955;  Z. Lu 1957; Lin 1983; Lü 1989; Shi 1992; Dai 1993; Zhao and Zhang 1996).  In late 50's, there was a heated discussion on the definition of Chinese word in China.  This discussion was induced by the campaign for the Chinese writing system reform (文字改革运动).  At that time, the government policy was to ultimately replace the Chinese characters (hanzi) by a Romanized writing system.  The system of pinyin, based on the Latin alphabet, was designed to represent the pronunciation of the characters in the Contemporary Mandarin.  The simplest way is to use pinyin as a writing system and simply translate Chinese characters into syllables in pinyin.  But it was soon found impractical due to the many-to-one correspondence from hanzi to syllable.  Text in pinyin with no  explicit word boundary delimiters is hardly comprehensible.   Linguists agree that the key issue for the feasibility of a pinyin-based writing system is to establish a standard or definition for Chinese words (Z. Lu 1957).  Once words can be identified by a common standard, the pinyin system can in principle be adopted for recording the Chinese language by using space and punctuation marks to separate words.  This is because the number of homophones at the word level is dramatically reduced when compared to the number of homophones at the hanzi (morpheme or monosyllabic) level.

But the definition of a Chinese word is a very complicated issue due to the existence of a considerable amount of borderline cases.  It has never been possible to reach a precise definition which can be applied to all circumstances and which can be accepted by linguists from different schools.

There have been many papers addressing the Chinese wordhood issue (e.g. Z. Lu 1957; Lin 1983; Lü 1989; Dai 1993).  Although there are still many problems in defining Chinese words for borderline cases and more debate will continue for many years to come, the understanding of Chinese wordhood has been deepened in the general acknowledgement of the following key aspects:  (i) the distinct status of Chinese morphology;  (ii) the distinction of different notions of word;  and (iii) the lack of absolute definition across systems or theories.

Almost all Chinese grammarians agree that unlike Classical Chinese, Contemporary Chinese is not based on single-morpheme words.   In other words, the word and the morpheme are no longer coextensive in Contemporary Chinese.[1]  In fact, that is the reason why we need to define Chinese morphology.  If the word and the morpheme stand for the same linguistic object in a language, like Classical Chinese, the definition of  morpheme will entail the definition of word and there is no role of morphology.

As it stands, there is little debate on the definition of morpheme in Chinese.  It is generally acknowledged that each syllable (or its corresponding written form hanzi) corresponds to (at least) one morpheme.  In a characteristic ‘isolating language’ - Classical Chinese is close to this, there is no or very poor morphology.[2]  However, Contemporary Chinese contains a significant number of bound morphemes in word formation (Dai 1993).  In particular, it is observed that many affixes are highly productive (Lü et al 1980).

It is widely acknowledged that the grammar of Contemporary Chinese is not complete without the component of morphology (Z. Lu 1957; Chao 1968; Li and Thompson 1981; Dai 1993; etc.).   Based on this widely accepted assumption, one major task for this thesis is to argue for the proper place to cut the line between morphology and syntax, and to explore effective ways of interleaving the two for analysis.

A significant development concerning the Chinese wordhood study is the  distinction between two different notions of word:  grammar word versus vocabulary word.  It is now clear that in terms of grammar analysis, a vocabulary word is not an appropriate notion (Lü 1989; more discussion to come in 4.1).

Decades of debate and discussion on the definition of a Chinese word have also shown that an operational definition for a grammar word precise enough to apply to all cases can hardly be established across systems or theories.  But a computational grammar of Chinese cannot be developed without precise definitions.  This leads to an argument in favor of the system internal wordhood definition and the interface coordination within a grammar.

The remaining sections of this chapter are organized like this.  Section 4.1 examines two notions of word.  Making sure that we use the right notion based on some appropriate guideline, some operational methods for judging a Chinese grammar word will be developed in 4.2.  Section 4.3 demonstrates the formal representation of a word in CPSG95.  This formalization is based on the design of expectation feature structures and the structural feature structure  presented in Chapter III.

4.1. Two Notions of Word

This section examines the two notions of word which have caused confusion.  The first notion, namely vocabulary word, is easy to define.  However, for the second notion, namely, grammar word, unfortunately,  no operational definition has been available.  It will be argued that a feasible alternative is to system internally define a grammar word and the labor division between Chinese morphology and syntax.

A grammar word stands for the grammatical unit which fits in the hierarchy of morpheme, word and phrase in linguistic analysis.  This gives the general concept of this notion but it is by no means an operational definition.  Vocabulary word, on the other hand, refers to the listed entry in the lexicon.  This definition is simple and unambiguous once a lexicon is given.  The lexical lookup will generate vocabulary words as potential building blocks for analysis.

On one hand, vocabulary words come from the lexicon;  they are basic building blocks for linguistic analysis.  On the other hand, as the ‘resulting’ unit for morphological analysis as well as the ‘starting’ or ‘atomic’ unit for syntactic analysis, the grammar word is the notion for linguistic generalization.  But it is observed that a vocabulary word is not necessarily a grammar word and vice versa.  It is this possible mismatch between vocabulary word and grammar word that has caused a problem in both Chinese grammar research and Chinese NLP system development.

Lü (1989) indicates that not making a distinction between these two notions of word has caused considerable confusion on the definition of Chinese word in the literature.  He further points out that only the former notion should be used in the grammar research.

Di Sciullo and Williams (1987) have similar ideas on these two notions of word.  They indicate that a sign listable in the lexicon corresponds to no certain grammatical unit.[3]   It can be a morpheme, a (grammar) word, or a phrase including sentence.  Some examples of different kinds of Chinese vocabulary words are given below to demonstrate this insight.

(4-1.) sample Chinese vocabulary words

(a) 性           bound morpheme, noun suffix, ‘-ness’
(b) 洗           free morpheme or word, V: ‘wash’
(c) 澡           word (only used in idioms), N: ‘bath’
(d) 澡盆        compound word, N: ‘bath-tub’
(e) 洗澡        idiom phrase, VP: ‘take a bath’
(f) 他们         pronoun as noun phrase, NP: ‘they’
(g) 城门失火,殃及池鱼

idiomatic sentence, S:
‘When the gate of a city is on fire, the fish in the
canal around the gate is also endangered.’

The above signs are all Chinese vocabulary words.  But grammatically, they do not necessarily function as a grammar word.  For example, (4-1a) functions as a suffix, smaller than a word.  (4-1e) behaves like a transitive VP (see 5.1 for more evidence), and (4-1g) acts as a sentence, both larger than a word.  The consequence of mixing up these different units in a grammar is the loss of power for a grammar to capture the linguistic generality for each level of grammatical unit.

The definition of grammar word has been a contentious issue in general linguistics (Di Sciullo and Williams 1987).  Its precise definition is particularly difficult in Chinese linguistics as there is a considerable amount of phenomena marginal between Chinese morphology and syntax (Zhu 1985; L. Li 1990; Sun and Huang 1996).  The morpheme-word-phrase transition is a continuous band in the linguistic reality.  Different grammars may well cut the division differently.  As long as there is no contradiction in coordinating these objects within the grammar, there does not seem to exist absolute judgment on which definition is right and which is wrong.

It is generally agreed that a grammar word is the smallest unit in syntax (Lü 1989), as also emphasized by Di Sciullo and Williams (1987) on the 'syntactic atomicity' of word.[4]  But this statement only serves as a guideline in theory, it is not an operational definition for the following reason.  It is logically circular to define word, smallest unit in syntax, and syntax, study of how words combine into phrases, one upon the other.

To avoid this 'circular definition' problem, a feasible alternative is to system internally define grammar word and the labor division between Chinese morphology and syntax, as in the case of CPSG95.  Of course, the system internal definition still needs to be justified based on the proposed morphological or syntactic analysis of borderline phenomena in terms of capturing the linguistic generality.  More specifically, three things need to be done:  (i) argue for the analysis case by case, e.g. why a certain construction should be treated as a morphological or syntactic phenomenon, what linguistic generality is captured by such a treatment, etc.;  (ii) establish some operational methods for wordhood judgment to cover similar cases;  (iii) use formalized data structures to represent the linguistic units after the wordhood judgment is made.  Section 4.2 will handle task (ii) and Section 4.3 is devoted to the formal definition of word required by task (iii).   The task in (i) will be pursued in the remaining chapters.

Another important notion related to grammar word is unlisted word.  Conceptually, an unlisted word is a novel construction formed via morphological rules, e.g. a derived word like 可读性 ke-du-xing (-able-read-ness: readability), foolish-ness, a compound person name (given name + family name) such as John Smith, 毛泽东 mao-ze-dong (Mao Zedong).  Unlisted words are often rule-based.  This is where productive word formation sets in.

However, unlisted word is not a crystal clear notion, just like the underlying concept grammar word.  Many grammarians have observed that phrases and unlisted words in Chinese are formed under similar rules (e.g. Zhu 1985; J. Lu 1988).  As both syntactic constructions and unlisted words are rule based, it can be difficult to judge a significant amount of borderline constructions as morphological or syntactic.

There are fuzzy cases where a construction is regarded as a grammar word by one and judged as a syntactic construction by another.  For example, while san (three) ge (CLA) is regarded as a syntactic construction, namely numeral-classifier phrase, in many grammars including CPSG95, such constructions are treated as compound words by others (e.g. Chen and Liu 1992).  ‘Quasi-affixation’ presents another outstanding ‘gray area’ (see 6.2).

The difficulty in handling the borderline phenomena leads back to the argument that the labor division between Chinese morphology and syntax should be pursued system internally and argued case by case in terms of capturing the linguistic generality.  To implement the required system internal definition, it is desirable to investigate practical wordhood judgment methods in addition to case-by-case arguments.  Some judgment methods will be developed in 4.2.  Case-by-case arguments and analysis for specific phenomena will be presented in later chapters.  After the wordhood judgment is made, there is a need for the formal representation.  Section 4.3 defines the formal representation of word with illustrations.

4.2. Judgment Methods

This section proposes some operational wordhood judgment methods based on the notion of ‘syntactic atomicity’ (Di Sciullo and Williams 1987).  These methods should be applied in combination with arguments of the associated grammatical analysis.  In fact, whether a sign is judged as a morpheme, a grammar word or a phrase ultimately depends on the related grammatical analysis.  However, the operationality of these methods will help facilitate the later analysis for some individual problems and avoid unnecessary repetition of similar arguments.

Most methods proposed for Chinese wordhood judgment in the literature are not fully operational.  For example, Chao (1968) agrees with Z. Lu (1957) that a word can fill the functional frame of a typical syntactic structure.  Dai (1993) points out that this method may effectively separate bound morphemes from free words, it cannot differentiate between words and phrases, as phrases may also be positioned in a syntactic frame.  In fact, whether this method can indeed separate bound morphemes from free words is still a problem.  This method cannot be made operational unless the definition of ‘frame of a typical syntactic structure’ is given.  The judgment methods proposed in this section try to avoid this ‘lack of operationality’ problem.

Dai (1993) made a serious effort in proposing a series of methods for cutting the line between morphemes and syntactic units in Chinese.  These methods have significantly advanced the study of this topic.  However, Dai admits that there is limitation associated with these proposals.  While each proposed method provides a sufficient (but not necessary) condition for judging whether a unit is a morpheme,  none of the methods can further determine whether this unit is a word or a phrase.  For example, the method of syntactic independence tests whether a unit in a question can be used as a short answer to the question.  If yes, the syntactic independence is confirmed and this unit is not a morpheme inside a word.  Obviously, such a method tells nothing about the syntactic rank of the tested unit because a word, a phrase or clause can all serve as an answer to a question.  In order to achieve that, other methods and/or analyses need to be brought in.

The first judgment method proposed below involves passivization and topicalization tests.  In essence, this is to see whether a string involves syntactic processes.  As an atomic unit, the internal structure of a word is transparent to syntax.  It follows that no syntactic processes are allowed to exert effects on the internal structure of a word.[5]  As  passivization and topicalization are generally acknowledged to be typical syntactic processes, if a potential combination A+B is subject to passivization B+bei+A and topicalization B+…+NP+A, it can be concluded that A+B is not a word:   the relation between A and B must be syntactic.

The second method is to define an unambiguous pattern for the wordhood judgment, namely, judgment patterns.  Judgment patterns are by no means a new concept.  In particular, keyword based judgment patterns have been frequently used in the literature of Chinese linguistics as a handy way for deterministic word category detection (e.g. L. Wang 1955;  Zhu 1985; Lü 1989).

The following keyword (i.e. aspect markers) based patterns are proposed for  judging a verb sign.

(4-2.)
(a) V(X)+着/过 --> word(X)
(b) V(X)+着/过/了+NP --> word(X)

The pattern (4-2a) states that if X is a sign of verb, no matter transitive or intransitive, appearing immediately before zhe/guo, then X is a word.  This proposal is backed by the following argument.  It is an important and widely acknowledged grammatical generalization in Chinese syntax that the aspect markers appear immediately after lexical verbs (Lü et al 1980).

Note that the aspect marker le (LE) is excluded from the pattern in (4-2a) because the same keyword le corresponds to two distinctive morphemes in Chinese:  the aspect le (LE) attaches to a lexical V while the sentence-final le (LEs) attaches to a VP (Lü et al 1980).  Therefore, judgment cannot be reliably made when a sentence ends in X+le, for example, when X is an intransitive verb or a transitive verb with the optional object omitted.  However, le in pattern (4-2b) has no problem since le is not in the ambiguous sentence final position.  This pattern says that if any of the three aspect markers appears between a sign X of verb and NP, X must be a word:  in fact, it is a lexical transitive verb.

There are two ways to use the judgment patterns.  If a sub-string of the input sentence matches a judgment pattern, one reaches the conclusion promptly.  If the input string does not match a pattern directly, one can still make indirect use of the patterns for judgment.  The idiomatic combination xi (wash) zao (bath) is a representative example.   Assume that the vocabulary word xi zao is a grammar word.  It follows that it should be able to fill in the lexical verb position in the judgment pattern (4-2a).  We then make a sentence which contains a substring matching the pattern to see whether it is grammatical.  The result is ungrammatical:  * 他洗澡着 ta (he) xi-zao (V) zhe (ZHE);  * 他洗澡过 ta (he) xi-zao (V) guo (GUO).  Therefore, our assumption must be wrong:  洗澡 xi zao is not a grammar word.  We then change the assumption and try to insert aspect markers inside them (it is in fact an expansion test, to be discussed shortly).  The new assumption is that the verb xi alone is a grammar word.  What we get are perfectly grammatical sentences and they match the pattern (4-2b):  他洗着澡 ta (he) xi (V) zhe (ZHE) zao (bath): ‘He is taking a bath’;  他洗过澡 ta (he) xi (V) guo (GUO) zao (bath): ‘He has taken the bath’.  Therefore the assumption is proven to be correct.  This way, all V+X combinations can be judged based on the judgment patterns (4-2a) or (4-2b).

The third method proposed below involves a more general expansion test.  As an atomic unit in syntax, the internal parts of a word are in principle not separable.[6]  Lü (1989) emphasized inseparability as a criterion for judging grammar words.  But he did not give instructions how this criterion should be applied.  Nevertheless, many linguists (e.g. Bloomfield 1933; Z. Lu 1957;  Lyons 1968; Dai 1993) have discussed expansion tests one way or another in assisting the wordhood judgment.

The method of expansion to be presented below for wordhood judgment is called X-insertion.  X-insertion is based on Di Sciullo and Williams’ thesis of the syntactic atomicity of word.  The rationale is that the internal parts of a word cannot be separated by syntactic constituents.

As a method, how to perform X-insertion is defined as follows.   Suppose that one needs to judge whether the combination A+B is a word.   If a sign X can be found to satisfy the following condition, then A+B is not a word, but a syntactic combination:  (i) A+X+B is a grammatical string,  (ii) X is not a bound morpheme, and (iii) the sub-structure [A+X] is headed by A or the sub-string [X+B] is headed by B.

The first constraint is self-evident:  a syntactic combination is necessarily a grammatical string.  The second constraint aims at  eliminating the danger of wrongly applying an infix here.  In fact, if X is a morphological infix, the conclusion would be just opposite:  A+B is a word.  The last constraint states that X must be a dependant of the head A (or B).  Otherwise, it results in a different structure.  There is no direct structural relation between A and B when A (or B) is a dependant of the head X in the structure.  Therefore, the question of whether A+B is a phrase or a word does not apply in the first place.

After the wordhood judgment is made on strings of signs based on the above judgment methods and/or the arguments for the analysis involved, the next step is to have them properly represented (coded) in the grammar formalism used.  This is the topic to be presented in 4.3 below.

4.3. Formal Representation of Word

The expectation feature structure and structural phrase structure in the mono-stratal design of CPSG95 presented in Chapter III provide means for the formal definition of the basic unit word in CPSG95.  Once the wordhood judgment for a unit is made based on arguments for a structural analysis and/or using the methods presented in Section 4.2., the formal representation is required for coding it in CPSG95.

This type of formalization is required to ensure its implementability in enforcing a required configurational constraint.  For example, the suffix 性 -xing expects an adjective word to form an abstract noun, such constraints [CATEGORY a] and @word can be placed in the morphological expectation feature [SUFFIXING].  These constraints will permit, for example, the legitimately derived word 严肃性 [yan-su]-xing] (serious-ness), but will block the following combination * 非常严肃性 [[fei-chang yan-su]-xing] (very-serious-ness).  This is because 非常严肃 [fei-chang yan-su] violates the formal constraint as given in the word definition:  it is not an atomic unit in syntax.

In CPSG95, word is defined as a syntactically atomic unit without obligatory morphological expectations, formally represented in the following macro.

word macro
a_sign
PREFIXING saturated | optional
SUFFIXING saturated | optional
STRUCT no_syn_dtr

Note that the above formal definition uses the sorted hierarchy [struct] for the structural feature structure and the sorted hierarchy [expected] for the expectation feature structure.  The definitions of these feature structures have been given in the preceding Chapter III.

Based on the sorted hierarchy struct: {syn_dtr, no_syn_dtr}, the constraint [no_syn_dtr] ensures that the word sign do not contain any syntactic daughter.[7]  This prevents syntactic constructions from being treated as words.  On the other hand, since [saturated], [obligatory] and [optional] are three subtypes of [expected], the constraint [saturated|optional] prevents a bound morpheme, say a prefix or suffix which has obligatory expectation in [PREFIXING] or [SUFFIXING], from being treated as a word.

This macro definition covers the representation of mono-morpheme words, e.g. 鹅 e ‘goose’, 读 du ‘read’, etc., or multi-morpheme words, e.g. 小看 xiao-kan ‘look down upon’, 天鹅 tian-e ‘swan’, etc., as well as unlisted words such as derived words whose internal morphological structures have already been formed.  Some typical examples of word are shown below.

th11

th12

For a derived word, note that the specification of [PREFIXING satisfied] and [STRUCT prefix], or [SUFFIXING satisfied] and [STRUCT suffix], assigned by the corresponding PS rule is compatible with the macro word definition.

The above word definition is an extension of the corresponding representation features from HPSG (Pollard and Sag 1987).  HPSG uses a binary structural feature [LEX] to distinguish lexical signs, [LEX +], and non-lexical signs, [LEX -].  In addition, [sign] is divided into [lexical_sign] and [phrasal_sign].[8]  Except for the one-to-one correspondence between [phrasal_sign] and [syn_dtr] in terms of rank (which stands for non-atomic syntactic constructs including phrases), neither of these HPSG binary divisions account for the distinction between a bound morpheme and a free morpheme.  Such a distinction is not necessary in HPSG because bound morphemes are assumed to be processed in the preprocessing stage (e.g. lexical rules for English inflection, Pollard and Sag 1987) and do not show themselves as independent input to the parser.  As CPSG95 involves both derivation morphology and syntax in an integrated general grammar, the HPSG binary divisions are no longer sufficient for formalizing the word definition.  ‘Word’ in CPSG95 needs to be distinguished with proper constraints from not only syntactic constructs, but also from affixes (bound morphemes).

In CPSG95, as productive derivation is designed to be an integrated component of the grammar, the word definition is both specified in the lexicon for some free morpheme words and assigned by the rules in morphological analysis.  This practice in essence follows one  suggestion in the original HPSG book:  "we might divide rules of grammar into two classes: rules of word formation, including compounding rules, which introduce the specification [LEX +] on the mother, and other rules, which introduce [LEX -] on the mother." (Pollard and Sag 1987:73).

It is worth noticing that words thus defined can fill either a morphological position or a syntactic position.  This reflects the interface nature of word:  word is an eligible unit in both morphology and syntax.  This is in contrast to bound morphemes which can only be internal parts of morphology.

In morphology, derivation combines a word and an affix into a derived word.  These derivatives are eligible to feed morphology again.   This is shown above by the examples in (4-5) and (4-6).  The adjective word 可读 ke-du (read-able) is derived from the prefix morpheme 可 ke- (-able) and the word 读 du (read).  Like other adjective words, this derived word can further combine with the suffix 性
–xing (-ness) in morphology.  It can also directly enter syntax, as all words do.

To syntax, all words are atomic units.  If a lexical position is specified, via the macro constraint @word in CPSG95, in a syntactic pattern, it makes no difference whether a filler of this position is a listed grammar word, or an unlisted word such as a derivative.  Such distinction is transparent to the syntactic structure.

4.4. Summary

Efforts have been made to reach a better understanding of Chinese wordhood in theory, methodology and formalization.  The main spirit of the HPSG theory and Di Sciullo and Williams' ‘syntactic atomicity’ theory has been applied to the study of Chinese wordhood and its formal representation.  Some effective wordhood judgment methods have also been proposed, based on theoretical guidelines.

The above work in the area of Chinese wordhood study provides a sound foundation for the analysis of the specific Chinese morpho-syntactic interface problems in Chapter V and Chapter VI.

 

 

-------------------------------------------------------

[1] For Classical Chinese, word, morpheme, syllable and hanzi are presumably all co-extensive.  This is the so-called Monosyllabic Myth of Chinese (DeFrancis 1984: ch.8).  The development of large numbers of homophones, mainly due to the loss of coda stops, has led to the development of large quantities of bi-syllabic and poly-syllabic word-like expressions (Chen and Wang 1975).

[2] Classical Chinese arguably allows for a certain degree of compounding.  In the linguistic literature, some linguists (e.g. Sapir 1921; Zhang 1957; Jensen 1990) did not strictly distinguish Contemporay/Modern Chinese from Classical Chinese and they held the general view that Chinese has little morphology except for limited compounding.  But this view of Contemporary Chinese has been criticized as misconception (Dai 1993) and is no longer accepted by the community of Chinese grammarians.

[3] Di Sciullo and Williams call a sign listable in the lexicon listeme, equivalent to the notion vocabulary word.

[4] In the literature, variations of  this view include the Lexicalist position (Chomsky 1970), the Lexical Integrity Hypothesis (Jackendoff 1972), the Principle of Morphology-Free Syntax (Zwicky 1987), etc.

[5] This type of ‘atomicity’ constraint (Di Sciullo and Williams 1987) is generally known as Lexical Integrity Hypothesis (LIH, Jackendoff 1972), which states that syntactic rules or operations cannot refer to part of a word.  A more elaborate version of LIH is proposed by Zwicky (1987) as a Principle of Morphology-Free Syntax.  This principle states that syntactic rules cannot make reference to the internal morphological composition of words.  The only lexical properties accessible to syntax, according to Zwicky, are syntactic category, subcategory, and features like gender, case, person, etc.

[6] Of course, in theory a word may be separated by morphological infix.  But except for the two modal signs de3 (can) and bu (cannot) (see Section 5.3 in Chapter V), there does not seem to exist infixation in Mandarin Chinese.

[7] In terms of rank, [no_syn_dtr] in CPSG95 corresponds to the type [lexical_sign] in HPSG (Pollard and Sag 1987).  A binary division between [lexical_sign] and [phrasal_sign] is enough in HPSG to distinguish the atomic unit word from syntactic construction.  But, as CPSG95 incorporates derivation in the general grammar, [no_syn_dtr] covers for both free morphemes and bound morphemes.  That is why the [no_syn_dtr] constraint on [STRUCT] alone cannot define word in CPSG95;  it needs to involve constraints on morphological expectation structures as well, as shown in the macro definition.

[8] Note that there are [LEX -] signs which are not of the type [phrasal_sign].

 

[Related]

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP