【李白之19:三探白老师的秘密武器】

【立委按】专业探讨的时候,第一要义是互相搞懂各自的术语。老司机的毛病是经年积淀,自成体系,自创术语,不拘一格。白老师有一套自己的术语,立委也有一套术语。好在过去一年来,在白老师的语义计算群唠嗑唠久了,互相开始明白了各自术语的所指。但对于后学,很可能就麻烦了。为深入虎穴,三探奥秘,我把相关术语编辑在篇末,供各位查阅,不准确处可请白老师指正。

白:
我们先解决“谁和谁发生关系”而不必具体明确“是何种关系”,只笼统地分成:“a是b的直接成分”、“a是b的修饰成分”以及“a是b的合并成分”三种情况。

洪:
@wei 八九十年代 Steven Small 有套Word Expert distribute parsing理论,当时ucsd的Garrison Cottrell和 umass的 wendy lehnert也有类似研究。

白:
word expert理论当年也跟踪过,因为跟汉语实际相差太远,后来不了了之了。

李:
Small 的工作以前常引用,因为我导师刘老师给自己的 MT 机制也取名叫专家词典。术语撞车了,不得不引。句法词典化作为大的方向,在parsing的人以及整个的NLP共同体,长期以来是有相当大共识的,虽然各有做法不同(GPSG以后盛行的词典主义的LFG和HPSG就是这种共识的一个反映)。白老师的分步走,想来是一条结合大数据和词典化的大道。第一步只做 dependency,而且允许以后反悔。只依赖词典,先塔个句法的架子,靠中间件的语义化操作来减除伪歧义的困扰。Parsing 的语义化不外两个层面,一路是 nodes 的语义,即wsd;另一路是 arcs 的语义,就是 matcher 的结构消歧工作,为了求解靠谱的 parses(白老师所谓二元关系)。其后的细线条逻辑语义解析,包括部分二元关系的休眠唤醒以及揭示隐藏的逻辑语义关系,算是深度语义计算。这两路靠的都是大数据与初始结构的“恋爱”结果来克服知识瓶颈,而不是靠带标的treebank。其中最有意思的工作应该是这个无监督大数据与初结构的恋爱学习过程,貌似水性杨花 漫天撒网 愿者上钩,最后根据统计性落实各自秉性与最佳搭配。等着听白老师这个无监督的核爆炸吧,大数据引爆这种针对 deep parsing 的语义知识习得,据说可借力深度学习的 RNN 机制。

白:
不务虚了,讨论点昨天出的具体的例子吧。总会有突如其来的不带介词的NP,让没有坑的VP措手不及。躲得过初一躲不过十五。大数据会告诉我们什么呢?比如,“那堆砖让我垒了鸡窝了”,垒,没有预备坑给“那堆砖”,怎么办?

李:
不务虚 那请教白老师几个问题:第一层词典化二元parsing 是 nondeterministic 吗
还是绝对 deterministic?那么粗糙的 parsing - 如果是后者的话,对后去的学习和反悔 感觉会不够给力。

白:
大数据变了,结果会不同。这算nondeterministic?

李:
不算。那是两套系统,依据的是不同的数据和训练,在不同的时间框架。
不是说不需要大数据吗?连二元关系的性质都模糊,就是先勾搭上而已。

白:
不需要带标大数据。性质可以模糊,但约束必须明确。比如萝卜什么时候占名额什么时候不占名额。Matcher不是语义中间件,他要用到语义中间件。wsd也要用到。一个确定节点标签,一个确定留下的二元关系。

李:
约束不就是词典里面的挖坑,实际中的填坑 ➕ 挖坑么?用的是 cat,因为一个词可以有多个 cats(or subcats),所以调用了 WSD 模块来决定。根据这个决定来填坑构成二元结构。好像就是这么个过程。

白:
“这碗猪”还记得吧。

李:
不搭没关系吧 - 开始的时候。

白:
【碗,猪】这个二元关系有还是没有,问中间件。没有,就不建立arc。虽然cat相配,也不建。

李:
那是大数据训练以后的事儿了,训练之前呢?语义中间件就是一个大数据训练出来的类似 hownet 的资源。在训练之前 大数据怎么结构化的?

白:
训练是独立的,跟matcher无关,跟ontology有关,ontology是结构化的

李:
无监督训练,总要有个啥吧。训练是独立的 offline 进行,利用大数据得出的语义相谐的统计性历史总结,作为 parsing 的资源。Matcher 是 online runner,来对新的 input 做 parsing 的。这跟我们专家去写 patterns 道理一样。训练的结果包含 ontology,
训练的支持难道不是结构化的大数据?这个结构怎么来的?谁给的第一推动?

白:
这是一个冷启动窗口长短的问题。matcher可以只看3个,大数据看13个。大数据的13个中包含被matcher拉近到3个的概率不低。

李:
拉近不是结构化的作为吗?

白:
大数据中非结构化的词串,十三个词里面“碗”和“猪”的共现,以及背后subcat的共现,同“碗”和“汤”的共现相比,这数据有统计意义不?我说的是“包含”。

李:
有意思。非结构化词串就是 ngram,13 词区间大体就是一个子句的长度,再长也没啥统计价值的关联了。

白:
碗,背后的subcat是“容器”“餐具”;汤,背后的subcat是“液体”“食物”。统计subcat共现,可以脱离具体的词例,获得大样本。在大窗口里进行,跑都跑不掉。所以,有无结构的说法是含混的。从parse角度讲,冷启动时无结构;从ontology角度讲,冷启动时结构很丰富。

李:
嗯,为了统计性,脱离具体词,先用 hownet 或 wordnet 支持一下。

白:
冷和热的唯一区别,就是有了冷的基础,热应该更好做。因为大窗口的关联都挖出来了,小窗口更不在话下。

只有一种情况,就是热的情况下,小窗口里面的关联,是把大窗口都覆盖不到的远距离关联拉近了的结果,这种会失手。

李:
好,在 onyology 支持下,在13词窗口内,系统学到了“碗”与“汤”的搭配,so what?

白:
在遇到这碗猪的时候,会选择不match,把“这碗”留着,让“猪”去找自己的坑

李:
这口气得憋多久啊

白:
就是所谓的“过程性因素”,用中间件的查询结果来控制,而不是用手编的语言学知识或规则来控制。

李:
停下的意思类似于入栈。稍有闪失就沉底出不来了。

白:
对啊,RNN+栈。入栈,等着填坑

李:
不知道栈有多深

白:
出不来的情况,参见刚才的例子 “那堆砖让我垒了鸡窝了”。在“垒”只有两个坑的情况,“那堆砖”就是进去了出不来的,如果不想其他办法的话。

荀:
如果这种二元决策是确定性的过程,如果出错,填入的坑的萝卜就得靠唤醒了。

白:
不妨仔细推演下这个例子。

李:
赶巧这个【工具】的坑,处于可有可无的边缘。“垒” 其实也可以带三个坑的。

白:
如果大数据中,存在着大量“砖”带着明确的介词和“垒”共处一个窗口的情况呢?或者投射到subcat上,“建筑材料”带着介词和“建筑行为”共现?

荀:
如何辨认“工具”和“施事”就很重要了

李:
【工具主语】 与 【人主语】 几乎有类似的统计性。

荀:
需要用启发式信息,引导RNN训练,这个引导过程是至关重要的。

白:
这里有“我”,已经明确会填坑。我说的是,没有坑可填不可怕,翻翻大数据,历史上别人用它带什么介词,就把那个介词补上好了。然后就堂而皇之地做状语了。这些东东,有了ontology和大数据的结合,就不要人来操心了。

荀:
把subcat嵌入到RNN中,用启发式信息结合LM训练方式引导RNN编织权重。

李:
如果加上显性形式“用”,工具作为萝卜有很多数据。

白:
我昨天出了那么多例子,伟哥居然没觉出用心良苦:

“这些纸能写很多字”
“这些铁可以打很多钉子”

荀:
[用]这些铁可以打很多钉子
[在]这些纸能写很多字

白:
从形式上,为严谨起见,我们不会去给这个句子凭空添加任何一个莫须有的介词,但总可以用一个不占位置的虚介词吧……
【phi】这些铁可以打很多钉子。

荀:
利用大数据可以做“小词“还原,这对缺少标记的汉语很重要了。

白 :

至少有了这个phi,栈里的不会出不来了。

荀:
借助大数据,RNN做“还原”这类事情很在行。把小词“虚化”,也是一种subcat处理。抓住了小词就抓住了汉语结构命门,白老师在这上花足了心思。对句子做“结构归一化”处理。

白:
推而广之,就是利用大资源+大数据把看起来不那么规范的句子有理有据地整理成更规范的,这样parser负担就轻了,无需独自面对复杂情况。“这场火多亏消防队来得及时”,这里的“这场火”同样面临“没给留坑”的尴尬。但是,把句子中的“火”“消防队”两个实词送入中间件,可以发现与他们共现频次相当高的“救”。有“救”垫底,就可以引入及物的虚动词phi,这样萝卜和坑就相安无事了。

荀:
白老师提到的parser需要确定的三种关系,权重信息编织在网中了,在应用时,词典发出请求,RNN做认定。Parsing就是做<W1,W2,Relation>认定的过程, W1或者W2 可以是小词。 功夫在于Relation定义,在承载结构的小词处理以及<W1,W2,Relation>训练过程,白老师对这些都有一套不同以往的做法。

李:
如果没有坑可跳,就自己挖个坑去跳,这也是 mods 的常规了。在形态语言中,mods 有显性小词或词尾帮助确定该怎么挖坑自裁。在裸奔的汉语,形式没了,只好靠搭配。

白:
救火这个例子,已经不是subcat嵌入了,根本就是词嵌入。

李:
wait:“这场火多亏消防队来得及时”,这里的“这场火”同样面临“没给留坑”的尴尬。

咱们走一走这场火。哪里出来的“救火”,“消防队”本体里面的吗?Hownet 里面肯定有。

常规的做法是,遇到句首 np 没法填坑,就给个 topic 标签。有点像英语的 as for,with regards to,topic 很像pp做的状语。往后找一个谓语挂靠:“这场火” 挂靠到 “来”。

白:
人家只有一个坑,还是给human预留的。

李:
不需要啊。状语是随机的。状语可以看成是不填坑,而是挖坑,挖个坑让谓语填进去
或者让自己跳进去 再去找主儿。

白:
比如“为了”?

李:
想不出来为什么要绕那么大弯,让“救火”出来救驾。Topic 式状语,无需那么清晰的标签,就是把np 降级为 pp。至于什么 p 什么格,另说着。

世界语有个万能介词 je,柴门霍夫这样解说:介词就是格,都是确定性语义的。
几十个介词 就是几十个格。但是如果有一个状语,你不知道哪个介词合适
或者你懒得费劲琢磨什么格合适,你就用 je。与前面提的phi,异曲同工啊。

白:
那样活儿太糙。补介词合适还是补动词合适,大数据说了算。

李:
用了 je 就确定了其地位。不是没有道理。人如果要清晰,他可以有清晰的形式,譬如介词或词尾。如果他不用,那就模糊。虽然模糊,句法地位和关系还是大体确定了。这类模糊要确定语义关系,可以在后面的语义模块(我以前也叫它语义中间件)决定,而不是白老师的中间件在parsing 过程中调用。我选择把二者分开,因为这类情形句法没有到走投无路,就算耍个流氓 亦无不可。先躲过初一,到15再说。其实 15 到了,要求很可能与初一不一样了。人走茶凉不了了之也是有的。

白:
数据支持的话,可以冒进一点。中间件就是在过程中调用啊,否则有啥用。

李:
deep parsing 的过程可以分两个阶段,两个模块:句法和语义。我叫语义中间件是指它在句法模块之后,产品语义落地之前,夹在中间。怎么没用?几乎所有的 hidden 逻辑语义,都可以留到这里做,而不必在句法模块做。

不仅句法模块内部可以多层去做,句法到逻辑语义,也可以分开,成为两个层面的 parsing,Syntactic parsing to semantic parsing。非谓语动词的主宾等都可以后延,
句法只要确定其状语还是定语或补足语身份即可。对于谓语的主宾等,也可以先在句法做一个糙活,到语义中间件再细化或修正。糙活是不到不得已不调用 ontology,如 np 主语,管他 【human】 还是 【instrument】:

张三砍了李四
斧头砍了李四

开始都是同一个parse。

张三吃了大餐
乌云吃了月亮

也是如此。

白:
现在还都没说定性,只说定位,谁跟谁有关系。结论是,就这么糙的事儿,也得动用ontology。

李:
句法不必要太细。语义可以细,但那个活儿可以悠着点,做多少算多少。

回到白老师前面给的句子,试试我目前语义模块还没丰富完善的 parsing:

“那堆砖让我给搭鸡窝了”
“这辆车能坐六个人”
“这个方向不被看好”
“这些铁可以打很多钉子”
“这些纸能写很多字”


see,句法架子是出来了,但未尽如意的语义还有一步之遥。这一步补不补,不紧急,因为语义落地的时候,如果是 integrated 一体化的直通车,而不是提供给第三方做 offshelf support 的,就可以在落地模块内部协调。譬如,“坐车” 带了“六个人” 为 O,ideally,语义模块应该把 “六个人” 从句法的 O 转为 逻辑语义的 S。但是,如果是内部协调,转不转也无所谓。O 不过是一个符号而已。词驱动落地的时候,“坐车”的 arg 是 O 或 S,完全不必计较。当然,如果要补足这一步,虽然琐细,但真要做也不难。在没弄清楚多少利益之前,懒得做这细活。同理:“那堆砖”最好是加一条线,连上“搭”,标签是 【Instrument】。“这个方向不被看好”已经把表层的小词 “被” 带入考量,直接给了 O,一切到位,没有可做了。“打铁” 和 “钉子”,最好是加上标签【Result】。最后一句,最好给 S 进一步加上逻辑语义标签【Instrument】or 【Material】,但其实落地也未必需要这个,就是加上了显得很酷,很智能,让人看着爽,倒未必是对落地产品真地就有多大利益。

 

 

【术语 Index】

Matcher:the syntactic parsing program,有时候我们叫 runner,在白老师的系统里面,就是接受输入文句,对其二元依存关系解析的模块。

WSD:与 community 的依据义项划分的定义有别,白老师的 WSD 模块指的是:在词负载结构的体系里,一个具体的词负载了好几种可能的结构,结合上下文选择其中一种的模块,称之为wsd模块。事实上,这里的WSD 是利用大数据得来的词与词或其上位概念之间的语义相谐,来决定采纳某种区分一个词不同用法的扩展的 POS tags or 白老师所谓 subcats,来帮助结构消歧。粗线条义项的区分成为二元关系结构消歧的副产品。当(细线条)义项区别不影响结构的时候,义项区分就不是这个WSD模块的任务。

二元关系:两个词之间的句法依存关系(bianry dependency)。白老师的系统分为三类:修饰关系(如 定语、状语),算元(args)关系(如主语、宾语)和合并关系。

POS(cat):part-of-speech (or category,相对于 subcat 子类而言)词类,不必是 PennTree 定义的集合。作为模块,指的是根据系统给定的词类标准,自动做词性标注。一个词可能跨类,POS 模块可以根据上下文决定最合适的类别(词性)。在白老师的系统中,是所谓 WSD 模块做这个 POS 的事儿,来供给 Matcher 充当合法填坑的 candidates。在白老师的系统,我们可以把 POS 的词性标注理解为粗线条的 WSD。不影响结构的词义区分不是白老师所说的 WSD 模块的任务,虽然 community 的 WSD 不是这样定义的。

subcat:subcat 的原义指的是谓词的子类,这个子类对应了这个词的特定句型(譬如,双宾句型,宾+宾补句型,等)。白老师说的 subcat 扩展到不一定具有对应句型的子类。譬如,碗,背后的subcat是“容器”“餐具”;汤,背后的subcat是“液体”“食物”。这实际上是本体语义(ontology)的层级结构,如 ISA taxonomy chain:碗 ISA 餐具,餐具 ISA 工具,工具 ISA 商品;商品 ISA 人造物品;人造物品 ISA 物品;物品 ISA 实体(逻辑名词,这是这个 chain 的顶端节点 TOP 了)。

耍流氓”:指的是对于二元依存关系不能定性,但是可以认定具有某种关系。汉语句法中,句首的名词短语在没有确定其性质是主语、宾语或定语、状语之前,往往先给它一个 Topic 标签,挂靠到后面的谓语身上,白老师认为这就是耍流氓。同理,当两个实词之间的关系基本可以确认,但是不能定性的时候,我们往往根据其出现的先后次序,让 parser 给一个 Next 的标签把二者连上,作为一个增强句法分析器鲁棒性(robustness)和查全率(recall)的打补丁的手段。这也算是先耍一下流氓,因为理论上后去还是需要语义模块去确认是何种关系才算深度分析到位。如果是两个中文动词一先一后系统给了 Next,其默认关系是【接续】,就是汉语文法书上所谓的“连动”结构。

Topic:汉语分析中,句首名词短语如果不直接做主语、宾语等,很多分析就给 一个Topic(主题)的标签。汉语文法的一个突出语言句型现象就是所谓双主语句(常常分析成一个Topic or 大主语,加一个小主语:譬如,他身体特别好。这家公司业绩直线上升。)由于这种关系逻辑语义的性质不明,聊胜于无,所以也称这种二元关系的建立为“耍流氓”。

Next:两个词一先一后,但不能确认他们发生了什么句法语义关系,系统常常给一个特殊的关系标签,叫 Next,其默认关系是【接续】。 这是一个增强句法分析器鲁棒性(robustness)和查全率(recall)的打补丁的手段。由于这种关系逻辑语义的性质不明,聊胜于无,所以也称建立这种二元关系为“耍流氓”。

mod:修饰成分或关系。包括定语、状语、补语。

arg:算元成分或关系。包括主语、宾语、(宾语)补足语或间接宾语。

Hownet:董振东前辈发明的面向MT和NLP服务的跨语言本体知识(ontology)网络《知网》的英文名称。

小词:教科书上叫做功能词。包括介词、连词、代词、副词、感叹词、联系动词等。

伪歧义:也叫伪路径,指的是 parsers 产生出来的貌似成功但没有价值的结构分析路径。伪歧义,是相对于真(结构)歧义而言。真的结构歧义的典型案例是某些 PP-attachment 的现象,同一个 PP 可以理解为两种可能:做宾语的后置定语;或做谓语动词的后置状语,这两个 parses 都是有效的语义解析。但是,很多传统的 parsers,会产生很多貌似成功解析输入文句的分析路径(numerous parses),给人以文句结构歧义严重的假象,但其实这些不同路径大多没有区别意义,是为伪歧义。这是一个困扰了传统 parsing 很多年的难题。白老师和立委的系统都利用不同的策略(包括休眠唤醒机制)很好地解决了这个问题。

中间件:白老师的所谓语义中间件,指的是在 ontology(本体知识库,如 HowNet,WordNet)的支持下,通过大数据训练得出来的语言词汇之间的语义相谐(各种关系之间的语义搭配)的统计知识库。这个中间件被 WSD 和 Matcher 模块调用作为对于输入文句的 parsing 的资源。立委以前的NLP博文种的所谓语义中间件虽然有与白老师的中间件相同的一面,但却是不同的所指。在立委的 deep parsing 的系统种,语义中间件不是一个知识库资源,而是指的句法模块后面的语义模块。这个模块利用句法框架,负责深度分析的逻辑语义细化、隐含的逻辑语义关系的解析、休眠唤醒新的语义结构关系(包括改正此前的错误路径),如果需要的话,也可以在这个模块做一些词义消歧工作(WSD的本义)。总之,这个语义模块是独立于领域,夹在句法分析之前和领域的语义落地之前,为了更好地服务于语义落地。为了不再混淆术语,立委考虑今后不再称此模块为中间件,而是把术语让出,就叫语义模块。

:坑是role-provider,萝卜是role-fulfiller。就是依存关系(dependency)的被预期的节点。对于谓词,其坑就是它预期的算元(args)成分,主语、宾语、补足语。对于修饰关系(mods),譬如定语、状语和(汉语)的补语,一般认为是附加的边缘语义,不占坑。也可以看成是修饰语预期了谓词,或看成是谓词不占坑地吃掉了修饰语。在知网里“坑”叫“角色”;“萝卜”叫“典型演员”a) 只从语义考虑,与特定语言无关。

萝卜:指的是那些参与谓词结构(所谓 argument structure)所要求的实体角色的词,譬如充当主语、宾语、补足语的成分。谓词结构通常被认为是一个语句的核心语义。谓词以动词为主(但也有形容词和名词做谓词的),在词典主义(lexicalist)的系统中(白老师和立委的系统均属于词典主义),一个谓词的潜在的结构都标注在这个词的词典信息 subcat 里面。换句话说,谓词的 subcat 规定了它期望什么样的成分(所谓挖坑),需要什么样的词(萝卜)来填。譬如,“走路”挖了一个坑,需要一个优选语义位【human】的名词萝卜来充当其施事主语。再如,“喜欢” 挖了两个坑:谁喜欢什么。充当主语的是【human】名词,充当宾语的是几乎任何词。

填坑:一个词(包括代表短语的头词)根据谓词对坑的句法(甚至语义)要求,充当了其谓词结构的成分,建立了与谓词的二元关系(binary dependency),这个建构过程叫做填坑。谓词结构的成分填满了,核心语义就完整了,这个状态叫 saturated。

萝卜指标:指的就是坑。所谓不占萝卜指标,是说的一个词可以合法填两个坑的情形,其中一个坑不影响其填另一个坑的能力。听上去似乎与坑与填坑的概念出发点相违背,但在依存关系图的构建过程中,是必须考虑一个萝卜填多个坑(一个儿子多个老子)的情形才可以把依存关系进行到底(有些一个萝卜多个坑的情形在短语结构表达中,可以借助非终结节点避免)。

优选语义:最早由著名人工智能和机器翻译前辈 Wilks 提出的概念,指的是在本体网络(ontology)中,概念之间的语义相谐表现在自然语言的表达的时候,呈现的是一个区间,而不是一个固定的语义约束。譬如,【eat】这个概念对于【受事】的优选语义是【food】,但是这只是其优选,并不是一定要是【food】。语言表达的时候,优选语义可以根据句法的约束条件不断放松,以至于达到完全不相谐的程度(nonsense)。乔姆斯基认为,句法可以独立于这些语义相谐的约束,举的就是句法约束决定结构关系,偏离优选语义到极端的例子:Colorless green ideas sleep furiously。对于形态语言,句法独立性的原则有较多的证据。对于汉语,这个原则需要打折扣,合理利用优选语义的约束就成为汉语解析的关键依据。立委 parser 改造使用了 HowNet 来弥补句法形式的不足。白老师的系统是依靠大数据训练出来的中间件来实现优选语义的对 parsing 的约束。

逻辑语义:指的是深层结构关系。最早起源于乔姆斯基的深层结构和费尔默的深层格(关系)。中国NLP和MT的旗手级前辈董振东老师发扬光大,深化了这方面的研究,指出解析逻辑语义是深度自然语言理解的关键:所谓理解一个句子,主要就是理解了这个句子里面概念之间的逻辑语义,谁是施事,谁是受事,时间、地点、条件,等等。在 community,对应于所谓 role labeling 的任务。一般而言,主谓宾定状补之类的句法关系比较粗糙,这些是表层关系,一个语言深度解析器(deep parser)不仅要解析(decode)句法关系,而且要进一步揭示后面的逻辑语义关系,包括细化句法关系(譬如句法主语可以进一步标注为施事、受事、工具等逻辑语义,句法宾语可以标注为受事、对象、结果等逻辑语义,诸如此类),和揭示隐含的逻辑语义关系(所谓 hidden links,就是句法上没有直接联系但逻辑语义上具有直接联系的结构关系,譬如宾语是宾语补足语的隐藏的逻辑主语)。

休眠唤醒:在李白的系列研讨中,这个术语指的是一种把可能性较小的路径暂时搁置的parsing策略,被搁置的路径可以在适当的条件下被唤醒。这种策略据信反映了人的语言解析的过程,可以从段子、相声抖包袱等现象看到这个过程的表现。立委有系列博文专谈这个机制。譬如:【立委科普:结构歧义的休眠唤醒演义

 

【相关】

【李白之18:白老师的秘密武器再探】

【李白之15:白老师的秘密武器探幽】

【李白对话录系列】

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

【李白之16:小词负载结构与小词只参与模式条件之辩】

白:

“是他杀的张三”是一个完整句子吗?

主谓宾都在哪儿?

李:

shi S Vt de O ==》SVO

很常见的句式,主谓宾齐全,

是 S V 的 O == S V 的 O == SVO

不过其中 “SV的O” 有歧义,因为与带定语从句的NP同形。

不过加了强调小词 “是” 在句首以后,似乎歧义就消失了。

白:

“是”当什么?自己没有主谓宾?

李:

是数学我不喜欢。

是在北京他们开的董事会。

是1990年我毕业的。

句首的“是”,是强调小词。

类似于英语的强调表达法: it is X 。。。。

it was in 1990 when I graduated
It was in Beijing where we got the deal

白:

可否认为“张三是他杀的”,然后“张三”后移到句尾。

李:

张三是他杀的 ==》 是他杀的张三 ?

张三他杀的 ==》 他杀的张三

白:

“是他杀了张三”可以这么做。“是他杀的张三”不能。

“是他杀了张三”跟“有人敲门”是一个性质,在坑论里是两个谓词合并,共享一个萝卜。

但“是他杀的张三”不好套用这个结构。没办法把二元关系进行到底。不仅“的”捞不出来,连“是”还得搭进去。

李:

“的” 字的两个用法:1 定语标志(或所有格);2. 肯定语气

表示肯定语气的 ”的“,通常位于句末,也常与表示肯定或强调的 ”是“ 搭配使用: 是 XP 的

貌似由此衍生出表示肯定的  ”的“ 用于谓宾之间。

“是他杀的张三” 说的是对过去或完成的肯定,但是却不允许用助词 ”了“ 或 ”过“,原因可能是这个位置被 “的” 占据了。另一个原因是 这种肯定语气蕴含了完成。肯定的行为动作不可能是没有发生的事件。

在 pattern 中,只要能列举出这种即可,很容易捕捉,除非是歧义。

白:

表达什么先不管,谁跟谁有关系是首先要解决的。

李:

没有句首“是”的pattern  “SV的O” 的确有歧义,但是这种歧义是 consistent 的。对于consistent 的歧义,其实不难处理,可以将错就错。直到错到某个点,系统觉得应该校正了,就校正。现在的处置是,开始 parsing 的时候,一律做定语从句看。

白:

有套路,就把构成套路的词摘到二元关系之外,语言学上不够简约。

前面说到的踢出机制试了几个例子,很好玩,基本通了。

李:

有套路,就把构成套路的词摘到二元关系之外,没有问题啊。因为小词已经在套路(patterns)起到了该起的条件作用,譬如 “是+S+V+的+O”,在这个 pattern 中,没有歧义, SVO 被确定,逻辑语义被解构,一抓一个准,完事了,把 “是” 和 “的” 这种句法辅助小词挂起来。这是pattern的天经地义。pattern 比起二元关系环环相扣的 parsing 有不同的优缺点:pattern 可能比较长,上述 pattern 是个五元组,实词的元是XP,所以实际的跨越可能是很长的 string,用的是长度来换取确定性,牺牲了某种抽象性,或换句话说,带来了一些规则的冗余度。二元关系环环相扣的做法,可能更加简约和概括。

白:

做系统咋都行。做句法。感觉非常实用主义,理论上不连贯。

李:

句法标配说的是 sv 语序,多数系统都用的。你的系统先不用,是例外。

白:

我这不用。至少matcher不用。以后语义模块用另说。

李:

虽然汉语语序很操蛋,孤立语中它自由得简直不讲道理,但是 sv 是默认,有统计性依据,也有心理认知的依据。这一汉语句法标配的形式痕迹,不用白不用。

白:

用了也有误导的时候。

我在尝试踢出边的功能:一个强搭配萝卜进来,在坑饱和的情况下,踢走一个已经进坑的萝卜,自己跳坑。如果没有不占指标的额度的话。一进一出,不破坏结构,不重构结构,也不改变结构对外部的联系。与所谓“回溯”大不一样。拔出来的萝卜再进什么坑,全看后续发展。

李:

所以句法结构的时候 可以不利用语序,因为这个结构的标签暂时不不用给出。其实这是跳过句法标签,直接在下一步进入逻辑语义标签。但传统做法是区分 arg1 arg2 arg3,现在是不区分 只说这是arg,有别于 mod 就可以了。

白:

但是各个arg如果subcat不同的话,需要锁定,免得互相串了。

李:

所以是标签隐藏在后面,暂时不露而已。

对,免得互相串了 是必须的。

白:

如果连另一个可能性都没有指出来,焉知落地想要的不是另一个?

李:

这个问题哪里会有?是落地的需求 drive 开发呀。世界上哪里有飘在天上搞开发的呢。何况开发这事儿也不是一锤子买卖。今天没有的可能性,明天加上可能性也是可以的。系统不可能是一成不变的。pattern 不变的话,在结论上增加点什么,连重新测试都不需要就可以搞定,有何难哉?更何况 我们 patterns 用完小词以后,还发扬了革命人道主义,并没有扔掉敲门砖,还用 X 把小词给挂上呢。所有的痕迹都在,过河没拆桥。不过是不让过了河的桥和敲了门的砖占据我们的语义核心地位而已。

小词负载结构,我的理解,本质上也就是一个过渡,一个粘结剂,一个特定的 parsing 算法所依赖的一种手段,并不是一种必需。小词成为条件,则是一种通用的必需,因为没有小词,结构关系就很难搞定,这是小词存在的理由。

白:

“杀人犯把卖盐的杀了化妆成卖盐的在那卖盐。”杀人犯是卖盐的?

李:

那句话一时看不懂,但 point 明白了。你是在诘问 把“S是V的”处理成 SV 的做法。它们不等同,不仅仅是 affirmative 的差别,还有另一个差别在。这个差别是,“S是V” 的 可以回答 “S是谁” 的问题,而 “SV” 不能回答 “S是谁” 的问题。好,这是一个典型的语义落地决定如何表达的例子。现在的问答系统的语义落地有对这两种结构做区分的需求,那就让第二个pattern在输出表达中,把这种需求满足即可。譬如,可以让第二个pattern (S 是 V 的)输出这样的结果:

arg structure: S V

feature: affirmative

answer: who is S

白:

杀人犯不是卖盐的VS杀人犯不卖盐

这些零碎副词加在affirmative上还是加在普通谓语上怎么区分?

李:

bottom line is pattern1 和 pattern2 是两个独立的捕捉,二者该怎样处理都可以,加在哪里都可以。加在哪里落地好用,落地觉得有用,就加在哪里。这都不是事儿。媳妇都娶回家了,怎么打扮还不是男家一句话吗?

白:

那就是说实际上做了两个谓词,简化成一个谓词是伪命题。而小词负载结构,只不过把两个谓词显性化而已。

李:

早早年的parsing,其原始定义记得是没有parse tree的表达的。什么都没有。就是一个合法非法的结论。所谓合法的结论,就是 parser 把那个句子从头到尾都吃进去了。

白:

判定问题

李:

后来的 tree representation 或其他的表达,全部是 parsing 过程留下的痕迹,或副作用。这样看parsing就明白了娶媳妇是核心,打扮媳妇是具有任意性和功利性的副产品。小词负载结构如果能在语义上表现出贡献,那么这种贡献可以等价地由 pattern 给出。换句话说,如果某种用小词作为枢纽来区别谓词的表达法,对于语义落地有益,那么没有人可以阻挡pattern的编写者,输出同样的表达。但实践中,我们知道,其实绝大多数时候,这些小词丢掉了,核心语义没啥损失。譬如 小词 “把”点名了宾语的所在,借助它表达出宾语的核心语义以后,“把”的使命也就完结了。

白:

“把”和述补结构连接,绝不是只有“宾语”这一个含义。参照“他把眼睛哭肿了。”

李:

小词负载结构的语义贡献注定是有限的小词的本质就是句法的显性形式,在句法走向逻辑语义的过程中,形式走向内容。表层走向深层。言语走向逻辑。这种趋向决定了小词负载结构基本是边缘的语义。换一个角度看这个问题,小词是 language specific 的,而语义的本质是人类共通,language-independent 的。language specific 的东西不会在语义里面唱主角。

30 年前,董老师提出,以人类共同的逻辑语义作为机器翻译的基础,而不是在句子表层实施转换。这样一来,不仅用了不同小词和词序的主动语态和被动语态被认为是相同的,而且动词与deverbal的名词也被认为是相同的。因为其 arg structure 的核心逻辑语义都是相同的。用它指导 MT 就是:

I translated A from B into C

== A is translated from B to C (by me)

== my translation of A from B to C

==> 我把A从B翻译成C

当时觉得董老师的做法的确抓到了要害,但也觉得表层的小词和细微差别(譬如语态)也不能就这么扔了。最后的体会和结论是: 在语义落地的时候(譬如MT),逻辑语义是主要的,表层结构是辅助的。做到了逻辑语义的转换,基本任务可以算是完成了。但是要想做得更好一点,还可以参照表层结构或features,再做一些细节上的调整。譬如 英语是被动态的,也许也翻译成被动态更合适(其实,由于两个语言的显性被动表达形式具有不同的使用频度,只能说,部分的被动态用汉语的显性被动为宜,其他的被动态可以用隐性的被动形式,最后还有一部分被动其实更合适用汉语的主动态来翻译,这个要细细研究的话,可以针对不同情形结合表层和深层结构写一大篇来)。

总而言之,小词和表层,顾不上来的话,扔掉了也没啥大不了的。这些边缘的语义色彩,对于语义落地的不同场景或许可以有参照作用,但不是核心。

白:

实际情况是,逻辑语义也是人参照表层写出来的。本族语表层研究不到位,就只好迁就着走。就好像grandma不知道是姥姥还是奶奶。并不是他们说英语的人逻辑上不能定义和区分爸爸的妈妈还是妈妈的妈妈。我们要高频率地使用,就不能绕着走。

李:

这样看也是一个角度,有其道理。

理论上,逻辑语义应该是参照多数的人类语言提出来。基本立足点就是,人类的概念和思维是共同的,理解也应该是共同的,只是表达的时候穿了不同的外衣。当然,语言对思维也有反作用,因此人类思维和理解的共同性,只可能是大同小异,而不可能是完全一致。

白:
共同性体现为外衣的并集

如果主要外衣缺失,就谈不上共同性了

【相关】

【李白对话录系列】

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

【李白之15:白老师的秘密武器探幽】

【立委按】 白老师不动声色开始亮剑了:独创的自然语言的parsing法术,无需规则,无需带标数据,词典主义标注加无监督大数据以克服知识瓶颈。深度计算,句法语义交融,借力RNN,以平天下。小试牛刀,以中文难句为例。先睹为快,以飨同仁。

 

李:
省略 head 最可恨。从“的字结构”和what-clause始,恨的是 队员都在 头儿却跑了,缺少头就缺少了语义相谐的依据。what I read 的语义是 【book】,可是很难找到这个本体的入口点。语义计算和细线条深度parsing就犯难了。当然,可以转弯抹角通过 “read”的 HowNet 网络里面的 logic subcat 的逻辑宾语的标配,把这个语义节点挖出来,这多费劲啊。
白:
别拿“的”不当头儿
李:
“的” 与 what 一样,可以当头儿,但没有本体的底气。
白:
给它它就有
李:
我尝试过把 V 当头, 也尝试过拿 “的” 当头,都遇到这个本体的滑铁卢。V 麻烦更大一些,V 本身的本体在那儿添乱。理论上可以通过 V 询问 HowNet 去 retrieve 出来逻辑宾语的标配,然后赋值,并替代 V 的本体属性。Word 天,这不是人做的活儿。
我吃的 --》【food】
我看的 --》【ANY】
我修读的 --》【knowledge】
我parse的--》【language】
我干的 --【事业,or 勾当?】
白:
只需要指回来,不需要明确哪个坑。
的当头不是问题,当头的赋予什么subcat才要紧。谓词的坑不饱和不要紧,可以到坑里去挖。谓词的坑饱和了还要“凭空”憋出一个subcat来最麻烦了
李:
出去买外卖,路上冒出个英语打油:
What I read -- is not a book
What I eat -- is not food
What I do -- is not a cause,
What I love, is you.
标配被形式推翻,哈。
白:
not a book, but newspapers; not food, but pills; not a cause, but fun.
看看我自己搞的parser的图:
李:
牛叉!看着就高大上。
what I love, is the girl:见过的都说漂亮的。
见过的都说漂亮的:有歌为证:“在那遥远的地方,有位好姑娘;人们走过她的帐房,都要回头留恋地张望。”  里面有两个 “的”字结构:
见过的 --》 【human】
都说漂亮的 --》【ANY】
前者是主语,后者是宾语,补全了就是: 见过的【人】都说【她】漂亮的【那个姑娘】
白:
tomita差远了,我这里没有规则只有词典。
sh移进;up升格;ma填坑;mo修饰
比图栈自动机简单,而且跟语义中间件查询一一对应。
李:
就四个操作?跟汇编似的。记得汇编就是两个字母做操作缩写。我自己没整过汇编,我领导当年整天就是汇编。
白:
还有一个me,合并,这个例子木有用到。
match,modify,merge,那么凑巧都是m打头,第二个跟不同的元音。
下划线义为“关闭”。五角星是关闭后又打开复用。
李:
SH 的缩写是?
白:
shift
李:
my bad, I thought it was shit
wondering why this naming
by the way, shit and crap are NOT negative in Oral English when used in NPs
白:
up含元音u,shift含元音i,这aeiou也算集齐了
shift:move to the next token
李:
we call it read head in FSA
白:
pda players like to call it that way.
但实际上我这也不是栈。
李:
能把姑娘那句串出来,真心不易。
我要是硬做也可以做,可是感觉不踏实,不知道哪天又断了。很多事儿是选择不做,因为没有鲁棒的把握,当然也因为不足够常见感觉不值。
match 是填坑,那就是 saturated 了。up 是升格,意思是?
白:
修饰语提升为被自己修饰的pos
坑有指标,填一个少一个
matcher就做这五个动作
李:
为了理解白老师的parsing机制,咱们对照上图来个walk-through吧:
“这” 是 N++,Up 了一下,就成了 N+ ?
白:
指示词,数词,都是N++
李:
Up 之前为什么 Shift?
白:
位置一开始-1,进入0就是shift
李:
Mo01 就是被吃掉了,N+ 就是往右边找 N,modifier 找 head,找到了 就自裁了。
白:
李:
可是这个 NP 看不出来是一个完整的有 det 的 NP,过了这村,痕迹也没留下的感觉。
白:
弧都在。每条边都是痕迹。
李:
Ma12 那是填坑了,带有 2N 的 “见”,saturate 了一个,成了 S/N,可是怎么知道是主语坑填了 还是宾语坑填了呢?
白:
不知道。也不care。
李:
所以 先吃掉一个再说。也不问问中间件?
白:
目前每步都问
李:
每一步都查相谐?
道理上不需要,只有出现歧义可能才需要查问和比对。
白:
两个如果有一个相谐,就锁定一个,再来只需查另一个。
李:
小词 “过” +S 是向左修饰谓语的,因此 Mo32 就把时态助词吃掉。
so far so good
可这时候没遇到 N,只遇到 “的”。“的” 很特别,谁都要
白:
的,左面吃一坨,吐一个定语N+
李:
X 来了就吃掉。
对 突出了一个 N+
白:
X是wildcard 以前说过的,不管S还是N,来者不拒
李:
Up 4 晕了
原地踏步就是 Up?
白:
定语升格为NP,实际是创建了一个虚节点,图上有
李:
Up 的原因是因为前后的路都堵死了,等于是默认操作?走不下去了,就 Up 一下。
白:
什么因素驱动什么操作,应该是最核心的东东了。
李:
Ma42 填的啥坑?“见” 还有一个没有填的 arg N。
白:
见,残留的坑。
因为是残留的坑,萝卜不占指标,依然可以它用。
残留的坑就灭了。
李:
哦,那是“的”没找到 head N,自我升格为 N 以后,去填了 “见” 的第二个坑。至此我们其实不知道“姑娘”和“的”各填的什么坑:如果的字结构中被省去的N是【非人】(东西),则“女孩”是主语;否则,“女孩”可能是宾语(也可以是主语),类似于说:
这女孩见过的【东西】
见过这女孩的【人】
其实在本例句“这女孩见过的都说漂亮”中,“女孩”是宾语,而的字结构指的是主语【人】。
不占指标的意思是,这个 V 做了定语从句,所以 V 全部saturated 了
白:
残留是指,head已经填坑去了
我们matcher是没有语言学知识的,只知道填坑去了,定语从句什么的,不知道。
另一面说,如果有其他情形导致残留的,也一样办理。
六亲不认。
李:
Ma42 结果 把 “见” /N 变成了_,关闭了,就是用 “的” 填进去的结果?语言学上对应于 “的”字结构 反填为子句里面的主语。
makes sense
反填不占指标,所以动词saturated,可“的”字还是 N,从这个 N 进一步取下一词 (Sh5) ?
白:
sh是先放着,看下一个
李:
入栈?
白:
不完全是栈,暂且理解为栈也将就
李:
小词 “都” S+ 往右找谓词做修饰(Mo56),于是吃掉了 “都”。
“说”有两个 args?一个是 N 主语,一个是 X 爱咋咋:NP宾语也好,宾语从句也好
这时候 的 N 可以填进去 (Ma46),后面的 A 作为第二个 X 也填进去,大功告成?
白:
要处理残留:谁漂亮
李:
Ma17 于是把“女孩”连上了“漂亮”,填坑。
白:
因为是残留,要在之前已经关闭的N里面找一个做兼职。不占名额。
李:
远距离逻辑关系 不能占句法关系的坑。
白:
_是关闭,五角星是再打开
李:
关闭是入栈,打开是 pop?
白:
好像不是
关闭是了结,打开是废物利用。
李:
这个游戏好玩。
parsing 是走通了,哪里看出怎么给标签?主谓宾等
白:
不给标签,只说谁跟谁有什么关系,留下来的arcs正好构成这么一幅图
李:
“什么”关系不就是标签吗?
MO 是修饰;MA 是填坑,但没说主语关系还是宾语关系
白:
水平的是修饰关系(红色),垂直的是填坑关系(蓝色),跨接是合并关系(橙色)。
李:
没看到跨接
白:
这句没有
不说没关系的,范围已经框死了
语义层面往下走接得住
李:
主语宾语怎么接得住?
主宾的区分往往是,相谐只是可能,句法才是决定。“老鼠爱大米” 填坑以后 相谐可以决定主宾,“张三爱李四” 呢?
白:
我们汉语可能要反过来,相谐如果搞定,不问语序;相谐搞不定的,再问语序。
语序的原始编号都在。
李:
至少对于此句,不问语序是对的。问了语序的话,“女孩”在主语位置,应该是定语从句的主语了,但其实是宾语。
白:
如果填坑时没有竞争者,也不用查中间件。
大部分情况只用相谐就搞得定。
三省吾身,用得妥妥的。
李:
有意思,太有意思了。
白:
当个玩具吧,希望尽快升格为不再是玩具。
李:
这可不是玩具 玩具哪里能搞定这样的句子。看得出来 小词很关键。实词一边有坑,一边有中间件。
白:
玩小词其乐无穷啊
李:
“的”字的玩法 令人惊诧。
白:
“圆圆地画一个圈”
这里要解决“伪状语”的问题。顺带考查一下小词“地”。
李:
洗一个痛快的澡
是伪定语,同理。
白:
这里,“圆”残留的/N,靠“圈”的废物利用搞定。二者之间的subcat不要太般配哦。
同理,“痛快”残留的/N,找到了subcat相谐的已关闭的“他N”。
李:
这句没看懂。
“圆” 一个坑,后来让 “圈” 填了。类似于 “痛快” 的坑 让 “他” 填。
白:
画的逻辑宾语坑是“图形”,圈的subcat也是图形,这不是般配是什么?
伟哥没看懂的是上海话吧……
不要太 means 太tm
李:
哦。
北方话就是王八绿豆对上眼了。
对上眼的是远距离的 “圆” 与 “圈” 啊,“画” 与 “圈” 哪里需要对上眼,那是句法绑定 父母包办:
v 了 一个 n
白:
父母包办的也送中间件里,无妨
李:
不需要。先婚后恋。不恋也成婚。
白:
圆圈也包办的
不过我还没处理成包办
需要磨
所谓包办,就是word embedding。自由恋爱,就是subcat-embedding。
李:
前者是强搭配?后者是搭配
强搭配在两个直接量之间进行:洗-澡;搭配可以在 subcats 之间
吃 -【food】
or【consume】-【food】
词对词:洗-澡 ==》 词对subcat: 吃-【food】 ==》 subcat 对 subcat:【consume】-【food】
HowNet 基本是后者,因为是概念之间。汉语词典里面有前者,因为有习惯表达法,language-specific。问题是,由于自然语言有多义,词到概念的映射不是一一对应的,除非存在一个完美的 WSD 支持。因此,subcat 对 subcat 的这个宇宙真理,尽管概括性和逻辑性强,但不好实施,容易走偏。除非有大数据做底,指望 WSD 不太现实。
白:
中间件看到的就是实例对实例、标签对实例、标签对标签(含标签它八辈儿祖宗)。
李:
我把 HowNet 的搭配搬过来以后,吃过亏。不过实例对实例,这个不需要大数据,拍脑袋也不会走偏。基本就是词典的记忆,脑袋里都有了,而且因为概括性弱,走偏的可能几乎没有:譬如 洗-澡。实例对标签 处于二者之间。
白:
WSD再怎么不完美也要分开,绝不能搅在一起。宁可分头完善去
李:
我从来不指望 WSD
默认假设是没有 WSD 怎么做 NLU 或落地。WSD 是其他过程的结果或副作用,而不是支持其他模块的前提。
白:
“洗”是万金油,“澡”是单打一。
一个固定搭配入口在哪很要紧,放在万金油那儿就惨了
李:
那是效率的问题。有不同的 indexing 的入口。“澡”作为入口 效率更好而已。
所谓 word driven 其中一个考量就是入口的驱动词的选择。传统的词典编纂也有这个考量。
白:
WSD和matcher工作时都要调用中间件
李:
一时想不出来 parsing 为什么要 WSD,有中间件就可以 parse 了。理论上 parse结果里面,词的节点应该是 WSD 过的概念。
白:
不存在独立于中间件之外的WSD
给Matcher的是单选的pos流,从多选到单选这一步是WSD做。再回到多选,就是休眠唤醒了。就是我说的,“纵向不确定性”WSD负责搞定;“横向不确定性”matcher负责搞定。二者都要借助中间件。
李:
多选到单选不是中间件吗?当然说这里面隐含了WSD也是不错的,因为所谓相谐就是两个节点的某一个 ws 与某一个 ws 对上了。从图上说,node 才有 wsd 的问题,arc 不是。
白:
义项的多选到单选,由WSD借助中间件做。parsing动作的多选到单选,由matcher借助中间件做。
“我想战胜AI的心,仅仅是为了作为棋手的尊严。”
“想战胜AI的心”,遇到“心”属不属于“那个小集合”的问题。可以人为设定“心”的一个属于那个小集合的新义项,(类似“心情、心愿”),在中间件里面靠“想、V、的”等捆绑,希望运行WSD时可以体现出来。目前资源太小,很多时候不顺手。
李:
我来推演一下:
parsing 到某个步骤,需要决定定语从句修饰的N,是不是应该反填子句谓语还未填的坑。如果 N 与坑的arg的要求相谐,则填,否则不填。如果 args 都已经 saturated 也没有填的问题。
“我想战胜AI的心”: “战胜”已经saturated,“心”不填。无需给心做 WSD
“想战胜AI的心”: 这时候,“战胜”还有一个主语的 arg 没有填,“心” 能不能填,决定于大数据中有没有 “心” 做 “战胜” 主语的历史积淀。应该是不相谐,没有积淀,因此不填。即便是那个“小集合”的典型案例,譬如“消息”,也有可能是相谐可填坑的:
他走漏的消息,很关键。
他走失的消息,很关键。
大数据搞定 “走漏-消息” 是肯定的。至于“走失” 与 “消息”,那应该是词典决定的标配,而不是大数据。换句话说,搭配是大数据的统计,不搭配则是默认。

白: 再看:

因为“碗”和“猪”不相谐,标红的这一步选择了Sh,而不是Mo

王:
白老师,这个句子最后一步match的是17,可以是47么?

Plus,“这女孩见过的都说漂亮”这个句子似乎有歧义?对比:“老祖宗讲过的都说有理”。

“女孩”在这个带“见”的的字结构里,可以当主语,也可以当宾语。

白:
对。这里只取了一种分析结果

李:

这姑娘见到的都说漂亮。
这小伙儿见到的都说英俊。
这小子见到的都说漂亮(因为他以前根本没遇到过漂亮的)。
这姑娘见到的都说英俊(因为她足不出户,见识太少)。
猪八戒见到的都说漂亮。
这傻瓜见到的都说奇妙。

结论,是 ”姑娘“ 与 ”漂亮“ 的高度相谐性,决定了姑娘与句法标配唱反调,做了 ”见到“的逻辑 宾语。甚至替换成同义词 ”英俊“,这种相谐性有所降低,就很难打败句法标配了。这也说明,语义中间件的相谐性不是好玩的游戏,非高手不能。甚至高手也会失手,过犹不及。

王:
操作有Shift, Modify, Match, Up, 还有这个句子里没用到的Merge,一共五种……

白老师,如何决定每一步用哪种操作呢?是在每一步都把五种操作全部轮一遍,看看哪个能用,然后继续,最后把成功parse全句的依存关系留下,没parse出全句的依存关系丢弃?

另外parsing以前做pos tagging的时候是不是也要把所有可能的pos序列全部给出来?

白:
这里面有大量无效的结合需要排除。算法的核心就体现在这个地方。

目前算法还没有面向所有歧义分析结果,取的是按照系统排序原则首先形成的第一个满足条件的分析结果。

另外不同的pos标记是靠WSD模块来选取的,每个词只有一个pos标记胜出。

如果做不下去了,又发现“里外勾结(甲词的首选pos和乙词的非首选pos类型相配)”,则启动翻盘。

李: 总结一哈。

优选的路径亮相的背后是大量的伪歧义,白老师怎么对付的呢?一个是基于训练出来的语义中间件的WSD,它负责提供每一个词的唯一而合适的pos供给 以 subcats 驱动的 parsing 去匹配。另一个就是 parsing 的算法,想来是糅合了某些语言学原则的,来决定操作的顺序。

这解答了我以前的一个疑问,为什么不可以绕过WSD做深度parsing?

在白老师,是绕不过去的,因为是基础支持。在我这儿,基本上是绕过去了。POS (可以看成是最粗线条的 WSD 的语法表现)我基本是绕过去做parsing的。见:【NLP 迷思之四:词义消歧(WSD)是NLP应用的瓶颈】;【中文处理的迷思之二:词类标注是句法分析的前提】。

能绕不能绕,决定于算法。条条大道通罗马 of course

白老师算法的精炼和操作的简约,是建立在两个基础之上:一个是语言学标注丰富的词典,潜在的路径都藏在里面,就等 matcher去选秀。另一个就是要有一个大数据的语义中间件的有力支持。

我这边也要靠信息丰富的词典,词典的一头是语言学,词典的另一头是HowNet本体,前者是主,后者是辅。

另一个靠山就是规则,根据语言学原则和经验设计出来的支持多层parsing 模块的 hierarchical 的规则集。

多层、细线条规则,为绕过POS和绕过WSD施行对伪歧义免疫的高精度深度分析,提供了条件。parsing 本身的基本机制也很简单,但利用这个机制把语言学揉进去来组织多层,那就是可乐式秘方了。

白:

“貌似咱倆把天聊歪了”--隔壁群里的一句话,离合词活用经典。

李: 就此打住吧。

白:
我说,天好好的,没歪。

当规则寓于词典的时候,WSD不是传统含义,POS也不是。某种意义上说,此时选择义项就是在选择规则。也就是说,绕开彼WSD使用的技术,跟此WSD使用的技术是相通的。

李:
WSD 本来是一个独立的与结构分析不必交融的任务,譬如,bank 是选“银行”还是“河岸”的消歧问题。再如,this coach is believed to be tough 这是以前提过的 WSD 经典案例,说的是,利用语义相谐来做 WSD

coach 有n个义项 tough 也有 m 个,二者互谐的只有两个:
coach【human】:教练 ; tough【human feature】:严厉
coach 【vehicle】 :马车; tough 【object feature】:皮实

白:
这不影响结构啊,一个N,一个S/N。pos没有其他选择就不check

李:
这个案例不影响各自的POS,不影响结构,这是 WSD 原本要追求的目标,但不是 parsing 所需要的支持。

然而,如果相谐是需要check的一个条件,出现的情况就是:

1. 由于 sparse data,两个直接量在一起的机会不够,所以系统认为是不相谐: 就是说语义不及格,全靠句法了。如果句法无歧义,没关系。否则影响parsing的质量。

2. 如果数据超大量,不要依赖于 subcat 级别上的相谐,而是利用直接量的 touch 和 coach 就有足够的例证是互谐的,那么语义支持了二者的结合,哪怕这时候究竟是 【human】还是【non-human】仍然无解。

我要说的是,白老师的 WSD 模块不是通常意义的 WSD,而是针对结构歧义(structural disambiguation)而来的相谐的支持,是粗线条的,而且是调用 when needed 的。WSD 的本义不是这个,本义是 lexical disambiguation,是为了确定词义的。本义的 WSD 对 结构 parsing 理论上有帮助,实践中基本不需要。在结构 parsing 的时候,WSD 可以隐含(或成为结果,就是所谓 positive 的副作用),但不必是条件。

即便如此,白老师由于没有显式的多层的 pattern 规则,只有隐含在词典可以被 matcher激发的潜在规则种子,其结果是对所谓 WSD或POS 模块的依赖远远大于多层的规则系统。既然有休眠唤醒,白老师应该也引入了多层。但总体上,白老师的层次是少数的,仍然在传统 parsing 单层搜索空间的延长线上。因此理论上,伪歧义会成为极大的困扰。白老师的创新就在,层次虽然不多,但背靠两座大山。这两座大山,都是传统 parsing 不具备,或者严重不充分的。第一座大山是词典主义标注,这是一个巨大的语言学工作,特别对小词和 top 1000 的用法众多的实词。第二座大山就是大数据的语义相谐的训练。建造这两座山都不是简单的活儿,除了设计家的宏观规划外,牵扯的具体的数据工作和调试测试工作非常庞大。没本事建大山,也就无法克服传统parsing的伪歧义瓶颈。

【相关】

【李白之18:白老师的秘密武器再探】

【李白对话录系列】

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

【语义计算:李白对话录系列】

【立委按】世有李白者,精于语义,勤于计算,一敏一木,一弦一弹,无论魏晋,不知有汉。坐而论道,波澜不惊,各得其乐,天马空行。挥斥方遒,指点语言,和寡曲高,流水云天。有道是,一擎核弹一拨弦,不是冤家不上船。

【李白100:Parsing 的休眠反悔机制】

【李白99:从大小S的整体部分关系看舆情挖掘的统计性】

【李白98:从对联和孔子遗言看子语言自动解析】

【李白97:大S小S句式中插入“的”所引起的交叉陷阱】

【李白96:想哪扯哪,不离其宗】

【李白95:走在路上……】

【李白梁于94:爱因斯坦是卓别林的崇拜者, 谁崇拜谁?】

【李白宋93:汉语语序的自由与不自由】

【李白92:自然语言漏得筛子似的,未必要补漏】

【李白91:休眠唤醒需要打离婚证】

【李白宋郭90:句法与逻辑和语用的纠缠】

【李白王89:模糊语义与真假歧义,兼论PSG与DG】

【李白宋88:再谈量词搭配与名词短语自动解析】

【李白洪87:人工智能,真的该让这样的哲学家走开】

【李白86:这是最后的斗争?】

【李白刘董85:汉字优越吗?】

【李白王董84:再谈POS迷思,兼论 PennTree 的误导】

【李白宋83:点评 “人工智能的诗与远方”】

【李白82:汉语重叠式再议】

【李白81:某些人的讽刺与挖苦】

【李白毛洪80:驯兽散记】

【李白79:中文深度解析的地基是词法分析器】

【李白78:毛主席保证】

【李白77:基本短语是浅层和深层parsing的重要接口】

【李白76:跨层次结构歧义的识别表达痛点】

【李白洪毛75:乔姆斯基批判】

【李白雷宋74:乔老爷的递归陷阱】

【李白73:汉语parsing的合成词痛点】

【李白宋毛72:NLP的测不准与追求完美】

【李白71:“上交所有不义之财!”】

【李白70:计算语言学界最“浪漫”的事儿】

【李白69:“蛋要是能炒饭,要厨师干啥用?”】

【李白68:NLP扯着扯着还是扯到萝卜填坑】

【李白67:带结构变量的词驱动模式注定是有限的】

【李白66:“青春期父母指南”的语义计算】

【李白邢65:“着”字VP的处置】

【李白董冯吕64:NLPers 谈 NLP渊源及其落地】

【李白雷63:做NLP也要见好就收,适可而止】

【李白梅宋62:工程语法与深度神经】

【李白张61:长尾问题种种】

【李白60:事理图谱之辨】

【李白雷梅59:自动句法分析中的伪歧义泥潭】

【李白之58:爬楼NLU】

【李白董57:中文字驱动patterns初探】

【李白王56:与上帝同在和对话的学问】

【李白毛55:漫谈 中文NLP和数据流】

【李白雷54:句法语义纠缠论】

【李白宋53:聪明的一休与睿智的立委】

李白郭 52:单层、一锅烩、反悔

李白董51:说不完的subcat和逻辑语义

【李白之50:符号战壕的两条道路之辩(续)】

【李白梁49:同一个战壕的两条道路之辨】

【李白之48:关系不交叉原则再探】

【李白之47:深度分析是图不是树,逻辑语义不怕句法交叉】

【李白之46:做NLP想不乐观都找不到理由】

【李白之45:从变性谈到模糊与歧义的不同】

【李白之44:“明确”是老子还是儿子,需要明确】

【李白之43:谈谈绑定和回指】

【李白之42:谈谈工具格的语言形式】

【李白之41:Gui冒VP的风险】

【李白之40:逻辑语义是语义核心,但不是全部】

李白之39:探究自然语言的毛毛虫机制

李白之38:叫NLP太沉重

【李白之37:分层与一锅煮的parsing机制探讨】

【李白之36:汉语可以裸奔,不可能无法】

【李白之35:句法分析 bottom up 为基础,可穿插 top down】

【李白之34:汉语情态词和计划类动词的异同】

【李白之33:从语言的毛毛虫特性聊到语文纠错的辅助工具】

【李白之32:从“没 de Vt” 聊开去】

【李白之31:绕弯可以,弯不过三】

【李白之30:李白侃中文parsing】

【李白之29:依存关系图引入浅层短语结构的百利一弊】

【李白之28:“天就是这样被聊死的”】

【李白之27:莫名其妙之妙,妙不可道】

【李白之26:汉语动结式和情态式的隐式被动现象】

【李白之25:句法能简则简,只要不影响总体结构】

【李白之24:“这碗花纹很别致的”】

【李白之23:“一切都在变,只有变本身不变”】

【李白之22:兼语式的处置及其结构表达】

【李白之21:萝卜多坑不够咋办】

【李白之20:得字结构的处置及其结构表达】

【李白之19:三探白老师的秘密武器】

【李白之18:白老师的秘密武器再探】

【李白之17:“我的人回来了, 可心还在路上”】

【李白之16:小词负载结构与小词只参与模式条件之辩】

【李白之15:白老师的秘密武器探秘】

【李白之14:Chinese deep parsing,说的是 deep!】

【李白之13:所谓话题或大小主语的句式】

李白之12:修正乔老爷的保守派自由派之辨】

【李白之11:parser 的三省吾身】

【李白之10:白老师的麻烦不是白老师的】

【李白之九:语义破格的出口】

李白之八:有语义落地直通车的parser才是核武器

【李白之七:NLP 的 Components 及其关系】

【李白之六:如何学习和处置“打了一拳”】

【李白之五:你波你的波,我粒我的粒】

【李白之四:RNN 与语言学算法】

【李白之三:从“把手”谈起】

【李白之二:关于词类活用】

《李白之一:关于纯语义系统》

《李白之零:NLP 骨灰级砖家一席谈,关于伪歧义》

【李白隔空对话录:谁无知呢?】

李白:其实NLP 也没那么容易气死

【相关】

[转载]【白硕 – 穿越乔家大院寻找“毛毛虫”】

【白硕 – 打回原形】

自然语言后学都应该看看白硕老师的“自然语言处理与人工智能”

立委译白硕:“入口载体”之争(中英对照)

《铿锵众人行, parsing 可以颠覆关键词吗?》

【泥沙龙铿锵行:再论NLP与搜索】

【泥沙龙笔记:语义可以绕过句法吗】

《parsing 的休眠反悔机制》

【歧义parsing的休眠唤醒机制初探】

【结构歧义的休眠唤醒演义】

《跨层次结构歧义的识别表达痛点》

乔姆斯基批判

【科研笔记:NLP “毛毛虫” 笔记,从一维到二维】

【理论家的围墙和工程师的私货】

【语义计算沙龙:乔老爷的围墙,community 的盲区】

【deep parsing 小品:天涯若比邻的远距离关系】 

Deep parsing: 每日一析,内情曝光vs 假货曝光

立委科普:关键词革命

立委科普:关键词外传

骨灰级砖家一席谈,真伪结构歧义的对策(1/2)

骨灰级砖家一席谈,真伪结构歧义的对策(2/2)

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

【李白之14:Chinese deep parsing,说的是 deep!】

白:

“认错的人我原谅了”“认错的人原谅我了”
“这场雨来的不是时候”“这场雨来得不是时候”哪个对?
“这场雨来的不是时候”、“这场雨来得不是时候”,感觉前者说“来”不是时候,后者说“雨”不是时候。个人倾向前者。

李:


谁原谅谁,句式蛮普通的。这里面还有其他的 catch 吗?

白:
两个不同的“认错”:一个是承认错误,一个是认错人,负面sentiment在不同主体身上。原谅的方向不一样,可以反推是哪个“认错”

李:
我很木
没想到第二个认错(人)。

白:
在“认错”与“人”之间,至少一个S,一个O。

李:
两句都理解成“承认错误”不行吗?

白:
一个大概率,一个小概率
把后者标成O,都是大概率。

李:
明白了。(当然那是在受教以后在明白)

我的问题是,普罗百姓更多人跟我似的木,还是白老师一样敏锐?
第二个问题是,这种不影响句子大局(大结构)的微结构里面的hidden args:decode 出来或不理它,decode 对了或错了,对一般的语义落地目标影响多大?

白:每个都不大,加起来不小。

李:
下面这组更看不出区别了:

白:
的/得,对错别字再加点容忍度,肯定是你这结果

李:
这一组,几乎肯定老百姓对区别无感。
“得”、“的” 混用已经如此普遍,以至于正式文字里面也不少见了。
实践中的体会是,遇到 “得” 就遇到了救星,因为基本可以肯定,用 “得” 的人是有文化的,是有意为之。由于 “得” 的补语标志性很强烈,加上 “的” 用法太多,系统不用担心 “得”,但是对 “的” 不得不格外小心。无论怎么小心,也还常搞不定。现代汉语该死该诅咒的东西很多,“的” 肯定是 top 3 可以千刀万剐的。恨得不止咬牙。

白:
“他菜炒的不够熟练”“他菜炒得不够熟练”呢?前者说的“炒”不够熟练;后者说的“他”不够熟练。其实用哪个字倒在其次,关键是结构不同。不同的萝卜来填坑。

李:

其实第二句的离合词“炒菜”也抓着了,不过显示的时候与 “得字结构” 撞车,没显示出来,这是显示的bug,不是parser的bug。有意思的是,炒菜也可以是名词:

白:
“生的伟大死的光荣”还是“生得伟大死得光荣”?
我觉得谓词的体词化+comments谓词,和谓词与谓词共享坑,根本就是两个结构。
我其实想说,“圈画de圆”和“圈画de慢”是两个结构,不管用“的”还是“得”。
前者是双爹,后者是祖孙三代。
“坑挖de整齐”和“坑挖de突然”也是两个结构。

李:

后者是默认,不论。
前者很 tricky。

白:
这两个用不同字区分,有意义。

李:
没法教育普罗。语言实践中不可操作。

白:
受过教育的尽管都写“得”,其实很勉强。谓词名词化用“的”本来天经地义,为啥算错。
我觉得做区分也是“语文”的需要,并不是“语言学”的必然。所以有今天这样的混用局面
这个区分并不高明。

李:
因为的字过载,其中的 “的字结构” 已经很各别:一个子句突然变成一个 NP,
这已经很让人困扰了(英语的 what-clause 也有这种困扰,容后论),这时候大家在学 “的字结构” 的时候,尽量趋向于收紧。最常最先记住的 pattern 就是“我吃的”、“他读的”、“你扔掉的” 这类。现在突然来了倒装,又夹杂了分离词,别说普罗会懵,就是文化人能整明白的也在少数。

白:
所以我才只问填坑,不贴标签

李:
更主要的还是口语中读音一样,这种细致的语义区别,要想教育用文字区别,不可行。

白:
不说哪个是补语,填坑的方向自然说明一切
@wei 试试旅馆那个?
“这里的旅馆住过的都知道很脏。”
“这女孩见过的都说漂亮。”

这个例子的奇妙之处在于,一般的情态动词和它带的宾语动词具有同一个逻辑主语,但这里却不是。“住过的都知道”似乎像是一个插入语。但是parser怎么会知道这里是插入语?

“这女孩见过的都说漂亮。”

同理。撇开是不是插入语的事情,这就是“很脏”的坑谁来填的问题。有“住过的”和“旅馆”两个选项。为啥不是“住过的”?我之所以拿女孩漂亮的句子做补充,是想说明,这一选择与sentiment无关。

撇开是不是插入语的事情,这就是“很脏”的坑谁来填的问题。有“住过的”和“旅馆”两个选项。为啥不是“住过的”?我之所以拿女孩漂亮的句子做补充,是想说明,这一选择与sentiment无关。

李:


何年何月 肯定做过努力。至于努力的成效就不好说了。汗滴禾下土 有迹可循。弄巧成拙也在所难免。不遇到特别的句子还看不出来。

白:
“这里的旅馆 住旅馆过的”看样子是在容忍不确定性
两个S都指向这个集合,但回避了是否指同一个元素。李:

李:
这是智慧还是油滑?白硕:

白:
这种容忍不确定性又把不确定性圈住的做法,必须发扬光大。李:

李:
听上去像是表扬。白硕:

白:
当然
只是还不解渴
期待更多 更系统 更elegant

白:
是知道一个词这样,还是情态动词都这样?

“这个人大家都相信是无辜的”

似乎“相信”也是对的

李:

白:
“这只老虎尿过的都认为是自己的地盘。” 出现反例了。“尿”本是不及物动词,但是这时候要强制提拔一个编外的坑,给“尿”。处所,很幸运地被选中。

李:
还有:“这个小便池尿过的都说干净。”

计算语言学家的不能告诉我们,我们正在走向危险的边缘。句子越来越诡异,合法非法之间,更要命的是,感觉上不具有统计性:不好拿捏,不仅要费牛劲儿,而且做了白做。

见好就收。拉倒。

前面提到英语的 what-clause 也类似汉语的 的字结构 容易让人懵。因为看上去就是一个 wh子句,可用法上的绝大多数 却是 NP,等价于带定语从句但省略了 head 的NP。蛮操蛋的。结果呢, 遇到复杂情形,晕菜了。What you said is not what you did.

What you said is not what you did.

这个中规中矩,还好。

I don't know what you did.
I don't understand what you said.

前一句,是 NP 还是 wh-子句?其实两可。翻译过来就是:

我不知道你做的(事儿)。
我不知道你做了什么。

这种细微差别,老百姓是不管的,也管不了,大家也就打马虎眼抹平了,反正也差不太多。第二句呢,一般理解就是 NP:我不理解你说的(话儿)。

understand 与 know 近义词,但 subcats 有别。know 既可以带 NP 又可以到 wh-子句:

I don't know who you are, but I know what you are
I also know where you live and how you got your permit.

understand 通常带NP,或比 know 对 NP 更青睐。

 

【相关】

【李白对话录系列】

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

【李白之13:所谓话题或大小主语的句式】

白:
“我讨厌那种窗户不透亮的房间”
“我喜欢家不在本地的女朋友”
“他期盼着进入一个放学不留作业的学校”
@崔佳悦 欢迎

李:

欢迎@崔佳悦 虽然词库目前未登录 好在 ne 还是对了。

1. “秦汉胡同学员作品展” 里切词有错,可是 NP 对了,应该不影响parsing该NP涉及到的句子大局;

2. “蒙太格” 不在词典,更甭提“蒙太格语法”这个术语了,但是 NP 还是没错;

3. “逻辑语义学研究”,初期 parse 成主语 S,到了语义中间件(不是白老师那种中间件,是句法后的语义模块)发现是逻辑宾语 O,但并未删除初始的主语结论(其实是有意为之,因为这个更新不是百分百把握,留下初始结果可以增加语义落地 recall,不过是一个儿子两个老子而已的可以容纳的 non-deterministic 的结果);

4. 从事的主语应该是【human】or 【institution】,现在错了,这个可以在语义模块更正的,如果磨细活的话。

5. 最后的话(怎么听上去像瞿秋白先烈就义前的心灵忏悔?): 所谓鲁棒,就是步步为营,不求完美。有知识就细柔一点,没知识就大老粗。

白:
“我不喜欢穿没有领子的毛衣。”

李:
人在外 领导血拼 估计没问题 等回家测试
人在外 == body 在外
不是领导ing shopping 而是陪领导 shopping
估计what没问题 等回家测试what
这个要问白老师 语义中间件搞不定。
估计【白句的测试结果】没问题 等回家测试【白句】
白句是what 问一问小冰
据说小冰是目前对discourse最能耐的妞儿了。

白:
窗户那个,看不出细节
如果“不透亮”一个坑,而且被“窗户”占了,而“房间”又不属于“那个小集合”,那么“房间”回填“不透亮”应该遭遇阻碍而且不能不了了之。不知道伟哥怎么摆平的。别告诉我就是不了了之了。

李:
做 sentiment 的 这个都摆不平 就完蛋了:窗户不透亮 是 negative 中最具情报价值的信息,我们叫做 actionable insights,比纯粹情绪的宣泄价值大太多

白:
“没有”两个坑,把问题掩盖了。再来:“我最喜欢吃皮儿薄的饺子。”
估计和窗户一样
毛衣、饺子、房间,都是俩爹。

李:


这个、这个 -- 差强人意了点儿。
可毛主席说了,吕端大事不糊涂。大局不错。

白:
领子、窗户、皮儿,都有坑
在图上把俩爹找齐了才漂亮

李:
哪两个爹?
皮尔 是 薄 的 S, 也是 饺子 的 part?
part-of 关系可以作为支持,不必作为 output,因为这个在 ontology 已经预定了

白:
饺子,皮儿一个爹,吃一个爹。

李:
吃是爹,饺子是宾语儿子。
至于饺子与皮的关系,这是一种不易的本体知识,不是动态的语义,有何必要?HowNet 里面就给了这个关系,这是宇宙真理,放之四海而皆准。没有情报价值。

白:
句法层面不需要知道是part-of,只需要知道有坑

“这些饺子皮儿太厚”,饺子的爹是谁?
是“太厚”,不通。是“皮儿太厚”,违反了词负载结构的原则,答案只能是“皮儿”,而且人家俩的关系就摆在那里。

李:

一个大主语,一个小主语而已,层次不同。
做 sentiment 的,对于这样的描述: 皮儿太厚,非常敏感,因为请报价值大。这是一种抱怨,而且抱怨具有 actionable 情报。下次包饺子,就要注意擀皮儿的人了。

白:
我就是不满意这个大小主语的说法
本来可以做得更好,而且是举手之劳

李:
目前这个做法 很合理啊。
你看看 “皮” 虽然也是 S, “这些饺子” 也是 S,但两个 Ss 明显层次不同

白:
“张三不知怎么搞的亲戚那么多。”
亲戚不是part-of,但有坑的特点是一样的

李:
从短语结构即可看出:"皮"几乎就是词法 open morphology 而 "饺子"则是句法。

白:
亲戚也是词法?
sentiment里面所谓的aspect,在句法里都可以显性处理,不用走曲折道路。

一个性质,完全平行的句子,被处理成了不同样子。

李:


这些扩展的主谓结构,处于词法句法边缘地带,是句子一级的句法的一个单元。违反了词负载结构的原则,也没什么。

白:
硬盘也是?
这台机器不知怎么搞的硬盘总坏。

李:

说是 open morphology 也差不多:硬盘总坏;身体好;性格糟糕;屏幕太宽;价钱死贵 。。。。。。

白:
本来可以纯句法处理的,只要更新一下观念

李:
现在的处理办法, 经过太多的测试,一直感觉很顺溜
语义落地,特别是落地到 sentiment,真地是一顺百顺
展示这样的sentiment 的短语结构的结果 也很漂亮。容易理解,情报丰富。

白:
在句法上把坑显性化,只会做得更好

“硬盘总是坏的机器不要买”
“秘书过于漂亮的老板不好当”

李:
这个 这个 这个 。。。。这个 这个。。。

反正这样玩法,白老师铁心要玩残它。不过:
“硬盘总是坏的采购员不要买,好的采购员可以买”

白:
这个例子好。

问题是我有更好的方法

李:
白老师的问题不是我的问题,但白老师的麻烦却是我的问题。(哈,cf:【李白对话录之10:白老师的麻烦不是白老师的】

白:
有自由度了。中间件该出场了。
如果没有自由度,还轮不到中间件呢

李:
中间件是这样的:没有它 parser 一样行走,而且多数情况下走得还不错。
这就是 parser 第一步要做到的目标。
然后,随着中间件不断丰富,parsing 不断细化、语义逻辑化。这是 parsing 的第二步目标,是 semantic parsing 的终极目标,但不是语义落地的必要条件。

白:
当然。但句法不同,伪歧义表现不同,中间件出场的条件自然也不同。
伪歧义多点少点死不了人,断链子就不同了

李:
无论是 白老师的中间件的 integrated or call-on-the-fly 的做法,还是我的 pipeline 的语义模块的做法,本质是一样的。

白:
我在质疑句法本身,而不是parsing。

李:
现在的情况是,第一步目标几乎达到了,靠谱了。第二步还在断续补足中。
句法的目标只能是方便语义落地。如果较好地支持语义落地,那么就是一个好的句法(representation)。

白:
这里的“方便”往往夹杂着对软件和数据legacy的考量,着眼点不在“地”而在“落“
如果没有legacy,反而更容易着眼于“地”。成功是失败之母。因为任何一种成功都有固化自己的冲动。

李:
评判 syntax 是一个困难的事儿。譬如说吧 虽然我们可以批判 penn tree,抱怨它设计得糟糕透了。但是 换一个人制定标准 估计也好不了太多,我们还会批判它。我可以一口气数落 penn tree 的10大错,而且可以条条是道。所以,在认识到这种评判的困难以后,我们可以退一步说,至少我们评判一个 syntax 的表达,着眼点是多大程度 多么明显地 有助于或妨碍了 语义落地,而不是从一个句法算法的一致性与否的角度。

白:
还是那句话,落地是有潜台词的,站在什么legacy的山上唱什么歌。最后是legacy主导一切,落主导一切,地是啥反而可以不care了。

李:
那是。我是站在 sentiment,或 IE 的legacy 上。

白:
不对,IE/sentiment还是地,不是落。

李:
地 就是 IE (sentiment 也是 IE),落就是 subtree patterns 的 matching白硕:

白:
subtree自身的局限决定了落的局限
我大概4-5年前放弃subtree的。
我当时起名叫 ppt:partial parse tree

李:
结构匹配
聚焦就是这么来的。
叫不叫 tree 无所谓 实际是个 sub-graph,我们以前也叫 svo search or svo retrieval or svo matching,有了这个 老鼠爱大米 就可以挖掘出来了。

白:
“他死去多年的消息终于曝光了”“他死去多年的弟弟终于平反了”

李:

这个是经典 minimal pair 了
同位subcat

白:
“专家对这种方法的意见还是不靠谱。”
“还是”歧义了。

李:

“还是” 还不是问题,“对” 是个隐患。
【对 np 的 n】 ,np or pp?

谁的意见?
专家的。
什么意见?
对这种方法的意见
什么不靠谱?
意见不靠谱。
谈的 topic 是?
专家。

专家作为虚的 topic 和 实的 mod 都 decode 了,parser 也算兢兢业业了吧。
流氓耍到了这个境界,也可以立牌坊了,正所谓:功到雄奇即罪名,流氓耍尽成英雄。

【相关】

 

【李白对话录系列】

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

 

【李白之12:修正乔老爷的保守派自由派之辨】

白:
“他们把总裁开掉的人训了一顿。”“他们把总裁开掉的人吃了一顿。”

我:
总裁开掉的那些人吃了一顿。
把总裁开掉的那些人吃了一顿。

“他们把总裁开掉的人吃了一顿。”属于合法非法边缘,语感上别扭:“他们”与“人”coreference,很多人不接受。

白:
同位语

我:
觉得别扭。
这些句子真心难。
试一试 parser。别扭的说法出来了(第二句),顺溜的句法反而走歪了(第一句):

白:
吃的宾语相谐条件太明显不满足。

他们戴着大盖帽的人很强势
他们把子曰诗云挂在嘴边的人对民间俚语一点兴趣都没有。

我:

“他们戴着大盖帽的人很强势”这句稍微好一点,不过这类句子总体别扭是因为有更简约明了的说法在竞争:

他们戴着大盖帽的人很强势 --> 戴着大盖帽的人很强势
他们戴着大盖帽的那些人很强势 --> 戴着大盖帽的那些人很强势

这个“他们”不仅多此一举,而且平添理解困难。

白:
理解是不应该裁定是否别扭的
生成可以

我:
道理是。但是别扭决定了统计性弱,因此理解系统忽略它后果不严重,甚至总体更有利(减少了弄巧成拙的可能性)。可惜,我们目前坏在没有忽略它。因为 local SVO 很正,想忽略也不容易呢。即便想降低 recall,减少对罕见例子的鲁棒性,也不容易,除非费力刻意为之 。。。

白:
我是在探索方法论问题:不回头的matcher需要看多远。

我:
知道,这是"消息"类的延长线。不过这个同位复杂了,需要回头。不好办。弄巧成拙的可能很大。

白:
如果必须在一个阶段内将错就错,那么等trigger到来之际,强行上车的乘客挤掉之前在车上的哪个乘客,还会不会翻掉更早的盘。

代词相当于有个坑,虽然和谓词隔了一层,但毕竟和“信息”类不同。
非代词同位结构不能这样用。

我:
道理明白。
道理是道理。那什么是什么。

as expected,前一句虽然对了,对得不开心。歪打正着,这不是第一次遇到了。在非设计的成功里,设计者不可能开心。而这一路不好设计。

前句各就各位,一路通畅。正因为此,后者只好把“把”落到定语从句的 head N 身上,又因为“把”的句法强势,“。。。那些人”成了盘中餐。哈,荒诞不过如此,但parsing 的逻辑线条却是清晰的。

白:
这里有个逻辑顺序问题。“把”怎么摆布,是有余地的,“吃”做逻辑宾语的语义不相谐,却是没余地的。应该句法不到山穷水尽,语义不相谐的不要登场才是。

我:
这个说法实践中很容易把人带进坑的。
换句话说,白老师自己有一个路数,按照这个路数,这个说法没啥问题。可是 followers 如果不是那个路数,或不明白那类路数,把这个说法当原则去指导实践,九成以上就掉坑里了。比较容易 follow 而大面上不错的原则还是乔老爷的句法独立原则的修正:句法不到山穷水尽,语义相谐的不要登场才是。对比白老师原则:句法不到山穷水尽,语义不相谐的“不”要登场才是。

白:
实践中,语义不相谐又被采纳的基本是活用性质的修辞,它们都发生在“高确定性、低相谐度”那个区域。如果明明是活用性的修辞用法,但却发生在低确定性区域,只能证明句法本身出问题了。

我:
对啊。
“高确定性、低相谐度”那个区域是不小的一个区间。因此句法独立的做法也不是完全要推翻,适当使用还是有益的。

白:
@wei 这个乔老爷原则用在英语上。
汉语不灵。

我:
明白。但还是一个度的问题。
完全实行乔老爷,根本就没有语义相谐或不相谐的事儿,语义被句法踢得远远的,老死不相往来。Note 我的原则是对乔老爷的修正: 句法不到山穷水尽,语义相谐的不要登场才是。可见,在这个原则下,语义登场了,语义句法融合了。
白老师的原则也是融合,也是对乔老爷的修正或反叛。但一字之差,就是保守派和自由派之争。我的说法:作为原则,不到不得已不动用语义。不得不动用的话,动用相偕,而不是不相谐。这个说法是有一贯性的:(1)不到不得已不用语义,差不多就是让句法来主导,暂时不管谐不谐,这等于语义不谐但句法不错的已在网中,因此也就没有再查不相谐的必要了。毛姑姑,这样可以搞定英语的90%+,汉语的 80%+,那么剩下的句法搞不定的,句法出错的,就用语义相谐来细化(句法角色细化为逻辑语义:譬如 确定 agent 主语 vs instrument 主语)或修正(包括休眠唤醒)。这条路稳妥一些,至少感觉跌进坑的可能减少一些。

 

【相关】

《泥沙龙笔记:parsing 的休眠反悔机制》

【立委科普:歧义parsing的休眠唤醒机制再探】 

【立委科普:结构歧义的休眠唤醒演义】

【李白对话录之11:parser 的三省吾身】

【李白对话录之10:白老师的麻烦不是白老师的】

【李白对话录之九:语义破格的出口】

李白对话录之八:有语义落地直通车的parser才是核武器

【李白对话录之七:NLP 的 Components 及其关系】

【李白对话录之六:如何学习和处置“打了一拳”】

【李白对话录之五:你波你的波,我粒我的粒】

【李白对话录之四:RNN 与语言学算法】

【李白对话录之三:从“把手”谈起】

【李白隔空对话录之二:关于词类活用】

《李白对话录:关于纯语义系统》

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

【语义计算:从神经机器翻译谈起】

我:
机器翻译所蕴含的厚重和神圣,在新一代是不可理解的

刚入行的时候做的是外汉机器翻译,一直不大敢碰汉外,原因是汉语语法不好形式 化,感觉太难了,当时想,这辈子怕都没指望了。 现如今,汉语语法还真没有见到多少大规模形式化能实用的,按照以前的路子,那汉外机器翻译必然寸步难行,因为汉语分析是前提,然后才是转换和生成。

可谁能想到,机器学习越来越牛,人工翻译的双语资料作为人类活动的副产品,几乎“天然地”源源不断而来,这就成就了深度神经机器翻译。什么分析,什么生成,统统绕过去,端对端直接施行转换。Google Translate 因此可以在同一个模型架构下,支持几十种语言的互译。这简直就是神迹。可却是技术的事实。尤其不可思议的是,以前认为最难的的汉外翻译,反而进步最大(至少汉英是如此)。 译文再不济,也给你个大概齐,不仅立等可取,而且还完全免费。比你学两年外语,带上词典死磕还一头雾水要强多少。除了天堂,天下哪里有这样的美事?

机器翻译(MT)是自然语言处理(NLP)领域历史最悠久的应用方向,从上 个世纪50年代初发轫,承载了中外几代不知道多少人的青春和梦想,也包括青年时代的立委。如今,梦想化为现实,嵌入式机器翻译在互联网无孔不入,已经成为普罗大众手中招之即来挥之即去的便捷工具,每时每刻在默默服务着千百万互联网用户。女儿学汉语用它,学西班牙语用它,去日本动漫网页也用它,用到对它熟视无睹,把机器翻译视为理所当然。只在翻译错得离谱的时候才意识到它的存在,不时报以嘲讽:真笨。可机器翻译呢,谦谦君子,玉树临风,虚怀若谷,无怨无悔。对于已经天然成为女儿这代人生活一部分的机器翻译,我满腹机器翻译的历史和掌故,却不知如何给她诉说。耳濡目染,她从我断续的话语中似乎隐隐觉得机器翻译对于她父亲的一生具有特别的意义,可是我还是无法象对同辈人那样娓娓道来,如数家珍,传达出我内心深处的机器翻译所蕴含的那份厚重和神圣。不仅仅是代沟,是技术的跨越式发展造成了两代人迥然不同的视角,令人感慨。 from 【机器翻译万岁

刘:
@wei 深有同感。科学技术的发展真是出人意料,做梦也想不到机器翻译能到现在这个程度。我一个刚入门不久的学生跑NMT,轻松超过Moses十几个点,仅几年前,这还是天方夜谭,要是超出Moses五个点绝对可以发最高等级的论文、拿博士学位了。
而且现在用现有的深度学习工具编NMT程序,代码量跟SMT相比都很小,不像写一个SMT程序,要花大量时间处理小的细节。深度学习的工具本身太强大了。同一套工具,稍加修改,既可以做机器翻译,也可以做语言识别、图像识别。
深度学习并没有解决所有问题,但为我们解决一些难题提供了全新的框架,带来了新的希望,潜力还远远没有挖掘完,这给我们这些搞研究的也带来了巨大的机会

我:
很羡慕ing @刘 那天与讯飞的院长谈这事儿,他也是超级兴奋,说以前以为大约四五年会有全方位的大突破,神经在大系统大应用上全面开花。现在他确信只要2-3年就可以了,到时候很多事情会超出我们的想象。他是这样描述的,非常由衷。感觉是作为一线领航者,他看到一种排山倒海的科学潜力正在转化为技术力量,面对巨大机会忍不住激动。这很感染人。这种心态我可以体会。

biao:
@wei  所以,哥儿几个在这死磕语法似乎很难看到什么时候是出头之日。

科大讯飞的确有过人之处。起码它的语音输入可以让你节约大量时间。
前几天有人在这里抱怨说输入码字太累。实际上现在语音输入完全可以帮助你非常轻松的输入,而且效果很好。
上面这两段话完全是讯飞语音输入的。一个字没有改,十几秒钟搞定,非常轻松。

刘:
我不敢预测哪些问题能解决哪些不能,但总体的进步是可预期的

我:
说语法没有出头之日 是小看了咱语言学家 等到dl打败我的 parser 再说不迟。
事实是 迄今全世界最牛的 dl syntaxnet 仍然是我手下败将
另一个事实是 迄今没有sentiment系统在 open domain social media 这个几乎最难的 space,能赶上我们。Not even close :the margin is almost 20 percentage points apart

所以我跟讯飞院长说 你我是同一类人。不过你在舞台中央 我在野。但是论信心和对nlp的展望 心态和世界观惊人的一致。要不咱们互补、合作、合流,要不咱们就来个友谊赛,我就不自量力一哈。反正论年龄 我输得起 你们输不起 =)
(我输了 就钓鱼去 乐见ai一统天下于dl if they truly deliver as well as nmt did
可是 nmt 有data 而大多数 nlp 没有那么多clean labeled data 啊)

biao:
语法分析最大的问题是不灵活。鲜活的语言千变万化。一句话稍微变个说法,语法分析就抓狂了。

我:
根本不是这回事 你的理解有误

白:
死守固定语序才这样 但语法分析死守固定语序已经是老黄历了
你变个说法给伟哥试试 他会告诉你一个robust的句法分析器能做到什么

从“计算”角度说,黑盒子容纳结构的能力是最本质的。从“语言”角度说,结构应该长什么样,比其他的事情更值得关注。
两栖人

biao:
先分析一个名句:
”其为人也孝悌而好犯上者鲜矣。”

我:
如果变个说法 语法就抓狂 要这劳什子干嘛。语法的目的不就是为了对付变体吗

白:
大战风车,其乐无穷

我:
你弄句文言做啥?这个 sublanguage 里面没钱,开发他有卵用。
“卵” 属于 P 系列:是现代汉语口语的脏字否定限定词,== fucking no,社会媒体口语的这个 sublanguage 我们倒是对付了,不妨试试。

biao:
你的机器怎么知道它是文言文,半文言文,还是白话文?他们都是中文。

我:
不在一个频道 算了

biao:
“工欲善其事,必先利其器”。这是文言文还是白话文?大量的成语是文言文还是白话文?金庸的小说是文言文还是白话文?四大名著,是文言文还是白话文?鲁迅的文章是文言文还是白话文?
这些都是在现实生活中大量遇到的语言素材。绕是绕不开的。

白:
高频小体量,适合死记硬背。文言文句法上并不比白话文更难处理,某种程度上还容易。文言文没有白话文里那种NP、VP串烧。有词类活用,但有规律可循。

我:
文言文长句 相对少。排比 平行用法普遍 也是形式痕迹。还有些非常固定的文言句式 用到特定的文言虚字 可以借力。等退休以后 玩玩文言文应该是一个不错 time killer。文言词汇量大大减小,字基本就是词,但每个字的用法 包括活用或引申用法 就多一些。

白:
关键看WSD一选出错率会不会增大?

我:
有不小比例的wsd,等价于pos,pos搞定 就搞定:老吾老。及物动词的“老”是一个活用义项,词典可以绑架为“尊崇”、“孝顺”之列,与作为形容词的“老(old)”的本义,以及作为名词的“老(the old,senior,parents)”都不同。
文言处理也少了切词错误的干扰 基本没可切之词。字驱动的路子,有很多字典工作可做

白:
有些歧义是简化字造成,之前古籍并无。比如后,简化之前就有这个字,就是皇后的意思。以后的后,之前是“後”。做pos也好wsd也好,要考虑文本的基准。

我:
所谓更多的活用,可以在字典假想如果处于某种活用,它义项是什么,然后绑架,倒也便利。另外,现代汉语对虚词的省略 似乎大于文言中虚字的省略,这也是文言处理的便利,虚字的频繁使用,给确定句子成分的边界创造了条件。

weidong:
娱乐一下:陈亢问于伯鱼曰子亦有异闻乎对曰未也尝独立鲤趋而过庭曰学诗乎对曰未也不学诗无以言鲤退而学诗他日又独立鲤趋而过庭曰学礼乎对曰未也不学礼无以立鲤退而学礼闻斯二者陈亢退而喜曰问一得三闻诗闻礼又闻君子之远其子也
标点断句先

我:
试了一下我的 parser,满篇都是 Next ;=)

weidong:
没有引号连话到哪儿结束都猜半天

我:
索性也试试前面要求的测试


其为人Next 也孝悌,而好犯上者 Next 鲜矣。

以前学美国之音英语900句,都说有900句,英语的基本句型就搞定了。这些年,我都 unit tested 近两万句了。是不是差不多该搞定了?最近翻阅以前内部论坛的帖子,有这么一贴,好玩:

池子里说说无妨,万一明年中文核弹爆了,你们可以作证立委就是钱学森。
作者: 立委 (*)
日期: 2012/04/18 23:13:13
不说的话,将来被代笔,说中文核弹不是我的作品 ,找个旁证都找不到。 

换句话说,各路身怀绝技的侠客剑法可能不同,但有个共识:就是我们面临技术核弹大爆炸的前夕。至于AI泡沫,那是商业上的炒作,技术的发展与成熟只是给了它一个炒作的话题而已。

 

【相关】

机器翻译万岁

【语义计算:没有语言学的计算语言学,NLP的亚健康现状】

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

【一日一parsing:自然语言太难了吗?】

今天微博同仁圈子里盛传下面这个年末搞笑的帖子,标题是 #自然语言理解太难了#,其实一点不难,可见即便是圈子内人,如果没深入做过parsing,有时也被表象迷惑。

#自然语言理解太难了# 转发段子:今年基本已经结束了,我刚在群里问了很多朋友今年挣钱了没?大多朋友都有挣,而且挣得五花八门:有挣个屁的,有挣个锤子的,有挣个毛的,更有甚者挣个妹的,简直奢侈之极!最恐怖的是挣个鬼的!有的还可以,挣个球,下午我碰见一朋友,问今年挣了吗?他望着天空喃喃自语:挣个鸟!
看吧,只要肯努力,什么都能挣到[坏笑]

liu:
乐呵乐呵,语言理解很不易啊。

白:
真心不难
基本上可穷尽

liu:
“挣”后面的惯常搭配和选择限制反而简单?

我:
早已解决

liu:
但是,“挣一辆车”就是它本身的含义了。

我:
简单确定的pattern 有限的词汇可填充项 这是 pattern 的拿手好戏

liu:
这倒是的,它的“能产性”不高。

我:
如果训练 有可能漏掉低频率填充项 sparse data
但对于确定性的 patterns 规则可以一网打尽。

liu:
蕴含、推理方面的理解反而是重要的?

我:
一辆车 走通则
屁屎鸟、妹妈奶奶等 走特定规则
我们做社会媒体分析的 这类玩意儿早涵盖了

白:
“规则+例外”的总描述长度最短,就踩到点儿上了。高频用变量泛化,低频用实例枚举。

我:
parse results

图上只是显示这个结构被抓着了,没有显示系统“理解”这种用法的内部表达:实际上这个 chunk 抓住时,系统也就知道头词是动词“挣”,也知道这是一个口语化的动宾否定式,用了脏字,模式匹配规则“绑架”了这一切。

WD:
放个屁 长个毛 真歧义

白:
不是只针对这些话,是一般性的philosophy
这件事跟“挣”关联弱,跟“个”关联强。

WD:
鬼都挣不着
跟”什么女人“类似,有所谓元语否定用法
就是拒绝前一句话的陈述恰当性 负面评价功能
什么一流大学=什么破大学
甲:挣了不少吧。乙:挣什么屁钱啊,都……

我:
“屁”“鸟”之类 有一个英语 no 的用法,是汉语的np 否定式。一般认为汉语没有 否定限定词 no,其实汉语有 不过汉语的 no 混杂着 负面情绪 用的是脏字。而英语的 no 很单纯。
屁事儿: Nothing
没见屁人。在 “没” 后 脏字成了 any,避免double negative 吧,英语也有:
Ain't see nobody == didn't see anybody

从这个例子说开去:“v个P”,P={P,屁,头,鸟,吊,jiba,妹,鬼,......}
从类似的现象可以看到,小数据为依据的规则系统,有时候比大数据训练的系统,更为有效:更加精准,更加能对抗 sparse data 因此而提高 recall(具有 clear patterns 性质的语言现象,可以一网打尽,完全没有 sparse data 的困扰),模拟语言现象更加直接,因此也更加容易debug和维护。

在 IE 历史上,直到 MUC-7,当时表现最牛的 NE 系统 NetOwl 就是基于 pattern rules 的,几乎所有的统计对手都拿它作为拼杀的对象。NetOwl 从 SRA spinoff 出去想以 NE 为技术基础,进行商业运作,一开始在分类广告业拿下了一些业务,终究不能持续赚钱,后来被 SRA 收回,逐渐销声匿迹了。后来追随潮流,系统里面也混杂了机器学习的模块。https://www.netowl.com/our-story/

从此在学界就再也见不到规则系统了,哪怕是对于规则非常适用的某些 NE 任务:譬如 时间,数量结构,等。可见潮流之厉害。反潮流者不得食,发不了论文,拿不到 grants,带不了学生,自然自生自灭。

但事物的本质和本性并没有改变,尤其是对于自然语言中的具有 clear patterns 的现象,依据小数据,经过人脑的归纳,数据驱动去开发规则系统,仍然是如上述,具有高效高质量。工业界默默实行的这类人、团队和系统并不鲜见,只不过大家心知肚明,只做不说而已。犯不着顶风作案。相对应,发动群众去标注大数据,然后用大数据训练一个系统如何?这是主流的默认 honored 的方法。如果数据足够大,其质量的确可以接近或匹敌规则系统。当数据量不理想的时候,就捉襟见肘了: 或者 underkill (由于 sparse data,漏掉很多统计性稍弱的变体)伤害 recall,或者 overkill (smoothing 过度,把不该抓的现象抓进),影响了precision

什么叫有 clear pattern 的语言现象呢?举个例子,抓取邮政地址,这个工作我自己作为一个 fun 做过。出来的系统请邮局员工测试过,他们啧啧称奇。美国地址大体是 门牌 街道 城市 州 邮政编码 最后是 国名,patterns 相当地 clear ,可你可能无法想象上述 pattern 构件的变体之多,有些变体绝对是 long tails 再大的数据量也不可能涵盖其组合爆炸的本性。如果你收集了一个巨大的美国地址库作为训练集(大数据),你完全可以设计一个学习系统来做这件事儿。而另一边,虽然也是 data driven,但只需要小数据样本,然后经过人的大脑去举一反三进行开发。可以拍胸脯的是,后一种办法做出来的系统绝对是高质量易维护,天生地具有 sparse data 的免疫性。

 

【语义计算沙龙:坐而论道 on 中文 parsing】

董:
刺死前妻男友男子获刑5年 死者系酒醉持刀上门 -- 百度新闻
Stabbed her boyfriend man jailed for 5 years, the drunken knife door --百度翻译
Stabbed his ex-boyfriend boyfriend was sentenced to death for 5 years the Department of drunken knife door -- 谷歌翻译
不知道这样结果是什么智能? -- 人工?鬼工?骗工?

白:
也是醉了

董:
我主要是要探讨“连动”--酒醉,持刀,上门。这三个动词在知网词典里都是有的。 酒醉 -- {dizzy|昏迷:cause={drink|喝:patient={drinks|饮品:{addict|嗜好:patient={~}}}}}
持刀 -- {hold|拿:aspect={Vgoingon|进展},patient={tool|用具:{cut|切削:instrument={~}},{split|破开:instrument={~}}}}
上门 -- {visit|看望}
酒醉的上位可达:“状态”;持刀的上位可达“行动”,但它与“拿”不同,它是“拿着”,所以定义描述里多了“aspect=Vgoningon”;最后是“上门” 它是“行动”。于是我试下面的规则:
DefineVP1 0712 CN[*pos==`verb`,*def_h=={act|行动},*syl==`2`];L1[*pos==`verb`,*def_h=={act|行动},*def_s==`aspect={Vgoingon|进展}`,*syl==`2`]$L1[*log==`preceding`]@chunk(CN,L1)# // 酒醉持刀上门;
DefineVP1 0722 CN[*pos==`verb`,*def_h=={act|行动},*syl==`2`];L1[*pos==`verb`,*def_h=={state|状态},*syl==`2`]$L1[*log==`preceding`]@chunk(CN,L1)# // 酒醉持刀上门;
心里还是不踏实,因为没有大数据的支持。想听你们的意见。其他例子如:骑车上街买菜遇到一个老同学;

白:
直观感觉,状态的标签不是太好贴。比如,拿着刀子砍人,拿着是状态;抡起斧子砍人,抡起就不是状态?隔着玻璃射击,隔着是状态;打开窗户通风,打开算不算状态 ?
买菜和遇到老同学,谁是前景,谁是背景?谁是主线谁是旁岔,很难说。像伟哥这样一律next最省事。
打开保险射击,打开保险就不是状态

我:
伟哥于是成为懒汉的同义语 。工业界呆久了 想不懒都不成。我曾经多么勤勉地一条道走到黑啊。Next 的好处是拖延决策 或者无需决策。可以拖延到语义中间件,有时也可以一直拖延到语义落地。更多的时候 拖延到不了了之 这就是无需决策的情形。

白:
董老师说的就是语义落地啊。花五毛钱打酱油,花五毛钱打醋。花五毛钱该贴啥标签?
要不是语义落地谁费这事儿。

我:
花 money vp
这个是 subcat 可以预测的模式。凡是subcat可明确预测的句型 通常都不是事儿。给标签于是成为 system internal 的内部协调。

白:
关键是不知道该有多少标签,如何通过粒度筛选、领域筛选、时空背景筛选,快速拿到最有用的标签。

我:
通常的给法是:money 是 o (object),vp 是 c (complement),这是句法。
句法之上这几个节点如何标签逻辑语义 也可以由 subcat 输出端强行给定。譬如 可以给 vp 一个【结果】的标签,vp 是 “花钱” 的结果。
subcat 的实质就是定义输入端的线性模式匹配 并 指明如何 map 到输出端的句法和逻辑语义的结构。这种词典化的subcat驱动简化了分析算法 而且包容了语义甚至常识。

董:
我是因为首先要解决句法关系引起的。例如:欢迎参观;争取投资,就是VO关系,而不是参观游览。也就是说,两个或更多的动词连着时,如何排除歧义?试着只给两个标签:动宾、连动。

我:
一般而言 动宾 是动决定的,连动可以是第一个动决定, 也可以是随机的组合。后者有一个与conjoin区分的问题。
“欢迎” 在词典subcat 中决定了可以带 “参观” 这样的宾语,就事论事 这个“欢迎-参观”的关系几乎是强搭配,与 “洗-澡” 类似。
连动也有词典 subcat 决定的,譬如 “去” vp,“驱车” vp,“出门” vp。
词典决定的东西 没有排除歧义的问题 就是词典绑架 通过 subcat。只有随机组合才有歧义区分的问题。而动宾的本质是不随机,原则上不存在歧义 一律是强盗逻辑 本质就是记忆。可以假设 人的动宾关系是死记在词典预测(expectation)里的,预测实现了 动宾就构建了 这符合 arg structure 的词典主义原则。

董:
负责挖坑,负责浇水,负责填土。。。动宾关系;

我:
负责 vp
为 vp 负责
后者是变式

董:
这么看来,动宾还是连动还是修饰(限定),都由词典解决了。统统做进词典里,就可以了。明白了。

我:
词典主义。随机度太大的组合比较难做进词典。所以一方面尽量做进词典,另一方面 来几条非词典化的规则 兜个底。
随机性而言 似乎 修饰大于连动 连动大于动宾。

白:
如果只有这三个标签,当然做进词典是首选,就怕落地时要的不止这三个。

董:
33724688194454877

这是我刚才试的一个句子。我们为每个节点预留10个子节点。动词与动词也得包括这些。

我:
进不进词典 主要不是有几个标签 而是这个标签的性质。
语言学的理论比较文科,说的东西有些模糊,但大体还是有影子的。
语言学理论中一个最基本的概念区分就是 complement vs adjunct,这是句法的术语,对应到较深的层面 就是 argument vs modifier。一般而言,arguments or complements 都是词典的主导词可以 subcat 预测的。HowNet 从语义层面对 args 已经做了预测。语言学词典(譬如英语的计算词典,汉语的计算词典等)就是要相应地从具体语言的句法表达方式的角度把 subcat 预测的 complements 定义出来。至于 modifier 和 adjuncts,他们的组合性随机,词典就难以尽收。最典型的就是普世的时间地点状语等。世界上的所有事件都是在时间和地点中进行。

白:
跑步去公园,去公园跑步。前者去公园的路上都在跑步,两个事件在时间上重合;后者只有到了公园才开始跑步,在时间上只是先后衔接。
如果语义落地需要对此作出区分,该有什么标签?怎么词典化?
动词为其他动词挖坑的情况都不难处理,难的是压根儿没有标配的坑。这是从ontology的事件根结点继承下来的。

我:
跑步去公园,去公园跑步。
先说第二句:【去 + NP + VP】 这是可以词典预测的,万一预测不准,可以 fine-tune 条件,譬如:【去 + 地点 + 动作】,总之是词典预测的。既然词典预测了,那么该给什么标签就不是问题了。给什么都可以,要什么给什么。
再看第一句:跑步去公园。
去公园 不是问题 这是一个动宾 VP 是词典预测的:【去 + NP】 或 【去 + 地点】。
问题于是就成为 “跑步” 与 VP(人类动作)之间的关系。 这种关系在哪里处理,词典可以不可以预测?

白:
吃口饭去单位,又是接续关系不是重叠关系了

我:
这个的确有些 tricky 但不是无迹可寻。

白:
跑会儿步去公园,也是接续关系了。

我:
偷懒的办法就是有一条非辞典化的模糊的规则 Next 连接二者。
费劲的办法也有:一个是 “跑步去” 词典化 作为“去”的变体,“跑步”是对“去”的方式限定。

白:
现在的问题是,句法上承认next,语义上细化next

我:
另一个词典化的做法是,在“跑步”词条下,预测 movement 的动词 VP, 【去NP】 、【来NP】 、【到达NP】 等等 都符合条件,可以跟在“跑步”后面。

白:
为啥跑步加了时态,限定就失效?

我:
这个预测的subcat里面的句法规定是:
1. 本词不许有显性时态,不许分离;
2. 后面的 VP 必须是 movement;
3. 输出端:本词作为后一个 VP 的限定方式(句法叫方式状语:adverbial of manner)。
Binggo!
至于为啥?这个问题,系统可以不回答,系统可以是数据驱动的。
系统背后的语言学家可以一直为了 “为啥” 去争论下去,系统不必听见。总之是让 “跑会儿步去公园” 不能在此预测pattern中实现。词典化实现不了,那就只好找兜底的规则了,于是 Next 了。【限定】与【接续】的区别由此实现。前者是词典强盗,后者是句法标配。

白:
在词典之外搞几个标签模版也不难,句法上都对着next,只不过依据前后subcat细化了,这有多困难,而且清爽。

我:
亦无不可。差不多是一回事儿。一碗豆腐,豆腐一碗,就是先扣条件还是后补条件的区别而已。无论前后,总之是要用到词典信息,细线条的词典信息。

白:
看上去不那么流氓

我:
先耍流氓【注1】,还是先门当户对,是两个策略。
很多年前跟刘倬老师做专家词典。他是老一代无产阶级革命家,谆谆教导的是不能耍流氓,要门当户对,理想一致了才能结合成为革命伴侣。后来到了美国闹革命,开始转变策略,总是先耍了流氓再重新做人。其实都是有道理的。

白:
@董 跑步和上班是先后关系,跑步和去是同时关系。

董:
这句分析后,有两个“preceding”,不符合我们理想的结果。我们要的是“跑步”是“去上班”的manner 才好。因为我们要准备用户提出更多的信息要求。例如:系统要告诉用户,我平时是HOW去上班的。

我:
刘老师做系统是在科学院殿堂里面,可以数年磨一剑,we can afford to 不耍流氓。来美国闹革命拿的是风投的钱,恨不能你明天就造出语言理解机器人出来,鞭子在上,不耍流氓出不了活。形势比人强,不养童养媳成不了亲,不抓壮丁打不了仗,于是先霸王,然后有闲再甄别。

董:
是的,我们现在连科学院殿堂都不是,而是家庭作坊,可以慢慢磨。其实已经磨了20多年了。

我:
我还记得当年我们为了一个不足100句的英语sample,翻来覆去磨剑磨了两三年,反复地磨平台、磨算法和磨规则。当时的董老师已经大数据(现在看也不是大数据了)开放集测试“科研一号”【注:中国MT划时代的第一款工业产品“译星”的前身】了。

董:
我们给我们的现在开发的中文分析的目标是:看看能最大限度地挖掘出多少信息。

我:
董老师20年磨出的 HowNet 打下了语言分析的牢固基础。现在是把普世的 HowNet 细化为具体语言的句法规定。路线上是一脉相承的。换个角度看,董老师在 HowNet 中已经把普世的 Subcat 的输出端统一定义了,现在是要反过来再进一步去定义具体语言的句法表达形式,也就是输入端的pattern和条件,然后把二者的映射关系搭上,大功即告成。先深层结构 和 UG,然后回过头来应对每个语言的鸡零狗碎的形式。

董:
这倒是的,我们这个中文系统还没到半年,就有点模样了。词典22万义项,规则近4000条。当然,要真正交给用户,那还有一段磨的。

我:
蛮 impressive。我们开发四年多了,但绝对没有 8x 的规则量。

董:
这回我们不做中英翻译,因为英语生成我们做不起,又没有大数据的。其实做出来也只是给别人添砖加瓦,多一个陪着玩的。这种事情我们不玩的。

我:
对,MT 从大面上就拱手相让吧,数据为王。 符号逻辑和规则路线现在的切入点就是应对数据不足的情境:其实数据不足比人们想象的要严重得多,领域、文体等等,大数据人工标注根本玩不起。不带标的 raw 数据哪里都不缺 但那比垃圾也好不了多少。

宋:
"中国对蒙出口产品开始加征费用"

白:
这个哪里特殊?

宋:
中国对(蒙出口产品)开始加征费用, (中国对蒙)出口产品开始加征费用

白:
进口出口,应该站在自己立场吧

宋:
出口是自己的立场,但也有两种解读:蒙古出口,中国对蒙古出口。我一开始理解为后者,看了内容才知道是前者。

我:
这个 tricky,在争抢同一个介词“对”:对 np 征税;对 n 出口。
远距离赢。

白:
常识是保护自己一方的出口,限制非自己一方的进口

我:
远距离原则有逻辑 scope 的根据。但是具体看 很难说 因为汉语的介词常常省略。scope 的起点用零形式 并不鲜见。
“对阔人征税” 可以减省为 “阔人征税”;“对牛肉征税” 可以简化为 “牛肉征税”。但 “对蒙古出口”,不可简化为 “蒙古出口”。本来也可以简化的,但赶上了 “出口” ,逻辑主语相谐。“牛肉” 与 “征税” 没有这种逻辑主谓的可能,于是“对”可省 而NP的逻辑语义不变。

白:
势均力敌时,常识是关键一票

宋:
这个例子在我所看到的语境下是远距离赢,在别的语境下则不一定。因此,分析器是否应当给出两个结果,然后在进一步的处理中再筛选?

我:
给两个结果 原则上没难度,但后去还是麻烦。

白:
其实关键是什么时候定结果,几个倒在其次

我:
"中国对蒙出口产品开始被加征费用"

加了一个 被 字 哈哈 可能是蒙古对中国的反制。

白:
两个对,有一个和被不兼容

【注1】所谓parsing耍流氓,指的是在邻近的短语之间,虽然他们之间句法语义关系的条件和性质尚不清晰,parser 先行把他们勾搭上,给个 Next 或 Topic 之类的虚标签,类似未婚同居,后去或确认具体关系,明媒正娶,或红杏出墙,另攀高枝,或划清界限,分手拉倒。

 

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

【李白对话录之10:白老师的麻烦不是白老师的】

我:

突然想起一句话 怕忘了 写在这:

“白老师的麻烦是 他懂的 我不懂 我懂的 他懂。”

谁的麻烦?

乔姆斯基说 麻烦是白老师的

菲尔默说 麻烦是我的

后一种语义深度分析的结论是如何得出的?

语义要多茁壮 才能敌得过句法的标配啊。

而且这种语义的蛛丝马迹并非每个人都有捕捉的能力 它远远超出语言学 与一个人的背景知识和领悟力有关

遇到这种极深度的人工智慧 目前能想出来的形式化途径 还是词驱动比较靠谱 如果真想较真探索的话

“麻烦 问题 毛病” 这类词有两个与【human】有关的坑

一个是标配 表达的是所有关系 possessive

另一个是 about 要求填坑的是 【event】或【entity】 后者自然也包括 【human】

白:

“他的教训我一辈子忘不了”

谁被教训?

我: 哈。

回到前面, 近水楼台的 【human】 “白老师” 是标配。

另一条词驱动的可能路径自然休眠。因为词驱动 也就埋下来唤醒的种子。

上下文中遇到另一个 【human】 candidate “我”,加上其他一时也整不清楚但终究可能抓到的蛛丝马迹, 于是休眠唤醒 了。

白:

好像sentiment在休眠唤醒中起比较重要的作用

我:

此句是一例 本来是褒 可不唤醒就是贬了。

白:

标配的麻烦,把负面情感赋与那谁,等到后面说的都是正面,纠结了,另一个human就有空子钻了。

我:

对对对

这个 trick 我们做了n年 sentiment 摸索出来了就在用。典型案例是: “Thank you for misleading me”

Thank 里表达的抽象的褒 由于遭遇了 misleading 的较为具体的贬 而转化为讽刺。

还有:“你做的好事儿 great”。这里 great 的讽刺也是有迹可寻的。

白:

more specific expressions承载的sentiment优先

我:

遇到过两次记者采访,两次都被问到 你们教给机器 sentiment,机器可以理解正话反说 和 讽刺 吗?

我的回答是:这是一个挑战 但其中的一些常见的讽刺说法 是可以形式化 可以捕捉到的。举例就是上面。

白:

具体override抽象。

我:

yes yes yes

白:

如果二者纠结,具体承载的sentiment才是基调,抽象的反向sentiment不是抵消而是修辞手法的开关。

我:

我一直在强调,sentiment 的世界里面,主要是两类东西:一类是情绪的表达,一类是情绪背后的理由。

有些人只表达情绪,但有些人为了说服或影响别人,好恶表态的前后,会说一通理由:you make a point,then you need to support your point with arguments

所谓 sentiment analysis 很长一段时间 领域里面以为那是一个简单的分类问题:thumbs up thumbs down。这个浅陋而流行的观点只是针对的情绪,而面对情绪背后千变万化的理由 就有些抓瞎了。可是没有后者,那个sentiment就没啥特别的价值。

所谓讽刺,只是情绪的转向,正话反说。具体的理由是不能转向的,否则人类的交流就没有一个 protocol 而可以相互理解了。褒贬里面具体的东西 我们叫 pros and cons, 那个东西因为其具体,所以语义是恒定的,不会轻易改变。

情绪却不同。人是一个奇怪的动物,爱极而恨,恨极而爱,都有。甚至很多时候 爱恨交织 自己都搞不清楚。表达为语言,就更诡异善变。

英语口语中 sick 是强烈的褒义情绪,shit 和 crap 等词也不是贬义,bad ass is very positive too:

“The inside of a prius is bad ass no lie.” 是非常正面的褒奖。

人类在情绪表达中说反话,或者由于反话说常了 community 都理解成正话了,这种情形也屡见不鲜。

关键词的褒贬分类系统遇到这种东西不傻眼才怪:当然如果input很长,可以 assume 这类现象只是杂音,整个关键词分类还可以靠谱。但一旦是社会媒体的短消息,这种语言模型比丢硬币好不了多少。

汉语中 老婆太喜欢老公了 喜欢到不知道怎么好了 就说 杀千刀的。

再举一个今天遇到的 sentiment 实际案例:
@Monster47_eNd nah, you have no idea how bad I would kill to eat taco bell or any kind of shit like that.
瞧瞧里面的 sentiment triggers: bad;kill;shit 三个都是强烈的 negative triggers
谈论的 topic 是 Taco Bell,一家流行的墨西哥快餐连锁品牌。
这条短消息通篇没有褒义词出现,因此没有理解、缺乏结构的关键词系统只能得出贬义的结论。但这句话其实是对 Taco Bell 异乎寻常的褒奖 用的是完全草根普罗的用语。

谷歌的神经翻译遇到口语化的句子也基本抓瞎,训练的数据严重口语不足(那是因为双语语料质量过得去的来源大多是正规文档,组织人力去标注口语,做地道的口语翻译,是一个浩大的工程,巨头也无能为力吧):
@ Monster47_eNd nah,你不知道我會殺了多少吃塔可鐘或任何種類的狗屎。

尝试“人工”翻译一哈:
@ Monster47_eNd nah,你不知道为了能吃上Taco Bell 的东东,我會怎样不惜代价(哪怕让我杀人都行)。

简单的译法是:
想吃 Taco Bell 这样的垃圾,我他妈都想疯了。

谁要再说 sentiment 好做,我TM跟他急。这无疑是 NLP 中最艰涩的果子之一。
【相关】

《泥沙龙笔记:parsing 的休眠反悔机制》

【立委科普:基于关键词的舆情分类系统面临挑战】

【立委科普:舆情挖掘的背后】

【李白对话录之九:语义破格的出口】 

李白对话录之八:有语义落地直通车的parser才是核武器

【李白对话录之七:NLP 的 Components 及其关系】

【李白对话录之六:如何学习和处置“打了一拳”】

【李白对话录之五:你波你的波,我粒我的粒】

【李白对话录之四:RNN 与语言学算法】

【李白对话录之三:从“把手”谈起】

【李白隔空对话录之二:关于词类活用】

《李白对话录:关于纯语义系统》

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

【一日一parsing:“这瓶酒他只喝了一杯”】

白:
“这瓶酒他只喝了一杯。”
两个量词(瓶、杯)和一个名词(酒)关联。
三个问题:1、“这瓶酒”是什么成分?为什么?2、“一杯”是回指到句中的“酒”还是指到另一个省略了的“酒”?3、如果“喝”的逻辑宾语是杯中酒,那么瓶中酒又是什么逻辑角色?
就是说,如果把逻辑宾语看成“部分”,其相对的“总体”提前为“话题主语”或“大主语”,那么后者到底填了什么坑?目测已经没位置了

詹:
“语文他答对了三道题。”跟白老师例子类似。
他只喝了这瓶酒中一杯的量
这瓶酒他只喝了一口
这瓶酒他只喝了二两
“喝”事件可以设计一个“消耗量”的事件元素
“这瓶酒他喝了一大半”

白:
随意增减动词坑的数目总是不好,量词倒是可负载两种结构:一种是绝对量,一种是相对量。相对量有坑,绝对量没坑。

詹:
动词的坑的数量可以设计(因而可调)。消耗量设计为“喝”的一个坑,可以跟“讨论、谈、喜欢”这样的动词对比。“这瓶酒他们讨论了一杯”不能接受。因为“讨论”类动词没有预留这个坑
“这瓶酒他们讨论了一天。”
请教白老师说的绝对量和相对量具体如何理解?形式区别是什么?

白:
相对量和绝对量都是数量组合。绝对量与中心语结合,相对量中心语省略,但与同形的先行中心语形成远距离照应。
“山东聊城市”

我:

1121a
句法是清楚的。

白:
buyu是个大杂烩 装了很多不同的东西,从填坑角度看更是五花八门缺少共性。

我:
那就加个标签【数量补语】,与其他补语对照:【程度补语】【结果补语】或【原因补语】等。如果想进一步区分 “喝了一杯” 与 “喝了一斤”,还可以进一步区分 根据数量结构本身的子类即可。句法到这一步 落地应该水到渠成了。

白:
那倒不必。喝了一口有点麻烦。可是这不是一个好的二元关系。
或者说,buyu才是真正的宾语,O反而只跟buyu发生直接关系,通过buyu才跟动词发生间接关系。O跟buyu的关系是明确的总分关系

我:
喝---酒 应该是直接的关系 否则 语义不搭。

白:
一杯后面有个省略的酒
正常也可以说,走,喝两杯去。省略是肯定的,省略的是酒,则是通过先行词照应出来的。先行词是茶,省略的就是茶。杯和酒,也有强关联,不管语义上还是统计上。
试试:“这瓶酒张三只喝了一杯,李四却喝了三杯。”
要想把“一杯”和“三杯”都分析成buyu,还有点小难度呢。
“一瓶酒四个人喝,张三和李四各喝了一杯,王五和赵六各喝了两杯,瓶里还剩一杯,问这瓶酒共有几杯?”

我:

1121b

一致不一致 只要后面是有准备的 就可以我们在落地模块里面 其实是有这个心理准备的,
并不指望句法分析出现完全一致的结果。关系标签只是落地的条件之一,不是全部条件,如果 x 和 y 的关系都有可能,对付不一致就是 x|y,一般不影响结果。

白:
“X杯”都分析成buyu吗?
不好的句法不一致多些,好的句法不一致少些

我:
一切都是平衡,某个条件宽了,另外的条件就可以弥补。

白:
遇到不好的句法,不一致不是不能对付,只是一边对付一边喷语言学家而已。

我:
哪里都一样。arsing 做不好 可以喷 POS 模块开发人,OS 做不好 可以怪词典学家没弄好。或者学习模块很操蛋 对付不了 sparse data,但是 说到底 在一个真实开发环境里 还是内部协调为纲。要是踢皮球,做不了好系统

白:
但是句法稍作调整,就可以做得更好。
我:

铁路警察各管一段 是一个非常坏的原则,adaptive dev 才是正道。当然,凡事都一个度。

白:
补语和宾语补足语弄成两个东西,一个指向动词,一个指向名词。已经做了初一,还怕十五么?

我:
一杯和酒 脱离上下文 也有很强的特征上的不同 而且也有ontology或大数据方面的高度相关性。因此 句法把它们连成 x 也好 y 也好 都不是大问题,因为各自的本性的、静态的标签是恒定的、随时可check 的

白:
这话推到极端,就是不要句法也行
可你老人家早就有话等在那里,有现成的梯子,为什么不用?
我现在要说,反正也没到顶,有另一部可以爬得更高的梯子,为什么不用?
与大数据或ontology的关系,自然语言是跑不掉的,波粒二象性摆在那里。
其中可以帮到句法的部分,封装成中间件直接拿来用,早已不是禁忌。

我:
真地没看到显然的必要性,起码对于抽取情报,V 连上了实体 N做 O,连上了数量做 Buyu,想从中抽取啥都可以。要细做,也最多是把 Buyu 和 O 再加一条通道,说 Buyu 是限定 O 的。

白:
看看上面的应用题。要解题,不知道总分关系怎么解?不把句法关系标成一致,怎么获取总分关系?

我:
自然语言理解落地为自动解题,作为复杂问答系统的一个分支,这个倒是确实要求比一般情报抽取要高。那天与胡总聊到高考机器人项目,胡总说,数学应用题道理上应该电脑是大拿吧。可惜,电脑读不懂应用题。自然语言理解是拦路虎。如果读懂了题,转化成了公式,电脑当然当小菜来解题。

白:
NLU做应用题,@约翰 师兄三十几年前就在做了。

我:
做几何题,@严 也兴趣了很久。

白:
用填坑来统领句法关系,就不会那么为难了。把二元关系进行到底,把词例化进行到底。吴文俊团队实际上也做了部分几何题理解的工作。不过数学家们认为这是脏活累活,没有学术价值。所以浅尝则止

wang:
机器做数学应用题,是验证自然语言理解效果的一个非常好的测试。但是没有市场。
本人2000年是在做小学数学应用题求解系统,当时也是为了检验自然语言理解效果的。当时系统,本群的刘群老师,周明老师,詹卫东老师,董强老师都见过,只是这些老师是否想起16年前的事就不得而知了。
当时演示的应用题“一条河里有4条小船,5条大船,河里一共有几条船?”--对于求解有几条小船,几条大船,或者颠倒顺序,都可以演示OK。但是在北大詹卫东老师把“一条河”改成“一个河”,系统就出不来结果,量词啊,量词没细致考虑。
这都是过去多年的事了,只是这个系统没有市场,最后只能搁浅。落不了地就被历史淹没了。记得当时台湾的中研院许文廉老师也做数学应用题求解。对于几何求解系统前几年看过文献,好像已经非常成熟了。可能语义理解的信息不是复杂,还是封闭环境非歧义语义,也许相对容易,这个后期我关注就不是很多了。

白:
应用题这东西,换个内容就是上市公司的报表,谁还敢说分析上司公司的报表没有市场?

wang:
白老师,我那个时候抱着系统广泛寻求市场,却没有市场关爱我。

白:
关键是不要被技术的表现形式所迷惑,要看穿技术的实质,有没有用是由实质决定的,不是由眼下的表现形式决定的。定位问题了。天上不会掉下个产品经理,最初的产品经理就是你自己。这世界上能看穿技术实质的人少之又少,要把技术包装对方向,还要扶上马送一程,理解的人才有可能多那么一点点。现在的教育里用人工智能逐渐多起来,但是系统更像系统而不是老师。要想让系统像老师,必须有NLP。像伟哥这样可以躺在垄断场景上高枕无忧,犯不着关注其他场景的人毕竟也是少数。

wang:
遗憾当初没有遇到白老师啊!以白老师的眼力,就活了。
觉得李老师也是在找更宽的场景。
回到昨天的话题“这瓶酒他只喝了一杯”。我的想法是“这瓶酒”--不是补语
应该是个强调部分。类似英语“It is .... that”
这瓶“酒”和一杯(“酒”),这酒是同质的事物,后者必须省略。不同质的事物,必须交代。

白:
还有不涉及量词的总分关系:“我们班的同学就他混到了正部级”
“我们班的同学”相当于瓶中酒,“他”相当于杯中酒。
总分关系,“总”表现为话题主语,“分”表现为动词的直接成分,主语或宾语。
但是按照移位理论,移出来的话题主语的原位必须是某个论元,所以一定要找到这个坑。

wang:
这种情况可否理解介词短语省略了介词“在...中”,(among)
单独“总”这个论元好像对应不了谓词,比如这里“混”

白:
英语介词短语可以修饰名词 总直接对分,分对谓词
我早上核心观点就是这个

wang:
恩,同意白老师

我:
I drink a cup of tea
cup is O of drink and then tea is linked to cup??
this is not what has been practised for long
tea is O of drink and cup (or a_cup_of) is Mod of tea
these are standard treatments

白:
@wei 这个treatment我太同意了。
英语不能省略tea吧。
即使前面提及了tea
壶里的茶我只喝了一杯,英语怎么说?

我:
NMT: I only drank a cup of tea, how to say English?
壶呢?
原来神经做翻译的时候,怎么常见怎么来,拉下的词没处放,就不放,一笔抹去,眼不见为净。这倒是顺溜了,可不带这么糊弄吧以前的 MT,无论 SMT 还是 RMT,大概
不敢这么玩

白:
有些口译人士倒是真的如此

刘:
SMT也一样的,经常丟词,还有论文专门研究SMT的丟词问题

白:
我在上交所的时候,就领教过知名公司的随团口译。我们提出的尖锐问题,一律抹平了翻,尖锐的词儿影都没有。有时我不得不自己用英语纠正一遍。

我:
那就是 RMT 不敢丢,其实也不是不敢,是丢不掉。除非生成程序有意设计了丢的条件。默认,实词是不能丢的。
“壶里的茶我只喝了一杯” 应该是:
as for the tea in the pot, I only drank one cup of it.
“it" refers to the "tea"

白:
it,相当于移走的tea的trace 在汉语是空范畴 在英语里总要有个真实代词。从伟哥的英译可以看出,他是真心不把“壶里的茶”当主语或宾语的。

我:
顺便一提,我觉得将来机器口译会有更好的用户体验
这是因为人的口译也就那么回事儿,糊弄的时候多,不合格的口译多,合格的在时间紧张的时候也老出乱子。这个观察在前些时候尝试用 NMT 翻译汉语到英语的时候就很清晰了。当时翻译到了英语以后,第一个震惊是,NND,神经真厉害,然后看到谷歌翻译下面有一个 speech 的按钮,就顺手一按,这一听,是第二个震惊,听上去比读居然更顺耳!读起来别扭或不合法的地方,给当今的语音合成一糊弄,居然那么自然,加上人的口译也是错误不断,相比之下,机器读出来里面有几个错就相当可以接受了。于是我用 iPhone 把那一段录音下来,放到了我的博客里面,让世人见识一下,机器口译不是梦。见:

谷歌NMT,见证奇迹的时刻

以前一直认为,口语到文字是第一层损耗,文字翻译是信息的第二层损耗,再从目标语文字到语音,是第三层损耗,损耗这样叠加下来,语音机器翻译是一个完全没谱的事儿。但实际上不是这么回事儿。
这第三层损耗,由于有人的陪绑和陪衬,不但不减分,反而加分。第一层的问题也基本解决了。当然前提是语音技术要神(经),语音合成要做得自然巧妙,而这些现在已经不是问题了。前几天讯飞合成一个广告词,居然声情并茂。

赵忠祥当年深陷录音门丑闻,声誉形象大减,那是错了时代。隔现在,赵大叔可以一口咬定那个录音是机器假冒的。

白:
啥时候声乐也能人工合成了,让帕瓦罗蒂唱我写的歌。

我:
白老师等着吧,不远了。

 

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

【一日一parsing:他 / 喝了 / 三碗 / 汤】

bai:
“他汤喝了三碗”
问题:“三碗”指向“汤”还是“喝”还是自己的省略被修饰语?
问题:它和“他喝了三碗汤”在语义上等价吗?

马:
强调的内容不一样吧,前者强调喝了三碗的是汤不是别的,后者强调是三碗

我:
要挖出变式的 nuances,不如把表层结构包括词序的差异保存 等到落地的时候 由应用的需要来决定这种差异是不是有必要。脱离落地谈细微差别 及其抽象表达,容易莫衷一是 也容易丢了西瓜。

他喝了三碗汤
他喝了汤三碗
三碗汤他喝了
汤他喝了三碗
他汤喝了三碗
? 他三碗喝了汤
? 三碗他喝了汤

最后两个变式走在句法的边缘。

一个标签是 Mod,一个是 buyu,其余皆同,包括可分离动词合成词“喝汤”,表层结构的所有信息,包括词序,也都 accessible if needed。因为 parer 的内部 representation 通常是增量的、信息 enrich 的过程,除非是信息更新为了改正一个错误,过去的或历史的信息并不丢失。这也是我们以前说过的为什么休眠唤醒机制可以work,因为被唤醒的原始状态并没有丢失,一个子串永远可以重来,二次 parsing。推向极端就是,整个一个句子都可以推倒重来,因为原始的 token string 并没丢弃。当然,实际上的休眠唤醒几乎永远是针对句子中的一个子树,再糟糕的 parser 也不至于全错需要重新来过。

Topic 再进一步转为 S 就完美了,语义中间件还有细致的工作可做。

最后这两句句法边缘的句子不是不可能出现,但比较罕见,对于毛毛虫边缘的毛刺部分的现象,合法非法中间的数据,如果不常见,那就拉倒,parser 出啥结果都无需太 care,反正有做不完的活计,不值当在它们身上花时间。

【相关】

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

【社煤挖掘:为什么要选ta而不是ta做总统?】

中文社煤挖掘美国大选的华人舆情,接着练。

Why and why not Clinton/Trump?

Why 喜大妈?Why 川大叔?Why not Clinton? Why not Trump?这是大选的首要问题,也是我们舆情挖掘想要探究的重点。Why???

First, why Clinton and why not Clinton? 看看喜大妈在舆情中的优劣对比图(pros and cons)。

sentiment-drivers-33

why Clinton?剔除竞选表现优秀等等与总统辩论和 campaign 有关的好话(“领先”、“获胜”、“占上风”、“赢得”等)外,主要理由有:

1. 老练 强硬; 2. 乐观; 2. 清楚; 4 换发活力 谈笑风生; 5. 梦想共同市场

拿着放大镜,除了政治套话和谀辞外也没看到什么真正的亮点。舆情领先,只能说对手太差了吧。四年前与奥巴马竞争被甩出一条街去,那是遇到了真正的强手。

OK,why not Clinton?

1. 性侵 性骚扰 威胁(她丈夫做的好事,她来背黑锅,呵呵。照常理她是受害者,可以同情的,不料给同样管不住下半身的川普一抹黑,她倒成了性侵的帮凶,说是威胁被性侵的女性。最滑稽的是,川普自己的丑闻曝光,他却一本正经带了一帮前总统克林顿的绯闻女士开记者会,来抹黑自己的对手克林顿夫人。滑稽逆天了。)

2. 邮件门 曝光 泄密

3 竞选团队的不轨行为 操纵大选 作弊

4. 克林顿基金会的问题

5. 华尔街收费

6 健康问题

7 撒谎、可耻

8. 缺乏判断力

这些都不是新鲜事儿,大选以来已经炒了很久了,但比起她的长处(经验老练等少数几条),喜妈被抓住的辫子还真不少。再看网民的情绪性吐槽, 说好话都是相似的,坏话却各有不同:轻的是,“乏善可陈”、“不喜欢”、“不信任”; 重的是:“妖婆”,“婊子”、“灾难”、“无耻”、“邪恶”。

sentiment-drivers-34
作为对比,来看川大叔,why or why not Trump?

sentiment-drivers-35

pros:1. 减税;2. 承诺 崛起 (America great again);3. 真实;4. 擅长 business
cons:
1. 曝光的视频丑闻 性骚扰
2. 偷税漏税
3. 吹嘘
4 咄咄逼人 喜怒无常
5 粗鄙、威胁
6 撒谎

情绪性吐槽,轻的是 “不靠谱”、“出言不逊”,重的是 “恶心”、“愚蠢”、“卑劣”、“众叛亲离”。

sentiment-drivers-36
上篇中文社煤自动民调博文发了以后有朋友问,为什么不见大名鼎鼎的脸书。(微信不见可以理解,人家数据不对外开放,对隐私性特别敏感,比脸书严多了。不过,地球人都知道,反映我大唐舆情最及时精准的大数据宝库,非微信莫属)。查对了一下,上次做的中文舆情调查,不知何故 Facebook 不在 top 10,只占调查数据的 0.1%:

sources-9

记得以前的英语社煤调查,通常的比例是 70% twitter,20% Facebook, 其他所有论坛和社交媒体只占 10%。最近加了 instagram、Tumblr 等,格局似有变。但是中文在海外,除了推特,Facebook 本来应该有比重的,特别是我台湾同胞,用 Facebook 跟东土用微信一样普遍。

再看看这次调查的网民背景分类。

1.  职业是科技为主(大概不少是咱码农),其次才是新闻界和教育界。这些人喜欢到网上嚷嚷。

professions

这是他们的兴趣(interests),有意思的关联似乎是,喜欢谈政治的与喜欢谈宗教和美食的有相当大交集。

interests

这是年龄分组,分布比较均匀,但还是中青年为主。

age

性别不用说,男多女少。男人谈政治与女人谈shopping一样热心。

gender

最后看看地理分布,社煤的地理来源:
geo-regions

 

 

【相关】

【社媒挖掘:川大叔喜大妈谁长出了总统样?】

Big data mining shows clear social rating decline of Trump last month

【川普和希拉里的幽默竞赛】

【大数据舆情挖掘:希拉里川普最近一个月的形象消长】

论保守派该投票克林顿

【立委科普:自动民调】

【关于舆情挖掘】

《朝华午拾》总目录

【社媒挖掘:川大叔喜大妈谁长出了总统样?】

眼看决战时刻快到了,调查一下华人怎么看美国大选,最近一个月的舆情趋势。中文社会媒体对于美国总统候选人的自动调查。

aaa

先看喜大妈,是过去三十天的调查(时间区间:9/26-10/25)
summary-metrics-new-3
mentions 是热议度,net sentiment 是褒贬指数,反映的网民心目中的形象。

summary-metrics-6
很自然,二者并不总是吻合:譬如,在十月10日到11日的时候,希拉里被热议,而她的褒贬指数则跌入谷底。那天有喜大妈的什么丑闻吗?咱们把时间按周(by weeks)而不是按日来看 trends,粗线条看趋势也许更明显一些:

summary-metrics-7
Anyway,过去30天的总社煤形象分(net sentiment)是 11%,比起英语世界的冰点之下(-18%)好太多了,似乎华语世界远不如英语世界对老政客喜大妈的吐槽刻薄。

作为对比,我们看看川普(特朗普)在同一个时期的社会形象的消长趋势:川普过去30天的总社煤形象分(net sentiment)是 -12%,比希拉里的+11%成鲜明对比。

summary-metrics-8

看上面的趋势图(by weeks),川普的热议度一直居高不下,话题之王名副其实,但他的社会评价却一直在冰点之下,十月初更是跌入万丈深渊。同时期的希拉里,热议度与社会评价却时有交叉。趋势 by days:

summary-metrics-9

这样看来,虽然有所谓华人挺川的民间鼓噪,总体来看,川大叔在华人的网上口水战中,与喜大妈完全不是一个量级的对手。川普很臭,真地很臭。在英语社煤中,川普也很臭(-20%),但希拉里也不香,民间厌恶她诅咒她的说法随处可见,得分 -18%,略好于川普。譬如电邮门事件,很多老美对此深恶痛绝,不少华人(包括在下)心里难免觉得是小题大作。为什么华人世界对希拉里没有那么反感呢?居然给希拉里 +11% 的高评价。朋友说,希拉里更符合华人主流价值观吧。

这是我们的品牌对比图,三维直观地对比两位候选人在社煤的形象位置:

brand-passion-index-10

希拉里领先太多,虽然热议度略逊。

总有人质疑社煤挖掘的情报价值,说也许NLU不过关,挖掘有误呢。更多的质疑是,也许某党的人士更愿意搅浑水呢(譬如利用水军或机器人bots)。凡此总总,都给社会媒体舆情挖掘在多大程度上反映民意,提出了疑问和挑战。其实,对于传统的民调,不同的机构有不同的结果,加上手工民调的取样不可能大,error margin 也大。各机构结果也颇不同,所以大家也都是一肚子怀疑。不断有怀疑,还是不断有民调在进行。这是大选年的信息“刚需”吧。

所有的自动的或人工的民调,都可能有偏差,都只能做民意的参考。但是我要强调的是:

1. 现在的深度 NLU 支持的舆情挖掘,已经今非昔比,加上大数据信息冗余度的支撑,精准度在宏观上是可以保障的;

2. 全自动的社煤民调,其大数据的特性,是人工民调无法比的(时效以及costs也无法比,见【立委科普:自动民调】);

3. 虽然社煤上的口水、噪音以及不同党派或群体在其上的反映都可能有很大差异,但是社煤民调的消长趋势的情报以及不同候选人(或品牌)的对比情报,是相对可靠的。怎么讲?因为自动系统具有与生俱来的一视同仁性。

时间维度上的舆情消长,具有相对的比较价值,它基本不受噪音或其他因素的影响。也不大受系统数据质量的影响(当然,太臭的舆情系统也还是糊不上墙,跟抛硬币差不了太多的一袋子词这样的“主流”舆情分类,在短消息压倒多数的社会媒体,还是不要提了吧,见一切声称用机器学习做社会媒体舆情挖掘的系统,都值得怀疑)。

我们目前的系统,是 deep parsing 支持,本性是 precision 优于 recall(precision 不降低,recall 也可以慢慢爬上来,譬如我们的英语舆情系统就有相当好的recall,recall在符号逻辑路线里面,本质上就是开发时间的函数)。Given big data 这样的场景,recall 的某种缺失,其实并不影响舆情的相对意义,因为决定 recall 的是规则量,缺少的是一些长尾 pattern rules,而语言学的 rules 不会因为时间或候选人的不同,而有所不同。同理,因为系统的编制是独立于千变万化的候选人、品牌或话题,因此数据质量对于候选人之间的比较,是靠谱的。这样看,舆情趋势和候选人对比的情报挖掘,的确真实地反映了民意的消长和相对评价。下面是这次自动民调的 Top 10 数据来源(可惜没有“她”,我是说 wechat),还是最动态反映舆情的推特中文帖子占多数(其中 66% 简体,30% 繁体,4% 粤语)。

domains-5

看一下popular的帖子,居然小方的也在其列。倒也不怪,方在中文社煤还是有影响力的。

chuanpupopularposts

小方总结得不错啊,难得同意他:满嘴跑火车的川大叔是“谎言大王”。其实川普与其说是谎话连篇,不如说是他根本不care 或不屑去核对事实。就跟北京出租司机信口开河成为习惯一样,话说到这里,转一篇我的老友刚写的博文(论保守派该投票克林顿),quote:

川普说话不顾事实是众所周知的。只要他一开口,就忙坏了各种事实核查 fact check ......
更重要的是,川普不仅犯了大大小小众多的事实错误,而且对事实抱着强烈的轻蔑和鄙视。

总结一下这次民调的结果可以说,如果是华人投票,川普不仅是 lose 而是要死得很惨,很难看。(当然,不管华人与否,川普都没有啥胜算。)

timeline-comparison-12

这是 by days 的趋势对比,这种持续的舆情领先在大选前很难改变吧:

timeline-comparison-13

【更多美国大选舆情的自动调查还在进行整理中,stay tuned】

 

【相关】

【社煤挖掘:为什么要选ta而不是ta做总统?】

Big data mining shows clear social rating decline of Trump last month

【川普和希拉里的幽默竞赛】

【大数据舆情挖掘:希拉里川普最近一个月的形象消长】

论保守派该投票克林顿

【立委科普:自动民调】

【立委科普:舆情挖掘的背后】

【社媒挖掘:《品牌舆情图》的设计问题】

一切声称用机器学习做社会媒体舆情挖掘的系统,都值得怀疑

【关于舆情挖掘】

《朝华午拾》总目录

 

 

 

 

 

 

 

 

 

【一日一parsing:parser 超越创造parser的人,不是不可能的】

460225017498569285白:
“那些林彪说过的话”
看看复数指示词(det)是如何跳过单数NP找到自己的中心语的。

我:

0924a

0924b
何难之有?

0924c

看着最后这句出来,不禁有些惶恐:这样下去,机器超越造机器的人,不是不可能的。内行看门道,自不必说,可今天还是对后学做个科普吧:为什么说此句的 deep parsing 牛得达到了语言学专家的水平,已经超越了普通人的语言结构分析的能力呢?这个自动生成、看似简单的树形图涵盖这么多的语言学:

(1) 复数指示词 “那批” 跳过了近距离的“你”,甚至跳过了定语从句的谓词“写-过”,连上了远距离的中心词“文章”,做其修饰语(Mod),牛不牛?

(2) 确定了定语从句(Mod-S)“你写过的”及其中心词“文章”;

(3) 定语从句谓词“写过”的主语(S)“你”和逻辑宾语(O)“文章”(所谓的 argument structure 的解构);

(4) 句首的这个带有定语从句的名词短语(“......文章”),与后续句子的谓词“保存-着”的远距离动宾关系(O)也揭示了,这个也有点儿牛吧;

(5) 事实上,句子主干的主(S)谓宾(O)都是各就各位,还有那些小词也都附着到了应该存在的地方(X)。

从深度结构分析的逻辑语义角度,可以说以上的分析已臻完美。

科普完。

能够达到以上对咱中文语句的语言学自动深度分析(deep parsing)水平的,得瑟一哈,也许算是可以原谅的“寡人之疾”了吧。

得瑟毕。

抹一把插大葱的象鼻,拍拍尘土,咱继续谦虚谨慎愚公移山去也。

白:
最后这句的next有些多余
即使去掉,所有有用的关系都在

我:
Next 是桥梁(敲门砖),本来是可以用完扔掉的,后来觉得留下也可以。
做个青春的纪念。
青春是褒义词,耍流氓是贬义词,但都是一回事儿:盲目躁动。(Next 残存了一点语序的信息,虽然逻辑上没有语序的地位,但在语义落地的时候,这个痕迹有时可能还有一点用。)

我一直相信,结构分析,机器达到或超越人的水平,是在望的。
结构分析后的语义落地,与人类的智力还有一些距离。但是因为语义落地几乎都是面向领域或应用的,因此有 leverage,有些觉得是天大的难题,有时在领域语用里面,就自然化解了,或者简化了。由此看来,NLU (或语义计算)是靠谱的 monster。

近两个月出了两件牛刀宰鸡的事儿。一个是英文,一个是中文。具体不让说,但可以假语村言。都是在某个产品领域被认为是拦路虎的与自然语言有关的难题。研究了一下,回答说,有了 deep parsing 的核武器,这有何难?

演练了一下,真地就是牛刀宰鸡,一眼见底。很多人以为核武器之说是立法委的极度夸张。天知地知,还真不是。被演义的对象说,这个难题在这个产品领域一旦解决,有很多后续的应用。可是如果不是不得已,还是想做牛刀宰牛的活计,而不是陷入鸡窝去没完没了地宰鸡。胜之不武啊。古训不是有说,不为五斗米折腰嘛。但愿不至于落到五斗米的田地。

【相关】

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

【李白对话录:如何学习和处置“打了一拳”】

白:
“张三打了李四一拳”“张三打李四的那一拳”
我的问题:1、“一拳”在两个例子里,跟“打”的“逻辑语义关系”是否是相同的?
2、如果相同,这种关系是不是萝卜和坑的关系?
3、如果是,那么这个坑是“打”自带的,还是被“一拳”的出现逼出来的?
4、非自带但可以被逼出来的坑,是一个个别现象还是一个普遍现象?是汉语特有的现象还是一个语言共性现象?
2':如果不同,第二例中的定语从句和中心语“那一拳”之间的关系是怎么建立的?
“张三喊了一嗓子”“张三喊的那一嗓子,我老远就听见了”,一个道理
另外,“回马枪”“窝心脚”等“工具扩展为招式”固定短语,是不是可以直接略掉量词,与数词结合?

我:

1. 逻辑语义上应该相同,句法上有【主谓】和【定语从句+NP】 的不同,很典型。

2 具体说,“打一拳” 就是搭配,是合成动词,与“洗澡”可比,不过后者是动宾搭配,前者是动补搭配。都是合成词的句法表现,都涉及词典与句法的动态接口。
直接量的搭配,当然属于罗卜与坑。
语言中的萝卜和坑,不外是 :(1)一个直接量(词)准备了一类词(feature)的坑;(2)一个直接量(词)准备了另一个直接量(词)的坑,通常叫强搭配;(3)一类词(feature)准备了另一类词(feature)的坑。(3) 是常规句法的表现,属于空对空,两边都不着地。其规则(feature based grammar)概括性强,但容易遭遇例外的滑铁卢。lexicalized grammar or word driven rules,越来越远离(3),或者把(3)限定在一个极少的数量上。那么就剩下(1)和(2)了。
“打...一拳” 是(1),这就到了你的第三个问题,两个直接量的搭配,谁 expects 谁?
纯技术上讲,根本就没有区分,或者说,等价。x 与 y 相互勾搭,说是 x 勾搭了 y 或者 y 勾搭了 x,都无所谓,反正他们是一家人,本来就是一个词,一个概念,不过到了语言表达,被人为分开了距离。

【3、如果是,那么这个坑是“打”自带的,还是被“一拳”的出现逼出来的?】
“打一拳”就是一个词条,概念上是混为一体的,不分你我,无所谓主次(动补的主次是词法内部的,可以无视)。但是操作上,可以有说法。(不知道汉语的搭配词典里面,“打一拳”这样的条目是放在 “打” 的下面,还是 “一拳” 的下面,还是两个地方都有?)但是,在NLP实现中,“打一拳” 与 “洗澡” 一样,是一个特定的分离词词条。不过是标签不同而已,譬如 Vo 与 Vbu,其他的事儿就交给句法了。

【4、非自带但可以被逼出来的坑,是一个个别现象还是一个普遍现象?是汉语特有的现象还是一个语言共性现象?】
对于直接量搭配,我的看法是,没有自带和被逼的问题,都是两厢情愿的相互吸引。
这个应该属于普遍现象: x--y,汉语有 “洗-澡”, 英语有 “take--bath”。词法是动补或者动词与状语这样的直接量与直接量的搭配,其他语言肯定也会有,不过一时想不到例子而已。

白:
打一苕帚疙瘩,也是搭配
任何顺手的东西,都可以抄起来就打
搭配的做法未免太ad hoc

我:
所有的词典都是 ad hoc,不然就不叫绑架了。但是 词条背后的 x--y 搭配 则是有语言共性的。

白:
问题是不可穷尽,而且本来能产,是一个有规律性的现象,打两鞭子,砍三刀,踹五脚。

我:
不可穷尽 那就不是 x--y 强搭配。理论上 不是 x --- y,就只能是 x ---- feature,或者 feature1 ----- feature2,没有其他的框可以进去。
“砍三刀” 与 “洗三个澡” 可比吗?要是可以,那就是 x --- y,可变的不过是 numeral,两端还是固定的:“踹-脚”,“砍--刀”。

白:
加量词的不算,只算省略量词的.明显的是工具,但是原动词很难说自带了“工具”这个坑。

我:
有些中间地带的现象。
说到底是路线问题。如果是 lexicalist 的路线,中间地带的一律进入词典,不在乎 ad hoc,不在乎冗余,好处是精准。如果是“传统”的文法,那就把中间地带划归到句法去,具有完全的产生性,好处是 不错的recall,但很容易被例外搅合,损失了精准(precision)。当然也可以二者结合,先弄一条 recall 的兜底,然后见到中间地带弄错了的,再去结合词典堵它。recall 楼底的可以想象的 rule 是这样的,利用了汉语名词通常不能直接为数词修饰的句法特点:

V + CD + N --> V Buyu

这一条可以搂住很多,但是危险。修修补补也可以把这条规则的危险减小,但不能杜绝,因为这是 feature based rule 的本性(POS 是 feature )。

接着练,我们可以有个楼底的规则来满足白老师说的某种语言现象的共性:

V +(时态小词)+ CD + N ==> V <-- Buyu[CD+N]

这条规则可以 parse 上面列举的所有现象,但是还是 too “powerful”, recall 有余,precision 不足。不过 precision 这东西,工程上靠的就不断扩大测试,测试不错的话就当没有精度问题,如果测试遇到问题了,有三个路子:(1)一个是在这一条规则中打磨,把 POS 条件细化成子类或ontology,或其他限制;(2) 第二个路子是另写一条细线条规则去 override 它,使得文法成为一个 hierarchy 的模块;(3) 第三个路子就是把错的东西(例外)扔进词典, 这实际上等价于第二条路子的极限 case,把词典当成是 rule hierarchy 的极端。有了这么一个从词典规则,到细线条 feature 规则,最后到 POS 的抽象层规则的 hierarchy 的规则化设计,就可以应对语言的例外、个性一直到共性及其之间的灰色地带。

懒得大数据,甚至懒得词典绑架搭配,上面那条默认规则送进系统先凑合事儿吧,就坐等今后例外慢慢地出现,再说。

0925b

0925c

白:
为什要在细粒度基于规则
这里说的这个层面规则的缺点,用学习对付起来正是优势

我:
不要细粒度也可以啊,抓两头带中间。大不了有些 redundancy,灰色的一律当成黑色。不可穷举不过是一种修辞说法。从统计上,处于灰色地带的东西一定是可以穷举的,不过是穷举到后来成了统计性长尾,不要再举而已。

白:
我是说,这里不存在二分法,除了词典捆绑就是基于规则, 可以基于学习

我:
白老师可以 illustrate 基于学习的东西,优势在哪里?(其实这个问题,我没觉得是一个对规则系统的挑战。没觉得它的挑战超越了 “洗澡”)

白:
不能穷举、规则又零乱,正好拿可以部分例子来学。feature很值钱,长尾的实例也很值钱,裹在一起学才是正道,既有泛化,又有死记硬背。

白:
拿有规律性的东西死记硬背,是逼着好孩子耍流氓

我:
从良性角度,也可以说是教育孩子脚踏实地,一步一个脚印。

白:
在泛化和死记硬背的灰色地带,该用学习就用学习。
看着不爽,又不是没办法。
只有应试教育、临阵磨枪,才把什么活的都搞死

我:
这里面的根本是,迄今为止,一个系统要不是统计的,要不是规则的。所谓 hybrid 的系统,大多是是两个系统的叠加,而不是融合。在这样一个 context 下,就不是说,我规则的规则,词典的词典,中间混杂一些统计学习。虽然后者应该是一个研究方向,而且应该可能做得比叠加式 hybrid 更高明。如果白老师说的是纯粹的学习系统,那是另一套话语体系,no comment。从规则这边看,抓两头,把灰色当黑色做,没有问题,不过是磨时间而已。共性规则保证了 recall,而 precision 就是时间的函数。

白:
我说的是,谁可能跟谁结合用规则,在同样符合规则情况下,谁排除跟谁结合用学习,但这是无监督学习,标注来自词典。前面用规则的只涉及萝卜、坑和帽子,不涉及subcat。后面学习的则是用subcat。

我:
其实 就用 V+CD+N 这个简单的模式到海量数据去,抓回来的无监督学习也大体就齐了。这是一个很狭窄的语言现象。无监督学习的结果就是这个特定的 subcat 的 knowledge acquisition,这是一个 offline 的学习过程。然后再利用学习出来的结果,支持 parsing

白:
其实这楼已经歪了。我的本意是在探讨逼出来的非标配的坑。
如果可以那样做,离语言的本质或许更近。

“他上学的那个学校”;“他约会的那个晚上”。

不加数词也存在把在一个句式里充当状语或补语的名词在另一个相关句式中充当主谓语,而逻辑语义关系不变的情况。而那个名词的真实身份是工具、处所、时间等角色。本来对于动词来说不是标配的。来到了某种位置,就逼迫动词把这个角色变为标配。
英语的介词结尾:the man you look for,可以给它们明确身份,即使在定语从句,也是庶出(介词养的,不是动词养的)。当然可以说动介组合look for养的。
汉语里进入定语从句后分不出来谁养的,反正介词消失了,带着反而不对。带着就要把零形式用真实代词替换:“你在其中上学的学校”,“你与之结婚的女人”

加数词,只不过突出了动量含义,不改变逻辑语义关系。

砍张三的斧子……着眼工具
砍张三的两斧子……着眼动作的次数
砍张三的斧子……用来(以/之/其)砍张三的斧子

我:
补语表示次数是逻辑语义工具在语言中的"虚化"(同时“形象化”)的用法,这种虚化用法本身不是语言共性,但可以映射到到深层的逻辑语义【工具】: 【工具】是 universal 的。就“砍”而言,【工具】不是逼迫出来的标配,而是自带的标配,不信可以查董老师的 HowNet,结婚 的标配是 with [human],对于 上学, 学校 是不是自带的?大概也可以这么说,不知道知网里面 上学 有没有一个 location 的槽,标配是学校。

可以找一个完全 random 的定义或状语试试,好像不行。似乎很难找到一个具有同样逻辑语义的,并且可以参与下面两个句式的案例:补语句式(表示次数)和定语句式。换句话说,这种现象要不就是搭配,要不就是搭配的延伸,而不是 random 的修饰语(adjunct)的组合,或者从 adjunct 被逼迫成的 complement,里面的逻辑语义是概念关系的某种 argument,有其结合的必然性。这种搭配似乎可以是词对词(两条腿落地),也可能是词对小类(feature:一条腿落地)。前者是强搭配的词典绑架,后者是灰色的,不一定可以绑架得了,统计可以学习出来。

白:
正是我要说的

我:
白老师岂止是四两拨千斤 lol

词对小类的subcat的习得,譬如 某个动词要求的是某种宾语(譬如【human】),这种东西可以从大数据学习出来:这个概念已经有日子了。剑桥大学一个教授多年前就倡导这种学习,好像也做了一批实验,印象也发表了一些文章。但这些研究总体来说是零星的,研究的归研究,应用的归应用,二者似乎也没有什么结合起来让人印象深刻的成果。

白:
没有把搭配学习锚定在结构上,是没戏的
你如果又学结构又学搭配,肯定乱套
一定是选定少数几种可能的结构,让搭配来进一步甄别,各司其职

白:
“砍”的工具可以是标配,“打”不行。适合“打”的subcat很不整齐,我们心里想的是“顺手可以抄起来的物件”但是subcat列表上不会顺顺当当给你这个。于是,要诸多subcat、诸多词例都当作features,想办法从可以列举的例子(包括已经可以确认的词例-subcat子规则)学出来。
炉子太大,抄不起来。房子更大。扫把大小适中。细菌太小。所以,“张三打李四一大肠杆菌”不通。

我:
用 pattern 打+CD+N,一学一个准 只要有海量数据,根本不用怕噪音,因为这个 pattern 非常好使。
联想到10多年前谷歌有人发过一篇论文,用两个特别拣选的 ngram patterns,学出了 ISA 的 taxonomy,让人印象深刻。后来我们还重复了这个工作,虽然并没真正用上其结果,但路子是对的。照着类似学习的路子,HowNet 有一天也是可以学出来的,只要董老师定义好要学的几个语义关系的性质,找到合适的 patterns。
谷歌用的两个 patterns 是: N such as X, Y, Z ;X, Y, Z and other N

e.g.
furniture such as desks, chairs, coffee-tables
desks, chairs, coffee-tables and other furniture (will all be on sale)
taxonomy is: {X, Y, Z} -->N

学他有啥用,反正人拍着脑袋慢慢想也可以想出来呀。HowNet 语义关系丰富,所以编写了很多年,但是终究还是编写出来了,几乎完备了(董老师好像如今只是零星地补充和添加了)。既然专家可以人工编写,既完备,又精良,有什么理由指望大数据去习得这些知识呢?这是问题的一面,特别是对于相对恒定久远的概念语义关系,确实没有道理不用专家的产品。

问题的另一面是,对于具有某种流动性的概念关系,专家很难赶得上机器习得(acquisition),还有不同领域的知识,等等。这是人力不及的地带,只有指靠大数据和机器了。上面的谷歌论文中举了一些例子,特别有意思,记得是说,学出来一个 dictator 的下位概念,里面的成员极具大数据的特点,有 卡斯特罗,毛泽东,斯大林,希特勒,etc。

白:
这是主观分类了,不合适放词典里。还有“知名品牌”的实例, 马上就有商业价值了。

我:
这不是我每天做的工作吗:social media mining of public opinions and sentiments
我们公司定期出版全球知名品牌的口碑排行榜之类,印刷精良。以前出版的是奢侈品牌(名牌包、名牌轿车、高级香水)等。最近出的一期是: Social Media Industry Report 2016: Restaurant Brand

刚测试了一下白老师的例句,最奇葩的是这个:

0925a

长成葫芦状的树形图,以前还真没见过。(词典里没有小词 “与之”,PP 也没合成它,于是被略去。)尽管如此,整个图是很逻辑的,撞了不知道什么运:“你”是结婚的一方(S),“女人”也是结婚的一方(S),这两方结婚的事件是一个定语从句(Mod-S),修饰到了“女人”的头上。至于小词 “的”、“之”,还有耍流氓的咸猪手 Next,这一切都是帮助建立结构的敲门砖,这些表层东西与逻辑语义无关,留在那里不是为了碍眼,而是为了在语义的语用落地的时候,万一需要表层痕迹的一些帮助呢。after all 语义计算的的目的不是为了画出好看的逻辑的图,自娱娱人,而是为了落地、做产品。

 

【相关】

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

Interaction of syntax and semantics in parsing Chinese transitive verb patterns

Interaction of syntax and semantics in parsing Chinese transitive verb patterns *
(old paper in Proceedings of International Chinese Computing Conference, ICCC'96)

Wei  LI

Department of Linguistics, Simon Fraser University
Burnaby, B.C. V5A 1S6 CANADA (email: [email protected])

Keywords: Chinese processing, transitive pattern, syntax, semantics, lexical rule, HPSG

Abstract

This paper addresses the problem of parsing Chinese transitive verb patterns (including the BA construction and the BEI construction) and handling the related phenomena of semantic deviation (i.e. the violation of the semantic constraint).

We designed a syntax-semantics combined model of Chinese grammar in the framework of Head-driven Phrase Structure Grammar [Pollard & Sag 1994]. Lexical rules are formulated to handle both the transitive patterns which allow for semantic deviation and the patterns which disallow it. The lexical rules ensure the effective interaction between the syntactic constraint and the semantic constraint in analysis.

The contribution of our research can be summarized as:

(1) the insight on the interaction of syntax and semantics in analysis;
(2) a proposed lexical rule approach to semantic deviation based on (1);
(3) the application of (2) to the study of the Chinese transitive patterns;
(4) the implementation of (3) in an unification-based Chinese HPSG prototype.

  1. Background

When Chomsky proposed his Syntactic Structures in Fifties, he seemed to indicate that syntax should be addressed independently of semantics. As a convincing example, he presented a famous sentence:

1)             Colorless green ideas sleep furiously.

Weird as it sounds, the grammaticality of this sentence is intuitively acknowledged: (1) it follows the English syntax; (2) it can be interpreted. In fact, there is only one possible interpretation, solely decided by its syntactic structure. In other words, without the semantic interference, our linguistic knowledge about the English syntax is sufficient to assign roles to each constituent to produce a reading although the reading does not seem to make sense.

However, things are not always this simple. Compare the following Chinese sentences of the same form NP NP V:

2a)           dianxin  wo           chi           le.
                Dim-Sum I               eat           LE.
The Dim Sum I have eaten.
Note:        LE is a particle for perfect aspect.

2b)   wo dianxin chi le.
I have eaten the Dim Sum.

Who eats what? There is no formal way but to resort to the semantic constraint imposed by the notion eat to reach the correct interpretation [Li, W. & McFetridge 1995].

Of course, if we want to maintain the purity of syntax, it could be argued that syntax will only render possible interpretations and not the interpretation.  It is up to other components (semantic filter and/or other filters) of grammar to decide which interpretation holds in a certain context or discourse. The power of syntax lies in the ability to identify structural ambiguities and to render possible corresponding interpretations. We call this type of linguistic design a syntax-before-semantics model. While this is one way to organize a  grammar, we found it unsatisfactory for two reasons. First, it does not seem to simulate the linguistic process of human comprehension closely.  For human listeners, there are no ambiguities involved in sentences 2a) and 2b). Secondly, there is considerable cost on processing efficiency in terms of computer implementation. This efficiency problem can be very serious in the analysis of languages like Chinese with virtually no inflection.

Head-driven Phrase Structure Grammar (HPSG) [Pollard & Sag 1994, 1987] assumes a lexicalist approach to linguistic analysis and advocates an integrated model of syntax and the other components of grammar. It serves as a desirable framework for the integration of the semantic constraint in establishing syntactic structures and interpretations. Therefore, we proposed to enforce the semantic constraint that animate being eats food directly in the lexical entry chi  (eat) [Li, W. & McFetridge 1995]: chi (eat) requires an animate NP subject and a food NP object. It correctly addresses who-eats-what problem for sentences like 2a) and 2b). In fact, this type of semantic constraint (selection restriction) has been widely used for disambiguation in NLP systems.

The problem is, the constraint should not always be enforced. In the practice of communication, deviation from the constraint is common and deviation is often deliberately applied to help render rhetorical expressions.

 

3) xiang      chi           yueliang,  ni             gou           de3    zhao       me?
    want        eat           moon,       you          reach       DE3  -able          ME?
Wanting to eat the moon, but can you reach it?
Note:  DE3 is a particle, introducing a postverbal adjunct of result or capability. ME is a sentence final particle for yes-no question.

4) dajia         dou   chi           shehui zhuyi,           neng         bu            qiong       me?
     people      all      eat           social -ism,               can            not           poor         ME
Everyone is eating socialism, can it not be poor?

yueliang (moon) is not food, of course. It is still some physical object, though. But in 4), shehui zhuyi (socialism) is a purely abstract notion. If a parser enforces the rigid semantic constraint, there are many such sentences that will be rejected without getting a chance to be interpreted. The fact is, we do have interpretations for 3) and 4). Hence an adequate grammar should be able to accommodate those interpretations.

To capture such deviation, Wilks came up with his Preference Semantics [Wilks 1975, 1978]. A sophisticated mechanism is designed to calculate the semantic weight for each possible interpretation, i.e. how much it deviates from the preference semantic constraint. The final choice will be given to the interpretation with the most semantic weight in total. His preference model simulates the process of how human comprehends language more closely than most previous approaches.

The problem with this design is the serious computational complexities involved in the model [Huang 1987]. In order to calculate the semantic weight, the preference semantic constraint is loosened step by step. Each possible substructure has to be re-tried with each step of loosening. It may well lead to combinatorial explosion.

What we are proposing here is to look at semantic deviation in the light of the interaction of the syntactic constraint and the semantic constraint. In concrete terms, the loosening of the semantic constraint is conditioned by syntactic patterns. Syntactic pattern is defined as the representation of an argument structure in surface form. A pattern consists of 2 parts: a structure's syntactic constraint (in terms of the syntactic categories and configuration, word order,  function words and/or inflections) and its interpretation (role assignment). For example, for Chinese transitive structure, NP V NP: SVO is one pattern, NP NP V: SOV is another pattern, and NP [ba NP] V: SOV (the BA construction) is still another. The expressive power of a language is indicated by the variety of patterns used in that language. Our design will account for some semantic deviation or rhetorical phenomena seen in everyday Chinese without the overhead of computational complexities. We will focus on Chinese transitive verb patterns for illustration of this approach.

  1. Chinese transitive patterns

Assuming three notional signs wo (I), chi (eat) and dianxin (Dim Sum), there are maximally 6 possible combinations in surface word order, out of which 3 are grammatical in Chinese.[1]

5a)           wo chi le dianxin.                                   SVO
5b)           wo dianxin chi le.                                   SOV
5c)           dianxin wo chi le.                                    OSV

SVO is the canonical word order for Chinese transitive structure. When a string of signs matches the order NP V NP, the semantic constraint has to yield to syntax for interpretation.

NP V NP: SVO

6)  daodi         shi     ni             zai         du       shu          ne,
haishi                 shu           zai         du       ni             ne?

     on-earth     be     you          ZAI        read     book        NE,
or                        book        ZAI        read     you          NE?

Are you reading the book, or is the book reading you, anyway?
Note:        ZAI is a particle for continuous aspect.
NE is a sentence final particle for or-question.

Same as in the English equivalent, the interpretation of  6) can only be SVO, no matter how contradictory  it might be to our common sense. In other words, in the form of NP V NP, syntax plays a decisive role.

In contrast, to interpret the form NP NP V as SOV in 2b), the semantic constraint is critical. Without the enforcement of the semantic constraint, the interpretation of SOV does not  hold. In fact, this SOV pattern (NP1 NP2 V: SOV) has been regarded as ungrammatical in a Case Theory account for Chinese transitive structure in the framework of GB. According to their analysis, something similar to this pattern constitutes the D‑Structure for transitive pattern and Chinese is an underlying SOV language (called "SOV Hypothesis": see the survey in Gao 1993). In the surface structure, NP2 is without case on the assumption that V assigns its CASE only to the right. One has to either insert the case-marker ba to assign CASE to it (the BA construction) or move it to the right of V to get its CASE (the SVO pattern). This analysis suffers from not being able to account for the grammaticality of sentences like 2b).  However, by distinguishing the deep pattern SOV from the 2 surface patterns (the SVO and the BA construction), the theory has its merit to alert us that the SOV pattern seems to be syntactically problematic (crippled, so to speak). This is an insightful point, but it goes one step too far in totally rejecting the SOV pattern in surface structure. If we modify this idea, we can claim that SOV is a syntactically unstable pattern and that SOV tends to (not must) "transform" to the SVO or the BA construction unless it is reinforced by semantic coherence (i.e. the enforcement of the semantic constraint). This argument in the light of syntax-semantics interaction is better supported by the Chinese data. In essence, our account is close to this reformulated argument, but in our theory, we do not assume a deep structure and transformation. All patterns are surface constructions. If no sentences can match a construction, it is not considered as a pattern by our definition.

This type of unstable pattern which depends on the semantic constraint is not limited to the transitive phenomena. For example, the type of Chinese NP predicate defined in  [Li, W. & McFetridge 1995] is also a semantics dependent pattern. Compare:

7a)  zhe           zhang       zhuozi                  san          tiao          tui.
        this           Cl.         table(furniture)      three        Cl.            leg
This table is three-legged.
Note:        Cl for classifier.

7b) *        zhe           zhang       ditu                          san          tiao          tui.
                this           Cl.           map(non-furniture)  three        Cl.            leg

There is clearly a semantic constraint of the NP predicate on its subject: it should be furniture (or animate). Without this "semantic agreement", Chinese NP is normally not capable of functioning as a predicate, as shown in 7b).

Between semantics dependent and semantics independent patterns, we may have partially dependent patterns. For example, in NP NP V: OSV, it seems that the semantic constraint on the initial object is less important than the semantic constraint on the subject.

8)   shitou                wo              ye   xiang  chi,    kexi      yao       bu      dong.
   stone(non-food)  I(animate) also want  eat,    pity       chew    not      -able

Even stones I also want to eat, but it's such a pity that I am not able to chew them.

If the constraint on the object matches well, is the subject allowed to be semantically deviant?

9) ?          dianxin                     zhuozi                        chi           le.
                Dim-Sum(food)        table(non-animate)  eat           LE.

Those are the marginal cases, a grammar may choose to be more tolerable to accept it or to be more restrained to reject it.

Unlike SOV, but similar to its English counterpart, OSV is one type of Chinese topic constructions and the relationship between the initial O and V is of long distance dependency.

10a)  dianxin      wo     xiangxin   ni           yiwei        Lisi          chi           le.
          Dim-Sum    I         believe     you          think        Lisi           eat           LE

The Dim Sum I believe you think that Lisi ate.

10b) *      Lisi wo xiangxin ni yiwei dianxin chi le.

10b) will not be accepted in our model because (1) it cannot be interpreted as OSV since it violates the semantic constraint on S: dianxin is not animate; (2) it can neither be interpreted as SOV since it violates the configurational constraint: SOV is simply not of a long distance pattern. In fact, NP NP V: SOV is such a restricted pattern in Chinese that it not only excludes any long distance dependency but even disallows some adjuncts. Compare 11a) in the OSV pattern and 11b) and 11c) in the SOV pattern:

11a)  dianxin      wo           jinjinyouwei             de2           chi           le.
          Dim-Sum      I              with-relish                DE2         eat           LE

The Dim Sum I ate with relish.
Note:        DE2 is a particle introducing a preverbal adjunct of  manner.

11b) *      wo dianxin jinjinyouwei de2 chi le.

11c) *      wo jinjinyouwei de2 dianxin chi le.

There is another pattern of the linear order SOV, the Chinese notorious BA construction. ba is usually regarded as a preposition which introduces a preverbal object for transitive verbs.

NP [ba NP] V: SOV

12a)  wo           ba            dianxin       jinjinyouwei             de2          chi           le.
           I              BA           Dim-Sum     with-relish                DE2         eat           LE

I ate the Dim Sum with relish.

12b)         wo jinjinyouwei de2 ba dianxin  chi le.
With relish, I ate the Dim Sum.

12c)         dianxin  ba wo jinjinyouwei de2  chi le.
The Dim Sum ate me with relish.

12d)         dianxin jinjinyouwei de2 ba wo  chi le.
With relish, the Dim Sum ate me.

For the OSV order, there is another so-called BEI construction. The BEI construction is usually regarded as an explicit passive pattern in Chinese.

NP [bei NP] V: OSV

13a)        dianxin       bei          wo           chi           le.
                Dim-Sum     BEI          I               eat           LE

The Dim Sum was eaten by me.

13b)         wo bei dianxin  chi le.

I was eaten by the Dim Sum.

The BEI construction and the BA construction are both semantics independent. In fact, any pattern resorting to the means of function words in Chinese seems to be sufficiently independent of the semantic constraint.

To conclude, semantic deviation often occurs in some more independent patterns, as seen in 5d2), 6), 8), 12c), 12d), 13b). Close study reveals that different patterns result in different reliance on the semantic constraint, as summarized in the following table.

                syntactic pattern                                 semantic dependence

                NP V NP: SVO                                                    no dependence
                NP [ba NP] V: SOV                                            no dependence
                NP [bei NP] V: OSV                                           no dependence
                NP NP V: OSV                                                    partial dependence
                NP NP V: SOV                                                    full dependence
............

It should be emphasized that this observation constitutes the rationale behind our approach.

  1. Formulation of lexical rules

Based on the above observation, we have designed a syntax-semantics combined model. In this model, we take a lexical rule approach to Chinese patterns and the related problem of semantic deviation.

A lexical rule takes as its input a lexical entry which satisfies its condition and generates another entry. Lexical rules are usually used to cover lexical redundancy between related patterns. The design of lexical rules is preferred by many grammarians over the more conventional use of syntactic transformation, especially for lexicalist theories.

Our general design is as follows, still using chi (eat) for illustration:

(1)   Syntactically, chi (eat) as a transitive verb subcategorizes for a left NP as its subject and a right NP as its object.

(2)   Semantically, the corresponding notion eat expects an entity of category animate as its logical subject and an entity of category food as its logical object. Therefore the common sense (knowledge) that animate being eats food is represented.

(3)   The interaction of syntax and semantics is implemented by lexical rules. The lexical rules embody the linguistic generalizations about the transitive patterns. They will decide to enforce or waive the semantic constraint based on different patterns.

As seen, syntax only stipulates the requirement of two NPs as complements for chi and does not care about the NPs' semantic constraint. Semantics sets its own expectation of animate entity and food entity as arguments for eat and does not care what syntactic forms these entities assume on the surface. It is up to lexical rules to coordinate the two. In our model, the information in (1) and (2) is encoded in the corresponding lexical entry and the lexical rules in (3) will then be applied to expand the lexicon before parsing begins. Driven by the expanded lexicon, analysis is implemented by a lexicalist parser to build the interpretation structure for the input sentence. Following this design, there will be sufficient interaction between syntax and semantics as desired while syntax still remains to be a self-contained component from semantics in the lexicon. More importantly, this design does not add any computational complexities to parsing because in order to handle different patterns, the similar lexical rules are also required even for a pure syntax model.

Before we proceed to formulate lexical rules for transitive patterns, we should make sure what a transitive pattern is. As we defined before, a pattern consists of 2 parts: a structure's syntactic constraint and the corresponding interpretation. Word order is important constraint for Chinese syntax. In addition to word order, we have categories and function words (preposition, particle, etc.). As for interpretation, transitive structure involves 3 elements: V (predicate) and its arguments S (logical subject) and O (logical object). There is a further factor to take into account: Chinese complements are often optional. In many cases, subject and/or object can be omitted either because they can be recovered in the discourse or they are unknown. We call those patterns  elliptical patterns (with some complement(s) omitted), in contrast to full patterns. With these in mind, we can define 10 patterns for Chinese transitive structure: 5 full patterns and 5 elliptical patterns.

We now investigate these transitive patterns one by one and try to informally formulate the corresponding lexical rules to capture them. Please note that the basic input condition is the same with all the lexical rules. This is because they share one same argument structure - transitive structure.

Lexical rule 1:   

                V ((NP1, NP2), (constr1, constr2)) --> NP1 V NP2: SVO

The above notation for the lexical rule should be quite obvious. The input of the rule is a transitive verb which subcategorizes for two NPs: NP1 and NP2 and whose corresponding notion expects two arguments of constr1 and constr2NP is syntactic category, and constr is semantic category (human, animate, food, etc.). The output pattern is in a defined word order SVO and waives the semantic constraint.

Lexical rule 2:   

      V ((NP1, NP2), (constr1, constr2)) --> [NP1, constr1] [NP2, constr2] V: SOV

Please note that the semantic constraint is enforced for this SOV pattern. Since this pattern shares the form NP NP V with the OSV pattern, it would be interesting to see what happens if a transitive verb has the same semantic constraint on both its subject and object. For example, qingjiao (consult) expects a human subject and a human object.

14)           ta                     ni                               qingjiao    guo        me?
                he(human)     you(human)             consult     GUO        ME

Him, have you ever consulted?
Note: GUO is a particle for experience aspect.

15)           ni ta  qingjiao guo  me?

You, has he ever consulted?

In both cases, the interpretation is OSV instead of SOV. Therefore, we need to reformulate Lexical rule 2 to exclude the case when the subject constraint is the same as the object constraint.

Lexical rule 2' (refined version):

                V ((NP1, NP2), (constr1, constr2), (constr1 not = constr2))

                --> [NP1, constr1] [NP2, constr2] V: SOV

Lexical rule 3:

                V ((NP1, NP2), (constr1, constr2)) --> NP1 [ba NP2] V: SOV

This is the typical BA construction. But not every transitive verb can assume the BA pattern. In fact, ba is one of a set of prepositions to introduce the logical object. There are other more idiosyncratic prepositions (xiang, dao, dui, etc.) required by different verbs to do the same job.

16a)      ni             qingjiao    guo         ta             me?
              you          consult     GUO        he            ME

Have you ever consulted him?

16b)         ni             xiang        ta             qingjiao    guo        me?
                 you          XIANG     he            consult     GUO        ME

Have you ever consulted him?

16c) *      ni             ba            ta             qingjiao    guo        me?
                you          BA           he            consult     GUO        ME

17a)         ta             qu             guo         Beijing.
                 he            go-to        GUO        Beijing

He has been to Beijing.

17b)         ta             dao         Beijing     qu             guo.
                 he            DAO        Beijing     go-to        GUO

He has been to Beijing.

17c) *      ta             ba            Beijing     qu            guo.
                 he            BA           Beijing     go-to        GUO

18a)         ta             hen         titie                             zhangfu.
                 she           very       tenderly-care-for      husband

She cares for her husband very tenderly.

18b)         ta             dui          zhangfu       hen        titie.
                 she           DUI         husband      very       tenderly-care-for

She cares for her husband very tenderly.

18c) *      ta             ba            zhangfu         hen                          titie.
                she           BA           husband         very                         tenderly-care-for

This originates from different theta-roles assumed by different verb notions on their object argument: patient, theme, destination, to name only a few. These theta-roles are further classification of the more general semantic role logical object. We can rely on the subcategorization property of the verb for the choice of the preposition literal (so-called valency preposition). With the valency information in place, we now reformulate Lexical rule 3 to make it more general:

Lexical rule 3' (refined version):

       V ((NP1, NP2), (constr1, constr2),  (valency_preposition=P), (P not = null))

       --> NP1 [P NP2] V: SOV

Lexical rule 4:   

                V ((NP1, NP2), (constr1, constr2)) --> NP2 ... [NP1, constr1] V: OSV

This is a topic pattern of long distance dependency. It is up to different formalisms to provide different approaches to long-distance phenomena. In our present implementation, NP2 is placed in a feature called BIND to indicate the nature of long distance dependency. One phrase structure rule Topic Rule is designed to use this information and handle the unification of the long distance complement properly.

Following the topic pattern, the passive BEI construction is formulated in Lexical rule 5.

Lexical rule 5:   

                V ((NP1, NP2), (constr1, constr2)) --> NP2 [bei NP1] V: OSV

We now turn to elliptical patterns.

Lexical rule 6:   

                V ((NP1, NP2), (constr1, constr2)) --> V NP2: VO

19)           chi           guo          jiaozi                        me?
                eat           GUO        dumpling                 ME

Have (you) ever eaten dumpling?

Lexical rule 7:   

                V ((NP1, NP2), (constr1, constr2)) --> [NP1, constr1] V: SV

20)           wo           chi           le.
                I               eat           LE

I have eaten (it).

21)           ji                                 chi           le.
                chicken1(animate)   eat           LE

The chicken has eaten (it).

Like its English counterpart, ji (chicken) has two senses: (1) chicken1 as animate; (2) chicken2 as food. We code this difference in two lexical entries. Only the first entry matches the semantic constraint on the subject in the pattern and reaches the above SV interpretation in 21). Interestingly enough, the same sentence will get another parse with a different interpretation OV in 23) because the second entry also satisfies the semantic constraint on the object in the OV pattern in Lexical rule 8.

22)           ni             qingjiao    guo         me?
                you          consult     GUO        ME

Have you consulted (someone)?

22) indicates that the SV interpretation is preferred over the OV interpretation when the semantic constraint on the subject and the semantic constraint on the object happen to be the same. Hence the added condition in Lexical rule 8.

Lexical rule 8:   

                V ((NP1, NP2), (constr1, constr2), (constr1 not = constr2))

                --> [NP2, constr2] V: OV

23)           ji                                 chi           le.
                chicken2(food)         eat           LE

The chicken has been eaten.

Lexical rule 9:   

                V ((NP1, NP2), (constr1, constr2)) --> NP2 [bei V]: OV

24)           dianxin    bei           chi           le.
                Dim-Sum  BEI          eat           LE

The Dim Sum has been eaten.

Lexical rule 10:

                V ((NP1, NP2), (constr1, constr2)) --> V: V

25)           chi           le             me?
                eat           LE            ME?                        

(Have you) eaten (it)?

  1. Implementation

We begin with a discussion of some major feature structures in HPSG related to handling the transitive patterns.  Then, we will show how our proposal works and discuss some related implementation issues.

HPSG is a highly lexicalist theory. Most information is housed in the lexicon. The general grammar is kept to minimum: only a few phrase structure rules (called ID Schemata) associated with a couple of principles. The data structure is typed feature structure. The necessary part for a typed feature structure is the type information. A simple feature structure contains only the type information, but a complex feature structure can introduce a set of feature/value pairs in addition to the type information. In a feature/value pair, the value is itself a feature structure (simple or complex). The following is a sample implementation of the lexical entry chi for our Chinese HPSG grammar using the ALE formalism [Carpenter  & Penn 1994].

hpsg3

Note:  (1) Uppercase notation for feature; (2) Lowercase notation for type; (3) Number indices in square brackets for unification.

Leaving the notational details aside, what this roughly says is: (1) for the semantic constraint, the arguments of the notion eat are an animate entity and a food entity; (2) for the syntactic constraint, the complements of the verb chi are 2 NPs: one on the left and the other on the right; (3) the interpretation of the structure is a transitive predicate with a subject and an object. The three corresponding features are: (1) KNOWLEDGE; (2) SUBCAT; (3) CONTENT. KNOWLEDGE stores some of our common sense by capturing the internal relation between concepts. Such common sense knowledge is represented in linguistic ways, i.e. it is represented as a semantic expectation feature, which parallels to the syntactic expectation feature SUBCAT. KNOWLEDGE defines the semantic constraint on the expected arguments no matter what syntactic forms the arguments will take.  In contrast, SUBCAT only defines the syntactic constraint on the expected complements. The syntactic constraint includes word order (LEFT feature), syntactic category (CATEGORY feature) and configurational information (LEX feature).  Finally, CONTENT feature assigns the roles SUBJECT and OBJECT for the represented structure.

A more important issue is the interaction of the three feature structures. Among the three features, only KNOWLEDGE is our add-on. The relationship between SUBCAT and CONTENT has been established in all HPSG versions: SUBCAT resorts to CONTENT for interpretation.  This interaction corresponds to our definition of pattern. Everything goes fine as far as the syntactic constraint alone can decide interpretation. When the semantic constraint (in KNOWLEDGE) has to be involved in the interpretation process, we need a way to access this information. In unification based theories, information flow is realized by unification (i.e. structure sharing, which is represented by the co-index of feature values). In general, we have two ways to ensure structure sharing in the lexicon. It is either directly co-indexed in the lexical entries, or it resorts to lexical rules. The former is unconditional, and the latter is conditional. As argued before, we cannot directly enforce the semantic constraint for every transitive pattern in Chinese, for otherwise our grammar will not allow for any semantic deviation. We are left with lexical rules which we have informally formulated in Section 3 and implemented in the ALE formalism.

CATEGORY is another major feature for a sign. The CATEGORY feature in our implementation includes functional category which can specify functional literal (function word) as its value. Function words belong to closed categories. Therefore, they can be classified by enumeration of literals. Like word order, function words are important form for Chinese syntactic constraint. Grammars for other languages also resort to some functional literals for constraint. In most HPSG grammars for English, for example, a preposition literal is specified in a feature called P_FORM. There are two problems involved there. First, at representation level, there is redundancy: P_FORM:x --> CATEGORY:p (where x is not null). In other words, there exists feature dependency between P_FORM and CATEGORY which is not captured in the formalism. Second, if P_FORM is designed to stipulate a preposition literal, we will ultimately need to add features like CL_FORM for classifier specification, CO_FORM for conjunction specification, etc. In fact, for each functional category, literal specification may be required for constraint in a non-toy grammar. That will make the feature system of the grammar too cumbersome. These problems are solved in our grammar implementation in ALE. One significant mechanism in ALE is its type inheritance and appropriateness specifications for feature structures [Carpenter  & Penn 1994]. (Similar design is found in the new software paradigm of Object Oriented Programming.) Thanks to ALE, we can now use literals (ba, xiang, dao, dui, etc) as well as major categories (n, v, a, p, etc.) to define the CATEGORY feature. In fact, any intermediate level of subclassification between these two extremes, major categories and literals, can all be represented in CATEGORY just as handily. They together constitute a type hierarchy of CATEGORY. The same mechanism can also be applied to semantic categories (human, animate, food, etc.) to capture the thesaurus inference like human --> animate. This makes our knowledge representation much more powerful than in those formalisms without this mechanism. We will address this issue in depth in another paper Typology for syntactic category and semantic category in Chinese grammar.

In the following, we give a brief description on how our grammar works. The grammar consists of several phrase structure rules and a lexicon with lexical entries and lexical rules. First, ALE compiles the grammar into a Prolog parser. During this process (at compile time), lexical rules are applied to lexical entries. In the case of transitive patterns, this means that one entry of chi will evolve into 10 entries. Please note that it is this expanded lexicon that is used for parsing (at run time).

At the level of implementation, we do not need to presuppose an abstract transitive structure as input of the lexical rules and from there generates 10 new entries for each transitive verb. What is needed is one pattern as the basic pattern for transitive structure and derives the other patterns. In fact, we only need 4 lexical rules to derive the other 4 full patterns from 1 basic full pattern. Elliptical patterns can be handled more elegantly by other means than lexical rules.[2]

The basic pattern constitutes the common condition for lexical rules. Although in theory any one of the 5 full patterns can be seen as the basic pattern, the choice is not arbitrarily made. The pattern we chose is the valency preposition pattern (the BA-type construction) NP1 [P NP2] V: SOV (see Lexical rule 3').[3] This is justified as follows. The valency preposition P (ba, xiang, dao, dui, etc.) is idiosyncratically associated with the individual verb. To derive a more general pattern from a specific pattern is easier than the other way round, for example,  NP1 [P NP2] V: SOV --> NP1 V NP2: SVO is easier than NP1 V NP2: SVO --> NP1 [P NP2] V: SOV. This is because we can then directly code the valency preposition under CATEGORY in the SUBCAT feature and do not have to design a specific feature to store this valency information.

 

  1. Summery

The ultimate aim for natural language analysis is to reach interpretation, i.e. to assign roles to the constituents. An old question is how syntax (form) and semantics (meaning) interact in this interpretation process. More specifically, which is a more important factor in Chinese analysis, the syntactic constraint or the semantic constraint? For the linguistic data we have investigated, it seems that sometimes syntax plays a decisive role and other times semantics has the final say. The essence is how to adequately handle the interface between syntax and semantics.

In our proposal, the syntactic constraint is seen as a more fundamental factor. It serves as the frame of reference for the semantic constraint. The involvement of the semantic constraint seems to be most naturally conditioned by syntactic patterns. In order to ensure their effective interaction, we accommodate syntax and semantics in one model.  The model is designed to be based on syntax and resorts to semantic information only when necessary. In concrete terms, the system will selectively enforce or waive the semantic constraint, depending on syntactic patterns.

It needs to be advised that there are other factors involved in reaching a correct interpretation. For example, in order to recover the omitted complements in elliptical patterns, information from discourse and pragmatics may be vital. We leave this for future research.

 

References

Carpenter, B. & Penn, G. (1994): ALE, The Attribute Logic Engine, User's Guide, Version 2.0

Gao, Qian (1993): “Chinese BA-Construction: Its Syntax and Semantics”, OSU Working Papers in Linguistics 1993, Kathol A. & Pollard C. (eds.)

Huang, Xiuming (1987): “XTRA: The Design and Implementation of A Fully Automatic Machine Translation System”, Ph.D. dissertation.

Li, Audry (1990): Chapter 6 “Passive, BA, and topic constructions”, Order & Constituency in Mandarin Chinese. Kluwer Academic Publishers

Li, Wei & McFetridge, Paul (1995): “Handling Chinese NP predicate in HPSG”, Proceedings of PACLING-II, Brisbane, Australia

Pollard, Carl  & Sag, Ivan A. (1994): Head-Driven Phrase Structure Grammar,  Centre for the Study of Language and Information, Stanford University, CA

Pollard, Carl  & Sag, Ivan A. (1987): Information-based Syntax and Semantics. Vol. 1: Fundamentals. Centre for the Study of Language and Information, Stanford University, CA

Wilks, Y.A. (1978): “Making Preferences More Active”,  Artificial Intelligence, Vol. 11

Wilks, Y.A. (1975): “A Preferential Pattern-Seeking Semantics for Natural Language Interference”, Artificial Intelligence, Vol. 6

~~~~~~~~~~~~

* This research is part of my Ph.D. project on a Chinese HPSG-style grammar, supported by the Science Council of British Columbia, Canada under G.R.E.A.T. award (code: 61). I thank my supervisor Dr. Paul McFetridge for his supervision. He introduced me into the HPSG theory and provided me with his sample grammars. Without his help, I would not have been able to implement the Chinese grammar in a relatively short time. Thanks also go to Prof. Dong Zhen Dong and Dr. Ping Xue for their comments and encouragement.

 

[1]               The other combinations are:

5d1) *      dianxin chi le wo.              OVS

5d2)         dianxin chi le wo.
The Dim Sum ate me.

Note:        It is OK with the 5d2) reading in the pattern NP V NP: SVO.

5e1) *      chi le wo dianxin.               VSO
5e2)         chi le wo dianxin.

(Somebody) ate my Dim Sum.

Note:        It is OK with the 5e2) reading of in the pattern V [NP1 NP2]: VO where NP1 modifies NP2.

5f1) *      chi le dianxin wo.                 VOS
5f2)         chi le dianxin, wo.

Eaten the Dim Sum, I have.

Note:        It is OK in Spoken Chinese, with a short pause before wo, in a  pattern like V NP, NP: VOS.

[2]   The conventional configurational approach is based on the assumption that complements are obligatory and should be saturated. If saturation of complements were not taken as a precondition for a phrase, serious problems might arise in structural overgeneration. On the other hand, optionality of complement(s) is a real life fact. Elliptical patterns are seen in many languages and especially commonplace in Chinese. In order to ensure obligatoriness of complements, the lexical rule approach can be applied to elliptical patterns, as shown in Section 3. This approach maintains configurational constraint in tree building to block structural overgeneration, but the cost is great: each possible elliptical pattern for a head will have to be accommodated by a new lexical entry. With the type mechanism provided by ALE, we have developed a technique to allow for optionality of complement(s) and still maintain proper configurational constraint. We will address this issue in another paper Configurational constraint in Chinese grammar.

[3]    This choice is coincidental to the base‑generated account of the BA construction in [Li, A. 1990], but that does not mean much. First, our so‑called basic pattern is not their D‑Structure. Second, our choice is based on more practical considerations. Their claim involves more theoretical arguments in the context of the generative grammar.

 

 

[Related]

Handling Chinese NP predicate in HPSG (old paper)

Notes for An HPSG-style Chinese Reversible Grammar

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

【一日一parsing:舍我其谁,我又是谁?】

昨夜名段:
【中秋,混得好的是花前月下,混得一般的是月下花钱,混得最差的是花下月的钱,混得最好的是钱下月花。】

0916a

0916b

几乎完美parsing了,但有一个分离词没有搭配的瑕疵,对比:

0916d

合在一起就眼花缭乱了,这是非一般的 graph,与多数句法树颇不同:

0916c

索性把前天的 parsing 也秀一秀。汉语 deep parsing 没有绝对的标准,但语言学家心里还是有杆秤的:靠谱不靠谱,内行看门道,外行看热闹罢。这种感觉有些奇诡刺激,一方面觉得是在走前人没走过的路,充满了拓荒者的悲壮与豪情。另一方面,也好像冥冥之中的命定,替天行道,舍我其谁,我又是谁?如果语言是思想的载体和表达(presentation),parsing 就是思想的形式化机器展示(representation),而我就是贯通二者的使者。感谢上帝,在创造了谜一样的语言的同时,没忘记把钥匙留下。

0915a

0915b

0915c

0915d

是的,【人类最无法理解的事情,就是机器对人类语言结构的分析能力】。机器达到人类的语言结构分析能力,现在已经没有悬念了。而机器难以达到的那部分理解能力,可以用人机辅助的方式进行,这个景象就在不太远的将来,已然历历在目了。让我们准备好,去拥抱这个人机交融的新时代。

洪爷有诗云:
庖丁解牛在语言,伟爷Parser之中练。善刀藏之于深山,实则乱麻可以斩。

【相关】

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

【博士涂鸦回顾:把常识代入文法的尝试】

上次说过,绝大多数的parsers对于谓词的 subcat 的表达都很简陋,伸展不开,多数不过把 subcat 当成一个代码,然后在相关的 subcat 规则中去确定 pattern。但是词驱动的文法 HPSG 却可以丝丝入扣,合情合理,可以直接在词典里面把 subcat 的 pattern 细致地描述,并对其句法语义的输入(pattern的条件)和输出(逻辑语义)之间的映射和解构,做出一个符合语言学原则的表达(representation)。

简陋有简陋的工程考量和理由,叠床架屋有叠床架屋的逻辑优美。鱼与熊掌不可兼得,我们最终还是更加倾向于简陋之法。尽管如此,走简陋快捷的路线的人,如果对结构表达的优美有所体验,还是有莫大的好处,至少不会被简陋的表象所迷惑,对于复杂的语言现象,逐渐摆脱简陋的捉襟见肘。

最近回看当年博士阶段的涂鸦文章,虽然其中反映出的对汉语句法的见识并不出彩,但是得力于 HPSG 的结构丰富性,还是把 subcat 在汉语文法中应用,表现得有条不紊,经得起时间的检验。当年钻研 HPSG 还是很专心的,吃得蛮透。正因为吃得透了,后来扬弃的时候就没有拖泥带水的牵挂。

譬如,在论及汉语NP带坑的现象的时候,是这样模型的:

11a)     桌子坏了。
11b)     腿坏了。
11c)     桌子的腿坏了。
12a)     他好。
12b)     身体好。
12c)     他的身体好。

When people say 11b) tui huai le (leg went wrong), we know something (the possessor) is omitted. For 11a), however, we have no such feel of incompleteness.

Although we may also ask whose table, this possessive relation between who and table is by no means innate. Similarly, ta (he) in 12a) is a complete notion denoting someone while shenti (body) in 12b) is not. In 11c) and 12c), the possessor appears in the possessive structure DE-construction, the expectation of tui (leg) and shenti (body) is realized. These examples show that some words (concepts) have conceptual expectation for some other words (concepts) although the expected words do not necessarily show up in a sentence and the expectation might not be satisfied. In fact, this type of expectation forms part of our knowledge (common sense). One way to represent the knowledge is to encode it with the related word in the lexicon.
Therefore we propose an underlying SYNSEM feature KNOWLEDGE to store some of our common sense knowledge by capturing the internal relation between concepts. KNOWLEDGE parallels to syntactic SUBCAT and semantic RELATION. KNOWLEDGE imposes semantic constraints on their expected arguments no matter what syntactic forms the arguments will take (they may take null form, i.e. the underlying arguments are not realized). In contrast, SUBCAT only defines syntactic requirement for the complements and gets interpreted in RELATION. Following this design, syntactic form and semantic constraints are kept apart. When necessary, the interaction between them can be implemented by lexical rules, or directly coindexed in the lexicon. For example, the following KNOWLEDGE information will be enforced as the necessary semantic constraints when we handle Chinese NP predicates by a lexical rule (see 3.3).

把常识暗度陈仓从后门带入文法,就是从那时候开始的。这个做法在欧洲语言的形式文法中不多见,因为句法形式大体够用了,通常不需要常识的帮忙。但是对于汉语,没有某种常识的引入,想做一个成熟的深度分析系统,则很难。当年带常识的的句法结构模型是这样定义的:

PHON      shenti
SYNSEM | KNOWLEDGE | PRED [1] possess
SYNSEM | KNOWLEDGE | POSSESSOR [2] human
SYNSEM | KNOWLEDGE | POSSESSED [3]
SYNSEM | LOCAL | CONTENT | INDEX [3]
SYNSEM | LOCAL | CONTENT | RESTRICTION { RELATION body }
SYNSEM | LOCAL | CONTENT | RESTRICTION { INSTANCE [3] }

最后,汉语文法中常识的引入被认为是对欧洲语言利用性数格的 agreement 的一个自然延伸。句法手段到语义限制的延伸。

Agreement revisited
This section relates semantic constraints which embody common sense to the conventional linguistic notion of agreement. We will show that they are essentially the same thing from different perspectives. We only need slight expansion for the definition of agreement to accommodate some of our basic knowledge. This is important as it accounts for the feasibility of coding knowledge in linguistic ways. Linguistic lexicon seems to be good enough to house some general knowledge in addition to linguistic knowledge.

为 parse“我鸡吃“ 和“鸡我吃”, 常识进入了文法(现在也可以利用大数据把常识代入):

A typical example of how concepts are linked in a network (a sophisticated concept lexicon) is seen in the representation of drink ((*ANI SUBJ) (((FLOW STUFF) OBJE) ((SELF IN) (((*ANI (THRU PART)) TO) (BE CAUSE))))) in Wilks 1975b. While for  various reasons we will not go as far as Wilks, we can gain enlightenment from this type of AI approach to knowledge. Lexicon-driven systems like the one in HPSG can, of course, make use of this possibility. Take the Chinese role-assignment problem, for example, the common sense that ANIMATE being eats FOOD can be seamlessly incorporated in the lexical entry chi (eat) as a semantic agreement requirement.

PHON chi
SYNSEM | KNOWLEDGE | PRED [1]  eat
SYNSEM | KNOWLEDGE | AGENT [2] animate
SYNSEM | KNOWLEDGE | PATIENT [3] food
SYNSEM | LOCAL | CATEGORY | SUBCAT | EXTERNAL_ARGUMENT [NP: [4]]
SYNSEM | LOCAL | CATEGORY | SUBCAT | INTERNAL_ARGUMENTS <[NP: [5]]>
SYNSEM | LOCAL | CONTENT | RELATION [1]
SYNSEM | LOCAL | CONTENT | EATER [4] | INDEX | ROGET [2]
SYNSEM | LOCAL | CONTENT | EATEN [5] | INDEX | ROGET [3]

可见,看上去不过是 POS 细分后的一个 subcat 的代码,里面其实包含了多少结构及其蕴含其内的知识。在 unification grammars 几乎成为历史陈迹的今天,我还是认为 HPSG 这样的表达是最优美的语言学的逻辑表达之一,论逻辑的清晰和美,后来的文法很难超越。

 

[Related]

Handling Chinese NP predicate in HPSG (old paper)

Notes for An HPSG-style Chinese Reversible Grammar

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

 

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

 

Handling Chinese NP predicate in HPSG (old paper)

Handling Chinese NP predicate in HPSG
(old paper in Proceedings of the Second Conference of the Pacific
Association for Computational Linguistics, Brisbane, 1995)

Wei Li & Paul McFetridge

Department of Linguistics
Simon Fraser University
Burnaby, B.C. CANADA  V5A 1S6

 

Key words: HPSG; knowledge representation, Chinese processing 

 

Abstract 

This paper addresses a type of Chinese NP predicate in the framework of HPSG 1994 (Pollard & Sag 1994). The special emphasis is laid on knowledge representation and the interaction of syntax and semantics in natural language processing. A knowledge based HPSG model is designed. This design not only lays a foundation for effectively handling Chinese NP predicate problem, but has theoretical and methodological significance on NLP in general.

In Section 1, the data are analyzed. Both structural and semantic constraints for this pattern are defined. Section 2 discusses the semantic constraints in the wider context of the conceived knowledge-based model. The aim of natural language analysis is to reach interpretations, i.e. correctly assigning semantic roles to the constituents. We indicate that without being able to resort to some common sense knowledge, some structures cannot get interpreted. We present a way on how to organize and utilize knowledge in HPSG lexicon. In Section 3, a lexical rule for this pattern is proposed in our HPSG model for Chinese, whose prototype is being implemented.

  1. Problem

We will show the data of Chinese NP predicate first. Then we will investigate what makes it possible for an NP to behave like a predicate. We will do this by defining both the syntactic and semantic constraints for this Chinese pattern.

1.1. Data: one type of Chinese NP predicate

1) 他好身体。

ta         hao      shenti.
he        good    body
He is of good health.

2)  张三高个子。

Zhangsan         gao      gezi
Zhangsan         tall       figure.
Zhangsan is tall.

3)  李四圆圆的脸。       Lisi

Lisi      yuanyuan         de        lian.
Lisi      round-round    DE       face.
Lisi has a quite round face.

4) 这件大衣红颜色。

zhe       jian      dayi     hong    yanse.
this      (cl.)      coat     red       colour.
This coat is of red colour.

5)  明天小雨。

mingtian          xiao     yu.
tomorrow        little     rain.
Tomorrow it will drizzle.

6)  那张桌子三条腿。

na        zhang   zhuozi san       tiao      tui.
that      (cl.)      table   three    (cl.)      leg
That table is three-legged.

Note:      (cl.) for classifier.
DE for Chinese attribute particle.

The relation between the subject NP and the predicate NP is not identity. The NP predicate in Chinese usually describes a property the subject NP has, corresponding to English be-of/have NP. In identity constructions, the linking verb SHI (be) cannot normally be omitted.[1]

7a)  他是学者。

ta         shi        xuezhe.
he        be        scholar
He is a scholar.

8b) ?他学者。

ta         xuezhe.  他学者。
he        scholar

1.2.  Problem analysis

1.2.1. We first investigate the structural characteristics of the Chinese NP predicate pattern.

A single noun cannot act as predicate. More restrictively, not every NP can become a predicate. It seems that only the NP with the following configuration has this potential: NP [lex -, predicate +].  In other words, a predicate NP consists of a lexical N with a modifying sister. Structures of this sort should not be further modified.[2] Thus, the following patterns are predicted.

8a)      那张桌子三条腿。

na        zhang   zhuozi san       tiao      tui.                   [ same as 6) ]
that      (cl.)      table    three    (cl.)      leg
That table is three-legged.

8b)       那张桌子塑料腿。

na        zhang   zhuozi suliao   tui.
that      (cl.)      table    plastic leg
That table is of plastic legs.

8c) * 那张桌子三条塑料腿。
*    na        zhang   zhuozi san       tiao      suliao   tui.       [too many attributes]

8d) * 那张桌子腿。
*    na        zhang   zhuozi tui.                                           [no attributes]

1.2.2. What is the semantic constraint for the Chinese predicate pattern?

Although there is no syntactic agreement between subject and predicate in Chinese, there is an obvious semantic "agreement" between the two: hao shenti (good body) requires a HUMAN as its subject; san tiao tui (three leg) demands that the subject be FURNITURE or ANIMATE. Therefore, the following are unacceptable:

9) * 这杯茶好身体。

* zhe       bei       cha       hao      shenti.
this      cup      tea       good    body

10) * 空气三条腿。

* kongqi san       tiao      tui.
air        three    (cl.)      leg

Obviously,. it is not hao (good) or san tiao (three) which poses this semantic selection of subject. The semantic restriction comes from the noun shenti (body) or tui (leg). There is an internal POSSESS relationship between them: shenti (body)  belongs to human beings and tui (leg) is one part of an animal or some furniture. This common sense relation is a crucial condition for the successful interpretation of the Chinese NP predicate sentences.

There are a number of issues involved here. First, what is the relationship of this type of knowledge to the syntactic structures and semantic interpretations? Second, where and how would this knowledge be represented? Third, how will the system use the knowledge when it is needed? More specifically, how will the introduction of this knowledge coordinate with the other parts of the well established HPSG formalism? Those are the questions we attempt to answer before we proceed to provide a solution to the Chinese NP predicate. Let us look at some more examples:

11a)     桌子坏了。

zhuozi huai     le.
table    bad      LE
The table went wrong.

11b)     腿坏了。

tui        huai     le.leg       bad      LE
leg       bad      LE
The leg went wrong.

11c)     桌子的腿坏了。

zhuozi  de        tui        huai     le.
table    DE       leg       bad      LE
The table's leg went wrong.

12a)     他好。

ta         hao.
he        good
He is good.

12b)     身体好。

shenti   hao.
body    good
The health is good.

12c)     他的身体好。

ta         de        shenti   hao.
he        DE       body    good
His health is good.

note: LE for Chinese perfect aspect particle.

When people say 11b) tui huai le (leg went wrong), we know something (the possessor) is omitted. For 11a), however, we have no such feel of incompleteness. Although we may also ask whose table, this possessive relation between who and table is by no means innate. Similarly, ta (he) in 12a) is a complete notion denoting someone while shenti (body) in 12b) is not. In 11c) and 12c), the possessor appears in the possessive structure DE-construction, the expectation of tui (leg) and shenti (body) is realized. These examples show that some words (concepts) have conceptual expectation for some other words (concepts) although the expected words do not necessarily show up in a sentence and the expectation might not be satisfied. In fact, this type of expectation forms part of our knowledge (common sense). One way to represent the knowledge is to encode it with the related word in the lexicon.

Therefore we propose an underlying SYNSEM feature KNOWLEDGE to store some of our common sense knowledge by capturing the internal relation between concepts. KNOWLEDGE parallels to syntactic SUBCAT and semantic RELATION. KNOWLEDGE imposes semantic constraints on their expected arguments no matter what syntactic forms the arguments will take (they may take null form, i.e. the underlying arguments are not realized). In contrast, SUBCAT only defines syntactic requirement for the complements and gets interpreted in RELATION. Following this design, syntactic form and semantic constraints are kept apart. When necessary, the interaction between them can be implemented by lexical rules, or directly coindexed in the lexicon. For example, the following KNOWLEDGE information will be enforced as the necessary semantic constraints when we handle Chinese NP predicates by a lexical rule (see 3.3).

PHON      shenti
SYNSEM | KNOWLEDGE | PRED [1] possess
SYNSEM | KNOWLEDGE | POSSESSOR [2] human
SYNSEM | KNOWLEDGE | POSSESSED [3]
SYNSEM | LOCAL | CONTENT | INDEX [3]
SYNSEM | LOCAL | CONTENT | RESTRICTION { RELATION body }
SYNSEM | LOCAL | CONTENT | RESTRICTION { INSTANCE [3] }

  1. Agreement revisited

This section relates semantic constraints which embody common sense to the conventional linguistic notion of agreement. We will show that they are essentially the same thing from different perspectives. We only need slight expansion for the definition of agreement to accommodate some of our basic knowledge. This is important as it accounts for the feasibility of coding knowledge in linguistic ways. Linguistic lexicon seems to be good enough to house some general knowledge in addition to linguistic knowledge. Some possible problems with this knowledge-based approach are also discussed.

Let's first consider the following two parallel agreement problems in English:

13) *    The boy drink.

14) ?    The air drinks.

13) is ungrammatical because it violates the syntactic agreement between the subject and predicate. 14) is conventionally considered as grammatical although it violates the semantic agreement between the agent and the action. Since the approach taken in this paper is motivated by semantic agreement, some elaboration and comment on agreement seem to be in need.

The agreement in person, gender and number are included in CONTENT | INDEX features (Pollard & Sag 1994, Chapter 2). It follows that any two signs co-indexed naturally agree with each other. That is desirable because co-indexed signs refer to the same entity. However, person, gender and number seem to be only part of the story of agreement. We may expand the INDEX feature to cope with the semantic agreement for handling Chinese and for in-depth semantic analysis for other languages as well.

Note that to accommodate semantic agreement in HPSG, we first need features to represent the result of semantic classification of lexical meanings like HUMAN, FOOD, FURNITURE, etc. We therefore propose a ROGET feature (named after the thesaurus dictionary) and put it into the INDEX feature.

Semantic agreement, termed sometimes as semantic constraint or semantic selection restriction in literature, is not a new conception in natural language processing. Hardly any in-depth language analysis can go smoothly without incorporating it to a certain extent. For languages like Chinese with virtually no inflection, it is more important. We can hardly imagine how the roles can be correctly assigned without the involvement of semantic agreement in the following sentences of the form NP1 NP2 Vt:

15a)     点心我吃了。

dianxin            wo       chi       le.
Dim-Sum         I           eat       LE
The Dim Sum I have eaten.

15b)     我点心吃了。

wo       dianxin            chi       le.
I           Dim-Sum         eat       LE
I have eaten the Dim Sum.

Who eats what?  There is no formal way but to resort to semantic agreement enforced by eat to correctly assign the roles. In HPSG 1994, it was pointed out (Pollard & Sag 1994, p81), "... there is ample independent evidence that verbs specify information about the indices of their subject NPs. Unless verbs 'had their hands on' (so to speak) their subjects' indices, they would be unable to assign semantic roles to their subjects." The Chinese data show that sometimes verbs need to have their hands on the semantic categories (ROGET) of both their external argument (subject) and internal arguments to be able to correctly assign roles. Now we have expanded the INDEX feature to cover both ROGET and the conventional agreement features number, person and gender, the above claim of Pollard and Sag becomes more general.

It is widely agreed that knowledge is bound to play an important role in natural language analysis and disambiguation. The question is how to build a knowledge-based system which is manageable. Knowledge consists of linguistic knowledge (phonology, morphology, syntax, semantics, etc.) and extra-linguistic knowledge (common sense, professional knowledge, etc.). Since semantics is based on lexical meanings, lexical meanings represent concepts and concepts are linked to each other in a way to form knowledge, we can well regard semantics as a link between linguistics and beyond-linguistics in terms of knowledge. In other words, some extra-linguistic knowledge may be represented in linguistic ways. In fact, lexicon, if properly designed, can be a rich source of knowledge, both linguistic and extra-linguistic. A typical example of how concepts are linked in a network (a sophisticated concept lexicon) is seen in the representation of drink ((*ANI SUBJ) (((FLOW STUFF) OBJE) ((SELF IN) (((*ANI (THRU PART)) TO) (BE CAUSE))))) in Wilks 1975b. While for  various reasons we will not go as far as Wilks, we can gain enlightenment from this type of AI approach to knowledge. Lexicon-driven systems like the one in HPSG can, of course, make use of this possibility. Take the Chinese role-assignment problem, for example, the common sense that ANIMATE being eats FOOD can be seamlessly incorporated in the lexical entry chi (eat) as a semantic agreement requirement.

PHON chi
SYNSEM | KNOWLEDGE | PRED [1]  eat
SYNSEM | KNOWLEDGE | AGENT [2] animate
SYNSEM | KNOWLEDGE | PATIENT [3] food
SYNSEM | LOCAL | CATEGORY | SUBCAT | EXTERNAL_ARGUMENT [NP: [4]]
SYNSEM | LOCAL | CATEGORY | SUBCAT | INTERNAL_ARGUMENTS <[NP: [5]]>
SYNSEM | LOCAL | CONTENT | RELATION [1]
SYNSEM | LOCAL | CONTENT | EATER [4] | INDEX | ROGET [2]
SYNSEM | LOCAL | CONTENT | EATEN [5] | INDEX | ROGET [3]

Note:        Following the convention, the part after the colon is SYNSEM | LOCAL | CONTENT information.

One last point we would like to make in this context is that semantic agreement, like syntactic agreement, should be able to loosen its restriction, in other words, agreement is just a canonical, in Wilk's term preference, requirement (Wilks 1975a). In practice of communication, deviation in different degrees is often seen and people often relax the preference restriction in order to understand. With semantic agreement, the deliberate deviation is one of the handy means to help render rhetorical expression. In a certain domain, Chomsky's famous sentence Colorless green ideas sleep furiously is well imaginable. On the other hand, the syntactic agreement deviation will not affect the meaning if no confusion is caused, which may or may not happen depending on context and the structure of the language. In English, lack of syntactic agreement for the present third person singular between subject and predicate usually causes no problem. Sentence 15) The boy drink therefore can be accepted and correctly interpreted. There is much more to say on the interaction of the two types of agreement deviation, how a preference model might be conceived, what computational complexities it may cause and how to handle them effectively. We plan to address it in another paper. The interested reader is referred to one famous approach in this direction. (Wilks 1975a, 1978).

 

  1. Solution

We will set some requirements first and then present a lexical rule to see how well it meets our requirements.

3.1. Based on the discussion in Section 1, the solution to the Chinese predicate NP problem should meet the following 4 requirements:

(1)        It should enforce the syntactic constraints for this pattern: one and only one modifier XP in the form of NP1 XP NP2.

(2)        It should enforce the semantic constraints for this pattern: N2 must expect NP1 as its POSSESSOR with semantic agreement.

(3)        It should correctly assign roles to the constituents of the pattern: NP1 POSSESS NP2 (where NP2 consists of XP N2).

(4)        It should be implementable in HPSG formalism.

 

3.2. What mechanisms can we use to tackle a problem in HPSG formalism?

HPSG grammar consists of two components: a general grammar (ID schemata and principles) and a lexical grammar (in the lexicon). The lexicon houses lexical entries with their linguistic description and knowledge representation in feature structures. The lexicon also contains generalizations captured by inheritance of lexical hierarchy and by a set of lexical rules. Roughly speaking, lexical hierarchy covers static redundancy between related potential structures. Just because the lexicon can reflect different degrees of lexical redundancy in addition to idiosyncrasy, the general grammar can desirably be kept to minimum.

The Chinese NP predicate pattern should be treated in the lexicon. There are two arguments for that. First, this pattern covers only restricted phenomena (see 3.4). Second, it relies heavily on the semantic agreement, which in our model is specified in the lexicon by KNOWLEDGE. We need somehow to link the semantic expectation KNOWLEDGE and the syntactic expectation SUBCAT or MOD. The general mechanism to achieve that is structure sharing by coindexing the features either directly in the lexical entries (see the representation of the entry chi in Section 2) or through lexical rules (see 3.3).

3.3. Lexical Rule

Lexical rules are applied to lexical signs (words, not phrases) which satisfy the condition. The result of the application is an expanded lexicon to be used during parsing. Since the pattern is of the form NP1 XP N2, the only possible target is N2, i.e. shenti (body) or tui (leg). This is due to the fact that among the three necessary signs in this form, the first two are phrases and only the final N2 is a lexical sign. We assume the following structure for our proposed lexical rule:

NP[ta[1]]         [[AP[2] hao] [N<NP[1], XP[2]> shenti]]

NP Predicate Lexical Rule

hpsg1

SYNSEM | KNOWLEDGE | PRED [1] possess
SYNSEM | KNOWLEDGE | POSSESSOR [2]
SYNSEM | LOCAL | CATEGORY | HEAD | MAJ [6] n
SYNSEM | LOCAL | CATEGORY | PREDICATE -
SYNSEM | LOCAL | CONTENT | INDEX [4]
SYNSEM | LOCAL | CONTENT | RESTRICTION {[3]}
...| CATEGORY | PREDICATE +
...| CATEGORY | SUBCAT | EXTERNAL_ARGUMENT [NP: [5]]
...| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS < [...| CATEGORY | HEAD | MOD [6] ] >
...| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS < [...| CONTENT | INDEX [4] ]

==>

...| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS < [...| CONTENT | RESTRICTION {[7]} ] >
...| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS < [...| LEX - ] >
...| CONTENT | RELATION [1] possess
...| CONTENT | POSSESSOR [5] | INDEX | ROGET [2]
...| CONTENT | POSSESSED | INDEX [4]
...| CONTENT | POSSESSED | RESTRICTION {[7] | [3] }

For complicated information flow like this, it is best to explain the indices one by one with regards to the example ta hao shenti (he is of good body) in the form of NP1 XP N2.

The index [1] links the underlying PRED feature of N2 to the semantic RELATION feature; in other words, the predicate in the underlying KNOWLEDGE of shenti (body) now surfaces as the relation for the whole sentence. The index [2] enforces the semantic constraint for this pattern, i.e. shenti (body) expects a human (ROGET) possessor as the subject (EXTERNAL_ARGUMENT) for this sentence. The index [3] is the restriction relation of N2. [4] links the INDEX features of XP and N2, and [6] indicates that the internal argument is a de-facto modifier of N2, i.e. XP mods-for N2. Note that the part of speech of the internal argument (INTERNAL_ARGUMENT | SYNSEM | LOCAL | CATEGORY | HEAD | MAJ) is deliberately not specified in the rule because Chinese modifiers (XP) are not confined to one class, as can be seen in our linguistic data. Finally, [7] defines the restriction relation of the XP to the INDEX of N2.

The indices [4], [7] and [3] all contribute to artificially creating a semantic interpretation for [XP N2]. As is interpreted, XP is, in fact, a modifier of N2 and they would form an NP2, or [XP N2] constituent. In normal circumstances, the building of NP2 interpretation is taken care of by HPSG Semantics Principle. But in this special pattern, we have treated XP as a complement of N2, yet semantically they are still understood as one instance: hao shenti (good body) is an instance of good and body. This interpretation of NP2 serves as POSSESSED of the sentence predicate, indicated by the structure-sharing of [4], [7] and [3]. Finally, [5] is the interpretation of NP1 and is assigned the role of POSSESSOR for the sentence predicate.

Let's see how well this lexical rule meets the 4 requirements set in 3.1.

(1) It enforces the syntactic constraints by treating XP as the internal argument and NP1 as the external argument.

(2) It enforces the semantic constraints through structure sharing by the index [2].

(3) It correctly assigns roles to the constituents of the pattern.

The following interpretation will be established for ta hao shenti (he is of good body) by the parser.

hpsg2

CONTENT | RELATION possess
CONTENT | POSSESSOR | INDEX | PERSON 3
CONTENT | POSSESSOR | INDEX | NUMBER singular
CONTENT | POSSESSOR | INDEX | GENDER male
CONTENT | POSSESSOR | INDEX | ROGET human
CONTENT | POSSESSOR | RESTRICTION { }
CONTENT | POSSESSED | INDEX [1]    | PERSON 3
CONTENT | POSSESSED | INDEX          | NUMBER singular
CONTENT | POSSESSED | INDEX          | GENDER nil
CONTENT | POSSESSED | INDEX          | ROGET organ
CONTENT | POSSESSED | RESTRICTION { [ RELATION good],              [ RELATION body  ] }
CONTENT | POSSESSED | RESTRICTION { [ INSTANCE [1] ],              [ INSTANCE [1]  ] }

In prose, it says roughly that a third person male human he possesses something which is an instance of good body. We believe that this is the adequate interpretation for the original sentence.

(4) Last, this rule has been implemented in our Chinese HPSG-style grammar using ALE and Prolog.  The results meet our objective.

But there is one issue we have not touched yet, word order. At first sight, Chinese seems to have similar LP constraints as those in English. For example, the internal argument(s) of a Chinese transitive verb by default appear on the right side of the head. It seems that our formulation contradicts this constraint in grammar. But in fact, there are many other examples with the internal argument(s), especially PP argument(s), appearing on the left side of the head.

服务 fuwu (serve): <NP, PP(wei)>

16a) 为人民服务

wei      renmin fuwu
for       people  serve
Serve the people.

16b) ? 服务为人民。

fuwu    wei      renmin.
serve    for       people

有益 youyi (of benefit): <NP, PP(dui yu)>

17a) 这对我有益。

zhe       dui       wo       youyi
this      to         I           have-benefit
This is of benefit to me.

17b) * 这有益对我。

zhe       youyi               dui       wo
this      have-benefit    to         I

18a) 这于我有益。

zhe       yu        wo       youyi
this      to         I           have-benefit
This is of benefit to me.

18b) 这有益于我。

zhe       youyi               yu        wo
this      have-benefit    to         I
This is of benefit to me.

Word order and its place in grammar are important issues in formulating Chinese grammar. To play safe and avoid generalization too soon, we assume a lexicalized view on Chinese LP constraint, encoding word order information in LEXICON through SUBCAT and MOD features. This proves to be a realistic and precise approach to Chinese word order phenomena.

3.4. As a final note, we will briefly compare the NP Predicate Pattern with one of the Chinese Topic Constructions:

NP1 NP2 Vi/A
(topic + (subject + predicate))

In Chinese, this is a closely related but much more productive form than this NP Predicate Pattern. And their structures are different.

19)       他身体好。

ta         shenti   hao
he        body    good
He is good in health.

For topic constructions, we propose a new feature CONTEXT | TOPIC, whose index in this case is token identical to the INDEX value of ta. Please be advised that in the above structure, the CONTEXT | TOPIC ta is considered as a sentential adjunct instead of a complement subcated-for by shenti. Why? First, ta is highly optional: topic-less sentence is still a sentence. Second, and more convincingly, ta cannot always be predicted by its following noun. Compare:

20a) 他身体好。

ta         shenti   hao
he        body    good
He is good in health.

20b) 他好身体。

ta         hao      shenti
he        good    body
He is of good health.

21a) 他脾气好。

ta         piqi                  hao
he        disposition       good
He is good in disposition.

21b)  他好脾气。

ta         hao      piqi
he        good    disposition
He is of good disposition.

but:

22a) 她学习好。

ta         xuexi   hao. [3]
he        study   good
He is good in study.

22b) *  他好学习。

ta         hao      xuexi
he        good    study

What this shows is that for topic sentences like ta shenti hao (He is good in health), ta xuexi hao (He is good in study), etc., there is no requirement to regard topic ta (he) as a necessary semantic possessor of shenti / xuexi, the relation is rather "in-aspect": something (NP1) is good (A) in some aspect (NP2), or for something (NP1), some aspect (NP2) is good (A).

Finally, it needs to be mentioned that our proposed lexical rule requires modification to accommodate sentence 6). That is already beyond what we can reach in this paper because it is integrated with the way we handle Chinese classifiers in HPSG framework.

 

References

Pollard, Carl  & Sag, Ivan A. (1994): Head-Driven Phrase Structure Grammar,  Centre for the Study of Language and Information, Stanford University, CA

Pollard, Carl & Sag, Ivan A. (1987): Information‑based Syntax and Semantics Vol. 1: Fundamentals. Centre for the Study of Language and Information, Stanford University, CA

Wilks, Y.A. (1975a): A Preferential Pattern-Seeking Semantics for Natural Language Interference.  Artificial Intelligence, Vol. 6, pp.53-74.

Wilks, Y.A. (1975b): An Intelligent Analyzer and Understander of English, in Communications of the ACM, Vol. 18, No.5, pp.264-274

Wilks, Y.A. (1978): Making Preferences More Active.  Artificial Intelligence, Vol. 11, pp. 197-223

~~~~~~~~~~~~~~~ footnotes ~~~~~~~~~~~~~~~~

[1] This is not absolute, we do have the following examples:

Ia)          约翰是纽约人。

Yuehan shi           Niuyue                   ren
John       be            New-York              person
John is a New Yorker.

Ib)           约翰纽约人。

Yuehan  Niuyue                   ren.
John       New-York              person
John is a New Yorker.

IIa)         今天是星期天。

jintian    shi           xingqi-tian.
today     be            Sun-day
Today is Sunday.

IIb)         今天星期天。

jintian    xingqi-tian.
today     Sun-day
Today is Sunday.

It seems to be that the subject NP stands for some individual element(s), and the predicate NP describes a set (property) where the subject belongs. But it is not clear how to capture Ib) and IIb) while excluding 7b). We leave this question open.

[2] We realize that the syntactic constraint defined here is only a rough approximation to the data from syntactic angle. It seems to match most data, but there are exceptions when yi (one) appears in a numeral-classifier phrase:

IIIa)  他一副好身体。

ta            yi             fu            hao         shenti.
he            one         (cl.)         good       body
He is of good health. (He is of a good body.)

IIIb) * 他三副好身体。

ta            san          fu            hao         shenti
he            three       (cl.)         good       body

IIIc)   他好身体。

ta            hao         shenti.    [same as 1) ]

IVa) 李四一张圆圆的脸。

Lisi          yi             zhang     yuanyuan             de            lian.
Lisi          one         (cl.)         round-round         DE          face
Lisi has a quite round face.

IVb) * 李四两张圆圆的脸。

Lisi          liang       zhang     yuanyuan             de            lian.
Lisi          two         (cl.)         round-round         DE          face

IVc)  李四圆圆的脸。

Lisi          yuanyuan             de            lian.        [ same as 3) ]

[3] Another reading for 22a) is [S [Sta xuexi][AP hao]], where ta xuexi is a subject clause: "That he studies is good". This is another issue.

 

[Related]

Interaction of syntax and semantics in parsing Chinese transitive verb patterns 

Notes for An HPSG-style Chinese Reversible Grammar

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

Notes for An HPSG-style Chinese Reversible Grammar

ABSTRACT

Key words: Chinese parsing, Chinese generation, reversible grammar,  HPSG

This paper presents a reversible Chinese unification grammar named CPSG. The lexicalized and integrated design of CPSG embodies the general spirit of the modern linguistic theory Head-driven Phrase Structure Grammar (HPSG, Pollard & Sag 1987, 1994). Using ALE formalism in Prolog (Carpenter & Penn 1994), we have implemented a prototype of CPSG.

CPSG covers Chinese morphology, Chinese syntax and semantics in a novel integrated language model (Figure 1, for interface between morphology, see Li 1997; for interface between syntax and semantics, see Li 1996). CPSG model is in sharp contrast to the conventional clear-cut successive design of grammar components (Figure 2, see survey in Feng 1996). We will show that our model is much better suited and more efficient for Chinese analysis (or generation).

 

cpsg

Grammar reversibility is a highly desired feature for multi-lingual machine translation application (Hutchins & Somers 1992, Huang 1986, 1987). To test its reversible features, we have applied the CPSG prototype to an experiment of bi-directional machine translation between English and Chinese. The machine translation engine developed in our Natural Language Lab is based on shake-and-bake design, a novel approach to machine translation suited for unification grammars (Whitelock 1992, 1994, Beaven 1992, Brew 1992). The experimental results meet our design objective and verify the feasibility of CPSG approach.

~~~~~~~~~~~~~~~~~~~~~

Notes for NWLC-97, UBC, Vancouver

Outline of An HPSG-style Chinese Reversible Grammar

Wei LI   ([email protected])

Linguistics Department, Simon Fraser University

 

 Key words:          lexicalist approach, integrated language model, HPSG,

                                reversible grammar,  bi-directional machine translation, 

                                Chinese computational grammar,

                                Chinese word identification, Chinese parsing,
Chinese generation

 

  1. background

1.1. design philosophy

Two major obstacles in writing Chinese computational grammar:

lacking in serious study on Chinese lexical base

well designed lexicon is crucial for a successful computational system

theoretical linguists have made fruitful efforts (e.g. Li Linding) but lack formalization

computational linguists require more patience in adapting and formalizing the fruits:

it is huge work, but has to be done if a non-toy system is targeted

lack of effective interaction between morphology, syntax and semantics.

e.g.

ambiguity in word identification makes it hard to interface morphology & syntax:

a theoretical defect of morphology preprocessor (segmenter)

e.g. ABC: ABC or A | BC or AB | C or A | B | C?

active/passive isomorphic phenomena make semantic constraint a desired need in parsing NP Vt: subject NP or object NP?

Solution: the lexicalized and integrated design of Chinese grammar

1.2. major theoretical foundation:

HPSG:       lexicalist theory encouraging integration of different components

a desired framework matching our design philosophy

CPSG: HPSG-style unification grammar

CPSG: reversible grammar suited for both parsing and generation

CPSG: formalized grammar, a description that does not rely on undefined notions

  1. integrated language model

2.1. CPSG versus conventional Chinese grammar

 

 

parse tree embodies both morphological and syntactic structures in CPSG

  1. lexicalized formal grammar

3.1. formalized grammar, as required by a computational grammar: formulation of CPSG

readily implementable (theories, principles, rules, etc.);

precise definition for the very basic notions (e.g. sign, morpheme, word, phrase, sentence, NP, VP, etc.), rules (PS rules and lexical rules), lexical items (lexical hierarchy), typology (hierarchy embodied in feature structures)

(4.)       Definition: sign

A sign is the most fundamental concept of grammar. Formally, a sign is defined by the type [a_sign], which introduces a set of linguistic features for its description, as shown below.

a_sign
INDEX index
KANJI kanji
MORPH1 expected
MORPH2 expected
CATEGORY category
COMP0 expected
COMP1 expected
COMP2 expected
MOD expected
KNOWLEDGE knowledge
CONTENT content
INDEX0 index
INDEX1 index
INDEX2 index
DTR dtr

(5.)       Definition: word

In CPSG, a word is a sign satisfying the following two conditions: (1) its obligatory morphological expectation has all been saturated; (2) it is not a mother of any syntactic structures, hence no syntactic daughters. Formally, a word is defined as shown below.

(6.)       word

a_sign
MORPH1 ~obligatory
MORPH2 ~obligatory
DTR no_syn_dtr

3.2. lexicalized grammar

CPSG consists of two parts:

(1) a minimized general grammar:

only 11 phrase structure rules
(covering complement structure, modifier structure,
conjunctive structure and morphological structure)

(2) a feature enriched lexicon:

lexical entries;
lexical hierarchy and a set of lexical rules
(capturing lexical generalizations).

 

(7.)          comp0 PS rule

MOTHER               a_sign
COMP0 saturated
COMP1 [1]
COMP2 [2]
DTR comp0
MYSISTER [6]
LEFTMOD [7] category
RIGHTMOD [8] category
LEFTCOMP [9] category
RIGHTCOMP [10] category

===>

EXPECTING          a_sign
COMP0 a_expected
DIRECTION left
ROLE [3]
SIGN [4]
COMP1 [1] ~obligatory
COMP2 [2] ~obligatory
INDEX [5]
DTR dtr
LEFTMOD [7]
RIGHTMOD [8]
RIGHTCOMP [10]

EXPECTED            a_sign [4]
CONTENT content
MYHEAD [5]
MYROLE [3] comp_role
INDEX [6]
CATEGORY [9]

PRINCIPLE            #head_feature

(8.)          lexical entry: chi

a_sign
KANJI one_character
H1 chi
CATEGORY v
INDEX0 [1] index
INDEX1 [2] index
COMP0 a_expected
DIRECTION left
SIGN a_sign
CATEGORY n
INDEX [1]
COMP1 a_expected
DIRECTION right
SIGN a_sign
CATEGORY n
INDEX [2]
KNOWLEDGE eat
U_OBJECT food
MALE none
PERSON 3
SINGULAR bin
U_SUBJECT animate
MALE bin
PERSON tri
SINGULAR bin

  1. Implementation and Application of CPSG

CPSG prototype implemented in ALE and Prolog, having parsed a corpus of 200 various types of sentences

ALE and Prolog: suitable for unification grammar
ALE:         mechanism for typed feature structures: type polymorphism
a powerful tool in language modeling

CPSG prototype adapted for application to bi-directional MT, having generated the same corpus of 200 sentences

References

Beaven, John L. (1992): "Shake and Bake Machine Translation", Proceedings of the 15th International Conference on Computational Linguistics, pp. 603-609, Nantes, France.

Brew, Chris (1992): "Letting the Cat out of the Bag: Generation for Shake-and-bake MT", Proceedings of the 15th International Conference on Computational Linguistics, pp. 610-616, Nantes, France.

Carpenter, B. & Penn, G. (1994): ALE, The Attribute Logic Engine, User's Guide

Feng, Z.  (1996): "COLIPS Lecture Series - Chinese Natural Language Processing",  Communications of COLIPS, Vol.6, No.1 1996, Singapore (http://www.iscs.nus.sg/~colips/commcolips/paper/p96.html)

Huang, X-M. (1986): "A Bidirectional Grammar for Parsing and Generating Chinese".  Proceedings of the International Conference on Chinese Computing, Singapore, pp. 46-54

Huang, X-M. (1987): XTRA: The Design and Implementation of A Fully Automatic Machine Translation System, Doctoral dissertation, University of Essex.

Hutchins, W.J. & H.L. Somers (1992): An Introduction to Machine Translation. London, Academic Press.

Li, W. (1996): Interaction of Syntax and Semantics in Parsing Chinese Transitive Patterns. Proceedings of International Conference on Chinese Computing (ICCC'96), Singapore

Li, W. (1997): Chart Parsing Chinese Character Strings. Proceedings of The Ninth North American Conference on Chinese Linguistics (NACCL-9, to be available), Victoria, Canada

Pollard, C.  & I. Sag (1987): Information based Syntax and Semantics Vol. 1: Fundamentals. Centre for the Study of Language  and Information, Stanford University, CA

Pollard, C.  & I. Sag (1994): Head-Driven Phrase Structure Grammar,  Centre for the Study of Language and Information, Stanford University, CA

Whitelock, Pete (1992): "Shake and Bake Translation", Proceedings of the 14th International Conference on Computational Linguistics, pp. 784-790, Nantes, France.

Whitelock, Pete (1994). "Shake and Bake Translation", C.J. Rupp, M.A. Rosner, and R.L. Johnson (eds.), Constraints, Language and Computation, pp. 339-359, London, Academic Press.

 

[Related]

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

【一日一parsing:从“见面”的subcat谈起】

白:
“三两面”和“两三面”很不一样啊……
我借过他三两面。我见过他两三面。

我:
三两面 > 两三面
我见过他三两面

0912a
ditransitive, no problem, but:

0912b

separable verb jian-mian is still not connected

还有:
(0)我见过他两三面。
(1)我见过他。
(2)我与他见过面。
(3)* 我见过面
(4)我们见过面。
(5)我与他,见面过。

“见面” 要求或者主语是复数(4),或者主语是并列结构(5),或者带有介词短语“与(with)”(PP或并列在汉语界限不清,(2)),或者动量词疑似的“两三面”前必须有定语【human】。所有的这些句法subcat要求都是满足语义(或常识)的一个【human】的坑:常识是,“见面“”必须在两个或以上的 human entities 之间进行。

HPSG 这类极端依赖subcat数据结构的词驱动的理论和语言学表达,尽管繁缛,但有一个亮点, 就是把上述的句法要求作为 input 的匹配条件描述,与内在的语义要求(类似于 HowNet 的描述)作为语义的 output,一条一条形式化,细致入微,丝丝入扣。用的是 label 的unification(就是 label 所代表的子结构的 sharing)机制。多数系统对于 subcat 的内部结构,input到output的映射,以及背后的句法与语义的关系(语义是句法的动因,同时也是句法的目标:句法匹配,语义实现),都显得太简陋了。

过犹不及,不及犹过。我们一直在探索在 subcat 的表达和实现中,如何做到中庸而不平庸,简约而不简陋。

白:
他我见过几面

我:
简陋之极的一个例证是给人用的 Oxford 高级词典和朗曼词典的那些 subcat codes,类似 v1,。。。v23 之类。后来纽约大学专门组织CL的研究生做 CompLex 和 NomLex 等 subcat 词典。中文方面,社科院语言所的【现代汉语800词】开 subcat 先河,【动词用法词典】等系列辞典,开始试图把 subcat 用某种编码加例句予以表达。所有这些工作,从数据表达和关系看,都显得有些简陋。其根子是,句法和语义没有厘清。

对于一个 NLP practitioner,拿来这些资源,必须在肚子里做这个句法语义的连接和消化,然后确定数据结构,找寻自己的实现途径。实现的时候,很难达到 unification 文法的漂亮,大多是凑合事儿,为的是避免 HPSG 这类的实现起来的低效率和数据结构的难维护。

董老师的 HowNet 对于汉语和英语的 subcat,语义上登峰造极了,但是句法方面还是显得不够细致周全。譬如“见面”这类的上述6-7种句法规定,好像就没有一一描述(董老师指正:也许我没吃透),也没见哪家描述清楚过。也都需要一个重新咀嚼消化,然后去实现。

0912c

(3)的 generation 不合法(*),但对于 parsing,鲁棒性要求这样parsing,没错。

0912d

没调试,居然出来了,912 的狗屎运吧。(911恐袭,913林跑,都不是好日子。)只剩下 “我见过他两三面” 这个 case 了。这个类似动量补语的东西其实仅限于:“一面”,“几面”,“两三面”,“三两面”,等少数几个。起码,100+ 面 基本不可能 除非是恋人。

张: 崇拜严重中

我:
张老师谬赞。清谈误国,我只要不误“人”子弟就好了,一辈子没当过教授,要误也都是人家子弟,哈。

张: 白求恩

我:
认真说,其实真地涉嫌误人子弟,因为凡事都有一个大环境和背景,我说的这些个多少有些异类,结果是,主流学生雾里看花。雾里看花也算增加视野,最误人的是,看到花,却够不着。这就好比鲁老爷子说的,本来人家黑屋子里面睡得蛮香甜,你非要去【呐喊】,唤醒了,可屋子还是黑屋子,这就不仅仅是残忍了。不残忍的法子就是,等以后退休了,开一个 Deep Parsing 开源公园,每条代码,每个词条,每段规则,全部公开,然后看看能不能靠众人的力量,弄一个无敌系统来。大家一起玩符号逻辑,让两条路线永远。

 

 

【相关】

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

成语的弹性识别和理解机制

白:
“去年秋膘应犹在,只是猪颜改”

我:
1234应犹在 只是56改
成语弹性机制一抓一个准。一个成语中 哪些是变量 哪些是常量 可以研究。人心里大体有本帐。拿 “九牛二虎之力” 为例,弹性第一环是数词的变量化:m牛n虎之力

二牛九虎之力
九虎二牛之力
八虎七牛之力
四牛五虎之力

都不影响parsing和理解,总之是 费了老鼻子劲儿。

弹性第二环 是名词沿着taxonomy变量化:m 【大动物】n【大动物】之力

九熊二豹之力
三象五狮之力

转:
今个立秋,问苍天什么季节最忙? 秋天,多事之秋; 什么季节最公平? 秋天,平分秋色; 什么季节最简单? 秋天, 一叶知秋; 什么季节最长? 秋天,一日不见如隔三秋; 什么季节最爽? 秋天,秋高气爽;什么季节最险?秋天,秋后算账: 什么季节最暧昧? 秋天,暗送秋波!秋日快乐!!

成语弹性机制 从 “秋” 上升到 【季节】 再上升可以到 【时段】:

多事之春 多事之年 多事之岁月
平分春色
一花知春
一日不见如隔三冬
一日不见如隔九冬

白:
秋天来了,冬天还会远吗

我:
冬天来了 秋天还会远吗
这是时间隧道
或倒转 或快进。

关于小标题:

0905b

【成语】的【【弹性【识别和理解】】机制】,论句法应该是这样的:对于成语,需要一个弹性的识别机制,或者弹性识别的机制。但写的时候,脑子里更可能想的是,对于【成语的弹性】,需要一个识别机制。

再一想,who cares,人的表达和理解不常常是这样模模糊糊的吗。除了段子或较真,通常人根本就对这类结构歧义无感。语义上的模糊也不影响理解的大面。

【相关】

立委NLP博文一览

《朝华午拾》总目录

立委NLP频道

硕士论文:世界语到汉语和英语的自动翻译试验(1)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

本文是我在导师刘涌泉和刘倬先生指导下所做的毕业设计的论文总结. 共分十大部分:
1. EChA概况: 系统流程图; 2. 世界语: 语言学特点及其研究价值; 3. 层次递归成分体系CDC: 体现独立分析结果的EChA中间语言; 4. EChA机器词典, 句子加工场格式; 5. 世界语形态分析: 削尾算法, 关于削缀问题的讨论; 6. 句法分析第一线: 虚词处理, 规则和规则分开的讨论; 7. 句法分析第二线: CDC的求解, 中间结果分析; 8. 英语形态生成, 汉语形态修辞, 原语和译语对比差异的一般总结, 多义区分例释; 9. 调序: 自底而上加工; 10. EChA试验结果分析, 汉语和英语的机译文的比较, 关于文学作品可不可以跟机器翻译结合的问题, 修辞的讨论。

                         目       录

  1. EChA概况 ............................................................... 3
  2. 世界语: 语言学特点及其研究价值 ......................................... 7
  3.   层次递归成分体系 ....................................................... 13
  4. EChA机器词典 ........................................................... 19
  5. 世界语形态分析 ......................................................... 23
  6. 世界语句法分析(1) ...................................................... 29
  7. 世界语句法分析(2) ...................................................... 31
  8. 英语形态生成 ........................................................... 34
  9. 目标语调序 ............................................................. 38
  10. EChA试验结果的分析 ..................................................... 39

[致谢] ..................................................................... 44

[参考书目] ................................................................. 45

[附录一] EChA试验结果 ...................................................... 46

[附录二] 世界语文摘 ........................................................ 57

EChA概况

EChA (E-Ch/A: el Esperanto en la Chinan kaj Anglan Lingvojn) 系统是以世界语作为源语, 以汉语和英语作为目标语的一对多小型实验系统. 它是一个句对句的, 分析和综合有一定独立性的全文机器翻译系统. 本系统实现了翻译过程的完全自动化,不需要译前和译后编辑. (由于纯技术原因, 世界语中的几个戴帽字母暂时还需要用加 H 的复合字母来转写.) EChA系统从上机调试到打出译文只用了五个月, 全部工作历时近一年, 进展比较顺利. 本系统使用的是IBM-PC/XT微型机, 编程语言 BASIC (Version D2.00), 同时选用IBM公司的BASIC编译程序软件包. EChA由CCDOS操作系统(即带有汉字库的PC DOS 2.10)支持. 系统主体是六线分析和综合程序. 另外还建立了三部词典, 两个词表, 编制了词典的造查, 扩充和维护程序. 整个系统由近一万条BASIC语句构成. 编程时充分利用了BASIC串处理函数, 显得特别方便.

这次试验共翻译了150多句世界语文句. 汉语和英语的机器译文都通顺或可懂, 结果令人满意. (见附录) 提供本系统试验的源语素材有三部分: 第一部分是选自著名世界语作家Sandor Szhatmari的世界语原文著作 "Mashinmondo" (<<机器世界>>, 中国展望出版社)上的两段连续文章(12句, P.100-101), 句子比较长, 结构也比较复杂. 第二部分选自魏原枢和徐文琪编著的 <<世界语语法>> (上海外语教育出版社, 1982.10)中的典型例句(100多句), 这些例句(其中有一部分是日常用语)都具有一定的语言学特点, 表现了不同时态(简单时态,复合时态), 语态(主动语态, 被动语态), 语式(陈述语式, 命令语式, 假定语式),不同的句式(简单句, 并列句, 复合句, 无主句, 独词句, 一般疑问句, 特殊疑问句, 等等),不同的句型以及动词的各种形式. 总之, 它们具有相当的代表性, 基本上反映了世界语语法概貌, 这就弥补了连续文句特点单一的不足, 更有利于试验EChA系统的能力和适应性. 最后作为一种尝试,还选译了两首世界语诗歌(第一首是著名的世界语者的颂歌"希望之歌").

EChA由三大部分组成: 1) 机器词典; 2) 源语分析; 3) 目标语生成. 源语分析部分包括了世界语的全部基本语法和常用句型. 然而, 由于机器条件和实验周期的限制, 本系统的规模(特别是词典的规模)还很小, 有待于进一步扩充和改进. ----准备从两方面来扩充EChA系统, 一是补充例句, 做扩大试验; 二是增加俄语和法语作为新的目标语, 进一步检验体现独立分析结果的中间语言CDC(层次递归成分体系, 第3节详述)的适应范围, 并探讨其完善的途径. 另外, 时间仓促给系统还带来一些问题:  EChA的结构还不是很合理, 算法有待于进一步优化, 规则和算法还没能分开, 在分析和综合的独立性上下了不少功夫, 但还没有完全独立.

尽管还有上述问题, 然而按照设计要求, 只要适当扩充词典, 系统就有能力处理世界语的绝大多数语言现象. 在中国近三十年的机器翻译研究历史中, EChA是第一个以世界语为研究对象的机译系统. 在世界语跟机器翻译结合的过程中, EChA是一个成功的尝试和良好的开端. 我们热切希望得到专家学者, 世界语同志们的帮助和指导.

EChA系统流程图
______丨________
/   原文输入    丨
/_________________丨
_______________________丨________________________
词               丨 1. 削尾, 查词典(实词词典, 虚词词典, 成语词典, 丨
典               丨    词类词义区分表)                                               丨
(形态分析)     丨_____________________________________________丨
-------------------  _______________________丨_________________________
句               丨 2. 连词标点, 切分, 其他虚词                                     丨
法               丨________________________________________________丨
分                _______________________丨_________________________
析               丨 3. 中间语言CDC的求解                                           丨
丨________________________________________________丨
-------------------  _______________________丨_________________________
丨 4. 多义词区分; 英语形态生成及汉语形态修辞; 查丨
目               丨      英语不规则词词表                                              丨
标               丨_______________________________________________丨
语                _______________________丨_________________________
生               丨 5. 英语调序                                                                丨
成               丨________________________________________________丨
_______________________丨_________________________
丨  6. 汉语调序及其他修辞                                            丨
丨________________________________________________丨
_________丨_________
丨     译文输出           丨
丨__________________丨

源语文句输入以后, 作第一遍扫描. 首先判定加工词长度是否大于三. 若大于三, 转子程序削尾后查实词词干词典, 否则查虚词词典. 因为世界语虚词(无词尾变化)大多短小, 以三为界限最合理, 可以大大减少虚查次数. 词典查不着的作生词处理, 削尾信息保留. 查完词典及词表以后, 把削尾信息和词典信息移到计算机内存中所开辟的句子加工场.

句法分析确定源语文句的层次结构和句法关系. 分析结果以一种高度形式化的层次递归成分体系CDC来体现. CDC是独立于目标语的机器翻译中间语言, 这种独立性对于一对多机译系统是必要的. CDC由形态, 成分, 节点, 分布, 链号和层次几部分信息构成. 它不但揭示了源语文句的正确的句法树, 而且还包含了其它的有用的信息. 事实上, 它为建立多目标语的生成系统奠定了良好的基础.

句法分析第一线处理虚词, 中心任务是加工连词和标点, 正确切分语段. 原则上为每一个虚词编制一套分析规则. 世界语虚词数量很有限, 但用法较多, 具有民族语功能词的类似的复杂性, 是语言个性的集中表现, 所以分别加工比较适宜, 这也有利于规则跟规则分开. 该线加工任务很重, 特别是连词KAJ和KE, 分析规则十分复杂. 在很大程度上, 虚词分析对了, 句法关系也就清楚了. 因此, 集中力量编制一套完备的针对具体虚词的分析系统, 对于世界语类型的机器翻译至关重要. 该线正确处理了虚词个性现象, 便可以保证下一线分析的充分抽象性和概括性, 这样做对于象世界语这样的科学而规则的语言显得特别有利. 句法分析第二线运用自顶而下的方法, 从句子的谓语轴心(第一层)着手, 一层一层往下递归加工, 直到最末层(终结节点层). 加工过程就是不断递归调用各子程序的过程. 其中以动词子程序为核心, 它充分反映了世界语语法的基本内容及其高度规则性. 分析完毕得出一条对应于源语文句的中间语言CDC的链.

综合第一线做英语形态生成和汉语形态修辞. 英语形态并不发达, 所以世英的形态转换规则也不复杂. 汉语缺乏形态, 一般用适当的虚词(助词, 副词等)来代替. 我们把多义词区分规则也放在这一线, 这是因为多义区分的条件至此已经具备. 一般来说, 根据多义词及其联系词的CDC成分和语义特征就可以得出该词的正确义项. 综合第二线和第三线分别做英语调序和汉语调序. 调序信息由CDC结合目标语语法规律得出, 调序的方法是自底而上, 层层归约, 这样就不至于调乱. 我们知道, 世界语语序极为灵活自由, 而汉语语序却很固定, 所以生成汉语的主要任务是调序. 对于英语, 调序的任务较轻, 主要是保证文句主干 "主谓宾" 次序不乱. 英语名词没有主宾格的区分, 所以关键是把前置宾语移到动词之后. "世界语是印欧语系的一个合理化的公分母", 与英语相似处毕竟很多, 比如同一句法层次的定语或状语的内部调序, 在译汉语时是一个难题, 而在印欧系诸语言中则不是大问题. 另外修辞加工的过程也可以免了. (世英转换中的成语和多义现象较之世汉转换也少得多.) 总之, 英语生成比汉语生成容易许多.

EChA虽然是个不大的系统, 但是内容比较丰富. 它既有形态分析, 又有形态生成, 也有调序和修辞, 还有自己的一套成分体系. 我们在总体设计时, 已经考虑到增加新的不同类型的目标语扩充该系统的需要. 可以预计, 如果增加两线俄语和法语的生成程序(主要是形态生成), 分析部分稍作改动(主要是充实与综合还没有完全独立开来的虚词分析规则), 就可以实现崐世到汉/英/法/俄的自动翻译. 总之, 实用机译系统所能遇到的问题, EChA几乎都已涉及, 而且主体六线程序各个有自己的特色, 是个有相当代表性的一对多全自动机译模型.

 

【相关】

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

PhD Thesis: Chapter VII Concluding Remarks

This chapter summarizes the research conducted in this dissertation, including its contributions as well as limitation.

7.0. Summary

The goal of this dissertation is to explore effective ways of formally approaching Chinese morpho-syntactic interface in a phrase structure grammar.  This research has led to the following results:  (i) the design of a Chinese grammar, namely CPSG95, which enables flexible coordination and interaction of morphology and syntax;  (ii) the solutions proposed in CPSG95 to a series of long-standing problems at the Chinese morpho-syntactic interface.

CPSG95 was designed in the general framework of HPSG (Pollard and Sag 1987, 1994).  The sign-based mono-stratal design from HPSG demonstrates the advantage in being capable of accommodating and accessing information of different components of a grammar.  One crucial feature of CPSG95 is its introduction of morphology expectation feature structures and the corresponding morphological PS rules into HPSG.  As a result, CPSG95 has been demonstrated to provide a favorable environment for solving morpho-syntactic interface problems.

Three types of morpho-syntactic interface problems have been studied extensively: (i) the segmentation ambiguity in Chinese word identification;  (ii) Chinese separable verbs, a borderline problem between compounding and syntax; and (iii) borderline phenomena between derivation morphology and syntax.

In the context of the CPSG95 design, the segmentation ambiguity is no longer a problem as morphology and syntax are designed system internally in the grammar to support morpho-syntactic parsing based on non-deterministic tokenization (W. Li 1997, 2000).  In other words, the design of CPSG95 itself entails an adequate solution to this long-standing problem, a problem which has been a central topic in Chinese NLP for the last two decades.  This is made possible because the access to a full grammar including both morphology and syntax is available in the integrated process of Chinese parsing and word identification while traditional word segmenters can at best access partial grammar knowledge.[1]

The second problem involves an interesting case between compounding and syntax:  different types of Chinese separable verbs demonstrate various degrees of separability in syntax while all these verbs, when used contiguously, are part of Chinese verb vocabulary.  For each type of separable verbs, arguments were presented for the proposed linguistic analysis and a solution to the problem was then formulated in CPSG95 based on the analysis.  All the proposed solutions provide a way of capturing the link between the separated use and the contiguous use of the separable verb phenomena.  They are shown to be better solutions than previous approaches in the literature which either cannot link the separated use and the contiguous use in the analysis or suffer from being not formal.

The third problem at the interface of derivation and syntax involves two issues: (i) a considerable amount of ‘quasi-affix’ data, and (ii) the intriguing case of zhe-suffixation which demonstrates an unusual combination of a phrase with a bound morpheme.  A generic analysis of Chinese derivation has been proposed in CPSG95.  This analysis has been demonstrated to be also effective in handling both quasi-affixation and zhe-affixation.

7.1. Contributions

The specific contributions are reflected in the study of the following five topics, each constituting a chapter.

On the topic of the Role of Grammar, the investigation leads to the central argument that knowledge from both morphology and syntax is required to properly handle the major types of morpho-syntactic interface problems.  This establishes the foundation for the general design of CPSG95 as consisting of morphology and syntax in one grammar formalism.

An in-depth study has been conducted in the area of the segmentation ambiguity in Chinese word identification.  The most important discovery from the study is that the disambiguation involves the analysis of the entire input string.  This means that the availability of a grammar is key to the solution of this problem.  A natural solution to this problem is the use of grammatical analysis to resolve, and/or prepare the basis for resolving, the segmentation ambiguity.

On the topic of the Design of CPSG95, a mono-stratal Chinese phrase structure grammar has been established in the spirit of the HPSG theory.  Components of a grammar such as morphology, syntax and semantics are all accommodated in distinct features of a sign.  CPSG95 is designed to provide a framework and means for formalizing the analysis of the linguistic problems at the morpho-syntactic interface.

The essential part of this work is the design of expectation feature structures.  Expectation feature structures are generalized from the HPSG feature structures for syntactic subcategorization and modification.  One characteristic of the CPSG95 structural expectation is the design of morphological expectation features to incorporate Chinese productive derivation, which covers a wide range of linguistic phenomena in Chinese word formation.

In order to meet the requirements induced by introducing morphology into the general grammar and by accommodating linguistic characteristics of Chinese, modifications from the standard HPSG are proposed in CPSG95.  The rationale and arguments for these modifications have been presented.  The design of CPSG95 is demonstrated to be a successful application of HPSG in the study of Chinese morpho-syntactic phenomena.

On the topic of Defining the Chinese Word, efforts have been made to reach a better understanding of Chinese wordhood in theory, methodology and formalization.

The theoretical inquiry follows the insight from Di Sciullo and Williams (1987) and Lü (1989).  Two notions of word, namely grammar word and vocabulary word, have been examined and distinguished.  While vocabulary word is easy to define once a lexicon is given, the object for linguistic study and generalization is actually grammar word.  Unfortunately, as there is a considerable amount of borderline phenomena between Chinese morphology and syntax, no precise definition of Chinese grammar word has been available across systems.  Therefore, an argument in favor of the system-internal wordhood definition and interface coordination within a grammar has been made.  This leads to a case-by-case approach to the analysis of specific Chinese morpho-syntactic interface problems.

On the other hand, three useful wordhood judgment methods have also been proposed as a complementary means to the case-by-case analysis.  These methods are (i) syntactic process test involving passivization and topicalization; (ii) keyword based judgment patterns for verbs, and (iii) a general expansion test named X-insertion.  These methods are demonstrated to be fairly operational and easy to apply.

In terms of formalization, a system-internal representation of word has been defined in CPSG95 feature structures.  This definition distinguishes a grammar word from both bound morphemes and syntactic constructions.  The formalization effort is necessary for the rigid study of Chinese morpho-syntactic problems and ensures the implementability of the solutions to these problems as proposed in the dissertation.

On the topic of Chinese Separable Verbs, the task is to coordinate the idiomatic nature of separable verbs and their separated uses in various syntactic patterns.

Since there are different degrees of ‘separability’ for different types of Chinese separable verbs, there is no uniform analysis which can handle all separable verbs properly.  A case-by-case study for each type of separable verbs has been conducted.  An essential part of this study is the arguments for the wordhood judgment for each type.  In the light of this judgment, CPSG95 provides formalized analyses of separable verbs which satisfy two criteria:  (i)  they all capture both structural and semantic aspects of the constructions at issue; (ii) they all provide a way of capturing the link between the separated use and contiguous use.

Finally, on the topic of Morpho-syntactic Interface Involving Derivation, a general approach to Chinese derivation has been proposed.  This approach not only enables us to handle quasi-affix phenomena, but is also flexible enough to provide an adequate treatment of the special problem in zhe-suffixation.

In the CPSG95 analysis, the affix serves as head of a derivative and can impose various constraints in the lexicon on its expected stem sign for the morphological expectation.  Coupled with only two PS rules formulated in the general grammar (Prefix PS Rule and Suffix PS Rule), it has been shown that various Chinese affixation phenomena can be captured equally well.  The PS rules ensure that all the lexical constraints be observed before the affix and the stem combine and that the output of derivation be a word.

As for the quasi-affixation problem, based on the observation that there is no fundamental structural difference between quasi-affixation and other affixation, a proper treatment of 'quasi-affixes' can be established in the same way as other affixes are handled in CPSG95; the individual difference in semantics is shown to be capturable in the lexicon.

The study of zhe-suffixation started with arguments for its analysis of VP+-zhe.  This is an unsolvable problem in any system which enforces sequential processing of morphology before syntax.  The solution which CPSG95 offers demonstrates the power of designing derivation morphology and syntax in a mono-stratal grammar.   With this novel design in modeling Chinese grammar, the CPSG95 general approach to derivation readily applies to the tough case of zhe-suffixation.  This is possible because of the ability of an affix in placing any lexicalized constraints, VP in this case, on the expected stem for morphological expectation.  In addition, the proposed lexicalized solution also captures the building of the semantic content for this morpho-syntactic borderline phenomenon.

7.2. Limitation

The major limitation of the work reported in this thesis lies in the following two aspects.

Limited by space, the thesis has only presented some sample formulation of typical affixes and quasi-affixes to demonstrate the proposed general approach to Chinese derivation morphology.  As many affixes/quasi-affixes have their distinctive semantic property, a reader who likes to experiment with this proposal in implementation still has to work out the technical details for each affix.  However, it is believed that the general strategy has been presented in sufficient details to allow for easy accommodation of individual aspects of an affix which have not been specifically addressed in the thesis.

Limited by the focus on a handful of major morpho-syntactic interface problems, the treatment of reduplication and unlisted proper names have not been listed as special topics for in-depth exploration.  They are only briefly discussed in Chapter II (Section 2.2) as cases of productive word formation for the need to involve syntax when they involve segmentation ambiguity at the boundaries.  However, they are also long-standing word identification problems which affect morpho-syntactic interface when the segmentation ambiguity is involved.  In particular, it is felt that the treatment of transliterated foreign names requires further research before a satisfactory solution can be found in the framework of CPSG95.[2]

7.3. Final Notes

This last section is used to place the research reported in this thesis in a larger context.

Chinese NLP has reached a new stage marked by the publication of Guo’s series of papers on Chinese tokenization (Guo 1997a,b,c,d, Guo 1998).  There are signs that the major research focus is being shifted from word segmentation to the grammar design and development.  In this process,  the morph-syntactic interface will remain a hot topic for quite some time to come.  The work on CPSG95 can be seen as one of the efforts in this direction.

The design of CPSG95, a formal grammar capable of representing both morphology and syntax in a uniform formalism, is one successful application of the modern linguistic theory HPSG in the area of Chinese  morpho-syntactic interface research.  However, this is by no means to claim that CPSG95 is the only or best framework to capture the morpho-syntactic problems.   This is only one approach which has been shown to be feasible and effective.  Other equally good or better approaches may exist.

In terms of future directions, constraints from semantics and discourse should be made available in the grammatical analysis.  In Chapter II (Section 2.4), we have seen problems whose ultimate solutions depend on the access to the semantic or discourse constraints.  It is believed that the sign-based mono-stratal design of CPSG95 will be extensible to accommodate these constraints.  However, this will require years of future research before they can be formally modeled and properly introduced into the grammar.

 

--------------------------

[1] As a matter of fact, the CPSG95 experiment shows that most segmentation ambiguity is resolved automatically as a by-product of morpho-syntactic parsing and the remaining ambiguity is embodied in the multiple syntactic trees as the results of the analysis.

[2] However, in the CPSG95 implementation, the problem of handling the Chinese person names, a special case of compounding, has been solved fairly satisfactorily.  The proposal is to use the surname as the head sign to expect the given name (of one or two characters) on its right to form potential full names.  As the right boundary of a person name is difficult to define without the support of sentential analysis, the conventional word segmenter frequently makes wrong segmentation in such cases.  In contrast, the approach implemented in CPSG95 is free from this problem because whether a potential name proposed by the surname ultimately survive as a proper name is decided by whether it contributes to a valid parse for the processed sentence.  In last few years, there has been rapid progress on proper name identification in the area of information extraction, called named entity tagging (MUC7 1998; Chen et al 1997).

 

BIBLIOGRAPHY

Bauer, Laurie (1988).  Introducing Linguistic Morphology.  Edinburgh:  Edinburgh University Press.

Bloomfield, Leonard (1933). Language, New York: Henry Holt & Co.

Borsley, Robert (1987).  Subjects and Complements in HPSG.   Technical report no. CSLI-107-87.  Stanford:  Center for the Study of Language and Information.

Carpenter, B. and G. Penn (1994).  ALE, The Attribute Logic Engine, User's Guide.  From http://www.sfs.nphil.uni-tuebingen.de/~gpenn/ale.html (accessed January 30, 2001).

Chao, Yuen-Ren (1968).  A Grammar of Spoken Chinese.  Berkeley:  University of California Press.

Chen, H.-H et al (1997).  Description of the NTU System used for MET-2.  Proceedings of MUC-7.  From http://perso.enst.fr/~monnier/lectures/IE/MUC7/muc_7_toc.html (accessed January 30, 2001).

Chen, K. and S. Liu (1992).  Word Identification for Mandarin Chinese Sentences.  Proceedings of 14th International Conference on Computational Linguistics (COLING’92). Nantes, France, 101-107.

Chen, M.Y. and W. S-Y. Wang (1975).  Sound Change:  Actuation and Implementation.  Language 51:2, 255-281.

Chen, Ping (1994).  “Shilun Hanyu zhong San Zhong Juzi Chengfen yu Yuyi Cheng Fen de Peiwei Yuanze” (On Mapping Principles of Relationship between Chinese Three Syntactic Constituents and Semantic Roles). Zhongguo Yuwen (Chinese Linguistics), No.3.

Chomsky, Noam (1970).  Remarks on Nominalization.  Readings in English Transformational Grammar, eds. by R. Jacobs and P. Rosenbaum, Waltham, Massachasetts:  Ginn and Company, 184-221.

Dai, John Xiang-ling (1993).  Chinese Morphology and its Interface with Syntax.  Ph.D. Dissertation, Ohio State University.

DeFrancis, John (1984).  The ChineseLanguage: Fact and Fantasy.  Honolulu:  University of Hawaii Press.

Di Sciullo, A.M. and E. Williams (1987).  On The Definition of Word.  The MIT Press, Cambridge, Massachusetts.

Ding, Shengshu (1953). “Hanyu Yufa Jianghua” (Lectures of Chinese Grammar), Zhongguo Yuwen (Chinese Linguistics), No. 3 and No. 4.

Dowty, D. (1982).  More on the Categorial Analysis of Grammatical Relations.  In A. Zaenen (Ed.), Subjects and Other Subjects:  Proceedings of the Harvard Conference on Grammatical Relations.  Bloomington:  Indiana University Linguistics Club.

Feng, Zhiwei (1996).  COLIPS Lecture Series - Chinese Natural Language Processing,  Communications of COLIPS, Vol.6, No.1, Singapore.

Gan, Kok Wee (1995).  Integrating Word Boundary Disambiguation with Sentence Understanding, Ph.D. Dissertation, National University of Singapore.

Gazdar, G., E. Klein, G.K. Pullum, and I.A. Sag (1985).  Generalized Phrase Structure Grammar.  Cambridge: Blackwell, and Cambridge, Mass.:  Harvard University Press.

Guo, Jin (1997a).  Critical tokenization and its properties.  Computational Linguistics, Vo. 23, No.4, 569-596.

Guo, Jin (1997b).  Chinese Language Modeling for Speech Recognition.  Ph.D. dissertation, Institute of Systems Science, National University of Singapore.

Guo, Jin (1997c).  A Comparative Study on Sentence Tokenization Generation Schemes.  In review for journal publication from http://sunzi.iss.nus.sg:1996/guojin/papers/ (accessed March 25, 1999).

Guo, Jin (1998).  One tokenization per source.  Proceedings of the 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics (COLING-ACL ’98),  Montreal, Canada, 457-463.

He, K., H. Xu and B. Sun (1991).  Design Principles of an Expert System for Automatic Word Segmentation of Written Chinese Texts, Journal of Chinese Information Processing, Vol. 5, No. 2, 1-14.

Hockett, C.F. (1958).  A Course in Modern Linguistics.  New York:  Macmillan.

Hu, F. and L. Wen (1954).  “Ci de fanwei, xingtai, gongneng” (Scope, form and function of word). Zhongguo Yuwen (Chinese Linguistics), August issue.

Jackendoff, Ray (1972). Semantic Interpretation In Generative Grammar, Cambridge, Massachusetts:  MIT Press.

Jensen, John T. (1990).  Morphology:  Word Structure in Generative Grammar.  Amsterdam/Philadephia:  John Benjamins Publishing Company.

Kathol, Andreas (1999).  Agreement and the Syntax-Morphology Interface in HPSG. In Robert Levine and Georgia Green (eds.) Studies in Current Phrase Structure Grammar. Cambridge University Press, 223-274.

Kolman, B. and R.C. Busby (1987). Discrete Mathematical Structures for Computer Science, 2nd edition. Prentice-Hall, Inc.

Krieger, Hans-Ulrich (1994). Derivation without Lexical Rules,  in C.J Rupp, M. Rosner and R. Johnson (eds), Constraints, Language, and Computation.  Academic Press, 277-313.

Li, C.N. and  S.A. Thompson (1981).  Mandarin Chinese:  A Functional Grammar.  Berkeley:  University of California Press.

Li, Linding (1986).  Xiandai Hanyu Juxing (Sentence Patterns in Contemporary Mandarin), Shangwu Yinshuguan (Commercial Press), Beijing.

Li, Linding (1990).  Xiandai Hanyu Dongci (Verbs in Contemporary Mandarin), Zhongguo Shehui Kexue Chubanshe, Beijing.

Li, Qinghua (1983).  “Tan liheci de tedian he yongfa” (On the characteristics and usages of separable words).  Yuyan Jiaoxue He Yan Jiu (Language Instruction and Research), No.3.

Li, Wei (1996).  Interaction of Syntax and Semantics in Parsing Chinese Transitive Patterns.  Proceedings of International Conference on Chinese Computing (ICCC'96), Singapore.

Li, Wei (1997).  Chart Parsing Chinese Character Strings.  Proceedings of the Ninth North American Conference on Chinese Linguistics (NACCL-9), Victoria, Canada.

Li, Wei (2000). On Chinese parsing without using a separate word segmenter.  Communication of COLIPS 10 (1): 19-68.

Liang, Nanyuan (1987).  CDWS -- A Written Chinese Automatic Word Segmentation System.  Journal of Chinese Information Processing, 1(2): 44-52.

Lieber, R. (1992).  Deconstructing Morphology. Chicago: University of Chicago Press.

Lin, Handa (1983).  “Shime shi ci – xiaoyu ci de bu shi ci” (What is a word – a unit smaller than a word is not a word). Zhongguo Yuwen (Chinese Linguistics), No.34.

Lu, Jianming (1988).  “Mingci-xing ‘laixin’ shi ci haishi cizu” (Nominal laixin: word or word group).  Zhongguo Yuwen (Chinese Linguistics), No. 5.

Lu, Zhiwei (1957).  Hanyu de Goucifa (Chinese Word Formation), Kexue Chubanshe (Science Publishing House)..

Lü, Shuxiang. (1946). “Cong Zhuyu, Binyu de Fenbie Tan Guoyu Juzi de Fenxi” (On Sentence Analysis of Mandarin Chinese from the Angle of the Distinction between Subject and Object),  Kaiming Shudian Er Shi Zhounian Jiannian Wenji (Selected Works to Celebrate the 20th Anniversary of Kaiming Bookstore).

Lü, Shuxinag et al (ed.) (1980).  Xiandai Hanyu Babai Ci (800 Words in Contemporary Mandarin), Shangwu Yinshuguan (Commercial Press), Beijing.

Lü, Shuxiang (1989). “Hanyu Yufa Fenxi Wenti” (Issues on Chinese grammatical analysis),  Lü Shuxiang Zixuanji (Self-selected Works of Shuxiang Lü), Shang Hai Jiaoyu Chubanshe (Shanghai Education Publishing House), Shanghai, 93-180.

Lua, Kim Teng (1994).  Application of Information Theory Binding in Word Segmentation. Computer Processing of Chinese and Oriental Languages 8(1): 115-124.

Lyons, John (1968).  Introduction to Theoretical Linguistics.  Cambridge:  Cambridge University Press.

MUC-7 (1998).  Proceedings of the Seventh Message Understanding Conference (MUC-7).  From http://perso.enst.fr/~monnier/lectures/IE/MUC7/muc_7_toc.html (accessed January 30, 2001).

Pollard, C. and I. Sag (1987).  Information based Syntax and Semantics Vol. 1: Fundamentals.  Centre for the Study of Language  and Information, Stanford University, CA.

Pollard, C. and I. Sag (1994).  Head-Driven Phrase Structure Grammar.  The University of Chicago Press.

Riehemann, Susanne (1993). Word Formation in Lexical Type Hierarchies – A Case Study of bar-Adjectives in German. SfS-Report-02-93, University of Tübingen.

Riehemann, Susanne (1998). Type-based derivational morphology.  Journal of Comparative Germanic Linguistics 2. 49-77.

Sapir, Edward (1921).  Language:  Introduction to the Study of Speech.  NewYork:  Harcourt, Brace, and World.

Selkirk, E. (1982).  The Syntax of Words.  Cambridge:  MIT Press.

Shi, Youwei (1992).  Huhuan Rouxing – Hanyu Yufa Tanyi (A Call for Flexibility – Peculiarities of Chinese Grammar), Hunan Publishing House.

Shieber, S. (1986).  An Introduction to Unification-Based Approaches to Grammar.  Centre for the Study of Language  and Information, Stanford University, CA.

Sproat, R., C. Shih, V. Gale, and N. Chang (1996).  A Stochastic Finite-State Word-Segmentation Algorithm for Chinese.  Computational Linguistics. Vol. 22, No. 3.

Sun, L. and P. Cole (1991).  The effect of morphology on long-distance reflexives.  Journal of Chinese Linguistics 19:1, 42-62.

Sun, M. and B. T’sou (1995).  Ambiguity resolution in Chinese word segmentation.  Proceedings of the 10th Pacific Asia Conference on Language, Information and Computation (PACLIC-95), Hong Kong, 121-126.

Sun, M. and C. Huang (1996).  Word Segmentation and Part-of-Speech Tagging for Unrestricted Chinese Texts, A Tutorial at the 1996 International Conference on Chinese Computing (ICCC96), Singapore.

Thompson, S.A. (1973).  Resultative Verb Compounds in Mandarin Chinese:  A Case of Lexical Rules. Language 49:2, 361-379.

Wang, Li (1955).  ZhongguoYufa Lilun (Chinese Grammatical Theory), Zhonghua Shuju, Shanghai.

Wang, Xiaolong (1989).  Automatic Chinese Word Segmentation, in Word Separating and Mutual Translation of Syllable and Character Strings, Ph.D. Dissertation, Dept. of Computer Science and Engineering, Harbin Institute of Technology.

Webster, J. J. and C-Y Kit. (1992).  Tokenization as the Initial Phase in NLP.  Proceedings of the 14th International Conference on Computational Linguistics (COLING-92).  Nantes, France, 1106-1110.

Wu, A. and Z. Jiang (1998).  Word Segmentation in Sentence Analysis.  Proceedings of the 1998 International Conference on Chinese Information Processing.  Beijing, China, 169-180.

Wu, Dekai (1998).  A Position Statement on Chinese Segmentation.  Presented at the Chinese Language Processing Workshop, University of Pennsylvania. (Current draft at http://www.cs.ust.hk/~dekai/papers/segmentation.html, accessed January 30, 2001).

Wu, M. and K. Su (1993).  Corpus-Based Automatic Compound Extraction with Mutual Information and Relative Frequency Count.  Proceedings of R.O.C. Computational Linguistics Conference (ROCLING) VI, Taiwan, 207-216.

Xue, Ping (1991).  Syntactic Dependencies in Chinese and their Theoretical Implications.  Ph.D. dissertation, University of Victoria, Canada.

Yao, T., G. Zhang, and Y. Wu (1990).  A Rule-Based Chinese Automatic Segmentation System.  Journal of Chinese Information Processing 4(1): 37-43.

Yeh, C-L. and H-J. Lee (1991).  Rule-Based Word Identification For Mandarin Chinese Sentences -- A Unification Approach.  Computer Processing of Chinese and Oriental Languages. Vol. 5, No. 2, 97-118.

Yu, Shihong et al (1997).  Description of the Kent Ridge Digital Labs System Used for MUC-7.  Proceedings of MUC-7.  From http://perso.enst.fr/~monnier/lectures/IE/MUC7/muc_7_toc.html (accessed January 30, 2001).

Zhang, J., Z. Chen and S. Chen (1991).  A Method of Word Identification for Chinese by Constraint Satisfaction and Statistical Optimization Techniques.  Proceedings of R.O.C. Computational Linguistics Conference (ROCLING) IV, Taiwan, 147-165.

Zhang, Shoukang (1957).  “Lüetan hanyu goucifa” (A brief discussion on Chinese word formation)  Xiandai Hanyu Cankao Ziliao (Reference for Comtemporary Chinese),  ed. by Yushu Hu (1981),  Shanghai:  Shanghai Jiaoyu Chubanshe (Shanghai Education Publishing Company), 241-256.

Zhao, S. and B. Zhang (1996).  “Liheci de queding yu liheci de xingzhi” (Determination and characteristics of separable words).  Yuyan Jiaoxue he Yanjiu (Language Instruction and Research), No.1, 40-51.

Zhu, Dexi (1985).  Yufa Wenda (Questions and Answers on Chinese Grammar).  Shangwu Yinshuguan (Commercial Press), Beijing.

Zwicky, A.M. (1987). Slashes in the Passive.  Linguistics 25, 639-669.

Zwicky, A.M. (1989).  Idioms and Constructions.  Eastern States Conference on Linguistics 5, 547-558.

 

[Related]

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

6.0. Introduction

This chapter studies some challenging problems of Chinese derivation and its interface with syntax.  These problems have been a challenge to existing word segmenters; they are also long-standing problems for Chinese grammar research.

It is observed that a good number of signs have become more and more like affixes as the Chinese language develops.  Typical, indisputable examples include signs like the nominalizer 性 ‑xing (-ness) and the prefix 第 di- (-th).  While few people doubt the existence of affixes in Contemporary Chinese, there is no general agreement on the exact number of Chinese affixes, due to a considerable number of borderline cases often referred to as ‘quasi-affixes’ (类语缀 lei yu-zhui).[1]  It will be argued that the quasi-affixes belong to morphology and are structurally not different from other affixes.  The major difference between ‘quasi-affixes’ and the few generally honored (‘genuine’) affixes lies mainly in the following aspect.  The former retain some ‘solid’ meaning while the latter are more functionalized.  However, this does not prevent CPSG95 from providing a proper treatment of quasi-affixes in the same way as it handles other affixes.  It will be shown that the difference in semantics between affixes or quasi-affixes can be accommodated fairly easily in the CPSG95 lexicon.

Based on the examination of the common property of Chinese affixes and quasi-affixes, a general approach to Chinese derivation is proposed.  This approach not only enables us to handle quasi-affix phenomena, but is also flexible enough to provide an adequate treatment of a special problem in Chinese derivation, namely zhe-suffixation.  The affix status of 者 -zhe (-er) is generally acknowledged (classified as suffix in the authoritative books like Lü et al 1980):  it attaches to a verb sign and produces a word.  The peculiar aspect of this suffix is that the verb stem which it attaches to can be syntactically expanded.  In fact, there is significant amount of evidence for the argument that this suffix expects a VP as its stem (see 6.5 for evidence).   Since a VP is only formed in syntax and derivation is within the domain of morphology, this phenomenon presents a highly challenging case on how morphology should be interfaced properly to syntax.  The solution which is offered in CPSG95 demonstrates the power of designing morphology and syntax in an integrated grammar formalism.  In contrast, in any system which enforces sequential processing of derivation morphology before syntax - most traditional systems assume this, this is an unsolvable problem.  There does not seem to be a way of enabling partial output of syntactic analysis (i.e. VP) to feed back to some derivation rule in the preprocessing stage.

In Section 6.1, the general approach to Chinese derivation is proposed first.  Following this proposal, prefixation is illustrated in 6.2 and suffixation in 6.3.  Section 6.4 shows that this general approach to derivation applies equally well to the 'quasi-affix' phenomena.  Section 6.5 investigates the suffixation of -zhe (-er).  The analysis is based on the argument that this suffixation involves the combination VP+-zhe.  The specific solution following the CPSG95 general approach will be presented based on this analysis.

6.1. General Approach to Derivation

This section examines the property of Chinese affixes and proposes a corresponding general approach to Chinese derivation.  This serves as the basis for the specific solutions to be presented in the remaining sections to various problems in Chinese derivation.

It is fairly easy to observe that in Chinese derivation it is the affix which selects the stem, not the other way round.  For example, the suffix 性 -xing (‑ness) expects an adjective to produce an (abstract) noun.   Based on the examination of the behavior of a variety of Chinese affixes or quasi-affixes, the following generalization has been reached.  That is, an affix lexically expects a sign of category x, with possible additional constraints, to form a derived word of category y.   This generalization is believed to capture the common property shared by Chinese affixes/quasi-affixes.  It seems to account for all Chinese derivational data, including typical affixation, quasi-affixation (see 6.4) and the special case of zhe-suffixation (see 6.5).  So far no counter evidence has been found to challenge this generalization.

The observation and the generalization above support the argument that in a grammar which relies on lexicalized expectation feature structures to drive the building of structures, affixes, not the stems, should be selecting heads of the morphological structures.[2]   Leaving aside the non-productive affixation,[3] the general strategy to Chinese productive derivation is proposed as follows.  In the lexicon, the affix as head of derivative is encoded with the following derivation information:  (i) what type of stem (constraints) it expects;  (ii) where to look for the expected stem, on its right or left;  (iii) what type of (derived) word it leads to (category, semantics, etc.).  Based on this lexical information, CPSG95 has two PS rules in the general grammar for derivation:  one for prefixation, one for suffixation.[4]  These rules ensure that all the constraints be observed before an affix and a stem are combined.  They also determine that the output of derivation, i.e. the mother sign, be a word.

Along this line, the key to a lexicalized treatment of Chinese derivation is to determine the structural and semantic property of the derivative and to impose proper constraints on the expected stem.  The constraints on the expected stem can be lexically specified in the morphological expectation feature [PREFIXING] or [SUFFIXING] of the affix.  The property (category, syntactic expectation, semantics, etc.) of the derivative can also be encoded directly in the lexical entry of the affix, seen as the head of a derivational structure in the CPSG95 analysis.  This property information, as part of head features, will be percolated up when the derivation rules are applied.

In the remaining part of this chapter, it will be demonstrated how this proposed general approach is applied to each specific derivation problem.

6.2. Prefixation

The purpose of this section is to present the CPSG95 solution to Chinese prefixation.  This is done by formulating a sample lexical entry for the ordinal prefix 第 di- (-th) in CPSG95.  It will be shown how the lexical information drives the prefix rule in the general grammar for the derivational combination.

Thanks to the productivity of the prefix 第 di- (-th), the ordinal numeral is always a derived word from the cardinal numeral via the following rule, informally formulated in (6-1).

(6-1.) 第 di- + cardinal numeral --> ordinal numeral

第22条军规
di-      22      tiao    jun-gui
-th     22      CLA   military-rule
the 22-nd military rule (Catch-22)

第八个是铜像
di-      ba      ge      shi     tong-xiang
-th     eight  CLA   be      bronze-statue
The eighth is the bronze statue.

The basic function of the Chinese numeral, whether cardinal or ordinal,  is to combine with a classifier, as shown in the sample sentences above.

To capture this phenomenon, CPSG95 defines two subtypes for the category numeral [num], namely the [cardinal_num] and [ordinal_num].   The lexical entries of the prefix 第 di‑ (‑th) and the cardinal numeral 五 wu (five) are formulated in (6-2) and (6-3).  The prefix encodes the lexical expectation for the derivation 第 di- + [cardinal_num] ‑‑> [ordinal_num] plus the semantic composition of the combination.  Note that the constraint @numeral inherits all common property specified for the numeral macro.

th6263

As indicated before, prefixation in CPSG95 is handled by the Prefix PS Rule based on the lexical specification.  More specifically, it is driven by the lexical expectation encoded in [PREFIXING].  The prefix rule is formulated in (6-4).

th64

Like all PS rules in CPSG95, whenever two adjacent signs satisfy all the constraints, this rule takes effect in combining them into a higher level sign in parsing.  For example, the prefix 第 di- (-th) and the sign 五 wu (five) will be combined into the sign as shown in (6-5).

th65

The combination of 第五 di+wu in (6-5) demonstrates how the morphological structure is built in the CPSG95 approach to Chinese prefixation.

6.3. Suffixation

Like prefixation, the Suffix PS Rule for suffixation is driven by the lexically encoded expectation in [SUFFIXING].  Parallel to the Prefix PS Rule, the suffix rule is formulated in (6-6).

th66

With this PS rule in hand, all that is needed is to capture the individual derivational constraint in the lexical entries of the suffixes at issue.  For example, the suffix 性 -xing (-ness) changes an adjective or verb into an abstract noun:  A/V + ‑xing  ‑‑> N.  This information is contained in the formulation of the suffix 性 –xing (-ness) in the CPSG95 lexicon, as shown in (6-7).

th67

Note that abstract nouns are uncountable, hence the call to the uncountable_noun macro to inherit the common property of uncountable nouns.[5]

Suppose the suffix 性 -xing (-ness) appears immediately after the adjective 实用 shi-yong (practical) formulated in (6-8), the suffix PS rule will combine them into a noun, as  shown in (6-9).

th6869

The combination of 实用性 shi-yong+xing in (6-9) demonstrates how the morphological structure is built in the CPSG95 approach to Chinese suffixation.

6.4. Quasi-affixes

The purpose of this section is to propose an adequate treatment of the quasi-affix phenomena in Chinese.  This is an area which has not received enough investigation in the field of Chinese NLP.  Few Chinese NLP systems demonstrate where and how to handle these quasi-affixes.

To achieve the purpose, typical examples of ‘quasi-affixes’ are presented and compared with some ‘genuine’ affixes.  The comparison highlights the general property shared by both 'quasi-affixes' and other affixes and also shows their differences.  Based on this study, it is found to be a feasible proposal to treat quasi-affixes within the derivation morphology of CPSG95.  The proposed solution will be presented by demonstrating how a typical quasi-affix is represented in CPSG95 and how the general affix rules can work with the lexical entries of 'quasi-affixes' as well.

The tables in (6-10) and (6-11) list some representative quasi-affixes in Chinese.

(6-10.)         Table for sample quasi-prefixes

prefixation examples
lei (quasi-)+N --> N 类前缀 lei-[qian-zhui]: quasi-[pre-fix]
前缀 qian (before, pre-, former-) zhui (...)
ban (semi-)+N --> N 半文盲 ban-[wen-mang]: semi-illiterate
文盲 wen (written-language), mang (blind)
dan (mono-)+N --> N 单音节 dan-[yin-jie]: mono-syllable
音节 yin (sound), jie (segment)
shuang (bi-)+N --> N 双音节 shuang-[yin-jie]: bi-syllable
duo (multi-)+N --> N 多音节 duo-[yin-jie]: multi-syllable
fei (non-)+N/A --> A 非谓 fei-wei: non-predicate
非正式 fei-[zheng-shi]: non-official
xiang (each other)+Vt (mono-syllabic) --> Vi 相爱 xiang-ai: love each other
zi (self-)+Vt --> Vi 自爱 zi-ai: self-love zi-xue-xi: self-learning
qian (former, ex-) + N
--> N
前夫人 qian-[fu-ren]: ex-wife
前总统 qian-[zong-tong]: former president

(6-11.)         Table for sample quasi-suffixes

suffixation Examples
N + shi (style) --> N 美国式 [mei-guo]-shi: American-style
NUM/N + xing (model)
--> N
1980型 1980-xing: 1980 model;
IV型 IV-xing: Model IV
A/V + (rate) --> N 准确率 [zhun-que]-lü: (percentage of) precision
NUM + liu (class) --> A 一流 yi-liu: first class
三流 san-liu: third class
N + mang ('blind', person who has little knowledge of) --> N 法盲 fa-mang:
person who has no knowledge of law
计算机盲 [ji-suan-ji]-mang: computer-layman

Compare the above quasi-affixes with the few widely acknowledged affixes like 性 -xing (-ness) and 第 di- (-th), it is fairly easy to observe that the property as generalized in Section 6.1 is shared by both affixes and quasi-affixes.  That is, in all cases of the combination, the affix or quasi-affix expects a sign of category x, with possible additional constraints, either on the right or on the left to form a derived word of category y (y may be equal to x).  For example, the quasi-prefix 自 zi- (self-) expects a transitive verb to produce an intransitive verb, etc.  This property supports the following two points of view:  (i) the affix or quasi-affix is the selecting head of the combination;  (ii) both types of combination (affixation) should be properly contained in morphology since the output is always a word (derivative).

In terms of difference, it is observed that there are different degrees of the functionalization of the meaning between quasi-affixes and other affixes.  For example, the nominalizer 性 -xing (‑ness) seems to be semantically more functionalized than the quasi-suffix 盲 -mang (blind-man, person who has little knowledge of).  In the case of 性 -xing (-ness), there is believed to be little semantic contribution from the affix.  But in cases of affixation by quasi-affixes, the semantic contribution of the affixes is non-trivial, and it must be ensured that proper semantics be built based on semantic compositionality of both the stem and the affix.

Except for the different degrees of semantic abstractness, there is no essential grammatical difference observed between quasi-affixes and the few widely accepted affixes.  As the semantic variation can be easily accommodated in the lexicon, nothing needs to be changed in the  general approach to Chinese derivation as described before.  The text below demonstrates how the quasi-affix phenomena are handled in CPSG95, using a sample quasi-affix to show the derivation.

The quasi-prefix to examine is 相 xiang- (each other).  It is used before a mono-syllabic transitive verb, making it an intransitive verb: 相 xiang- + Vt (monosyllabic) ‑‑> Vi.  More precisely, the syntactic object of the transitive verb is morphologically satisfied so that the derivative becomes an intransitive verb.

Unlike the original verb, the verb derived via xiang-prefixation requires a plural subject, as shown in (6-12).  This is a linguistically interesting phenomenon.  In a sense, it is a version of subject-predicate agreement in Chinese.

(6-12.) (a)    他们相爱过。
ta-men         xiang-         ai       guo
they            each-other   love    GUO
They used to love each other.

(b)      他爱过。
ta       ai       guo
he      love    GUO.
He used to love (someone).

(b) *   他相爱过。
ta       xiang-         ai       guo
he      each-other   love    GUO.

This number agreement can help decode the plural semantics of the subject noun as shown in the first sentence (6-13a) in the following group.  Sentence (6-13a) illustrates a common, number-underspecified case where the NP has no plural marker.  This contrasts with (6-13b) which includes a plural marker 们 men (-s), and with (6-13c) which resorts to the use of a numeral-classifier construction.

(6-13.) (a)     孩子相爱了。
hai-zi           xiang-         ai       le
child           each-other   love    LE
The children have fallen in love with each other.

(b)      孩子们相爱了。
hai-zi men   xiang-         ai       le
child  PLU   each-other   love    LE
The children have fallen in love with each other.

(c)      两个孩子相爱了。
liang ge      hai-zi           xiang-         ai       le
two    CLA   child           each-other   love    LE
The two children have fallen in love with each other.

Following the practice for number agreement in HPSG, the agreement can be captured by enforcing an additional plural constraint on the subject expectation [SUBJ | SIGN | CONTENT | INDEX | NUMBER plural], as shown in the formulation of the lexical entry for 相 xiang- (each other) in (6-14) below.

th614

As shown above, the affixation also necessitates corresponding modification of the semantics in the argument structure:  the first argument is equal to the second via index [2].[6]  Note that the notation [ ], or more accurately, the most general feature structure, is used as a place holder.  For example, HANZI <[ ]> stands for the constraint of a mono-hanzi sign.  Another thing worth noticing is that the derivative requires that a subject must appear before it.  In other words, the subject expectation becomes obligatory.  This is based on the fact that this derived verb cannot stand by itself in syntax, unlike most original verbs in Chinese, say 爱 ai (love), whose subject expectation is optional.

With the lexical entries for the quasi-affixes taking care of the differences in the building of semantics, there is no need for any modification of the CPSG95 PS rules.  For example, the prefix 相 xiang- (each other) and the verb 爱 ai (love) formulated in (6-15) will be combined into the derivative 相爱 xiang-ai (love each other) shown in (6-16) via the Prefix PS Rule.

th615616

In summary, the proposed approach to Chinese derivation is effective in handling quasi-affixes as well.  The general grammar rules for derivation remain unchanged while lexical constraints are accommodated in the lexicon.  This demonstrates the advantages of the lexicalized design for grammar development.

6.5. Suffix 者 zhe (-er)

This section analyzes zhe-suffixation, a highly challenging  case at the interface between morphology and syntax.  This is believed to be an unsolvable problem as long as a system is based on the sequential processing of derivation morphology and syntax.  The solution to be proposed in this section is based on the argument that this suffixation is a combination of VP+zhe.

The suffix 者 zhe (-er, person) is a very productive bound morpheme.   It is often compared to the English suffix ‑er or ‑or, as seen in the pairs in (6-17).

(6-17.)
工作 gong-zuo (work)      工作者 [gong-zuo]-zhe (work‑er)
劳动 lao-dong (labor)       劳动者 [lao-dong]-zhe (labor-er)
学习 xue-xi (learn)           学习者 [xue-xi]-zhe (learn-er);.

But 者 ‑zhe is not an ordinary suffix;  it belongs to the category of so-called ‘phrasal affix’,[7] with very different characteristics than the English counterpart.  Although the output of the zhe-suffixation is a word, the input is a VP, not a lexical V.  In other words, it combines with a VP and produces a lexical N:  VP+zhe --> N.   The arguments to be presented below support this analysis.

The first thing is to demonstrate the word status of zhe‑suffixation.  This is fairly straightforward:  there are no observed facts to show that the zhe-derivative is different from other lexical nouns in the syntactic distribution.  For example, like other lexical nouns, the derivative can combine with an optional classifier construction to form a noun phrase.   Compare the following pairs of examples in (6-18) and (6-19).

(6-18.) (a)    两名违反这项规定者
liang  ming [[wei-fan      zhe    xiang gui-ding]     -zhe]
two    CLA   violate         this    CLA   regulation   -er
two persons who have violated this regulation

(b)    两名学生
liang  ming xue-sheng
two    CLA   student
two students

(6-19.) (a)    他是一位优秀工作者
ta       shi     yi       wei    you-xiu        [[gong-zuo]   -zhe]
he      be      one    CLA   excellent      work           -er
He is an excellent worker.

(b)    他是一位优秀工人。
ta       shi     yi       wei    you-xiu        gong-ren
he      be      one    CLA   excellent      worker
He is an excellent worker.

The next thing is to demonstrate the phrasal nature of the ‘stem’.[8]   The stem is judged as a VP because it can be freely expanded by syntactical complements or modifiers without changing the morphological relationship between the stem and the suffix, as shown in (6‑20) below.  (6-20a) involves a modifier (努力 nu-li) before the head verb.  The verb stem in (6-20b) and (6-20c) is a transitive VP consisting of a verb and an NP object.

(6-20.) (a)    努力工作者
[nu-li  gong-zuo]     -zhe
hard  work           ‑er
hard-worker, person who works hard

(b)      学习鲁迅者
[xue-xi         Lu Xun]       -zhe
learn           Lu Xun       -er
person who is learning from Lu Xun

(c)      违反这项规定者
[wei-fan       zhe    xiang           gui-ding]      -zhe
violate         this    CLA   regulation   -er
person who violates this rule

More examples with the head verb 雇 gu (employ) are given in (6-21), with the last two expressions involving passivized VP.

(6-21.)(a)    雇者
gu-zhe
employ-er

(b)      雇人者
[gu               ren]             -zhe
employ        person         -er
those who employ people, employer/recruiter

(c)      被雇者
[bei gu]                  -zhe
[be-employed]       -er
employee

(d)      被人雇者
[bei    ren              gu]               -zhe
by      person         employ        -er
those who are employed by (other) people

In fact, the stem VP is semantically equivalent to a relative clause.   A Chinese relative clause is normally expressed in the form of a DE-phrase: VP+de+N (Xue 1991).  In other words, 者 ‑zhe embodies functions of two signs, an N (‘person’, by default) and a relative clause introducer de, something like English one that + VP (or person who + VP).[9]  Compare the two examples in (6-22) and (6-23) with the same meaning - the expression in (6-23) is more colloquial than the first in (6-22) which uses the suffix 者‑zhe.

(6-22.) 违反规定者,处以罚款。
wei-fan        gui-ding       zhe,            chu-yi                   fa-kuan
violate         regulation   one that      punish-by   fine

Those who violate the regulations will be punished by fines.

(6-23.) 违反规定的人,处以罚款。
wei-fan        gui-ding       de      ren,             chu-yi          fa-kuan
violate         regulation   DE     person         punish-by   fine
Those who violate the regulations will be punished by fines.

On further examination, it is found that VPs with attached aspect markers combine with the suffix 者 -zhe with difficulty, as seen in the following examples.

(6-24.) (a)    违反规定者
wei-fan        gui-ding       zhe
violate         regulation   -er
Those who violate the regulations

(b) ?  违反了规定者
wei-fan        le       gui-ding       zhe
violate         LE     regulation   one that

This means that some further constraint may be necessary in order to prevent the grammar from producing strings like (6-24b).  If CPSG95 is only used for parsing, such a constraint is not absolutely necessary because, in normal Chinese text, such input is almost never seen.  Since CPSG95 is intended to be procedure-neutral, for use in both parsing and generation, the further constraint is desirable.

This constraint is in fact not an isolated phenomenon in Chinese grammar.  In syntax, the constraint is commonly required when the VP is not in the predicate position.[10]  For example, when a verb, say 喜欢 xi-huan (like), or a preposition, say 为了 wei-le (in order to), subcategorizes for a VP as a complement, it actually expects a VP with no aspect markers attached.   The following pair of sentences demonstrates this point.

(6-25.) (a)    我喜欢打篮球。
wo     xi-huan       da      lan-qiu.
I         like              play   basket-ball
I like playing basket-ball.

(b) * 我喜欢打了篮球。
wo     xi-huan       da      le       lan-qiu
I         like              play   LE     basket-ball

To accommodate such common constraint requirement in both Chinese morphology and syntax, a binary feature [FINITE] is designed for Chinese verbs in CPSG95.  In the lexicon, this feature is under-specified for each Chinese verb, i.e. [FINITE bin].  When an aspect marker 了着过 le/zhe/guo combines with the verb, this feature is unified to be [FINITE plus].  We can then enforce the required constraint [FINITE minus] in the morphological expectation or syntactic expectation to prevent aspected VP from appearing in a position expecting a non-predicate un-aspected  VP.

Based on the above analysis, the lexical entry of the suffix 者 –zhe is formulated in (6-26).  Note the notation for the macro with parameter (placed in parentheses) @common_noun(名|位|个).  This macro represents the following information.  The derivative is like any other common noun, it inherits the common property;  it can combine with an optional classifier construction using the classifier 名 ming or 位  wei or 个 ge.[11]

th626

As seen, the VP expectation is realized by using the macro constraint @vp.  The semantics of the derivative is [np_semantics], an instance of -er with restriction from the event of VP, represented by [2].  The index [1] ensures that whatever is expected as a subject by the VP, which has no chances to be satisfied syntactically in this case, is semantically identical to this noun.[12]  In other words, this derived noun semantically fills an argument slot held by the subject in the VP semantics [v_content].  In the active case, say, 雇人者 [gu ren]–zhe (‘person who employs people’), the subject is the first argument, i.e. the index of this noun is the logical subject of employ.  However, when the VP is in passive, say, 被人雇者 [bei ren gu]‑zhe (‘person who is employed by other people’), the subject expected by the VP fills the second argument, i.e. the noun in this case is the logical object of the VP.  It is believed that this is the desired result for the semantic composition of zhe-derivation.

With the lexical expectation of the suffix as the basis, the general Suffix PS Rule is ready to work.  Remember that there is nothing restricting the input stem to the derivation in either of the derivation rules, formulated in (6-4) and (6-6) before.  In CPSG95, this is not considered part of the general grammar but rather a lexical property of the head affix.  It is up to the affix to decide what constraints such as category, wordhood status, semantic constraint, etc., to impose on the expected stem to produce a derivative.  In most cases of derivation, the input status of the stem is a word, but now we have an intricate case where the suffix zhe (-er) expects a verb phrase for derivation.  The general property for all cases of derivation is that regardless of the input, the output of derivation (as well as any other types of morphology) is always a word.

Before demonstrating by examples how zhe-derivation is implemented, there is a need to address the configurational constraints of CPSG95.  This is an important factor in realizing the flexible interaction between morphology and syntax as required in this case.

In all HPSG-style grammars, some type of configurational constraint is in place to ensure the proper order of rule application.  A typical constraint is that the subject rule should apply after the object rule.  This is implemented in CPSG95 by imposing the constraint in the subject PS rule that the head daughter must be a phrase and by imposing the constraint in the object PS rule that the subject of the head daughter may not be satisfied.[13]

Since derivation morphology and syntax are designed in the same framework in CPSG95, constraints are called for to ensure the ordering of rule application between morphological PS rules and syntactic PS rules as well.  In general, morphological rules apply before syntactic rules.  However, if this constraint is made absolute, to the extent that that all morphological rules must apply before all syntactic rules, we in effect make morphology and syntax two independent, successive modules, just like the case for traditional systems.  The grammar will then lose the power of flexible interaction between morphology and syntax and cannot handle cases like zhe-derivation.  However, this is not a problem in CPSG95.

The proposed constraint regulating the rule application order between morphological PS rules and syntactic PS rules is as follows.  Only when a sign has both obligatory morphological expectation and syntactic expectation will CPSG95 have constraints ensuring that the morphological rule apply first.  For example, as formulated in (6-14) before, the sign 相 xiang- (each other) has both morphological expectation in [PREFIXING] as a bound morpheme and syntactic expectation for the subject in [SUBJ] as (head of) derivative.  If the input string is 他们相爱  ta-men (they) xiang- (each other) ai (love), the prefix rule will first combine 相 xiang- (each other) and the stem 爱 ai (love) before the subject rule can apply.  The result is the expected structure embodying the results of both morphological analysis and syntactic analysis, [ta-men [xiang- ai]].  This constraint is implemented by specifying in all syntactic PS rules that the head daughter cannot have obligatory morphological expectation yet to be satisfied.  It effectively prevents a bound morpheme from being used as a constituent in syntax.   It should be emphasized that this constraint in the general grammar does not prohibit a bound morpheme from combining with any types of sign;  such constraints are only lexically decided in the expectation feature of the affix.

The following text shows step by step the CPSG95 solution to the problem of zhe-derivation.  The chosen example is the derivation for the derived noun 违法规定者 [[wei-fan gui-ding]-zhe]  ‘persons violating (the) regulation’.  The lexical sign of the suffix 者 -zhe (-er) has already been formulated in (6-26) before.  The words 违反 wei-fan (violate) and 规定 gui-ding (regulation) in the CPSG95 lexicon are shown in (6-27) and (6-28) respectively.

th627628

Note that all common nouns, specified as @common_noun, in the lexicon have the following INDEX features [PERSON 3, NUMBER number], i.e. third person with unspecified number.  As for the feature [GENDER], it is encoded in the noun itself with one of the following [male], [female], [have_gender], [no_gender] or unspecified as [gender].   The corresponding sort hierarchy is: [gender] consists of sub-sorts [no_gender] and [have_gender];  and [have_gender] is sub-typed into [male] and [female].  Of course, 规定 gui-ding (regulation) is lexically specified as [GENDER no_gender].

The following is the VP built by the object PS rule in the CPSG95 syntax.  As seen, the building of the semantics follows the practice in HPSG, with the argument slots filled by the [INDEX] feature of the subject and object.  In this VP case, [ARG2] has been realized.

th629
The VP result in (6-29) and the suffix 者 –zhe will combine into the expected derived noun via the Suffix PS Rule, as shown in (6-30).

th630

To summarize, it is the integrated model of derivational morphology and syntax in CPSG95 that makes the above analysis implementable.  Without the integration, there is no way that a suffix is allowed to expect a phrasal stem.[14]  The lexicalist approach adopted in CPSG95 facilitates the capturing of the individual feature of the phrase expectation for the few individual affixes like 者 -zhe. This enables the general PS rules for derivation in CPSG95 to be applicable to both typical cases of affixation and special cases of affixation.

6.6. Summary

This chapter has investigated some representative phenomena of Chinese derivation and their interface to syntax.  The solutions to these problems have been presented based on the arguments for the analysis.

The key to a lexicalized treatment of Chinese derivation is to determine the structural and semantic property of the derivative and to impose proper constraints on the expected stem.  The constraints on the expected stem are lexically specified in the corresponding morphological expectation feature structure of the affix.  The property of the derivative is also lexically encoded in the affix, seen as head of derivational structure in the CPSG95 analysis.  This property information will be percolated up when the derivation rules are applied.  These rules ensure that the output of derivation is a word.  It has been shown that this approach applies equally well to derivation via ‘quasi-affixes’ and the tough case of zhe-suffixation as well.

 

------------------------------------

[1] Some linguists (e.g. Li and Thompson 1981) hold the view that Chinese has only a few affixes;  others (e.g. Chao 1968) believe that the inventory of Chinese affixes should be extended to include quasi-affixes.  Interestingly, the sign lei (quasi-, original sense ‘class’) itself is a quasi-prefix in Chinese.  Phenomena similar to Chinese quasi-affixes, called ‘semi-affixes’ or ‘Affixoide’, also exist in German morphology (Riehemann 1998).

[2] This is similar to the practice in many grammars, including HPSG, that a functional sign preposition is the selecting head of the corresponding syntactic structure, namely Prepositional Phrase.

[3] Those affixes which are not or no longer productive, e.g. lao‑ (original meaning ‘old’) in lao‑hu (tiger) and lao‑shu (mouse),  are not a problem.  The corresponding derived words are simply listed in the CPSG95 lexicon.

[4] The CPSG95 phrase-structural approach to Chinese productive derivation was inspired by the implementation in HPSG of a word-syntactic approach in Krieger (1994).  Similar practice is also seen in Selkirk (1982), Riehemann (1993) and Kathol (1999) in an effort to explore alternative approaches than the lexical rule approach to morphology.

[5] The major common property is reflected in two aspects, formulated in the macro definition of uncountable_noun in CPSG95.  First, there is value setting for the [NUMBER] feature, i.e. [CONTENT|INDEX|NUMBER no_number].  The CPSG95 sort hierarchy for the type [number] is defined as {a_number, no_number} where [a_number] is further sub-typed into {singular, plural}.  [NUMBER no_number] applies to uncountable nouns while [NUMBER a_number] is used for countable noun where the plurality is yet to be decided (i.e. under-specified for plurality).  Second, based on the syntactic difference between Chinese countable nouns and uncountable nouns, the classifier expected by uncountable nouns is exclusively zhong (kind/sort of).  That is, uncountable nouns may only combine with a preceding classifier construction using the classifier zhong.

[6] For time being, the subtle difference in semantics between pairs like We love ourselves and We love each other is not represented in the content.  It requires a more elaborate system of semantics to reflect the nuance.  The elaboration of semantics is left for future research.

[7] Some linguists (e.g. Z. Lu 1957; Lü et al 1980; Lü 1989; Dai 1993) have briefly introduced the notion of ‘phrasal affix’ in Chinese.  Lü further indicates that these ‘phrasal affixes’ are a distinctive characteristic of the Chinese grammar.

[8] The English possessive morpheme ‘s is arguably a suffix which expects an NP instead of a lexical noun as its stem:  NP + -’s.  Unlike VP + -zhe, the result of this NP + -‘s combination is generally regarded as a phrase, not a word.  In this sense, ‘s seems to be closer to a functional word, similar to a preposition or postposition, than to a suffix.

[9] Chinese zhe-suffixation is somewhat like the English phenomenon of what-clause (in ‘what he likes is not what interests her’). ‘What’ in this use also embodies functions of two signs that which. But the English what-clause functions as an NP, but VP+zhe forms a lexical N.

[10] It is generally agreed in the circle of Chinese grammar research that Chinese predicate (or finite) verbs have aspect distinction, using or not using aspect markers.  This is in contrast to English where both finite and non-finite verbs have aspect distinction but only finite verbs are tensed.

[11] It is generally agreed that each Chinese common noun may only combine with a classifier construction using a specific set of classifiers.  This classifier specification is generally regarded as lexical, idiosyncratic information of nouns (Lü et al 1980).  Using the macro with the classifier parameter follows this general idea.  It is worth noticing that the lexical formulation for -zhe (-er) in CPSG95 does not rely on any specific NP analysis chosen in syntax, except that the classifier specification should be placed under the entry for nouns (or derived nouns).

[12] The proposal in building the semantics for the zhe-derivative is based on ideas similar to the assumption adopted for the complement control in HPSG that ‘the fundamental mechanism of control was coindexing between the unexpressed subject of an unsaturated complement and its controler’ (Pollard and Sag 1994:282).

[13] If the object expectation is obligatory, this constraint ensures the priority of the object rule over the subject rule in application, building the desirable structure [S [V O]] instead of [[S V] O].  This is because, a verb with obligatory object yet to be satisfied is by definition not a phrase.  If the object expectation is optional, the order of rule application is still in effect although the lexical V in this scenario does not violate the phrase definition.  There are two cases for this situation.  In case one, the object O happens to occur in the input string.  The subject PS rule will tentatively combine S and V via the subject rule, but it can go no further.  This is because the object rule cannot apply after the subject rule, due to the constraint in the object rule that the head cannot have a satisfied subject.  The successful parse will only build the expected structure [S [V O]].  In case two, the object O does not appear in the input string.  Then the tentative combination [S V] built by the subject rule becomes the final parse.

[14] For example, if the lexical rule approach were adopted for derivation, this problem could not be solved.

 

[Related]

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

 

PhD Thesis: Chapter V Chinese Separable Verbs

 

5.0. Introduction

This chapter investigates the phenomena usually referred to as separable verbs (离合动词 lihe dongci) in the form V+X.  Separable verbs constitute a significant portion of Chinese verb vocabulary.[1]  These idiomatic combinations seem to show dual status (Z. Lu 1957; L. Li 1990).  When V+X is not separated, it is like an ordinary verb.   When V is separated from X, it seems to be more like a phrasal combination.  The co-existence of both the separated use and contiguous use for these constructions is recognized as a long-standing problem at the interface of Chinese morphology and syntax (L. Wang 1955;  Z. Lu 1957; Chao 1968; Lü 1989; Lin 1983;  Q. Li 1983; L. Li 1990; Shi 1992; Dai 1993; Zhao and Zhang 1996).

Some linguists (e.g. L. Li 1990; Zhao and Zhang 1996) have made efforts to classify different types of separable verbs and demonstrated different linguistic facts about these types.  There are two major types of separable verbs:  V+N idioms with the verb-object relation and V+A/V idioms with the verb-modifier relation - when X is A or non-conjunctive V.[2]

The V+N idiom is a typical case which demonstrates the mismatch between a vocabulary word and grammar word.  There have been three different views on whether V+N idioms are words or phrases in Chinese grammar.

Given the fact that the V and the N can be separated in usage, the most popular view (e.g. Z. Lu 1957; L. Li 1990; Shi 1992) is that they are words when V+N are contiguous and they are phrases otherwise.  This analysis fails to account for the link between the separated use and the contiguous use of the idioms.  In terms of the type of V+N idioms like 洗澡 xi zao (wash-bath: take a bath), this analysis also fails to explain why a different structural analysis should be given to this type of contiguous V+N idioms listed in the lexicon than the analysis to the also contiguous but non-listable combination of V and N (e.g. 洗碗 xi wan 'wash dishes').[3]  As will be shown in Section 5.1, the structural distribution for this type of V+N idioms and the distribution for the corresponding non-listable combinations are identical.

Other grammarians argue that V+N idioms are not phrases (Lin 1983;  Q. Li 1983; Zhao and Zhang 1996).  They insist that they are words, or a special type of words.  This argument cannot explain the demonstrated variety of separated uses.

There are scholars (e.g. Lü 1989; Dai 1993) who indicate that idioms like 洗澡 xi zao are phrases.  Their judgment is based on their observation of the linguistic variations demonstrated by such idioms.  But they have not given detailed formal analyses which account for the difference between these V+N idioms and the non-listable V+NP constructions in the semantic compositionality.  That seems to be the major reason why this insightful argument has not convinced people with different views.

As for V+A/V idioms, Lü (1989) offers a theory that these idioms are words and the insertable signs between V and A/V are Chinese infixes.  This is an insightful hypothesis.  But as in the case of the analyses proposed for V+N idioms, no formal solutions have been proposed based on the analyses in the context of phrase structure grammars.  As a general goal, a good solution should not only be implementable, but also offer an analysis which captures the linguistic link, both structural and semantic, between the separated use and the contiguous use of separable verbs.  It is felt that there is still a distance between the proposed analyses reported in literature and achieving this goal of formally capturing the linguistic generality.

Three types of V+X idioms can be classified based on their different degrees of 'separability' between V and X, to be explored in three major sections of this chapter.  Section 5.1 studies the first type of V+N idioms like 洗澡 xi zao (wash-bath: take a bath).  These idioms are freely separable.  It is a relatively easy case.  Section 5.2 investigates the second type of the V+N idioms represented by 伤心 shang xin (hurt-heart: sad or heartbroken).  These idioms are less separable.  This category constitutes the largest part of the V+N phenomena.  It is a more difficult borderline case.  Section 5.3 studies the V+A/V idioms.  These idioms are least separable:  only the two modal signs 得 de3 (can) and 不 bu (cannot) can be inserted inside them, and nothing else.  For all these problems, arguments for the wordhood judgment will be presented first.  A corresponding morphological or syntactic analysis will be proposed, together with the formulation of the solution in CPSG95 based on the given analysis.

5.1. Verb-object Idioms: V+N I

The purpose of this section is to analyze the first type of V+N idioms, represented by 洗澡 xi zao (wash‑bath: take a bath).  The basic arguments to be presented are that they are verb phrases in Chinese syntax and the relationship between the V and the N is syntactic.  Based on these arguments, formal solutions to the problems involved in this construction will be presented.

The idioms like 洗澡 xi zao are classified as V+N I, to be distinguished from another type of idioms V+N II (see 5.2).  The following is a sample list of this type of idioms.

(5-1.) V+N I: xi zao type

洗澡 xi (wash) zao (bath #)              take a bath
擦澡 ca (scrub) zao (bath #)             clean one's body by scrubbing
吃亏 chi (eat) kui (loss #)                   get the worst
走路 zou (go) lu (way $)                      walk
吃饭 chi (eat) fan (rice $)                    have a meal
睡觉 shui (V:sleep) jiao (N:sleep #)   sleep
做梦 zuo (make) meng (N:dream)     dream (a dream)
吵架  chao (quarrel) jia (N:fight #)    quarrel (or have a row)
打仗 da (beat) zhang (battle)              fight a battle
上当 shang (get) dang (cheating #)                be taken in
拆台 chai (pull down) tai (platform #)          pull away a prop
见面 jian (see) mian (face #)                            meet (face to face)
磕头 ke (knock) tou (head)                              kowtow
带头 dai (lead) tou (head $)                            take the lead
帮忙 bang (help) mang (business #)              give a hand
告状 gao (sue) zhuang (complaint #)            lodge a complaint

Note: Many nouns (marked with # or $) in this type of constructions cannot be used independently of the corresponding V.[4]  But those with the mark $ have no such restriction in their literal sense.  For example, when the sign fan  means 'meal', as it does in the idiom, it cannot be used in a context other than the idiom chi-fan (have a meal).  Only when it stands for the literal meaning ‘rice’, it does not have to co-occur with  chi.

There is ample evidence for the phrasal status of the combinations like 洗澡 xi zao.  The evidence is of three types.  The first comes from the free insertion of some syntactic constituent X between the idioms in the form V+X+N: this involves keyword-based judgment patterns and other X‑insertion tests proposed in Chapter IV.  The second type of evidence resorts to some syntactic processes for the transitive VP, namely passivization and long-distance topicalization.  The V+N I idioms can be topicalized and passivized in the same way as ordinary transitive VP structures do.  The last piece of evidence comes from the reduplication process associated with this type of idiom.   All the evidence leads to the conclusion that V+N I idioms are syntactic in nature.

The first evidence comes from using the wordhood judgment pattern: V(X)+zhe/guo à word(X).  It is a well observed syntactic fact that Chinese aspectual markers appear right after a lexical verb (and before the direct object).  If 洗澡 xi zao were a lexical verb, the aspectual markers would appear after the combinations, not inside them.  But that is not the case, shown by the ungrammaticality of the example in (5-2b).  A productive transitive VP example is given in (5-3) to show its syntactic similarity (parallelness) with V+N I idioms.

(5-2.) (a)      他正在洗着澡
ta       zheng-zai    xi      zhe    zao.
he      right-now    wash ZHE   bath
He is taking a bath right now.

(b) *   他正在洗澡着。
ta       zheng-zai    xi-zao         zhe.
he      right-now    wash-bath   ZHE

(5-3.) (a)      他正在洗着衣服。
ta       zheng-zai    xi      zhe    yi-fu.
he      right-now    wash ZHE   clothes
He is washing the clothes right now.

(b) *   他正在洗衣服着。
ta       zheng-zai    xi      yi-fu           zhe.
he      right-now    wash clothes        ZHE

The above examples show that the aspectual marker 着 zhe (ZHE) should be inserted in the V+N idiom, just as it does in an ordinary transitive VP structure.

Further evidence for X-insertion is given below.   This comes from the post-verbal modifier of ‘action-times’ (动量补语 dongliang buyu) like 'once', 'twice', etc.  In Chinese, action-times modifiers appear after the lexical verb and aspectual marker (but before the object), as shown in (5-4a) and (5-5a).

(5-4.) (a)      他洗了两次澡。
ta       xi      le       liang  ci       zao.
he      wash LE     two    time   bath
He has taken a bath twice.

(b) *   他洗澡了两次。
ta       xi-zao         le       liang  ci.
he      wash-bath   LE     two    time

(5-5.) (a)      他洗了两次衣服。
ta       xi      le       liang  ci       yi-fu.
he      wash LE     two    time   clothes
He has washed the clothes twice.

(b) *   他洗衣服了两次。
ta       xi      yi-fu           le       liang  ci.
he      wash clothes        LE     two    time

So far, evidence has been provided of syntactic constituents which are attached to the verb in the V+N I idioms.  To further argue for the VP status of the whole idiom, it will be demonstrated that the N in the V+N I idioms in fact fills the syntactic NP position in the same way as all other objects do in Chinese transitive VP structures.  In fact, N in the V+N I does not have to be a bare N:  it can be legitimately expanded to a full-fledged NP (although it does not normally do so).  A full-fledged NP in Chinese typically consists of a classifier phrase (and modifiers like de-construction) before the noun.  Compare the following pair of examples.  Just like an ordinary NP 一件崭新的衣服 yi jian zan-xin de yi-fu (one piece of brand-new clothes), 一个痛快的澡 yi ge tong-kuai de zao (a comfortable bath) is a full-fledged NP.

(5-6.)           他洗了一个痛快的澡。
ta       xi      le       yi       ge      tong-kuai     de      zao.
he      wash LE     one    CLA   comfortable DE     bath
He has taken a comfortable bath.

(5-7.)           他洗了一件崭新的衣服。
ta       xi      le       yi       jian    zan-xin        de      yi-fu.
he      wash LE     one    CLA   brand-new  DE     clothes
He has washed one piece of brand-new clothes.

It requires attention that the above evidence is directly against the following widespread view, i.e. signs like 澡 zao, marked with # in (5-1), are 'bound morphemes' or ‘bound stems’ (e.g. L. Li 1990; Zhao and Zhang 1996).  As shown, like every other free morpheme noun (e.g. yi-fu), zao holds a lexical position in the typical Chinese NP sequence 'determiner + classifier + (de-construction) + N', e.g. 一个澡 yi ge zao (a bath), 一个痛快的澡 yi ge tong-kuai de zao (a comfortable bath).[5]  In fact, as long as the ‘V+N I phrase’ arguments are accepted (further evidence to come), by definition ‘bound morpheme’ is a misnomer for 澡 zao.  As a part of morphology, a bound morpheme cannot play a syntactic role:  it is inside a word and cannot be seen in syntax.  The analysis of 洗xi (...) zao as a phrase entails the syntactic roles played by 澡 zao:  (i) 澡 zao is a free morpheme noun which fills the lexical position as the final N inside the possibly full-fledged NP;  (ii) 澡zao plays the object role in the syntactic transitive structure 洗澡xi zao.

This bound morpheme view is an argument used for demonstrating  the relevant V+N idioms to be words rather than phrases (e.g. L. Li 1990).  Further examination of this widely accepted view will help to strengthen the counter-arguments that all V+N I idioms are phrases.

Labeling signs like 澡zao (bath) as bound morphemes seem to come from an inappropriate interpretation of the statement that bound morphemes cannot be ‘freely’, or ‘independently’, used in syntax.[6]  This interpretation places an equal sign between the idiomatic co-occurrence constraint and ‘not being freely used’.  It is true that 澡zao is not an ordinary noun to be used in isolation.  There is a co-occurrence constraint in effect:  澡zao cannot be used without the appearance of 洗xi (or 擦ca).  However, the syntactic role played by 澡zao, the object in the syntactic VP structure, has full potential of being ‘freely’ used as any other Chinese NP object:   it can even be placed before the verb in long-distance constructions as shall be shown shortly.  A more proper interpretation of ‘not being freely used’ in terms of defining bound morphemes should be that a genuine bound morpheme, e.g. the suffix 性 -xing ‘-ness’, has to attach to another sign contiguously to form a word.

A comparison with similar phenomena in English may be helpful.  English also has similar idiomatic VPs, such as kick the bucket.[7]  For the same reason, it cannot be concluded that bucket (or the bucket) is a bound morpheme only because it demonstrates necessary co-occurrence with the verb literal kick.  Signs like bucket, zao (bath) are not of the same nature as bound morphemes like –less, -ly, un-, ‑xing (-ness), etc

The second type of evidence shows some pattern variations for the V+N I idioms.  These variations are typical syntactic patterns for the transitive V+NP structure in Chinese.  One of most frequently used patterns for transitive structures is the topical pattern of long distance dependency.  This provides strong evidence for judging the V+N I idioms as syntactic rather than morphological.  For, with the exception of clitics, morphological theories in general conceive of the parts of a word as being contiguous.[8]  Both the V+N I idiom and the normal V+NP structure can be topicalized, as shown in (5-8b) and (5-9b) below.

(5-8.) (a)      我认为他应该洗澡。
wo     ren-wei        ta       ying-gai       xi zao.
I         think           he      should        wash-bath
I think that he should take a bath.

(b)      澡我认为他应该洗
zao    wo     ren-wei        ta       ying-gai       xi.
bath  I         think           he      should        wash
The bath I think that he should take.

(5-9.) (a)       我认为他应该洗衣服。
wo     ren-wei        ta       ying-gai       xi      yi-fu.
I         think           he      should        wash clothes
I think that he should wash the clothes.

(b)      衣服我认为他应该洗。
yi-fu           wo     ren-wei        ta       ying-gai       xi.
clothes        I         think           he      should        wash
The clothes I think that he should wash.

The minimal pair of passive sentences in (5-10) and (5‑11) further demonstrates the syntactic nature of the V+N I structure.

(5-10.)         澡洗得很干净。
zao             xi      de3    hen    gan-jing.
bath            wash DE3   very   clean
A good bath was taken so that one was very clean.

(5-11.)         衣服洗得很干净。
yi-fu           xi      de3    hen    gan-jing.
clothes        wash DE3   very   clean
The clothes were washed clean.

The third type of evidence involves the nature of reduplication associated with such idioms.  For idioms like 洗澡 xi zao (take a bath), the first sign can be reduplicated to denote the shortness of the action:  洗澡 xi zao (take a bath) --> 洗洗澡 xi xi zao (take a short bath).  If 洗澡 xi zao is a word, by definition, 洗xi is a morpheme inside the word and 洗洗澡 xi-xi-zao belongs to morphological reduplication (AB-->AAB type).  However, this analysis fails to account for the generality of such reduplication:  it is a general rule in Chinese grammar that a verb reduplicates itself contiguously to denote the shortness of the action.  For example, 听音乐 ting (listen to) yin-yue (music) --> 听听音乐 ting ting yin-yue (listen to music for a while); 休息 xiu-xi (rest) --> 休息休息 xiu-xi xiu-xi (have a short rest), etc.  On the other hand, when we accept that 洗澡 xi zao is a verb-object phrase in syntax and the nature of this reduplication is accordingly judged as syntactic,[9] we come to a satisfactory and unified account for all the related data.  As a result, only one reduplication rule is required in CPSG95 to capture the general phenomena;[10]  there is no need to do anything special for V+N  idioms.

This AB ‑‑> AAB type reduplication problem for the V+N idioms poses a big challenge to traditional word segmenters (Sun and Huang 1996).  Moreover, even when a word segmenter successfully incorporates some procedure to cope with this problem, the essentially same rule has to be repeated in the grammar for the general VV reduplication.  This is not desirable in terms of capturing the linguistic generality.

All the evidence presented above indicates that idioms like 洗澡xi zao, no matter whether V and N are used contiguously or not, are not words, but phrases.  The idiomatic nature of such combinations seems to be the reason why most native speakers, including some linguists, regard them as words.  Lü (1989: 113-114) suggests that vocabulary words  like 洗澡 xi zao should be distinguished from grammar words.  He was one of the first Chinese grammarians who found that the V+N relation in the idioms like 洗澡 xi zao is a syntactic verb object relation.  But he did not provide full arguments for his view, neither did he offer a precise formalized analysis of this problem.[11]

As shown in the previous examples, the V+N I idioms do not differ from other transitive verb phrases in all major syntactic behaviors.   However, due to their idiomatic nature, the V+N I idioms are different from ordinary transitive VPs in the following two major aspects.  These differences need to be kept in mind when formulating the grammar to capture the phenomena.

  • Semantics:  the semantics of the idiom should be given directly in the lexicon, not as a result of the computation of the semantics of the parts based on some general principle of compositionality.
  • Co-occurrence requirement:  洗 xi (or 擦 ca) and 澡 zao must co-occur with each other;  走 zou (go) and 路 lu (way) must co-occur; etc.  This is a requirement specific to the idioms at issue.  For example, 洗 xi and 澡 zao must co-occur in order to stand as an idiom to mean ‘take a bath’.

Based on the study above, the CPSG95 solution to this problem is described below.  In order to enforce the co-occurrence of the V+N I idioms, it is specified in the CPSG95 lexicon that the head V obligatorily expects as its object an NP headed by a specific literal.  This treatment originates from the practice of handling collocations in HPSG.  In HPSG, there are features designed to enable the subcategorization for particular words, or phrases headed by particular words.  For example, the feature [NFORM there] and [NFORM it] refer to the expletive there and it respectively for the special treatment of existential constructions, cleft constructions, etc. (Pollard and Sag 1987:62).  The values of the feature PFORM distinguish individual prepositions like for, on, etc.  They are used in phrasal verbs like rely on NP, look for NP, etc.  In CPSG95, this approach is being generalized, as described below.

As presented before, the feature for orthography [HANZI] records the Chinese character string for each lexical sign.  When a specific lexical literal is required in an idiomatic expectation, the constraint is directly placed on the value of the feature [HANZI] of the expected sign, in addition to possible other constraints.  It is standard practice in a lexicalized grammar that the expected complement (object) for the transitive structure be coded directly in the entry of the head V in the lexicon.  Usually, the expected sign is just an ordinary NP.  In the idiomatic VP like 洗 xi (...) 澡 zao, one further constraint is placed:  the expected NP must be headed by the literal character 澡zao.  This treatment ensures that all pattern variations for transitive VP such as passive constructions, topicalized constructions, etc. in Chinese syntax will equally apply to the V+N I idioms.[12]

The difference in semantics is accommodated in the feature [CONTENT] of the head V with proper co-indexing.  In ordinary cases like 洗衣服 xi yi-fu (wash clothes), the argument structure is [vt_semantics] which requires two arguments, with the role [ARG2] filled by the semantics of the object NP.  In the idiomatic case 洗澡 xi zao (take a bath), the V and N form a semantic whole, coded as [RELN take_bath].[13]  The V+N I idioms are formulated like intransitive verbs in terms of composing the semantics - hence coded as [vi_semantics], with only one argument to be co-indexed with the subject NP.  Note that there are two lexical entries in the lexicon for the verb 洗 xi (wash), one for the ordinary use and the other for the idiom, shown in (5-12) and (5-13).

th000

The above solution takes care of the syntactic similarity of the
V+N I idioms and ordinary V+NP structures.  It is also detailed enough to address their major differences.  In addition, the associated reduplication process (i.e. V+N --> V+V+N) is no longer a problem once this solution is adopted.  As the V in the V+N idioms is judged and coded as a lexical V (word) in this proposal, the reduplication rule which handles V --> VV will equally apply here.

5.2. Verb-object Idioms: V+N II

The purpose of this section is to provide an analysis of another type of V+N idiom and present the solution implemented in CPSG95 based on the analysis.

Examples like 洗澡 xi zao (take a bath) are in fact easy cases to judge.   There are more marginal cases.  When discussing Chinese verb-object idioms, L. Li (1990) and Shi (1992) indicate that the boundary between a word and a phrase in Chinese is far from clear-cut.  There is a remarkable “gray area” in between.  Examples in (5-14) are V+N II idioms, in contrast to the V+N I type, classified by L. Li (1990).

(5-14.) V+N II: 伤心 shang xin type

伤心 shang (hurt) xin (heart)             sad or break one's heart
担心 dan (carry) xin (heart)               worry
留神 liu (pay) shen (attention)           pay attention to
冒险 mao (take) xian (risk)                 take the risk
借光 jie (borrow) guang (light)           benefit from
劳驾 lao (bother) jia (vehicle)             beg the pardon
革命 ge (change) ming (life)                 make revolution
落后 luo (lag) hou (back)                      lag behind
放手 fang (release) shou (hand)          release one's hold

Compared with V+N I (洗澡xi zao type), V+N II has more characteristics of a word.  The lists below given by L. Li (1990) contrast their respective characteristics.[14]

(5-15.) V+N I (based on L. Li 1990:115-116)

as a word

V-N

(a1) corresponds to one generalized sense (concept)

(a2) usually contains ‘bound morpheme(s)’

as a phrase

V X N

 

(b1) may insert an aspectual particle (X=le/zhe/guo)

(b2) may insert all types of post-verbal modifiers (X=BUYU)

(b3) may insert a pre-nominal modifier de-construction (X=DEP)

(5-16.) V+N II (based on L. Li 1990:115)

as a word

 

V-N X

(a1) corresponds to one generalized sense (concept)

(a2) usually contains ‘bound morpheme(s)’

(a3) (some) may be followed by an aspectual particle (X=le/zhe/guo)

(a4) (some) may be followed by a post-verbal modifier
of duration or number of times (X=BUYU)

(a5) (some) may take an object (X=BINYU)

as a phrase

 

V X N

(b1) may insert an aspectual particle (X=le/zhe/guo)

(b2) may insert all types of post-verbal modifiers (X=BUYU)

(b3) may insert a pre-nominal modifier de-construction (X=DEP)

For V+N I, the previous text has already given detailed analysis and evidence and decided that such idioms are phrases, not words.  This position is not affected by the demonstrated features (a1) and (a2) in (5‑15);  as argued before,  (a1) and (a2) do not contribute to the definition of a grammar word.

However, (a3), (a4) and (a5) are all syntactic evidence showing that V+N II idioms can be inserted in lexical positions.   On the other hand, these idioms also show the similarity with V+N I idioms in the features (b1), (b2) and (b3) as a phrase.  In particular, (a3) versus (b1) and (a4) versus (b2) demonstrate a 'minimal pair' of phrase features and word features.  The following is such a minimal pair example (with the same meaning as well) based on the feature pairs (a3) versus (b1), with a post-verbal modifier 透tou (thorough) and aspectual particle 了le (LE).  It demonstrates the borderline status of such idioms.  As before, a similar example of an ordinary transitive VP is also given below for comparison.

(5-17.)         V+N II: word or phrase?

伤心:sad; heart-broken
shang          xin
hurt            heart

(a)      我伤心透了
wo     shang-xin  tou              le.
I         sad              thorough     LE
I was extremely sad.

(b)      我伤透了心
wo     shang         tou              le       xin.
I         break          thorough     LE     heart
I was extremely sad.

(5-18.)         Ordinary V+NP phrase: 恨hen (hate) 他ta (he)

(a) *   我恨他透了
wo     hen   ta      tou              le.
I         hate   he      thorough     LE

(b)      我恨透了他
wo     hen   tou              le       ta.
I         hate   thorough     LE     he
I thoroughly hate him.

As shown in (5-18), in the common V+NP structure, the post-verbal modifier 透 tou (thorough) and the aspectual particle 了 le (perfect aspect) can only occur between the lexical V and NP.  But in many V+N II idioms, they may occur either after the V+N combination or in between.  In (5‑17a), 伤心 shang xin is in the lexical position because Chinese syntax requires that the post-verbal modifier attach to the lexical V, not to a VP as indicated in (5-18a).  Following the same argument, 伤 shang (hurt) alone in (5-17b) must be a lexical V as well.  The sign 心 xin (heart) in (5‑17b) establishes itself in syntax as object of the V, playing the same role as 他ta (he) in (5-18b).  These facts show clearly that V+N II idioms can be used both as lexical verbs and as transitive verb phrases.   In other words, before entering a context, while still in the lexicon, one can not rule out either possibility.

However, there is a clear cut condition for distinguishing its use as a word and its use as a phrase once a V+N II idiom is placed in a context.   It is observed that the only time a V+N II idiom assumes the lexical status is when V and N are contiguous.  In all other cases, i.e. when V and N are not contiguous, they behave essentially similar to the V+N I type.

In addition to the examples in (5-17) above, two more examples are given below to demonstrate the separated phrasal use of V+N II.  The first is the case V+X+N where X is a possessive modifier attached to the head N.  Note also the post-verbal position of 透 tou (thorough) and 了le (LE).  The second is an example of passivization when N occurs before V.  These examples provide strong evidence for the syntactic nature of V+N II idioms when V and N are not used contiguously.

(5-19.) (a) *   你伤他的心透了
ni       shang         ta       de      xin    tou              le.
you    hurt            she    DE     heart thorough     LE

(b)      你伤透了他的心
ni       shang         tou              le       ta       de      xin.
you    hurt            thorough     LE     she    DE     heart
You broke her heart.

(5-20.)         V+N II: instance of passive with or without 被 bei (BEI)

心(被)伤透了
xin    (bei)   shang         tou              le.
heart BEI    break          thorough     LE
The heart was completely broken.
or: (Someone) was extremely sad.

Based on the above investigation, it is proposed in CPSG95 that two distinct entries be constructed for each such idiom, one as an inseparable lexical V, and the other as a transitive VP just like that of V+N I.  Each entry covers its own part of the phenomena.  In order to capture the semantic link between the two entries, a lexical rule called V_N_II Rule is formulated in CPSG95, shown in (5-21).

th001

The input to the V_N_II Lexical Rule is an entry with [CATEGORY v_n_ii] where [v_n_ii] is a given sub-category in the lexicon for V+N II type verbs.  The output is another entry with the same information except for three features [HANZI], [CATEGROY] and [COMP1_RIGHT].  The new value for [HANZI] is a list concatenating the old [HANZI] and the [HANZI] for the expected [COMP1_RIGHT].  The new [CATEGORY] value is simply [v].  The value for [COMP1_RIGHT] becomes [null].  The outline of the two entries captured by this lexical rule are shown in (5-22) and (5-23).

th002

It needs to be pointed out that the definition of [CATEGORY v_n_ii] in CPSG95 is narrower than L. Li’s definition of V+N II type idioms.  As indicated by L. Li (1990), not all V+N II idioms share the same set of lexical features (a3), (a4) and (a5) as a word.  The definition in CPSG95 does not include the idioms which share the lexical feature (a5), i.e. taking a syntactic object.  These are idioms like 担心dan-xin (carry-heart: worry about).  For such idioms, when they are used as inseparable compound words, they can take a syntactic object.  This is not possible for all other V+N idioms, as shown below.

(5-24.) (a)     她很担心你
ta       hen    dan-xin                ni.
he      very   worry (about)        you
He is very concerned about you.

(b) *   他很伤心你
ta       hen    shang-xin            ni.
he      very   sad                       you

In addition, these idioms do not demonstrate the full distributional potential of transitive VP constructions.  The separated uses of these idioms are far more limited than other V+N idioms.  For example, they can hardly be passivized or topicalized as other V+N idioms can, as shown by the following minimal pair of passive constructions.

(5-25.)(a) *   心(被)担透了
xin    (bei)   dan             tou              le.
heart BEI    carry           thorough     LE

(b)      心(被)伤透了
xin    (bei)   shang         tou              le.
heart BEI    break          thorough     LE
The heart was completely broken.
or: (Someone) was extremely sad.

In fact, the separated use ('phrasal use') for such V+N idioms seems only limited to some type of X-insertion, typically the appearance of aspect signs between V and N.[15]  Such separated use is the only thing shared by all V+N idioms, as shown below.

(5-26.)(a)     他担过心
ta       dan             guo    xin
he      carry           GUO  heart
He (once) was worried.

(b)      他伤过心
ta       shang         guo    xin
he      break          GUO  heart
He (once) was heart-broken.

To summarize,  the V+N idioms like 担心 dan-xin which can take a syntactic object do not share sufficient generality with other V+N II idioms for a lexical rule to capture.  Therefore, such idioms are excluded from the [CATEGORY v_n_ii] type.  This makes these idioms not subject to the lexical rule proposed above.  It is left for future research to answer the question whether there is enough generality among this set of idioms to justify some general approach to this problem, say, another lexical rule or some other ways of generalization of the phenomena.  For time being, CPSG95 simply lists both the contiguous and separated uses of these idioms in the lexicon.[16]

It is worth noticing that leaving such idioms aside, this lexical rule still covers large parts of V+N II phenomena.  The idioms like 担心dan-xin only form a very small set which are in the state of transition to words per se (from the angle of language development) but which still retain some (but not complete) characteristics of a phrase.[17]

5.3. Verb-modifier Idioms: V+A/V

This section investigates the V+X idioms in the form of V+A/V.  The data for the interaction of V+A/V idioms and the modal insertion are presented first.  The subsequent text will argue for Lü's infix hypothesis for the modal insertion and accordingly propose a lexical rule to capture the idioms with or without modal insertion.

The following is a sample list of V+A/V idioms, represented by kan jian (look-see: have seen).

(5-27.) V+A/V: kan jian type

看见 kan (look) jian (see)                    have seen
看穿 kan (look) chuan (through)        see through
离开 li (leave) kai (off)                         leave
打倒 da (beat) dao (fall)                      down with
打败 da (beat) bai (fail)                       defeat
打赢 da (beat) ying (win)                    fight and win
睡着 shui (sleep) zhao (asleep)            fall asleep
进来 jin (enter) lai (come)                             enter
走开 zou (go) kai (off)                         go away
关上  guan (close) shang (up)             close

In the V+A/V idiom kan jian (have-seen), the first sign kan (look) is the head of the combination while the second jian (see) denotes the result.  So when we say, wo (I) kan-jian (see) ta (he), even without the aspectual marker le (LE) or guo (GUO), we know that it is a completed action:  'I have seen him' or 'I saw him'.[18]

Idioms like kan-jian (have-seen) function just as a lexical whole (transitive verb).  When there is an aspect marker, it is attached immediately after the idioms as shown in (5‑28).  This is strong evidence for judging V+A/V idioms as words, not as syntactic constructions.

(5-28.)         我看见了他
wo     kan jian     le       ta.
I         look-see       LE     he                   I have seen him.

The only observed separated use is that such idioms allow for two modal signs 得 de3 (can) and 不 bu (cannot) in between, shown by (5-29a) and (5-29b).  But no other signs, operations or processes can enter the internal structure of these idioms.

(5-29.) (a)     我看不见他
wo     kan bu jian         ta.
I         look cannot see     he
I cannot see him.

(c)      你看得见他吗?
ni       kan de3 jian       ta       me?
you    look can see          he      ME
Can you see him?

Note that English modal verbs ‘can’ and ‘cannot’ are used to translate these two modal signs.  In fact, Contemporary Mandarin also has corresponding modal verbs (能愿动词 neng-yuan dong-ci):  能 neng (can) and 不能 bu neng (cannot).  The major difference between Chinese modal verbs 能 neng / 不能 bu neng and the modal signs 得 de3 / 不 bu lies in their different distribution in syntax.  The use of modal signs 得 de3 (can) and 不 bu (cannot) is extremely restrictive:  they have to be inserted into V+BUYU combinations.  But Chinese modal verbs can be used before any VP structures.  It is interesting to see the cases when they are used together in one sentence, as shown in (5-30 a+b) below.  Note that the meaning difference between the two types of modal signs is subtle, as shown in the examples.

(5-30.)(a)     你看得见他吗?
ni       kan de3 jian         ta       me?
you    look can see          he      ME
Can you see him? (Is your eye-sight good enough?)

(b)      你能看见他吗?
ni       neng kan jian      ta       me?
you    can    see              he      ME
Can you see him?
(Note: This is used in more general sense. It covers (a) and more.)

(a+b)  你能看得见他吗?
ni       neng kan de3 jian         ta       me?
you    can    look can see          he      ME
Can you see him? (Is your eye-sight good enough?)

(5-31.)(a)     我看不见他
wo     kan bu jian           ta
I         look cannot see     he
I cannot see him. (My eye-sight is too poor.)

(b)      我不能看见他
wo     bu     neng kan jian      ta
I         not    can    see              he
I cannot see him. (Otherwise, I will go crazy.)

(a+b) 我不能看不见他
wo     bu     neng kan bu jian           ta.
I         not    can    look cannot see     he
I cannot stand not being able to see him.
(I have to keep him always within the reach of my sight.)

Lü (1989:127) indicates that the modal signs are in fact the only two infixes in Contemporary Chinese.  Following this infix hypothesis, there is a good account for all the data above.  In other words, the V+A/V idioms are V+BUYU compound words subject to the modal infixation.  The phenomena of 看得见 kan-de3-jian (can see) and 看不见 kan-bu-jian (cannot see) are therefore morphological by nature.  But Lü did not offer formal analysis for these idioms.

Thompson (1973) first proposed a lexical rule to derive the potential forms V+de3/bu+A/V from the V+A/V idioms.  The lexical rule approach seems to be most suitable for capturing the regularity of the V+A/V idioms and their infixation variants V+de3/bu+A/V.  The  approach taken in CPSG95 is similar to Thompson’s proposal.  More precisely, two lexical rules are formulated in CPSG95 to handle the infixation in V+A/V idioms.  This way, CPSG95 simply lists all V+A/V idioms in the lexicon as V+A/V type compound words, coded as [CATEGORY v_buyu].[19]  Such entries cover all the contiguous uses of the idioms.  It is up to the two lexical rules to produce two infixed entries to cover the separated uses of the idioms.

The change of the infixed entries from the original entry lies in the semantic contribution of the modal signs.  This is captured in the lexical rules in (5-32) and (5-33).  In case of V+de3+A/V, the Modal Infixation Lexical Rule I in (5-32) assigns the value [can] to the feature [MODAL] in the semantics.  As for V+bu+A/V, there is a setting  [POLARITY minus] used to represent the negation in the semantics, shown in (5-33).[20]

th003

The following lexical entry shows the idiomatic compound 看见 kan-jian as coded in the CPSG95 lexicon (leaving some irrelevant details aside).   This entry satisfies the necessary condition for the proposed infixation lexical rules.

th004

The modal infixation lexical rules will take this [v_buyu] type compound as input and produce two V+MODAL+BUYU entries.  As a result, new entries 看得见 kan-de3-jian (can see) and 看不见 kan-bu-jian (cannot see) as shown below are added to the lexicon.[21]

th005

th006

The above proposal offers a simple, effective way of capturing the linguistic data of the interaction of V+A/V idioms and the modal insertion, since it eliminates the need for any change of the general grammar in order to accommodate this type of separable verbs interacting with 得 de3 / 不 bu, the only two infixes in Chinese.

5.4. Summary

This chapter has conducted an inquiry into the linguistic phenomena of Chinese separable verbs, a long-standing difficult problem at the interface of Chinese compounding and syntax.   For each type of separable verb, arguments for the wordhood judgment have been presented.  Based on this judgment, CPSG95 provides analyses which capture both structural and semantic aspects of the constructions at issue.  The proposed solutions are formal and implementable.  All the solutions provide a way of capturing the link between the separated use and contiguous use of the V+X idioms.  The proposals presented in this chapter cover the vast majority of separable verbs.  Some unsolved rare cases or potential problems are also identified for further research.

 

----------------------------------------------------------------------

[1] They are also called phrasal verbs (duanyu dongci) or compound verbs (fuhe dongci) among Chinese grammarians.  For linguists who believe that they are compounds, the V+N separable verbs are often called verb object compounds and the V+A/V separable verbs resultative compounds.  The want of a uniform term for such phenomena reflects the borderline nature of these cases.  According to Zhao and Zhang (1996), out of the 3590 entries in the frequently used verb vocabulary, there are 355 separable V+N idioms.

[2] As the term 'separable verbs' gives people an impression that these verbs are words (which is not necessarily true), they are better called V+X (or V+N or V+A/V) idioms.

[3] There is no disagreement among Chinese grammarians for the verb-object combinations like xi wan:  they are analyzed as transitive verb phrases in all analyses, no matter whether the head V and the N is contiguous (e.g. xi wan 'wash dishes') or not (e.g. xi san ge wan 'wash three dishes').

[4] Such signs as zao (bath), which are marked with # in (5-1), are often labeled as 'bound morphemes' among Chinese grammarians, appearing only in idiomatic combinations like xi zao (take a bath), ca zao (clean one's body by scrubbing).  As will be shown shortly, bound morpheme is an inappropriate classification for these signs.

[5] It is widely acknowledged that the sequence num+classifier+noun is one typical form of Chinese NP in syntax.  The argument that zao is not a bound morpheme does not rely on any particular analysis of such Chinese NPs.  The fact that such a combination is generally regarded as syntactic ensures the validity of this argument.

[6] The notion ‘free’ or ‘freely’ is linked to the generally accepted view of regarding word as a minimal ‘free’ form, which can be traced back to classical linguistics works such as Bloomfield (1933).

[7] It is generally agreed that idioms like kick the bucket are not compounds but phrases (Zwicky 1989).

[8] That is the rationale behind the proposal of inseparability as important criterion for wordhood judgment in Lü (1989).

[9] In Chinese, reduplication is a general mechanism used both in morphology and syntax.  This thesis only addresses certain reduplication issues when they are linked to the morpho-syntactic problems under examination, but cannot elaborate on the Chinese reduplication phenomena in general.  The topic of Chinese reduplication deserves the study of a full-length dissertation.     

[10] In the ALE implementation of CPSG95, there is a VV Diminutive Reduplication Lexical Rule in place for phenomena like xi zao (take a bath) à xi xi zao (take a short bath);  ting yin-yue (listen to music) à ting ting yin-yue (listen to music for a while);  xiu-xi (rest) à xiu-xi xiu-xi (have a short rest).

[11] He observes that there are two distinct principles on wordhood.  The vocabulary principle requires that a word represent an integrated concept, not the simple composition of its parts.  Associated with the above is a tendency to regard as a word a relatively short string.  The grammatical principle, however, emphasizes the inseparability of the internal parts of a combination.  Based on the grammatical principle, xi zao is not a word, but a phrase.  This view is very insightful.

[12] The pattern variations are captured in CPSG95 by lexical rules following the HPSG tradition.  It is out of the scope of this thesis to present these rules in the CPSG95 syntax.  See W. Li (1996) for details.

[13] In the rare cases when the noun zao is realized in a full-fledged phrase like yi ge tong-kuai de zao (a comfortable bath), we may need some complicated special treatment in the building of the semantics.  Semantically, xi (wash) yi (one) ge (CLA) tong‑kuai (comfortable) de (DE) zao (bath): ‘take a comfortable bath’ actually means tong‑kuai (comfortable) de2 (DE2) xi (wash) yi (one) ci (time) zao (bath): ‘comfortably take a bath once’.  The syntactic modifier of the N zao is semantically a modifier attached to the whole idiom.  The classifier phrase of the N becomes the semantic 'action-times' modifier of the idiom.  The elaboration of semantics in such cases is left for future research.

[14] The two groups classified by L. Li (1990) are not restricted to the V+N combinations.  In order not to complicate the case,  only the comparison of the two groups of V+N idioms are discussed here.  Note also that in the tables, he used the term ‘bound morpheme’ (inappropriately) to refer to the co-occurrence constraint of the idioms.

[15] Another type of X-insertion is that N can occasionally be expanded by adding a de‑phrase modifier.  However, this use is really rare.

[16] Since they are only a small, easily listable set of verbs, and they only demonstrate limited separated uses (instead of full pattern variations of a transitive VP construction), to list these words and all their separated uses in the lexicon seems to be a better way than, say, trying to come up with another lexical rule just for this small set.  Listing such idiosyncratic use of language in the lexicon is common practice in NLP.

[17] In fact, this set has been becoming smaller because some idioms, say zhu-yi 'focus-attention: pay attention to', which used to be in this set, have already lost all separated phrasal uses and have become words per se.  Other idioms including dan-xin (worry about) are in the process of transition (called ionization by Chao 1968) with their increasing frequency of being used as words.   There is a fairly obvious tendency that they combine more and more closely as words, and become transparent to syntax.  It is expected that some, or all, of them will ultimately become words proper in future, just as zhu-yi did.

[18] In general, one cannot use kan-jian to translate English future tense 'will see', instead one should use the single-morpheme word kan:  I will see him --> wo (I) jiang (will) kan (see) ta (he).

[19] Of course, [v_buyu] is a sub-type of verb [v].

[20] The use of this feature for representing negation was suggested in  Footnote 18 in Pollard and Sag (1994:25)

[21] This is the procedural perspective of viewing the lexical rules.  As pointed out by Pollard and Sag (1987:209), “Lexical rules can be viewed from either a declarative or a procedural perspective: on the former view, they capture generalizations about static relationships between members of two or more word classes; on the latter view, they describe processes which produce the output from the input form.”

 

[Related]

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

PhD Thesis: Chapter IV Defining the Chinese Word

 

4.0. Introduction

This chapter examines the linguistic definition of the Chinese word and establishes its formal representation in CPSG95.  This lays a foundation for the treatment of Chinese morpho-syntactic interface problems in later chapters.

To address issues on interfacing morphology and syntax in Chinese NLP, the fundamental question is:  what is a Chinese word?  A proper answer to this question defines the boundaries between morphology, the study of how morphemes combine into words, and syntax, the study of how words combine into phrases.  However, there is no easy answer to this question.

In fact, how to define Chinese words has been a central topic among Chinese grammarians for decades (Hu and Wen 1954; L. Wang 1955;  Z. Lu 1957; Lin 1983; Lü 1989; Shi 1992; Dai 1993; Zhao and Zhang 1996).  In late 50's, there was a heated discussion on the definition of Chinese word in China.  This discussion was induced by the campaign for the Chinese writing system reform (文字改革运动).  At that time, the government policy was to ultimately replace the Chinese characters (hanzi) by a Romanized writing system.  The system of pinyin, based on the Latin alphabet, was designed to represent the pronunciation of the characters in the Contemporary Mandarin.  The simplest way is to use pinyin as a writing system and simply translate Chinese characters into syllables in pinyin.  But it was soon found impractical due to the many-to-one correspondence from hanzi to syllable.  Text in pinyin with no  explicit word boundary delimiters is hardly comprehensible.   Linguists agree that the key issue for the feasibility of a pinyin-based writing system is to establish a standard or definition for Chinese words (Z. Lu 1957).  Once words can be identified by a common standard, the pinyin system can in principle be adopted for recording the Chinese language by using space and punctuation marks to separate words.  This is because the number of homophones at the word level is dramatically reduced when compared to the number of homophones at the hanzi (morpheme or monosyllabic) level.

But the definition of a Chinese word is a very complicated issue due to the existence of a considerable amount of borderline cases.  It has never been possible to reach a precise definition which can be applied to all circumstances and which can be accepted by linguists from different schools.

There have been many papers addressing the Chinese wordhood issue (e.g. Z. Lu 1957; Lin 1983; Lü 1989; Dai 1993).  Although there are still many problems in defining Chinese words for borderline cases and more debate will continue for many years to come, the understanding of Chinese wordhood has been deepened in the general acknowledgement of the following key aspects:  (i) the distinct status of Chinese morphology;  (ii) the distinction of different notions of word;  and (iii) the lack of absolute definition across systems or theories.

Almost all Chinese grammarians agree that unlike Classical Chinese, Contemporary Chinese is not based on single-morpheme words.   In other words, the word and the morpheme are no longer coextensive in Contemporary Chinese.[1]  In fact, that is the reason why we need to define Chinese morphology.  If the word and the morpheme stand for the same linguistic object in a language, like Classical Chinese, the definition of  morpheme will entail the definition of word and there is no role of morphology.

As it stands, there is little debate on the definition of morpheme in Chinese.  It is generally acknowledged that each syllable (or its corresponding written form hanzi) corresponds to (at least) one morpheme.  In a characteristic ‘isolating language’ - Classical Chinese is close to this, there is no or very poor morphology.[2]  However, Contemporary Chinese contains a significant number of bound morphemes in word formation (Dai 1993).  In particular, it is observed that many affixes are highly productive (Lü et al 1980).

It is widely acknowledged that the grammar of Contemporary Chinese is not complete without the component of morphology (Z. Lu 1957; Chao 1968; Li and Thompson 1981; Dai 1993; etc.).   Based on this widely accepted assumption, one major task for this thesis is to argue for the proper place to cut the line between morphology and syntax, and to explore effective ways of interleaving the two for analysis.

A significant development concerning the Chinese wordhood study is the  distinction between two different notions of word:  grammar word versus vocabulary word.  It is now clear that in terms of grammar analysis, a vocabulary word is not an appropriate notion (Lü 1989; more discussion to come in 4.1).

Decades of debate and discussion on the definition of a Chinese word have also shown that an operational definition for a grammar word precise enough to apply to all cases can hardly be established across systems or theories.  But a computational grammar of Chinese cannot be developed without precise definitions.  This leads to an argument in favor of the system internal wordhood definition and the interface coordination within a grammar.

The remaining sections of this chapter are organized like this.  Section 4.1 examines two notions of word.  Making sure that we use the right notion based on some appropriate guideline, some operational methods for judging a Chinese grammar word will be developed in 4.2.  Section 4.3 demonstrates the formal representation of a word in CPSG95.  This formalization is based on the design of expectation feature structures and the structural feature structure  presented in Chapter III.

4.1. Two Notions of Word

This section examines the two notions of word which have caused confusion.  The first notion, namely vocabulary word, is easy to define.  However, for the second notion, namely, grammar word, unfortunately,  no operational definition has been available.  It will be argued that a feasible alternative is to system internally define a grammar word and the labor division between Chinese morphology and syntax.

A grammar word stands for the grammatical unit which fits in the hierarchy of morpheme, word and phrase in linguistic analysis.  This gives the general concept of this notion but it is by no means an operational definition.  Vocabulary word, on the other hand, refers to the listed entry in the lexicon.  This definition is simple and unambiguous once a lexicon is given.  The lexical lookup will generate vocabulary words as potential building blocks for analysis.

On one hand, vocabulary words come from the lexicon;  they are basic building blocks for linguistic analysis.  On the other hand, as the ‘resulting’ unit for morphological analysis as well as the ‘starting’ or ‘atomic’ unit for syntactic analysis, the grammar word is the notion for linguistic generalization.  But it is observed that a vocabulary word is not necessarily a grammar word and vice versa.  It is this possible mismatch between vocabulary word and grammar word that has caused a problem in both Chinese grammar research and Chinese NLP system development.

Lü (1989) indicates that not making a distinction between these two notions of word has caused considerable confusion on the definition of Chinese word in the literature.  He further points out that only the former notion should be used in the grammar research.

Di Sciullo and Williams (1987) have similar ideas on these two notions of word.  They indicate that a sign listable in the lexicon corresponds to no certain grammatical unit.[3]   It can be a morpheme, a (grammar) word, or a phrase including sentence.  Some examples of different kinds of Chinese vocabulary words are given below to demonstrate this insight.

(4-1.) sample Chinese vocabulary words

(a) 性           bound morpheme, noun suffix, ‘-ness’
(b) 洗           free morpheme or word, V: ‘wash’
(c) 澡           word (only used in idioms), N: ‘bath’
(d) 澡盆        compound word, N: ‘bath-tub’
(e) 洗澡        idiom phrase, VP: ‘take a bath’
(f) 他们         pronoun as noun phrase, NP: ‘they’
(g) 城门失火,殃及池鱼

idiomatic sentence, S:
‘When the gate of a city is on fire, the fish in the
canal around the gate is also endangered.’

The above signs are all Chinese vocabulary words.  But grammatically, they do not necessarily function as a grammar word.  For example, (4-1a) functions as a suffix, smaller than a word.  (4-1e) behaves like a transitive VP (see 5.1 for more evidence), and (4-1g) acts as a sentence, both larger than a word.  The consequence of mixing up these different units in a grammar is the loss of power for a grammar to capture the linguistic generality for each level of grammatical unit.

The definition of grammar word has been a contentious issue in general linguistics (Di Sciullo and Williams 1987).  Its precise definition is particularly difficult in Chinese linguistics as there is a considerable amount of phenomena marginal between Chinese morphology and syntax (Zhu 1985; L. Li 1990; Sun and Huang 1996).  The morpheme-word-phrase transition is a continuous band in the linguistic reality.  Different grammars may well cut the division differently.  As long as there is no contradiction in coordinating these objects within the grammar, there does not seem to exist absolute judgment on which definition is right and which is wrong.

It is generally agreed that a grammar word is the smallest unit in syntax (Lü 1989), as also emphasized by Di Sciullo and Williams (1987) on the 'syntactic atomicity' of word.[4]  But this statement only serves as a guideline in theory, it is not an operational definition for the following reason.  It is logically circular to define word, smallest unit in syntax, and syntax, study of how words combine into phrases, one upon the other.

To avoid this 'circular definition' problem, a feasible alternative is to system internally define grammar word and the labor division between Chinese morphology and syntax, as in the case of CPSG95.  Of course, the system internal definition still needs to be justified based on the proposed morphological or syntactic analysis of borderline phenomena in terms of capturing the linguistic generality.  More specifically, three things need to be done:  (i) argue for the analysis case by case, e.g. why a certain construction should be treated as a morphological or syntactic phenomenon, what linguistic generality is captured by such a treatment, etc.;  (ii) establish some operational methods for wordhood judgment to cover similar cases;  (iii) use formalized data structures to represent the linguistic units after the wordhood judgment is made.  Section 4.2 will handle task (ii) and Section 4.3 is devoted to the formal definition of word required by task (iii).   The task in (i) will be pursued in the remaining chapters.

Another important notion related to grammar word is unlisted word.  Conceptually, an unlisted word is a novel construction formed via morphological rules, e.g. a derived word like 可读性 ke-du-xing (-able-read-ness: readability), foolish-ness, a compound person name (given name + family name) such as John Smith, 毛泽东 mao-ze-dong (Mao Zedong).  Unlisted words are often rule-based.  This is where productive word formation sets in.

However, unlisted word is not a crystal clear notion, just like the underlying concept grammar word.  Many grammarians have observed that phrases and unlisted words in Chinese are formed under similar rules (e.g. Zhu 1985; J. Lu 1988).  As both syntactic constructions and unlisted words are rule based, it can be difficult to judge a significant amount of borderline constructions as morphological or syntactic.

There are fuzzy cases where a construction is regarded as a grammar word by one and judged as a syntactic construction by another.  For example, while san (three) ge (CLA) is regarded as a syntactic construction, namely numeral-classifier phrase, in many grammars including CPSG95, such constructions are treated as compound words by others (e.g. Chen and Liu 1992).  ‘Quasi-affixation’ presents another outstanding ‘gray area’ (see 6.2).

The difficulty in handling the borderline phenomena leads back to the argument that the labor division between Chinese morphology and syntax should be pursued system internally and argued case by case in terms of capturing the linguistic generality.  To implement the required system internal definition, it is desirable to investigate practical wordhood judgment methods in addition to case-by-case arguments.  Some judgment methods will be developed in 4.2.  Case-by-case arguments and analysis for specific phenomena will be presented in later chapters.  After the wordhood judgment is made, there is a need for the formal representation.  Section 4.3 defines the formal representation of word with illustrations.

4.2. Judgment Methods

This section proposes some operational wordhood judgment methods based on the notion of ‘syntactic atomicity’ (Di Sciullo and Williams 1987).  These methods should be applied in combination with arguments of the associated grammatical analysis.  In fact, whether a sign is judged as a morpheme, a grammar word or a phrase ultimately depends on the related grammatical analysis.  However, the operationality of these methods will help facilitate the later analysis for some individual problems and avoid unnecessary repetition of similar arguments.

Most methods proposed for Chinese wordhood judgment in the literature are not fully operational.  For example, Chao (1968) agrees with Z. Lu (1957) that a word can fill the functional frame of a typical syntactic structure.  Dai (1993) points out that this method may effectively separate bound morphemes from free words, it cannot differentiate between words and phrases, as phrases may also be positioned in a syntactic frame.  In fact, whether this method can indeed separate bound morphemes from free words is still a problem.  This method cannot be made operational unless the definition of ‘frame of a typical syntactic structure’ is given.  The judgment methods proposed in this section try to avoid this ‘lack of operationality’ problem.

Dai (1993) made a serious effort in proposing a series of methods for cutting the line between morphemes and syntactic units in Chinese.  These methods have significantly advanced the study of this topic.  However, Dai admits that there is limitation associated with these proposals.  While each proposed method provides a sufficient (but not necessary) condition for judging whether a unit is a morpheme,  none of the methods can further determine whether this unit is a word or a phrase.  For example, the method of syntactic independence tests whether a unit in a question can be used as a short answer to the question.  If yes, the syntactic independence is confirmed and this unit is not a morpheme inside a word.  Obviously, such a method tells nothing about the syntactic rank of the tested unit because a word, a phrase or clause can all serve as an answer to a question.  In order to achieve that, other methods and/or analyses need to be brought in.

The first judgment method proposed below involves passivization and topicalization tests.  In essence, this is to see whether a string involves syntactic processes.  As an atomic unit, the internal structure of a word is transparent to syntax.  It follows that no syntactic processes are allowed to exert effects on the internal structure of a word.[5]  As  passivization and topicalization are generally acknowledged to be typical syntactic processes, if a potential combination A+B is subject to passivization B+bei+A and topicalization B+…+NP+A, it can be concluded that A+B is not a word:   the relation between A and B must be syntactic.

The second method is to define an unambiguous pattern for the wordhood judgment, namely, judgment patterns.  Judgment patterns are by no means a new concept.  In particular, keyword based judgment patterns have been frequently used in the literature of Chinese linguistics as a handy way for deterministic word category detection (e.g. L. Wang 1955;  Zhu 1985; Lü 1989).

The following keyword (i.e. aspect markers) based patterns are proposed for  judging a verb sign.

(4-2.)
(a) V(X)+着/过 --> word(X)
(b) V(X)+着/过/了+NP --> word(X)

The pattern (4-2a) states that if X is a sign of verb, no matter transitive or intransitive, appearing immediately before zhe/guo, then X is a word.  This proposal is backed by the following argument.  It is an important and widely acknowledged grammatical generalization in Chinese syntax that the aspect markers appear immediately after lexical verbs (Lü et al 1980).

Note that the aspect marker le (LE) is excluded from the pattern in (4-2a) because the same keyword le corresponds to two distinctive morphemes in Chinese:  the aspect le (LE) attaches to a lexical V while the sentence-final le (LEs) attaches to a VP (Lü et al 1980).  Therefore, judgment cannot be reliably made when a sentence ends in X+le, for example, when X is an intransitive verb or a transitive verb with the optional object omitted.  However, le in pattern (4-2b) has no problem since le is not in the ambiguous sentence final position.  This pattern says that if any of the three aspect markers appears between a sign X of verb and NP, X must be a word:  in fact, it is a lexical transitive verb.

There are two ways to use the judgment patterns.  If a sub-string of the input sentence matches a judgment pattern, one reaches the conclusion promptly.  If the input string does not match a pattern directly, one can still make indirect use of the patterns for judgment.  The idiomatic combination xi (wash) zao (bath) is a representative example.   Assume that the vocabulary word xi zao is a grammar word.  It follows that it should be able to fill in the lexical verb position in the judgment pattern (4-2a).  We then make a sentence which contains a substring matching the pattern to see whether it is grammatical.  The result is ungrammatical:  * 他洗澡着 ta (he) xi-zao (V) zhe (ZHE);  * 他洗澡过 ta (he) xi-zao (V) guo (GUO).  Therefore, our assumption must be wrong:  洗澡 xi zao is not a grammar word.  We then change the assumption and try to insert aspect markers inside them (it is in fact an expansion test, to be discussed shortly).  The new assumption is that the verb xi alone is a grammar word.  What we get are perfectly grammatical sentences and they match the pattern (4-2b):  他洗着澡 ta (he) xi (V) zhe (ZHE) zao (bath): ‘He is taking a bath’;  他洗过澡 ta (he) xi (V) guo (GUO) zao (bath): ‘He has taken the bath’.  Therefore the assumption is proven to be correct.  This way, all V+X combinations can be judged based on the judgment patterns (4-2a) or (4-2b).

The third method proposed below involves a more general expansion test.  As an atomic unit in syntax, the internal parts of a word are in principle not separable.[6]  Lü (1989) emphasized inseparability as a criterion for judging grammar words.  But he did not give instructions how this criterion should be applied.  Nevertheless, many linguists (e.g. Bloomfield 1933; Z. Lu 1957;  Lyons 1968; Dai 1993) have discussed expansion tests one way or another in assisting the wordhood judgment.

The method of expansion to be presented below for wordhood judgment is called X-insertion.  X-insertion is based on Di Sciullo and Williams’ thesis of the syntactic atomicity of word.  The rationale is that the internal parts of a word cannot be separated by syntactic constituents.

As a method, how to perform X-insertion is defined as follows.   Suppose that one needs to judge whether the combination A+B is a word.   If a sign X can be found to satisfy the following condition, then A+B is not a word, but a syntactic combination:  (i) A+X+B is a grammatical string,  (ii) X is not a bound morpheme, and (iii) the sub-structure [A+X] is headed by A or the sub-string [X+B] is headed by B.

The first constraint is self-evident:  a syntactic combination is necessarily a grammatical string.  The second constraint aims at  eliminating the danger of wrongly applying an infix here.  In fact, if X is a morphological infix, the conclusion would be just opposite:  A+B is a word.  The last constraint states that X must be a dependant of the head A (or B).  Otherwise, it results in a different structure.  There is no direct structural relation between A and B when A (or B) is a dependant of the head X in the structure.  Therefore, the question of whether A+B is a phrase or a word does not apply in the first place.

After the wordhood judgment is made on strings of signs based on the above judgment methods and/or the arguments for the analysis involved, the next step is to have them properly represented (coded) in the grammar formalism used.  This is the topic to be presented in 4.3 below.

4.3. Formal Representation of Word

The expectation feature structure and structural phrase structure in the mono-stratal design of CPSG95 presented in Chapter III provide means for the formal definition of the basic unit word in CPSG95.  Once the wordhood judgment for a unit is made based on arguments for a structural analysis and/or using the methods presented in Section 4.2., the formal representation is required for coding it in CPSG95.

This type of formalization is required to ensure its implementability in enforcing a required configurational constraint.  For example, the suffix 性 -xing expects an adjective word to form an abstract noun, such constraints [CATEGORY a] and @word can be placed in the morphological expectation feature [SUFFIXING].  These constraints will permit, for example, the legitimately derived word 严肃性 [yan-su]-xing] (serious-ness), but will block the following combination * 非常严肃性 [[fei-chang yan-su]-xing] (very-serious-ness).  This is because 非常严肃 [fei-chang yan-su] violates the formal constraint as given in the word definition:  it is not an atomic unit in syntax.

In CPSG95, word is defined as a syntactically atomic unit without obligatory morphological expectations, formally represented in the following macro.

word macro
a_sign
PREFIXING saturated | optional
SUFFIXING saturated | optional
STRUCT no_syn_dtr

Note that the above formal definition uses the sorted hierarchy [struct] for the structural feature structure and the sorted hierarchy [expected] for the expectation feature structure.  The definitions of these feature structures have been given in the preceding Chapter III.

Based on the sorted hierarchy struct: {syn_dtr, no_syn_dtr}, the constraint [no_syn_dtr] ensures that the word sign do not contain any syntactic daughter.[7]  This prevents syntactic constructions from being treated as words.  On the other hand, since [saturated], [obligatory] and [optional] are three subtypes of [expected], the constraint [saturated|optional] prevents a bound morpheme, say a prefix or suffix which has obligatory expectation in [PREFIXING] or [SUFFIXING], from being treated as a word.

This macro definition covers the representation of mono-morpheme words, e.g. 鹅 e ‘goose’, 读 du ‘read’, etc., or multi-morpheme words, e.g. 小看 xiao-kan ‘look down upon’, 天鹅 tian-e ‘swan’, etc., as well as unlisted words such as derived words whose internal morphological structures have already been formed.  Some typical examples of word are shown below.

th11

th12

For a derived word, note that the specification of [PREFIXING satisfied] and [STRUCT prefix], or [SUFFIXING satisfied] and [STRUCT suffix], assigned by the corresponding PS rule is compatible with the macro word definition.

The above word definition is an extension of the corresponding representation features from HPSG (Pollard and Sag 1987).  HPSG uses a binary structural feature [LEX] to distinguish lexical signs, [LEX +], and non-lexical signs, [LEX -].  In addition, [sign] is divided into [lexical_sign] and [phrasal_sign].[8]  Except for the one-to-one correspondence between [phrasal_sign] and [syn_dtr] in terms of rank (which stands for non-atomic syntactic constructs including phrases), neither of these HPSG binary divisions account for the distinction between a bound morpheme and a free morpheme.  Such a distinction is not necessary in HPSG because bound morphemes are assumed to be processed in the preprocessing stage (e.g. lexical rules for English inflection, Pollard and Sag 1987) and do not show themselves as independent input to the parser.  As CPSG95 involves both derivation morphology and syntax in an integrated general grammar, the HPSG binary divisions are no longer sufficient for formalizing the word definition.  ‘Word’ in CPSG95 needs to be distinguished with proper constraints from not only syntactic constructs, but also from affixes (bound morphemes).

In CPSG95, as productive derivation is designed to be an integrated component of the grammar, the word definition is both specified in the lexicon for some free morpheme words and assigned by the rules in morphological analysis.  This practice in essence follows one  suggestion in the original HPSG book:  "we might divide rules of grammar into two classes: rules of word formation, including compounding rules, which introduce the specification [LEX +] on the mother, and other rules, which introduce [LEX -] on the mother." (Pollard and Sag 1987:73).

It is worth noticing that words thus defined can fill either a morphological position or a syntactic position.  This reflects the interface nature of word:  word is an eligible unit in both morphology and syntax.  This is in contrast to bound morphemes which can only be internal parts of morphology.

In morphology, derivation combines a word and an affix into a derived word.  These derivatives are eligible to feed morphology again.   This is shown above by the examples in (4-5) and (4-6).  The adjective word 可读 ke-du (read-able) is derived from the prefix morpheme 可 ke- (-able) and the word 读 du (read).  Like other adjective words, this derived word can further combine with the suffix 性
–xing (-ness) in morphology.  It can also directly enter syntax, as all words do.

To syntax, all words are atomic units.  If a lexical position is specified, via the macro constraint @word in CPSG95, in a syntactic pattern, it makes no difference whether a filler of this position is a listed grammar word, or an unlisted word such as a derivative.  Such distinction is transparent to the syntactic structure.

4.4. Summary

Efforts have been made to reach a better understanding of Chinese wordhood in theory, methodology and formalization.  The main spirit of the HPSG theory and Di Sciullo and Williams' ‘syntactic atomicity’ theory has been applied to the study of Chinese wordhood and its formal representation.  Some effective wordhood judgment methods have also been proposed, based on theoretical guidelines.

The above work in the area of Chinese wordhood study provides a sound foundation for the analysis of the specific Chinese morpho-syntactic interface problems in Chapter V and Chapter VI.

 

 

-------------------------------------------------------

[1] For Classical Chinese, word, morpheme, syllable and hanzi are presumably all co-extensive.  This is the so-called Monosyllabic Myth of Chinese (DeFrancis 1984: ch.8).  The development of large numbers of homophones, mainly due to the loss of coda stops, has led to the development of large quantities of bi-syllabic and poly-syllabic word-like expressions (Chen and Wang 1975).

[2] Classical Chinese arguably allows for a certain degree of compounding.  In the linguistic literature, some linguists (e.g. Sapir 1921; Zhang 1957; Jensen 1990) did not strictly distinguish Contemporay/Modern Chinese from Classical Chinese and they held the general view that Chinese has little morphology except for limited compounding.  But this view of Contemporary Chinese has been criticized as misconception (Dai 1993) and is no longer accepted by the community of Chinese grammarians.

[3] Di Sciullo and Williams call a sign listable in the lexicon listeme, equivalent to the notion vocabulary word.

[4] In the literature, variations of  this view include the Lexicalist position (Chomsky 1970), the Lexical Integrity Hypothesis (Jackendoff 1972), the Principle of Morphology-Free Syntax (Zwicky 1987), etc.

[5] This type of ‘atomicity’ constraint (Di Sciullo and Williams 1987) is generally known as Lexical Integrity Hypothesis (LIH, Jackendoff 1972), which states that syntactic rules or operations cannot refer to part of a word.  A more elaborate version of LIH is proposed by Zwicky (1987) as a Principle of Morphology-Free Syntax.  This principle states that syntactic rules cannot make reference to the internal morphological composition of words.  The only lexical properties accessible to syntax, according to Zwicky, are syntactic category, subcategory, and features like gender, case, person, etc.

[6] Of course, in theory a word may be separated by morphological infix.  But except for the two modal signs de3 (can) and bu (cannot) (see Section 5.3 in Chapter V), there does not seem to exist infixation in Mandarin Chinese.

[7] In terms of rank, [no_syn_dtr] in CPSG95 corresponds to the type [lexical_sign] in HPSG (Pollard and Sag 1987).  A binary division between [lexical_sign] and [phrasal_sign] is enough in HPSG to distinguish the atomic unit word from syntactic construction.  But, as CPSG95 incorporates derivation in the general grammar, [no_syn_dtr] covers for both free morphemes and bound morphemes.  That is why the [no_syn_dtr] constraint on [STRUCT] alone cannot define word in CPSG95;  it needs to involve constraints on morphological expectation structures as well, as shown in the macro definition.

[8] Note that there are [LEX -] signs which are not of the type [phrasal_sign].

 

[Related]

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

PhD Thesis: Chapter III Design of CPSG95

3.0. Introduction

CPSG95 is the grammar designed to formalize the morpho-syntactic analysis presented in this dissertation.  This chapter presents the general design of CPSG95 with emphasis on three essential aspects related to the morpho-syntactic interface:  (i) the overall mono-stratal design of the sign;  (ii) the design of expectation feature structures;  (iii) the design of structural feature structures.

The HPSG-style mono-stratal design of the sign in CPSG95 provides a general framework for the information flow between different components of a grammar via unification.  Morphology, syntax and semantics are all accommodated in distinct features of a sign.  An example will be shown to illustrate the information flow between these components.

Expectation feature structures are designed to accommodate lexical information for the structural combination.  Expectation feature structures are vital to a lexicalized grammar like CPSG95.  The formal definition for the sort hierarchy [expected] for the expectation features will be given.  It will be demonstrated that the defined sort hierarchy provides means for imposing a proper structural hierarchy as defined by the general grammar.

One characteristic of the CPSG95 structural expectation is the unique design of morphological expectation features to incorporate Chinese productive derivation.  This design is believed to be a feasible and natural way of modeling Chinese derivation, as shall be presented shortly below and elaborated in section 3.2.1.  How this design benefits the interface coordination between derivation and syntax will be further demonstrated in Chapter VI.

The type [expected] for the expectation features is similar to the HPSG definition of [subcat] and [mod].  They both accommodate lexical expectation information to drive the analysis conducted via the general grammar.  In order to meet some requirements induced by introducing morphology into the general grammar and by accommodating linguistic characteristics of Chinese, three major modifications from the standard HPSG are proposed in CPSG95.  They are:  (i) the CPSG95 type [expected] is more generalized as to cover productive derivation in addition to syntactic subcategorization and modification;  (ii) unlike HPSG which tries to capture word order phenomena as independent constraints, Chinese word order in CPSG95 is integrated in the definition of the expectation features and the corresponding morphological/syntactic relations;  (iii) in terms of handling the syntactic subcategorization, CPSG95 pursues a non-list alternative to the standard practice of HPSG relying on the list design of obliqueness hierarchy.  The rationale and arguments for these modifications are presented in the corresponding sections, with a brief summary given below.

The first modification is necessitated by meeting the needs of introducing Chinese productive derivation into the grammar.  It is observed that a Chinese affix acts as the head daughter of the derivative in terms of expectation (Dai 1993).  The expectation information that drives the analysis of a Chinese productive derivation is found to be capturable lexically by the affix sign;  this is very similar to how the information for the head-driven syntactic analysis is captured in HPSG.  The expansion of the expectation notion to include productive morphology can account for a wider range of linguistic phenomena.  The feasibility of this modification has been verified by the implementation of CPSG95 based on the generalized expectation feature structures.

One outstanding characteristic of all the expectation features designed in CPSG95 is that the word order information is implied in the definition of these features.[1]  Word order constraints in CPSG95 are captured by individual PS rules for the structural relationship between the constituents.  In other words, Chinese word order constraints are not treated as phenomena which have sufficient generalizations of themselves independent of the individual morphological or syntactic relations.  This is very different from the word order treatment in theories like HPSG (Pollard and Sag 1987) and GPSG (Gazdar, Klein, Pullum and Sag 1985).  However, a similar treatment can be found in the work from  the school of ‘categorial grammar’ (e.g. Dowty 1982).

The word order theory in HPSG and GPSG is based on the assumption that structural relations and syntactic roles can be defined without involving the factor of word order.  In other words, it is assumed that the structural nature of a constituent (subject, object, etc.) and its linear position in the related structures can be studied separately.  This assumption is found to be inappropriate in capturing Chinese structural relations.  So far, no one has been able to propose an operational definition for Chinese structural relations and morphological/syntactic roles without bringing in word order.[2]

As Ding (1953) points out, without the means of inflections and case markers, word order is a primary constraint for defining and distinguishing Chinese structural relations.[3]  In terms of expectation, it can always be lexically decided where for the head sign to look for its expected daughter(s).  It is thus natural to design the expectation features directly on their expected word order.

The reason for the non-list design in capturing Chinese subcategorization can be summarized as follows:  (i) there has been no successful attempt by anyone, including the initial effort involved in the CPSG95 experiment, which demonstrates that the obliqueness design can be applied to Chinese grammar with sufficient linguistic generalizations;  (ii) it is found that the atomic approach with separate features for each complement is a feasible and flexible proposal in representing the relevant linguistic phenomena.

Finally, the design of the structural feature [STRUCT]  originates from [LEX + | -] in HPSG (Pollard and Sag 1987).  Unlike the binary type for [LEX], the type [struct] for [STRUCT] forms an elaborate sort hierarchy.  This is designed to meet the configurational requirements of introducing morphology into CPSG95.  This feature structure, together with the design of expectation feature structures, will help create a favorable framework for handling Chinese morpho-syntactic interface.  The proposed structural feature structure and the expectation feature structures contribute to the formal definition of linguistic units in CPSG95.  Such definitions enable proper lexical configurational constraints to be imposed on the expected signs when required.

3.1. Mono-stratal Design of Sign

This section presents the data structure involving the interface between morphology, syntax and semantics in CPSG95.  This is done by defining the mono-stratal design of the fundamental notion sign and by illustrating how different components, represented by the distinct features for the sign, interact.

As a dynamic unit of grammatical analysis, a sign can be a morpheme, a word, a phrase or a sentence.  It is the most fundamental object of HPSG-style grammars.  Formally, a sign is defined in CPSG95 by the type [a_sign], as shown below.[4]

(3-1.) Definition: a_sign

a_sign
HANZI                            hanzi_list
CONTENT                      content
CATEGORY                    category
SUBJ                               expected
COMP0_LEFT               expected
COMP1_RIGHT             expected
COMP2_RIGHT             expected
MOD_LEFT                    expected
MOD_RIGHT                  expected
PREFIXING                    expected
SUFFIXING                    expected
STRUCT                          struct

The type [a_sign] introduces a set of linguistic features for the description of a sign.  These are features for orthography, morphology, syntax and semantics, etc.[5]  The types, which are eligible to be the values of these features, have their own definitions in the sort hierarchy.  An introduction of these features follows.

The orthographic feature [HANZI] contains a list of Chinese characters (hanzi or kanji).  The feature [CONTENT] embodies the semantic representation of the sign.  [CATEGORY] carries values like [n] for noun, [v] for verb, [a] for adjective, [p] for preposition, etc.  The structural feature [STRUCT] contains information on the relation of the structure to its sub-constituents, to be presented in detail in section 3.3.

The features whose appropriate value must be the type [expected] are called expectation features.  They are the essential part of a lexicalist grammar as these features contain information about various types of potential structures in both syntax and morphology.  They specify various constraints on the expected daughter(s) of a sign for structural analysis.   The design of these expectation features and their appropriate type [expected] will be presented shortly in section 3.2.

The definition of [a_sign] illustrates the HPSG philosophy of mono-stratal analysis interleaving different components.  As seen, different components of Chinese grammar are contained in different feature structures for the general linguistic unit sign.  Their interaction is effected via the unification of relevant feature structures during various stages of analysis.  This will unfold as the solutions to the morpho-syntactic interface problems are presented in Chapter V and Chapter VI.  For illustration, the prefix 可 ke (-able) is used as an example in the following discussion.

As is known, the prefix ke- (-able) makes an adjective out of a transitive verb:  ke- + Vt --> A.  This lexicalized rule is contained in the CPSG95 entry for the prefix ke-, shown in (3-2).  Following the ALE notation, @ is used for macro, a shorthand mechanism for a pre-defined feature structure.[6]

th1

As seen, the prefix ke- morphologically expects a sign with [CATEGORY vt].  An affix is analyzed as the head of a derivational structure in CPSG95 (see section 6.1 for discussion) and [CATEGORY] is a representative head feature to be percolated up to the mother sign via the corresponding morphological PS rule as formulated in (6-4) of section 6.2, this expectation eventually leads to a derived word with [CATEGORY a].  Like most Chinese adjectives, the derived adjective has an optional expectation for a subject NP to account for sentences like 这本书很可读 zhe (this) ben (CLA) shu (book) hen (very) ke-du (read-able): ‘This book is very readable’.  This syntactic optional expectation for the derivative is accommodated in the head feature [SUBJ].

Note that before any structural combination of ke- with other expected signs, ke- is a bound morpheme, a sign which has obligatory morphological expectation in [PREFIXING].  As a head for both the morphological combination ke+Vt and the potential syntactic combination NP+[ke+Vt], the interface between morphology and syntax in this case lies in the hierarchical structures which should be imposed.   That is, the morphological structure (derivation) should be established before its syntactic expected structure can be realized.  Such a configurational constraint is specified in the corresponding PS rules, i.e. the Subject PS Rule and The Prefix PS Rule.  It guarantees that the obligatory morphological expectation of ke- has to be saturated before the sign can be legitimately used in syntactic combination.

The interaction between morphology/syntax and semantics in this case is encoded by the information flow, i.e. structure-sharing indicated by the number index in square brackets, between the corresponding feature structures inside this sign.  The semantic compositionality involved in the morphological and syntactic grouping is represented like this.  There is a semantic predicate marked as [-able] (for worthiness) in the content feature [RELN];  this predicate has an argument which is co-indexed by [1] with the semantics of the expected Vt.  Note that the syntactic subject of the derived adjective, say ke-du (read-able) or ke-chi (eat-able), is the semantic (or logical) object of the stem verb, co-indexed by [2] in the sample entry above.  The head feature [CONTENT] which reflects the semantic compositionality will be percolated up to the mother sign when applicable morphological and syntactic PS rules take effect in structure building.

In summary, embodied in CPSG95 is a mono-stratal grammar of morphology and syntax within the same formalism.  Both morphology and syntax use same data structure (typed feature structure) and mechanisms (unification, sort hierarchy, PS rules, lexical rules, macros).   This design for Chinese grammar is original and is shown to be feasible in the CPSG95 experiments on various Chinese constructions.  The advantages of handling morpho-syntactic interface problems under this design will be demonstrated throughout this dissertation.

3.2. Expectation Feature Structures

This section presents the design of the expectation features in CPSG95.  In general, the expectation features contain information about various types of potential structures of the sign.  In CPSG95, various constraints on the expected daughter(s) of a sign are specified in the lexicon to drive both morphological and syntactic structural analysis.  This provides a favorable basis for interleaving Chinese morphology and syntax in analysis.

The expected daughter in CPSG95 is defined as one of the following grammatical constituents:  (i) subject in the feature [SUBJ];  (ii) first complement in the feature [COMP0_LEFT] or [COMP1_RIGHT];  (iii) second complement in [COMP2_RIGHT];   (iv) head of a modifier in the feature [MOD_LEFT] or [MOD_RIGHT];   (v) stem of an affix in the feature [PREFIXING] or [SUFFIXING].[7]  The first four are syntactic daughters which will be investigated in sections 3.2.2 and 3.2.3.  The last one is the morphological daughter for affixation, to be presented in section 3.2.1.  All these features are defined on the basis of the relative word order of the constituents in the structure.  The hierarchy for the structure at issue resorts to the configurational constraints which will be presented in section 3.2.4.

3.2.1. Morphological Expectation

One key characteristic of the CPSG95 expectation features is the design of morphological expectation features to incorporate Chinese productive derivation.

It is observed that a Chinese affix acts as the head daughter of the derivative in terms of expectation (see section 6.1 for more discussion).   An affix can lexically define what stem to expect and can predict the derivation structure to be built.  For example, the suffix 性 –xing demands that it combine with a preceding adjective to make an abstract noun, i.e. A+-xing --> N.  This type of information can be easily captured by the expectation feature structure in the lexicon, following the practice of the HPSG treatment of the syntactic expectation such as subcategorization and modification.

In the CPSG95 lexicon, each affix entry is encoded to provide the following derivation information:   (i) what type of stem it expects;  (ii) whether it is a prefix or suffix to decide where to look for the expected stem;  (iii) what type of (derived) word it produces.  Based on this lexical information, the general grammar only needs to include two PS rules for Chinese derivation:  one for prefixation, one for suffixation.  These rules will be formulated in Chapter VI (sections 6.2 and 6.3).  It will also be demonstrated that this lexicalist design for Chinese derivation works for both typical cases of affixation and for some difficult cases such as ‘quasi-affixation’ and zhe-suffixation.

In summary, the morphological combination for productive derivation in CPSG95 is designed to be handled by only two PS rules in the general grammar, based on the lexical specification in [PREFIXING] and [SUFFIXING].  Essentially, in CPSG95, productive derivation is treated like a ‘mini-syntax’;[8]  it becomes an integrated part of Chinese structural analysis.

3.2.2. Syntactic Expectation

This section presents the design of the expectation features to represent Chinese syntactic relations.  It will be demonstrated that constraints like word order and function words are crucial to the formalization of syntactic relations.  Based on them, four types of syntactic relations can be defined, which are accommodated in six syntactic expectation feature structures for each head word.

There is no general agreement on how to define Chinese syntactic relations.  In particular, the distinction between Chinese subject and object has been a long debated topic (e.g. Ding 1953; L. Li 1986, 1990; Zhu 1985; Lü 1989).  The major difficulty lies in the fact that Chinese does not have inflection to indicate subject-verb agreement and nominative case or accusative case, etc.

Theory-internally, there have been various proposals that Chinese syntactic relations be defined on the basis of one or more of the following factors:  (i) word order (more precisely, constituent order);  (ii) the function words associated with the constituents;  (iii) the semantic relations or roles.  The first two factors are linguistic forms while the third factor belongs to linguistic content.

L. Li (1986, 1990) relies mainly on the third factor to study Chinese verb patterns. The constituents in his proposal are named as NP-agent (ming-shi), NP-patient (ming-shou), etc. This practice amounts to placing an equal sign between the syntactic relation and semantic relation.  It implies that the syntactic relation is not an independent feature.  This makes syntactic generalization difficult.

Other Chinese grammarians (e.g. Ding 1953; Zhu 1985) emphasize the factor of word order in defining syntactic relations.  This school insists that syntactic relations be differentiated from semantic relations.  More precisely, semantic relations should be the result of the analysis of syntactic relations.  That is also the rationale behind the CPSG95 practice of using word order and other constraints (including function words) in the definition of Chinese relations.

In CPSG95, the expected syntactic daughter in CPSG95 is defined as one of the following grammatical constituents:  (i) subject in the feature [SUBJ], typically an NP which is on the leftmost position relative to the head;  (ii) complements closer to the head in the feature [COMP0_LEFT] or [COMP1_RIGHT], in the form of an NP or a specific PP;  (iii) the second complement in [COMP2_RIGHT]: this complement is defined to be an XP (NP, a specific PP, VP, AP, etc.) farther away from the head than [COMP1_RIGHT] in word order;  (iv) head of a modifier in the feature [MOD_LEFT] or [MOD_RIGHT].  In this defined framework of four types of possible syntactic relations, for each head word, the lexicon is expected to specify the specific constraints in its corresponding expectation feature structures and map the syntactic constituents to the corresponding semantic roles in [CONTENT].  This is a secure way of linking syntactic structures and their semantic composition for the following reason.  Given a specific head word and a syntactic structure with its various constraints specified in the expectation feature structures, the decoding of semantics is guaranteed.[9]

A Chinese syntactic pattern can usually be defined by constraints from category, word order, and/or function words (W. Li 1996).  For example, NP+V, NP+V+NP, NP+PP(x)+NP, NP+V+NP+NP, NP+V+NP+VP, etc.  are all  such patterns.  With the design of the expectation features presented above, these patterns can be easily formulated in the lexicon under the relevant head entry, as demonstrated by the sample formulations given in (3-3) and (3-4).

th2

th3

The structure in (3-3) is a Chinese transitive pattern in its default word order, namely NP1+Vt+NP2.  The representation in (3-4) is another transitive pattern NP+PP(x)+Vt.  This pattern requires a particular preposition x to introduce its object before the head verb.

The sample entry in (3-5) is an example of how modification is represented in CPSG95.  Following the HPSG semantics principle, the semantic content from the modifier will be percolated up to the mother sign from the head-modifier structure via the corresponding PS rule.  The added semantic contribution of the adverb chang-chang (often) is its specification of the feature [FREQUENCY] for the event at issue.

th4

3.2.3. Chinese Subcategorization

This section presents the rationale behind the CPSG95 design for subcategorization.  Instead of a SUBCAT-list, a keyword approach with separate features for each complement is chosen for representing the subcategorization information, as shown in the corresponding expectation features in section 3.2.2.  This design has been found to be a feasible alternative to the standard practice of HPSG relying on the list design of obliqueness hierarchy and SUBCAT Principle when handling subject and complements.

The CPSG95 design for representing subcategorization follows one proposal from Pollard and Sag (1987:121), who point out:  “It may be possible to develop a hybrid theory that uses the keyword approach to subjects, objects and other complements, but which uses other means to impose a hierarchical structure on syntactic elements, including optional modifiers not subcategorized for in the same sense.”  There are two issues for such a hybrid theory:  the keyword approach to representing subject and complements and the means for imposing a hierarchical structure.  The former is discussed below while the latter will be addressed in the subsequent section 3.2.4.

The basic reason for abandoning the list design is due to the lack of an operational definition of obliqueness which captures generalizations of Chinese subcategorization.  In the English version of HPSG (Pollard and Sag 1987, 1994), the obliqueness ordering is established between the syntactic notions of subject, direct object and second object (or oblique object).[10]  But these syntactic relations themselves are by no means universal.  In order to apply this concept to the Chinese language, there is a need for an operational definition of obliqueness which can be applied to Chinese syntactic relations.  Such a definition has not been available.

In fact, how to define Chinese subject, object and other complements has been one of the central debated topics among Chinese grammarians for decades (Lü 1946, 1989; Ding 1953; L. Li 1986, 1990; Zhu 1985; P. Chen 1994).  No general agreement for an operational, cross-theory definition of Chinese subcategorization has been reached.  It is often the case that formal or informal definitions of Chinese subcategorization are given within a theory or grammar.   But so far no Chinese syntactic relations defined in a theory are found to demonstrate convincing advantages of a possible obliqueness ordering, i.e. capturing the various syntactic generalizations for Chinese.

Technically, however, as long as subject and complements are formally defined in a theory, one can impose an ordering of them in a SUBCAT list.  But if such a list does not capture significant generalizations, there is no point in doing so.[11]  It has turned out that the keyword approach is a promising alternative once proper means are developed for the required configurational constraint on structure building.

The keyword approach is realized in CPSG95 as follows.  Syntactic constituents for subcategorization, namely subject and complements, are directly accommodated in four parallel features [SUBJ], [COMP0_LEFT], [COMP1_RIGHT] and [COMP2_RIGHT].

The feasibility of the keyword approach proposed here has been tested during the implementation of CPSG95 in representing a variety of structures.  Particular attention has been given to the constructions or patterns related to Chinese subcategorization.  They include various transitive structures, di-transitive structures, pivotal construction (jianyu-shi), ba-construction (ba-zi ju), various passive constructions (bei-dong shi), etc.  It is found to be easy to  accommodate all these structures in the defined framework consisting of the four features.

We give a couple of typical examples below, in addition to the ones in (3-3) and (3-4) formulated before, to show how various subcategorization phenomena are accommodated in the CPSG95 lexicon within the defined feature structures for subcategorization.  The expected structure and example are shown before each sample formulation in (3‑6) through (3-8) (with irrelevant implementation details left out).

th5

th6

Based on such lexical information, the desirable hierarchical structure on the related syntactic elements, e.g. [S [V O]] instead of [[S V] O], can be imposed via the configurational constraint based on the design of the expectation type.  This is presented in section 3.2.4 below.

3.2.4. Configurational Constraint

The means for the configurational constraint to impose a desirable hierarchical morpho-syntactic structure defined by a grammar is the key to the success of a keyword approach to structural constituents, including subject and complements from the subcategorization.  This section defines the sort hierarchy of the expectation type [expected].  The use of this design for flexible configurational constraint both in the general grammar and in the lexicon will be demonstrated.

As presented before, whether a sign has structural expectation, and what type of expectation a sign has, can be lexically decided:  they form the basis for a lexicalized grammar.  Four basic cases for  expectation are distinguished in the expectation type of CPSG95:  (i) obligatory: the expected sign must occur;  (ii) optional:  the expected sign may occur;  (iii) null:  no expectation;  (iv) satisfied: the expected sign has occurred.  Note that case (i), case (ii) and case (iii) are static information while (iv) is dynamic information, updated at the time when the daughters are combined into a mother sign.  In other words, case (iv) is only possible when the expected structure has actually been built.  In HPSG-style grammars, only the general grammar, i.e. the set of PS rules, has the power of building structures.  For each structure being built, the general grammar will set [satisfied] to the corresponding expectation feature of the mother sign.

Out of the four types, case (i) and case (ii) form a natural class,  named as [a_expected];  case (iii) and case (iv) are of one class named as [saturated].  The formal definition of the type [expected] is given (3-9].

(3-9.) Definition: sorted hierarchy for [expected]

expected: {a_expected, saturated}
a_expected: {obligatory, optional}
ROLE role
SIGN a_sign
saturated: {null, satisfied}

The type [a_expected] introduces two features:  [ROLE] and [SIGN].   [ROLE] specifies the semantic role which the expected sign plays in the structure.  [SIGN] houses various types of constraints on the expected sign.

The type [expected] is designed to meet the requirement of the configurational constraint.  For example, in order to guarantee that syntactic structures for an expecting sign are built on top of its morphological structures if the sign has obligatory morphological expectation, the following configurational constraint is enforced in the general grammar.  (The notation | is used for logical OR.)

(3-10.)         configurational constraint in syntactic PS rules

PREFIXING                    saturated | optional
SUFFIXING                    saturated | optional

The constraint [saturated] means that syntactic rules are permitted to apply if a sign has no morphological expectation or after the morphological expectation has been satisfied.  The reason why the case [optional] does not block the application of syntactic rules is the following.  Optional expectation entails that the expected sign may or may not appear.  It does not have to be satisfied.

Similarly, within syntax, the constraints can be specified in the Subject PS Rule:

(3-11.)         configurational constraint in Subject PS rule

COMP0_LEFT                 saturated | optional
COMP1_RIGHT              saturated | optional
COMP2                           saturated | optional

This ensures that complement rules apply before the subject rule does.  This way of imposing a hierarchical structure between subcategorized elements corresponds to the use of SUBCAT Principle in HPSG based on the notion of obliqueness.

The configurational constraint is also used in CPSG95 for the formal definition of phrase, as formulated below.

phrase macro

a_sign
PREFIXING saturated | optional
SUFFIXING saturated | optional
COMP1_LEFT saturated | optional
COMP1_RIGHT saturated | optional
COMP2 saturated | optional

Despite the notational difference, this definition follows the spirit reflected in the phrase definition given in Pollard and Sag (1987:69) in terms of the saturation status of the subcategorized complements.  In essence, the above definition says that a phrase is a sign whose morphological expectation and syntactic complement expectation (except for subject) are both saturated.  The reason to include [optional] in the definition is to cover phrases whose head daughter has optional expectation, for example, a verb phrase consisting of just a verb with its optional object omitted in the text.

Together with the design of the structural feature [STRUCT] (section 3.3), the sort hierarchy of the type [expected] will also enable the formal definition for the representation of the fundamental notion word (see Section 4.3 in Chapter IV).  Definitions such as @word and @phrase are the basis for lexical configurational constraints to be imposed on the expected signs when required.  For example, -xing (-ness) will expect an adjective stem with the word constraint and -zhe (-er) can impose the phrase constraint on the expected verb sign based on the analysis proposed in section 6.5.

3.3. Structural Feature Structure

The design of the feature [STRUCT] serves important structural purposes in the formalization of the CPSG95 interface between morphology and syntax.  It is necessary to present the rationale of this design and the sort hierarchy of the type [struct] used in this feature.

The design of [STRUCT struct] originates from the binary structural feature structure [LEX + | -] in the original HPSG theory (Pollard and Sag 1987).  However, in the CPSG95 definition, the type [struct] forms an elaborate sort hierarchy.   It is divided into two types at the top level:  [syn_dtr] and [no_syn_dtr].  A sub-type of [no_syn_dtr] is [no_dtr].  The CPSG95 lexicon encodes the feature [STRUCT no_dtr] for all single morphemes.[12]  Another sub-type of [no_syn_dtr] is [affix] (for units formed via affixation) which is further sub-typed into [prefix] and [suffix], assigned by the Prefix PS rule and Suffix PS Rule.  In syntax, [syn_dtr] includes sub-types like [subj], [comp] and [mod].  Despite the hierarchical depth of the type, it is organized to follow the natural classification of the structural relation involved.  The formal definition is given below.

(3-12.)         Definition: sorted hierarchy for [struct]

struct: {syn_dtr, no_syn_dtr}
syn_dtr: {subj, comp, mod}
comp: {comp0_left, comp1_right, comp2_right}
mod: {mod_left, mod_right}
no_syn_dtr: {no_dtr, affix}
affix: {prefix, suffix}

In CPSG95, [STRUCT] is not a (head) feature which percolates up to the mother sign;  its value is solely decided by the structure being built.[13]   Each PS rule, whether syntactic or morphological, assigns the value of the [STRUCT] feature for the mother sign, according to the nature of combination.  When morpheme daughters are combined into a mother sign word, the value of the feature [STRUCT] for the mother sign remains a sub-type of [no_syn_dtr].  But when some syntactic rules are applied, the rules will assign the value to the mother sign as a sub-type of [syn_dtr] to show that the structure being built is a syntactic construction.

The design of the feature structure [STRUCT struct] is motivated by the new requirement caused by introducing morphology into the general grammar of  CPSG95.  In HPSG, a simple, binary type for [LEX] is sufficient to distinguish lexical signs, i.e. [LEX +], from signs created via syntactic rules, i.e. [LEX -].  But in CPSG95, as presented in section 3.2.1 before, productive derivation is also accommodated in the general grammar.  A simple distinction between a lexical sign and a syntactic sign cannot capture the difference between signs created via morphological rules and signs created via syntactic rules.  This difference plays an essential role in formalizing the morpho-syntactic interface, as shown below.

The following examples demonstrate the structural representation through the design of the feature [STRUCT].  In the CPSG95 lexicon, the single Chinese characters like the prefix ke- (-able) and the free morphemes du (read), bao (newspaper) are all coded as [STRUCT no_dtr].   When the Prefix PS Rule combines the prefix ke- and the verb du into an adjective ke-du, the rule assigns [STRUCT prefix] to the newly built derivative.  The structure may remain in the domain of morphology as the value [prefix] is a sub-type of [no_syn_dtr].  However, when this structure is further combined with a subject, say, bao (newspaper) by the syntactic Subj PS Rule, the resulting structure [bao [ke-du]] (‘Newspapers are readable’) is syntactic, having [STRUCT subj] assigned by the Subj PS Rule;  in fact, this is a simple sentence.  Similarly, the syntactic Comp1_right PS Rules can combine the transitive verb du (read) and the object bao (newspaper) and assign for the unit du bao (read newspapers) in the feature [STRUCT comp1_right].  In general, when signs whose [STRUCT] value is a sub-type of [no_syn_dtr] combine into a unit whose [STRUCT] is assigned a sub-type of [syn_dtr], it marks the jump from the domain of morphology to syntax.  This is the way the interface of Chinese morphology and syntax is formalized in the present formalism.

The use of this feature structure in the definition of Chinese word will be presented in Chapter IV.  Further advantages and flexibility of the design of this structural feature structure and the expectation feature structures will be demonstrated in later chapters in presenting solutions to some long-standing problems at the morpho-syntactic interface.

3.4. Summary

The major design issues for the proposed mono-stratal Chinese grammar CPSG95 are addressed.  This provides a framework and means for formalizing the analysis of the linguistic problems at the morpho-syntactic interface.  It has been shown that the design of the CPSG95 expectation structures enables configuration constraints to be imposed on the structure hierarchy defined by the grammar.  This makes the keyword approach to Chinese subcategorization a feasible alternative to the list design based on the obliqueness hierarchy of subject and complements.

Within this defined framework of CPSG95, the subsequent Chapter IV will be able to formulate the system-internal, but strictly formalized definition of Chinese word.  Formal definitions such as @word and @phrase enable proper configurational constraints to be imposed on the expected signs when required.  This lays a foundation for implementing the proposed solutions to the morpho-syntactic interface problems to be explored in the remaining chapters.

 

---------------------------------------------------------------------------------

[1] More precisely, it is not ‘word’ order, it is constituent order, or linear precedence (LP) constraint between constituents.

[2]  L. Li (1986, 1990)’s definition on structural constituents does not involve word order.  However, his proposed definition is not an operational one from the angle of natural language processing.  He relies on the decoding of the semantic roles for the definitions of the proposed constituents like NP-agent (ming-shi), NP-patient (ming-shou), etc.  Nevertheless, his proposal has been reported to produce good results in the field of Chinese language teaching.  This seems to be understandable because the process of decoding semantic roles is naturally and subconsciously conducted in the mind of the language instructors/learners.

[3] Most linguists agree that Chinese has no inflectional morphology (e.g. Hockett 1958; Li and Thompson 1981; Zwicky 1987; Sun and Cole 1991).  The few linguists who believe that Chinese has developed or is developing inflection morphology include  Bauer (1988) and Dai (1993).  Typical examples cited as Chinese inflection morphemes are aspect markers le, zhe, guo and the plural marker men.

[4] A note for the notation: uppercase is used for feature and lowercase, for type.

[5] Phonology and discourse are not yet included in the definition.  The latter is a complicated area which requires further research before it can be properly integrated in the grammar analysis.  The former is not necessary because the object for CPSG95 is Written Chinese.   In the few cases where phonology affects structural analysis, e.g. some structural expectation needs to check the match of number of syllables, one can place such a constraint indirectly by checking the number of Chinese characters instead (as we know, a syllable roughly corresponds to a Chinese character or hanzi).

[6] The macro constraint @np in (3-2) is defined to be [CATEGORY n] and a call to another macro constraint @phrase to be defined shortly in Section 3.2.4.

[7] These expectation features defined for [a_sign] are a maximum set of possible expected daughters;  any specific sign may only activate a subset of them, represented by non-null value.

[8] This is similar to viewing morphology as ‘the syntax of words’ (Selkirk 1982; Lieber 1992; Krieger 1994).  It seems that at least affixation shares with syntax similar structural constraints on constituency and linear ordering in Chinese.  The same type of mechanisms (PS rules, typed feature structure for expectation, etc) can be used to capture both Chinese affixation and syntax (see Chapter VI).

[9] More precisely, the decoding of possible ways of semantic composition is guaranteed.  Syntactically ambiguous structures with the same constraints correspond to multiple ways of semantic compositionality.  These are expressed as different entries in the lexicon and the link between these entries is via corresponding lexical rules, following the HPSG practice. (W. Li 1996)

[10]  Borsley (1987) has proposed an HPSG framework where subject is posited as a distinct feature than other complements.  Pollard and Sag (1994:345) point out that “the overwhelming weight of evidence favors Borsley’s view of this matter”.

[11] The only possible benefit of such arrangement is that one can continue using the SUBCAT Principle for building complement structure via list cancellation.

[12] It also includes idioms whose internal morphological structure is unknown or has no grammatical relevance.

[13] The reader might have noticed that the assigned value is the same as the name of the PS rule which applies.  This is because there is correspondence between what type of structure is being built and what PS rule is building it.  Thus, the [STRUCT] feature actually records the rule application information.  For example, [STRUCT subj] reflects the fact that the Subj PS Rule is the most recently applied rule to the structure in point;  a structure built via the Prefix PS Rule has [STRUCT prefix] in place; etc.  This practice gives an extra benefit of the functionality of ‘tracing’ which rules have been applied in the process of debugging the grammar.  If there has never been a rule applied to a sign, it must be a morpheme carrying [STRUCT no_dtr] from the lexicon.

 

[Related]

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP