月度归档： 2016 年 9 月

【一日一parsing：走火入魔，parser 貌似发疯了】

我:
系统调试也上瘾。今夜无眠，调着调着，parser 貌似发疯了，大概是嫌我啥都喂给它，闹情绪了？？

0927a
仔细瞅瞅，好像也没啥大错，没疯。与鲁爷【狂人日记】不同，我怀疑得没理。

自然语言的任何并列（Conj）结构，到了逻辑层，都必须分列。赶上遇到好几个并列就热闹了，关系有组合爆炸的趋向。都是汉语的顿号惹的祸。用恁多顿号做啥，多写几个小句能死吗？纯句法parsing不管这些，图面倒是显得干净。可是 deep parsing 的语义计算是逻辑的，就不能不管。

白:
“或”的结合能力弱于“与”，顿号在被“或”绑架不成情况下标配解释为“与”。

我:
这几天净出怪，不知是机器走火入魔了，还是玩机器的走火入魔，总之，出来一些奇奇怪怪的 graphs，远远不是教科书上展示的句法树形图给人留下的印象。教科书都是这样的，太过优雅：

parse_tree_1

前两天出了一个葫芦形的图，昨天又出了双伞形的，今天是发飙，明天还不知会咋样。

这是昨天的两把伞。瞅了一瞅，好像也没错：

0926a

白:
吗的位置不对。两把伞那个，能……吗，才是一对。

我:
对，“吗“”应该更上一层楼。如果没有上一层，“吗”疑似就对了。为个小词爬楼不值当了，不是不可以爬 (patching). 当然这里面其实牵涉到决定 yes-no question 的所属问题，最终可能还是要上。

如果说 “电子签证是什么吗。”那就是活用。表面上用疑问，实际是应该是感叹？不是“吗”的标准用法。因为“吗”的本性是一般疑问句，而“什么”是特殊疑问句的疑问词（wh-word），不相谐。

白:
那个是“嘛”，不是“吗”

我:
肯定这里不可以用 “吗” 吗？

白:
他知道电子签证是什么

我:
感觉上可以，好像也不等同于“嘛”。

是那个什么吗。
真地忘了是那个什么了。

白:
你说的感叹义，应该用“嘛”。遗忘义，可以用“吗”
不过现在白字用的，早乱套了。

我:
这是前天的葫芦，白老师的名句。就是“与之”没挂上arg，差强人意，但总体逻辑语义的计算还都对。“你”（S）与“女人”（S）结了婚，而且这事儿修饰的（Mod-S：定语从句）是“女人”。

0925a

你说机器神不神，parser 好玩不好玩，这算不算对人类语言的机器理解的敲门砖：芝麻开门！芝麻芝麻快开门。

【相关】

Who we are. Not an ad, but a snapshot.

from http://www.netbase.com/about-netbase/ 09/27/2016

WHO WE ARE

EMPOWERING GLOBAL BUSINESSES WITH SOCIAL INSIGHTS

We are uniquely positioned to help global businesses create real business value from the unprecedented level of growth opportunities presented each day by social media. We have the industry’s fastest and most accurate social analytics platform, strong partnerships with companies like Twitter, DataSift, and Tumblr, and award-winning patented language technology.

We empower brands and agencies to make the smartest business decisions grounded on the deepest and most reliable consumer insights from social. We’ve grown 300 percent year-over-year and excited to see revenue grow by 4,000% since the second quarter of 2012.

RECENT ACCOLADES

We were recently named a top rated social media management platform by software users on TrustRadius and a market leader by G2 Crowd.

LEARN MORE

“NetBase is one of the strongest global social listening and analytics tools in the market. Their new interface makes customized dashboard creation a breeze.”

- Omri Duek, Coca-Cola

“Data reporting is both broad and detailed, with the ability to drill down from annual data to hourly data. NetBase allows us to have a pulse on the marketplace in just a few minutes.”

- Susie Thomas, VP, Palisades Media Group

“We started with a gen one solution, but then found that we needed to move to a tool with a better accuracy that could support digital strategy and insights research. NetBase satisfied all our needs.”

- Jared Degnan, Director of Digital Strategy

“As one of the first brands to test NetBase Audience 3D for our Mobile App launch, we’ve found that we could engage with our consumers on a deeper, more human level that further drives them to be brand champions.”

- Mihir Minawala, Manager of Social, Industry & Competitive Intelligence, Taco Bell

OUR CUSTOMERS

We work with executives from forward-looking agencies and leading brands across all verticals in over 99 countries. Our customers use NetBase for real-time consumer insights across the organization, from brand and digital marketing, public relations, product management to customer care.

KEY MILESTONES

March 2003

Founded by Michael Osofsky at MIT. Later joined by Wei Li, Chief NetBase Scientist
July 2009

P&G, Coca-Cola and Kraft signed as first customers of NetBase
January 2014

Named Best-in-Class By Consumer Goods Technology
April 2014

Launched Brand Live Pulse, the first real-time view of brands’ social movements
May 2014

Celebrated 10 years with 500% customer growth in 3 years
January 2015

AdAge Names 5 NetBase Customers to the Agency A-List
March 2015

Introduced Audience 3D, the first ever 3D view of audiences
April 2015

Raised $33 MM in Series E Round
November 2015

Named Market Leader by G2 Crowd. Earned Top Ratings by Trust Radius

What inspired you to join NetBase?

It was exciting to build the technology that could quickly surface meaningful customer insights at scale. For example, what used to take a day to run a simple analysis now takes just a second. Our platform now analyzes data in “Google time”, yet the depth and breadth of our analysis is exponentially deeper and larger than what you’ll ever get from a Google search.

What are you most proud of at NetBase?

I’m especially proud that we have the industry’s most accurate, deepest, fastest, and more granular text analysis technology. This enables us to gives our customers very actionable insights, unlike other platforms that offer broad sentiment analysis and general trending topics. Plus, NetBase reads 42 languages. Other platforms don’t even come close. We are customer-centric. Our platform truly helps customers quickly identify their priorities and next steps. This is what sets us apart.

What is the next frontier for NetBase?

With the exploding growth of social and mobile data and new social networks emerging, we’ll be working on connecting all these data points to help our customers get even more out of social data. As Chief Scientist, I’m more excited than ever to develop a “recipe” that can work with the world’s languages and further expand our language offerings.

WE’RE GLOBAL: 42 LANGUAGES, 99+ COUNTRIES, 8 OFFICES

Handling Chinese NP predicate in HPSG

【一日一parsing：parser 超越创造parser的人，不是不可能的】

460225017498569285 白:
“那些林彪说过的话”
看看复数指示词（det）是如何跳过单数NP找到自己的中心语的。

我:

0924a

0924b
何难之有？

0924c

看着最后这句出来，不禁有些惶恐：这样下去，机器超越造机器的人，不是不可能的。内行看门道，自不必说，可今天还是对后学做个科普吧：为什么说此句的 deep parsing 牛得达到了语言学专家的水平，已经超越了普通人的语言结构分析的能力呢？这个自动生成、看似简单的树形图涵盖这么多的语言学：

(1) 复数指示词 “那批” 跳过了近距离的“你”，甚至跳过了定语从句的谓词“写-过”，连上了远距离的中心词“文章”，做其修饰语（Mod），牛不牛？

(2) 确定了定语从句（Mod-S）“你写过的”及其中心词“文章”；

(3) 定语从句谓词“写过”的主语（S）“你”和逻辑宾语（O）“文章”（所谓的 argument structure 的解构）；

(4) 句首的这个带有定语从句的名词短语（“......文章”），与后续句子的谓词“保存-着”的远距离动宾关系（O）也揭示了，这个也有点儿牛吧；

(5) 事实上，句子主干的主（S）谓宾（O）都是各就各位，还有那些小词也都附着到了应该存在的地方（X）。

从深度结构分析的逻辑语义角度，可以说以上的分析已臻完美。

科普完。

能够达到以上对咱中文语句的语言学自动深度分析（deep parsing）水平的，得瑟一哈，也许算是可以原谅的“寡人之疾”了吧。

得瑟毕。

抹一把插大葱的象鼻，拍拍尘土，咱继续谦虚谨慎愚公移山去也。

白:
最后这句的next有些多余
即使去掉，所有有用的关系都在

我:
Next 是桥梁（敲门砖），本来是可以用完扔掉的，后来觉得留下也可以。
做个青春的纪念。
青春是褒义词，耍流氓是贬义词，但都是一回事儿：盲目躁动。（Next 残存了一点语序的信息，虽然逻辑上没有语序的地位，但在语义落地的时候，这个痕迹有时可能还有一点用。）

我一直相信，结构分析，机器达到或超越人的水平，是在望的。
结构分析后的语义落地，与人类的智力还有一些距离。但是因为语义落地几乎都是面向领域或应用的，因此有 leverage，有些觉得是天大的难题，有时在领域语用里面，就自然化解了，或者简化了。由此看来，NLU （或语义计算）是靠谱的 monster。

近两个月出了两件牛刀宰鸡的事儿。一个是英文，一个是中文。具体不让说，但可以假语村言。都是在某个产品领域被认为是拦路虎的与自然语言有关的难题。研究了一下，回答说，有了 deep parsing 的核武器，这有何难？

演练了一下，真地就是牛刀宰鸡，一眼见底。很多人以为核武器之说是立法委的极度夸张。天知地知，还真不是。被演义的对象说，这个难题在这个产品领域一旦解决，有很多后续的应用。可是如果不是不得已，还是想做牛刀宰牛的活计，而不是陷入鸡窝去没完没了地宰鸡。胜之不武啊。古训不是有说，不为五斗米折腰嘛。但愿不至于落到五斗米的田地。

【相关】

【李白对话录：如何学习和处置“打了一拳”】

白:
“张三打了李四一拳”“张三打李四的那一拳”
我的问题：1、“一拳”在两个例子里，跟“打”的“逻辑语义关系”是否是相同的？
2、如果相同，这种关系是不是萝卜和坑的关系？
3、如果是，那么这个坑是“打”自带的，还是被“一拳”的出现逼出来的？
4、非自带但可以被逼出来的坑，是一个个别现象还是一个普遍现象？是汉语特有的现象还是一个语言共性现象？
2'：如果不同，第二例中的定语从句和中心语“那一拳”之间的关系是怎么建立的？
“张三喊了一嗓子”“张三喊的那一嗓子，我老远就听见了”，一个道理
另外，“回马枪”“窝心脚”等“工具扩展为招式”固定短语，是不是可以直接略掉量词，与数词结合？

我:

1. 逻辑语义上应该相同，句法上有【主谓】和【定语从句+NP】的不同，很典型。

2 具体说，“打一拳” 就是搭配，是合成动词，与“洗澡”可比，不过后者是动宾搭配，前者是动补搭配。都是合成词的句法表现，都涉及词典与句法的动态接口。
直接量的搭配，当然属于罗卜与坑。
语言中的萝卜和坑，不外是：（1）一个直接量(词)准备了一类词（feature）的坑；（2）一个直接量（词）准备了另一个直接量（词）的坑，通常叫强搭配；（3）一类词（feature）准备了另一类词（feature）的坑。（3）是常规句法的表现，属于空对空，两边都不着地。其规则（feature based grammar）概括性强，但容易遭遇例外的滑铁卢。lexicalized grammar or word driven rules，越来越远离（3），或者把（3）限定在一个极少的数量上。那么就剩下（1）和（2）了。
“打...一拳” 是（1），这就到了你的第三个问题，两个直接量的搭配，谁 expects 谁？
纯技术上讲，根本就没有区分，或者说，等价。x 与 y 相互勾搭，说是 x 勾搭了 y 或者 y 勾搭了 x，都无所谓，反正他们是一家人，本来就是一个词，一个概念，不过到了语言表达，被人为分开了距离。

【3、如果是，那么这个坑是“打”自带的，还是被“一拳”的出现逼出来的？】
“打一拳”就是一个词条，概念上是混为一体的，不分你我，无所谓主次（动补的主次是词法内部的，可以无视）。但是操作上，可以有说法。（不知道汉语的搭配词典里面，“打一拳”这样的条目是放在 “打” 的下面，还是 “一拳” 的下面，还是两个地方都有？）但是，在NLP实现中，“打一拳” 与 “洗澡” 一样，是一个特定的分离词词条。不过是标签不同而已，譬如 Vo 与 Vbu，其他的事儿就交给句法了。

【4、非自带但可以被逼出来的坑，是一个个别现象还是一个普遍现象？是汉语特有的现象还是一个语言共性现象？】
对于直接量搭配，我的看法是，没有自带和被逼的问题，都是两厢情愿的相互吸引。
这个应该属于普遍现象： x--y，汉语有 “洗-澡”，英语有 “take--bath”。词法是动补或者动词与状语这样的直接量与直接量的搭配，其他语言肯定也会有，不过一时想不到例子而已。

白:
打一苕帚疙瘩，也是搭配
任何顺手的东西，都可以抄起来就打
搭配的做法未免太ad hoc

我:
所有的词典都是 ad hoc，不然就不叫绑架了。但是词条背后的 x--y 搭配则是有语言共性的。

白:
问题是不可穷尽，而且本来能产，是一个有规律性的现象，打两鞭子，砍三刀，踹五脚。

我:
不可穷尽那就不是 x--y 强搭配。理论上不是 x --- y，就只能是 x ---- feature，或者 feature1 ----- feature2，没有其他的框可以进去。
“砍三刀” 与 “洗三个澡” 可比吗？要是可以，那就是 x --- y，可变的不过是 numeral，两端还是固定的：“踹-脚”，“砍--刀”。

白:
加量词的不算，只算省略量词的.明显的是工具，但是原动词很难说自带了“工具”这个坑。

我:
有些中间地带的现象。
说到底是路线问题。如果是 lexicalist 的路线，中间地带的一律进入词典，不在乎 ad hoc，不在乎冗余，好处是精准。如果是“传统”的文法，那就把中间地带划归到句法去，具有完全的产生性，好处是不错的recall，但很容易被例外搅合，损失了精准（precision）。当然也可以二者结合，先弄一条 recall 的兜底，然后见到中间地带弄错了的，再去结合词典堵它。recall 楼底的可以想象的 rule 是这样的，利用了汉语名词通常不能直接为数词修饰的句法特点：

V + CD + N --> V Buyu

这一条可以搂住很多，但是危险。修修补补也可以把这条规则的危险减小，但不能杜绝，因为这是 feature based rule 的本性（POS 是 feature ）。

接着练，我们可以有个楼底的规则来满足白老师说的某种语言现象的共性：

V +（时态小词）+ CD + N ==> V <-- Buyu[CD+N]

这条规则可以 parse 上面列举的所有现象，但是还是 too “powerful”， recall 有余，precision 不足。不过 precision 这东西，工程上靠的就不断扩大测试，测试不错的话就当没有精度问题，如果测试遇到问题了，有三个路子：（1）一个是在这一条规则中打磨，把 POS 条件细化成子类或ontology，或其他限制；（2）第二个路子是另写一条细线条规则去 override 它，使得文法成为一个 hierarchy 的模块；（3）第三个路子就是把错的东西（例外）扔进词典，这实际上等价于第二条路子的极限 case，把词典当成是 rule hierarchy 的极端。有了这么一个从词典规则，到细线条 feature 规则，最后到 POS 的抽象层规则的 hierarchy 的规则化设计，就可以应对语言的例外、个性一直到共性及其之间的灰色地带。

懒得大数据，甚至懒得词典绑架搭配，上面那条默认规则送进系统先凑合事儿吧，就坐等今后例外慢慢地出现，再说。

0925b

0925c

白:
为什要在细粒度基于规则
这里说的这个层面规则的缺点，用学习对付起来正是优势

我:
不要细粒度也可以啊，抓两头带中间。大不了有些 redundancy，灰色的一律当成黑色。不可穷举不过是一种修辞说法。从统计上，处于灰色地带的东西一定是可以穷举的，不过是穷举到后来成了统计性长尾，不要再举而已。

白:
我是说，这里不存在二分法，除了词典捆绑就是基于规则, 可以基于学习

我:
白老师可以 illustrate 基于学习的东西，优势在哪里？（其实这个问题，我没觉得是一个对规则系统的挑战。没觉得它的挑战超越了 “洗澡”）

白:
不能穷举、规则又零乱，正好拿可以部分例子来学。feature很值钱，长尾的实例也很值钱，裹在一起学才是正道，既有泛化，又有死记硬背。

白:
拿有规律性的东西死记硬背，是逼着好孩子耍流氓

我:
从良性角度，也可以说是教育孩子脚踏实地，一步一个脚印。

白:
在泛化和死记硬背的灰色地带，该用学习就用学习。
看着不爽，又不是没办法。
只有应试教育、临阵磨枪，才把什么活的都搞死

我:
这里面的根本是，迄今为止，一个系统要不是统计的，要不是规则的。所谓 hybrid 的系统，大多是是两个系统的叠加，而不是融合。在这样一个 context 下，就不是说，我规则的规则，词典的词典，中间混杂一些统计学习。虽然后者应该是一个研究方向，而且应该可能做得比叠加式 hybrid 更高明。如果白老师说的是纯粹的学习系统，那是另一套话语体系，no comment。从规则这边看，抓两头，把灰色当黑色做，没有问题，不过是磨时间而已。共性规则保证了 recall，而 precision 就是时间的函数。

白:
我说的是，谁可能跟谁结合用规则，在同样符合规则情况下，谁排除跟谁结合用学习，但这是无监督学习，标注来自词典。前面用规则的只涉及萝卜、坑和帽子，不涉及subcat。后面学习的则是用subcat。

我:
其实就用 V+CD+N 这个简单的模式到海量数据去，抓回来的无监督学习也大体就齐了。这是一个很狭窄的语言现象。无监督学习的结果就是这个特定的 subcat 的 knowledge acquisition，这是一个 offline 的学习过程。然后再利用学习出来的结果，支持 parsing

白:
其实这楼已经歪了。我的本意是在探讨逼出来的非标配的坑。
如果可以那样做，离语言的本质或许更近。

“他上学的那个学校”；“他约会的那个晚上”。

不加数词也存在把在一个句式里充当状语或补语的名词在另一个相关句式中充当主谓语，而逻辑语义关系不变的情况。而那个名词的真实身份是工具、处所、时间等角色。本来对于动词来说不是标配的。来到了某种位置，就逼迫动词把这个角色变为标配。
英语的介词结尾：the man you look for，可以给它们明确身份，即使在定语从句，也是庶出（介词养的，不是动词养的）。当然可以说动介组合look for养的。
汉语里进入定语从句后分不出来谁养的，反正介词消失了，带着反而不对。带着就要把零形式用真实代词替换：“你在其中上学的学校”，“你与之结婚的女人”

加数词，只不过突出了动量含义，不改变逻辑语义关系。

砍张三的斧子……着眼工具
砍张三的两斧子……着眼动作的次数
砍张三的斧子……用来（以/之/其）砍张三的斧子

我:
补语表示次数是逻辑语义工具在语言中的"虚化"(同时“形象化”）的用法，这种虚化用法本身不是语言共性，但可以映射到到深层的逻辑语义【工具】：【工具】是 universal 的。就“砍”而言，【工具】不是逼迫出来的标配，而是自带的标配，不信可以查董老师的 HowNet，结婚的标配是 with [human]，对于上学，学校是不是自带的？大概也可以这么说，不知道知网里面上学有没有一个 location 的槽，标配是学校。

可以找一个完全 random 的定义或状语试试，好像不行。似乎很难找到一个具有同样逻辑语义的，并且可以参与下面两个句式的案例：补语句式（表示次数）和定语句式。换句话说，这种现象要不就是搭配，要不就是搭配的延伸，而不是 random 的修饰语（adjunct）的组合，或者从 adjunct 被逼迫成的 complement，里面的逻辑语义是概念关系的某种 argument，有其结合的必然性。这种搭配似乎可以是词对词（两条腿落地），也可能是词对小类（feature：一条腿落地）。前者是强搭配的词典绑架，后者是灰色的，不一定可以绑架得了，统计可以学习出来。

白:
正是我要说的

我:
白老师岂止是四两拨千斤 lol

词对小类的subcat的习得，譬如某个动词要求的是某种宾语（譬如【human】），这种东西可以从大数据学习出来：这个概念已经有日子了。剑桥大学一个教授多年前就倡导这种学习，好像也做了一批实验，印象也发表了一些文章。但这些研究总体来说是零星的，研究的归研究，应用的归应用，二者似乎也没有什么结合起来让人印象深刻的成果。

白:
没有把搭配学习锚定在结构上，是没戏的
你如果又学结构又学搭配，肯定乱套
一定是选定少数几种可能的结构，让搭配来进一步甄别，各司其职

白:
“砍”的工具可以是标配，“打”不行。适合“打”的subcat很不整齐，我们心里想的是“顺手可以抄起来的物件”但是subcat列表上不会顺顺当当给你这个。于是，要诸多subcat、诸多词例都当作features，想办法从可以列举的例子（包括已经可以确认的词例-subcat子规则）学出来。
炉子太大，抄不起来。房子更大。扫把大小适中。细菌太小。所以，“张三打李四一大肠杆菌”不通。

我:
用 pattern 打+CD+N，一学一个准只要有海量数据，根本不用怕噪音，因为这个 pattern 非常好使。
联想到10多年前谷歌有人发过一篇论文，用两个特别拣选的 ngram patterns，学出了 ISA 的 taxonomy，让人印象深刻。后来我们还重复了这个工作，虽然并没真正用上其结果，但路子是对的。照着类似学习的路子，HowNet 有一天也是可以学出来的，只要董老师定义好要学的几个语义关系的性质，找到合适的 patterns。
谷歌用的两个 patterns 是： N such as X, Y, Z ；X, Y, Z and other N

e.g.
furniture such as desks, chairs, coffee-tables
desks, chairs, coffee-tables and other furniture (will all be on sale)
taxonomy is: {X, Y, Z} -->N

学他有啥用，反正人拍着脑袋慢慢想也可以想出来呀。HowNet 语义关系丰富，所以编写了很多年，但是终究还是编写出来了，几乎完备了（董老师好像如今只是零星地补充和添加了）。既然专家可以人工编写，既完备，又精良，有什么理由指望大数据去习得这些知识呢？这是问题的一面，特别是对于相对恒定久远的概念语义关系，确实没有道理不用专家的产品。

问题的另一面是，对于具有某种流动性的概念关系，专家很难赶得上机器习得（acquisition），还有不同领域的知识，等等。这是人力不及的地带，只有指靠大数据和机器了。上面的谷歌论文中举了一些例子，特别有意思，记得是说，学出来一个 dictator 的下位概念，里面的成员极具大数据的特点，有卡斯特罗，毛泽东，斯大林，希特勒，etc。

白:
这是主观分类了，不合适放词典里。还有“知名品牌”的实例, 马上就有商业价值了。

我:
这不是我每天做的工作吗：social media mining of public opinions and sentiments
我们公司定期出版全球知名品牌的口碑排行榜之类，印刷精良。以前出版的是奢侈品牌（名牌包、名牌轿车、高级香水）等。最近出的一期是： Social Media Industry Report 2016: Restaurant Brand

刚测试了一下白老师的例句，最奇葩的是这个：

0925a

长成葫芦状的树形图，以前还真没见过。（词典里没有小词 “与之”，PP 也没合成它，于是被略去。）尽管如此，整个图是很逻辑的，撞了不知道什么运：“你”是结婚的一方（S），“女人”也是结婚的一方（S），这两方结婚的事件是一个定语从句（Mod-S），修饰到了“女人”的头上。至于小词 “的”、“之”，还有耍流氓的咸猪手 Next，这一切都是帮助建立结构的敲门砖，这些表层东西与逻辑语义无关，留在那里不是为了碍眼，而是为了在语义的语用落地的时候，万一需要表层痕迹的一些帮助呢。after all 语义计算的的目的不是为了画出好看的逻辑的图，自娱娱人，而是为了落地、做产品。

【相关】

Chart Parsing Chinese Character Strings

W. Li. 1997. Chart Parsing Chinese Character Strings. In
Proceedings of the Ninth North American Conference on Chinese
Linguistics (NACCL-9). Victoria, Canada.

Chart Parsing Chinese Character Strings [1]

Wei LI

Simon Fraser University
Burnaby B.C. V5A 1S6 CANADA ([email protected])

ABSTRACT

This paper examines problems in word identification for a Chinese natural language processing system and presents our solution to these problems. In conventional systems, written Chinese parsing takes two steps: (1) a segmentation preprocessor for word identification (segmenter); (2) a grammar parsing the string of identified words. Morphological analysis, when required, as in the case of productive word formation, has to be incorporated in the segmenter. This matches the conventional morphology-before-syntax architecture. We will demonstrate the theoretical defect of this architecture when applied to Chinese. This leads to the conclusion that segmentational approach, despite its being the mainstream in Chinese computational morphology, is in general not adequate for the task of Chinese word identification. To solve this problem, a full grammar should be made available. Therefore, we take an alternative one-step approach. We have implemented an integrated grammar of morphology and syntax for directly parsing a string of Chinese characters, building both morphological and syntactic structures. Compared with the conventional two-step approach, our strategy has advantages in resolving ambiguity in word identification and in handling productive word formation.

Introduction

A written Chinese sentence is a string of characters with no blanks to mark word boundaries. In conventional systems, Chinese parsing takes two steps as shown in the following Figure 1: (1) a segmentation preprocessor (called segmenter) for word identification; (2) a word based parsing grammar, building syntactic structures (Feng 1996, Chen & Liu (1992).

hpsg4

In contrast, we take an alternative one-step approach, as shown in Figure 2 below. We have implemented a grammar named W‑CPSG (for Wei's Chinese Phrase Structure Grammar). W‑CPSG integrates morphology and syntax for character based parsing, building both morphological and syntactic structures.

hpsg5

In the two-step architecture, the purpose for the segmenter is to properly identify a string of words to feed syntax. This is not an easy task due to the possible involvement of the segmentation ambiguity. For example, given a string of 4 Chinese characters 研究生命, the segmentation ambiguity is shown in (1.a) and (1.b) below.

(1.) 研究生命

(a) 研究生 | 命
graduate student | life or destiny

(b) 研究 | 生命
study | life

The resolution of the above ambiguity in the segmenter is a hopeless job because such ambiguity is syntactically conditioned. For sentences like 研究生命金贵 (life for graduate students is precious), (1.a) is the right identification. For the phrase 研究生命起源 (to study the origin of life), (1.b) is right. So far there are no segmenters which can handle this properly and guarantee right word segmentation (Feng 1996). In fact, there can never be such segmenters as long as a grammar is not brought in. This is a theoretical defect of all Chinese analysis systems in the conventional architecture. We have solved this problem in our morphology-syntax integrated W‑CPSG. Word identification in our design becomes a by-product of parsing instead of a pre-condition for parsing.

In the text below, Section 2 investigates problems with the conventional two-step approach. In Section 3, we will present W‑CPSG one-step approach and demonstrate how W‑CPSG parsing solves these problems. The following is a list for abbreviations used in this paper.

A (Adjective); AF (Affix); BM (Bound Morpheme);
CLA (Classifier); CLAP (Classifier Phrase);
DE (Chinese particle introducing a modifier of noun); DEP (DE Phrase);
DE3 (Chinese particle introducing a modifier of result or capability);
DET (Determiner); LE (Chinese perfective aspect marker);
N (Noun); NP (Noun Phrase); P (Preposition); PP (Prepositional Phrase);
S (Sentence); V (Verb); VP (Verb Phrase); Vt (Transitive Verb)

Problems Challenging Segmenters

In general, there are two basic problems for segmenters, namely, segmentation ambiguity and productive word formation.

2.1. segmentation ambiguity

This sub-section studies the segmentation ambiguity for Chinese word identification. We indicate that this ambiguity is structural in nature. Therefore it should be captured by structural trees via parsing. We conclude that a parsing grammar is indispensable in the resolution of the segmentation ambiguity.

Behind all segmenters are procedure based segmentation algorithms. Most proposals are some modified versions of large-lexicon based matching algorithms. As an underlying hypothesis, a longer match overrides a shorter match, hence the name maximum match. Decided by the direction of the procedure, i.e. whether the segmentation proceeds from left (the beginning of a string) to right (the end of the string) or from right to left, we have two general types of maximum match: (1) FMM (Forward Maximum Match) algorithm; (2) BMM (Backward Maximum Match) algorithm (Feng 1996).

According to Liang 1987, segmenters have trouble with cases involving the segmentation ambiguity. There are two types of segmentation ambiguity: the cross ambiguity (AB|C vs. A|BC) and the embedded ambiguity (AB vs. A|B).

To detect possible ambiguity, many researchers use the technique of combining the FMM algorithm and the BMM algorithm. When the output of FMM and BMM are different, there must be some ambiguity involved. The following table lists the cases associated with the FMM and BMM combined approach.[2]

hpsg6

The following 3 examples all contain a cross ambiguity sub-string 研究生命 with 2 segmentation possibilities: 研究生|命 and 研究|生命. Example (4.) is a genuinely ambiguous case. Genuinely ambiguous sentences cannot be disambiguated within the sentence boundary, rendering multiple readings.

(2.) case 1: 研究生命金贵。

(a) 研究生 | 命 | 金贵 (FMM: correct)
graduate student | life | precious
Life for graduate students is precious.

(b) * 研究 | 生命 |起源 (BMM: incorrect)
study | life | precious

(3.) case 2: 研究生命起源。

(a) * 研究生 | 命 | 起源 (FMM: incorrect)
graduate-student | life | origin

(b) 研究 | 生命 | 起源 (BMM: correct)
study | life | origin
to study the origin of life

(4.) case 3: 研究生命不好。

(a) 研究生 | 命 | 不 | 好 (FMM: correct)
graduate student | destiny | not | good
The destiny of graduate students is not good.

(b) 研究 | 生命 | 不 | 好 (BMM: correct)
study | life | not | good
It is not good to study life.

The following example is a complicated case of cross ambiguity, involving more than 2 ways of segmentation. Both the FMM segmentation 出现|在世|界 and the BMM segmentation 出|现在|世界 are wrong. A third segmentation 出现|在|世界 is right.

(5.) case 4: 出现在世界东方。

(a) * 出现 | 在世 | 界 | 东方 (FMM: incorrect)
appear | be-alive | BM | east

(b) * 出 | 现在 | 世界 | 东方 (BMM: incorrect)
out | now | world | east

In the following examples (6.) through (8.), ¿¾°×Êí involves embedded ambiguity. As separate words, the verb ¿¾ (bake) and the NP °×Êí (sweet potato) form a VP. As a whole, it is a compound noun ¿¾°×Êí (baked sweet potato). In cases of the embedded ambiguity, FMM and BMM always make the same segmentation, namely AB instead of A|B. It may be the only right choice, as seen in (6.). It may be wrong as shown in (7.). It may only be half right, as in the case of genuine ambiguity shown in (8.).

(6.) case 5: 他吃烤白薯。

(a) 他 | 吃 | 烤白薯 (FMM&BMM: correct)
he | eat | baked sweet potato
He eats baked sweet potatoes.

(b) * 他 | 吃 | 烤 | 白薯 (incorrect)
he | eat | bake | sweet potato

(7.) case 6: 他会烤白薯。

(a) * 他 | 会 | 烤白薯 (FMM&BMM: incorrect)
he | can | baked sweet potato

(b) 他 | 会 | 烤 | 白薯 (correct)
he | can | bake | sweet potato
He can bake sweet potatoes.

(8.) case 7: 他喜欢烤白薯。

(a) 他 | 喜欢 | 烤白薯 (FMM&BMM: correct)
he | like | baked sweet potato
He likes baked sweet potatoes.

(b) 他 | 喜欢 | 烤 | 白薯 (correct)
he | like | bake | sweet potato
He likes baking sweet potatoes.

Compare the above examples, we see that there are severe limitations for the FMM-BMM combined approach. First, it only serves the purpose of ambiguity detection (when the results of FMM and BMM do not match), and contributes nothing to its resolution. It has no way to tell which segmentation is right (compare case 1 and case 2), and, worse still, whether both are right (case 3) or wrong (case 4). Second, even when the results of FMM and BMM do match, it by no means guarantees right segmentation (case 6). Third, as far as detection is concerned, it is only limited to the problems for the cross ambiguity. The existence of the embedded ambiguity defines a blind area for this way of detection (case 6 and case 7). This is because the underlying maximum match hypothesis assumed in the FMM and BMM segmentation algorithms is directly contradictory to the phenomena of the embedded ambiguity.

In face of ambiguity, how do people judge which segmentation is right in the first place? It really depends on whether we can understand the sentence or phrase based on the segmentation. In computational linguistics, this is equivalent to whether the segmented string can be parsed by a grammar. The segmentation ambiguity is one type of structural ambiguity, not in essence different from typical structural ambiguity like, say, PP attachment ambiguity. In fact, PP attachment problem is a counterpart of the cross ambiguity in English syntax, as shown below.

(9.) Cross ambiguity in PP attachment: V NP PP

(a) [V NP] [PP]
(b) [V] [NP PP]

Therefore, like English PP attachment, Chinese word segmentation ambiguity should also be captured by a parsing grammar. A parser resolves the ambiguity if it can, or detects the ambiguity in the form of multiple parses when it cannot. As shall be demonstrated in Section 3, wrong segmentation will not lead to a parse. Right segmentation results in at least one successful parse. In any case, at least a parser (hence a grammar on which the parser is based) is required for proper word identification.

The important thing is that the ambiguity in word identification is a grammatical problem. The attempt to solve this problem without a grammar is bound to be crippled. Since traditional segmentation algorithms are non-grammatical in nature, they are theoretically not equipped for handling such ambiguity. A successive model of segmentater-before-grammar attempts to do what it is not yet able to do. This is the theoretical defect for almost all existing segmentation approaches.

(10.) Conclusion for 2.1.

The segmentation ambiguity in word identification is one type of structural ambiguity. In order to solve this problem, a parsing grammar is indispensable.

2.2. productive word formation

Unless morphological analysis is incorporated, lexicon match based segmenters will have trouble with new words produced by Chinese productive word formation, including reduplication, derivation and the formation of proper names. When the morphology component is incorporated in the segmenter, the two-step design becomes a variant of the conventional morphology-before-syntax architecture. But this architecture is not effective when the segmentation ambiguity is at issue.

In the following, we investigate reduplication, derivation and proper names one by one. In each case, we find that there is always a possible involvement of the segmentation ambiguity. This problem cannot be solved by a morphology component independent of syntax. We therefore propose a grammar incorporating both morphology and syntax.

2.2.1. reduplication

Reduplication in Chinese serves various grammatical and/or lexical functions. Not all reduplications pose challenges to segmentation algorithms. Assume that a word consists of 2 characters AB, reduplication of the type AB --> ABAB is no problem. What becomes a problem for word segmentation is the reduplication of the type AB --> AABB or its variants like AB --> AAB. For example, a two-morpheme verb with verb-object relation at the level of morphology has the following way of reduplication.

(11.) Verb Reduplication: AB --> AAB (for diminutive use)

分心 (get distracted) --> 分分心 (get distracted a bit)

让他分分心。

让 | 他 | 分分心
let | he | get distracted a bit
Let him relax a while.

It seems that reduplication is a simple process which can be handled by incorporating some procedure-based function calls in the segmentation algorithm. If a 3-character string, say 分分心, cannot be found in the lexicon, the reduplication procedure will check whether the first 2 characters are the same, and if yes, delete one of them and consult the lexicon again. But, such expansion of the segmentation algorithm is powerless when the segmentation ambiguity is involved. For example, it is wrong to regard 分分心 as of reduplication in the following sentence.

(12.) 这件事十分分心。

(a) * 这 | 件 | 事 | 十 | 分分心
this | CLA | thing | ten | get distracted a bit

(b) 这 | 件 | 事 | 十分 | 分心
this | CLA | thing | very | distracting
This thing is very distracting.

2.2.2. derivation

In Contemporary Mandarin, there have come to be a few morphemes functioning similarly to English affixes, e.g. 可 (-able) turns a transitive verb into an adjective.

(13.) 可 (-able) + Vt --> A

可 (-able) + 读 (Vt: read) --> 可读 (A:readable)

这本书非常可读。

这 | 本 | 书 | 非常 | 可读
this | CLA | book | very | readable
This book is very readable.

The suffix 性 works just like '-ness', changing an adjective into an abstract noun. The derived noun 可读性 (readability) in the following example, similar to its English counterpart, involves a process of double affixation.

(14.) A + 性 (-ness) --> N
可 (-able) + 读 (Vt: read) --> 可读 (A:readable)
可读 (A:readable) + 性 (-ness) --> 可读性 (N:readability)

这本书的可读性

这 | 本 | 书 | 的 | 可读性
this | CLA | book | DE | readability
this book's readability

The suffix Í· can change a transitive verb into an abstract noun adding to it the meaning "worth-of".

(15.) Vt + 头 (AF:worth of) --> N

吃 (Vt:eat) + 头 (AF:worth of) --> 吃头 (N:worth of eating)

这道菜没有吃头

这 | 道 | 菜 | 没有 | 吃头
this | CLA | dish | not-have | worth-of-eating
This dish is not worth eating.

It is not difficult to incorporate in the segmenter these derivation rules for the morphological analysis. But, as in the case of reduplication, there is always a danger of wrongly applying the rules due to possible ambiguity involved. For example, 吃头 is a sub-string of embedded ambiguity. It can be both a derived noun 'worth of eating' or two separate words as seen in the following example.

(16.) 他饿得能吃头牛。

(a) * 他 | 饿 | 得 | 能 | 吃头· | 牛
he | hungry | DE3 | can | worth-of-eating | ox

(b) 他 | 饿 | 得 | 能 | 吃 | 头 | 牛
he | hungry | DE3 | can | eat | CLA | ox
He is so hungry that he can eat an ox.

2.2.3. proper name

Proper names are of 2 major types: (1) Chinese names; (2) transliterated foreign names. In this paper, we only target the identification of Chinese names and leave the problem of transliterated foreign names for further research (Li, 1997b).

A Chinese human name usually consists of a family name followed by a given name. Chinese family names form a clear-cut closed set. A given name is usually either one character or two characters. For example, the late Chinese chairman 毛泽东 (Mao Zedong) used to have another name 李得胜 (Li Desheng). In the lexicon, 李 is a registered family name. Both 得胜 and 胜 mean 'win'. This may lead to 3 ways of word segmentation: (1) 李得胜; (2) 李|得胜; (3) 李得|胜, as seen in the following examples.

(17.) 李得胜了

(a) 李 | 得胜 | 了.
Li | win | LE
Li won.

(b) 李得 | 胜 | 了
Li De | win | LE
Li De won.

(18.) 李得胜胜了。

(a) * 李 | 得胜 | 胜 | 了.
Li | win | win | LE

(b) * 李得 | 胜 | 胜 | 了
Li De | win | win | LE

Since the given name like µÃÊ¤ is an arbitrary string of 1 or 2 characters, the morphological analysis of the full name should start with family name which can optionally combine with any 1 or 2 characters to form candidate proper names Àî, ÀîµÃ and ÀîµÃÊ¤. In other words, family name serves as the left boundary of a full name and the length is used to determine candidates. The right segmentation can only be made via sentence analysis as shown in the above examples.

Most Chinese place proper names are made of 1 to 3 characters, for example, 武汉市(Wuhu City), 南陵县 (Nanling County). The arbitrariness of these names makes any sub-strings of n characters (0<n<4) in the sentence a suspect. Fortunately, in most cases we may find boundary indicators of these names, like 省 (province), 市 (city), 县 (county), etc. Once the boundary indicator is located, the similar technique in using Chinese family name to identify the given name can be applied to select candidates of place proper names for verification through grammatical analysis.

In general, there is always a possibility of ambiguity involvement in the formation of all types of proper names.

(19.) Conclusion for 2.2.

Due to the possible involvement of ambiguity, a parsing grammar for morphological analysis as well as for sentence analysis is required for the proper identification of the words produced by Chinese productive word formation.

W‑CPSG Grammatical Approach

This section presents W‑CPSG approach to Chinese word identification and morphological analysis. We will demonstrate how a parser based on W‑CPSG solves the problems of the word identification ambiguity and productive word formation.

3.1. rationale of W‑CPSG approach

There have been a number of word identification algorithms based on both morphological and syntactic information (see survey in Feng 1996 and Sun & Huang 1996). Most such approaches do not use a self-contained grammar to parse the complete sentence. They are confined to the conventional two-step process of the segmentation-before-grammar design. As long as the word identification procedure is independent of a parsing grammar, it is extremely difficult to make full use of grammatical information to resolve ambiguity in word identification. Careful tuning up and sophisticated design improves the precision but will not change the theoretical defect of all such approaches. Chen & Liu acknowledges the limitation of their approach due to the lack of a grammar. “However”, they say, “it is almost impossible to apply real world knowledge nor to check the grammatical validity at this stage”. (Chen & Liu 1992, p.105) Why impossible at this stage? Because these segmentation systems are based on the concept of two-step architecture and the grammar is not yet available! As we have demonstrated, the final judgment for proper word identification can hardly be made until the whole sentence is parsed, hence the requirement of a full grammar. Therefore, we are forced to make a compromise in involving how much of grammatical information depending on how much word identification precision we can afford to sacrifice. Needless to say, there is significant double-labor between such a word segmentation procedure and the following stage of parsing. As more and more grammatical information is used to achieve better precision, the overhead of this double labor becomes more serious. We consider the double labor as one strong argument against the two-step approach. If enough grammatical information is incorporated, it is essentially equivalent to a grammar. And the segmenter will be equivalent to a parser. Then why two grammars, one for word identification, and one for sentence parsing? Why not combine them? That is exactly what we are proposing in W‑CPSG - one-step approach based on an integrated grammar, eliminating the necessity of a segmentation preprocessor.

3.2. W‑CPSG character-based parsing

W‑CPSG (Li. 1997a, 1997b) is a lexicalized Chinese unification grammar. The work on W‑CPSG is taken in the spirit of the modern linguistic theory Head-driven Phrase Structure Grammar (Pollard & Sag 1994). W‑CPSG consists of two parts: a minimized general grammar and an enriched lexicon. The general grammar only contains a handful of PS (phrase structure) rules, covering complement structure, modifier structure, conjunctive structure and morphological structure. This is the nature of lexicalized grammars. PS rules in such grammars are very abstract. Essentially, all they say is one thing, that is, 2 signs can combine so long as the lexicon so indicates. The lexicon houses lexical entries with their linguistic description in feature structures. Potential morphological structures as well as potential syntactic structures are lexically encoded. In syntax, a word expects another sign to form a phrase. In morphology, a morpheme expects another sign to form a word. For example, the prefix 可 (-able) expects a transitive verb to form an adjective. The morphological PS rule will build the morphological structure when a transitive verb does appear after the prefix 可 (-able) in the input string.

We now illustrate how W‑CPSG parses a string of Chinese characters by a sample parsing chart. The prototype of W‑CPSG was written in ALE, a grammar compiler developed on top of Prolog by Carpenter & Penn (1994). ALE compiles W‑CPSG into a Chinese parser, a Prolog program ready to accept a string of characters for analysis. W‑CPSG parse tree embodies both morphological analysis and syntactic analysis, as shown below.

hpsg12

This is so-called bottom-up parsing. It starts with lexicon look-up. Edges 1 through 7 are lexical edges. Other edges are phrasal edges. Each edge represents a sign, i.e. a character (morpheme), a word, a phrase or a sentence. Lexical edges result from a successful match between the signs in the input string and the entries in the lexicon during lexicon look-up. For example, 可 (-able), 读 (read) and 性 (-ness) are all registered entries in the lexicon, so they get matched and shown by edge 5, edge 6 and edge 7. Words produced by productive word formation present themselves as phrasal edges, e.g. edge ((5+6)+7) for 可读性 (readability). For the sake of concise illustration, we only show two pieces of information for the signs in the chart, namely category and interpretation with a delimiting colon (lexical edges are only labeled for either category or interpretation). The parser attempts to combine the signs according to PS rules in the grammar until parses are found. A parse is an edge which ranges over the whole string. The parse ((((1+2)+3)+4)+((5+6)+7)) for (20.) represents a binary structural tree based on the W‑CPSG analysis, as shown below.

hpsg13

3.3. ambiguity resolution in word identification

Given the resources of a phrase structure grammar like W‑CPSG, a parser based on standard chart parsing algorithms can handle both the cross ambiguity and the embedded ambiguity provided that a match algorithm based on exhaustive lookup instead of maximum match is adopted for lexicon lookup. All candidate words in the input string are presented to the parser for judgment. Ambiguous segmentation becomes a natural part of parsing: different ways of segmentation add different edges, a successful parse always embodies right identification. In other words, word identification in our design becomes a by-product of parsing instead of a pre-condition for parsing. The following example of the complicated cross ambiguity illustrates how the W‑CPSG parser resolves ambiguity. As seen, both the FMM segmentation (represented by the edge sequence 8-9-5-10) and the BMM segmentation (represented by 1-11-12-10) are in the chart as a result of exhaustive lexicon lookup. They are proved to be wrong because they do not lead to a successful parse according to the grammar. As a by-product, the final parse (8+(3+(12+10))) automatically embodies rightly identified word sequence 8-3-12-10, i.e. 出现 (appear) |在 (at) |世界 (world) |东方 (east).

hpsg10

Exhaustive lookup also makes an embedded ambiguity sub-string like 烤红薯 no longer a blind area for word identification, as shown in (22.) below. All the candidate words in the sub-string including 烤 (bake), 红薯 (sweet potato), 烤红薯 (baked sweet potato) are added to the chart as lexical edges (edge 4, edge 8 and edge 10). This is a case of genuine ambiguity, resulting in 2 parses corresponding to 2 readings. The first parse (1+(7+10)) identifies the word sequence 他|喜欢|烤红薯, and the second parse (1+(9+(4+8))) a different sequence 他|喜欢|烤|红薯. Edge 7 and edge 9 represent two lexical entries for the verb 喜欢 (like), with different syntactic expectation (categorization). One expects an NP object, notated in the chart by like<NP>, and the other expects a VP complement, notated by like<VP>.

hpsg11

We now illustrate how Chinese proper names are identified in W‑CPSG parsing. In the W‑CPSG lexicon, Chinese family name is encoded to optionally expect the given name. Due to the arbitrariness of given names, no other constraint except for the length (either 1 character or 2 characters) is specified in the expectation. Therefore, we have three candidates for proper names in the following example, namely 李 (Li), 李得 (Li De), 李得胜 (Li Desheng), represented respectively by edge 1, edge (1+2) and the NP edge (1+5).[3] The first two candidates contribute to two valid parses while the third does not, hence the identification of the word sequences 李|得胜|了 and 李得|胜|了.

hpsg8

Now we add one more character 胜 (win) to form a new sentence, as shown in (24.) below.

hpsg9

The first two candidate proper names 李 (Li) and 李得 (Li De) no longer lead to parses. But the third candidate 李得胜 (Li Desheng) becomes part of the parse as a subject NP. The parse (((1+6)+4)+5) corresponds to the identification of the only valid word sequence 李得胜|胜|了.

Finally, we give an example to demonstrate how W‑CPSG handles reduplication in parsing and word identification. The sample sentence to be processed by the parser is 让他分分心 (Let him relax a while), involving the AB-->AAB type verb reduplication for diminutive use.

In most lexicons, 分心 (distract-heart: get distracted) is a registered 2-morpheme verb with internal morphological verb-object relation. Therefore, the reduplication is considered morphological. But in Chinese syntax, we also have a general verb reduplication rule of the type A-->AA for diminutive use, for example, 看(look) --> 看看(have a look). This morphological verb reduplication rule AB-->AAB and the syntactic verb reduplication rule A-->AA are essentially the same rule in Chinese grammar. 分心 sits in the gray area between morphology and syntax. It looks both like a word (verb) and a phrase (VP). Lexically, it corresponds to one generalized sense (concept) and the internal combination is idiomatic, i.e. 分 (distract) must combine with 心 (heart) to mean 'get distracted'. But, structurally, the combination of 分 and 心 is not fundamentally different from a VP consisting of Vt and NP, as in the phrase 看电影 (see a film). In fact, there is no clear-cut boundary between Chinese morphology and syntax. This morphology-syntax isomorphic fact serves as a further argument to support the W‑CPSG design of integrating morphology and syntax in one grammar module. Although the boundary between Chinese morphology and syntax is fuzzy, hence no universal definition of basic notions like word and phrase, the division can be easily defined system internally in an integrated grammar. In W‑CPSG, 分心 is treated as a phrase (VP) instead of a word (verb). The lexical entry 分 (distract) is coded to obligatorily expect the literal 心 (heart) as its syntactic object, shown in the following chart by the notation V<心>. This approach has the advantage of eliminating the doubling of the reduplication rule for diminutive use in both syntax and morphology, making the grammar more elegant. The verb reduplication rule is implemented as a lexical rule in W‑CPSG.[4] This lexical rule creates a reduplicated verb with added diminutive sense, shown by edge 8 (a lexical edge). The whole parsing process is illustrated below.

hpsg7

REFERENCES

Carpenter, B. & Penn, G. (1994): ALE, The Attribute Logic Engine, User's Guide, Carnegie Mellon University

Chen, K-J., & S-H. Liu (1992): "Word identification for mandarin Chinese sentences". Proceedings of the 15th International Conference on Computational Linguistics, Nantes, 101-107.

Feng, Z-W. (1996): "COLIPS lecture series - Chinese natural language processing", Communications of COLIPS, Vol.6, No.1 1996, Singapore

Li, W. (1997a): "Outline of an HPSG-style Chinese reversible grammar", Proceedings of The Northwest Linguistics Conference-97 (NWLC-97, forthcoming), UBC, Vancouver, Canada

Li, W. (1997b): W‑CPSG: A Lexicalized Chinese Unification Grammar And Its Application, Doctoral dissertation (on-going), Simon Fraser University, Canada

Liang, N. (1987): "Shumian Hanyu Zidong Fenci Xitong - CDWS" (Automatic word segmentation system for written Chinese - CDWS), Journal of Chinese Information Processing, No.2 1987, pp 44-52, Beijing

Pollard, C. & I. Sag (1994): Head-Driven Phrase Structure Grammar, Centre for the Study of Language and Information, Stanford University, CA

Sun, M-S. & C-N. Huang (1996): "Word segmentation and part of speech tagging for unrestricted Chinese texts" (Tutorial Notes for International Conference on Chinese Computing ICCC'96), Singapore

~~~~~~~~~~~~~~~~~~~

[1] The author benefited from the insightful discussion with Dr. Dekang Lin on the feasibility of parsing Chinese character strings instead of word strings. Thanks also go to Paul McFetridge and Fred Popowich for their supervision and encouragement.

[2] This table is adapted from the following table in Sun & Huang (1996).

case 1	The output of FMM and BMM are different, but both are incorrect	0.054%
case 2	The output of FMM and BMM are different, but only one is correct	9.24%
case 3	The output of FMM and BMM are identical, but incorrect	0.41%
case 4	The output of FMM and BMM are identical, and correct	90.30%

The 4 cases which they listed are not logically exhaustive in terms of sentence based processing (i.e. when discourse is not involved in a system). In particular, there is another case when the output of FMM and BMM are different, and both are correct. We call this a case of genuine cross ambiguity.

[3] Note that there is another S edge (1+5) in the chart. These two edges are structurally different, created via different PS rules. The NP edge (1+5) is formed through the morphological PS rule, combining the family name (edge 1) and its expected given name (edge 5). In the S edge (1+5). however, it is the subject rule (one of the complement PS rules) that decides the combination of the predicate (edge 5) and its expected subject NP (edge 1).

[4] Lexical rules are favored by many linguists to capture redundancy in the lexicon instead of the conventional approach of syntactic transformation. Lexical rules are applied at compile time to form an expanded lexicon before parsing starts.

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

Interaction of syntax and semantics in parsing Chinese transitive verb patterns

Interaction of syntax and semantics in parsing Chinese transitive verb patterns *
(old paper in Proceedings of International Chinese Computing Conference, ICCC'96)

Wei LI

Department of Linguistics, Simon Fraser University
Burnaby, B.C. V5A 1S6 CANADA (email: [email protected])

Keywords: Chinese processing, transitive pattern, syntax, semantics, lexical rule, HPSG

Abstract

This paper addresses the problem of parsing Chinese transitive verb patterns (including the BA construction and the BEI construction) and handling the related phenomena of semantic deviation (i.e. the violation of the semantic constraint).

We designed a syntax-semantics combined model of Chinese grammar in the framework of Head-driven Phrase Structure Grammar [Pollard & Sag 1994]. Lexical rules are formulated to handle both the transitive patterns which allow for semantic deviation and the patterns which disallow it. The lexical rules ensure the effective interaction between the syntactic constraint and the semantic constraint in analysis.

The contribution of our research can be summarized as:

(1) the insight on the interaction of syntax and semantics in analysis;
(2) a proposed lexical rule approach to semantic deviation based on (1);
(3) the application of (2) to the study of the Chinese transitive patterns;
(4) the implementation of (3) in an unification-based Chinese HPSG prototype.

Background

When Chomsky proposed his Syntactic Structures in Fifties, he seemed to indicate that syntax should be addressed independently of semantics. As a convincing example, he presented a famous sentence:

1) Colorless green ideas sleep furiously.

Weird as it sounds, the grammaticality of this sentence is intuitively acknowledged: (1) it follows the English syntax; (2) it can be interpreted. In fact, there is only one possible interpretation, solely decided by its syntactic structure. In other words, without the semantic interference, our linguistic knowledge about the English syntax is sufficient to assign roles to each constituent to produce a reading although the reading does not seem to make sense.

However, things are not always this simple. Compare the following Chinese sentences of the same form NP NP V:

2a)           dianxin wo           chi           le.
                Dim-Sum I               eat           LE.
The Dim Sum I have eaten.
Note:        LE is a particle for perfect aspect.

2b) wo dianxin chi le.
I have eaten the Dim Sum.

Who eats what? There is no formal way but to resort to the semantic constraint imposed by the notion eat to reach the correct interpretation [Li, W. & McFetridge 1995].

Of course, if we want to maintain the purity of syntax, it could be argued that syntax will only render possible interpretations and not the interpretation. It is up to other components (semantic filter and/or other filters) of grammar to decide which interpretation holds in a certain context or discourse. The power of syntax lies in the ability to identify structural ambiguities and to render possible corresponding interpretations. We call this type of linguistic design a syntax-before-semantics model. While this is one way to organize a grammar, we found it unsatisfactory for two reasons. First, it does not seem to simulate the linguistic process of human comprehension closely. For human listeners, there are no ambiguities involved in sentences 2a) and 2b). Secondly, there is considerable cost on processing efficiency in terms of computer implementation. This efficiency problem can be very serious in the analysis of languages like Chinese with virtually no inflection.

Head-driven Phrase Structure Grammar (HPSG) [Pollard & Sag 1994, 1987] assumes a lexicalist approach to linguistic analysis and advocates an integrated model of syntax and the other components of grammar. It serves as a desirable framework for the integration of the semantic constraint in establishing syntactic structures and interpretations. Therefore, we proposed to enforce the semantic constraint that animate being eats food directly in the lexical entry chi (eat) [Li, W. & McFetridge 1995]: chi (eat) requires an animate NP subject and a food NP object. It correctly addresses who-eats-what problem for sentences like 2a) and 2b). In fact, this type of semantic constraint (selection restriction) has been widely used for disambiguation in NLP systems.

The problem is, the constraint should not always be enforced. In the practice of communication, deviation from the constraint is common and deviation is often deliberately applied to help render rhetorical expressions.

3) xiang chi yueliang, ni gou de3 zhao me?
want eat moon, you reach DE3 -able ME?
Wanting to eat the moon, but can you reach it?
Note: DE3 is a particle, introducing a postverbal adjunct of result or capability. ME is a sentence final particle for yes-no question.

4) dajia dou chi shehui zhuyi, neng bu qiong me?
people all eat social -ism, can not poor ME
Everyone is eating socialism, can it not be poor?

yueliang (moon) is not food, of course. It is still some physical object, though. But in 4), shehui zhuyi (socialism) is a purely abstract notion. If a parser enforces the rigid semantic constraint, there are many such sentences that will be rejected without getting a chance to be interpreted. The fact is, we do have interpretations for 3) and 4). Hence an adequate grammar should be able to accommodate those interpretations.

To capture such deviation, Wilks came up with his Preference Semantics [Wilks 1975, 1978]. A sophisticated mechanism is designed to calculate the semantic weight for each possible interpretation, i.e. how much it deviates from the preference semantic constraint. The final choice will be given to the interpretation with the most semantic weight in total. His preference model simulates the process of how human comprehends language more closely than most previous approaches.

The problem with this design is the serious computational complexities involved in the model [Huang 1987]. In order to calculate the semantic weight, the preference semantic constraint is loosened step by step. Each possible substructure has to be re-tried with each step of loosening. It may well lead to combinatorial explosion.

What we are proposing here is to look at semantic deviation in the light of the interaction of the syntactic constraint and the semantic constraint. In concrete terms, the loosening of the semantic constraint is conditioned by syntactic patterns. Syntactic pattern is defined as the representation of an argument structure in surface form. A pattern consists of 2 parts: a structure's syntactic constraint (in terms of the syntactic categories and configuration, word order, function words and/or inflections) and its interpretation (role assignment). For example, for Chinese transitive structure, NP V NP: SVO is one pattern, NP NP V: SOV is another pattern, and NP [ba NP] V: SOV (the BA construction) is still another. The expressive power of a language is indicated by the variety of patterns used in that language. Our design will account for some semantic deviation or rhetorical phenomena seen in everyday Chinese without the overhead of computational complexities. We will focus on Chinese transitive verb patterns for illustration of this approach.

Chinese transitive patterns

Assuming three notional signs wo (I), chi (eat) and dianxin (Dim Sum), there are maximally 6 possible combinations in surface word order, out of which 3 are grammatical in Chinese.[1]

5a)           wo chi le dianxin. SVO
5b)           wo dianxin chi le. SOV
5c)           dianxin wo chi le.                                    OSV

SVO is the canonical word order for Chinese transitive structure. When a string of signs matches the order NP V NP, the semantic constraint has to yield to syntax for interpretation.

NP V NP: SVO

6) daodi shi ni zai du shu ne,
haishi shu zai du ni ne?

on-earth be you ZAI read book NE,
or book ZAI read you NE?

Are you reading the book, or is the book reading you, anyway?
Note: ZAI is a particle for continuous aspect.
NE is a sentence final particle for or-question.

Same as in the English equivalent, the interpretation of 6) can only be SVO, no matter how contradictory it might be to our common sense. In other words, in the form of NP V NP, syntax plays a decisive role.

In contrast, to interpret the form NP NP V as SOV in 2b), the semantic constraint is critical. Without the enforcement of the semantic constraint, the interpretation of SOV does not hold. In fact, this SOV pattern (NP1 NP2 V: SOV) has been regarded as ungrammatical in a Case Theory account for Chinese transitive structure in the framework of GB. According to their analysis, something similar to this pattern constitutes the D‑Structure for transitive pattern and Chinese is an underlying SOV language (called "SOV Hypothesis": see the survey in Gao 1993). In the surface structure, NP2 is without case on the assumption that V assigns its CASE only to the right. One has to either insert the case-marker ba to assign CASE to it (the BA construction) or move it to the right of V to get its CASE (the SVO pattern). This analysis suffers from not being able to account for the grammaticality of sentences like 2b). However, by distinguishing the deep pattern SOV from the 2 surface patterns (the SVO and the BA construction), the theory has its merit to alert us that the SOV pattern seems to be syntactically problematic (crippled, so to speak). This is an insightful point, but it goes one step too far in totally rejecting the SOV pattern in surface structure. If we modify this idea, we can claim that SOV is a syntactically unstable pattern and that SOV tends to (not must) "transform" to the SVO or the BA construction unless it is reinforced by semantic coherence (i.e. the enforcement of the semantic constraint). This argument in the light of syntax-semantics interaction is better supported by the Chinese data. In essence, our account is close to this reformulated argument, but in our theory, we do not assume a deep structure and transformation. All patterns are surface constructions. If no sentences can match a construction, it is not considered as a pattern by our definition.

This type of unstable pattern which depends on the semantic constraint is not limited to the transitive phenomena. For example, the type of Chinese NP predicate defined in [Li, W. & McFetridge 1995] is also a semantics dependent pattern. Compare:

7a)  zhe           zhang       zhuozi san tiao          tui.
this           Cl. table(furniture) three        Cl.            leg
This table is three-legged.
Note:        Cl for classifier.

7b) * zhe zhang ditu san tiao tui.
this Cl. map(non-furniture) three Cl. leg

There is clearly a semantic constraint of the NP predicate on its subject: it should be furniture (or animate). Without this "semantic agreement", Chinese NP is normally not capable of functioning as a predicate, as shown in 7b).

Between semantics dependent and semantics independent patterns, we may have partially dependent patterns. For example, in NP NP V: OSV, it seems that the semantic constraint on the initial object is less important than the semantic constraint on the subject.

8) shitou wo ye xiang chi, kexi yao bu dong.
stone(non-food) I(animate) also want eat, pity chew not -able

Even stones I also want to eat, but it's such a pity that I am not able to chew them.

If the constraint on the object matches well, is the subject allowed to be semantically deviant?

9) ? dianxin zhuozi chi le.
Dim-Sum(food) table(non-animate) eat LE.

Those are the marginal cases, a grammar may choose to be more tolerable to accept it or to be more restrained to reject it.

Unlike SOV, but similar to its English counterpart, OSV is one type of Chinese topic constructions and the relationship between the initial O and V is of long distance dependency.

10a) dianxin wo xiangxin ni yiwei Lisi chi le.
Dim-Sum I believe you think Lisi eat LE

The Dim Sum I believe you think that Lisi ate.

10b) * Lisi wo xiangxin ni yiwei dianxin chi le.

10b) will not be accepted in our model because (1) it cannot be interpreted as OSV since it violates the semantic constraint on S: dianxin is not animate; (2) it can neither be interpreted as SOV since it violates the configurational constraint: SOV is simply not of a long distance pattern. In fact, NP NP V: SOV is such a restricted pattern in Chinese that it not only excludes any long distance dependency but even disallows some adjuncts. Compare 11a) in the OSV pattern and 11b) and 11c) in the SOV pattern:

11a) dianxin wo jinjinyouwei de2 chi le.
Dim-Sum I with-relish DE2 eat LE

The Dim Sum I ate with relish.
Note: DE2 is a particle introducing a preverbal adjunct of manner.

11b) * wo dianxin jinjinyouwei de2 chi le.

11c) * wo jinjinyouwei de2 dianxin chi le.

There is another pattern of the linear order SOV, the Chinese notorious BA construction. ba is usually regarded as a preposition which introduces a preverbal object for transitive verbs.

NP [ba NP] V: SOV

12a) wo ba dianxin jinjinyouwei de2 chi le.
I BA Dim-Sum with-relish DE2 eat LE

I ate the Dim Sum with relish.

12b) wo jinjinyouwei de2 ba dianxin chi le.
With relish, I ate the Dim Sum.

12c) dianxin ba wo jinjinyouwei de2 chi le.
The Dim Sum ate me with relish.

12d) dianxin jinjinyouwei de2 ba wo chi le.
With relish, the Dim Sum ate me.

For the OSV order, there is another so-called BEI construction. The BEI construction is usually regarded as an explicit passive pattern in Chinese.

NP [bei NP] V: OSV

13a) dianxin bei wo chi le.
Dim-Sum BEI I eat LE

The Dim Sum was eaten by me.

13b) wo bei dianxin chi le.

I was eaten by the Dim Sum.

The BEI construction and the BA construction are both semantics independent. In fact, any pattern resorting to the means of function words in Chinese seems to be sufficiently independent of the semantic constraint.

To conclude, semantic deviation often occurs in some more independent patterns, as seen in 5d2), 6), 8), 12c), 12d), 13b). Close study reveals that different patterns result in different reliance on the semantic constraint, as summarized in the following table.

syntactic pattern semantic dependence

                NP V NP: SVO no dependence
                NP [ba NP] V: SOV                                            no dependence
                NP [bei NP] V: OSV                                           no dependence
                NP NP V: OSV partial dependence
                NP NP V: SOV full dependence
............

It should be emphasized that this observation constitutes the rationale behind our approach.

Formulation of lexical rules

Based on the above observation, we have designed a syntax-semantics combined model. In this model, we take a lexical rule approach to Chinese patterns and the related problem of semantic deviation.

A lexical rule takes as its input a lexical entry which satisfies its condition and generates another entry. Lexical rules are usually used to cover lexical redundancy between related patterns. The design of lexical rules is preferred by many grammarians over the more conventional use of syntactic transformation, especially for lexicalist theories.

Our general design is as follows, still using chi (eat) for illustration:

(1) Syntactically, chi (eat) as a transitive verb subcategorizes for a left NP as its subject and a right NP as its object.

(2) Semantically, the corresponding notion eat expects an entity of category animate as its logical subject and an entity of category food as its logical object. Therefore the common sense (knowledge) that animate being eats food is represented.

(3) The interaction of syntax and semantics is implemented by lexical rules. The lexical rules embody the linguistic generalizations about the transitive patterns. They will decide to enforce or waive the semantic constraint based on different patterns.

As seen, syntax only stipulates the requirement of two NPs as complements for chi and does not care about the NPs' semantic constraint. Semantics sets its own expectation of animate entity and food entity as arguments for eat and does not care what syntactic forms these entities assume on the surface. It is up to lexical rules to coordinate the two. In our model, the information in (1) and (2) is encoded in the corresponding lexical entry and the lexical rules in (3) will then be applied to expand the lexicon before parsing begins. Driven by the expanded lexicon, analysis is implemented by a lexicalist parser to build the interpretation structure for the input sentence. Following this design, there will be sufficient interaction between syntax and semantics as desired while syntax still remains to be a self-contained component from semantics in the lexicon. More importantly, this design does not add any computational complexities to parsing because in order to handle different patterns, the similar lexical rules are also required even for a pure syntax model.

Before we proceed to formulate lexical rules for transitive patterns, we should make sure what a transitive pattern is. As we defined before, a pattern consists of 2 parts: a structure's syntactic constraint and the corresponding interpretation. Word order is important constraint for Chinese syntax. In addition to word order, we have categories and function words (preposition, particle, etc.). As for interpretation, transitive structure involves 3 elements: V (predicate) and its arguments S (logical subject) and O (logical object). There is a further factor to take into account: Chinese complements are often optional. In many cases, subject and/or object can be omitted either because they can be recovered in the discourse or they are unknown. We call those patterns elliptical patterns (with some complement(s) omitted), in contrast to full patterns. With these in mind, we can define 10 patterns for Chinese transitive structure: 5 full patterns and 5 elliptical patterns.

We now investigate these transitive patterns one by one and try to informally formulate the corresponding lexical rules to capture them. Please note that the basic input condition is the same with all the lexical rules. This is because they share one same argument structure - transitive structure.

Lexical rule 1:

V ((NP1, NP2), (constr1, constr2)) --> NP1 V NP2: SVO

The above notation for the lexical rule should be quite obvious. The input of the rule is a transitive verb which subcategorizes for two NPs: NP1 and NP2 and whose corresponding notion expects two arguments of constr1 and constr2. NP is syntactic category, and constr is semantic category (human, animate, food, etc.). The output pattern is in a defined word order SVO and waives the semantic constraint.

Lexical rule 2:

V ((NP1, NP2), (constr1, constr2)) --> [NP1, constr1] [NP2, constr2] V: SOV

Please note that the semantic constraint is enforced for this SOV pattern. Since this pattern shares the form NP NP V with the OSV pattern, it would be interesting to see what happens if a transitive verb has the same semantic constraint on both its subject and object. For example, qingjiao (consult) expects a human subject and a human object.

14) ta ni qingjiao guo me?
he(human) you(human) consult GUO ME

Him, have you ever consulted?
Note: GUO is a particle for experience aspect.

15) ni ta qingjiao guo me?

You, has he ever consulted?

In both cases, the interpretation is OSV instead of SOV. Therefore, we need to reformulate Lexical rule 2 to exclude the case when the subject constraint is the same as the object constraint.

Lexical rule 2' (refined version):

V ((NP1, NP2), (constr1, constr2), (constr1 not = constr2))

--> [NP1, constr1] [NP2, constr2] V: SOV

Lexical rule 3:

V ((NP1, NP2), (constr1, constr2)) --> NP1 [ba NP2] V: SOV

This is the typical BA construction. But not every transitive verb can assume the BA pattern. In fact, ba is one of a set of prepositions to introduce the logical object. There are other more idiosyncratic prepositions (xiang, dao, dui, etc.) required by different verbs to do the same job.

16a) ni qingjiao guo ta me?
you consult GUO he ME

Have you ever consulted him?

16b) ni xiang ta qingjiao guo me?
you XIANG he consult GUO ME

Have you ever consulted him?

16c) * ni ba ta qingjiao guo me?
you BA he consult GUO ME

17a) ta qu guo Beijing.
he go-to GUO Beijing

He has been to Beijing.

17b) ta dao Beijing qu guo.
he DAO Beijing go-to GUO

He has been to Beijing.

17c) * ta ba Beijing qu guo.
he BA Beijing go-to GUO

18a) ta hen titie zhangfu.
she very tenderly-care-for husband

She cares for her husband very tenderly.

18b) ta dui zhangfu hen titie.
she DUI husband very tenderly-care-for

She cares for her husband very tenderly.

18c) * ta ba zhangfu hen titie.
she BA husband very tenderly-care-for

This originates from different theta-roles assumed by different verb notions on their object argument: patient, theme, destination, to name only a few. These theta-roles are further classification of the more general semantic role logical object. We can rely on the subcategorization property of the verb for the choice of the preposition literal (so-called valency preposition). With the valency information in place, we now reformulate Lexical rule 3 to make it more general:

Lexical rule 3' (refined version):

V ((NP1, NP2), (constr1, constr2), (valency_preposition=P), (P not = null))

--> NP1 [P NP2] V: SOV

Lexical rule 4:

V ((NP1, NP2), (constr1, constr2)) --> NP2 ... [NP1, constr1] V: OSV

This is a topic pattern of long distance dependency. It is up to different formalisms to provide different approaches to long-distance phenomena. In our present implementation, NP2 is placed in a feature called BIND to indicate the nature of long distance dependency. One phrase structure rule Topic Rule is designed to use this information and handle the unification of the long distance complement properly.

Following the topic pattern, the passive BEI construction is formulated in Lexical rule 5.

Lexical rule 5:

V ((NP1, NP2), (constr1, constr2)) --> NP2 [bei NP1] V: OSV

We now turn to elliptical patterns.

Lexical rule 6:

V ((NP1, NP2), (constr1, constr2)) --> V NP2: VO

19) chi guo jiaozi me?
eat GUO dumpling ME

Have (you) ever eaten dumpling?

Lexical rule 7:

V ((NP1, NP2), (constr1, constr2)) --> [NP1, constr1] V: SV

20) wo chi le.
I eat LE

I have eaten (it).

21) ji chi le.
chicken1(animate) eat LE

The chicken has eaten (it).

Like its English counterpart, ji (chicken) has two senses: (1) chicken1 as animate; (2) chicken2 as food. We code this difference in two lexical entries. Only the first entry matches the semantic constraint on the subject in the pattern and reaches the above SV interpretation in 21). Interestingly enough, the same sentence will get another parse with a different interpretation OV in 23) because the second entry also satisfies the semantic constraint on the object in the OV pattern in Lexical rule 8.

22) ni qingjiao guo me?
you consult GUO ME

Have you consulted (someone)?

22) indicates that the SV interpretation is preferred over the OV interpretation when the semantic constraint on the subject and the semantic constraint on the object happen to be the same. Hence the added condition in Lexical rule 8.

Lexical rule 8:

V ((NP1, NP2), (constr1, constr2), (constr1 not = constr2))

--> [NP2, constr2] V: OV

23) ji chi le.
chicken2(food) eat LE

The chicken has been eaten.

Lexical rule 9:

V ((NP1, NP2), (constr1, constr2)) --> NP2 [bei V]: OV

24) dianxin bei chi le.
Dim-Sum BEI eat LE

The Dim Sum has been eaten.

Lexical rule 10:

V ((NP1, NP2), (constr1, constr2)) --> V: V

25) chi le me?
eat LE ME?

(Have you) eaten (it)?

Implementation

We begin with a discussion of some major feature structures in HPSG related to handling the transitive patterns. Then, we will show how our proposal works and discuss some related implementation issues.

HPSG is a highly lexicalist theory. Most information is housed in the lexicon. The general grammar is kept to minimum: only a few phrase structure rules (called ID Schemata) associated with a couple of principles. The data structure is typed feature structure. The necessary part for a typed feature structure is the type information. A simple feature structure contains only the type information, but a complex feature structure can introduce a set of feature/value pairs in addition to the type information. In a feature/value pair, the value is itself a feature structure (simple or complex). The following is a sample implementation of the lexical entry chi for our Chinese HPSG grammar using the ALE formalism [Carpenter & Penn 1994].

hpsg3

Note: (1) Uppercase notation for feature; (2) Lowercase notation for type; (3) Number indices in square brackets for unification.

Leaving the notational details aside, what this roughly says is: (1) for the semantic constraint, the arguments of the notion eat are an animate entity and a food entity; (2) for the syntactic constraint, the complements of the verb chi are 2 NPs: one on the left and the other on the right; (3) the interpretation of the structure is a transitive predicate with a subject and an object. The three corresponding features are: (1) KNOWLEDGE; (2) SUBCAT; (3) CONTENT. KNOWLEDGE stores some of our common sense by capturing the internal relation between concepts. Such common sense knowledge is represented in linguistic ways, i.e. it is represented as a semantic expectation feature, which parallels to the syntactic expectation feature SUBCAT. KNOWLEDGE defines the semantic constraint on the expected arguments no matter what syntactic forms the arguments will take. In contrast, SUBCAT only defines the syntactic constraint on the expected complements. The syntactic constraint includes word order (LEFT feature), syntactic category (CATEGORY feature) and configurational information (LEX feature). Finally, CONTENT feature assigns the roles SUBJECT and OBJECT for the represented structure.

A more important issue is the interaction of the three feature structures. Among the three features, only KNOWLEDGE is our add-on. The relationship between SUBCAT and CONTENT has been established in all HPSG versions: SUBCAT resorts to CONTENT for interpretation. This interaction corresponds to our definition of pattern. Everything goes fine as far as the syntactic constraint alone can decide interpretation. When the semantic constraint (in KNOWLEDGE) has to be involved in the interpretation process, we need a way to access this information. In unification based theories, information flow is realized by unification (i.e. structure sharing, which is represented by the co-index of feature values). In general, we have two ways to ensure structure sharing in the lexicon. It is either directly co-indexed in the lexical entries, or it resorts to lexical rules. The former is unconditional, and the latter is conditional. As argued before, we cannot directly enforce the semantic constraint for every transitive pattern in Chinese, for otherwise our grammar will not allow for any semantic deviation. We are left with lexical rules which we have informally formulated in Section 3 and implemented in the ALE formalism.

CATEGORY is another major feature for a sign. The CATEGORY feature in our implementation includes functional category which can specify functional literal (function word) as its value. Function words belong to closed categories. Therefore, they can be classified by enumeration of literals. Like word order, function words are important form for Chinese syntactic constraint. Grammars for other languages also resort to some functional literals for constraint. In most HPSG grammars for English, for example, a preposition literal is specified in a feature called P_FORM. There are two problems involved there. First, at representation level, there is redundancy: P_FORM:x --> CATEGORY:p (where x is not null). In other words, there exists feature dependency between P_FORM and CATEGORY which is not captured in the formalism. Second, if P_FORM is designed to stipulate a preposition literal, we will ultimately need to add features like CL_FORM for classifier specification, CO_FORM for conjunction specification, etc. In fact, for each functional category, literal specification may be required for constraint in a non-toy grammar. That will make the feature system of the grammar too cumbersome. These problems are solved in our grammar implementation in ALE. One significant mechanism in ALE is its type inheritance and appropriateness specifications for feature structures [Carpenter & Penn 1994]. (Similar design is found in the new software paradigm of Object Oriented Programming.) Thanks to ALE, we can now use literals (ba, xiang, dao, dui, etc) as well as major categories (n, v, a, p, etc.) to define the CATEGORY feature. In fact, any intermediate level of subclassification between these two extremes, major categories and literals, can all be represented in CATEGORY just as handily. They together constitute a type hierarchy of CATEGORY. The same mechanism can also be applied to semantic categories (human, animate, food, etc.) to capture the thesaurus inference like human --> animate. This makes our knowledge representation much more powerful than in those formalisms without this mechanism. We will address this issue in depth in another paper Typology for syntactic category and semantic category in Chinese grammar.

In the following, we give a brief description on how our grammar works. The grammar consists of several phrase structure rules and a lexicon with lexical entries and lexical rules. First, ALE compiles the grammar into a Prolog parser. During this process (at compile time), lexical rules are applied to lexical entries. In the case of transitive patterns, this means that one entry of chi will evolve into 10 entries. Please note that it is this expanded lexicon that is used for parsing (at run time).

At the level of implementation, we do not need to presuppose an abstract transitive structure as input of the lexical rules and from there generates 10 new entries for each transitive verb. What is needed is one pattern as the basic pattern for transitive structure and derives the other patterns. In fact, we only need 4 lexical rules to derive the other 4 full patterns from 1 basic full pattern. Elliptical patterns can be handled more elegantly by other means than lexical rules.[2]

The basic pattern constitutes the common condition for lexical rules. Although in theory any one of the 5 full patterns can be seen as the basic pattern, the choice is not arbitrarily made. The pattern we chose is the valency preposition pattern (the BA-type construction) NP1 [P NP2] V: SOV (see Lexical rule 3').[3] This is justified as follows. The valency preposition P (ba, xiang, dao, dui, etc.) is idiosyncratically associated with the individual verb. To derive a more general pattern from a specific pattern is easier than the other way round, for example, NP1 [P NP2] V: SOV --> NP1 V NP2: SVO is easier than NP1 V NP2: SVO --> NP1 [P NP2] V: SOV. This is because we can then directly code the valency preposition under CATEGORY in the SUBCAT feature and do not have to design a specific feature to store this valency information.

Summery

The ultimate aim for natural language analysis is to reach interpretation, i.e. to assign roles to the constituents. An old question is how syntax (form) and semantics (meaning) interact in this interpretation process. More specifically, which is a more important factor in Chinese analysis, the syntactic constraint or the semantic constraint? For the linguistic data we have investigated, it seems that sometimes syntax plays a decisive role and other times semantics has the final say. The essence is how to adequately handle the interface between syntax and semantics.

In our proposal, the syntactic constraint is seen as a more fundamental factor. It serves as the frame of reference for the semantic constraint. The involvement of the semantic constraint seems to be most naturally conditioned by syntactic patterns. In order to ensure their effective interaction, we accommodate syntax and semantics in one model. The model is designed to be based on syntax and resorts to semantic information only when necessary. In concrete terms, the system will selectively enforce or waive the semantic constraint, depending on syntactic patterns.

It needs to be advised that there are other factors involved in reaching a correct interpretation. For example, in order to recover the omitted complements in elliptical patterns, information from discourse and pragmatics may be vital. We leave this for future research.

References

Carpenter, B. & Penn, G. (1994): ALE, The Attribute Logic Engine, User's Guide, Version 2.0

Gao, Qian (1993): “Chinese BA-Construction: Its Syntax and Semantics”, OSU Working Papers in Linguistics 1993, Kathol A. & Pollard C. (eds.)

Huang, Xiuming (1987): “XTRA: The Design and Implementation of A Fully Automatic Machine Translation System”, Ph.D. dissertation.

Li, Audry (1990): Chapter 6 “Passive, BA, and topic constructions”, Order & Constituency in Mandarin Chinese. Kluwer Academic Publishers

Li, Wei & McFetridge, Paul (1995): “Handling Chinese NP predicate in HPSG”, Proceedings of PACLING-II, Brisbane, Australia

Pollard, Carl & Sag, Ivan A. (1994): Head-Driven Phrase Structure Grammar, Centre for the Study of Language and Information, Stanford University, CA

Pollard, Carl & Sag, Ivan A. (1987): Information-based Syntax and Semantics. Vol. 1: Fundamentals. Centre for the Study of Language and Information, Stanford University, CA

Wilks, Y.A. (1978): “Making Preferences More Active”, Artificial Intelligence, Vol. 11

Wilks, Y.A. (1975): “A Preferential Pattern-Seeking Semantics for Natural Language Interference”, Artificial Intelligence, Vol. 6

~~~~~~~~~~~~

* This research is part of my Ph.D. project on a Chinese HPSG-style grammar, supported by the Science Council of British Columbia, Canada under G.R.E.A.T. award (code: 61). I thank my supervisor Dr. Paul McFetridge for his supervision. He introduced me into the HPSG theory and provided me with his sample grammars. Without his help, I would not have been able to implement the Chinese grammar in a relatively short time. Thanks also go to Prof. Dong Zhen Dong and Dr. Ping Xue for their comments and encouragement.

[1] The other combinations are:

5d1) * dianxin chi le wo. OVS

5d2) dianxin chi le wo.
The Dim Sum ate me.

Note: It is OK with the 5d2) reading in the pattern NP V NP: SVO.

5e1) * chi le wo dianxin. VSO
5e2) chi le wo dianxin.

(Somebody) ate my Dim Sum.

Note: It is OK with the 5e2) reading of in the pattern V [NP1 NP2]: VO where NP1 modifies NP2.

5f1) * chi le dianxin wo. VOS
5f2) chi le dianxin, wo.

Eaten the Dim Sum, I have.

Note: It is OK in Spoken Chinese, with a short pause before wo, in a pattern like V NP, NP: VOS.

[2] The conventional configurational approach is based on the assumption that complements are obligatory and should be saturated. If saturation of complements were not taken as a precondition for a phrase, serious problems might arise in structural overgeneration. On the other hand, optionality of complement(s) is a real life fact. Elliptical patterns are seen in many languages and especially commonplace in Chinese. In order to ensure obligatoriness of complements, the lexical rule approach can be applied to elliptical patterns, as shown in Section 3. This approach maintains configurational constraint in tree building to block structural overgeneration, but the cost is great: each possible elliptical pattern for a head will have to be accommodated by a new lexical entry. With the type mechanism provided by ALE, we have developed a technique to allow for optionality of complement(s) and still maintain proper configurational constraint. We will address this issue in another paper Configurational constraint in Chinese grammar.

[3] This choice is coincidental to the base‑generated account of the BA construction in [Li, A. 1990], but that does not mean much. First, our so‑called basic pattern is not their D‑Structure. Second, our choice is based on more practical considerations. Their claim involves more theoretical arguments in the context of the generative grammar.

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

http://abcnews.go.com/GMA/Moms/story?id=1406161

【创业笔记：安娜离职记】

安娜是个很可爱的俄罗斯上进女青年，从小弹钢琴跳芭蕾，小学没毕业即随父母移民美国。她身材高佻，曲线优美，性情温和，举止得体，善解人意，给人一种古典但不古板，现代却不俗艳，阳光而浪漫的印象。大家知道，虽然俄罗斯大嫂大多偏胖粗线条，但俄罗斯姑娘却多有迷人的风采，老帮菜耳熟能详念念不忘的就有钢铁怎样炼成里面的资产阶级小姐冬妮亚，芭蕾舞天后乌兰诺娃，风华绝代的花样滑冰艺术家 Ekaterina Gordeeva。安娜也是这样一位俄罗斯女郎，每天就在身边，给满屋大多是 boys 的办公室带来了温馨柔和的气息。自然地，大家都喜欢她。

然而，安娜辞职了，很快就要离开，大家都舍不得。我心里也不是滋味,想到午餐时不再有她的说说笑笑，餐后也不能邀她打乒乓球了，失落落的。我问她一定要离开么，你不是说很喜欢这个环境么？You know this office is already too crowded with boys, and we are trying to change this situation, trying to find some girls with affirmative action, and you are leaving?

她回说，我喜欢这个环境，是因为在这里我接触的都是你这样的世界上最聪明的人，因为你们太聪明了，结果我的发展道路堵死了，只好痛下决心离开了，我还是去 consulting company 做我擅长的分析工作去吧。两年来，我亲眼目睹我的20小时的人工怎样被你的20秒的全自动搜索所替代，而且结果往往比人工更好更全更有一致性。

她说的不假。确实是技术的转移抢走了她的饭碗，但公司不想辞她，决定让她转型做在线客户服务，可她思前想后，觉得年轻轻不能放弃自己的专长，只好决定离开了。

作为技术带头人，她的离开与我直接相关。这是一个活生生的机器取代人工的例子。

两年前我加入公司的时候，公司基本上是一个 professional service 类型的公司，虽然也开发了一个内部使用的系统，但系统的输出只是缩小了人工范围，必须有长时间的后编辑，手动增删修补，分析归纳，才能提供给客户。编辑人员我们称为信息分析员，要求语言能力强，阅读理解一目十行，并具有分析综合的技能。安娜就是信息分析员中的佼佼者。经她过手的分析报告，客户特别满意。

可是公司需要成本核算。核算的结果是，肉工可以，要适度，否则入不敷出，是亏本买卖。当时平均每个搜索分析的订单需要肉工22小时方能完工，这22小时叫做 pain time (既是分析员的pain, 更是公司的pain)。要想赚钱，理想的 pain time 支出需要控制在两个小时之内，在当时有点天方夜谭。老板找我谈的时候，就把它定为主要目标，但并没有设置时间限度，因为没有人知道其可行性以及达成这样的目标需要多少资源。我自己也不明白，只是感觉到了这个重担。我以前做过的工作，都是先研究，后做原型引擎，然后寻找应用领域，最后开发产品。而这家公司与多数技术创新公司截然相反，它是先有客户，后有粗糙的引擎，最后才引进人才和技术，把希望寄托在技术的快速转移身上。这条路子让我觉得新鲜和刺激，觉得可以试一下，我的技术转移技能能不能如鱼得水，发挥出来。先有客户和应用领域的好处是显而易见的，就像搞共产主义有了遵义会议的明灯一样，省却了在黑暗中的漫长摸索。道路是光明的，就看路怎样走才能赚钱了。

长话短说。我上马以后，三个月把系统的核心部分替换了，半年下来结果明显改善，到一周年的时候，肉工的痛苦时间已经缩短到两小时以下，老板喜不自禁。

人心不足蛇吞象，老板告诉我，Wei,你知道，你的技术给我们的业务带来了革命性变化。我们的立足已经不成问题，只要我们愿意，维持一个机器加人工的服务，发展成年入几千万的企业指日可待。但是，只要有人工，就不能 scale up, 赚钱就有限，盘子就做不大。我知道你是有雄心的人（我心里说，子非鱼），肯定不满足小打小闹。不管多大风险，我们还是决定放弃这条道路，而走全自动的路子，让系统可以服务所有的分析客户，而不是只供我们内部人工（安娜这样的）或者需要专门训练的 power users 使用。我们的目标是让世界上每个分析员都离不开我们，就如大家离不开Google一样。为此，我们必须做到 pain time 为零，这是着险棋，但是前景不可限量。

好家伙，这个口气，就梦想称霸全世界了。美国是个很有意思的地方，这方水土盛产百折不挠，心比天高的企业梦想家。但美国并非梦想家的乐园，95％的梦想家牺牲了，不到5％得以生存，其中不过1％最终做大，真正是一将功成万骨枯。虽然如此，美国造企业梦想家仍然前赴后继，生生不息。我其实很喜欢这些梦想家，他们的坚韧豪情很感染人。

一年又过去了。我们实现了在一个主要分析领域完全铲除痛苦时间的目标（pain time 0），把搜索分析从两年前的22小时人工，发展成为如今的20秒钟全自动立等可取，无需任何人工编辑。

得之桑榆，失之东隅, 两年的奋战取得了超出所有人预料的成就，但同时也失去了一位可爱的俄罗斯女郎。

【二次创业笔记】记于2008年四月

【后记】关于安娜，还有一个小插曲。大家知道，创业公司的人都爱做梦数小鸡，股票期权则是催梦剂。

有一天，公司哥们跟往常一样数小鸡玩儿，安娜跟我说：Wei, come here, I got something to show you. 我走近一看，是一辆轿车。她跟我一字一板地说：

I like this car. I just love it. It is my dream car. I want to buy it.
Guys, work hard so I can own this car.

及至仔细一看价码，吓了一个筋斗，百万以上，她可真敢想啊，乖乖隆的东，here it is：

【一日一parsing：舍我其谁，我又是谁？】

昨夜名段：
【中秋，混得好的是花前月下，混得一般的是月下花钱，混得最差的是花下月的钱，混得最好的是钱下月花。】

0916a

0916b

几乎完美parsing了，但有一个分离词没有搭配的瑕疵，对比：

0916d

合在一起就眼花缭乱了，这是非一般的 graph，与多数句法树颇不同：

0916c

索性把前天的 parsing 也秀一秀。汉语 deep parsing 没有绝对的标准，但语言学家心里还是有杆秤的：靠谱不靠谱，内行看门道，外行看热闹罢。这种感觉有些奇诡刺激，一方面觉得是在走前人没走过的路，充满了拓荒者的悲壮与豪情。另一方面，也好像冥冥之中的命定，替天行道，舍我其谁，我又是谁？如果语言是思想的载体和表达（presentation），parsing 就是思想的形式化机器展示（representation），而我就是贯通二者的使者。感谢上帝，在创造了谜一样的语言的同时，没忘记把钥匙留下。

0915a

0915b

0915c

0915d

是的，【人类最无法理解的事情，就是机器对人类语言结构的分析能力】。机器达到人类的语言结构分析能力，现在已经没有悬念了。而机器难以达到的那部分理解能力，可以用人机辅助的方式进行，这个景象就在不太远的将来，已然历历在目了。让我们准备好，去拥抱这个人机交融的新时代。

洪爷有诗云：
庖丁解牛在语言，伟爷Parser之中练。善刀藏之于深山，实则乱麻可以斩。

【相关】

【博士涂鸦回顾：把常识代入文法的尝试】

上次说过，绝大多数的parsers对于谓词的 subcat 的表达都很简陋，伸展不开，多数不过把 subcat 当成一个代码，然后在相关的 subcat 规则中去确定 pattern。但是词驱动的文法 HPSG 却可以丝丝入扣，合情合理，可以直接在词典里面把 subcat 的 pattern 细致地描述，并对其句法语义的输入（pattern的条件）和输出（逻辑语义）之间的映射和解构，做出一个符合语言学原则的表达（representation）。

简陋有简陋的工程考量和理由，叠床架屋有叠床架屋的逻辑优美。鱼与熊掌不可兼得，我们最终还是更加倾向于简陋之法。尽管如此，走简陋快捷的路线的人，如果对结构表达的优美有所体验，还是有莫大的好处，至少不会被简陋的表象所迷惑，对于复杂的语言现象，逐渐摆脱简陋的捉襟见肘。

最近回看当年博士阶段的涂鸦文章，虽然其中反映出的对汉语句法的见识并不出彩，但是得力于 HPSG 的结构丰富性，还是把 subcat 在汉语文法中应用，表现得有条不紊，经得起时间的检验。当年钻研 HPSG 还是很专心的，吃得蛮透。正因为吃得透了，后来扬弃的时候就没有拖泥带水的牵挂。

譬如，在论及汉语NP带坑的现象的时候，是这样模型的：

11a) 桌子坏了。
11b) 腿坏了。
11c) 桌子的腿坏了。
12a) 他好。
12b) 身体好。
12c) 他的身体好。

When people say 11b) tui huai le (leg went wrong), we know something (the possessor) is omitted. For 11a), however, we have no such feel of incompleteness.

Although we may also ask whose table, this possessive relation between who and table is by no means innate. Similarly, ta (he) in 12a) is a complete notion denoting someone while shenti (body) in 12b) is not. In 11c) and 12c), the possessor appears in the possessive structure DE-construction, the expectation of tui (leg) and shenti (body) is realized. These examples show that some words (concepts) have conceptual expectation for some other words (concepts) although the expected words do not necessarily show up in a sentence and the expectation might not be satisfied. In fact, this type of expectation forms part of our knowledge (common sense). One way to represent the knowledge is to encode it with the related word in the lexicon.
Therefore we propose an underlying SYNSEM feature KNOWLEDGE to store some of our common sense knowledge by capturing the internal relation between concepts. KNOWLEDGE parallels to syntactic SUBCAT and semantic RELATION. KNOWLEDGE imposes semantic constraints on their expected arguments no matter what syntactic forms the arguments will take (they may take null form, i.e. the underlying arguments are not realized). In contrast, SUBCAT only defines syntactic requirement for the complements and gets interpreted in RELATION. Following this design, syntactic form and semantic constraints are kept apart. When necessary, the interaction between them can be implemented by lexical rules, or directly coindexed in the lexicon. For example, the following KNOWLEDGE information will be enforced as the necessary semantic constraints when we handle Chinese NP predicates by a lexical rule (see 3.3).

把常识暗度陈仓从后门带入文法，就是从那时候开始的。这个做法在欧洲语言的形式文法中不多见，因为句法形式大体够用了，通常不需要常识的帮忙。但是对于汉语，没有某种常识的引入，想做一个成熟的深度分析系统，则很难。当年带常识的的句法结构模型是这样定义的：

PHON shenti
SYNSEM | KNOWLEDGE | PRED [1] possess
SYNSEM | KNOWLEDGE | POSSESSOR [2] human
SYNSEM | KNOWLEDGE | POSSESSED [3]
SYNSEM | LOCAL | CONTENT | INDEX [3]
SYNSEM | LOCAL | CONTENT | RESTRICTION { RELATION body }
SYNSEM | LOCAL | CONTENT | RESTRICTION { INSTANCE [3] }

最后，汉语文法中常识的引入被认为是对欧洲语言利用性数格的 agreement 的一个自然延伸。句法手段到语义限制的延伸。

Agreement revisited
This section relates semantic constraints which embody common sense to the conventional linguistic notion of agreement. We will show that they are essentially the same thing from different perspectives. We only need slight expansion for the definition of agreement to accommodate some of our basic knowledge. This is important as it accounts for the feasibility of coding knowledge in linguistic ways. Linguistic lexicon seems to be good enough to house some general knowledge in addition to linguistic knowledge.

为 parse“我鸡吃“ 和“鸡我吃”，常识进入了文法（现在也可以利用大数据把常识代入）：

A typical example of how concepts are linked in a network (a sophisticated concept lexicon) is seen in the representation of drink ((*ANI SUBJ) (((FLOW STUFF) OBJE) ((SELF IN) (((*ANI (THRU PART)) TO) (BE CAUSE))))) in Wilks 1975b. While for various reasons we will not go as far as Wilks, we can gain enlightenment from this type of AI approach to knowledge. Lexicon-driven systems like the one in HPSG can, of course, make use of this possibility. Take the Chinese role-assignment problem, for example, the common sense that ANIMATE being eats FOOD can be seamlessly incorporated in the lexical entry chi (eat) as a semantic agreement requirement.

PHON chi
SYNSEM | KNOWLEDGE | PRED [1] eat
SYNSEM | KNOWLEDGE | AGENT [2] animate
SYNSEM | KNOWLEDGE | PATIENT [3] food
SYNSEM | LOCAL | CATEGORY | SUBCAT | EXTERNAL_ARGUMENT [NP: [4]]
SYNSEM | LOCAL | CATEGORY | SUBCAT | INTERNAL_ARGUMENTS <[NP: [5]]>
SYNSEM | LOCAL | CONTENT | RELATION [1]
SYNSEM | LOCAL | CONTENT | EATER [4] | INDEX | ROGET [2]
SYNSEM | LOCAL | CONTENT | EATEN [5] | INDEX | ROGET [3]

可见，看上去不过是 POS 细分后的一个 subcat 的代码，里面其实包含了多少结构及其蕴含其内的知识。在 unification grammars 几乎成为历史陈迹的今天，我还是认为 HPSG 这样的表达是最优美的语言学的逻辑表达之一，论逻辑的清晰和美，后来的文法很难超越。

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

Handling Chinese NP predicate in HPSG （old paper）

Handling Chinese NP predicate in HPSG
(old paper in Proceedings of the Second Conference of the Pacific
Association for Computational Linguistics, Brisbane, 1995)

Wei Li & Paul McFetridge

Department of Linguistics
Simon Fraser University
Burnaby, B.C. CANADA V5A 1S6

Key words: HPSG; knowledge representation, Chinese processing

Abstract

This paper addresses a type of Chinese NP predicate in the framework of HPSG 1994 (Pollard & Sag 1994). The special emphasis is laid on knowledge representation and the interaction of syntax and semantics in natural language processing. A knowledge based HPSG model is designed. This design not only lays a foundation for effectively handling Chinese NP predicate problem, but has theoretical and methodological significance on NLP in general.

In Section 1, the data are analyzed. Both structural and semantic constraints for this pattern are defined. Section 2 discusses the semantic constraints in the wider context of the conceived knowledge-based model. The aim of natural language analysis is to reach interpretations, i.e. correctly assigning semantic roles to the constituents. We indicate that without being able to resort to some common sense knowledge, some structures cannot get interpreted. We present a way on how to organize and utilize knowledge in HPSG lexicon. In Section 3, a lexical rule for this pattern is proposed in our HPSG model for Chinese, whose prototype is being implemented.

Problem

We will show the data of Chinese NP predicate first. Then we will investigate what makes it possible for an NP to behave like a predicate. We will do this by defining both the syntactic and semantic constraints for this Chinese pattern.

1.1. Data: one type of Chinese NP predicate

1) 他好身体。

ta hao shenti.
he good body
He is of good health.

2) 张三高个子。

Zhangsan gao gezi
Zhangsan tall figure.
Zhangsan is tall.

3) 李四圆圆的脸。 Lisi

Lisi yuanyuan de lian.
Lisi round-round DE face.
Lisi has a quite round face.

4) 这件大衣红颜色。

zhe jian dayi hong yanse.
this (cl.) coat red colour.
This coat is of red colour.

5) 明天小雨。

mingtian xiao yu.
tomorrow little rain.
Tomorrow it will drizzle.

6) 那张桌子三条腿。

na zhang zhuozi san tiao tui.
that (cl.) table three (cl.) leg
That table is three-legged.

Note: (cl.) for classifier.
DE for Chinese attribute particle.

The relation between the subject NP and the predicate NP is not identity. The NP predicate in Chinese usually describes a property the subject NP has, corresponding to English be-of/have NP. In identity constructions, the linking verb SHI (be) cannot normally be omitted.[1]

7a) 他是学者。

ta shi xuezhe.
he be scholar
He is a scholar.

8b) ？他学者。

ta xuezhe. 他学者。
he scholar

1.2. Problem analysis

1.2.1. We first investigate the structural characteristics of the Chinese NP predicate pattern.

A single noun cannot act as predicate. More restrictively, not every NP can become a predicate. It seems that only the NP with the following configuration has this potential: NP [lex -, predicate +]. In other words, a predicate NP consists of a lexical N with a modifying sister. Structures of this sort should not be further modified.[2] Thus, the following patterns are predicted.

8a) 那张桌子三条腿。

na zhang zhuozi san tiao tui. [ same as 6) ]
that (cl.) table three (cl.) leg
That table is three-legged.

8b) 那张桌子塑料腿。

na zhang zhuozi suliao tui.
that (cl.) table plastic leg
That table is of plastic legs.

8c) * 那张桌子三条塑料腿。
* na zhang zhuozi san tiao suliao tui. [too many attributes]

8d) * 那张桌子腿。
* na zhang zhuozi tui. [no attributes]

1.2.2. What is the semantic constraint for the Chinese predicate pattern?

Although there is no syntactic agreement between subject and predicate in Chinese, there is an obvious semantic "agreement" between the two: hao shenti (good body) requires a HUMAN as its subject; san tiao tui (three leg) demands that the subject be FURNITURE or ANIMATE. Therefore, the following are unacceptable:

9) * 这杯茶好身体。

* zhe bei cha hao shenti.
this cup tea good body

10) * 空气三条腿。

* kongqi san tiao tui.
air three (cl.) leg

Obviously,. it is not hao (good) or san tiao (three) which poses this semantic selection of subject. The semantic restriction comes from the noun shenti (body) or tui (leg). There is an internal POSSESS relationship between them: shenti (body) belongs to human beings and tui (leg) is one part of an animal or some furniture. This common sense relation is a crucial condition for the successful interpretation of the Chinese NP predicate sentences.

There are a number of issues involved here. First, what is the relationship of this type of knowledge to the syntactic structures and semantic interpretations? Second, where and how would this knowledge be represented? Third, how will the system use the knowledge when it is needed? More specifically, how will the introduction of this knowledge coordinate with the other parts of the well established HPSG formalism? Those are the questions we attempt to answer before we proceed to provide a solution to the Chinese NP predicate. Let us look at some more examples:

11a) 桌子坏了。

zhuozi huai le.
table bad LE
The table went wrong.

11b) 腿坏了。

tui huai le.leg bad LE
leg bad LE
The leg went wrong.

11c) 桌子的腿坏了。

zhuozi de tui huai le.
table DE leg bad LE
The table's leg went wrong.

12a) 他好。

ta hao.
he good
He is good.

12b) 身体好。

shenti hao.
body good
The health is good.

12c) 他的身体好。

ta de shenti hao.
he DE body good
His health is good.

note: LE for Chinese perfect aspect particle.

When people say 11b) tui huai le (leg went wrong), we know something (the possessor) is omitted. For 11a), however, we have no such feel of incompleteness. Although we may also ask whose table, this possessive relation between who and table is by no means innate. Similarly, ta (he) in 12a) is a complete notion denoting someone while shenti (body) in 12b) is not. In 11c) and 12c), the possessor appears in the possessive structure DE-construction, the expectation of tui (leg) and shenti (body) is realized. These examples show that some words (concepts) have conceptual expectation for some other words (concepts) although the expected words do not necessarily show up in a sentence and the expectation might not be satisfied. In fact, this type of expectation forms part of our knowledge (common sense). One way to represent the knowledge is to encode it with the related word in the lexicon.

Therefore we propose an underlying SYNSEM feature KNOWLEDGE to store some of our common sense knowledge by capturing the internal relation between concepts. KNOWLEDGE parallels to syntactic SUBCAT and semantic RELATION. KNOWLEDGE imposes semantic constraints on their expected arguments no matter what syntactic forms the arguments will take (they may take null form, i.e. the underlying arguments are not realized). In contrast, SUBCAT only defines syntactic requirement for the complements and gets interpreted in RELATION. Following this design, syntactic form and semantic constraints are kept apart. When necessary, the interaction between them can be implemented by lexical rules, or directly coindexed in the lexicon. For example, the following KNOWLEDGE information will be enforced as the necessary semantic constraints when we handle Chinese NP predicates by a lexical rule (see 3.3).

PHON shenti
SYNSEM | KNOWLEDGE | PRED [1] possess
SYNSEM | KNOWLEDGE | POSSESSOR [2] human
SYNSEM | KNOWLEDGE | POSSESSED [3]
SYNSEM | LOCAL | CONTENT | INDEX [3]
SYNSEM | LOCAL | CONTENT | RESTRICTION { RELATION body }
SYNSEM | LOCAL | CONTENT | RESTRICTION { INSTANCE [3] }

Agreement revisited

This section relates semantic constraints which embody common sense to the conventional linguistic notion of agreement. We will show that they are essentially the same thing from different perspectives. We only need slight expansion for the definition of agreement to accommodate some of our basic knowledge. This is important as it accounts for the feasibility of coding knowledge in linguistic ways. Linguistic lexicon seems to be good enough to house some general knowledge in addition to linguistic knowledge. Some possible problems with this knowledge-based approach are also discussed.

Let's first consider the following two parallel agreement problems in English:

13) * The boy drink.

14) ? The air drinks.

13) is ungrammatical because it violates the syntactic agreement between the subject and predicate. 14) is conventionally considered as grammatical although it violates the semantic agreement between the agent and the action. Since the approach taken in this paper is motivated by semantic agreement, some elaboration and comment on agreement seem to be in need.

The agreement in person, gender and number are included in CONTENT | INDEX features (Pollard & Sag 1994, Chapter 2). It follows that any two signs co-indexed naturally agree with each other. That is desirable because co-indexed signs refer to the same entity. However, person, gender and number seem to be only part of the story of agreement. We may expand the INDEX feature to cope with the semantic agreement for handling Chinese and for in-depth semantic analysis for other languages as well.

Note that to accommodate semantic agreement in HPSG, we first need features to represent the result of semantic classification of lexical meanings like HUMAN, FOOD, FURNITURE, etc. We therefore propose a ROGET feature (named after the thesaurus dictionary) and put it into the INDEX feature.

Semantic agreement, termed sometimes as semantic constraint or semantic selection restriction in literature, is not a new conception in natural language processing. Hardly any in-depth language analysis can go smoothly without incorporating it to a certain extent. For languages like Chinese with virtually no inflection, it is more important. We can hardly imagine how the roles can be correctly assigned without the involvement of semantic agreement in the following sentences of the form NP1 NP2 Vt:

15a) 点心我吃了。

dianxin wo chi le.
Dim-Sum I eat LE
The Dim Sum I have eaten.

15b) 我点心吃了。

wo dianxin chi le.
I Dim-Sum eat LE
I have eaten the Dim Sum.

Who eats what? There is no formal way but to resort to semantic agreement enforced by eat to correctly assign the roles. In HPSG 1994, it was pointed out (Pollard & Sag 1994, p81), "... there is ample independent evidence that verbs specify information about the indices of their subject NPs. Unless verbs 'had their hands on' (so to speak) their subjects' indices, they would be unable to assign semantic roles to their subjects." The Chinese data show that sometimes verbs need to have their hands on the semantic categories (ROGET) of both their external argument (subject) and internal arguments to be able to correctly assign roles. Now we have expanded the INDEX feature to cover both ROGET and the conventional agreement features number, person and gender, the above claim of Pollard and Sag becomes more general.

It is widely agreed that knowledge is bound to play an important role in natural language analysis and disambiguation. The question is how to build a knowledge-based system which is manageable. Knowledge consists of linguistic knowledge (phonology, morphology, syntax, semantics, etc.) and extra-linguistic knowledge (common sense, professional knowledge, etc.). Since semantics is based on lexical meanings, lexical meanings represent concepts and concepts are linked to each other in a way to form knowledge, we can well regard semantics as a link between linguistics and beyond-linguistics in terms of knowledge. In other words, some extra-linguistic knowledge may be represented in linguistic ways. In fact, lexicon, if properly designed, can be a rich source of knowledge, both linguistic and extra-linguistic. A typical example of how concepts are linked in a network (a sophisticated concept lexicon) is seen in the representation of drink ((*ANI SUBJ) (((FLOW STUFF) OBJE) ((SELF IN) (((*ANI (THRU PART)) TO) (BE CAUSE))))) in Wilks 1975b. While for various reasons we will not go as far as Wilks, we can gain enlightenment from this type of AI approach to knowledge. Lexicon-driven systems like the one in HPSG can, of course, make use of this possibility. Take the Chinese role-assignment problem, for example, the common sense that ANIMATE being eats FOOD can be seamlessly incorporated in the lexical entry chi (eat) as a semantic agreement requirement.

Note: Following the convention, the part after the colon is SYNSEM | LOCAL | CONTENT information.

One last point we would like to make in this context is that semantic agreement, like syntactic agreement, should be able to loosen its restriction, in other words, agreement is just a canonical, in Wilk's term preference, requirement (Wilks 1975a). In practice of communication, deviation in different degrees is often seen and people often relax the preference restriction in order to understand. With semantic agreement, the deliberate deviation is one of the handy means to help render rhetorical expression. In a certain domain, Chomsky's famous sentence Colorless green ideas sleep furiously is well imaginable. On the other hand, the syntactic agreement deviation will not affect the meaning if no confusion is caused, which may or may not happen depending on context and the structure of the language. In English, lack of syntactic agreement for the present third person singular between subject and predicate usually causes no problem. Sentence 15) The boy drink therefore can be accepted and correctly interpreted. There is much more to say on the interaction of the two types of agreement deviation, how a preference model might be conceived, what computational complexities it may cause and how to handle them effectively. We plan to address it in another paper. The interested reader is referred to one famous approach in this direction. (Wilks 1975a, 1978).

Solution

We will set some requirements first and then present a lexical rule to see how well it meets our requirements.

3.1. Based on the discussion in Section 1, the solution to the Chinese predicate NP problem should meet the following 4 requirements:

(1)        It should enforce the syntactic constraints for this pattern: one and only one modifier XP in the form of NP1 XP NP2.

(2)        It should enforce the semantic constraints for this pattern: N2 must expect NP1 as its POSSESSOR with semantic agreement.

(3)        It should correctly assign roles to the constituents of the pattern: NP1 POSSESS NP2 (where NP2 consists of XP N2).

(4)        It should be implementable in HPSG formalism.

3.2. What mechanisms can we use to tackle a problem in HPSG formalism?

HPSG grammar consists of two components: a general grammar (ID schemata and principles) and a lexical grammar (in the lexicon). The lexicon houses lexical entries with their linguistic description and knowledge representation in feature structures. The lexicon also contains generalizations captured by inheritance of lexical hierarchy and by a set of lexical rules. Roughly speaking, lexical hierarchy covers static redundancy between related potential structures. Just because the lexicon can reflect different degrees of lexical redundancy in addition to idiosyncrasy, the general grammar can desirably be kept to minimum.

The Chinese NP predicate pattern should be treated in the lexicon. There are two arguments for that. First, this pattern covers only restricted phenomena (see 3.4). Second, it relies heavily on the semantic agreement, which in our model is specified in the lexicon by KNOWLEDGE. We need somehow to link the semantic expectation KNOWLEDGE and the syntactic expectation SUBCAT or MOD. The general mechanism to achieve that is structure sharing by coindexing the features either directly in the lexical entries (see the representation of the entry chi in Section 2) or through lexical rules (see 3.3).

3.3. Lexical Rule

Lexical rules are applied to lexical signs (words, not phrases) which satisfy the condition. The result of the application is an expanded lexicon to be used during parsing. Since the pattern is of the form NP1 XP N2, the only possible target is N2, i.e. shenti (body) or tui (leg). This is due to the fact that among the three necessary signs in this form, the first two are phrases and only the final N2 is a lexical sign. We assume the following structure for our proposed lexical rule:

NP[ta[1]] [[AP[2] hao] [N<NP[1], XP[2]> shenti]]

NP Predicate Lexical Rule

hpsg1

SYNSEM | KNOWLEDGE | PRED [1] possess
SYNSEM | KNOWLEDGE | POSSESSOR [2]
SYNSEM | LOCAL | CATEGORY | HEAD | MAJ [6] n
SYNSEM | LOCAL | CATEGORY | PREDICATE -
SYNSEM | LOCAL | CONTENT | INDEX [4]
SYNSEM | LOCAL | CONTENT | RESTRICTION {[3]}
...| CATEGORY | PREDICATE +
...| CATEGORY | SUBCAT | EXTERNAL_ARGUMENT [NP: [5]]
...| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS < [...| CATEGORY | HEAD | MOD [6] ] >
...| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS < [...| CONTENT | INDEX [4] ]

==>

...| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS < [...| CONTENT | RESTRICTION {[7]} ] >
...| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS < [...| LEX - ] >
...| CONTENT | RELATION [1] possess
...| CONTENT | POSSESSOR [5] | INDEX | ROGET [2]
...| CONTENT | POSSESSED | INDEX [4]
...| CONTENT | POSSESSED | RESTRICTION {[7] | [3] }

For complicated information flow like this, it is best to explain the indices one by one with regards to the example ta hao shenti (he is of good body) in the form of NP1 XP N2.

The index [1] links the underlying PRED feature of N2 to the semantic RELATION feature; in other words, the predicate in the underlying KNOWLEDGE of shenti (body) now surfaces as the relation for the whole sentence. The index [2] enforces the semantic constraint for this pattern, i.e. shenti (body) expects a human (ROGET) possessor as the subject (EXTERNAL_ARGUMENT) for this sentence. The index [3] is the restriction relation of N2. [4] links the INDEX features of XP and N2, and [6] indicates that the internal argument is a de-facto modifier of N2, i.e. XP mods-for N2. Note that the part of speech of the internal argument (INTERNAL_ARGUMENT | SYNSEM | LOCAL | CATEGORY | HEAD | MAJ) is deliberately not specified in the rule because Chinese modifiers (XP) are not confined to one class, as can be seen in our linguistic data. Finally, [7] defines the restriction relation of the XP to the INDEX of N2.

The indices [4], [7] and [3] all contribute to artificially creating a semantic interpretation for [XP N2]. As is interpreted, XP is, in fact, a modifier of N2 and they would form an NP2, or [XP N2] constituent. In normal circumstances, the building of NP2 interpretation is taken care of by HPSG Semantics Principle. But in this special pattern, we have treated XP as a complement of N2, yet semantically they are still understood as one instance: hao shenti (good body) is an instance of good and body. This interpretation of NP2 serves as POSSESSED of the sentence predicate, indicated by the structure-sharing of [4], [7] and [3]. Finally, [5] is the interpretation of NP1 and is assigned the role of POSSESSOR for the sentence predicate.

Let's see how well this lexical rule meets the 4 requirements set in 3.1.

(1) It enforces the syntactic constraints by treating XP as the internal argument and NP1 as the external argument.

(2) It enforces the semantic constraints through structure sharing by the index [2].

(3) It correctly assigns roles to the constituents of the pattern.

The following interpretation will be established for ta hao shenti (he is of good body) by the parser.

hpsg2

CONTENT | RELATION possess
CONTENT | POSSESSOR | INDEX | PERSON 3
CONTENT | POSSESSOR | INDEX | NUMBER singular
CONTENT | POSSESSOR | INDEX | GENDER male
CONTENT | POSSESSOR | INDEX | ROGET human
CONTENT | POSSESSOR | RESTRICTION { }
CONTENT | POSSESSED | INDEX [1]    | PERSON 3
CONTENT | POSSESSED | INDEX          | NUMBER singular
CONTENT | POSSESSED | INDEX          | GENDER nil
CONTENT | POSSESSED | INDEX          | ROGET organ
CONTENT | POSSESSED | RESTRICTION { [ RELATION good],              [ RELATION body ] }
CONTENT | POSSESSED | RESTRICTION { [ INSTANCE [1] ],              [ INSTANCE [1] ] }

In prose, it says roughly that a third person male human he possesses something which is an instance of good body. We believe that this is the adequate interpretation for the original sentence.

(4) Last, this rule has been implemented in our Chinese HPSG-style grammar using ALE and Prolog. The results meet our objective.

But there is one issue we have not touched yet, word order. At first sight, Chinese seems to have similar LP constraints as those in English. For example, the internal argument(s) of a Chinese transitive verb by default appear on the right side of the head. It seems that our formulation contradicts this constraint in grammar. But in fact, there are many other examples with the internal argument(s), especially PP argument(s), appearing on the left side of the head.

服务 fuwu (serve): <NP, PP(wei)>

16a) 为人民服务

wei renmin fuwu
for people serve
Serve the people.

16b) ? 服务为人民。

fuwu wei renmin.
serve for people

有益 youyi (of benefit): <NP, PP(dui yu)>

17a) 这对我有益。

zhe dui wo youyi
this to I have-benefit
This is of benefit to me.

17b) * 这有益对我。

zhe youyi dui wo
this have-benefit to I

18a) 这于我有益。

zhe yu wo youyi
this to I have-benefit
This is of benefit to me.

18b) 这有益于我。

zhe youyi yu wo
this have-benefit to I
This is of benefit to me.

Word order and its place in grammar are important issues in formulating Chinese grammar. To play safe and avoid generalization too soon, we assume a lexicalized view on Chinese LP constraint, encoding word order information in LEXICON through SUBCAT and MOD features. This proves to be a realistic and precise approach to Chinese word order phenomena.

3.4. As a final note, we will briefly compare the NP Predicate Pattern with one of the Chinese Topic Constructions:

NP1 NP2 Vi/A
(topic + (subject + predicate))

In Chinese, this is a closely related but much more productive form than this NP Predicate Pattern. And their structures are different.

19) 他身体好。

ta shenti hao
he body good
He is good in health.

For topic constructions, we propose a new feature CONTEXT | TOPIC, whose index in this case is token identical to the INDEX value of ta. Please be advised that in the above structure, the CONTEXT | TOPIC ta is considered as a sentential adjunct instead of a complement subcated-for by shenti. Why? First, ta is highly optional: topic-less sentence is still a sentence. Second, and more convincingly, ta cannot always be predicted by its following noun. Compare:

20a) 他身体好。

ta shenti hao
he body good
He is good in health.

20b) 他好身体。

ta hao shenti
he good body
He is of good health.

21a) 他脾气好。

ta piqi hao
he disposition good
He is good in disposition.

21b) 他好脾气。

ta hao piqi
he good disposition
He is of good disposition.

but:

22a) 她学习好。

ta xuexi hao. [3]
he study good
He is good in study.

22b) * 他好学习。

ta hao xuexi
he good study

What this shows is that for topic sentences like ta shenti hao (He is good in health), ta xuexi hao (He is good in study), etc., there is no requirement to regard topic ta (he) as a necessary semantic possessor of shenti / xuexi, the relation is rather "in-aspect": something (NP1) is good (A) in some aspect (NP2), or for something (NP1), some aspect (NP2) is good (A).

Finally, it needs to be mentioned that our proposed lexical rule requires modification to accommodate sentence 6). That is already beyond what we can reach in this paper because it is integrated with the way we handle Chinese classifiers in HPSG framework.

References

Pollard, Carl & Sag, Ivan A. (1994): Head-Driven Phrase Structure Grammar, Centre for the Study of Language and Information, Stanford University, CA

Pollard, Carl & Sag, Ivan A. (1987): Information‑based Syntax and Semantics Vol. 1: Fundamentals. Centre for the Study of Language and Information, Stanford University, CA

Wilks, Y.A. (1975a): A Preferential Pattern-Seeking Semantics for Natural Language Interference. Artificial Intelligence, Vol. 6, pp.53-74.

Wilks, Y.A. (1975b): An Intelligent Analyzer and Understander of English, in Communications of the ACM, Vol. 18, No.5, pp.264-274

Wilks, Y.A. (1978): Making Preferences More Active. Artificial Intelligence, Vol. 11, pp. 197-223

~~~~~~~~~~~~~~~ footnotes ~~~~~~~~~~~~~~~~

[1] This is not absolute, we do have the following examples:

Ia) 约翰是纽约人。

Yuehan shi Niuyue ren
John be New-York person
John is a New Yorker.

Ib) 约翰纽约人。

Yuehan Niuyue ren.
John New-York person
John is a New Yorker.

IIa) 今天是星期天。

jintian shi xingqi-tian.
today be Sun-day
Today is Sunday.

IIb) 今天星期天。

jintian xingqi-tian.
today Sun-day
Today is Sunday.

It seems to be that the subject NP stands for some individual element(s), and the predicate NP describes a set (property) where the subject belongs. But it is not clear how to capture Ib) and IIb) while excluding 7b). We leave this question open.

[2] We realize that the syntactic constraint defined here is only a rough approximation to the data from syntactic angle. It seems to match most data, but there are exceptions when yi (one) appears in a numeral-classifier phrase:

IIIa) 他一副好身体。

ta yi fu hao shenti.
he one (cl.) good body
He is of good health. (He is of a good body.)

IIIb) * 他三副好身体。

ta san fu hao shenti
he three (cl.) good body

IIIc) 他好身体。

ta hao shenti. [same as 1) ]

IVa) 李四一张圆圆的脸。

Lisi yi zhang yuanyuan de lian.
Lisi one (cl.) round-round DE face
Lisi has a quite round face.

IVb) * 李四两张圆圆的脸。

Lisi liang zhang yuanyuan de lian.
Lisi two (cl.) round-round DE face

IVc) 李四圆圆的脸。

Lisi yuanyuan de lian. [ same as 3) ]

[3] Another reading for 22a) is [S [Sta xuexi][AP hao]], where ta xuexi is a subject clause: "That he studies is good". This is another issue.

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

Notes for An HPSG-style Chinese Reversible Grammar

ABSTRACT

Key words: Chinese parsing, Chinese generation, reversible grammar, HPSG

This paper presents a reversible Chinese unification grammar named CPSG. The lexicalized and integrated design of CPSG embodies the general spirit of the modern linguistic theory Head-driven Phrase Structure Grammar (HPSG, Pollard & Sag 1987, 1994). Using ALE formalism in Prolog (Carpenter & Penn 1994), we have implemented a prototype of CPSG.

CPSG covers Chinese morphology, Chinese syntax and semantics in a novel integrated language model (Figure 1, for interface between morphology, see Li 1997; for interface between syntax and semantics, see Li 1996). CPSG model is in sharp contrast to the conventional clear-cut successive design of grammar components (Figure 2, see survey in Feng 1996). We will show that our model is much better suited and more efficient for Chinese analysis (or generation).

cpsg

Grammar reversibility is a highly desired feature for multi-lingual machine translation application (Hutchins & Somers 1992, Huang 1986, 1987). To test its reversible features, we have applied the CPSG prototype to an experiment of bi-directional machine translation between English and Chinese. The machine translation engine developed in our Natural Language Lab is based on shake-and-bake design, a novel approach to machine translation suited for unification grammars (Whitelock 1992, 1994, Beaven 1992, Brew 1992). The experimental results meet our design objective and verify the feasibility of CPSG approach.

~~~~~~~~~~~~~~~~~~~~~

Notes for NWLC-97, UBC, Vancouver

Outline of An HPSG-style Chinese Reversible Grammar

Wei LI ([email protected])

Linguistics Department, Simon Fraser University

Key words: lexicalist approach, integrated language model, HPSG,

reversible grammar, bi-directional machine translation,

Chinese computational grammar,

Chinese word identification, Chinese parsing,
Chinese generation

background

1.1. design philosophy

Two major obstacles in writing Chinese computational grammar:

lacking in serious study on Chinese lexical base

well designed lexicon is crucial for a successful computational system

theoretical linguists have made fruitful efforts (e.g. Li Linding) but lack formalization

computational linguists require more patience in adapting and formalizing the fruits:

it is huge work, but has to be done if a non-toy system is targeted

lack of effective interaction between morphology, syntax and semantics.

e.g.

ambiguity in word identification makes it hard to interface morphology & syntax:

a theoretical defect of morphology preprocessor (segmenter)

e.g. ABC: ABC or A | BC or AB | C or A | B | C?

active/passive isomorphic phenomena make semantic constraint a desired need in parsing NP Vt: subject NP or object NP?

Solution: the lexicalized and integrated design of Chinese grammar

1.2. major theoretical foundation:

HPSG: lexicalist theory encouraging integration of different components

a desired framework matching our design philosophy

CPSG: HPSG-style unification grammar

CPSG: reversible grammar suited for both parsing and generation

CPSG: formalized grammar, a description that does not rely on undefined notions

integrated language model

2.1. CPSG versus conventional Chinese grammar

parse tree embodies both morphological and syntactic structures in CPSG

lexicalized formal grammar

3.1. formalized grammar, as required by a computational grammar: formulation of CPSG

readily implementable (theories, principles, rules, etc.);

precise definition for the very basic notions (e.g. sign, morpheme, word, phrase, sentence, NP, VP, etc.), rules (PS rules and lexical rules), lexical items (lexical hierarchy), typology (hierarchy embodied in feature structures)

(4.) Definition: sign

A sign is the most fundamental concept of grammar. Formally, a sign is defined by the type [a_sign], which introduces a set of linguistic features for its description, as shown below.

a_sign
INDEX index
KANJI kanji
MORPH1 expected
MORPH2 expected
CATEGORY category
COMP0 expected
COMP1 expected
COMP2 expected
MOD expected
KNOWLEDGE knowledge
CONTENT content
INDEX0 index
INDEX1 index
INDEX2 index
DTR dtr

(5.) Definition: word

In CPSG, a word is a sign satisfying the following two conditions: (1) its obligatory morphological expectation has all been saturated; (2) it is not a mother of any syntactic structures, hence no syntactic daughters. Formally, a word is defined as shown below.

(6.) word

a_sign
MORPH1 ~obligatory
MORPH2 ~obligatory
DTR no_syn_dtr

3.2. lexicalized grammar

CPSG consists of two parts:

(1) a minimized general grammar:

only 11 phrase structure rules
(covering complement structure, modifier structure,
conjunctive structure and morphological structure)

(2) a feature enriched lexicon:

lexical entries;
lexical hierarchy and a set of lexical rules
(capturing lexical generalizations).

(7.) comp0 PS rule

MOTHER a_sign
COMP0 saturated
COMP1 [1]
COMP2 [2]
DTR comp0
MYSISTER [6]
LEFTMOD [7] category
RIGHTMOD [8] category
LEFTCOMP [9] category
RIGHTCOMP [10] category

===>

EXPECTING a_sign
COMP0 a_expected
DIRECTION left
ROLE [3]
SIGN [4]
COMP1 [1] ~obligatory
COMP2 [2] ~obligatory
INDEX [5]
DTR dtr
LEFTMOD [7]
RIGHTMOD [8]
RIGHTCOMP [10]

EXPECTED a_sign [4]
CONTENT content
MYHEAD [5]
MYROLE [3] comp_role
INDEX [6]
CATEGORY [9]

PRINCIPLE #head_feature

(8.) lexical entry: chi

a_sign
KANJI one_character
H1 chi
CATEGORY v
INDEX0 [1] index
INDEX1 [2] index
COMP0 a_expected
DIRECTION left
SIGN a_sign
CATEGORY n
INDEX [1]
COMP1 a_expected
DIRECTION right
SIGN a_sign
CATEGORY n
INDEX [2]
KNOWLEDGE eat
U_OBJECT food
MALE none
PERSON 3
SINGULAR bin
U_SUBJECT animate
MALE bin
PERSON tri
SINGULAR bin

Implementation and Application of CPSG

CPSG prototype implemented in ALE and Prolog, having parsed a corpus of 200 various types of sentences

ALE and Prolog: suitable for unification grammar
ALE: mechanism for typed feature structures: type polymorphism
a powerful tool in language modeling

CPSG prototype adapted for application to bi-directional MT, having generated the same corpus of 200 sentences

References

Beaven, John L. (1992): "Shake and Bake Machine Translation", Proceedings of the 15th International Conference on Computational Linguistics, pp. 603-609, Nantes, France.

Brew, Chris (1992): "Letting the Cat out of the Bag: Generation for Shake-and-bake MT", Proceedings of the 15th International Conference on Computational Linguistics, pp. 610-616, Nantes, France.

Carpenter, B. & Penn, G. (1994): ALE, The Attribute Logic Engine, User's Guide

Feng, Z. (1996): "COLIPS Lecture Series - Chinese Natural Language Processing", Communications of COLIPS, Vol.6, No.1 1996, Singapore (http://www.iscs.nus.sg/~colips/commcolips/paper/p96.html)

Huang, X-M. (1986): "A Bidirectional Grammar for Parsing and Generating Chinese". Proceedings of the International Conference on Chinese Computing, Singapore, pp. 46-54

Huang, X-M. (1987): XTRA: The Design and Implementation of A Fully Automatic Machine Translation System, Doctoral dissertation, University of Essex.

Hutchins, W.J. & H.L. Somers (1992): An Introduction to Machine Translation. London, Academic Press.

Li, W. (1996): Interaction of Syntax and Semantics in Parsing Chinese Transitive Patterns. Proceedings of International Conference on Chinese Computing (ICCC'96), Singapore

Li, W. (1997): Chart Parsing Chinese Character Strings. Proceedings of The Ninth North American Conference on Chinese Linguistics (NACCL-9, to be available), Victoria, Canada

Pollard, C. & I. Sag (1987): Information based Syntax and Semantics Vol. 1: Fundamentals. Centre for the Study of Language and Information, Stanford University, CA

Pollard, C. & I. Sag (1994): Head-Driven Phrase Structure Grammar, Centre for the Study of Language and Information, Stanford University, CA

Whitelock, Pete (1992): "Shake and Bake Translation", Proceedings of the 14th International Conference on Computational Linguistics, pp. 784-790, Nantes, France.

Whitelock, Pete (1994). "Shake and Bake Translation", C.J. Rupp, M.A. Rosner, and R.L. Johnson (eds.), Constraints, Language and Computation, pp. 339-359, London, Academic Press.

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

【一日一parsing：从“见面”的subcat谈起】

白:
“三两面”和“两三面”很不一样啊……
我借过他三两面。我见过他两三面。

我:
三两面 > 两三面
我见过他三两面

0912a
ditransitive, no problem, but:

0912b

separable verb jian-mian is still not connected

还有：
（0）我见过他两三面。
（1）我见过他。
（2）我与他见过面。
（3）* 我见过面
（4）我们见过面。
（5）我与他，见面过。

“见面” 要求或者主语是复数（4），或者主语是并列结构（5），或者带有介词短语“与（with）”（PP或并列在汉语界限不清，（2）），或者动量词疑似的“两三面”前必须有定语【human】。所有的这些句法subcat要求都是满足语义（或常识）的一个【human】的坑：常识是，“见面“”必须在两个或以上的 human entities 之间进行。

HPSG 这类极端依赖subcat数据结构的词驱动的理论和语言学表达，尽管繁缛，但有一个亮点, 就是把上述的句法要求作为 input 的匹配条件描述，与内在的语义要求（类似于 HowNet 的描述）作为语义的 output，一条一条形式化，细致入微，丝丝入扣。用的是 label 的unification（就是 label 所代表的子结构的 sharing）机制。多数系统对于 subcat 的内部结构，input到output的映射，以及背后的句法与语义的关系（语义是句法的动因，同时也是句法的目标：句法匹配，语义实现），都显得太简陋了。

过犹不及，不及犹过。我们一直在探索在 subcat 的表达和实现中，如何做到中庸而不平庸，简约而不简陋。

白:
他我见过几面

我:
简陋之极的一个例证是给人用的 Oxford 高级词典和朗曼词典的那些 subcat codes，类似 v1，。。。v23 之类。后来纽约大学专门组织CL的研究生做 CompLex 和 NomLex 等 subcat 词典。中文方面，社科院语言所的【现代汉语800词】开 subcat 先河，【动词用法词典】等系列辞典，开始试图把 subcat 用某种编码加例句予以表达。所有这些工作，从数据表达和关系看，都显得有些简陋。其根子是，句法和语义没有厘清。

对于一个 NLP practitioner，拿来这些资源，必须在肚子里做这个句法语义的连接和消化，然后确定数据结构，找寻自己的实现途径。实现的时候，很难达到 unification 文法的漂亮，大多是凑合事儿，为的是避免 HPSG 这类的实现起来的低效率和数据结构的难维护。

董老师的 HowNet 对于汉语和英语的 subcat，语义上登峰造极了，但是句法方面还是显得不够细致周全。譬如“见面”这类的上述6-7种句法规定，好像就没有一一描述（董老师指正：也许我没吃透），也没见哪家描述清楚过。也都需要一个重新咀嚼消化，然后去实现。

0912c

（3）的 generation 不合法（*），但对于 parsing，鲁棒性要求这样parsing，没错。

0912d

没调试，居然出来了，912 的狗屎运吧。（911恐袭，913林跑，都不是好日子。）只剩下 “我见过他两三面” 这个 case 了。这个类似动量补语的东西其实仅限于：“一面”，“几面”，“两三面”，“三两面”，等少数几个。起码，100+ 面基本不可能除非是恋人。

张: 崇拜严重中

我:
张老师谬赞。清谈误国，我只要不误“人”子弟就好了，一辈子没当过教授，要误也都是人家子弟，哈。

张: 白求恩

我:
认真说，其实真地涉嫌误人子弟，因为凡事都有一个大环境和背景，我说的这些个多少有些异类，结果是，主流学生雾里看花。雾里看花也算增加视野，最误人的是，看到花，却够不着。这就好比鲁老爷子说的，本来人家黑屋子里面睡得蛮香甜，你非要去【呐喊】，唤醒了，可屋子还是黑屋子，这就不仅仅是残忍了。不残忍的法子就是，等以后退休了，开一个 Deep Parsing 开源公园，每条代码，每个词条，每段规则，全部公开，然后看看能不能靠众人的力量，弄一个无敌系统来。大家一起玩符号逻辑，让两条路线永远。

【相关】

【语言学小品：苹果发布 iPhone 7 的“话术”】

我:
前一阵提到汉语 if-then 简约式对parsing的挑战。昨天又遇到一些例子，也是极少显性形式痕迹，可是人就理解为 if-then: “中国出生，美国长大，如何申请回国定居？”

VP1, VP2, how VP3

中国出生，美国长大，如何申请回国定居？
== 【如果】【一个人】【在】中国出生，【并且】【在】美国长大，【那么】【他/她】【将】如何申请【他/她】回国定居【的paperwork】【呢】？

省去了多少玩意儿，简约的中文！

这种句式听起来很顺耳，普罗没感觉有理解或缺省的问题。仔细看，也不能算没有形式痕迹，这样的 pattern 似乎就应该是这样的理解（？）：

VP1（, VP2, ...), how VPn?

一旦匹配上，还有其他的语义可能吗？VP1 到 VPn-1 都是 AND 条件， VPn 才是虚拟条件的结果。

白:
不甜不要钱，不甜的不要钱
一个意思，形式上真要拉开那么大差距吗
理解为省略“的”，就是单。理解为省略“如果……则……”就是复句

我:
的字结构，是一个短语与从句的中间怪物，英语的 what-clause 亦然。

白:
如果依照“懒人定律”，无论如何过程简约、结果简约的理解优先。
用最小能量补齐者优先

宋:
不完全一样。瓜主指着一堆瓜说“不甜不要钱”，意思是我保证个个都甜。“不甜的不要钱”口气软一点，是说我不保证每个瓜都甜，如果你买到的瓜不甜，我就不收钱。

白:
您的例子只能区分省略掉的名词加的是存在量词还是全称量词，不能区分省略掉的小词是“的”还是“如果……则”

我:
@宋好区分。不过，这种口气的软硬真地很 subtle，广告商似乎常常有意利用这种 nuances，来忽悠老百姓。同样的广告词，软的方面理解才是实在的，广告商希望听众往硬的方面理解，来凸显其底气。“不甜不要钱”，就是这样的话术。它的实际意义和法律意义等价于“不甜的不要钱”。但它想传达的却是，我的产品多牛，根本不可能不甜，不信我愿意跟你打赌。

白:
不管软硬，真遇到不甜的（逻辑反例），肯定是哪个瓜不甜哪个瓜不要钱，不会整堆儿不要钱。不信试试。

我:
不用试吧，@白硕

说到“话术”，昨天看苹果发布会，体会才深，从乔布斯时代到现在，苹果最经常用到的忽悠信众和普罗的话术就是：iXyz is the best Xyz ever made by Apple
这种话是宇宙真理，没有丝毫信息量，却听上去似乎是最有力量的广告词。

白:
有sentiment就够了

我:
尼玛做电子产品，不是越做越好，难道越做越坏？新一代比起前面的几代好，不是理所当然吗？这里的 best 不就是这么声称吗？屡试不爽，把全世界当傻瓜，可是全世界还就愿意当傻瓜。没人 question 或反讽。我要是苹果的竞争对手，就专门做一个宣传片，嘲讽这个“话术”。

白:
made和比较范围并没有硬捆绑呀。
不是硬性的

我:
是 iPhone7 与 iPhones 比较；iWatch Series 2 与 iWatch 1 比较

白:
也可以理解为横向比

我:
这是正式新闻发布：
San Francisco — Apple today introduced iPhone 7 and iPhone 7 Plus, the best, most advanced iPhone ever, packed with unique innovations that improve all the ways iPhone is used every day.“the

“the best, most advanced iPhone ever”

白:
又回到限定性非限定性问题上，聪明的一休

我:
逻辑上，剔除定语，就是 iPhone 7 is iPhone

白:
这个跟“媳妇是娘”那种剔除法一样不可取。

我:
苹果就是完全烂了，没有任何创新，也永远可以这样声称：
iPhone 7 is the best iPhone.
（iPhone 8 will be the best iPhone)
In fact, a new iPhone release is always the best iPhone.

白:
问题是，把苹果买在手里的用户，按照另一种理解，会有一种傲视天下的感觉。

宋:
马列主义的顶峰。
新顶峰。

我:
他要是真牛，应该说 iPhone 7 is the best smart phone.
不过他不敢

白:
苹果不蠢，只是蒙不了伟哥而已。

我:
只有谷歌 SyntaxNet 才傻乎乎地敢于不带范围地如此声称世界第一。

【相关】

【汉语句法的挑战之一：if-then的简约式】

【NLP 迷思之四：词义消歧（WSD）是NLP应用的瓶颈】

《立委科普：NLP系统语义模块的任务》

本篇旨在探讨NLP（Natural Language Processing）语义模块的任务，尤其在知识图谱应用中。探讨之前，我们先站在万米高看俯瞰一下语义模块在语言学和NLP的主要模块的架构中位于何处。

语言学的教科书通常把语言文本研究从浅入深划分为这么几个分支：词法（morphology）、句法（syntax）、语义（semantics）和语用（pragmatics）。还有另一个维度的分支，叫篇章研究（discourse study），是跨句进行，其他的研究一般限于句内。词法句法的研究成果在 NLP 中表现为 parser，可以自动把线性字符串的语句分析为句法树结构，千变万化的语句因此化为有限的句型或 patterns，为语言理解和应用提供了坚实的基础。语义处于句法之后、语用之前，我们叫它为语义中间件 (middleware)，因为它是领域独立的语言研究的终点，支持的是依赖领域和应用的语用。这个语义中间件的任务也可以留到语用阶段在语义落地（semantic grounding）的时候根据语用对语义的要求来一起做，但是理论上，总有一部分语义工作有足够的领域独立性，值得提前做好，来支持种种不同的语用场景和应用，减轻语用模块的负担。

如此定义的语义模块（语义中间件），主要是寻找 hidden links，譬如隐含的逻辑主语、宾语等。这些在句法阶段没有显性表明，但是有足够证据去确定如何填补。填补的时候，一个是利用句法（显性的links），一个是利用 ontology，通常是二者的结合。词驱动（word-driven）来做，是一个很 tractable 的任务，是比parsing更琐碎但难度较低的工作，因为要结构有结构，要ontology有ontology（包括动态形成的ontology节点，譬如NE专名的分类），条件比纯句法分析模块只有线性的pattern可用，是成熟多了。其有用性还是不太清晰： argument 之一就是，如果 hidden 的语义重要，人为什么不用显性句法手段？即便在一个句子的选定的句法结构中，某个重要的语义难以显性表达，如果足够重要，人就会换一种句法结构在另一个句子显性表达出来。如果上述 argument 有一定的道理，那么不做 hidden 语义，对于大数据挖掘，应该不会有太大的损害。至少在大数据挖掘这样的场景，信息的冗余性足以弥补个体 hidden 语义的不全。在句法结束的时候，有些句子提到的 arg（s）并没有到位，可以说是不饱和（unsaturated）。语义中间件的任务就是把句法没有做全的不饱和的坑填得饱和，hidden links 建立了，于是就饱和了。如果句法模块和语义模块以后，仍然不饱和，就应该在 discourse 中去找。如果 discourse 中还是没找到，那么理论上是应该通过常识去饱和它。

回到万米高空俯瞰，昨天还在想所谓“语义计算”到底包含哪些呢。从 community 来看，相关的方面有：（1）WSD（Word Sense Disambiguation）; (2) FrameNet (role labeling); (3) IE（Information Extraction）。“经典”IE （MUC IE 传统）里面一般分 NE、relationship、event，外加 Coreference，等任务。从结构图的角度看，NE 和 WSD 是做 node 的语义计算；FrameNet 和 IE Template （for relationship or event）是做 arc （link）的语义计算。这样来看 community 定义的几个任务和方向，可以发现，（1）和（2）都是学究式的任务，不实用。（3）是最接地气的东西，是应用（apps）直接需要的。但是 IE 是针对领域的，直接为产品服务的，不好抽象，那么就可以想想什么东西是句法之后，语用之前，最能帮助 IE。其中之一就是 Coreference，这个任务已经被 IE 收编了，但它实际上是独立于领域的篇章（discourse）尺度的语义计算，是为了支持 IE 的跨句整合的。

沿着这个思路，我们还可以细化，根据实际需求，我们定义过三个任务，觉得应该在语义中间件里面做，它们应该可以惠及所有的应用：第一个是同位语关系，这个可以看成是 Corference之一种；第二个是部分和整体的关系（譬如，苹果和iPhone）；第三个原因和结果的关系。上述三个关系不限于句法短距离，也包括远距离的，甚至跨句的这类联系。我们一直在这三个关系，加上代词的coreference (包括专名的 aliasing) 上下功夫，比在 hidden 逻辑主谓宾方面更多，因为前者直接服务于 local IE 以后的 IE，以便整合成图谱，是整合的粘合剂，后者大多可以通过信息冗余去做弥补。

以上说的是实践中摸索出来的体会，就是自然而然这么走下来的。local IE 在抓取信息填 IE Template 里面的坑的时候，所看到的都是局部的信息，所填坑的材料经常很“虚”。虚的极端例子就是代词（“它”，“这个”），或者一些指代性的名词（“这台电脑”），这些东西只能作为桥梁，不能真正导致图谱。这时候语义模块在上述四个方面所做的工作，就可以帮助把这些虚的材料，变得实在，这是通向图谱的一个很重要的支持。

大而言之，语义中间件做到什么程度合适，有很大的争论空间。在确定应用之前，不少细线条语义进一步伸展没有太大意义，或者劳而少功。就是说在句法把结构的框架搭起来以后，在语用层面的具体应用确定之前，到底要做多少语义计算，不是容易说清楚的，直觉上和经验上，不赞成做得太多。从某种意义上看，费尔默创立 FrameNet 就是想把语义中间件进行到底。理论上，他的深入是有道理的，因为在 arg structure （句法subcat的拿手好戏）之后，如果要深入，domain independent 的 Frame hierarchy 是通向语用的深度桥梁。起码理论上如此。但是我们做了18年的 IE 以后，结论是，费尔默那个语义计算的路子基本是歧途。没感觉到啥好处，却带来了很大的 overhead，可操作性很差，也并不省功。IE 领域用 Template 定义语用领域的需求，没有人主张把这些 Templates 定义在 FrameNet 的 hierarchy 上面，因为感觉不到需要，而且也不现实。100 年后，也许 FrameNet 可以被重新发现，因为那时候的语用落地已经太多了，需要组织组织了。FrameNet 正好提供了一个组织和整合的框架，如今的语用落地都是零星的。

在立委牌 NLP University 中，能看懂上面这些参杂了些假洋鬼子话（术语）的“高阶科普”的后学，是可以授予学位的。这个学位是硬通货。看不懂也没关系，可以视为狂人乱语，或者是误入迷宫，不隔行也如山，耽误了你玩深度学习（dl）的宝贵时间。

【相关】

《泥沙龙铿锵三人行：句法语义纠缠论》

【立委科普：结构歧义的休眠唤醒演义】

【科研笔记：NLP “毛毛虫” 笔记，从一维到二维】

【语义计算沙龙：三角关系的 chemistry 种种】

白:
朴泰恒小组成绩不好，今天不一定能进决赛
上面例子，“小组”怎么摆，是个考验。
原意是“在小组赛阶段的”

梁:
朴泰恒今天小组成绩不好。
孙杨小组第一。

白:
以人命名的小组也是存在的

梁:
是啊，感觉“小组成绩不好”是谓语。这里小组也不是“朴泰恒的小组“，考验来了。

我:
不是说大数据吗看某某某小组是不是够资格

t08061

t08062

t08063

t08064

t08065

梁:
@wei 很棒！有个 Topic.

宋:
@wei 确实很好。但是确实能区分两种“小组”，还是只顾一头？

我:
没有大数据，应该是只顾一头吧，可以试试另一头的典型案例

宋:
即使有大数据，还得区分时代、地域、行业等，不好办。
而且，这就成了有监督的学习了，需要做语料标注。

白:
不一定宋老师。可以词典里离线加标签，目标文本在线只需计算标签密度，不涉及监督学习。

宋:
具体解释一下吗？

我:
词典习得本质上是无监督的 ngram 频率做底。假设北京大学不在词典应该可以学出来，某某某小组亦然。白老师说的是在线词典化通过现场计算。

宋:
@wei 就这个例子而言，对比“朴泰恒小组”和“朴泰恒……小组”的频率，是吗？

我:
能不能解决这个问题：北京大学、中学、小学要立刻全部动员起来
xyz 相交切分的通则：xy 强还是 yz 强，这个道理上可以在线检索计算
“北京大学” 还是 “大学、中学” 强

宋:
如果看作交搭型歧义问题，那么在大数据中，肯定是“小组成绩”频率高过“朴泰恒”的频率，除非朴泰恒这个人太红。因此，以此决定句法结构，似乎理由不足。

我:
人是怎么决策的呢？
这里可能涉及大数据的范围问题。
数据不是越大越好尤其不能杂大而杂就把领域抹平了，而很可能这是领域知识

宋:
对，我糊涂了。

白:
其实，和人名结合是兜底的，要学的只是不和人名结合的高频词串。
向右结合的条件不满足，就默认向左好了。
大数据不是这么用的。

宋:
不过无论如何，一般来说，X小组比不上小组成绩。这里是领域知识问题，不大好用词频去处理。

我:
先说一下篇章现象 one sense per discourse.
如果同一篇中还有某某某小组再现。那个原则是过硬的可以在篇章内搞定，这时候大数据认输。

宋:
张三小组第一，李四小组第二。

白:
@宋柔这个是歧义

我:
分为四级
第一级是词典绑架北京大学基本如此
第二级是篇章原则
第三级是领域数据
第四级才是大数据超领域的
涉及到专名术语的走不到超领域的大数据，大数据抹平了领域知识反而不妙

白:
词例级如此，特征级未必
特征级可以把xx小组一起拿上来统计。

我:
明白。不过具体操作起来，还是一笔糊涂账。xxx 小组与小组成绩打架，要赢多少算赢？在多大的数据里？如果特别悬殊好说，稍微有些接近就是烂帐，or 烂仗。

白:
另外，针对篇章可以计算特征密度，如果某种特征密度显著比其他特征高，也可用。比如体育特征显著，“小组”做前缀就优先级较高。

宋:
我在11年人民日报中检索，“小组赛”1013次，“小组成绩”4次，“小组赛成绩”两次，人名+小组3次。对于一个毫无体育比赛知识的人，如果有一般的比赛知识，知道比赛会出成绩，就能推知“小组比赛”是一个短语。首先是从黏着的“赛”黏着到“小组赛”，知道有“小组赛”这个术语，并能理解这是分小组而比赛。由于知道比赛会出成绩，就能推知“小组成绩”是一个短语，指某人在小组赛中的成绩。人名+小组7次，但都与体育无关：赵梦桃小组，郝建秀小组等，都是棉纺厂的。一个人，没有体育比赛知识，但有一般的比赛知识，又有语言知识，就可以有这样的推理

我:
“周恩来思想深刻谈吐幽默”，vs. “毛泽东思想深刻”
“思想” 与 “小组” 类似

宋:
1940年代以前，汉语中好像没有“人名+思想”作为一个词的。此后，“毛泽东思想”频率越来越高。但其他人名+思想就不能成词。

我:
这个政治有意思：从此其他人名+思想成为禁忌：我花开来百花杀啊。

白:
@宋 “小组循环赛”“小组出线”“小组第一”……等各种组合均以“小组”为前缀，如果只对实例，其实比“朴泰恒小组”好不到哪里去。统计频度多一点少一点都做不得结构优选的依据。但是如果抽象地考察“前缀模式”和“后缀模式”的优先程度受什么影响，必然会追溯到特征以及特征在篇章中的密度分布。如果“体育”或“竞赛”特征及其密度优势显著，“小组”倾向于做前缀，否则倾向于做后缀。如果前缀所带的实例碰巧在大数据里固然好，不在，也可通过特征及特征密度间接获得友军的支持。同样，如果“人名”“任务名”特征或特征密度显著，“小组”倾向于做后缀。

【相关】

【一日一parsing：degraded text and robust parsing】

我:
“i love programming the games are cool its fun to play them don't you think”
@梁 here are parsing results of your casual English:

t0721a

So there is one error in parsing this "degraded text":
our parser links "the games" as Object of "programming" which is locally correct, an understandable mistake. But human knows there is a missing punctuation and will link "the games" as Subject of "are", other aspects of parsing seem alright. So "degraded text" does pose some challenges, but a robust parser can still handle most of it.

@梁:
Thank you, @wei. It is very well handled. By the way, it is not my casual English. I copied it from Khan Academy.
@wei, ”Opred“ means predicate as objective, what is "infmod"?

白:
不定式作后置修饰语

我:
对。Opred 是谓词性宾语，包括ing和不定式。
其实那个错误做细活是可以改正的因为 are 对主语的强制性力量远远超越了作为前面动词宾语的力量。这样就达到人的结构分析水平了。

白:
think怎么next了？这个是个反义疑问句啊。

我:
白老师眼毒，不指出我根本就没注意到呢。那显然是一个 bug：助动当成主动词了。
就事论事那个应该词典化。

白:
are距离又近，不填主语又不饱和。反倒是programming，不是非有坑不可。
词典化赞同。

【相关】

成语的弹性识别和理解机制

白:
“去年秋膘应犹在，只是猪颜改”

我:
1234应犹在只是56改
成语弹性机制一抓一个准。一个成语中哪些是变量哪些是常量可以研究。人心里大体有本帐。拿 “九牛二虎之力” 为例，弹性第一环是数词的变量化：m牛n虎之力

二牛九虎之力
九虎二牛之力
八虎七牛之力
四牛五虎之力

都不影响parsing和理解，总之是费了老鼻子劲儿。

弹性第二环是名词沿着taxonomy变量化：m 【大动物】n【大动物】之力

九熊二豹之力
三象五狮之力

转：
今个立秋，问苍天什么季节最忙？秋天，多事之秋；什么季节最公平？秋天，平分秋色；什么季节最简单？秋天，一叶知秋；什么季节最长？秋天，一日不见如隔三秋；什么季节最爽？秋天，秋高气爽；什么季节最险？秋天，秋后算账: 什么季节最暧昧？秋天，暗送秋波！秋日快乐！！

成语弹性机制从 “秋” 上升到【季节】再上升可以到【时段】：

多事之春多事之年多事之岁月
平分春色
一花知春
一日不见如隔三冬
一日不见如隔九冬

白:
秋天来了，冬天还会远吗

我:
冬天来了秋天还会远吗
这是时间隧道
或倒转或快进。

关于小标题：

0905b

【成语】的【【弹性【识别和理解】】机制】，论句法应该是这样的：对于成语，需要一个弹性的识别机制，或者弹性识别的机制。但写的时候，脑子里更可能想的是，对于【成语的弹性】，需要一个识别机制。

再一想，who cares，人的表达和理解不常常是这样模模糊糊的吗。除了段子或较真，通常人根本就对这类结构歧义无感。语义上的模糊也不影响理解的大面。

【相关】

立委NLP博文一览

http://link.springer.com/chapter/10.1007%2F978-1-4020-4746-6_11

立委NLP频道

Once upon a time, we were publishing like crazy

List of 23 NLP Publications (Cymfony Period)

Once upon a time, we were publishing like crazy ...... as if we were striving for tenure faculty

[1] R. Srihari, W. Li and X. Li. 2006. Question Answering Supported by
Multiple Levels of Information Extraction. a book chapter in T. Strzalkowski & S. Harabagiu (eds.), Advances in Open- Domain Question Answering. Springer, 2006, ISBN:1-4020-4744-4.

[2] R. Srihari, W. Li, C. Niu and T. Cornell. 2006. InfoXtract: A Customizable Intermediate Level Information Extraction Engine. Journal of Natural Language Engineering, 12(4), 1-37

http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=1513012

This paper focuses on IE tasks designed to support information discovery applications. It defines new IE tasks such as entity profiles, and concept-based general events which represent realistic goals in terms of what can be accomplished in the near-term as well as providing useful, actionable information.

[3] C. Niu, W. Li, R. Srihari, H. Li. 2005. Word Independent Context Pair Classification Model For Word Sense Disambiguation. Proceedings of Ninth Conference on Computational Natural Language Learning (CoNLL-2005)

W05-0605

[4] C. Niu, W. Li and R. Srihari. 2004. Weakly Supervised Learning for
Cross-document Person Name Disambiguation Supported by Information
Extraction. In Proceedings of ACL 2004.

ACL 2004 Niu Li Srihari 372_pdf_2-col

[5] C. Niu, W. Li, R. Srihari, H. Li and L. Christ. 2004. Context Clustering for Word Sense Disambiguation Based on Modeling Pairwise Context Similarities. In Proceedings of Senseval-3 Workshop.

ACL 2004 Context Clustering for WSD niu1

[6] C. Niu, W. Li, J. Ding, and R. Rohini. 2004. Orthographic Case
Restoration Using Supervised Learning Without Manual Annotation.
International Journal of Artificial Intelligence Tools, Vol. 13, No.
1, 2004.

IJAIT 2004 Niu, Li, Ding, and Srihari caseR

(7) Cheng Niu, Wei Li and Rohini Srihari 2004. A Bootstrapping
Approach to Information Extraction Domain Porting. ATEM-2004: The
AAAI-04 Workshop on Adaptive Text Extraction and Mining. San Jose. (PDF)

WS104NiuC

[8] W. Li, X. Zhang, C. Niu, Y. Jiang, and R. Srihari. 2003. An Expert
Lexicon Approach to Identifying English Phrasal Verbs. In Proceedings
of ACL 2003. Sapporo, Japan. pp. 513-520.

ACL 2003 Li, Zhang, Niu, Jiang and Srihari 2003 PhrasalVerb_ACL2003_submitted

[9] C. Niu, W. Li, J. Ding, and R. Srihari 2003. A Bootstrapping
Approach to Named Entity Classification using Successive Learners. In
Proceedings of ACL 2003. Sapporo, Japan. pp. 335-342.

ACL 2003 Niu, Li, Ding and Srihari 2003 ne-acl2003

[10] W. Li, R. Srihari, C. Niu, and X. Li. 2003. Question Answering on
a Case Insensitive Corpus. In Proceedings of Workshop on Multilingual
Summarization and Question Answering - Machine Learning and Beyond
(ACL-2003 Workshop). Sapporo, Japan. pp. 84-93.

ACL 2003 Workshop Li, Srihari, Niu and Li 2003 QA-workshopl2003_final

[11] C. Niu, W. Li, J. Ding, and R.K. Srihari. 2003. Bootstrapping for
Named Entity Tagging using Concept-based Seeds. In Proceedings of
HLT/NAACL 2003. Companion Volume, pp. 73-75, Edmonton, Canada.

NAACL 2003 Niu, Li, Ding and Srihari 2003 ne_submitted

[12] R. Srihari, W. Li, C. Niu and T. Cornell. 2003. InfoXtract: A
Customizable Intermediate Level Information Extraction Engine. In
Proceedings of HLT/NAACL 2003 Workshop on Software Engineering and
Architecture of Language Technology Systems (SEALTS). pp. 52-59,
Edmonton, Canada.

NAACL 2003 Workshop InfoXtract SEALTS paper2

[13] H. Li, R. Srihari, C. Niu, and W. Li. 2003. InfoXtract Locatio
Normalization: A Hybrid Approach to Geographic References in
Information Extraction. In Proceedings of HLT/NAACL 2003 Workshop on
Analysis of Geographic References. Edmonton, Canada.

NAACL 2003 Workshop Li, Srihari, Niu and Li 2003 CymfonyLoc_final

[14] W. Li, R. Srihari, C. Niu, and X. Li 2003. Entity Profile
Extraction from Large Corpora. In Proceedings of Pacific Association
for Computational Linguistics 2003 (PACLING03). Halifax, Nova Scotia,
Canada.

PACLING 2003 Li, Srihari, Niu and Li 2003 Entity Profile profile_PACLING_final_submitted

[15] C. Niu, W. Li, R. Srihari, and L. Crist 2003. Bootstrapping a
Hidden Markov Model for Relationship Extraction Using Multi-level
Contexts. In Proceedings of Pacific Association for Computational
Linguistics 2003 (PACLING03). Halifax, Nova Scotia, Canada.

PACLING 2003 Niu, Li, Srihari and Crist 2003 CE Bootstrapping PACLING03_15_final

[16] C. Niu, Z. Zheng, R. Srihari, H. Li, and W. Li 2003. Unsupervised
Learning for Verb Sense Disambiguation Using Both Trigger Words and
Parsing Relations. In Proceedings of Pacific Association for
Computational Linguistics 2003 (PACLING03). Halifax, Nova Scotia,
Canada.

PACLING 2003 Niu, Zheng, Srihari, Li and Li 2003 Verb Sense Identification PACLING_14_final

[17] C. Niu, W. Li, J. Ding, and R.K. Srihari 2003. Orthographic Case
Restoration Using Supervised Learning Without Manual Annotation. In
Proceedings of the Sixteenth International FLAIRS Conference, St.
Augustine, FL, May 2003, pp. 402-406.

FLAIRS 2003 Niu, Li, Ding and Srihari 2003 FLAIRS03CNiu

[18] R. Srihari and W. Li 2003. Rapid Domain Porting of an
Intermediate Level Information Extraction Engine. In Proceedings of
International Conference on Natural Language Processing 2003.

ICON2003 paper FINAL

[19] H. Li, R. Srihari, C. Niu and W. Li 2002. Location Normalization
for Information Extraction. In Proceedings of the 19th International
Conference on Computational Linguistics (COLING-2002). Taipei, Taiwan.

COLING 2002 Li, Srihari, Niu and Li 2002 coling2002LocNZ

[20] W. Li, R. Srihari, X. Li, M. Srikanth, X. Zhang and C. Niu 2002.
Extracting Exact Answers to Questions Based on Structural Links. In
Proceedings of Multilingual Summarization and Question Answering
(COLING-2002 Workshop). Taipei, Taiwan.

COLING 2002 Workshop Li et al CymfonyQA_final

[21] R. Srihari, and W. Li. 2000. A Question Answering System
Supported by Information Extraction. In Proceedings of ANLP 2000.
Seattle.

ANLP 2000 Srihari and Li 2000 anlp9l

[22] R. Srihari, C. Niu and W. Li. 2000. A Hybrid Approach for Named
Entity and Sub-Type Tagging. In Proceedings of ANLP 2000. Seattle.

ANLP 2000 Srihari, Niu and Li 2000 anlp105_final9

[23] R. Srihari and W. Li. 1999. Question Answering Supported by
Information Extraction. In Proceedings of TREC-8. Washington

cymfony

Other publications: SBIR Final Reports

W. Li & R. Srihari. 2003. Flexible Information Extraction Learning Algorithm (Phase 2), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York.

W. Li & R. Srihari. 2001. Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization (Phase 1), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York.

W. Li & R. Srihari. 2000. A Domain Independent Event Extraction Toolkit (Phase 2), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York.

W. Li & R. Srihari. 2000. Flexible Information Extraction Learning Algorithm (Phase 1), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York.

W. Li & R. Srihari 2003. Automated Verb Sense Identification (Phase I), Final Techinical Report, U.S. DoD SBIR (Navy), Contract No. N00178-02-C-3073 (2002-2003)

R. Srihari & W. Li 2003. Fusion of Information from Diverse, Textual Media: A Case Restoration Approach (Phase I), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Contract No. F30602-02-C-0156 (2002-2003)

R. Srihari, W. Li & C. Niu 2004. A Large Scale Knowledge Repository and Information Discovery Portal Derived from Information Extraction (Phase 1), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. (2003-2004)

R. Srihari & W. Li 2003. An Automated Domain Porting Toolkit for Information Extraction (Phase I), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Contract No. F30602-02-C-0057 (2002-2003)

T. Cornell, R. Srihari & W. Li 2004. Automatically Time Stamping Events in Unrestricted Text (Phase I), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. (2003-2004)

[Related]