【李白对话录之五:NLP 的 Components 及其关系】

白:
“交杯酒”,似乎“交杯”修饰的不是“酒”。“散伙饭”比“交杯酒”好点,可能“饭”单指饭局比“酒”单指敬酒频率要高些。

我:
这不就是一个黑箱子吗,里面啥关系对语义计算有什么用呢?如果有用,那就在词典绑架性标出,如果没用,就不管它。“交杯酒” 与 “酒” 的不同,是前者有个坑 【with+human】:“与张三的交杯酒刚喝过,李四就跟他掰了。” 后者似乎也可以,但那个似乎随机性很强或者后者指的是前者的时候:“与张三的酒刚喝过。。。”

白:
考虑创造新说法的安全性和可接受性,这问题不能绑架了之。见面礼,也属此类。现在流行的“谢师宴”,若干年前肯定是不说的。如何“安全地泛化”,对于语言生成来说是新课题。

我:
如果说的是语言生成,譬如在机器翻译应用,那么,一个系统有选择的余地。不必要翻译成一个短小紧凑的 【合成词】 的表达方式。可以用比较散漫的句法表达方式,这样相对保险,也规避了 word formation 的泛化的问题,因为句法的本性就是泛化和随机,构词法则不然。“谢师宴” 可以表达成 “感谢恩师的宴会”。

白:
人机对话不同
需要惊喜

我:
白老师看的是未来,锦上添花的东西。现如今雪中送炭还远没解决。
如果是 parsing,这种泛化的合成词的确词典收不胜收。汉语的造词能力特强,需要专门的 compounding 的模块去识别。

白:
单字形容词重叠+的,应该是系统性的现象。词典化怎么看都不像正路子。

我:
应该是两手吧。常用的 aa 重叠,尤其是双字的合成词,常规词典有收。系统的 rule 也必须有,娄底,保证recall。何况 “美” 与 “美美” 不是 1+1 的关系。可以被 “美美” 所状的谓词,单个的 “美” 根本不可能,无论睡觉、吃饭。同理,“好好” 与 “好” 也差别很大。可是 “幸幸福福” 与 “幸福” 就完全是规律化、系统性的现象了。即便用法有别,也是系统性地有别。这与 “好好、“美美” 不同。

白:
美美,是当事人感觉美。好好,是提要求/愿望的人觉得满足要求。轻轻,是行动人的身体或者行动人操控的物品宛若很轻。这一切基本与谓词无关。

我:
“美美睡上一觉”;“睡一个美美的觉。”
说与谓词无关,谓词不同意吧。
如果无关,辛勤与工作也无关。辛勤也是说人,工作也是说人,当“辛勤”修饰(状语)“工作”的时候,两个人是一个人。
如果说句法上的修饰关系,到了逻辑语义层不应该有所反映的话,那么逻辑语义表达中就不存在定语从句的路径。那么,“我说的话” 与 “我说话” 的语义区别咋办呢?
目前我们的处理是:“我说的话” 有一个小句“我说话”,这个小句有一个修饰的路径(Mod-S)指向“话”。

0928b

白:
逻辑语义是一个“结构体”,标配是负载最外层结构的词对外。需要其他词对外时,就需要定语从句来改变。所以定语从句不改变结构内逻辑语义关系,只是指派了不同词语来“代表”结构对外而已。S-mod是一个句法关系,不是逻辑语义关系。

我:
我对逻辑语义的理解是宽式的,涵盖一切由语句表达的涉及概念之间关系的语义,表达的是人对语句的理解。有了逻辑语义,再加上节点上的概念(从语词到概念的映射,理论上要经过 WSD),就可以说人理解语言了。如果是机器达到了上面两项,那就是机器的自然语言理解了。从这个角度看,定语从句不仅仅是表层的句法关系,它也是深层的语义关系(在 argument structure 以外的另一个维度)。

白:
“吃饭”,和“吃的饭”,吃与饭的逻辑语义关系不变,只是这个结构体的对外代言人,一个落在“吃”,一个落在“饭”。

我:
没错,吃饭 与 吃的饭 在 arg structure 维度,逻辑语义完全一致。也因此我们的 deep parse tree 上,下面的小句是等同的。跳出这个SVO结构体以外的维度,或者说,这个 SVO 与其他 SVO 发生关联的时候,这种关联也是语言理解的必需,也是语义。至于这个语义及其形式化的表达,叫不叫逻辑语义,那是命名的问题。但它的确是理解的必需,也的确是语义,也不能说不逻辑。对于 “我喜欢吃饭”,这个“吃饭” 的 arg structure 就直接做了 “喜欢” 的对象,到了 “我喜欢我吃的饭”,这个 arg structure 就只能降一级,然后通过 “饭” 来做喜欢的对象。逻辑上,arg structure 只是一个最基本的事件语义元件。

白:
结构体能做萝卜的成分有多个,包括最外层谓词本身。这并没超出逻辑语义范围。真正超出的是语用,比如定语从句有“造成既成事实”因而“强加于人”的感觉。

我:
由于这些元件的叠加所产生的语义,在语言表达中有很多方式,而且语言的节省(或偷懒)原则,使得这些元件的坑里面的萝卜往往省略,造成了语言向逻辑映射的困难,构成了 deep parsing 的挑战。说定语从句是句法形式,表达的是语用,而不是语义。这种说法,可成一家之言。可是,语义和语用本来的界限就有相当的灰色地带,哪些东西可以从语用的边界拉到语义这边,哪些东西可以在语义中挂起来,留待语用去解,都是一个公婆各有理的 practice,实践中就是系统内部(system internal)的协调。

白:
“惯于充当世界警察的美国”,就有把“美国惯于充当世界警察”当作既成事实强加给受众的意思。

我:
我个人的原则是,凡是 domain independent 的,都应该在语义里面表达和求解。凡是关涉 domain 的,或应用的,那就先挂起来,留到语用去解。这是把语用(pragmatics)与应用(apps)紧密联系了。定语从句是独立于 domain 的,不管哪个 domain 哪个应用,定语从句所表达的那点意思,都是一样的。当然会遇到有些语义计算的现象,它似乎有某种 domain 独立性,但又不完全。这时候的定夺就有任意性。主张在语义阶段求解的,加重的是语义 component 的负担,节省的是那些对此现象有要求的 domains 的 work,浪费表现在对于那些对此工作没有要求的 domains,那个语义工作就是白费功夫了。

白:
你这语用不是pragmatics,而是language usage。pragmatics一定是领域无关的。但pragmatics同样独立于逻辑语义结构

我:
前者我不确定,也许 community 对 pragmatics 的理解是你所说的,那样的话,“我的语义定义”里面是包含了这部分的。后者好像不对, language usage 一般指的是纯粹语言学的句法、词法、习惯用法等表层的现象。language usage 不是一个语言学的相对独立完整的 component。

白:
或者application,反正不是pragmatics的意思。这误会不是一天两天了,之前听你说语用我就感觉对不上茬儿。

我:
哈哈。
用的不是同一套话语体系,交流的时候要是不相互了解的话,的确很别扭。
举个具体的案例。费尔默上个世纪70年代(?)提出格语法(Case Grannar,深层格)的时候,我的理解,这是语义。实际上就是逻辑语义。当他不断把这条路线往下进行的时候,终于越来越趋近语用,最后成就的 FrameNet 在我的理解体系里面是属于 “语义” 向 “语用” 过度的结果(因此我一直批评它说在NLP中不尴不尬),但仍然基本上是领域独立的,可以划在语义的大范畴里面。可是,到了 MUC 创立了 IE 的时候,这就不再是领域独立了,于是就完完全全成了语用。信息抽取领域定义的那些关于事件和关系(知识图谱的发源)的 Templates, 与费尔默定义的 FrameNet,从形式上看,是一体的。不过在 FrameNet 里面,成千的 Frames 组织成了一个基本独立于领域的 hierarchy ,到了 IE 就完全放弃了自上而下的这种继承,一切都是零敲碎打,随着领域 随着应用随时拼凑起来的 Templates,直接服务于产品。

白:
指代消解,言外之意推导等,这些才是“语用学”环节要解决的问题。
你在美国,只要不用pragmatics指你说的语用,啥事没有,但对国内的人就不同了。你说的语用,在国内叫知识表示。

我:
指代在我的“科普”体系里面,是另一个 component,属于 Discourse,那是另一个维度,是篇章的维度。知识表示 (knowledge representation)有两大类吧,一类就是 ontology,有普世的,譬如董老师的 HowNet,也有领域的,譬如,医学领域的 ontology(本体)。另一类知识表示是动态的、流动的,就是现在红透半边天的知识图谱,其基础就是 IE,加上 篇章的、跨文本的一些工作支持的融合(fusion),包括merging,deconflicitng 等 mining。

白:
句内也有指代,无需篇章。

我:
句内也有指代,所以才有以句子为最大单位的句法的与之交互。交互的结果就是乔老爷的所谓 Binding Theory 或 Principles,但是指代在借助句法做了句内的指代以后,自然的下一步是走向篇章。事实上 乔老爷的 binding 原则之一,就是把句法搞不定的,推向篇章。那个原则是,本句的某个 NP 不可能是这个指代词 (“自己”、“他”)的所指。根据这一原则,句法的作用只是否定了一种可能,留下的是其他的可能,让篇章去寻。

我的相关科普在:【立委科普:NLP 联络图 】(英文在:  OVERVIEW OF NATURAL LANGUAGE PROCESSING )。里面把与 NLP 有关的语言学 components 按照我自己的理解,梳理了一遍。

 

【相关】

Not an ad. But a historical record.

Although not updated for long, this wiki remains like this until today 9/28/2016
from https://en.wikipedia.org/wiki/NetBase_Solutions,_Inc.

wikinetbase

NetBase Solutions, Inc.

From Wikipedia, the free encyclopedia
  (Redirected from NetBase)
NetBase Solutions, Inc.
Private
Industry Market Research
Founded 2004
Founder Jonathan Spier and Michael Osofsky
Headquarters Mountain View, CA, USA
Area served
Worldwide
Key people
Peter Caswell, CEO
Mark Bowles, CTO
Lisa Joy Rosner, CMO
Dr. Wei Li, Chief Scientist
Products NetBase Insight Workbench
Website www.netbase.com

NetBase Solutions, Inc. is a Mountain View, CA based developer of natural language processing technology used to analyze social media and other web content. It was founded by two engineers from Ariba in 2004 as Accelovation, before changing names to NetBase in 2008. It has raised a total of $21 million in funding. It’s sold primarily on a subscription basis to large companies to conduct market research and social media marketing analytics. NetBase has been used to evaluate the top reasons men wear stubble, the products Kraft should develop and the favorite tech company based on digital conversations.

History

NetBase was founded by Jonathan Spier and Michael Osofsky, both of whom were engineers at Ariba, in 2004 as Accelovation, based on the combination of the words “acceleration” and “innovation.”[1][2] It raised $3 million in funding in 2005, followed by another $4 million in 2007.[1][3] The company changed its name to NetBase in February 2008.[4][5]

It developed its analytics tools in March 2010 and began publishing monthly brand passion indexes (BPI) comparing brands in a market segment using the tool shortly afterwards.[6] In 2010 it raised $9 million in additional funding and another $2.5 million in debt financing.[1][3] NetBase Insight Workbench was released in March 2011 and a partnership was formed with SAP AG that December for SAP to resell NetBase’s software.[7] In April 2011, a new CEO Peter Caswell was appointed.[8] Former TIBCO co-inventor, patent author and CTO Mark Bowles is now the CTO at NetBase and held responsible for many technical achievements in scalability.[9]

Software and services

Screenshot of NetBase Insight Workbench dashboard

NetBase sells a tool called NetBase Insight Workbench that gives market researchers and social marketers a set of analytics, charts and research tools on a subscription basis. ConsumerBase is what the company calls the back-end that collects and analyzes the data. NetBase targets market research firms and social media marketing departments, primarily at large enterprises with a price-point of around $100,000.[10][11] NetBase is also white-labeled by Reed Elsevier in a product called illumin8.[12]

Uses

For the average NetBase user, 12 months of activity is twenty billion sound bytes from just over seven billion digital documents. The company claims to index 50,000 sentences a minute from sources like public-facing Facebook, blogs, forums, Twitter and consumer review sites.[13][14]

According to a story in InformationWeek, Kraft uses NetBase to measure customer needs and conduct market research for new product ideas.[15] In 2011 the company released a report based on 18 billion postings over twelve months on the most loved tech companies. Salesforce.com, Cisco Systems and Netflix were among the top three.[16] Also in 2011, NetBase found that the news of Osama Bin Laden eclipsed the royal wedding and the Japan earthquake in online activity.[17]

External links

References

  1. ^ Jump up to:a b c By Matt Marshall, VentureBeat. “Accelovation Raises $4M for online software for IT market research.” December 3, 2007.
  2. Jump up^ BusinessWeek profile
  3. ^ Jump up to:a b By Jon Xavier, BizJournals. “NetBase filters social media for what clients need to know.” June 3, 2011.
  4. Jump up^ By Barbara Quint, Information Today. “Elsevier and NetBase Launch illumin8.” February 28, 2008.
  5. Jump up^ The Economist. “Improving Innovation.” February 29, 2008.
  6. Jump up^ By Rachael King, BusinessWeek. “Most Loved — And Hated — Tech Companies.”
  7. Jump up^ Darrow, Barb (December 12, 2011). “SAP taps NetBase for deep social media analytics”. GigaOm. Retrieved May 8, 2012.
  8. Jump up^ San Jose Mercury News. “People on the Move.” May 15, 2011.
  9. Jump up^ By David F. Carr, InformationWeek. “How Much is your Brand Loved (or Hated)?” June 16, 2011.
  10. Jump up^ By Eric Schoenfeld, TechCrunch. “NetBase Offers Powerful Semantic Indexing Platform That Reads The Web.” April 22, 2009.
  11. Jump up^ By Jon Xavier, BizJournals. “NetBase filters social media for what clients need to know.” June 3, 2011.
  12. Jump up^ By Barbara Quint, Newsbreak. “Elsevier and NetBase Launch illumin8.” February 28, 2008.
  13. Jump up^ By Neil Glassman, Social Times. “What Every Social Media Marketer Should Know About NetBase.” August 24, 2010.
  14. Jump up^ By Ryan Flinn, BusinessWeek. “Wanted: Social Media Sifters.” October 21, 2010.
  15. Jump up^ By David F. Carr, InformationWeek. “How Kraft Foods Listens to Social Media.” June 30, 2011.
  16. Jump up^ By Ryan Flinn, Bloomberg. “Tech companies measure online sentiment.” May 19, 2011.
  17. Jump up^ By Geoffrey Fowler and Alexandra Berzon, Wall Street Journal. “Social Media Buzzes, Comes Into Its Own.” May 2, 2011.

【一日一parsing:走火入魔,parser 貌似发疯了】

我:
系统调试也上瘾。今夜无眠,调着调着,parser 貌似发疯了,大概是嫌我啥都喂给它,闹情绪了??

0927a
仔细瞅瞅,好像也没啥大错,没疯。与鲁爷【狂人日记】不同,我怀疑得没理。

自然语言的任何并列(Conj)结构,到了逻辑层,都必须分列。赶上遇到好几个并列就热闹了,关系有组合爆炸的趋向。都是汉语的顿号惹的祸。用恁多顿号做啥,多写几个小句能死吗?纯句法parsing不管这些,图面倒是显得干净。可是 deep parsing 的语义计算是逻辑的,就不能不管。

白:
“或”的结合能力弱于“与”,顿号在被“或”绑架不成情况下标配解释为“与”。

我:
这几天净出怪,不知是机器走火入魔了,还是玩机器的走火入魔,总之,出来一些奇奇怪怪的 graphs,远远不是教科书上展示的句法树形图给人留下的印象。教科书都是这样的,太过优雅

parse_tree_1

前两天出了一个葫芦形的图,昨天又出了双伞形的,今天是发飙,明天还不知会咋样。

这是昨天的两把伞。瞅了一瞅,好像也没错:

0926a

白:
吗的位置不对。两把伞那个,能……吗,才是一对。

我:
对,“吗“”应该更上一层楼。如果没有上一层,“吗”疑似就对了。为个小词爬楼不值当了,不是不可以爬 (patching). 当然这里面其实牵涉到决定 yes-no question 的所属问题,最终可能还是要上。

如果说 “电子签证是什么吗。”那就是活用。表面上用疑问,实际是应该是感叹?不是“吗”的标准用法。因为“吗”的本性是一般疑问句,而“什么”是特殊疑问句的疑问词(wh-word),不相谐。

白:
那个是“嘛”,不是“吗”

我:
肯定这里不可以用 “吗” 吗?

白:
他知道电子签证是什么

我:
感觉上可以,好像也不等同于“嘛”。

是那个什么吗。
真地忘了是那个什么了。

白:
你说的感叹义,应该用“嘛”。遗忘义,可以用“吗”
不过现在白字用的,早乱套了。

我:
这是前天的葫芦,白老师的名句。就是“与之”没挂上arg,差强人意,但总体逻辑语义的计算还都对。“你”(S)与“女人”(S)结了婚,而且这事儿修饰的(Mod-S:定语从句)是“女人”。

0925a

你说机器神不神,parser 好玩不好玩,这算不算对人类语言的机器理解的敲门砖:芝麻开门!芝麻芝麻快开门。

 

【相关】

【立委科普:语法结构树之美】

【立委科普:语法结构树之美(之二)】

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

Who we are. Not an ad, but a snapshot.

NetBase

WHO WE ARE

n1

EMPOWERING GLOBAL BUSINESSES WITH SOCIAL INSIGHTS

We are uniquely positioned to help global businesses create real business value from the unprecedented level of growth opportunities presented each day by social media. We have the industry’s fastest and most accurate social analytics platform, strong partnerships with companies like Twitter, DataSift, and Tumblr, and award-winning patented language technology.

We empower brands and agencies to make the smartest business decisions grounded on the deepest and most reliable consumer insights from social. We’ve grown 300 percent year-over-year and excited to see revenue grow by 4,000% since the second quarter of 2012.

RECENT ACCOLADES

We were recently named a top rated social media management platform by software users on TrustRadius and a market leader by G2 Crowd.

n2

“NetBase is one of the strongest global social listening and analytics tools in the market. Their new interface makes customized dashboard creation a breeze.”

– Omri Duek, Coca-Cola

“Data reporting is both broad and detailed, with the ability to drill down from annual data to hourly data. NetBase allows us to have a pulse on the marketplace in just a few minutes.”

– Susie Thomas, VP, Palisades Media Group

“We started with a gen one solution, but then found that we needed to move to a tool with a better accuracy that could support digital strategy and insights research. NetBase satisfied all our needs.”

– Jared Degnan, Director of Digital Strategy

“As one of the first brands to test NetBase Audience 3D for our Mobile App launch, we’ve found that we could engage with our consumers on a deeper, more human level that further drives them to be brand champions.”

– Mihir Minawala, Manager of Social, Industry & Competitive Intelligence, Taco Bell

OUR CUSTOMERS

We work with executives from forward-looking agencies and leading brands across all verticals in over 99 countries. Our customers use NetBase for real-time consumer insights across the organization, from brand and digital marketing, public relations, product management to customer care.

KEY MILESTONES

  • March 2003
    Founded by Michael Osofsky at MIT. Later joined by Wei Li, Chief NetBase Scientist
  • July 2009
    P&G, Coca-Cola and Kraft signed as first customers of NetBase
  • January 2014
    Named Best-in-Class By Consumer Goods Technology
  • April 2014
    Launched Brand Live Pulse, the first real-time view of brands’ social movements
  • May 2014
    Celebrated 10 years with 500% customer growth in 3 years
  • January 2015
    AdAge Names 5 NetBase Customers to the Agency A-List
  • March 2015
    Introduced Audience 3D, the first ever 3D view of audiences
  • April 2015
    Raised $33 MM in Series E Round
  • November 2015
    Named Market Leader by G2 Crowd. Earned Top Ratings by Trust Radius

n3

What inspired you to join NetBase?

It was exciting to build the technology that could quickly surface meaningful customer insights at scale. For example, what used to take a day to run a simple analysis now takes just a second. Our platform now analyzes data in “Google time”, yet the depth and breadth of our analysis is exponentially deeper and larger than what you’ll ever get from a Google search.

What are you most proud of at NetBase?

I’m especially proud that we have the industry’s most accurate, deepest, fastest, and more granular text analysis technology. This enables us to gives our customers very actionable insights, unlike other platforms that offer broad sentiment analysis and general trending topics. Plus, NetBase reads 42 languages. Other platforms don’t even come close. We are customer-centric. Our platform truly helps customers quickly identify their priorities and next steps. This is what sets us apart.

What is the next frontier for NetBase?

With the exploding growth of social and mobile data and new social networks emerging, we’ll be working on connecting all these data points to help our customers get even more out of social data. As Chief Scientist, I’m more excited than ever to develop a “recipe” that can work with the world’s languages and further expand our language offerings.

WE’RE GLOBAL: 42 LANGUAGES, 99+ COUNTRIES, 8 OFFICES

NetBase Solutions, Inc  © 2016

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

【一日一parsing:parser 超越创造parser的人,不是不可能的】

460225017498569285白:
“那些林彪说过的话”
看看复数指示词(det)是如何跳过单数NP找到自己的中心语的。

我:

0924a

0924b
何难之有?

0924c

看着最后这句出来,不禁有些惶恐:这样下去,机器超越造机器的人,不是不可能的。内行看门道,自不必说,可今天还是对后学做个科普吧:为什么说此句的 deep parsing 牛得达到了语言学专家的水平,已经超越了普通人的语言结构分析的能力呢?这个自动生成、看似简单的树形图涵盖这么多的语言学:

(1) 复数指示词 “那批” 跳过了近距离的“你”,甚至跳过了定语从句的谓词“写-过”,连上了远距离的中心词“文章”,做其修饰语(Mod),牛不牛?

(2) 确定了定语从句(Mod-S)“你写过的”及其中心词“文章”;

(3) 定语从句谓词“写过”的主语(S)“你”和逻辑宾语(O)“文章”(所谓的 argument structure 的解构);

(4) 句首的这个带有定语从句的名词短语(“……文章”),与后续句子的谓词“保存-着”的远距离动宾关系(O)也揭示了,这个也有点儿牛吧;

(5) 事实上,句子主干的主(S)谓宾(O)都是各就各位,还有那些小词也都附着到了应该存在的地方(X)。

从深度结构分析的逻辑语义角度,可以说以上的分析已臻完美。

科普完。

能够达到以上对咱中文语句的语言学自动深度分析(deep parsing)水平的,得瑟一哈,也许算是可以原谅的“寡人之疾”了吧。

得瑟毕。

抹一把插大葱的象鼻,拍拍尘土,咱继续谦虚谨慎愚公移山去也。

白:
最后这句的next有些多余
即使去掉,所有有用的关系都在

我:
Next 是桥梁(敲门砖),本来是可以用完扔掉的,后来觉得留下也可以。
做个青春的纪念。
青春是褒义词,耍流氓是贬义词,但都是一回事儿:盲目躁动。(Next 残存了一点语序的信息,虽然逻辑上没有语序的地位,但在语义落地的时候,这个痕迹有时可能还有一点用。)

我一直相信,结构分析,机器达到或超越人的水平,是在望的。
结构分析后的语义落地,与人类的智力还有一些距离。但是因为语义落地几乎都是面向领域或应用的,因此有 leverage,有些觉得是天大的难题,有时在领域语用里面,就自然化解了,或者简化了。由此看来,NLU (或语义计算)是靠谱的 monster。

近两个月出了两件牛刀宰鸡的事儿。一个是英文,一个是中文。具体不让说,但可以假语村言。都是在某个产品领域被认为是拦路虎的与自然语言有关的难题。研究了一下,回答说,有了 deep parsing 的核武器,这有何难?

演练了一下,真地就是牛刀宰鸡,一眼见底。很多人以为核武器之说是立法委的极度夸张。天知地知,还真不是。被演义的对象说,这个难题在这个产品领域一旦解决,有很多后续的应用。可是如果不是不得已,还是想做牛刀宰牛的活计,而不是陷入鸡窝去没完没了地宰鸡。胜之不武啊。古训不是有说,不为五斗米折腰嘛。但愿不至于落到五斗米的田地。

【相关】

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

【李白对话录:如何学习和处置“打了一拳”】

白:
“张三打了李四一拳”“张三打李四的那一拳”
我的问题:1、“一拳”在两个例子里,跟“打”的“逻辑语义关系”是否是相同的?
2、如果相同,这种关系是不是萝卜和坑的关系?
3、如果是,那么这个坑是“打”自带的,还是被“一拳”的出现逼出来的?
4、非自带但可以被逼出来的坑,是一个个别现象还是一个普遍现象?是汉语特有的现象还是一个语言共性现象?
2’:如果不同,第二例中的定语从句和中心语“那一拳”之间的关系是怎么建立的?
“张三喊了一嗓子”“张三喊的那一嗓子,我老远就听见了”,一个道理
另外,“回马枪”“窝心脚”等“工具扩展为招式”固定短语,是不是可以直接略掉量词,与数词结合?

我:

1. 逻辑语义上应该相同,句法上有【主谓】和【定语从句+NP】 的不同,很典型。

2 具体说,“打一拳” 就是搭配,是合成动词,与“洗澡”可比,不过后者是动宾搭配,前者是动补搭配。都是合成词的句法表现,都涉及词典与句法的动态接口。
直接量的搭配,当然属于罗卜与坑。
语言中的萝卜和坑,不外是 :(1)一个直接量(词)准备了一类词(feature)的坑;(2)一个直接量(词)准备了另一个直接量(词)的坑,通常叫强搭配;(3)一类词(feature)准备了另一类词(feature)的坑。(3) 是常规句法的表现,属于空对空,两边都不着地。其规则(feature based grammar)概括性强,但容易遭遇例外的滑铁卢。lexicalized grammar or word driven rules,越来越远离(3),或者把(3)限定在一个极少的数量上。那么就剩下(1)和(2)了。
“打…一拳” 是(1),这就到了你的第三个问题,两个直接量的搭配,谁 expects 谁?
纯技术上讲,根本就没有区分,或者说,等价。x 与 y 相互勾搭,说是 x 勾搭了 y 或者 y 勾搭了 x,都无所谓,反正他们是一家人,本来就是一个词,一个概念,不过到了语言表达,被人为分开了距离。

【3、如果是,那么这个坑是“打”自带的,还是被“一拳”的出现逼出来的?】
“打一拳”就是一个词条,概念上是混为一体的,不分你我,无所谓主次(动补的主次是词法内部的,可以无视)。但是操作上,可以有说法。(不知道汉语的搭配词典里面,“打一拳”这样的条目是放在 “打” 的下面,还是 “一拳” 的下面,还是两个地方都有?)但是,在NLP实现中,“打一拳” 与 “洗澡” 一样,是一个特定的分离词词条。不过是标签不同而已,譬如 Vo 与 Vbu,其他的事儿就交给句法了。

【4、非自带但可以被逼出来的坑,是一个个别现象还是一个普遍现象?是汉语特有的现象还是一个语言共性现象?】
对于直接量搭配,我的看法是,没有自带和被逼的问题,都是两厢情愿的相互吸引。
这个应该属于普遍现象: x–y,汉语有 “洗-澡”, 英语有 “take–bath”。词法是动补或者动词与状语这样的直接量与直接量的搭配,其他语言肯定也会有,不过一时想不到例子而已。

白:
打一苕帚疙瘩,也是搭配
任何顺手的东西,都可以抄起来就打
搭配的做法未免太ad hoc

我:
所有的词典都是 ad hoc,不然就不叫绑架了。但是 词条背后的 x–y 搭配 则是有语言共性的。

白:
问题是不可穷尽,而且本来能产,是一个有规律性的现象,打两鞭子,砍三刀,踹五脚。

我:
不可穷尽 那就不是 x–y 强搭配。理论上 不是 x — y,就只能是 x —- feature,或者 feature1 —– feature2,没有其他的框可以进去。
“砍三刀” 与 “洗三个澡” 可比吗?要是可以,那就是 x — y,可变的不过是 numeral,两端还是固定的:“踹-脚”,“砍–刀”。

白:
加量词的不算,只算省略量词的.明显的是工具,但是原动词很难说自带了“工具”这个坑。

我:
有些中间地带的现象。
说到底是路线问题。如果是 lexicalist 的路线,中间地带的一律进入词典,不在乎 ad hoc,不在乎冗余,好处是精准。如果是“传统”的文法,那就把中间地带划归到句法去,具有完全的产生性,好处是 不错的recall,但很容易被例外搅合,损失了精准(precision)。当然也可以二者结合,先弄一条 recall 的兜底,然后见到中间地带弄错了的,再去结合词典堵它。recall 楼底的可以想象的 rule 是这样的,利用了汉语名词通常不能直接为数词修饰的句法特点:

V + CD + N –> V Buyu

这一条可以搂住很多,但是危险。修修补补也可以把这条规则的危险减小,但不能杜绝,因为这是 feature based rule 的本性(POS 是 feature )。

接着练,我们可以有个楼底的规则来满足白老师说的某种语言现象的共性:

V +(时态小词)+ CD + N ==> V <– Buyu[CD+N]

这条规则可以 parse 上面列举的所有现象,但是还是 too “powerful”, recall 有余,precision 不足。不过 precision 这东西,工程上靠的就不断扩大测试,测试不错的话就当没有精度问题,如果测试遇到问题了,有三个路子:(1)一个是在这一条规则中打磨,把 POS 条件细化成子类或ontology,或其他限制;(2) 第二个路子是另写一条细线条规则去 override 它,使得文法成为一个 hierarchy 的模块;(3) 第三个路子就是把错的东西(例外)扔进词典, 这实际上等价于第二条路子的极限 case,把词典当成是 rule hierarchy 的极端。有了这么一个从词典规则,到细线条 feature 规则,最后到 POS 的抽象层规则的 hierarchy 的规则化设计,就可以应对语言的例外、个性一直到共性及其之间的灰色地带。

懒得大数据,甚至懒得词典绑架搭配,上面那条默认规则送进系统先凑合事儿吧,就坐等今后例外慢慢地出现,再说。

0925b

0925c

白:
为什要在细粒度基于规则
这里说的这个层面规则的缺点,用学习对付起来正是优势

我:
不要细粒度也可以啊,抓两头带中间。大不了有些 redundancy,灰色的一律当成黑色。不可穷举不过是一种修辞说法。从统计上,处于灰色地带的东西一定是可以穷举的,不过是穷举到后来成了统计性长尾,不要再举而已。

白:
我是说,这里不存在二分法,除了词典捆绑就是基于规则, 可以基于学习

我:
白老师可以 illustrate 基于学习的东西,优势在哪里?(其实这个问题,我没觉得是一个对规则系统的挑战。没觉得它的挑战超越了 “洗澡”)

白:
不能穷举、规则又零乱,正好拿可以部分例子来学。feature很值钱,长尾的实例也很值钱,裹在一起学才是正道,既有泛化,又有死记硬背。

白:
拿有规律性的东西死记硬背,是逼着好孩子耍流氓

我:
从良性角度,也可以说是教育孩子脚踏实地,一步一个脚印。

白:
在泛化和死记硬背的灰色地带,该用学习就用学习。
看着不爽,又不是没办法。
只有应试教育、临阵磨枪,才把什么活的都搞死

我:
这里面的根本是,迄今为止,一个系统要不是统计的,要不是规则的。所谓 hybrid 的系统,大多是是两个系统的叠加,而不是融合。在这样一个 context 下,就不是说,我规则的规则,词典的词典,中间混杂一些统计学习。虽然后者应该是一个研究方向,而且应该可能做得比叠加式 hybrid 更高明。如果白老师说的是纯粹的学习系统,那是另一套话语体系,no comment。从规则这边看,抓两头,把灰色当黑色做,没有问题,不过是磨时间而已。共性规则保证了 recall,而 precision 就是时间的函数。

白:
我说的是,谁可能跟谁结合用规则,在同样符合规则情况下,谁排除跟谁结合用学习,但这是无监督学习,标注来自词典。前面用规则的只涉及萝卜、坑和帽子,不涉及subcat。后面学习的则是用subcat。

我:
其实 就用 V+CD+N 这个简单的模式到海量数据去,抓回来的无监督学习也大体就齐了。这是一个很狭窄的语言现象。无监督学习的结果就是这个特定的 subcat 的 knowledge acquisition,这是一个 offline 的学习过程。然后再利用学习出来的结果,支持 parsing

白:
其实这楼已经歪了。我的本意是在探讨逼出来的非标配的坑。
如果可以那样做,离语言的本质或许更近。

“他上学的那个学校”;“他约会的那个晚上”。

不加数词也存在把在一个句式里充当状语或补语的名词在另一个相关句式中充当主谓语,而逻辑语义关系不变的情况。而那个名词的真实身份是工具、处所、时间等角色。本来对于动词来说不是标配的。来到了某种位置,就逼迫动词把这个角色变为标配。
英语的介词结尾:the man you look for,可以给它们明确身份,即使在定语从句,也是庶出(介词养的,不是动词养的)。当然可以说动介组合look for养的。
汉语里进入定语从句后分不出来谁养的,反正介词消失了,带着反而不对。带着就要把零形式用真实代词替换:“你在其中上学的学校”,“你与之结婚的女人”

加数词,只不过突出了动量含义,不改变逻辑语义关系。

砍张三的斧子……着眼工具
砍张三的两斧子……着眼动作的次数
砍张三的斧子……用来(以/之/其)砍张三的斧子

我:
补语表示次数是逻辑语义工具在语言中的”虚化”(同时“形象化”)的用法,这种虚化用法本身不是语言共性,但可以映射到到深层的逻辑语义【工具】: 【工具】是 universal 的。就“砍”而言,【工具】不是逼迫出来的标配,而是自带的标配,不信可以查董老师的 HowNet,结婚 的标配是 with [human],对于 上学, 学校 是不是自带的?大概也可以这么说,不知道知网里面 上学 有没有一个 location 的槽,标配是学校。

可以找一个完全 random 的定义或状语试试,好像不行。似乎很难找到一个具有同样逻辑语义的,并且可以参与下面两个句式的案例:补语句式(表示次数)和定语句式。换句话说,这种现象要不就是搭配,要不就是搭配的延伸,而不是 random 的修饰语(adjunct)的组合,或者从 adjunct 被逼迫成的 complement,里面的逻辑语义是概念关系的某种 argument,有其结合的必然性。这种搭配似乎可以是词对词(两条腿落地),也可能是词对小类(feature:一条腿落地)。前者是强搭配的词典绑架,后者是灰色的,不一定可以绑架得了,统计可以学习出来。

白:
正是我要说的

我:
白老师岂止是四两拨千斤 lol

词对小类的subcat的习得,譬如 某个动词要求的是某种宾语(譬如【human】),这种东西可以从大数据学习出来:这个概念已经有日子了。剑桥大学一个教授多年前就倡导这种学习,好像也做了一批实验,印象也发表了一些文章。但这些研究总体来说是零星的,研究的归研究,应用的归应用,二者似乎也没有什么结合起来让人印象深刻的成果。

白:
没有把搭配学习锚定在结构上,是没戏的
你如果又学结构又学搭配,肯定乱套
一定是选定少数几种可能的结构,让搭配来进一步甄别,各司其职

白:
“砍”的工具可以是标配,“打”不行。适合“打”的subcat很不整齐,我们心里想的是“顺手可以抄起来的物件”但是subcat列表上不会顺顺当当给你这个。于是,要诸多subcat、诸多词例都当作features,想办法从可以列举的例子(包括已经可以确认的词例-subcat子规则)学出来。
炉子太大,抄不起来。房子更大。扫把大小适中。细菌太小。所以,“张三打李四一大肠杆菌”不通。

我:
用 pattern 打+CD+N,一学一个准 只要有海量数据,根本不用怕噪音,因为这个 pattern 非常好使。
联想到10多年前谷歌有人发过一篇论文,用两个特别拣选的 ngram patterns,学出了 ISA 的 taxonomy,让人印象深刻。后来我们还重复了这个工作,虽然并没真正用上其结果,但路子是对的。照着类似学习的路子,HowNet 有一天也是可以学出来的,只要董老师定义好要学的几个语义关系的性质,找到合适的 patterns。
谷歌用的两个 patterns 是: N such as X, Y, Z ;X, Y, Z and other N

e.g.
furniture such as desks, chairs, coffee-tables
desks, chairs, coffee-tables and other furniture (will all be on sale)
taxonomy is: {X, Y, Z} –>N

学他有啥用,反正人拍着脑袋慢慢想也可以想出来呀。HowNet 语义关系丰富,所以编写了很多年,但是终究还是编写出来了,几乎完备了(董老师好像如今只是零星地补充和添加了)。既然专家可以人工编写,既完备,又精良,有什么理由指望大数据去习得这些知识呢?这是问题的一面,特别是对于相对恒定久远的概念语义关系,确实没有道理不用专家的产品。

问题的另一面是,对于具有某种流动性的概念关系,专家很难赶得上机器习得(acquisition),还有不同领域的知识,等等。这是人力不及的地带,只有指靠大数据和机器了。上面的谷歌论文中举了一些例子,特别有意思,记得是说,学出来一个 dictator 的下位概念,里面的成员极具大数据的特点,有 卡斯特罗,毛泽东,斯大林,希特勒,etc。

白:
这是主观分类了,不合适放词典里。还有“知名品牌”的实例, 马上就有商业价值了。

我:
这不是我每天做的工作吗:social media mining of public opinions and sentiments
我们公司定期出版全球知名品牌的口碑排行榜之类,印刷精良。以前出版的是奢侈品牌(名牌包、名牌轿车、高级香水)等。最近出的一期是: Social Media Industry Report 2016: Restaurant Brand

刚测试了一下白老师的例句,最奇葩的是这个:

0925a

长成葫芦状的树形图,以前还真没见过。(词典里没有小词 “与之”,PP 也没合成它,于是被略去。)尽管如此,整个图是很逻辑的,撞了不知道什么运:“你”是结婚的一方(S),“女人”也是结婚的一方(S),这两方结婚的事件是一个定语从句(Mod-S),修饰到了“女人”的头上。至于小词 “的”、“之”,还有耍流氓的咸猪手 Next,这一切都是帮助建立结构的敲门砖,这些表层东西与逻辑语义无关,留在那里不是为了碍眼,而是为了在语义的语用落地的时候,万一需要表层痕迹的一些帮助呢。after all 语义计算的的目的不是为了画出好看的逻辑的图,自娱娱人,而是为了落地、做产品。

 

【相关】

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

Chart Parsing Chinese Character Strings

W. Li. 1997. Chart Parsing Chinese Character Strings. In
Proceedings of the Ninth North American Conference on Chinese
Linguistics (NACCL-9). Victoria, Canada.

Chart Parsing Chinese Character Strings [1]

 

Wei  LI

Simon Fraser University
Burnaby B.C. V5A 1S6 CANADA (lio@sfu.ca) 

 

ABSTRACT

This paper examines problems in word identification for a Chinese natural language processing system and presents our solution to these problems. In conventional systems, written Chinese parsing takes two steps: (1) a segmentation preprocessor for word identification (segmenter); (2) a grammar parsing the string of identified words. Morphological analysis, when required, as in the case of productive word formation, has to be incorporated in the segmenter. This matches the conventional morphology-before-syntax architecture. We will demonstrate the theoretical defect of this architecture when applied to Chinese. This leads to the conclusion that segmentational approach, despite its being the mainstream in Chinese computational morphology, is in general not adequate for the task of Chinese word identification. To solve this problem, a full grammar should be made available. Therefore, we take an alternative one-step approach. We have implemented an integrated grammar of morphology and syntax for directly parsing a string of Chinese characters, building both morphological and syntactic structures. Compared with the conventional two-step approach, our strategy has advantages in resolving ambiguity in word identification and in handling productive word formation.

  1. Introduction

A written Chinese sentence is a string of characters with no blanks to mark word boundaries. In conventional systems, Chinese parsing takes two steps as shown in the following Figure 1: (1) a segmentation preprocessor (called segmenter) for word identification; (2) a word based parsing grammar, building syntactic structures (Feng 1996, Chen & Liu (1992).

hpsg4

 

In contrast, we take an alternative one-step approach, as shown in Figure 2 below. We have implemented a grammar named W‑CPSG (for Wei’s Chinese Phrase Structure Grammar). W‑CPSG integrates morphology and syntax for character based parsing, building both morphological and syntactic structures.

hpsg5

In the two-step architecture, the purpose for the segmenter is to properly identify a string of words to feed syntax. This is not an easy task due to the possible involvement of the segmentation ambiguity. For example, given a string of 4 Chinese characters 研究生命, the segmentation ambiguity is shown in (1.a) and (1.b) below.

(1.)  研究生命

(a)        研究生               |
graduate student         | life or destiny

(b)        研究    | 生命
study   | life

The resolution of the above ambiguity in the segmenter is a hopeless job because such ambiguity is syntactically conditioned. For sentences like 研究生命金贵 (life for graduate students is precious), (1.a) is the right identification. For the phrase 研究生命起源 (to study the origin of life), (1.b) is right. So far there are no segmenters which can handle this properly and guarantee right word segmentation (Feng 1996). In fact, there can never be such segmenters as long as a grammar is not brought in. This is a theoretical defect of all Chinese analysis systems in the conventional architecture. We have solved this problem in our morphology-syntax integrated W‑CPSG. Word identification in our design becomes a by-product of parsing instead of a pre-condition for parsing.

In the text below, Section 2 investigates problems with the conventional two-step approach. In Section 3, we will present W‑CPSG one-step approach and demonstrate how W‑CPSG parsing solves these problems. The following is a list for abbreviations used in this paper.

A (Adjective); AF (Affix); BM (Bound Morpheme);
CLA (Classifier); CLAP (Classifier Phrase);
DE (Chinese particle introducing a modifier of noun); DEP (DE Phrase);
DE3 (Chinese particle introducing a modifier of result or capability);
DET (Determiner); LE (Chinese perfective aspect marker);
N (Noun); NP (Noun Phrase); P (Preposition); PP (Prepositional Phrase);
S (Sentence); V (Verb); VP (Verb Phrase); Vt (Transitive Verb)

  1. Problems Challenging Segmenters

In general, there are two basic problems for segmenters, namely, segmentation ambiguity and productive word formation.

2.1. segmentation ambiguity

This sub-section studies the segmentation ambiguity for Chinese word identification. We indicate that this ambiguity is structural in nature. Therefore it should be captured by structural trees via parsing. We conclude that a parsing grammar is indispensable in the resolution of the segmentation ambiguity.

Behind all segmenters are procedure based segmentation algorithms. Most proposals are some modified versions of large-lexicon based matching algorithms. As an underlying hypothesis, a longer match overrides a shorter match, hence the name maximum match. Decided by the  direction of the  procedure, i.e. whether  the segmentation proceeds from left (the beginning of a string) to right (the end of the string) or from right to left, we have two general types of maximum match: (1) FMM (Forward Maximum Match) algorithm; (2) BMM (Backward Maximum Match) algorithm (Feng 1996).

According to Liang 1987, segmenters have trouble with cases involving the segmentation ambiguity. There are two types of segmentation ambiguity: the cross ambiguity (AB|C vs. A|BC) and the embedded ambiguity (AB vs. A|B).

To detect possible ambiguity, many researchers use the technique of combining the FMM algorithm and the BMM algorithm. When the output of FMM and BMM are different, there must be some ambiguity involved. The following table lists the cases associated with the FMM and BMM combined approach.[2]

hpsg6

The following 3 examples all contain a cross ambiguity sub-string 研究生命 with 2 segmentation possibilities: 研究生|命 and 研究|生命. Example (4.) is a genuinely ambiguous case. Genuinely ambiguous sentences cannot be disambiguated within the sentence boundary, rendering multiple readings.

(2.) case 1:      研究生命金贵。

(a)        研究生                |      | 金贵                  (FMM: correct)
graduate student         | life   | precious
Life for graduate students is precious.

(b) * 研究 | 生命    |起源                                   (BMM: incorrect)
study        | life     | precious

(3.) case 2:       研究生命起源。

(a) *     研究生              | 命     | 起源                       (FMM: incorrect)
graduate-student       | life   | origin

(b)        研究     | 生命    | 起源                                (BMM: correct)
study   | life     | origin
to study the origin of life

(4.) case 3:       研究生命不好。

(a)        研究生                   | 命                 |        |      (FMM: correct)
graduate student         | destiny        | not     | good
The destiny of graduate students is not good.

(b) 研究 | 生命   | 不      | 好                                      (BMM: correct)
study    | life     |  not    | good
It is not good to study life.

The following example is a complicated case of cross ambiguity, involving more than 2 ways of segmentation. Both the FMM segmentation 出现|在世|界 and the BMM segmentation 出|现在|世界 are wrong. A third segmentation 出现||世界 is right.

(5.)  case 4:      出现在世界东方。

(a) * 出现 | 在世          |      | 东方                       (FMM: incorrect)
appear     | be-alive   | BM   | east

(b) * 出  | 现在  | 世界    | 东方                               (BMM: incorrect)
out        | now   | world | east

(c)  出现  |     | 世界     | 东方                               (correct)
appear    | at    | world  | east
to appear in the east of the world

In the following examples (6.) through (8.), ¿¾°×Êí involves embedded ambiguity. As separate words, the verb ¿¾ (bake) and the NP °×Êí (sweet potato) form a VP. As a whole, it is a compound noun ¿¾°×Êí (baked sweet potato). In cases of the embedded ambiguity, FMM and BMM always make the same segmentation, namely AB instead of A|B. It may be the only right choice, as seen in (6.). It may be wrong as shown in (7.). It may only be half right, as in the case of genuine ambiguity shown in (8.).

(6.) case 5:       他吃烤白薯。

(a)        他       |       | 烤白薯                                 (FMM&BMM: correct)
he       | eat     | baked sweet potato
He eats baked sweet potatoes.

(b) *     他       |       |       | 白薯                        (incorrect)
he       | eat     | bake | sweet potato

(7.) case 6:       他会烤白薯。

(a) *     他       |       | 烤白薯                                 (FMM&BMM: incorrect)
he       | can    | baked sweet potato

(b)        他      |       |       | 白薯                         (correct)
he      | can   | bake | sweet potato
He can bake sweet potatoes.

(8.) case 7:       他喜欢烤白薯。

(a)       他       | 喜欢 | 烤白薯                                  (FMM&BMM: correct)
he      | like  | baked sweet potato
He likes baked sweet potatoes.

(b)        他       | 喜欢   |       | 白薯                       (correct)
he      | like     | bake | sweet potato
He likes baking sweet potatoes.

Compare the above examples, we see that there are severe limitations for the FMM-BMM combined approach. First, it only serves the purpose of ambiguity detection (when the results of FMM and BMM do not match), and contributes nothing to its resolution. It has no way to tell which segmentation is right (compare case 1 and case 2), and, worse still, whether both are right (case 3) or wrong (case 4). Second, even when the results of FMM and BMM do match, it by no means guarantees right segmentation (case 6). Third, as far as detection is concerned, it is only limited to the problems for the cross ambiguity. The existence of the embedded ambiguity defines a blind area for this way of detection (case 6 and case 7). This is because the underlying maximum match hypothesis assumed in the FMM and BMM segmentation algorithms is directly contradictory to the phenomena of the embedded ambiguity.

In face of ambiguity, how do people judge which segmentation is right in the first place? It really depends on whether we can understand the sentence or phrase based on the segmentation. In computational linguistics, this is equivalent to whether the segmented string can be parsed by a grammar. The segmentation ambiguity is one type of structural ambiguity, not in essence different from typical structural ambiguity like, say, PP attachment ambiguity. In fact, PP attachment problem is a counterpart of the cross ambiguity in English syntax, as shown below.

(9.)       Cross ambiguity in PP attachment: V NP PP

(a) [V NP] [PP]
(b) [V] [NP PP]

Therefore, like English PP attachment, Chinese word segmentation ambiguity should also be captured by a parsing grammar. A parser resolves the ambiguity if it can, or detects the ambiguity in the form of multiple parses when it cannot. As shall be demonstrated in Section 3, wrong segmentation will not lead to a parse. Right segmentation results in at least one successful parse. In any case, at least a parser (hence a grammar on which the parser is based) is required for proper word identification.

The important thing is that the ambiguity in word identification is a grammatical problem. The attempt to solve this problem without a grammar is bound to be crippled. Since traditional segmentation algorithms are non-grammatical in nature, they are theoretically not equipped for handling such ambiguity. A successive model of segmentater-before-grammar attempts to do what it is not yet able to do. This is the theoretical defect for almost all existing segmentation approaches.

(10.)     Conclusion for 2.1.

The segmentation ambiguity in word identification is one type of structural ambiguity. In order to solve this problem, a parsing grammar is indispensable.

2.2. productive word formation

Unless morphological analysis is incorporated, lexicon match based segmenters will have trouble with new words produced by Chinese productive word formation, including reduplication, derivation and the formation of proper names. When the morphology component is incorporated in the segmenter, the two-step design becomes a variant of the conventional morphology-before-syntax architecture. But this architecture is not effective when the segmentation ambiguity is at issue.

In the following, we investigate reduplication, derivation and proper names one by one. In each case, we find that there is always a possible involvement of the segmentation ambiguity. This problem cannot be solved by a morphology component independent of syntax. We therefore propose a  grammar incorporating both morphology and syntax.

2.2.1. reduplication

Reduplication in Chinese serves various grammatical and/or lexical functions. Not all reduplications pose challenges to segmentation algorithms. Assume that a word consists of 2 characters AB, reduplication of the type AB –> ABAB is no problem. What becomes a problem for word segmentation is the reduplication of the type AB –> AABB or its variants like AB –> AAB. For example, a two-morpheme verb with verb-object relation at the level of morphology has the following way of reduplication.

(11.) Verb Reduplication: AB –> AAB  (for diminutive use)

分心 (get distracted) –> 分分心 (get distracted a bit)

让他分分心。

让       | 他     | 分分心
let       | he    | get distracted a bit
Let him relax a while.

It seems that reduplication is a simple process which can be handled by incorporating some procedure-based function calls in the segmentation algorithm. If a 3-character string, say 分分心, cannot be found in the lexicon, the reduplication procedure will check whether the first 2 characters are the same, and if yes, delete one of them and consult the lexicon again. But, such expansion of the segmentation algorithm is powerless when the segmentation ambiguity is involved. For example, it is wrong to regard 分分心 as of reduplication in the following sentence.

(12.)   这件事十分分心。

(a) *     这       |      |         |       | 分分心
this      | CLA  | thing  | ten    | get distracted a bit

(b)        这       |       |         | 十分    | 分心
this      | CLA  | thing  | very   | distracting
This thing is very distracting.

2.2.2. derivation

In Contemporary Mandarin, there have come to be a few morphemes functioning similarly to English affixes, e.g. 可 (-able) turns a transitive verb into an adjective.

(13.)     可 (-able) + Vt –> A

可 (-able) + 读 (Vt: read) –>   可读 (A:readable)

这本书非常可读。

这       | 本     | 书       | 非常   | 可读
this    | CLA  | book  | very  | readable
This book is very readable.

The suffix 性 works just like ‘-ness’,  changing an adjective into an abstract noun.  The derived noun 可读性 (readability) in the following example, similar to its English counterpart, involves a process of double affixation.

(14.)     A + 性 (-ness)  –> N
可 (-able) + 读 (Vt: read) –>   可读 (A:readable)
可读 (A:readable) + 性 (-ness) –> 可读性 (N:readability)

这本书的可读性

这       | 本      | 书       |      | 可读性
this    | CLA  | book  | DE    | readability
this book’s readability

The suffix Í· can change a transitive verb into an abstract noun adding to it the meaning “worth-of”.

(15.) Vt + 头 (AF:worth of) –> N

吃 (Vt:eat) + 头 (AF:worth of) –> 吃头 (N:worth of eating)

这道菜没有吃头

这       | 道     | 菜      | 没有             | 吃头
this    | CLA  | dish  | not-have    | worth-of-eating
This dish is not worth eating.

It is not difficult to incorporate in the segmenter these derivation rules for the morphological analysis. But, as in the case of reduplication, there is always a danger of wrongly applying the rules due to possible ambiguity involved. For example, 吃头 is a sub-string of embedded ambiguity. It can be both a derived noun ‘worth of eating’ or two separate words as seen in the following example.

(16.)  他饿得能吃头牛。

(a) *     他      | 饿             |       |      | 吃头·                       |
             he     | hungry    | DE3  | can  | worth-of-eating   | ox

(b)        他      | 饿              |      |      |       |       |
              he     | hungry    | DE3  | can  | eat    | CLA  | ox
He is so hungry that he can eat an ox.

2.2.3. proper name

Proper names are of 2 major types: (1) Chinese names; (2) transliterated foreign names. In this paper, we only target the identification of Chinese names and leave the problem of transliterated foreign names for further research (Li, 1997b).

A Chinese human name usually consists of a family name followed by a given name. Chinese family names form a clear-cut closed set. A given name is usually either one character or two characters. For example, the late Chinese chairman 毛泽东 (Mao Zedong) used to have another name 李得胜 (Li Desheng). In the lexicon, 李 is a registered family name. Both 得胜 and 胜 mean ‘win’. This may lead to 3 ways of word segmentation: (1) 李得胜; (2) 李|得胜; (3) 李得|胜, as seen in the following examples.

(17.)    李得胜了

(a)  李    | 得胜 | .
       Li    | win  | LE
Li won.

(b)   李得   |      |
        Li De | win  | LE
Li De won.

(c) *  李得胜          | .
          Li Desheng | LE

(18.)   李得胜胜了 。

(a) *  李 | 得胜 |     | .
         Li  | win | win | LE

(b) *  李得   |      |      |
          Li De | win  | win  | LE

(c)   李得胜            |      |
Li Desheng   | win  | LE
Li Desheng won.

Since the given name like µÃʤ is an arbitrary string of 1 or 2 characters, the morphological analysis of the full name should start with family name which can optionally combine with any 1 or 2 characters to form candidate proper names Àî, ÀîµÃ and ÀîµÃʤ. In other words, family name serves as the left boundary of a full name and the length is used to determine candidates. The right segmentation can only be made via sentence analysis as shown in the above examples.

Most Chinese place proper names are made of 1 to 3 characters, for example, 武汉市(Wuhu City), 南陵县 (Nanling County). The arbitrariness of these names makes any sub-strings of n characters (0<n<4) in the sentence a suspect. Fortunately, in most cases we may find boundary indicators of these names, like 省 (province), 市 (city), 县 (county), etc. Once the boundary indicator is located, the similar technique in using Chinese family name to identify the given name can be applied to select candidates of place proper names for verification through grammatical analysis.

In general, there is always a possibility of ambiguity involvement in the formation of all types of proper names.

(19.)     Conclusion for 2.2.

Due to the possible involvement of ambiguity, a parsing grammar for morphological analysis as well as for sentence analysis is required for the proper identification of the words produced by Chinese productive word formation.

  1. W‑CPSG Grammatical Approach

This section presents W‑CPSG approach to Chinese word identification and morphological analysis. We will demonstrate how a parser based on W‑CPSG solves the problems of the word identification ambiguity and productive word formation.

3.1. rationale of W‑CPSG approach

There have been a number of word identification algorithms based on both morphological and syntactic information (see survey in Feng 1996 and Sun & Huang 1996). Most such approaches do not use a self-contained grammar to parse the complete sentence. They are confined to the conventional two-step process of the segmentation-before-grammar design. As long as the word identification procedure is independent of a parsing grammar, it is extremely difficult to make full use of grammatical information to resolve ambiguity in word identification. Careful tuning up and sophisticated design improves the precision but will not change the theoretical defect of all such approaches. Chen & Liu acknowledges the limitation of their approach due to the lack of a grammar.  “However”, they say,  “it is almost impossible to apply real world knowledge nor to check the grammatical validity at this stage”. (Chen & Liu 1992, p.105) Why impossible at this stage? Because these segmentation systems are based on the concept of  two-step architecture and the grammar is not yet available! As we have demonstrated, the final judgment for proper word identification can hardly be made until the whole sentence is parsed, hence the requirement of a full grammar. Therefore, we are forced to make a compromise in involving how much of grammatical information depending on how much word identification precision we can afford to sacrifice. Needless to say, there is significant double-labor between such a word segmentation procedure and the following stage of parsing. As more and more grammatical information is used to achieve better precision, the overhead of this double labor becomes more serious. We consider the double labor as one strong argument against the two-step approach. If enough grammatical information is incorporated, it is essentially equivalent to a grammar. And  the segmenter will be equivalent to a parser.  Then why two grammars, one for word identification, and one for sentence parsing? Why not combine them? That is exactly what we are proposing in W‑CPSG – one-step approach based on an integrated grammar,  eliminating the necessity of a segmentation preprocessor.

3.2. W‑CPSG character-based parsing

W‑CPSG (Li. 1997a, 1997b) is a lexicalized Chinese unification grammar. The work on W‑CPSG is taken in the spirit of the modern linguistic theory Head-driven Phrase Structure Grammar (Pollard & Sag 1994). W‑CPSG consists of two parts: a minimized general grammar and an enriched lexicon. The general grammar only contains a handful of PS (phrase structure) rules, covering complement structure, modifier structure, conjunctive structure and morphological structure. This is the nature of lexicalized grammars. PS rules in such grammars are very abstract. Essentially, all they say is one thing, that is, 2 signs can combine so long as the lexicon so indicates. The lexicon houses lexical entries with their linguistic description in feature structures. Potential morphological structures as well as potential syntactic structures are lexically encoded. In syntax, a word expects another sign to form a phrase. In morphology, a morpheme expects another sign to form a word. For example, the prefix 可 (-able) expects a transitive verb to form an adjective. The morphological PS rule will build the morphological structure when a transitive verb does appear after the prefix 可 (-able) in the input string.

We now illustrate how W‑CPSG parses a string of Chinese characters by a sample parsing chart. The prototype of W‑CPSG was written in ALE, a grammar compiler developed on top of Prolog by Carpenter & Penn (1994). ALE compiles W‑CPSG into a Chinese parser, a Prolog program ready to accept a string of characters for analysis. W‑CPSG parse tree embodies both morphological analysis and syntactic analysis, as shown below.

hpsg12

 

This is so-called bottom-up parsing. It starts with lexicon look-up. Edges 1 through 7 are lexical edges. Other edges are phrasal edges. Each edge represents a sign, i.e. a character (morpheme), a word, a phrase or a sentence. Lexical edges result from a successful match between the signs in the input string and the entries in the lexicon during lexicon look-up. For example, 可 (-able), 读 (read) and 性 (-ness) are all registered entries in the lexicon, so they get matched and shown by edge 5, edge 6 and edge 7. Words produced by productive word formation present themselves as phrasal edges, e.g. edge ((5+6)+7) for 可读性 (readability). For the sake of concise illustration, we only show two pieces of information for the signs in the chart, namely category and interpretation with a delimiting colon (lexical edges are only labeled for either category or interpretation). The parser attempts to combine the signs according to PS rules in the grammar until parses are found. A parse is an edge which ranges over the whole string. The parse ((((1+2)+3)+4)+((5+6)+7)) for (20.) represents a binary structural tree based on the W‑CPSG analysis, as shown below.

hpsg13

3.3. ambiguity resolution in word identification

Given the resources of a phrase structure grammar like W‑CPSG, a parser based on standard chart parsing algorithms can handle both the cross ambiguity and the embedded ambiguity provided that a match algorithm based on exhaustive lookup instead of maximum match is adopted for lexicon lookup. All candidate words in the input string are presented to the parser for judgment. Ambiguous segmentation becomes a natural part of parsing: different ways of segmentation add different edges, a successful parse always embodies right identification. In other words, word identification in our design becomes a by-product of parsing instead of a pre-condition for parsing. The following example of the complicated cross ambiguity illustrates how the W‑CPSG parser resolves ambiguity. As seen, both the FMM segmentation (represented by the edge sequence 8-9-5-10) and the BMM segmentation (represented by 1-11-12-10) are in the chart as a result of exhaustive lexicon lookup. They are proved to be wrong because they do not lead to a successful parse according to the grammar. As a by-product, the final parse (8+(3+(12+10))) automatically embodies rightly identified word sequence 8-3-12-10, i.e. 出现  (appear) |在  (at) |世界 (world) |东方 (east).

hpsg10

 

Exhaustive lookup also makes an embedded ambiguity sub-string like 烤红薯 no longer a blind area for word identification, as shown in (22.) below. All the candidate words in the sub-string including 烤 (bake), 红薯 (sweet potato), 烤红薯 (baked sweet potato) are added to the chart as lexical edges (edge 4, edge 8 and edge 10). This is a case of genuine ambiguity, resulting in 2 parses corresponding to 2 readings. The first parse (1+(7+10)) identifies the word sequence 他|喜欢|烤红薯, and the second parse (1+(9+(4+8))) a different sequence 他|喜欢|烤|红薯. Edge 7 and edge 9 represent two lexical entries for the verb 喜欢 (like), with different syntactic expectation (categorization). One expects an NP object, notated in the chart by like<NP>, and the other expects a VP complement, notated by like<VP>.

hpsg11

 

We now illustrate how Chinese proper names are identified in W‑CPSG parsing. In the W‑CPSG lexicon, Chinese family name is encoded to optionally expect the given name. Due to the arbitrariness of given names, no other constraint except for the length (either 1 character or 2 characters) is specified in the expectation. Therefore, we have three candidates for proper names in the following example, namely 李 (Li), 李得 (Li De), 李得胜 (Li Desheng), represented respectively by edge 1, edge (1+2) and the NP edge (1+5).[3] The first two candidates contribute to two valid parses while the third does not, hence the identification of the word sequences 李|得胜|了 and 李得|胜|了.

hpsg8

 

Now we add one more character 胜 (win) to form a new sentence, as shown in (24.) below.

hpsg9

 

The first two candidate proper names 李 (Li) and 李得 (Li De) no longer lead to parses. But the third candidate 李得胜 (Li Desheng) becomes part of the parse as a subject NP. The parse (((1+6)+4)+5) corresponds to the identification of the only valid word sequence 李得胜|胜|了.

Finally, we give an example to demonstrate how W‑CPSG handles reduplication in parsing and word identification. The sample sentence to be processed by the parser is 让他分分心 (Let him relax a while), involving the AB–>AAB type verb reduplication for diminutive use.

In most lexicons, 分心 (distract-heart: get distracted) is a registered 2-morpheme verb with internal morphological verb-object relation. Therefore, the reduplication is considered morphological. But in Chinese syntax, we also have a  general verb reduplication rule of the type A–>AA for diminutive use, for example, 看(look) –> 看看(have a look). This morphological verb reduplication rule AB–>AAB and the syntactic verb reduplication rule A–>AA are essentially the same rule in Chinese grammar. 分心 sits in the gray area between morphology and syntax. It looks both like a word (verb) and a phrase (VP). Lexically, it corresponds to one generalized sense (concept) and the internal combination is idiomatic, i.e. 分 (distract) must combine with 心 (heart) to mean ‘get distracted’. But, structurally, the combination of 分 and 心 is not fundamentally different from a VP consisting of Vt and NP, as in the phrase 看电影 (see a film). In fact, there is no clear-cut boundary between Chinese morphology and syntax. This morphology-syntax isomorphic fact serves as a further argument to support the W‑CPSG design of integrating morphology and syntax in one grammar module. Although the boundary between Chinese morphology and syntax is fuzzy, hence no universal definition of basic notions like word and phrase, the division can be easily defined system internally in an integrated grammar. In W‑CPSG,  分心 is treated as a phrase (VP) instead of a word (verb). The lexical entry 分 (distract) is coded to obligatorily expect the literal 心 (heart) as its syntactic object, shown in the following chart by the notation V<>. This approach has the advantage of eliminating the doubling of the reduplication rule for diminutive use in both syntax and morphology, making the grammar more elegant. The verb reduplication rule is implemented as a lexical rule in W‑CPSG.[4] This lexical rule creates a reduplicated verb with added diminutive sense, shown by edge 8 (a lexical edge).  The whole parsing process is illustrated below.

hpsg7

 

 

REFERENCES

Carpenter, B. & Penn, G. (1994): ALE, The Attribute Logic Engine, User’s Guide, Carnegie Mellon University

Chen, K-J., & S-H. Liu (1992): “Word identification for mandarin Chinese sentences”. Proceedings of the 15th International Conference on Computational Linguistics, Nantes, 101-107.

Feng, Z-W. (1996): “COLIPS lecture series – Chinese natural language processing”,  Communications of COLIPS, Vol.6, No.1 1996, Singapore

Li, W. (1997a): “Outline of an HPSG-style Chinese reversible grammar”, Proceedings of The Northwest Linguistics Conference-97 (NWLC-97, forthcoming), UBC, Vancouver, Canada

Li, W. (1997b): W‑CPSG: A Lexicalized Chinese Unification Grammar And Its Application, Doctoral dissertation (on-going), Simon Fraser University, Canada

Liang, N. (1987): “Shumian Hanyu Zidong Fenci Xitong – CDWS” (Automatic word segmentation system for written Chinese – CDWS), Journal of Chinese Information Processing, No.2 1987, pp 44-52, Beijing

Pollard, C.  & I. Sag (1994): Head-Driven Phrase Structure Grammar,  Centre for the Study of Language and Information, Stanford University, CA

Sun, M-S. & C-N. Huang  (1996): “Word segmentation and part of speech tagging for unrestricted Chinese texts” (Tutorial Notes for International Conference on Chinese Computing ICCC’96), Singapore

~~~~~~~~~~~~~~~~~~~

[1] The author benefited from the insightful discussion with Dr. Dekang Lin on the feasibility of parsing Chinese character strings instead of word strings. Thanks also go to Paul McFetridge and Fred Popowich for their supervision and encouragement.

[2] This table is adapted from the following table in Sun & Huang (1996).

case 1 The output of FMM and BMM are different, but both are incorrect 0.054%
case 2 The output of FMM and BMM are different, but only one is correct 9.24%
case 3 The output of FMM and BMM are identical, but incorrect 0.41%
case 4 The output of FMM and BMM are identical, and correct 90.30%

The 4 cases which they listed are not logically exhaustive in terms of sentence based processing (i.e. when discourse is not involved in a system). In particular, there is another case when the output of FMM and BMM are different, and both are correct. We call this a case of genuine cross ambiguity.

[3] Note that there is another S edge (1+5) in the chart. These two edges are structurally different, created via different PS rules. The NP edge (1+5) is formed through the morphological PS rule, combining the family name (edge 1) and its expected given name (edge 5). In the S edge (1+5). however, it is the subject rule (one of the complement PS rules) that decides the combination of the predicate (edge 5) and its expected subject NP (edge 1).

[4] Lexical rules are favored by many linguists to capture redundancy in the lexicon instead of the conventional approach of syntactic transformation. Lexical rules are applied at compile time to form an expanded lexicon before parsing starts.

 

[Related]

Interaction of syntax and semantics in parsing Chinese transitive verb patterns 

Handling Chinese NP predicate in HPSG 

Notes for An HPSG-style Chinese Reversible Grammar

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

Interaction of syntax and semantics in parsing Chinese transitive verb patterns

Interaction of syntax and semantics in parsing Chinese transitive verb patterns *
(old paper in Proceedings of International Chinese Computing Conference, ICCC’96)

Wei  LI

Department of Linguistics, Simon Fraser University
Burnaby, B.C. V5A 1S6 CANADA (email: lio@sfu.ca)

Keywords: Chinese processing, transitive pattern, syntax, semantics, lexical rule, HPSG

Abstract

This paper addresses the problem of parsing Chinese transitive verb patterns (including the BA construction and the BEI construction) and handling the related phenomena of semantic deviation (i.e. the violation of the semantic constraint).

We designed a syntax-semantics combined model of Chinese grammar in the framework of Head-driven Phrase Structure Grammar [Pollard & Sag 1994]. Lexical rules are formulated to handle both the transitive patterns which allow for semantic deviation and the patterns which disallow it. The lexical rules ensure the effective interaction between the syntactic constraint and the semantic constraint in analysis.

The contribution of our research can be summarized as:

(1) the insight on the interaction of syntax and semantics in analysis;
(2) a proposed lexical rule approach to semantic deviation based on (1);
(3) the application of (2) to the study of the Chinese transitive patterns;
(4) the implementation of (3) in an unification-based Chinese HPSG prototype.

  1. Background

When Chomsky proposed his Syntactic Structures in Fifties, he seemed to indicate that syntax should be addressed independently of semantics. As a convincing example, he presented a famous sentence:

1)             Colorless green ideas sleep furiously.

Weird as it sounds, the grammaticality of this sentence is intuitively acknowledged: (1) it follows the English syntax; (2) it can be interpreted. In fact, there is only one possible interpretation, solely decided by its syntactic structure. In other words, without the semantic interference, our linguistic knowledge about the English syntax is sufficient to assign roles to each constituent to produce a reading although the reading does not seem to make sense.

However, things are not always this simple. Compare the following Chinese sentences of the same form NP NP V:

2a)           dianxin  wo           chi           le.
                Dim-Sum I               eat           LE.
The Dim Sum I have eaten.
Note:        LE is a particle for perfect aspect.

2b)   wo dianxin chi le.
I have eaten the Dim Sum.

Who eats what? There is no formal way but to resort to the semantic constraint imposed by the notion eat to reach the correct interpretation [Li, W. & McFetridge 1995].

Of course, if we want to maintain the purity of syntax, it could be argued that syntax will only render possible interpretations and not the interpretation.  It is up to other components (semantic filter and/or other filters) of grammar to decide which interpretation holds in a certain context or discourse. The power of syntax lies in the ability to identify structural ambiguities and to render possible corresponding interpretations. We call this type of linguistic design a syntax-before-semantics model. While this is one way to organize a  grammar, we found it unsatisfactory for two reasons. First, it does not seem to simulate the linguistic process of human comprehension closely.  For human listeners, there are no ambiguities involved in sentences 2a) and 2b). Secondly, there is considerable cost on processing efficiency in terms of computer implementation. This efficiency problem can be very serious in the analysis of languages like Chinese with virtually no inflection.

Head-driven Phrase Structure Grammar (HPSG) [Pollard & Sag 1994, 1987] assumes a lexicalist approach to linguistic analysis and advocates an integrated model of syntax and the other components of grammar. It serves as a desirable framework for the integration of the semantic constraint in establishing syntactic structures and interpretations. Therefore, we proposed to enforce the semantic constraint that animate being eats food directly in the lexical entry chi  (eat) [Li, W. & McFetridge 1995]: chi (eat) requires an animate NP subject and a food NP object. It correctly addresses who-eats-what problem for sentences like 2a) and 2b). In fact, this type of semantic constraint (selection restriction) has been widely used for disambiguation in NLP systems.

The problem is, the constraint should not always be enforced. In the practice of communication, deviation from the constraint is common and deviation is often deliberately applied to help render rhetorical expressions.

 

3) xiang      chi           yueliang,  ni             gou           de3    zhao       me?
    want        eat           moon,       you          reach       DE3  -able          ME?
Wanting to eat the moon, but can you reach it?
Note:  DE3 is a particle, introducing a postverbal adjunct of result or capability. ME is a sentence final particle for yes-no question.

4) dajia         dou   chi           shehui zhuyi,           neng         bu            qiong       me?
     people      all      eat           social -ism,               can            not           poor         ME
Everyone is eating socialism, can it not be poor?

yueliang (moon) is not food, of course. It is still some physical object, though. But in 4), shehui zhuyi (socialism) is a purely abstract notion. If a parser enforces the rigid semantic constraint, there are many such sentences that will be rejected without getting a chance to be interpreted. The fact is, we do have interpretations for 3) and 4). Hence an adequate grammar should be able to accommodate those interpretations.

To capture such deviation, Wilks came up with his Preference Semantics [Wilks 1975, 1978]. A sophisticated mechanism is designed to calculate the semantic weight for each possible interpretation, i.e. how much it deviates from the preference semantic constraint. The final choice will be given to the interpretation with the most semantic weight in total. His preference model simulates the process of how human comprehends language more closely than most previous approaches.

The problem with this design is the serious computational complexities involved in the model [Huang 1987]. In order to calculate the semantic weight, the preference semantic constraint is loosened step by step. Each possible substructure has to be re-tried with each step of loosening. It may well lead to combinatorial explosion.

What we are proposing here is to look at semantic deviation in the light of the interaction of the syntactic constraint and the semantic constraint. In concrete terms, the loosening of the semantic constraint is conditioned by syntactic patterns. Syntactic pattern is defined as the representation of an argument structure in surface form. A pattern consists of 2 parts: a structure’s syntactic constraint (in terms of the syntactic categories and configuration, word order,  function words and/or inflections) and its interpretation (role assignment). For example, for Chinese transitive structure, NP V NP: SVO is one pattern, NP NP V: SOV is another pattern, and NP [ba NP] V: SOV (the BA construction) is still another. The expressive power of a language is indicated by the variety of patterns used in that language. Our design will account for some semantic deviation or rhetorical phenomena seen in everyday Chinese without the overhead of computational complexities. We will focus on Chinese transitive verb patterns for illustration of this approach.

  1. Chinese transitive patterns

Assuming three notional signs wo (I), chi (eat) and dianxin (Dim Sum), there are maximally 6 possible combinations in surface word order, out of which 3 are grammatical in Chinese.[1]

5a)           wo chi le dianxin.                                   SVO
5b)           wo dianxin chi le.                                   SOV
5c)           dianxin wo chi le.                                    OSV

SVO is the canonical word order for Chinese transitive structure. When a string of signs matches the order NP V NP, the semantic constraint has to yield to syntax for interpretation.

NP V NP: SVO

6)  daodi         shi     ni             zai         du       shu          ne,
haishi                 shu           zai         du       ni             ne?

     on-earth     be     you          ZAI        read     book        NE,
or                        book        ZAI        read     you          NE?

Are you reading the book, or is the book reading you, anyway?
Note:        ZAI is a particle for continuous aspect.
NE is a sentence final particle for or-question.

Same as in the English equivalent, the interpretation of  6) can only be SVO, no matter how contradictory  it might be to our common sense. In other words, in the form of NP V NP, syntax plays a decisive role.

In contrast, to interpret the form NP NP V as SOV in 2b), the semantic constraint is critical. Without the enforcement of the semantic constraint, the interpretation of SOV does not  hold. In fact, this SOV pattern (NP1 NP2 V: SOV) has been regarded as ungrammatical in a Case Theory account for Chinese transitive structure in the framework of GB. According to their analysis, something similar to this pattern constitutes the D‑Structure for transitive pattern and Chinese is an underlying SOV language (called “SOV Hypothesis”: see the survey in Gao 1993). In the surface structure, NP2 is without case on the assumption that V assigns its CASE only to the right. One has to either insert the case-marker ba to assign CASE to it (the BA construction) or move it to the right of V to get its CASE (the SVO pattern). This analysis suffers from not being able to account for the grammaticality of sentences like 2b).  However, by distinguishing the deep pattern SOV from the 2 surface patterns (the SVO and the BA construction), the theory has its merit to alert us that the SOV pattern seems to be syntactically problematic (crippled, so to speak). This is an insightful point, but it goes one step too far in totally rejecting the SOV pattern in surface structure. If we modify this idea, we can claim that SOV is a syntactically unstable pattern and that SOV tends to (not must) “transform” to the SVO or the BA construction unless it is reinforced by semantic coherence (i.e. the enforcement of the semantic constraint). This argument in the light of syntax-semantics interaction is better supported by the Chinese data. In essence, our account is close to this reformulated argument, but in our theory, we do not assume a deep structure and transformation. All patterns are surface constructions. If no sentences can match a construction, it is not considered as a pattern by our definition.

This type of unstable pattern which depends on the semantic constraint is not limited to the transitive phenomena. For example, the type of Chinese NP predicate defined in  [Li, W. & McFetridge 1995] is also a semantics dependent pattern. Compare:

7a)  zhe           zhang       zhuozi                  san          tiao          tui.
        this           Cl.         table(furniture)      three        Cl.            leg
This table is three-legged.
Note:        Cl for classifier.

7b) *        zhe           zhang       ditu                          san          tiao          tui.
                this           Cl.           map(non-furniture)  three        Cl.            leg

There is clearly a semantic constraint of the NP predicate on its subject: it should be furniture (or animate). Without this “semantic agreement”, Chinese NP is normally not capable of functioning as a predicate, as shown in 7b).

Between semantics dependent and semantics independent patterns, we may have partially dependent patterns. For example, in NP NP V: OSV, it seems that the semantic constraint on the initial object is less important than the semantic constraint on the subject.

8)   shitou                wo              ye   xiang  chi,    kexi      yao       bu      dong.
   stone(non-food)  I(animate) also want  eat,    pity       chew    not      -able

Even stones I also want to eat, but it’s such a pity that I am not able to chew them.

If the constraint on the object matches well, is the subject allowed to be semantically deviant?

9) ?          dianxin                     zhuozi                        chi           le.
                Dim-Sum(food)        table(non-animate)  eat           LE.

Those are the marginal cases, a grammar may choose to be more tolerable to accept it or to be more restrained to reject it.

Unlike SOV, but similar to its English counterpart, OSV is one type of Chinese topic constructions and the relationship between the initial O and V is of long distance dependency.

10a)  dianxin      wo     xiangxin   ni           yiwei        Lisi          chi           le.
          Dim-Sum    I         believe     you          think        Lisi           eat           LE

The Dim Sum I believe you think that Lisi ate.

10b) *      Lisi wo xiangxin ni yiwei dianxin chi le.

10b) will not be accepted in our model because (1) it cannot be interpreted as OSV since it violates the semantic constraint on S: dianxin is not animate; (2) it can neither be interpreted as SOV since it violates the configurational constraint: SOV is simply not of a long distance pattern. In fact, NP NP V: SOV is such a restricted pattern in Chinese that it not only excludes any long distance dependency but even disallows some adjuncts. Compare 11a) in the OSV pattern and 11b) and 11c) in the SOV pattern:

11a)  dianxin      wo           jinjinyouwei             de2           chi           le.
          Dim-Sum      I              with-relish                DE2         eat           LE

The Dim Sum I ate with relish.
Note:        DE2 is a particle introducing a preverbal adjunct of  manner.

11b) *      wo dianxin jinjinyouwei de2 chi le.

11c) *      wo jinjinyouwei de2 dianxin chi le.

There is another pattern of the linear order SOV, the Chinese notorious BA construction. ba is usually regarded as a preposition which introduces a preverbal object for transitive verbs.

NP [ba NP] V: SOV

12a)  wo           ba            dianxin       jinjinyouwei             de2          chi           le.
           I              BA           Dim-Sum     with-relish                DE2         eat           LE

I ate the Dim Sum with relish.

12b)         wo jinjinyouwei de2 ba dianxin  chi le.
With relish, I ate the Dim Sum.

12c)         dianxin  ba wo jinjinyouwei de2  chi le.
The Dim Sum ate me with relish.

12d)         dianxin jinjinyouwei de2 ba wo  chi le.
With relish, the Dim Sum ate me.

For the OSV order, there is another so-called BEI construction. The BEI construction is usually regarded as an explicit passive pattern in Chinese.

NP [bei NP] V: OSV

13a)        dianxin       bei          wo           chi           le.
                Dim-Sum     BEI          I               eat           LE

The Dim Sum was eaten by me.

13b)         wo bei dianxin  chi le.

I was eaten by the Dim Sum.

The BEI construction and the BA construction are both semantics independent. In fact, any pattern resorting to the means of function words in Chinese seems to be sufficiently independent of the semantic constraint.

To conclude, semantic deviation often occurs in some more independent patterns, as seen in 5d2), 6), 8), 12c), 12d), 13b). Close study reveals that different patterns result in different reliance on the semantic constraint, as summarized in the following table.

                syntactic pattern                                 semantic dependence

                NP V NP: SVO                                                    no dependence
                NP [ba NP] V: SOV                                            no dependence
                NP [bei NP] V: OSV                                           no dependence
                NP NP V: OSV                                                    partial dependence
                NP NP V: SOV                                                    full dependence
…………

It should be emphasized that this observation constitutes the rationale behind our approach.

  1. Formulation of lexical rules

Based on the above observation, we have designed a syntax-semantics combined model. In this model, we take a lexical rule approach to Chinese patterns and the related problem of semantic deviation.

A lexical rule takes as its input a lexical entry which satisfies its condition and generates another entry. Lexical rules are usually used to cover lexical redundancy between related patterns. The design of lexical rules is preferred by many grammarians over the more conventional use of syntactic transformation, especially for lexicalist theories.

Our general design is as follows, still using chi (eat) for illustration:

(1)   Syntactically, chi (eat) as a transitive verb subcategorizes for a left NP as its subject and a right NP as its object.

(2)   Semantically, the corresponding notion eat expects an entity of category animate as its logical subject and an entity of category food as its logical object. Therefore the common sense (knowledge) that animate being eats food is represented.

(3)   The interaction of syntax and semantics is implemented by lexical rules. The lexical rules embody the linguistic generalizations about the transitive patterns. They will decide to enforce or waive the semantic constraint based on different patterns.

As seen, syntax only stipulates the requirement of two NPs as complements for chi and does not care about the NPs’ semantic constraint. Semantics sets its own expectation of animate entity and food entity as arguments for eat and does not care what syntactic forms these entities assume on the surface. It is up to lexical rules to coordinate the two. In our model, the information in (1) and (2) is encoded in the corresponding lexical entry and the lexical rules in (3) will then be applied to expand the lexicon before parsing begins. Driven by the expanded lexicon, analysis is implemented by a lexicalist parser to build the interpretation structure for the input sentence. Following this design, there will be sufficient interaction between syntax and semantics as desired while syntax still remains to be a self-contained component from semantics in the lexicon. More importantly, this design does not add any computational complexities to parsing because in order to handle different patterns, the similar lexical rules are also required even for a pure syntax model.

Before we proceed to formulate lexical rules for transitive patterns, we should make sure what a transitive pattern is. As we defined before, a pattern consists of 2 parts: a structure’s syntactic constraint and the corresponding interpretation. Word order is important constraint for Chinese syntax. In addition to word order, we have categories and function words (preposition, particle, etc.). As for interpretation, transitive structure involves 3 elements: V (predicate) and its arguments S (logical subject) and O (logical object). There is a further factor to take into account: Chinese complements are often optional. In many cases, subject and/or object can be omitted either because they can be recovered in the discourse or they are unknown. We call those patterns  elliptical patterns (with some complement(s) omitted), in contrast to full patterns. With these in mind, we can define 10 patterns for Chinese transitive structure: 5 full patterns and 5 elliptical patterns.

We now investigate these transitive patterns one by one and try to informally formulate the corresponding lexical rules to capture them. Please note that the basic input condition is the same with all the lexical rules. This is because they share one same argument structure – transitive structure.

Lexical rule 1:   

                V ((NP1, NP2), (constr1, constr2)) –> NP1 V NP2: SVO

The above notation for the lexical rule should be quite obvious. The input of the rule is a transitive verb which subcategorizes for two NPs: NP1 and NP2 and whose corresponding notion expects two arguments of constr1 and constr2NP is syntactic category, and constr is semantic category (human, animate, food, etc.). The output pattern is in a defined word order SVO and waives the semantic constraint.

Lexical rule 2:   

      V ((NP1, NP2), (constr1, constr2)) –> [NP1, constr1] [NP2, constr2] V: SOV

Please note that the semantic constraint is enforced for this SOV pattern. Since this pattern shares the form NP NP V with the OSV pattern, it would be interesting to see what happens if a transitive verb has the same semantic constraint on both its subject and object. For example, qingjiao (consult) expects a human subject and a human object.

14)           ta                     ni                               qingjiao    guo        me?
                he(human)     you(human)             consult     GUO        ME

Him, have you ever consulted?
Note: GUO is a particle for experience aspect.

15)           ni ta  qingjiao guo  me?

You, has he ever consulted?

In both cases, the interpretation is OSV instead of SOV. Therefore, we need to reformulate Lexical rule 2 to exclude the case when the subject constraint is the same as the object constraint.

Lexical rule 2′ (refined version):

                V ((NP1, NP2), (constr1, constr2), (constr1 not = constr2))

                –> [NP1, constr1] [NP2, constr2] V: SOV

Lexical rule 3:

                V ((NP1, NP2), (constr1, constr2)) –> NP1 [ba NP2] V: SOV

This is the typical BA construction. But not every transitive verb can assume the BA pattern. In fact, ba is one of a set of prepositions to introduce the logical object. There are other more idiosyncratic prepositions (xiang, dao, dui, etc.) required by different verbs to do the same job.

16a)      ni             qingjiao    guo         ta             me?
              you          consult     GUO        he            ME

Have you ever consulted him?

16b)         ni             xiang        ta             qingjiao    guo        me?
                 you          XIANG     he            consult     GUO        ME

Have you ever consulted him?

16c) *      ni             ba            ta             qingjiao    guo        me?
                you          BA           he            consult     GUO        ME

17a)         ta             qu             guo         Beijing.
                 he            go-to        GUO        Beijing

He has been to Beijing.

17b)         ta             dao         Beijing     qu             guo.
                 he            DAO        Beijing     go-to        GUO

He has been to Beijing.

17c) *      ta             ba            Beijing     qu            guo.
                 he            BA           Beijing     go-to        GUO

18a)         ta             hen         titie                             zhangfu.
                 she           very       tenderly-care-for      husband

She cares for her husband very tenderly.

18b)         ta             dui          zhangfu       hen        titie.
                 she           DUI         husband      very       tenderly-care-for

She cares for her husband very tenderly.

18c) *      ta             ba            zhangfu         hen                          titie.
                she           BA           husband         very                         tenderly-care-for

This originates from different theta-roles assumed by different verb notions on their object argument: patient, theme, destination, to name only a few. These theta-roles are further classification of the more general semantic role logical object. We can rely on the subcategorization property of the verb for the choice of the preposition literal (so-called valency preposition). With the valency information in place, we now reformulate Lexical rule 3 to make it more general:

Lexical rule 3′ (refined version):

       V ((NP1, NP2), (constr1, constr2),  (valency_preposition=P), (P not = null))

       –> NP1 [P NP2] V: SOV

Lexical rule 4:   

                V ((NP1, NP2), (constr1, constr2)) –> NP2 … [NP1, constr1] V: OSV

This is a topic pattern of long distance dependency. It is up to different formalisms to provide different approaches to long-distance phenomena. In our present implementation, NP2 is placed in a feature called BIND to indicate the nature of long distance dependency. One phrase structure rule Topic Rule is designed to use this information and handle the unification of the long distance complement properly.

Following the topic pattern, the passive BEI construction is formulated in Lexical rule 5.

Lexical rule 5:   

                V ((NP1, NP2), (constr1, constr2)) –> NP2 [bei NP1] V: OSV

We now turn to elliptical patterns.

Lexical rule 6:   

                V ((NP1, NP2), (constr1, constr2)) –> V NP2: VO

19)           chi           guo          jiaozi                        me?
                eat           GUO        dumpling                 ME

Have (you) ever eaten dumpling?

Lexical rule 7:   

                V ((NP1, NP2), (constr1, constr2)) –> [NP1, constr1] V: SV

20)           wo           chi           le.
                I               eat           LE

I have eaten (it).

21)           ji                                 chi           le.
                chicken1(animate)   eat           LE

The chicken has eaten (it).

Like its English counterpart, ji (chicken) has two senses: (1) chicken1 as animate; (2) chicken2 as food. We code this difference in two lexical entries. Only the first entry matches the semantic constraint on the subject in the pattern and reaches the above SV interpretation in 21). Interestingly enough, the same sentence will get another parse with a different interpretation OV in 23) because the second entry also satisfies the semantic constraint on the object in the OV pattern in Lexical rule 8.

22)           ni             qingjiao    guo         me?
                you          consult     GUO        ME

Have you consulted (someone)?

22) indicates that the SV interpretation is preferred over the OV interpretation when the semantic constraint on the subject and the semantic constraint on the object happen to be the same. Hence the added condition in Lexical rule 8.

Lexical rule 8:   

                V ((NP1, NP2), (constr1, constr2), (constr1 not = constr2))

                –> [NP2, constr2] V: OV

23)           ji                                 chi           le.
                chicken2(food)         eat           LE

The chicken has been eaten.

Lexical rule 9:   

                V ((NP1, NP2), (constr1, constr2)) –> NP2 [bei V]: OV

24)           dianxin    bei           chi           le.
                Dim-Sum  BEI          eat           LE

The Dim Sum has been eaten.

Lexical rule 10:

                V ((NP1, NP2), (constr1, constr2)) –> V: V

25)           chi           le             me?
                eat           LE            ME?                        

(Have you) eaten (it)?

  1. Implementation

We begin with a discussion of some major feature structures in HPSG related to handling the transitive patterns.  Then, we will show how our proposal works and discuss some related implementation issues.

HPSG is a highly lexicalist theory. Most information is housed in the lexicon. The general grammar is kept to minimum: only a few phrase structure rules (called ID Schemata) associated with a couple of principles. The data structure is typed feature structure. The necessary part for a typed feature structure is the type information. A simple feature structure contains only the type information, but a complex feature structure can introduce a set of feature/value pairs in addition to the type information. In a feature/value pair, the value is itself a feature structure (simple or complex). The following is a sample implementation of the lexical entry chi for our Chinese HPSG grammar using the ALE formalism [Carpenter  & Penn 1994].

hpsg3

Note:  (1) Uppercase notation for feature; (2) Lowercase notation for type; (3) Number indices in square brackets for unification.

Leaving the notational details aside, what this roughly says is: (1) for the semantic constraint, the arguments of the notion eat are an animate entity and a food entity; (2) for the syntactic constraint, the complements of the verb chi are 2 NPs: one on the left and the other on the right; (3) the interpretation of the structure is a transitive predicate with a subject and an object. The three corresponding features are: (1) KNOWLEDGE; (2) SUBCAT; (3) CONTENT. KNOWLEDGE stores some of our common sense by capturing the internal relation between concepts. Such common sense knowledge is represented in linguistic ways, i.e. it is represented as a semantic expectation feature, which parallels to the syntactic expectation feature SUBCAT. KNOWLEDGE defines the semantic constraint on the expected arguments no matter what syntactic forms the arguments will take.  In contrast, SUBCAT only defines the syntactic constraint on the expected complements. The syntactic constraint includes word order (LEFT feature), syntactic category (CATEGORY feature) and configurational information (LEX feature).  Finally, CONTENT feature assigns the roles SUBJECT and OBJECT for the represented structure.

A more important issue is the interaction of the three feature structures. Among the three features, only KNOWLEDGE is our add-on. The relationship between SUBCAT and CONTENT has been established in all HPSG versions: SUBCAT resorts to CONTENT for interpretation.  This interaction corresponds to our definition of pattern. Everything goes fine as far as the syntactic constraint alone can decide interpretation. When the semantic constraint (in KNOWLEDGE) has to be involved in the interpretation process, we need a way to access this information. In unification based theories, information flow is realized by unification (i.e. structure sharing, which is represented by the co-index of feature values). In general, we have two ways to ensure structure sharing in the lexicon. It is either directly co-indexed in the lexical entries, or it resorts to lexical rules. The former is unconditional, and the latter is conditional. As argued before, we cannot directly enforce the semantic constraint for every transitive pattern in Chinese, for otherwise our grammar will not allow for any semantic deviation. We are left with lexical rules which we have informally formulated in Section 3 and implemented in the ALE formalism.

CATEGORY is another major feature for a sign. The CATEGORY feature in our implementation includes functional category which can specify functional literal (function word) as its value. Function words belong to closed categories. Therefore, they can be classified by enumeration of literals. Like word order, function words are important form for Chinese syntactic constraint. Grammars for other languages also resort to some functional literals for constraint. In most HPSG grammars for English, for example, a preposition literal is specified in a feature called P_FORM. There are two problems involved there. First, at representation level, there is redundancy: P_FORM:x –> CATEGORY:p (where x is not null). In other words, there exists feature dependency between P_FORM and CATEGORY which is not captured in the formalism. Second, if P_FORM is designed to stipulate a preposition literal, we will ultimately need to add features like CL_FORM for classifier specification, CO_FORM for conjunction specification, etc. In fact, for each functional category, literal specification may be required for constraint in a non-toy grammar. That will make the feature system of the grammar too cumbersome. These problems are solved in our grammar implementation in ALE. One significant mechanism in ALE is its type inheritance and appropriateness specifications for feature structures [Carpenter  & Penn 1994]. (Similar design is found in the new software paradigm of Object Oriented Programming.) Thanks to ALE, we can now use literals (ba, xiang, dao, dui, etc) as well as major categories (n, v, a, p, etc.) to define the CATEGORY feature. In fact, any intermediate level of subclassification between these two extremes, major categories and literals, can all be represented in CATEGORY just as handily. They together constitute a type hierarchy of CATEGORY. The same mechanism can also be applied to semantic categories (human, animate, food, etc.) to capture the thesaurus inference like human –> animate. This makes our knowledge representation much more powerful than in those formalisms without this mechanism. We will address this issue in depth in another paper Typology for syntactic category and semantic category in Chinese grammar.

In the following, we give a brief description on how our grammar works. The grammar consists of several phrase structure rules and a lexicon with lexical entries and lexical rules. First, ALE compiles the grammar into a Prolog parser. During this process (at compile time), lexical rules are applied to lexical entries. In the case of transitive patterns, this means that one entry of chi will evolve into 10 entries. Please note that it is this expanded lexicon that is used for parsing (at run time).

At the level of implementation, we do not need to presuppose an abstract transitive structure as input of the lexical rules and from there generates 10 new entries for each transitive verb. What is needed is one pattern as the basic pattern for transitive structure and derives the other patterns. In fact, we only need 4 lexical rules to derive the other 4 full patterns from 1 basic full pattern. Elliptical patterns can be handled more elegantly by other means than lexical rules.[2]

The basic pattern constitutes the common condition for lexical rules. Although in theory any one of the 5 full patterns can be seen as the basic pattern, the choice is not arbitrarily made. The pattern we chose is the valency preposition pattern (the BA-type construction) NP1 [P NP2] V: SOV (see Lexical rule 3′).[3] This is justified as follows. The valency preposition P (ba, xiang, dao, dui, etc.) is idiosyncratically associated with the individual verb. To derive a more general pattern from a specific pattern is easier than the other way round, for example,  NP1 [P NP2] V: SOV –> NP1 V NP2: SVO is easier than NP1 V NP2: SVO –> NP1 [P NP2] V: SOV. This is because we can then directly code the valency preposition under CATEGORY in the SUBCAT feature and do not have to design a specific feature to store this valency information.

 

  1. Summery

The ultimate aim for natural language analysis is to reach interpretation, i.e. to assign roles to the constituents. An old question is how syntax (form) and semantics (meaning) interact in this interpretation process. More specifically, which is a more important factor in Chinese analysis, the syntactic constraint or the semantic constraint? For the linguistic data we have investigated, it seems that sometimes syntax plays a decisive role and other times semantics has the final say. The essence is how to adequately handle the interface between syntax and semantics.

In our proposal, the syntactic constraint is seen as a more fundamental factor. It serves as the frame of reference for the semantic constraint. The involvement of the semantic constraint seems to be most naturally conditioned by syntactic patterns. In order to ensure their effective interaction, we accommodate syntax and semantics in one model.  The model is designed to be based on syntax and resorts to semantic information only when necessary. In concrete terms, the system will selectively enforce or waive the semantic constraint, depending on syntactic patterns.

It needs to be advised that there are other factors involved in reaching a correct interpretation. For example, in order to recover the omitted complements in elliptical patterns, information from discourse and pragmatics may be vital. We leave this for future research.

 

References

Carpenter, B. & Penn, G. (1994): ALE, The Attribute Logic Engine, User’s Guide, Version 2.0

Gao, Qian (1993): “Chinese BA-Construction: Its Syntax and Semantics”, OSU Working Papers in Linguistics 1993, Kathol A. & Pollard C. (eds.)

Huang, Xiuming (1987): “XTRA: The Design and Implementation of A Fully Automatic Machine Translation System”, Ph.D. dissertation.

Li, Audry (1990): Chapter 6 “Passive, BA, and topic constructions”, Order & Constituency in Mandarin Chinese. Kluwer Academic Publishers

Li, Wei & McFetridge, Paul (1995): “Handling Chinese NP predicate in HPSG”, Proceedings of PACLING-II, Brisbane, Australia

Pollard, Carl  & Sag, Ivan A. (1994): Head-Driven Phrase Structure Grammar,  Centre for the Study of Language and Information, Stanford University, CA

Pollard, Carl  & Sag, Ivan A. (1987): Information-based Syntax and Semantics. Vol. 1: Fundamentals. Centre for the Study of Language and Information, Stanford University, CA

Wilks, Y.A. (1978): “Making Preferences More Active”,  Artificial Intelligence, Vol. 11

Wilks, Y.A. (1975): “A Preferential Pattern-Seeking Semantics for Natural Language Interference”, Artificial Intelligence, Vol. 6

~~~~~~~~~~~~

* This research is part of my Ph.D. project on a Chinese HPSG-style grammar, supported by the Science Council of British Columbia, Canada under G.R.E.A.T. award (code: 61). I thank my supervisor Dr. Paul McFetridge for his supervision. He introduced me into the HPSG theory and provided me with his sample grammars. Without his help, I would not have been able to implement the Chinese grammar in a relatively short time. Thanks also go to Prof. Dong Zhen Dong and Dr. Ping Xue for their comments and encouragement.

 

[1]               The other combinations are:

5d1) *      dianxin chi le wo.              OVS

5d2)         dianxin chi le wo.
The Dim Sum ate me.

Note:        It is OK with the 5d2) reading in the pattern NP V NP: SVO.

5e1) *      chi le wo dianxin.               VSO
5e2)         chi le wo dianxin.

(Somebody) ate my Dim Sum.

Note:        It is OK with the 5e2) reading of in the pattern V [NP1 NP2]: VO where NP1 modifies NP2.

5f1) *      chi le dianxin wo.                 VOS
5f2)         chi le dianxin, wo.

Eaten the Dim Sum, I have.

Note:        It is OK in Spoken Chinese, with a short pause before wo, in a  pattern like V NP, NP: VOS.

[2]   The conventional configurational approach is based on the assumption that complements are obligatory and should be saturated. If saturation of complements were not taken as a precondition for a phrase, serious problems might arise in structural overgeneration. On the other hand, optionality of complement(s) is a real life fact. Elliptical patterns are seen in many languages and especially commonplace in Chinese. In order to ensure obligatoriness of complements, the lexical rule approach can be applied to elliptical patterns, as shown in Section 3. This approach maintains configurational constraint in tree building to block structural overgeneration, but the cost is great: each possible elliptical pattern for a head will have to be accommodated by a new lexical entry. With the type mechanism provided by ALE, we have developed a technique to allow for optionality of complement(s) and still maintain proper configurational constraint. We will address this issue in another paper Configurational constraint in Chinese grammar.

[3]    This choice is coincidental to the base‑generated account of the BA construction in [Li, A. 1990], but that does not mean much. First, our so‑called basic pattern is not their D‑Structure. Second, our choice is based on more practical considerations. Their claim involves more theoretical arguments in the context of the generative grammar.

 

 

[Related]

Handling Chinese NP predicate in HPSG (old paper)

Notes for An HPSG-style Chinese Reversible Grammar

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

【创业笔记:安娜离职记】

安娜是个很可爱的俄罗斯上进女青年,从小弹钢琴跳芭蕾,小学没毕业即随父母移民美国。她身材高佻,曲线优美,性情温和,举止得体,善解人意,给人一种古典但不古板,现代却不俗艳,阳光而浪漫的印象。大家知道,虽然俄罗斯大嫂大多偏胖粗线条,但俄罗斯姑娘却多有迷人的风采,老帮菜耳熟能详念念不忘的就有钢铁怎样炼成里面的资产阶级小姐冬妮亚,芭蕾舞天后乌兰诺娃,风华绝代的花样滑冰艺术家 Ekaterina Gordeeva。安娜也是这样一位俄罗斯女郎,每天就在身边,给满屋大多是 boys 的办公室带来了温馨柔和的气息。自然地,大家都喜欢她。

然而,安娜辞职了,很快就要离开,大家都舍不得。我心里也不是滋味,想到午餐时不再有她的说说笑笑,餐后也不能邀她打乒乓球了,失落落的。我问她一定要离开么,你不是说很喜欢这个环境么?You know this office is already too crowded with boys, and we are trying to change this situation, trying to find some girls with affirmative action, and you are leaving?

她回说,我喜欢这个环境,是因为在这里我接触的都是你这样的世界上最聪明的人,因为你们太聪明了,结果我的发展道路堵死了,只好痛下决心离开了,我还是去 consulting company 做我擅长的分析工作去吧。两年来,我亲眼目睹我的20小时的人工怎样被你的20秒的全自动搜索所替代,而且结果往往比人工更好更全更有一致性。

她说的不假。确实是技术的转移抢走了她的饭碗,但公司不想辞她,决定让她转型做在线客户服务,可她思前想后,觉得年轻轻不能放弃自己的专长,只好决定离开了。

作为技术带头人,她的离开与我直接相关。这是一个活生生的机器取代人工的例子。

两年前我加入公司的时候,公司基本上是一个 professional service 类型的公司,虽然也开发了一个内部使用的系统,但系统的输出只是缩小了人工范围,必须有长时间的后编辑,手动增删修补,分析归纳,才能提供给客户。编辑人员我们称为信息分析员,要求语言能力强,阅读理解一目十行,并具有分析综合的技能。安娜就是信息分析员中的佼佼者。经她过手的分析报告,客户特别满意。

可是公司需要成本核算。核算的结果是,肉工可以,要适度,否则入不敷出,是亏本买卖。当时平均每个搜索分析的订单需要肉工22小时方能完工,这22小时叫做 pain time (既是分析员的pain, 更是公司的pain)。要想赚钱,理想的 pain time 支出需要控制在两个小时之内,在当时有点天方夜谭。老板找我谈的时候,就把它定为主要目标,但并没有设置时间限度,因为没有人知道其可行性以及达成这样的目标需要多少资源。我自己也不明白,只是感觉到了这个重担。我以前做过的工作,都是先研究,后做原型引擎,然后寻找应用领域,最后开发产品。而这家公司与多数技术创新公司截然相反,它是先有客户,后有粗糙的引擎,最后才引进人才和技术,把希望寄托在技术的快速转移身上。这条路子让我觉得新鲜和刺激,觉得可以试一下,我的技术转移技能能不能如鱼得水,发挥出来。先有客户和应用领域的好处是显而易见的,就像搞共产主义有了遵义会议的明灯一样,省却了在黑暗中的漫长摸索。道路是光明的,就看路怎样走才能赚钱了。

长话短说。我上马以后,三个月把系统的核心部分替换了,半年下来结果明显改善,到一周年的时候,肉工的痛苦时间已经缩短到两小时以下,老板喜不自禁。

人心不足蛇吞象,老板告诉我,Wei,你知道,你的技术给我们的业务带来了革命性变化。我们的立足已经不成问题,只要我们愿意,维持一个机器加人工的服务,发展成年入几千万的企业指日可待。但是,只要有人工,就不能 scale up, 赚钱就有限,盘子就做不大。我知道你是有雄心的人(我心里说,子非鱼),肯定不满足小打小闹。不管多大风险,我们还是决定放弃这条道路,而走全自动的路子,让系统可以服务所有的分析客户,而不是只供我们内部人工(安娜这样的)或者需要专门训练的 power users 使用。我们的目标是让世界上每个分析员都离不开我们,就如大家离不开Google一样。为此,我们必须做到 pain time  为零,这是着险棋,但是前景不可限量。

好家伙,这个口气,就梦想称霸全世界了。美国是个很有意思的地方,这方水土盛产百折不挠,心比天高的企业梦想家。但美国并非梦想家的乐园,95%的梦想家牺牲了,不到5%得以生存,其中不过1%最终做大,真正是一将功成万骨枯。虽然如此,美国造企业梦想家仍然前赴后继,生生不息。我其实很喜欢这些梦想家,他们的坚韧豪情很感染人。

一年又过去了。我们实现了在一个主要分析领域完全铲除痛苦时间的目标(pain time 0),把搜索分析从两年前的22小时人工,发展成为如今的20秒钟全自动立等可取,无需任何人工编辑。

得之桑榆,失之东隅, 两年的奋战取得了超出所有人预料的成就,但同时也失去了一位可爱的俄罗斯女郎。

【二次创业笔记】 记于2008年四月

【后记】关于安娜,还有一个小插曲。大家知道,创业公司的人都爱做梦数小鸡,股票期权则是催梦剂。

有一天,公司哥们跟往常一样数小鸡玩儿,安娜跟我说:Wei, come here, I got something to show you. 我走近一看,是一辆轿车。她跟我一字一板地说:

I like this car. I just love it. It is my dream car. I want to buy it.
Guys, work hard so I can own this car.

及至仔细一看价码,吓了一个筋斗,百万以上,她可真敢想啊,乖乖隆的东,here it is:

http://abcnews.go.com/GMA/Moms/story?id=1406161

相关篇什:

【一日一parsing:舍我其谁,我又是谁?】

昨夜名段:
【中秋,混得好的是花前月下,混得一般的是月下花钱,混得最差的是花下月的钱,混得最好的是钱下月花。】

0916a

0916b

几乎完美parsing了,但有一个分离词没有搭配的瑕疵,对比:

0916d

合在一起就眼花缭乱了,这是非一般的 graph,与多数句法树颇不同:

0916c

索性把前天的 parsing 也秀一秀。汉语 deep parsing 没有绝对的标准,但语言学家心里还是有杆秤的:靠谱不靠谱,内行看门道,外行看热闹罢。这种感觉有些奇诡刺激,一方面觉得是在走前人没走过的路,充满了拓荒者的悲壮与豪情。另一方面,也好像冥冥之中的命定,替天行道,舍我其谁,我又是谁?如果语言是思想的载体和表达(presentation),parsing 就是思想的形式化机器展示(representation),而我就是贯通二者的使者。感谢上帝,在创造了谜一样的语言的同时,没忘记把钥匙留下。

0915a

0915b

0915c

0915d

是的,【人类最无法理解的事情,就是机器对人类语言结构的分析能力】。机器达到人类的语言结构分析能力,现在已经没有悬念了。而机器难以达到的那部分理解能力,可以用人机辅助的方式进行,这个景象就在不太远的将来,已然历历在目了。让我们准备好,去拥抱这个人机交融的新时代。

洪爷有诗云:
庖丁解牛在语言,伟爷Parser之中练。善刀藏之于深山,实则乱麻可以斩。

【相关】

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

【博士涂鸦回顾:把常识代入文法的尝试】

上次说过,绝大多数的parsers对于谓词的 subcat 的表达都很简陋,伸展不开,多数不过把 subcat 当成一个代码,然后在相关的 subcat 规则中去确定 pattern。但是词驱动的文法 HPSG 却可以丝丝入扣,合情合理,可以直接在词典里面把 subcat 的 pattern 细致地描述,并对其句法语义的输入(pattern的条件)和输出(逻辑语义)之间的映射和解构,做出一个符合语言学原则的表达(representation)。

简陋有简陋的工程考量和理由,叠床架屋有叠床架屋的逻辑优美。鱼与熊掌不可兼得,我们最终还是更加倾向于简陋之法。尽管如此,走简陋快捷的路线的人,如果对结构表达的优美有所体验,还是有莫大的好处,至少不会被简陋的表象所迷惑,对于复杂的语言现象,逐渐摆脱简陋的捉襟见肘。

最近回看当年博士阶段的涂鸦文章,虽然其中反映出的对汉语句法的见识并不出彩,但是得力于 HPSG 的结构丰富性,还是把 subcat 在汉语文法中应用,表现得有条不紊,经得起时间的检验。当年钻研 HPSG 还是很专心的,吃得蛮透。正因为吃得透了,后来扬弃的时候就没有拖泥带水的牵挂。

譬如,在论及汉语NP带坑的现象的时候,是这样模型的:

11a)     桌子坏了。
11b)     腿坏了。
11c)     桌子的腿坏了。
12a)     他好。
12b)     身体好。
12c)     他的身体好。

When people say 11b) tui huai le (leg went wrong), we know something (the possessor) is omitted. For 11a), however, we have no such feel of incompleteness.

Although we may also ask whose table, this possessive relation between who and table is by no means innate. Similarly, ta (he) in 12a) is a complete notion denoting someone while shenti (body) in 12b) is not. In 11c) and 12c), the possessor appears in the possessive structure DE-construction, the expectation of tui (leg) and shenti (body) is realized. These examples show that some words (concepts) have conceptual expectation for some other words (concepts) although the expected words do not necessarily show up in a sentence and the expectation might not be satisfied. In fact, this type of expectation forms part of our knowledge (common sense). One way to represent the knowledge is to encode it with the related word in the lexicon.
Therefore we propose an underlying SYNSEM feature KNOWLEDGE to store some of our common sense knowledge by capturing the internal relation between concepts. KNOWLEDGE parallels to syntactic SUBCAT and semantic RELATION. KNOWLEDGE imposes semantic constraints on their expected arguments no matter what syntactic forms the arguments will take (they may take null form, i.e. the underlying arguments are not realized). In contrast, SUBCAT only defines syntactic requirement for the complements and gets interpreted in RELATION. Following this design, syntactic form and semantic constraints are kept apart. When necessary, the interaction between them can be implemented by lexical rules, or directly coindexed in the lexicon. For example, the following KNOWLEDGE information will be enforced as the necessary semantic constraints when we handle Chinese NP predicates by a lexical rule (see 3.3).

把常识暗度陈仓从后门带入文法,就是从那时候开始的。这个做法在欧洲语言的形式文法中不多见,因为句法形式大体够用了,通常不需要常识的帮忙。但是对于汉语,没有某种常识的引入,想做一个成熟的深度分析系统,则很难。当年带常识的的句法结构模型是这样定义的:

PHON      shenti
SYNSEM | KNOWLEDGE | PRED [1] possess
SYNSEM | KNOWLEDGE | POSSESSOR [2] human
SYNSEM | KNOWLEDGE | POSSESSED [3]
SYNSEM | LOCAL | CONTENT | INDEX [3]
SYNSEM | LOCAL | CONTENT | RESTRICTION { RELATION body }
SYNSEM | LOCAL | CONTENT | RESTRICTION { INSTANCE [3] }

最后,汉语文法中常识的引入被认为是对欧洲语言利用性数格的 agreement 的一个自然延伸。句法手段到语义限制的延伸。

Agreement revisited
This section relates semantic constraints which embody common sense to the conventional linguistic notion of agreement. We will show that they are essentially the same thing from different perspectives. We only need slight expansion for the definition of agreement to accommodate some of our basic knowledge. This is important as it accounts for the feasibility of coding knowledge in linguistic ways. Linguistic lexicon seems to be good enough to house some general knowledge in addition to linguistic knowledge.

为 parse“我鸡吃“ 和“鸡我吃”, 常识进入了文法(现在也可以利用大数据把常识代入):

A typical example of how concepts are linked in a network (a sophisticated concept lexicon) is seen in the representation of drink ((*ANI SUBJ) (((FLOW STUFF) OBJE) ((SELF IN) (((*ANI (THRU PART)) TO) (BE CAUSE))))) in Wilks 1975b. While for  various reasons we will not go as far as Wilks, we can gain enlightenment from this type of AI approach to knowledge. Lexicon-driven systems like the one in HPSG can, of course, make use of this possibility. Take the Chinese role-assignment problem, for example, the common sense that ANIMATE being eats FOOD can be seamlessly incorporated in the lexical entry chi (eat) as a semantic agreement requirement.

PHON chi
SYNSEM | KNOWLEDGE | PRED [1]  eat
SYNSEM | KNOWLEDGE | AGENT [2] animate
SYNSEM | KNOWLEDGE | PATIENT [3] food
SYNSEM | LOCAL | CATEGORY | SUBCAT | EXTERNAL_ARGUMENT [NP: [4]]
SYNSEM | LOCAL | CATEGORY | SUBCAT | INTERNAL_ARGUMENTS <[NP: [5]]>
SYNSEM | LOCAL | CONTENT | RELATION [1]
SYNSEM | LOCAL | CONTENT | EATER [4] | INDEX | ROGET [2]
SYNSEM | LOCAL | CONTENT | EATEN [5] | INDEX | ROGET [3]

可见,看上去不过是 POS 细分后的一个 subcat 的代码,里面其实包含了多少结构及其蕴含其内的知识。在 unification grammars 几乎成为历史陈迹的今天,我还是认为 HPSG 这样的表达是最优美的语言学的逻辑表达之一,论逻辑的清晰和美,后来的文法很难超越。

 

[Related]

Handling Chinese NP predicate in HPSG (old paper)

Notes for An HPSG-style Chinese Reversible Grammar

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

 

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

 

Handling Chinese NP predicate in HPSG (old paper)

Handling Chinese NP predicate in HPSG
(old paper in Proceedings of the Second Conference of the Pacific
Association for Computational Linguistics, Brisbane, 1995)

Wei Li & Paul McFetridge

Department of Linguistics
Simon Fraser University
Burnaby, B.C. CANADA  V5A 1S6

 

Key words: HPSG; knowledge representation, Chinese processing 

 

Abstract 

This paper addresses a type of Chinese NP predicate in the framework of HPSG 1994 (Pollard & Sag 1994). The special emphasis is laid on knowledge representation and the interaction of syntax and semantics in natural language processing. A knowledge based HPSG model is designed. This design not only lays a foundation for effectively handling Chinese NP predicate problem, but has theoretical and methodological significance on NLP in general.

In Section 1, the data are analyzed. Both structural and semantic constraints for this pattern are defined. Section 2 discusses the semantic constraints in the wider context of the conceived knowledge-based model. The aim of natural language analysis is to reach interpretations, i.e. correctly assigning semantic roles to the constituents. We indicate that without being able to resort to some common sense knowledge, some structures cannot get interpreted. We present a way on how to organize and utilize knowledge in HPSG lexicon. In Section 3, a lexical rule for this pattern is proposed in our HPSG model for Chinese, whose prototype is being implemented.

  1. Problem

We will show the data of Chinese NP predicate first. Then we will investigate what makes it possible for an NP to behave like a predicate. We will do this by defining both the syntactic and semantic constraints for this Chinese pattern.

1.1. Data: one type of Chinese NP predicate

1) 他好身体。

ta         hao      shenti.
he        good    body
He is of good health.

2)  张三高个子。

Zhangsan         gao      gezi
Zhangsan         tall       figure.
Zhangsan is tall.

3)  李四圆圆的脸。       Lisi

Lisi      yuanyuan         de        lian.
Lisi      round-round    DE       face.
Lisi has a quite round face.

4) 这件大衣红颜色。

zhe       jian      dayi     hong    yanse.
this      (cl.)      coat     red       colour.
This coat is of red colour.

5)  明天小雨。

mingtian          xiao     yu.
tomorrow        little     rain.
Tomorrow it will drizzle.

6)  那张桌子三条腿。

na        zhang   zhuozi san       tiao      tui.
that      (cl.)      table   three    (cl.)      leg
That table is three-legged.

Note:      (cl.) for classifier.
DE for Chinese attribute particle.

The relation between the subject NP and the predicate NP is not identity. The NP predicate in Chinese usually describes a property the subject NP has, corresponding to English be-of/have NP. In identity constructions, the linking verb SHI (be) cannot normally be omitted.[1]

7a)  他是学者。

ta         shi        xuezhe.
he        be        scholar
He is a scholar.

8b) ?他学者。

ta         xuezhe.  他学者。
he        scholar

1.2.  Problem analysis

1.2.1. We first investigate the structural characteristics of the Chinese NP predicate pattern.

A single noun cannot act as predicate. More restrictively, not every NP can become a predicate. It seems that only the NP with the following configuration has this potential: NP [lex -, predicate +].  In other words, a predicate NP consists of a lexical N with a modifying sister. Structures of this sort should not be further modified.[2] Thus, the following patterns are predicted.

8a)      那张桌子三条腿。

na        zhang   zhuozi san       tiao      tui.                   [ same as 6) ]
that      (cl.)      table    three    (cl.)      leg
That table is three-legged.

8b)       那张桌子塑料腿。

na        zhang   zhuozi suliao   tui.
that      (cl.)      table    plastic leg
That table is of plastic legs.

8c) * 那张桌子三条塑料腿。
*    na        zhang   zhuozi san       tiao      suliao   tui.       [too many attributes]

8d) * 那张桌子腿。
*    na        zhang   zhuozi tui.                                           [no attributes]

1.2.2. What is the semantic constraint for the Chinese predicate pattern?

Although there is no syntactic agreement between subject and predicate in Chinese, there is an obvious semantic “agreement” between the two: hao shenti (good body) requires a HUMAN as its subject; san tiao tui (three leg) demands that the subject be FURNITURE or ANIMATE. Therefore, the following are unacceptable:

9) * 这杯茶好身体。

* zhe       bei       cha       hao      shenti.
this      cup      tea       good    body

10) * 空气三条腿。

* kongqi san       tiao      tui.
air        three    (cl.)      leg

Obviously,. it is not hao (good) or san tiao (three) which poses this semantic selection of subject. The semantic restriction comes from the noun shenti (body) or tui (leg). There is an internal POSSESS relationship between them: shenti (body)  belongs to human beings and tui (leg) is one part of an animal or some furniture. This common sense relation is a crucial condition for the successful interpretation of the Chinese NP predicate sentences.

There are a number of issues involved here. First, what is the relationship of this type of knowledge to the syntactic structures and semantic interpretations? Second, where and how would this knowledge be represented? Third, how will the system use the knowledge when it is needed? More specifically, how will the introduction of this knowledge coordinate with the other parts of the well established HPSG formalism? Those are the questions we attempt to answer before we proceed to provide a solution to the Chinese NP predicate. Let us look at some more examples:

11a)     桌子坏了。

zhuozi huai     le.
table    bad      LE
The table went wrong.

11b)     腿坏了。

tui        huai     le.leg       bad      LE
leg       bad      LE
The leg went wrong.

11c)     桌子的腿坏了。

zhuozi  de        tui        huai     le.
table    DE       leg       bad      LE
The table’s leg went wrong.

12a)     他好。

ta         hao.
he        good
He is good.

12b)     身体好。

shenti   hao.
body    good
The health is good.

12c)     他的身体好。

ta         de        shenti   hao.
he        DE       body    good
His health is good.

note: LE for Chinese perfect aspect particle.

When people say 11b) tui huai le (leg went wrong), we know something (the possessor) is omitted. For 11a), however, we have no such feel of incompleteness. Although we may also ask whose table, this possessive relation between who and table is by no means innate. Similarly, ta (he) in 12a) is a complete notion denoting someone while shenti (body) in 12b) is not. In 11c) and 12c), the possessor appears in the possessive structure DE-construction, the expectation of tui (leg) and shenti (body) is realized. These examples show that some words (concepts) have conceptual expectation for some other words (concepts) although the expected words do not necessarily show up in a sentence and the expectation might not be satisfied. In fact, this type of expectation forms part of our knowledge (common sense). One way to represent the knowledge is to encode it with the related word in the lexicon.

Therefore we propose an underlying SYNSEM feature KNOWLEDGE to store some of our common sense knowledge by capturing the internal relation between concepts. KNOWLEDGE parallels to syntactic SUBCAT and semantic RELATION. KNOWLEDGE imposes semantic constraints on their expected arguments no matter what syntactic forms the arguments will take (they may take null form, i.e. the underlying arguments are not realized). In contrast, SUBCAT only defines syntactic requirement for the complements and gets interpreted in RELATION. Following this design, syntactic form and semantic constraints are kept apart. When necessary, the interaction between them can be implemented by lexical rules, or directly coindexed in the lexicon. For example, the following KNOWLEDGE information will be enforced as the necessary semantic constraints when we handle Chinese NP predicates by a lexical rule (see 3.3).

PHON      shenti
SYNSEM | KNOWLEDGE | PRED [1] possess
SYNSEM | KNOWLEDGE | POSSESSOR [2] human
SYNSEM | KNOWLEDGE | POSSESSED [3]
SYNSEM | LOCAL | CONTENT | INDEX [3]
SYNSEM | LOCAL | CONTENT | RESTRICTION { RELATION body }
SYNSEM | LOCAL | CONTENT | RESTRICTION { INSTANCE [3] }

  1. Agreement revisited

This section relates semantic constraints which embody common sense to the conventional linguistic notion of agreement. We will show that they are essentially the same thing from different perspectives. We only need slight expansion for the definition of agreement to accommodate some of our basic knowledge. This is important as it accounts for the feasibility of coding knowledge in linguistic ways. Linguistic lexicon seems to be good enough to house some general knowledge in addition to linguistic knowledge. Some possible problems with this knowledge-based approach are also discussed.

Let’s first consider the following two parallel agreement problems in English:

13) *    The boy drink.

14) ?    The air drinks.

13) is ungrammatical because it violates the syntactic agreement between the subject and predicate. 14) is conventionally considered as grammatical although it violates the semantic agreement between the agent and the action. Since the approach taken in this paper is motivated by semantic agreement, some elaboration and comment on agreement seem to be in need.

The agreement in person, gender and number are included in CONTENT | INDEX features (Pollard & Sag 1994, Chapter 2). It follows that any two signs co-indexed naturally agree with each other. That is desirable because co-indexed signs refer to the same entity. However, person, gender and number seem to be only part of the story of agreement. We may expand the INDEX feature to cope with the semantic agreement for handling Chinese and for in-depth semantic analysis for other languages as well.

Note that to accommodate semantic agreement in HPSG, we first need features to represent the result of semantic classification of lexical meanings like HUMAN, FOOD, FURNITURE, etc. We therefore propose a ROGET feature (named after the thesaurus dictionary) and put it into the INDEX feature.

Semantic agreement, termed sometimes as semantic constraint or semantic selection restriction in literature, is not a new conception in natural language processing. Hardly any in-depth language analysis can go smoothly without incorporating it to a certain extent. For languages like Chinese with virtually no inflection, it is more important. We can hardly imagine how the roles can be correctly assigned without the involvement of semantic agreement in the following sentences of the form NP1 NP2 Vt:

15a)     点心我吃了。

dianxin            wo       chi       le.
Dim-Sum         I           eat       LE
The Dim Sum I have eaten.

15b)     我点心吃了。

wo       dianxin            chi       le.
I           Dim-Sum         eat       LE
I have eaten the Dim Sum.

Who eats what?  There is no formal way but to resort to semantic agreement enforced by eat to correctly assign the roles. In HPSG 1994, it was pointed out (Pollard & Sag 1994, p81), “… there is ample independent evidence that verbs specify information about the indices of their subject NPs. Unless verbs ‘had their hands on’ (so to speak) their subjects’ indices, they would be unable to assign semantic roles to their subjects.” The Chinese data show that sometimes verbs need to have their hands on the semantic categories (ROGET) of both their external argument (subject) and internal arguments to be able to correctly assign roles. Now we have expanded the INDEX feature to cover both ROGET and the conventional agreement features number, person and gender, the above claim of Pollard and Sag becomes more general.

It is widely agreed that knowledge is bound to play an important role in natural language analysis and disambiguation. The question is how to build a knowledge-based system which is manageable. Knowledge consists of linguistic knowledge (phonology, morphology, syntax, semantics, etc.) and extra-linguistic knowledge (common sense, professional knowledge, etc.). Since semantics is based on lexical meanings, lexical meanings represent concepts and concepts are linked to each other in a way to form knowledge, we can well regard semantics as a link between linguistics and beyond-linguistics in terms of knowledge. In other words, some extra-linguistic knowledge may be represented in linguistic ways. In fact, lexicon, if properly designed, can be a rich source of knowledge, both linguistic and extra-linguistic. A typical example of how concepts are linked in a network (a sophisticated concept lexicon) is seen in the representation of drink ((*ANI SUBJ) (((FLOW STUFF) OBJE) ((SELF IN) (((*ANI (THRU PART)) TO) (BE CAUSE))))) in Wilks 1975b. While for  various reasons we will not go as far as Wilks, we can gain enlightenment from this type of AI approach to knowledge. Lexicon-driven systems like the one in HPSG can, of course, make use of this possibility. Take the Chinese role-assignment problem, for example, the common sense that ANIMATE being eats FOOD can be seamlessly incorporated in the lexical entry chi (eat) as a semantic agreement requirement.

PHON chi
SYNSEM | KNOWLEDGE | PRED [1]  eat
SYNSEM | KNOWLEDGE | AGENT [2] animate
SYNSEM | KNOWLEDGE | PATIENT [3] food
SYNSEM | LOCAL | CATEGORY | SUBCAT | EXTERNAL_ARGUMENT [NP: [4]]
SYNSEM | LOCAL | CATEGORY | SUBCAT | INTERNAL_ARGUMENTS <[NP: [5]]>
SYNSEM | LOCAL | CONTENT | RELATION [1]
SYNSEM | LOCAL | CONTENT | EATER [4] | INDEX | ROGET [2]
SYNSEM | LOCAL | CONTENT | EATEN [5] | INDEX | ROGET [3]

Note:        Following the convention, the part after the colon is SYNSEM | LOCAL | CONTENT information.

One last point we would like to make in this context is that semantic agreement, like syntactic agreement, should be able to loosen its restriction, in other words, agreement is just a canonical, in Wilk’s term preference, requirement (Wilks 1975a). In practice of communication, deviation in different degrees is often seen and people often relax the preference restriction in order to understand. With semantic agreement, the deliberate deviation is one of the handy means to help render rhetorical expression. In a certain domain, Chomsky’s famous sentence Colorless green ideas sleep furiously is well imaginable. On the other hand, the syntactic agreement deviation will not affect the meaning if no confusion is caused, which may or may not happen depending on context and the structure of the language. In English, lack of syntactic agreement for the present third person singular between subject and predicate usually causes no problem. Sentence 15) The boy drink therefore can be accepted and correctly interpreted. There is much more to say on the interaction of the two types of agreement deviation, how a preference model might be conceived, what computational complexities it may cause and how to handle them effectively. We plan to address it in another paper. The interested reader is referred to one famous approach in this direction. (Wilks 1975a, 1978).

 

  1. Solution

We will set some requirements first and then present a lexical rule to see how well it meets our requirements.

3.1. Based on the discussion in Section 1, the solution to the Chinese predicate NP problem should meet the following 4 requirements:

(1)        It should enforce the syntactic constraints for this pattern: one and only one modifier XP in the form of NP1 XP NP2.

(2)        It should enforce the semantic constraints for this pattern: N2 must expect NP1 as its POSSESSOR with semantic agreement.

(3)        It should correctly assign roles to the constituents of the pattern: NP1 POSSESS NP2 (where NP2 consists of XP N2).

(4)        It should be implementable in HPSG formalism.

 

3.2. What mechanisms can we use to tackle a problem in HPSG formalism?

HPSG grammar consists of two components: a general grammar (ID schemata and principles) and a lexical grammar (in the lexicon). The lexicon houses lexical entries with their linguistic description and knowledge representation in feature structures. The lexicon also contains generalizations captured by inheritance of lexical hierarchy and by a set of lexical rules. Roughly speaking, lexical hierarchy covers static redundancy between related potential structures. Just because the lexicon can reflect different degrees of lexical redundancy in addition to idiosyncrasy, the general grammar can desirably be kept to minimum.

The Chinese NP predicate pattern should be treated in the lexicon. There are two arguments for that. First, this pattern covers only restricted phenomena (see 3.4). Second, it relies heavily on the semantic agreement, which in our model is specified in the lexicon by KNOWLEDGE. We need somehow to link the semantic expectation KNOWLEDGE and the syntactic expectation SUBCAT or MOD. The general mechanism to achieve that is structure sharing by coindexing the features either directly in the lexical entries (see the representation of the entry chi in Section 2) or through lexical rules (see 3.3).

3.3. Lexical Rule

Lexical rules are applied to lexical signs (words, not phrases) which satisfy the condition. The result of the application is an expanded lexicon to be used during parsing. Since the pattern is of the form NP1 XP N2, the only possible target is N2, i.e. shenti (body) or tui (leg). This is due to the fact that among the three necessary signs in this form, the first two are phrases and only the final N2 is a lexical sign. We assume the following structure for our proposed lexical rule:

NP[ta[1]]         [[AP[2] hao] [N<NP[1], XP[2]> shenti]]

NP Predicate Lexical Rule

hpsg1

SYNSEM | KNOWLEDGE | PRED [1] possess
SYNSEM | KNOWLEDGE | POSSESSOR [2]
SYNSEM | LOCAL | CATEGORY | HEAD | MAJ [6] n
SYNSEM | LOCAL | CATEGORY | PREDICATE –
SYNSEM | LOCAL | CONTENT | INDEX [4]
SYNSEM | LOCAL | CONTENT | RESTRICTION {[3]}
…| CATEGORY | PREDICATE +
…| CATEGORY | SUBCAT | EXTERNAL_ARGUMENT [NP: [5]]
…| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS < […| CATEGORY | HEAD | MOD [6] ] >
…| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS < […| CONTENT | INDEX [4] ]

==>

…| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS < […| CONTENT | RESTRICTION {[7]} ] >
…| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS < […| LEX – ] >
…| CONTENT | RELATION [1] possess
…| CONTENT | POSSESSOR [5] | INDEX | ROGET [2]
…| CONTENT | POSSESSED | INDEX [4]
…| CONTENT | POSSESSED | RESTRICTION {[7] | [3] }

For complicated information flow like this, it is best to explain the indices one by one with regards to the example ta hao shenti (he is of good body) in the form of NP1 XP N2.

The index [1] links the underlying PRED feature of N2 to the semantic RELATION feature; in other words, the predicate in the underlying KNOWLEDGE of shenti (body) now surfaces as the relation for the whole sentence. The index [2] enforces the semantic constraint for this pattern, i.e. shenti (body) expects a human (ROGET) possessor as the subject (EXTERNAL_ARGUMENT) for this sentence. The index [3] is the restriction relation of N2. [4] links the INDEX features of XP and N2, and [6] indicates that the internal argument is a de-facto modifier of N2, i.e. XP mods-for N2. Note that the part of speech of the internal argument (INTERNAL_ARGUMENT | SYNSEM | LOCAL | CATEGORY | HEAD | MAJ) is deliberately not specified in the rule because Chinese modifiers (XP) are not confined to one class, as can be seen in our linguistic data. Finally, [7] defines the restriction relation of the XP to the INDEX of N2.

The indices [4], [7] and [3] all contribute to artificially creating a semantic interpretation for [XP N2]. As is interpreted, XP is, in fact, a modifier of N2 and they would form an NP2, or [XP N2] constituent. In normal circumstances, the building of NP2 interpretation is taken care of by HPSG Semantics Principle. But in this special pattern, we have treated XP as a complement of N2, yet semantically they are still understood as one instance: hao shenti (good body) is an instance of good and body. This interpretation of NP2 serves as POSSESSED of the sentence predicate, indicated by the structure-sharing of [4], [7] and [3]. Finally, [5] is the interpretation of NP1 and is assigned the role of POSSESSOR for the sentence predicate.

Let’s see how well this lexical rule meets the 4 requirements set in 3.1.

(1) It enforces the syntactic constraints by treating XP as the internal argument and NP1 as the external argument.

(2) It enforces the semantic constraints through structure sharing by the index [2].

(3) It correctly assigns roles to the constituents of the pattern.

The following interpretation will be established for ta hao shenti (he is of good body) by the parser.

hpsg2

CONTENT | RELATION possess
CONTENT | POSSESSOR | INDEX | PERSON 3
CONTENT | POSSESSOR | INDEX | NUMBER singular
CONTENT | POSSESSOR | INDEX | GENDER male
CONTENT | POSSESSOR | INDEX | ROGET human
CONTENT | POSSESSOR | RESTRICTION { }
CONTENT | POSSESSED | INDEX [1]    | PERSON 3
CONTENT | POSSESSED | INDEX          | NUMBER singular
CONTENT | POSSESSED | INDEX          | GENDER nil
CONTENT | POSSESSED | INDEX          | ROGET organ
CONTENT | POSSESSED | RESTRICTION { [ RELATION good],              [ RELATION body  ] }
CONTENT | POSSESSED | RESTRICTION { [ INSTANCE [1] ],              [ INSTANCE [1]  ] }

In prose, it says roughly that a third person male human he possesses something which is an instance of good body. We believe that this is the adequate interpretation for the original sentence.

(4) Last, this rule has been implemented in our Chinese HPSG-style grammar using ALE and Prolog.  The results meet our objective.

But there is one issue we have not touched yet, word order. At first sight, Chinese seems to have similar LP constraints as those in English. For example, the internal argument(s) of a Chinese transitive verb by default appear on the right side of the head. It seems that our formulation contradicts this constraint in grammar. But in fact, there are many other examples with the internal argument(s), especially PP argument(s), appearing on the left side of the head.

服务 fuwu (serve): <NP, PP(wei)>

16a) 为人民服务

wei      renmin fuwu
for       people  serve
Serve the people.

16b) ? 服务为人民。

fuwu    wei      renmin.
serve    for       people

有益 youyi (of benefit): <NP, PP(dui yu)>

17a) 这对我有益。

zhe       dui       wo       youyi
this      to         I           have-benefit
This is of benefit to me.

17b) * 这有益对我。

zhe       youyi               dui       wo
this      have-benefit    to         I

18a) 这于我有益。

zhe       yu        wo       youyi
this      to         I           have-benefit
This is of benefit to me.

18b) 这有益于我。

zhe       youyi               yu        wo
this      have-benefit    to         I
This is of benefit to me.

Word order and its place in grammar are important issues in formulating Chinese grammar. To play safe and avoid generalization too soon, we assume a lexicalized view on Chinese LP constraint, encoding word order information in LEXICON through SUBCAT and MOD features. This proves to be a realistic and precise approach to Chinese word order phenomena.

3.4. As a final note, we will briefly compare the NP Predicate Pattern with one of the Chinese Topic Constructions:

NP1 NP2 Vi/A
(topic + (subject + predicate))

In Chinese, this is a closely related but much more productive form than this NP Predicate Pattern. And their structures are different.

19)       他身体好。

ta         shenti   hao
he        body    good
He is good in health.

For topic constructions, we propose a new feature CONTEXT | TOPIC, whose index in this case is token identical to the INDEX value of ta. Please be advised that in the above structure, the CONTEXT | TOPIC ta is considered as a sentential adjunct instead of a complement subcated-for by shenti. Why? First, ta is highly optional: topic-less sentence is still a sentence. Second, and more convincingly, ta cannot always be predicted by its following noun. Compare:

20a) 他身体好。

ta         shenti   hao
he        body    good
He is good in health.

20b) 他好身体。

ta         hao      shenti
he        good    body
He is of good health.

21a) 他脾气好。

ta         piqi                  hao
he        disposition       good
He is good in disposition.

21b)  他好脾气。

ta         hao      piqi
he        good    disposition
He is of good disposition.

but:

22a) 她学习好。

ta         xuexi   hao. [3]
he        study   good
He is good in study.

22b) *  他好学习。

ta         hao      xuexi
he        good    study

What this shows is that for topic sentences like ta shenti hao (He is good in health), ta xuexi hao (He is good in study), etc., there is no requirement to regard topic ta (he) as a necessary semantic possessor of shenti / xuexi, the relation is rather “in-aspect”: something (NP1) is good (A) in some aspect (NP2), or for something (NP1), some aspect (NP2) is good (A).

Finally, it needs to be mentioned that our proposed lexical rule requires modification to accommodate sentence 6). That is already beyond what we can reach in this paper because it is integrated with the way we handle Chinese classifiers in HPSG framework.

 

References

Pollard, Carl  & Sag, Ivan A. (1994): Head-Driven Phrase Structure Grammar,  Centre for the Study of Language and Information, Stanford University, CA

Pollard, Carl & Sag, Ivan A. (1987): Information‑based Syntax and Semantics Vol. 1: Fundamentals. Centre for the Study of Language and Information, Stanford University, CA

Wilks, Y.A. (1975a): A Preferential Pattern-Seeking Semantics for Natural Language Interference.  Artificial Intelligence, Vol. 6, pp.53-74.

Wilks, Y.A. (1975b): An Intelligent Analyzer and Understander of English, in Communications of the ACM, Vol. 18, No.5, pp.264-274

Wilks, Y.A. (1978): Making Preferences More Active.  Artificial Intelligence, Vol. 11, pp. 197-223

~~~~~~~~~~~~~~~ footnotes ~~~~~~~~~~~~~~~~

[1] This is not absolute, we do have the following examples:

Ia)          约翰是纽约人。

Yuehan shi           Niuyue                   ren
John       be            New-York              person
John is a New Yorker.

Ib)           约翰纽约人。

Yuehan  Niuyue                   ren.
John       New-York              person
John is a New Yorker.

IIa)         今天是星期天。

jintian    shi           xingqi-tian.
today     be            Sun-day
Today is Sunday.

IIb)         今天星期天。

jintian    xingqi-tian.
today     Sun-day
Today is Sunday.

It seems to be that the subject NP stands for some individual element(s), and the predicate NP describes a set (property) where the subject belongs. But it is not clear how to capture Ib) and IIb) while excluding 7b). We leave this question open.

[2] We realize that the syntactic constraint defined here is only a rough approximation to the data from syntactic angle. It seems to match most data, but there are exceptions when yi (one) appears in a numeral-classifier phrase:

IIIa)  他一副好身体。

ta            yi             fu            hao         shenti.
he            one         (cl.)         good       body
He is of good health. (He is of a good body.)

IIIb) * 他三副好身体。

ta            san          fu            hao         shenti
he            three       (cl.)         good       body

IIIc)   他好身体。

ta            hao         shenti.    [same as 1) ]

IVa) 李四一张圆圆的脸。

Lisi          yi             zhang     yuanyuan             de            lian.
Lisi          one         (cl.)         round-round         DE          face
Lisi has a quite round face.

IVb) * 李四两张圆圆的脸。

Lisi          liang       zhang     yuanyuan             de            lian.
Lisi          two         (cl.)         round-round         DE          face

IVc)  李四圆圆的脸。

Lisi          yuanyuan             de            lian.        [ same as 3) ]

[3] Another reading for 22a) is [S [Sta xuexi][AP hao]], where ta xuexi is a subject clause: “That he studies is good”. This is another issue.

 

[Related]

Interaction of syntax and semantics in parsing Chinese transitive verb patterns 

Notes for An HPSG-style Chinese Reversible Grammar

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

Notes for An HPSG-style Chinese Reversible Grammar

ABSTRACT

Key words: Chinese parsing, Chinese generation, reversible grammar,  HPSG

This paper presents a reversible Chinese unification grammar named CPSG. The lexicalized and integrated design of CPSG embodies the general spirit of the modern linguistic theory Head-driven Phrase Structure Grammar (HPSG, Pollard & Sag 1987, 1994). Using ALE formalism in Prolog (Carpenter & Penn 1994), we have implemented a prototype of CPSG.

CPSG covers Chinese morphology, Chinese syntax and semantics in a novel integrated language model (Figure 1, for interface between morphology, see Li 1997; for interface between syntax and semantics, see Li 1996). CPSG model is in sharp contrast to the conventional clear-cut successive design of grammar components (Figure 2, see survey in Feng 1996). We will show that our model is much better suited and more efficient for Chinese analysis (or generation).

 

cpsg

Grammar reversibility is a highly desired feature for multi-lingual machine translation application (Hutchins & Somers 1992, Huang 1986, 1987). To test its reversible features, we have applied the CPSG prototype to an experiment of bi-directional machine translation between English and Chinese. The machine translation engine developed in our Natural Language Lab is based on shake-and-bake design, a novel approach to machine translation suited for unification grammars (Whitelock 1992, 1994, Beaven 1992, Brew 1992). The experimental results meet our design objective and verify the feasibility of CPSG approach.

~~~~~~~~~~~~~~~~~~~~~

Notes for NWLC-97, UBC, Vancouver

Outline of An HPSG-style Chinese Reversible Grammar

Wei LI   (lio@sfu.ca)

Linguistics Department, Simon Fraser University

 

 Key words:          lexicalist approach, integrated language model, HPSG,

                                reversible grammar,  bi-directional machine translation, 

                                Chinese computational grammar,

                                Chinese word identification, Chinese parsing,
Chinese generation

 

  1. background

1.1. design philosophy

Two major obstacles in writing Chinese computational grammar:

lacking in serious study on Chinese lexical base

well designed lexicon is crucial for a successful computational system

theoretical linguists have made fruitful efforts (e.g. Li Linding) but lack formalization

computational linguists require more patience in adapting and formalizing the fruits:

it is huge work, but has to be done if a non-toy system is targeted

lack of effective interaction between morphology, syntax and semantics.

e.g.

ambiguity in word identification makes it hard to interface morphology & syntax:

a theoretical defect of morphology preprocessor (segmenter)

e.g. ABC: ABC or A | BC or AB | C or A | B | C?

active/passive isomorphic phenomena make semantic constraint a desired need in parsing NP Vt: subject NP or object NP?

Solution: the lexicalized and integrated design of Chinese grammar

1.2. major theoretical foundation:

HPSG:       lexicalist theory encouraging integration of different components

a desired framework matching our design philosophy

CPSG: HPSG-style unification grammar

CPSG: reversible grammar suited for both parsing and generation

CPSG: formalized grammar, a description that does not rely on undefined notions

  1. integrated language model

2.1. CPSG versus conventional Chinese grammar

 

 

parse tree embodies both morphological and syntactic structures in CPSG

  1. lexicalized formal grammar

3.1. formalized grammar, as required by a computational grammar: formulation of CPSG

readily implementable (theories, principles, rules, etc.);

precise definition for the very basic notions (e.g. sign, morpheme, word, phrase, sentence, NP, VP, etc.), rules (PS rules and lexical rules), lexical items (lexical hierarchy), typology (hierarchy embodied in feature structures)

(4.)       Definition: sign

A sign is the most fundamental concept of grammar. Formally, a sign is defined by the type [a_sign], which introduces a set of linguistic features for its description, as shown below.

a_sign
INDEX index
KANJI kanji
MORPH1 expected
MORPH2 expected
CATEGORY category
COMP0 expected
COMP1 expected
COMP2 expected
MOD expected
KNOWLEDGE knowledge
CONTENT content
INDEX0 index
INDEX1 index
INDEX2 index
DTR dtr

(5.)       Definition: word

In CPSG, a word is a sign satisfying the following two conditions: (1) its obligatory morphological expectation has all been saturated; (2) it is not a mother of any syntactic structures, hence no syntactic daughters. Formally, a word is defined as shown below.

(6.)       word

a_sign
MORPH1 ~obligatory
MORPH2 ~obligatory
DTR no_syn_dtr

3.2. lexicalized grammar

CPSG consists of two parts:

(1) a minimized general grammar:

only 11 phrase structure rules
(covering complement structure, modifier structure,
conjunctive structure and morphological structure)

(2) a feature enriched lexicon:

lexical entries;
lexical hierarchy and a set of lexical rules
(capturing lexical generalizations).

 

(7.)          comp0 PS rule

MOTHER               a_sign
COMP0 saturated
COMP1 [1]
COMP2 [2]
DTR comp0
MYSISTER [6]
LEFTMOD [7] category
RIGHTMOD [8] category
LEFTCOMP [9] category
RIGHTCOMP [10] category

===>

EXPECTING          a_sign
COMP0 a_expected
DIRECTION left
ROLE [3]
SIGN [4]
COMP1 [1] ~obligatory
COMP2 [2] ~obligatory
INDEX [5]
DTR dtr
LEFTMOD [7]
RIGHTMOD [8]
RIGHTCOMP [10]

EXPECTED            a_sign [4]
CONTENT content
MYHEAD [5]
MYROLE [3] comp_role
INDEX [6]
CATEGORY [9]

PRINCIPLE            #head_feature

(8.)          lexical entry: chi

a_sign
KANJI one_character
H1 chi
CATEGORY v
INDEX0 [1] index
INDEX1 [2] index
COMP0 a_expected
DIRECTION left
SIGN a_sign
CATEGORY n
INDEX [1]
COMP1 a_expected
DIRECTION right
SIGN a_sign
CATEGORY n
INDEX [2]
KNOWLEDGE eat
U_OBJECT food
MALE none
PERSON 3
SINGULAR bin
U_SUBJECT animate
MALE bin
PERSON tri
SINGULAR bin

  1. Implementation and Application of CPSG

CPSG prototype implemented in ALE and Prolog, having parsed a corpus of 200 various types of sentences

ALE and Prolog: suitable for unification grammar
ALE:         mechanism for typed feature structures: type polymorphism
a powerful tool in language modeling

CPSG prototype adapted for application to bi-directional MT, having generated the same corpus of 200 sentences

References

Beaven, John L. (1992): “Shake and Bake Machine Translation”, Proceedings of the 15th International Conference on Computational Linguistics, pp. 603-609, Nantes, France.

Brew, Chris (1992): “Letting the Cat out of the Bag: Generation for Shake-and-bake MT”, Proceedings of the 15th International Conference on Computational Linguistics, pp. 610-616, Nantes, France.

Carpenter, B. & Penn, G. (1994): ALE, The Attribute Logic Engine, User’s Guide

Feng, Z.  (1996): “COLIPS Lecture Series – Chinese Natural Language Processing”,  Communications of COLIPS, Vol.6, No.1 1996, Singapore (http://www.iscs.nus.sg/~colips/commcolips/paper/p96.html)

Huang, X-M. (1986): “A Bidirectional Grammar for Parsing and Generating Chinese”.  Proceedings of the International Conference on Chinese Computing, Singapore, pp. 46-54

Huang, X-M. (1987): XTRA: The Design and Implementation of A Fully Automatic Machine Translation System, Doctoral dissertation, University of Essex.

Hutchins, W.J. & H.L. Somers (1992): An Introduction to Machine Translation. London, Academic Press.

Li, W. (1996): Interaction of Syntax and Semantics in Parsing Chinese Transitive Patterns. Proceedings of International Conference on Chinese Computing (ICCC’96), Singapore

Li, W. (1997): Chart Parsing Chinese Character Strings. Proceedings of The Ninth North American Conference on Chinese Linguistics (NACCL-9, to be available), Victoria, Canada

Pollard, C.  & I. Sag (1987): Information based Syntax and Semantics Vol. 1: Fundamentals. Centre for the Study of Language  and Information, Stanford University, CA

Pollard, C.  & I. Sag (1994): Head-Driven Phrase Structure Grammar,  Centre for the Study of Language and Information, Stanford University, CA

Whitelock, Pete (1992): “Shake and Bake Translation”, Proceedings of the 14th International Conference on Computational Linguistics, pp. 784-790, Nantes, France.

Whitelock, Pete (1994). “Shake and Bake Translation”, C.J. Rupp, M.A. Rosner, and R.L. Johnson (eds.), Constraints, Language and Computation, pp. 339-359, London, Academic Press.

 

[Related]

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

【一日一parsing:从“见面”的subcat谈起】

白:
“三两面”和“两三面”很不一样啊……
我借过他三两面。我见过他两三面。

我:
三两面 > 两三面
我见过他三两面

0912a
ditransitive, no problem, but:

0912b

separable verb jian-mian is still not connected

还有:
(0)我见过他两三面。
(1)我见过他。
(2)我与他见过面。
(3)* 我见过面
(4)我们见过面。
(5)我与他,见面过。

“见面” 要求或者主语是复数(4),或者主语是并列结构(5),或者带有介词短语“与(with)”(PP或并列在汉语界限不清,(2)),或者动量词疑似的“两三面”前必须有定语【human】。所有的这些句法subcat要求都是满足语义(或常识)的一个【human】的坑:常识是,“见面“”必须在两个或以上的 human entities 之间进行。

HPSG 这类极端依赖subcat数据结构的词驱动的理论和语言学表达,尽管繁缛,但有一个亮点, 就是把上述的句法要求作为 input 的匹配条件描述,与内在的语义要求(类似于 HowNet 的描述)作为语义的 output,一条一条形式化,细致入微,丝丝入扣。用的是 label 的unification(就是 label 所代表的子结构的 sharing)机制。多数系统对于 subcat 的内部结构,input到output的映射,以及背后的句法与语义的关系(语义是句法的动因,同时也是句法的目标:句法匹配,语义实现),都显得太简陋了。

过犹不及,不及犹过。我们一直在探索在 subcat 的表达和实现中,如何做到中庸而不平庸,简约而不简陋。

白:
他我见过几面

我:
简陋之极的一个例证是给人用的 Oxford 高级词典和朗曼词典的那些 subcat codes,类似 v1,。。。v23 之类。后来纽约大学专门组织CL的研究生做 CompLex 和 NomLex 等 subcat 词典。中文方面,社科院语言所的【现代汉语800词】开 subcat 先河,【动词用法词典】等系列辞典,开始试图把 subcat 用某种编码加例句予以表达。所有这些工作,从数据表达和关系看,都显得有些简陋。其根子是,句法和语义没有厘清。

对于一个 NLP practitioner,拿来这些资源,必须在肚子里做这个句法语义的连接和消化,然后确定数据结构,找寻自己的实现途径。实现的时候,很难达到 unification 文法的漂亮,大多是凑合事儿,为的是避免 HPSG 这类的实现起来的低效率和数据结构的难维护。

董老师的 HowNet 对于汉语和英语的 subcat,语义上登峰造极了,但是句法方面还是显得不够细致周全。譬如“见面”这类的上述6-7种句法规定,好像就没有一一描述(董老师指正:也许我没吃透),也没见哪家描述清楚过。也都需要一个重新咀嚼消化,然后去实现。

0912c

(3)的 generation 不合法(*),但对于 parsing,鲁棒性要求这样parsing,没错。

0912d

没调试,居然出来了,912 的狗屎运吧。(911恐袭,913林跑,都不是好日子。)只剩下 “我见过他两三面” 这个 case 了。这个类似动量补语的东西其实仅限于:“一面”,“几面”,“两三面”,“三两面”,等少数几个。起码,100+ 面 基本不可能 除非是恋人。

张: 崇拜严重中

我:
张老师谬赞。清谈误国,我只要不误“人”子弟就好了,一辈子没当过教授,要误也都是人家子弟,哈。

张: 白求恩

我:
认真说,其实真地涉嫌误人子弟,因为凡事都有一个大环境和背景,我说的这些个多少有些异类,结果是,主流学生雾里看花。雾里看花也算增加视野,最误人的是,看到花,却够不着。这就好比鲁老爷子说的,本来人家黑屋子里面睡得蛮香甜,你非要去【呐喊】,唤醒了,可屋子还是黑屋子,这就不仅仅是残忍了。不残忍的法子就是,等以后退休了,开一个 Deep Parsing 开源公园,每条代码,每个词条,每段规则,全部公开,然后看看能不能靠众人的力量,弄一个无敌系统来。大家一起玩符号逻辑,让两条路线永远。

 

 

【相关】

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

【语言学小品:苹果发布 iPhone 7 的“话术”】

我:
前一阵提到汉语 if-then 简约式对parsing的挑战。昨天又遇到一些例子,也是极少显性形式痕迹,可是人就理解为 if-then: “中国出生,美国长大,如何申请回国定居?”

VP1, VP2, how VP3

中国出生,美国长大,如何申请回国定居?
== 【如果】【一个人】【在】中国出生,【并且】【在】美国长大,【那么】【他/她】【将】如何申请【他/她】回国定居【的paperwork】【呢】?

省去了多少玩意儿,简约的中文!

这种句式听起来很顺耳,普罗没感觉有理解或缺省的问题。仔细看,也不能算没有形式痕迹,这样的 pattern 似乎就应该是这样的理解(?):

VP1(, VP2, …), how VPn?

一旦匹配上,还有其他的语义可能吗?VP1 到 VPn-1 都是 AND 条件, VPn 才是虚拟条件的结果。

白:
不甜不要钱,不甜的不要钱
一个意思,形式上真要拉开那么大差距吗
理解为省略“的”,就是单。理解为省略“如果……则……”就是复句

我:
的字结构,是一个短语与从句的中间怪物,英语的 what-clause 亦然。

白:
如果依照“懒人定律”,无论如何过程简约、结果简约的理解优先。
用最小能量补齐者优先

宋:
不完全一样。瓜主指着一堆瓜说“不甜不要钱”,意思是我保证个个都甜。“不甜的不要钱”口气软一点,是说我不保证每个瓜都甜,如果你买到的瓜不甜,我就不收钱。

白:
您的例子只能区分省略掉的名词加的是存在量词还是全称量词,不能区分省略掉的小词是“的”还是“如果……则”

我:
@宋 好区分。不过,这种口气的软硬真地很 subtle,广告商似乎常常有意利用这种 nuances,来忽悠老百姓。同样的广告词,软的方面理解才是实在的,广告商希望听众往硬的方面理解,来凸显其底气。“不甜不要钱”,就是这样的话术。它的实际意义和法律意义等价于“不甜的不要钱”。但它想传达的却是,我的产品多牛,根本不可能不甜,不信我愿意跟你打赌。

白:
不管软硬,真遇到不甜的(逻辑反例),肯定是哪个瓜不甜哪个瓜不要钱,不会整堆儿不要钱。不信试试。

我:
不用试吧,@白硕

说到“话术”,昨天看苹果发布会,体会才深,从乔布斯时代到现在,苹果最经常用到的忽悠信众和普罗的话术就是:iXyz is the best Xyz ever made by Apple
这种话是宇宙真理,没有丝毫信息量,却听上去似乎是最有力量的广告词。

白:
有sentiment就够了

我:
尼玛做电子产品,不是越做越好,难道越做越坏?新一代比起前面的几代好,不是理所当然吗?这里的 best 不就是这么声称吗?屡试不爽,把全世界当傻瓜,可是全世界还就愿意当傻瓜。没人 question 或反讽。我要是苹果的竞争对手,就专门做一个宣传片,嘲讽这个“话术”。

白:
made和比较范围并没有硬捆绑呀。
不是硬性的

我:
是 iPhone7 与 iPhones 比较;iWatch Series 2 与 iWatch 1 比较

白:
也可以理解为横向比

我:
这是正式新闻发布:
San Francisco — Apple today introduced iPhone 7 and iPhone 7 Plus, the best, most advanced iPhone ever, packed with unique innovations that improve all the ways iPhone is used every day.“the

“the best, most advanced iPhone ever”

白:
又回到限定性非限定性问题上,聪明的一休

我:
逻辑上,剔除定语,就是 iPhone 7 is iPhone

白:
这个跟“媳妇是娘”那种剔除法一样不可取。

我:
苹果就是完全烂了,没有任何创新,也永远可以这样声称:
iPhone 7 is the best iPhone.
(iPhone 8 will be the best iPhone)
In fact, a new iPhone release is always the best iPhone.

白:
问题是,把苹果买在手里的用户,按照另一种理解,会有一种傲视天下的感觉。

宋:
马列主义的顶峰。
新顶峰。

我:
他要是真牛,应该说 iPhone 7 is the best smart phone.
不过他不敢

白:
苹果不蠢,只是蒙不了伟哥而已。

我:
只有谷歌 SyntaxNet 才傻乎乎地敢于不带范围地如此声称世界第一

 

 

【相关】

【汉语句法的挑战之一:if-then的简约式】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

《立委科普:NLP系统语义模块的任务》

本篇旨在探讨NLP(Natural Language Processing)语义模块的任务,尤其在知识图谱应用中。探讨之前,我们先站在万米高看俯瞰一下语义模块在语言学和NLP的主要模块的架构中位于何处。
语言学的教科书通常把语言文本研究从浅入深划分为这么几个分支:词法(morphology)、句法(syntax)、语义(semantics)和语用(pragmatics)。还有另一个维度的分支,叫篇章研究(discourse study),是跨句进行,其他的研究一般限于句内。词法句法的研究成果在 NLP 中表现为 parser,可以自动把线性字符串的语句分析为句法树结构,千变万化的语句因此化为有限的句型或 patterns,为语言理解和应用提供了坚实的基础。语义处于句法之后、语用之前,我们叫它为语义中间件 (middleware),因为它是领域独立的语言研究的终点,支持的是依赖领域和应用的语用。这个语义中间件的任务也可以留到语用阶段在语义落地(semantic grounding)的时候根据语用对语义的要求来一起做,但是理论上,总有一部分语义工作有足够的领域独立性,值得提前做好,来支持种种不同的语用场景和应用,减轻语用模块的负担。
如此定义的语义模块(语义中间件),主要是寻找 hidden links,譬如隐含的逻辑主语、宾语 等。这些在句法阶段没有显性表明,但是有足够证据去确定如何填补。填补的时候,一个是利用句法(显性的links),一个是利用 ontology,通常是二者的结合。词驱动(word-driven)来做,是一个很 tractable 的任务,是比parsing更琐碎但难度较低的工作,因为要结构有结构,要ontology有ontology(包括动态形成的ontology节点,譬如NE专名的分类),条件比纯句法分析模块只有线性的pattern可用,是成熟多了。其有用性还是不太清晰: argument 之一就是,如果 hidden 的语义重要,人为什么不用显性句法手段?即便在一个句子的选定的句法结构中,某个重要的语义难以显性表达,如果足够重要,人就会换一种句法结构在另一个句子显性表达出来。如果上述 argument 有一定的道理,那么不做 hidden 语义,对于大数据挖掘,应该不会有太大的损害。至少在大数据挖掘这样的场景,信息的冗余性足以弥补 个体 hidden 语义的不全。在句法结束的时候,有些句子提到的 arg(s) 并没有到位,可以说是不饱和(unsaturated)。语义中间件的任务就是把句法没有做全的不饱和的坑填得饱和,hidden links 建立了,于是就饱和了。如果句法模块和语义模块以后,仍然不饱和,就应该在 discourse 中去找。如果 discourse 中还是没找到,那么理论上是应该通过常识去饱和它。
回到万米高空俯瞰,昨天还在想所谓“语义计算”到底包含哪些呢。从 community 来看,相关的方面有:(1)WSD(Word Sense Disambiguation); (2) FrameNet (role labeling); (3) IE(Information Extraction)。“经典”IE (MUC IE 传统)里面一般分 NE、relationship、event,外加 Coreference,等任务。从结构图的角度看,NE 和 WSD 是做 node 的语义计算;FrameNet 和 IE Template (for relationship or event) 是做 arc (link)的语义计算。这样来看 community 定义的几个任务和方向,可以发现,(1) 和 (2) 都是学究式的任务,不实用。(3) 是最接地气的东西,是应用(apps)直接需要的。但是 IE 是针对领域的,直接为产品服务的,不好抽象,那么就可以想想什么东西是句法之后,语用之前,最能帮助 IE。其中之一就是 Coreference,这个任务已经被 IE 收编了,但它实际上是独立于领域的篇章(discourse)尺度的语义计算,是为了支持 IE 的跨句整合的。
沿着这个思路,我们还可以细化,根据实际需求,我们定义过三个任务,觉得应该在语义中间件里面做,它们应该可以惠及所有的应用:第一个是 同位语关系,这个可以看成是 Corference之一种;第二个是 部分和整体的关系(譬如,苹果和iPhone);第三个 原因和结果的关系。上述三个关系不限于句法短距离,也包括远距离的,甚至跨句的这类联系。我们一直在这三个关系,加上代词的coreference (包括专名的 aliasing) 上下功夫,比在 hidden 逻辑主谓宾方面更多,因为前者直接服务于 local IE 以后的 IE,以便整合成图谱,是整合的粘合剂,后者大多可以通过信息冗余去做弥补。
以上说的是实践中摸索出来的体会,就是自然而然这么走下来的。local IE 在抓取信息填 IE Template 里面的坑的时候,所看到的都是局部的信息,所填坑的材料经常很“虚”。虚的极端例子就是代词(“它”,“这个”),或者 一些指代性的名词(“这台电脑”),这些东西只能作为桥梁,不能真正导致图谱。这时候语义模块在上述四个方面所做的工作,就可以帮助把这些虚的材料,变得实在,这是通向图谱的一个很重要的支持。
大而言之,语义中间件做到什么程度合适,有很大的争论空间。在确定应用之前,不少细线条语义进一步伸展没有太大意义,或者劳而少功。就是说在句法把结构的框架搭起来以后,在语用层面的具体应用确定之前,到底要做多少语义计算,不是容易说清楚的,直觉上和经验上,不赞成做得太多。从某种意义上看,费尔默创立 FrameNet 就是想把语义中间件进行到底。理论上,他的深入是有道理的,因为在 arg structure (句法subcat的拿手好戏)之后,如果要深入,domain independent 的 Frame hierarchy 是通向语用的深度桥梁。起码理论上如此。但是我们做了18年的 IE 以后,结论是,费尔默那个语义计算的路子基本是歧途。没感觉到啥好处,却带来了很大的 overhead,可操作性很差,也并不省功。IE 领域用 Template 定义语用领域的需求,没有人主张把这些 Templates 定义在 FrameNet 的 hierarchy 上面,因为感觉不到需要,而且也不现实。100 年后,也许 FrameNet 可以被重新发现,因为那时候的语用落地已经太多了,需要组织组织了。FrameNet 正好提供了一个组织和整合的框架,如今的语用落地都是零星的。
立委牌 NLP University 中,能看懂上面这些参杂了些假洋鬼子话(术语)的“高阶科普”的后学,是可以授予学位的。这个学位是硬通货。看不懂也没关系,可以视为狂人乱语,或者是误入迷宫,不隔行也如山,耽误了你玩深度学习(dl)的宝贵时间。

【语义计算沙龙:三角关系的 chemistry 种种】

白:
朴泰恒小组成绩不好,今天不一定能进决赛
上面例子,“小组”怎么摆,是个考验。
原意是“在小组赛阶段的”

梁:
朴泰恒今天小组成绩不好。
孙杨小组第一。

白:
以人命名的小组也是存在的

梁:
是啊,感觉“小组成绩不好”是谓语。这里小组也不是“朴泰恒的小组“,考验来了。

我:
不是说大数据吗 看 某某某小组 是不是够资格

t08061

t08062

t08063

t08064

t08065

梁:
@wei 很棒! 有个 Topic.

宋:
@wei 确实很好。但是确实能区分两种“小组”,还是只顾一头?

我:
没有大数据,应该是只顾一头吧,可以试试另一头的典型案例

宋:
即使有大数据,还得区分时代、地域、行业等,不好办。
而且,这就成了有监督的学习了,需要做语料标注。

白:
不一定宋老师。可以词典里离线加标签,目标文本在线只需计算标签密度,不涉及监督学习。

宋:
具体解释一下吗?

我:
词典习得本质上是无监督的 ngram 频率做底。假设北京大学不在词典 应该可以学出来,某某某小组 亦然。白老师说的是在线词典化 通过现场计算。

宋:
@wei 就这个例子而言,对比“朴泰恒小组”和“朴泰恒……小组”的频率,是吗?

我:
能不能解决这个问题:北京大学、中学、小学要立刻全部动员起来
xyz 相交切分的通则:xy 强 还是 yz 强,这个道理上可以在线检索计算
“北京大学” 还是 “大学、中学” 强

宋:
如果看作交搭型歧义问题,那么在大数据中,肯定是“小组成绩”频率高过“朴泰恒”的频率,除非朴泰恒这个人太红。因此,以此决定句法结构,似乎理由不足。

我:
人是怎么决策的呢?
这里可能涉及大数据的范围问题。
数据不是越大越好 尤其不能杂 大而杂 就把领域抹平了,而很可能这是领域知识

宋:
对,我糊涂了。

白:
其实,和人名结合是兜底的,要学的只是不和人名结合的高频词串。
向右结合的条件不满足,就默认向左好了。
大数据不是这么用的。

宋:
不过无论如何,一般来说,X小组 比不上 小组成绩。这里是领域知识问题,不大好用词频去处理。

我:
先说一下篇章现象 one sense per discourse.
如果同一篇中 还有 某某某小组 再现。那个原则是过硬的 可以 在篇章内搞定,这时候大数据认输。

宋:
张三小组第一,李四小组第二。

白:
@宋柔 这个是歧义

我:
分为四级
第一级 是词典绑架 北京大学基本如此
第二级 是篇章原则
第三级 是领域数据
第四级 才是大数据 超领域的
涉及到专名 术语的 走不到超领域的大数据,大数据抹平了领域知识 反而不妙

白:
词例级如此,特征级未必
特征级可以把xx小组一起拿上来统计。

我:
明白。不过具体操作起来,还是一笔糊涂账。xxx 小组 与 小组成绩 打架,要赢多少 算赢?在多大的数据里?如果特别悬殊 好说,稍微有些接近 就是烂帐,or 烂仗。

白:
另外,针对篇章可以计算特征密度,如果某种特征密度显著比其他特征高,也可用。比如体育特征显著,“小组”做前缀就优先级较高。

宋:
我在11年人民日报中检索,“小组赛”1013次,“小组成绩”4次,“小组赛成绩”两次,人名+小组3次。对于一个毫无体育比赛知识的人,如果有一般的比赛知识,知道比赛会出成绩,就能推知“小组比赛”是一个短语。首先是从黏着的“赛”黏着到“小组赛”,知道有“小组赛”这个术语,并能理解这是分小组而比赛。由于知道比赛会出成绩,就能推知“小组成绩”是一个短语,指某人在小组赛中的成绩。人名+小组7次,但都与体育无关:赵梦桃小组,郝建秀小组等,都是棉纺厂的。一个人,没有体育比赛知识,但有一般的比赛知识,又有语言知识,就可以有这样的推理

我:
“周恩来思想深刻 谈吐幽默”,vs. “毛泽东思想深刻”
“思想” 与 “小组” 类似

宋:
1940年代以前,汉语中好像没有“人名+思想”作为一个词的。此后,“毛泽东思想”频率越来越高。但其他人名+思想就不能成词。

我:
这个政治有意思:从此 其他 人名+思想 成为禁忌:我花开来百花杀啊。

白:
@宋 “小组循环赛”“小组出线”“小组第一”……等各种组合均以“小组”为前缀,如果只对实例,其实比“朴泰恒小组”好不到哪里去。统计频度多一点少一点都做不得结构优选的依据。但是如果抽象地考察“前缀模式”和“后缀模式”的优先程度受什么影响,必然会追溯到特征以及特征在篇章中的密度分布。如果“体育”或“竞赛”特征及其密度优势显著,“小组”倾向于做前缀,否则倾向于做后缀。如果前缀所带的实例碰巧在大数据里固然好,不在,也可通过特征及特征密度间接获得友军的支持。同样,如果“人名”“任务名”特征或特征密度显著,“小组”倾向于做后缀。

 

【相关】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

 

【一日一parsing:degraded text and robust parsing】

我:
“i love programming the games are cool its fun to play them don’t you think”
@梁 here are parsing results of your casual English:

t0721a

So there is one error in parsing this “degraded text”:
our parser links “the games” as Object of “programming” which is locally correct, an understandable mistake. But human knows there is a missing punctuation and will link “the games” as Subject of “are”, other aspects of parsing seem alright.  So “degraded text” does pose some challenges, but a robust parser can still handle most of it.

@梁:
Thank you, @wei. It is very well handled. By the way, it is not my casual English. I copied it from Khan Academy.
@wei, ”Opred“ means predicate as objective, what is “infmod”?

白:
不定式作后置修饰语

我:
对。Opred 是谓词性宾语,包括ing和不定式。
其实那个错误 做细活 是可以改正的 因为 are 对主语的强制性力量 远远超越了作为前面动词宾语的力量。这样就达到人的结构分析水平了。

白:
think怎么next了?这个是个反义疑问句啊。

我:
白老师眼毒,不指出我根本就没注意到呢。那显然是一个 bug:助动当成主动词了。
就事论事 那个应该词典化。

白:
are距离又近,不填主语又不饱和。反倒是programming,不是非有坑不可。
词典化赞同。

 

 

 

【相关】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

 

成语的弹性识别和理解机制

白:
“去年秋膘应犹在,只是猪颜改”

我:
1234应犹在 只是56改
成语弹性机制一抓一个准。一个成语中 哪些是变量 哪些是常量 可以研究。人心里大体有本帐。拿 “九牛二虎之力” 为例,弹性第一环是数词的变量化:m牛n虎之力

二牛九虎之力
九虎二牛之力
八虎七牛之力
四牛五虎之力

都不影响parsing和理解,总之是 费了老鼻子劲儿。

弹性第二环 是名词沿着taxonomy变量化:m 【大动物】n【大动物】之力

九熊二豹之力
三象五狮之力

转:
今个立秋,问苍天什么季节最忙? 秋天,多事之秋; 什么季节最公平? 秋天,平分秋色; 什么季节最简单? 秋天, 一叶知秋; 什么季节最长? 秋天,一日不见如隔三秋; 什么季节最爽? 秋天,秋高气爽;什么季节最险?秋天,秋后算账: 什么季节最暧昧? 秋天,暗送秋波!秋日快乐!!

成语弹性机制 从 “秋” 上升到 【季节】 再上升可以到 【时段】:

多事之春 多事之年 多事之岁月
平分春色
一花知春
一日不见如隔三冬
一日不见如隔九冬

白:
秋天来了,冬天还会远吗

我:
冬天来了 秋天还会远吗
这是时间隧道
或倒转 或快进。

关于小标题:

0905b

【成语】的【【弹性【识别和理解】】机制】,论句法应该是这样的:对于成语,需要一个弹性的识别机制,或者弹性识别的机制。但写的时候,脑子里更可能想的是,对于【成语的弹性】,需要一个识别机制。

再一想,who cares,人的表达和理解不常常是这样模模糊糊的吗。除了段子或较真,通常人根本就对这类结构歧义无感。语义上的模糊也不影响理解的大面。

【相关】

立委NLP博文一览

《朝华午拾》总目录

立委NLP频道

Once upon a time, we were publishing like crazy

List of 23 NLP Publications (Cymfony Period)

Once upon a time, we were publishing like crazy …… as if we were striving for tenure faculty

[1] R. Srihari, W. Li and X. Li. 2006. Question Answering Supported by
Multiple Levels of Information Extraction.  a book chapter in T. Strzalkowski & S. Harabagiu (eds.), Advances in Open- Domain Question Answering.  Springer, 2006, ISBN:1-4020-4744-4.

http://link.springer.com/chapter/10.1007%2F978-1-4020-4746-6_11

[2] R. Srihari, W. Li, C. Niu and T. Cornell. 2006.  InfoXtract: A Customizable Intermediate Level Information Extraction Engine.  Journal of Natural Language Engineering, 12(4), 1-37

http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=1513012

This paper focuses on IE tasks designed to support  information discovery applications. It defines new IE tasks such as entity profiles, and concept-based general events which represent realistic goals in terms of what can be accomplished in the near-term as well as providing useful, actionable information.

[3] C. Niu, W. Li, R. Srihari, H. Li.  2005. Word Independent Context Pair Classification Model For Word Sense Disambiguation.  Proceedings of Ninth Conference on Computational Natural Language Learning (CoNLL-2005)

W05-0605

[4] C. Niu, W. Li and R. Srihari. 2004. Weakly Supervised Learning for
Cross-document Person Name Disambiguation Supported by Information
Extraction. In Proceedings of ACL 2004.

ACL 2004 Niu Li Srihari 372_pdf_2-col

[5] C. Niu, W. Li, R. Srihari, H. Li and L. Christ. 2004. Context Clustering for Word Sense Disambiguation Based on Modeling Pairwise Context Similarities. In Proceedings of Senseval-3 Workshop.

ACL 2004 Context Clustering for WSD niu1

[6] C. Niu, W. Li, J. Ding, and R. Rohini. 2004. Orthographic Case
Restoration Using Supervised Learning Without Manual Annotation.
International Journal of Artificial Intelligence Tools, Vol. 13, No.
1, 2004.

IJAIT 2004 Niu, Li, Ding, and Srihari caseR

(7) Cheng Niu, Wei Li and Rohini Srihari 2004. A Bootstrapping
Approach to Information Extraction Domain Porting. ATEM-2004: The
AAAI-04 Workshop on Adaptive Text Extraction and Mining. San Jose. (PDF)

[8] W. Li, X. Zhang, C. Niu, Y. Jiang, and R. Srihari. 2003. An Expert
Lexicon Approach to Identifying English Phrasal Verbs. In Proceedings
of ACL 2003. Sapporo, Japan. pp. 513-520.

ACL 2003 Li, Zhang, Niu, Jiang and Srihari 2003 PhrasalVerb_ACL2003_submitted

[9] C. Niu, W. Li, J. Ding, and R. Srihari 2003. A Bootstrapping
Approach to Named Entity Classification using Successive Learners. In
Proceedings of ACL 2003. Sapporo, Japan. pp. 335-342.

ACL 2003 Niu, Li, Ding and Srihari 2003 ne-acl2003

[10] W. Li, R. Srihari, C. Niu, and X. Li. 2003. Question Answering on
a Case Insensitive Corpus. In Proceedings of Workshop on Multilingual
Summarization and Question Answering – Machine Learning and Beyond
(ACL-2003 Workshop). Sapporo, Japan. pp. 84-93.

ACL 2003 Workshop Li, Srihari, Niu and Li 2003 QA-workshopl2003_final

[11] C. Niu, W. Li, J. Ding, and R.K. Srihari. 2003. Bootstrapping for
Named Entity Tagging using Concept-based Seeds. In Proceedings of
HLT/NAACL 2003. Companion Volume, pp. 73-75, Edmonton, Canada.

NAACL 2003 Niu, Li, Ding and Srihari 2003 ne_submitted

[12] R. Srihari, W. Li, C. Niu and T. Cornell. 2003. InfoXtract: A
Customizable Intermediate Level Information Extraction Engine. In
Proceedings of HLT/NAACL 2003 Workshop on Software Engineering and
Architecture of Language Technology Systems (SEALTS). pp. 52-59,
Edmonton, Canada.

NAACL 2003 Workshop InfoXtract SEALTS paper2

[13] H. Li, R. Srihari, C. Niu, and W. Li. 2003. InfoXtract Locatio
Normalization: A Hybrid Approach to Geographic References in
Information Extraction. In Proceedings of HLT/NAACL 2003 Workshop on
Analysis of Geographic References. Edmonton, Canada.

NAACL 2003 Workshop Li, Srihari, Niu and Li 2003 CymfonyLoc_final

[14] W. Li, R. Srihari, C. Niu, and X. Li 2003. Entity Profile
Extraction from Large Corpora. In Proceedings of Pacific Association
for Computational Linguistics 2003 (PACLING03). Halifax, Nova Scotia,
Canada.

PACLING 2003 Li, Srihari, Niu and Li 2003 Entity Profile profile_PACLING_final_submitted

[15] C. Niu, W. Li, R. Srihari, and L. Crist 2003. Bootstrapping a
Hidden Markov Model for Relationship Extraction Using Multi-level
Contexts. In Proceedings of Pacific Association for Computational
Linguistics 2003 (PACLING03). Halifax, Nova Scotia, Canada.

PACLING 2003 Niu, Li, Srihari and Crist 2003 CE Bootstrapping PACLING03_15_final

[16] C. Niu, Z. Zheng, R. Srihari, H. Li, and W. Li 2003. Unsupervised
Learning for Verb Sense Disambiguation Using Both Trigger Words and
Parsing Relations. In Proceedings of Pacific Association for
Computational Linguistics 2003 (PACLING03). Halifax, Nova Scotia,
Canada.

PACLING 2003 Niu, Zheng, Srihari, Li and Li 2003 Verb Sense Identification PACLING_14_final

[17] C. Niu, W. Li, J. Ding, and R.K. Srihari 2003. Orthographic Case
Restoration Using Supervised Learning Without Manual Annotation. In
Proceedings of the Sixteenth International FLAIRS Conference, St.
Augustine, FL, May 2003, pp. 402-406.

FLAIRS 2003 Niu, Li, Ding and Srihari 2003 FLAIRS03CNiu

[18] R. Srihari  and W. Li 2003. Rapid Domain Porting of an
Intermediate Level Information Extraction Engine. In Proceedings of
International Conference on Natural Language Processing 2003.

ICON2003 paper FINAL

[19] H. Li, R. Srihari, C. Niu and W. Li 2002. Location Normalization
for Information Extraction. In Proceedings of the 19th International
Conference on Computational Linguistics (COLING-2002). Taipei, Taiwan.

COLING 2002 Li, Srihari, Niu and Li 2002 coling2002LocNZ

[20] W. Li, R. Srihari, X. Li, M. Srikanth, X. Zhang and C. Niu 2002.
Extracting Exact Answers to Questions Based on Structural Links. In
Proceedings of Multilingual Summarization and Question Answering
(COLING-2002 Workshop). Taipei, Taiwan.

COLING 2002 Workshop Li et al CymfonyQA_final

[21] R. Srihari, and W. Li. 2000. A Question Answering System
Supported by Information Extraction. In Proceedings of ANLP 2000.
Seattle.

ANLP 2000 Srihari and Li 2000 anlp9l

[22] R. Srihari, C. Niu and W. Li. 2000. A Hybrid Approach for Named
Entity and Sub-Type Tagging. In Proceedings of ANLP 2000. Seattle.

ANLP 2000 Srihari, Niu and Li 2000 anlp105_final9

[23] R. Srihari and W. Li. 1999. Question Answering Supported by
Information Extraction. In Proceedings of TREC-8. Washington

cymfony

Other publications: SBIR Final Reports

W. Li & R. Srihari. 2003.  Flexible Information Extraction Learning Algorithm (Phase 2), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. 


W. Li & R. Srihari. 2001.  Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization (Phase 1), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York.

W. Li & R. Srihari.  2000.  A Domain Independent Event Extraction Toolkit (Phase 2), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York.

W. Li & R. Srihari. 2000.  Flexible Information Extraction Learning Algorithm (Phase 1), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York.

W. Li & R. Srihari 2003. Automated Verb Sense Identification (Phase I), Final Techinical Report, U.S. DoD SBIR (Navy), Contract No. N00178-02-C-3073 (2002-2003)

R. Srihari & W. Li 2003. Fusion of Information from Diverse, Textual Media: A Case Restoration Approach (Phase I), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Contract No. F30602-02-C-0156 (2002-2003)

R. Srihari, W. Li & C. Niu 2004. A Large Scale Knowledge Repository and Information Discovery Portal Derived from Information Extraction (Phase 1), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. (2003-2004)

R. Srihari & W. Li 2003. An Automated Domain Porting Toolkit for Information Extraction (Phase I), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Contract No. F30602-02-C-0057 (2002-2003)

T. Cornell, R. Srihari & W. Li 2004. Automatically Time Stamping Events in Unrestricted Text (Phase I), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. (2003-2004)

 

[Related]

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

【一日一parsing:谈parsing是问答系统的核武】

一日一parsing:今天的是。。。

0831d

怎么知道这里的问题和答案可以相配呢?如果有 parsing 和建立其上的知识图谱,那就好办。图谱里面有 professionOf 的 relationship,有了 parsing 抽取这个关系就是小菜(这个例子很简单,就是把同位语关系映射到professionOf关系)。有了 parsing 对于 question 要问的关系,也可以解出来 asking point,子树(S:李娜-从事,O:从事-运动;Mod:什么-关系)就确定了 asking point 是寻求 professionOf(“李娜”)。然后做语义 matching,问答系统的这个环就圆了。This is IE or knowledge-graph supported QA.

具体说,为了让Q和A能match,我们可以对两边做子树规则,填空(抽取)到 professionOf 的关系去,语义一体化,然后就顺风顺水了。第一条子树规则是:

“从事”O: (“职业|运动”)

O: (“职业|运动”)

Mod (“什么|何种”)

S: ^Sombody==>

==> professionOf(^Somebody,?)

professionOf(^Somebody,?)

这是 Question parsing 和 asking point extraction.  在答案源那一边,也有一组规则做 professionOf 的抽取,其中有这样一条规则:[personNE]

[person-NE]:^Person

equiv([profession_token]:^Profession)

==> professionOf(^Person,^Profession)

QA 就这样 match 了。

如果没有专门的知识图谱,没有事先定义好的关系的抽取,怎样做 QA 来应对呢?那就用 SVO parsing 也可以应对相当多的关于事件的问答。但是关系和复杂的事件的问答,简单的 SVO matching 就不行。好在原则上说,复杂的语义大多可以预先定义成 IE (predefined), 专门去做针对性抽取。简单的语义是 open-ended 的,语言学parsing(主谓宾定状补等)就够应付了。

天不我欺也。

IE 对于 SVO,实质就是 (semantic) slot normalization,原来的 slots 是语言学的,叫 S 也好, O 也好,equiv(同位语)也好,mod 也好 。。。。现在的 slots 是 pragmatic 的语义: 譬如 professionOf, locationOf, employeeOf, acquiringCompany, acquiredCompany, priceOfAcqusition, etc.

SVO matching 的 QA 也可以举一个例子, 譬如询问如何做某事:做+某事 就是一个 V+O:

0831a

0831b

0831c

甭管怎样换说法,不变的是 VO (格式化,硬盘)。有了这个 VO matching 做底,离开QA 或人机对话就不远了。譬如,FAQ 档案里很可能就有这样的标题: 格式化硬盘的步骤;关于格式化硬盘;等。于是 Q与A基本就是 SVO 子树 matching:”格式化“ —O—> “硬盘”。
0901b

接着这个话题再发挥一下。IE 说的是信息抽取,多数时候这个 information 是与 insights (情报,有价值的信息)等价。但其实 IE 可以是抽取有价值的情报,也可以是抽取无价值的情报(噪音)。

为啥要抽取无价值的信息呢?道理很简单,噪音捣乱啊,为了剔除噪音,首先要识别它,或者说抽取它以便扔掉它。所用的方法可以完全一样。搜索界有 stop words ,被当做噪音扔掉了,那是噪音的最简单形式,不需要上下文,纯粹是高频虚词:对于 parsing 这些 stop words 其实很关键,是必要的建立结构的桥梁,但对于关键词搜索,因为里面没有结构,这些词就变成纯粹的噪音了。用 IE 来剔除噪音,实际上是根据上下文结构来断定哪些信息是应该扔掉的,譬如上面的句子里面,在 QA 的语用场景下,就可以剔除诸如:“请告诉我”、“我不知道”等,这样才凸显关键的的VO“格式化-硬盘”。要是做相似度计算,这些个词都是噪音。把“请告诉我”当成一个 4-gram 的 stop word 行不行?可以,但是如果这种东西有很多变式,ngram 就不行了。这时候在子树基础上做 IE 抽取噪音就非常可取了。又因为噪音大多可以用 word-driven 来做,做这件事儿是很靠谱的,基本一抓一准。

小结一下,一般而言,如果 Q 和 A 说法类似,譬如“格式化”+ “硬盘”,那么只要在 SVO 基础上做 matching 就可以把 QA couple 起来。如果 说法很不相同,或者一个关系或事件的变式太多,那么就加一层 IE,matching 在 IE 语义上做。SVO 的 QA matching 是智能搜索的本质,可以对付不可预测的问题。IE 的 QA matching 是预先定义的,针对领域的,不仅精准,而且可以应对变式。两个方案相辅相成。一个善于领域的精准,一个善于open domain 的广度和召回。二者都比 keywords 好出很多,因为有结构。如果从 backoff 来看,那就是 IE 优先, SVO 其次,keywords 楼底。这样精度广度就全照顾到了。

说来归齐,对于QA,对于对话系统,parsing 是核心引擎的关键技术。QA 说到底就是在 Q 与 A 中建立映射,映射的基础是语义匹配。deep parsing 及其 IE 是语义匹配的核武。

 

【相关】

【Bots 的愿景】

立委科普:问答系统的前生今世

泥沙龙笔记:parsing 是引擎的核武器,再论NLP与搜索

泥沙龙笔记:从 sparse data 再论parsing乃是NLP应用的核武器

【立委科普:NLP核武器的奥秘】

问答系统

泥沙龙笔记:搜索和知识图谱的话题

置顶:立委NLP博文一览】

《朝华午拾》总目录

立委NLP频道

立委硕士论文:EChA 试验结果 (11)

世界语到汉语和英语的自动翻译试验
— EChA机器翻译系统概述

 [参考书目]

 

  1. Heinz Dieter MAAS “Automata  Tradukado en kaj el Esperanto” ( “Lingvo-kibernetiko kaj aliajinternacilingvaj aktoj de l(1a IX-a Internacia Kongreso de Kibernetiko”, pp 75-81, 1982 Gunter Narr Verlag Tubingen )
  1. <<机器翻译论文选辑>> ( 科学技术文献出版社, 1979 )
  2. Kalocsay-Waringhien <<Plena Analiza Gramatiko de Esperanto>> ( 中国世界语出版社, 1984 )
  3. 刘涌泉等著 <<中国的机器翻译>> ( 知识出版社, 1984 )
  4. 刘涌泉, 高祖舜, 刘倬著 <<机器翻译浅说>> ( 科学普及出版社, 1964 )
  5. 刘涌泉, 李维 <<巴贝尔通天塔必将建成>> ( 中国第一届世界语大会论文, 1985.8 )
  6. 刘倬 <<三次机器翻译试验>> ( 第一次机器翻译学术会议论文, 1980.9 )
    <<论机器翻译规则系统的编制方法>> ( 1982.3 上海 )
    <<JFY型英汉机器翻译系统的研制和试验>> ( 语言学会第二届年会论文, 1983.4 )
  1. 乔毅 <<开展语言的计算机处理和世界语类型的机器翻译>> ( 中国第一届世界语大会论文, 1985.8 )
  2. 魏原枢, 徐文琪编 <<世界语语法>> ( 上海外语教育出版社, 1982 )
  3. 叶蜚声, 徐通锵著 <<语言学纲要>> ( 北京大学出版社, 1981 )
  4. <<语言和计算机>> (1) (中国社会科学出版社, 1982 )
  5. <<语言和计算机>> (2) (中国社会科学出版社, 1985 )
  6. 张道真编著 <<实用英语语法>> ( 商务印书馆, 1984 )

[致谢]

研制世界语类型的机器翻译系统, 从一开始就得到刘涌泉老师的热情支持, 从方案主体到具体问题的处理, 他都给以认真指导。在程序设计和上机调试的的过程中, 刘倬老师也多次给予指导, 有些基本操作的算法也是刘倬老师提供的。在EChA系统取得初步成果的时候, 笔者向他们表示深切的感谢。另外, 还要特别感谢机房韩老师的多方协助。没有她提供的方便, EChA系统根本不可能在这么短时间试验成功。

[附录一] EChA试验结果

 

(1) LA ORIGINALA TEKSTO / THE ORIGINAL TEXT / 世界语原文

(001) TIEL EVOLUIGHIS PLI KAJ PLI LA PLANADO PER MASHINOJ . (002) TIUJ MASHINOJ KOMENCE NUR ELKALKULIS LA DIKTITAJN MATEMATIKAJN PROBLEMOJN , KONFORME AL LA ENPROGRAMIGO . (003) LA ELEKTRONIKAN PROGRAMIGON PRETIGIS HOMOJ . (004) PLI POSTE , KIAM LA SCIODISKETOJ ESTIS ELTROVITAJ , LA PLENAN INDIKARON , ENDISKIGITAN , ONI METIS EN MASHINOJN KAJ ILI TIAMANIERE POVIS EN SI MEM AKUMULI SCIENCAN STOKON , PLI GRANDAN OL LA HOMA CERBO. (005) KAJ SE TEMIS EKZEMPLE PRI LA PLANADO DE ELEKTROMOTORO , ONI ENMETIS LA SHABLONDISKETON DE LA ELEKTROMOTOR-PLANADO , DONIS LA INDIKOJN DE LA DEZIRATA MOTORO ( KILOVATO , TENSIO , ROTACIO , TIPO , KTP ) , (006) POST KIO LA MASHINO MEM PROGRAMIGIS SIN KAJ FARIS LA KALKULOJN . POST KELKAJ MINUTOJ GHI JAM PRETE ELDONIS LA MEZUROJN : LA DIAMETRON DE LA ROTACIA PARTO , GHIAN LONGON, LA MEZUROJN DE LA KANELOJ , DRATOJ , LA VOLVONOMBRON , ENTUTE CHION BEZONATAN . (007) ECH PLI : BALDAU ESTIS ATINGITE , KE LA MASHINO FARIS LA TUTAN DESEGNON KAJ TRANSDONIS GHIN AL LA FABRIKO . (008) KOMPRENEBLE TIUJ < DESEGNOJ > NE ESTIS IDENTAJ KUN NIAJ PAPERDESEGNOJ . (009) ILI ESTIS DISKETOJ , KIUJ ENTENIS CHIUN DETALON . (010) TIAMANIERE LA PLANADON KAJ FABRIKADON DE LA MASHINOJ JAM PLENUMIS SAME MASHINOJ . (011) ILI PLANIS LA MENDITAN MASHINON , FABRIKIS , ECH KONTROLPROVIS GHIN KAJ LA FUSHAN FORJHETIS . (012) SED CHIO CHI ANKORAU OKAZIS SUB HOMA GVIDADO KAJ PLEJ GRAVE ESTIS , KE CHIO CHI BAZIGHIS SUR LA HOMA SCIO .

LA TEKSTO TRADUKITA EN LA ANGLAN / THE TEXT TRANSLATED INTO ENGLISH / 英语译文

(001) SO DEVELOPED MORE AND MORE THE PLANNING BY MACHINES . (002) THOSE MACHINES AT BEGINNING ONLY CALCULATED OUT THE DICTATED MATHEMATICAL PROBLEMS , ACCORDING TO THE PROGRAMMING . (003) MEN PREPARED THE ELECTRONIC PROGRAMMING . (004) MORE LATER , WHEN THE KNOWLEDGE-DISKETTES HAD BEEN FOUND OUT , PEOPLE PUT THE FULL INDICATION , ENDISKED , INTO MACHINES AND THEY THEREFORE COULD IN THEMSELVES ACCUMULATE SCIENTIFIC STOCK, MORE GREAT THAN THE MAN’SBRAIN . (005) AND IF IT CONCERNED FOR EXAMPLE ABOUT THE PLANNING OF ELECTRIC MOTOR, PEOPLE INPUT THE SAMPLE DISKETTE OF THE MOTOR PLANNING , GAVE THE INDICATIONS OF THE DESIRED MOTOR (KILOWATT , VOLTAGE , ROTATION , TYPE , ETC ) , AFTER WHICH THE MACHINE ITSELF PROGRAMMED ITSELF AND DID THE CALCULATIONS . (006) AFTER SEVERAL MINUTES IT ALREADY READILY GAVE OUT THE MEASUREMENTS : THE DIAMETER OF THE ROTARY PART ,ITS LENGTH , THE MEASUREMENTS OF THE GROOVES , WIRES , THE WINDING NUMBER , IN TOTAL ALL REQUIRED . (007) EVEN MORE : SOON IT HAD BEEN ACHIEVED , THAT THE MACHINE DID THE TOTAL DESIGN AND OVERHANDED IT TO THE FACTORY . (008) OF COURSE THOSE < DESIGNS >  WERE NOT IDENTICAL WITH OUR PAPERDESIGNS . (009) THEY WERE DISKETTES , WHICH CARRIED ALL DETAIL . (010) THEREFORE MACHINES ALREADY FULFILED THE PLANNING AND MANUFACTURING OF THE MACHINES SAMELY . (011) THEY PLANNED THE ORDERED MACHINE , MANUFACTURED , EVEN EXAMINED IT AND THREW AWAY THE USELESS . (012) BUT ALL THIS STILL HAPPENED UNDER MAN’S GUIDING AND IT WAS MOST IMPORTANT , THAT ALL THIS WAS BASED ON THE MAN’S KNOWLEDGE .

LA TEKSTO TRADUKITA EN LA CHINAN / THE TEXT TRANSLATED INTO CHINESE / 汉语译文

(001) 这样用机器设计越来越发展了. (002) 那些机器开始时仅仅按照输入程序计算出所命令的数学问题. (003) 人准备了电子程序设计. (004) 更以后,当微型知识磁盘被发明了时,人们把所写入磁盘的全套指令集合放到机器里面,他(它)们这样能在自己本身里面积累比人的头脑更大的科学贮蓄. (005) 如果涉及例如关于电动机的设计, 人们输入了电动机设计的微型样品磁盘, 给了所希望的电动机的指标(千瓦,电压,运转,型号,等等),在此以后机器本身把自己程序化了,做了计算. (006) 在几分钟以后它已经就能给出尺寸:运转部分的直径,它的长度,槽纹,导线的尺寸,圈数,总之所需要的一切. (007) 甚至更:很快达到了,机器做了整个图样,把它转交到工厂. (008) 当然那些<图样>与我们的图纸不是一样的. (009) 他(它)们是储有所有细节的微型磁盘. (010) 这样机器已经同样地完成了机器的设计和制造. (011) 他(它)们设计了所定购的机器,制造了,甚至检验了它,把废的抛弃了. (012) 但是这一切仍然在人的指导下进行,最重要的是,这一切以人的知识作为基础.

(2) DIVERSAJ FRAZOJ / VARIOUS SENTENCES / 各类文句

(016) KIAM MI ESTIS LUDANTA VIOLONON , MIA ONKLO VIZITIS NIAN HEJMON .
WHEN I WAS PLAYING VIOLIN , MY UNCLE VISITED OUR HOME .
当我(当时)正在拉小提琴时,我的叔叔访问了我的家.

(020)  MI ESTOS FININTA LA EKSPERIMENTON PRI MASHINA TRADUKADO POST KELKAJ MONATOJ .
I WILL HAVE FINISHED THE EXPERIMENT ABOUT MACHINE’S TRANSLATING IN SEVERAL MONTHS.
我在几月以后将已经完成关于机器的翻译的实验.

(028)  BABELO NE ESTIS ELKONSTRUITA.
BABEL HAD NOT BEEN BUILT UP .
巴贝尔塔没有被建成.

(029)  NEPRE ESTOS ELKONSTRUITA LA NOVA BABELO .
ABSOLUTELY WILL HAVE BEEN BUILT UP THE NEW BABEL .
新巴贝尔塔必然地将被建成.

(040)  KIAL VI LERNAS ESPERANTON ?
WHY DO YOU LEARN ESPERANTO ?
为什么你学习世界语?

(044)  NE PROKRASTU LA HODIAUAN LABORON GHIS MORGAU .
DON’T PUT OFF THE TODAY’S WORK TILL TOMORROW .
别把今天的工作推迟到明天.

(045)  KIEL BONE PENTRAS LA KNABO !
HOW WELL THE BOY PAINTS !
男孩多么好地画画啊!

(048)  KIU ESTAS LA AUTORO DE LA LIBRO , KIUN VI JHUS LEGIS ?
WHO IS THE AUTHOR OF THE BOOK , WHICH YOU JUST READ ?
你刚刚读了的书的作者是谁?

(050)  SE MI PARTOPRENUS EN VIA AMUZA AKTIVADO , MI ESTUS TRE GHOJA .
IF I WOULD TAKE PART IN YOUR RECREATIONAL ACTIVITY , I WOULD BE VERY GLAD .
如果我参加你(们)的文娱活动,我会是很高兴的.

(056)  CHU VI MEMORAS LA TAGOJN , KIAM NI KUNE STUDIS EN LA UNIVERSITATO ?
DO YOU REMEMBER THE DAYS , WHEN WE TOGETHER STUDIED IN THE UNIVERSITY ?
你记得我们在一起在大学里面学习的日子吗?

(058)  UNUIGHU PROLETOJ DE CHIUJ LANDOJ !
LET PROLETARIANS OF ALL COUNTRIES UNITE !
让所有国家的无产者联合吧!

(061)  KIEL SAGHA VI ESTAS !
HOW WISE YOU ARE !
你是多么聪明啊!

(062)  ESPERANTO ESTAS INTERNACIA HELPA LINGVO .
ESPERANTO IS INTERNATIONAL HELP LANGUAGE .
世界语是国际辅助语言.

(067)  LIA PROPONO ESTAS , KE NI CHIUJ LIBERE ELMETU NIAJN OPINIOJN .
HIS PROPOSAL IS , THAT WE ALL FREELY OUTPUT OUR OPINIONS .
他的建议是,让我们所有人自由地提出我们的意见.

(068)  MI NE SCIAS , KIAM KOMENCIGHOS NIAJ FERIOJ .
I DON’T KNOW , WHEN WILL BEGIN OUR HOLIDAYS .
我不知道,我们的假日什么时候将开始.

(069)  LA LIBRO , KIU KUSHAS SUR LA TABLO , ESTAS VERDA .
THE BOOK , WHICH LIES ON THE TABLE , IS GREEN .
在桌子上躺的书是绿的.

(071)  LA INFANO PLORAS , CHAR IU LIN BATIS .
THE CHILD CRIES , BECAUSE SOMEBODY BEAT HIM .
小孩哭,因为某人打了他.

(078)  LERNI ESPERANTON NE ESTAS MALFACILE .
TO LEARN ESPERANTO IS NOT DIFFICULT .
学习世界语不是困难的.

(084)  MI NE SCIAS , CHU VI POVAS PLENUMI TIUN CHI TASKON .
I DON’T KNOW , WHETHER YOU CAN FULFIL THIS TASK .
我不知道,是否你能完成这个任务.

(086)  MULTAJ DIVERSLANDAJ ESPERANTISTOJ CHEESTOS LA UNIVERSALAN KONGRESON DE ESPERANTO OKAZONTAN PEKINE .
A LOT OF VARIOUS COUNTRY’S ESPERANTISTS WILL ATTEND THE UNIVERSAL CONGRESS OF ESPERANTO TO BE HELD IN BEIJING .
许多不同国家的世界语者将参加在北京将召开的世界语的国际大会.

(089)  LIA PROPONO ELEKTI NOVAN PREZIDANTON NE ESTIS AKCEPTITA .
HIS PROPOSAL TO ELECT NEW PRESIDENT HAD NOT BEEN ACCEPTED .
他的选举新总统的建议没有被接受.

(090)  SHI ESTAS LA PLEJ BELA EL LA KNABINOJ .
SHE IS THE MOST BEAUTIFUL OF THE GIRLS .
她在女孩里面是最漂亮的.

(092)  FALINTE , LI NE POVIS RELEVIGHI .
HAVING FALLEN , HE COULD NOT GET UP .
摔倒了,他不能重新起来.

(093)  FORIRONTE , LI PREMIS MIAN MANON .
TO GO AWAY , HE SHOOK MY HAND .
将要离去,他握了我的手.

(098)  MI TRE AMAS ESPERANTON , MI PLI AMAS ESPERANTISTOJN , MI PLEJ AMAS LA IDEALON DE ESPERANTO .
I VERY MUCH LOVE ESPERANTO , I MORE LOVE ESPERANTISTS , I MOST LOVE THE IDEAL OF ESPERANTO .
我很爱世界语,我更爱世界语者,我最爱世界语的理想.

(116)  NI LUDU , CHU BONE ?
LET’S PLAY , ALL RIGHT ?
让我们玩吧,好吗?

(119)  KIA MIRAKLO TIO ESTAS , KE NIAJ ANTIKVULOJ KONSTRUIS LA GRANDAN MURON NUR PER SIAJ DU MANOJ !
WHAT MIRACLE IT IS , THAT OUR ANCESTORS BUILT THE GREAT WALL ONLY BY THEIR TWO HANDS !
我们的祖先仅仅用自己的两手建造了长城,这是怎样的奇迹啊!

(121)  FORPASIS UNU TAGO , FORPASIS ANKAU LA DUA .
PASSED AWAY ONE DAY , PASSED AWAY ALSO THE SECOND .
一天过去了,第二也过去了.

(122)  CHU ESTAS EBLE , KE VI NENION SCIAS ?
IS IT POSSIBLE , THAT YOU KNOW NOTHING ?
你不知道任何事,这是可能的吗?

(131)  LA HOMON , PRI KIU VI PAROLAS , MI NENIAM VIDIS .
I NEVER SAW THE MAN , ABOUT WHOM YOU SPEAK .
我从未看见过你提到的人.

(132)  NI , ESPERANTISTOJ , DEVAS LABORI PLI ENERGIE OL IAM .
WE , ESPERANTISTS , MUST WORK MORE HARD THAN EVER .
我们,世界语者,应该比任何时候更努力工作.

(133)  SOMERE ESTAS TRE VARME .
IN SUMMER IT IS VERY HOT .
夏天是很热的.

(134)  DOKTORO ZAMENHOF NASKIGHIS LA 15-AN DE DECEMBRO EN 1859 .
DOCTOR ZAMENHOF WAS BORN ON THE 15TH OF DECEMBER IN 1859 .
柴门霍夫博士1859年十二月的15号出生.

(135)  SE VI SCIUS , KIU LI ESTAS , VI LIN PLI ESTIMUS .
IF YOU WOULD KNOW , WHO HE IS , YOU MORE WOULD ESTEEM HIM .
如果你知道,他是谁,你更会尊敬他.

(136)  CENTOJ DA MALFERMAJ AUTOJ NIN PORTIS AL LA CENTRA LENIN-STADIONO, MALRAPIDE MOVIGHANTE TRA LA HOMA SVARMO .
HUNDREDS OF OPEN CARS CARRIED US TO THE CENTRAL LENIN STADIUM , SLOWLY MOVING THROUGH THE MAN’S SWARM .
成百敞篷汽车把我们带到中央列宁运动场,缓慢地通过人群运动.

(137)  MI VIDIS , KE LI FALIS KAJ LIA VESTO MALPURIGHIS .
I SAW , THAT HE FELL AND HIS CLOTHES BECAME DIRTY .
我看见了,他摔倒了,他的衣服弄脏了.

(139)  MI SCIIS , KE LI NE FAROS , KION LI PROMESIS .
I KNEW , THAT HE WOULD NOT DO WHAT HE PROMISED .
我知道,他将不做他允诺的.

(140)  ESTAS PAULO , KIU ARANGHIS LA AFERON .
IT IS PAULO THAT ARRANGED THE AFFAIR .
是PAULO安排了事情.

(142)  KUREGIS LA KNABO PER SIA TUTA FORTO , SED LI NE POVIS ATINGI LA PAPILION .
RAN THE BOY BY HIS TOTAL STRENGTH , BUT HE COULD NOT ACHIEVE THE BUTTERFLY .
男孩用自己的整个力量狂奔,但是他不能达到蝴蝶.

(144)  LI DONIS AL MI MULTAJN INSTRUAJN LIBROJN .
HE GAVE ME A LOT OF TEACHING BOOKS .
他给了我许多教科书.

(145)  CHU VI PAROLAS CHINE AU JAPANE ?
DO YOU SPEAK IN CHINESE OR IN JAPANESE ?
你用中文还是用日文说话?

(151)  NUR TIU NE ERARAS , KIU NENIAM ION FARAS .
ONLY THAT PERSON IS NOT WRONG , WHO NEVER DOES SOMETHING .
仅仅从不做某事的那个人不犯错误.

(155)  ESPERANTO ESTAS CHIES PROPRAJHO .
ESPERANTO IS EVERYBODY’S PROPERTY .
世界语是所有人的财产.

(156)  MI MEMORAS CHIUN , KIUN MI VIDIS .
I REMEMBER ALL , WHOM I SAW .
我记得我看见了的所有人.

(157)  ESTAS NENIU EN LA CHAMBRO .
THERE IS NOBODY IN THE ROOM .
在房间里面没有任何人.

(3) DU POEMOJ / TWO POEMS / 两首诗歌

(099) LA ESPERO : ESPERANTISTA HIMNO ( POEMO FAR ZAMENHOF ) .

(100) EN LA MONDON VENIS NOVA SENTO ,
TRA LA MONDO IRAS FORTA VOKO ;
(101) PER FLUGILOJ DE FACILA VENTO ,
NUN DE LOKO FLUGU GHI AL LOKO .

(102) NE AL GLAVO SANGONSOIFANTA ,
GHI LA HOMAN TIRAS FAMILION ;
(103) AL LA MOND’ ETERNE MILITANTA ,
GHI PROMESAS SANKTAN HARMONION .
(099) THE HOPE : ESPERANTIST’S HYMN ( POEM BY ZAMENHOF ) .

(100) INTO THE WORLD CAME NEW FEELING ,
OVER THE WORLD GOES STRONG VOICE ;
(101) BY WINGS OF EASY WIND ,
NOW FROM PLACE LET IT FLY TO PLACE .
(102) NOT TO SWORD BLOODTHIRSTY ,
IT PULLS THE MAN FAMILY ;
(103) TO THE WORLD EVER FIGHTING ,
IT PROMISES SACRED HARMONY .

(099) 希望: 世界语者的颂歌 (柴门霍夫所作的诗歌).

(100) 新感觉来到了世界,
有力的声音走遍世界;
(101) 用顺风的翅膀,
现在让它从一个地方飞到另一个地方吧.

(102) 它不把人的家庭
引到渴血的刀剑;
(103) 向永远战争着的世界,
它允诺神圣的和谐.

(104) AL NIA KARA LINGVO ( FAR IU NOVA ESPERANTISTO ) .

(105) LA LINGVO GRACIA , KARA MIA ,
GHIS KIAM VI VENIS AL MI FINE FIN ?
(106) ATENDIS SOIFE MI , ETERNE VIA ,
MI AMAS VIN !

(107) MI AMAS VIN VERE , PRUVU DIO ,
KAJ MIA BON-KORO BATAS NUR POR VI ;
(108) NE PLU SEKRETETO ESTAS TIO :
VIN AMAS MI !

(109) CHU KREDAS VI MIAN AMON MARAN ?
(110) CHU KREDAS , KE MIA KORO FLAMAS ?
(111) CHU KREDAS LA VORTON PURE KARAN :
VIN MI AMAS !

(104) TO OUR DEAR LANGUAGE ( BY SOME NEW ESPERANTIST ) .

(105) THE LANGUAGE GRACEFUL , MY DEAR ,
TILL WHEN YOU CAME TO ME AT LAST ?
(106) WAITED LONGINGLY I , EVER YOURS ,
I LOVE YOU !

(107) I LOVE YOU TRUELY , LET GOD PROVE ,
AND MY GOOD HEART BEATS ONLY FOR YOU ;
(108) NO LONGER THAT IS LITTLE SECRET :
I LOVE YOU !

(109) DO YOU BELIEVE MY LOVE LIKE SEA ?
(110) DO BELIEVE , THAT MY HEART BURNS ?
(111) DO BELIEVE THE WORD PURELY DEAR :
I LOVE YOU !

(104) 献给我们的亲爱的语言(某新世界语者所作).

(105) 优美的语言,我的亲爱的,
到什么时候你最后来到了我这儿?
(106) 我渴望地等待,你的永远的,
我爱你!

(107) 我真实地爱你,让上帝证明吧,
我的善良的心仅仅为了你跳动;
(108) 那已经不再是小秘密:
我爱你!

(109) 你相信我的大海一样的爱吗?
(110) 相信,我的心燃烧吗?
(111) 相信纯粹地亲爱的词吗:
我爱你!

 

 

【相关】

硕士论文: 世界语到汉语和英语的自动翻译试验
立委硕士论文:1. EChA概况
立委硕士论文:2. 世界语: 语言学特点及其研究价值
立委硕士论文:3. 层次递归成分体系
立委硕士论文:4. EChA机器词典及词表
立委硕士论文:5. 世界语形态分析
立委硕士论文:6/7 世界语句法分析
立委硕士论文:8. 英语形态生成
立委硕士论文:9. 目标语调序
立委硕士论文:10. EChA 试验结果的分析
立委硕士论文【致谢】【参考书目】
立委硕士论文全文(世界语版)

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

Outline of an HPSG-style Chinese reversible grammar

 Outline of an HPSG-style Chinese reversible grammar*

Wei  LI
Simon Fraser University
(NLWC97)

This paper presents the outline and the design philosophy of a lexicalized Chinese unification grammar named W‑CPSG. W‑CPSG covers Chinese morphology, Chinese syntax and semantics in a novel integrated language model. The grammar works reversibly, suited for both parsing and generation. This work is developed in the general spirit of the linguistic theory Head-driven Phrase Structure Grammar (Pollard & Sag 1994). We identify the following two problems as major obstacles in formulating a precise and efficient Chinese grammar. First, we lack in serious study on Chinese lexical base and often jump too soon for linguistic generalization. Second, there is a lack of effective interaction and adequate interface between morphology, syntax and semantics. We address these problems in depth with the lexicalized and integrated design of W‑CPSG. We will also illustrate how W‑CPSG is formalized and how it works.

 

  1. Background

Unification grammars have been extensively studied in the last decade (Shieber 1986). Implementations of such grammars for English are being used in a wide variety of applications. Attempts also have been made to write Chinese unification grammars (Huang 1986, among others). W‑CPSG (for Wei’s Chinese Phrase Structure Grammar, Li, W. 1997b) is a new endeavor in this direction, with its unique design and characteristics.

1.1. Design philosophy

We identify the following two problems as major obstacles in formulating a precise and efficient Chinese grammar. First, we lack in serious study on Chinese lexical base and often jump too soon for linguistic generalization. Second, there is a lack of effective interaction and adequate interface between morphology, syntax and semantics. We address these problems in depth with the lexicalized and integrated design of W‑CPSG.

1.1.1. Lexicalized design

It has been widely accepted that a well-designed lexicon is crucial for a successful grammar, especially for a natural language computational system. But Chinese linguistics in general and Chinese computational grammars in particular have generally been lacking in in-depth research on Chinese lexical base. For many years, most dictionaries published in China did not even contain information for grammatical categories in the lexical entries (except for a few dictionaries intended for foreign readers learning Chinese). Compared with the sophisticated design and rich linguistic information embodied in English dictionaries like Oxford Advanced Learners’ Dictionary and Longman Dictionary of Contemporary English, Chinese linguistics is hampered by the lack of such reliable lexical resources.

In the last decade, however, Chinese linguists have achieved significant progress in this field. The publication of 800 Words in Contemporary Mandarin (Lü et al., 1980) marked a milestone for Chinese lexical research. This book is full of detailed linguistic description of the most frequently used Chinese words and their collocations. Since then, Chinese linguists have made fruitful efforts, marked by the publication of a series of valency dictionaries (e.g. Meng et al., 1987) and books  (e.g. Li, L. 1986, 1990). But almost all such work was done by linguists with little knowledge of computational linguistics. Their description lacks formalization and consistency. Therefore, Chinese computational linguists require patience in adapting and formalizing these results, making them implementable.

1.1.2. Integrated design

Most conventional grammars assume a successive model of morphology, syntax and semantics. We argue that this design is not adequate for Chinese natural language processing. Instead, an integrated grammar of morphology, syntax and semantics is adopted in W‑CPSG.

Let us first discuss the rationale of integrating morphology and syntax in Chinese grammar. As it stands, a written Chinese sentence is a string of characters (morphemes) with no blanks to mark word boundaries. In conventional systems, there is a procedure-based Chinese morphology preprocessor (so-called segmenter). The major purpose for the segmenter is to identify a string of words to feed syntax. This is not an easy task, due to the possible involvement of the segmentation ambiguity. For example, given a string of 4 Chinese characters da xue sheng huo, the segmentation ambiguity is shown in (1a) and (1b) below.

(1)                    da xue sheng huo

(a)        da-xue                          | sheng-huo
university                    | life

(b)        da-xue-sheng               | huo
university-student       | live

The resolution of the above ambiguity in the morphology preprocessor is a hopeless job because such structural ambiguity is syntactically conditioned. For sentences like da xue sheng huo you qu (university life is interesting), (1a) is the right identification. For sentences like da xue sheng huo bu xia qu le (university students cannot make a living), (1b) is right. So far there are no segmenters which can handle this properly and guarantee correct word segmentation (Feng 1996). In fact, there can never be such segmenters as long as syntax is not brought in. This is a theoretical defect of all Chinese analysis systems in the morphology-before-syntax architecture (Li, W. 1997a). I have solved this problem in our morphology-syntax integrated W‑CPSG (see 2.2. below).

Now we examine the motivation of integrating syntax and semantics in Chinese grammar. It has been observed that, compared with the analysis of Indo-European languages, proper Chinese analysis relies more heavily on semantic information (see, e.g. Chen 1996, Feng 1996). Chinese syntax is not as rigid as languages with inflections. Semantic constraint is called for in both structural and lexical disambiguation as well as in solving the problem of computational complexity.  The integration of syntax and semantics helps establish flexible ways of their interaction in analysis (see 2.3. below).

1.2. Major theoretical foundation: HPSG

The work on W‑CPSG is developed in the spirit of the linguistic theory Head-driven Phrase Structure Grammar (HPSG, proposed by Pollard & Sag, 1987). HPSG is a highly lexicalist theory, which encourages the integration of different components. This matches our design philosophy for implementing our Chinese computational grammar. HPSG serves as a desired framework to start this research with. We benefit most from the general linguistic ideas in HPSG. However, W‑CPSG is not confined to the theory-internal formulations of principles and rules and other details in HPSG versions (e.g. Pollard & Sag 1987, 1994 or later developments). We borrow freely from other theoretical sources or form our own theories in W‑CPSG to meet our goal of Natural Language Processing in general and Chinese computing in particular. For example, treating morphology as an integrated part of parsing and placing it right into grammar is our deliberate choice. In syntax, we formulate our own theory for configuration and word order. Our semantics differs most from any standard version of situation-semantics-based theory in HPSG. It is based on insights from Tesnière’s Dependency Grammar (Tesnière 1959), Fillmore’s Case Grammar (Fillmore 1968) and  Wilks’ Preference Semantics (Wilks 1975, 1978) as well as our own semantic view for knowledge representation and better coordination of syntax-semantics interaction (Li, W. 1996). For these differences and other modifications, it is more accurate to regard W‑CPSG as an HPSG-style Chinese grammar, rather than an (adapted) version of Chinese HPSG.

  1. Integrated language model

2.1. W‑CPSG versus conventional Chinese grammar

The lexicalized design sets the common basis for the organization of the grammar in W‑CPSG. This involves the interfaces of morphology, syntax and semantics.[1]   W‑CPSG assumes an integrated language model of its components (see Figure 1).  The W‑CPSG model is in sharp contrast to the conventional clear-cut successive design of grammar components (see Figure 2).

 

 lw1

Figure 2.  conventional language model (non-reversible)

2.2. Interfacing morphology and syntax

As shown in Figure 2 above, conventional  systems take a two-step approach: a procedure-based preprocessor for word identification (without discovering the internal structure) and a grammar for word-based parsing. W‑CPSG takes an alternative one-step approach and the parsing is character- (i.e. morpheme-) based. A morphological PS (phrase structure) rule is designed not only to identify candidate words but to build word‑internal structures as well. In other words, W‑CPSG is a self-contained model, directly accepting the input of a character string for parsing. The parse tree embodies both the morphological analysis and the syntactic analysis, as illustrated by the following sample parsing chart.

lw6

Note:    DET for determiner; CLA for classifier; N for noun; DE for particle de;
AF for affix; V for verb; A for adjective; CLAP for classifier phrase;
NP for noun phrase; DEP for DE-phrase

This is so-called bottom-up parsing. It starts with lexicon look-up. Simple edges 1 through 7 are lexical edges. Combined edges are phrasal edges. Each edge represents a sign, i.e. a character (morpheme), a word, a phrase or a sentence. Lexical edges result from a successful match between the signs in the input string and the entries in the lexicon during lexicon look-up. After looking up the lexicon, the lexical information for the signs are made available to the parser. For the sake of concise illustration, we only show two crucial pieces of information for each edge in the chart, namely category and interpretation with a delimiting colon (some function words are only labeled for category). The parser attempts to combine the edges according to PS rules in the grammar until a parse is found. A parse is an edge which ranges over the whole string. The parse ((((1+2)+3)+4)+((5+6)+7)) represents the following binary structural tree embodying both the morphological and syntactic analysis of this NP phrase.

lw5

As seen, word identification is no longer a pre-condition for parsing. It becomes a natural by-product of parsing in this integrated grammar of morphology and syntax: a successful parse always embodies the right word identification. For example, the parse ((((1+2)+3)+4)+((5+6)+7)) includes the identification of a word-string zhe (DET) ben (CLA) shu (N) de (DE) ke-du-xing (N). An argument against the conventional separation model is that there exists in the two-step approach a theoretical threshold beyond which the precision for the correct word identification is not possible. This is because proper word identification in Chinese is to a considerable extent syntactically conditioned due to  possible structural ambiguity involved. Our strategy has advantages over the conventional approach  in  resolving word identification ambiguities and in handling the productive word formation. It has solved the problems inherent in the morphology-before-syntax architecture (for detailed argumentation, see Li, W. 1997a).

2.3. Interaction of syntax and semantics

The interface and interaction of syntax and semantics are of vital importance in a Chinese grammar. We are of the same opinion as Chen (1996) and many others that it is more effective to analyze Chinese in an environment where semantic constraints are enforced during the parsing, not after. The argument is based on the linguistic characteristics of Chinese. Chinese has no inflection (like English ‑’s, ‑s, ‑ing, ‑ed, etc.), no such formatives as article (like English a, the), infinitivizer (like English to) and complementizer (like English that). Instead, function words and word order are used as major syntactic devices. But Chinese function words (prepositions, aspect particles, passive particle, plural suffix, conjunctions, etc.) can often be omitted (Lü et al. 1980, p.2). Moreover, fixed word order in order to mark syntactic functions which is usually assumed for isolating languages, is to a considerable extent untrue for Chinese. In fact, there is remarkable freedom or flexibility in Chinese word order. One typical example is demonstrated in the numerous word order variations (although the default order is S‑V‑O subject-verb-object) for the Chinese transitive patterns  (Li, W. 1996).  All these added up project a picture of Chinese as a language of loose syntactic constraint. A weak syntax requires some support beyond syntax to enhance grammaticality. Semantic constraints are therefore called for. I believe that an effective way to model this interaction between syntax and semantics is to integrate the two in one grammar.

One strong piece of evidence for this syntax-semantics integration argument is that Chinese has what I call syntactically crippled structures. These are structures which can hardly be understood on purely formal grounds and are usually judged as ungrammatical unless accompanied with the support from the semantic constraints (i.e. the match of semantic selection restrictions). Some Chinese NP predicate (Li, W. & McFetridge 1995) and transitive patterns like S‑O‑V (Li, W. 1996), among others, are such structures. The NP Predicate is a typical instance of semantic dependence. It is highly undesirable if we assume a general rule like S –> NP1 NP2 in a Chinese grammar to capture such phenomena. This is because there is a semantic condition for NP2 to function as predicate, which makes the Chinese NP predicate a very restricted pattern. For example, in the sentence This table is three-legged: zhe (this) zhang (classifier) zhuo-zi (desk) san (three) tiao (classifier) tui (leg), the subject must be of the semantic type animate or furniture (which can have legs). The general rule with no recourse to semantic constraints is simply too productive and may cause severe computational complexity. In the case of Chinese transitive patterns, formal means are decisive for some variations in their interpretation (i.e. role assignment) process. But others are heavily dependent on semantic constraint. Take chi (eat) as an example. There is no difference in syntactic form in sentences like wo (I) chi (eat) dianxin (Dim-Sum) le (perfect-aspect) and dianxin (Dim-Sum) wo (I) chi (eat) le (perfect-aspect). Who eats what? To properly assign roles to NP1 NP2 V as S-O-V versus O-S-V, the semantic constraint animate eats food needs to be enforced.

The conventional syntax-before-semantics model has now received less popularity in Chinese computing community. Researchers have been exploring various ways of integrating syntax and semantics in Chinese grammar (Chen 1996). In W‑CPSG, the Chinese syntax was enhanced by the incorporation of a semantic constraint mechanism. This mechanism embodies a lexicalized knowledge representation, which parallels to the syntactic representation in the lexicon. I have developed a way to dynamically coordinate the syntactic constraint and semantic constraint in one model. This technique proves to be effective in handling rhetorical expressions and in making the grammar both precise and robust (Li, W 1996).

 

  1. Lexicalized formal grammar

3.1. Formalized grammar

The application nature of this research requires that we pay equal attention to practical issues of computational systems as well as to a sound theoretical design. All theories and rule formulations in W‑CPSG are implementable. In fact. most of them have been implemented in our prototype W‑CPSG. W‑CPSG is a strictly formalized grammar that does not rely on undefined notions. The whole grammar is represented by typed feature structures (TFS), as defined below based on Carpenter & Penn (1994).

(3)        Definition: typed feature structure 

A typed feature structure is a data structure adopted to model a certain object of a grammar. The necessary part for a typed feature structure is type. Type represents the classification of the feature structure. A simple feature structure contains only the type information, but a complex feature structure can introduce a set of feature-value pairs in addition to the type. A feature-value pair consists of a feature and a value. A feature reflects one aspect of an object. The value describes that aspect. A value is itself a feature structure (simple or complex). A feature determines which type of feature structures it takes as its value. Typed feature structures are finite in a grammar. Their definition constitutes the typology of the grammar.

With this formal device of typed feature structures, we formulate W‑CPSG by defining from the very basic notions (e.g. sign, morpheme, word, phrase, S, NP, VP, etc.) to rules (PS rules and lexical rules), lexical items, lexical hierarchy and typology (hierarchy embodied in feature structures) (Li, W. 1997b). The following sample definitions of some basic notions illustrate the formal nature of W‑CPSG. Please note that they are system-internal definitions and are used in W‑CPSG to serve the purpose of configurational constraints (see Chapter VI of Li, W. 1997b).

(4)        Definition: sign [2]

a_sign
KANJI kanji
MORPH expected
CATEGORY category
COMP0 expected
COMP1 expected
COMP2 expected
MOD expected
KNOWLEDGE knowledge
CONTENT content
DTR dtr

A sign is the most fundamental concept of grammar. A sign is a dynamic unit of grammatical analysis. It can be a morpheme, a word, a phrase or a sentence. Formally, a sign is defined by the TFS a_sign, which introduces a set of linguistic features for its description, as shown above. These features include the orthographic feature KANJI; morphological feature MORPH; syntactic features CATEGORY, COMP0, COMP1, COMP2, and MOD; structural feature (for both morphology and syntax) DTR; semantic features KNOWLEDGE and CONTENT.

(5)        Definition: morpheme

a_sign
MORPH ~saturated

A morpheme is a sign whose morphological expectation has not been saturated. In W‑CPSG, ~saturated is equivalent to obligatory/optional/null. For example, the suffix ‑xing (‑ness) is such a morpheme whose morphological expectation for a preceding adjective is obligatory.  In W‑CPSG, a morpheme like ‑xing (‑ness) ceases to be a morpheme when its obligatory expectation, say the adjective ke-du (readable), is saturated. Therefore, the sign ke-du-xing (readability) is not a morpheme, but becomes a word per se.

(6)        Definition: word

a_sign
MORPH ~obligatory
DTR no_syn_dtr

In W‑CPSG, ~obligatory is equivalent to saturated/optional/null. The specification [MORPH ~obligatory] defines a syntactic sign, i.e. a sign whose obligatory morphological expectation has been saturated. A word is a syntactic sign with no syntactic daughters, i.e. [DTR no_syn_dtr]. Obviously, word with [MORPH saturated/optional/null] overlaps morpheme with [MORPH obligatory/optional/null] in cases when the morphological expectation is optional or null.

Just like the overlapping of morpheme and word, there is also an intersection between word and phrase. Compare the following definition of phrase with the above definition of word.

(7)        Definition: phrase

a_sign
MORPH ~obligatory
COMP0 ~obligatory
COMP1 ~obligatory
COMP2 ~obligatory 

A phrase is a syntactic sign whose obligatory complement expectation has all been saturated, i.e. [COMP0 ~obligatory, COMP1 ~obligatory, COMP2 ~obligatory]. When a word has only optional complement expectation or no complement expectation, it is also a phrase. The overlapping relationship among morpheme, word and phrase can be shown by the following illustration of the three sets.

lw4 

S is a syntactic sign satisfying the following 3 conditions: (1) its category is pred (which includes V and A); (2) its comp0 is saturated; (3) its obligatory comp1 and comp2  are saturated.

3.2. Lexicalized grammar

W‑CPSG takes a radical lexicalist approach. We started with individual words in the lexicon and have gradually built up a lexical hierarchy and the grammar prototype.

W‑CPSG consists of two parts: a minimized general grammar and a information-enriched lexicon. The general grammar contains only 11 PS rules, covering complement structure, modifier structure, conjunctive structure and morphological structure. We formulate a PS rule for illustration.

lw3

This comp0 PS rule is similar to the rule S ==> NP VP in the conventional phrase structure grammar. The feature COMP0 represents the expectation of the head daughter for its external complement (subject or specifier) on its left side, i.e. [DIRECTION left]. The nature of its expected comp0, NP or other types of sign, is lexically decided by the individual head (hence head-driven or lexicon-driven). It will always be warranted by the general grammar, here via the index [3]. This is the nature of lexicalized grammars. PS rules in such grammars are very abstract. Essentially, they say one thing, namely, 2 signs can combine so long as the lexicon so indicates. The indices [1] and [2] represent configurational constraint. They ensure that internal obligatory complements COMP1 and COMP2 must be saturated before this rule can be applied. Finally, Head Feature Principle (defined elsewhere in the grammar based on the adaptation of the Head Feature Principle in HPSG, Pollard & Sag, 1994) ensures that head features are percolated up from the head daughter to the mother sign.

The lexicon houses lexical entries with their linguistic description and knowledge representation. Potential morphological structures, as well as potential syntactic structures, are lexically encoded (in the feature MORPH for the former and in the features COMP0, COMP1, COMP2, MOD for the latter). Our knowledge representation is also embodied in the lexicon (in the feature KNOWLEDGE). I believe that this is an effective and realistic way of handling natural language phenomena and their disambiguation without having to resort to an encyclopedia-like knowledge base. The following sample formulation of the lexical entry chi (eat) projects a rough picture of what the W‑CPSG lexicon looks like.

lw2

The lexicon also contains lexical generalizations. The  generalizations are captured by the inheritance of the lexical hierarchy and by a set of lexical rules. Due to space limitations, I will not show them in this paper.

  1. Implementation and application of W‑CPSG

A substantial Chinese computational grammar has been implemented in the W‑CPSG prototype.  It covers all basic Chinese constructions. Particular attention is paid to the handling of function words and verb patterns.  On the basis of the information- enriched lexicon and the general grammar, the system adequately handles the relationship between linguistic individuality and generality. The grammar formalism which I use to code W‑CPSG is ALE, a grammar compiler on top of Prolog, developed by Carpenter & Penn (1994). ALE  is equipped with an inheritance mechanism on typed feature structures, a powerful tool in grammar modeling. I have made extensive use of the mechanism in the description of lexical categories as well as in knowledge representation. This seems to be an adequate way of capturing the inherent relationship between features in a grammar. Prolog is a programming environment particularly suitable for the development of unification and reversible grammars (Huang 1986, 1987). ALE compiles W‑CPSG into a Chinese parser, a Prolog program ready to accept a string of characters for analysis. In the first experiment, W‑CPSG has parsed a corpus of 200 Chinese sentences of various types.

An important benefit of a unification-based grammar is that the same grammar can be used both for parsing and generation. Grammar reversibility is a highly desired feature for multi-lingual machine translation application. Following this line, I have successfully applied W‑CPSG to the experiment of bi-directional machine translation between English and Chinese. The machine translation system developed in our Natural Language Lab is based on the shake-and-bake design (Whitelock 1992, 1994). I used the same three grammar modules (W‑CPSG, an English grammar and a bilingual transfer lexicon) and the same corpus for the experiment. As part of machine translation output, W‑CPSG has successfully generated the 200 Chinese sentences. The experimental results meet our design objective and verify the feasibility of our approach.

 

References

 

Carpenter, B. & Penn, G. (1994): ALE, The Attribute Logic Engine, User’s Guide

Chen, K-J.  (1996): “Chinese sentence parsing” Tutorial Notes for International Conference on Chinese Computing ICCC’96, Singapore

Feng, Z-W.  (1996): “COLIPS lecture series – Chinese natural language processing”,  Communications of COLIPS, Vol. 6, No. 1 1996, Singapore

Fillmore, C. J. (1968): “The case for case”. Bach and Harms (eds.), Universals in Linguistic Theory. Holt, Reinhart and Winston, pp. 1-88.

Huang, X-M. (1986): “A bidirectional grammar for parsing and generating Chinese”.  Proceedings of the International Conference on Chinese Computing, Singapore, pp. 46-54

Huang, X-M. (1987): XTRA: The Design and Implementation of A Fully Automatic Machine Translation System, Doctoral dissertation, University of Essex.

Li, L-D. (1986): Xiandai Hanyu Juxing (Sentence Patterns in Contemporary Mandarin), Shangwu Yinshuguan, Beijing

Li, L-D. (1990): Xiandai Hanyu Dongci (Verbs in Contemporary Mandarin), Zhongguo Shehui Kexue Chubanshe, Beijing

Li, W. & P. McFetridge (1995): “Handling Chinese NP predicate in HPSG”, Proceedings of PACLING-II, Brisbane, Australia

Li, W. (1996): “Interaction of syntax and semantics in parsing Chinese transitive patterns”, Proceedings of International Conference on Chinese Computing (ICCC’96), Singapore

Li, W. (1997a): “Chart parsing Chinese character strings”, Proceedings of The Ninth North American Conference on Chinese Linguistics (NACCL-9, to be available), Victoria, Canada

Li, W. (1997b): W‑CPSG: A Lexicalized Chinese Unification Grammar, Doctoral dissertation, Simon Fraser University (on-going)

Lü, S-X. et al. (ed.) (1980): Xiandai Hanyu Babai Ci (800 Words in Contemporary Mandarin), Shangwu Yinshuguan, Beijing

Meng, Z., H-D. Zheng, Q-H. Meng, & W-L. Cai (1987): Dongci Yongfa Cidian (Dictionary of Verb Usages), Shanghai Cishu Chubanshe, Shanghai

Pollard, C.  & I. Sag (1987): Information based Syntax and Semantics Vol. 1: Fundamentals. Centre for the Study of Language  and Information, Stanford University, CA

Pollard, C.  & I. Sag (1994): Head-Driven Phrase Structure Grammar,  Centre for the Study of Language and Information, Stanford University, CA

Shieber, S. (1986): An Introduction to Unification-Based Approaches to Grammar. Centre for the Study of Language  and Information, Stanford University, CA

Tesnière, L. (1959): Éléments de Syntaxe Structurale, Paris: Klincksieck

Whitelock, Pete (1992): “Shake and bake translation”, Proceedings of the 14th International Conference on Computational Linguistics, pp. 784-790, Nantes, France.

Whitelock, Pete (1994). “Shake and bake translation”, C.J. Rupp, M.A. Rosner, and R.L. Johnson (eds.), Constraints, Language and Computation, pp. 339-359, London, Academic Press.

Wilks, Y.A. (1975). “A preferential pattern-seeking semantics for natural language interference”.  Artificial Intelligence, Vol. 6, pp. 53-74.

Wilks, Y.A. (1978). “Making preferences more active”.  Artificial Intelligence, Vol. 11,  pp. 197-223

 

————————————-

* This project was supported by the Science Council of British Columbia, Canada under G.R.E.A.T. Award (code: 61) and by my industry partner TCC Communications Corporation, British Columbia, Canada. I thank my academic advisors Paul McFetridge and Fred Popowich and my industry advisor John Grayson for their supervision and encouragement. Thanks also go to my colleagues Davide Turcato, James Devlan Nicholson and Olivier Laurens for their help during the implementation of this grammar in our Natural Language Lab. I am also grateful to the editors of the NWLC’97 Proceedings for their comments and corrections.

[1] We leave aside the other components such as discourse, pragmatics, etc. They are an important part of a grammar for a full analysis of language phenomena, but they are beyond what can be addressed in this research.

[2] In formulating W‑CPSG, we use uppercase for feature and lowercase for type; ~ for logical not and / for logical or; number in square brackets for unification.

 

[Related]

Outline of An HPSG-style Chinese Reversible Grammar ABSTRACT

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

立委硕士论文:目标语调序 (9)

世界语到汉语和英语的自动翻译试验
— EChA机器翻译系统概述

目标语调序

在前面的虚词一线和形态生成一线, 已经做了一些局部调序并给了同号. 如:

CHIO (一切) CHI (这) —-> 这一切 (012);
DOKTORO (博士) ZAMENHOF (柴门霍夫) —-> 柴门霍夫博士 (134)

英语疑问句和否定句所需要的调序, 就放在形态生成的同时进行. 如:

NE (NOT) ESTIS (WERE) —-> WERE NOT (008)

CHU VIA (YOUR) AMIKO (FRIEND) ESTAS (IS) KURACISTO (DOCTOR) ?
—-> IS YOUR FRIEND DOCTOR ? (039)

从综合第二线开始, 系统从句子整体着眼, 自底而上分别做各目标语的归约调序. 有了CDC和调序子程序, 建立目标语的归约生成算法就很简单了. 其基本思路是:

(1) 由句首至句末依次取词, 放过已加工和非终结节点.
(2) 若该词层号为一, 右链为零, 说明已经归约到顶层主轴心, 该句加工完毕.
(3) 若该词需要调序, 入调序子程序.
(4) 该词做已加工特征, 并视情况决定是否给该词以轴心词同号.
(5) 入子程序检查该词的姐妹词是否也都已加工.
(6) 若是, 则该词及其所有姐妹词给以轴心词同号, 轴心词做终结节点特征.
(7) 返回第(1)步.

对于英语, 问题特别简单, 只有一种情况需要调序, 即及物谓语所带的前置宾语和后置主语. (不及物谓语句中的后置主语无需调序.) 汉语的问题就复杂得多, 主要规则有:

(1) 存在 “有” (ESTI) 的主语应后置. 除此以外, 后置主语(包括多数主语从句)一律前移.

(2) 要求带 “把”, “使” 等的汉语及物动词做谓语的句子, 其宾语在加上 “把”, “使”等以后, 应置于谓语前. 除此以外, 前置宾语一律后移.

(3) 后置定语从句在两种情况下不需前移: 1. ESTAS + X, KIU 型强调句式; 2. 长15词以上的定语从句. 其余的所有后置定语一律前移. 各姐妹定语的相对位置主要由它们的语义特征决定, 具体是通过调序时给或不给同号来实现.

(4) 状语从句一般原位不动(但后置时间状语从句最好前移). 其余后置状语一律前移. 各姐妹状语相对位置的处理原则同上.

 

 

【相关】

硕士论文: 世界语到汉语和英语的自动翻译试验
立委硕士论文:1. EChA概况
立委硕士论文:2. 世界语: 语言学特点及其研究价值
立委硕士论文:3. 层次递归成分体系
立委硕士论文:4. EChA机器词典及词表
立委硕士论文:5. 世界语形态分析
立委硕士论文:6/7 世界语句法分析
立委硕士论文:8. 英语形态生成
立委硕士论文:9. 目标语调序
立委硕士论文:10. EChA 试验结果的分析
立委硕士论文【致谢】【参考书目】
立委硕士论文全文(世界语版)

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

立委硕士论文:EChA试验结果分析 (10)

世界语到汉语和英语的自动翻译试验
— EChA机器翻译系统概述

EChA试验结果分析

总的来说, 这次试验结果相当令人满意. 译文不但可读, 多数都很通顺. 由于比较重视修辞, 机器味儿也不浓. 当然, 这毕竟是小范围的实验, 虽然我们尽量照顾到各种可能出现的语言现象, 但也难说在今后的扩大试验中会出现什么问题, 好在该系统比较容易维护和改进.

第二首诗中有两处(110)(111)把疑问句错译成英语强调句:

CHU kredas la vorton pure karan: vin mi amas! (111)
DO BELIEVE the word purely dear: I love you!
Cf: 相信纯粹地亲爱的词吗:我爱你!

这是因为原诗句为了节奏的需要, 承前省略了主语 VI (YOU). 有意思的是, 译成强调句于诗意没有什么损害.

在EChA上机伊始, 我们由于专心于检验方案主体的可行性和合理性, 而忽略了修辞. 初期译文(1985.12)显得较粗糙, 比较后期结果(1986.2), 译文的改进是明显的. 例如:

  1. 形式主语IT的增加 (007)(012)(077)(122)(125)(133):

Sed chio chi ankorau okazis sub homa gvidado kaj PLEJ GRAVE ESTIS, KE chio chi bazighis sur la homa scio. (012)

1) But all this still happened under man’s guiding and MOST IMPORTANT WAS, THAT all this was based on the man’s knowledge.

2) But all this still happened under man’s guiding and IT WAS MOST IMPORTANT, THAT all this was based on the man’s knowledge.

  1. 不定式带TO跟不带TO的区分 (004)(019)(072)(078)(083)(084)(088)(089)(092)(095)(132)(142)(146):

LABORI estas necese.(072)
1) (TO) WORK is necessary.
2) TO WORK is necessary.
工作是必要的.

  1. 双宾语 (128)(143)(144):

Donu AL mi iom da kafo! (128)
1) Give TO me a little coffee!
2) Give me a little coffee!
给我一点咖啡!

  1. 表示存在的 ESTI 译 “有” 和 THERE TO BE (049)(157):

En unu jaro ESTAS kvar sezonoj: printempo, somero, autuno kaj vintro. (049)

1) In one year ARE four seasons: spring, summer, autumn and winter.
在一年里面 “是” 四季节:春季,夏季,秋季和冬季.

2) In one year THERE ARE four seasons: spring, summer, autumn and winter.
在一年里面 “有” 四季节:春季,夏季,秋季和冬季.

  1. 目标语词义的选择 (059)(067)(081)(046)(098)(013)(014)(027)(118)(130):

ELMETU viajn opiniojn pri nia laboro! (059)

1) “输出” 你们的关于我们的工作的意见!
2) “提出” 你们的关于我们的工作的意见!
OUTPUT your opinions about our work!

Chu mi FARIS multajn erarojn en mia hejmtasko? (081)

1) Did I DO a lot of mistakes in my homework?
我在我的家庭作业里面 “做” 了许多错误吗?

2) Did I MAKE a lot of mistakes in my homework?
我在我的家庭作业里面 “犯” 了许多错误吗?

La partio TRE zorgas la vivon de la popolamaso. (046)

1) The party VERY cares for the life of the masses.
2) The party VERY MUCH cares for the life of the masses.
党很关心人民群众的生活.

La suno levighas CHE oriento. (013)

1) The sun rises AT east.
2) The sun rises IN THE east.
太阳在东方升起.

POST unu monato komencighos la someraj ferioj. (014)

1) AFTER one month will begin the summer’s holidays.
2) IN one month will begin the summer’s holidays.
暑假在一月以后将开始.

La eksperimento pri mashina tradukado ANKORAU NE estas finita. (027)

1) The experiment about machine’s translating STILL has been NOT finished.
关于机器的翻译的试验 “仍然没有” 被完成.

2) The experiment about machine’s translating has been NOT finshed YET.
关于机器的翻译的试验 “还没有” 被完成.

Ni esperas, ke li GAJNU championecon en la konkurso. (118)

1) We hope, that he WIN championship in the competition.
2) We hope, that he WILL WIN championship in the competition.
我们希望,让他在比赛里面赢得冠军.

Prenu la lingvon neutralan KIEL la bazon. (130)

1) Take the language neutral AS the base.
2) Take the language neutral FOR the base.
拿中立的语言作为基础.

通过EChA试验, 我们深深体会到, 同一语系中的语言转换较之不同语系容易许多. 亲属关系越近, 机器翻译对自动分析的精度要求也就越低, 因而越容易推向实用. 英语和汉语都是分析型语言, 有很多类似的语言特点, 即便如此, 世英转换比

世汉还是简单得多. 只要建立一部世英自动词典, 再加上一套形态转换算法, 甚至无需进行层次和句法的分析, 就可以实现词对词世英机器翻译. 这样的译文尽管粗糙, 但在相当程度上是可用的. 我们对ECHA综合第一线(形态转换)输出的未经调序的中间译文作了统计, 以不引起误解为标准, 英语正确率为 95% (150/158) 左右, 费解的有八句 (003)(010)(075)(095)(102)(108)(111)(141), 汉语正确率为 72% (113/158) 左右. 排除形态转换中利用了句法分析结果的部分, (但不排除第一线的虚词分析和转换), 英语正确率也在80%以上. 如果在输出译文时, 对前置宾格名词加上标识符, 则可懂度还可提高. 当然, 我们试验的这158句总有一定的局限, 所以上述统计也只具有相对意义. 中国的机器翻译, 从一开始研究的就是印欧和汉臧这两个没有亲属关系的语系间语言的自动转换, 难度很大. 这恐怕是我们的实用系统迟迟不能问世的重要原因之一. 所以, 崐中国机器翻译工作者肩上的担子更重, 任务更艰巨, 更需要独创和献身精神. 这种不利的条件也有它的另一面: 机器翻译与汉语结合带来的许多特别的问题, 客观上使我们的研究比较深入. 我国的机译研究就没有象欧美那样经历词对词翻译的第一代, 而是直接从第二代句对句翻译开始, 起点较高, 并且在很短时间内(60年代初期)就赶上了当时的世界先进水平. 这显然与我们所研究的特定对象(俄-汉, 英-汉等)的要求有关.[10]

现在谈谈另一个问题: 文学作品可不可以由机器翻译? 我们说完全可以, 不过很困难. 要把人在翻译文学作品时所遵循的规则(其中很多是下意识的)形式化算法化, 显然不容易. 即便做到了, 经济上也不上算. 所以, 在相当长的时间内, 除特别的实验需要外, 人们一般不去花这个力气. EChA选译了两首诗歌, 在这个方面做了粗浅的尝试, 证明机器也可译诗. 从译?
文看, 英语比汉语美, 保留了更多的节奏和韵律的特点, 更象一首诗. 汉语译文除了几句译得较好( 如: “向永远战争着的世界, / 它允诺神圣的和谐” ), 总体上看, 更象一篇散文. 这也难怪, 因为EChA本来就不是专门为翻译诗歌而设计的. 诗歌形式上的两个最大特点是节奏和尾韵. 可以设想, 诗歌机译系统的词典跟一般机器词典应有所不同: 各词条的每一义项下集中了一批同义的目标语等价词. 这些词长短不一, 韵尾各异, 供机器在诗歌综合时选用, 正象人在写诗或译诗时常需要翻韵书一样.

一提机器翻译, 人们总爱问: 机器能够翻译文学作品吗? 为什么不能? 离散是对连续的逼近, 机器智能是对人的智能的模拟, 二者之间并没有一道不可逾越的鸿沟. 从功能上看, 机器和人没有什么不同. 机器不过是无机体的人罢了. 只要人会的事情, 机器迟早也能会. 机器的不会并不是它不能, 而是人没有使它会, 这正如文盲不会写字是因为没人教他一样. 不过, 机器胃口很刁, 不懂 “意会”, 只有 “言传”(通过计算机语言)才能教会它. 可惜, 对很多事, 人至今还是知其然, 并不知其所以然, 无法传授. 可见, 机器的无能全由于人的无能. 可人今天不知其所以然的, 并不说明将来总也不知, 所以从发展的观点看, 机器和人一样是无所不能的. 事实上, 机器目前已能代替医生, 译员和作曲家做部分工作, 而且比技术较差的人做得还象样些, 因为它 “取法乎上”. 即便人, 也只有很少一部分专家能够从事这些工作. 机器已经闯进了万物之灵的神圣禁地.

最后, 一般地谈谈修辞问题. 由于机器翻译至今多局限在实验室里, 所以未予修辞而产生的阅读障碍(包括心理障碍)还不突出. 但随着机器翻译的逐步实用化, 修辞的必要性将越来越明显. 前面所举的后期译文对初期译文的改进的实例, 主要涉及的就是修辞.

1) 什么是机器翻译修辞?

机器翻译修辞是保证译文通顺的一个重要手段. 它是机器语法之后译文综合的一部分, 是自动翻译过程的最后一个环节. 广义的修辞包括贯穿翻译全过程的, 一切旨在促使译文通顺和美化的手段, 譬如成语手段(通过成语词典), 虚词分析(通过虚词模块), 结构手段(通过搭配关系)等等. 有些所谓多义区分, 实际上也是一种修辞, 例如 LUDI (PLAY) 可分为 “玩”, “打球)”, “演奏(乐器)”等义项, 但 “演奏” 义下具体选择 “拉(提琴, 胡琴)”(016), “弹(钢琴)”(038) 还是 “吹(口琴)” 就属于修辞了. EChA对于涉及多义的修辞, 即目标语合适对等词的选择, 就把它当作多义问题解决(见EChA虚词模块, 词类词义区分表和多义区分模块). 一般来说, 跟具体的词汇或语法现象联系很紧的修辞, 以及其他个性较强的特例修辞, 应该放在相应的词典或语法部分同时处理, 而可以归出类别的修辞, 则由最后独立的修辞模块统一解决.

机器翻译修辞具有某种超语言学的特征, 属于翻译学范畴. 我们知道, 根据原语和译语的语言学角度的对比差异, 就可以对所译文句实现转换(主要是句型转换), 这是我们目前机器翻译的主体工作. 但这样直接转换的句子不能保证其通顺, 甚至也不能保证其正确(即不被误解), 因为语言间(尤其是没有亲属关系的语言间)除了词汇语法等差异外, 还有超语言学(表达习惯, 思维方式等等)的差异存在, 即翻译学角度的对比差异. 例如: nun DE LOKO flugu ghi AL LOKO (now FROM PLACE let it fly TO PLACE) (101) / 现在从 “一个” 地方让它飞到 “另一个” 地方吧(“从地方到地方” 不符合汉语表达习惯). 修辞主要是为消除这种差异而设置的. 因此, 只有翻译学角度的语言对比差异, 才是修辞的根本依据.

2) 修辞的分类

可分作两大类: 必要修辞和美修辞. 必要修辞是保证译文正确可懂所必需的修辞, 它是修辞的初级阶段. 美修辞则是保证译文通顺畅达, 甚至产生某种美感或帮助形成译文风格所要求的修辞, 它是修辞的高级阶段. 机器翻译修辞首先是作为必要修辞提出来的. 必要修辞是基础, 具有更大的迫切性, 是所有实用系统的必要组成部分, 如形态修辞. 这部分修辞数量很有限, 一定量的研究就可以穷尽它. 美修辞可以说是锦上添花. 它是为机器译文不断提高质量, 使之朝成熟, 完美方向发展, 以期赶上人工翻译的手段. 可见, 美修辞是无限发展的, 它本身具有许多层次和侧面. 修修补补远不能满足美修辞发展的需要. 它要求体系和方法上的不断革新. 就机器翻译的前景来说, 美修辞的比重将逐渐变大. 从严格的意义上讲, 只有美修辞才真正体现修辞本身的特点和规律, 因为必要修辞在一定的意义上不过是语法的推广, 即可以算作广义的语法. 它的手段跟机器语法没有根本的不同. 在现行的EChA系统中, 必要修辞就常常跟语法混在一起.

关于美修辞, EChA只是做了一点尝试. 应该指出, 机器翻译的美有自己的侧重点, 它最推崇 “通顺流畅, 合乎习惯和简洁自然”, 其次是译文风格的形成. 我们认为, 机器译文的风格逐步形成, 是完全可能的. 因为从形式上看, 风格的承担者主要是词汇, 尤其是小词(语气词, 结构词), 其次, 语法形式也有些不同. 不同风格的形式特点, 是可以为机器识辨和接受的. ?
具体做法可以吸收计算风格学(Computational stylistics)的研究成果, 去设计不同风格的译语修辞模型. 风格可以有正规体, 典雅体和口语体等等. 正规体格式规范, 清楚简单, 给人的印象是客观公正, 不假藻饰. 典雅体的特点是虚词多用古字 (如 “则”, 即”, “乃”, “便”, “故”, “且”, “其”, “及” 等), 成语用的也较多, 显得简洁古雅. 口语体则比较松散自由, 带?
有更多的语气词(如 “吗”, “呢”, “可不”, “是吗”, “啊” 等).

____________________________________________________________________

附注: [10] 参见 刘涌泉 <<中国的机器翻译>> ( <<情报科学>> 1980, 3 )

 

[致谢]

研制世界语类型的机器翻译系统, 从一开始就得到刘涌泉老师的热情支持, 从方案主体到具体问题的处理, 他都给以认真指导. 在程序设计和上机调试的的过程中, 刘倬老师也多次给予指导, 有些基本操作的算法也是刘倬老师提供的. 在EChA系统取得初步成果的时候, 笔者向他们表示深切的感谢. 另外, 还要特别感谢机房韩老师的多方协助. 没有她提供的方便, EChA系统根本不可能在这么短时间试验成功.

 

\

【相关】

硕士论文: 世界语到汉语和英语的自动翻译试验
立委硕士论文:1. EChA概况
立委硕士论文:2. 世界语: 语言学特点及其研究价值
立委硕士论文:3. 层次递归成分体系
立委硕士论文:4. EChA机器词典及词表
立委硕士论文:5. 世界语形态分析
立委硕士论文:6/7 世界语句法分析
立委硕士论文:8. 英语形态生成
立委硕士论文:9. 目标语调序
立委硕士论文:10. EChA 试验结果的分析
立委硕士论文【致谢】【参考书目】
立委硕士论文全文(世界语版)

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

立委硕士论文:英语形态生成 (8)

世界语到汉语和英语的自动翻译试验
— EChA机器翻译系统概述

英语形态生成

加尾算法跟削尾算法正好是逆过程. 建立一个完全的, 符合实用系统要求的英语加尾算法并不困难, 因为英语的形态比较简单. EChA把汉语形态修辞与英语形态生成放在一处进行.

原语和译语的对比差异是建立语言转换规则的依据. 这种对比差异可以归纳为下面五种情况: 1) 一一对应; 2) 此一彼多; 3) 此多彼一; 4) 此有彼无; 5) 此无彼有. 我们以世界语到英语的形态转换分别举例如下:

1) 一一对应

世界语派生副词(由逻辑类为形容词的词干加 “-E” 尾构成)
———>英语相应形容词加 “-LY” 尾.

例: diligent-E —-> diligent-LY ; serioz-E —-> serious-LY ;
sincer-E —-> sincere-LY. (063)

例外: bon-E —-> well (045)
( 不是 good-LY, 这种情况在词典一线入词类词义区分表处理. )

显然, 一一对应的情形最好办.

2) 此一彼多

世界语不定式 ——–> 英语动词原形 或 TO + 动词原形
世界语条件句(谓语动词以 “-US” 收尾) ——–> 英语三种形式(过去, 现在, 将来).
例: 1. Se mi sci-US hierau, mi certe ven-US. –
—> If I HAD KNOWN yesterday, I certainly                                                                                SHOULD HAVE COME. (与过去事实相反的假设)

  1. Se vi est-US mi, kion vi far-US? —-> If you WERE me, what WOULD you do?                                 (与现在事实相反)
  1. Se vi ven-US morgau, vi shin vid-US.
    —-> If you SHOULD come tomorrow, you WOULD see her.                                                                 (与将来事实相反)

这种情况最麻烦. 机器翻译中的多义现象盖源于此. 如果上例没有明确的时间状语, 那只能靠跨句上下文去推测, 这对机器实在太难了. EChA遇到这种情况, 就干脆一律用 “WOULD” 代替 “-US” (050), 这虽然不大符合英语语语法规范, 暂时也只能这样了. 好在这样转换并不造成误解.

此一彼多另一个常见的例子是, 世界语现在时简单式(-AS尾)对应于英语一般现在时和现在进行时两种. 虽然世界语复合时态有与英语现在进行时对应的形式( ESTAS x-ANTA ), 但是世界语的节约原则要求人们尽可能少用复杂形式. 我们一时还找不出足够可靠的形式规则, 来决定 “-AS” 究竟何时译作一般时态, 何时译作进行时态. EChA目前一律以一般现在时译之, 这使得部分译文不是很确切, 但并不造成误解或费解. 如:

Kien vi ir-RA? (158) —-> To where DO you go? ( CF: Where ARE you GOING? )
Chu kredas, ke mia koro flam-AS? (110) —-> Do believe, that my heart burn-S?
( CF: Do you believe that my heart IS BURNING? )

3) 此多彼一

世界语形动词或副动词的各种形式 ——–> 英语分词的相应形式.

-ANTA 和 -ANTE —-> -ING ; -INTA 和 -INTE —-> HAVING+过去分词 ;
-OTA 和 -OTE —-> TO BE+过去分词; 等等.

[例] KURANTE sur la strato, li falis. (091) —-> RUNNING on the street, he fell.

Laboristoj estas KONSTRUANTAJ fabrikon. (015)
—-> Workers are BUILDING factory.

这种情况好办. 世界语形态比较丰富, 而现代英语形态不发达, 所以世英形态转换中最经常出现的, 就是此多彼一或此有彼无的情形, 这对建立比较完全的EChA英语形态生成(加尾)算法是很有利的条件.

4) 此有彼无

世界语将来将来时 ( ESTOS x-ONTA(J) ) ——–> 英语 ?

[例] Mi ESTOS LEGONTA la libron kiam shi venos. (023)
—-> I WILL ( 或: WILL BE GOING TO ) read the book when she comes.

这种情况看上去似乎很不利, 实际上并不难处理. 因为现今存在的各种语言, 作为人们千百年来交流思想的工具, 一般都能够表达各种细微的语义差别. 虽然乙语言也许缺乏甲语言的某个特定的表达手段, 但如果必要, 它总可以找到代替的表达方式. 如上例 ESTOS LEGONTA 通常译作 WILL READ 已经足够, 如果一定要强调将来的将来, 也不妨译作 WILL BE GOING TO READ 这样繁冗的形式. 再如汉语缺乏形态, 但如果需要, 总可以用适当的助词或副词等来代替, 这就是所谓的形态修辞.

5) 此无彼有

世界语 ? ——–> 英语完成进行时

[例] Mi atend-AS vin chi tie du horojn.
—-> I HAVE BEEN WAITING here for you for two hours.
CF: I WAIT here for you for two hours.
I AM WAITING here for you for two hours.

此所无彼所有的, 如果在彼也是可有可无的, 或并不太影响语义, 那还好办, 如上例. 再如, 英语的不定冠词, 世界语就没有, EChA对此干脆不管, 也没造成严重的后果, 只是译文显得有些不顺: Is your friend (*) doctor?       (039) This is (*) green star, and that is (*) red star. (152) ( * 处本应有不定冠词 A ) 最头痛的是此所无彼必有. 从完全没有冠词的语言(如汉语和俄语)译入有冠词的语言在很多情况下就是这样.

上述归纳在机器翻译的转换生成中具有普遍意义. 最困难的是此一彼多和此所无彼必有两种情况, 一般要通过精密的句法和语义的对比和分析来解决. 比如通过分析不定式所直接联系的英语轴心词的句型特征, 就可以决定该不定式采用带 TO 还是不带 TO 的形式. 实在不得已, 只好把几种可能的选择同时打印出来, 由用户自己决定—-这当然是权宜之计, 但常常比编制一套不可靠的区分规则, 客观上更有利一些. 机器模拟人的智能, 在一定的阶段总还有某些局限. 上面的做法, 实际上就是把机器暂时还不具有的智能, 交还给人发挥, 特别是那些很难形式化, 但人凭经验和直感却很容易判断的部分. 然而, 人工智能的使命决定了, 人们应该尽最大努力提高机器智能化程度. 条件允许却不去努力是设计者的懒惰和失职.

在EChA形态生成一线, 还有词典化了的多义区分程序段(它在形态生成前执行), 用BASIC写起来很容易. 现举例介绍如下:

1) LUDI 玩 / 打(各类球) / 拉(提琴, 胡琴) / 弹(钢琴) / 吹(口琴)

2120 IF VT$(GC)<>”1″ THEN 2160
( 若该词不及物则保留词典基本义项 “玩”, 该词多义区分毕, 转2160. )

2130 IF HY$(ZC)=”胡琴” OR RIGHT$(HY$(ZC),4)=”提琴” THEN HY$(GC)=”拉”: GOTO 2160
( 若找到词为 “胡琴”, 或找到词的后两字为 “提琴” (包括大提琴,小提琴,中音提琴等), 则该词取汉义 “拉”, 该词毕, 转2160. )

2140 IF HY$(ZC)=”钢琴” THEN HY$(GC)=”弹”: GOTO 2160
2145 IF HY$(ZC)=”口琴” THEN HY$(GC)=”吹”: GOTO 2160
2150 IF RIGHT$(HY$(ZC),2)=”球” THEN HY$(GC)=”打”
2160 GC=GC+1: GOTO 1830 ( 放过该词, 取后一词, 转1830. )

2) BATI 打 / (心)跳动

1990 IF VT$(GC)=”1″ AND (RIGHT$(HY$(ZC),2)=心” OR HY$(ZC)=”心脏”) THEN HY$(GC)=”跳动”
2000 GOTO 2160

3) OKAZI 进行 / 发生 / 召开

2450 IF RIGHT$(HY$(ZC),2)=”事” THEN HY$(GC)=”发生”:GOTO 2160
2460 IF RIGHT$(HY$(ZC),2)=”会” THEN HY$(GC)=”召开”:YY$(GC)=”BE HELD”: YTZ$(GC)=”8″: XX$(GC)=”1″
2470 GOTO 2160

3) RIGARDI: LOOK AT / LOOK / WATCH (TV) / SEE (FILM)

2830 IF VT$(GC)<>”1″ THEN YY$(GC)=”LOOK”: GOTO 2160
2840 IF YY$(ZC)=”TELEVISION” OR YY$(ZC)=”TV” THEN YY$(GC)=”WATCH”: GOTO 2160
2850 IF YY$(ZC)=”FILM” THEN YY$(GC)=”SEE”: YTZ$(GC)=”1″
2860 GOTO 2160

4) NENIAM 从不 / 从未

3070 IF ST$(ZC)=”2″ THEN HY$(GC)=”从未”: HY$(ZC)=HY$(ZC)+”过”: JG$(ZC)=”9″
3080 GOTO 2160

 

 

 

【相关】

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录立委硕士论文:9. 目标语调序

立委硕士论文:世界语句法分析(6&7)

世界语到汉语和英语的自动翻译试验
— EChA机器翻译系统概述

世界语句法分析(1): 虚词处理

虚词分析是世界语句法分析中最困难的部分. EChA的策略是分而治之, 各个击破. 每一个虚词的分析规则自成一体, 互相独立, 这样在充实或改进某一具体虚词的规则时, 便不致于影响其他虚词的规则, 这也就是规则和规则分开吧.[9] 语言规则和算法程序应该分开, 大家已经说了许多, 而规则和规则分开, 似乎还没有引起足够的重视. (不是指所有规则都分开: 具有普遍意义的抽象语法规则集合, 作为系统对于该语言充分形式化的逻辑描述, 是自动分析的枢纽, 本身就是一个可以做的很美的统一整体, 谈不上分开. (参考EChA句法分析第二线, 见第7节.) 一个优良的系统应该既能分得开, 又能合得拢.) 我们认为, 规则和规则分开, 对于研制实用性机译系统具有决定性意义. 没有什么系统从一开始研制就可以足够完善, 所以是否容易扩充和改进, 在很大程度上决定了一个系统的前途. 规则和算法分开, 固然大大增强了系统的扩充能力, 并且便于语言工作者和软件工作者充分合作. 但这还不够. 如果能实现规则和规则分开, 不但有利于遵循具体问题具体分析原则, 去解决语言这种特别复杂的现象中的许多个性问题, 从而大大提高翻译质量, 而且也为语言工作者和语言工作者的协作, 创造了必要的条件—-这种协作, 对于研制大型实用系统是必不可少的.

规则和规则分开的主要方式是: 1) 词典语法化: 以词为基本单位, 把关于该词的各种用法及其分析规则, 以数据的形式写入词典(它建在外存贮器上). 这样的机器词典, 形式上很类似于我们案头的词典工具书, 如牛津, 韦式, LONGMAN等, 而且也较容易借鉴已有的这些词典的研究成果. 我们建议首先把虚词和动词的条目语法化. 2) 语法词典化: 在编写句法分析或综合程序(它在内存贮器中)时, 把规则落实到具体词或小类上, 并使这些规则独立开来. 这两种方法形式有别, 实质是一样的. 我们在EChA中采用的是第二种方法. (参见EChA虚词分析部分和EChA综合部分的多义词区分规则.) 说到底, EChA分析第一线不过是一个带有分析规则的虚词大词典.

当然, 应该指出, 规则和规则分开, 必然使规则量成倍增长. 然而, 由于边界分明, 这种增长并不影响系统结构上的逻辑清晰性, 这跟以前语言和算法, 规则和规则都没分开时的情形大不相同, 那时的规则无限膨胀, 只能致使系统最终报废. 不过规则量的增长, 涉及到机器的存贮容量问题. 但这实际上也不成问题, 因为现在的机器对于存贮节省的要求, 已经不是那么苛刻了. 即便是微型机, 中高挡的内存容量就能达到, 或很容易扩充到四兆到八兆字节. 值得强调的是, 规则量的增长, 一般并不影响系统的工作效率, 因为规则是附在具体的词或小类下, 只有所译文句出现了某词, 才会入该词一线.

在EChA虚词分析一线中, 我们把虚词的多义区分, 甚至有些涉及虚词特点的目标语修辞, 都一古脑纳入具体虚词的分析规则中. 这样处理显然比较简便易行, 也大大减轻了综合的困难. 但是, 正是在这儿, EChA违背了我们所极力赞同的分析和综合独立的原则. 目前还想不出更好更合理的办法. 不过, 我们主张独立分析的本意, 不外乎为了两点: 1) 为了使分析深入以便提高机译质量; 2) 让同一个独立分析结果, 能为多语综合所利用. 考虑到虚词的分析和综合同步进行, 有助于提高译文崐质量, 而且由于虚词数量的有限及其分析规则的相互独立, 在增加新的目标语时充实这些规则不会有很大困难, 更不会影响整个系统的筋骨, 因而我们目前的做法是有理由的, 它并不违背我们的宗旨.

 

世界语句法分析(2)

分析第(2)线与目标语综合充分独立, 逻辑性强, 是一个相当完整的语言分析模型. 它由一个主程序和几个以动词分析算法为核心的环环相扣的子程序构成. 主程序主要用来确定各语段的范围(前限后限)及其加工次序, 为它们进入动词子程序做好准备. 它必须对各种类型的世界语文句作出正确, 合理的处理, 才能保证系统的充分概括性和适应性. 从各类文句的试验结果看, EChA相当好地做到了这一点.

我们把世界语文句的类型归纳如下:

1.无谓句. 如:

Kia belega pejzagho ! (041) / What beautiful scenery ! 多么绝美的景色!

2.谓语句:

1) 简单句: 全句只有一个谓语. 如: Skribu klare ! (033) / Write clearly ! 写清楚!

2) 扩展的简单句: 全句至少有两个谓语, 但只有一个主句, 从句跟主句(以主轴心为代表)没有直接联系, 即从句处于2层以外 ( 其层号 >= 3 ). 这类从句往往是定语从句或同位语从句. 如:

La homon , pri kiu vi parolas , mi neniam vidis . (131)
The man(宾), about whom you speak , I never saw .
我从未见过你提到的人.

3) 主从句: 全句至少有两个谓语, 但只有一个主句, 从句跟主句发生直接联系. 如:

Se mi partoprenus en via amuza aktivado , mi estus tre ghoja . (050)
If I should take part in your recreational activity , I would be very glad .
如果我参加你们的文娱活动, 我会是很高兴的.

4) 并列句: 全句至少有两个谓语, 同时也至少有两个有并列关系的分句, 并且其中一个是主轴心. 如:

Mi miras , timas , tremas . (074)
I wonder, fear, tremble.
我惊奇, 害怕, 颤抖.

5) 交错句: 以上四类句子交错组合而成的复杂句. 如本文第3节举的例句(004)就是.

EChA在对付这些不同类型的句子时, 能够把复杂的句子分解成简单的句子处理. 分析程序首先查找从句. 如果查到, 先入并列从句子程序分解(若是光杆从句就放过, 返主), 然后确定每一个从句的前后限, 入动词子程序加工. 加工完毕, 做绝对放过标志. 所有从句处理完毕, 再行主句加工. 这时候, 句子呈或者简单句, 或者并列句的形式.

世界语中表示关系的从句, 如有相应的 T 类相关词与之呼应, 就是同位语从句. 而当主句中 T 类相关词省略时, 便于表示疑问的名词性从句同形, 从而增加了识辨难度. 对此本系统暂时不予考虑. 这种省略虽然显得较干练 (成语警句中常用), 崐但不宜提倡, 因为甚至人(尤其是非印欧语系的人)理解起来, 也常常感到困难.

[例] Bone ridas , KIU laste ridas .
Well smiles, WHO smiles at last.
谁笑得最后, 笑得最好.

KIO pasis , ne revenos .
WHAT passed, will not return.
时不再来. (一去不复返.)

CF: Nur TIU ne eraras, KIU neniam ion faras.(151)
Only THAT PERSON is not wrong, WHO never dose something.
仅仅从不做某事的那个人不犯错误.

第二线的关键是动词子程序的建立. (这儿所谓动词包括谓语动词, 形动词, 副动词和不定式, 但不包括-ADO词, 因为世界语的-ADO词已经完全名词化了, 不再具有动词的特性.) 如果说先从句后主句的加工过程, 实际上是自下而上的方法, 那么动词算法的路径正好反过来, 是自上而下. 动词子程序首先设三个开关. 一是检验是否可以构成动词短语 VP. 若不能, 如独词句及光杆的形动词, 副动词或不定式, 则给该词节点信息 J (终结节点), 该词加工完毕, 退出. 二是检验该词是否系词, 若是, 转系词子程序作适当处理, 再回动词子程序递归加工. 这是因为系动词有其特殊性, 比如一般动词谓语简单句, 只可能有一个前面没有介词的普通格名词(它当然是主语), 而系词谓语句却可以有两个(一主一表), 因而不能直接入动词子程序.  最后一个开关检验该动词短语是否扩展的 VP, 若不是, 即行分析. 扩展的 VP 定义为该动词的间接成分层中(所谓间接成分层是指其层号 >= 动词轴心的层号 + 2 的层次), 至少又包含一个 VP. 对于扩展的动词短语, 运用栈技术作递归加工. 这样动词子程序真正的加工单位便是不扩展的各类 VP (简单句, 形动词短语, 副动词短语, 不定式短语). 动词子程序在工作期间, 常常需要调用其他子程序. 各子程序间的逻辑关系是十分清楚的.

名词子程序也要设开关. 扩展的 NP 定义为带有至少一个 VP 的 NP, 它必须回动词子程序递归加工.

对于不扩展的动词短语, 一般来说加工次序如下:

丨动词子程序丨——–丨 名词子程序 丨——丨形容词子程序丨—-丨 副词子程序 丨

这形象地体现了 “自顶而下” 的分析思想.

试验表明, EChA的两线分析程序, 一具体一抽象, 一个对付个性一个对付共性, 一个面向虚词一个面向实词, 一个尽量使句法分析词典化, 一个则努力使分析过程逻辑化, 二者相互配合, 很有效地实现了各类世界语文句的自动分析. EChA输出的中间结果158条CDC链中只发现一处分析错误. 它出现在第一首诗歌 “LA ESPERO” 的第三句:

Ne al glavo sangonsoifanta , ghi LA HOMAN tiras FAMILION . (102)
Not to sword bloodthirsty , it THE MAN’S (目的格) pulls FAMILY (目的格).

为了节奏和韵律的关系, 作者把形容词修饰语与其轴心词分开了(当然仍同格同数), 中间插进一个动词谓语. 于是系统误把二者都看作是动词谓语的宾语, 因为 “冠词+形容词” (后不跟名词) 结构一般总是代替 NP 的, 所以EChA也就这样分析了. 幸运的是, 这一分析错误没有导致译文错误, 因为中英文综合都把前置宾语移至动词轴心之后, 客观上恢复了修饰语与其中心词的正常词序, 当然这只是巧合.

_____________________________________________________________________

附注: [9] 这儿关于规则和规则分开的讨论, 很大程度上得益于与刘倬老师的几次谈话.

 

【相关】

硕士论文: 世界语到汉语和英语的自动翻译试验
立委硕士论文:1. EChA概况
立委硕士论文:2. 世界语: 语言学特点及其研究价值
立委硕士论文:3. 层次递归成分体系
立委硕士论文:4. EChA机器词典及词表
立委硕士论文:5. 世界语形态分析
立委硕士论文:6/7 世界语句法分析
立委硕士论文:8. 英语形态生成
立委硕士论文:9. 目标语调序
立委硕士论文:10. EChA 试验结果的分析
立委硕士论文【致谢】【参考书目】
立委硕士论文全文(世界语版)

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

立委硕士论文:世界语形态分析 (5)

世界语到汉语和英语的自动翻译试验
— EChA机器翻译系统概述

世界语形态分析

源语文句分析大体可以分形态分析和句法分析两大类. 前者研究的对象小于等于词, 而后者的对象大于等于词(句素). 分析的终极目的就是求解词的正确的CDC成分. 本节先讨论形态分析问题. 我们把构词分析的讨论也放在这一节.

世界语形态分析的主体是消尾算法的建立. 世界语没有形态同形现象, 所以只要削尾正确, 形态分析也就完成. 下面给出EChA的削尾算法. 应该说, 该算法是比较完备和合理的, 完全能够满足世界语自动分析实用系统的要求.

世界语削尾算法

(1) 若该词最末字母为 “-O” 取 “名词 / 普通格 / 单数” 的结论, 该词削尾后查实词词干词典, 转下一步(2), 否则步骤(12).

(2) 若查词典成功, 取词典信息到加工场, 该词加工完毕, 否则下一步(3).

(3) 若该词最末二字母为 “-AD” 取 “AD词” 的结论, 该词削尾后查实词词干词典, 转下一步(4), 否则步骤(5).

(4) 若查词典成功, 取词典信息到加工场, 该词加工完毕, 否则步骤(11).

(5) 若该词最末三字母为 “-ANT” 取 “分词 / 进行式 / 主动式” 的结论, 该词削尾后查实词词干词典, 转步骤(4), 否则下一步(6).

(6) 若该词最末三字母为 “-INT” 取 “分词 / 完成式 / 主动式” 的结论, 该词削尾后查实词词干词典, 转步骤(4), 否则下一步(7).

(7) 若该词最末三字母为 “-ONT” 取 “分词 / 将来式 / 主动式” 的结论, 该词削尾后查实词词干词典, 转步骤(4), 否则下一步(8).

(8) 若该词最末二字母为 “-AT” 取 “分词 / 进行式 / 被动式” 的结论, 该词削尾后查实词词干词典, 转步骤(4), 否则下一步(9).

(9) 若该词最末二字母为 “-IT” 取 “分词 / 完成式 / 被动式” 的结论, 该词削尾后查实词词干词典, 转步骤(4), 否则下一步(10).

(10) 若该词最末二字母为 “-OT” 取 “分词 / 将来式 / 被动式” 的结论, 该词削尾后查实词词干词典, 转步骤(4), 否则下一步(11).

(11) 该词取 “生词” 的结论, 保留削尾结论, 在加工场的目标语语义项里复制该词, 该词加工完毕.

(12) 若该词最末字母为 “-‘” 取 “名词 / 普通格 / 单数” 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则下一步(13).

(13) 若该词最末字母为 “-A” 取 “形容词 / 普通格 / 单数” 的结论, 该词削尾后查实词词干词典, 转步骤(2),  否则下一步(14).

(14) 若该词最末字母为 “-E” 取 “副词 / 普通格” 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则下一步(15).

(15) 若该词最末字母为 “-J” 取 “普通格 / 复数” 的结论, 该词削尾后转下一步(16), 否则步骤(18).

(16) 若该词最末字母为 “-O” 取 “名词” 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则下一步(17).

(17) 若该词最末字母为 “-A” 取 “形容词” 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则步骤(11).

(18) 若该词最末字母为 “-N” 取 “目的格” 的结论, 该词削尾后转下一步(19), 否则步骤(23).

(19) 若该词最末字母为 “-J” 取 “复数” 的结论, 该词削尾后转步骤(16), 否则下一步(20).

(20) 若该词最末字母为 “-O” 取 “名词 / 单数” 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则下一步(21).

(21) 若该词最末字母为 “-A” 取 “形容词 / 单数” 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则下一步(22).

(22) 若该词最末字母为 “-E” 取 “副词” 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则步骤(11).

(23) 若该词最末字母为 “-S” 转下一步(24), 否则转步骤(30).

(24) 若该词最末二字母为 “-AS” 取 “现在时” 的结论, 该词削尾后转步骤(28), 否则下一步(25).

(25) 若该词最末二字母为 “-IS” 取 “过去时” 的结论, 该词削尾后转步骤(28), 否则下一步(26).

(26) 若该词最末二字母为 “-OS” 取 “将来时” 的结论, 该词削尾后转步骤(28), 否则下一步(27).

(27) 若该词最末二字母为 “-US” 取 “虚拟式” 的结论, 该词削尾后转步骤(29), 否则步骤(32).

(28) 取 “陈述式” 的结论, 转下一步(29).

(29) 取 “动词 / 谓语 / 主动语态” 的结论, 查实词词干词典, 转步骤(2).

(30) 若该词最末字母为 “-I” 取 “动词 / 不定式” 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则下一步(31).

(31) 若该词最末字母为 “-U” 取 “命令式” 的结论, 该词削尾后转步骤(29), 否则下一步(32).

(32) 查虚词词典(因该词无尾可削). 若成功取词典信息到加工场, 该词加工完毕, 否则取 “名词 / 专有名词” 的结论, 返回步骤(11).

[注] 世界语基本法规第16条说: “名词和冠词末尾的元音字母可以省略, 用省略号 ‘ 来代替”. 这种现象多出现在诗歌里, 如 MOND'(103). 我们在步骤(12)对它作了处理(冠词是长度小于 3 的虚词, 直接查虚词词典, 不入削尾一线, 故不予考虑).

我们谈谈构词分析问题, 这包括两个方面: 1. 关于建立削缀算法(派生词处理)的讨论; 2. 关于拆离合成词的讨论. 在现行的EChA系统中, 这两个问题都回避了. 我们建立的词典, 是以词干(包括合成词词干)作存贮单位的, 加工词只要削去语法词尾, 就可以查到. 但是, 应该指出, 这样做, 对于世界语这种构词特别灵活的语言并不合理. 以词干存词, 在做小型实验时还可应付, 如果是实用系统, 就会出现存不胜存的情况. 我们主张实词词典既存词根也存词干, 同时建立一个完全的世界语削缀算法和合成词拆离算法, 以便对付生词. (世界语除国际性的专业词汇外, 基本词根很有限. 所谓生词, 一般都是由基本词根及几十个词缀随机组合的派生词或合成词. 因此, 只要切分正确, 生词便不 “生”.)

世界语后缀可以叠加(理论上无限), 但前缀通常只能有一个. 这样词典一线的加工路径应该是:

lw9

削缀与削尾不同, 并非有缀必削. 对于削尾, 机器是先削后查, 而对于削缀, 则是先查词典, 查不着的生词再去削缀. 这样处理便于我们根据设计要求(实验型还是实用型, 对于翻译速度, 质量, 成本的要求等等)和机器条件(内存容量, 运算速度等)决定实词词典收词干的标准.

现在, 由于计算机技术的发展, 机器功能(存贮, 速度)越来越强, 而成本急遽下降. 因此机器翻译界如今有人提倡存贮单位宜大不宜小(如尽量多收成语的主张[7] ), 以海量存贮和快速查找来减轻分析的负担. 这是很有见地的认识. 单位越大, 确定性就越强, 对分析综合(机器智能)的要求就越低, 研制的难度相对减轻, 而译文的质量会大大提高. 机器翻译是实用性?
很强的学科, 这种主张就显得更有价值. 当然, 单位也不是越大越好, 因为单位每大一级(从词根到词干, 从词干到词, 从词到词组, 从词组到语句), 其组合的可能性呈指数增长.[8] 如果推向极端, 以句子为存贮单位, 则完全不需要分析和综合, 只要对号入座即可输出译文. 这时候, 人工智能的程度等于零, 翻译质量却可以达到最佳(如果以人工水平为最佳). 可惜, 硬件技术无论怎样发达, 其存贮容量和查找速度也总有限, 不可能对付无穷的句子. (但为了某种特殊的需要在有限的范围内, 这种办法是可行的, 如旅游翻译机. 这到底还算不算机器翻译? 应该算的, 只是它不是人工智能意义下的机器翻译.) 机器翻译的另一极是以词素(词根, 词缀, 词尾)为分析单位, 它所需要的词典容量(只存词根)最小, 人工智能的水平最高, 不但有句法分析和综合, 还要有构词分析和综合. 但费了好大劲儿, 质量却最不能保证, 因为一个句子掰得太碎(原文分析), 捏拢来总难免有些难看的痕迹(译文综合). 所以, 现行的机译系统, 一般都是在这两极中根据具体条件和设计者的观点取某个中值. 我们认为, 一个优秀的实用系统应该有两手, 既能分析得很透彻, 又能对常用词组(成语)囫囵儿处理. 该细的地方细得下去, 该粗的地方粗得起来. 一般来说, 对于常用的, 固定的, 个性的可枚举现象粗一点比较有利, 而对于规律性的随机现象, 则适宜较细致的分析. 所以, 对于以世界语为分析对象的实用机译系统, 我们既主张尽可能多收成语和带缀词干, 也充分肯定建立一个完备的削缀算法的必要性.

那么, 世界语实词词典收多少派生词词干比较合理呢? 对于独立型机器翻译:

(1) 如果是小型实验系统, 目的是在有限的材料内试验系统的句法分析和综合能力, 那就词干全收; 否则:

(2) 凡是常用的派生词词干一律收进词典, 而不再入削缀子程序—-常用性(出现频率高)是根本标准;

(3) 有助于区别同形多义的派生词词干, 应该收;

(4) 可收可不收的, 主张收;

(5) 在刚开始设计实用系统的机器词典时, 由于世界语词缀的极端灵活性和随机性, 很难一次收入许多带缀的词干, 这样, 削缀算法就显得更重要. 削下缀来, 虽然表义不是很确切, 甚至有时在目标语综合时, 还需要辅以说明性注释(见后面例释), 但总比直接打出生词来(信息量为零)强出百倍. 随着系统的不断扩充和完善, 收的词干自然会越来越多.

如果是具有特定的目标语的相关型机器翻译:

(1) 收多少派生词词干应该考虑目标语的构词特点及词汇状况;

(2) 在目标语中作为一个完整概念, 而不是词根和词缀意义简单相加所能反映的词干, 应该收入词典. 如: DOM-EGO 楼房, 大厦 (而不是一般的 “大-房子” );

(3) 如果以汉语为目标语, 削缀更多一些, 因为世汉构词法很相似, 汉族人的心理本能地习惯于理解词素与词素的组合. (这种民族偏爱心理在引进外来词时表现的很明显, 如 “德律风” 为 “电话” 取代, “莱塞” 为 “激光” 取代等.) 可以举出很多世汉构词神似的例子. 而且也有许多世界语派生词如 DOM-ACHO 虽然整个儿译作 “陋室” 更雅一些, 但也不妨用统一的削缀合成法组成新词 “鬼-房子”, 与原义相去也不远. 特别是有些缀与汉字(词素)有很多一致性, 如 VIC-/副-, -IN-/女-, -EBL-/可- 等等, 就更有理由作削缀处理.

世汉构词对比例释(1): 派生词

(1) BO- 姻- : BO-PATRO 姻-父亲 (岳父或公公) , BO-FILO 姻-儿子 (女婿) , BO-FRATO 姻-兄弟 (内弟) ;

(2) GE- (男女)- : GE-AMIKOJ (男女)-朋友们 , GE-KAMARADOJ (男女)-同志们 , GE-AKTOROJ (男女)-演员们 ;

(3) EKS- 前- : EKS-OFICISTO 前-职员 , EKS-MINISTRO 前-部长 , EKS-INSTRUISTO 前-教师 ;

(4) MAL- [反义] : MAL-BONA [反义]好 (坏) , MAL-AMIKO [反义]朋友 (敌人) , MAL-SAGHE [反义]聪明 (愚苯) ;

[说明] MAL-是世界语中用得最广, 随机性最强的前缀之一, 具有极强的造词能力, 可惜, 中文没有对应的词素. 如果系统遇到某个MAL-型生词, 削下前缀后给出[反义]这样的说明性标识, 也还可以使人理解.

(5) VIC- 副- : VIC-PREZIDANTO 副-主席 , VIC-ESTRO 副-队长 , VIC-CHEFMINISTRO 副-总理 ;

(6) FI- 坏- : FI-INSEKTO 坏-虫 , FI-KOMERCISTO 坏-商人 (奸商) , FI-KUTIMO 坏-习惯 (恶习) ;

(7) SEN- 1. 若词根逻辑类为名词则 “无-” : SEN-GUSTA 无-味的 , SEN-SENCA 无-意义的 ;

  1. 若词根逻辑类为动词则 “不-” : SEN-MORTA 不-死的 (不朽的) , SEN-ATENTA 不-注意的 ;

(8) NE- 若词根逻辑类为名词则 “非-” 否则 “不-” : NE-ESPERANTISTO 非-世界语者 , NE-BONA 不-好的 ;

(9) 介词性前缀:  1. SUR- -上: SUR-TABLE 桌子-上 ; 2. APUD- -旁: APUD-VOJA 路-旁的 ;

  1. EN- -内: EN-LANDE 国-内 ; 4. LAU- 按-: LAU-VICE 按-次序 ; 5. DE- 从-: DE-NOVE 从-新 ;

(10) -ACH- 鬼- : DOM-ACHO 鬼-房子 (陋室) , KNAB-ACHO 鬼-男孩 (捣蛋鬼) , VETER-ACHO 鬼天气 ;

(11) -AN- -成员 : KLUB-ANO 俱乐部-成员 , KURS-ANO 讲习班-成员 , KOMUNUM-ANO 公社-成员 ;

(12) -UL- -者 : BON-ULO 好-者 , KAR-ULO 亲爱-者 , JUN-ULO 年青-者 , LONG-KRUR-ULO 长/腿-者 ;

(13) -IN- 女- : KAMARAD-INO 女-同志 , INSTRUIST-INO 女-教师 , OFICIST-INO 女-职员 , AKTOR-INO , 女-演员 ;

(14) -EBL- 可- : VID-EBLA 可-见的 , MANGH-EBLA 可-吃的 , UZ-EBLA 可-用的 , NE-ATING-EBLA 不-可-达到的 ;

(15) -EC- -性 : CERT-ECO 确实-性 , NECES-ECO 必要-性 , KLAR-ECO 清楚-性 , LIBER-ECO 自由-性 ;

(16) -EM- 爱- : LABOR-EMA 爱-工作的 (勤劳的) , PAROL-EMA 爱-说话的 , MENSOG-EMA 爱-撒谎的 ;

(17) -IND- 值得- : LERN-INDA 值得-学习的 , LAUD-INDE 值得-称赞 , LEG-INDA 值得-读的 , AM-INDA 值得-爱的 ;

(18) -ON- 1. 若 -ONO 则 “-分之一”: DU-ONO 二-分之一 , TRI-ONO 三-分之一 , KVAR-ONO 四-分之一 ;

  1. 若 X+Y-ONOJ 则 “Y-分之X”: TRI DEK-ONOJ 十-分之三 , KVIN OK-ONOJ 八-分之五 .

合成词 (“词根+词根”) 也是一样. 比较固定的, 应该整个儿存入词典, 随机组合的, 应该拆开. 但这儿有一个困难, 世界语语法为了方便使用者, 即便对完全随机组合的合成词, 也不作加连字符的规定. 那么怎么拆呢? 词根的数量与词缀不能比, 长度也变化很大, 一个字母一个字母地削查比较, 显然不是办法. 如果坚持不要译前编辑, 还找不到一个合理的解决办法. 目前可以考虑先对中间有连字符的合成词作拆词加工. 我们提倡除比较固定常用的合成词外, 世界语者在运用随机合成词时,为读者的省力和机器的识辨计加上连字符. 鉴于世界语构词法与汉语构词法惊人的一致(组合方式及其高度随机性都很类似), 对于世汉机器翻译这一倡议更加必要.

世汉构词对比例释(2): 合成词

(1) AKVO-FONTO 水/源 ; (2) VARM-ENERGIO 热/能 ; (3) ARBO-BRANCHO 树/枝 ; (4) VAPOR-SHIPO 汽/船 ;

(5) SURD-MUT-ULO 聋/哑-者 ; (6) BLANK-HARA 白/发的 ; (7) NUD-PIEDA 光/脚的 ; (8) FISH-KAPTI 捕/鱼

______________________________________________________________

附注: [7] 参见:

刘涌泉 <<中国的机器翻译>> ( <<情报科学>> 1980, 3 )

王广义 <<机器翻译中的固定词组和固定结构问题>> ( <<语言和计算机>> (1), 1982 )

[8] 参看: 叶蜚声, 徐通锵 <<语言学纲要>> 第二章第二节 ” 1. 语言的层级体系”, PP.34-36 ( 北京大学出版社, 1981 )

 

 

【相关】

硕士论文: 世界语到汉语和英语的自动翻译试验
立委硕士论文:1. EChA概况
立委硕士论文:2. 世界语: 语言学特点及其研究价值
立委硕士论文:3. 层次递归成分体系
立委硕士论文:4. EChA机器词典及词表
立委硕士论文:5. 世界语形态分析
立委硕士论文:6/7 世界语句法分析
立委硕士论文:8. 英语形态生成
立委硕士论文:9. 目标语调序
立委硕士论文:10. EChA 试验结果的分析
立委硕士论文【致谢】【参考书目】
立委硕士论文全文(世界语版)

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

立委硕士论文:EChA机器词典及词表 (4)

世界语到汉语和英语的自动翻译试验
— EChA机器翻译系统概述

EChA机器词典及词表

EChA所有词典词表都是随机数据文件, 并且各配有一套修改和扩充的外围维护程序, 这给系统的改进提供了方便. 下面

分别介绍各词典词表的定义.

1) 实词词干词典
格式:            __________________________________________________________________________________
词干丨逻辑类丨及物性丨带不定式丨支配词丨支配词汉义码丨汉义丨汉义特征 丨 英义
____丨_______丨______丨_________丨_______丨_____________丨____丨_________丨______    ___________________________________________________________
丨英义特征 丨 语义特征 丨 词类词义区分表记录号  丨 备用项 丨
丨_________丨__________丨_______________________丨________丨

<逻辑类>::= { N, V, A, F, P, C, K, T, R, S, W, E, D, X }

N=名词 , V=动词 , A=形容词 , F=副词 , P=介词 , C=连词或标点 , K=K类相关词 ,
T=T类相关词 , R=其他相关词 , S=数词 , W=人称代词 , E=系词 , D=冠词 , X=万能词

[说明] 逻辑类用来表明词的静态词性. 世界语实词的语法词性是动态随机的, 只能由削尾决定. 但每个词一般具有一个基本词性, 这是单词的深层的逻辑特征. 语法词性不过是由它通过加词尾派生的表层的句法特征.

<汉义特征>::= { “…以后”, “…的”, “使…”, “把…”, “给…”, “…下”, “…上”, “…里”, “…时”,
多义词特征, 构成成语特征, … }

[说明] 汉义特征揭示了该词汉义的结构特性, 也给出了汉语生成的修辞信息.

<英义特征>::= { 不规则变化特征, 双写特征, 形式不变特征, … }

[说明] 英义特征给出该词的英语形态生成方式信息.

<支配词汉义>::= { 零义, “给”, “以”, “到”, … }

[说明] 支配词汉义标示该词所支配的词(通常是介词)的汉义.

<语义特征>::= { HM, LK, TM, FX, … }

HM=人类特征, LK=地点特征, TM=时间特征, FX=方向特征
2) 虚词词典

虚词词典除包含实词词典的各项信息外, 还揭示了部分CDC信息, 如词性, 格, 数, 关系, 分布, 节点等. 分析之前就能在词典里给出某些动态信息, 这是由虚词特点决定的. 例如: 介词永远处于非终结节点(节点”Y”)上, 原副词和万能词一般是不扩展的, 所以总处于终结节点(节点”J”)上. 万能词 ECH (EVEN) 永远位于其轴心词之前(分布”Q”). 原副词 JAM (ALREADY) 永远做状语(关系”F”). 从属连词 KE (THAT) 总是引导名词性从句(词类”K”, 节点”K”), 而且总位于其轴心词之后(分布”H”).

冠词LA永远做定语(关系”D”), 位于轴心词前(分布”Q”), 处于终结节点上(节点”J”).

3) 成语词典

机器翻译界所谓的成语, 比其通常的意义要宽泛得多. 凡是常用的比较固定的词组都可收作成语. 世界语中纯粹的不可分析的习惯表达法较少, 所以成语词典容量相对不大. 成语词典的收词范围, 还在很大程度上决定于原语和译语的对比差异. 亲属关系相近的表达方法类似, 可以少收或不收成语. 在EChA中, 就没有设立世英成语词典, 只有一部世汉成语词典.

EChA成语例释:

MALFERMA(JN) AUTO(JN) —– 敞蓬汽车 ( CF: OPEN CAR(S) )
SOMERA(JN) FERIO(JN) —– 暑假 ( CF: SUMMER HOLIDAY(S) )
LA ANGLA(N) LINGVO(N) —- 英语 ( CF: THE ENGLISH LANGUAGE )
INSTRUA(JN) LIBRO(JN) —- 教科书 ( CF: TEACHING BOOK(S) )
LA GRANDA(N) MURO(N) —- 长城 ( CF: THE GREAT WALL )
HOMA(N) SVARMO(N) —- 人群 ( CF: MAN’S SWARM )
FACILA(N) VENTO(N) —- 顺风 (CF: EASY WIND )

4) 词类词义区分表

建立该词表对于世界语作为源语的机器翻译很必要, 可以大大减轻综合时多义区分的负担. 凡是随着词性和逻辑类的不同, 目标语的义项也相应不同, 而这种改变并不遵循形态转换规律, 这样的单词就收入区分表. 例如: MATEMATIK-A(JN) 必须收入, 而 HOM-A(JN) 就不必收, 因为前者的英义是 MATHEMATICAL (不是 MATHEMATICS’ ), 而后者只要按规律从源语形容格(形容词性), 生成目标语所有格的词尾 -‘S 或助词  “的” ( MAN-‘S / “人-的” ) 就可以了. 我们在实词词典中对要入区分表的词, 都给出了查表记录号(随机文件地址), 所以系统只要按地址取记录就行了. 用BASIC编程时, 拿随机文件记录号?
作为单词内部代码, 是值得推荐的.

词类词义区分表例释:

实词词典                      词类词义区分表

ATING-I: ACHIEVE / 达到        ATING-O: ACHIEVEMENT / 成就
EKZEMPL-O: EXAMPLE / 例子      EKZEMPL-E: FOR EXAMPLE / 例如
KOMENC-I: BEGIN / 开始         KOMENC-E: AT BEGINNING / 开始时
MEZUR-I: MEASURE / 测量        MEZUR-O: MEASUREMENT / 尺寸
OKAZ-I: HAPPEN / 发生          OKAZ-O: OCCASION / 场合
SCI-I: KNOW / 知道             SCI-O: KNOWLEDGE / 知识
TIP-O: TYPE / 型号             TIP-A: TYPICAL / 典型的

5) 英语不规则词表

这个词表跟一般英语词典附录中列的不规则表没什么两样, 不过为了简便, 我们把动词形式的不规则变化和名词复数的不规则变化放在一个表内. 不规则词表是供英语形态生成查用的.

英语不规则词表

原形             过去时                过去分词              名词复数

BEAT             BEAT                  BEATEN
BECOME       BECAME                BECOME
…              …                   …                    …
CHILD                                                         CHILDREN
…              …                   …                    …

最后我们给出EChA句子加工场的格式:

目标语序号丨实词词典各项丨CDC信息丨已加工特征丨虚词特征丨
目标语调序信息丨目标语位移序号丨

[说明] 1. 目标语序号用来在综合阶段自底而上归约加工时给同号.

  1. 目标语位移序号用来在用搬家法作虚拟调序时代表整个词条. 用序号代替整个词条位移的虚拟调序, 比纯粹用搬家法效率高, 大约跟拉链法相仿. 鉴于BASIC不能处理组合项变量, 如果采用搬家法调序, 只能一项一项位移, 这种虚拟调序的技术更显出优越性. 但须注意, 跟位移序号一起移动的, 还必须包括该词的自然顺序号, 用它标示原词条位置, 这样查问时才无后顾之忧.

【相关】

硕士论文: 世界语到汉语和英语的自动翻译试验
立委硕士论文:1. EChA概况
立委硕士论文:2. 世界语: 语言学特点及其研究价值
立委硕士论文:3. 层次递归成分体系
立委硕士论文:4. EChA机器词典及词表
立委硕士论文:5. 世界语形态分析
立委硕士论文:6/7 世界语句法分析
立委硕士论文:8. 英语形态生成
立委硕士论文:9. 目标语调序
立委硕士论文:10. EChA 试验结果的分析
立委硕士论文【致谢】【参考书目】
立委硕士论文全文(世界语版)

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

立委硕士论文:层次递归成分体系 (3)

世界语到汉语和英语的自动翻译试验
— EChA机器翻译系统概述

层次递归成分体系

在给出层次递归成分体系(CDC)的定义之前, 我们先说说该体系的来源及其理论依据.

CDC体系是机器翻译的一种中间语言, 我们试图提供一套更加合乎独立分析独立综合要求的机器翻译抽象文法. CDC是EChA系统的关键, 它体现了我们对语言结构的看法和对机器翻译的认识. CDC是直接从导师们的中介成分体系[2] 脱胎而来的, 它保留了中介成分的形式, 继承和改造了它的内容, 其思想基础是有向直接联系理论(或轴心词理论). 体现在CDC中的要点是:

1) 句子的最顶层是主句谓语, 它是全句的最大联系中心(主轴心), 所以谓语是全句的代表. 一个完整的句子的最简单也是最典型的形式, 就是独词祈使句. 如:

Venu! Come! 来!

任何其他句子(无谓句是不完整句, 除外)都是从上面的简单形式一层一层推衍出来的:

Venu! … La studento venu chi tien! … La studento, kiu parolis, venu chi tien! ……

Come!     Let the student come here!     Let the student, who spoke, come here!

反过来说, 对一个无论怎样复杂的句子层层归约, 归约的顶层必然是主句动词谓语:

VENU
/                \    \
studento         tien   (!)
/           \               /
la        parolis      chi
/     /    \
(,)    kiu      (,)

2) 一个词只能跟另外的一个词发生直接联系, 但一个词可以带 N 个 ( N>=0 ) 直接联系词. 这就是句子结构的有向直接联系观点.[3] 带直接联系词的词叫轴心词, 当 N>0 时, 它是非终结节点词. 直接联系词本身也常常是低一层次的轴心词.

3) 主句谓语(主轴心)处在第一层. 与主句谓语发生直接联系的词位于第二层. 与第二层词直接联系的词在第三层. 这样一环扣一环, 组成句子的每一个词都处在某一个层次上. 理论上说, 句子的层次可以是无限的.

4) “虚词不虚.” 虚词(或者叫功能词, 结构词)较之实词包含更多的句法结构信息. 有些虚词同样可以充当轴心词. 比如: 在 “介+名” 结构中, 介词是轴心词. 主从连词如 SE (IF), KVANKAM (ALTHOUGH) 等也充当轴心词, 作为从句的代表, 它跟主句谓语发生直接联系, 它所带的下位直接联系词是从句谓语.崐    5) 作为源语文句的中间语言映射, 层次递归成分应该, 也可以落实到每个词上. 所谓词, 从机器角度来看, 就是两空之间的字符串(汉语另当别论). 严格地说, 标点符号也是词(虚词), 也要参与文句的分析和归约.

建立CDC体系的两项基本原则是:

1) 层次递归原则: 有多少层次反映多少层次, 而且层次是递归的. 层次的递归性表现在: (1) 对文句可以自底而上层层归约(参见EChA系统的目标语生成算法); (2) 对文句可以自顶而下层层分析(参见EChA的源语分析算法).

2) 词本位原则:[4] 词到句子(以主句谓语为代表)是一个动态递归过程的两极, 其间的各个环节就是所谓层次. 贯彻词本位原则的实质, 就是在一切层次上都把成分(CDC)落实到词. 句子是, 也仅仅是由句素组成的. 而每一个大大小小的句素(词组, 短语, 从句等)按照我们的看法, 总是以一个轴心词来代表的.

现在, 我们给出层次递归成分体系的形式化定义:

  1. 层次递归成分体系是层次递归成分的集合.
  2. 层次递归成分是这样一个六元信息组:
    形态信息 | 结构关系信息 | 节点信息 | 分布信息 | 层号信息 | 链号信息
  1.  <形态信息>::=
    { <词性>, <格>, <数>, <时态>, <语态>, <语式>, <非谓语形式>, <体>, <人称>, … }

<词性>::= { N, V, A, F, P, Z, C, K, B }

N=名词, V=动词, A=形容词, F=副词, P=介词, Z=助动词, C=并列连词,
K=主从连词, B=标点符号

<格>::= { 非格, 普通格, 目的格 }

<数>::= { 非数, 单数, 复数 }

<时态>::= { 非时态, 现在时, 过去时, 将来时 }

<语态>::= { 非语态, 主动语态, 被动语态 }

<语式>::= { 非语式, 陈述语式, 命令语式, 虚拟语式 }

<非谓语形式>::= { 非非谓语形式, 分词, 不定式, 名动词 }

<体>::= { 非体, 进行体, 完成体, 将来体 }

<人称>::= { 非人称, 第一人称, 第二人称, 第三人称 }

  1. <结构关系信息>::= { S, W, O, D, F, B, T, I, C, L, M, A, Z, V, R }

S=主语, W=谓语, O=宾语, D=定语, F=状语, B=补语, T=同位语,
I=独立成分, C=同等连词或标点, L=从句起始标点, M=从句末标点,
A=插入成分起始标点,Z=插入成分末标点, V=非结构意义标点, R=句末标点

  1. <节点信息>::= { J, <非终结节点> }

J=终结节点

<非终结节点>::= { S, O, D, B, K, X, Y }

S=主语从句节点, O=宾语从句节点, D=定语从句节点, B=补语从句节点,
K=一般从句节点, X=动词性非终结节点, Y=其他非终结节点

  1. <分布信息>::= { Q, H, G }

Q=位于轴心词前, H=位于轴心词后, G=轴心

  1. <层号信息>::= { 非层号, <自然数> }

<自然数>::= { 1, 2, 3, … }

  1. <链号信息>::= { <左链号>, <右链号> }

<左链号>::= { 非左链号, 99, N }

N=大于句首号小于句末号的自然数

<右链号>::= { 非右链号, N }

[说明]   左链号的设置是为了处理同等成分的方便. 我们把同等成分的最右元素认作整个成分的代表(落脚点, 轴心).  左链号99是同等成分最左元素的标志. 有了左链号, 消除了后顾之忧, 同等成分就可以和其他句素一样, 参加文句的分析和归约.

下面是用这套成分体系作分析的例句(004):

CDC中形态信息略去, 余下依次是: 关系/节点/分布/层号/左链/右链, 例如:

FJQ 05 00 02 —>
状语/终结节点/位于其轴心词之前/处于第5层/没有左链(00是非左链号)
/右链号为02

Pli    poste          ,              kiam           la                     sciodisketoj
英:   More  later ,             when           the            knowledge-disks
汉:   更以后           ,            当(…时)                            微型知识磁盘
CDC链:  FJQ 05 00 02   FYQ 04 00 17   LJQ 05 00 04   FKQ 04 00 17   DJQ 07 00 06   SYQ 06 00 07

estis          eltrovitaj     ,             la          plenan         indikaron [注:目的格]
had been       found out      ,    the            full           indication
被             发明了         ,                             全套           指令集合
WBH 05 00 04   BJH 06 00 07   MJH 05 00 04   DJQ 05 00 12   DJQ 05 00 12   OYQ 04 00 17

,              endiskigitan   ,              oni            metis          en
,              endisked       ,              people         put            into
,              所写入磁盘的   ,              人们           放             到(…里面)
AJQ 06 00 14   DYH 05 00 12   ZJH 06 00 14   SJQ 04 00 17   WXG 03 99 20   BYH 04 00 17

mashinojn      kaj            ili            tiamaniere     povis          en
machines       and            they           therefore      could          in
机器                                    它们           这样               能             在(…里面)
OJH 05 00 18   CJQ 02 17 23   SJQ 02 00 23   FJQ 02 00 23   WXG 01 20 00   FYQ 03 00 27

si             mem            akumuli           sciencan       stokon         ,
themselves                 accumulate     scientific      stock          ,
自己           本身           积累                     科学           贮蓄           ,
BYH 04 00 24   BJH 05 00 25   BXH 02 00 23   DJQ 04 00 29   OYH 03 00 27   VJQ 05 00 32

pli            grandan        ol             la             homa           cerbo          .
more       great          than           the            man’s          brain          .
更             大                 比                                 人的           头脑           .
FJQ 05 00 32   DYH 04 00 29   FYH 05 00 32   DJQ 07 00 36   DJQ 07 00 36   BYH 06 00 33   RJH 02 00 23

层次递归成分实质上就是不同层次的词之间直接联系关系的一种反映. 它揭示了文句结构的正确的句法树. 根据文句的CDC链, 我们很容易画出该句的句法树.

实验证明, 作为体现独立分析结果的机器翻译中间语言, 层次递归成分体系是比较有效的. 现在, 越来越多的专家呼吁建立能充分体现对源语分析的结果, 正确揭示文句的层次结构和语义信息的媒介语, 或类似媒介语的东西. 许多文章论证了分析和综合独立的必要性. 原语分析依赖译语, 或译语综合依赖原语, 使分析和综合都不能深入, 而且难免捉襟见肘.[5]

当然, 层次递归成分体系还处于草创时期, 必然存在不少问题, 有待于在实践中不断检验, 改进和完善. 通过时间的考验和我们的努力, 也许它最终能成为一个比较得心应手的机译工具, 而为人们乐于采用, 这当然是我们所希望的. 也许它不是一个好的方案, 很快便被淘汰了. 但无论如何, 总是一次有益的尝试.

这套体系的不足之处是, 它不大能够反映有向直接联系的语义性质, 而这对于高质量的机器翻译是比较关键的信息. 人类语言不管怎样千差万别, 总有某些共同的东西. 例如, 句素间的层次结构及其直接联系关系就具有很强的普遍性. 正是这些语言共性才使翻译成为可能, 从而它成为语言转换的基础. 句素与句素之间的逻辑语义联系, 也是重要的语言共性之一.[6] 逻辑语义的确定, 将大大有助于生成地道的目标语. 在CDC体系中, 结构关系一项基本上是传统语法中句法成分的继承, 反映的是句子表层结构的关系(主谓宾定状补等). 看来, 有必要扩充CDC, 再加一个逻辑语义元:

<逻辑语义信息>::= { Ag, Sb, Ob, Vb, Pl, Tl, Mn, Pp, Rs, Fr, Rg, Dg, Tm, Pr, Cl, Fn, Ms, Pm, Cd, Nb, Pt, Mt, Ps, Tg, Cs, Ex, Dt, Ct, Cn, Cc, Cp, Tw, Xx }

Ag=施事(Agent), Sb=主体(Subject), Ob=受事(Object), Vb=行为(Verb), Pl=地点(Place),
Tl=工具(Tool), Mn=方式(Manner), Pp=目的(Purpose), Rs=结果(Result),
Fr=频率(Frequency), Rg=范围(Range), Dg=程度(degree), Tm=时点(Time),
Pr=时段(Period), Cl=颜色(Colour), Fn=功能(Function), Ms=尺寸(Measurement),
Pm=后饰(Post-modifier), Cd=条件(Condition) , Nb=数量(Number),
Pt=属性(Property), Mt=质料(Material), Ps=领属(Possession), Tg=对象(Target),
Cs=原因(Cause), Ex=说明(Explanation), Dt=限定(Determiner),
Ct=环境(Circumstance), Cn=内容(Content), Cc=让步(Concession),
Cp=比较(Comparison), Tw=同位, Xx=非语义(或不定语义)

[注] Xx是所有无法确定, 或没有必要确定的成分的逻辑语义. 机器翻译跟自然语言理解不同, 并不一味要求分析得越具体越透彻越好. 机器翻译过程中的中间信息究竟要深入到怎样的程度, 应根据充分必要的原则来决定. 少则影响效果(质量), 多则白费功夫.

_____________________________________________________________

附注: [2] 关于中介成分体系, 参见:

刘涌泉, 刘倬, 高祖舜 <<俄汉机器翻译规则系统新旧方案比较>> ( <<中国语文>> 1962.2 )

刘涌泉 <<外汉机器翻译中的中介成分体系>> ( <<中国语文>> 1982.2 )

刘  倬 <<三次机器翻译试验>> ( 第一次机器翻译学术会议论文, 1980.9 )

[3] 关于有向直接联系理论, 参见:

刘涌泉, 刘倬, 高祖舜 <<俄汉机器翻译规则系统新旧方案比较>> (同上)

刘涌泉, 刘倬, 高祖舜 <<机器翻译中的词序问题>> ( <<中国语文>> 1965.3 )

并请参阅 <<特斯尼埃的 <结构句法基础> 简介>> ( 张烈材, <<国外语言学>> 1985.2 )

[4] 参见: 刘涌泉 <<词>> ( 1984年机器翻译及自然语言处理学术讨论会论文, 1984.9 )

[5] 参见: 冯志伟 <<当前机器翻译的一些新特点>> ( <<情报学刊>> 1982. Vol 1 No.2 )

[6] 参见: 董振东 <<逻辑语义及其在机译中的应用>> ( <<中国的机器翻译>> pp.25-45 )

 

 

 

 

【相关】

立委硕士论文:目标语调序

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录