分类： 杂类

郭:
Amazon’s $2.5M ‘Alexa Prize’ seeks chatbot that can converse intelligently for 20 minutes

【从IBM沃森平台的云服务谈AI热门中的热门 bots】

我:
哥仨老革命在去 IBM 的 traffic 中去大名鼎鼎的沃森（Watson）系统探秘

洪:
讲者是这位印度籍女士：http://researcher.watson.ibm.com/researcher/view.php?person=us-vibha.sinha:

郭:
比较有意思的是她后面讲的三点:

1. LSTM based intent recognition and entity extraction

2. "tone" recognition
这里tone指的是从一句话（书面语）反应出的说话人的喜怒哀乐和处事方式等

3. personality recognition
主要基于心理学的分类，用200到2000条tweets训练

她重点强调的是，通过增加tone和personality的识别，人机对话可以有更高的可接受度。

我:
唐老师诸位汇报一下昨天的听闻。上面郭老师也总结了几条，很好。我再说几点。
话说三位老革命慕名而去，这个 meet-up 一共才来了20几位听众吧大概湾区此类活动甚多 marketing 不够的话也难。据说北京的 AI 沙龙，弄个花哨一点的题目往往门庭若市。

1. 没有什么 surprises 但参加沙龙的好处是可以问问题和可以听别人问问题，而主讲人常常在回答的时候给出一些书面没有的数据和细节。否则的话，各种资料都在网上（最后的 slide 给了链接），要写利人似的调研报告，只要不怕苦，有的是资料。

听讲的另一个好处是，主讲人事先已经组织好材料讲解，可以快速了解一个项目的概貌。

2. 特地替唐老师问了他钟情的 Prolog，问你们有用吗，在什么模块用。主讲人说，没有用。我说有报道说有用到。她说，她没听说，至少在她主讲的已经产品化的这个沃森 chatbot 的组建 toolkit 里面没有 Prolog。当然她不排除某个小组或个人在沃森的某个项目或模块用到。IBM 对 AI 的投入增大，在沃森的名号下的各种研究项目和小组很多。

马:
我问过了IBM中国的，在沃森参加电视节目版本中没有用prolog，但是后续的版本中，确实用到了prolog

陈:
它是很多services构成，用不会奇怪，尤其是某些既有系统

我:

3. 现在不少巨头都在 offer 这样的 toolkit，问微软 offer 的 cortana 的 toolkit 与你们沃森的这套有啥不同。回答是，非常类似，不过她自认为沃森质量更好。亚马逊也有类似的 offer。

所以回来路上，我们就谈到这个 bots 遍地开花的场景。郭老师说，现如今谁要想做一个领域内的 bot，或自己的 app 做一个 bot 接口，根本就不需要编程。只要准备好领域的 experts，把数据准备好，用这些巨头的工具箱就可以构建一个出来。也一样可以 deploy 到 messenger 或嵌入其他场景，这几乎是一条龙的云服务。

当然用这些服务是要交钱的，但主讲人说很便宜很便宜的，郭兄说，真用上了，其实也不便宜。便宜与否放一边，至少现如今，bots 的门槛很低，需要的不是软件人才，而是领域数据的人。于是，我看到一种前景，以前毕业即失业的语言学家、图书馆业人士，将来可能成为 AI 的主力，只有对数据和细节敏感的人，最终才是 AI 接口的血肉构筑者，反正架构是现成通用的。这个细想想是有道理的。这是沃森 API calls 的价格。

我:
这就回到我们以前议论过的话题。AI 创业，如果做平台或工具箱，初创公司怎么敌得过巨头呢？我觉得几乎是死路。

大而言之做平台和工具箱创业的，历史上就没见过什么成功案例（不排除做了被收购那种，那也是“成功”，如果你的技术有幸被巨头看中：其实昨晚介绍的沃森系统的一个重要组件 AlchomyLanguage 就是收购的，洪爷知道收购的来路和细节）。

白:
麦当劳玩法，方便，质量可控，但绝非美食，虽然是“美”食。

我:
不错，这些巨头的 offerring 都是麦当劳式的流程。创业的空间，从工具角度，可以是中华料理的配方辅助工具之类。不过，还是那句话，最好绕过平台本身创业的思维，而是用巨头的工具或者自家建造匕首去做领域的 AI，这样的创业应该具有更大的空间和更多的可能性。

对于 NLP（AI之一种）我写过 n 篇博文强调，所有的 offshelf 的平台和toolkit（譬如历史悠久的GATE），甚至一个小插件（譬如 Brill Tagger or some Chinese word segmenter）都不好用。可以 prototyping 但如果稍微有点长期观点要建一个大规模的NLP的应用，还是一切自家建造为好。当然，自家建造的门槛很高，多数人造不起，也没这个 architect 来指挥。但最终是，自家建造的胜出，从质量上说（质量包括速度、鲁棒性、精度广度、领域的可适应性等关键综合指标）。

巨头的工具箱的产品 offers 一开始也不赚钱，但他们的研发积累已经做了，且还在不断投入，不产品化成工具箱不是傻瓜吗，赚多少算多少。如果真到了AI bots 遍地开花的时候，他们凭借巨大的平台优势，赚钱也是可能的。小公司这条路没门吧。如果你的 offer 的确 unique，譬如是中华料理，譬如是伟哥的 parsing，你可能会吸引一批使用者。但想赚钱必须有规模，而 component tech 或平台工具之类，在小公司的环境中，是成不了规模的。所以不要想赚钱的事儿。

赚钱靠的是产品，而不是工具，这是AI创业铁律。

当然，通过平台或工具打出影响，做 marketing，曲线救国创业，另当别论。
回到 meet-up：

4. bots 构建的核心当然是 conversations 的训练工具。IBM沃森的工具用的是深度神经。

对于 bots，input 是确定的，就是用 bots 的人的输入。自然语言的语音也好文字也好，语音反正也要转化为文字所以我们面对的就是人机接口中的“人话”，理论上无止境千变万化。

bots 的 output 呢？

在目前的框架里，在绝大多数实际场景，这个 output 都是以极为有限的集合
最典型的案例是为 apps（天气、股票、时间之类）做 bots 作为 apps 的人机接口，
其 output 就是 app 里面的 commands 集合。于是 bot 产品定义为从无限到有限的映射，这是一个典型的分类场景。于是沃森提供这个深度学习为基础的工具帮助你训练你所需要的 classifiers，这是标准做法无甚新意。

数据越多，分类质量越好。千变万化的死敌是稀疏数据。好在对于 bots，数据的收集会是一个边使用边加强的过程。如果你的 bots 开始有用户，你就形成了正循环，数据源源而来，你不断打磨、训练，这些都是可以 streamline 的流水作业，就越来越好。Siri 如此，Echo 也如此。

白:
分类本身是不带参数的，而bots的应对必须是带参数的，这是硬伤。
拿分类来做对话是看得到天花板的。

我:
I cannot agree more :=)

这里其实是有历史渊源的。IBM 做问答，一直是把问题简化为分类。18 年前我们在第一次 QA 竞赛（TREC-8）中交流就是如此，这么多年这个核心做法一直不变。当时我们的QA成绩最好，得分66%，沃森的系统印象是40%左右，他们的组长就追在后面问，我们思路差不多呀，都是 question intents（我们叫 asking points，比多数 intents 其实更聚焦），外加 Named Entity 的support。我说我们还用到了语言结构啊。

直到今天他们仍然是没有句法分析，更甭提深度分析。他们当年的 QA 就是基于两点：
1. 问句分类：试图了解 intents；2. NE。有了这两条，通过 keywords 检索作为 context，在大数据中寻找答案，对于 factoid questions 是不难的（见【立委科普：问答系统的前生今世】）。这就是沃森打败人类的基本原理，一点也不奥秘，从来没有根本改变。现在这一套继续体现在其 bots 工具箱 offering 里面。

洪:

昨晚Watson讲座听，
今早广告已跟进。
IBM可真下本，
今天我试Bluemix云。

我:
2. 因此 conversations 训练，其核心就是两条：一个是 intents classification （这个 intents 是根据 output 的需求来定义的），一个 NE，不过 NE 是他们已经训练好的模块（NE有一定的domain独立性），用户只是做一些微调和增强而已。

顺便插一句，这几天一直在想，AI 现在的主打就是深度神经，所有的希望都寄托在神经上。但无论怎么神经，都不改 supervised learning 的本性：所以，我的问题是：你怎么克服缺乏带标大数据的知识瓶颈？

ok 你把机器翻译玩转了。因为 MT 有几乎无限的 “自然” 带标数据（其实也不是自然了，也是人工，幸运的是这些人力是历史的积累，是人类翻译活动的副产品，是不需要开发者花钱的 free ride）。可其他的 ai 和 nlp 应用呢，你还可以像 MT 这样幸运这样享用免费午餐吗？

现在想，紧接着 MT 的具有大数据的热门应用是什么？非 bots 莫属。
对于 bots，数据已经有一定的积累了，其最大的特点在于，bots 的使用过程，数据就会源源而来。问题是这些数据是对路的，real life data from the field，但还是不带标啊。所以，bots 的前景就是玩的跟数据打仗：可以雇佣人去没完没了地给数据做标注。这是一个很像卓别林的【摩登时代】的AI工厂的场景，或者是列宁同志攻打冬宫的人海战术。看上去很笨，但可以确定的是，bots 会越来越“智能”，应对的场景也越来越多。应了那句老话，有多少人工，就有多少智能。然而，这不是、也不应该是唯一的克服知识瓶颈的做法。

毛:
嗯，有多少人工，就有多少智能。这话说得好。

我:
但这个景象成为常规也不错至少是帮助解决了一些白领就业。是用高级的专家知识去编写规则来提高系统质量，还是利用普罗标注去提高质量，从帮助就业和维稳角度看，几乎蛮力似的深度神经对于标注大数据的无休止的渴望和胃口，对于社会似乎更为有利。为了社会稳定和世界和平，我们该看好这种蛮力。我们做深度分析和理解的专家，试图尽可能逼真地去模拟人的智能过程，但对蛮力也应该起一份敬意。

将来的AI，什么人都可做：1. 你发现一个领域的 AI 需求； 2. 你雇佣一个对这个需求可以形式化定义的设计家； 3. 你调用巨头的一个通用的 AI 工具箱（譬如 TensorFlow）或面向专项产品的工具箱（譬如 bot 的沃森工具箱） 4 你雇佣一批失业但受过教育的普罗，像富士康一样训练他们在流水线上去根据设计家的定义去标注数据、测试系统，你于是通过 AI 创造了价值，不排除你的产品会火。因为产品火不火已经不是技术了，而是你满足需求的产品角度。

3. 但是正如白老师说的这种用分类来简化问题的 AI 产品化，走不远。它可能满足一些特定领域的特定的需求但是后劲不足是显然的。其中一个痛点或挑战就是，这种东西走不出三步，三步以上就抓瞎。如果你的应用可以在三步之内就基本满足需求，没问题。

bots 最显然的有利可图的应用场景是客服。一般而言，bots 取代和补充客服是大势所趋，因为客服的知识资源和记忆，根本没法与我们可以灌输给 bots 的知识来相比。利用知识去回答客户疑问，人不如机，是可以想见的。但是观察一个好的客服与客户的交互可以发现，三步的交流模型是远远无法满足稍微复杂一点的场景的。三步的说法是一个比喻，总之是目前的工具箱，对于较长时期的对话，还是束手无策。

bots 对用户话语的理解简化为 classification，以此为基础对用户的回答就不是那么简单了。目前提供的做法是：因为 intents 是有限的集合，是 classification 的结果，那么对于每一个 intent 可以预知答案（存在数据库的 hand-crafted text snippet）或回应（譬如展示一个图，譬如天气app的今日天气图表)。这些预制的答案，听上去非常自然、生动甚至诙谐，它们都是领域专家的作品。且不说这些预制的 snippets，如何根据classification hierarchy 本身需要做不同组装，在存于数据库里面的核心应答的预制以外，还可以加上情感的维度，还可以加上 personalized 的维度，这些都可以使得对话更加人性化、自然化，但每加一个维度就意味着我们开始接近组装式策略的组合爆炸后果。三步、三维以上就无法收拾。

我问主讲人，你的这些预先制定好的应答片段，按照你的工具的组装方式，不就是一个 decision tree 吗？回答是，的确，就是一个 decision tree 的做法。然后她说，有不少研究想突破这种应答模式，但都是在探索，没有到可以产品化工具化的阶段。

郭老师说，谁要是有本事把人机的 “自然对话”能够延长到 20 分钟，换句话说就是突破图灵测试，谁就是 AI bots 的真正破局者。如果你证明你能做到，巨头会抢着来高价收购你的。这是所有做 bots 的所面临的共同挑战。

据说小冰最高记录是与单一的人谈了九个小时的心。但那不是真正的突破，那是遇到了一个异常人类。正常的人，我的体会是两分钟定律，你与小冰谈话超不过两分钟。我试过多次，到了两分钟，它所露出来的破绽就让你无法忍受，除非自己铁心要自我折磨。其实工业界要求的连续对话，不是小冰这种闲扯。而是针对一个稍微复杂一点的任务场景（譬如订票）如何用自然对话的假象去把相关的信息收集全，来最大限度地满足客户需求。

累了，先笔记和评论如上。其余还有一些有趣的点儿可以讨论，以后再说。这是交给我们唐老师的作业。

洪:
亚马逊正设大奖，
chatbot赛悬赏。
对话若超廿分长，
两半米粒到手上。// 2.5M

【相关】

【立委科普：问答系统的前生今世】

Amazon’s $2.5M ‘Alexa Prize’ seeks chatbot that can converse intelligently for 20 minutes

【语义计算：精灵解语多奇智，不是冤家不上船】

白:
“他分分钟就可以教那些不讲道理的人做人的道理。”

我:

1016a

一路通，直到最后的滑铁卢。
定语从句谓语是“做人”而不是“可以教”，可是定语从句【【可以教。。。的】道理】与 vp定语【【做人的】道理】，这账人是怎么算的？

白：
还记得“那个小集合”吗？sb 教 sb sth，坑已经齐活儿了
“道理”是一般性的，定语是谓词的话一定要隐含全称陈述，不能是所有坑都有萝卜的。当然这也是软性的。只是在比较中不占优而已。单独使用不参与比较就没事：“张三打李四的道理你不懂”就可以，这时意味着“张三打李四背后的逻辑你不懂”。
“他分分钟就可以把一个活人打趴下的道理我实在是琢磨不透。”这似乎可以。

我:
教至少两个 subcats：
教 sb sth
教 sb todo sth

白:
这个可以有
刚刚看到一个标题起：没有一滴雨会认为自己制造了洪灾。
这个句法关系分析的再清楚，也解释不了标题的语义。

宋:
有意思。

我:
教他
教他做人
教他道理
教他做人的道理
教他的道理
教他做人的往事儿

这个 “道理” 和 “往事”，是属于同一个集合的，我们以前讨论过的那个集合，不参与定语从句成分的 head n。

白:
对

我:
这个集合里面有子集是关于 info 的，包括道理新闻公告往事。。。

白：
但是于“道理”而言，坑不满更显得有抽象度。是没“提取”，但坑不满更顺更优先，因为隐含了全称量词。

我:
就是说这个集合里面还有 nuances 需要照顾。滑铁卢就在 “教他做人的往事儿” 上，照顾了它就照顾不了 “做人的道理”。
就事论事我可以词典化 “做人的道理”，后者有大数据的支持。

白:
这可是能产的语言现象。
试试这个：“你们懂不懂做人要低调的道理？”

我:
我试试人在外但电脑带了只好拍照了

371656522530864097

你们懂不懂道理，这是主干
什么道理？
要低调的道理。
谁要低调？
你们。
懂什么类型的道理？
做人的道理。
谁做人？
你们。
小小的语义计算图谱，能回答这么多问题，这机器是不是有点牛叉？

白:
图上看，“要低调”是“懂道理”的状语而不是“道理”的定语？

我:
这个是对的，by design。但我们设计vn合成词的时候，我们要求把分离词合成起来。如果 n 带有定语，合成以后就指向合成词整体。这时候为了留下一些痕迹，有意在系统内部保留定语的标签，以区别于其他的动词的状语修饰语。否则，“懂【要低调的】道理” 与 “【要低调的】懂道理”，就无法区分了。这样处理语义落地有好处完全是系统内部的对这种现象的约定和协调 system internal。定语状语都是修饰语大类无异。

白:
“做人要低调”是一个整体，被拆解了。逻辑似乎不对。
拆解的问题还没解决：不管x是谁，如果x做人，x就要低调。
两个x是受全称量词管辖的同一个约束变元。
@宋早上您似乎对“没有一滴雨会认为自己制造了洪灾”这个例子有话要说？

宋:
@白硕主要是觉得这句话的意思有意思。从语义分析看应该不难，因为这是一种模式：没有NP V。即任何x，若x属于NP，则否定V(x)。

白:
首先这是一个隐喻，雨滴是不会“认为”如何如何的，既然这样用，就要提炼套路，看把雨滴代换成什么：雨滴和洪水的关系，是天上的部分和地上的整体的关系，是无害无责任的个体和有害有责任的整体的关系。

“美国网约车判决给北上广深的启示”

洪:
中土NLP全家福，
烟台开会倾巢出。
语言架桥机辅助，
兵强马壮数据足。

chinanlp
中国nlp全家福啊@wei

白: 哈
李白无暇混贵圈，一擎核弹一拨弦。精灵解语多奇智，不是冤家不上船。

洪:
冤家全都上贼船，李白有事别处赶。天宫迄今无甚关，Alien语言亟需练。

我:
白老师也没去啊敢情。
黑压压一片吾道不孤勒。

【相关】

【李白对话录：RNN 与语言学算法】

【李白对话录：如何学习和处置“打了一拳”】

【李白对话录：你波你的波，我粒我的粒】

【李白对话录- 从“把手”谈起】

【李白对话录：如何学习和处置“打了一拳”】

【李白对话录之六：NLP 的Components 及其关系】

乔姆斯基批判

[转载]【白硕 – 穿越乔家大院寻找“毛毛虫”】

泥沙龙笔记：parsing 是引擎的核武器，再论NLP与搜索

泥沙龙笔记：从 sparse data 再论parsing乃是NLP应用的核武器

【Question answering of the past and present】

A pre-existence

The traditional question answering (QA) system is an application of Artificial Intelligence (AI). It is usually confined to a very narrow and specialized domain, which is basically made up of a hand-crafted knowledge base with a natural language interface. As the field is narrow, the vocabulary is very limited, and its pragmatic ambiguity can be effectively under control. Questions are highly predictable, or close to a closed set, the rules for the corresponding answers are fairly straightforward. Well-known projects in the 1960s include LUNAR, a QA system specializing in answering questions about the geological analysis on the lunar samples collected from the Apollo's landing on the Moon. SHRDLE is another famous QA expert system in AI history, it simulates the operation of a robot in the toy building world. The robot can answer the question of the geometric state of a toy and listen to the language instruction for its operation.

These early AI explorations seemed promising, revealing a fairy-tale world of scientific fantasy, greatly stimulating our curiosity and imagination. Nevertheless, in essence, these are just toy systems that are confined to the laboratory and are not of much practical value. As the field of artificial intelligence was getting narrower and narrower (although some expert systems have reached a practical level, majority AI work based on common sense and knowledge reasoning could not get out beyond lab), the corresponding QA systems failed to render meaningful results. There were some conversational systems (chatterbots) that had been developed thus far and became children's popular online toys (I remember at one time when my daughter was young, she was very fond of surfing the Internet to find various chatbots, sometimes deliberately asking tricky questions for fun. Recent years have seen a revival of this tradition by industrial giants, with some flavor seen in Siri, and greatly emphasized in Microsoft's Little Ice).

2. Rebirth

Industrial open-domain QA systems are another story, it came into existence with the development of the Internet boom and the popularity of search engines. Specifically, the open QA system was born in 1999, when the TREC-8 (Eighth Text Retrieval Conference) decided to add a natural language QA track of competition, funded by the US Department of Defense's DARPA program, administrated by the United States National Institute of Standards and Technology (NIST), thus giving birth to this emerging QA community. Its opening remarks when calling for the participation of the competition are very impressive, to this effect:

Users have questions, they need answers. Search engines claim that they are doing information retrieval, yet the information is not an answer to their questions but links to thousands of possibly related files. Answers may or may not be in the returned documents. In any case, people are compelled to read the documents in order to find answers. A QA system in our vision is to solve this key problem of information need. For QA, the input is a natural language question, the output is the answer, it is that simple.

It seems of benefit to introduce some background for academia as well as the industry when the open QA was born.

From the academic point of view, the traditional sense of artificial intelligence is no longer popular, replaced by the large-scale corpus-based machine learning and statistical research. Linguistic rules still play a role in the field of natural language, but only as a complement to the mainstream machine learning. The so-called intelligent knowledge systems based purely on knowledge or common sense reasoning are largely put on hold by academic scholars (except for a few, such as Dr. Douglas Lenat with his Cyc). In the academic community before the birth of open-domain question and answering, there was a very important development, i.e. the birth and popularity of a new area called Information Extraction (IE), again a child of DARPA. The traditional natural language understanding (NLU) faces the entire language ocean, trying to analyze each sentence seeking a complete semantic representation of all its parts. IE is different, it is task-driven, aiming at only the defined target of information, leaving the rest aside. For example, the IE template of a conference may be defined to fill in the information of the conference [name], [time], [location], [sponsors], [registration] and such. It is very similar to filling in the blank in a student's reading comprehension test. The idea of task-driven semantics for IE shortens the distance between the language technology and practicality, allowing researchers to focus on optimizing tasks according to the tasks, rather than trying to swallow the language monster at one bite. By 1999, the IE community competitions had been held for seven annual sessions (MUC-7: Seventh Message Understanding Conference), the tasks of this area, approaches and the then limitations were all relatively clear. The most mature part of information extraction technology is the so-called Named Entity (NE tagging), including identification of names for human, location, and organization as well as tagging time, percentage, etc. The state-of-the-art systems, whether using machine learning or hand-crafted rules, reached a precision-recall combined score (F-measures) of 90+%, close to the quality of human performance. This first-of-its-kind technological advancement in a young field turned out to play a key role in the new generation of open-domain QA.

In industry, by 1999, search engines had grown rapidly with the popularity of the Internet, and search algorithms based on keyword matching and page ranking were quite mature. Unless there was a methodological revolution, the keyword search field seemed to almost have reached its limit. There was an increasing call for going beyond basic keyword search. Users were dissatisfied with search results in the form of links, and they needed more granular results, at least in paragraphs (snippets) instead of URLs, preferably in the form of direct short answers to the questions in mind. Although the direct answer was a dream yet to come true waiting for the timing of open-domain QA era, the full-text search more and more frequently adopted paragraph retrieval instead of simple document URLs as a common practice in the industry, the search results changed from the simple links to web pages to the highlighting of the keywords in snippets.

In such a favorable environment in industry and academia, the open-domain question answering came onto the stage of history. NIST organized its first competition, requiring participating QA systems to provide the exact answer to each question, with a short answer of no more than 50 bytes in length and a long answer no more than 250 bytes. Here are the sample questions for the first QA track:

Who was the first American in space?
Where is the Taj Mahal?
In what year did Joe DiMaggio compile his 56-game hitting streak?

3. Short-lived prosperity

What are the results and significance of this first open domain QA competition? It should be said that the results are impressive, a milestone of significance in the QA history. The best systems (including ours) achieve more than 60% correct rate, that is, for every three questions, the system can search the given corpus and is able to return two correct answers. This is a very encouraging result as a first attempt at an open domain system. At the time of dot.com's heyday, the IT industry was eager to move this latest research into information products and revolutionize the search. There were a lot of interesting stories after that (see my related blog post in Chinese: "the road to entrepreneurship"), eventually leading to the historical AI event of IBM Watson QA beating humans in Jeopardy.

The timing and everything prepared by then from the organizers, the search industry, and academia, have all contributed to the QA systems' seemingly miraculous results. The NIST emphasizes well-formed natural language questions as appropriate input (i.e. English questions, see above), rather than traditional simple and short keyword queries. These questions tend to be long, well suited for paragraph searches as a leverage. For competition's sake, they have ensured that each question asked indeed has an answer in the given corpus. As a result, the text archive contains similar statements corresponding to the designed questions, having increased the odds of sentence matching in paragraph retrieval (Watson's later practice shows that from the big data perspective, similar statements containing answers are bound to appear in text as long as a question is naturally long). Imagine if there are only one or two keywords, it will be extremely difficult to identify relevant paragraphs and statements that contain answers. Of course, finding the relevant paragraphs or statements is not sufficient for this task, but it effectively narrows the scope of the search, creating a good condition for pinpointing the short answers required. At this time, the relatively mature technology of named entity tagging from the information extraction community kicked in. In order to achieve the objectivity and consistency in administrating the QA competition, the organizers deliberately select only those questions which are relatively simple and straightforward, questions about names, time or location (so-called factoid questions). This practice naturally agrees with the named entity task closely, making the first step into open domain QA a smooth process, returning very encouraging results as well as a shining prospect to the world. For example, for the question "In what year did Joe DiMaggio compile his 56-game hitting streak?", the paragraph or sentence search could easily find text statements similar to the following: "Joe DiMaggio's 56 game hitting streak was between May 15, 1941 and July 16". An NE system tags 1941 as time with no problem and the asking point for time in parsing the wh-phrase "in what year" is also not difficult to decode. Therefore, an exact answer to the exact question seems magically retrieved from the sea of documents to satisfy the user, like a needle found in the haystack. Following roughly the same approach, equipped with gigantic computing power for parallel processing of big data, 11 years later, IBM Watson QA beat humans in the Jeopardy live show in front of the nationwide TV audience, stimulating the entire nation's imagination with awe for this technology advance. From QA research perspective, the IBM's victory in the show is, in fact, an expected natural outcome, more of an engineering scale-up showcase rather than research breakthrough as the basic approach of snippet + NE + asking-point has long been proven.

A retrospect shows that adequate QA systems for factoid questions are invariably combined with a solid Named Entity module and a question parser for identifying asking points. As long as there is an IE-indexed big data behind, with information redundancy as its nature, factoid QA is a very tractable task .

4. State of the art

The year 1999 witnessed the academic community's initial success of the first open-domain QA track as a new frontier of the retrieval world. We also benefited from that event as a winner, having soon secured a venture capital injection of $10 million from the Wall Street. It was an exciting time shortly after AskJeeves' initial success in presenting a natural language interface online (but they did not have the QA technology for handling the huge archive for retrieving exact answers automatically, instead they used human editors behind the scene to update the answers database). A number of QA start-ups were funded. We were all expecting to create a new era in the information revolution. Unfortunately, the good times are not long, the Internet bubble soon burst, and the IT industry fell into the abyss of depression. Investors tightened their monetary operations, the QA heat soon declined to freezing point and almost disappeared from the industry (except for giants' labs such as IBM Watson; in our case, we shifted from QA to mining online brand intelligence for enterprise clients). No one in the mainstream believes in this technology anymore. Compared with traditional keyword indexing and searching, the open domain QA is not as robust and is yet to scale up to really big data for showing its power. The focus of the search industry is shifting from depth back to breadth, focusing on the indexing coverage, including the so-called deep web. As the development of QA systems is almost extinct from the industry, this emerging field stays deeply rooted in the academic community, developed into an important branch, with increasing natural language research from universities and research labs. IBM later solves the scale-up challenge, as a precursor of the current big data architectural breakthrough.

At the same time, scholars begin to summarize the various types of questions that challenge QA. A common classification is based on identifying the type of questions for their asking points. Many of us still remember our high school language classes, where the teacher stressed the 6 WHs for reading comprehension: who / what / when / where / how / why. (Who did what when, where, how and why?) Once answers to these questions are clear , the central stories of an article are in hands. As a simulation of human reading comprehension, the QA system is designed to answer these key WH questions as well. It is worth noting that these WH questions are of different difficulty levels, depending on the types of asking points (one major goal for question parsing is to identify the key need from a question, what we call asking point identification, usually based on question parsing of wh-phrases and other question clues). Those asking points corresponding to an entity as an appropriate answer, such as who / when / where, are relatively easy questions to answer (i.e. factoid questions). Another type of question is not simply answerable by an entity, such as what-is / how / why, there is consensus that answering such questions is a much more challenging task than factors questions. A brief introduction to these three types of "tough" questions and their solutions are presented below as a showcase of the on-going state to conclude this overview of the QA journey.

What/who is X? This type of questions is the so-called definition question, such as What is iPad II? Who is Bill Clinton? This type of question is typically very short, after the wh-word and the stop word "is" are stripped in question parsing, what is left is just a name or a term as input to the QA system. Such an input is detrimental to the traditional keyword retrieval system as it ends up with too many hits from which the system can only pick the documents with the most keyword density or page rank as returns. But from QA perspective, the minimal requirement to answer this question is a definition statement in the forms of "X is a ...". Since any entity or object is in multiple relationships with other entities and involved in various events as described in the corpus, a better answer to the definition question involves a summary of the entity with all the links to its key associated relations and events, giving a profile of the entity. Such technology is in existence, and, in fact, has been partly deployed today. It is called knowledge graph, supported by underlying information extraction and fusion. The state-of-the-art solution for this type of questions is best illustrated in the Google deployment of its knowledge graph in handling queries of a short search for movie stars or other VIP.

The next challenge is how-questions, asking about a solution for solving a problem or doing something, e.g. How can we increase bone density? How to treat a heart attack? This type of question calls for a summary of all types of solutions such as medicine, experts, procedures, or recipe. A simple phrase is usually not a good answer and is bound to miss varieties of possible solutions to satisfy the information need of the users (often product designers, scientists or patent lawyers) who typically are in the stage of prior art research and literature review for a conceived solution in mind. We have developed such a powerful system based on deep parsing and information extraction to answer open-domain how-questions comprehensively in the product called Illumin8, as deployed by Elsevier for quite some years. (Powerful as it is, unfortunately, it did not end up as a commercial success in the market from revenue perspective.)

The third difficult question is why. People ask why-questions to find the cause or motive of a phenomenon, whether an event or an opinion. For example, why people like or dislike our product Xyz? There might be thousands of different reasons behind a sentiment or opinion. Some reasons are explicitly expressed (I love the new iPhone 7 because of its greatly enhanced camera) and more reasons are actually in some implicit expressions (just replaced my iPhone , it sucks in battery life). An adequate QA system should be equipped with the ability to mine the corpus and summarize and rank the key reasons for the user. In the last 5 years, we have developed a customer insight product that can answer why questions behind the public opinions and sentiments for any topics by mining the entire social media space.

Since I came to the Silicon Valley 9 years ago, I have been lucky, with pride, in having had a chance to design and develop QA systems for answering the widely acknowledged challenging questions. Two products for answering the open-domain how questions and why-questions in addition to deep sentiment analysis have been developed and deployed to global customers. Our deep parsing and IE platform is also equipped with the capability to construct deep knowledge graph to help answer definition questions, but unlike Google with its huge platform for the search needs, we have not identified a commercial opportunity to deploy that capability for a market yet.

This piece of writing first appeared in 2011 in my personal blog, with only limited revisions since. Thanks to Google Translate at https://translate.google.com/ for providing a quick basis, which was post-edited by myself.

The Anti-Eliza Effect, New Concept in AI

"Knowledge map and open-domain QA (1)" (in Chinese)

"knowledge map and how-question QA (2)" (in Chinese)

【Ask Jeeves and its million-dollar idea for human interface in 】(in Chinese)

Dr Li’s NLP Blog in English

【立委科普：谷歌NMT，见证奇迹的时刻】

微信最近疯传人工智能新进展：谷歌翻译实现重大突破！值得关注和庆贺。mt 几乎无限量的自然带标数据在新技术下，似乎开始发力。报道说：

十年前，我们发布了 Google Translate（谷歌翻译），这项服务背后的核心算法是基于短语的机器翻译（PBMT:Phrase-Based Machine Translation）。

自那时起，机器智能的快速发展已经给我们的语音识别和图像识别能力带来了巨大的提升，但改进机器翻译仍然是一个高难度的目标。

今天，我们宣布发布谷歌神经机器翻译（GNMT：Google Neural Machine Translation）系统，该系统使用了当前最先进的训练技术，能够实现到目前为止机器翻译质量的最大提升。我们的全部研究结果详情请参阅我们的论文《Google`s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation》。

几年前，我们开始使用循环神经网络（RNN：Recurrent Neural Networks）来直接学习一个输入序列（如一种语言的一个句子）到一个输出序列（另一种语言的同一个句子）的映射。其中基于短语的机器学习（PBMT）将输入句子分解成词和短语，然后在很大程度上对它们进行独立的翻译，而神经机器翻译（NMT）则将输入的整个句子视作翻译的基本单元。

这种方法的优点是：相比于之前的基于短语的翻译系统，这种方法所需的工程设计更少。当其首次被提出时，NMT 在中等规模的公共基准数据集上的准确度，就达到了可与基于短语的翻译系统媲美的程度。

自那以后，研究者已经提出了很多改进 NMT 的技术，其中包括模拟外部对准模型（external alignment model）来处理罕见词，使用注意（attention）来对准输入词和输出词，以及将词分解成更小的单元应对罕见词。尽管有这些进步，但 NMT 的速度和准确度还没能达到成为 Google Translate 这样的生产系统的要求。

我们的新论文描述了怎样克服让 NMT 在非常大型的数据集上工作的许多挑战、如何打造一个在速度和准确度上都足够能为谷歌用户和服务带来更好的翻译体验的系统。

来自对比评估的数据，其中人类评估者对给定源句子的翻译质量进行比较评分。得分范围是 0 到 6，其中 0 表示「完全没有意义的翻译」，6 表示「完美的翻译」。

............

使用人类评估的并排比较作为一项标准，GNMT 系统得出的翻译相比于之前基于短语的生产系统有了极大提升。

在双语人类评估者的帮助下，我们在来自维基百科和新闻网站的样本句子上测定发现：GNMT 在多个主要语言对的翻译中将翻译误差降低了 55%-85% 甚至更多。

今天除了发布这份研究论文之外，我们还宣布将 GNMT 投入到了一个非常困难的语言（汉语-英语）的翻译的生产中。

现在，移动版和网页版的 Google Translate 的汉英翻译已经在 100% 使用 GNMT 机器翻译了——每天大约 1800 万条翻译。GNMT 的生产部署是使用我们公开开放的机器学习工具套件 TensorFlow 和我们的张量处理单元（TPU：Tensor Processing Units），它们为部署这些强大的 GNMT 模型提供了足够的计算算力，同时也满足了 Google Translate 产品的严格的延迟要求。

汉语到英语的翻译是 Google Translate 所支持的超过 10000 种语言对中的一种，在未来几个月，我们还将继续将我们的 GNMT 扩展到远远更多的语言对上。

from 谷歌翻译实现重大突破！

作为老机译，不能不被吸引。准备小试一下这最新版的谷歌神经翻译。
此前试过谷歌在线翻译，总体不如百度，可现如今说汉语mt已经很神经了：深度神经，接近人类。我有几百篇待译正好一试，先试为快。期待谷歌的神译。

董:
@wei 但愿不致让你失望。我曾半开玩笑地说：规则机译是傻子，统计机译是疯子，现在我继续调侃：神经机译是“骗子”（我绝不是指研发者）。语言可不是猫脸或马克杯之类的，仅仅表面像不行，内容也要像！

我：现在是见证奇迹的时刻：

The automatic speech generation of this science blog of mine is attached here, it is amazingly clear and understandable. If you are an NLP student, you can listen to it as a lecture note from a seasoned NLPer (definitely clearer than if I were giving this lecture myself with my strong accent). More amazingly, the original blog was in Chinese and I used the newest Google Translate claimed to be based on deep learning using sentence-based translation as well as character-based techniques. My original blog in Chinese is here, you can compare:【立委科普：自然语言系统架构简说】。

董老师，您是知道我的背景和怀疑的。但是，面对这样的进步，这种远远超出我们当初入行的时候可以想象的极限的自动翻译质量和鲁棒性，我们不得不，不得不，不得不叹服。

董:
用他们的术语叫“less adequate，but more fluent”。机译已经历了三次paradigm shift，当人们发现无论如何它只能是一种很好的信息处理工具，而无法取代人类翻译时，那就选取代价较少的。

我:
无论如何，这个小小的测试让我这个老机译有点瞠目结舌。还没有从这种冲击回过味来。当然，赶巧我测试的是正规文体，谈的又是电脑和NLP话题，肯定在语料库的涵盖范围内，撞上枪眼了。可比起此前我使用的前神经时代的谷歌SMT和百度SMT，这个飞跃式突破还是让人惊异的。向我们的神经同行致敬。这是一批绝顶聪明的疯子。

毛老，这是我对谷歌最近的 claim 的一个反馈。上次对他们的 parser 嘲笑了一通，这次对他们用同样技术带来的MT的突破，表达一下深深的敬佩。这种 contrast 不是我神经了，或分裂了，而是在 parsing，他们苦于没有自然带标数据，巧妇难为无米之炊，因此无法跟符号逻辑派比试。可是 MT 不同，几乎有无穷无尽的自然带标数据（人的翻译从来没有终止过，留下的对照译文浩如烟海）。

毛: @wei 这就是说，你服了基于神经元的MT，改变了自己的见解和主张？

我: 我服了，但没真地改变。

毛: 怎么说？

我:
无论多少门户之见，基本事实总应该看清吧。听一下上面列出的他们的 SMT 翻译，其流利程度和对我原文的忠实程度，已经超过了一个普通的人做翻译。因为一个口译如果不懂我这一行，我如果拿着这个讲稿讲课，让这样的 average interpreter 做现场翻译，他是比不过机器的，无论信还是达。（翻译高手不论。）这一点不得不服。另一方面，我以前说的，神经再深度，我看不出它在今后几年内可以赶上我的深度 parser，这一点还没改变。尤其是应对不同 domains 和 genres 的能力，他们不可能做到。因为在自然世界里面，没有带标的句法树，有的只是线性句子。而迄今看到的突破都是有监督的深度学习（supervised learning），离开海量带标数据（labeled data）就抓瞎了。

毛: 我被你弄糊涂了。你究竟是说哪一派强哪？@wei 究竟谁是世界第0？

我: parsing 我是第0，谷歌赶不上。MT 谷歌有重大突破，估计符号逻辑派的 MT 的日子不好过。

毛: 我问的是，MT谁是第0，不管用什么方法。

我: 这不是说，MT 规则系统就没有活路了，但是总体而言，SMT（statistical MT）占据上风的 trends 还在增强。

云: THKS. 我来试试能不能翻译我写的公司白皮书？

我:
你要是加一点译后人工编辑的话，我估计会很好的。再不要傻傻地从头请人工做翻译了。翻译公司如果不使用 MT 做底，将会被淘汰，成本上看很难存活。

董:
学习上，初二是一个分水岭，学科数量明显增多，学习方法也有所改变，一些学生能及时调整适应变化，进步很快，由成绩中等上升为优秀。但也有一部分学生存在畏难情绪，将心思用在学习之外，成绩迅速下降，对学习失去兴趣，自暴自弃，从此一蹶不振，这样的同学到了初三往往很难有所突破，中考的失利难以避免。
Learning, the second is a watershed, the number of subjects increased significantly, learning methods have also changed, some students can adjust to adapt to changes in progress, progress quickly, from the middle to rise to outstanding. But there are some students there is fear of hard feelings, the mind used in the study, the rapid decline in performance, loss of interest in learning, self-abandonment, since the devastated, so the students often difficult to break through the third day,

毛: 这翻译没什么好呀？

我:
要的就是这句话 🙂 @毛，需要一个对比，才能回答你的问题。

毛: 那就拿你的出来比比？

我: 我早就不做 MT 了，我是逃兵。近20年前，我就转移到信息抽取 IE（以及sentiment）挖掘了，这方面我有底气，不怕比。

刘：转：谷歌新版翻译有多神？英文教授亲证后告诉你...

我: thanks 似乎评论得比较中肯。对于口语现在肯定还不行，它的训练集一直对口语 cover 的很不够。以前测过，有些常用的简单的口语表达也弄错。不知这次这方面加强多少。

董老师上面给的那段谷歌翻译的段落，毛老说翻译不咋样。不过我做过MT多年，知道达到这一步其实是有很大进步的。以前的汉译英的不可读，到现在读起来大体能听得懂。这里面有很多的进步值得留意。

刘: @wei 转一个: 话说大数据干的一些事属于人工智能操练（不能用“研究”这词了）的范畴吗，那本来不就是传统计算机系的人干的？动不动嘲笑每开掉一个语言学家就往前多走几步这眼界太浅了

马: 在数据充足的领域，这几年DL方法都突飞猛进，我认识的好几个以前对DL有偏见的都多少改变了看法。在IR领域DL还基本不见功效，但也在慢慢渗透中。

毛: 不认同“传统计算机系”这个话。计算机系应该跟着实践走，而不是反过来。

董:
NMT的关键是个“像”。于是出了一个有时不懂原文的人以为翻的很顺溜。没了信的翻译，不就成骗子了吗？如何知道自己的翻译是满拧了?这也是NMT的死穴。

马: 董老师，我觉得统计方法都有这个死穴

我:
寸有所长尺有所短，这也不奇怪。我今天自己听这个对我的blog的翻译已经三篇了，一步一叹。NND 咋这么溜呢。找碴子找翻译错总是有的。可是人也有译错啊。从可懂度和流畅程度看，我反正是服了。而这发生在没有亲属关系的两个语言之间。

董:
想当年有的领导干部对我说，“其实机器翻译只有百分之50的正确性，也不要紧，问题是你能不能把那个一半告诉我，我可以找人专翻那部分。”我回答说我做不到。从那时起我一直在关注这个问题。直到如今很多人在叫嚷要取代人工翻译了。这真有点像有了麦当劳就说不要法式大餐了一样。何况机译还做不到麦当劳。计算机、以致机译是上帝给人类玩的，上帝没有给人类那种可以复制自己的本领。

洪:

我的观点很简单：
影子不能三维变。
人若二维非压扁，
自叹弗如影子前。

人工智能影子般，
随人活动数据攒。
深度学习模型建，
类似皮影戏好玩。

董:
是的。我曾对照过10多本英国名著，曾经发现有一本译著明显的是译者故意大段大段地漏译的，那里面有太多的花草等，估计大师懒得查。就不译了。

为什么GNMT首先选择的语言对是汉英，而不是英汉？这是非常精明的。人工翻译即使错了或漏了，译文通常会是顺溜的，至少绝不会像传统的机译那样有傻又疯的，诘屈聱牙的，而这正是NMT的特点，它挑选的是译文中最大相似的。那样一来广大的英语读者，多数不懂中文，就容易被它“唬住”了。

我：
对。仔细看来，这次“突破”是达有余而信不足，矫枉过正了。
但一切才刚开始。我可以理解做NMT的人面对突破的欣喜心情。

洪:
伟爷久玩nlp，
一直孤傲头不低。
今朝服膺叹奇迹，
深度神经已皈依！

我:
皈依还不至于，也不够格。赞佩是由衷的，希望今后有合作的机会，取长补短，达成共赢。人家要是看不上咱呢，咱就单干。deep parsing 是 NLP 的皇冠。神经 parsing 何时全方位超过在下，咱就退休。现在仍然觉得，照这个标准，估计这辈子也退休不了。但愿自己错了，可以提早周游世界。

【相关】

Wei’s Introduction to NLP Architecture

谷歌翻译实现重大突破

谷歌新版翻译有多神？英文教授亲证后告诉你...

【立委科普：NLP 联络图】（姐妹篇）

【立委科普：自然语言系统架构简说】

【立委科普：NLP 联络图】

OVERVIEW OF NATURAL LANGUAGE PROCESSING

《新智元笔记：【Google 年度顶级论文】有感》

【语义计算沙龙：巨头谷歌昨天称句法分析极难，但他们最强】

It is untrue that Google SyntaxNet is the "world’s most accurate parser"

Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open

Is Google SyntaxNet Really the World’s Most Accurate Parser?

机器翻译

【置顶：立委NLP博文一览】

Wei's Introduction to NLP Architecture Translated by Google

Introduction to NLP Architecture
by Dr. Wei Li
(fully automatically translated by Google Translate)

The automatic speech generation of this science blog of mine is attached here, it is amazingly clear and understandable, if you are an NLP student, you can listen to it as a lecture note from a seasoned NLPer (definitely clearer than if I were giving this lecture myself with my strong accent):

To preserve the original translation, nothing is edited below. I will write another blog to post-edit it to make this an "official" NLP architecture introduction to the audiences perused and honored by myself, the original writer. But for time being, it is completely unedited, thanks to the newly launched Google Translate service from Chinese into English at https://translate.google.com/

[Legislature science: natural language system architecture brief]

For the natural language processing (NLP) and its application, the system architecture is the core issue, I blog [the legislature of science: NLP contact diagram] which gave four NLP system architecture diagram, now one by one to be a brief .
I put the NLP system from the core engine to the application, is divided into four stages, corresponding to the four frame diagram. At the bottom of the core is deep parsing, is the natural language of the bottom-up layer of automatic analyzer, this work is the most difficult, but it is the vast majority of NLP system based technology.

160213sg5p2r8ro18v17z8

The purpose of parsing is to structure unstructured languages. The face of the ever-changing language, only structured, and patterns can be easily seized, the information we go to extract semantics to solve. This principle began to be the consensus of (linguistics) when Chomsky proposed the transition from superficial structure to deep structure after the linguistic revolution of 1957. A tree is not only the arcs that express syntactic relationships, but also the nodes of words or phrases that carry various information. Although the importance of the tree, but generally can not directly support the product, it is only the internal expression of the system, as a language analysis and understanding of the carrier and semantic landing for the application of the core support.

160216n8x8jj08qj2y1a8y

The next layer is the extraction layer (extraction), as shown above. Its input is the tree, the output is filled in the content of the templates, similar to fill in the form: is the information needed for the application, pre-defined a table out, so that the extraction system to fill in the blank, the statement related words or phrases caught out Sent to the table in the pre-defined columns (fields) to go. This layer has gone from the original domain-independent parser into the face-to-face, application-oriented and product-demanding tasks.
It is worth emphasizing that the extraction layer is domain-oriented semantic focus, while the previous analysis layer is domain-independent. Therefore, a good framework is to do a very thorough analysis of logic, in order to reduce the burden of extraction. In the depth analysis of the logical semantic structure to do the extraction, a rule is equivalent to the extraction of thousands of surface rules of language. This creates the conditions for the transfer of the domain.
There are two types of extraction, one is the traditional information extraction (IE), the extraction of fact or objective information: the relationship between entities, entities involved in different entities, such as events, can answer who dis what when and where When and where to do what) and the like. This extraction of objective information is the core technology and foundation of the knowledge graph which can not be renewed nowadays. After completion of IE, the next layer of information fusion (IF) can be used to construct the knowledge map. Another type of extraction is about subjective information, public opinion mining is based on this kind of extraction. What I have done over the past five years is this piece of fine line of public opinion to extract (not just praise classification, but also to explore the reasons behind the public opinion to provide the basis for decision-making). This is one of the hardest tasks in NLP, much more difficult than IE in objective information. Extracted information is usually stored in a database. This provides fragmentation information for the underlying excavation layer.
Many people confuse information extraction and text mining, but in fact this is two levels of the task. Extraction is the face of a language tree, from a sentence inside to find the information you want. The mining face is a corpus, or data source as a whole, from the language of the forest inside the excavation of statistical value information. In the information age, the biggest challenge we face is information overload, we have no way to exhaust the information ocean, therefore, must use the computer to dig out the information from the ocean of critical intelligence to meet different applications. Therefore, mining rely on natural statistics, there is no statistics, the information is still out of the chaos of the debris, there is a lot of redundancy, mining can integrate them.

160215hzp5hq5pfd1alldj

Many systems do not dig deep, but simply to express the information needs of the query as an entrance, real-time (real time) to extract the relevant information from the fragmentation of the database, the top n results simply combined, and then provide products and user. This is actually a mining, but is a way to achieve a simple search mining directly support the application.
In fact, in order to do a good job of mining, there are a lot of work to do, not only can improve the quality of existing information. Moreover, in-depth, you can also tap the hidden information, that is not explicitly expressed in the metadata information, such as the causal relationship between information found, or other statistical trends. This type of mining was first done in traditional data mining because the traditional mining was aimed at structural data such as transaction records, making it easy to mine implicit associations (eg, people who buy diapers often buy beer , The original is the father of the new people's usual behavior, such information can be excavated to optimize the display and sale of goods). Nowadays, natural language is also structured to extract fragments of intelligence in the database, of course, can also do implicit association intelligence mining to enhance the value of intelligence.
The fourth architectural diagram is the NLP application layer. In this layer, analysis, extraction, mining out of the various information can support different NLP products and services. From the Q & A system to the dynamic mapping of the knowledge map (Google search search star has been able to see this application), from automatic polling to customer intelligence, from intelligent assistants to automatic digest and so on.

16221285l5wkx8t5ffi8a9

This is my overall understanding of the basic architecture of NLP. Based on nearly 20 years in the industry to do NLP product experience. 18 years ago, I was using a NLP structure diagram to the first venture to flicker, investors themselves told us that this is million dollar slide. Today's explanation is to extend from that map to expand from.
Days unchanged Road is also unchanged.

Where previously mentioned the million-dollar slide story. Clinton said that during the reign of 2000, the United States to a great leap forward in Internet technology, known as. Com bubble, a time of hot money rolling, all kinds of Internet startups are sprang up. In such a situation, the boss decided to hot to find venture capital, told me to achieve our prototype of the language system to do an introduction. I then draw the following three-tier structure of a NLP system diagram, the bottom is the parser, from shallow to deep, the middle is built on parsing based on information extraction, the top of the main categories are several types of applications, including Q & A system. Connection applications and the following two language processing is the database, used to store the results of information extraction, these results can be applied at any time to provide information. This architecture has not changed much since I made it 15 years ago, although the details and icons have been rewritten no less than 100 times. The architecture diagram in this article is about one of the first 20 editions. Off the core engine (background), does not include the application (front). Saying that early in the morning by my boss sent to Wall Street angel investors, by noon to get his reply, said he was very interested. Less than two weeks, we got the first $ 1 million angel investment check. Investors say that this is a million dollar slide, which not only shows the threshold of technology, but also shows the great potential of the technology.

165325a3pamcdcdr3daapw

Pre - Knowledge Mapping: The Structure of Information Extraction Engine

"NLP White Paper: Overview of Our NLP Core Engine"

White Paper of NLP Engine

"Zhaohua afternoon pick up" directory

[Top: Legislative Science Network blog NLP blog at a glance (regularly updated version)]

nmt1

nmt2

nmt3

nmt4

nmt5

nmt6

nmt7

retrieved 10/1/2016 from https://translate.google.com/

translated from http://blog.sciencenet.cn/blog-362400-981742.html

Not an ad. But a historical record.

Although not updated for long, this wiki remains like this until today 9/28/2016
from https://en.wikipedia.org/wiki/NetBase_Solutions,_Inc.

wikinetbase

NetBase Solutions, Inc.

From Wikipedia, the free encyclopedia

(Redirected from NetBase)

NetBase Solutions, Inc.

Type	Private
Industry	Market Research
Founded	2004
Founder	Jonathan Spier and Michael Osofsky
Headquarters	Mountain View, CA, USA
Area served	Worldwide
Key people	Peter Caswell, CEO Mark Bowles, CTO Lisa Joy Rosner, CMO Dr. Wei Li, Chief Scientist
Products	NetBase Insight Workbench
Website	www.netbase.com

NetBase Solutions, Inc. is a Mountain View, CA based developer of natural language processing technology used to analyze social media and other web content. It was founded by two engineers from Ariba in 2004 as Accelovation, before changing names to NetBase in 2008. It has raised a total of $21 million in funding. It's sold primarily on a subscription basis to large companies to conduct market research and social media marketing analytics. NetBase has been used to evaluate the top reasons men wear stubble, the products Kraft should develop and the favorite tech company based on digital conversations.

History

NetBase was founded by Jonathan Spier and Michael Osofsky, both of whom were engineers at Ariba, in 2004 as Accelovation, based on the combination of the words “acceleration” and “innovation.”^[1]^[2] It raised $3 million in funding in 2005, followed by another $4 million in 2007.^[1]^[3] The company changed its name to NetBase in February 2008.^[4]^[5]

It developed its analytics tools in March 2010 and began publishing monthly brand passion indexes (BPI) comparing brands in a market segment using the tool shortly afterwards.^[6] In 2010 it raised $9 million in additional funding and another $2.5 million in debt financing.^[1]^[3] NetBase Insight Workbench was released in March 2011 and a partnership was formed with SAP AG that December for SAP to resell NetBase's software.^[7] In April 2011, a new CEO Peter Caswell was appointed.^[8] Former TIBCO co-inventor, patent author and CTO Mark Bowles is now the CTO at NetBase and held responsible for many technical achievements in scalability.^[9]

Software and services

Screenshot of NetBase Insight Workbench dashboard

NetBase sells a tool called NetBase Insight Workbench that gives market researchers and social marketers a set of analytics, charts and research tools on a subscription basis. ConsumerBase is what the company calls the back-end that collects and analyzes the data. NetBase targets market research firms and social media marketing departments, primarily at large enterprises with a price-point of around $100,000.^[10]^[11] NetBase is also white-labeled by Reed Elsevier in a product called illumin8.^[12]

Uses

For the average NetBase user, 12 months of activity is twenty billion sound bytes from just over seven billion digital documents. The company claims to index 50,000 sentences a minute from sources like public-facing Facebook, blogs, forums, Twitter and consumer review sites.^[13]^[14]

According to a story in InformationWeek, Kraft uses NetBase to measure customer needs and conduct market research for new product ideas.^[15] In 2011 the company released a report based on 18 billion postings over twelve months on the most loved tech companies. Salesforce.com, Cisco Systems and Netflix were among the top three.^[16] Also in 2011, NetBase found that the news of Osama Bin Laden eclipsed the royal wedding and the Japan earthquake in online activity.^[17]

External links

NetBase Corporate Site

References

^ Jump up to:^a ^b ^c By Matt Marshall, VentureBeat. “Accelovation Raises $4M for online software for IT market research.” December 3, 2007.
Jump up^ BusinessWeek profile
^ Jump up to:^a ^b By Jon Xavier, BizJournals. “NetBase filters social media for what clients need to know.” June 3, 2011.
Jump up^ By Barbara Quint, Information Today. “Elsevier and NetBase Launch illumin8.” February 28, 2008.
Jump up^ The Economist. “Improving Innovation.” February 29, 2008.
Jump up^ By Rachael King, BusinessWeek. “Most Loved -- And Hated -- Tech Companies.”
Jump up^ Darrow, Barb (December 12, 2011). "SAP taps NetBase for deep social media analytics". GigaOm. Retrieved May 8, 2012.
Jump up^ San Jose Mercury News. “People on the Move.” May 15, 2011.
Jump up^ By David F. Carr, InformationWeek. “How Much is your Brand Loved (or Hated)?” June 16, 2011.
Jump up^ By Eric Schoenfeld, TechCrunch. "NetBase Offers Powerful Semantic Indexing Platform That Reads The Web." April 22, 2009.
Jump up^ By Jon Xavier, BizJournals. "NetBase filters social media for what clients need to know." June 3, 2011.
Jump up^ By Barbara Quint, Newsbreak. "Elsevier and NetBase Launch illumin8." February 28, 2008.
Jump up^ By Neil Glassman, Social Times. “What Every Social Media Marketer Should Know About NetBase.” August 24, 2010.
Jump up^ By Ryan Flinn, BusinessWeek. “Wanted: Social Media Sifters.” October 21, 2010.
Jump up^ By David F. Carr, InformationWeek. “How Kraft Foods Listens to Social Media.” June 30, 2011.
Jump up^ By Ryan Flinn, Bloomberg. “Tech companies measure online sentiment.” May 19, 2011.
Jump up^ By Geoffrey Fowler and Alexandra Berzon, Wall Street Journal. “Social Media Buzzes, Comes Into Its Own.” May 2, 2011.

Categories:

~~~~~~~~~~~~~~

Leadership

The current CEO Peter Caswell, former Siperian and Advent CEO, was appointed in April, 2011^[16] The company’s marketing advisory board includes current and former executives from Taco Bell, PepsiCo, Yahoo! and Procter & Gamble.^[17] Former TIBCO co-inventor, patent author and CTO Mark Bowles is now the CTO at NetBase and held responsible for many technical achievements in scalability.^[3] Computational linguist Dr. WeiLi leads natural language processing development. NetBase CMO Lisa Joy Rosner, former Vice President of Worldwide Marketing at BroadVision, is an ARF Great Minds award winner.^[15]

[Related]

Who we are. Not an ad, but a snapshot.

Handling Chinese NP predicate in HPSG

【一日一parsing：走火入魔，parser 貌似发疯了】

我:
系统调试也上瘾。今夜无眠，调着调着，parser 貌似发疯了，大概是嫌我啥都喂给它，闹情绪了？？

0927a
仔细瞅瞅，好像也没啥大错，没疯。与鲁爷【狂人日记】不同，我怀疑得没理。

自然语言的任何并列（Conj）结构，到了逻辑层，都必须分列。赶上遇到好几个并列就热闹了，关系有组合爆炸的趋向。都是汉语的顿号惹的祸。用恁多顿号做啥，多写几个小句能死吗？纯句法parsing不管这些，图面倒是显得干净。可是 deep parsing 的语义计算是逻辑的，就不能不管。

白:
“或”的结合能力弱于“与”，顿号在被“或”绑架不成情况下标配解释为“与”。

我:
这几天净出怪，不知是机器走火入魔了，还是玩机器的走火入魔，总之，出来一些奇奇怪怪的 graphs，远远不是教科书上展示的句法树形图给人留下的印象。教科书都是这样的，太过优雅：

parse_tree_1

前两天出了一个葫芦形的图，昨天又出了双伞形的，今天是发飙，明天还不知会咋样。

这是昨天的两把伞。瞅了一瞅，好像也没错：

0926a

白:
吗的位置不对。两把伞那个，能……吗，才是一对。

我:
对，“吗“”应该更上一层楼。如果没有上一层，“吗”疑似就对了。为个小词爬楼不值当了，不是不可以爬 (patching). 当然这里面其实牵涉到决定 yes-no question 的所属问题，最终可能还是要上。

如果说 “电子签证是什么吗。”那就是活用。表面上用疑问，实际是应该是感叹？不是“吗”的标准用法。因为“吗”的本性是一般疑问句，而“什么”是特殊疑问句的疑问词（wh-word），不相谐。

白:
那个是“嘛”，不是“吗”

我:
肯定这里不可以用 “吗” 吗？

白:
他知道电子签证是什么

我:
感觉上可以，好像也不等同于“嘛”。

是那个什么吗。
真地忘了是那个什么了。

白:
你说的感叹义，应该用“嘛”。遗忘义，可以用“吗”
不过现在白字用的，早乱套了。

我:
这是前天的葫芦，白老师的名句。就是“与之”没挂上arg，差强人意，但总体逻辑语义的计算还都对。“你”（S）与“女人”（S）结了婚，而且这事儿修饰的（Mod-S：定语从句）是“女人”。

0925a

你说机器神不神，parser 好玩不好玩，这算不算对人类语言的机器理解的敲门砖：芝麻开门！芝麻芝麻快开门。

【相关】

Chart Parsing Chinese Character Strings

W. Li. 1997. Chart Parsing Chinese Character Strings. In
Proceedings of the Ninth North American Conference on Chinese
Linguistics (NACCL-9). Victoria, Canada.

Chart Parsing Chinese Character Strings [1]

Wei LI

Simon Fraser University
Burnaby B.C. V5A 1S6 CANADA ([email protected])

ABSTRACT

This paper examines problems in word identification for a Chinese natural language processing system and presents our solution to these problems. In conventional systems, written Chinese parsing takes two steps: (1) a segmentation preprocessor for word identification (segmenter); (2) a grammar parsing the string of identified words. Morphological analysis, when required, as in the case of productive word formation, has to be incorporated in the segmenter. This matches the conventional morphology-before-syntax architecture. We will demonstrate the theoretical defect of this architecture when applied to Chinese. This leads to the conclusion that segmentational approach, despite its being the mainstream in Chinese computational morphology, is in general not adequate for the task of Chinese word identification. To solve this problem, a full grammar should be made available. Therefore, we take an alternative one-step approach. We have implemented an integrated grammar of morphology and syntax for directly parsing a string of Chinese characters, building both morphological and syntactic structures. Compared with the conventional two-step approach, our strategy has advantages in resolving ambiguity in word identification and in handling productive word formation.

Introduction

A written Chinese sentence is a string of characters with no blanks to mark word boundaries. In conventional systems, Chinese parsing takes two steps as shown in the following Figure 1: (1) a segmentation preprocessor (called segmenter) for word identification; (2) a word based parsing grammar, building syntactic structures (Feng 1996, Chen & Liu (1992).

hpsg4

In contrast, we take an alternative one-step approach, as shown in Figure 2 below. We have implemented a grammar named W‑CPSG (for Wei's Chinese Phrase Structure Grammar). W‑CPSG integrates morphology and syntax for character based parsing, building both morphological and syntactic structures.

hpsg5

In the two-step architecture, the purpose for the segmenter is to properly identify a string of words to feed syntax. This is not an easy task due to the possible involvement of the segmentation ambiguity. For example, given a string of 4 Chinese characters 研究生命, the segmentation ambiguity is shown in (1.a) and (1.b) below.

(1.) 研究生命

(a) 研究生 | 命
graduate student | life or destiny

(b) 研究 | 生命
study | life

The resolution of the above ambiguity in the segmenter is a hopeless job because such ambiguity is syntactically conditioned. For sentences like 研究生命金贵 (life for graduate students is precious), (1.a) is the right identification. For the phrase 研究生命起源 (to study the origin of life), (1.b) is right. So far there are no segmenters which can handle this properly and guarantee right word segmentation (Feng 1996). In fact, there can never be such segmenters as long as a grammar is not brought in. This is a theoretical defect of all Chinese analysis systems in the conventional architecture. We have solved this problem in our morphology-syntax integrated W‑CPSG. Word identification in our design becomes a by-product of parsing instead of a pre-condition for parsing.

In the text below, Section 2 investigates problems with the conventional two-step approach. In Section 3, we will present W‑CPSG one-step approach and demonstrate how W‑CPSG parsing solves these problems. The following is a list for abbreviations used in this paper.

A (Adjective); AF (Affix); BM (Bound Morpheme);
CLA (Classifier); CLAP (Classifier Phrase);
DE (Chinese particle introducing a modifier of noun); DEP (DE Phrase);
DE3 (Chinese particle introducing a modifier of result or capability);
DET (Determiner); LE (Chinese perfective aspect marker);
N (Noun); NP (Noun Phrase); P (Preposition); PP (Prepositional Phrase);
S (Sentence); V (Verb); VP (Verb Phrase); Vt (Transitive Verb)

Problems Challenging Segmenters

In general, there are two basic problems for segmenters, namely, segmentation ambiguity and productive word formation.

2.1. segmentation ambiguity

This sub-section studies the segmentation ambiguity for Chinese word identification. We indicate that this ambiguity is structural in nature. Therefore it should be captured by structural trees via parsing. We conclude that a parsing grammar is indispensable in the resolution of the segmentation ambiguity.

Behind all segmenters are procedure based segmentation algorithms. Most proposals are some modified versions of large-lexicon based matching algorithms. As an underlying hypothesis, a longer match overrides a shorter match, hence the name maximum match. Decided by the direction of the procedure, i.e. whether the segmentation proceeds from left (the beginning of a string) to right (the end of the string) or from right to left, we have two general types of maximum match: (1) FMM (Forward Maximum Match) algorithm; (2) BMM (Backward Maximum Match) algorithm (Feng 1996).

According to Liang 1987, segmenters have trouble with cases involving the segmentation ambiguity. There are two types of segmentation ambiguity: the cross ambiguity (AB|C vs. A|BC) and the embedded ambiguity (AB vs. A|B).

To detect possible ambiguity, many researchers use the technique of combining the FMM algorithm and the BMM algorithm. When the output of FMM and BMM are different, there must be some ambiguity involved. The following table lists the cases associated with the FMM and BMM combined approach.[2]

hpsg6

The following 3 examples all contain a cross ambiguity sub-string 研究生命 with 2 segmentation possibilities: 研究生|命 and 研究|生命. Example (4.) is a genuinely ambiguous case. Genuinely ambiguous sentences cannot be disambiguated within the sentence boundary, rendering multiple readings.

(2.) case 1: 研究生命金贵。

(a) 研究生 | 命 | 金贵 (FMM: correct)
graduate student | life | precious
Life for graduate students is precious.

(b) * 研究 | 生命 |起源 (BMM: incorrect)
study | life | precious

(3.) case 2: 研究生命起源。

(a) * 研究生 | 命 | 起源 (FMM: incorrect)
graduate-student | life | origin

(b) 研究 | 生命 | 起源 (BMM: correct)
study | life | origin
to study the origin of life

(4.) case 3: 研究生命不好。

(a) 研究生 | 命 | 不 | 好 (FMM: correct)
graduate student | destiny | not | good
The destiny of graduate students is not good.

(b) 研究 | 生命 | 不 | 好 (BMM: correct)
study | life | not | good
It is not good to study life.

The following example is a complicated case of cross ambiguity, involving more than 2 ways of segmentation. Both the FMM segmentation 出现|在世|界 and the BMM segmentation 出|现在|世界 are wrong. A third segmentation 出现|在|世界 is right.

(5.) case 4: 出现在世界东方。

(a) * 出现 | 在世 | 界 | 东方 (FMM: incorrect)
appear | be-alive | BM | east

(b) * 出 | 现在 | 世界 | 东方 (BMM: incorrect)
out | now | world | east

In the following examples (6.) through (8.), ¿¾°×Êí involves embedded ambiguity. As separate words, the verb ¿¾ (bake) and the NP °×Êí (sweet potato) form a VP. As a whole, it is a compound noun ¿¾°×Êí (baked sweet potato). In cases of the embedded ambiguity, FMM and BMM always make the same segmentation, namely AB instead of A|B. It may be the only right choice, as seen in (6.). It may be wrong as shown in (7.). It may only be half right, as in the case of genuine ambiguity shown in (8.).

(6.) case 5: 他吃烤白薯。

(a) 他 | 吃 | 烤白薯 (FMM&BMM: correct)
he | eat | baked sweet potato
He eats baked sweet potatoes.

(b) * 他 | 吃 | 烤 | 白薯 (incorrect)
he | eat | bake | sweet potato

(7.) case 6: 他会烤白薯。

(a) * 他 | 会 | 烤白薯 (FMM&BMM: incorrect)
he | can | baked sweet potato

(b) 他 | 会 | 烤 | 白薯 (correct)
he | can | bake | sweet potato
He can bake sweet potatoes.

(8.) case 7: 他喜欢烤白薯。

(a) 他 | 喜欢 | 烤白薯 (FMM&BMM: correct)
he | like | baked sweet potato
He likes baked sweet potatoes.

(b) 他 | 喜欢 | 烤 | 白薯 (correct)
he | like | bake | sweet potato
He likes baking sweet potatoes.

Compare the above examples, we see that there are severe limitations for the FMM-BMM combined approach. First, it only serves the purpose of ambiguity detection (when the results of FMM and BMM do not match), and contributes nothing to its resolution. It has no way to tell which segmentation is right (compare case 1 and case 2), and, worse still, whether both are right (case 3) or wrong (case 4). Second, even when the results of FMM and BMM do match, it by no means guarantees right segmentation (case 6). Third, as far as detection is concerned, it is only limited to the problems for the cross ambiguity. The existence of the embedded ambiguity defines a blind area for this way of detection (case 6 and case 7). This is because the underlying maximum match hypothesis assumed in the FMM and BMM segmentation algorithms is directly contradictory to the phenomena of the embedded ambiguity.

In face of ambiguity, how do people judge which segmentation is right in the first place? It really depends on whether we can understand the sentence or phrase based on the segmentation. In computational linguistics, this is equivalent to whether the segmented string can be parsed by a grammar. The segmentation ambiguity is one type of structural ambiguity, not in essence different from typical structural ambiguity like, say, PP attachment ambiguity. In fact, PP attachment problem is a counterpart of the cross ambiguity in English syntax, as shown below.

(9.) Cross ambiguity in PP attachment: V NP PP

(a) [V NP] [PP]
(b) [V] [NP PP]

Therefore, like English PP attachment, Chinese word segmentation ambiguity should also be captured by a parsing grammar. A parser resolves the ambiguity if it can, or detects the ambiguity in the form of multiple parses when it cannot. As shall be demonstrated in Section 3, wrong segmentation will not lead to a parse. Right segmentation results in at least one successful parse. In any case, at least a parser (hence a grammar on which the parser is based) is required for proper word identification.

The important thing is that the ambiguity in word identification is a grammatical problem. The attempt to solve this problem without a grammar is bound to be crippled. Since traditional segmentation algorithms are non-grammatical in nature, they are theoretically not equipped for handling such ambiguity. A successive model of segmentater-before-grammar attempts to do what it is not yet able to do. This is the theoretical defect for almost all existing segmentation approaches.

(10.) Conclusion for 2.1.

The segmentation ambiguity in word identification is one type of structural ambiguity. In order to solve this problem, a parsing grammar is indispensable.

2.2. productive word formation

Unless morphological analysis is incorporated, lexicon match based segmenters will have trouble with new words produced by Chinese productive word formation, including reduplication, derivation and the formation of proper names. When the morphology component is incorporated in the segmenter, the two-step design becomes a variant of the conventional morphology-before-syntax architecture. But this architecture is not effective when the segmentation ambiguity is at issue.

In the following, we investigate reduplication, derivation and proper names one by one. In each case, we find that there is always a possible involvement of the segmentation ambiguity. This problem cannot be solved by a morphology component independent of syntax. We therefore propose a grammar incorporating both morphology and syntax.

2.2.1. reduplication

Reduplication in Chinese serves various grammatical and/or lexical functions. Not all reduplications pose challenges to segmentation algorithms. Assume that a word consists of 2 characters AB, reduplication of the type AB --> ABAB is no problem. What becomes a problem for word segmentation is the reduplication of the type AB --> AABB or its variants like AB --> AAB. For example, a two-morpheme verb with verb-object relation at the level of morphology has the following way of reduplication.

(11.) Verb Reduplication: AB --> AAB (for diminutive use)

分心 (get distracted) --> 分分心 (get distracted a bit)

让他分分心。

让 | 他 | 分分心
let | he | get distracted a bit
Let him relax a while.

It seems that reduplication is a simple process which can be handled by incorporating some procedure-based function calls in the segmentation algorithm. If a 3-character string, say 分分心, cannot be found in the lexicon, the reduplication procedure will check whether the first 2 characters are the same, and if yes, delete one of them and consult the lexicon again. But, such expansion of the segmentation algorithm is powerless when the segmentation ambiguity is involved. For example, it is wrong to regard 分分心 as of reduplication in the following sentence.

(12.) 这件事十分分心。

(a) * 这 | 件 | 事 | 十 | 分分心
this | CLA | thing | ten | get distracted a bit

(b) 这 | 件 | 事 | 十分 | 分心
this | CLA | thing | very | distracting
This thing is very distracting.

2.2.2. derivation

In Contemporary Mandarin, there have come to be a few morphemes functioning similarly to English affixes, e.g. 可 (-able) turns a transitive verb into an adjective.

(13.) 可 (-able) + Vt --> A

可 (-able) + 读 (Vt: read) --> 可读 (A:readable)

这本书非常可读。

这 | 本 | 书 | 非常 | 可读
this | CLA | book | very | readable
This book is very readable.

The suffix 性 works just like '-ness', changing an adjective into an abstract noun. The derived noun 可读性 (readability) in the following example, similar to its English counterpart, involves a process of double affixation.

(14.) A + 性 (-ness) --> N
可 (-able) + 读 (Vt: read) --> 可读 (A:readable)
可读 (A:readable) + 性 (-ness) --> 可读性 (N:readability)

这本书的可读性

这 | 本 | 书 | 的 | 可读性
this | CLA | book | DE | readability
this book's readability

The suffix Í· can change a transitive verb into an abstract noun adding to it the meaning "worth-of".

(15.) Vt + 头 (AF:worth of) --> N

吃 (Vt:eat) + 头 (AF:worth of) --> 吃头 (N:worth of eating)

这道菜没有吃头

这 | 道 | 菜 | 没有 | 吃头
this | CLA | dish | not-have | worth-of-eating
This dish is not worth eating.

It is not difficult to incorporate in the segmenter these derivation rules for the morphological analysis. But, as in the case of reduplication, there is always a danger of wrongly applying the rules due to possible ambiguity involved. For example, 吃头 is a sub-string of embedded ambiguity. It can be both a derived noun 'worth of eating' or two separate words as seen in the following example.

(16.) 他饿得能吃头牛。

(a) * 他 | 饿 | 得 | 能 | 吃头· | 牛
he | hungry | DE3 | can | worth-of-eating | ox

(b) 他 | 饿 | 得 | 能 | 吃 | 头 | 牛
he | hungry | DE3 | can | eat | CLA | ox
He is so hungry that he can eat an ox.

2.2.3. proper name

Proper names are of 2 major types: (1) Chinese names; (2) transliterated foreign names. In this paper, we only target the identification of Chinese names and leave the problem of transliterated foreign names for further research (Li, 1997b).

A Chinese human name usually consists of a family name followed by a given name. Chinese family names form a clear-cut closed set. A given name is usually either one character or two characters. For example, the late Chinese chairman 毛泽东 (Mao Zedong) used to have another name 李得胜 (Li Desheng). In the lexicon, 李 is a registered family name. Both 得胜 and 胜 mean 'win'. This may lead to 3 ways of word segmentation: (1) 李得胜; (2) 李|得胜; (3) 李得|胜, as seen in the following examples.

(17.) 李得胜了

(a) 李 | 得胜 | 了.
Li | win | LE
Li won.

(b) 李得 | 胜 | 了
Li De | win | LE
Li De won.

(18.) 李得胜胜了。

(a) * 李 | 得胜 | 胜 | 了.
Li | win | win | LE

(b) * 李得 | 胜 | 胜 | 了
Li De | win | win | LE

Since the given name like µÃÊ¤ is an arbitrary string of 1 or 2 characters, the morphological analysis of the full name should start with family name which can optionally combine with any 1 or 2 characters to form candidate proper names Àî, ÀîµÃ and ÀîµÃÊ¤. In other words, family name serves as the left boundary of a full name and the length is used to determine candidates. The right segmentation can only be made via sentence analysis as shown in the above examples.

Most Chinese place proper names are made of 1 to 3 characters, for example, 武汉市(Wuhu City), 南陵县 (Nanling County). The arbitrariness of these names makes any sub-strings of n characters (0<n<4) in the sentence a suspect. Fortunately, in most cases we may find boundary indicators of these names, like 省 (province), 市 (city), 县 (county), etc. Once the boundary indicator is located, the similar technique in using Chinese family name to identify the given name can be applied to select candidates of place proper names for verification through grammatical analysis.

In general, there is always a possibility of ambiguity involvement in the formation of all types of proper names.

(19.) Conclusion for 2.2.

Due to the possible involvement of ambiguity, a parsing grammar for morphological analysis as well as for sentence analysis is required for the proper identification of the words produced by Chinese productive word formation.

W‑CPSG Grammatical Approach

This section presents W‑CPSG approach to Chinese word identification and morphological analysis. We will demonstrate how a parser based on W‑CPSG solves the problems of the word identification ambiguity and productive word formation.

3.1. rationale of W‑CPSG approach

There have been a number of word identification algorithms based on both morphological and syntactic information (see survey in Feng 1996 and Sun & Huang 1996). Most such approaches do not use a self-contained grammar to parse the complete sentence. They are confined to the conventional two-step process of the segmentation-before-grammar design. As long as the word identification procedure is independent of a parsing grammar, it is extremely difficult to make full use of grammatical information to resolve ambiguity in word identification. Careful tuning up and sophisticated design improves the precision but will not change the theoretical defect of all such approaches. Chen & Liu acknowledges the limitation of their approach due to the lack of a grammar. “However”, they say, “it is almost impossible to apply real world knowledge nor to check the grammatical validity at this stage”. (Chen & Liu 1992, p.105) Why impossible at this stage? Because these segmentation systems are based on the concept of two-step architecture and the grammar is not yet available! As we have demonstrated, the final judgment for proper word identification can hardly be made until the whole sentence is parsed, hence the requirement of a full grammar. Therefore, we are forced to make a compromise in involving how much of grammatical information depending on how much word identification precision we can afford to sacrifice. Needless to say, there is significant double-labor between such a word segmentation procedure and the following stage of parsing. As more and more grammatical information is used to achieve better precision, the overhead of this double labor becomes more serious. We consider the double labor as one strong argument against the two-step approach. If enough grammatical information is incorporated, it is essentially equivalent to a grammar. And the segmenter will be equivalent to a parser. Then why two grammars, one for word identification, and one for sentence parsing? Why not combine them? That is exactly what we are proposing in W‑CPSG - one-step approach based on an integrated grammar, eliminating the necessity of a segmentation preprocessor.

3.2. W‑CPSG character-based parsing

W‑CPSG (Li. 1997a, 1997b) is a lexicalized Chinese unification grammar. The work on W‑CPSG is taken in the spirit of the modern linguistic theory Head-driven Phrase Structure Grammar (Pollard & Sag 1994). W‑CPSG consists of two parts: a minimized general grammar and an enriched lexicon. The general grammar only contains a handful of PS (phrase structure) rules, covering complement structure, modifier structure, conjunctive structure and morphological structure. This is the nature of lexicalized grammars. PS rules in such grammars are very abstract. Essentially, all they say is one thing, that is, 2 signs can combine so long as the lexicon so indicates. The lexicon houses lexical entries with their linguistic description in feature structures. Potential morphological structures as well as potential syntactic structures are lexically encoded. In syntax, a word expects another sign to form a phrase. In morphology, a morpheme expects another sign to form a word. For example, the prefix 可 (-able) expects a transitive verb to form an adjective. The morphological PS rule will build the morphological structure when a transitive verb does appear after the prefix 可 (-able) in the input string.

We now illustrate how W‑CPSG parses a string of Chinese characters by a sample parsing chart. The prototype of W‑CPSG was written in ALE, a grammar compiler developed on top of Prolog by Carpenter & Penn (1994). ALE compiles W‑CPSG into a Chinese parser, a Prolog program ready to accept a string of characters for analysis. W‑CPSG parse tree embodies both morphological analysis and syntactic analysis, as shown below.

hpsg12

This is so-called bottom-up parsing. It starts with lexicon look-up. Edges 1 through 7 are lexical edges. Other edges are phrasal edges. Each edge represents a sign, i.e. a character (morpheme), a word, a phrase or a sentence. Lexical edges result from a successful match between the signs in the input string and the entries in the lexicon during lexicon look-up. For example, 可 (-able), 读 (read) and 性 (-ness) are all registered entries in the lexicon, so they get matched and shown by edge 5, edge 6 and edge 7. Words produced by productive word formation present themselves as phrasal edges, e.g. edge ((5+6)+7) for 可读性 (readability). For the sake of concise illustration, we only show two pieces of information for the signs in the chart, namely category and interpretation with a delimiting colon (lexical edges are only labeled for either category or interpretation). The parser attempts to combine the signs according to PS rules in the grammar until parses are found. A parse is an edge which ranges over the whole string. The parse ((((1+2)+3)+4)+((5+6)+7)) for (20.) represents a binary structural tree based on the W‑CPSG analysis, as shown below.

hpsg13

3.3. ambiguity resolution in word identification

Given the resources of a phrase structure grammar like W‑CPSG, a parser based on standard chart parsing algorithms can handle both the cross ambiguity and the embedded ambiguity provided that a match algorithm based on exhaustive lookup instead of maximum match is adopted for lexicon lookup. All candidate words in the input string are presented to the parser for judgment. Ambiguous segmentation becomes a natural part of parsing: different ways of segmentation add different edges, a successful parse always embodies right identification. In other words, word identification in our design becomes a by-product of parsing instead of a pre-condition for parsing. The following example of the complicated cross ambiguity illustrates how the W‑CPSG parser resolves ambiguity. As seen, both the FMM segmentation (represented by the edge sequence 8-9-5-10) and the BMM segmentation (represented by 1-11-12-10) are in the chart as a result of exhaustive lexicon lookup. They are proved to be wrong because they do not lead to a successful parse according to the grammar. As a by-product, the final parse (8+(3+(12+10))) automatically embodies rightly identified word sequence 8-3-12-10, i.e. 出现 (appear) |在 (at) |世界 (world) |东方 (east).

hpsg10

Exhaustive lookup also makes an embedded ambiguity sub-string like 烤红薯 no longer a blind area for word identification, as shown in (22.) below. All the candidate words in the sub-string including 烤 (bake), 红薯 (sweet potato), 烤红薯 (baked sweet potato) are added to the chart as lexical edges (edge 4, edge 8 and edge 10). This is a case of genuine ambiguity, resulting in 2 parses corresponding to 2 readings. The first parse (1+(7+10)) identifies the word sequence 他|喜欢|烤红薯, and the second parse (1+(9+(4+8))) a different sequence 他|喜欢|烤|红薯. Edge 7 and edge 9 represent two lexical entries for the verb 喜欢 (like), with different syntactic expectation (categorization). One expects an NP object, notated in the chart by like<NP>, and the other expects a VP complement, notated by like<VP>.

hpsg11

We now illustrate how Chinese proper names are identified in W‑CPSG parsing. In the W‑CPSG lexicon, Chinese family name is encoded to optionally expect the given name. Due to the arbitrariness of given names, no other constraint except for the length (either 1 character or 2 characters) is specified in the expectation. Therefore, we have three candidates for proper names in the following example, namely 李 (Li), 李得 (Li De), 李得胜 (Li Desheng), represented respectively by edge 1, edge (1+2) and the NP edge (1+5).[3] The first two candidates contribute to two valid parses while the third does not, hence the identification of the word sequences 李|得胜|了 and 李得|胜|了.

hpsg8

Now we add one more character 胜 (win) to form a new sentence, as shown in (24.) below.

hpsg9

The first two candidate proper names 李 (Li) and 李得 (Li De) no longer lead to parses. But the third candidate 李得胜 (Li Desheng) becomes part of the parse as a subject NP. The parse (((1+6)+4)+5) corresponds to the identification of the only valid word sequence 李得胜|胜|了.

Finally, we give an example to demonstrate how W‑CPSG handles reduplication in parsing and word identification. The sample sentence to be processed by the parser is 让他分分心 (Let him relax a while), involving the AB-->AAB type verb reduplication for diminutive use.

In most lexicons, 分心 (distract-heart: get distracted) is a registered 2-morpheme verb with internal morphological verb-object relation. Therefore, the reduplication is considered morphological. But in Chinese syntax, we also have a general verb reduplication rule of the type A-->AA for diminutive use, for example, 看(look) --> 看看(have a look). This morphological verb reduplication rule AB-->AAB and the syntactic verb reduplication rule A-->AA are essentially the same rule in Chinese grammar. 分心 sits in the gray area between morphology and syntax. It looks both like a word (verb) and a phrase (VP). Lexically, it corresponds to one generalized sense (concept) and the internal combination is idiomatic, i.e. 分 (distract) must combine with 心 (heart) to mean 'get distracted'. But, structurally, the combination of 分 and 心 is not fundamentally different from a VP consisting of Vt and NP, as in the phrase 看电影 (see a film). In fact, there is no clear-cut boundary between Chinese morphology and syntax. This morphology-syntax isomorphic fact serves as a further argument to support the W‑CPSG design of integrating morphology and syntax in one grammar module. Although the boundary between Chinese morphology and syntax is fuzzy, hence no universal definition of basic notions like word and phrase, the division can be easily defined system internally in an integrated grammar. In W‑CPSG, 分心 is treated as a phrase (VP) instead of a word (verb). The lexical entry 分 (distract) is coded to obligatorily expect the literal 心 (heart) as its syntactic object, shown in the following chart by the notation V<心>. This approach has the advantage of eliminating the doubling of the reduplication rule for diminutive use in both syntax and morphology, making the grammar more elegant. The verb reduplication rule is implemented as a lexical rule in W‑CPSG.[4] This lexical rule creates a reduplicated verb with added diminutive sense, shown by edge 8 (a lexical edge). The whole parsing process is illustrated below.

hpsg7

REFERENCES

Carpenter, B. & Penn, G. (1994): ALE, The Attribute Logic Engine, User's Guide, Carnegie Mellon University

Chen, K-J., & S-H. Liu (1992): "Word identification for mandarin Chinese sentences". Proceedings of the 15th International Conference on Computational Linguistics, Nantes, 101-107.

Feng, Z-W. (1996): "COLIPS lecture series - Chinese natural language processing", Communications of COLIPS, Vol.6, No.1 1996, Singapore

Li, W. (1997a): "Outline of an HPSG-style Chinese reversible grammar", Proceedings of The Northwest Linguistics Conference-97 (NWLC-97, forthcoming), UBC, Vancouver, Canada

Li, W. (1997b): W‑CPSG: A Lexicalized Chinese Unification Grammar And Its Application, Doctoral dissertation (on-going), Simon Fraser University, Canada

Liang, N. (1987): "Shumian Hanyu Zidong Fenci Xitong - CDWS" (Automatic word segmentation system for written Chinese - CDWS), Journal of Chinese Information Processing, No.2 1987, pp 44-52, Beijing

Pollard, C. & I. Sag (1994): Head-Driven Phrase Structure Grammar, Centre for the Study of Language and Information, Stanford University, CA

Sun, M-S. & C-N. Huang (1996): "Word segmentation and part of speech tagging for unrestricted Chinese texts" (Tutorial Notes for International Conference on Chinese Computing ICCC'96), Singapore

~~~~~~~~~~~~~~~~~~~

[1] The author benefited from the insightful discussion with Dr. Dekang Lin on the feasibility of parsing Chinese character strings instead of word strings. Thanks also go to Paul McFetridge and Fred Popowich for their supervision and encouragement.

[2] This table is adapted from the following table in Sun & Huang (1996).

case 1	The output of FMM and BMM are different, but both are incorrect	0.054%
case 2	The output of FMM and BMM are different, but only one is correct	9.24%
case 3	The output of FMM and BMM are identical, but incorrect	0.41%
case 4	The output of FMM and BMM are identical, and correct	90.30%

The 4 cases which they listed are not logically exhaustive in terms of sentence based processing (i.e. when discourse is not involved in a system). In particular, there is another case when the output of FMM and BMM are different, and both are correct. We call this a case of genuine cross ambiguity.

[3] Note that there is another S edge (1+5) in the chart. These two edges are structurally different, created via different PS rules. The NP edge (1+5) is formed through the morphological PS rule, combining the family name (edge 1) and its expected given name (edge 5). In the S edge (1+5). however, it is the subject rule (one of the complement PS rules) that decides the combination of the predicate (edge 5) and its expected subject NP (edge 1).

[4] Lexical rules are favored by many linguists to capture redundancy in the lexicon instead of the conventional approach of syntactic transformation. Lexical rules are applied at compile time to form an expanded lexicon before parsing starts.

Notes for An HPSG-style Chinese Reversible Grammar

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter VII Concluding Remarks

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

2.0. Introduction

This chapter examines the role of grammar in handling the three major types of morpho-syntactic interface problems. This investigation justifies the mono-stratal design of CPSG95 which contains feature structures of both morphology and syntax.

The major observation from this study is: (i) grammatical analysis, including both morphology and syntax, plays the fundamental role in contributing to the solutions of the morpho-syntactic problems; (ii) when grammar alone is not sufficient to reach the final solution, knowledge beyond morphology and syntax may come into play and serve as “filters” based on the grammatical analysis results.[1] Based on this observation, a study in the direction of interleaving morphology and syntax will be pursued in the grammatical analysis. Knowledge beyond morphology and syntax is left to future research.

Section 2.1 investigates the relationship between grammatical analysis and the resolution of segmentation ambiguity. Section 2.2 studies the role of syntax in handling Chinese productive word formation. The borderline cases and their relationship with grammar are explored in 2.3. Section 2.4 examines the relevance of knowledge beyond syntax to segmentation disambiguation. Finally, a summary of the presented arguments and discoveries is given in 2.5.

2.1. Segmentation Ambiguity and Syntax

Segmentation ambiguity is one major problem which challenges the traditional word segmenter or an independent morphology. The following study shows that this ambiguity is structural in nature, not fundamentally different from other structural ambiguity in grammar. It will be demonstrated that sentential structural analysis is the key to this problem.

A huge amount of research effort in the last decade has been made on resolving segmentation ambiguity (e.g. Chen and Liu 1992; Gan 1995; He, Xu and Sun 1991; Liang 1987; Lua 1994; Sproat, Shih, Gale and Chang 1996; Sun and T’sou 1995; Sun and Huang 1996; X. Wang 1989; Wu and Su 1993; Yao, Zhang and Wu 1990; Yeh and Lee 1991; Zhang, Chen and Chen 1991; Guo 1997b). Many (e.g. Sun and Huang 1996; Guo 1997b) agree that this is still an unsolved problem. The major difficulty with most approaches reported in the literature lies in the lack of support from sufficient grammar knowledge. To ultimately solve this problem, grammatical analysis is vital, a point to be elaborated in the subsequent sections.

2.1.1. Resolution of Hidden Ambiguity

The topic of this section is the treatment of hidden ambiguity. The conclusion out of the investigation below is that the structural analysis of the entire input string provides a sound basis for handling this problem.

The following sample sentences illustrate a typical case involving the hidden ambiguity string 烤白薯 kao bai shu.

(2-1.) (a)      他吃烤白薯
ta         | chi | kao-bai-shu
he      | eat | baked-sweet-potato
[S [NP ta] [VP [V chi] [NP kao-bai-shu]]]
He eats the baked sweet potato.

(b) * ta | chi | kao | bai-shu
he | eat | bake | sweet-potato

(2-2.) (a) *    他会烤白薯
ta         | hui | kao-bai-shu.
he      | can | baked-sweet-potato

(b) ta | hui | kao | bai-shu.
he | can | bake | sweet-potato
[S [NP ta] [VP [V hui] [VP [V kao] [NP bai-shu]]]]
He can bake sweet potatoes.

Sentences (2-1) and (2-2) are a minimal pair; the only difference is the choice of the predicate verb, namely chi (eat) versus hui (can, be capable of). But they have very different structures and assume different word identification. This is because verbs like chi expect an NP object but verbs like hui require a VP complement. The two segmentations of the string kao bai shu provide two possibilities, one as an NP kao-bai-shu and the other as a VP kao | bai-shu. When the provided unit matches the expectation, it leads to a successful syntactic analysis, as illustrated by the parse trees in (2‑1a) and (2-2b). When the expectation constraint is not satisfied, as in (2-1b) and (2-2a), the analysis fails. These examples show that all candidate words in the input string should be considered for grammatical analysis. The disambiguation choice can be made via the analysis, as seen in the examples above with the sample parse trees. Correct segmentation results in at least one successful parse.

He, Xu and Sun (1991) indicate that a hidden ambiguity string requires a larger context for disambiguation. But they did not define what the 'larger context' should be. The following discussion attempts to answer this question.

The input string to the parser constitutes a basic context as well as the object for sentential analysis.[2] It will be argued that this input string is the proper context for handling the hidden ambiguity problem. The point to be made is, context smaller than the input string is not reliable for the hidden ambiguity resolution. This point is illustrated by the following examples of the hidden ambiguity string ge ren in (2-3).[3] In each successive case, the context is expanded to form a new input string. As a result, the analysis and the associated interpretation of ‘person’ versus ‘individual’ change accordingly.

(2-3.) input string reading(s)

(a) 人 ren person (or man, human)
[N ren]

(b) 个人 ge ren individual
[N ge-ren]

(d) 人的力量 ren de li liang the human power
[NP [DEP [NP ren] [DE de]] [N li-liang]]

(e) 个人的力量 ge ren de li liang the power of an individual
[NP [DEP [NP ge-ren] [DE de]] [N li-liang]]

(f) 三个人的力量 san ge ren de li liang the power of three persons
[NP [DEP [NP [CLAP [NUM san] [CLA ge]] [N ren]] [DE de]] [N li-liang]]

(g) 他不是个人 ta bu shi ge ren.
(1) He is not a man. (He is a pig.)
[S [NP ta] [VP [ADV bu] [VP [V shi] [NP [CLAP ge] [N ren]]]]]
(2) He is not an individual. (He represents a social organization.)
[S [NP ta] [VP [ADV bu] [VP [V shi] [NP ge-ren]]]]

Comparing (a), (b) with (c), and (d), (e) with (f), one can see the associated change of readings when each successively expanded input string leads to a different grammatical analysis. Accordingly, one segmentation is chosen over the other on the condition that the grammatical analysis of the full string can be established based on the segmentation. In (b), the ambiguous string is all that is input to the parser, therefore the local context becomes full context. It then acquires the lexical reading individual as the other possible segmentation ge | ren does not form a legitimate combination. This reading may be retained, as in (e), or changed to the other reading person, as in (c) and (f), or reduced to one of the possible interpretations, as in (g), when the input string is further lengthened. All these changes depend on the sentential analysis of the entire input string, as shown by the associated structural trees above. It demonstrates that the full context is required for the adequate treatment of the hidden ambiguity phenomena. Full context here refers to the entire input string to the parser.

It is necessary to explain some of the analyses as shown in the sample parses above. In Contemporary Mandarin, a numeral cannot combine with a noun without a classifier in between.[4] Therefore, the segmentation san (three) | ge-ren (individual) is excluded in (c) and (f), and the correct segmentation san (three) | ge (CLA) | ren (person) leads to the NP analysis. In general, a classifier alone cannot combine with the following noun either, hence the interpretation of ge ren as one word ge-ren (individual) in (b) and (e). A classifier usually combines with a preceding numeral or determiner before it can combine with the noun. But things are more complicated. In fact, the Chinese numeral yi (one) can be omitted when the NP is in object position. In other words, the classifier alone can combine with a noun in a very restricted syntactic environment. That explains the two readings in (g).[5]

The following is a summary of the arguments presented above. These arguments have been shown to account for the hidden ambiguity phenomena. The next section will further demonstrate the validity of these arguments for overlapping ambiguity as well.

(2-4.) Conclusion
The grammatical analysis of the entire input string is required for the adequate treatment of the hidden ambiguity problem in word identification.

2.1.2. Resolution of Overlapping Ambiguity

This section investigates overlapping ambiguity and its resolution. A previous influential theory is examined, which claims that the overlapping ambiguity string can be locally disambiguated. However, this theory is found to be unable to account for a significant amount of data. The conclusion is that both overlapping ambiguity and hidden ambiguity require a context of the entire input string and a grammar for disambiguation.

For overlapping ambiguity, comparing different critical tokenizations will be able to detect it, but such a technique cannot guarantee a correct choice without introducing other knowledge. Guo (1997) pointed out:

As all critical tokenizations hold the property of minimal elements on the word string cover relationship, the existence of critical ambiguity in tokenization implies that the “most powerful and commonly used” (Chen and Liu 1992, page 104) principle of maximum tokenization would not be effective in resolving critical ambiguity in tokenization and implies that other means such as statistical inferencing or grammatical reasoning have to be introduced.

However, He, Xu and Sun (1991) claim that overlapping ambiguity can be resolved within the local context of the ambiguous string. They classify the overlapping ambiguity string into nine types. The classification is based on the categories of the assumably correctly segmented words in the ambiguous strings, described below.

Suppose there is an overlapping ambiguous string consisting of ABC; both AB and BC are entries listed in the lexicon. There are two possible cases. In case one, the category of A and the category of BC define the classification of the ambiguous string. This is the case when the segmentation A|BC is considered correct. For example, in the ambiguous string 白天鹅 bai tian e, the word AB is bai-tian (day-time) and the word BC is tian-e (swan). The correct segmentation for this string is assumed to be A|BC, i.e. bai (A: white) | tian-e (N: swan) (in fact, this cannot be taken for granted as shall be shown shortly), therefore, it belongs to the A-N type. In case two, i.e. when the segmentation AB|C is considered correct, the category of AB and the category C define the classification of the ambiguous string. For example, in the ambiguous string 需求和 xu qiu he, the word AB is xu-qiu (requirement) and the word BC qiu-he (sue for peace). The correct segmentation for this string is AB|C, i.e. xu-qiu (N: requirement) | he (CONJ: and) (again, this should not be taken for granted), therefore, it belongs to the N-CONJ type.

After classifying the overlapping ambiguous strings into one of nine types, using the two different cases described above, they claim to have discovered a rule.[6] That is, the category of the correctly segmented word BC in case one (or AB in case two) is predictable from AB (or BC in case two) within the local ambiguous string. For example, the category of tian-e (swan) in bai | tian-e (white swan) is a noun. This information is predictable from bai tian within the respective local string bai tian e. The idea is, if ever an overlapping ambiguity string is formed of bai tian and C, the judgment of bai | tian-C as the correct segmentation entails that the word tian-C must be a noun. Otherwise, the segmentation A|BC is wrong and the other segmentation AB|C is right. For illustration, it is noted that tian-shi (angel) in the ambiguous string 白天使 bai | tian-shi (white angel) is, as expected, a noun. This predictability of the category information from within the local overlapping ambiguous string is seen as an important discovery (Feng 1996). Based on this assumed feature of the overlapping ambiguous strings, He, Xu and Sun (1991) developed their theory that an overlapping ambiguity string can be disambiguated within the local string itself.

The proposed disambiguation process within the overlapping ambiguous string proceeds as follows. In order to correctly segment an overlapping ambiguous string, say, bai tian e or bai tian shi, the following information needs to be given under the entry bai-tian (day-time) in the tokenization lexicon: (i) an ambiguity label, to indicate the necessity to call a disambiguation rule; (ii) the ambiguity type A-N, to indicate that it should call the rule corresponding to this type. Then the following disambiguation rule can be formulated.

(2-5.) A-N type rule       (He, Xu and Sun 1991)
In the overlapping ambiguous string A(1)...A(i) B(1)...B(j) C(1)...C(k),
if        B(1)...B(j) and C(1)...C(k) form a noun,
then the correct segmentation is A(1)...A(i) | B(1)...B(j)-C(1)...C(k),
else    the correct segmentation is A(1)...A(i)-B(1)...B(j) | C(1)...C(k).

This way, bai tian e and bai tian shi will always be segmented as bai (white) | tian-e (swan) and bai (white) | tian-shi (angel) instead of bai-tian (daytime) | e (goose) and bai-tian (daytime) | shi (make). This can be easily accommodated in a segmentation algorithm provided the above information is added to the lexicon and the disambiguation rules are implemented. The whole procedure is running within the local context of the overlapping ambiguous string and uses only lexical information. So they also name the overlapping ambiguity disambiguation morphology-based disambiguation, with no need to consult syntax, semantics or discourse.

Feng (1996) emphasizes that He, Xu and Sun's view on the overlapping ambiguous string constitutes a valuable contribution to the theory of Chinese word identification. Indeed, this overlapping ambiguous string theory, if it were right, would be a breakthrough in this field. It in effect suggests that the majority of the segmentation ambiguity is resolvable without and before a grammar module. A handful of simple rules, like the A-N type rule formulated above, plus a lexicon would solve most ambiguity problems in word identification.[7]

Feng (1996) provides examples for all the nine types of overlapping ambiguous strings as evidence to support He, Xu and Sun (1991)'s theory. In the case of the A-N type ambiguous string bai tian e, the correct segmentation is supposed to be bai | tian-e in this theory. However, even with his own cited example, Feng ignores a perfect second reading (parse) when the time NP bai-tian (daytime) directly acts as a modifier for the sentence with no need for a preposition, as shown in (2‑6b) below.

(2-6.) 白天鹅游过来了
bai tian e you guo lai le (Feng 1996)

In addition, one only needs to add a preposition zai (in) to the beginning of the sentence to make the abandoned segmentation bai-tian | e the only right one in the changed context. The presumably correct segmentation, namely bai | tian-e, now turns out to be wrong, as shown in (2-7a) below.

(2-7.) 在白天鹅游过来了
zai bai tian e you guo lai le

The above counter-example is by no means accidental. In fact, for each cited ambiguous string in the examples given by Feng, there exist counter-examples. It is not difficult to construct a different context where the preferred segmentation within the local string, i.e. the segmentation chosen according to one of the rules, is proven to be wrong.[8] In the pairs of sample sentences (2‑8) through (2-10), (a) is an example which Feng (1996) cited to support the view that the local ambiguous string itself is enough for disambiguation. Sentences in (b) are counter-examples to this theory. It is a notable fact that the listed local string is often properly contained in a more complicated ambiguous string in an expanded context, seen in (2-9b) and (2-10b). Therefore, even when the abandoned segmentation can never be linguistically correct in any context, as shown for tu-xing (graph) | shi (BM) in (2-9) where a bound morpheme still exists after the segmentation, it does not entail the correctness of the other segmentation in all contexts. These data show that all possible segmentations should be retained for the grammatical analysis to judge.

(2-8.) V-N type of overlapping ambiguous string

研究生命
yan jiu sheng ming:
yan-jiu (V:study) | sheng-ming (N:life)
yang-jiu-sheng (N:graduate student) | ming (life/destiny)

(a)      研究生命的本质
yan-jiu    sheng-ming de      ben-zhi
          study          life               DE     essence
Study the essence of life.

(b)      研究生命金贵
yan-jiu-sheng      ming jin-gui
          graduate-student life     precious
Life for graduate students is precious.

(2-9.) CONJ-N type of overlapping ambiguous string
和平等 he ping deng:
he (CONJ:and) | ping-deng (N:equality)
he-ping (N:peace) | deng (V:wait)?

(a)      独立自主和平等互利的原则
du-li-zi-zhu           he      ping-deng-hu-li               de      yuan-ze
          independence       and    equal-reciprocal-benefit DE     principle
the principle of independence and equal reciprocal benefit

(b) 和平等于胜利 he-ping deng-yu sheng-li
peace equal victory
Peace is equal to victory.

(2-10.) V-P type of overlapping ambiguous string
看中和 kan zhong he:
kan-zhong (V:target) | he (P:with)
kan (V:see) | zhong-he (V:neutralize)

(a)      他们看中和日本人生意的机会
ta-men    kan-zhong   he      ri-ben          ren              zuo     sheng-yi      de      ji-hui
they target with Japan person do   business     DE opportunity
They have targeted the opportunity to do business with the Japanese.

(b)      这要看中和作用的效果
zhe          yao    kan    zhong-he-zuo-yong                   de          xiao-guo
this    need see     neutralization                DE     effect
This will depend on the effects of the neutralization.

The data in (b) above directly contradict the claim that an overlapping ambiguous string can be disambiguated within the local string itself. While this approach is shown to be inappropriate in practice, the following comment attempts to reveal its theoretical motivation.

As reviewed in the previous text, He, Xu and Sun (1991)'s overlapping ambiguity theory is established on the classification of the overlapping ambiguous strings. A careful examination of their proposed nine types of the overlapping ambiguous strings reveals an underlying assumption on which the classification is based. That is, the correctly segmented words within the overlapping ambiguous string will automatically remain correct in a sentence containing the local string. This is in general untrue, as shown by the counter-examples above.[9] The following analysis reveals why.

Within the local context of the overlapping ambiguous string, the chosen segmentation often leads to a syntactically legitimate structure while the abandoned segmentation does not. For example, bai (white) | tian-e (swan) combines into a valid syntactic unit while there is no structure which can span bai-tian (daytime) | e (goose). For another example, yan-jiu (study) | sheng-ming (life) can be combined into a legitimate verb phrase [VP [V yan-jiu] [NP sheng-ming]], but yan-jiu-sheng (graduate student) | ming (life/destiny) cannot. But that legitimacy only stands locally within the boundary of the ambiguous string. It does not necessarily hold true in a larger context containing the string. As shown previously in (2-7a), the locally legitimate structure bai | tian-e (white swan) does not lead to a successful parse for the sentence. In contrast, the locally abandoned segmentation bai-tian (daytime) | e (goose) has turned out to be right with the parse in (2-7b). Therefore, the full context instead of the local context of the ambiguous string is required for the final judgment on which segmentation can be safely abandoned. Context smaller than the entire input string is not reliable for the overlapping ambiguity resolution. Note that exactly the same conclusion has been reached for the hidden ambiguous strings in the previous section.

The following data in (2-11) further illustrate the point of the full context requirement for the overlapping ambiguity resolution, similar to what has been presented for the hidden ambiguity phenomena in (2-3). In each successive case, the context is expanded to form a new input string. As a result, the interpretation of ‘goose’ versus ‘swan’ changes accordingly.

(2-11.) input string reading(s)

(a) 鹅 e goose
[N e]

(b) 天鹅 tian e swan
[N tian-e]

(d) 鹅游过来了 e you guo lai le.
The geese swam over here.
[S [NP e] [VP you guo-lai le]]

(e) 天鹅游过来了 tian e you guo lai le.
The swans swam over here.
[S [NP tian-e] [VP you guo-lai le]]

(f) 白天鹅游过来了 bai tian e you guo lai le.
(i) The white swan swam over here.
[S [NP bai tian-e] [VP you guo-lai le]]
(ii) In the daytime, the geese swam over here.
S [NP+mod bai-tian] [S [NP e] [VP you guo-lai le]]]

(g) 在白天鹅游过来了 zai bai tian e you guo lai le.
In the daytime, the geese swam over here.
[S [PP zai bai-tian] [S [NP e] [VP you guo-lai le]]]

(h) 三只白天鹅游过来了 san zhi bai tian e you guo lai le.
Three white swans swam over here.
[S [NP san zhi bai tian-e] [VP you guo-lai le]]

It is interesting to compare (c) with (f), (g) and (h) to see their associated change of readings based on different ways of segmentation. In (c), the overlapping ambiguous string is all that is input to the parser, therefore the local context becomes full context. It then acquires the reading white swan corresponding to the segmentation bai | tian-e. This reading may be retained, or changed, or reduced to one of the possible interpretations when the input string is lengthened. That is respectively the case in (h), (g) and (f). All these changes depend on the grammatical analysis of the entire input string. It shows that the full context and a grammar are required for the resolution of most ambiguities; and when sentential analysis cannot disambiguate - in cases of ‘genuine’ segmentation ambiguity like (f), the structural analysis can make explicit the ambiguity in the form of multiple parses (readings).

In the light of the inquiry in this section, the theoretical significance of the distinction between overlapping ambiguity and hidden ambiguity seems to have diminished.[10] They are both structural in nature. They both require full context and a grammar for proper treatment.

(2-12.) Conclusion

(i) It is not necessarily true that an overlapping ambiguous string can be disambiguated within the local string.

(ii) The grammatical analysis of the entire input string is required for the adequate treatment of the overlapping ambiguity problem as well as the hidden ambiguity problem.

2.2. Productive Word Formation and Syntax

This section examines the connection of productive word formation and segmentation ambiguity. The observation is that there is always a possible involvement of ambiguity with each type of word formation. The point to be made is that no independent morphology systems can resolve this ambiguity when syntax is unavailable. This is because words formed via morphology, just like words looked up from lexicon, only provide syntactic ‘candidate’ constituents for the sentential analysis. The choice is decided by the structural analysis of the entire sentence.

Derivation is a major type of productive word formation in Chinese. Section 1.2.2 has given an example of the involvement of hidden ambiguity in derivation, repeated below.

(2-13.) 这道菜没有吃头 zhe dao cai mei you chi tou.

(b) ? zhe | dao | cai | mei-you | chi | tou
this | CLA | dish | not have | eat | head
[S [NP zhe dao cai] [VP [ADV mei-you] [VP [V chi] [NP tou]]]]
This dish did not eat the head.

(2-14.) 他饿得能吃头牛 ta e de neng chi tou niu.

(a) * ta | e | de | neng | chi-tou | niu
he | hungry | DE3 | can | worth-of-eating | ox

(b) ta | e | de | neng | chi | tou | niu
he | hungry | DE3 | can | eat | CLA | ox
[…[VP [V e] [DE3P [DE3 de] [VP [V neng] [VP [V chi] [NP tou niu]]]]]]
He is so hungry that he can eat an ox.

Some derivation rule like the one in (2-15) is responsible for combining the transitive verb stem and the suffix –tou (worth-of) into a derived noun for (2-13a) and (2-14a).

(2-15.) X (transitive verb) + tou --> X-tou (noun, semantics: worth-of-X)

However, when syntax is not available, there is always a danger of wrongly applying this morphological rule due to possible ambiguity involved, as shown in (2-14a). In other words, morphological rules only provide candidate words; they cannot make the decision whether these words are legitimate in the context.

Reduplication is another method for productive word formation in Chinese. An outstanding problem is the AB --> AABB reduplication or AB --> AAB reduplication if AB is a listed word. In these cases, some reduplication rules or procedures need to be involved to recognize AABB or AAB. If reduplication is a simple process confined to a local small context, it may be possible to handle it by incorporating some procedure-based function calls during the lexical lookup. For example, when a three-character string, say 分分心 fen fen xin, cannot be found in the lexicon, the reduplication function will check whether the first two characters are the same, and if yes, delete one of them and consult the lexicon again. This method is expected to handle the AAB type reduplication, e.g. fen-xin (divide-heart: distract) --> fen-fen-xin (distract a bit).

But, segmentation ambiguity can be involved in reduplication as well. Compare the following examples in (2-16) and (2-17) containing the sub-string fen fen xin, the first is ambiguity free but the second is ambiguous. In fact, (2‑17) involves an overlapping ambiguous string shi fen fen xin: shi (ten) | fen-fen-xin (distract a bit) and shi-fen (very) | fen-xin (distract). Based on the conclusion presented in 2.1, it requires grammatical analysis to resolve the segmentation ambiguity. This is illustrated in (2‑17).

(2-16.) 让他分分心

rang | ta | fen-fen-xin
let | he | distracted-a-bit
Let him relax a while.

(2-17.) 这件事十分分心

zhe jian shi shi fen fen xin.

(a) * zhe | jian | shi | shi | fen-fen-xin
this | CLA | thing | ten | distracted a bit

Finally, there is also possible ambiguity involvement in the proper name formation. Proper names for persons, locations, etc. that are not listed in the lexicon are recognized as another major problem in word identification (Sun and Huang 1996).[11] This problem is complicated when ambiguity is involved.

For example, a Chinese person name usually consists of a family name followed by a given name of one or two characters. For example, the late Chinese chairman mao-ze-dong (Mao Zedong) used to have another name li-de-sheng (Li Desheng). In the lexicon, li is a listed family name. Both de-sheng and sheng mean ‘win’. This may lead to three ways of word segmentation, a complicated case involving both overlapping ambiguity and hidden ambiguity: (i) li | de-sheng; (ii) li-de | sheng; (iii) li-de-sheng, as shown in (2-18) below.

(2-18.) 李得胜了 li de sheng le.

(a) li | de-sheng | le
Li | win | LE
[S [NP li] [VP de-sheng le]]
Li won.

(b) li-de | sheng | le
Li De | win | LE
[S [NP li de] [VP sheng le]]
Li De won.

For this particular type of compounding, the family name serves as the left boundary of a potential compound name of person and the length can be used to determine candidates.[12] Again, the choice is decided by the grammatical analysis of the entire sentence, as illustrated in (2-18).

(2-19.) Conclusion

Due to the possible ambiguity involvement in productive word formation, a grammar containing both morphology and syntax is required for an adequate treatment. An independent morphology system or separate word segmenter cannot solve ambiguity problems.

2.3. Borderline Cases and Grammar

This section reviews some outstanding morpho-syntactic borderline phenomena. The points to be made are: (i) each proposed morphological or syntactic analysis should be justified in terms of capturing the linguistic generality; (ii) the design of a grammar should facilitate the access to the knowledge from both morphology and syntax in analysis.

The nature of the borderline phenomena calls for the coordination of morphology and syntax in a grammar. The phenomena of Chinese separable verbs are one typical example. The co-existence of their contiguous use and separate use leads to the confusion whether they belong to the lexicon and morphology, or whether they are syntactic phenomena. In fact, as will be discussed in Chapter V, there are different degrees of ‘separability’ for different types of Chinese separable verbs; there is no uniform analysis which can handle all separable verbs properly. Different types of separable verbs may justify different approaches to the problems. In terms of capturing linguistic generality, a good analysis should account for the demonstrated variety of separated uses and link the separated use and the contiguous use.

‘Quasi-affixation’ is another outstanding interface problem. This problem requires careful morpho-syntactic coordination. As presented in Chapter I, structurally, ‘quasi-affixes’ and ‘true’ affixes demonstrate very similar word formation potential, but ‘quasi-affixes’ often retain some ‘solid’ meaning while the meaning of ‘true’ affixes are functionalized. Therefore, how to coordinate the semantic contribution of the derived words via ‘quasi-affixation’ in the context of the building of the semantics for the entire sentence is the key. This coordination requires flexible information flow between data structures for morphology, syntax and semantics during the morpho-syntactic analysis.

In short, the proper treatment of the morpho-syntactic borderline phenomena requires inquiry into each individual problem in order to reach a morphological or syntactic analysis which maximally captures linguistic generality. It also calls for the design of a grammar where information between morphology and syntax can be effectively coordinated.

2.4. Knowledge beyond Syntax

This section examines the roles of knowledge beyond syntax in the resolution of segmentation ambiguity. Despite the fact that further information beyond syntax may be necessary for a thorough solution to segmentation ambiguity,[13] it will be argued that syntax is the appropriate place for initiating this process due to the structural nature of segmentation ambiguity.

Depending on which type of information is essential for the disambiguation, disambiguation can be classified as structure-oriented, semantics-oriented and pragmatics-oriented. This classification hierarchy is modified from that in He, Xu and Sun (1991). They have classified the hidden ambiguity disambiguation into three categories: syntax-based, semantics-based and pragmatics-based. Together with the morphology-based disambiguation which is equivalent to the overlapping ambiguity resolution in their theory, they have built a hierarchy from morphology up to pragmatics.

A note on the technical details is called for here. The term X‑oriented (where X is either syntax, semantics or pragmatics) is selected here instead of X-based in order to avoid the potential misunderstanding that X is the basis for the relevant disambiguation. It will be shown that while information from X is required for the ambiguity resolution, the basis is always syntax.

Based on the study in 2.1, it is believed that there is no morphology-based (or morphology-oriented) disambiguation independent of syntax. This is because the context of morphology is a local context, too small for resolving structural ambiguity. There is little doubt that the morphological analysis is a necessary part of word identification in terms of handling productive word formation. But this analysis cannot by itself resolve ambiguity, as argued in 2.2. The notion 'structure' in structure-oriented disambiguation includes both syntax and morphology.

He, Xu and Sun (1991) exclude the overlapping ambiguity resolution in the classification beyond morphology. This exclusion is found to be not appropriate. In fact, both the resolution of hidden ambiguity and overlapping ambiguity can be classified into this hierarchy. In order to illustrate this point, for each such class, I will give examples from both hidden ambiguity and overlapping ambiguity.

Sentences in (2-20) and (2-21) which contain the hidden ambiguity string 阵风zhen feng are examples for the structure-oriented disambiguation. This type of disambiguation relying on a grammar constitutes the bulk of the disambiguation task required for word identification.

(2-20.) 一阵风吹过来了
yi zhen feng chui guo lai le. (Feng 1996)

(2-21.) 阵风会很快来临 zhen feng hui hen kuai lai lin.

Compare (2-20a) where the ambiguity string is identified as two words zhen (CLA) feng (wind) and (2-21a) where the string is taken as one word zhen-feng (gusts-of-wind). Chinese syntax defines that a numeral cannot directly combine with a noun, neither can a classifier alone when it is in non-object position. The numeral and the classifier must combine together before they can combine with a noun. So (2-20b) and (2‑21b) are both ruled out while (2-20a) and (2-21a) are structurally well-formed.

For the structure-oriented overlapping ambiguity resolution, numerous examples have been cited before, and one typical example is repeated below.

(2-22.) 研究生命金贵 yan jiu sheng ming jin gui

(a) yan-jiu-sheng | ming | jin-gui
graduate student | life | precious
[S [NP yan-jiu-sheng] [S [NP ming] [AP jin-gui]]]
Life for graduate students is precious.

(b) * yan-jiu | sheng-ming | jin-gui
study | life | precious

As a predicate, the adjective jin-gui (precious) syntactically expects an NP as its subject, which is saturated by the second NP ming (life) in (2-22a). The first NP serves as a topic of the sentence and is semantically linked to the subject ming (life) as its possessive entity.[14] But there is no parse for (2-22b) despite the fact that the sub-string yan-jiu sheng-ming (to study life) forms a verb phrase [VP [V yan-jiu] [NP sheng-ming]] and the sub-string sheng-ming jin-gui (life is precious) forms a sentence [S [NP sheng-ming] [AP jin-gui]]. On one hand, the VP in the subject position does not satisfy the syntactic constraint (the category NP) expected by the adjective jin-gui (precious) - although other adjectives, say zhong-yao 'important', may expect a VP subject. On the other hand, the transitive verb yan-jiu (study) expects an NP object. It cannot take an S object (embedded object clause) as do other verbs, say ren-wei (think).

The resolution of the following hidden ambiguity belongs to the semantics-oriented disambiguation.

(2-23.) 请把手抬高一点儿 qing ba shou tai gao yi dian er (Feng 1996)

This is an interesting example. The same character qing is both an adverb ‘please’ and a verb ‘invite’. (2-23b2) is syntactically valid, but violates the semantic constraint or semantic selection restriction. The logical object of qing (invite) should be human but ba-shou (handle) is not human. The two syntactically valid parses (2-23a1) and (2-23b2), which correspond to two ways of segmentation, are expected to be somehow disambiguated on the above semantic grounds.

The following case is an example of semantics-oriented resolution of the overlapping ambiguity.

(2-24.) 茶点心吃了 cha dian xin chi le.

(b1) ? cha-dian | xin | chi | le
tea dim sum | heart | eat | LE
[S [NP+object cha-dian] [S [NP+agent xin] [VP chi le]]]
The tea dim sum, the heart ate.

(b2) ? cha-dian | xin | chi | le
tea dim sum | heart | eat | LE
[S [NP+agent cha-dian] [VP [NP+object xin] [VP chi le]]]
The tea dim sum ate the heart.

Most Chinese dictionaries contain the listed compound noun cha-dian (tea-dim-sum), but not cha dian-xin which stands for the same thing, namely the snacks served with the tea. As shown above, there are four analyses for one segmentation and two analyses for the other segmentation. These are all syntactically legitimate, corresponding to six different readings. But there is only one analysis which makes sense, namely the implicit passive construction with the compound noun cha dian-xin as the preceding (logical) object in (a1). All the other five analyses are nonsense and can be disambiguated if the semantic selection restriction that animate being eats (i.e. chi) food is enforced. Syntactically, (a2) is an active construction with the optional object omitted. The constructions for (a3) and (b1) are of long distance dependency where the object is topicalized and placed at the beginning. The SOV (Subject Object Verb) pattern for (a4) and (b2) is a very restrictive construction in Chinese.[15]

The pragmatics-oriented disambiguation is required for the case where ambiguity remains after the application of both structural and semantic constraints.[16] The sentences containing this type of ambiguity are genuinely ambiguous within the sentence boundary, as shown with the multiple parses in (2-25) for the hidden ambiguity and (2-26) for the overlapping ambiguity below.

(2-25.) 他喜欢烤白薯 ta xi huan kao bai shu.

(b) ta | xi-huan | kao-bai-shu.
he | like | baked-sweet-potato
[S [NP ta] [VP [V xi-huan] [NP kao-bai-shu]]]
He likes the baked sweet potatoes.

(2-26.) 研究生命不好 yan jiu sheng ming bu hao

An important distinction should be made among these classes of disambiguation. Some ambiguity must be solved in order to get a reading during analysis. Other ambiguity can be retained in the form of multiple parses, corresponding to multiple readings. In either case, it demonstrates that at least a grammar (syntax and morphology) is required. The structure-oriented ambiguity belongs to the former, and can be handled by the appropriate structural analysis. The semantics-oriented ambiguity and the pragmatics-oriented ambiguity belong to the latter, so multiple parses are a way out. The examples for different classes of ambiguity show that the structural analysis is the foundation for handling ambiguity problems in word identification. It provides possible structures for the semantic constraints or pragmatic constraints to work on.

In fact, the resolution of segmentation ambiguity in Chinese word identification is but a special case of the resolution of structural ambiguity for NLP in general. As a matter of fact, the grammatical analysis has been routinely used to resolve, and/or prepare the basis for resolving, the structural ambiguity like the PP attachment.[17]

2.5. Summary

The most important discovery in the field of Chinese word identification presented in this chapter is that the resolution of both types of segmentation ambiguity involves the analysis of the entire input string. This means that the availability of a grammar is the key to the solution of this problem.

This chapter has also examined the ambiguity involvement in productive word formation and reached the following conclusion. A grammar for morphological analysis as well as for sentential analysis is required for an adequate treatment of this problem. This establishes the foundation for the general design of CPSG95 as consisting of morphology and syntax in one grammar formalism. [18]

The study of the morpho-syntactic borderline problems shows that the sophisticated design of a grammar is called for so that information between morphology and syntax can be effectively coordinated. This is the work to be presented in Chapter III and Chapter IV. It also demonstrates that each individual borderline problem should be studied carefully in order to reach a morphological or syntactic analysis which maximally captures linguistic generality. This study will be pursued in Chapter V and Chapter VI.

----------------------------------------------------------

[1] Constraints beyond morphology and syntax can be implemented as subsequent modules, or “filters”, in order to select the correct analysis when morpho-syntactic analysis leads to multiple results (parses). Alternatively, such constraints can also be integrated into CPSG95 as components parallel to, and interacting with, morphology and syntax. W. Li (1996) illustrates how semantic selection restriction can be integrated into syntactic constraints in CPSG95 to support Chinese parsing.

[2] In theory, if discourse is integrated in the underlying grammar, the input can be a unit larger than sentence, say, a paragraph or even a full text. But this will depend on the further development in discourse theory and its formalization. Most grammars in current use assume sentential analysis.

[3] Similar examples for the overlapping ambiguity string will be shown in 2.1.2.

[4] But in Ancient Chinese, a numeral can freely combine with countable nouns.

[5] These two readings in written Chinese correspond to an obvious difference in Spoken Chinese: ge (CLA) in (g1) is weakened in pronunciation, marked by the dropping of the tone, while in (g2) it reads with the original 4^th tone emphatically.

[6] It is likely that what they have found corresponds to Guo’s discovery of “one tokenization per source” (Guo 1998). Guo’s finding is based on his experimental study involving domain (“source”) evidence and seems to account for the phenomena better. In addition, Guo’s strategy in his proposal is also more effective, reported to be one of the best strategies for disambiguation in word segmenters.

[7] According to He, Xu and Sun (1991)'s statistics on a corpus of 50833 Chinese characters, the overlapping ambiguous strings make up 84.10%, and the hidden ambiguous strings 15.90%, of all ambiguous strings.

[8] Guo (1997b) goes to the other extreme to hypothesize that “every tokenization is possible”. Although this seems to be a statement too strong, the investigation in this chapter shows that at least domain independently, local context is very unreliable for making tokenization decision one way or the other.

[9] However, this assumption may become statistically valid within a specific domain or source, as examined in Guo (1998). But Guo did not give an operational definition of source/domain. Without such a definition, it is difficult to decide where to collect the domain-specific information required for disambiguation based on the principle one tokenization per source, as proposed by Guo (1998).

[10] This distinction is crucial in the theories of Liang (1987) and He, Xu and Sun (1991).

[11] This work is now defined as one fundamental task, called Named Entity tagging, in the world of information extraction (MUC-7 1998). There has been great advance in developing Named Entity taggers both for Chinese (e.g. Yu et al 1997; Chen et al 1997) and for other languages.

[12] That is what was actually done with the CPSG95 implementation. More precisely, the family name expects a special sign with hanzi-length of 1 or 2 to form a full name candidate.

[13] A typical, sophisticated word segmenter making reference to knowledge beyond syntax is presented in Gan (1995).

[14] This is in fact one very common construction in Chinese in the form of NP1 NP2 Predicate. Other examples include ta (he) tou (head) tong (ache): ‘he has a head-ache’ and ta (he) shen-ti (body) hao (good): 'he is good in health'.

[15] For the detailed analysis of these constructions, see W. Li (1996).

[16] It seems that it may be more appropriate to use terms like global disambiguation or discourse-oriented disambiguation instead of the term pragmatics-oriented disambiguation for the relevant phenomena.

[17] It seems that some PP attachment problems can be resolved via grammatical analysis alone. For example, put something on the table; found the key to that door. Others require information beyond syntax (semantics, discourse, etc.) for a proper solution. For example, see somebody with telescope. In either case, the structural analysis provides a basis. The same thing happens to the disambiguation in Chinese word identification.

[18] In fact, once morphology is incorporated in the grammar, the identification of both vocabulary words and non-listable words becomes a by-product during the integrated morpho-syntactic analysis. Most ambiguity is resolved automatically and the remaining ambiguity will be embodied in the multiple syntactic trees as the results of the analysis. This has been shown to be true and viable by W. Li (1997, 2000) and Wu and Jiang (1998).

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter I Introduction

1.0. Foreword

This thesis addresses the issue of the Chinese morpho-syntactic interface. This study is motivated by the need for a solution to a series of long-standing problems at the interface. These problems pose challenges to an independent morphology system or a separate word segmenter as there is a need to bring in syntactic information in handling these problems.

The key is to develop a Chinese grammar which is capable of representing sufficient information from both morphology and syntax. On the basis of the theory of Head-Driven Phrase Structure Grammar (Pollard and Sag 1987, 1994), the thesis will present the design of a Chinese grammar, named CPSG95 (for Chinese Phrase Structure Grammar). The interface between morphology and syntax is defined system internally in CPSG95. For each problem, arguments will be presented for the linguistic analysis involved. A solution to the problem will then be formulated based on the analysis. The proposed solutions are formalized and implementable; most of the proposals have been tested in the implementation of CPSG95.

In what follows, Section 1.1 reviews some important developments in the field of Chinese NLP (Natural Language Processing). This serves as the background for this study. Section 1.2 presents a series of long-standing problems related to the Chinese morpho-syntactic interface. These problems are the focus of this thesis. Section 1.3 introduces CPSG95 and sketches its morpho-syntactic interface by illustrating an example of the proposed morpho-syntactic analysis.

1.1. Background

This section presents the background for the work on the interface between morphology and syntax in CPSG95. Major development on Chinese tokenization and parsing, the two areas which are related to this study, will be reviewed.

1.1.1. Principle of Maximum Tokenization and Critical Tokenization

This section reviews the influential Theory of Critical Tokenization (Guo 1997a) and its implications. The point to be made is that the results of Guo’s study can help us to select the tokenization scheme used in the lexical lookup phase in order to create the basis for morpho-syntactic parsing.

Guo (1997a,b,c) has conducted a comprehensive formal study on tokenization schemes in the framework of formal languages, including deterministic tokenization such as FT (Forward Maximum Tokenization) and BT (Backward Maximum Tokenization), and non-deterministic tokenization such as CT (Critical Tokenization), ST (Shortest Tokenization) and ET (Exhaustive Tokenization). In particular, Guo has focused on the study of the rich family of tokenization strategies following the general Principle of Maximum Tokenization, or “PMT”. Except for ET, all the tokenization schemes mentioned above are PMT-based.

In terms of lexical lookup, PMT can be understood as a heuristic by which a longer match overrides all shorter matches. PMT has been widely adopted (e.g. Webster and Kit 1992; Guo 1997b) and is believed to be “the most powerful and commonly used disambiguation rule” (Chen and Liu 1992:104).

Shortest Tokenization, or “ST”, first proposed by X. Wang (1989), is a non-deterministic tokenization scheme following the Principle of Maximum Tokenization. A segmented token string is shortest if it contains the minimum number of vocabulary words possible - “short” in the sense of the shortest word string length.

Exhaustive Tokenization, or “ET”, does not follow PMT. As its name suggests, the ET set is the universe of all possible segmentations consisting of all candidate vocabulary words. The mathematical definition of ET is contained in Definition 4 for “the character string tokenization operation” in Guo (1997a).

The most important concept in Guo’s theory is Critical Tokenization, or “CT”. Guo’s definition is based on the partially ordered set, or ‘poset’, theory in discrete mathematics (Kolman and Busby 1987). Guo has found that different segmentations can be linked by the cover relationship to form a poset. For example, abc|d and ab|cd both cover ab|c|d, but they do not cover each other.

Critical tokenization is defined as the set of minimal elements, i.e. tokenizations which are not covered by other tokenizations, in the tokenization poset. Guo has given proof for a number of mathematical properties involving critical tokenization. The major ones are listed below.

Every tokenization is a subtokenization of (i.e. covered by) a critical tokenization, but no critical tokenization has a true supertokenization;
The tokenization variations following the Principle of Maximum Tokenization proposed in the literature, such as FT, BT, FT+BT and ST, are all true sub-classes of CT.

Based on these properties, Guo concludes that CT is the precise mathematical description of the widely adopted Principle of Maximum Tokenization.

Guo (1997c) further reports his experimental studies on relative merits of these tokenization schemes in terms of three quality indicators, namely, perplexity, precision and recall. The perplexity of a tokenization scheme gives the expected number of tokenized strings generated for average ambiguous fragments. The precision score is the percentage of correctly tokenized strings among all possible tokenized strings while the recall rate is the percentage of correctly tokenized strings generated by the system among all correctly tokenized strings. The main results are:

Both FT and BT can achieve perfect unity perplexity but have the worst precision and recall;
ET achieves perfect recall but has the lowest precision and highest perplexity;
ST and CT are simple with good computational properties. Between the two, ST has lower perplexity but CT has better recall.

Guo (1997c) concludes, “for applications with moderate performance requirement, ST is the choice; otherwise, CT is the solution.”

In addition to the above theoretical and experimental study, Guo (1997b) also develops a series of optimized algorithms for the implementation of these generation schemes.

The relevance and significance of Guo’s achievement to the research in this thesis lie in the following aspect. The research on Chinese morpho-syntactic interface is conducted with the goal of supporting Chinese morpho-syntactic parsing. The input to a Chinese morpho-syntactic parser comes directly from the lexical lookup of the input string based on some non-deterministic tokenization scheme (W. Li 1997, 2000; Wu and Jiang 1998). Guo’s research and algorithm development can help us to decide which tokenization schemes to use depending on the tradeoff between precision, recall and perplexity or the balance between reducing the search space and minimizing premature commitment.

1.1.2. Monotonicity Principle and Task-driven Segmentation

This section reviews the recent development on Chinese analysis systems involving the interface between morphology and syntax. The research on the Chinese morpho-syntactic interface in this thesis echoes this new development in the field of Chinese NLP.

In the last few years, projects have been proposed for implementing a Chinese analysis system which integrates word identification and parsing. Both rule-based systems and statistical models have been attempted with good results.

Wu (1998) has addressed the drawbacks of the conventional practice on the development of Chinese word segmenters, in particular, the problem of premature commitment in handling segmentation ambiguity. In his A Position Statement on Chinese Segmentation, Wu proposed a general principle:

Monotonicity Principle for segmentation:

A valid basic segmentation unit (segment or token) is a substring that no processing stage after the segmenter needs to decompose.

The rationale behind this principle is to prevent premature commitment and to avoid repetition of work between modules. In fact, traditional word segmenters are modules independent of subsequent applications (e.g. parsing). Due to the lack of means for accessing sufficient grammar knowledge, they suffer from premature commitment and repetition of work, hence violating this principle.

Wu’s proposal of the monotonicity principle is a challenge to the Principle of Maximum Tokenization. These two principles are not always compatible. Due to the existence of hidden ambiguity (see 1.2.1), the PMT-based segmenters by definition are susceptible to premature commitment leading to “too-long segments”. If the target application is designed to solve the hidden ambiguity problem in the segments, “decomposition” of some segments is unavoidable.

In line with the Monotonicity Principle, Wu (1998) proposes an alternative approach which he claims “eliminates the danger of premature commitment”, namely task-driven segmentation. Wu (1998) points out, “Task-driven segmentation is performed in tandem with the application (parsing, translating, named-entity labeling, etc.) rather than as a preprocessing stage. To optimize accuracy, modern systems make use of integrated statistically-based scores to make simultaneous decisions about segmentation and parsing/translation.” The HKUST parser, developed by Wu’s group, is such a statistical system employing the task-driven segmentation.

As for rule-based systems, similar practice of integrating word identification and parsing has also been explored. W. Li (1997, 2000) proposed that the results of an ET-based lexical lookup directly feed the parser for the hanzi-based parsing. More concretely, morphological rules are designed to build word internal structure for productive morphology and non-productive morphology is lexicalized via entry enumeration.[1] This approach is the background for conducting the research on Chinese morpho-syntactic interface for CPSG95 in this dissertation.

The Chinese parser on the platform of multilingual NLPWin developed by Microsoft Research also integrates word identification and parsing (Wu and Jiang 1998). They also use a hand-coded grammar for word identification as well as for sentential parsing. The unique part of this system is the use of a certain lexical constraint on ET in the lexical lookup phase. This effectively reduces the parsing search space as well as the number of syntactic trees produced by the parser, with minimal sacrifice in the recall of tokenization. This tokenization strategy provides a viable alternative to the PMT-based tokenization schemes like CT or ST in terms of the overall balance between precision, recall and perplexity.

The practice of simultaneous word identification and parsing in implementing a Chinese analysis system calls for the support of a grammar (or statistical model) which contains sufficient information from both morphology and syntax. The research on Chinese morpho-syntactic interface in this dissertation aims at providing this support.

1.2. Morpho-syntactic Interface Problems

This section presents a series of outstanding problems in Chinese NLP which are related to the morpho-syntactic interface. One major goal of this dissertation is to argue for the proposed analyses of the problems and to provide solutions to them based on the analyses.

Sun and Huang (1996) have reviewed numerous cases which challenge the existing word segmenters. As many of these cases call for an exchange of information between morphology and syntax, an appropriate solution can hardly be reached within the module of a separate word segmenter. Three major problems at issue are presented below.

1.2.1. Segmentation ambiguity

This section presents the long-standing problem in Chinese tokenization, i.e. the resolution of the segmentation ambiguity. Within a separate word segmenter, resolving the segmentation ambiguity is a difficult, sometimes hopeless job. However, the majority of ambiguity can be resolved when a grammar is available.

Segmentation ambiguity has been the focus of extensive study in Chinese NLP for the last decade (e.g. Chen and Liu 1992; Liang 1987; Sproat, Shih, Gale and Chang 1996; Sun and Huang 1996; Guo 1997b). There are two types of segmentation ambiguities (Liang 1987; Guo 1997b): (i) overlapping ambiguity: e.g. da-xue | sheng-huo vs. da-xue-sheng | huo as shown in (1-1) and (1-2); and (ii) hidden ambiguity: ge-ren vs. ge | ren, as shown in (1-3) and (1-4).

(1-3.) 个人的力量
ge-ren | de | li-liang
individual | DE | power
the power of an individual

(1-4.) 三个人的力量
san | ge | ren | de | li-liang
three | CLA | person |DE | power
the power of three persons

These examples show that the resolution of segmentation ambiguity requires larger syntactic context and grammatical analysis. There will be further arguments and evidence in Chapter II (2.1) for the following conclusion: both types of segmentation ambiguity are structural by nature and require sentential analysis for the resolution. Without access to a grammar, no matter how sophisticated a tokenization algorithm is designed, a word segmenter is bound to face an upper bound for the precision of word identification. However, in an integrated system, word identification becomes a natural by-product of parsing (W. Li 1997, 2000; Wu and Jiang 1998). More precisely, the majority of ambiguity can be resolved automatically during morpho-syntactic parsing; the remaining ambiguity can be made explicit in the form of multiple syntactic trees.[2] But in order to make this happen, the parser requires reliable support from a grammar which contains both morphology and syntax.

1.2.2. Productive Word Formation

Non-listable words created via productive morphology pose another challenge (Sun and Huang 1996). There are two major problems involved in this issue: (i) problem in identifying lexicon-unlisted words; (ii) problem of possible segmentation ambiguity.

One important method of productive word formation is derivation. For example, the derived word 可读性 ke-du-xing (-able-read-ness: readability) is created via morphology rules, informally formulated below

(1-5.) derivation rules

ke + X (transitive verb) --> ke-X (adjective, semantics: X-able)

Y (adjective or verb) + xing --> Y-xing (abstract noun, semantics: Y-ness)

Rules like the above have to be incorporated properly in order to correctly identify such non-listable words. However, there has been little research in the literature on what formalism should be adopted for Chinese morphology and how it should be interfaced to syntax.

To make the case more complicated, ambiguity may also be involved in productive word formation. When the segmentation ambiguity is involved in word formation, there is always a danger of wrongly applying morphological rules. For example, 吃头 chi-tou (worth of eating) is a derived word (transitive verb + suffix tou); however, it can also be segmented as two separate tokens chi (eat) | tou (CLA), as shown in (1-6) and (1-7) below.

(1-7.) 他饿得能吃头牛
ta | e | de | neng | chi | tou | niu
he | hungry | DE3 | can | eat | CLA | ox
He is so hungry that he can eat an ox.

To resolve this segmentation ambiguity, as indicated before in 1.2.1, the structural analysis of the complete sentences is required. An independent morphology system or a separate word segmenter cannot handle this problem without accessing syntactic knowledge.

1.2.3. Borderline Cases between Morphology and Syntax

It is widely acknowledged that there is a remarkable gray area between Chinese morphology and Chinese syntax (L. Li 1990; Sun and Huang 1996). Two typical cases are described below. The first is the phenomena of Chinese separable verbs. The second case involves interfacing derivation and syntax.

Chinese separable verbs are usually in the form of V+N and V+V or V+A. These idiomatic combinations are long-standing problems at the interface between compounding and syntax in Chinese grammar (L. Wang 1955; Z. Lu 1957; Lü 1989; Lin 1983; Q. Li 1983; L. Li 1990; Shi 1992; Zhao and Zhang 1996).

The separable verb 洗澡 xi zao (wash‑bath: take a bath) is a typical example. Many native speakers regard xi zao as one word (verb), but the two morphemes are separable. In fact, xi+zao shares the syntactic behavior and the pattern variations with the syntactic transitive combination V+NP: not only can aspect markers appear between xi and zao, but this structure can be passivized and topicalized as well. The following is an example of topicalization (of long distance dependency) for xi zao.

(1-8.)(a) 我认为他应该洗澡
wo ren-wei ta ying-gai xi zao.
I think he should wash-bath
I think that he should take a bath.

(b) 澡我认为他应该洗
zao wo ren-wei ta ying-gai xi.
bath I think he should wash
The bath I think that he should take.

Although xi zao behaves like a syntactic phrase, it is a vocabulary word in the lexicon due to its idiomatic nature. As a result, almost all word segmenters output xi-zao in (1-8a) as one word while treating the two signs[3] in (1-8b) as two words. Thus the relationship between the separated use of the idiom and the non-separated use is lost.

The second case represents a considerable number of borderline cases often referred to as ‘quasi-affixes’. These are morphemes like 前 qian (former, ex-) in words like 前夫 qian-fu (ex-husband), 前领导 qian-[ling-dao] (former boss) and -盲 mang (person who has little knowledge of) in words like 计算机盲 [ji-suan-ji]-mang (computer layman), 法盲 fa-mang (person who has no knowledge of laws).

It is observed that 'quasi-affixes' are structurally not different from other affixes. The major difference between 'quasi-affixes' and the few generally honored ('genuine') affixes like the nominalizer 性 -xing (-ness) lies mainly in the following aspect. The former retain some 'solid' meaning while the latter are more functionalized. Therefore, the key to this problem seems to lie in the appropriate way of coordinating the semantic contribution of the derived words using 'quasi-affixes' to the building of the semantics for the entire sentence. This is an area which has not received enough investigation in the field of Chinese NLP. While many word segmenters have included some type of derivational processing for a few typical affixes, few systems demonstrate where and how to handle these 'quasi-affixes'.

1.3. CPSG95: HPSG-style Chinese Grammar in ALE

To investigate the interaction between morphological and syntactic information, it is important to develop a Chinese grammar which incorporates morphology and syntax in the same formalism. This section gives a brief presentation on the design and background of CPSG95 (including lexicon).

1.3.1. Background and Overview of CPSG95

Shieber (1986) distinguishes two types of grammar formalism: (i) theory-oriented formalism; (ii) tool-oriented formalism. In general, a language-specific grammar turns to a theory-oriented formalism for its foundation and a tool-oriented formalism for its implementation. The work on CPSG95 is developed in the spirit of the theory-oriented formalism Head-driven Phrase Structure Grammar (HPSG, proposed by Pollard and Sag 1987). The tool-oriented formalism used to implement CPSG95 is the Attribute Logic Engine (ALE, developed by Carpenter and Penn 1994).

The unique feature of CPSG95 is its incorporation of Chinese morphology in the HPSG framework.[4] Like other HPSG grammars, CPSG95 is a heavily lexicalized unification grammar. It consists of two parts: a minimized general grammar and an information-enriched lexicon. The general grammar contains a small number of Phrase Structure (PS) rules, roughly corresponding to the HPSG schemata tuned to the Chinese language.[5] The syntactic PS rules capture the subject-predicate structure, complement structure, modifier structure, conjunctive structure and long-distance dependency. The morphological PS rules cover morphological structures for productive word formation. In one version of CPSG95 (its source code is shown in APPENDIX I), there are nine PS rules: seven syntactic rules and two morphological rules.

In CPSG95, potential morphological structures and potential syntactic structures are both lexically encoded. In syntax, a word can expect (subcat-for or mod in HPSG terms) another sign to form a phrase. Likewise, in Chinese morphology, a morpheme can expect another sign to form a word.[6]

One important modification of HPSG in designing CPSG95 is to use an atomic approach with separate features for each complement to replace the list design of obliqueness hierarchy among complements. The rationale and arguments for this modification are presented in Section 3.2.3 in Chapter III.

1.3.2. Illustration

The example shown in (1-9) demonstrates the morpho-syntactic analysis in CPSG95.

(1-9.) 这本书的可读性
zhe ben shu de ke du xing
this CLA book DE AF:-able read AF:-ness
this book’s readability
(Note: CLA for classifier; DE for particle de; AF for affix.)

Figure 1 illustrates the tree structure built by the morphological PS rules and the syntactic PS rules in CPSG95

cpsgtree

Figure 1. Sample Tree Structure for CPSG95 Analysis

As shown, the tree embodies both morphological analysis (the sub-tree for ke-du-xing) and syntactic analysis (the NP structure). The results of the morphological analysis (the category change from V to A and to N and the building of semantics, etc.) are readily accessible in building syntactic structures.

1.4. Organization of the Dissertation

The remainder of this dissertation is divided into six chapters.

Chapter II presents arguments for the need to involve syntactic analysis for a proper solution to the targeted morpho-syntactic problems. This establishes the foundation on which CPSG95 is based.

Chapter III presents the design of CPSG95. In particular, the expectation feature structures will be defined. They are used to encode the lexical expectation of both morphological and syntactic structures. This design provides the necessary means for formally defining Chinese word and the interface of morphology, syntax and semantics.

Chapter IV is on defining the Chinese word. This is generally recognized as a basic issue in discussing Chinese morpho-syntactic interface. The investigation leads to a way of the wordhood formalization and a coherent, system-internal definition of the work division between morphology and syntax.

Chapter V studies Chinese separable verbs. It discusses wordhood judgment for each type of separable verbs based on their distribution. The corresponding morphological or syntactic solutions will then be presented.

Chapter VI investigates some outstanding problems of Chinese derivation and its interface with syntax. It will be demonstrated that the general approach to Chinese derivation in CPSG95 works both for typical cases of derivation and the two special problems, namely 'quasi-affix' phenomena and zhe-affixation.

The last chapter, Chapter VII, concludes this dissertation. In addition to a concise retrospect for what has been achieved, it also gives an account of the limitations of the present research and future research directions.

Finally, the three appendices give the source code of one version of the implemented CPSG95 and some tested results.[7]

--------------------------------------------------

[1] In line with the requirements by Chinese NLP, this thesis places emphasis on the analysis of productive morphology: phenomena which are listable in the lexicon are not the major concern. This is different from many previous works on Chinese morphology (e.g. Z. Lu 1957; Dai 1993) where the bulk of discussions is on unproductive morphemes (affixes or ‘bound stems’).

[2] Ambiguity which remains after sentential parsing may be resolved by using further semantic, discourse or pragmatic knowledge, or ‘filters’.

[3] In CPSG95 and other HPSG-style grammars, a ‘sign’ usually stands for the generalized notion of grammatical units such as morpheme, word, phrase, etc.

[4] Researchers have looked at the incorporation of morphology of other natural languages in the HPSG framework (e.g. Type-based Derivation Morphology by Riehemann 1998). Arguments for the inclusion of morphological features in the definition of sign will be presented in detail in Chapter III

[5] Note that ‘phrase structure’ in terms like Phrase Structure Grammar (PSG) or Phrase Structure rules (PS rules) does not necessarily refer to structures of (syntactic) phrases. It stands for surface-based constituency structure, in contrast to, say, dependency structure in Dependency Grammar. In CPSG95, some productive morphological structures are also captured by PS rules.

[6] Note that in this dissertation, the term expect is used as a more generalized notion than the terms subcat-for (subcategorize for) and mod (modify). ‘Expect’ is intended to be applied to morphology as well as to syntax.

[7] There are differences in technical details between the proposed grammar in this dissertation and the implemented version. This is because any implemented version was tested at a given time while this thesis evolved over a long period of time. It is the author’s belief that it best benefits readers (including those who want to follow the CPSG practice) when a version was actually tested and given as was.

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

【语义计算群：借定语的壳装状语的瓤】

看一组例子：
“洗了一个痛快的澡”
“痛快地洗了一个澡”
“洗澡洗得痛快”

t0708p

好在我们把动宾离合词“洗澡”的搭配问题解决了，定语（Mod）、状语（Adv）同是附加语（adjunct），都挂到了同样的动词“洗澡”身上了，加上部分补语（Buyu）也是附加语，可谓世界大同了。原先较真的话，要问“痛快”的是“澡”，还是“洗”，还是“洗澡”, who cares？其实都是一个意思。类似的，英语也有：
live a happy life
live (a life) happily

白:
do了一个痛快的“洗澡”
程序还是要care的。

我:
如果程序在此类情形下只选一个路径，或不做规约也是可以的。到语义落地的时候只要系统适应性鲁棒即可：Adv:happily OR Mod:happy。

白:
借定语的壳装状语的瓤，总要有一个地方碰上的。
“开了一个无聊的会”
工程师可以不 care，架构师必须给说法。
我要说的是，伪定语伪状语在formalism层面就是可以解决的，并不带来额外负担。

我:
do + Adjunct + core pred
已经做了相当努力去规约这些本质上相同的说法了，如前面图中的“洗澡”：Mod 也好 Adv 也好 Buyu 也好，大体属于同样性质的附加语：
adjunct 痛快 ----》 pred 洗澡

白:
“张三做出了一个追悔莫及的决定。”
“张三遇上了这个倒霉的天气。”
“倒霉的”修饰“天气”，但倒霉的不是天气。
同理，“追悔莫及的”修饰“天气”，但追悔莫及的不是天气。
修饰关系和修饰语内置的填坑关系是脱钩的。

我:
“追悔莫及” 本义有一个 human 的坑
“做出决定” 也有一个 human 的坑
现在 human （张三）与 “做出决定” 发生了直接联系（S）与 “追悔莫及” 发生了间接关系（通过“做出决定”）。离开让 human （张三）与需要 human 坑的 “追悔莫及”直接联系只有一步之遥了。

白:
由此可见，有了的字结构，就由“的”统一应对被修饰语。至于修饰语内部的坑由谁填，被修饰语不过只是一个普通的候选而已。选不上不勉强，有更好的候选完全可以进来。所以我对把“的”这种重要的词仅仅处理成x，是有保留看法的。

我:
“的” 是敲门砖。句法树出来了， x它意思意思，比扔掉它也许好一些。

白:
我有更好的处理办法，绝非仅是敲门砖。

我:
关键是，第一个句子是一步之遥，第二个句子是两步之遥，几乎不可能超过两步。也就是说从ngram角度看也不过是dag中的 bigram 或 trigram 的语义规则，如果真想做的话。只要证明从间接联系到直接联系在语义中间件做对应用有益处这个工作是非常 tractable 的。
一个有语义的坑一个正好符合语义可以填坑近在咫尺有何难处？给我五分钟我两条线都可以勾搭上，而且保证不是权宜之计不引起副作用。其所以这些语义中间件的细活虽然不难但并没去全做是因为不很确定做了到底能带来多大好处，虽然理论上是有好处的。

白:
这些后缀，几乎每个case都是一样的。

我:
要的是这个结果吗？
t0708r

白:
一点不错，就是它

我:
我做一下 regression testing 看看有无副作用，没有的话，这个 trigram 的语义填坑规则就留下来。

我:
trigram
具体到这个 case 是从线性 5-gram 缩小成 graph 的 trigram
5 与 3 在组合爆炸的考量中是天壤之别
何况完全可以造出比 5 更加远距离的同样合适的例子来这就是句法的威力。
更主要的是，即便一个线性系统用得起 5-gram
没有结构支撑，也不敢乱用

白:
5-gram配得上的不稀疏的数据哪里来?

我:
说的是一回事儿 5gram 必然是稀疏数据不足以支撑远距离选取。不能因为一个token需要human 另一个token恰好是human 中间隔了四个词，就可以填坑了。总之是，没有结构，这事儿就做不成。

【相关】

【置顶：立委NLP博文一览】

【deep parsing，deep learning 以及在对话和问答系统中的应用】

【置顶：立委NLP博文一览】

NLP University

【立委NLP相关博文汇总一览】

NLP University 开张大吉

《朝华午拾》电子版

余致力自然语言处理(NLP,Natural Language Processing)凡30年，其目的在求交流之通畅，信息之自由，语言之归一，世界之大同。积30年之经验，深知欲达此目的，必须启蒙后进，普及科学，同心协力，共建通天之塔，因作文鼓而吹之。处理尚未成功，同志仍需努力。

【关于 NLP 以及杂谈】

【关于 parsing】

0. AI/NLP最新博文

AIGC 潮流扑面而来，是顺应还是（无谓）抵抗呢？

美术新时代，视频展示

漫谈AI 模型生成图像

《李白宋梁130：从短语结构的词序基础约束到大模型向量空间的天马行空》

AI 正在不声不响渗透我们的生活

RPA 是任务执行器还是数字员工？
图灵测试其实已经过时了 《立委科普：自注意力机制解说》 《深层解析符号模型与深度学习预训练模型》（修订文字版） NLP 新纪元来临了吗？ 【随感：大数据时代的信息茧房和“自洗脑”】 推荐Chris Manning 论大模型，并附上相关讨论 [转载]转载：斯坦福Chris Manning: 大模型剑指通用人工智能？ 《我看好超大生成模型的创造前途》 [转载]编译 Gary Marcus 最新著述：《深度学习正在撞南墙》

老司机谈NLP半自动驾驶，欢迎光临。

立委随笔：机器翻译，从学者到学员

关于NLP 落地以及冷启动的对话

《AI 随笔：从对张医生的综述抄袭指控谈起》

《AI 随笔：观老教授Walid的神经网络批判有感》

从人类认知谈AI融合之不易

与AI老友再谈特斯拉自动驾驶

《AI 理性主义的终结是不可能的吗》

《马斯克AI自动驾驶的背后：软件的内伤，硬件的短板》

《王婆不卖瓜，特斯拉车主说自驾》

《AI 赚钱真心难》

NLP自选系列2020专栏连载

《语言形式的无中生有：从隐性到显性》

1. 关于NLP体系及方法论

【语义计算沙龙：乔老爷的围墙，community 的盲区】

【立委科普：NLP 联络图】

【立委科普：自然语言系统架构简说】

【立委科普：自然语言parsers是揭示语言奥秘的LIGO式探测仪】

泥沙龙笔记:漫谈语言形式

Notes on Building and Using Lexical Semantic Knowledge Bases

【NLP 迷思之四：词义消歧（WSD）是NLP应用的瓶颈】

《泥沙龙笔记：沾深度神经的光，谈parsing的深度与多层》

【立委科普：语言学算法是 deep NLP 绕不过去的坎儿】

《OVERVIEW OF NATURAL LANGUAGE PROCESSING》

《NLP White Paper: Overview of Our NLP Core Engine》

White Paper of NLP Engine

《泥沙龙笔记：deep，情到深处仍孤独》

《泥沙龙铿锵三人行：句法语义纠缠论》

《泥沙龙笔记：知识习得对本体知识，信息抽取对知识图谱》

【泥沙龙笔记：再谈知识图谱和知识习得】

【立委科普：本体知识系统的发展历程】

Notes on Building and Using Lexical Semantic Knowledge Bases

NLP 是什么，不是什么？做什么，不做什么？

【新智元笔记：工程语法和深度神经】

【新智元笔记：李白对话录 – RNN 与语言学算法】

《新智元笔记：再谈语言学手工编程与机器学习的自动编程》

《新智元笔记：对于 tractable tasks, 机器学习很难胜过专家》

《新智元笔记：【Google 年度顶级论文】有感》

《新智元笔记：NLP 系统的分层挑战》

《泥沙龙笔记：连续、离散，模块化和接口》

《泥沙龙笔记：parsing 的休眠反悔机制》

【立委科普：歧义parsing的休眠唤醒机制初探】

【泥沙龙笔记：NLP hard 的歧义突破】

【立委科普：结构歧义的休眠唤醒演义】

【新智元笔记：李白对话录 – 从“把手”谈起】

《新智元笔记：跨层次结构歧义的识别表达痛点》

【立委科普：NLP 中的一袋子词是什么】

一切声称用机器学习做社会媒体舆情挖掘的系统，都值得怀疑

《立委科普：关键词革命》

《立委科普：关键词外传》

《立委随笔：机器学习和自然语言处理》

【泥沙龙笔记：语法工程派与统计学习派的总结】

【科普小品：NLP 的锤子和斧头】

【新智元笔记：两条路线上的NLP数据制导】

《立委随笔：语言自动分析的两个路子》

Comparison of Pros and Cons of Two NLP Approaches

【why hybrid? on machine learning vs. hand-coded rules in NLP】

Why Hybrid?

钩沉：Early arguments for a hybrid model for NLP and IE

【李白对话录：你波你的波，我粒我的粒】

【泥沙龙笔记：学习乐观主义的极致，奇文共欣赏】

《泥沙龙笔记:铿锵众人行,parsing 可以颠覆关键词吗?》

泥沙龙笔记:铿锵三人行

《泥沙龙铿锵三人行：句法语义纠缠论》

【科普随笔：NLP主流的傲慢与偏见】

【科普随笔：NLP主流最大的偏见，规则系统的手工性】

再谈机器学习和手工系统：人和机器谁更聪明能干？

乔姆斯基批判

Chomsky’s Negative Impact

[转载]【白硕 – 穿越乔家大院寻找“毛毛虫”】

【新智元笔记：语法糖霜论不值得认真对待】

【科研笔记：NLP “毛毛虫” 笔记，从一维到二维】

【泥沙龙笔记：NLP 专门语言是规则系统的斧头】

【新智元：理论家的围墙和工程师的私货】

泥沙龙笔记：从乔姆斯基大战谷歌Norvig说起

【Church – 钟摆摆得太远（2）：乔姆斯基论】

【NLP主流的反思：Church – 钟摆摆得太远（1）：历史回顾】

【Church – 钟摆摆得太远（3）：皮尔斯论】

【Church – 钟摆摆得太远（4）：明斯基论】

【Church – 钟摆摆得太远（5）：现状与结论】

《泥沙龙笔记：【钟摆摆得太远】高大上，但有偏颇》

自给自足是NLP王道

自然语言后学都应该看看白硕老师的“自然语言处理与人工智能”

语言创造简史

Notes on Building and Using Lexical Semantic Knowledge Bases

【NLP主流成见之二，所谓规则系统的移植性太差】

Domain portability myth in natural language processing (NLP)

【科普随笔：NLP的宗教战争？】

Church – 计算语言学课程的缺陷（翻译节选）

【科普随笔：NLP主流之偏见重复一万遍成为反真理】

【坚持四项基本原则，开发鲁棒性NLP系统】

【NLP 围脖：成语从来不是问题】

【NLP 是一个力气活：再论成语不是问题】

【立委围脖：对于用户来说，抓住老鼠就是好猫】

《科普随笔：keep ambiguity untouched》

【科研笔记：NLP的词海战术】

【在构筑一个模型时，枚举法是常用的必要的强盗分类】

没有语言学的 CL 走不远

[转载]为什么谷歌搜索并不像广泛相信的那样主要采用机器学习？

手工规则系统的软肋在文章分类

老教授回函：理性主义回摆可能要再延迟10几年

一袋子词的主流方法面对社交媒体捉襟见肘，结构分析是必由之路

《新智元：通用的机器人都是闹着玩的，有用的都是 domain 的》

【新智元笔记：反伊莉莎效应，人工智能的新概念】

The Anti-Eliza effect, New Concept in AI

《新智元笔记：机器的秒杀人类和霍金的杞人忧天》

【新智元笔记：强弱人工智能之辩】

【泥沙龙笔记：强人工智能的伟哥测试】

重温AI历史上的思维实验：老外不会中文，正如机器没有理解

《语义三巨人》

【泥沙龙笔记：吃科学的饭，还是技术的饭？】

《立委随笔：人工“智能”》

人机接口是机器人的面子

《新智元：通用的机器人都是闹着玩的，有用的都是 domain 的》

【泥沙龙笔记：从民科谈到五代机及其AI历史与现状】

【泥沙龙笔记：吃科学的饭，还是技术的饭？】

SBIR Grants

2. 关于NLP分析（parsing）

【语义计算沙龙：Parsing 的数据结构和形式文法】

【语义计算群：句法语义的萝卜与坑】

【语义计算群：李白侃中文parsing】

【语义计算群：借定语的壳装状语的瓤】

【语义计算群：带歧义或模糊前行，有如带病生存】

【一日一parsing：”钱是没有问题”】

【一日一parsing：休眠唤醒的好例子】

【一日一parse：长尾问题种种】

【语言学小品：送老婆后面的语言学】

【一日一parsing：NLP应用可以对parsing有所包容】

泥沙龙笔记：骨灰级砖家一席谈，真伪结构歧义的对策（1/2)

泥沙龙笔记：骨灰级砖家一席谈，真伪结构歧义的对策（2/2)

【语义计算沙龙：巨头谷歌昨天称句法分析极难，但他们最强】

【语义计算沙龙：parsing 的鲁棒比精准更重要】

《语义计算沙龙：基本短语是浅层和深层parsing的重要接口》

【做 parsing 还是要靠语言学家，机器学习不给力】

《泥沙龙笔记：狗血的语言学》

【deep parsing 小品：天涯若比邻的远距离关系】

《有了deep parsing，信息抽取就是个玩儿》

【语义计算沙龙：关于汉语介词的兼语句型，兼论POS】

泥沙龙笔记：在知识处理中，很多时候，人不如机

《立委科普：机器可以揭开双关语神秘的面纱》

《泥沙龙笔记：漫谈自动句法分析和树形图表达》

泥沙龙笔记：语言处理没有文法就不好玩了

泥沙龙笔记：parsing 是引擎的核武器，再论NLP与搜索

泥沙龙笔记：从 sparse data 再论parsing乃是NLP应用的核武器

【立委科普：NLP核武器的奥秘】

【立委科普：语法结构树之美】

【立委科普：语法结构树之美（之二）】

【立委科普：自然语言理解当然是文法为主，常识为辅】

【语义计算沙龙：从《知网》抽取逻辑动宾的关系】

【立委科普：教机器识英文】

【立委科普：及物、不及物与动词 subcat 及句型】

泥沙龙笔记：再聊乔老爷的递归陷阱

 【泥沙龙笔记：人脑就是豆腐，别扯什么递归了】

泥沙龙笔记：儿童语言没有文法的问题

《自然语言是递归的么？》

【从 colorless green ideas sleep furiously 说开去】

Parsing nonsense with a sense of humor

【科普小品：文法里的父子原则】

Parent-child Principle in Dependency Grammar

乔氏 X 杠杠理论以及各式树形图表达法

【泥沙龙笔记：依存语言学的怪圈】

【没有语言结构可以解析语义么？浅论 LSA】

泥沙龙笔记：聊一聊世界语及老柴老乔以及老马老恩

《泥沙龙笔记：NLP component technology 的市场问题》

【泥沙龙笔记：没有结构树，万古如长夜】

Deep parsing：每日一析

Deep parsing 每日一析：内情曝光 vs 假货曝光

Deep parsing 每日一析半垃圾进半垃圾出

【一日一parsing: 屈居世界第零】

【研发随笔：植树为林自成景（10/n）】

【deep parsing：植树为林自成景（20/n）】

【deep parsing：植树为林自成景（30/n）】

【语义计算沙龙：植树为林自成景（40/n）】

【deep parsing 吃文化：植树为林自成景（60/n）】

【deep parsing (70/n)：离合词与定语从句的纠缠】

【deep parsing (80/n)：植树成林自成景】

【deep parsing (90/n)：“雨是好雨，但风不正经”】

【deep parsing (100/n)：其实 NLP 也没那么容易气死】

3. 关于NLP抽取

【立委科普：NLU 的螺旋式上升及其 open知识图谱的趋向】

【语义计算沙龙：知识图谱无需动用太多知识负重而行】

【立委科普：信息抽取】

《朝华午拾：信息抽取笔记》

泥沙龙笔记：搜索和知识图谱的话题

《知识图谱的先行：从Julian Hill 说起》

《有了deep parsing，信息抽取就是个玩儿》

【立委科普：实体关系到知识图谱，从“同学”谈起】

泥沙龙笔记： parsing vs. classification and IE

前知识图谱钩沉: 信息抽取引擎的架构

前知识图谱钩沉: 信息体理论

前知识图谱钩沉，信息抽取任务由浅至深的定义

前知识图谱钩沉，关于事件的抽取

钩沉：SVO as General Events

Pre-Knowledge-Graph Profile Extraction Research via SBIR (1)

Pre-Knowledge-Graph Profile Extraction Research via SBIR (2)

Coarse-grained vs. fine-grained sentiment extraction

【立委科普：基于关键词的舆情分类系统面临挑战】

【“剩女”的去向和出路】

SBIR Grants

4.关于NLP大数据挖掘

“大数据与认识论”研讨会的书面发言（草稿）

【立委科普：自动民调】

Automated survey based on social media

【立委科普：所谓大数据（BIG DATA）】

【科研笔记：big data NLP, how big is big?】

文本挖掘需要让用户既能见林又能见木

【社媒挖掘：《品牌舆情图》的设计问题】

研究发现，国人爱说反话：夸奖的背后藏着嘲讽

立委统计发现，人是几乎无可救药的情绪性动物

2011 信息产业的两大关键词：社交媒体和云计算

《扫了 sentiment，NLP 一览众山小：从“良性肿瘤”说起》

5. 关于NLP应用

【河东河西，谁敢说SMT最终一定打得过规则MT？】

【立委科普：NLP应用的平台之叹】

【Bots 的愿景】

《新智元笔记：知识图谱和问答系统：how-question QA（2）》

《新智元笔记：知识图谱和问答系统：开题（1）》

【泥沙龙笔记：NLP 市场落地，主餐还是副食？】

《泥沙龙笔记：怎样满足用户的信息需求》

【立委科普：问答系统的前生今世】

《新智元笔记：微软小冰，人工智能聊天伙伴（1）》

《新智元笔记：微软小冰，可能的商业模式（2）》

《新智元笔记：微软小冰，两分钟定律（3）》

新智元笔记：微软小冰，QA 和AI，历史与展望（4）

泥沙龙笔记：把酒话桑麻，聊聊 NLP 工业研发的掌故

泥沙龙笔记：创新，失败，再创新，再失败，直至看上去没失败

泥沙龙笔记：parsing 是引擎的核武器，再论NLP与搜索

【立委科普：从产业角度说说NLP这个行当】

【立委科普：机器翻译】

立委硕士论文【附录一：EChA 试验结果】

【社会媒体（围脖啦）火了，信息泛滥成灾，技术跟上了么？】

【2011 信息产业的两大关键词：社交媒体和云计算】

【再说苹果爱疯的贴身小蜜死日（Siri）】

【从新版iPhone发布，看苹果和微软技术转化能力的天壤之别】

【非常折服苹果的技术转化能力，但就自然语言技术本身来说 ...】

【科研笔记：big data NLP, how big is big?】

【与机器人对话】

《机器翻译词义辨识对策》

【立委随笔：机器翻译万岁】

6. 关于中文NLP

【语义计算群：李白侃中文秀parsing】

【parsing 在希望的田野上】

【语义计算沙龙：其实 NLP 也没那么容易气死】

【deep parsing (70/n)：离合词与定语从句的纠缠】

【立委科普：deep parsing 小讲座】

【新智元笔记：词的幽灵在NLP徘徊】

《新智元笔记：机器的馅饼专砸用心者的头》

【新智元笔记：机器的馅饼（续篇）】

【新智元笔记：parsing 汉语涉及重叠的鸡零狗碎及其他】

【新智元笔记：中文自动分析杂谈】

【deep parsing：“对医闹和对大夫使用暴力者，应该依法严惩" 】

【让机器人解读洪爷的《人工智能忧思录》（4/n）】

【让机器人解读洪爷的《人工智能忧思录》（3/n）】

【让机器人解读洪爷的《人工智能忧思录》（2/n）】

【让机器人解读洪爷的《人工智能忧思录》（1/n）】

《新智元笔记：找茬拷问立氏parser》

【新智元笔记：汉语分离词的自动分析】

《新智元笔记：与汉语离合词有关的结构关系》

《新智元笔记：汉语使动结构与定中结构的纠缠》

《新智元笔记：汉语parsing的合成词痛点》

《新智元：填空“的子结构”、“所字结构”和“者字结构“》

【沙龙笔记：汉语构词和句法都要用到reduplication机制】

钩沉：博士阶段的汉语HPSG研究 2015-11-02

泥沙龙小品：小词搭配是上帝给汉语文法的恩赐

泥沙龙笔记：汉语牛逼，国人任性！句法语义，粗细不同

【中文处理的迷思之二：词类标注是句法分析的前提】

【中文NLP迷思之三：中文处理的长足进步有待于汉语语法的理论突破】

【专业老友痛批立委《迷思》系列搅乱NLP秩序，立委固执己见】

【后生可畏，专业新人对《迷思》争论表面和稀泥，其实门儿清】

【突然有一种紧迫感：再不上中文NLP，可能就错过时代机遇了】

【社会媒体舆情自动分析：马英九 vs 陈水扁】

【舆情自动分析表明，谷歌的社会评价度高出百度一倍】

【方寒大战高频情绪性词的词频分析】

【方韩大战的舆情自动分析：小方的评价比韩少差太多了】

【研究发现，国人爱说反话：夸奖的背后藏着嘲讽】

【立委统计发现，人是几乎无可救药的情绪性动物】

【立委统计发现，汉语既适合吹嘘拍马亦长于恶意构陷】

【比起英语，汉语感情更外露还是更炽烈？】

【科研笔记：究竟好还是不好】

《科普随笔：汉字和语素》

《科普随笔：汉语自动断词 “一次性交500元”》

《科普随笔：“他走得风一样地快” 的详细语法结构分析》

【立委科普：《非诚勿扰》中是谁心动谁动心？】

《新智元：挖掘你的诗人气质，祝你新年快乐》

7. 关于NLP社会媒体舆情挖掘的实践

【语义计算沙龙：sentiment 中的讽刺和正话反说】

《利用大数据高科技，实时监测美国总统大选舆情变化》

世人皆错nlp不错，民调错大数据也不会错

社媒大数据的困境：微信的风行导致舆情的碎片化

从微信的用户体验谈大数据挖掘的客户情报

社媒挖掘：社会媒体疯传柴静调查，毁誉参半，争议趋于情绪化

【奥巴马赢了昨晚辩论吗？舆情自动检测告诉你】

全球社交媒体热议阿里巴巴上市

到底社媒曲线与股市曲线有没有、有多少相关度?

再谈舆情与股市的相关性

【『科学』预测：A-股看好】

舆情挖掘用于股市房市预测靠谱么？

大数据帮助决策实例：《走进“大数据”——洗衣机寻购记》

【社媒挖掘：外来快餐店风光不再】

【社媒挖掘：中国手机市场仍处于战国争雄的阶段】

世界杯是全世界的热点，纵不懂也有义务挖掘一哈

【大数据挖掘：方崔大战一年回顾】（更正版）

【大数据挖掘：转基因一年回顾】

【大数据挖掘：“苦逼”小崔2013年5-7月为什么跌入谷底？】

【大数据挖掘：转基因中文网络的自动民调，东风压倒西风？】

【大数据挖掘：转基因英文网络的自动民调和分析】

只认数据不认人：IRT 的鼓噪左右美国民情了么？

继续转基因的大数据挖掘：谁在说话？发自何处？能代表美国人民么

关于转基因及其社会媒体大数据挖掘的种种问题

【美国网民怎么看转基因：英文社交媒体大数据调查告诉你】

【社媒挖掘：必胜客是七夕节情侣聚餐的首选之地？】

【社媒挖掘：大数据时代的危机管理】

测试粤语舆情挖掘：拿娱乐界名人阿娇和陈冠希开刀

大数据淹没下的冰美人（之三）: 喜欢的理由

大数据淹没下的冰美人（之四）: 流言蜚语篇（慎入）

大数据淹没下的冰美人（之五）: 星光灿烂谁为最？

【社媒挖掘：成都暴打事件中的男司机和女司机】

舆情挖掘：九合一國民黨慘敗馬英九時代行將結束？

社会媒体舆情自动分析：马英九 vs 陈水扁

社媒挖掘：争议人物方博士被逐，提升了其网路形象

方韩大战高频情绪性词的词频分析

方韩大战的舆情自动分析：小方的评价比韩少差太多了

社媒挖掘：苹果CEO库克公开承认同志身份，媒体反应相当正面

苹果智能手表会是可穿戴设备的革命么？

全球社交媒体热议苹果推出 iPhone 6

互联网盛世英雄马云的媒体形象

革命革到自身头上，给咱“科学网”也挖掘一下形象

两年来中国红十字会的社会媒体形象调查

自动民调Walmart，挖掘发现跨国公司在中国的日子不好过

【社媒挖掘：“剩女”问题】

【舆情挖掘：2013央视春晚播后】

【舆情挖掘：年三十挖一挖央视春晚】

新浪微博下周要大跌？舆情指数不看好，负面评价太多（疑似虚惊）

【大数据挖掘：微信（WeChat）】

【大数据解读：方崔大战对转基因形象的影响】

【微博自动民调：薄熙来、薛蛮子和李天一】

【社媒挖掘：第一夫人光彩夺目赞誉有加】

Chinese First Lady in Social Media

Social media mining on credit industry in China

Sina Weibo IPO and its automatic real time monitoring

Social media mining: Teens and Issues

立委元宵节大数据科技访谈土豆视频上网

【大数据挖掘：中国红十字会的社会媒体形象】

新鲜出炉：2012 热点话题五大盘点之五【小方vs韩2】

【凡事不决问 social：切糕是神马？】

Social media mining: 2013 vs. 2012

社会媒体测试知名品牌百度，有惊人发现

尝试揭秘百度的“哪里有小姐”：小姐年年讲、月月讲、天天讲？

舆情自动分析表明，谷歌的社会评价度高出百度一倍

圣诞社媒印象: 简体世界狂欢，繁體世界分享

WordClouds: Season's sentiments, pros & cons of Xmas

新鲜出炉：2012 热点话题五大盘点之一【吊丝】

新鲜出炉：2012 热点的社会媒体五大盘点之二【林书豪】

新鲜出炉：2012 热点话题五大盘点之三【舌尖上的中国】

新鲜出炉：2012 热点话题五大盘点之四【三星vs苹果】

8. 关于NLP的掌故趣闻

【立委随笔：文傻和理呆的世纪悲剧（romance tragedy）】

【把酒话桑麻，MT 产品落地史话】

【泥沙龙笔记：从机器战胜人类围棋谈开去】

遭遇脸书的 Deep Text

泥沙龙笔记：创新，失败，再创新，再失败，直至看上去没失败

泥沙龙笔记：把酒话桑麻，聊聊 NLP 工业研发的掌故

【 IT风云掌故：金点子起家的　AskJeeves 】

【今天的Ask.com】

《语义三巨人》

《朝华午拾：在美国写基金申请的酸甜苦辣》

看望导师刘倬先生，中国机器翻译的开山鼻祖之一

【朝华午拾集锦：立委流浪图】

【立委随笔：流浪的概念网络】

【朝华午拾：安娜离职记】

《知识图谱的先行：从 Julian Hill 说起》

【80年代在国内，社科院的硕士训练使我受益最多】

围脖：一个人对抗一个世界，理性主义大师 Lenat 教授

《泥沙龙笔记：再谈 cyc》

围脖：格语法创始人菲尔墨（Charles J. Fillmore）教授千古！

百度大脑从谷歌大脑挖来深度学习掌门人 Andrew Ng

冯志伟老师以及机器翻译历史的一些事儿

《立委随笔：微软收购PowerSet》

【NLP 历史上最大的媒体误导：成语难倒了电脑】

【立委推荐：乔姆斯基】

巧遇语言学上帝乔姆斯基

[转载]欧阳锋：巧遇语言学新锐　－　乔姆斯基

【科普小品：伟哥的关键词故事】

【不是那根萝卜，不做那个葱】

【随记：湾区的年度 NLP BBQ 】

【女怕嫁错郎，男怕入错行，专业怕选错方向】

【据说，神奇的NLP可以增强性吸引力，增加你的信心和幽会成功率】

【立委科普：美梦成真的通俗版解说】

【征文参赛：美梦成真】

【创业故事：技术的力量和技术公司的命运】

把酒话桑麻，再泡一壶茶，白头老机译，闲坐说研发

立委随笔：Chomsky meets Gates

钩沉：《中国报道》上与导师用世界语发表的第一篇论文

钩沉：《中国报道》上用世界语发表的第二篇论文

【泥沙龙笔记：机器 parsing 洪爷，无论打油或打趣】

老革命遇到新问题，洪爷求饶打油翁

我要是退休了，就机器 parse 《离骚》玩儿

【语言学小品：送老婆后面的语言学】

456822675539882531

谁会误读？为什么误读？研究一下背后的语言学 and beyond。

双宾两个坑 human 默认的坑是对象 “老婆”是“送”的对象，这是正解。
对于心术不正的人 human 也可以填受事的坑，“老婆”跟礼物一样，成了“送”的受事。
这是 “送” 的歧义，到了 caption 里面的合成词 “送给”，subcat 有细微变化，就没歧义了。为什么 “送-个” 也没歧义呢？因为“个”是不定的，而对象这个角色通常是有定的。
这里面细说起来还有一摞的语言学。

（1）双宾句型的对象一般是有定的，不定的对象不是绝对不可以，譬如：
“我把一大批书送（给）一所学校了。”
“一所” 是不定数量词，作为对象。
汉语中的 “一+量词”与光杆“量词”通常认为是等价的，范畴都是不定（indefinite），后者是前者省略了“一”而得。但是二者并非完全等价。
对象这个角色默认有定（definite，虽然汉语没有定冠词），如果是有定，不可以省略“一”，或者说，不可以由带光杆量词的NP充当。
汉语句法里面可以总结出这么一条细则：带有光杆量词的NP只能充当直接宾语，不能充当间接宾语（对象）或其他。

（2）再看合成词 “送给” 里面的语言学。
汉语反映双宾概念的语词，常常可以进一步与“给”组成合成动词，意义不变，但注意合成前后的subcat的微妙变化：“送” vs “送给” （寄给，赠给，赠送给，等）
“送”的 subcat patterns：
（1）送 + 对象NP + 受事NP: 送她一本书
（2） “把”受事NP+送+对象：把一本书送她
（3）受事NP+送+对象：这本书送她了
（4）送+受事NP：送个老婆
（5）送+对象NP（human，definite）：送（我）老婆。

请留心（4）和（5）：两个patterns有相交竞争的时候，于是歧义产生。当“送+给”构成合成动词后，subcat 的 patterns（1）（2）（3）（5）保持不变，而（4）基本失效（退出）了。说基本失效，是因为：虽然 “送给老婆”只能循 pattern 5，但“送给个老婆”（稍微有限别扭，但仍在语言可接受之列）似乎仍然需要理解为 pattern 4，这是怎么回事呢？
这就是语言的微妙之处：pattern 4 本来应该退出，因为“给”已经决定了后面是对象而不是受事；但是因为汉语有另一条很细但是很强的规则说，光杆量词的NP只能做受事，不能做对象或其他。在这两条规则（pattern 5的对象规则与光杆受事规则）发生冲突的时候，后一条胜，因此“送给个老婆”就不得不做 pattern 4 的受事解了。这叫规则与规则打架，谁胜谁输也是语言学的一部分，电脑实现的时候可以运用一个priority的机制来model。

上图还涉及一个常见的促销句式：买NP1送NP2
买iPhone 6 送耳机
买 Prius 送三年保修
这个语用句式的存在，加强了NP2作为受事的可能性，使得 human 本来默认为对象的力量受到制衡。这似乎涉及语用与句法的交界了。

这些算是语言学。Beyond 语言学，也可以从文化上看这个误解或歧义的现象：

对于来自落后农村的人，老婆作为受事的理解几乎是理所当然，因为农村的封建落后使得娶不起媳妇的光棍汉太多，白捞一个媳妇的渴望诱使他们更多向受事而不是对象方面联想，何况手机对于他们是天价，卖肾才可得之，因此对于促销句式也就更加敏感。反之，对于一个知识分子或富裕阶层人士，“送老婆”可能更偏向于理解为对象。

就跟王若水老老年谈桌子的哲学类似，这则小品主要是想谈谈日常的语言学。哲学家满眼都是哲学，语言学家以语言学看世界。语言人人会说，背后的语言学却不是老妪能解。语言如水如空气，一般人熟视无睹了，语言学家来揭示。这是 real life linguistics，琐碎而不乏规律，似海却仍可见底。

【相关】

《立委随笔：语言学家是怎样炼成的》

【关于立委NLP的《关于系列》】

【立委NLP频道的《关于系列》】

【立委按】有了这个《关于系列》，NLP有关的话，该说的已经大体说完了。以后再说，大多是重复或细节而已。有些论点可以不同角度说，关键的事情可以反复说，以信息的冗余试图保障信息传输的有效性和完整性。以前说过的，这方面立委有三个榜样，一律苦口婆心：第一是马克思，尤其反映在他集30多年功力未及完工的砖头一般厚重的《Das Kapital（资本论）》；第二是乔姆斯基，他对美国外交霸权主义和美国大众媒体的批判，絮叨了一辈子，万变不离其宗；三是老友镜子先生横扫万事万物，见诸立委主编【镜子大全】。都是菩萨心肠，把自以为的真知灼见（当然不是真理，也难免偏激）说给世界听。至少于我，说给世界听，但并不在乎世界听不听。老夫聊发少年狂，花开花落两由之。

【关于 NLP 以及杂谈】专栏：杂类；English

【关于NLP体系和设计哲学】; 专栏：NLP架构

【关于NLP方法论以及两条路线之争】专栏：NLP方法论

【关于 parsing】专栏：Parsing

【关于中文NLP】专栏：中文处理

【关于机器翻译】专栏：机器翻译；

【关于NLP应用】专栏：NLP应用

【关于我与NLP】专栏：NLP掌故

【关于NLP掌故】专栏：NLP掌故

【关于人工智能】专栏：杂类

《新智元笔记：知识图谱和问答系统：how-question QA（2）》

【关于问答系统】

【立委科普：问答系统的前生今世】

《新智元笔记：知识图谱和问答系统：开题（1）》

《朝华午拾：创业之路》

【Bots 的愿景】

《泥沙龙笔记：怎样满足用户的信息需求》

《新智元笔记：微软小冰，人工智能聊天伙伴（1）》

《新智元笔记：微软小冰，可能的商业模式（2）》

《新智元笔记：微软小冰，两分钟定律（3）》

新智元笔记：微软小冰，QA 和AI，历史与展望（4）

【再说苹果爱疯的贴身小蜜死日（Siri）】

【从新版iPhone发布，看苹果和微软技术转化能力的天壤之别】

【非常折服苹果的技术转化能力，但就自然语言技术本身来说 ...】

【与机器人对话】

【 IT风云掌故：金点子起家的　AskJeeves 】

【今天的Ask.com】

【关于 NLP 以及杂谈】

【关于 parsing】

【泥沙龙笔记：吃科学的饭，还是技术的饭？】

我:

我虽然被封了个小公司 Chief Scientist 的职称，实在不敢称科学家了，因为早已脱离 academia，也没真正靠科学吃饭：这个金饭碗太沉，端不起。这倒不是谦虚，也不是自我矮化，因为科学家和技术人在我心中难分高低。作为一线技术人，并没觉得自己比一流科学家逊色。

不说生物，说说NLP。可重复性是科学的根本，否则算命先生和跳大神的也都是科学家了。针对一个单纯的任务，或一个纯粹的算法，在 community 有一个标注测试集的时候，这个可重复性似乎是理应有所要求的，虽然具体怎么验证这个要求，验证到哪一步才被公认有效，似乎远非黑白分明。

我的问题是，如果是一个复杂一些的系统，譬如 deep parser，譬如 MT，特别是在工业界，有可能做到可重复吗？不可重复就不能认可吗？且不说不可重复是保持竞争优势的必要条件，就算一家公司不在乎 IP，指望对手能重复自己的结果，也是难以想象的事儿 -- 除非把全盘源代码、原资源，包括所有的词典，原封不动交给对方，而且不许configure，亦不允许改动任何参数，否则怎么可能做到结果可以被重复呢？

毛:

凡是“构成性要素”，必须在一定的误差范围内可重复。要不然就属于商业秘密而不属于科学发现了。

我:

所以 key 就是看你吃哪一碗饭。吃学术的饭，你就必须过这一关。怎么拿捏是 community peer reviewers 的事儿。

毛:

还是那句话，你不能把什么好处都占了。

我:

吃工业的饭，你只要你的黑箱子 performs 就ok了。

这就使得学术界只能就“构成性要素”而发表，做一个 integrated 系统是不讨好的。这个从科学上是有道理的，但是很多做学术的人也不甘心总猫在象牙塔里，为他人做嫁衣裳，他们也想做实用系统。integrated 的实用系统几乎肯定无法由他人重复出结果来，因为变数太多，过程太复杂。

毛:

那倒也不一定，当年的 unix 就是系统。但是在同样的配置条件下得到的结果应该在一定的误差范围之内。

我:

换句话说吧，别说他人，就是自己也不见得能重复出自己的结果来。如果重起炉灶，再做一个 parser 出来，结果的误差是多少才能算容许的范围呢？就算基本设计和算法不变，相信是越做越好，但结果的误差在做成之前是很难预测的。这与在新的开发现场所能调用的资源等因素有关。

毛:

对呀，所以别人也不至于吹毛求疵，大家会有个共识的。像Parser一类，如果是对自然语言，那应该是很宽的。但如果是形式语言、编程语言，那就要求很严了。

我：

说的是自然语言。十几年前，我还在学术殿堂边徘徊，试图讨好主流，分一杯羹，虽然明知学界的统计一边倒造成偏见流行（【科普随笔：NLP主流的傲慢与偏见】）积久成疾，我辈压抑，同行如隔山，相互听不见。直到有一天大彻大悟，我到底吃的是谁的饭，我凭的什么在吃饭？原来我的衣食父母不是科学，更不是主流。我与隔壁的木匠阿二无异，主要靠的是手艺吃饭，靠的是技术创新的绝技，而不是纯科学的突破。认清这一点，也就避免了以卵击石，长他人威风，灭自己志气。说到底，在业界，老板不在意你在哪一条路线上，客户更不在乎你有没有追赶潮流，白猫黑猫，一切由系统说话。你有你的科学突破，我有我的技术绝技，到了应用现场，还要看谁接地气，有没有硬通货呢。系统结果可能难以重复，客观测量却并非难事儿。

【相关】

【关于NLP方法论以及两条路线之争】

【关于我与NLP】

《知识图谱的先行：从Julian Hill 说起》

【关于信息抽取】

【立委科普：信息抽取】

《朝华午拾：信息抽取笔记》

泥沙龙笔记：搜索和知识图谱的话题

《有了deep parsing，信息抽取就是个玩儿》

【立委科普：实体关系到知识图谱，从“同学”谈起】

泥沙龙笔记： parsing vs. classification and IE

前知识图谱钩沉: 信息抽取引擎的架构 2015-11-01

前知识图谱钩沉: 信息体理论 2015-10-31

前知识图谱钩沉，信息抽取任务由浅至深的定义 2015-10-30

前知识图谱钩沉，关于事件的抽取

钩沉：SVO as General Events

Pre-Knowledge-Graph Profile Extraction Research via SBIR (1)

Pre-Knowledge-Graph Profile Extraction Research via SBIR (2)

Coarse-grained vs. fine-grained sentiment extraction

【立委科普：基于关键词的舆情分类系统面临挑战】

【“剩女”的去向和出路】

SBIR Grants

【关于 parsing】

【关于 NLP 以及杂谈】

【关于人工智能】

【关于NLP体系和设计哲学】

【关于NLP方法论以及两条路线之争】

【置顶：立委科学网博客NLP博文一览（定期更新版）】

"快叫爸爸小视频" 的社会计算语言学解析

“快叫爸爸小视频” 这样的东西有社会语言学的味道随着时代和潮流翻滚。在微信朋友圈及其提供的小视频功能风靡之前小视频不是术语不是合成词也没有动词的引申用法。它就是一个定中结构的 NP，在句型中等价于说”把爸爸叫做小视频”，虽然常识是 “人（爸爸）不可以等价于物（视频）”。在语言的强制性subcat结构（叫NP1NP2）里面，常识是没有位置的。句法不需要顾及常识正如 “鸡把我吃了”的违反常识一样也正如乔姆斯基千古名句的 green ideas。
可是社会语言学登场了语言被置于流动的社会背景之下，小视频成了 technical term，然后又从术语融入了语言共同体的动词用法，正如谷歌从术语（专名）变成动词一样： “我还是先谷歌一下再回应吧”，“快小视频呀”，“一定要小视频这个精彩时刻”。
白:
“一下”强制“谷歌”为动词。半个括号已经有了另半个没有也得有。
我:
于是 subcats 开始 compete，有了 competition，有了结构歧义就有了常识出场的理由。顺应常识者于是推翻了句法的第一个 reading。
白:
你是我的小苹果，怎解？
我：
“你是我的小苹果”是强制性的句法啊，无论怎么理解这个苹果（到现在我也没有理解为什么把爱人或意中人叫做小苹果，是因为拿高大上的苹果比喻珍贵吗？）都与常识无关：你是我的 x，就是强行的句法等价关系。
“一下”强制“谷歌”为动词这一类看似临时的强制在语言共同体中逐渐从临时变成常态后就侵入了词汇。换句话说，“谷歌”在以前的词典里面是没有也无需“潜在动词”的标注（lexical candidate POS feature），因为几乎所有的动词用法都是零星的句法强制的无需词典 support 的。但是随着语言的发展 “谷歌”的动词用法逐渐变成了语言共同体司空见惯的表达方式（其动词用法的流行显得简洁、时髦甚至俏皮），这时候语言的用法被反映在语言共同体的集体词汇表中，我们模型这个共同体的语言能力的时候就开始标注其动词的可能性了。
金:
厉害，这抠的!金融语义在一边看热闹
我:
或问：这词典里面标注了（反映的是共同体集体意识到这种用法的流行）和不标注有什么区别？
当然有区别。标注了就意味着其动词用法作为一个合理的路径参与 parsing 的正常竞争；不标注虽然也不能排除临时的动词用法但是因为缺乏了底部的词典支持其动词用法的路径是默认不合法，除非句法（包括词法）的context逼迫它成为动词，这就是 “一哈”的所谓强盗句法：不仅词典是绑架的天堂，句法也可以绑架。
白老师说：“兼语理解（叫某人做某事）有谓词性的坑不饱和，双宾理解（叫某人某称呼）有体词性的坑不饱和。如果拘泥于结构，二者半斤八两。但如果结合语境，非兼语理解是颠覆性的，兼语理解是常识性的。放着常识性的理解不选选择颠覆性的理解，说明心头的阴云不是一天两天了。冰冻三尺。”
重温一下白老师作为对比，字字玑珠，而且妙趣啊。“冰冻三尺”就是社会语言学。
白
也可以说，冰冻三尺就是大数据
我:
我们学习语言学模型句法绝大多数都是针对现时的把语言看成是一个静态的剖面来研究它模型它。这个也没大错而且简化了问题。但是语言是流动的社会语言学强调的就是这个流动性。流动自然反映在大数据中。因此对于静态的语言模型需要不断的更新如果有大数据那就定时地 check 它。
白：
有个动态更新的中间件就够了
我:
陈原是个大家。他写的社会语言学很有趣味。在世界语场合有幸聆听过陈原先生的世界语演讲：那个才华四射、感染力和个性特色让人高山仰止。人家做语言学是业余本职工作是出版商。据说是中国最权威的出版家，也是个左派社会活动家。
洪:
虽然解放初才入党，但应该早就是中共地下党员，三十年代初就在三联当编辑，胡愈之邹韬奋的部下，以前《读书》上一直有陈原的《在语词的密林里》
我:
陈原的那次演讲与黄华(我做翻译的那次)的演讲都有一个共同的特点，就是表情丰富、富于感染力，能感受到人的 personality，都是“大家”。
aaa

【相关】

《我的世界语国》

《朝华午拾：欧洲之行》