分类： NLP方法论

【NLP主流的反思：Church – 钟摆摆得太远（1）】

【NLP主流的傲慢与偏见】

【Church – 钟摆摆得太远（5）：现状与结论】

【骨灰级语言学家开讲段子小品】

走在路上瞎琢磨，突然脑中冒出句俏皮话的段子：

天下无贼贼有看头
why
好看啊
我是问无贼怎么有看头呢

白老师曰相声的段子就是这么来的。

马：
"贼有看头"，估计有人看不懂这句
想起一个段子。一个人去东北出差，问东北人宾馆是否好找，答曰：东北宾馆贼多。于是吓得不敢去了。

哈马老师这个更好。

今天有闲，侃侃这个即兴段子的语言学。这样的对话在语言学家眼中有些什么看点呢？

从语义计算的角度，并不是下述每一个点都那么容易形式化、模型化，但是人机对话要想逼近人类对话的高度，这些方方面面迟早要被 addressed。

看点 1. 专名与字面语义的纠缠：《天下无贼》

自从摈弃了上世纪30-40（？）年代流行过的书名号（一种括号）和专名号（下划直线或波浪线：据说后来嫌排版麻烦，就逐渐舍弃了）以后，这个纠缠就很 annoying。这是不同层次的纠缠，但没留下形式痕迹。通常的做法是指望有一部专名大辞典，搜罗进去的遵从 hidden ambiguity 的休眠原则。于是，“天下无贼”被词典识别为默认的电影专名，其内部的语义结构（小句结构）则被休眠。

2. 两个贼的纠缠

自然语言有一个广为人知的属于 discourse 范畴的 heuristic，叫做 one sense per token，说的是，一个token在同一个discourse里面重复出现，那么这个token的所指是相同的。有数据证明，这个 heuristic 的准确率非常高。于是，两个“贼”因为这个 heuristic，就埋下了一条伏线：同指（coreference）的 heuristic 一线与不同指的 heuristic 例外的一线。我们知道，凡 heuristic 一律有例外：再高的精确度，也有不灵光的时候。

其实，这个例外也有人研究过，例外里面还是有规则。规则就是，如果一个token隐藏在一个成语（计算语言学所谓成语包括术语、专名和其他的合成词）内部，那么这个token就不（必）遵循 one sense per token 的原则。

3. 贼的内部歧义

贼的标配定义就是 blah blah 的【human】。不知何时，好像是早先来自我们东北兄弟，开始用“贼”表达程度（副词），感觉贼形象、贼酷。这个用法显得别致、匪气、接地气，进而渐次推广到全国，尤其在网络用语里面。于是，贼的 hidden ambiguity 出现了，（i）默认的名词【human】和（ii）程度副词。

4. “vt头”的语言学

（有）.... 看头、吃头、玩头

这事儿咱从头说起。汉语是孤立语，一般认为没有欧洲语言的形态（词尾等），也没有严格意义的前缀后缀。如果n个词素（morphemes）组合成了一个词典单位，通常的说法就是合成词（compounding），而不是有明显主干枝叶之分的派生词（derivatives）。但语言是发展的，从古汉语甚至合成词都极少（这是“孤立语”的本义，孤立语的典型和极致是没有 morphology的）、一切都是 syntax，逐步发展到现代汉语，汉语的孤立特性在明显降低。有些所谓类语缀（quasi-affix）的语言学材料开始出现。换句话说，汉语有朝着印欧语言的方向演变的蛛丝马迹。

“头”就是一例。

“看头”，不是句法的动宾：看（了个啥）头；
也不是通常的合成词的定中套路：（所）看（的）头。
而是一个特别的后缀，其派生词的构词法与句法的接口，可以这样来做形式化的描述（by the way 我的博士论文专门有一节论汉语的类语缀现象）：

NP 有/没有 Vt-头 ==》有/没有 VP{Vt NP} 的【value】，VP {Vt NP} 结构自然是典型的动宾式动词短语。

细究的话，这里面还有“学问”：

其一，Vt 不仅要求及物动词，而且要求是单语素（说白了就是一个汉字），两个汉字可能吗？也许由于灰色过渡的存在，可以找到个别的例子，但感觉不是很多：

？这本书有学习头吗
？这个课题没有研究头。
* 这个曲子有弹奏头。

到了二字以上，那就绝对违法了。（MD 想一个三字的及物动词还真不容易：汉语的双音化太突出了。算了，不想了。）

其二，这个搭配句式中的有/没有不是 “拥有” 的“有” 而是“存在”的“有”，相当于英语的 (there) be or (there) exist。因此这个 V 的唯一的 arg 前置到主语的位置和后置到宾语的位置，语义关系不变：

这本书有看头 ==》
（i）有看这本书的价值
（ii）看这本书的价值有。
（后者头重脚轻，稍微有些不顺，但句法上是可以自由语序的， cf：

（的确）存在看这本书的【价值】
看这本书的【价值】（的确）存在。

其他看点还有，譬如 “why” 和 “怎么”（字面意义是 how）的逻辑语义其实是一样的，问的是【原因】而不是【方式】。累了。歇了。只是记住一点：
自然语言里面的名堂，比我们每天说话的人想象的要丰富很多。希望年轻的 NLP 后学不要小看这门语言的学问。至于老人，那就不用指望了。傲慢与偏见，爱咋咋。

【相关】

【NLP主流的傲慢与偏见】

【NLP主流的反思：Church - 钟摆摆得太远（1）】

【Church - 钟摆摆得太远（5）：现状与结论】

【语义计算：李白对话录系列】

【立委按】世有李白者，精于语义，勤于计算，一敏一木，一弦一弹，无论魏晋，不知有汉。坐而论道，波澜不惊，各得其乐，天马空行。挥斥方遒，指点语言，和寡曲高，流水云天。有道是，一擎核弹一拨弦，不是冤家不上船。

【李白100：Parsing 的休眠反悔机制】

【李白99：从大小S的整体部分关系看舆情挖掘的统计性】

【李白98：从对联和孔子遗言看子语言自动解析】

【李白97：大S小S句式中插入“的”所引起的交叉陷阱】

【李白96：想哪扯哪，不离其宗】

【李白95：走在路上……】

【李白梁于94：爱因斯坦是卓别林的崇拜者, 谁崇拜谁？】

【李白宋93：汉语语序的自由与不自由】

【李白92：自然语言漏得筛子似的，未必要补漏】

【李白91：休眠唤醒需要打离婚证】

【李白宋郭90：句法与逻辑和语用的纠缠】

【李白王89：模糊语义与真假歧义，兼论PSG与DG】

【李白宋88：再谈量词搭配与名词短语自动解析】

【李白洪87：人工智能，真的该让这样的哲学家走开】

【李白86：这是最后的斗争？】

【李白刘董85：汉字优越吗？】

【李白王董84：再谈POS迷思，兼论 PennTree 的误导】

【李白宋83：点评 “人工智能的诗与远方”】

【李白82：汉语重叠式再议】

【李白81：某些人的讽刺与挖苦】

【李白毛洪80：驯兽散记】

【李白79：中文深度解析的地基是词法分析器】

【李白78：毛主席保证】

【李白77：基本短语是浅层和深层parsing的重要接口】

【李白76：跨层次结构歧义的识别表达痛点】

【李白洪毛75：乔姆斯基批判】

【李白雷宋74：乔老爷的递归陷阱】

【李白73：汉语parsing的合成词痛点】

【李白宋毛72：NLP的测不准与追求完美】

【李白71：“上交所有不义之财！”】

【李白70：计算语言学界最“浪漫”的事儿】

【李白69：“蛋要是能炒饭，要厨师干啥用？”】

【李白68：NLP扯着扯着还是扯到萝卜填坑】

【李白67：带结构变量的词驱动模式注定是有限的】

【李白66：“青春期父母指南”的语义计算】

【李白邢65：“着”字VP的处置】

【李白董冯吕64：NLPers 谈 NLP渊源及其落地】

【李白雷63：做NLP也要见好就收，适可而止】

【李白梅宋62：工程语法与深度神经】

【李白张61：长尾问题种种】

【李白60：事理图谱之辨】

【李白雷梅59：自动句法分析中的伪歧义泥潭】

【李白之58：爬楼NLU】

【李白董57：中文字驱动patterns初探】

【李白董51：说不完的subcat和逻辑语义】

【李白之50：符号战壕的两条道路之辩（续）】

【李白梁49：同一个战壕的两条道路之辨】

【李白之48：关系不交叉原则再探】

【李白之47：深度分析是图不是树，逻辑语义不怕句法交叉】

【李白之46：做NLP想不乐观都找不到理由】

【李白之45：从变性谈到模糊与歧义的不同】

【李白之44：“明确”是老子还是儿子，需要明确】

【李白之43：谈谈绑定和回指】

【李白之42：谈谈工具格的语言形式】

【李白之41：Gui冒VP的风险】

【李白之40：逻辑语义是语义核心，但不是全部】

【李白之39：探究自然语言的毛毛虫机制】

【李白之38：叫ＮＬＰ太沉重】

【李白之37：分层与一锅煮的parsing机制探讨】

【李白之36：汉语可以裸奔，不可能无法】

【李白之35：句法分析 bottom up 为基础，可穿插 top down】

【李白之34：汉语情态词和计划类动词的异同】

【李白之33：从语言的毛毛虫特性聊到语文纠错的辅助工具】

【李白之32：从“没 de Vt” 聊开去】

【李白之31：绕弯可以，弯不过三】

【李白之30：李白侃中文parsing】

【李白之29：依存关系图引入浅层短语结构的百利一弊】

【李白之28：“天就是这样被聊死的”】

【李白之27：莫名其妙之妙，妙不可道】

【李白之26：汉语动结式和情态式的隐式被动现象】

【李白之25：句法能简则简，只要不影响总体结构】

【李白之24：“这碗花纹很别致的”】

【李白之23：“一切都在变，只有变本身不变”】

【李白之22：兼语式的处置及其结构表达】

【李白之21：萝卜多坑不够咋办】

【李白之20：得字结构的处置及其结构表达】

【李白之19：三探白老师的秘密武器】

【李白之18：白老师的秘密武器再探】

【李白之17：“我的人回来了, 可心还在路上”】

【李白之16：小词负载结构与小词只参与模式条件之辩】

【李白之15：白老师的秘密武器探秘】

【李白之14：Chinese deep parsing，说的是 deep！】

【李白之13：所谓话题或大小主语的句式】

【李白之12：修正乔老爷的保守派自由派之辨】

【李白之11：parser 的三省吾身】

【李白之10：白老师的麻烦不是白老师的】

【李白之九：语义破格的出口】

【李白之八：有语义落地直通车的parser才是核武器】

【李白之七：NLP 的 Components 及其关系】

《李白之零：NLP 骨灰级砖家一席谈，关于伪歧义》

【李白隔空对话录：谁无知呢?】

【李白：其实NLP 也没那么容易气死】

【相关】

自然语言后学都应该看看白硕老师的“自然语言处理与人工智能”

【白硕 – 打回原形】

立委译白硕：“入口载体”之争（中英对照）

《铿锵众人行, parsing 可以颠覆关键词吗?》

【语义计算沙龙：乔老爷的围墙，community 的盲区】

【理论家的围墙和工程师的私货】

【deep parsing 小品：天涯若比邻的远距离关系】

Deep parsing: 每日一析，内情曝光vs 假货曝光

《立委科普：关键词革命》

《立委科普：关键词外传》

骨灰级砖家一席谈，真伪结构歧义的对策（1/2)

骨灰级砖家一席谈，真伪结构歧义的对策（2/2)

【李白之12：修正乔老爷的保守派自由派之辨】

白:
“他们把总裁开掉的人训了一顿。”“他们把总裁开掉的人吃了一顿。”

我:
总裁开掉的那些人吃了一顿。
把总裁开掉的那些人吃了一顿。

“他们把总裁开掉的人吃了一顿。”属于合法非法边缘，语感上别扭：“他们”与“人”coreference，很多人不接受。

白:
同位语

我:
觉得别扭。
这些句子真心难。
试一试 parser。别扭的说法出来了（第二句），顺溜的句法反而走歪了（第一句）：

白:
吃的宾语相谐条件太明显不满足。

他们戴着大盖帽的人很强势
他们把子曰诗云挂在嘴边的人对民间俚语一点兴趣都没有。

我:

“他们戴着大盖帽的人很强势”这句稍微好一点，不过这类句子总体别扭是因为有更简约明了的说法在竞争：

他们戴着大盖帽的人很强势 --> 戴着大盖帽的人很强势
他们戴着大盖帽的那些人很强势 --> 戴着大盖帽的那些人很强势

这个“他们”不仅多此一举，而且平添理解困难。

白:
理解是不应该裁定是否别扭的
生成可以

我:
道理是。但是别扭决定了统计性弱，因此理解系统忽略它后果不严重，甚至总体更有利（减少了弄巧成拙的可能性）。可惜，我们目前坏在没有忽略它。因为 local SVO 很正，想忽略也不容易呢。即便想降低 recall，减少对罕见例子的鲁棒性，也不容易，除非费力刻意为之。。。

白:
我是在探索方法论问题：不回头的matcher需要看多远。

我:
知道，这是"消息"类的延长线。不过这个同位复杂了，需要回头。不好办。弄巧成拙的可能很大。

白:
如果必须在一个阶段内将错就错，那么等trigger到来之际，强行上车的乘客挤掉之前在车上的哪个乘客，还会不会翻掉更早的盘。

代词相当于有个坑，虽然和谓词隔了一层，但毕竟和“信息”类不同。
非代词同位结构不能这样用。

我:
道理明白。
道理是道理。那什么是什么。

as expected，前一句虽然对了，对得不开心。歪打正着，这不是第一次遇到了。在非设计的成功里，设计者不可能开心。而这一路不好设计。

前句各就各位，一路通畅。正因为此，后者只好把“把”落到定语从句的 head N 身上，又因为“把”的句法强势，“。。。那些人”成了盘中餐。哈，荒诞不过如此，但parsing 的逻辑线条却是清晰的。

白:
这里有个逻辑顺序问题。“把”怎么摆布，是有余地的，“吃”做逻辑宾语的语义不相谐，却是没余地的。应该句法不到山穷水尽，语义不相谐的不要登场才是。

我:
这个说法实践中很容易把人带进坑的。
换句话说，白老师自己有一个路数，按照这个路数，这个说法没啥问题。可是 followers 如果不是那个路数，或不明白那类路数，把这个说法当原则去指导实践，九成以上就掉坑里了。比较容易 follow 而大面上不错的原则还是乔老爷的句法独立原则的修正：句法不到山穷水尽，语义相谐的不要登场才是。对比白老师原则：句法不到山穷水尽，语义不相谐的“不”要登场才是。

白:
实践中，语义不相谐又被采纳的基本是活用性质的修辞，它们都发生在“高确定性、低相谐度”那个区域。如果明明是活用性的修辞用法，但却发生在低确定性区域，只能证明句法本身出问题了。

我:
对啊。
“高确定性、低相谐度”那个区域是不小的一个区间。因此句法独立的做法也不是完全要推翻，适当使用还是有益的。

白:
@wei 这个乔老爷原则用在英语上。
汉语不灵。

我:
明白。但还是一个度的问题。
完全实行乔老爷，根本就没有语义相谐或不相谐的事儿，语义被句法踢得远远的，老死不相往来。Note 我的原则是对乔老爷的修正: 句法不到山穷水尽，语义相谐的不要登场才是。可见，在这个原则下，语义登场了，语义句法融合了。
白老师的原则也是融合，也是对乔老爷的修正或反叛。但一字之差，就是保守派和自由派之争。我的说法：作为原则，不到不得已不动用语义。不得不动用的话，动用相偕，而不是不相谐。这个说法是有一贯性的：（1）不到不得已不用语义，差不多就是让句法来主导，暂时不管谐不谐，这等于语义不谐但句法不错的已在网中，因此也就没有再查不相谐的必要了。毛姑姑，这样可以搞定英语的90%+，汉语的 80%+，那么剩下的句法搞不定的，句法出错的，就用语义相谐来细化（句法角色细化为逻辑语义：譬如确定 agent 主语 vs instrument 主语）或修正（包括休眠唤醒）。这条路稳妥一些，至少感觉跌进坑的可能减少一些。

【相关】

《泥沙龙笔记：parsing 的休眠反悔机制》

【李白对话录之八：有语义落地直通车的parser才是核武器】

【立委科普：结构歧义的休眠唤醒演义】

【李白对话录之11：parser 的三省吾身】

【李白对话录之10：白老师的麻烦不是白老师的】

【李白对话录之九：语义破格的出口】

【李白对话录之七：NLP 的 Components 及其关系】

【李白对话录之六：如何学习和处置“打了一拳”】

【李白对话录之五：你波你的波，我粒我的粒】

【李白对话录之11：parser 的三省吾身】

李：
白老师有关于深度分析的名言曰：

parser三省吾身：有坑填乎？有盘翻乎？有subcat相谐乎？

填坑乃细线条句法，翻盘为语义重新计算，subcat 相谐引入本体常识。

宋老师提出的例子很有意思，属于语义翻盘:

Jogger's nipple happens when a runner's shirt rubs against his exposed chest.
google译为慢跑者的乳头发生在跑步者的衬衫摩擦他暴露的胸部时。

宋老师说：其实，”Jogger's nipple“应该译成“ 慢跑者乳头”，具有专指意义，可以看作临床医学的术语。汉语中，“慢跑者乳头”和“慢跑者的乳头”一字之差，决定了前者是术语，后者为普通的短语。但是，如果说“慢跑者乳头会被擦伤”时，只能理解成（慢跑者（乳头会被擦伤））

这个一字之差很合理啊。这与 hidden ambigity 同，是 context 强拆词典词。为了维持词典的优先或默认地位，这种强拆就是我们讨论过的休眠唤醒问题。这与 “难过” sad 被唤醒为 difficult to cross 是一个套路。可以用词驱动的方式局部重新洗牌。

按照 “难过” 的处理思路此例不难。大体是第一遍粗线条parsing的时候不check语义限制条件，所以得出这种疾病会被擦伤的逻辑动宾关系，大面上的 parse 是不错的，尽管不 make sense 因为疾病与擦伤不搭。到后面的模块或者是语义落地的模块，我们可以再做局部的 parsing 调整。正因为它有两种可能性，才使得词驱动的反悔策略可以成功。这个策略的成功已经在我的sentiment语义落地模块得到应用和证实。我专门有一篇博文，详细解说这种局部parsing反悔而使得语义正确落地，否则“难过”就是主观的负面情绪，可我的系统最终结论为客观的困难。“这条小河很难过” 于是不同于 “这个小孩很难过”。sentiment系统做到这个程度没听说过第二家，但的的确确是可行的。它基于的机制就是与白老师讨论过好几回的休眠唤醒，而不是 nondeterministic 带着瓶瓶罐罐跑。原则上只要是可以词驱动的这类现象都可以做。

白：
方法论上，就是纵向不确定性（词汇歧义）和横向不确定性（填坑歧义）不要搅在一起。先撇开结构（但不排除非结构的各种信息包括subcat的使用）确定无歧义的词汇（WSD），再聚焦横向不确定性。当纵向自己冒出更加相谐的其他候选或者横向遭遇结构性不顺或者二者兼而有之的时候，启动WSD翻盘，即休眠唤醒。这里的潜台词是，随着填坑的进行，WSD一直在继续，类似阈下联想。但这种继续，只是横向填坑的结果单方面抛给纵向WSD，但是不到翻盘启动，WSD只不过瞎联想而已，并不反作用于横向填坑。

李：
白老师是哲学家善于总结和抽象。

白:
段子就是这么产生的
包袱就是这么抖的
直到WSD的最后一根稻草打破僵局，启动翻盘

李：
人也是这么个理解过程，叫恍然大悟。恍然前其实在标配休眠。
所以说以前很多人认为 hidden ambiguity 是中文分词的死穴、无解，那是因为误把分词当成了一个独立的死模块。明明是 context parsing 的催眠唤醒的任务，硬要怪罪和强加给分词。现在清楚了这个问题有解，但解不必在分词层面，解隐藏在词驱动规则里面，不到一定的时候不 activate。这与以前的 exhaustive parsing 的方案虽然原理都是借助 parsing 之力，但却适应了 real life 系统多层模块化开发的需求。（我在博士论文中提过用 chart parsing 通过 exhaustive candidates 解决一切切词任务，包括 hidden ambiguity 的，是把切词看成 parsing 的一个有机成分，这个解决方案理论上可行，但难以 scale up。）

白:
parser三省吾身：有坑填乎？有盘翻乎？有subcat相谐乎？

李：
笑喷。差点喷在手机屏幕上

李：
说 subcat 是命根子等价于说教科书上的 cat（POS）为基础的 grammar 太粗线条，很误导，只是 toy，无法对付真实语料。subcat 与词典主义是一致的。

白:
两个粒度
没得可选时，pos很给力。选择太多时，pos就是累赘了。

李：
hierarchy: literal -- subcat -- cat
subcat includes sub-sub-cat and hownet taxonomy

白:
沿上下位链条有一个统计分布，并不是任何一个节点的传播强度都相同。
非常有意思的一个问题

比如说到“猴子”，可能其典型的上位词是“灵长类”，而“哺乳动物”“动物”“生物”这些上位词就不那么典型。也不一定就是直接上位最典型。比如“豹子”，可能“猫科”并不典型，而“野兽”更加典型。如果要做无监督学习，典型性分布是一个必须解决的问题。典型性还会“条件化”。比如上下文中有“吃”，则“动物”上位就会强化。说“产卵”，则卵生上位会强化。

这样才能把词典中的subcat标签如所愿地变成非标注语料的自动标签。也就是说，实现正确的subcat embedding, subcat embedding是比word embedding意义重大很多的一件事，难度也不在一个数量级上。如果subcat embedding成功，意味着从此告别苦力，进入自动化标注时代。

李：
subcats （sets or clusters of words from any angles） or taxonomy 链条中的所谓典型 nodes，说到底，是作为语言特征，它是不是有区别性。

早早年做 MT 有个例子很有意思。说英语的 down 有一个用法和义项，与 along 同，翻译成汉语是 “沿着”。需要什么条件才翻译成沿着呢？研究了 data 发现，原来是它后面的名词都有一个特征，可以叫做“线条性”，于是这个特征就成为语言使用和理解中有意义的 feature 了：

down the street
down the line
down the pipe
down the corridor
etc

down + NP【线条性】 --》沿着 NP

那么 along 呢？ along 基本没有歧义，于是就不需要这个条件了：

along + NP --》沿着 NP

生物学上的 taxonomy 不一定具有语言学上的区别性特征，其中的有的 nodes 典型，具有语言意义，有的 nodes 就没有语言学意义。人、鬼、神、妖很不同的。但是语言使用上，其搭配关系大同小异。

白:
所以闭门造taxonomy是不管用的
标签造出来就是为了区别的
如果不知道谁对区别敏感，就吃力不讨好

李：
完全从大数据去 clustering，也不好说结果就好使；闭门造车拍脑袋也容易偏差。最好还是二者的某种结合。

回到那个【线条性】的区别性feature来。一旦总结出来，我们就可以穷尽词典，根据这个特征给每一个具有线条性的名词标注。从此就可以说 down 的这个用法，我们基本搞定了，没有 sparse data 的顾虑了。如果没有总结出这个 feature，并在词典里面穷尽它，SMT 和 NMT 能自己学出这个 feature 并对 sparse data 免疫吗？它能够从 down the street 举一反三到 down the line 吗？---- 假如后者是 sparse data，训练数据里面没见的话。

白:
等会儿，两个问题要分开。词典标“线条性”特征是一件事，写不写规则是另一件事。用学习的方式，学down 和“线条性”subcat之间的搭配，机器学习方法是没问题的。
走半步，做词典里的subcat标注，另半步交给机器，这很正常。

李：
有理。不过，难点在发现“线条性”是一个值得标注的东西。假设人已经词典标注了，机器学习出这种条件，是自然的。还有一种就是不管3721，把几万个能想到的 features 都标注上，然后让学习自动筛选区别性特征，选出 top 1000 features 其他的舍去。然后，词典维护的负担就大大减轻了，只要把注意力集中在 1000 个最具区别性的概念上就可以了。其实 1000 以外的 features 也没啥概括性了，经验告诉我们舍弃没问题。反正后面还有 literal 做底。literal 做底的就是我们说的强搭配。

白:
语言学家灵感所至，想出一个“线条性”来，当然是一个好的启发。但是对于学习来说，有好的启发就很好了，不需要语言学家干脏活累活调规则。所有工作集中到词典，剩下的交给机器。

李：
HowNet 和 WordNet 里面都有万以上的 features，我们都知道者万以上的 features 其实只有千这个量级的子集最有意义。具体是哪些有意义，目前是拍脑袋。这个选取工作应该是机器来做的。

白:
作用还不仅此。
有时一词多义，不是所有义项都有“线条性”含义，比如thread作为“线程”解，其“线条性”就很弱。string当弦乐器解的时候其实已经没有线条性。所以一个外部条件，可以只和词典里一部分义项勾搭，把另一些义项冷落在一边，形成事实上的WSD，就通过subcat做。WSD和用搭配筛选固化结构，一石二鸟。义项支持结构，结构支持义项，形成正反馈。环形支持，不是单边支持。

李：
“WSD和用搭配筛选固化结构，一石二鸟。”
这就是我以前说过的，为什么 IE 可以绕过 WSD，因为 parse 基础上的 IE 语义落地根本不需要独立的 WSD 模块作为支持，因为 WSD 在 IE 过程中自然实现了。当然前提是 deep parsing 支持的 IE，而不是主流那种没有结构支持的 IE。一般而言，一个词有多义不可怕，可怕的是多义没有结构去制约。如果对于最终的语义落地，总是以 parsing 作为跳板的话，这种多义的困扰就自然消失了。

白:
当然，也少不了反悔
有subcat干扰的反悔总是比较艰难的。只需三省吾身。

李：
实际上，IE 落地不仅可以容忍词多义（WSD），也可以容忍结构歧义。因为到了 IE 的份上，domain 已经聚焦到要落地的语义。这时候，词驱动因为聚焦变得可行。因为词驱动变得对于歧义可以容忍而不失精度，这就是 deep parsing 是语义落地核武器的奥秘所在。

自然语言最让人困扰的问题是歧义性。恰恰在这个最困扰的地方，parsing + IE聚焦使得对于歧义可以免疫的词驱动的 approach 变得切实可行。不少人因为只知道 IE 是学出来的，不需要句法和结构，无法理解 parsing 的核武器性质。结果是今天的 IE 与明天的 IE 被看成是两个独立的任务，具有各自的知识瓶颈。但在 parsing + IE 的架构里面，这就不再是独立的任务了，而是80%+ 相交的任务了。说白了就是，结构不够（结构歧义）词来弥补，词不够（歧义）结构来制约。要恰好赶上词的多义与结构的多义在给定的IE语义落地任务中重合，并且这种重合影响到落地的质量，很不容易呢。换句话说，如果是两条腿走路，想出错都难，想质量不高都不容易。一条腿是结构，哪怕是歧义的结构。一条腿是词（nodes），哪怕是歧义的词（当然这词所代表的不仅仅是词，还有其上的 ontology）。两条腿走路踏空掉悬崖去的例子，学者研究过，不能说没有，但真实应用中完全不足为虑。

我的黄学长（Wilks 的门生黄秀铭）在他的 Prolog MT 的博士论文中特地举了这个两脚踏空的倒霉案例，为了彰显 Prolog 回溯消歧的本领：tough coach, 第一条腿是结构：定中关系，很幸运这条腿本身没有结构歧义。第二条腿是词义，两边都是常用词，义项比较多。结果是，加上了结构以后，还留下了两个语义相谐（ontologically appropriate）的可能性，不能完全WSD消歧: 1. 严厉的教练；2. 牢固的马车。原则上在这个 local 结构的 context 里面，这个罕见的多义案例是无解的，需要更大的上下文来消歧。要我说，拉倒吧，难得一错，认栽吧，否则不像是人造的 intelligence 呢。

白:
大数据说谁就是谁了
哪有那么纠结

李：
那倒是，不就是个 bigram 嘛。类似的例子如果远距离，不知道大数据会不会稀疏到不能定夺。譬如：The coach that has been there for years is known to be really tough.

白：
我理解WSD和分析器使用语义中间件是个动态递进的过程：随着分析的进展，原来远距离的会拉近，原来WSD的结论也会翻盘。

【相关】

《泥沙龙笔记：parsing 的休眠反悔机制》

【科普随笔：NLP主流最大的偏见，规则系统的手工性】

【语义计算：没有语言学的计算语言学，NLP的亚健康现状】

我:
大而言之，实词（对应概念）之间，只要发生句法关系，逻辑语义上就有个说法。
作为总原则去操作，句法标签总带着一个逻辑语义标签的做法，是有益无害的（最多是逻辑语义那边不增加新的信息，给个 dummy 的逻辑符号，assuming 句法标签对于语义落地足够了）。
但反过来，我们都知道，有不少逻辑语义是建立在没有句法直接联系的实词之间的所谓 hidden args, 语义中间件的主要任务就是挖掘出这些 hidden 的逻辑语义关系来。
还有一个突出的区别：对于句法 dependency，大体上要遵循一个老子的原则。而对于逻辑 dependency，这一条就废了：一个儿子有多个老子，对于逻辑是天经地义。因此这树形图也就变得诡异了。

白:
定语从句就是多个老子，用坑的话说，就是填一送一

我:
定语从句的老子儿子相互循环，直接对抗 acyclic 的天条，那是 DG（Dependency Grammar）的 formalism 引起的。DG 有一万个好，在这一点上还是露出了皮袍下面的“小”来。不过虽然君臣父子乱套，看上去挺窝心的，实际操作使用上也无大碍。要是单单为了这一点就采纳了叠床架屋的短语结构，不值得。

白:
我不认为树或者DAG是动不得的天条。语义那头已经是这样了，句法why not

我:
我无异议。不过多数语言学家和逻辑学家看不惯乱伦。

白:
而且我现在的填坑体系里根本就没有树。天生允许多爹，允许loop

我:
总得有个数据结构某种 internal representation 作为 output。我的老印搭档在实现这个 graph 的时候，遇到 loop，以前是 error，系统罢工。后来改成 warning。实践中我发现，这个 warning 对于 debug 还是有用的。遇到定语从句这种 loop by design 就忽略警告。但很多时候，那个警告帮助指出了多层规则系统的不合理之处。人的脑子蛮可怜，再有经验的语言学家，也看不过三步。因此在编码规则的时候，容易陷入局部思维。看到 warning 时候“回溯”，往往恍然大悟，原来全局上看，有些东西是不合理的，需要协调。

biao:
哥儿几个在这死磕语法似乎很难看到什么时候是出头之日。

liang:
据说，我们都是乘着“计算”这趟历史快车。跟着时代走。

白:
做股票可不是这样说哦，都是在讲“抄底”。

这要回归到一个老问题，状态机的学习。从非确定有限状态自动机到RNN只有半步之遥。从正则表达式直接编译到RNN的路径是畅通的。所以，规则和学习两条路都可以到达RNN。说得清的用规则，说不清的用学习，谁也不碍着谁。

我:
有数据的用学习缺乏数据的用规则。

另外说语法没有出头之日是小看了咱语言学家。等到 dl 先打败我的 parser 再说不迟。想起奥巴马与希拉里当年党内初选，希拉里老说奥巴马做副手不赖，可以与她搭档跟共和党竞争。奥巴马笑说，你一路输在我后面，说什么呢？当然，这些与潮流相左的话没人当真。一律当成妄人或民科的鼓噪而已。好在在应用现场，最终还是系统说话。

白:
对标注来说，上量，和自洽，是同一个问题的两面。

我:
我信服dl的power 但文本标注和domain化的挑战貌似没看到根本的突破。知识瓶颈 kills a cat。

白:
对我来说不存在两条路线竞赛不竞赛的问题。那个东西该长什么样是更重要的，这点一旦定下来，怎么弄成那个样都行。比如说，肯定不是树。所以树库再庞大也那个。

我:
端对端的理念是不要那个：不要结构不要语言学。

白:
那只是表象只不过把问题转化为中间黑盒子长什么样而已。

我:
问题是结构的目的是帮助克服domain化的挑战。没有结构每一个nlp的应用就是一个独立的问题，就需要无止境的带标数据，到哪里去克服这个知识瓶颈呢？一千个应用需要一千种带标大数据。在我这里不需要因为结构化了；我只要少量的数据样本让我知道任务的定义即可。专家天生懂得举一反n，谁叫我们是人呢，linguists，domain specialists ......

白:
这真的是表象
因为黑盒子不是仅仅学习可以得到，对规则进行编译一样可以得到。

张:
李白的discourse 省略太多，求Wei的分析

白:
所以关键是黑盒子本质上有没有容纳结构的能力。黑盒子长的模样不对，容纳结构就不力。之所以一任务一标注一训练，是因为不了解黑盒子容纳结构的通用能力。
也是因为这样拆分有利于持续发论文

wang:
白老师今天高见，“句法关系不是树结构”，领教！但是常规大部分句子，用树结构表示还是可以表达清晰的，除非您列出的那些刁钻的句子。我认为那是语言学家功课还没有做好，这些看似异类的句子（其实是现实合法句），还没有语言学家的事先分类归属。我现在的观点（也许以后会改变），句法处理过程中可以不是树结构，但是最终结果还是树，而不能出现环。

我:
无所谓啦。
社会网络里任何人都可能与任何人发生关系何况语词？

白:
环必然会出现
定语从句在汉语中是个并非偏门的表达方式

我:
他喜欢的女孩
什么女孩？ -- 他喜欢的女孩
他喜欢谁？ -- 喜欢女孩

白:
这个环形结构的思想大约在1998年就形成了。当时是在范畴语法的框架内表述的。后来一直想把范畴语法发展成可用的mechanism，遇到诸多困难。最近几年才转向，把当时的一些精华嫁接到依存语法中来，弄了个不伦不类没名没姓的坑论。@梁 @赵都有涉及这项工作。看到伟哥也果断打破树结构，拥抱俩爹，非常欣慰！董老师的框架，箭头方向和我一致：萝卜指向坑，修饰语指向被修饰语。伟哥的方向，随依存体系，反过来的。伟哥省略了小词。而在我的坑论里面，小词负载很重要的结构，“的”是构成定语从句环路的最核心节点。不仅“的”，像“地”、“得”之类也负载结构，也挖坑。

wang:
看来白老师这是深酿多年的酒了哈！希望这理论能取个好听的名字。更希望早日形成系统，发挥应有力量。

我:
【坑论】，不蛮好？
环形不明白的问利鹏。他自从解雇了小蜜就聘了自家领导做手下，并与新手下约法三章：一切服从领导。

我:
说到填坑，HPSG 里面有个说法：
对于 args，当然是 head 挖坑（subcat），期待（expect）那些 args 填坑。是 head 找 args。但对于 mods，一切反过来，不是 head 找 mods，而是 mod 去找 head。
所以对于词例化的 HPSG，修饰语的词也挖坑挖的是让 head 去填的坑。

刘:
为什么mod不能做head而把动词作为arg？

我:
但实际上我自己在 parsing 的操作中，两条路线都走过：
做过 head 去找 mods，大不了多几层，或来个循环。也做过 mod 去找 head。

mod 做 head 从语义表达上，是本末倒置吧，至少人看着不舒服。真要做，也可以做，可是 mods 是数量不定的，除非是短语结构，一层一层嵌套上来，让最远的 mod 做总 head。否则怎么表达多 mods 对于同一个 vp 的填坑要求呢？能想到的办法就是让同一个 vp 或 s 可以有 n 个 mods 的老子（说的是依存关系的表达）。总之，一般认为还是谓词做 head，既做 args 的 head 构成 arg structure 作为语义核心，也做 mods 的 head，表达边缘的语义（修饰限定）。

白:
这里有模糊地带。
比如，马上种树，必然种树，肯定种树，会种树。
副词和情态动词的边界情态动词就被认为是动词填情态动词的坑

范畴语法就是mod做head，比如形容词是n/n，你喂给它一个n，它吐给你一个经过修饰了的n。

副词就被认为是给核心动词戴帽子。我曾经坚持了很长时间喂一个吐一个的思路处理修饰词，后来证明有害无益。后来把方向扭过来了。

我:
喂一个吐一个的做法早早年我导师刘老师就是这么做的。所谓名词组抱团（就是我们说的 chunking），就是从head N 开始往左一个一个的吃。情态动词与副词有类似也有不同。说情态动词是 head 后面的动词是 dependent，这种处理有其优点。主要是情态动词与主语往往有一致关系，而且也常带有谓语的时体信息
但副词不同，让副词做 head 就有些反客为主了。

白:
这个地方是范畴语法和依存语法的重大差别。

我:
情态动词与后面动词，谁主谁副，很有说头。从句法上，情态动词做主，因为上述理由，最合适。从语义上（谓词的ontology），当然是后面的动词，因为情态动词是功能词，反映的是语法意义，概念意义很虚。当主语与谓语需要check语法上的一致关系的时候，应该 check 情态动词。而当主语与谓语需要 check 语义一致关系（最典型的是主谓搭配关系）的时候，就必须 check 后面的动词。这是两个矛盾的要求。一般都在一个体系内部协调解决，确保情态动词与后面动词的 acessibility，适应不同的需求。

有时候想，白老师这个群里交流的这些体会、经验、理论和实践，算不算 CL 和 NLP 方面的学问呢？要说是学问吧，好像这种学问没处发表。（语言学的刊物那边或许有一些 room，但掌管语言学的学者，对语义计算好奇多于了解和欣赏。）计算语言学这边吧，一律的学习啊学习，或者深度啊神经，根本没人拿这个学问当回事儿，或者也听不懂。

这真是一个有意思的怪象。
所以我说岂止是隔行如隔山同行也隔山。锤子不同，虽然做的是同一个事儿，也还是隔锤如隔山。白老师这样两边都不隔的，绝对是熊猫。

这种亚健康状态，终有一天会被领域认识到。

【相关】

科学网—计算语言学的尴尬

【语义计算：从神经机器翻译谈起】

【科普随笔：NLP主流的傲慢与偏见】

【NLP主流的反思：Church – 钟摆摆得太远（1）：历史回顾】

【Church – 钟摆摆得太远（5）：现状与结论】

【从V个P到抓取邮电地址看 clear patterns 如何抵御 sparse data】

从前几天的例子：V个P （挣个毛、挣个求、挣个妹，等）
P={P，屁，头，鸟，吊，jiba，妹，鬼，......}
可以看到，小数据为依据的规则系统，有时候比大数据训练的系统，可能更加有效：更精准，更能对抗 sparse data 从而提高 recall（具有 clear patterns 性质的语言现象，可以一网打尽，完全没有 sparse data 的困扰），模拟语言现象更加直接，因此也更加容易debug和维护。

在 IE 历史上，直到 MUC-7，当时表现最牛的 NE 系统 NetOwl 就是基于 pattern rules 的，几乎所有的统计对手都拿它作为拼杀的对象。NetOwl 从 SRA spinoff 出去想以 NE 为技术基础，进行商业运作，一开始在分类广告业拿下了一些业务，终究不能持续赚钱，后来被 SRA 收回，逐渐销声匿迹了。后来追随潮流，系统里面也混杂了机器学习的模块。

从此在学界就再也见不到规则系统了，哪怕是对于规则非常适用的某些 NE 任务：譬如时间，数量结构，等。可见潮流之厉害，貌似所向披靡。但事物的本质和本性并没有改变，对于自然语言中的具有 clear patterns 的现象，依据小数据，经过人脑的归纳，行数据驱动去开发规则系统，仍然是如上述高效而高质量：工业界默默实行的人、团队和系统并不鲜见，只不过大家心知肚明，只做不说而已。

相对应，发动群众去标注大数据，然后用大数据训练一个系统如何？这是主流的默认、honored 的方法。如果数据足够大，其质量的确可以接近或匹敌规则系统。当数据量不理想的时候，就捉襟见肘了：或者 underkill （由于 sparse data，漏掉很多统计性稍弱的变体）伤害 recall，或者 overkill （smoothing 过度，把不该抓的现象抓进），影响了precision。

什么叫有 clear pattern 的语言现象呢？举个例子，抓取邮政地址，这个工作我自己作为一个 fun project 做过。美国地址大体是门牌、街道、城市、州、邮政编码，最后是国名，patterns 相当地 clear，可你可能无法想象上述 pattern 的构件变体之多，有些变体绝对是 long tails，再大的数据量也难涵盖其组合爆炸的本性。

如果你收集了一个巨大的美国地址库作为训练集（大数据），你完全可以设计一个学习系统来做这件事儿。而另一边，虽然也是 data driven，但只需要小数据样本，然后经过人的大脑去举一反三进行开发，最后到 raw data 的大数据中去验证反馈。可以拍胸脯的是，后一种办法做出来的系统绝对是高质量易维护，几乎天生地具有 sparse data 的免疫性。

云:
@wei ，地址parsing属于reg expressions就能搞定的事，我们大数据分析经常要做的事。这个和NLP没有多大的关系。这是一个context free的grammer，相对简单。

我: finite state，是 regex 就搞定，但不少人还是训练。这是其一。
其二是，自然语言复杂性比起相对简单的地址识别，不过是多了几层而已。都可以 finite. 譬如，subcat 说需要主语、宾语，还要一个宾语补足语，这与地址说需要一个街名、城市名和州名，也差不多。

云:
不一样的，
1. 街名
2. 城市
3. 州名
各自独立，互不依赖。
而主谓宾相互有上下文关系

我:
比喻都是跛脚的。anyway 二者都是 finite 装置可以搞定。地址由于其组件的独立性，利用 macros 调用，可以一层搞定，也可以不利用 macros 多层搞定。NL 通常要多层 finite 装置搞定。

其实我要说的是，自然语言看上去千丝万缕，复杂无比，但本性上、大面上是背后具有 clear patterns 的 monster。为什么自然语言有 clear patterns （所谓句法）在背后？乔姆斯基归结为 UG，是从娘胎里带出来的。有意思的是，语言学家看自然语言，看到的是章法，甭管这个章法多么地扑朔迷离。而没多少语言学训练的NLP工作者，往往看到的是一团纠缠不清的迷雾。

【相关】

【语义计算：从神经机器翻译谈起】

我:
机器翻译所蕴含的厚重和神圣，在新一代是不可理解的

刚入行的时候做的是外汉机器翻译，一直不大敢碰汉外，原因是汉语语法不好形式化，感觉太难了，当时想，这辈子怕都没指望了。现如今，汉语语法还真没有见到多少大规模形式化能实用的，按照以前的路子，那汉外机器翻译必然寸步难行，因为汉语分析是前提，然后才是转换和生成。

可谁能想到，机器学习越来越牛，人工翻译的双语资料作为人类活动的副产品，几乎“天然地”源源不断而来，这就成就了深度神经机器翻译。什么分析，什么生成，统统绕过去，端对端直接施行转换。Google Translate 因此可以在同一个模型架构下，支持几十种语言的互译。这简直就是神迹。可却是技术的事实。尤其不可思议的是，以前认为最难的的汉外翻译，反而进步最大（至少汉英是如此）。译文再不济，也给你个大概齐，不仅立等可取，而且还完全免费。比你学两年外语，带上词典死磕还一头雾水要强多少。除了天堂，天下哪里有这样的美事？

机器翻译（MT）是自然语言处理（NLP）领域历史最悠久的应用方向，从上个世纪50年代初发轫，承载了中外几代不知道多少人的青春和梦想，也包括青年时代的立委。如今，梦想化为现实，嵌入式机器翻译在互联网无孔不入，已经成为普罗大众手中招之即来挥之即去的便捷工具，每时每刻在默默服务着千百万互联网用户。女儿学汉语用它，学西班牙语用它，去日本动漫网页也用它，用到对它熟视无睹，把机器翻译视为理所当然。只在翻译错得离谱的时候才意识到它的存在，不时报以嘲讽：真笨。可机器翻译呢，谦谦君子，玉树临风，虚怀若谷，无怨无悔。对于已经天然成为女儿这代人生活一部分的机器翻译，我满腹机器翻译的历史和掌故，却不知如何给她诉说。耳濡目染，她从我断续的话语中似乎隐隐觉得机器翻译对于她父亲的一生具有特别的意义，可是我还是无法象对同辈人那样娓娓道来，如数家珍，传达出我内心深处的机器翻译所蕴含的那份厚重和神圣。不仅仅是代沟，是技术的跨越式发展造成了两代人迥然不同的视角，令人感慨。 from 【机器翻译万岁】

刘:
@wei 深有同感。科学技术的发展真是出人意料，做梦也想不到机器翻译能到现在这个程度。我一个刚入门不久的学生跑NMT，轻松超过Moses十几个点，仅几年前，这还是天方夜谭，要是超出Moses五个点绝对可以发最高等级的论文、拿博士学位了。
而且现在用现有的深度学习工具编NMT程序，代码量跟SMT相比都很小，不像写一个SMT程序，要花大量时间处理小的细节。深度学习的工具本身太强大了。同一套工具，稍加修改，既可以做机器翻译，也可以做语言识别、图像识别。
深度学习并没有解决所有问题，但为我们解决一些难题提供了全新的框架，带来了新的希望，潜力还远远没有挖掘完，这给我们这些搞研究的也带来了巨大的机会

我:
很羡慕ing @刘那天与讯飞的院长谈这事儿，他也是超级兴奋，说以前以为大约四五年会有全方位的大突破，神经在大系统大应用上全面开花。现在他确信只要2-3年就可以了，到时候很多事情会超出我们的想象。他是这样描述的，非常由衷。感觉是作为一线领航者，他看到一种排山倒海的科学潜力正在转化为技术力量，面对巨大机会忍不住激动。这很感染人。这种心态我可以体会。

biao:
@wei 所以，哥儿几个在这死磕语法似乎很难看到什么时候是出头之日。

科大讯飞的确有过人之处。起码它的语音输入可以让你节约大量时间。
前几天有人在这里抱怨说输入码字太累。实际上现在语音输入完全可以帮助你非常轻松的输入，而且效果很好。
上面这两段话完全是讯飞语音输入的。一个字没有改，十几秒钟搞定，非常轻松。

刘:
我不敢预测哪些问题能解决哪些不能，但总体的进步是可预期的

我:
说语法没有出头之日是小看了咱语言学家等到dl打败我的 parser 再说不迟。
事实是迄今全世界最牛的 dl syntaxnet 仍然是我手下败将。
另一个事实是迄今没有sentiment系统在 open domain social media 这个几乎最难的 space，能赶上我们。Not even close ：the margin is almost 20 percentage points apart

所以我跟讯飞院长说你我是同一类人。不过你在舞台中央我在野。但是论信心和对nlp的展望心态和世界观惊人的一致。要不咱们互补、合作、合流，要不咱们就来个友谊赛，我就不自量力一哈。反正论年龄我输得起你们输不起 =）
（我输了就钓鱼去乐见ai一统天下于dl if they truly deliver as well as nmt did
可是 nmt 有data 而大多数 nlp 没有那么多clean labeled data 啊）

biao:
语法分析最大的问题是不灵活。鲜活的语言千变万化。一句话稍微变个说法，语法分析就抓狂了。

我:
根本不是这回事你的理解有误

白:
死守固定语序才这样但语法分析死守固定语序已经是老黄历了
你变个说法给伟哥试试他会告诉你一个robust的句法分析器能做到什么

从“计算”角度说，黑盒子容纳结构的能力是最本质的。从“语言”角度说，结构应该长什么样，比其他的事情更值得关注。
两栖人

biao:
先分析一个名句：
”其为人也孝悌而好犯上者鲜矣。”

我:
如果变个说法语法就抓狂要这劳什子干嘛。语法的目的不就是为了对付变体吗

白:
大战风车，其乐无穷

我:
你弄句文言做啥？这个 sublanguage 里面没钱，开发他有卵用。
“卵” 属于 P 系列：是现代汉语口语的脏字否定限定词，== fucking no，社会媒体口语的这个 sublanguage 我们倒是对付了，不妨试试。

biao:
你的机器怎么知道它是文言文，半文言文，还是白话文？他们都是中文。

我:
不在一个频道算了

biao:
“工欲善其事，必先利其器”。这是文言文还是白话文？大量的成语是文言文还是白话文？金庸的小说是文言文还是白话文？四大名著，是文言文还是白话文？鲁迅的文章是文言文还是白话文？
这些都是在现实生活中大量遇到的语言素材。绕是绕不开的。

白:
高频小体量，适合死记硬背。文言文句法上并不比白话文更难处理，某种程度上还容易。文言文没有白话文里那种NP、VP串烧。有词类活用，但有规律可循。

我:
文言文长句相对少。排比平行用法普遍也是形式痕迹。还有些非常固定的文言句式用到特定的文言虚字可以借力。等退休以后玩玩文言文应该是一个不错 time killer。文言词汇量大大减小，字基本就是词，但每个字的用法包括活用或引申用法就多一些。

白:
关键看WSD一选出错率会不会增大？

我:
有不小比例的wsd，等价于pos，pos搞定就搞定：老吾老。及物动词的“老”是一个活用义项，词典可以绑架为“尊崇”、“孝顺”之列，与作为形容词的“老（old）”的本义，以及作为名词的“老（the old，senior，parents）”都不同。
文言处理也少了切词错误的干扰基本没可切之词。字驱动的路子，有很多字典工作可做

白:
有些歧义是简化字造成，之前古籍并无。比如后，简化之前就有这个字，就是皇后的意思。以后的后，之前是“後”。做pos也好wsd也好，要考虑文本的基准。

我:
所谓更多的活用，可以在字典假想如果处于某种活用，它义项是什么，然后绑架，倒也便利。另外，现代汉语对虚词的省略似乎大于文言中虚字的省略，这也是文言处理的便利，虚字的频繁使用，给确定句子成分的边界创造了条件。

weidong:
娱乐一下：陈亢问于伯鱼曰子亦有异闻乎对曰未也尝独立鲤趋而过庭曰学诗乎对曰未也不学诗无以言鲤退而学诗他日又独立鲤趋而过庭曰学礼乎对曰未也不学礼无以立鲤退而学礼闻斯二者陈亢退而喜曰问一得三闻诗闻礼又闻君子之远其子也
标点断句先

我:
试了一下我的 parser，满篇都是 Next ；=）

weidong:
没有引号连话到哪儿结束都猜半天

我:
索性也试试前面要求的测试

其为人Next 也孝悌，而好犯上者 Next 鲜矣。
哈

以前学美国之音英语900句，都说有900句，英语的基本句型就搞定了。这些年，我都 unit tested 近两万句了。是不是差不多该搞定了？最近翻阅以前内部论坛的帖子，有这么一贴，好玩:

池子里说说无妨，万一明年中文核弹爆了，你们可以作证立委就是钱学森。
作者: 立委 (*)
日期: 2012/04/18 23:13:13
不说的话，将来被代笔，说中文核弹不是我的作品，找个旁证都找不到。

换句话说，各路身怀绝技的侠客剑法可能不同，但有个共识：就是我们面临技术核弹大爆炸的前夕。至于AI泡沫，那是商业上的炒作，技术的发展与成熟只是给了它一个炒作的话题而已。

【相关】

【机器翻译万岁】

【语义计算：没有语言学的计算语言学，NLP的亚健康现状】

【语言学家妄论深度学习和AI，旨在 invite questions】

与董老师调侃AI泡沫，不过泡沫归泡沫，这次ai热让我们看清了几点：

第一是大数据里面有名堂不全是虚的。

第二是长远一点看 ai 和 nlp 在领域里可以解决实际问题
譬如我们做的客户情报产品虽然发现市场没有预想的那么大但价值是确认了

第三是深度神经是技术突破真东西虽然目前被神话了。至少在 nmt 中我们看到了以前达不到的质量。语音方面已经提升了整个产业的水平。

第四是 nlp 与大数据结合让我们看到很多可能。虽然并不是每一种可能都可以满足某种社会刚需但nlp大规模实用的大门已经开启就看谁的市场角度对路了。

有一位风头正健冲在世界最前沿的深度学习大牛好友，看了我最新的博文【如何自动识别同一个意思千变万化的表达】, 回说：李老师你还没有理解深度学习啊，深度学习做这件事儿（识别一个 statement 的千变万化的语言表达）其实比较简单。

我不懂深度学习，那是肯定的。说这件事儿很简单，我有点存疑。至少目前所有做 bots 和问答系统的人，都在 fight 这个挑战，不能说已经完美解决。当然，Siri 这类显示了在 apps 上的应用，令人印象深刻。

Anyway，我的回答是，我们属于同类，心态和世界观是一样的。手里有把得心应手的锤子，世界就变成了钉子。区别只是锤子的不同，我不懂你的锤子，你也未必使得了我的锤子。术业有专攻，隔锤如隔山。但我确认，我的锤子可以对付这个钉子。

咱们还是来个友谊赛吧，否则这个世界多么单调。

无监督学习除了 clustering 在某些特定场景可以得到应用外，基本还是 research 的探索性质吧，没人指望它能大规模应用。clustering 到 classification 还有不小的距离，总得有某种监督或人参与才靠谱吧。那天我说，学习界啥时把机器放到raw data 的语言大海里，机器就跟小孩一样学会了语言，那才是牛逼翻天了。否则的话，你有你的知识瓶颈（巨量带标数据），我有我的知识瓶颈（专家经验），谁的瓶颈更大难说着呢。

深度神经学习前，semi-supervised 的研究很热。至少从研究角度，那个领域是令人兴奋和期待的。说的是以最少的监督（种子啥的少量带标数据，或者人工的规则做引子），结合 raw data 去试图引导系统按照指定的方向做事儿。听上去在轨道上，至少不是所谓完全的无监督那种让人觉得不靠谱。还有就是白老师的语义计算主张，不必用带标数据，但要用丰富的词典信息，结合 raw data 做 parsing，也用到深度学习模型RNN啥的，听上去也是可行的。这是因为词典信息里面已经隐含了深入的人工监督（语言学知识和用法），各种 expectations 譬如 subcat，然后到大数据里面去定位。

微博上有人问除了图像和语音，文本NLP方面，深度学习有突破吗？我的记忆中，至少n月前，相当普遍有说，深度神经在文本遭遇瓶颈（by 看到瓶子有一半是空的人），或文本有待突破（by 看到瓶子有一半是满的人）。由于DL乐观主义流行加上全世界的CL牛人都憋足了劲儿地攻关努力，据说最近收到的答案是：文本也很突破了。

于是我正面反面各问了一下，拷贝于下，在此一并求教方家：

谁能给个神经在文本NLP中突破的清单就好了, 看 so far 到底哪些是真突破，哪些仍是瓶颈？先起个头，突破似乎表现在：

1 NMT，例如谷歌翻译，特别是中到英，的确突破性发展了（百度声称更早神经了，但翻译质量远不如谷歌NMT令人印象深刻，虽然在前神经时代，百度的中文方面的SMT比谷歌强）；

2 SyntaxNet 至少在新闻正规文本上，parsing 比前突破了，已经达到 94%，虽然离应用还远，虽然不是声称的世界第一

关于神经在文本NLP上的瓶颈或缺陷也抛块砖：

1. 迄今的突破都是 supervised 的，倚赖的是 insatiable 的巨量带标数据：带标数据于是成为知识瓶颈；

2 对于众多领域和文体，神经系统基本没有适应性，除非假设有海量领域数据可以重新训练成功；

3 几乎所有 unsupervised 尝试都是研究性质，离应用还远；

4 模型庞大带来的costs：训练和运行对计算资源的高要求

5. 迄今的端对端系统的神经应用，未见用到语言结构或理解，隐含层里的葫芦据说人也解不透；

6. 貌似黑箱子，有说 debug 不易（统计模型黑箱子不易debug的毛病以前是公认的痛点，不过最近有深度学习大牛一再强调，这个箱子一点也不黑，debug 也容易，此瓶颈存疑）；

端对端除了 NMT，还有哪些投入大规模应用的文本处理系统？似乎还在探索中，成熟的不多。在IE和QA领域，不久应该会有某种突破，因为这两个领域的系统基本是端对端，只要somehow（人海战术？）得到了大量的带标数据，突破是可以期待的。不过，在这些方面，高明的规则系统已经有了很好很快的解决方案。不信，可以到时候拉出来遛遛。

【相关】

It is untrue that Google SyntaxNet is the "world's most accurate parser ...

【李白对话录之八：有语义落地直通车的parser才是核武器】

【谷歌NMT，见证奇迹的时刻】

【泥沙龙笔记：语法工程派与统计学习派的总结】

【新智元笔记：两条路线上的NLP数据制导】

《立委随笔：语言自动分析的两个路子》

Comparison of Pros and Cons of Two NLP Approaches

From IBM's Jeopardy robot, Apple's Siri, to the new Google Translate

Latest Headline News: Samsung acquires Viv, a next-gen AI assistant built by the creators of Apple's Siri.

Wei:
Some people are just smart, or shrewd, more than we can imagine. I am talking about Fathers of Siri, who have been so successful with their technology that they managed to sell the same type of technology twice, both at astronomical prices, and both to the giants in the mobile and IT industry. What is more amazing is, the companies they sold their tech-assets to are direct competitors. How did that happen? How "nice" this world is, to a really really smart technologist with sharp business in mind.

What is more stunning is the fact that, Siri and the like so far are regarded more as toys than must-carry tools, intended at least for now to satisfy more curiosity than to meet the rigid demand of the market. The most surprising is that the technology behind Siri is not unreachable rocket science by nature, similar technology and a similar level of performance are starting to surface from numerous teams or companies, big or small.

I am a tech guy myself, loving gadgets, always watching for new technology breakthrough. To my mind, something in the world is sheer amazing, taking us in awe, for example, the wonder of smartphones when the iPhone first came out. But some other things in the tech world do not make us admire or wonder that much, although they may have left a deep footprint in history. For example, the question answering machine made by IBM Watson Lab in winning Jeopardy. They made it into the computer history exhibition as a major AI milestone. More recently, the iPhone Siri, which Apple managed to put into hands of millions of people first time for seemingly live man-machine interaction. Beyond that accomplishment, there is no magic or miracle that surprises me. I have the feel of "seeing through" these tools, both the IBM answering robot type depending on big data and Apple's intelligent agent Siri depending on domain apps (plus a flavor of AI chatbot tricks).

Chek: @ Wei I bet the experts in rocket technology will not be impressed that much by SpaceX either,

Wei: Right, this is because we are in the same field, what appears magical to the outside world can hardly win an insider's heart, who might think that given a chance, they could do the same trick or better.

The Watson answering system can well be regarded as a milestone in engineering for massive, parallel big data processing, not striking us as an AI breakthrough. what shines in terms of engineering accomplishment is that all this happened before the big data age when all the infrastructures for indexing, storing and retrieving big data in the cloud are widely adopted. In this regard, IBM is indeed the first to run ahead of the trend, with the ability to put a farm of servers in working for the QA engine to be deployed onto massive data. But from true AI perspective, neither the Watson robot nor the Siri assistant can be compared with the more-recent launch of the new Google Translate based on neural networks. So far I have tested using this monster to help translate three Chinese blogs of mine (including this one in making), I have to say that I have been thrown away by what I see. As a seasoned NLP practitioner who started MT training 30 years ago, I am still in disbelief before this wonder of the technology showcase.

Chen: wow, how so?

Wei: What can I say? It has exceeded my imagination limit for all my dreams of what MT can be and should be since I entered this field many years ago. While testing, I only needed to do limited post-editing to make the following Chinese blogs of mine presentable and readable in English, a language with no kinship whatsoever with the source language Chinese.

Question answering of the past and present

Introduction to NLP Architecture

Hong: Wei seemed frightened by his own shadow.Chen:

Chen: The effect is that impressive?

Wei: Yes. Before the deep neural-nerve age, I also tested and tried to use SMT for the same job, having tried both Google Translate and Baidu MT, there is just no comparison with this new launch based on technology breakthrough. If you hit their sweet spot, if your data to translate are close to the data they have trained the system on, Google Translate can save you at least 80% of the manual work. 80% of the time, it comes so smooth that there is hardly a need for post-editing. There are errors or crazy things going on less than 20% of the translated crap, but who cares? I can focus on that part and get my work done way more efficiently than before. The most important thing is, SMT before deep learning rendered a text hardly readable no matter how good a temper I have. It was unbearable to work with. Now with this breakthrough in training the model based on sentence instead of words and phrase, the translation magically sounds fairly fluent now.

It is said that they are good a news genre, IT and technology articles, which they have abundant training data. The legal domain is said to be good too. Other domains, spoken language, online chats, literary works, etc., remain a challenge to them as there does not seem to have sufficient data available yet.

Chen: Yes, it all depends on how large and good the bilingual corpora are.

Wei: That is true. SMT stands on the shoulder of thousands of professional translators and their works. An ordinary individual's head simply has no way in digesting this much linguistic and translation knowledge to compete with a machine in efficiency and consistency, eventually in quality as well.

Chen: Google's major contribution is to explore and exploit the existence of huge human knowledge, including search, anchor text is the core.

Ma: I very much admire IBM's Watson, and I would not dare to think it possible to make such an answering robot back in 2007.

Wei: But the underlying algorithm does not strike as a breakthrough. They were lucky in targeting the mass media Jeopardy TV show to hit the world. The Jeopardy quiz is, in essence, to push human brain's memory to its extreme, it is largely a memorization test, not a true intelligence test by nature. For memorization, a human has no way in competing with a machine, not even close. The vast majority of quiz questions are so-called factoid questions in the QA area, asking about things like who did what when and where, a very tractable task. Factoid QA depends mainly on Named Entity technology which was mature long ago, coupled with the tractable task of question parsing for identifying its asking point, and the backend support from IR, a well studied and practised area for over 2 decades now. Another benefit in this task is that most knowledge questions asked in the test involve standard answers with huge redundancy in the text archive expressed in various ways of expressions, some of which are bound to correspond to the way question is asked closely. All these factors contribute to IBM's huge success in its almost mesmerizing performance in the historical event. The bottom line is, shortly after the 1999 open domain QA was officially born with the first TREC QA track, the technology from the core engine has been researched well and verified for factoid questions given a large corpus as a knowledge source. The rest is just how to operate such a project in a big engineering platform and how to fine-tune it to adapt to the Jeopardy-style scenario for best effects in the competition. Really no magic whatsoever.

Google Translated from【泥沙龙笔记：从三星购买Siri之父的二次创业技术谈起】, with post-editing by the author himself.

Introduction to NLP Architecture

Newest GNMT: time to witness the miracle of Google Translate

Dr Li’s NLP Blog in English

《立委科普：NLP系统语义模块的任务》

本篇旨在探讨NLP（Natural Language Processing）语义模块的任务，尤其在知识图谱应用中。探讨之前，我们先站在万米高看俯瞰一下语义模块在语言学和NLP的主要模块的架构中位于何处。

语言学的教科书通常把语言文本研究从浅入深划分为这么几个分支：词法（morphology）、句法（syntax）、语义（semantics）和语用（pragmatics）。还有另一个维度的分支，叫篇章研究（discourse study），是跨句进行，其他的研究一般限于句内。词法句法的研究成果在 NLP 中表现为 parser，可以自动把线性字符串的语句分析为句法树结构，千变万化的语句因此化为有限的句型或 patterns，为语言理解和应用提供了坚实的基础。语义处于句法之后、语用之前，我们叫它为语义中间件 (middleware)，因为它是领域独立的语言研究的终点，支持的是依赖领域和应用的语用。这个语义中间件的任务也可以留到语用阶段在语义落地（semantic grounding）的时候根据语用对语义的要求来一起做，但是理论上，总有一部分语义工作有足够的领域独立性，值得提前做好，来支持种种不同的语用场景和应用，减轻语用模块的负担。

如此定义的语义模块（语义中间件），主要是寻找 hidden links，譬如隐含的逻辑主语、宾语等。这些在句法阶段没有显性表明，但是有足够证据去确定如何填补。填补的时候，一个是利用句法（显性的links），一个是利用 ontology，通常是二者的结合。词驱动（word-driven）来做，是一个很 tractable 的任务，是比parsing更琐碎但难度较低的工作，因为要结构有结构，要ontology有ontology（包括动态形成的ontology节点，譬如NE专名的分类），条件比纯句法分析模块只有线性的pattern可用，是成熟多了。其有用性还是不太清晰： argument 之一就是，如果 hidden 的语义重要，人为什么不用显性句法手段？即便在一个句子的选定的句法结构中，某个重要的语义难以显性表达，如果足够重要，人就会换一种句法结构在另一个句子显性表达出来。如果上述 argument 有一定的道理，那么不做 hidden 语义，对于大数据挖掘，应该不会有太大的损害。至少在大数据挖掘这样的场景，信息的冗余性足以弥补个体 hidden 语义的不全。在句法结束的时候，有些句子提到的 arg（s）并没有到位，可以说是不饱和（unsaturated）。语义中间件的任务就是把句法没有做全的不饱和的坑填得饱和，hidden links 建立了，于是就饱和了。如果句法模块和语义模块以后，仍然不饱和，就应该在 discourse 中去找。如果 discourse 中还是没找到，那么理论上是应该通过常识去饱和它。

回到万米高空俯瞰，昨天还在想所谓“语义计算”到底包含哪些呢。从 community 来看，相关的方面有：（1）WSD（Word Sense Disambiguation）; (2) FrameNet (role labeling); (3) IE（Information Extraction）。“经典”IE （MUC IE 传统）里面一般分 NE、relationship、event，外加 Coreference，等任务。从结构图的角度看，NE 和 WSD 是做 node 的语义计算；FrameNet 和 IE Template （for relationship or event）是做 arc （link）的语义计算。这样来看 community 定义的几个任务和方向，可以发现，（1）和（2）都是学究式的任务，不实用。（3）是最接地气的东西，是应用（apps）直接需要的。但是 IE 是针对领域的，直接为产品服务的，不好抽象，那么就可以想想什么东西是句法之后，语用之前，最能帮助 IE。其中之一就是 Coreference，这个任务已经被 IE 收编了，但它实际上是独立于领域的篇章（discourse）尺度的语义计算，是为了支持 IE 的跨句整合的。

沿着这个思路，我们还可以细化，根据实际需求，我们定义过三个任务，觉得应该在语义中间件里面做，它们应该可以惠及所有的应用：第一个是同位语关系，这个可以看成是 Corference之一种；第二个是部分和整体的关系（譬如，苹果和iPhone）；第三个原因和结果的关系。上述三个关系不限于句法短距离，也包括远距离的，甚至跨句的这类联系。我们一直在这三个关系，加上代词的coreference (包括专名的 aliasing) 上下功夫，比在 hidden 逻辑主谓宾方面更多，因为前者直接服务于 local IE 以后的 IE，以便整合成图谱，是整合的粘合剂，后者大多可以通过信息冗余去做弥补。

以上说的是实践中摸索出来的体会，就是自然而然这么走下来的。local IE 在抓取信息填 IE Template 里面的坑的时候，所看到的都是局部的信息，所填坑的材料经常很“虚”。虚的极端例子就是代词（“它”，“这个”），或者一些指代性的名词（“这台电脑”），这些东西只能作为桥梁，不能真正导致图谱。这时候语义模块在上述四个方面所做的工作，就可以帮助把这些虚的材料，变得实在，这是通向图谱的一个很重要的支持。

大而言之，语义中间件做到什么程度合适，有很大的争论空间。在确定应用之前，不少细线条语义进一步伸展没有太大意义，或者劳而少功。就是说在句法把结构的框架搭起来以后，在语用层面的具体应用确定之前，到底要做多少语义计算，不是容易说清楚的，直觉上和经验上，不赞成做得太多。从某种意义上看，费尔默创立 FrameNet 就是想把语义中间件进行到底。理论上，他的深入是有道理的，因为在 arg structure （句法subcat的拿手好戏）之后，如果要深入，domain independent 的 Frame hierarchy 是通向语用的深度桥梁。起码理论上如此。但是我们做了18年的 IE 以后，结论是，费尔默那个语义计算的路子基本是歧途。没感觉到啥好处，却带来了很大的 overhead，可操作性很差，也并不省功。IE 领域用 Template 定义语用领域的需求，没有人主张把这些 Templates 定义在 FrameNet 的 hierarchy 上面，因为感觉不到需要，而且也不现实。100 年后，也许 FrameNet 可以被重新发现，因为那时候的语用落地已经太多了，需要组织组织了。FrameNet 正好提供了一个组织和整合的框架，如今的语用落地都是零星的。

在立委牌 NLP University 中，能看懂上面这些参杂了些假洋鬼子话（术语）的“高阶科普”的后学，是可以授予学位的。这个学位是硬通货。看不懂也没关系，可以视为狂人乱语，或者是误入迷宫，不隔行也如山，耽误了你玩深度学习（dl）的宝贵时间。

【相关】

【NLP 迷思之四：词义消歧（WSD）是NLP应用的瓶颈】

《泥沙龙铿锵三人行：句法语义纠缠论》

【立委科普：结构歧义的休眠唤醒演义】

【语义计算沙龙：从“10年中学文化课”切词谈系统设计】

我:
毛老啊，1966-1976 10年文革，是我十年的中小学，我容易吗？10年中学文化课的时间不到一半，其余是学工学农学军。学赤脚医生学开手扶拖拉机。
为什么是【十年中】【学文化课】不是【十年中学】【文化课】?

Guo:
@wei 单就这句，确实两可。但你后面有这么多的"学"……
至少对这个例子，统计，"深度神经"RNN之类还是有merit的。当然，这两种解析其实也没本质的区别。不必多费心思。

我:
怎讲？因为“学”频率高所以“中学”成词就不便？统计模型在这个case怎么工作显示merit呢？愿闻其详。
大数据说有五年中学有六年中学，极少见十年中学，反映的是中学学制的常识。但是这个知识不是很强大，很难作数，因为这不是 positive evidence。如果句子在 “六年中学” 发生边界纠纷的时候得到来自大数据的直接支持，那是正面的 evidence，力量就很强。负面证据不顶事儿，因为它面对的是【非六】（或【非五】）的大海，理论上无边无沿，那点儿证据早被淹没了。

Guo:
统计分long term / global vs short term / local.

你讲的"大数据"，其实是在讲前者。

现在热的"深度神经"，有些是有意无意地多考虑些后者。例如，深度神经"皇冠上的明珠"LSTM即是Long Short Term Memory。虽非显式地求取利用"即时统计"，那层意思还是感觉的到的。

我:
@Guo 恩。这个 local 和 global 之间的关系很tricky
0821e

这个貌似歪打正着的parse应该纯粹是狗屎运，不理论。

白:
N+N的得分本来就低有状语有动词的更加“典型” N+N是实在没招了只能借助构词法解决零碎的产物有状语有动词时谁还理N+N。不管几年中学，也抗衡不了这个结构要素。就是说，同样是使用规则，有些规则上得厅堂，有些规则只能下得厨房。如果没有上得厅堂的规则可用，随你下厨房怎么折腾。但是如果有上得厅堂的规则可用，谁也不去下厨房。

我:
这里不仅仅是 N+N 的问题，在绝大多数切词模块中，还没走到N+N这一步，因此这个问题实际上可能挑战不少现存的切词程序：十年/中学/文化课 or 十年中/学/文化课 ?
有一个常用的切词 heuristic 要求偏向于音节数均匀的路径显然前者比后者均匀多了。

白:
句法上谈多层，也是“狗/咬吕洞宾”, 不是“狗咬/吕洞宾”

我:
真正的反例是交叉型的。
句法怎么谈层次其实无关因为多层的切词不过是一个技术策略，（通常）本身并不参与 parsing，最终的结果是狗/咬/吕洞宾就行了。其实即便论句法 SVO 层次在汉语中还是颇有争论的不像西方语言里面 V+NP 的证据那么充分。

白:
这有点循环论证了

我:
目前的接口是这样的多数系统的接口是。切词的结果并不存在层次，虽然切词内部可以也应该使用层次。肯定有研究型系统不采用这样的接口，但实用系统中的多数似乎就是这样简单。

白:
都保留也没啥，交给句法处理好了，谁说一定要分出个唯一结果再交上去，很多系统接受词图而不是词流了。对于神经网络这种天然接受不确定性的formalism而言，接受词图并不比接受词流多什么负担。

我:
数据结构多了维度，对于传统系统，涉及面蛮大的。词不仅仅是词，词本身不是一个简单的 object。以前的系统词流就是string 或最多是 token+POS list 对那样简单的结构增加维度还好。

白:
词和短语一样可以给位置加锁解锁竞争位置的锁

我:
不错，词是一切潜在结构的发源地，蕴藏了很大潜能，甚至在设计中，应该让词典可以内建结构，与parsing机制一体化。这种设计思想下的词增加维度就是带着镣铐跳舞不是容易处置好的。nondeterministic 是一个动听但不太好使的策略。否则理论上无需任何休眠与唤醒。

白:
可以参数化，连续过渡。处理得好，管子就粗些。处理不好，管子就细些。极端就回到一条线。一个位置允许几个词竞争锁，可以参数化。超出管子容量的，再做休眠唤醒。

我:
多层系统下的 nondeterministic 结构，就好比潘多拉的盒子。放鬼容易降鬼难，层次越多越是这样。也许机器学习那边不怕，反正不是人在降服鬼。

白:
其实一个词多个POS，或者多个subcat，机制是一样的。不仅有组合增加的一面，也有限制增加的一面。不用人降服鬼，鬼自己就打起来，打不赢没脸见人。只要制定好“见人”的标准，其他就交给鬼。

我:
这就是毛主席的路线叫天下大乱达到天下大治。文革大乱10年国民经济临近崩溃的边缘，但没有像60年那样彻底崩盘，除了狗屎运，还因为有一个绝对权威在。这个权威冷酷无情翻脸不认人。今天红上了天的红卫兵造反派明天就下牢狱。

白:
鬼打架也是有秩序的，不是大乱，是分布式表示。

我:
这样的系统大多难以调试等到见人了结果已定局好坏都是它了斯大林说胜利者是不受指责的。

白:
局部作用，高度自治

我:
鬼虽然是按照人制定的规则打架。具体细节却难以追踪因此也难以改正。当然这个毛病也不是现在才有的是一切黑箱子策略的通病。

白:
不是黑箱子，是基于规则、分布式表示、局部自治。打架的任何细节语言学上都可解释。理论上，如果词典确定，所有交集型分词歧义就已经确定，是词流还是词图，只是一个编码问题。如果再加上管子粗细的限制，编码也是高度可控的。

我:
刁德一说这茶喝到这儿才有了滋味。看好白老师及其design

白:
“10年”说的究竟是时长（duration）为10年的时间段，还是2010年这一年的简称，也是需要甄别的。

【相关】

Why deep parsing rules instead of deep learning model for sentiment analysis?
Pros and Cons of Two Approaches: Machine Learning and Grammar Engineering
Coarse-grained vs. fine-grained sentiment analysis
一切声称用机器学习做社会媒体舆情挖掘的系统，都值得怀疑
 【立委科普：基于关键词的舆情分类系统面临挑战】

The mainstream sentiment approach simply breaks in front of social media

I have articulated this point in various previous posts or blogs before, but the world is so dominated by the mainstream that it does not seem to carry. So let me make it simple to be understood:

The sentiment classification approach based on bag of words (BOW) model, so far the dominant approach in the mainstream for sentiment analysis, simply breaks in front of social media. The major reason is simple: the social media posts are full of short messages which do not have the "keyword density" required by a classifier to make the proper sentiment decision. Larger training sets cannot help this fundamental defect of the methodology. The precision ceiling for this line of work in real-life social media is found to be 60%, far behind the widely acknowledged precision minimum 80% for a usable extraction system. Trusting a machine learning classifier to perform social media sentiment mining is not much better than flipping a coin.

So let us get straight. From now on, whoever claims the use of machine learning for social media mining of public opinions and sentiments is likely to be a trap (unless it is verified to have involved parsing of linguistic structures or patterns, which so far has never been heard of in practical systems based on machine learning). Fancy visualizations may make the results of the mainstream approach look real and attractive but they are just not trustable at all.

Why deep parsing rules instead of deep learning model for sentiment analysis?

aaa

(1) Learning does not work in short messages as short messages do not have enough data points (or keyword density) to support the statistical model trained by machine learning. Social media is dominated by short messages.

(2) With long messages, learning can do a fairly good job in coarse-grained sentiment classification of thumbs-up and thumbs-down, but it is not good at decoding the fine-grained sentiment analysis to answer why people like or dislike a topic or brand. Such fine-grained insights are much more actionable and valuable than the simple classification of thumbs-up and thumbs-down.

We have experimented with and compared both approaches to validate the above conclusions. That is why we use deep parsing rules instead of a deep learning model to reach the industry-leading data quality we have for sentiment analysis.

We do use deep learning for other tasks such as logo and image processing. But for sentiment analysis and information extraction from text, especially in processing social media, the deep parsing approach is a clear leader in data quality.

Coarse-grained vs. fine-grained sentiment analysis

Deep parsing is the key to natural language understanding

Automated survey based on social media

Chomsky's Negative Impact

S. Bai: Natural Language Caterpillar Breaks through Chomsky's Castle

masterminds-it-quiz-10-728

Translator's note:

This article written in Chinese by Prof. S. Bai is a wonderful piece of writing worthy of recommendation for all natural language scholars. Prof. Bai's critical study of Chomsky's formal language theory with regards to natural language has reached a depth never seen before ever since Chomsky's revolution in 50's last century. For decades with so many papers published by so many scholars who have studied Chomsky, this novel "caterpillar" theory still stands out and strikes me as an insight that offers a much clearer and deeper explanation for how natural language should be modeled in formalism, based on my decades of natural language parsing study and practice (in our practice, I call the caterpillar FSA++, an extension of regular grammar formalism adequate for multi-level natural language deep parsing). For example, so many people have been trapped in Chomsky's recursion theory and made endless futile efforts to attempt a linear or near-linear algorithm to handle the so-called recursive nature of natural language which is practically non-existent (see Chomsky's Negative Impact). There used to be heated debates in computational linguistics on whether natural language is context-free or context-sensitive, or mildly sensitive as some scholars call it. Such debates mechanically apply Chomsky's formal language hierarchy to natural languages, trapped in metaphysical academic controversies, far from language facts and data. In contrast, Prof. Bai's original "caterpillar" theory presents a novel picture that provides insights in uncovering the true nature of natural languages.

S. Bai: Natural Language Caterpillar Breaks through Chomsky's Castle

Tags: Chomsky Hierarchy, computational linguistics, Natural Language Processing, linear speed

This is a technology-savvy article, not to be fooled by the title seemingly about a bug story in some VIP's castle. If you are neither an NLP professional nor an NLP fan, you can stop here and do not need to continue the journey with me on this topic.

Chomsky's Castle refers to the famous Chomsky Hierarchy in his formal language theory, built by the father of contemporary linguistics Noam Chomsky more than half a century ago. According to this theory, the language castle is built with four enclosing walls. The outmost wall is named Type-0, also called Phrase Structure Grammar, corresponding to a Turing machine. The second wall is Type-1, or Context-sensitive Grammar (CSG), corresponding to a parsing device called linear bounded automaton with time complexity known to be NP-complete. The third wall is Type-2, or Context-free Grammar (CFG), corresponding to a pushdown automaton, with a time complexity that is polynomial, somewhere between square and cubic in the size of the input sentence for the best asymptotic order measured in the worst case scenario. The innermost wall is Type-3, or Regular Grammar, corresponding to deterministic finite state automata, with a linear time complexity. The sketch of the 4-wall Chomsky Castle is illustrated below.

This castle of Chomsky has impacted generations of scholars, mainly along two lines. The first line of impact can be called "the outward fear syndrome". Because the time complexity for the second wall (CSG) is NP-complete, anywhere therein and beyond becomes a Forbidden City before NP=P can be proved. Thus, the pressure for parsing natural languages has to be all confined to within the third wall (CFG). Everyone knows the natural language involves some context sensitivity, but the computing device cannot hold it to be tractable once it is beyond the third wall of CFG. So it has to be left out.

The second line of impact is called "the inward perfection syndrome". Following the initial success of using Type 2 grammar (CFG) comes a severe abuse of recursion. When the number of recursive layers increases slightly, the acceptability of a sentence soon approximates to almost 0. For example, "The person that hit Peter is John" looks fine, but it starts sounding weird to hear "The person that hit Peter that met Tom is John". It becomes gibberish with sentences like "The person that hit Peter that met Tom that married Mary is John". In fact, the majority resources spent with regards to the parsing efficiency are associated with such abuse of recursion in coping with gibberish-like sentences, rarely seen in real life language. For natural language processing to be practical, pursuing the linear speed cannot be over emphasized. If we reflect on the efficiency of the human language understanding process, the conclusion is certainly about the "linear speed" in accordance with the length of the speech input. In fact, the abuse of recursion is most likely triggered by the "inward perfection syndrome", for which we intend to cover every inch of the land within the third wall of CFG, even if it is an area piled up by gibberish or garbage.

In a sense, it can be said that one reason for the statistical approach to take over the rule-based approach for such a long time in the academia of natural language processing is just the combination effect of these two syndromes. To overcome the effects of these syndromes, many researchers have made all kinds of efforts, to be reviewed below one by one.

Along the line of the outward fear syndrome, evidence against the context-freeness has been found in some constructions in Swiss-German. Chinese has similar examples in expressing respective correspondence of conjoined items and their descriptions. For example, “张三、李四、王五的年龄分别是25岁、32岁、27岁，出生地分别是武汉、成都、苏州” (Zhang San, Li Si, Wang Wu's age is respectively 25, 32, and 27, they were born respectively in Wuhan, Chengdu, Suzhou" ). Here, the three named entities constitute a list of nouns. The number of the conjoined list of entities cannot be predetermined, but although the respective descriptors about this list of nouns also vary in length, the key condition is that they need to correspond to the antecedent list of nouns one by one. This respective correspondence is something beyond the expression power of the context-free formalism. It needs to get out of the third wall.

As for overcoming "the inward perfection syndrome", the pursuit of "linear speed" in the field of NLP has never stopped. It ranges from allowing for the look-ahead mechanism in LR (k) grammar, to the cascaded finite state automata, to the probabilistic CFG parsers which are trained on a large treebank and eventually converted to an Ngram (n=>5) model. It should also include RNN/LSTM for its unique pursuit for deep parsing from the statistical school. All these efforts are striving for defining a subclass in Type-2 CFG that reaches linear speed efficiency yet still with adequate linguistic power. In fact, all parsers that have survived after fighting the statistical methods are to some degree a result of overcoming "the inward perfection syndrome", with certain success in linear speed pursuit while respecting linguistic principles. The resulting restricted subclass, compared to the area within the original third wall CFG, is a greatly "squashed" land.

If we agree that everything in parsing should be based on real life natural language as the starting point and the ultimate landing point, it should be easy to see that the outward limited breakthrough and the inward massive compression should be the two sides of a coin. We want to strive for a formalism that balances both sides. In other words, our ideal natural language parsing formalism should look like a linguistic "caterpillar" breaking through the Chomsky walls in his castle, illustrated below:

It seems to me that such a "caterpillar" may have already been found by someone. It will not take too long before we can confirm it.
Original article in Chinese from 《穿越乔家大院寻找“毛毛虫”》
Translated by Dr. Wei Li

【Related】

K. Church: A Pendulum Swung Too Far, Linguistics issues in Language Technology, 2011; 6(5)

from http://blog.sina.com.cn/s/blog_729574a00102wf63.html

[转载]【白硕 - 穿越乔家大院寻找“毛毛虫”】

【立委按】

白硕老师这篇文章值得所有自然语言学者研读和反思。击节叹服，拍案叫绝，是初读此文的真切感受。白老师对乔姆斯基形式语言理论用于自然语言所造成的误导，给出了迄今所见最有深度的犀利解析，而且写得深入浅出，形象生动，妙趣横生。这么多年，这么多学者，怎么就达不到这样的深度呢？一个乔姆斯基的递归陷阱不知道栽进去多少人，造成多少人在 “不是人话” 的现象上做无用功，绕了无数弯路。学界曾有多篇长篇大论，机械地套用乔氏层级体系，在自然语言是 context-free 还是 context-sensitive 的框框里争论不休，也有折衷的说法，诸如自然语言是 mildly sensitive，这些形而上的学究式争论，大多雾里看花，隔靴搔痒，不得要领，离语言事实甚远。白老师独创的 “毛毛虫” 论，形象地打破了这些条条框框。

白老师自己的总结是：‘如果认同“一切以真实的自然语言为出发点和最终落脚点”的理念，那就应该承认：向外有限突破，向内大举压缩，应该是一枚硬币的两面。’ 此乃金玉良言，掷地有声。

【白硕 - 穿越乔家大院寻找“毛毛虫”】

标签：乔姆斯基语言学自然语言处理线速

看标题，您八成以为这篇文章讲的是山西的乔家大院的事儿了吧？不是。这是一篇烧脑的技术贴。如果您既不是NLP专业人士也不是NLP爱好者，就不用往下看了。

咱说的这乔家大院，是当代语言学祖师爷乔姆斯基老爷子画下来的形式语言类型谱系划分格局。最外边一圈围墙，是0型文法，又叫短语结构文法，其对应的分析处理机制和图灵机等价，亦即图灵可计算的；第二圈围墙，是1型文法，又叫上下文相关文法，其对应的分析处理机制，时间复杂度是NP完全的；第三圈围墙，是2型文法，又叫上下文无关文法，其对应的分析处理机制，时间复杂度是多项式的，最坏情况下的最好渐进阶在输入句子长度的平方和立方之间；最里边一层围墙，是3型文法，又叫正则文法，其对应的分析处理机制和确定性有限状态自动机等价，时间复杂度是线性的。这一圈套一圈的，归纳整理下来，如下图所示：

乔老爷子建的这座大院，影响了几代人。影响包括这样两个方面：

第一个方面，我们可以称之为“外向恐惧情结”。因为第二圈的判定处理机制，时间复杂度是NP完全的，于是在NP=P还没有证明出来之前，第二圈之外似乎是禁区，没等碰到已经被宣判了死刑。这样，对自然语言的描述压力，全都集中到了第三圈围墙里面，也就是上下文无关文法。大家心知肚明自然语言具有上下文相关性，想要红杏出墙，但是因为出了围墙计算上就hold不住，也只好打消此念。0院点灯……1院点灯……大红灯笼高高挂，红灯停，闲人免出。

第二个方面，我们可以称之为“内向求全情结”。2型文法大行其道，取得了局部成功，也带来了一个坏风气，就是递归的滥用。当递归层数稍微加大，人类对于某些句式的可接受性就快速衰减至几近为0。比如，“我是县长派来的”没问题，“我是县长派来的派来的”就有点别扭，“我是县长派来的派来的派来的”就不太像人话了。而影响分析判定效率的绝大多数资源投入，都花在了应对这类“不像人话”的递归滥用上了。自然语言处理要想取得实用效果，处理的“线速”是硬道理。反思一下，我们人类的语言理解过程，也肯定是在“线速”范围之内。递归的滥用，起源于“向内求全情结”，也就是一心想覆盖第三圈围墙里面最犄角旮旯的区域，哪怕那是一个由“不像人话”的实例堆积起来的垃圾堆。

可以说，在自然语言处理领域，统计方法之所以在很长时间内压倒规则方法，在一定程度上，就是向外恐惧情结与向内求全情结叠加造成的。NLP领域内也有很多的仁人志士为打破这两个情结做了各种各样的努力。

先说向外恐惧情结。早就有人指出，瑞士高地德语里面有不能用上下文无关文法描述的语言现象。其实，在涉及到“分别”的表述时，汉语也同样。比如：“张三、李四、王五的年龄分别是25岁、32岁、27岁，出生地分别是武汉、成都、苏州。”这里“张三、李四、王五”构成一个名词列表，对这类列表的一般性句法表述，肯定是不定长的，但后面的两个“分别”携带的列表，虽然也是不定长的，但却需要跟前面这个列表的长度相等。这个相等的条件，上下文无关文法不能表达，必须走出第三圈围墙。

再说向内求全情结。追求“线速”的努力，在NLP领域一直没有停止过。从允许预读机制的LR(k)文法，到有限自动机堆叠，再到基于大型树库训练出来的、最终转化为Ngram模型（N=5甚至更大）的概率上下文无关文法分析器，甚至可以算上统计阵营里孤军深入自然语言深层处理的RNN/LSTM等等，都试图从2型文法中划出一个既有足够的语言学意义、又能达到线速处理效率的子类。可以说，凡是在与统计方法的搏杀中还能活下来的分析器，无一不是在某种程度上摆脱了向内求全情结、在基本尊重语言学规律基础上尽可能追求线速的努力达到相对成功的结果。这个经过限制的子类，比起第三圈围墙来，是大大地“压扁”了的。

如果认同“一切以真实的自然语言为出发点和最终落脚点”的理念，那就应该承认：向外有限突破，向内大举压缩，应该是一枚硬币的两面。我们希望，能够有一种形式化机制同时兼顾这两面。也就是说，我们理想中的自然语言句法的形式化描述机制，应该像一条穿越乔家大院的“毛毛虫”，如下图所示：

据笔者妄加猜测，这样的“毛毛虫”，可能有人已经找到，过一段时间自然会见分晓。

【相关】

【新智元：parsing 在希望的田野上】

【新智元：理论家的围墙和工程师的私货】

泥沙龙笔记：骨灰级砖家一席谈，真伪结构歧义的对策（2/2)

【泥沙龙笔记：NLP 专门语言是规则系统的斧头】

乔姆斯基批判

泥沙龙笔记：再聊乔老爷的递归陷阱

《自然语言是递归的么？》

语言创造简史

【置顶：立委博客NLP博文一览（定期更新版）】

On Hand-crafted Myth of Knowledge Bottleneck

In my article "Pride and Prejudice of Main Stream", the first myth listed as top 10 misconceptions in NLP is as follows:

[Hand-crafted Myth] Rule-based system faces a knowledge bottleneck of hand-crafted development while a machine learning system involves automatic training (implying no knowledge bottleneck).

While there are numerous misconceptions on the old school of rule systems , this hand-crafted myth can be regarded as the source of all. Just take a review of NLP papers, no matter what are the language phenomena being discussed, it's almost cliche to cite a couple of old school work to demonstrate superiority of machine learning algorithms, and the reason for the attack only needs one sentence, to the effect that the hand-crafted rules lead to a system "difficult to develop" (or "difficult to scale up", "with low efficiency", "lacking robustness", etc.), or simply rejecting it like this, "literature [1], [2] and [3] have tried to handle the problem in different aspects, but these systems are all hand-crafted". Once labeled with hand-crafting, one does not even need to discuss the effect and quality. Hand-craft becomes the rule system's "original sin", the linguists crafting rules, therefore, become the community's second-class citizens bearing the sin.

So what is wrong with hand-crafting or coding linguistic rules for computer processing of languages? NLP development is software engineering. From software engineering perspective, hand-crafting is programming while machine learning belongs to automatic programming. Unless we assume that natural language is a special object whose processing can all be handled by systems automatically programmed or learned by machine learning algorithms, it does not make sense to reject or belittle the practice of coding linguistic rules for developing an NLP system.

For consumer products and arts, hand-craft is definitely a positive word: it represents quality or uniqueness and high value, a legit reason for good price. Why does it become a derogatory term in NLP? The root cause is that in the field of NLP, almost like some collective hypnosis hit in the community, people are intentionally or unintentionally lead to believe that machine learning is the only correct choice. In other words, by criticizing, rejecting or disregarding hand-crafted rule systems, the underlying assumption is that machine learning is a panacea, universal and effective, always a preferred approach over the other school.

The fact of life is, in the face of the complexity of natural language, machine learning from data so far only surfaces the tip of an iceberg of the language monster (called low-hanging fruit by Church in K. Church: A Pendulum Swung Too Far), far from reaching the goal of a complete solution to language understanding and applications. There is no basis to support that machine learning alone can solve all language problems, nor is there any evidence that machine learning necessarily leads to better quality than coding rules by domain specialists (e.g. computational grammarians). Depending on the nature and depth of the NLP tasks, hand-crafted systems actually have more chances of performing better than machine learning, at least for non-trivial and deep level NLP tasks such as parsing, sentiment analysis and information extraction (we have tried and compared both approaches). In fact, the only major reason why they are still there, having survived all the rejections from mainstream and still playing a role in industrial practical applications, is the superior data quality, for otherwise they cannot have been justified for industrial investments at all.

the “forgotten” school: why is it still there? what does it have to offer? The key is the excellent data quality as advantage of a hand-crafted system, not only for precision, but high recall is achievable as well.
quote from On Recall of Grammar Engineering Systems

In the real world, NLP is applied research which eventually must land on the engineering of language applications where the results and quality are evaluated. As an industry, software engineering has attracted many ingenious coding masters, each and every one of them gets recognized for their coding skills, including algorithm design and implementation expertise, which are hand-crafting by nature. Have we ever heard of a star engineer gets criticized for his (manual) programming? With NLP application also as part of software engineering, why should computational linguists coding linguistic rules receive so much criticism while engineers coding other applications get recognized for their hard work? Is it because the NLP application is simpler than other applications? On the contrary, many applications of natural language are more complex and difficult than other types of applications (e.g. graphics software, or word processing apps). The likely reason to explain the different treatment between a general purpose programmer and a linguist knowledge engineer is that the big environment of software engineering does not involve as much prejudice while the small environment of NLP domain is deeply biased, with belief that the automatic programming of an NLP system by machine learning can replace and outperform manual coding for all language projects. For software engineering in general, (manual) programming is the norm and no one believes that programmers' jobs can be replaced by automatic programming in any time foreseeable. Automatic programming, a concept not rare in science fiction for visions like machines making machines, is currently only a research area, for very restricted low-level functions. Rather than placing hope on automatic programming, software engineering as an industry has seen a significant progress on work of the development infrastructures, such as development environment and a rich library of functions to support efficient coding and debugging. Maybe in the future one day, applications can use more and more of automated code to achieve simple modules, but the full automation of constructing any complex software project is nowhere in sight. By any standards, natural language parsing and understanding (beyond shallow level tasks such as classification, clustering or tagging) is a type of complex tasks. Therefore, it is hard to expect machine learning as a manifestation of automatic programming to miraculously replace the manual code for all language applications. The application value of hand-crafting a rule system will continue to exist and evolve for a long time, disregarded or not.

"Automatic" is a fancy word. What a beautiful world it would be if all artificial intelligence and natural languages tasks could be accomplished by automatic machine learning from data. There is, naturally, a high expectation and regard for machine learning breakthrough to help realize this dream of mankind. All this should encourage machine learning experts to continue to innovate to demonstrate its potential, and should not be a reason for the pride and prejudice against a competitive school or other approaches.

Before we embark on further discussions on the so-called rule system's knowledge bottleneck defect, it is worth mentioning that the word "automatic" refers to the system development, not to be confused with running the system. At the application level, whether it is a machine-learned system or a manual system coded by domain programmers (linguists), the system is always run fully automatically, with no human interference. Although this is an obvious fact for both types of systems, I have seen people get confused so to equate hand-crafted NLP system with manual or semi-automatic applications.

Is hand-crafting rules a knowledge bottleneck for its development? Yes, there is no denying or a need to deny that. The bottleneck is reflected in the system development cycle. But keep in mind that this "bottleneck" is common to all large software engineering projects, it is a resources cost, not only introduced by NLP. From this perspective, the knowledge bottleneck argument against hand-crafted system cannot really stand, unless it can be proved that machine learning can do all NLP equally well, free of knowledge bottleneck: it might be not far from truth for some special low-level tasks, e.g. document classification and word clustering, but is definitely misleading or incorrect for NLP in general, a point to be discussed below in details shortly.

Here are the ballpark estimates based on our decades of NLP practice and experiences. For shallow level NLP tasks (such as Named Entity tagging, Chinese segmentation), a rule approach needs at least three months of one linguist coding and debugging the rules, supported by at least half an engineer for tools support and platform maintenance, in order to come up with a decent system for initial release and running. As for deep NLP tasks (such as deep parsing, deep sentiments beyond thumbs-up and thumbs-down classification), one should not expect a working engine to be built up without due resources that at least involve one computational linguist coding rules for one year, coupled with half an engineer for platform and tools support and half an engineer for independent QA (quality assurance) support. Of course, the labor resources requirements vary according to the quality of the developers (especially the linguistic expertise of the knowledge engineers) and how well the infrastructures and development environment support linguistic development. Also, the above estimates have not included the general costs, as applied to all software applications, e.g. the GUI development at app level and operations in running the developed engines.

Let us present the scene of the modern day rule-based system development. A hand-crafted NLP rule system is based on compiled computational grammars which are nowadays often architected as an integrated pipeline of different modules from shallow processing up to deep processing. A grammar is a set of linguistic rules encoded in some formalism, which is the core of a module intended to achieve a defined function in language processing, e.g. a module for shallow parsing may target noun phrase (NP) as its object for identification and chunking. What happens in grammar engineering is not much different from other software engineering projects. As knowledge engineer, a computational linguist codes a rule in an NLP-specific language, based on a development corpus. The development is data-driven, each line of rule code goes through rigid unit tests and then regression tests before it is submitted as part of the updated system for independent QA to test and feedback. The development is an iterative process and cycle where incremental enhancements on bug reports from QA and/or from the field (customers) serve as a necessary input and step towards better data quality over time.

Depending on the design of the architect, there are all types of information available for the linguist developer to use in crafting a rule’s conditions, e.g. a rule can check any elements of a pattern by enforcing conditions on (i) word or stem itself (i.e. string literal, in cases of capturing, say, idiomatic expressions), and/or (ii) POS (part-of-speech, such as noun, adjective, verb, preposition), (iii) and/or orthography features (e.g. initial upper case, mixed case, token with digits and dots), and/or (iv) morphology features (e.g. tense, aspect, person, number, case, etc. decoded by a previous morphology module), (v) and/or syntactic features (e.g. verb subcategory features such as intransitive, transitive, ditransitive), (vi) and/or lexical semantic features (e.g. human, animal, furniture, food, school, time, location, color, emotion). There are almost infinite combinations of such conditions that can be enforced in rules’ patterns. A linguist’s job is to code such conditions to maximize the benefits of capturing the target language phenomena, a balancing art in engineering through a process of trial and error.

Macroscopically speaking, the rule hand-crafting process is in its essence the same as programmers coding an application, only that linguists usually use a different, very high-level NLP-specific language, in a chosen or designed formalism appropriate for modeling natural language and framework on a platform that is geared towards facilitating NLP work. Hard-coding NLP in a general purpose language like Java is not impossible for prototyping or a toy system. But as natural language is known to be a complex monster, its processing calls for a special formalism (some form or extension of Chomsky's formal language types) and an NLP-oriented language to help implement any non-toy systems that scale. So linguists are trained on the scene of development to be knowledge programmers in hand-crafting linguistic rules. In terms of different levels of languages used for coding, to an extent, it is similar to the contrast between programmers in old days and the modern software engineers today who use so-called high-level languages like Java or C to code. Decades ago, programmers had to use assembly or machine language to code a function. The process and workflow for hand-crafting linguistic rules are just like any software engineers in their daily coding practice, except that the language designed for linguists is so high-level that linguistic developers can concentrate on linguistic challenges without having to worry about low-level technical details of memory allocation, garbage collection or pure code optimization for efficiency, which are taken care of by the NLP platform itself. Everything else follows software development norms to ensure the development stay on track, including unit testing, baselines construction and monitoring, regressions testing, independent QA, code reviews for rules' quality, etc. Each level language has its own star engineer who masters the coding skills. It sounds ridiculous to respect software engineers while belittling linguistic engineers only because the latter are hand-crafting linguistic code as knowledge resources.

The chief architect in this context plays the key role in building a real life robust NLP system that scales. To deep-parse or process natural language, he/she needs to define and design the formalism and language with the necessary extensions, the related data structures, system architecture with the interaction of different levels of linguistic modules in mind (e.g. morpho-syntactic interface), workflow that integrate all components for internal coordination (including patching and handling interdependency and error propagation) and the external coordination with other modules or sub-systems including machine learning or off-shelf tools when needed or felt beneficial. He also needs to ensure efficient development environment and to train new linguists into effective linguistic "coders" with engineering sense following software development norms (knowledge engineers are not trained by schools today). Unlike the mainstream machine learning systems which are by nature robust and scalable, hand-crafted systems' robustness and scalability depend largely on the design and deep skills of the architect. The architect defines the NLP platform with specs for its core engine compiler and runner, plus the debugger in a friendly development environment. He must also work with product managers to turn their requirements into operational specs for linguistic development, in a process we call semantic grounding to applications from linguistic processing. The success of a large NLP system based on hand-crafted rules is never a simple accumulation of linguistics resources such as computational lexicons and grammars using a fixed formalism (e.g. CFG) and algorithm (e.g. chart-parsing). It calls for seasoned language engineering masters as architects for the system design.

Given the scene of practice for NLP development as describe above, it should be clear that the negative sentiment association with "hand-crafting" is unjustifiable and inappropriate. The only remaining argument against coding rules by hands comes down to the hard work and costs associated with hand-crafted approach, so-called knowledge bottleneck in the rule-based systems. If things can be learned by a machine without cost, why bother using costly linguistic labor? Sounds like a reasonable argument until we examine this closely. First, for this argument to stand, we need proof that machine learning indeed does not incur costs and has no or very little knowledge bottleneck. Second, for this argument to withstand scrutiny, we should be convinced that machine learning can reach the same or better quality than hand-crafted rule approach. Unfortunately, neither of these necessarily hold true. Let us study them one by one.

As is known to all, any non-trivial NLP task is by nature based on linguistic knowledge, irrespective of what form the knowledge is learned or encoded. Knowledge needs to be formalized in some form to support NLP, and machine learning is by no means immune to this knowledge resources requirement. In rule-based systems, the knowledge is directly hand-coded by linguists and in case of (supervised) machine learning, knowledge resources take the form of labeled data for the learning algorithm to learn from (indeed, there is so-called unsupervised learning which needs no labeled data and is supposed to learn from raw data, but that is research-oriented and hardly practical for any non-trivial NLP, so we leave it aside for now). Although the learning process is automatic, the feature design, the learning algorithm implementation, debugging and fine-tuning are all manual, in addition to the requirement of manual labeling a large training corpus in advance (unless there is an existing labeled corpus available, which is rare; but machine translation is a nice exception as it has the benefit of using existing human translation as labeled aligned corpora for training). The labeling of data is a very tedious manual job. Note that the sparse data challenge represents the need of machine learning for a very large labeled corpus. So it is clear that knowledge bottleneck takes different forms, but it is equally applicable to both approaches. No machine can learn knowledge without costs, and it is incorrect to regard knowledge bottleneck as only a defect for the rule-based system.

One may argue that rules require expert skilled labor, while the labeling of data only requires high school kids or college students with minimal training. So to do a fair comparison of the costs associated, we perhaps need to turn to Karl Marx whose "Das Kapital" has some formula to help convert simple labor to complex labor for exchange of equal value: for a given task with the same level of performance quality (assuming machine learning can reach the quality of professional expertise, which is not necessarily true), how much cheap labor needs to be used to label the required amount of training corpus to make it economically an advantage? Something like that. This varies from task to task and even from location to location (e.g. different minimal wage laws), of course. But the key point here is that knowledge bottleneck challenges both approaches and it is not the case believed by many that machine learning learns a system automatically with no or little cost attached. In fact, things are far more complicated than a simple yes or no in comparing the costs as costs need also to be calculated in a larger context of how many tasks need to be handled and how much underlying knowledge can be shared as reusable resources. We will leave it to a separate writing for the elaboration of the point that when put into the context of developing multiple NLP applications, the rule-based approach which shares the core engine of parsing demonstrates a significant saving on knowledge costs than machine learning.

Let us step back and, for argument's sake, accept that coding rules is indeed more costly than machine learning, so what? Like in any other commodities, hand-crafted products may indeed cost more, they also have better quality and value than products out of mass production. For otherwise a commodity society will leave no room for craftsmen and their products to survive. This is common sense, which also applies to NLP. If not for better quality, no investors will fund any teams that can be replaced by machine learning. What is surprising is that there are so many people, NLP experts included, who believe that machine learning necessarily performs better than hand-crafted systems not only in costs saved but also in quality achieved. While there are low-level NLP tasks such as speech processing and document classification which are not experts' forte as we human have much more restricted memory than computers do, deep NLP involves much more linguistic expertise and design than a simple concept of learning from corpora to expect superior data quality.

In summary, the hand-crafted rule defect is largely a misconception circling around wildly in NLP and reinforced by the mainstream, due to incomplete induction or ignorance of the scene of modern day rule development. It is based on the incorrect assumption that machine learning necessarily handles all NLP tasks with same or better quality but less or no knowledge bottleneck, in comparison with systems based on hand-crafted rules.

Note: This is the author's own translation, with adaptation, of part of our paper which originally appeared in Chinese in Communications of Chinese Computer Federation (CCCF), Issue 8, 2013

Pride and Prejudice of NLP Main Stream

K. Church: A Pendulum Swung Too Far, Linguistics issues in Language Technology, 2011; 6(5)

Wintner 2009. What Science Underlies Natural Language Engineering? Computational Linguistics, Volume 35, Number 4

Pros and Cons of Two Approaches: Machine Learning vs Grammar Engineering

Domain portability myth in natural language processing

Pride and Prejudice of NLP Main Stream

[Abstract]

In the area of Computational Linguistics, there are two basic approaches to natural language processing, the traditional rule system and the mainstream machine learning. They are complementary and there are pros and cons associated with both. However, as machine learning is the dominant mainstream philosophy reflected by the overwhelming ratio of papers published in academia, the area seems to be heavily biased against the rule system methodology. The tremendous success of machine learning as applied to a list of natural language tasks has reinforced the mainstream pride and prejudice in favor of one and against the other. As a result, there are numerous specious views which are often taken for granted without check, including attacks on the rule system's defects based on incomplete induction or misconception. This is not healthy for NLP itself as an applied research area and exerts an inappropriate influence on the young scientists coming to this area. This is the first piece of a series of writings aimed at educating the public and confronting the prevalent prejudice, focused on the in-depth examination of the so-called hand-crafted defect of the rule system and the associated knowledge bottleneck issue.

I. introduction

Over 20 years ago, the area of NLP (natural language processing) went through a process of replacing traditional rule-based systems by statistical machine learning as the mainstream in academia. Put in a larger context of AI (Artificial Intelligence), this represents a classical competition, and their ups and downs, between the rational school and the empirical school (Church 2007 ). It needs to be noted that the statistical approaches' dominance in this area has its historical inevitability. The old school was confined to toy systems or lab for too long without scientific break-through while machine learning started showing impressive results in numerous fronts of NLP in a much larger scale, initially very low level NLP such as POS (Part-of-Speech) tagging and speech recognition / synthesis, and later on expanded to almost all NLP tasks, including machine translation, search and ranking, spam filtering, document classification, automatic summarization, lexicon acquisition, named entity tagging, relationship extraction, event classification, sentiment analysis. This dominance has continued to grow till today when the other school is largely "out" from almost all major NLP arenas, journals and top conferences. New graduates hardly realize its existence. There is an entire generation gap for such academic training or carrying on the legacy of the old school, with exceptions of very few survivors (including yours truly) in industry because few professors are motivated to teach it at all or even qualified with an in-depth knowledge of this when the funding and publication prospects for the old school are getting more and more impossible. To many people's minds today, learning (or deep learning) is NLP, and NLP is learning, that is all. As for the "last century's technology" of rule-based systems, it is more like a failure tale from a distant history.

The pride and prejudice of the mainstream were demonstrated the most in the recent incidence when Google announced its deep-learning-based SyntaxNet and proudly claimed it to be "the most accurate parser in the world", so resolute and no any conditions attached, and without even bothering to check the possible existence of the other school. This is not healthy (and philosophically unbalanced too) for a broad area challenged by one of the most complex problems of mankind, i.e. to decode natural language understanding. As there is only one voice heard, it is scaring to observe that the area is packed with prejudice and ignorance with regards to the other school, some from leaders of the area. Specious comments are rampant and often taken for granted without check.

Prejudice is not a real concern as it is part of the real world around and involving ourselves, something to do with human nature and our innate limitation and ignorance. What is really scary is the degree and popularity of such prejudice represented in numerous misconceptions that can be picked up everywhere in this circle (I am not going to trace the sources of these as they are everywhere and people who are in this area for some time know this is not Quixote's windmill but a reality reflection). I will list below some of the myths or fallacies so deeply rooted in the area that they seem to become cliche, or part of the community consensus. If one or more statements below sound familiar to you and they do not strike you as opinionated or specious which cannot withstand scrutiny, then you might want to give a second study of the issue to make sure we have not been subconsciously brain-washed. The real damage is to our next generation, the new scholars coming to this field, who often do not get a chance for doubt.

For each such statement to be listed, it is not difficult to cite a poorly designed stereotypical rule system that falls short of the point, but the misconception lies in its generalization of associating an accused defect to the entire family of a school, ignorant of the variety of designs and the progress made in that school.

There are two types of misconceptions, one might be called myth and the other is sheer fallacy. Myths arise as a result of "incomplete induction". Some may have observed or tried some old school rule systems of some sort, which show signs of the stated defect, then they jump to conclusions leading to the myths. These myths call for in-depth examination and arguments to get the real picture of the truth. As for fallacies, they are simply untrue. It is quite a surprise, though, to see that even fallacies seem to be widely accepted as true by many, including some experts in this area. All we need is to cite facts to prove them wrong. For example, [Grammaticality Fallacy] says that the rule system can only parse grammatical text and cannot handle degraded text with grammar mistakes in it. Facts speak louder than words: the sentiment engine we have developed for our main products is a parsing-supported, rule-based system that fully automatically extracts and mines public opinions and consumer insights from all types of social media, typical of degraded text. Third-party evaluations show that this system is industry leader in data quality of sentiments, significantly better than competitions adopting machine learning. The large-scale operation of our system in the cloud in handling terabytes of real life social media big data (a year of social media in our index involve about 30 billion documents across more than 40 languages) also prove wrong what is stated in [Scalability Fallacy] below.

Let us now list these widely spread rumours collected from the community about the rule-based system to see if they ring the bell before we dive into the first two core myths to uncover the true picture behind in separate blogs.

II. Top 10 Misconceptions against Rules

[Hand-crafted Myth] Rule-based system faces a knowledge bottleneck of hand-crafted development while a machine learning system involves automatic training (implying no knowledge bottleneck). [see On Hand-crafted Myth of Knowledge Bottleneck.]

[Domain Portability Myth] The hand-crafted nature of a rule-based system leads to its poor domain portability as rules have to be rebuilt each time we shift to a new domain; but in case of machine learning, since the algorithm and system are universal, domain shift only involves new training data (implying strong domain portability). [see Domain Portability Myth]

[Fragility Myth] A rule-based system is very fragile and it may break before unseen language data, so it cannot lead to a robust real life application.

[Weight Myth] Since there is no statistical weight associated with the results from a rule-based system, the data quality cannot be trusted with confidence.

[Complexity Myth] As a rule-based system is complex and intertwined, it is easy to get to a standstill, with little hope for further improvement.

[Scalability Fallacy] The hand-crafted nature of a rule-based system makes it difficult to scale up for real life application; it is largely confined to the lab as a toy.

[Domain Restriction Fallacy] A rule-based system only works in a narrow domain and it cannot work across domains.

[Grammaticality Fallacy] A rule-based system can only handle grammatical input in the formal text (such as news, manuals, weather broadcasts), it fails in front of degraded text involving misspellings and ungrammaticality such as social media, oral transcript, jargons or OCR output.

[Outdated Fallacy] A rule-based system is a technology of last century, it is outdated (implying that it no longer works or can result in a quality system in modern days).

[Data Quality Fallacy] Based on the data quality of results, a machine learning system is better than a rule based system. (cf: On Recall of Grammar Engineering Systems)

III. Retrospect and Reflection of Mainstream

As mentioned earlier, a long list of misconceptions about the old school of rule-based systems have been around the mainstream for years in the field. It may sound weird for an interdisciplinary field named Computational Linguistics to drift more and more distant from linguistics; linguists play less and less a role in NLP dominated by statisticians today. It seems widely assumed that with advanced deep learning algorithms, once data are available, a quality system will be trained without the need for linguistic design or domain expertise.

Not all main stream scholars are one-sided and near-sighted. In recent years, insightful articles (e.g., church 2007, Wintner 2009) began a serious retrospect and reflection process and called for the return of Linguistics: “In essence, linguistics is altogether missing in contemporary natural language engineering research. … I want to call for the return of linguistics to computational linguistics.”（Wintner 2009）Let us hope that their voice will not be completely muffled in this new wave of deep learning heat.

Note that the rule system which the linguists are good at crafting in industry is different from the classical linguistic study, it is formalized modeling of linguistic analysis. For NLP tasks beyond shallow level, an effective rule system is not a simple accumulation of computational lexicons and grammars, but involves a linguistic processing strategy (or linguistic algorithm) for different levels of linguistic phenomena. However, this line of study on the NLP platform design, system architecture and formalism has increasingly smaller space for academic discussion and publication, the research funding becomes almost impossible, as a result, the new generation faces the risk of a cut-off legacy, with a full generation of talent gap in academia. Church (2007) points out that the statistical research is so dominant and one-sided that only one voice is now heard. He is a visionary main stream scientist, deeply concerned about the imbalance of the two schools in NLP and AI. He writes:

Part of the reason why we keep making the same mistakes, as Minsky and Papert mentioned above, has to do with teaching. One side of the debate is written out of the textbooks and forgotten, only to be revived/reinvented by the next generation. ......

To prepare students for what might come after the low hanging fruit has been picked over, it would be good to provide today’s students with a broad education that makes room for many topics in Linguistics such as syntax, morphology, phonology, phonetics, historical linguistics and language universals. We are graduating Computational Linguistics students these days that have very deep knowledge of one particular narrow sub-area (such as machine learning and statistical machine translation) but may not have heard of Greenberg’s Universals, Raising, Equi, quantifier scope, gapping, island constraints and so on. We should make sure that students working on co-reference know about c-command and disjoint reference. When students present a paper at a Computational Linguistics conference, they should be expected to know the standard treatment of the topic in Formal Linguistics.

We ought to teach this debate to the next generation because it is likely that they will have to take Chomsky’s objections more seriously than we have. Our generation has been fortunate to have plenty of low hanging fruit to pick (the facts that can be captured with short ngrams), but the next generation will be less fortunate since most of those facts will have been pretty well picked over before they retire, and therefore, it is likely that they will have to address facts that go beyond the simplest ngram approximations.

About Author

Dr. Wei Li is currently Chief Scientist at Netbase Solutions in the Silicon Valley, leading the effort for the design and development of a multi-lingual sentiment mining system based on deep parsing. A hands-on computational linguist with 30 years of professional experience in Natural Language Processing (NLP), Dr. Li has a track record of making NLP work robust. He has built three large-scale NLP systems, all transformed into real-life, globally distributed products.

Note: This is the author's own translation, with adaptation, of our paper in Chinese which originally appeared in W. Li & T. Tang, "Pride and Prejudice of Main Stream: Rule-based System vs. Machine Learning", in Communications of Chinese Computer Federation (CCCF), Issue 8, 2013

[Related]

K. Church: A Pendulum Swung Too Far, Linguistics issues in Language Technology, 2011; 6(5)

Wintner 2009. What Science Underlies Natural Language Engineering? Computational Linguistics, Volume 35, Number 4

On Hand-crafted Myth and Knowledge Bottleneck

On Recall of Grammar Engineering Systems

Pros and Cons of Two Approaches: Machine Learning vs Grammar Engineering

It is untrue that Google SyntaxNet is the “world’s most accurate parser”

R. Srihari, W Li, C. Niu, T. Cornell: InfoXtract: A Customizable Intermediate Level Information Extraction Engine. Journal of Natural Language Engineering, 12(4), 1-37, 2006

Pride and Prejudice of NLP Main Stream

【主流的傲慢与偏见：规则系统与机器学习】

I. 引言

有回顾NLP（Natural Language Processing）历史的知名学者介绍机器学习（machine learning）取代传统规则系统（rule-based system）成为学界主流的掌故，说20多年前好像经历了一场惊心动魄的宗教战争。必须承认，NLP 这个领域，统计学家的完胜，是有其历史必然性的。机器学习在NLP很多任务上的巨大成果和效益是有目共睹的：机器翻译，语音识别/合成，搜索排序，垃圾过滤，文档分类，自动文摘，词典习得，专名标注，词性标注等（Church 2007）。

然而，近来浏览几篇 NLP 领域代表人物的综述，见其中不乏主流的傲慢与偏见，依然令人惊诧。细想之下，统计学界的确有很多对传统规则系统根深蒂固的成见和经不起推敲但非常流行的蛮横结论。可怕的不是成见，成见无处不在。真正可怕的是成见的流行无阻。而在NLP这个领域，成见的流行到了让人瞠目结舌的程度。不假思索而认同接受这些成见成为常态。因此想到立此存照一下，并就核心的几条予以详论。下列成见随处可见，流传甚广，为免纷扰，就不列出处了，明白人自然知道这绝不是杜撰和虚立的靶子。这些成见似是而非，经不起推敲，却被很多人视为理所当然的真理。为每一条成见找一个相应的规则系统的案例并不难，但是从一些特定系统的缺陷推广到对整个规则系统的方法学上的批判，乃是其要害所在。

【成见一】规则系统的手工编制（hand-crafted）是其知识瓶颈，而机器学习是自动训练的（言下之意：没有知识瓶颈）。

【成见二】规则系统的手工编制导致其移植性差，转换领域必须重启炉灶，而机器学习因为算法和系统保持不变，转换领域只要改变训练数据即可（言下之意：移植性强）。

【成见三】规则系统很脆弱，遇到没有预测的语言现象系统就会 break（什么叫 break，死机？瘫痪？失效？），开发不了鲁棒（robust）产品。

【成见四】规则系统的结果没有置信度，鱼龙混杂。

【成见五】规则系统的编制越来越庞杂，最终无法改进，只能报废。

【成见六】规则系统的手工编制注定其无法实用，不能 scale up，只能是实验室里的玩具。

【成见七】规则系统只能在极狭窄的领域成事，无法实现跨领域的系统。

【成见八】规则系统只能处理规范的语言（譬如说明书、天气预报、新闻等），无法应对 degraded text，如社会媒体、口语、方言、黑话、OCR 文档。

【成见九】规则系统是上个世纪的技术，早已淘汰（逻辑的结论似乎是：因此不可能做出优质系统）。

【成见十】从结果上看，机器学习总是胜过规则系统。

所列“成见”有两类：一类是“偏”见，如【成见一】至【成见五】。这类偏见主要源于不完全归纳，他们也许看到过或者尝试过规则系统某一个类型, 浅尝辄止，然后遽下结论（jump to conclusions）。盗亦有道，情有可原，虽然还是应该对其一一纠“正”。本文即是拨乱反正的第一篇。成见的另一类是谬见，可以事实证明其荒谬。令人惊诧的是，谬见也可以如此流行。【成见五】以降均属不攻自破的谬见。譬如【成见八】说规则系统只能分析规范性语言。事实胜于雄辩，我们开发的以规则体系为主的舆情挖掘系统处理的就是非规范的社交媒体。这个系统的大规模运行和使用也驳斥了【成见六】,可以让读者评判这样的规则系统够不够资格称为实用系统：

以全球500强企业为主要客户的多语言客户情报挖掘系统由前后两个子系统组成。核心引擎是后台子系统（back-end indexing engine），用于对社交媒体大数据做自动分析和抽取。分析和抽取结果用开源的Apache Lucene文本搜索引擎(lucene.apache.org) 存储。生成后台索引的过程基于Map-Reduce框架，利用计算云(computing cloud) 中200台虚拟服务器进行分布式索引。对于过往一年的社会媒体大数据存档（约300亿文档跨越40多种语言），后台索引系统可以在7天左右完成全部索引。前台子系统（front-end app）是基于 SaaS 的一种类似搜索的应用。用户通过浏览器登录应用服务器，输入一个感兴趣的话题，应用服务器对后台索引进行分布式搜索，搜索的结果在应用服务器经过整合，以用户可以预设（configurable）的方式呈现给用户。这一过程立等可取，响应时间不过三四秒。

II. 规则系统手工性的责难

【成见一】说：规则系统的手工编制（hand-crafted）是其知识瓶颈，而机器学习是自动训练的（言下之意：因此没有知识瓶颈）。

NLP主流对规则系统和语言学家大小偏见积久成堆，这第一条可以算是万偏之源。随便翻开计算语言学会议的论文，无论讨论什么语言现象，为了论证机器学习某算法的优越，在对比批评其他学习算法的同时，规则系统大多是随时抓上来陪斗的攻击对象，而攻击的理由往往只有这么一句话，规则系统的手工性决定了 “其难以开发”（或“其不能 scale up”，“其效率低下”，“其不鲁棒”，不一而足），或者干脆不给具体理由，直接说“文献【1】【2】【3】尝试了这个问题的不同方面，但这些系统都是手工编制的”，一句话判处死刑，甚至不用讨论它们的效果和质量。手工性几乎成了规则系统的“原罪”，编制这些系统的语言学家因此成为学术共同体背负原罪的二等公民。

手工编制（hand-crafted）又如何？在日常消费品领域，这是对艺人特别的嘉奖，是对批量机械化生产和千篇一律的反抗，是独特和匠心的代表，是高价格理直气壮的理由。缘何到了NLP领域，突然就成贬义词了呢？这是因为在NLP领域，代表主流的统计学家由于他们在NLP某些任务上的非凡成功，居功自傲，把成功无限夸大，给这个共同体施行了集体催眠术，有意无意引导人相信机器学习是万能的。换句话说，批判手工编制的劣根性，其隐含的前提是机器学习是万能的，有效的，首选的。而实际情况是，面对自然语言的复杂性，机器学习只是划过了语言学的冰山一角，远远没有到主流们自觉或不自觉吹嘘的万能境界。催眠的结果是，不独不少语言学家以及NLP相关利益方（如投资人和用户）被他们洗脑了，连他们自己也似乎逐渐相信了自己编制的神话。

真实世界中，NLP 是应用学科，最终结果体现在应用软件（applications）上，属于语言软件工程。作为一个产业，软件工程领域吸引了无数软件工程师，虽然他们自嘲为“码工”，社会共同体给予他们的尊重和待遇是很高的（Bill Gates 自封了一个 Chief Engineer，说明了这位软件大王对工匠大师的高度重视）。古有鲁班，现有码师（coding master）。这些码工谁不靠手工编制代码作为立足之本呢？没听说一位明星工程师因为编制代码的手工性质而被贬损。同是软件工程，为什么计算语言学家手工编制NLP代码与其他工程师手工编制软件代码，遭遇如此不同的对待。难道是因为NLP应用比其他应用简单？恰恰相反，自然语言的很多应用比起大多数应用（譬如图形软件、字处理软件等等）更加复杂和艰难。解释这种不同遭遇的唯一理由就是，作为大环境的软件领域没有NLP主流的小环境里面那么多的傲慢和偏见。软件领域的大师们还没有狂妄到以为可以靠自动编程取代手工编程。他们在手工编程的基础建设（编程架构和开发环境等）上下功夫，而不是把希望寄托在自动编程的万能上。也许在未来的某一天，一些简单的应用可以用代码自动化来实现，但是复杂任务的全自动化从目前来看是遥遥无期的。不管从什么标准来看，非浅层的自然语言分析和理解都是复杂任务的一种。因此，机器学习作为自动编程的一个体现是几乎不可能取代手工代码的。规则系统的NLP应用价值会长期存在。

自动是一个动听的词汇。如果一切人工智能都是自动学习的，前景该有多么美妙。机器学习因为与自动连接在一起，显得那么高高在上，让人仰视。它承载着人类对未来世界的幻想。这一切理应激励自动学习专家不断创新，而绝不该成为其傲慢和偏见的理由。

在下面具体论述所谓规则系统的知识瓶颈软肋之前，值得一提的是，本文所谓自动是指系统的开发，不要混淆为系统的应用。在应用层面，无论是机器学习出来的系统，还是手工编制的系统，都是全自动地服务用户的，这是软件应用的性质决定的。虽然这是显而易见的事实，可确实有人被误导，一听说手工编制，就引申为基于规则系统的应用也是手工的，或者半自动的。

手工编制NLP系统是不是规则系统的知识瓶颈？毋庸讳言，确实如此。这个瓶颈体现在系统开发的周期上。但是，这个瓶颈是几乎所有大型软件工程项目所共有的，是理所当然的资源成本，不独为 NLP “专美”。从这个意义上看，以知识瓶颈诟病规则系统是可笑的，除非可以证明对所有NLP项目，用机器学习开发系统比编制规则系统，周期短且质量高（个别的项目可能是这样，但一般而言绝非如此，后面还要详谈）。大体说来，对于NLP的浅层应用（譬如中文切词，专名识别，等等），没有三个月的开发，没有至少一位计算语言学家手工编制和调试规则和至少半个工程师的平台层面的支持，是出不来规则系统的。对于NLP的深层应用（如句法分析，舆情抽取等），没有至少一年的开发，涉及至少一位计算语言学家的手工编制规则，至少半个质量检测员的协助和半个工程师的平台支持，外加软件工程项目普遍具有的应用层面的用户接口开发等投入，也是出不来真正的软件产品的。当然需要多少开发资源在很大程度上决定于开发人员（包括作为知识工程师的计算语言学家）的经验和质量以及系统平台和开发环境的基础（infrastructures）如何。

计算语言学家编制规则系统的主体工作是利用形式化工具编写并调试语言规则、各类词典以及语言分析的流程调控。宏观上看，这个过程与软件工程师编写应用程序没有本质不同，不过是所用的语言、形式框架和开发平台（language，formalism and development platform）不同，系统设计和开发的测重点不同而已。这就好比现代的工程师用所谓高级语言 Java 或者 C，与30年前的工程师使用汇编语言的对比类似，本质是一样的编程，只是层次不同罢了。在为NLP特制的“高级”语言和平台上，计算语言学家可以不用为内存分配等非语言学的工程细节所羁绊，一般也不用为代码的优化和效率而烦扰，他们的注意力更多地放在面对自然语言的种种复杂现象，怎样设计语言处理的架构和流程，怎样平衡语言规则的条件宽窄，怎样与QA（质量检测）协调确保系统开发的健康，怎样保证语言学家团队编制规则的操作规范（unit testing，regression testing，code review，baselines，等等）以确保系统的可持续性，怎样根据语言开发需求对于现有形式框架的限制提出扩展要求，以及怎样保证复杂系统的鲁棒性，以及怎样突破规则系统的框架与其他语言处理包括机器学习进行协调，等等。一个领头的计算语言学家就是规则系统的架构师，系统的成败绝不仅仅在于语言规则的编制及其堆积，更多的决定于系统架构的合理性。明星工程师是软件企业的灵魂，NLP 规则系统的大规模成功也一样召唤着语言工程大师。

关于知识瓶颈的偏见，必须在对比中评估。自然语言处理需要语言学知识，把这些知识形式化是每个NLP系统的题中应有之义，机器学习绝不会自动免疫，无需知识的形式化。规则系统需要语言学家手工开发的资源投入，机器学习也同样需要资源的投入，不过是资源方式不同而已。具体说，机器学习的知识瓶颈在于需要大数量的训练数据集。排除研究性强实用性弱的无监督学习（unsupervised learning），机器学习中可资开发系统的方法是有监督的学习（supervised learning）。有监督的学习能开发知识系统成为应用的前提是必须有大量的手工标注的数据，作为学习的源泉。虽然机器学习的过程是自动的（学习算法的创新、调试和实现当然还是手工的），但是大量的数据标注则是手工的（本来就有现成标注不计，那是例外）。因此，机器学习同样面临知识瓶颈，不过是知识瓶颈的表现从需要少量的语言学家变成需要大量的低端劳动者（懂得语言及其任务的中学生或大学生即可胜任）。马克思说金钱是一般等价物，知识瓶颈的问题于是转化为高级劳动低级劳动的开销和转换问题：雇佣一个计算语言学家的代价大，还是雇佣10个中学生的代价大？虽然这个问题根据不同项目不同地区等因素答案会有不同，但所谓机器学习没有知识瓶颈的神话可以休矣。

另外，知识瓶颈的对比问题不仅仅是针对一个应用而言，而应该放在多应用的可移植性上来考察。我们知道大多数非浅层的NLP应用的技术支持都源于从自然语言做特定的信息抽取：抽取关系、事件、舆情等。由于机器学习把信息抽取看成一个直接对应输入和输出的黑匣子，所以一旦改变信息抽取目标和应用方向，以前的人工标注就废弃了，作为知识瓶颈的标注工作必须完全重来。可是规则系统不同，它通常设计成一个规则层级体系，由独立于领域的语言分析器（parser）来支持针对领域的信息抽取器（extractor）。结果是，在转移应用目标的时候，作为技术基础的语言分析器保持不变，只需重新编写不同的抽取规则而已。实践证明，对于规则系统，真正的知识瓶颈在语言分析器的开发上，而信息抽取本身花费不多。这是因为前者需要应对自然语言变化多端的表达方式，将其逻辑化，后者则是建立在逻辑形式（logical form）上，一条规则等价于底层规则的几百上千条。因此，从多应用的角度看，规则系统的知识成本趋小，而机器学习的知识成本则没有这个便利。

III. 主流的反思

如前所述，NLP领域主流意识中的成见很多，积重难返。世界上还很少有这样的怪现象：号称计算语言学（Computational Linguistics）的领域一直在排挤语言学和语言学家。语言学家所擅长的规则系统，与传统语言学完全不同，是可实现的形式语言学（Formal Linguistics）的体现。对于非浅层的NLP任务，有效的规则系统不可能是计算词典和文法的简单堆积，而是蕴含了对不同语言现象的语言学处理策略（或算法）。然而，这一路研究在NLP讲台发表的空间日渐狭小，资助亦难，使得新一代学人面临技术传承的危险。Church （2007）指出，NLP研究统计一边倒的状况是如此明显，其他的声音已经听不见。在浅层NLP的低垂果实几乎全部采摘完毕以后，当下一代学人面对复杂任务时，语言学营养缺乏症可能导致统计路线捉襟见肘。

可喜的是，近年来主流中有识之士（如，Church 2007, Wintner 2009）开始了反思和呼吁，召唤语言学的归来：“In essence, linguistics is altogether missing in contemporary natural language engineering research. … I want to call for the return of linguistics to computational linguistics.”（Wintner 2009）。相信他们的声音会受到越来越多的人的注意。

参考文献

Church 2007. A Pendulum Swung Too Far. Linguistics issues in Language Technology, Volume 2, Issue 4.
Wintner 2009. What Science Underlies Natural Language Engineering? Computational Linguistics, Volume 35, Number 4

原载《W. Li & T. Tang: 主流的傲慢与偏见：规则系统与机器学习》
【计算机学会通讯】2013年第8期（总第90期）

[Abstract]

Pride and Prejudice in Mainstream: Rule System vs. Machine Learning

In the area of Computational Linguistics, there are two basic approaches to natural language processing, the traditional rule system and the mainstream machine learning. They are complementary and there are pros and cons associated with both. However, as machine learning is the dominant mainstream philosophy reflected by the overwhelming ratio of papers published in academia, the area seems to be heavily biased against the rule system methodology. The tremendous success of machine learning as applied to a list of natural language tasks has reinforced the mainstream pride and prejudice in favor of one and against the other. As a result, there are numerous specious views which are often taken for granted without check, including attacks on the rule system's defects based on incomplete induction or misconception. This is not healthy for NLP itself as an applied research area and exerts an inappropriate influence on the young scientists coming to this area. This is the first piece of a series of writings aimed at correcting the prevalent prejudice, focused on the in-depth examination of the so-called hand-crafted defect of the rule system and the associated knowledge bottleneck issue.

【相关】

K. Church: A Pendulum Swung Too Far, Linguistics issues in Language Technology, 2011; 6(5)

【科普随笔：NLP主流的傲慢与偏见】

On Hand-crafted Myth and Knowledge Bottleneck

Domain portability myth in natural language processing

【关于NLP方法论以及两条路线之争】专栏：NLP方法论

It is untrue that Google SyntaxNet is the “world’s most accurate parser”

On Recall of Grammar Engineering Systems

After I showed the benchmarking results of SyntaxNet and our rule system based on grammar engineering, many people seem to be surprised by the fact that the rule system beats the newest deep-learning based parser in data quality. I then got asked many questions, one question is:

Q: We know that rules crafted by linguists are good at precision, how about recall?

This question is worth a more in-depth discussion and serious answer because it touches the core of the viability of the "forgotten" school: why is it still there? what does it have to offer? The key is the excellent data quality as advantage of a hand-crafted system, not only for precision, but high recall is achievable as well.

Before we elaborate, here was my quick answer to the above question:

Unlike precision, recall is not rules' forte, but there are ways to enhance recall;
To enhance recall without precision compromise, one needs to develop more rules and organize the rules in a hierarchy, and organize grammars in a pipeline, so recall is a function of time;
To enhance recall with limited compromise in precision, one can fine-tune the rules to loosen conditions.

Let me address these points by presenting the scene of action for this linguistic art in its engineering craftsmanship.

A rule system is based on compiled computational grammars. A grammar is a set of linguistic rules encoded in some formalism. What happens in grammar engineering is not much different from other software engineering projects. As knowledge engineer, a computational linguist codes a rule in a NLP-specific language, based on a development corpus. The development is data-driven, each line of rule code goes through rigid unit tests and then regression tests before it is submitted as part of the updated system. Depending on the design of the architect, there are all types of information available for the linguist developer to use in crafting a rule's conditions, e.g. a rule can check any elements of a pattern by enforcing conditions on (i) word or stem itself (i.e. string literal, in cases of capturing, say, idiomatic expressions), and/or (ii) POS (part-of-speech, such as noun, adjective, verb, preposition), (iii) and/or orthography features (e.g. initial upper case, mixed case, token with digits and dots), and/or (iv) morphology features (e.g. tense, aspect, person, number, case, etc. decoded by a previous morphology module), (v) and/or syntactic features (e.g. verb subcategory features such as intransitive, transitive, ditransitive), (vi) and/or lexical semantic features (e.g. human, animal, furniture, food, school, time, location, color, emotion). There are almost infinite combinations of such conditions that can be enforced in rules' patterns. A linguist's job is to use such conditions to maximize the benefits of capturing the target language phenomena, through a process of trial and error.

Given the description of grammar engineering as above, what we expect to see in the initial stage of grammar development is a system precision-oriented by nature. Each rule developed is geared towards a target linguistic phenomenon based on the data observed in the development corpus: conditions can be as tight as one wants them to be, ensuring precision. But no single rule or a small set of rules can cover all the phenomena. So the recall is low in the beginning stage. Let us push things to extreme, if a rule system is based on only one grammar consisting of only one rule, it is not difficult to quickly develop a system with 100% precision but very poor recall. But what is good of a system that is precise but without coverage?

So a linguist is trained to generalize. In fact, most linguists are over-trained in school for theorizing and generalization before they get involved in software industrial development. In my own experience in training new linguists into knowledge engineers, I often have to de-train this aspect of their education by enforcing strict procedures of data-driven and regression-free development. As a result, the system will generalize only to the extent allowed to maintain a target precision, say 90% or above.

It is a balancing art. Experienced linguists are better than new graduates. Out of explosive possibilities of conditions, one will only test some most likely combination of conditions based on linguistic knowledge and judgement in order to reach the desired precision with maximized recall of the target phenomena. For a given rule, it is always possible to increase recall at compromise of precision by dropping some conditions or replacing a strict condition by a loose condition (e.g. checking a feature instead of literal, or checking a general feature such as noun instead of a narrow feature such as human). When a rule is fine-tuned with proper conditions for the desired balance of precision and recall, the linguist developer moves on to try to come up with another rule to cover more space of the target phenomena.

So, as the development goes on, and more data from the development corpus are brought to the attention on the developer's radar, more rules are developed to cover more and more phenomena, much like silkworms eating mulberry leaves. This is incremental enhancement fairly typical of software development cycles for new releases. Most of the time, newly developed rules will overlap with existing rules, but their logical OR points to an enlarged conquered territory. It is hard work, but recall gradually, and naturally, picks up with time while maintaining precision until it hits long tail with diminishing returns.

There are two caveats which are worth discussing for people who are curious about this "seasoned" school of grammar engineering.

First, not all rules are equal. A non-toy rule system often provides mechanism to help organize rules in a hierarchy for better quality as well as easier maintenance: after all, a grammar hard to understand and difficult to maintain has little prospect for debugging and incremental enhancement. Typically, a grammar has some general rules at the top which serve as default and cover the majority of phenomena well but make mistakes in the exceptions which are not rare in natural language. As is known to all, naturally language is such a monster that almost no rules are without exceptions. Remember in high school grammar class, our teacher used to teach us grammar rules. For example, one rule says that a bare verb cannot be used as predicate with third person singular subject, which should agree with the predicate in person and number by adding -s to the verb: hence, She leaves instead of *She leave. But soon we found exceptions in sentences like The teacher demanded that she leave. This exception to the original rule only occurs in object clauses following certain main clause verbs such as demand, theoretically labeled by linguists as subjunctive mood. This more restricted rule needs to work with the more general rule to result in a better formulated grammar.

Likewise, in building a computational grammar for automatic parsing or other NLP tasks, we need to handle a spectrum of rules with different degrees of generalizations in achieving good data quality for a balanced precision and recall. Rather than adding more and more restrictions to make a general rule not to overkill the exceptions, it is more elegant and practical to organize the rules in a hierarchy so the general rules are only applied as default after more specific rules are tried, or equivalently, specific rules are applied to overturn or correct the results of general rules. Thus, most real life formalisms are equipped with hierarchy mechanism to help linguists develop computational grammars to model the human linguistic capability in language analysis and understanding.

The second point that relates to the topic of recall of a rule system is so significant but often neglected that it cannot be over-emphasized and it calls for a separate writing in itself. I will only present a concise conclusion here. It relates to multiple levels of parsing that can significantly enhance recall for both parsing and parsing-supported NLP applications. In a multi-level rule system, each level is one module of the system, involving a grammar. Lower levels of grammars help build local structures (e.g. basic Noun Phrase), performing shallow parsing. System thus designed are not only good for modularized engineering, but also great for recall because shallow parsing shortens the distance of words that hold syntactic relations (including long distance relations) and lower level linguistic constructions clear the way for generalization by high level rules in covering linguistic phenomena.

In summary, a parser based on grammar engineering can reach very high precision and there are proven effective ways of enhancing its recall. High recall can be achieved if enough time and expertise are invested in its development. In case of parsing, as shown by test results, our seasoned English parser is good at both precision (96% vs. SyntaxNet 94%) and recall (94% vs. SyntaxNet 95%, only 1 percentage point lower than SyntaxNet) in news genre, and with regards to social media, our parser is robust enough to beat SyntaxNet in both precision (89% vs. SyntaxNet 60%) and recall (72% vs. SyntaxNet 70%).

R. Srihari, W Li, C. Niu, T. Cornell: InfoXtract: A Customizable Intermediate Level Information Extraction Engine. Journal of Natural Language Engineering, 12(4), 1-37, 2006

K. Church: A Pendulum Swung Too Far, Linguistics issues in Language Technology, 2011; 6(5)

Pros and Cons of Two Approaches: Machine Learning vs Grammar Engineering

Pride and Prejudice of NLP Main Stream

On Hand-crafted Myth and Knowledge Bottleneck

Domain portability myth in natural language processing

Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open

It is untrue that Google SyntaxNet is the "world’s most accurate parser"

As we all know, natural language parsing is fairly complex but instrumental in Natural Language Understanding (NLU) and its applications. We also know that a breakthrough to 90%+ accuracy for parsing is close to human performance and is indeed an achievement to be proud of. Nevertheless, following the common sense, we all have learned that you got to have greatest guts to claim the "most" for anything without a scope or other conditions attached, unless it is honored by authoritative agencies such as Guinness. For Google's claim of "the world's most accurate parser", we only need to cite one system out-performing theirs to prove its being untrue or misleading. We happen to have built one.

For a long time, we know that our English parser is near human performance in data quality, and is robust, fast and scales up to big data in supporting real life products. For the approach we take, i.e. the approach of grammar engineering, which is the other "school" from the mainstream statistical parsing, this was just a natural result based on the architect's design and his decades of linguistic expertise. In fact, our parser reached near-human performance over 5 years ago, at a point of diminishing returns, hence we decided not to invest heavily any more in its further development. Instead, our focus was shifted to its applications in supporting open-domain question answering and fine-grained deep sentiment analysis for our products, as well as to the multilingual space.

So a few weeks ago when Google announced SyntaxNet, I was bombarded by the news cited to me from all kinds of channels by many colleagues of mine, including my boss and our marketing executives. All are kind enough to draw my attention to this "newest breakthrough in NLU" and seem to imply that we should work harder, trying to catch up with the giant.

In my mind, there has never been doubt that the other school has a long way before they can catch us. But we are in information age, and this is the power of Internet: eye-catching news from or about a giant, true or misleading, instantly spreads to all over the world. So I felt the need to do some study, not only to uncover the true picture of this space, but more importantly, also to attempt to educate the public and the young scholars coming to this field that there have always been and will always be two schools of NLU and AI (Artificial Intelligence). These two schools actually have their respective pros and cons, they can be complementary and hybrid, but by no means can we completely ignore or replace one by the other. Plus, how boring a world would become if there were only one approach, one choice, one voice, especially in core cases of NLU such as parsing (as well as information extraction and sentiment analysis, among others) where the "select approach" does not perform nearly as well as the forgotten one.

So I instructed a linguist who was not involved in the development of the parser to benchmark both systems as objectively as possible, and to give an apples-to-apples comparison of their respective performance. Fortunately, the Google SyntaxNet outputs syntactic dependency relationships and ours is also mainly a dependency parser. Despite differences in details or naming conventions, the results are not difficult to contrast and compare based on linguistic judgment. To make things simple and fair, we fragment a parse tree of an input sentence into binary dependency relations and let the testor linguist judge; once in doubt, he will consult another senior linguist to resolve, or to put on hold if believed to be in gray area, which is rare.

Unlike some other areas of NLP tasks, e.g. sentiment analysis, where there is considerable space of gray area or inter-annotator disagreement, parsing results are fairly easy to reach consensus among linguists. Despite the different format such results are embodied in by two systems (an output sample is shown below), it is not difficult to make a direct comparison of each dependency in the sentence tree output of both systems. (To be stricter on our side, a patched relationship called Next link used in our results do not count as a legit syntactic relation in testing.)

SyntaxNet output:

1.Input: President Barack Obama endorsed presumptive Democratic presidential nominee Hillary Clinton in a web video Thursday .
Parse:
endorsed VBD ROOT
 +-- Obama NNP nsubj
 |   +-- President NNP nn
 |   +-- Barack NNP nn
 +-- Clinton NNP dobj
 |   +-- nominee NN nn
 |   |   +-- presumptive JJ amod
 |   |   +-- Democratic JJ amod
 |   |   +-- presidential JJ amod
 |   +-- Hillary NNP nn
 +-- in IN prep
 |   +-- video NN pobj
 |       +-- a DT det
 |       +-- web NN nn
 +-- Thursday NNP tmod
 +-- . . punct

Netbase output:

Benchmarking was performed in two stages as follows.

Stage 1, we select English formal text in the news domain, which is SyntaxNet's forte as it is believed to have much more training data in news than in other styles or genres. The announced 94% accuracy in news parsing is indeed impressive. In our case, news is not the major source of our development corpus because our goal is to develop a domain-independent parser to support a variety of genres of English text for real life applications on text such as social media (informal text) for sentiment analysis, as well as technology papers (formal text) for answering how questions.

We randomly select three recent news article for this testing, with the following links.

(1) http://www.cnn.com/2016/06/09/politics/president-barack-obama-endorses-hillary-clinton-in-video/
(2) Part of news from: http://www.wsj.com/articles/nintendo-gives-gamers-look-at-new-zelda-1465936033
(3) Part of news from: http://www.cnn.com/2016/06/15/us/alligator-attacks-child-disney-florida/

Here are the benchmarking results of parsing the above for the news genre:

(1) Google SyntaxNet: F-score= 0.94
(tp for true positive, fp for false positive, tn for true negative;
P for Precision, R for Recall, and F for F-score)

P = tp/(tp+fp) = 1737/(1737+104) = 1737/1841 = 0.94
R = tp/(tp+tn) = 1737/(1737+96) = 1737/1833 = 0.95
F= 2*[(P*R)/(P+R)] = 2*[(0.94*0.95)/(0.94+0.95)] = 2*(0.893/1.89) = 0.94

(2) Netbase parser: F-score = 0.95

P = tp/(tp+fp) = 1714/(1714+66) = 1714/1780 = 0.96
R = tp/(tp+tn) = 1714/(1714+119) = 1714/1833 = 0.94
F = 2*[(P*R)/(P+R)] = 2*[(0.96*0.94)/(0.96+0.94)] = 2*(0.9024/1.9) = 0.95

So the Netbase parser is about 2 percentage points better than Google SyntaxNet in precision but 1 point lower in recall. Overall, Netbase is slightly better than Google in the precision-recall combined measures of F-score. As both parsers are near the point of diminishing returns for further development, there is not too much room for further competition.

Stage 2, we select informal text, from social media Twitter to test a parser's robustness in handling "degraded text": as is expected, degraded text will always lead to degraded performance (for a human as well as a machine), but a robust parser should be able to handle it with only limited degradation. If a parser can only perform well in one genre or one domain and the performance drastically falls in other genres, then this parser is not of much use because most genres or domains do not have as large labeled data as the seasoned news genre. With this knowledge bottleneck, a parser is severely challenged and limited in its potential to support NLU applications. After all, parsing is not the end, but a means to turn unstructured text into structures to support semantic grounding to various applications in different domains.

We randomly select 100 tweets from Twitter for this testing, with some samples shown below.

1.Input: RT @ KealaLanae : ?? ima leave ths here. https : //t.co/FI4QrSQeLh2.Input: @ WWE_TheShield12 I do what I want jk I ca n't kill you .10.Input: RT @ blushybieber : Follow everyone who retweets this , 4 mins?

20.Input: RT @ LedoPizza : Proudly Founded in Maryland. @ Budweiser might have America on their cans but we think Maryland Pizza sounds better

30.Input: I have come to enjoy Futbol over Football ⚽️

40.Input: @ GameBurst That 's not meant to be rude. Hard to clarify the joke in tweet form .

50.Input: RT @ undeniableyella : I find it interesting , people only talk to me when they need something ...

60.Input: Petshotel Pet Care Specialist Jobs in Atlanta , GA # Atlanta # GA # jobs # jobsearch https : //t.co/pOJtjn1RUI

70.Input: FOUR ! BUTTLER nailed it past the sweeper cover fence to end the over ! # ENG - 91/6 -LRB- 20 overs -RRB- . # ENGvSL https : //t.co/Pp8pYHfQI8

79..Input: RT @ LenshayB : I need to stop spending money like I 'm rich but I really have that mentality when it comes to spending money on my daughter

89.Input: RT MarketCurrents : Valuation concerns perk up again on Blue Buffalo https : //t.co/5lUvNnwsjA , https : //t.co/Q0pEHTMLie

99.Input: Unlimited Cellular Snap-On Case for Apple iPhone 4/4S -LRB- Transparent Design , Blue/ https : //t.co/7m962bYWVQ https : //t.co/N4tyjLdwYp

100.Input: RT @ Boogie2988 : And some people say , Ethan 's heart grew three sizes that day. Glad to see some of this drama finally going away. https : //t.co/4aDE63Zm85

Here are the benchmarking results for the social media Twitter:

(1) Google SyntaxNet: F-score = 0.65

P = tp/(tp+fp) = 842/(842+557) = 842/1399 = 0.60
R = tp/(tp+tn) = 842/(842+364) = 842/1206 = 0.70
F = 2*[(P*R)/(P+R)] = 2*[(0.6*0.7)/(0.6+0.7)] = 2*(0.42/1.3) = 0.65

Netbase parser: F-score = 0.80

P = tp/(tp+fp) = 866/(866+112) = 866/978 = 0.89
R = tp/(tp+tn) = 866/(866+340) = 866/1206 = 0.72
F = 2*[(P*R)/(P+R)] = 2*[(0.89*0.72)/(0.89+0.72)] = 2*(0.64/1.61) = 0.80

For the above benchmarking results, we leave it to the next blog for interesting observations and more detailed illustration, analyses and discussions.

To summarize, our real life production parser beats Google's research system SyntaxtNet in both formal news text (by a small margin as we both are already near human performance) and informal text, with a big margin of 15 percentage points. Therefore, it is safe to conclude that Google's SytaxNet is by no means "world’s most accurate parser", in fact, it has a long way to get even close to the Netbase parser in adapting to the real world English text of various genres for real life applications.

K. Church: "A Pendulum Swung Too Far", Linguistics issues in Language Technology, 2011; 6(5)

Pros and Cons of Two Approaches: Machine Learning vs Grammar Engineering

Dr. Wei Li's English Blog on NLP

Is Google SyntaxNet Really the World's Most Accurate Parser?

Google is a giant and its marketing is more than powerful. While the whole world was stunned at their exciting claim in Natural Language Parsing and Understanding, while we respect Google research and congratulate their breakthrough in statistical parsing space, we have to point out that their claim in their recently released blog that that SyntaxNet is the "world’s most accurate parser" is simply not true. In fact, far from truth.

The point is that they have totally ignored the other school of NLU, which is based on linguistic rules, as if it were non-existent. While it is true that for various reasons, the other school is hardly presented any more in academia today due to the mainstream's dominance by machine learning (which is unhealthy but admittedly a reality, see Church's long article for a historical background of this inbalance in AI and NLU: K. Church: "A Pendulum Swung Too Far"）, any serious researcher knows that it has never vanished from the world, and it actually has been well developed in industry's real life applications for many years, including ours.

In the same blog, Google mentioned that Parsey McParseface is the "most accurate such model in the world", with model referring to "powerful machine learning algorithms". This statement seems to be true based on their cited literature review, but the equating this to the "world's most accurate parser" publicized in the same blog news and almost instantly disseminated all over the media and Internet is simply irresponsible, and misleading at the very least.

In the next blog of mine, I will present an apples-to-apples comparison of Google's SyntaxNet with the NetBase deep parser to prove and illustrate the misleading nature of Google's recent announcement.

Stay tuned.

Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open

K. Church: "A Pendulum Swung Too Far", Linguistics issues in Language Technology, 2011; 6(5)

Pros and Cons of Two Approaches: Machine Learning vs Grammar Engineering

Dr. Wei Li's English Blog on NLP

【没有翻不了的案子，兼论专家vs学习的现状】

白:
分层不是要害，one way才是

我:

什么叫 one way? 没有不可推翻的。原则上讲，如果某个现象足够突出，值得去做，NLP 就没有翻不了的案子。连毛太祖钦定的文化大革命都全面否定、彻底翻案了。

Parsing的分层设计本身隐含了语言学的流程和算法，但与一切的语言学规则一样，规则的背后就是例外。只不过规则及其例外构成的 hierarchy 在同一层表现，而分层的例外则在 pipeline（管式）中处置。经常是做几层就加一层 patching 做例外处置或修正，也有留到最后通过【词驱动】（word-driven）去唤醒的。词驱动不单单是词，可以是任意可能 trigger 歧义休眠及其唤醒的 ngram。（非词驱动的唤醒，如果需要，还需要研究，目前不太清晰。）但凡是可以词驱动的，问题就不大，因为词驱动聚焦了特定的歧义现象，错误的 parse 在聚焦为有限子树patterns以后是可以预计的，当然也就可以修正。错误不怕，就怕错误不可预测。可预测的 consistent 的错误，在管式架构下不是挑战，不必担心其 error propagation，如果设计者具有“负负得正”的准备和机制的话。

白:
唤醒的ngram再进一步，就是一个CNN了。parsing用明网RNN，休眠唤醒用暗网CNN。相得益彰啊。

我:
听上去高大上，cnn fox abc 呵呵

白:
多层卷积

我:

我骨子里是相信数据的，相信大数据的自动学习，因为太多的语言细节隐含其内，终归是可以挖掘出来帮助parsing和理解的。但不大相信短期内可以改天换地，匹敌专家的经验积累。

syntaxnet 已经被我剁成稀泥了。但同时也注意到 statistical parsing 的精度在最成熟的文体 news 方面，很多年 stuck 在 80 以下，syntaxnet 确实突破了 90，这个成就让他们忘乎所以一把、吹点不符合实际的牛也是情有可原的，虽然多年前我们就把规则系统做到了 90 以上的parsing精度，当时的感觉是理所当然，貌似苦力不值得弹冠相庆（不是蛮力，当然也不可能仅仅是力气活，还有架构者的设计匠心和类似 dark art 一样不可言传的绝技，譬如经年经验加研究而来的 NL“毛毛虫”的formalism及其实现，等等）。没有炫耀，就这么一直默默地领先了“主流”很多年。

虽然仍然无法匹敌规则系统，但深度神经的运用的确使得统计型parser有了新闻领域内的90的突破。很好奇他们如今用了多大的训练库，还用了什么 tricks（据报道行内达人声称真正能玩转深度神经系统的大牛全世界不过百人，因为里面不仅仅是科学，还是 art），其他人多快可以重复结果？最后的大问题是，cnn rnn 等深度神经的牛算法，多快可以移植到新的文体、新的domain和新的语言，这种成功移植的最低条件（譬如最少需要多大的带标数据）是什么。未来的某个时候，如果新的文体新的语言，就像流水线一样，可以高质量快速自动学习出来一个个可应用的 parser 出来，语言学专家们也就死得其所，可以安然地“永垂不朽”了。

不过，在共产主义神经大同真能实现之前，专家还不愁饭碗。

在 parsing 这个NLP核心任务方面，要赶上专家的系统质量也并非易事，因为专家的系统已经证明可以做到非常接近人的分析水平，而且文体和领域独立，鲁棒、线速且可以 scale up，这对学习有诸多挑战。Deep parsing，专家一边是 production system，已经达到实用的高度，学习一边还是 research 在努力追赶，这就是 parsing 质量的现状。可很多人误导或被误导，把深度神经未来可能的成功当成现实或铁定，完全无视专家系统现实的存在。

【相关】

【泥沙龙笔记：语法工程派与统计学习派的总结】

《新智元笔记：NLP 系统的分层挑战》

【NLP 是一个力气活：再论成语不是问题】

【科普随笔：NLP主流的傲慢与偏见】

【关于NLP方法论以及两条路线之争】专栏：NLP方法论

【NLP 迷思之四：词义消歧（WSD）是NLP应用的瓶颈】

【语义计算群：带歧义或模糊前行，有如带病生存】

众所周知，作为符号系统，自然语言与电脑语言的最大差异和挑战在于其歧义性，有两类，结构歧义（structural ambiguity）和一词多义（相应的消歧任务叫WSD，word sense disambiguation）。如果没有这些随处可见的歧义，自然语言的自动分析就会与电脑语言的编译一样做到精准无误。因此，一般认为，自然语言parsing和NLU（自然语言理解）的核心任务就是消歧。至少理论上如此。

有意思的是，尽管自然语言一词多义极为普遍，结构歧义也颇常见，人类用语言交流却相当流畅，很多时候人根本就没有感觉到歧义的存在。只是到了我们做 parser 在计算机上实现的时候，这个问题才凸显。与宋老师的下列对话显示，计算语言学家模拟结构分析常遭遇歧义。

宋:
“张三对李四的批评咬牙切齿”，这是两可。
“張三对李四的批评不置一词”，这里有第三种可能。
“張三对李四的批评保持中立”，另一种两可。
“張三对李四的批评态度温和”，这是三可了。

我:
宋老师我已经晕了。您是计算语言学家的敏感或敏锐，绝大多数 native speakers 是感觉不到这些句子之间的结构歧义及其不同之处的。

t0708o

目前的 parsing 结果，“保持中立” 的主语（S）是“批评”，这个解读不是不可能（批评意见的保持中立，可以间接指代给出这个批评的“张三”），但很勉强；多数人的解读应该是：“张三” 保持中立，“张三”不是“批评”的主语，“李四”是，不仅如此，“批评”隐含宾语回指到“张三”。第二句的parse倒显得更合理一些，关于这个“批评”（Topic），（其）“态度"是"温和"的，指代的是“张三”，而“批评”“李四”的正是“张三”。

宋:
“张三对李四的批评”+谓语，就批评者和被批评者来讲，有3种填坑的可能：
（1）批评者是张三，被批评者是李四。（2）批评者是李四，被批评者是张三。（3）批评者是李四，被批评者是第三者。
“置若罔闻”与“不置一词”不一样。对于这个V的主体A来说，一定是有一个评论，“置若罔闻”是说该评论是针对A的，而且是负面的；“不置一词”则没有这两条限制。

我:
两个逻辑谓词（句末的谓语和前面的“批评”）抢同一个PP（对），计算上总会遇到 scope 纠缠。再加一个 “对（or 对于）” 歧义就没了。“张三【‘对于’【‘对’李四的批评】保持中立】。” 可是两个 “对” 听起来别扭，很少人这么用。

结构歧义其实没有我们想象的可怕。如果目标是语义落地需要调整的不是追求落地前消灭一切歧义，而是反过来思维，如何让语义落地能够容忍歧义的保留，或者歧义的休眠，或者任意的某个 valid 的路径。其实人的理解和响应也不是在 ambiguity-free 的前提下进行。现代医学有一个概念，叫带病生存。语言理解也应该有一个概念，带歧义落地。适度的歧义作为常态来容忍。

这是结构歧义，WSD 更是如此。绝大多数语义落地可以容忍或绕过 WSD 的不作为（【NLP 迷思之四：词义消歧（WSD）是NLP应用的瓶颈】）。MT 可能是对 WSD 最敏感的一个语义落地的应用了。即便如此，也并非先做好 WSD 然后才能做好 MT 落地（MT中叫 “lexical transfer”）。有亲戚关系的语言对之间有很大的 keep ambiguity untouched 的空间自不必说。即便在不相关的语系之间，譬如英汉的MT中，实践证明，全方位的 WSD 也是不必要的。细线条的 WSD 则更不必要。细线条指的是词典里面的那些义项，或 WordNet 中 synsets，其中的很多本义和引申义的细微差别没有必要区分。

还有那些那些 hidden 的逻辑语义，是不是要挖掘出来呢？迄今为止，我们在句法后的语义中间件中做了部分这样的工作，但一直没有全力以赴去做全，虽然因为句法结构树已经提供了很好的条件了，这个工作并不是高难度的。

今天思考的结果是，其实很多 hidden links 没有必要整出来。如果一个 hidden link 本身就很模糊或歧义，那就更应该置之不理。自然语言带有相当程度的模糊性，语言本身也不是为了把每个细节都弄清白。人的交流不需要。如果一个细节足够重要，但这个细节在表达上是 hidden 的，省略的，或模糊的，那么人的交流就会在接下去的句子中把它 explicitly 用清晰无误的句法结构表达出来。

从语义落地的实践中也发现，大多数的 hidden links 也是不必要的。背后的道理是：信息流动的常态是不完整，不完整在信息交流中起到了减轻记忆负担、强化信息核心的重要作用。

理论上，每一个提到的谓词都有自己的 arg structure，里面都有潜在的坑，需要信息的萝卜来填。但语言的句法会区分谓词的不同地位，来决定是否把萝卜显性地表达出来，或隐去萝卜。常见的情形是，隐去、省略的萝卜或者不重要，或者不确定，都是信息交流双方不太 care 的细节。譬如一个动词 nominalize 后，就往往隐去 args （英语的动名词，汉语利用“的”的NP句式）。这种自然的隐去已经说明了细节不是关注点，我们何苦要硬去究它呢？

当然，上面说的是原则。凡原则一定有例外，某个隐去的细节如果不整明白，语义就很难落地到某个产品。能想到的“例外”就是，很多 hidden links 虽然其语义本身在语用上不是重要的信息，但是至少在 MT 的产品中，这个 hidden link 可以提供结构条件，帮助确定更合适的译词： e.g. this mistake is easy to make：make 与 mistake 的 hidden VO link 不整出来，就很难确定 make 的合适译法为 “犯（错误）”
关于隐去或省略的大多是不重要的，因此也 NLU 通常不 decode 出来也 OK，可以举个极端的例子来说明：

Giving to the poor is a virtue
Giving is a virtue

give 是一个 3-arg 的谓词，who give what to whom，但是在句法的名物化过程中，我们看到第一句只显性保留了一个萝卜（“to the poor”）。第二句连一个萝卜也没有。
我们要不要从上下文或利用标配去把这些剩下的坑都填上呢？
不。

白:
从陈述性用法“降格”为指称性用法的时候，对坑所采取的态度应该是八个字：“来者不拒、过时不候。” 比如，"这本书，出版比不出版好。"
我们没有必要关心谁出版，但是既然提高了这本书，填坑也就是一个举手之劳。

我：
很同意。就是说，一般来说对于这些有坑近处没萝卜的，我们不要觉得愧疚和心虚，who cares

【相关】

【一日一parsing：休眠唤醒的好例子】

白:
IPTV首页广告语“IPTV，电视新看法”。
需要休眠唤醒的好例子。

我:
t0796a
对，这个 hidden ambiguity 与 “难过”、“好过”一样，可以也应该休眠唤醒。如果真想做的话，就在“看法”的词条下，在句法后做一个词例化规则：
查一下“看法”（默认词义：viewpoint，有个 human 的坑）的 Mod 来唤醒第二条路径（词义）。
查 Mod 的条件有宽松两个选择，可以根据大数据测试的质量来权衡定夺：
（1）宽的话，如果 Mod 不是 Human，那就唤醒
（2）窄的话，如果 Mod 是“看”的搭配词（看电影、看电视、看戏、看热闹、看耍猴 etc），那就唤醒

白:
作为一个词的“看法”有两个坑，“谁”对“什么”的看法。“谁”即使锁定到human和“电视”不搭调了，还有“什么”可供填充。所以，单纯从一个坑的不匹配，还不足以翻盘。需要“看-电视”这样的强搭配来推波助澜里应外合。

我：
好！
唤醒的是句法层面的定中关系（默认的是词典的合成词，可以看成词的黑箱子，也可以看成是词法内部的定中关系）：“看-法”（而不是“看法”），语义落地在 MT 上的表现就是选择第二个译法：way of 看。因为搭配找着了，MT 也就可以把“看”的翻译从默认的“see”改成搭配的“watch”：
IPTV，电视新看法 ==》
IPTV, a new way of watching TV (而不是 TV's new viewpoint）
这个思路是没有问题的虽然需要花点儿力气。至于选择做还是不做那是另一个问题。
在“难过” vs “难-过” 这样的case上，我们做 sentiment，是选择了去做，用到上述休眠唤醒的招术，把表达主观负面情绪的默认的“难过”解读，语义落地为表达客观负面情形的“难-过”。见【立委科普：歧义parsing的休眠唤醒机制再探】。

梁:
“IPTV，电视新看法” 这句话，连我都休眠了好一阵，刚刚被唤醒。
语义落地到 MT, 是 “ IPTV, a new way to watch TV" ?
so it is either, 一种看电视到新方法， or 一种对电视的新的看法? ”看法“ could be "opinion"?

我:
广告跟段子一样，故意弄这些，为了俏皮，加深印象。
可喜的是，至少我们寻到了解决这类段子似的俏皮话的计算机实现的路径。
休眠唤醒的事儿，以前在语义落地的时候“自发地”做过，但从来没有总结到现在的高度，可以有意识地推广运用。这是在本群与白老师等探讨的收获。
hidden ambiguity 和段子这类的parsing，以前一般都认为是难以企及的语言难题。至少现在不是那么遥不可及了。

梁:
想想这事是挺有趣。“看法”，本来挺强劲的结合，硬是给左边的“电视”拆开了。这类暗中较劲，猜想、比较、争吵，分裂，结合，settle 的过程，人的大脑 parsing 的时候，可能也有这样的过程。据说人思考时经历一种微型达尔文过程。

白:
【转神段子】航拍记录显示，湖北已经基本都是湖，找不着北了。
看看如何休眠唤醒。

K:
他伸出双手，要露上两手。

我:
唤醒啥? “找不着北” 是成语：
t0707a

白:
湖北两个字，只剩一个字了另一个字找不着了
元层次和对象层次的纠缠

我:
并不影响语义理解

白:
影响段子理解

我:
俏皮的定性如果也是语义的一部分可以利用机械扫描匹配来发掘无需与句法层发生纠缠

白:
总之是要吃回头草

我:
不属于核心语义。核心语义是湖北这个地儿现如今到处是水方向辨别不清了

白:
核心语义没发生颠覆或荒诞化只是附加了一层元层次上的俏皮。段子理解，有颠覆型的也有附加型的。

我:
“伸出双手露上两手” 同理同机制也可以扫描匹配发掘后半部也是成语唤醒的是成语的非成语解读。这与切词中的 hidden ambiguity 同理。

白:
俏皮也是附加。还有我之前举的例子，“贾宝玉托举林黛玉，纯粹是多此一举”。先导词语素的重叠使用。

K:
电视的新看法看法有了非成语语义属于唤醒

我:
Longest principle 默认一个多语素词的黑箱子性质，但在特定的上下文中可以唤醒句法透明的解读。
t0707b

K:
露上两手多此一举是否与之类似呢？唤醒了手与举的非成语语义？

白:
是

我:
就是。词法的内部关系唤醒为句法的外部关系。哪怕词法句法是一脉相承平行的，句法解读与词法解读在语义和概念层面是很不同的。英语类似的 hidden ambiguity 有blackboard 解读为black board

K:
感谢二位老师
我理解成语语义的整体性与成语内部结构的潜在可分解性会影响休眠唤醒
比如新看法作为成语有其常用语义，但其内在结构可分解，（新（看（法））），这种结构分解为看法增添了新的语义。这与分词中的组合型歧义有关联。何时分，何时合，可能单在分词层面难以完成，有什么好的解决办法吗？

白:
先说没有外部刺激是不该拆开来的
然后再说外部刺激可能有哪些特征
比如，重复。

【相关】

【新智元：parsing 在希望的田野上】

【泥沙龙笔记：NLP hard 的歧义突破】

【deep parsing，deep learning 以及在对话和问答系统中的应用】

如何把deep parsing的东西系统地用到对话系统中来？
这个以前多次零星讨论过，再梳理一下。

1. 先说 deep parsing 与 deep learning 的结合

两种结合方式，一种是内在的，一种是外在的。

1.1. 内在的结合

问题可以反过来看，绝大多数对话系统是不用 deep parsing 的，这就是没有代入结构的 baseline。如果不代入结构就已经满足应用的要求，自然没有 deep parsing 什么事儿。但其实是 baseline 有缺陷，不能解决一些问题，譬如 sparse data 的问题。
下一个问题就是，我们知道语言是有结构的，理论上讲代入结构一定对克服 sparse data 的问题有利，但实际上怎么去做，还不清楚。抽象地说，这是一个研究课题，沙龙里面讨论时候也有人提过，迄今为止的浅尝辄止，把结构作为features代入一个ngram为基础的学习模型，没有发现简单的结合办法，可以显著提高质量，这受到到 evidence overlapping 和算法复杂度等挑战。我们以前也做过这种结合的尝试，也是浅尝辄止没有深入，当时的结论是，看到了曙光，但还需要更深入的探索尝试。
迄今整个领域没有深入研究这个问题，不仅仅是研究性的挑战（算法复杂度等），更主要是：很多团队没有靠谱的deep parsing 的条件作为探索的基础，加上整个领域20多年一直陷在浅层里面做 NLP，还没有余力去做这一方面。但现在似乎到了认真做这个的时候了，因为甚至 deep learning 在算法上的突破，迄今也还是没有突破 text 结构的瓶颈，这是下一步比较清楚的方向。
我的设想是，可以重新定义 ngram 作为带入结构的探索。譬如 bigram 以前的定义就是 w(i) w(i+1) 序列，我们可以重新定义为 R（w1,w2)。这个 R 就是结构，譬如动宾，主谓，定中，状语等关系，deep parsing 的那些句法关系。其中的 w1 与 w2 也可以做扩展，不再是词（直接量），而是包括直接量的关于那个节点的不同抽象度的 features （最上层是 POS，中间还有带入 ontology 的可能）。当然从直接量提升到不同的 features，很可能造成模型太复杂：怎么控制复杂度，又不局限于直接量，是一个需要拿捏分寸的事儿。但总体思路就是，既要跨越线性距离的局限（通过用 R 代替 linear ngram，这是 arc 的抽象概括能力），又要跨越单词直接量的局限（node 也要具有抽象概括能力），这才有可能在学习的系统中真正有突破性进展。这个探索值得花大力气，因为这是 text NLP 的方法学上的突破，适用的不仅仅是对话系统，而是一切的ngram为基础的NLP。这是从机器学习带入 deep parsing 的角度看探索的方向。

1.2. 外在的结合

如果不从机器学习的角度看，deep parsing 支持对话系统本质上就是一个小数据的精度分析和语义落地的任务，规则本身是有一套自己的方法去做的。这一路做下去的结果是，精度（precision）高，但广度（recall）是挑战，需要靠时间去磨（incremental recall enhancement）。

这种高精度的 deep parsing 作为支持的规则系统，也可以外在地与机器学习的 baseline 系统结合。我们叫做 backoff model：就是让没有结构的机器学习的系统垫底，以弥补 recall 的不足，让 parsing 支持的精度系统作为首选。

这种外在的结合相对容易，因为两套系统是分别开发的，虽然目标是一致的。只不过到调用的时候再把两套系统结合起来。这与前面说的把 parsing 带入机器学习（1.1）不同，因为前面的是你中有我的结合，具有研究性的挑战（overlapping和perplexity等）。

我的看法是，两个路子都值得尝试：前者一旦突破，是研究的成就，有学术的影响。
后者是工程的做法可以立竿见影。双方最终都要求一个靠谱扎实的 deep parser，对于汉语，迄今很少人做得到。汉语 chunking 倒是有靠谱的但 deep parsing 未曾见到（不算在下）。

2. 再说说，deep parsing 在小数据系统与大数据系统的应用

对话系统基本是 front-end，面对的是小数据的处理与应对，虽然对于学习系统，在训练的时候，训练集可以是所有收集得到的对话，也有一定的量。

对于大数据挖掘以及用挖掘结果去支持问答系统（也可算是一种“对话”的延伸，譬如类似 IBM 沃森的问答系统），那边有一个 back-end engine，专门处理大数据，以抽取挖掘或在线搜寻 open-ended 的答案。这个大数据架构下，deep parsing 怎样起作用，可以说得更清晰有把握　因为这些都是做过的工作：been there done that，在大数据挖掘，知识图谱，舆情监控，客户情报，自动民调，问答系统，智能搜索和智能浏览这些方面，deep parsing 是地地道道的核武器。

在小数据应用场合，deep parsing 也应该有很大助益，因为毕竟小数据更加依赖分析的精度。但是小数据的短板是没有大数据的信息冗余作为补偿。

或曰：车轱辘话
答曰：信息冗余是大数据时代的自保策略，否则一切真知灼见都可能烟消云散。

总之是大数据有了deep parsing，柳暗花明；大数据没有它，万古如长夜。但并不是一切的 deep parser 都是核武器，不信你下载一个斯坦福或者SyntaxNet试试就知道了。这些 deep parsers 下载来，你很难做一个像样的应用。主要原因有二：（i）他们基本没有 domain portability，设计者有本事训练出来，你下载以后想如法炮制，在你要应用的领域去重新训练，哪怕你真弄到了训练数据，其成功可能性还是渺茫；（ii） deep parsing 是 NLP 应用的核武器，绝不仅仅是指我们看到的那些句法树，还有句法树上的每个节点的丰富信息，包括 ontology。而下载来的这些 parsers 通常缺乏丰富的节点信息：光靠单词（直接量）加上句法树结构是做不了大事的。因此核武器能不能发挥威力，很大程度上取决于 made in where and made by which approach.

其实如果斯坦福的 parser 或谷歌的 SyntaxNet 能够发挥威力，迅速横扫深度 NLP 应用和产品哪里会等到你去做早有捷足先登的人了。而且果如此，他们大概也不会开源了，自家先发了再说。天上掉不下馅饼就是这个道理。

Guo 君给我留言说，你整天追在学习后面喊打，人家眼珠子都不动，该干嘛干嘛。他眼珠动还是不动，根本不是我关心的。我是菩萨心肠，为了世界的丰富性和多样性，为了后学，为了NLP的未来世界，心胸早已超越一己之私，一党之争。如果没有人发声，这个世界多么单调寂寞，多少人被误导和洗脑。多少人在结果还没出来的时候，就盲目相信深度神经一定会 work，又有多少人根本就不知道这个世界还有神经以外的理性算法

阿弥陀佛。

【后记】

话多可以洗脑，嗓门高也可以洗脑。乔姆斯基的不同政见，批判美国外交政策和媒体洗脑，几十年来基本没有变动，一直是车轱辘话。他就是反复地不同角度地以不同的案例讲的是同一个见解，与马克思讲《资本论》一个路数。

不止一位朋友反映：你这速度，写的比我读的还快，哪里有那么多时间。我当是 compliments，因为所写虽有冗余，但并非口水，不乏真知灼见。这点儿信心还有。

哪里有那么多时间？鲁迅早就说过的，时间就是海绵。又说，那是把资产阶级喝咖啡的时间也挤出来了。大家都是24小时，彼此彼此，吃喝拉撒睡、上班点卯，样样不能少。

喷涌的时候写，总比枯竭的时候挤牙膏强。兴致来的时候不写，兴尽而归，就了无音息了。了无音信其实也没啥，自己脑袋的知识和见识并没有因此减少，但人活着，也不能只为了自己的脑袋不是。

其实人一辈子有这个状态也不多。少年求学，青年求偶，中年养家，青年-老年也都还有个梦想 put a ding somewhere，总之人一辈子就是个劳碌命。随心随性的自由不过是劳碌之间的一次休假。要惜福感恩。

【相关】

【关于 parsing】

【语义计算沙龙：乔老爷的围墙，community 的盲区】

白:
所以，述补结构的处理，分层上要先于名词（短语）填坑。全放在一个平面，就说不清楚坑多萝卜少咋分的。
我:
那是因为 subcat 不是静止的。起点是词典，但在后续中有变。
在欧洲语言有一些构词元素具有改变 subcat 的功效使动语素（世界语用 -ig-）和自动语素（-igh-）是代表，morphology 里面的专门一章学问有一套套的术语名词。
到了孤立语句法结构就承担了一些这类subcat remapping 的功效。
白:
“这些馒头把我吃胖了”是个非常好的例子。X吃Y，Z胖。分析X、Y、Z各自的subcat，发现X与Z匹配的可能远大于Y与Z匹配的可能。而Z说好了要做结合后述补结构的逻辑宾语的，于是Y只好很不情愿地做了结合后述补结构的逻辑主语。
“把”就是“置底”，“被”就是“置顶”，“的”就是“掏心”，都是在做语序的重定向。
我:
有意思的说法。
总之 open ended 动补结构需要动态调整 subcat 的预期指向。
白:
分层了就把复杂问题变简单了。先合并坑，再让坑一致对外。等到真正填坑的萝卜来了，坑多已经是历史传说了，现实的坑不多不少。
我:
分层是必由之路。
不仅仅是为这个 args 的捕捉。很多现象都有 local 和全局的区分，把 local 的和全局的放在一个锅里煮。不是偷懒就是脑子进水了。出了问题找不到合适解决途径，还振振有词辩解说语言是 inter-dependent 的，怎样分层都是割裂整体。这样思维的人是自作自受。不管语言这个 monster 多么相互关联，难以一刀切干净，作为语言工程你都必须切成模块。关键不在模块之间是不是绝对地合适分开大体齐就行了；关键在于切割了还仍然是一个 integrated 的无缝连接的系统。即便有些地方切得不妥了也要有改正、弥补、唤醒或其他补救或patching的机会。这样才好把千头万绪简化成 tractable 的工程开发任务。
事实上，以前红过一阵子的 unification grammars 一派虽然最终在业界没能开花结果在学界也被排挤得差不多了，这拨人还在有不少是名校的名教授。他们深陷在 CFG 的单层的 formalism 里面加上 unification 的实现也是以 Prolog 的回溯机制为基础，既无效率也不能真正深入很难有 scale up 的指望。结果这帮人形成了自己的一个圈子也有一定的体量自己跟自己玩儿，虽然对NLP的学界和业界的影响几近于零了。每年各地诸侯会聚一次。起个名字好像叫 ...... 忘了，总之是类似 very deep parsing 意思的一个什么。其实，怎么可能 very deep，如果层次和formalism这一关不过的话？在乔姆斯基倡导的 formal linguistics 的研究中，他们算是异端。姥姥不疼舅舅不爱我们外人看去怪落寞的。可是当年（博士阶段）初学的时候却被它的巨大魅力而吸引。这是一个看上去很美的框架。
白:
毛毛虫万岁！
我:
对。可是看清这一点的人不多。那么多人陷在乔老爷的怪圈里。

QUOTE ( from [转载]【白硕 – 穿越乔家大院寻找“毛毛虫”】)：

白硕老师这篇文章值得所有自然语言学者研读和反思。击节叹服，拍案叫绝，是初读此文的真切感受。白老师对乔姆斯基形式语言理论用于自然语言所造成的误导，给出了迄今所见最有深度的犀利解析，而且写得深入浅出，形象生动，妙趣横生。这么多年，这么多学者，怎么就达不到这样的深度呢？一个乔姆斯基的递归陷阱不知道栽进去多少人，造成多少人在 “不是人话” 的现象上做无用功，绕了无数弯路。学界曾有多篇长篇大论，机械地套用乔氏层级体系，在自然语言是 context-free 还是 context-sensitive 的框框里争论不休，也有折衷的说法，诸如自然语言是 mildly sensitive，这些形而上的学究式争论，大多雾里看花，隔靴搔痒，不得要领，离语言事实甚远。白老师独创的 “毛毛虫” 论，形象地打破了这些条条框框。

白老师自己的总结是：‘如果认同“一切以真实的自然语言为出发点和最终落脚点”的理念，那就应该承认：向外有限突破，向内大举压缩，应该是一枚硬币的两面。’ 此乃金玉良言，掷地有声。

Church 写《钟摆摆得太远》的时候看到了表象格局和视野都有但就是没看到毛毛虫的实质。他看到了这个奇怪的现象：做 formal linguistics 的 community 研究了很多语言现象有些问题研究得很深入并试图以自己的框架形式化分析他们而做 nlp 的这个community 几乎全部陷在浅层的泥坑里。本来面对的是同样的自然语言而且双方都力图形式化和电脑实现它应该可以互补的但这两个 communities 是如此格格不入老死不相往来互相看对方都是雾里看花。
白:
门户之见是不需要证据的
我:
所以他就开了个药方要求在理性主义预计要回归的时候请下一代 CL 的学生必须去修语言学的课程，“计算语言学”不能丢了“语言学”的本，必须知道语言学圈做了些什么改变这一代的"NLP大师"缺乏语言学的怪象。
白:
抄底靠的是先知先觉。等学生上课，黄花菜都凉了
我:
但是这个药方和呼声是如此疲弱至少迄今没人当回事。而且这个药方本身也有问题因为那个另一派也的确问题多多。自己在围墙里面束缚了自己，完全不接地气。争论的问题一多半是口水，大多是 system internal，玩的过家家的游戏（《Church：钟摆摆得太远》）。

quote 计算语言学课程的缺陷

正如上面明斯基和帕佩特指出的，我们不断犯同样错误的部分原因与我们的教学有关。辩论的一方在当代计算语言学教科书中已被遗忘，不再提及，只能靠下一代人重新认识和复原。当代的计算语言学教科书很少介绍PCM 三位前辈。在汝拉夫斯基(Jurafsky) 和马丁(Martin) 编著的教科书以及曼宁(Manning) 等编著的两套教科书中根本没有提及皮尔斯。三本教科书中只有一本简要提起明斯基对感知机的批评。

他要学生回头去啃乔姆斯基等理性主义大师，但不知道乔姆斯基本人就是最大的误导者（乔姆斯基批判；《【钟摆摆得太远】高大上，但有偏颇》；[转载]【白硕 – 穿越乔家大院寻找“毛毛虫”】）。

w:
@wei 很是同意“关键在于切割了还仍然是一个 integrated 的无缝连接的系统，即便有些地方切得不妥了也要有改正、弥补、唤醒或其他补救或patching的机会”。切割是为了方便处理，藕虽断但丝还连。即使切错了，还能找回来才是真谛。
白:
这是一个搜索调度策略的问题。无论是平推，还是花开N朵先表一枝，还是台面上的路径之外另有暗送秋波的密道，都是可供选择的。“先表一枝”如能综合使用词典的、subcat的、中间件的、大数据的知识，在概率上可以占得先机。“密道”维护一条“阈下”暗流，一旦主线不保，立刻跳出来翻案，在段子的理解上最像人，但是密道的日常维护机制和受激翻案机制是个高难度的机密。
平推加概率，如果有很好的硬件实现RNN，效率绝对不低。至少确保线速。

我:
rnn 经白老师不断灌输我就当它为神器来膜拜。
林彪说了理解的要执行不理解的更要执行我对白老师的 rnn 的感受就是如此。
白老师啥时来一个 rnn 直通道提供一个接口我就可以不断给它 feed 语言学。
w:
@白现在的硬件发展肯定会助推DL的性能的。只要基础理论框架结实且有市场，很多的硬件商都会挤进来的，没有人和蛋糕过不去的，AI硬件GPU，FPGA，还有其他什么xPU的倒不少，不过自己没认真关注。
@wei 如果接口来了，那语言学知识好不好喂？是专家级的专属？还是一般用户可为？感觉这是能走多远，走多广的问题。
感觉上李师parser的确是核武器，专家维护几枚就好。不过市场倒很大，很多人想有一枚。就是渠道打不开。
同意白老师的密道说。而且密道也应该不是直通型的，是网络交叉型的。运行时维护密道和应时启用的确是关健。
白:
伟哥说我描述句法分析还带着账房先生的味道，这让我想起当年被傅爱萍请到社科院语言所做交流时，我提起过的“铜臭”。一晃过去这么多年了。

【相关】

《Church：钟摆摆得太远》

【语义计算沙龙：Parsing 的数据结构和形式文法】

【乔姆斯基批判】

《泥沙龙笔记：【钟摆摆得太远】高大上，但有偏颇》

【关于NLP方法论以及两条路线之争】