The mainstream sentiment approach simply breaks in front of social media

I have articulated this point in various previous posts or blogs before, but the world is so dominated by the mainstream that it does not seem to carry.  So let me make it simple to be understood:

The sentiment classification approach based on bag of words (BOW) model, so far the dominant approach in the mainstream for sentiment analysis, simply breaks in front of social media.  The major reason is simple: the social media posts are full of short messages which do not have the "keyword density" required by a classifier to make the proper sentiment decision.   Larger training sets cannot help this fundamental defect of the methodology.  The precision ceiling for this line of work in real-life social media is found to be 60%, far behind the widely acknowledged precision minimum 80% for a usable extraction system.  Trusting a machine learning classifier to perform social media sentiment mining is not much better than flipping a coin.

So let us get straight.  From now on, whoever claims the use of machine learning for social media mining of public opinions and sentiments is likely to be a trap (unless it is verified to have involved parsing of linguistic structures or patterns, which so far has never been heard of in practical systems based on machine learning).  Fancy visualizations may make the results of the mainstream approach look real and attractive but they are just not trustable at all.

Related Posts:

Why deep parsing rules instead of deep learning model for sentiment analysis?
Pros and Cons of Two Approaches: Machine Learning and Grammar Engineering
Coarse-grained vs. fine-grained sentiment analysis
一切声称用机器学习做社会媒体舆情挖掘的系统,都值得怀疑
【立委科普:基于关键词的舆情分类系统面临挑战】

Why deep parsing rules instead of deep learning model for sentiment analysis?

aaa

(1)    Learning does not work in short messages as short messages do not have enough data points (or keyword density) to support the statistical model trained by machine learning.  Social media is dominated by short messages.

(2)    With long messages, learning can do a fairly good job in coarse-grained sentiment classification of thumbs-up and thumbs-down, but it is not good at decoding the fine-grained sentiment analysis to answer why people like or dislike a topic or brand.  Such fine-grained insights are much more actionable and valuable than the simple classification of thumbs-up and thumbs-down.

We have experimented with and compared  both approaches to validate the above conclusions.  That is why we use deep parsing rules instead of a deep learning model to reach the industry-leading data quality we have for sentiment analysis.

We do use deep learning for other tasks such as logo and image processing.  But for sentiment analysis and information extraction from text, especially in processing social media, the deep parsing approach is a clear leader in data quality.

 

【Related】

The mainstream sentiment approach simply breaks in front of social media

Coarse-grained vs. fine-grained sentiment analysis

Deep parsing is the key to natural language understanding 

Automated survey based on social media

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

 

【语义计算沙龙:语序自由度之辩】

刘:
WMT2016上有一篇文章,讨论了语言的语序自由度,结论很有趣,见附图。根据这篇论文统计,汉语和英语之间语序关系是最稳定的(注意:语序关系稳定与语序一致不是一回事),比其他语言稳定度都高出许多。日语虽然是粘着语,但跟英语的语序关系也是相当稳定的。相反,德语虽然跟英语亲缘关系很近,但其相对语序的自由(不稳定)程度相当高。

916632021314869711
论文链接 http://www.statmt.org/wmt16/pdf/W16-2213.pdf

我:
这个研究是说,如果这些语言要与英语做自动翻译,语序需要调整多少?
英语相对语序很固定,加上是最流行的语言,拿它做底来比较,对于各语言的相对语序自由度应该是不离谱的。但是,从(平行)大数据来的这些计算,与这些语言的语言学意义上的语序自由度,有差别:
譬如 Esperanto 的语序自由度应该很大,怎么排列,意思都不变,但是由于很多人可能思想是用英语的,写出来的时候下意识在头脑里面翻译成了世界语,结果跟机器翻译一样,人的懒惰使得表达出来的语序照着英语的样子相对固定起来,并没有充分利用语言本身本来有的那么大自由度。

汉语的语序自由度,语感上,比图示出来的,要大。但是,做这项研究的双英对照数据也许大多是正规文体(譬如新闻),而不是自由度更大的口语,因此出现这样的结论也不奇怪。虽然汉语是所谓孤立语,英语接近汉语,但没有那么“孤立”,汉语的语序自由度比英语要大。做英汉MT的 generation 的时候,需要调整词序的时候并不很多,多数情况,保留原词序,基本就凑合了,这是利用了汉语语序有弹性,相对自由度大的特点。汉英MT没亲手做过(除了博士项目在Prolog平台上做过的一个英汉双向MT的玩具),感觉上应该比英汉MT,需要做调序的时候更多。调序多容易乱套,特别是结构分析不到位的时候更容易出乱子,是 MT 的痛点之一。尽量少调序,警惕调序过度弄巧成拙,是实践中常常采取的策略。包括英语的定语从句,多数时候不调序比调序好,用的技巧就是把定语从句当成一个插入语似的,前面加个逗号或括号,适当把 which 翻译成“它”等等。

刘:
你说的有道理,这个研究是以英语为基准的,虽然严格说不是很合理,但还是靠谱的,英文英语语序是比较固定的。我们说汉语语序自由,我觉得是错觉。汉语语序是很不自由的。实际上,对一个语言来说,形态的复杂程度和语序的自由程度是成正比的。形态越复杂的语言,语序越自由。汉语没有形态,只能用语序来表示句法关系。因此是严格语序语言。不可能说一种语言既没有形态,又语序自由,那么这种语言基本上没法表达意义了。

白:
这个,需要分开说。一是subcat算不算形态,因为不是显性的标记,很可能不算。二是subcat是否提供了冗余信息使得一定范围内的语序变化不影响语义的表达,这是肯定的。

Jiang:
嗯!subcat这里指的是什么?

白:
比如“司机、厨师、出纳……”都携带human这个subcat,但是human并不是一个显示的形式标记。

我:
虽然大而言之形态丰富的语言语序自由度就大、形态贫乏的语言语序相对固定是对的,但汉语并不是持孤立语语序固定论者说的那样语序死板,其语序的自由度超出我们一般人的想象:拿最典型的 SVO patterns 的变式来看,SVO 三个元素,排列的极限是6种词序的组合。Esperanto 形态并不丰富,只有一个宾格 -n 的形态(比较 俄语有6个格变):主格是零形式(零词尾也是形式),它可以采用六种变式的任意一个,而不改变 SVO 的句法语义:

1. SVO Mi manĝas fiŝon (I eat fish)
2. SOV: Mi fiŝon manĝas
3. VOS: Manĝas fiŝon mi
4. VSO: Manĝas mi fiŝon
5. OVS: Fiŝon manĝas mi.
6. OSV: Fiŝon mi manĝas.

比较一下形态贫乏的英语(名词没有格变,但是代词有)和缺乏形态的汉语(名词代词都没有格变)的SVO自由度,很有意思:

1. SVO 是默认的语序,没有问题:
I eat fish
我吃鱼

2. SOV:
* I fish eat (英语不允许这个语序)
我鱼吃 【了】(汉语基本上是允许的,尤其是后面有时态小词的时候,听起来很自然)
虽然英语有代词的格变(小词直接量:"I" vs "me"), 而汉语没有格变,英语在这个变式上的语序反而不如汉语。可见形态的丰富性指标不是语序自由度的必然对应。

3. VOS:
* Eat fish I (英语不允许这个语序)
?吃鱼我(汉语似乎处于灰色地带,不像英语那样绝对不行,设想飞机空姐问餐:“吃鱼还是吃肉?”你可以回答:“吃鱼,我”)

4. VSO:
* Eat I fish (不允许)
* 吃我鱼 (作为 VSO 是不允许的,但可以存在,表示另外一种句法语义:吃我的鱼)
做VSO不合法,但有些灰色的意思,至少不像英语那样绝对不允许。

5. OVS:
* Fish eat I (不允许,尽管 I 有主格标记)
* 鱼吃我 (句子是合法的,但句法语义正好相反了 , 是 SVO 不是 OVS。句子本身合法,但做OVS非法。)

6 OSV:
fish I eat (合法,除了表达 OSV 的逻辑语义 这个语序,还表达定语从句的关系)
鱼我吃(合法,常听到,鱼是所谓 Topic 我是 S,逻辑语义不变)

总结一下,汉语在 6 个语序中,有 3 个是合法的,1 个灰色地带,2 个非法。英语呢,只有两个合法,其余皆非法。可见汉语的语序自由度在最常见的SVO句式中,比英语要大。

白:
不考虑加不加零碎的语序研究都是那啥。“鱼吃我”不行,“鱼吃得我直恶心”就行

我:
不管那啥,这个 illustration 说明,语序自由度不是与形态丰富性线性相关。也说明了,汉语往往比我们想象的,比很多人(包括语言学家)想象的具有更大的自由度和弹性。白老师的例子也是后者的一个例示。其实,如果加上其他因素和tokens,这种弹性和自由,简直有点让人瞠目结舌。汉语不仅是裸奔的语言,也是有相当程度随心所欲语序的语言。超出想象的语序弹性其实是裸奔的表现之一,思维里什么概念先出现,就直接蹦出来。而且汉语不仅没有(严格意义的)形态,小词这种形式也常常省略,是一种不研究它会觉得不可思议的语言。它依赖隐性形式比依赖显性形式更多,来达到交流。这对 NLP 和 parsing 自然很不利,但是对人并不构成大负担。

刘:
首先,语序变化以后意义发生变化,不说明语序自由,相反,正说明语序不自由。语序传达了意义。其次,语序变化以后要加词才能成立(鱼我吃了)也正好说明语序不自由。再者,这种简单的句子不说明汉语普遍语序自由。在绝大部分清晰下,汉语都是svo结构,个别情况下需要特别强调o的时候,可以把o放到最前面。语序自由的前提,是通过词尾变化明确了词在句子中的功能,这样的话,主谓宾不管怎么交换顺序,都不会搞混,所以语序自由。没有形态变化,不可能真正语序自由。
“小王打小张”,语序就不能随便调整。
“我爱思考”,“我思考爱”,意思完全不一样

我:
这要看你怎么定义语序自由了。你给的定义是针对格变语言做的,有宾格的语言,等于是把句法关系浓缩了标给了充当角色的词,它跑到哪里都是宾语是题中应有之意。但语序自由的更标准和开放的定义不是这样的,如果 SVO 是基本的语序,凡是与它相左的语序的可能性,就是语序自由,研究的是其自由度。这种可能性的存在就证实了我们在理解语言的时候,或者机器在做 parse 的时候,必须要照顾这种 linear order 的不同,否则就 parse 不了,就抓不住语序自由的表达。不能因为一种相左的语序,由于词选的不同,某个可能语序不能实现,来否定那种语序自由的可能性和现实性。

退一步说,你的语序自由是 narrow definition, 我们也可以从广义来看语序自由,因为这种广义是客观的存在,这种存在你不对付它就不能理解它。就说 “小王打小张”,SVO 似乎不能变化。但是 “小张小王打不过” 就是 OSV,不能因为这个变式有一个补语的触发因素,来否定语序的确改变了。pattern 必须变换才能应对这种词序的改变。

最后,汉语与英语的对比,更说明了汉语的语序自由度大于英语,否则不能解释为什么汉语缺乏形态,反而比形态虽然贫乏但是比汉语多一些形态的英语,表现出更多的语序自由。“鱼我吃了” 和 “我鱼吃了” 是一个 minimal pair,它所标示的语序自由的可能性,是如此显然。人在语序自由的时候仍然可以做句法语义的理解,说明了形态虽然是促进自由的一个重要因素,但不会是唯一的因素。隐性形式乃至常识也可以帮助语序变得自由。

“打小张小王不给力。”(这是VOS。。。)
“打老张小王还行。”

刘:
这两个句子里面“打”都是小句谓语,不是主句谓语。主句谓语是“给力”和“还行”。例子不成立。

我:
影响语序自由的,形态肯定是重要因素,其他的语言形式也有作用。小句也不好 主句也好,SVO 的逻辑语义在那里,谁打谁?我们在说SVO语序自由这个概念的时候,出发点是思维里的逻辑语义,就是谁打谁,然后考察这个谁1 和 谁2,在语言的 surface form 里面是怎样表达的,它们之间的次序是怎样的。。

刘:
这就强拧了。这么说the apple he ate is red. 也是osv了?apple he ate的逻辑关系在哪里。这么说英语也可以osv了?

我:
不错,那就是地地道道的 OSV:谁吃什么,现在这个【什么】 跑到 【谁】 和 “ate” 的前面去了,底层的逻辑语义不变,表层次序不同了。

说英语是 svo 语言,这种说法只是一种标签,并不代表英语只允许这个词序。英语的SVO 6 种语序中,前面说了,有两种合法常见,其他四种基本不合法。

刘:
如果你对语序自由是这样定义的话,那英语也是语序自由了。

我:
不是的。只能说语序自由度。英语的语序自由度还是不如汉语。汉语的语序自由度不如世界语,也不如俄语。世界语的语序自由度不亚于俄语,虽然俄语的形态比世界语丰富。

刘:
那我们不必争论了,我们对语序自由这个概念的定义不一样。

我:
不错,这是定义的问题。我的定义是广义一些。你的定义窄。

刘:
按照你的定义:Eating the apple he smiled. 英语还可以VOS

白:
beat him as much as I can
总而言之S是从相反方向填它的坑

禹:
俄语的我吃鱼这么多种语序也可以?当真现实就是这么用吗?

易:
@禹 俄语的语序确实很灵活,尤其在口语体中,但意思不会变,因为名词有六个格,施受关系基本不会乱。

白:
日语里面有个名句:きしやのきしやはきしやにきしやできしやえきしやした
除了动词,其他成分的位置也是各种挪来挪去

刘:
@白硕 这个日语句子什么意思啊?

白:
贵社的记者坐火车朝着贵社打道回府了
考验日语输入法的经典例子,流传了将近百年
据说是电报引入日本不久的事情
这么个拼音电文,没人知道啥意思
跟赵元任发明一音节文,有得一拼
格标记本来就是给语序重定向的,所以不在乎原来语序也是情理之中。
如果汉语的“把”“被”“给”“用”“往”一起招呼,也可以不在乎语序的。
被张三 把李四 在胡同里 打了个半死……

我:
广义说 介词也是格 也是形态,格通常是词尾形式,介词的本质却是一样的。
“被” 是主格,“给” 是与格,“用” 是工具格。

禹:
俄语格的问题,有没有需要三四阶语法模型才能确定的还是基本上就是看之前的动词或名词的类别

我:
格就是parsing依赖的形式条件之一。形态丰富一些的语言 parsing 难度降低
不需要过多依赖上下文条件。

 

【相关】

【语义计算:汉语语序自由再辩】

泥沙龙笔记:汉语就是一种“裸奔” 的语言

泥沙龙笔记:漫谈语言形式

关于 parsing

【关于中文NLP】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

中文parsing:语义模块大有可为

白:
“放在行李架上的行李,请您确认已摆放稳妥。”----高铁的词儿。

我:
t0812a

t0812b
二者应该是等价的,现在接近了,还没等价。
想等价的话,条件已经具备:确认这样的词的前S(主语)与其后的OPred(动词性宾语),勾搭上,成为逻辑主谓,这是语义中间件很容易做的,因为条件清晰。
如果追求极致,那就动一下手术:(1)断掉原先的主谓(行李与确认);(2)建立新的主谓 (行李与摆放);(3)断掉原先的 OPred(谓词性宾语);(4)代之以 O-S(宾语从句)。这个也合情合理,条件同样清晰。
如果追求极致的极致,再进一步在主谓关系上加一层逻辑动宾关系,“摆放”的宾语是“行李”。这个可以在“摆放”上做,但必须在新的主谓确立以后再做,可做,稍微有点tricky。

t0812c

Hey @白老师报告,毛主席保证:
mission impossible accomplished in semantics module
中间件大有可为。现在要做一下regressions测试了。
极致的极致,不能如此得来全不费工夫吧。

白:
“坐在座位上的旅客,请您确认您的安全带已扣好系紧。”

我:
t0812f
真要扑哧一笑了:
“好系紧”
大概当成广东话了,曾经把广东话揉进了系统。
不管那个,整体架构在轨道上,宾语从句 O-S 和前面的定语从句 Mod-S。
追求极致的话,“旅客”和“你”是同位语。但是,因为“请你VP”用得太多,而且其中的“你”常常省略,因此parsing根本就不理会你的存在,你没有地位,就是祈使句的默认(这里的祈使句标志是小词 “请”)。因此旅客无需与那个子虚乌有的“你”做同位语了,做主语就好了。
应该是无可挑剔了吧(除了句末的广东话疑似)。

白:
“放在座位前方的说明书,请您确认已看过读懂。”
“走在前方道路上的行人,请您确认跟照片上是同一个人。”

我: 白老师 得寸进尺呢。

t0812gt0812h

“看说明书” 与 “看书” 同属于搭配,这个还可以debug一下,本来应该勾搭上的。
“确认” 与 “是” 断链子了,不过 “是” 与其他动词不同,不是好缠的主儿,不敢轻易动它。

白:
这个时候还是有点念老乔的好。甭管多少层谓词,只要一个必填的坑没填,而外边C-command位置上跟它配型,基本就是它了。就是说,主语(话题)部分的中心词一旦与谓语部分的自由坑配型,就可解释为移位。比同位结构还来得优先

我:
语义中间件 continues,逻辑SVO补全:
t0813a

t0813b

宋:
【转发】周末开心一刻!
中国有两项比赛大家基本不用看,也不用担心:一个是乒乓球,一个是男足。
前者是“谁也赢不了”,后者是“谁也赢不了”!(外国人看不懂,咱们也不告诉他)

白:
太多处见到宋老师转的这个段子。这不是一个句法问题,两个分析结果在句法上都成立。关键是语用。要想正确理解,要明白:(1)有歧义结构的句式连用两次且都指向其中同一种结构,在修辞上是非常乏味的。(2)这两个结构分别描述了竞技能力水平的两个极端。(3)进入同一个句式的差异部分的所指如果恰好处于这两个极端,可以构成一个完美的段子(结构急转弯伴随价值评判急转弯)。(4)常识(或大数据)支持第(3)条。

我:
谁也赢不了 / 谁都赢不了 入词典,两个义项:1 必赢;2 必输
歧义保留到底。

0816a

“打败”也有两个义项,不过条件清晰一些:
(1)有句法主语没宾语:被打败
(2)主宾俱全,“打赢”
中国男足打败了
中国乒乓球打败了瑞典

0816b

我:
想起这句“成语”:毛主席保证!
“毛主席”不是“保证”的【施事】,而是“保证”的【对象】。
尽管处于绝对标准的主语位置。历史大概是,原来有介词“向”的,后来说得常了,于是省略小词,有意造成似歧义但语用无歧义的效果,显得别致,结果就传播开了。如今只好词典绑架死记了。
什么叫似歧义语用无歧义?
从句法上看并无歧义,似乎只能是主语。但从语用上看,先王毛何等高高在上,皇帝是不用向子民“保证”任何事的,只有蚁民向他保证或效忠(文革时有早请示晚汇报)。
其实,严格说,这个向先王的保证是做给对方看的,真正的对象是说话的对方。但经过向先王的保证,就赋予了这种保证一种特别的严肃(实际是转化为滑稽了)的效果:君子无戏言,对君子的保证更不敢戏言。
这两天做语义中间件的逻辑语义补全,有些着魔,总琢磨这事儿。昨天想,逻辑语义的前辈董老师一辈子琢磨它,该是怎么个心态和功力呢。是不是看自然语言达到了穿透一切形式,无申报直达语义的境界?

 

【相关】

关于 parsing

【关于中文NLP】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

从汉语Topic句式谈起

再谈汉语的 Topic 句式,这玩意儿说到底就是句法偷懒:不求甚解,凡是句首看上去像个实词的,贴个discourse意味上的标签 Topic 完事儿。管它逻辑语义上究竟是扮演什么角色,怎样达成深度的理解。说得难听一点儿,这就是汉语文法“耍流氓”。

宋老师的例子:
“吃苦他在前”--> Topic【吃苦】Subj【他】Pred【在前】
这就交差了,句法算及格了。
更常见的其实是:“他吃苦在前”。 分析起来,也是一个套路:
“他吃苦在前”-->Topic【他】Subj【吃苦】Pred【在前】。

“他学习好” 也是如此,话题是某个人(“他”),说的是他的弱点:什么地方(aspect)好(evaluation)。“学习【,】他好”(不用逗号亦可,但有歧义:【学习他】好。)。话题是 “学习”这事儿,说的是哪些人(subset)这方面好(evaluation)。

英语大概是: he is good in study;his study is good; he studies well

逻辑语义呢,似乎有这几个关系:
(1)他(big object)好;(2)学习(small object)好;(3)他学习。

人无完人。一个人的一个方面好了,就可以说这个人(整体)好,好的所在(优点,pros)就是部分。这是整体与部分的相互关系。缺点(cons)亦然,如:
iPhone 屏幕不好。
细节是屏幕的不如人意,但是屏幕(部分)不好,也就影响了iPhone(整体)的评价,所以也是 iPhone 不好。

说来归齐,就是 Topic 做句法的第一步没问题,但不是句法语义的终点。更像是偷懒,或者桥梁,最终要达到(1)(2)(3)才算完事儿。无论“iPhone屏幕不行”还是“屏幕iPhone不行”,无论中文英文,表达法可以不同,最终的逻辑归结点应该是一致的,大体上就是123。思考一下英语没有话题句式但用了至少三种其他的表达式(如上所述),想想这些表达式最终怎么归化到逻辑的123,是非常有意思和启迪的。

句法分析或逻辑语义上的123,最终要落地到语用去支持应用。语用上的定义可以依据应用层面的情报需求。下面是我们目前的自动句法分析及其相关的 sentiment analysis 的语用表达:

0815d

 

【相关】

关于 parsing

【关于中文NLP】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

一日一parsing:“宝宝的经纪人睡了宝宝的宝宝 ..."

bai:
宝宝的经纪人睡了宝宝的宝宝,宝宝不知道宝宝的宝宝是不是宝宝亲生的宝宝,宝宝的宝宝为什么要这样对待宝宝!宝宝真的很难过!宝宝现在最担心的是宝宝的宝宝是不是宝宝的宝宝,如果宝宝的宝宝不是宝宝的宝宝那真的吓死宝宝了。

tan:
这种八卦,估计很难分析。 难为parse这个宝宝了!

姜:
里边的歧义真是不少啊!要仔细用心才能理解每个“宝宝”的含义!
宝宝1:王宝强;宝宝2:王宝强媳妇;宝宝3:王宝强媳妇的儿子;宝宝4:王宝强的儿子

我:
结构上看还是蛮清晰的,有些人看上去类似绕口令的话语,结构分析的挑战其实不大

0814a

0814b

0814c

刘:
@wei 这主要不是parsing的问题,主要是指代消解的问题

我:
是的,我是说parsing提供结构的基础不难, 语义模块怎么消解指代是另外的问题。很多人以为这玩意儿没法 parsing,是因为对句法和语义的模块和功能不甚了解。当然一般人心目中的 parsing 等于 understanding,是结合了句法、语义甚至语用的整体的系统。
一步一步来吧。打下一个坚实的结构基础总是有利的。
我是 native speaker,其实也看不懂,不知道里面宝宝指的都是谁。刚才上网恶补了一下新闻,才明白怎么回事儿。所以,这里的指代消解需要很多语用背景知识,光句法语义分析不能解析的。

白:
这么快段子就来了:王宝强去马场选了一匹马准备骑,这时候马场的师傅阻止了他,说选的这个马不好。宝强问为啥?师傅解释说:“这马容易劈腿。” 宝强没有听太明白,师傅又大声说:“这马蓉易劈腿啊!!”?

我:
这是白老师先前转的,还是也 parse parse 凑齐一段吧:

814d

814e

814f

雷: 宝宝的宝宝是宝宝的吗?

我: 最多也只能提供可能性,然后语用或背景知识去定。
从词典上,“宝宝(Baby)”的可能性是:(1)孩子;(2)爱人
至于宝宝指王宝强,那是抽风似的娱乐界当前的背景知识,此前此后都没有这个知识。

白: 名字里带“宝”字的都有可能
之前宝宝的称号还被赋予过仰望星空那位大人物 是有先例的

我: 这样每一个宝宝就有三个义项: 宝宝1 (孩子),宝宝2 (爱人),宝宝3 (宝强)

孩子的孩子是孩子吗
爱人的爱人是爱人吗
宝强的宝强是宝强吗
孩子的爱人是孩子吗
孩子的爱人是爱人吗
孩子的爱人是宝强吗
孩子的宝强是孩子吗
孩子的宝强是爱人吗
孩子的宝强是宝强吗
爱人的孩子是孩子吗
爱人的孩子是爱人吗
爱人的孩子是宝强么
.........

总之,逃不过这些爆炸的组合之一
加一点限制条件可以排除一些不可能组合,但留下的空间还是远远大于可以离开新闻背景知识而能消解的可能。

雷: 还有,现在还流行自称“宝宝”,比如,宝宝不高兴

白: 没那么复杂。“【孩子】是【男人】的吗?”才是唯一有看点的提问。其他都是渣。

我: 人不看新闻也解不了,反正我第一次看,完全不知所云。

雷: 还有,宝马之争

我: 宝强的孩子是宝强(亲生)的吗

雷: 宝强的老婆是宝强的吗?

我: 恩,这两条是关键,怎么从成堆的渣里面提炼出这两条?
通过什么样的知识?
这不是新闻背景知识了,这是人类道德在现阶段的扭曲和窥探欲的某种知识

白:
不是窥探欲的问题,是信息量、冲击力大小的问题。

雷: 孩子的爸爸是孩子的吗?

我: 孩子的爸爸是孩子的吗? 这个可能只在想象中成立,想象孩子把自己的爸爸叫做宝宝,孩子有这种探究的动机,等。

雷: 第三方看,也是可以的

 

【相关】

【立委科普:语法结构树之美(之二)】

关于 parsing

【关于中文NLP】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

S. Bai: Natural Language Caterpillar Breaks through Chomsky's Castle

masterminds-it-quiz-10-728

Translator's note:

This article written in Chinese by Prof. S. Bai is a wonderful piece of writing worthy of recommendation for all natural language scholars.  Prof. Bai's critical study of Chomsky's formal language theory with regards to natural language has reached a depth never seen before ever since Chomsky's revolution in 50's last century.  For decades with so many papers published by so many scholars who have studied Chomsky, this novel "caterpillar" theory still stands out and strikes me as an insight that offers a much clearer and deeper explanation for how natural language should be modeled in formalism, based on my decades of natural language parsing study and practice (in our practice, I call the caterpillar FSA++, an extension of regular grammar formalism adequate for multi-level natural language deep parsing).  For example, so many people have been trapped in Chomsky's recursion theory and made endless futile efforts to attempt a linear or near-linear algorithm to handle the so-called recursive nature of natural language which is practically non-existent (see Chomsky's Negative Impact).  There used to be heated debates  in computational linguistics on whether natural language is context-free or context-sensitive, or mildly sensitive as some scholars call it.  Such debates mechanically apply Chomsky's formal language hierarchy to natural languages, trapped in metaphysical academic controversies, far from language facts and data.  In contrast, Prof. Bai's original "caterpillar" theory presents a novel picture that provides insights in uncovering the true nature of natural languages.

S. Bai: Natural Language Caterpillar Breaks through Chomsky's Castle

Tags: Chomsky Hierarchy, computational linguistics, Natural Language Processing, linear speed

This is a technology-savvy article, not to be fooled by the title seemingly about a bug story in some VIP's castle.  If you are neither an NLP professional nor an NLP fan, you can stop here and do not need to continue the journey with me on this topic.

Chomsky's Castle refers to the famous Chomsky Hierarchy in his formal language theory, built by the father of contemporary linguistics Noam Chomsky more than half a century ago.  According to this theory, the language castle is built with four enclosing walls.  The outmost wall is named Type-0, also called Phrase Structure Grammar, corresponding to a Turing machine.  The second wall is Type-1, or Context-sensitive Grammar (CSG), corresponding to a parsing device called linear bounded automaton  with time complexity known to be NP-complete.  The third wall is Type-2, or Context-free Grammar (CFG), corresponding to a  pushdown automaton, with a time complexity that is polynomial, somewhere between square and cubic in the size of the input sentence for the best asymptotic order measured in the worst case scenario.  The innermost wall is Type-3, or Regular Grammar, corresponding to deterministic finite state automata, with a linear time complexity.  The sketch of the 4-wall Chomsky Castle is illustrated below.

This castle of Chomsky has impacted generations of scholars, mainly along two lines.  The first line of impact can be called "the outward fear syndrome".  Because the time complexity for the second wall (CSG) is NP-complete, anywhere therein and beyond becomes a Forbidden City before NP=P can be proved.  Thus, the pressure for parsing natural languages has to be all confined to within the third wall (CFG).  Everyone knows the natural language involves some context sensitivity,  but the computing device cannot hold it to be tractable once it is beyond the third wall of CFG.  So it has to be left out.

The second line of impact is called "the inward perfection syndrome".  Following the initial success of using Type 2 grammar (CFG) comes a severe abuse of recursion.  When the number of recursive layers increases slightly, the acceptability of a sentence soon approximates to almost 0.  For example, "The person that hit Peter is John" looks fine,  but it starts sounding weird to hear "The person that hit Peter that met Tom is John".  It becomes gibberish with sentences like "The person that hit Peter that met Tom that married Mary is John".  In fact, the majority resources spent with regards to the parsing efficiency are associated with such abuse of recursion in coping with gibberish-like sentences, rarely seen in real life language.  For natural language processing to be practical,  pursuing the linear speed cannot be over emphasized.  If we reflect on the efficiency of the human language understanding process, the conclusion is certainly about the "linear speed" in accordance with the length of the speech input.  In fact, the abuse of recursion is most likely triggered by the "inward perfection syndrome", for which we intend to cover every inch of the land within the third wall of CFG, even if it is an area piled up by gibberish or garbage.

In a sense, it can be said that one reason for the statistical approach to take over the rule-based approach for such a long time in the academia of natural language processing is just the combination effect of these two syndromes.  To overcome the effects of these syndromes, many researchers have made all kinds of efforts, to be reviewed below one by one.

Along the line of the outward fear syndrome, evidence against the context-freeness has been found in some constructions in Swiss-German.  Chinese has similar examples in expressing respective correspondence of conjoined items and their descriptions.  For example,   “张三、李四、王五的年龄分别是25岁、32岁、27岁,出生地分别是武汉、成都、苏州” (Zhang San, Li Si, Wang Wu's age is respectively 25, 32, and 27, they were born respectively in Wuhan, Chengdu, Suzhou" ).  Here, the three named entities constitute a list of nouns.  The number of the conjoined list of entities cannot be predetermined, but although the respective descriptors about this list of nouns also vary in length, the key condition is that they need to correspond to the antecedent list of nouns one by one.  This respective correspondence is something beyond the expression power of the context-free formalism.  It needs to get out of the third wall.

As for overcoming "the inward perfection syndrome", the pursuit of "linear speed" in the field of NLP has never stopped.  It ranges from allowing for the look-ahead mechanism in LR (k) grammar, to the cascaded finite state automata, to the probabilistic CFG parsers which are trained on a large treebank and eventually converted to an Ngram (n=>5) model.  It should also include RNN/LSTM for its unique pursuit for deep parsing from the statistical school.  All these efforts are striving for defining a subclass in Type-2 CFG that reaches linear speed efficiency yet still with adequate linguistic power.  In fact, all parsers that have survived after fighting the statistical methods are to some degree a result of overcoming "the inward perfection syndrome", with certain success in linear speed pursuit while respecting linguistic principles.  The resulting restricted subclass, compared to the area within the original third wall CFG, is a greatly "squashed" land.

If we agree that everything in parsing should be based on real life natural language as the starting point and the ultimate landing point, it should be easy to see that the outward limited breakthrough and the inward massive compression should be the two sides of a coin.  We want to strive for a formalism that balances both sides.  In other words, our ideal natural language parsing formalism should look like a linguistic "caterpillar" breaking through the Chomsky walls in his castle, illustrated below:

It seems to me that such a "caterpillar" may have already been found by someone.  It will not take too long before we can confirm it.
Original article in Chinese from 《穿越乔家大院寻找“毛毛虫”
Translated by Dr. Wei Li

 

 

【Related】

[转载]【白硕 - 穿越乔家大院寻找“毛毛虫”】

【立委按】

白硕老师这篇文章值得所有自然语言学者研读和反思。击节叹服,拍案叫绝,是初读此文的真切感受。白老师对乔姆斯基形式语言理论用于自然语言所造成的误导,给出了迄今所见最有深度的犀利解析,而且写得深入浅出,形象生动,妙趣横生。这么多年,这么多学者,怎么就达不到这样的深度呢?一个乔姆斯基的递归陷阱不知道栽进去多少人,造成多少人在 “不是人话” 的现象上做无用功,绕了无数弯路。学界曾有多篇长篇大论,机械地套用乔氏层级体系,在自然语言是 context-free 还是 context-sensitive 的框框里争论不休,也有折衷的说法,诸如自然语言是 mildly sensitive,这些形而上的学究式争论,大多雾里看花,隔靴搔痒,不得要领,离语言事实甚远。白老师独创的 “毛毛虫” 论,形象地打破了这些条条框框。

白老师自己的总结是:‘如果认同“一切以真实的自然语言为出发点和最终落脚点”的理念,那就应该承认:向外有限突破,向内大举压缩,应该是一枚硬币的两面。’ 此乃金玉良言,掷地有声。

【白硕 - 穿越乔家大院寻找“毛毛虫”】

看标题,您八成以为这篇文章讲的是山西的乔家大院的事儿了吧?不是。这是一篇烧脑的技术贴。如果您既不是NLP专业人士也不是NLP爱好者,就不用往下看了。

咱说的这乔家大院,是当代语言学祖师爷乔姆斯基老爷子画下来的形式语言类型谱系划分格局。最外边一圈围墙,是0型文法,又叫短语结构文法,其对应的分析处理机制和图灵机等价,亦即图灵可计算的;第二圈围墙,是1型文法,又叫上下文相关文法,其对应的分析处理机制,时间复杂度是NP完全的;第三圈围墙,是2型文法,又叫上下文无关文法,其对应的分析处理机制,时间复杂度是多项式的,最坏情况下的最好渐进阶在输入句子长度的平方和立方之间;最里边一层围墙,是3型文法,又叫正则文法,其对应的分析处理机制和确定性有限状态自动机等价,时间复杂度是线性的。这一圈套一圈的,归纳整理下来,如下图所示:

乔老爷子建的这座大院,影响了几代人。影响包括这样两个方面:

第一个方面,我们可以称之为“外向恐惧情结”。因为第二圈的判定处理机制,时间复杂度是NP完全的,于是在NP=P还没有证明出来之前,第二圈之外似乎是禁区,没等碰到已经被宣判了死刑。这样,对自然语言的描述压力,全都集中到了第三圈围墙里面,也就是上下文无关文法。大家心知肚明自然语言具有上下文相关性,想要红杏出墙,但是因为出了围墙计算上就hold不住,也只好打消此念。0院点灯……1院点灯……大红灯笼高高挂,红灯停,闲人免出。

第二个方面,我们可以称之为“内向求全情结”。2型文法大行其道,取得了局部成功,也带来了一个坏风气,就是递归的滥用。当递归层数稍微加大,人类对于某些句式的可接受性就快速衰减至几近为0。比如,“我是县长派来的”没问题,“我是县长派来的派来的”就有点别扭,“我是县长派来的派来的派来的”就不太像人话了。而影响分析判定效率的绝大多数资源投入,都花在了应对这类“不像人话”的递归滥用上了。自然语言处理要想取得实用效果,处理的“线速”是硬道理。反思一下,我们人类的语言理解过程,也肯定是在“线速”范围之内。递归的滥用,起源于“向内求全情结”,也就是一心想覆盖第三圈围墙里面最犄角旮旯的区域,哪怕那是一个由“不像人话”的实例堆积起来的垃圾堆。

可以说,在自然语言处理领域,统计方法之所以在很长时间内压倒规则方法,在一定程度上,就是向外恐惧情结与向内求全情结叠加造成的。NLP领域内也有很多的仁人志士为打破这两个情结做了各种各样的努力。

先说向外恐惧情结。早就有人指出,瑞士高地德语里面有不能用上下文无关文法描述的语言现象。其实,在涉及到“分别”的表述时,汉语也同样。比如:“张三、李四、王五的年龄分别是25岁、32岁、27岁,出生地分别是武汉、成都、苏州。”这里“张三、李四、王五”构成一个名词列表,对这类列表的一般性句法表述,肯定是不定长的,但后面的两个“分别”携带的列表,虽然也是不定长的,但却需要跟前面这个列表的长度相等。这个相等的条件,上下文无关文法不能表达,必须走出第三圈围墙。

再说向内求全情结。追求“线速”的努力,在NLP领域一直没有停止过。从允许预读机制的LR(k)文法,到有限自动机堆叠,再到基于大型树库训练出来的、最终转化为Ngram模型(N=5甚至更大)的概率上下文无关文法分析器,甚至可以算上统计阵营里孤军深入自然语言深层处理的RNN/LSTM等等,都试图从2型文法中划出一个既有足够的语言学意义、又能达到线速处理效率的子类。可以说,凡是在与统计方法的搏杀中还能活下来的分析器,无一不是在某种程度上摆脱了向内求全情结、在基本尊重语言学规律基础上尽可能追求线速的努力达到相对成功的结果。这个经过限制的子类,比起第三圈围墙来,是大大地“压扁”了的。

如果认同“一切以真实的自然语言为出发点和最终落脚点”的理念,那就应该承认:向外有限突破,向内大举压缩,应该是一枚硬币的两面。我们希望,能够有一种形式化机制同时兼顾这两面。也就是说,我们理想中的自然语言句法的形式化描述机制,应该像一条穿越乔家大院的“毛毛虫”,如下图所示:

据笔者妄加猜测,这样的“毛毛虫”,可能有人已经找到,过一段时间自然会见分晓。

from http://blog.sina.com.cn/s/blog_729574a00102wf63.html

 

【相关】

【新智元:parsing 在希望的田野上】 

【新智元:理论家的围墙和工程师的私货】

【科研笔记:NLP “毛毛虫” 笔记,从一维到二维】

【泥沙龙笔记:NLP 专门语言是规则系统的斧头】

乔姆斯基批判

泥沙龙笔记:再聊乔老爷的递归陷阱

泥沙龙笔记:骨灰级砖家一席谈,真伪结构歧义的对策(2/2) 

《自然语言是递归的么?》

语言创造简史

【置顶:立委博客NLP博文一览(定期更新版)】

 

On Hand-crafted Myth of Knowledge Bottleneck

In my article "Pride and Prejudice of Main Stream", the first myth listed as top 10 misconceptions in NLP is as follows:

[Hand-crafted Myth]  Rule-based system faces a knowledge bottleneck of hand-crafted development while a machine learning system involves automatic training (implying no knowledge bottleneck).

While there are numerous misconceptions on the old school of rule systems , this hand-crafted myth can be regarded as the source of all.  Just take a review of NLP papers, no matter what are the language phenomena being discussed, it's almost cliche to cite a couple of old school work to demonstrate superiority of machine learning algorithms, and the reason for the attack only needs one sentence, to the effect that the hand-crafted rules lead to a system "difficult to develop" (or "difficult to scale up", "with low efficiency", "lacking robustness", etc.), or simply rejecting it like this, "literature [1], [2] and [3] have tried to handle the problem in different aspects, but these systems are all hand-crafted".  Once labeled with hand-crafting, one does not even need to discuss the effect and quality.  Hand-craft becomes the rule system's "original sin", the linguists crafting rules, therefore, become the community's second-class citizens bearing the sin.

So what is wrong with hand-crafting or coding linguistic rules for computer processing of languages?  NLP development is software engineering.  From software engineering perspective, hand-crafting is programming while machine learning belongs to automatic programming.  Unless we assume that natural language is a special object whose processing can all be handled by systems automatically programmed or learned by machine learning algorithms, it does not make sense to reject or belittle the practice of coding linguistic rules for developing an NLP system.

For consumer products and arts, hand-craft is definitely a positive word: it represents quality or uniqueness and high value, a legit reason for good price. Why does it become a derogatory term in NLP?  The root cause is that in the field of NLP,  almost like some collective hypnosis hit in the community, people are intentionally or unintentionally lead to believe that machine learning is the only correct choice.  In other words, by criticizing, rejecting or disregarding hand-crafted rule systems, the underlying assumption is that machine learning is a panacea, universal and effective, always a preferred approach over the other school.

The fact of life is, in the face of the complexity of natural language, machine learning from data so far only surfaces the tip of an iceberg of the language monster (called low-hanging fruit by Church in K. Church: A Pendulum Swung Too Far), far from reaching the goal of a complete solution to language understanding and applications.  There is no basis to support that machine learning alone can solve all language problems, nor is there any evidence that machine learning necessarily leads to better quality than coding rules by domain specialists (e.g. computational grammarians).  Depending on the nature and depth of the NLP tasks, hand-crafted systems actually have more chances of performing better than machine learning, at least for non-trivial and deep level NLP tasks such as parsing, sentiment analysis and information extraction (we have tried and compared both approaches).  In fact, the only major reason why they are still there, having survived all the rejections from mainstream and still playing a role in industrial practical applications, is the superior data quality, for otherwise they cannot have been justified for industrial investments at all.

the “forgotten” school:  why is it still there? what does it have to offer? The key is the excellent data quality as advantage of a hand-crafted system, not only for precision, but high recall is achievable as well.
quote from On Recall of Grammar Engineering Systems

In the real world, NLP is applied research which eventually must land on the engineering of language applications where the results and quality are evaluated.  As an industry, software engineering has attracted many ingenious coding masters, each and every one of them gets recognized for their coding skills, including algorithm design and implementation expertise, which are hand-crafting by nature.   Have we ever heard of a star engineer gets criticized for his (manual) programming?  With NLP application also as part of software engineering, why should computational linguists coding linguistic rules receive so much criticism while engineers coding other applications get recognized for their hard work?  Is it because the NLP application is simpler than other applications?  On the contrary, many applications of natural language are more complex and difficult than other types of applications (e.g. graphics software, or word processing apps).  The likely reason to explain the different treatment between a general purpose programmer and a linguist knowledge engineer is that the big environment of software engineering does not involve as much prejudice while the small environment of NLP domain  is deeply biased, with belief that the automatic programming of an NLP system by machine learning can replace and outperform manual coding for all language projects.   For software engineering in general, (manual) programming is the norm and no one believes that programmers' jobs can be replaced by automatic programming in any time foreseeable.  Automatic programming, a concept not rare in science fiction for visions like machines making machines, is currently only a research area, for very restricted low-level functions.  Rather than placing hope on automatic programming, software engineering as an industry has seen a significant progress on work of the development infrastructures, such as development environment and a rich library of functions to support efficient coding and debugging.  Maybe in the future one day, applications can use more and more of automated code to achieve simple modules, but the full automation of constructing any complex software project  is nowhere in sight.  By any standards, natural language parsing and understanding (beyond shallow level tasks such as classification, clustering or tagging)  is a type of complex tasks. Therefore, it is hard to expect machine learning as a manifestation of automatic programming to miraculously replace the manual code for all language applications.  The application value of hand-crafting a rule system will continue to exist and evolve for a long time, disregarded or not.

"Automatic" is a fancy word.  What a beautiful world it would be if all artificial intelligence and natural languages tasks could be accomplished by automatic machine learning from data.  There is, naturally, a high expectation and regard for machine learning breakthrough to help realize this dream of mankind.  All this should encourage machine learning experts to continue to innovate to demonstrate its potential, and should not be a reason for the pride and prejudice against a competitive school or other approaches.

Before we embark on further discussions on the so-called rule system's knowledge bottleneck defect, it is worth mentioning that the word "automatic" refers to the system development, not to be confused with running the system.  At the application level, whether it is a machine-learned system or a manual system coded by domain programmers (linguists), the system is always run fully automatically, with no human interference.  Although this is an obvious fact for both types of systems, I have seen people get confused so to equate hand-crafted NLP system with manual or semi-automatic applications.

Is hand-crafting rules a knowledge bottleneck for its development?  Yes, there is no denying or a need to deny that.  The bottleneck is reflected in the system development cycle.  But keep in mind that this "bottleneck" is common to all large software engineering projects, it is a resources cost, not only introduced by NLP.  From this perspective, the knowledge bottleneck argument against hand-crafted system cannot really stand, unless it can be proved that machine learning can do all NLP equally well, free of knowledge bottleneck:  it might be not far from truth for some special low-level tasks, e.g. document classification and word clustering, but is definitely misleading or incorrect for NLP in general, a point to be discussed below in details shortly.

Here are the ballpark estimates based on our decades of NLP practice and experiences.  For shallow level NLP tasks (such as Named Entity tagging, Chinese segmentation), a rule approach needs at least three months of one linguist coding and debugging the rules, supported by at least half an engineer for tools support and platform maintenance, in order to come up with a decent system for initial release and running.  As for deep NLP tasks (such as deep parsing, deep sentiments beyond thumbs-up and thumbs-down classification), one should not expect a working engine to be built up without due resources that at least involve one computational linguist coding rules for one year, coupled with half an engineer for platform and tools support and half an engineer for independent QA (quality assurance) support.  Of course, the labor resources requirements vary according to the quality of the developers (especially the linguistic expertise of the knowledge engineers) and how well the infrastructures and development environment support linguistic development.  Also, the above estimates have not included the general costs, as applied to all software applications, e.g. the GUI development at app level and operations in running the developed engines.

Let us present the scene of the modern day rule-based system development.  A hand-crafted NLP rule system is based on compiled computational grammars which are nowadays often architected as an integrated pipeline of different modules from shallow processing up to deep processing.  A grammar is a set of linguistic rules encoded in some formalism, which is the core of a module intended to achieve a defined function in language processing, e.g. a module for shallow parsing may target noun phrase (NP) as its object for identification and chunking.  What happens in grammar engineering is not much different from other software engineering projects.  As knowledge engineer, a computational linguist codes a rule in an NLP-specific language, based on a development corpus.  The development is data-driven, each line of rule code goes through rigid unit tests and then regression tests before it is submitted as part of the updated system for independent QA to test and feedback.  The development is an iterative process and cycle where incremental enhancements on bug reports from QA and/or from the field (customers) serve as a necessary input and step towards better data quality over time.

Depending on the design of the architect, there are all types of information available for the linguist developer to use in crafting a rule’s conditions, e.g. a rule can check any elements of a pattern by enforcing conditions on (i) word or stem itself (i.e. string literal, in cases of capturing, say, idiomatic expressions), and/or (ii) POS (part-of-speech, such as noun, adjective, verb, preposition), (iii) and/or orthography features (e.g. initial upper case, mixed case, token with digits and dots), and/or (iv) morphology features (e.g. tense, aspect, person, number, case, etc. decoded by a previous morphology module), (v) and/or syntactic features (e.g. verb subcategory features such as intransitive, transitive, ditransitive), (vi) and/or lexical semantic features (e.g. human, animal, furniture, food, school, time, location, color, emotion).  There are almost infinite combinations of such conditions that can be enforced in rules’ patterns.  A linguist’s job is to code such conditions to maximize the benefits of capturing the target language phenomena, a balancing art in engineering through a process of trial and error.

Macroscopically speaking, the rule hand-crafting process is in its essence the same as programmers coding an application, only that linguists usually use a different, very high-level NLP-specific language, in a chosen or designed formalism appropriate for modeling natural language and framework on a platform that is geared towards facilitating NLP work.  Hard-coding NLP in a general purpose language like Java is not impossible for prototyping or a toy system.  But as natural language is known to be a complex monster, its processing calls for a special formalism (some form or extension of Chomsky's formal language types) and an NLP-oriented language to help implement any non-toy systems that scale.  So linguists are trained on the scene of development to be knowledge programmers in hand-crafting linguistic rules.  In terms of different levels of languages used for coding, to an extent, it is similar to the contrast between programmers in old days and the modern software engineers today who use so-called high-level languages like Java or C to code.  Decades ago, programmers had to use assembly or machine language to code a function.  The process and workflow for hand-crafting linguistic rules are just like any software engineers in their daily coding practice, except that the language designed for linguists is so high-level that linguistic developers can concentrate on linguistic challenges without having to worry about low-level technical details of memory allocation, garbage collection or pure code optimization for efficiency, which are taken care of by the NLP platform itself.  Everything else follows software development norms to ensure the development stay on track, including unit testing, baselines construction and monitoring, regressions testing, independent QA, code reviews for rules' quality, etc.  Each level language has its own star engineer who masters the coding skills.  It sounds ridiculous to respect software engineers while belittling linguistic engineers only because the latter are hand-crafting linguistic code as knowledge resources.

The chief architect in this context plays the key role in building a real life robust NLP system that scales.  To deep-parse or process natural language, he/she needs to define and design the formalism and language with the necessary extensions, the related data structures, system architecture with the interaction of different levels of linguistic modules in mind (e.g. morpho-syntactic interface), workflow that integrate all components for internal coordination (including patching and handling interdependency and error propagation) and the external coordination with other modules or sub-systems including machine learning or off-shelf tools when needed or felt beneficial.  He also needs to ensure efficient development environment and to train new linguists into effective linguistic "coders" with engineering sense following software development norms (knowledge engineers are not trained by schools today).  Unlike the mainstream machine learning systems which are by nature robust and scalable,  hand-crafted systems' robustness and scalability depend largely on the design and deep skills of the architect.  The architect defines the NLP platform with specs for its core engine compiler and runner, plus the debugger in a friendly development environment.  He must also work with product managers to turn their requirements into operational specs for linguistic development, in a process we call semantic grounding to applications from linguistic processing.   The success of a large NLP system based on hand-crafted rules is never a simple accumulation of linguistics resources such as computational lexicons and grammars using a fixed formalism (e.g. CFG) and algorithm (e.g. chart-parsing).  It calls for seasoned language engineering masters as architects for the system design.

Given the scene of practice for NLP development as describe above, it should be clear that the negative sentiment association with "hand-crafting" is unjustifiable and inappropriate.  The only remaining argument against coding rules by hands comes down to the hard work and costs associated with hand-crafted approach, so-called knowledge bottleneck in the rule-based systems.  If things can be learned by a machine without cost, why bother using costly linguistic labor?  Sounds like a reasonable argument until we examine this closely.  First, for this argument to stand, we need proof that machine learning indeed does not incur costs and has no or very little knowledge bottleneck.  Second, for this argument to withstand scrutiny, we should be convinced that machine learning can reach the same or better quality than hand-crafted rule approach.  Unfortunately, neither of these necessarily hold true.  Let us study them one by one.

As is known to all, any non-trivial NLP task is by nature based on linguistic knowledge, irrespective of what form the knowledge is learned or encoded.   Knowledge needs to be formalized in some form to support NLP, and machine learning is by no means immune to this knowledge resources requirement.  In rule-based systems, the knowledge is directly hand-coded by linguists and in case of (supervised) machine learning, knowledge resources take the form of labeled data for the learning algorithm to learn from (indeed, there is so-called unsupervised learning which needs no labeled data and is supposed to learn from raw data, but that is research-oriented and hardly practical for any non-trivial NLP, so we leave it aside for now).  Although the learning process is automatic,  the feature design, the learning algorithm implementation, debugging and fine-tuning are all manual, in addition to the requirement of manual labeling a large training corpus in advance (unless there is an existing labeled corpus available, which is rare; but machine translation is a nice exception as it has the benefit of using existing human translation as labeled aligned corpora for training).  The labeling of data is a very tedious manual job.   Note that the sparse data challenge represents the need of machine learning for a very large labeled corpus.  So it is clear that knowledge bottleneck takes different forms, but it is equally applicable to both approaches.  No machine can learn knowledge without costs, and it is incorrect to regard knowledge bottleneck as only a defect for the rule-based system.

One may argue that rules require expert skilled labor, while the labeling of data only requires high school kids or college students with minimal training.  So to do a fair comparison of the costs associated, we perhaps need to turn to Karl Marx whose "Das Kapital" has some formula to help convert simple labor to complex labor for exchange of equal value: for a given task with the same level of performance quality (assuming machine learning can reach the quality of professional expertise, which is not necessarily true), how much cheap labor needs to be used to label the required amount of training corpus to make it economically an advantage?  Something like that.  This varies from task to task and even from location to location (e.g. different minimal wage laws), of course.   But the key point here is that knowledge bottleneck challenges both approaches and it is not the case believed by many that machine learning learns a system automatically with no or little cost attached.  In fact, things are far more complicated than a simple yes or no in comparing the costs as costs need also to be calculated in a larger context of how many tasks need to be handled and how much underlying knowledge can be shared as reusable resources.  We will leave it to a separate writing for the elaboration of the point that when put into the context of developing multiple NLP applications, the rule-based approach which shares the core engine of parsing demonstrates a significant saving on knowledge costs than machine learning.

Let us step back and, for argument's sake, accept that coding rules is indeed more costly than machine learning, so what? Like in any other commodities, hand-crafted products may indeed cost more, they also have better quality and value than products out of mass production.  For otherwise a commodity society will leave no room for craftsmen and their products to survive.  This is common sense, which also applies to NLP.  If not for better quality, no investors will fund any teams that can be replaced by machine learning.  What is surprising is that there are so many people, NLP experts included, who believe that machine learning necessarily performs better than hand-crafted systems not only in costs saved but also in quality achieved.  While there are low-level NLP tasks such as speech processing and document classification which are not experts' forte as we human have much more restricted memory than computers do, deep NLP involves much more linguistic expertise and design than a simple concept of learning from corpora to expect superior data quality.

In summary, the hand-crafted rule defect is largely a misconception circling around wildly in NLP and reinforced by the mainstream, due to incomplete induction or ignorance of the scene of modern day rule development.  It is based on the incorrect assumption that machine learning necessarily handles all NLP tasks with same or better quality but less or no knowledge bottleneck, in comparison with systems based on hand-crafted rules.

 

 

Note: This is the author's own translation, with adaptation, of part of our paper which originally appeared in Chinese in Communications of Chinese Computer Federation (CCCF), Issue 8, 2013

 

[Related]

Domain portability myth in natural language processing

Pride and Prejudice of NLP Main Stream

K. Church: A Pendulum Swung Too Far, Linguistics issues in Language Technology, 2011; 6(5)

Wintner 2009. What Science Underlies Natural Language Engineering? Computational Linguistics, Volume 35, Number 4

Pros and Cons of Two Approaches: Machine Learning vs Grammar Engineering

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

Pride and Prejudice of NLP Main Stream

[Abstract]

In the area of Computational Linguistics, there are two basic approaches to natural language processing, the traditional rule system and the mainstream machine learning.  They are complementary and there are pros and cons associated with both.  However, as machine learning is the dominant mainstream philosophy reflected by the overwhelming ratio of papers published in academia, the area seems to be heavily biased against the rule system methodology.  The tremendous success of machine learning as applied to a list of natural language tasks has reinforced the mainstream pride and prejudice in favor of one and against the other.   As a result, there are numerous specious views which are often taken for granted without check, including attacks on the rule system's defects based on incomplete induction or misconception.  This is not healthy for NLP itself as an applied research area and exerts an inappropriate influence on the young scientists coming to this area.  This is the first piece of a series of writings aimed at educating the public and confronting the prevalent prejudice, focused on the in-depth examination of the so-called hand-crafted defect of the rule system and the associated  knowledge bottleneck issue.

I. introduction

Over 20 years ago, the area of NLP (natural language processing) went through a process of replacing traditional rule-based systems by statistical machine learning as the mainstream in academia.  Put in a larger context of AI (Artificial Intelligence), this represents a classical competition, and their ups and downs, between the rational school and the empirical school (Church 2007 ).  It needs to be noted that the statistical approaches' dominance in this area has its historical inevitability.   The old school was confined to toy systems or lab for too long without scientific break-through while machine learning started showing impressive results in numerous fronts of NLP in a much larger scale, initially  very low level NLP such as POS (Part-of-Speech) tagging and speech recognition / synthesis, and later on expanded to almost all NLP tasks, including machine translation, search and ranking, spam filtering, document classification, automatic summarization, lexicon acquisition, named entity tagging, relationship extraction, event classification, sentiment analysis.  This dominance has continued to grow till today when the other school is largely "out" from almost all major NLP arenas, journals and top conferences.  New graduates hardly realize its existence.  There is an entire generation gap for such academic training or carrying on the legacy of the old school, with exceptions of very few survivors (including yours truly) in industry because few professors are motivated to teach it at all or even qualified with an in-depth knowledge of this when the funding and publication prospects for the old school are getting more and more impossible.   To many people's minds today, learning (or deep learning) is NLP, and NLP is learning, that is all.  As for the "last century's technology" of rule-based systems, it is more like a failure tale from a distant history.

The pride and prejudice of the mainstream were demonstrated the most in the recent incidence when Google announced its deep-learning-based SyntaxNet and proudly claimed it to be "the most accurate parser in the world", so resolute and no any conditions attached, and without even bothering to check the possible existence of the other school.  This is not healthy (and philosophically unbalanced too) for a broad area challenged by one of the most complex problems of mankind, i.e. to decode natural language understanding.  As there is only one voice heard, it is scaring to observe that the area is packed with prejudice and ignorance with regards to the other school, some from leaders of the area.  Specious comments are rampant and often taken for granted without check.

Prejudice is not a real concern as it is part of the real world around and involving ourselves, something to do with human nature and our innate limitation and ignorance. What is really scary is the degree and popularity of such prejudice represented in numerous misconceptions that can be picked up everywhere in this circle (I am not going to trace the sources of these as they are everywhere and people who are in this area for some time know this is not Quixote's windmill but a reality reflection).  I will list below some of the myths or fallacies so deeply rooted in the area that they seem to become cliche, or part of the community consensus.  If one or more statements below sound familiar to you and they do not strike you as opinionated or specious which cannot withstand scrutiny, then you might want to give a second study of the issue to make sure we have not been subconsciously brain-washed.  The real damage is to our next generation, the new scholars coming to this field, who often do not get a chance for doubt.

For each such statement to be listed, it is not difficult to cite a poorly designed stereotypical rule system that falls short of the point, but the misconception lies in its generalization of associating an accused defect to the entire family of a school, ignorant of the variety of designs and the progress made in that school.

There are two types of misconceptions, one might be called myth and the other is sheer fallacy.  Myths arise as a result of "incomplete induction".  Some may have observed or tried some old school rule systems of some sort, which show signs of the stated defect, then they jump to conclusions leading to the myths.  These myths call for in-depth examination and arguments to get the real picture of the truth.  As for fallacies, they are simply untrue.  It is quite a surprise, though, to see that even fallacies seem to be widely accepted as true by many, including some experts in this area.  All we need is to cite facts to prove them wrong.  For example, [Grammaticality Fallacy] says that the rule system can only parse grammatical text and cannot handle degraded text with grammar mistakes in it.  Facts speak louder than words: the sentiment engine we have developed for our main products is a parsing-supportedrule-based system that fully automatically extracts and mines public opinions and consumer insights from all types of social media, typical of degraded text.  Third-party evaluations show that this system is industry leader in data quality of sentiments, significantly better than competitions adopting machine learning. The large-scale operation of our system in the cloud in handling terabytes of real life social media big data (a year of social media in our index involve about 30 billion documents across more than 40 languages) also prove wrong what is stated in [Scalability Fallacy] below.

Let us now list these widely spread rumours collected from the community about the rule-based system to see if they ring the bell before we dive into the first two core myths to uncover the true picture behind in separate blogs.

II.  Top 10 Misconceptions against Rules

[Hand-crafted Myth]  Rule-based system faces a knowledge bottleneck of hand-crafted development while a machine learning system involves automatic training (implying no knowledge bottleneck). [see On Hand-crafted Myth of Knowledge Bottleneck.]

[Domain Portability Myth] The hand-crafted nature of a rule-based system leads to its poor domain portability as rules have to be rebuilt each time we shift to a new domain; but in case of machine learning, since the algorithm and system are universal, domain shift only involves new training data (implying strong domain portability). [see Domain Portability Myth]

[Fragility Myth]  A rule-based system is very fragile and it may break before unseen language data, so it cannot lead to a robust real life application.

[Weight Myth] Since there is no statistical weight associated with the results from a rule-based system, the data quality cannot be trusted with confidence.

[Complexity Myth] As a rule-based system is complex and intertwined, it is easy to get to a standstill, with little hope for further improvement.

[Scalability Fallacy]  The hand-crafted nature of a rule-based system makes it difficult to scale up for real life application; it is largely confined to the lab as a toy.

[Domain Restriction Fallacy]  A rule-based system only works in a narrow domain and it cannot work across domains.

[Grammaticality Fallacy] A rule-based system can only handle grammatical input in the formal text (such as news, manuals, weather broadcasts), it fails in front of degraded text involving misspellings and ungrammaticality such as social media, oral transcript, jargons or OCR output.

[Outdated Fallacy]  A rule-based system is a technology of last century, it is outdated (implying that it no longer works or can result in a quality system in modern days).

[Data Quality Fallacy]  Based on the data quality of results, a machine learning system is better than a rule based system. (cf: On Recall of Grammar Engineering Systems)

III.  Retrospect and Reflection of Mainstream

As mentioned earlier, a long list of misconceptions about the old school of rule-based systems have been around the mainstream for years in the field.   It may sound weird for an interdisciplinary field named Computational Linguistics to drift more and more distant from linguistics;  linguists play less and less a role in NLP dominated by statisticians today.  It seems widely assumed that with advanced deep learning algorithms, once data are available, a quality system will be trained without the need for linguistic design or domain expertise.

Not all main stream scholars are one-sided and near-sighted.  In recent years, insightful articles  (e.g., church 2007, Wintner 2009) began a serious retrospect and reflection process and called for the return of Linguistics: “In essence, linguistics is altogether missing in contemporary natural language engineering research. … I want to call for the return of linguistics to computational linguistics.”(Wintner 2009)Let us hope that their voice will not be completely muffled in this new wave of deep learning heat.

Note that the rule system which the linguists are good at crafting in industry is different from the classical linguistic study, it is formalized modeling of linguistic analysis.  For NLP tasks beyond shallow level, an effective rule system is not a simple accumulation of computational lexicons and grammars, but involves a linguistic processing strategy (or linguistic algorithm) for different levels of linguistic phenomena.  However, this line of study on the NLP platform design, system architecture and formalism has increasingly smaller space for academic discussion and publication, the research funding becomes almost impossible, as a result, the new generation faces the risk of a cut-off legacy, with a full generation of talent gap in academia.  Church (2007) points out that the statistical research is so dominant and one-sided that only one voice is now heard.  He is a visionary main stream scientist, deeply concerned about the imbalance of the two schools in NLP and AI.  He writes:

Part of the reason why we keep making the same mistakes, as Minsky and Papert mentioned above, has to do with teaching. One side of the debate is written out of the textbooks and forgotten, only to be revived/reinvented by the next generation.  ......

To prepare students for what might come after the low hanging fruit has been picked over, it would be good to provide today’s students with a broad education that makes room for many topics in Linguistics such as syntax, morphology, phonology, phonetics, historical linguistics and language universals. We are graduating Computational Linguistics students these days that have very deep knowledge of one particular narrow sub-area (such as machine learning and statistical machine translation) but may not have heard of Greenberg’s Universals, Raising, Equi, quantifier scope, gapping, island constraints and so on. We should make sure that students working on co-reference know about c-command and disjoint reference. When students present a paper at a Computational Linguistics conference, they should be expected to know the standard treatment of the topic in Formal Linguistics.

We ought to teach this debate to the next generation because it is likely that they will have to take Chomsky’s objections more seriously than we have. Our generation has been fortunate to have plenty of low hanging fruit to pick (the facts that can be captured with short ngrams), but the next generation will be less fortunate since most of those facts will have been pretty well picked over before they retire, and therefore, it is likely that they will have to address facts that go beyond the simplest ngram approximations.

 

 

About Author

Dr. Wei Li is currently Chief Scientist at Netbase Solutions in the Silicon Valley, leading the effort for the design and development of a multi-lingual sentiment mining system based on deep parsing.  A hands-on computational linguist with 30 years of professional experience in Natural Language Processing (NLP), Dr. Li has a track record of making NLP work robust. He has built three large-scale NLP systems, all transformed into real-life, globally distributed products.

 

Note: This is the author's own translation, with adaptation, of our paper in Chinese which originally appeared in W. Li & T. Tang, "Pride and Prejudice of Main Stream:  Rule-based System vs. Machine Learning", in Communications of Chinese Computer Federation (CCCF), Issue 8, 2013

 

[Related]

K. Church: A Pendulum Swung Too Far, Linguistics issues in Language Technology, 2011; 6(5)

Wintner 2009. What Science Underlies Natural Language Engineering? Computational Linguistics, Volume 35, Number 4

Domain portability myth in natural language processing

On Hand-crafted Myth and Knowledge Bottleneck

On Recall of Grammar Engineering Systems

Pros and Cons of Two Approaches: Machine Learning vs Grammar Engineering

It is untrue that Google SyntaxNet is the “world’s most accurate parser”

R. Srihari, W Li, C. Niu, T. Cornell: InfoXtract: A Customizable Intermediate Level Information Extraction Engine. Journal of Natural Language Engineering, 12(4), 1-37, 2006

Introduction of Netbase NLP Core Engine

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

 

【主流的傲慢与偏见:规则系统与机器学习】

I.          引言

有回顾NLP(Natural Language Processing)历史的知名学者介绍机器学习(machine learning)取代传统规则系统(rule-based system)成为学界主流的掌故,说20多年前好像经历了一场惊心动魄的宗教战争。必须承认,NLP 这个领域,统计学家的完胜,是有其历史必然性的。机器学习在NLP很多任务上的巨大成果和效益是有目共睹的:机器翻译,语音识别/合成,搜索排序,垃圾过滤,文档分类,自动文摘,词典习得,专名标注,词性标注等(Church 2007)。

然而,近来浏览几篇 NLP 领域代表人物的综述,见其中不乏主流的傲慢与偏见,依然令人惊诧。细想之下,统计学界的确有很多对传统规则系统根深蒂固的成见和经不起推敲但非常流行的蛮横结论。可怕的不是成见,成见无处不在。真正可怕的是成见的流行无阻。而在NLP这个领域,成见的流行到了让人瞠目结舌的程度。不假思索而认同接受这些成见成为常态。因此想到立此存照一下,并就核心的几条予以详论。下列成见随处可见,流传甚广,为免纷扰,就不列出处了,明白人自然知道这绝不是杜撰和虚立的靶子。这些成见似是而非,经不起推敲,却被很多人视为理所当然的真理。为每一条成见找一个相应的规则系统的案例并不难,但是从一些特定系统的缺陷推广到对整个规则系统的方法学上的批判,乃是其要害所在。

【成见一】规则系统的手工编制(hand-crafted)是其知识瓶颈,而机器学习是自动训练的(言下之意:没有知识瓶颈)。

【成见二】规则系统的手工编制导致其移植性差,转换领域必须重启炉灶,而机器学习因为算法和系统保持不变,转换领域只要改变训练数据即可(言下之意:移植性强)。

【成见三】规则系统很脆弱,遇到没有预测的语言现象系统就会 break(什么叫 break,死机?瘫痪?失效?),开发不了鲁棒(robust)产品。

【成见四】规则系统的结果没有置信度,鱼龙混杂。

【成见五】规则系统的编制越来越庞杂,最终无法改进,只能报废。

【成见六】规则系统的手工编制注定其无法实用,不能 scale up,只能是实验室里的玩具。

【成见七】规则系统只能在极狭窄的领域成事,无法实现跨领域的系统。

【成见八】规则系统只能处理规范的语言(譬如说明书、天气预报、新闻等),无法应对 degraded text,如社会媒体、口语、方言、黑话、OCR 文档。

【成见九】规则系统是上个世纪的技术,早已淘汰(逻辑的结论似乎是:因此不可能做出优质系统)。

【成见十】从结果上看,机器学习总是胜过规则系统。

所列“成见”有两类:一类是“偏”见,如【成见一】至【成见五】。这类偏见主要源于不完全归纳,他们也许看到过或者尝试过规则系统某一个类型, 浅尝辄止,然后遽下结论(jump to conclusions)。盗亦有道,情有可原,虽然还是应该对其一一纠“正”。本文即是拨乱反正的第一篇。成见的另一类是谬见,可以事实证明其荒谬。令人惊诧的是,谬见也可以如此流行。【成见五】以降均属不攻自破的谬见。譬如【成见八】说规则系统只能分析规范性语言。事实胜于雄辩,我们开发的以规则体系为主的舆情挖掘系统处理的就是非规范的社交媒体。这个系统的大规模运行和使用也驳斥了【成见六】,可以让读者评判这样的规则系统够不够资格称为实用系统:

 

以全球500强企业为主要客户的多语言客户情报挖掘系统由前后两个子系统组成。核心引擎是后台子系统(back-end indexing engine),用于对社交媒体大数据做自动分析和抽取。分析和抽取结果用开源的Apache Lucene文本搜索引擎(lucene.apache.org) 存储。生成后台索引的过程基于Map-Reduce框架,利用计算云(computing cloud) 中200台虚拟服务器进行分布式索引。对于过往一年的社会媒体大数据存档(约300亿文档跨越40多种语言),后台索引系统可以在7天左右完成全部索引。前台子系统(front-end app)是基于 SaaS 的一种类似搜索的应用。用户通过浏览器登录应用服务器,输入一个感兴趣的话题,应用服务器对后台索引进行分布式搜索,搜索的结果在应用服务器经过整合,以用户可以预设(configurable)的方式呈现给用户。这一过程立等可取,响应时间不过三四秒。

                                                                                                                                                II.         规则系统手工性的责难

【成见一】说:规则系统的手工编制(hand-crafted)是其知识瓶颈,而机器学习是自动训练的(言下之意:因此没有知识瓶颈)。

NLP主流对规则系统和语言学家大小偏见积久成堆,这第一条可以算是万偏之源。随便翻开计算语言学会议的论文,无论讨论什么语言现象,为了论证机器学习某算法的优越,在对比批评其他学习算法的同时,规则系统大多是随时抓上来陪斗的攻击对象,而攻击的理由往往只有这么一句话,规则系统的手工性决定了 “其难以开发”(或“其不能 scale up”,“其效率低下”,“其不鲁棒”,不一而足),或者干脆不给具体理由,直接说“文献【1】【2】【3】尝试了这个问题的不同方面,但这些系统都是手工编制的”,一句话判处死刑,甚至不用讨论它们的效果和质量。手工性几乎成了规则系统的“原罪”,编制这些系统的语言学家因此成为学术共同体背负原罪的二等公民。

手工编制(hand-crafted)又如何?在日常消费品领域,这是对艺人特别的嘉奖,是对批量机械化生产和千篇一律的反抗,是独特和匠心的代表,是高价格理直气壮的理由。缘何到了NLP领域,突然就成贬义词了呢?这是因为在NLP领域,代表主流的统计学家由于他们在NLP某些任务上的非凡成功,居功自傲,把成功无限夸大,给这个共同体施行了集体催眠术,有意无意引导人相信机器学习是万能的。换句话说,批判手工编制的劣根性,其隐含的前提是机器学习是万能的,有效的,首选的。而实际情况是,面对自然语言的复杂性,机器学习只是划过了语言学的冰山一角,远远没有到主流们自觉或不自觉吹嘘的万能境界。催眠的结果是,不独不少语言学家以及NLP相关利益方(如投资人和用户)被他们洗脑了,连他们自己也似乎逐渐相信了自己编制的神话。

真实世界中,NLP 是应用学科,最终结果体现在应用软件(applications)上,属于语言软件工程。作为一个产业,软件工程领域吸引了无数软件工程师,虽然他们自嘲为“码工”,社会共同体给予他们的尊重和待遇是很高的(Bill Gates 自封了一个 Chief Engineer,说明了这位软件大王对工匠大师的高度重视)。古有鲁班,现有码师(coding master)。这些码工谁不靠手工编制代码作为立足之本呢?没听说一位明星工程师因为编制代码的手工性质而被贬损。同是软件工程,为什么计算语言学家手工编制NLP代码与其他工程师手工编制软件代码,遭遇如此不同的对待。难道是因为NLP应用比其他应用简单?恰恰相反,自然语言的很多应用比起大多数应用(譬如图形软件、字处理软件等等)更加复杂和艰难。解释这种不同遭遇的唯一理由就是,作为大环境的软件领域没有NLP主流的小环境里面那么多的傲慢和偏见。软件领域的大师们还没有狂妄到以为可以靠自动编程取代手工编程。他们在手工编程的基础建设(编程架构和开发环境等)上下功夫,而不是把希望寄托在自动编程的万能上。也许在未来的某一天,一些简单的应用可以用代码自动化来实现,但是复杂任务的全自动化从目前来看是遥遥无期的。不管从什么标准来看,非浅层的自然语言分析和理解都是复杂任务的一种。因此,机器学习作为自动编程的一个体现是几乎不可能取代手工代码的。规则系统的NLP应用价值会长期存在。

自动是一个动听的词汇。如果一切人工智能都是自动学习的,前景该有多么美妙。机器学习因为与自动连接在一起,显得那么高高在上,让人仰视。它承载着人类对未来世界的幻想。这一切理应激励自动学习专家不断创新,而绝不该成为其傲慢和偏见的理由。

在下面具体论述所谓规则系统的知识瓶颈软肋之前,值得一提的是,本文所谓自动是指系统的开发,不要混淆为系统的应用。在应用层面,无论是机器学习出来的系统,还是手工编制的系统,都是全自动地服务用户的,这是软件应用的性质决定的。虽然这是显而易见的事实,可确实有人被误导,一听说手工编制,就引申为基于规则系统的应用也是手工的,或者半自动的。

手工编制NLP系统是不是规则系统的知识瓶颈?毋庸讳言,确实如此。这个瓶颈体现在系统开发的周期上。但是,这个瓶颈是几乎所有大型软件工程项目所共有的,是理所当然的资源成本,不独为 NLP “专美”。从这个意义上看,以知识瓶颈诟病规则系统是可笑的,除非可以证明对所有NLP项目,用机器学习开发系统比编制规则系统,周期短且质量高(个别的项目可能是这样,但一般而言绝非如此,后面还要详谈)。大体说来,对于NLP的浅层应用(譬如中文切词,专名识别,等等),没有三个月的开发,没有至少一位计算语言学家手工编制和调试规则和至少半个工程师的平台层面的支持,是出不来规则系统的。对于NLP的深层应用(如句法分析,舆情抽取等),没有至少一年的开发,涉及至少一位计算语言学家的手工编制规则,至少半个质量检测员的协助和半个工程师的平台支持,外加软件工程项目普遍具有的应用层面的用户接口开发等投入,也是出不来真正的软件产品的。当然需要多少开发资源在很大程度上决定于开发人员(包括作为知识工程师的计算语言学家)的经验和质量以及系统平台和开发环境的基础(infrastructures)如何。

计算语言学家编制规则系统的主体工作是利用形式化工具编写并调试语言规则、各类词典以及语言分析的流程调控。宏观上看,这个过程与软件工程师编写应用程序没有本质不同,不过是所用的语言、形式框架和开发平台(language,formalism and development platform)不同,系统设计和开发的测重点不同而已。这就好比现代的工程师用所谓高级语言 Java 或者 C,与30年前的工程师使用汇编语言的对比类似,本质是一样的编程,只是层次不同罢了。在为NLP特制的“高级”语言和平台上,计算语言学家可以不用为内存分配等非语言学的工程细节所羁绊,一般也不用为代码的优化和效率而烦扰,他们的注意力更多地放在面对自然语言的种种复杂现象,怎样设计语言处理的架构和流程,怎样平衡语言规则的条件宽窄,怎样与QA(质量检测)协调确保系统开发的健康,怎样保证语言学家团队编制规则的操作规范(unit testing,regression testing,code review,baselines,等等)以确保系统的可持续性,怎样根据语言开发需求对于现有形式框架的限制提出扩展要求,以及怎样保证复杂系统的鲁棒性,以及怎样突破规则系统的框架与其他语言处理包括机器学习进行协调,等等。一个领头的计算语言学家就是规则系统的架构师,系统的成败绝不仅仅在于语言规则的编制及其堆积,更多的决定于系统架构的合理性。明星工程师是软件企业的灵魂,NLP 规则系统的大规模成功也一样召唤着语言工程大师。

关于知识瓶颈的偏见,必须在对比中评估。自然语言处理需要语言学知识,把这些知识形式化是每个NLP系统的题中应有之义,机器学习绝不会自动免疫,无需知识的形式化。规则系统需要语言学家手工开发的资源投入,机器学习也同样需要资源的投入,不过是资源方式不同而已。具体说,机器学习的知识瓶颈在于需要大数量的训练数据集。排除研究性强实用性弱的无监督学习(unsupervised learning),机器学习中可资开发系统的方法是有监督的学习(supervised learning)。有监督的学习能开发知识系统成为应用的前提是必须有大量的手工标注的数据,作为学习的源泉。虽然机器学习的过程是自动的(学习算法的创新、调试和实现当然还是手工的),但是大量的数据标注则是手工的(本来就有现成标注不计,那是例外)。因此,机器学习同样面临知识瓶颈,不过是知识瓶颈的表现从需要少量的语言学家变成需要大量的低端劳动者(懂得语言及其任务的中学生或大学生即可胜任)。马克思说金钱是一般等价物,知识瓶颈的问题于是转化为高级劳动低级劳动的开销和转换问题:雇佣一个计算语言学家的代价大,还是雇佣10个中学生的代价大?虽然这个问题根据不同项目不同地区等因素答案会有不同,但所谓机器学习没有知识瓶颈的神话可以休矣。

另外,知识瓶颈的对比问题不仅仅是针对一个应用而言,而应该放在多应用的可移植性上来考察。我们知道大多数非浅层的NLP应用的技术支持都源于从自然语言做特定的信息抽取:抽取关系、事件、舆情等。由于机器学习把信息抽取看成一个直接对应输入和输出的黑匣子,所以一旦改变信息抽取目标和应用方向,以前的人工标注就废弃了,作为知识瓶颈的标注工作必须完全重来。可是规则系统不同,它通常设计成一个规则层级体系,由独立于领域的语言分析器(parser)来支持针对领域的信息抽取器(extractor)。结果是,在转移应用目标的时候,作为技术基础的语言分析器保持不变,只需重新编写不同的抽取规则而已。实践证明,对于规则系统,真正的知识瓶颈在语言分析器的开发上,而信息抽取本身花费不多。这是因为前者需要应对自然语言变化多端的表达方式,将其逻辑化,后者则是建立在逻辑形式(logical form)上,一条规则等价于底层规则的几百上千条。因此,从多应用的角度看,规则系统的知识成本趋小,而机器学习的知识成本则没有这个便利。

                                                                                                                                                III.        主流的反思

如前所述,NLP领域主流意识中的成见很多,积重难返。世界上还很少有这样的怪现象:号称计算语言学(Computational Linguistics)的领域一直在排挤语言学和语言学家。语言学家所擅长的规则系统,与传统语言学完全不同,是可实现的形式语言学(Formal Linguistics)的体现。对于非浅层的NLP任务,有效的规则系统不可能是计算词典和文法的简单堆积,而是蕴含了对不同语言现象的语言学处理策略(或算法)。然而,这一路研究在NLP讲台发表的空间日渐狭小,资助亦难,使得新一代学人面临技术传承的危险。Church (2007)指出,NLP研究统计一边倒的状况是如此明显,其他的声音已经听不见。在浅层NLP的低垂果实几乎全部采摘完毕以后,当下一代学人面对复杂任务时,语言学营养缺乏症可能导致统计路线捉襟见肘。

可喜的是,近年来主流中有识之士(如,Church 2007, Wintner 2009)开始了反思和呼吁,召唤语言学的归来:“In essence, linguistics is altogether missing in contemporary natural language engineering research. … I want to call for the return of linguistics to computational linguistics.”(Wintner 2009)。相信他们的声音会受到越来越多的人的注意。

 

参考文献
  • Church 2007. A Pendulum Swung Too Far.  Linguistics issues in Language Technology,  Volume 2, Issue 4.
  • Wintner 2009. What Science Underlies Natural Language Engineering? Computational Linguistics, Volume 35, Number 4

 

原载 《W. Li & T. Tang: 主流的傲慢与偏见:规则系统与机器学习》
【计算机学会通讯】2013年第8期(总第90期)

[Abstract]

Pride and Prejudice in Mainstream:  Rule System vs. Machine Learning

In the area of Computational Linguistics, there are two basic approaches to natural language processing, the traditional rule system and the mainstream machine learning.  They are complementary and there are pros and cons associated with both.  However, as machine learning is the dominant mainstream philosophy reflected by the overwhelming ratio of papers published in academia, the area seems to be heavily biased against the rule system methodology.  The tremendous success of machine learning as applied to a list of natural language tasks has reinforced the mainstream pride and prejudice in favor of one and against the other.   As a result, there are numerous specious views which are often taken for granted without check, including attacks on the rule system's defects based on incomplete induction or misconception.  This is not healthy for NLP itself as an applied research area and exerts an inappropriate influence on the young scientists coming to this area.  This is the first piece of a series of writings aimed at correcting the prevalent prejudice, focused on the in-depth examination of the so-called hand-crafted defect of the rule system and the associated  knowledge bottleneck issue.

【相关】

K. Church: A Pendulum Swung Too Far, Linguistics issues in Language Technology, 2011; 6(5)

【科普随笔:NLP主流的傲慢与偏见】

Pride and Prejudice of NLP Main Stream

On Hand-crafted Myth and Knowledge Bottleneck

Domain portability myth in natural language processing

关于NLP方法论以及两条路线之争】 专栏:NLP方法论

【置顶:立委NLP博文一览】

《朝华午拾》总目录