【泥沙龙笔记:带标大数据这道坎迈不过去,不要侈谈AI革命】

李:前两天与NLP主流的权威人士聊人造智能的现状和前景。我问,人造智能这么牛,你给我找一个在自然语言方面没有标注大数据,靠非监督学习落地成功的案例。只要一例。

其实主流里面就是找不到一例(非主流有,但大家习惯性视而不见)。主流里面规模化成功的全部是监督学习,全部靠大数据。应了那句话,多少人工,多少智能。

毛:你这要求太苛刻了。咱们人小时候不也要靠爹妈教吗?@wei

李:不对,爹妈教的不是大数据。孩子跟父母学的是小数据,举一反三,不是举100返1。当然乔姆斯基认为那不是爹妈的功劳 也不是学童的功劳,是上帝的功劳,固化遗传的。

白:人工用在语料上还是用在资源上,才是区分技术路线的关键。

李:同意。前者简单野蛮粗暴,容易推广,后者需要精心设计。

在带标大数据的这道坎迈不过去前,不要侈谈人造I的革命。

有些疑似不需要带标大数据的有效学习,可以一一讨论。看看到底是不是无监督学习突破了,知识瓶颈化解于无形了。

MT 不用说了,无穷无尽的带标大数据。人类翻译了多少年,而且还会一直翻译下去,或者利用MT然后修订编辑。活水源源不断。好处是免费,是人类正常翻译活动的副产品。

白:小数据带标、大数据聚类,小数据循聚类举一反三。实际就是协同推荐。

李:好,看看大数据聚类,clustering 的本性就是非监督,有成功案例吗?clustering 是个好东西 但是独立规模化成功的,几乎不见。

白:加上小数据,不是纯聚类。

李:对。以前有一个路子,貌似有部分成功,就是先聚类,然后人工少量干预(给好的聚类起一个名字、把混进革命队伍的异己分子手工踢出去之类),然后利用所起的名字作为带标数据,把聚类(clustering)转换为可以落地有价值的分类(classifciation)。狸猫换太子,多少就克服了大数据短缺的知识瓶颈,聚类--》分类,曲线救国。

白:带标小数据更关键。

李:那也是一途 叫 seeds,boot strapping,找个办法来 propagate,用得巧的话,也有部分成功的,算是弱监督学习。

白:聚类是纯几何行为,不知道对什么敏感。小数据告诉你该对什么敏感。两轮驱动,不可偏废。大数据聚类可以提供疑似窝点,小数据一举捣毁。不是所有疑似窝点都值得捣毁。聚类是等势线(超曲面)相互包围的拓扑。

毛:立委你这不是抬杠吗,也没人说AI已经等同于人类智能呀。

肖:用户分群很有用啊,例子很多很多。聚类方法找异常也有很多成功应用,比如反欺诈。

李:聚类的结果 粗线条应用大概是有的,在宁可错杀一千的应用场合,或有当无的场合,聚类可松可紧,拿来就用,总之是有统计基础,作为参考,强过看不见。细线条就傻了。只要用眼睛和脑袋去检视过聚类结果的,大都有这个体会:这玩意儿说它不对,还长得蛮像,说它对吧,米锅里 到处可见老鼠屎。经常的感觉是鸡肋 食之无味 弃之可惜,用又不敢用,对接吧 可费劲了。词典习得 (lexicon acquisition),聚类用得上,最后的难点还是在对接上,就是聚类以后的标注(起名字),并让标注与现有的知识体系对接上。

白:不需要,有内部编号即可。以xor为例。聚类可以聚出四个象限。不需要为每个象限取名。如果小数据指向一三象限,就把这两个聚类的内部名称贴一个外部标签。聚类按小数据的指引,当粗则粗,当细则细。不能只用一个尺度,小波的成功就是借鉴。记得工厂里钣金,师傅领锤,力道不大但是都在点儿上,徒弟力大但不能乱锤,必须跟着师傅走。小数据是师傅,大数据是徒弟。

李:这个形象。

最近的NLG(自然语言生成)方面的成功,是因为语言模型在深度学习的时候 强大了。生成的句子 比我们普通人还顺溜。我们受过这么多年教育 还免不了文法错误 语句不顺。机器生成的句子 越来越“超越”人类了。怎么回事?

原来 NLG 比 MT 还邪性 还牛叉,MT 还需要双语的翻译大数据,NLG 面对的是单一的语言,数据无穷无尽,文满为患,这是一个极端的 raw corpus 居然等价于 labeled corpus 的现场。我们每个人写文章 都是潜在给 NLG 提供标注服务。自然语言语句与词汇随机发生器发出来的字符串 的不同,全部体现在每一篇人类撰写的文章里面。它不出色才怪呢。NLG 可以预见将来有大发展,在应用文写作等方面。有孩子的 可以放松他们的语文训练了,将来他们只要学会善用机器 就没有写不出规范的文章的。

白:大家写文章全一个味儿,也是问题。应该以强风格的作家为吸引子,形成若干漩涡,你接近某个漩涡,就持续往里面吸。至少不能千人一面。

肖:(NLG)现在摘要还写不好。

李:孩子不必特地去修应用文写作课,反正后来会有电脑帮忙写文字的。这就跟我小时候钢笔字写得狗爬似的,一直难受 羡慕小伙伴每天练字 让人看得赏心悦目。(当年喜欢一个女孩子 一半看脸蛋 一半看她写的字。)结果 我成年后 除了签字 就几乎没有写钢笔字的机会。

以前要成为(文科)大学者,最为人津津乐道和崇敬的是:

(1)记忆能力:过目不忘,检索起来可以闭着眼睛把典故的章节甚至页数指出来。社科院里面有很多这样广为传诵的奇闻逸事,尤其是关于钱锺书、吕叔湘这些老先生。

马:我认识一个理工科的教授,跟他聊天时,经常会说,那本期刊第几期第几页提到了这个问题。

李:(2)好书法。(3)诗词歌赋。这些到了机器,反而容易。

白:千万别提机器那个诗词歌赋,倒胃口。

李:感觉清华的《九歌》比不少郭沫若的诗词写得好。小时候看郭沫若带着西哈努克去山里面玩,诗性大发 口占一首,那个诗可真是纯粹应景 干瘪无味,就是平仄啥的应该整对了,论意境、诗味,啥都没有。

机器诗词倒胃口 也比不上郭沫若的那次表演(忘了是什么纪录片了),印象极深刻 不是好印象 而是坏印象。当然 艺术的鉴赏 见仁见智 不好说绝对。但往前看,机器做诗词 还有很多提升空间。人要熟读唐诗300首就很不容易了,机器灌输它个全唐诗去模仿,是小菜。人在时间限制下 需要应景作诗 提升空间就不大了。七步诗这样的天才 万里无一。

白:端到端、简单无结构标签、大数据是深度学习商业化的关键。但凡涉及到场景化、复杂结构和小数据,深度学习一定会不适应。是让应用迁就深度学习还是用技术改造深度学习,这不是个简单的选择。我主张:1、把标注的对象从“语料”迁移到“资源”;2、用带标小数据引领无标大数据;3、尊重领域专家、融合领域知识。

【相关】

【 立委小品:AI is fake I 】

【语义计算:李白对话录系列】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

AI is fake I

The term Artificial Intelligence (AI), which traces its roots to the milestone Dartmouth's historic conference, is quite a bit of an afterthought by the then thought-leaders of the time, with an emphasis on artificiality. It, in essence, defines the true nature of AI as a fake intelligence that simulates human intelligence. But we seem to often forget that.

Those commonly known as "vegetarian chicken" or "vegetarian duck" are soy products, generally classified under the category of "artificial protein". The gap between "artificial proteins" and "animal proteins" is very comparable to that between "artificial intelligence" and "human intelligence". Every vegetarian eating "vegetarian chicken" knows clearly that it is fake meat so they feel comfortable enjoying it with its great taste. In contrast, almost all media and the majority of users of AI products today rarely regard the nature of AI as fake intelligence. That is quite a surprise to me.

I don't know if it's just tabloid hype or it's true. But the impression is fairly clear that those popular AI stars more and more often act like god. They seem to love to use super big words and philosophical metaphors which lead the mass to the belief of an equal sign between AI and human I. I don't think it is so much a sense of mission as a sense of superiority and ego, and they just feel too good about themselves in mastering some magic of AI algorithms. It occurs to me that if you act like God, talk like God, over time you will believe you are God. In times of AI bubbles, people buy that; more importantly, media love that, and investors are willing to pay high.

My entire career has been engaged in "natural language understanding" (NLU), with a focus on "parsing", which was for a long time widely accepted as the key to language understanding, the crown of artificial intelligence as some experts put it. As practitioners in developing industrial products, we know all these AI terms such as language understanding, machine learning, neural networks, plus AI itself, are just analogy or metaphors. AI models are just simulations, mechanical programs attempting to mimic intelligent tasks. But that is apparently not what has been depicted by media's efforts for "AI marketing", nor is it educated by the few AI stars at the spotlight. The public opinions or even decision-makers, shaped or influenced by such media, run more and more towards the opposite. So it might be high time to air a different voice and re-uncover the true nature. Artificial intelligence is fake intelligence by its very nature, filled with "artful deception", as pointed out by Pierce in the AI history. His criticism has never been out of time. In fact, there is never a time with this much "artful deception" built into products such as intelligent assistants, so artful that we start getting used to it for the convenience.

What is "understanding"? Strictly speaking, the computer has zero intelligence except for its mechanical computation and memorization. Natural language understanding has always been a metaphor by convention, that is why the Turing test was purposely designed to define "artificial intelligence" by bypassing "understanding". This is by no means to deny the breakthrough in recent years in the functional success stories of AI applications such as speech processing, image recognition, and machine translation.

We all have had personal life experiences when we were amazed at some functions performed by a non-human. As a child, I was amazed for quite some time that the radio could "talk", how "intelligent" this box called radio was. My mother had been confined to a remote rural area in her childhood, and when she went to a middle school in the nearby town, she had a chance to see an automobile running on the road for the first time. She ran away in awe and years later described to me the shock at the time when a non-human machine was running so fast. That is beyond intelligent to her mind. We all had those first times of "intelligence" shock, the first time we had access to a calculator when I was a middle school kid, the first time we walked through an automatic door, the first time we went to the bathroom which automatically flushed the toilet, not to mention the first time we used GPS. All those fake intelligence behaviors look so true and superior to our modest being when we are first exposed to them. But now such "intelligence-like behavior" is all out, we all accept that it is non-I. By human nature, we tend to over-read the meaning when we do not understand something. We are shocked to see any "automatic" behavior or response from a non-human, regardless of whether the mechanism behind is simple or an algorithm with complexity. Such shock is easy to amplify, and it's hard not to be fooled by wonders if we don't understand the mechanisms and principles behind, which happens a lot around the media talks about AI. In recent years, the media and industry are never tired of "man-machine competitions", in games and knowledge showoffs, in order to demonstrate that now AI beats human. Sometimes in my dreams, I have been haunted by similar images of human weight lifting champions challenging a crane to see who could lift the ton of steel with a single swipe.

In recent years, some celebrity CEOs in industry and legendary figures in the science community have seriously begun to talk about the problem of the emotional machines and the threat from machines equipped with super-human AI. It is often far fetched, citing functional AI success as autonomous intelligence or emotions. I would not be surprised when the topic is taken one step further to start discussing the next world problem as recreating hormones and reproductive systems in machines. Why not? Machines are believed to develop a neural network to become this powerful, it is a natural course to be reproductive and even someday marry humans for the man-machine hybrid kind. Science fiction and reality tend to get mingled all in a mass too easily today.

Nowadays, artificial intelligence is just like a sexy modal attracting all the eyeballs. Talking to an old AI scholar the other day, he pointed out that AI is, in fact, a sad subject. A significant feature of AI is to temporarily hold things whose mechanisms are not yet clear. Once the mechanisms are clear, it often becomes "non-artificial intelligence" and develops into a specialized discipline on its own. The plane is up in the air, the submarine is under the water, deployed everywhere in our land for decades. Do people who design airplanes and submarines call themselves artificial intelligence researchers? No, they are experts of aerodynamics, fluid dynamics, and have little to do with AI. Autonomous driving today is still under the banner of AI, but it has less and less to do with AI as time moves on. Aircraft has long been self-driving for the most part, no one considered that artificial intelligence, right? Artificial intelligence is not a science that can hold a lot of branches on its own. The knowledge that really belongs to artificial intelligence is actually a very small circle, just like the part that really belongs to human intelligence is also a very small circle, both of which are much smaller than what we anticipated before. What is the unchangeable part of AI then? We might as well return to some original formulations by the forefathers of AI, one being a "general problem solver" (Simon 1959).

(Courtesy of youdao-MT for the first draft translation of my recent Chinese blog, without which I would not have the energy and time in its translation and rewriting here.)

My original Chinese blog on this topic:

【 立委小品:AI is fake I 】

Other English blogs

The Anti-Eliza Effect, New Concept in AI

From IBM's Jeopardy robot, Apple's Siri, to the new Google Translate

Question answering of the past and present

【 立委小品:AI is fake I 】

有个关于翻译的疑问, AI 台湾翻译为“人工智慧”, 大陆不知谁最先翻译的,总之一直叫“人工智能”。两岸都没有采用更为贴切的翻译“人造智能”。

那些俗称为素鸡素鸭的大豆制品,一般归到 “人造蛋白” 类。“人造蛋白”(或植物蛋白)与“动物蛋白”的距离和区分,可比“人造智能”与“人类智能” 的差异,鸿沟比较清晰。“人工智能/智慧”的译法有很大可能误导或被误解,当然媒体与大众多半乐于误导或被误导,那是另一回事儿。

突然想到老川把除了 fox 外的主流媒体一律称为 fake news,也可以考虑把AI 叫作 fake intelligence,至少比老川靠谱得多,利人也是这么认为的:“AI被忽悠得大家都认为是真的了”。

不知道是小报的渲染,还是的确如此,总之印象是,被称为AI牛人的,常常做上帝状,无论说话口气,还是描画蓝图,与其说是使命感,不如说是优越感,自我感觉特别良好。

认真说,明明是artificial,如今开口闭口机器智能,甚至自主推理、自发情感,弄得跟真的似的。巧妙欺骗的时代,连国家首脑都难免受影响,整得风声鹤唳,草木皆兵。貌似狼来了,机器就要统治世界,人类末日即将来临。

我一辈子做自然语言理解(NLU),主攻语言自动解析(parsing),这曾经被广泛认为是语言理解的钥匙,人造智能的皇冠。那天群里老友说,“理解文章上,机器比90%读完高中的人差吧”,突然意识到类比与现实已经潜移默化到我们自己都可能不假思索混同和认同的时候,觉得也许可以试图做一点澄清了:

什么叫“理解”?严格说,电脑是0理解。所谓自然语言理解,不过是个比喻的说法,所以图灵测试一开始就绕过“理解”来定义“智能”。区别在于,符号派用一套看上去是模拟理解的符号推理的步骤,也就是在符号系统内自己跟自己玩,过家家的游戏。经验派甚至连这个模拟的过家家也不要了,端到端,别跟我扯理解,你理解了要做什么,给我把要做什么的任务定义出来,然后按照这个定义给我标注数据,越大越好,上不封顶。然后给你把这个任务做出来,照猫画虎。你说它理解了啥?毛都没有。说理解和智能,都是比喻,不改变artificial智能是fake的本质。这并不是要否认模仿的功能性成功。

小时候很长时间惊异于收音机“会说话”,不理解无线电啥玩意能够如此“智能”。记得看过一本书,里面描述凉山少数民族百姓第一次“被看电影”,下面观众惊吓莫名。我妈妈一直在农村读私塾,后来到城里读中学第一次见到汽车在路上跑,跟我描述过当时的震撼。25年前我母校语言学系教授第一次对着苹果电脑叫 “Open Computer”,用语音启动电脑的得意样子还历历在目。还有我们第一次用计算器,第一次走进自动门,上厕所自动冲洗马桶,更甭提第一次用GPS。所有这些fake智能,全是那么真切。可现在全部脱离 I了,成为 non-I。可见,“智能”这东西不仅仅界限模糊,而且很雷人。第一次见到任何非人类出面的“自动”行为或反映,甭管原理简单还是算法复杂,受众都会感觉震撼。这种震撼很容易放大,如果不理解背后的机制和原理,很难不被忽悠。

小时候还有一事儿,挺扎心的,手太笨学珠算、写钢笔字总也没大成效。好在用了电脑后,就几乎没“写”过字,字俊字丑没人知道,藏拙了。当年对于心算快的,算盘打得快的小伙伴,那叫一个羡慕。计算器出来以后,没听说过有人组织过人机算术挑战赛。可是后来的下棋,知识问答,机器翻译,却不断作为智能的里程碑载入史册。回头想想,不就在计算器的延长线上,一多半靠的是memory和computing吗。人跟机拼,傻呀。近几年来,不知怎的,我常常头脑出现一种诡异的画面:人类举重冠军组团挑战起重机,看谁能把“那成吨的钢铁,轻轻地一抓就起来”(盗自革命样板戏《海港》插曲)。媒体和业界乐此不疲的“人机大战”,除了噱头效应外,其实也是非良定义(ill-defined)的为多:苹果怎么好与梨子打架呢。

Artificial 智能,明明是假智能(fake intelligence),里面充满了“巧妙的欺骗”(artful deception,于今为烈,前辈皮尔斯历史上的著名的AI批判其实从来没有失效过,见 Church - 钟摆摆得太远(3):皮尔斯论】 ),如今整得跟真的似的。哲学上、伦理上、媒体上、国际政治上,如今都在说狼来了。

还好AI这个术语,追根溯源到达特茅斯那次历史性会议,起名还算知趣,强调的是人工/人造/仿造。但大家久而久之时间长了就忽略了定语 artificial,把电脑与人的智能和理解,混同了。加上科技小编和科幻的鼓噪,类比与现实貌似等同了。甚至一些名人也一本正经开始大谈机器具有情感的问题,自主智能到自发情感,就差说要与人类杂交生子了。下一个世界难题应该是机器中再造荷尔蒙与生殖系统。(理论上也并非绝对不可能。无机物突变为有机物,历史上应该是发生过的。不妨在人类高科技刺激下,再发生一次?)

隐约记得以前论过这个话题,一搜果然说过,而且说得还不浅:

“人工智能其实应该翻译为人造智能。人造翻译(或仿人翻译)与人工翻译可大不相同。但取法乎上仅得其中的古训不大灵了,古训忽略了量的概念。被取法者足够大量的时候 所得不止于中。AI 代替中庸 势在必行。取法乎众 可得中上,这是事实。但最好的机器翻译不如最好的人工翻译,这也是事实。因为后者有智能 有理解。而前者虽然号称神经了,其实连“人造的理解”(譬如 NLU)都没有。

现如今人工智能好比一个性感女郎,沾点边的都往上面贴。今天跟一位老人工智能学者谈,他说,其实人工智能本性上就是一个悲催的学科,它是一个中继站,有点像博士后流动站。怎么讲?人工智能的本性就是暂时存放那些机理还没弄清楚的东西,一旦机理清楚了,就“非人工智能化”了(硬赖着不走,拉大旗作虎皮搞宣传的,是另一回事儿),独立出去成为一个专门的学科了。飞机上天了,潜艇下水了,曾几何时,这看上去是多么人工智能啊。现在还有做飞机潜艇的人称自己是搞人工智能的吗?他们属于空气动力学,流体动力学,与AI没有一毛钱的关系。同理,自动驾驶现如今还打着AI的招牌,其实已经与AI没啥关系了。飞机早就自动驾驶了,没人说是人工智能,到了汽车就突然智能起来?说不过去啊。总之,人工智能不是一个能 hold 住很多在它旗下的科学,它会送走一批批 misfits,这是好事儿,这是科学的进步。真正属于人工智能的学问,其实是一个很小的圈圈,就好比真正属于人类智能的部分也是很小的圈圈,二者都比我们直感上认为的范围,要小很多很多。我问,什么才是真正的恒定的AI呢?老友笑道,还是回到前辈们的原始定义吧,其中主要一项叫做“general problem solver”(西蒙 1959)。

from 【尼沙龙笔记:宁顺不信,神经机器翻译的成就和短板】

这个话题足够重要,不妨从不同角度多说说、反复说,被热昏误导的人太多,吹吹冷风,听见一个算一个。

【相关】

English: "AI is fake I "

【尼沙龙笔记:宁顺不信,神经机器翻译的成就和短板】

人工智能,一个永远没有结果的科学_马少平

【Church - 钟摆摆得太远(3):皮尔斯论】

反伊莉莎效应,人工智能的新概念

【语义计算:李白对话录系列】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

《每周两歌:中外女声魅力嗓音,Allison与于文华》

首先,刚过去的周末,祝各位母亲节快乐!

母亲节请欣赏【于文华:想起老妈妈】。背景是上次探亲,正值江南油菜花的季节。我插队的村子的邻村 原来鬼不生蛋的深山老坳 如今是【美丽乡村】的典型了。

关于于文华,以前推荐过:“于文华绣花嗓子,甜美细腻。郁钧剑怎么配她?(别说,还就那嘎声嘎气的尹相杰配于的细嗓子,还有点特别的风味。)郁钧剑没唱过什么给人留下印象的歌曲,这一次照他的功底,老实说算是不错了,也难为他了。无奈于文华的演唱登峰造极,相比之下郁钧剑太平淡了。我们通常只听第一段于的演唱,一到郁钧剑就回头或跳过去。”

Allison is my all time favorite, with her unique voice.  The footage I shot is from a Costco tv demo plus the footage from the Apple Store in the new headquarters

【相关】

音乐欣赏随笔:Brad Paisley & Allison Krauss:威士忌安魂曲

立委随笔:网事如烟