Q&A on NLP: Chapter I Natural Language and Linguistic Form

Guo: Professor Li, to ease into the discussion, let us begin with some foundational concepts. What exactly do we mean by natural language? What falls under the scope of the field, and where does it sit within the broader discipline of Artificial Intelligence (AI)?

Li:  Natural language refers to the everyday languages we humans speak—English, Russian, Japanese, Chinese, and so on;  in other words,  human language writ large.  It is distinct from computer languages.  Because human conversation is rife with ellipsis and ambiguity,  processing natural language on a computer poses formidable challenges.

Within AI, natural language is defined both as a problem domain and as the object we wish to manipulate.  Natural Language Processing (NLP) is an essential branch of AI, and parsing is its core technology—the crucial gateway to Natural Language Understanding (NLU). Parsing will therefore recur throughout this book.

Computational linguistics is the interdisciplinary field at the intersection of computer science and linguistics.  One might say that computational linguistics supplies the scientific foundations, whereas NLP represents the applied layer.

AI is often divided into perceptual intelligence and cognitive intelligence.  The former includes image recognition and speech processing.  Breakthroughs in big data and deep learning have allowed perceptual intelligence to reach—and in some cases surpass—human‑expert performance.  Cognitive intelligence, whose core is natural language understanding, is widely regarded as the crown jewel of AI.  Bridging the gap from perception to cognition is the greatest challenge—and opportunity—facing the field today.

The rationalist tradition formalises expert knowledge using symbolic logic to simulate human intellectual tasks.  In NLP, the classical counterpart to machine‑learning models comprises linguist‑crafted grammar rules, collectively called a computational grammar.  A system built atop such grammars is known as a rule‑based system. The grammar school decomposes linguistic phenomena with surgical precision, aiming at a deep structural analysis.  Rule‑based parsing is transparent and interpretable—much like the diagramming exercises once taught in a language school.

Figure 1‑1 sketches the architecture of a natural‑language parser core engine.  Without dwelling on minutiae, note that every major module—from shallow parsing through deep parsing—can, in principle, be realised via interpretable symbolic logic encoded as a computational grammar.  Through successive passes, the bewildering diversity of natural language is reduced first to syntactic relations and then to logical‑semantic structure.  Since Chomsky’s distinction between surface structure and deep structure in late 50s, this layered view has become an orthodoxy within linguistics.

Guo: These days everyone venerates neural networks and deep learning. Does the grammar school still have room to live? Rationalism seems almost voiceless in current NLP scholarship. How should we interpret this history and the present trend?

Li:  Roughly thirty years ago, the empiricist school of machine learning began its ascent, fuelled by abundant data and ever‑cheaper computation.  In recent years, deep neural networks have achieved spectacular success across many AI tasks.  Their triumph reflects not only algorithmic innovation but also today’s unprecedented volumes of data and compute.

By contrast, the rationalist programme of symbolic logic has waned.  After a brief renaissance twenty years ago—centred on unification‑based phrase‑structure grammars (PSGs)—computational grammar gradually retreated from the mainstream.  Many factors contributed; among them, Noam Chomsky’s prolonged negative impact warrants sober reflection.

History reveals a pendulum swing between empiricism and rationalism. Kenneth Church famously illustrated the motion in his article A Pendulum Swung Too Far (Figure 1-2).

For three decades, the pendulum has tilted toward empiricism (black dots in Figure 1‑2); deep learning still commands the spotlight. Rationalism, though innovating quietly, is not yet strong enough to compete head‑to‑head.  When one paradigm dominates, the other naturally fades from view.

Guo:  I sense some conceptual confusion both inside and outside the field.  Deep learning, originally just one empiricist technique, has become synonymous with AI and NLP for many observers.  If its revolution sweeps every corner of AI, will we still see a rationalist comeback at all? As Professor Church warns, the pendulum may already have swung too far.

Li:  These are two distinct philosophies with complementary strengths and weaknesses; neither can obliterate the other.

While the current empiricist monoculture has understandable causes, it is unhealthy in the long run.  The two schools both compete and synergise.  Veterans like Church continue to caution against over‑reliance on empiricism, and new scholars are probing deep integrations of the two methodologies to crack the hardest problems in NLU.

Make no mistake: today’s AI boom largely rests on deep‑learning breakthroughs, especially in image recognition, speech, and machine translation.  Yet deep learning inherits a fundamental limitation of the statistical school—its dependence on large volumes of labelled data.  In many niche domains—for instance, minority languages or e‑commerce translation—such corpora are simply unavailable.  This knowledge bottleneck severely constrains empiricist approaches to cognitive NLP tasks.  Without data, machine learning is a bread‑maker without flour; deep learning’s appetite as we all know is insatiable.

Guo: So deep learning is no panacea, and rationalism deserves a seat at the table.  Since each paradigm has its merits and deficits, could you summarise the comparison?

Li: A concise inventory helps us borrow strengths and shore up weaknesses.

Advantages of machine learning

    1. Requires no domain experts (but does require vast labelled data).
    2. Excels at coarse‑grained tasks such as classification.
    3. High recall.
    4. Robust and fast to develop.

Advantages of the grammar school

    1. Requires no labelled data (but does require expert rule writing).
    2. Excels at fine‑grained tasks such as parsing and reasoning.
    3. High precision.
    4. Easy to localise errors; inherently interpretable.

Li: Rule‑based systems shine at granular, line‑by‑line dissection, whereas learned statistical models are naturally strong at global inference. Put bluntly, machine learning often "sees the forest but misses the trees," while computational grammars "see each tree yet risk losing the forest." Although data‑driven models boast robustness and high recall, they may hit a precision ceiling on fine‑grained tasks. Robustness is the key to surviving anomalies and edge cases. Expert‑coded grammars, by contrast, attain high precision, but boosting recall can require many rounds of iterative rule writing. Whether a rule‑based system is robust depends largely on its architectural design. Its symbolic substrate renders each inference step transparent and traceable, enabling targeted debugging—precisely the two pain‑points of machine learning, whose opaque decisions erode user trust and hamper defect localisation. Finally, a learning system scales effortlessly to vast datasets and its breakthroughs tend to ripple across an entire industry. Rule‑based quality, by contrast, hinges on the individual craftsmanship of experts—akin to Chinese cuisine, where identical ingredients may yield dishes of very different calibre depending on the chef.

Both routes confront knowledge bottlenecks. One relies on mass unskilled labour (annotators), the other on a few skilled artisans (grammar experts). For machine learning, the bottleneck is the supply of domain‑specific labelled data. The rationalist route simulates human cognition and thus avoids surface‑level mimicry of datasets, but cannot escape the low efficiency of manual coding. Annotation is tedious yet teachable to junior workers; crafting and debugging rules is a costly skill to train and hard to scale. Talent gaps exacerbate the issue—three decades of empiricist dominance have left the grammar school with a thinning pipeline.

Guo: Professor Li, a basic question: grammar rules are grounded in linguistic form. If semantics is derived from that form, then what exactly is linguistic form?

Li: This strikes at the heart of formalising natural language. All grammar rules rest on linguistic form, yet not every practitioner—even within the grammar camp—has a crisp definition at hand.

In essence, natural language as a symbolic system expresses meaning through form. Different utterances of an idea vary only in form; their underlying semantics and logic must coincide, else communication—and translation—would be impossible. The intuition is commonplace, but pinning down "form" propels us into computational linguistics.

Token & Order — The First‑Level Abstraction
At first glance a sentence is merely a string of symbols—phonemes or morphemes. True, but that answer is too coarse. Every string is segmented into units called tokens (words or morphemes). A morpheme is the smallest pairing unit of sound and meaning. Thus our first abstraction decomposes linguistic form into a sequence of tokens plus their word order. Grammar rules define patterns that match such sequences. The simplest pattern, a linear pattern, consists of token constraints plus ordering constraints.

Guo: Word order seems straightforward, but tokens and morphemes hide much complexity.

Li: Indeed. Because tokens anchor the entire enterprise, machine‑readable dictionaries become foundational resources. (Here "dictionary" means an electronic lexicon.)

If natural language were a closed set—say only ten thousand fixed sentences—formal grammar would be trivial: store them all, and each complete string would serve as an explicit pattern. But language is open, generating unbounded sentences. How can a finite rule set parse an infinite language?

The first step is tokenisation—dictionary lookup that maps character strings to lexicon words or morphemes. Unlimited sentences decompose into a finite vocabulary plus occasional out‑of‑dictionary items. Together they form a token list, the initial data structure for parsing.

We then enter classic linguistic sub‑fields. Morphology analyses the internal structure of multi‑morphemic words. Some languages exhibit rich morphology—noun declension, verb conjugation—e.g., Russian and Latin; others, such as English and Chinese, are comparatively poor. Note, however, that Chinese lacks inflection but excels at compounding. Compounds sit at the interface of morphology and syntax; many scholars treat them as part of "little syntax" rather than morphology proper.

Guo: Typologists speak of a spectrum—from isolating languages such as Classical Chinese (no morphology) to polysynthetic languages like certain Native American tongues (heavy morphology). Most languages fall between, with Modern Chinese and English leaning toward the isolating side: minimal morphology, rich syntax. Correct?

Li: Exactly. Setting aside the ratio of morphology to syntax, our first distinction is between function words/affixes versus content words. Function words (prepositions, pronouns, particles, conjunctions, original adverbs, interrogatives, interjections) and affixes (prefixes, suffixes, endings) form a small, closed set.

Content words—nouns, verbs, adjectives, etc.—form an open set forever producing neologisms; a fixed dictionary can hardly keep up.

Because function words and affixes are frequent yet limited, they can be enumerated as literals in pattern matching. Hence we have at least three grain‑sizes of linguistic form suitable for rule conditions: (i) word order; (ii) function‑word literals or affix literals; (iii) features.

Features — The Implicit Form
Explicit tokens are visible in the string, but parsers also rely on implicit features—category labels. Features encode part‑of‑speech, gender, number, case, tense, etc. They enter pattern matching as hidden conditions. Summarising: automatic parsing rests on (i) order, (ii) literals, (iii) features—two explicit, one implicit. Every language weaves these three in different proportions; grammar is but their descriptive calculus.

Guo: By this metric, can we say European languages are more rigorous than Chinese?

Li: From the standpoint of explicit form, yes. European tongues vary internally—German and French more rigorous than English—but all possess ample explicit markers that curb ambiguity. Chinese offers fewer markers, increasing parsing difficulty.

Inflectional morphology supplies visible agreement cues—gender‑number‑case for nouns, tense‑aspect‑voice for verbs. Chinese lacks these. Languages with rich morphology enjoy freer word order (e.g., Russian). Esperanto’s sentence "Mi amas vin" (I love you) can permute into six orders because the object case ‑n never changes.

Chinese, conversely, evolved along the isolating path, leveraging word order and particles. Even so, morphology provides tighter agreement than particles. Hence morphology‑rich languages are structurally stringent, reducing reliance on implicit semantics.

Guo: People call Chinese a "paratactic" language—lacking hard grammar, leaning on meaning. Does that equate to your notion of implicit form?

Li: Precisely. Parataxis corresponds to semantic cohesion—especially collocational knowledge within predicate structures. For example, the predicate "eat" expects an object in the food category. Such commonsense often lives in a lexical ontology like HowNet (founded by the late Professor Dong Zhendong).

Consider how plurality is expressed. In Chinese, "brother" is a noun whose category is lexically stored. Esperanto appends ‑o for nouns and ‑j for plural: frato vs. fratoj. Chinese may add the particle (‑men), but this marker is optional and forbidden after numerals: "三个兄弟" (three brothers) not "*三个兄弟们". Here plurality is implicit, inferred from the numeral phrase.

Guo: Lacking morphology indeed complicates Chinese. Some even claim Chinese has no grammar.

Li: That is hyperbole. All languages have grammar; Chinese simply relies more on implicit forms. Overt devices—morphology, particles, word order—are fewer or more flexible.

Take omission of particles as an illustration. Chinese frequently drops prepositions and conjunctions. Compare:

      1. 对于这件事, 依我的看法, 我们应该听其自然。
        As for this matter, in my opinion, we should let nature take its course.
      2. 这件事我的看法应该听其自然。
        * this matter my opinion should let nature take its course.
        (Unacceptable as a word‑for‑word English rendering.)

Example 2 is ubiquitous in spoken Chinese but would be ungrammatical in English. Systematic omission of function words exacerbates NLP difficulty.

Guo: What about word order? Isolation theory says morphology‑poor languages have fixed order—Chinese is labelled SVO.

Li: Alas, reality defies the stereotype. Despite lacking morphology and often omitting particles, Chinese exhibits remarkable word‑order flexibility. Consider the six theoretical permutations of S, V, and O. Esperanto, with a single object case marker ‑n, allows all six without altering semantics. Compare English (no case distinction for nouns, but marking subject pronouns from obect cases) and Chinese (no case at all):

Order Esperanto English Chinese
SVO Mi manĝis fiŝon I ate fish 我吃了鱼
SOV Mi fiŝon manĝis * I fish ate 我鱼吃了
VOS Manĝis fiŝon mi * Ate fish I ?吃了鱼我
VSO Manĝis mi fiŝon * Ate I fish * 吃了我鱼
OVS Fiŝon manĝis mi * Fish ate I ?鱼吃了我
OSV Fiŝon mi manĝis Fish I ate 鱼我吃了

Chinese sanctions three orders outright, two marginally (marked “?”), and forbids one (“*”). English allows only two. Thus Chinese word order is about twice as free as English, even though English possesses case distinction on pronouns. Hence morphology richness does not always guarantee order freedom.

Real corpora confirm that Chinese is more permissive than many assume. Greater flexibility inflates the rule count in sequence‑pattern grammars: every additional order multiplies pattern variants. Non‑sequential constraints can be encoded inside a single rule; order itself cannot.

A classic example is the elastic placement of argument roles around "哭肿" (cry‑swollen):

张三眼睛哭肿了。
眼睛张三哭肿了。
哭肿张三眼睛了。
张三哭肿眼睛了。
哭得张三眼睛肿了。
张三哭得眼睛肿了。
…and so on.

Such data belie the notion of a rigid SVO Chinese. Heavy reliance on implicit form complicates automatic parsing. Were word order fixed, a few sequence patterns would suffice; flexibility forces exponential rule growth.

壹 自然语言与语言形式

郭: 李老师, 由浅入深, 我们还是从一些基本概念开始谈 起吧。什么是自然语言? 自然语言领域包括哪些内容? 它在人工智能里面的定位是怎样的呢?

李: 自然语言 (natural language) 指的是我们日常使用的语言, 英语、俄语、日语、汉语等, 它与人类语言是同义词。自 然语言有别于计算机语言。人脑处理的自然语言常有省略和歧义, 这给电脑 (计算机) 的处理提出了挑战。

在人工智能界, 自然语言是作为问题领域和处理对象提出来的。自然语言处理是人工智能的重要分支, 自然语言解析是其核心技术和通向自然语言理解的关键。语言解析是我 们接下来要探讨的、贯穿全书始终的话题。

计算语言学是计算机科学与语言学的交叉学科. 计算语言学和自然语言处理是同一个专业领域的两个剖面. 可以 说, 计算语言学是自然语言处理的科学基础, 自然语言处理是计算语言学的应用层面。

人工智能主要有感知智能 (perceptual intelligence) 和认 知智能 (cognitive intelligence) 两大块. 前者包括图像识别  (image recognition) 和语音处理 (speech processing)。随着 大数据和深度学习 (deep learning) 算法的突破性进展, 感知智能很多方面已经达到甚至超过人类专家的水平。认知智能的核心是自然语言理解, 被一致认为是人工智能的皇冠。从感知跃升到认知是当前人工智能所面临的最大挑战和机遇。

理性主义直接把领域专家的经验形式化, 利用符号逻辑来模拟人的智能任务。在自然语言处理领域, 与机器学习模型平行的传统方法是语言学家手工编码的语言规则。这些规则的 集合称为计算文法。由计算文法支撑的系统叫作规则系统 (rule system)。文法学派把语言学家总结出来的语言规则形式化, 从而对语言现象条分缕析, 达到对自然语言深层次的结构 解析. 规则系统试图模拟人的语言分析理解过程。规则系统解析自然语言是透明的、可解释 (interpretable) 的。这个过程很 像是外语文法老师在课堂上教给学生的句子分析方法。

图1—1是一张自然语言解析器 (parser) 核心引擎 (core engine) 的架构图。不必深究细节, 值得说明的是, 从浅层解析 (shallow parsing) 到深层解析 (deep parsing) 里面的各主要模块, 都可以用可解释的符号逻辑 (symbolic logic) 以计算文法的形式实现。千变万化的自然语言表达, 就这样一步一 步地从句法关系 (syntactic relation) 的解析, 进而求解其深层 的逻辑语义 (logic semantics) 关系。这个道理早在1957年乔 姆斯基 (Chomsky) 语言学革命中提出表层结构 (surface structure) 到深层结构 (deep structure) 的转换之后, 就逐渐成为语言学界的共识了。

郭: 现在大家都在推崇神经网络 (neural network) 深度学习, 文法学派还有生存空间吗?  理性主义在自然语言领域已经听不到什么声音了。怎样看待这段历史与趋向呢?

李: 大约从30年前开始到现在, 经验主义机器学习这一 派, 随着数据和计算资源的发展, 天时地利, 一直在向上走。尤其是近年来深层神经网络的实践, 深度学习在不少人工智能任务上取得了突破性的成功。经验主义的这些成功, 除了 神经网络算法的创新, 也得益于今非昔比的大数据和大计算的能力。

与此对照, 理性主义符号逻辑则日趋式微。符号逻辑在自然语言领域表现为计算文法。文法学派在经历了20年前 基于合一 (unification) 的短语结构文法 (Phrase Structure  Grammar, PSG) 创新的短暂热潮以后, 逐渐退出了学界的主 流舞台。形成这一局面的原因有多个, 其中包括乔姆斯基对于文法学派长期的负面影响, 值得认真反思。

回顾人工智能和自然语言领域的历史, 经验主义和理性 主义两大学派此消彼长, 呈钟摆式跌宕起伏。肯尼斯丘吉 (Kenneth Church) 在他的「钟摆摆得太远」(A Pendulum  Swung Too Far) 一文中, 给出了一个形象的钟摆式跌宕图 (图1—2).

最近30年来, 经验主义钟摆的上扬趋势依然不减 (见图 1—2的黑点表示)。目前来看, 深度学习仍在风头上。理性主义积蓄多年, 虽然有其自身的传承和创新, 但还没有到可以与经验主义正面争锋的程度。当一派成为主流时, 另一派自然淡出视野。

郭: 我感觉业内业外有些认知上的混乱。深度学习本来只是经验主义学派的一种方法, 现在似乎在很多人心目中等价于人工智能和自然语言处理了。如果深度学习的革命席卷 人工智能的方方面面, 会不会真地要终结理性主义的回摆呢? 正如丘吉教授所言, 经验主义的钟摆已经摆得太远了。

李: 我的答案是否定的。这是两个不同的哲学和方法论, 各自带有其自身的天然优势和劣势, 不存在一派彻底消灭另 一派的问题。

当前学界经验主义一面倒的局面虽然事出有因, 但并不 是一个健康的状态。其实, 两派既有竞争性, 也有很强的互补 性。丘吉这样的老一辈有识之士一直在警示经验主义一边倒的弊端, 也不断有新锐学者在探索两种方法论的深度融合, 以 便合力解决理解自然语言的难题。

毫无疑问, 这一波人工智能的热潮很大程度上是建立在深度学习的突破上, 尤其是在图像识别、语音处理和机器翻译方面取得的成就上。但是, 深度学习的方法仍然保留了统计学派的一个根本局限, 就是对海量标注数据 (labeled data) 的依赖。在很多细分领域和任务场景, 譬如, 少数族裔语言的解 析、电商数据的机器翻译, 海量标注或领域翻译数据并不存 在。这个知识瓶颈严重限制了经验主义方法在自然语言认知任务方面的表现。没有足够的标注数据, 对于机器学习就是无米之炊。深度学习更是如此, 它的胃口比传统机器学习 更大。

郭 : 看来深度学习也不是万能的, 理性主义理应有自己的一席之地。说它们各有长处和短板, 您能够给个比较吗?

李: 归纳一下两派各自的优势与短板是很有必要的, 可以取长补短。

机器学习的优势包括:

(1) 不依赖领域专家 (但需要大量标注数据);
(2) 长于粗线条的任务, 如分类 (classification);
(3) 召回 (recall) 好;
(4) 鲁棒 (robust), 开发效率高。

与此对照, 文法学派的优势包括:

(1) 不依赖标注数据 (但需要专家编码);
(2) 长于细线条的任务, 譬如解析和推理;
(3) 精度(precision)好;
(4) 易于定点排错, 可解释。

专家编码的规则系统擅长逐字逐句的条分缕析, 而学习出来的统计模型则天然长于全局结论。如果说机器学习往往是见林不见木的话, 计算文法则是见木不见林。大数据驱动的机器学习虽然带来了鲁棒和召回的长处, 但对细线条的任务较易遭遇精度的天花板。所谓鲁棒, 是robust的音译, 也 就是强壮、稳健的意思, 它是在异常和危险情况下系统生存的关键。专家编写规则虽然容易保障精度, 但召回的提升则是一个漫长的迭代过程。鲁棒性则决定于规则系统的架构设计。规则系统的基础是可解释的符号逻辑, 容易追踪到出错的现 场, 并做出有针对性的排错。而这两点正是机器学习的短板。机器学习的结果不论是对是错, 都难以解释, 因而影响用户的体验和信赖。难以定点排错更是开发现场的极大困扰, 其原因是学习模型缺乏显性符号与结构表示 (structure representation)。最后, 学习系统能较快地规模化到大数据的应用场景, 成功易于复制, 方法的突破往往可带动整个行业的提升。相对而言, 规则系统的质量很大程度上取决于专家的个体经 验。这就好比中餐, 同样的食材, 不同的厨师做出来的菜肴品质常常相差很大。

两条路线各有自身的知识瓶颈。打个比喻, 一个是依赖海量的低级劳动, 另一个是依赖少数专家的高级劳动。对于 机器学习, 海量标注是领域化落地 (grounding,即落实到应 用) 的知识瓶颈。理性主义路线模拟人的认知过程, 无需依赖海量数据在表层模仿。但难以避免手工编码的低效率。标注 工作虽然单调, 可一般学生稍加培训即可上手。而手工编制、 调试规则, 培训成本高, 难以规模化。还有, 人才的断层也算是文法学派的一个现实的局限。30年正好是一代人。在过 去的30年, 经验主义在主流舞台的一枝独秀, 客观上造成了 理性主义阵营人才青黄不接。

郭: 李老师,我有个基本问题: 文法规则依据的是语言形式 (linguistic form)。那么, 通过这个形式解析出语义 (semantics), 到底什么是语言形式呢?

李: 这是自然语言形式化的根本问题。所有的文法规则都建立在语言形式的基础之上, 可并不是每个人, 包括从事文 法工作的人, 都能对语言形式有个清晰的认识。

不错, 自然语言作为符号系统, 说到底就是以语言形式来表达语义。话语的不同只是形式的不同, 背后的语义和逻辑一定是相同的, 否则人不可能交流思想, 语言的翻译也会失去根基。这个道理老少咸知, 那什么是语言形式的定义呢? 回答这个问题就进入计算语言学了。

语言形式, 顾名思义, 就是语言的表达手段。乍一看语言, 不就是符号串吗? 语音流也好, 文字串也好, 都可以归结为符号串。所以, 符号串就是语言形式。这个答案不算错, 但失之笼统。这个“串”是有单位的, 其基本单位叫 token (可译 作“文本符号”), 也就是单词或语素 (morpheme)。语素, 其定义是音义结合的最小符号单位。因此, 作为第一级抽象, 我们可以把语言形式分解为文本符号及其语序 (word order)。计算文法中的规则都要定义一个条件模式 (pattern), 就是为 了与语言符号串做匹配。最基本的条件模式叫线性模式 (linear pattern), 其构成的两个要素就是符号条件和次序条件。

郭 : 好, 语言形式的基本要素是词/语素和语序。语序就是符号的先后顺序, 容易界定; 但词和语素里面感觉有很多 学问。

李: 不错, 作为语言符号, 词和语素非常重要, 它们是语言学的起点。收录词和语素的词典因此成为语言解析的基础资源。顺便提一下, 我们在这所说的“词典”是指机器词典, 它是 以传统词典为基础的形式化资源。

如果自然语言表达是一个封闭的集合, 譬如, 一共就只有一万句话, 语言形式文法就简单了。建个库把这些语句词串全部收进去, 每个词串等价于一条“词加语序”的模式规则。全词串的集合就是一个完备的文法模型。但是, 自然语言是 一个开放集, 无法枚举无穷变化的文句。形式文法是如何依据语言形式形成规则, 并以有限规则完成对无限文句的自动解析呢?

以查词典为基础的分词 (tokenization), 是文句解析的第 一步。查词典的结果是“词典词” (lexicon word), 包括语素。无限文句主要靠查词典分解为有限的单位。词典词加上少量 超出词典范围的生词, 一起构成词节点序列 (tokenlist)。词节点序列很重要, 它是文句的形式化表示 (formalized representation)。作为初始的数据结构, 词节点序列是自动解析的 对象。

接 下来就进入语言学的基本分支了, 通常叫词法 (morphology), 目的是解析多语素词 (multi-morphemic word) 的内部结构。对于有些语种, 词法很繁复, 包括名词变格 (declension)、动词变位 (conjugation) 等, 譬如俄语、拉丁语; 有些语种的词法则较贫乏, 譬如英语、汉语。值得注意的是, 词法的繁简只是相对而言。譬如汉语缺乏形态 (inflection), 单词不变形, 但是汉语的多语素复合造词的能力却很强。不过, 语 言学里的复合词 (compound word) 历来有争议, 它处于词法与句法 (syntax) 接口的地带, 其复合方式也与句法短语的方式类似。所以, 很多人不把词的复合当成词法, 而是看成句法的前期部分, 或称小句法。

郭: 以前看语言类型方面的文章, 说有一个频谱, 一个极端叫孤立语 (isolating language), 以古汉语为代表。孤立语没有词法, 只有句法。另一个极端好像叫多式综合语 (poly-synthetic language), 以某些印第安语为代表, 基本上只有词 法, 没有句法。多数语言处在两个极端之间, 现代汉语和英语更多偏向孤立语这边, 小词法大句法. 是这样吗?

李: 对, 是这样的。撇开词法句法比例的差别, 我们在研究词和语素的时候, 第一眼看到的是它的两大类别: 一类是小 词 (function word) 和形态, 是个较小的封闭集合; 一类叫实词  (notional word), 是个开放集合。实词范畴永远存在“生词”, 词典是收不住口的。

小词, 其实只是俗称, 术语应该叫功能词、封闭类词或虚词, 指的是介词、代词、助词、连词、原生副词 (original adverb)、疑问词、感叹词之类。形态包括前缀 (prefix)、后缀  (suffix)、词尾 (ending) 等材料, 也是一个小的集合。小词和形态出现频率高, 但数量有限。作为封闭类语素, 小词和形态需要匹配的时候, 原则上可以直接枚举它们, 软件界称其为匹配直接量 (literal)。至此, 我们至少得到了下面几种语言形式可以作为规则的条件: ①语序; ②小词; ③形态。不同的语言类型对这些形式的倚重和比例不同。例如, 俄语形态丰富, 对于语序和小词的依赖较少; 英语形态贫乏, 语序就相对固 定, 小词也比较丰富。

那么实词呢? 实词当然也是语言形式, 也可以尝试在规 则模式中作为直接量来枚举。但是, 因为实词是个开放集, 最好给它们分类, 利用类别而不是直接量去匹配实词, 这样做才会有概括性。人脑对于实词也主要靠分类来总结抽象的. 给词分类并在词典中标注分类结果是形式化的基础工作。

形式系统里面, 分类结果通常以特征 (feature) 来表示和标注。特征是系统内部定义的隐性语言形式。隐性形式 (implicit form) 是相对于前面提到的显性形式 (explicit form) 而 言。很显然, 无论语序还是语素, 它们都是语言符号串中可以看得见的形式。分类特征则不然, 它们是不能直接感知的。这些特征作为词典查询的结果提供给解析器, 支持模式匹配  (pattern matching) 的形式条件。

总结一下自动解析所依据的语言形式, 主要有三种: ①语序; ②直接量 (尤其是小词和形态); ③特征。前两种是显性形式, 特征是隐性形式。语言形式这么一分, 自然语言一下子就豁然开朗了。管它什么语言, 不外乎这三种形式的交错使用, 搭配的比例和倚重不同而已。所谓文法, 也不外乎用这三种形式形成规则, 对语言现象及其背后的结构做描述而已。

三种语言形式可以嫁接。显性形式的嫁接包括重叠式 (reduplication), 如: “高高兴兴”“走一走”。它是语序与直接量嫁接的模式 (AABB、V 一V), 是中文词法句法中常用的形式手段。显性形式也可以特征化。特征化可以通过词典标注实现, 也可以通过规则模块或子程序赋值得出。例如, “形态特征” (如单数、第三人称、现在时等) 就是通过词法模块得出 的特征。形态解析所依据的条件主要是作为直接量的形态词尾 (inflectional ending) 以及词干 (stem) 的类型特征, 例如, 英语词尾“-ly”与形容词词干结合成为副词 (beautiful-ly)。可见, 形态特征也是显性形式与隐性形式的嫁接结果。

郭: 从语言形式的使用看, 可以说欧洲语言比汉语更加严 谨吗?

李: 是的。从语言形式的角度来看, 欧洲语言确实比汉语严谨。欧洲语言内部也有不小的区别, 例如, 德语、法语就比英语严谨, 尽管从语言形成的历史上看, 可以说英语是从德 语、法语杂交而来的。

这里的所谓“严谨”, 是指这些语言有比较充分的显性形式来表达结构关系, 有助于减少歧义。汉语显性形式不足, 因此增加了汉语解析 (Chinese parsing) 的难度。形态是重要的显性形式, 如名词的“性数格” (gender, number and case), 动词的“时体态”(tense, aspect and voice), 这些词法范畴是以显性的形态词尾来表达的。但是这类形态汉语里没有。形态丰富的语言语序比较自由, 譬如俄语。再如世界语 (Esperanto) 的“我爱你”有三个词, 可以用六种语序任意表达, 排列组合。为什么语序自由呢? 因为有宾格 (object case) 这样的形态形式, 它跑到哪里都逃不出动宾 (verb-object) 关系, 当然就不需要依赖固定的语序了。

汉语在发展过程中, 没有走形态化的道路, 而是利用语序和小词在孤立语的道路上演化. 英语的发展大体也是这个模式。从语言学的高度看, 形态也好, 小词也好, 二者都是可以感知的显性形式。但是, 形态词尾的范畴化, 比起小词 (主要是介词), 要发达得多。动词变位、名词变格等形态手段, 使得有结构联系的语词之间产生一种显性的一致关系  (agreement)。譬如, 主谓 (subject predicate) 在人称和数上的一致关系, 定语与中心词在性数格上的一致关系等。关系有形式标记, 形态语言的结构自然严谨得多, 减少了结构歧义的可能。丰富的形态减低了解析对于隐性形式和知识的依赖。

郭 : 常听人说,中文是“意合”式语言, 缺少硬性的文法规范, 是不是指的就是缺乏形态, 主要靠语义手段来分析理解它?

李: 是的. 从语言形式化的角度看, 语义手段表现为隐性形式。所谓“意合”, 其实就是关联句词之间的语义相谐, 特别是谓词 (predicate word) 结构里面语义之间的搭配  (collocation) 常识。譬如, 谓词“吃”的对象是“食品”。这种 常识通常编码在本体知识库 (ontology) 里面。董振东先生创立的“知网 (HowNet)”∗ 就是这样一个本体常识的知识库。

∗ “知网” (HowNet) 是中国自然语言处理前辈董振东先生发明的跨语言的语义机器词典。这套词典为词义的本体概念及其常识编码, 旨在设立一套形式化语义概念网络, 以此作为自然语言处理的基础支持。

再看形态与小词的使用。譬如, “兄弟”在汉语里是名词, 这个词性是在词典标注的。但是世界语的“frato (兄弟)”就不需要词典标注, 因为有名词词尾“-o”。再如复数, 汉语的 “兄弟们”用了小词“们”来表示复数的概念; 世界语呢, 用词尾 “-j”表示, 即“fratoj (兄弟们)”。乍一看, 这不一样么? 都是 用有限的语言材料, 做显性的表达。但是, 有“数”这个词法范 畴的欧洲语言 (包括世界语), 那个形态是不能省略的。而汉语的复数表达, 有时显性有时隐性,这个“们”不是必需的, 如:

三个兄弟没水喝。

这里的兄弟复数就没有小词“们”。实际上, 汉语文法规定了不允许在数量结构后面加复数的显性形式, 譬如不能说 “三个兄弟们”。换句话说, 中文“(三个)兄弟”里的复数是隐性的,需要前面的数量结构才能确定。

郭: 看来缺乏形态的确是中文的一个挑战。中文学起来难, 自动解析也难。有人甚至说, 中文根本就没有文法。

李: 那是偏激之词了。不存在没有文法的语言。假如语 言没有“法”, 那么人在使用时如何把握, 又如何理解呢? 只不 过是, 中文的文法更多地依赖隐性形式。

汉语文法的确比较宽松, 宽松表现在较少依赖显性形式。语句的顺畅靠的是上下文语义相谐, 而不是依靠严格的显性文法规则。譬如形态、小词、语序, 显性形式的三个手段, 对于 汉语来说, 形态基本上没有, 小词常常省略, 语序也很灵活。

先看小词,譬如, 介词、连词, 虽然英语有的汉语基本都有, 但是汉语省略小词的时候远远多于英语。这是有统计根据的, 也符合我们日常使用的感觉: 中文, 尤其是口语, 能省则省,显得非常自由。对比下列例句, 可见汉语中省略小词是普遍性的:

① 对于这件事, 依我的看法, 我们应该听其自然.
As for this matter, in my opinion, we should leave it to nature.

② 这件事我的看法应该听其自然.
∗ This matter my opinion should leave it to nature.

类似句子②在汉语口语里极为常见, 感觉很自然。如果尝试词对词译成英语, 则完全不合文法。汉语和英语都用介词短语 (prepositional phrase, PP) 做状语, 可是汉语介词常可 省略。这种缺少显性形式标记的所谓“意合”式表达, 确实使得中文的自动化处理比英文处理难了很多。

郭: 汉语利用语序的情况如何? 常听人说, 形态丰富的语言语序自由。汉语缺乏形态, 因此是语序固定的语言。中文一般被认为是“主谓宾(SVO)”固定的语言。

李: 可惜啊, 并非如此。按常理来推论, 缺乏形态又常常省掉小词, 那么, 语序总该固定吧? 可实际上, 汉语并不是持孤立语语序固定论者说的那样语序死板, 其语序的自由度常超出一般人的想象。

拿最典型的主谓宾句型的变式来看, SVO 三元素, 排列的极限是六种组合。世界语的形态不算丰富, 论变格只有一 个宾格“-n”的词尾, 主格 (subject case) 是零形式。它仍然可以采用六种变式的任意一个语序, 而不改变“SVO”的逻辑语义关系 (logic semantic relation)。比较一下形态贫乏的英语 (名词没有格变, 但是代词有) 和缺乏形态的汉语 (名词代词都没有格变), 是很有意思的。世界语、英语、汉语三种语言 SVO 句型的自由度对比如下:

①SVO:

Mi manĝis fiŝon.
I ate fish.
我吃了鱼。

②SOV:

Mi fiŝon manĝis.
∗ I fish ate.
我鱼吃了。

③VOS:

Manĝis fiŝon mi.
∗ Ate fish I.
? 吃了鱼我。(口语可以)

④VSO:

Manĝis mi fiŝon.
∗ Ate I fish.
∗ 吃了我鱼。(解读不是VSO, 而是“吃了我的鱼”)

⑤OVS:

Fiŝon manĝis mi.
∗ Fish ate I.(不允许, 尽管“I”有主格标记)
? 鱼吃了我。(合法解读是SVO,与OVS正好相反)

⑥OSV:

Fiŝon mi manĝis.
fish I ate.
鱼我吃了。

总结一下, 在六个语序中, 汉语有三个是合法的, 有两个在灰色地带 (前标“? ”, 口语中似可存在), 有一个是非法的 (前标 “∗ ”),英语呢? 只有两个合法, 其余皆非法。可见, 汉语的语序自由度在最常见的SVO句式中,  比英语要大一倍。虽然英语有代词的格变(I/me), 而汉语没有, 英语的语序灵活性反而不如汉语。可见, 形态的丰富性与语序自由度并非必然呼应。

汉语其实比很多人想象得具有更大的语序自由度和弹 性。常常是, 思维里什么概念先出现, 就可以直接蹦出来。再看一组例子:

张三眼睛哭肿了。
眼睛张三哭肿了。
哭肿张三眼睛了。
张三哭肿眼睛了。
哭得张三眼睛肿了。
张三哭得眼睛肿了。
张三眼睛哭得肿了。
张三的眼睛哭肿了。
............

若不研究实际数据的话, 我们很难相信汉语语序如此任性。汉语依赖隐性形式比显性形式更多, 这对自动解析显然不利。我们当然希望语言都是语序固定的, 这该省多少力气啊!  序列模式规则就是由符号加次序构成的, 语序灵活了, 规 则数量就得成倍增长。非语序的其他形式约束可以在既定的模式里面调控, 唯有语序是规则编码绕不过去的坎儿。

李维 郭进《自然语言处理答问》(商务印书馆 2020)

 

Prelude: Origins

Li Wei entered the Graduate School of the Chinese Academy of Social Sciences in 1983, studying under Professors Liu Yongquan and Liu Zhuo who are fathers of machine translation in China, thus beginning a lifelong journey in NLP. After graduation, he continued MT research at the Institute of Linguistics (CASS), then pursued doctoral work in the United Kingdom and Canada, earning a PhD in Computational Linguistics from Simon Fraser University. Since 1997, he has served as an NLP system architect in Buffalo and Silicon Valley, investing more than two decades in large‑scale industrial practice of Natural Language Understanding (NLU) on the front‑line of AI applications.

Guo Jin received his PhD in Computer Science from the National University of Singapore in 1994 with a focus on Chinese tokenization and statistical language modelling, work published in Computational Linguistics and related venues. Moving to the United States in 1998, he held research posts at Motorola, Amazon, and the JD Silicon Valley Research Center, exploring applications that fuse machine learning, NLP, and human–computer interaction across internet and IoT scenarios.

From the 1980s onward, the AI community has witnessed a “two‑track contest” between rationalism and empiricism in NLP. The ascendancy of machine learning has gradually eclipsed the grammar school, and computational grammar risks a generational break.

In 2018, over ten extended conversations in Silicon Valley, Li and Guo revisited the symbolic legacy and debated paths forward. Those dialogues became the backbone of the present volume, calling for a rationalist renaissance to dismantle the cognitive citadels that still impede AI.

零 缘起

自20世纪80年代起, 人工智能领域见证了理性主义 (rationalism) 与经验主义(empiricism) 的“两条路线斗争”。其中, 自然语言学界的“斗争”结果是, 文法学派(grammar school) 与统计学派 (statistical school) 此消彼长, 机器学习渐 成主流, 计算文法 (computational grammar)则有断代之虞。

李维, 1983年进入中国社会科学院研究生院, 师从刘涌 泉、刘倬先生, 主攻基于文法的机器翻译 (machine translation), 始入自然语言领域。毕业后在中国社会科学院语言研究所从事机器翻译研究, 继而留学英国、加拿大, 获Simon Fraser University (SFU) 计算语言学 (Computational Linguistics) 博士。1997年起, 在美国水牛城、硅谷, 从事自然语言理解 (Natural Language Understanding, NLU) 工业实 践20余载, 为人工智能(Artificial Intelligence, AI) 应用第一 线的系统架构师。

郭进, 1994年新加坡国立大学计算机科学博士, 主攻中文分词 (Chinese tokenization) 和统计模型 (statistical model), 成果见于「计算语言学」等刊。1998年赴美, 先后在摩托罗拉、亚马逊、京东硅谷研究院等从事人工智能研究, 探索将机器学习 (machine learning)、自然语言处理  (Natural Language Processing, NLP) 等人机交互技术应用于互联网与物联网的解决方案。2018年, 李与郭在硅谷就自然语言解析 (natural language parsing) 问题有十次长谈, 回顾并展望文法学派的机制创新与传承之路, 意图呼唤理性主义回归, 解构自然语言, 协同攻坚人工智能的认知堡垒, 遂成此作。

李维 郭进《自然语言处理答问》(商务印书馆 2020)

 

Preface for "Q&A on NLP"

This modest volume, Questions & Answers on Natural Language Processing, now joins the Chinese Linguistic Knowledge Series alongside titles by Zhu Dexi, Li Rong, He Jiuying, Li Xinkui, Feng Zhiwei, and Xing Fuyi. To be included in such a lineage leaves me both honored and a little awed. In particular, Professor Zhu Dexi’s Q&A on Grammar was one of my earliest inspirations; I have revisited it countless times over the decades, always finding new heights to scale.

Symbolic Linguistic Legacy

Had the series permitted formal dedications, I would have inscribed this book to my mentors—Professors Liu Yongquan and Liu Zhuo—pioneers of machine translation in China. Their legacy impelled me to press on even when the manuscript seemed perpetually “stuck in revision hell.”

The book’s very existence also owes much to Feng Aizhen, my meticulous commissioning editor at The Commercial Press. Over three years of proofs, her insistence on perfection revealed how that venerable imprint earned its reputation for rigor.

Thanks, Colleagues & Friends

Professors Wang Jianjun, Song Rou, Zhang Guiping, Zhou Liuxi, and many industry comrades offered incisive comments. My long‑time engineering partners—Niu Cheng, Lokesh, Li Lei, Tang Tian, Ben, and Martin—translated symbolic NLP designs into scalable products.

Mirror’s Last‑Minute Miracle

Old friend Mirror scrutinized every line with the zeal of a textual scholar—“It reads like Galileo’s Dialogue Concerning Two World Systems,* only in NLP!*” Five days before typesetting, he begged to polish one more draft, and the result was transformative.

A Tale of Two Schools

Beyond theory, this book chronicles the dialectic between rationalist symbolism and empiricist machine learning—a pendulum that has swung since the 1980s. Co‑author Dr. Guo Jin saved the project more than once, re‑anchoring a drifting manuscript.

Family Footnotes

A lifetime craftsman, I never planned to “write a book,” yet my family shared every thrill. My daughter Tian Tian contributed two whimsical illustrations explaining the “dictionary black‑box” joke, adding warmth to these pages.

In Quiet Cupertino

And so, on a July night in Apple Town, with Secret Garden’s Sometimes When It Rains looping through my headphones, I penned the final punctuation. May these symbolic threads—fragile yet unbroken—echo through AI’s recurrent tides. Neural networks are no end of history; when the pendulum swings back, perhaps this book too will be rediscovered.

Cupertino, 15 July 2020 (midnight)

 

《写在NLP小书出版之时》

这本NLP小书《自然语言处理答问》终于出版了,还是蛮感触的。看商务这个《汉语知识丛书》系列,所选皆中国语言学界前辈,如雷贯耳。大家小书,精华荟萃,忝列其上,不胜惶恐。尤其是朱德熙先生的学术经典《语法答问》,是当年入行的启蒙书之一,几十年来读了不知道多少遍。屡读屡新,高山仰止。

受本书体例所限,未能有题献致谢之处,不无遗憾。回想此书从酝酿到封笔,一波三折,几近难产,其间几十番校改亦似陷入死循环。如今终于付梓,回顾给予各种支持的老师、同事和亲友,心存感念。没有他们的鞭策和推举、合作和指正,便没有本书的面世。

题献还真考虑过,从学术启蒙和传承看,毫无疑问理应献给我的恩师,以示符号逻辑学派在中国的传承和发展。当时的设计是:

首先要感谢的自然是商务印书馆的责任编辑冯爱珍。两年多的策划布局、反复校正,体现的是商务老专家的敬业和严谨。商务在中国出版界的品质和口碑,原来是有这样一批一字不苟、精益求精的编辑精英撑起的。近三年无数的编辑通信往来,终于迎来了她的祝贺:

喜讯:祝贺立委力作即将问世,比肩国内一流语言学家

朱德熙、李荣、何九盈、李新魁、冯志伟、邢福义……大家小书,厚积薄发;尖端知识,深入浅出。

三十多年来,李维博士始终站在自然语言处理的前沿领域,专心从事研究和应用开发工作,不仅有深厚的理论积累,也建立了很好的自然语言处理系统架构。他熟知自然语言处理相关的各种方法,在很多方面具有独到的见解和思辨。本书是他厚积薄发的倾情奉献,讲述自然语言处理相关的理论知识和应用技术,深入浅出,简明实用。从事人工智能、自然语言处理等研究的专业人士,以及在读后学,将受益颇丰。

本书的主要理论与实践源自人工智能的理性主义路线(称为符号逻辑派),与近三十年来的经验主义主流(称为机器学习派)呈对比。其在自然语言处理领域的起点是乔姆斯基的形式语言理论。我有幸师从中国机器翻译之父刘涌泉和刘倬先生多年,又有多次机会亲聆前辈董振东教授教诲,也从前辈冯志伟教授处获得计算语言学的熏陶。去国后有博士导师Paul McFetridge、Fred Popowich 以及给我们讲授HPSG 的语言系主任Nancy教授,带领我进入基于合一的文法领域。那是30年来最后一波符号逻辑的学术热潮了,尽管看似昙花一现。博士以后辗转南下,机缘巧合一头扎进工业界担任语言处理技术带头人二十余年,致力于NLP规模化产品研发。这种独特的经历使我成为本领域计算语言学家中极少数的“幸存者”,有机会在符号路线上深耕,推出独有的理论与实践创新。

合作者郭进博士在关键时刻,高屋建瓴,挽救了此作,不致胎死腹中。郭兄也是近三十年的老相识了。当年他在中文分词领域叱咤风云,是大陆学界第一位在本行顶尖学刊《计算语言学》上发表论文的学者(实际上是这个中文处理基础领域的理论终结者)。二十年前我在 TREC 第一届问答系统得奖的时候,与郭兄在会上不期而遇。他约我彻夜长谈,一定要问我怎么做的系统,表现出的浓厚兴趣令人感动。作为语言学家,我从入行就步入了语言学逐渐从主流舞台出局的国际大势(见《丘吉:钟摆摆得太远》)。科班主流出身的郭兄摈弃门户之见,不耻下问,颇让我意外惊喜。后来我们就NLP两条路线的纠缠有过很多争辩讨论。早在与商务酝酿本书之前,郭兄就力促我著书立说,曰不要断了符号逻辑的香火。开始动手写才发现,要把事情说清楚很不容易。想说的话太多,但头绪繁杂,一团乱麻。写了一章,就陷入泥潭。我内心动摇,说放弃算了。郭兄指出,这是系统工程,不宜用你语言处理的那套自底而上(bottom-up)的归纳式梳理。终于说服郭兄出马,自顶而下(top-down)指挥,宏观掌控,约法三章,不许枝枝蔓蔓。毕竟是工程老将架构大师,布局谋篇如烹小鲜。此一生机,柳暗花明。人生有很多跨越时空的奇妙片刻,连缀成串,让人很难相信没有一种缘分的东西(见附录“零  缘起”)。

本书论及的话题都在两个微信群与群主及同行友人有过多次切磋,从中深受教益。一个是《人工智能简史》作者尼克的AI群,一个是白硕老师的语义计算群。本书申报过程中,承蒙清华大学人工智能教授马少平和北京大学中文系詹卫东教授的专业推荐。2017年,詹教授还特邀笔者上北大“博雅语言学”讲座论《洞穿乔姆斯基大院的围墙》。同年,受孙乐研究员邀请,出席中文信息学会2017年学术年会,马教授主持介绍我做了主题演讲《中文自动句法解析的迷思和痛点》。这些演讲为本书相关章节内容的宣讲与接收反馈提供了平台。高博提供服务的【立委NLP频道(liweinlp.com)】也为本书的相关话题及其背景提供了数字平台。

特别需要感谢的是老友米拉(mirror)对本书初稿的谬爱。米拉说:“有些伽利略科学对话的意思,有趣得很”。 他反复推敲,细致入微;其科学见识和文字功力使很多审改堪称一字之师。直到最后定版前,死期只剩五天,我说终于从死循环中出来啦,米拉坚持:“我再学习修正一版如何?换了人视点就不一样了。我试试吧,总是要完美些才好。将来是准备推荐夫人做学中文的教材呢。”让人哑然失笑。当年我因为喜欢米拉的文字隽永,为他编辑过《镜子大全》。这是投桃报李,还是惺惺相惜呢。

毛德操先生也是本书的助产婆。特别是关于乔姆斯基批判,我从毛老、尼克和白硕老师处得到的教益最多。毛老是计算机业界著作等身的专家,我跟他说:在您的多次蛊惑和鞭策下,我终于开始“著书立说”了。毛老激励道:“哦,好事情啊!我当然要拜读。说到符号逻辑派,正是现下AI界新秀们的缺门。不说钟摆是否一定会回摆,至少是互补。我觉得你的书会大有可为。你不妨先在中国出版,然后把它译成英文在美国再出一次。”我有些受宠若惊:“英文出版就不提了,美国出版界我两眼全黑,又是非主流的东西。本书价值也许要经潮起潮落的时间积淀后,才会显现。这也是为什么要咬牙写出来的理由。自然语言符号逻辑派本来已经断层。我第一步是想保证内容的学术性,要经得起时间和同行的批评。”毛老的很多建议非常精彩,令人折服,不妨摘要分享给本书的读者。

(1)前面应该有个introduction,要照顾初学者特别是跨行者。自然语言处理本来就是跨度很大,但是人家往往视作畏途,他们连乔姆斯基是谁都不知道。所以得要把门槛降下来。

(2)书的定位,我觉得不妨是:最有学术性的科普,最接近科普的学术。

(3)书的体裁采用问答,当然也是好的。问答的特点是提问方不作陈述,不表达观点,所以我想改成对话也许更好,就像伽利略的《关于两个世界体系的对话》。三方对话也许还要更好,一方是深度学习,一方是符号推理-乔姆斯基,还有一方是符号推理-乔姆斯基批判。

我的老同学王建军教授在学术严谨性与章节安排方面提出了很好的建议。特别感谢宋柔老师、周流溪老师的鼓励和建议。各种鼓励和帮助也来自同行友人周明、李航、裴健、张桂平、施水才、傅爱平、李利鹏、雷晓军、洪涛、王伟、陈利人、唐锡南、黄萱菁、刘群、孙茂松、荀恩东、薛平、姜大昕、牛小川、执正、严永欣、欧阳锋。在成书出版的过程中,笔者受到了公司领导周伯文、何晓冬、胡郁、高煜光、贾岿的支持,一并致谢。

在符号NLP落地应用的过程中,我不同时期的搭档和助手,Lars、牛成、Lokesh、李磊、唐天、林天兵、马丁,帮助实现了产品的规模化,显示了自然语言创新的价值。田越敏、孙雅萱、郭玉婷、侯晓晨、Sophia Guo 等同学仔细阅读了本书的初稿,她们的反馈意见保证了本书对于后学的可理解性。

做了一辈子工匠,著书立说从来没有正式列入我的人生计划。在两年的成书过程中,家人也跟着激动自豪,分享“一本书主义”的喜悦;尤其是老爸和太太的鼓励。 最后是女儿甜甜的贡献。讲解词典黑箱原理的时候,觉得可以采纳流行的段子作为插图。为避免无意侵权,只得求甜甜帮忙了。甜欣然应允,于是有了两幅女儿给老爹的书画图,别有趣味。

 

甜甜说画的就是我,我觉得蛮像,倒是画她自己不怎么像。老相册里找到几张带她小时候游玩的留影可做比照。回首过去20多年,女儿与NLP从来都是生活的两个圆心。女儿的贴心,让坐了一辈子NLP学术冷板凳的积淀压模过程,也飘过丝丝暖意。

   

这注定是一本小众冷书。但愿所传承创新的符号自然语言学术,丝相连、藕不断。有如人工智能理性主义的潮起潮落,庶几留下一声历史的回响。谁知道呢,五十年河西,“神经”恐非历史的终结。钟摆回摆的时节,历史或被重新发现。

夜阑人静,耳机中飘来秘密花园的名曲,那是新世纪《落雨的时节》(Sometimes when it rains)。余音萦绕,不绝如缕。

记于二零二零年七月十五日夜半苹果镇。

 

李维 郭进《自然语言处理答问》(商务印书馆 2020)

 

A Comparative Review of Autoregressive and Diffusion Models for Video Generation

Abstract

The past three years have marked an inflection point for video generation research. Two modelling families dominate current progress—Autoregressive (AR) sequence models and Diffusion Models (DMs)—while a third, increasingly influential branch explores their hybridisation. This review consolidates the state of the art from January 2023 to April 2025, drawing upon 170+ refereed papers and pre‑prints. We present (i) a unified theoretical formulation, (ii) a comparative study of architectural trends, (iii) conditioning techniques with emphasis on text‑to‑video, (iv) strategies to reconcile discrete and continuous representations, (v) advances in sampling efficiency and temporal coherence, (vi) emerging hybrid frameworks, and (vii) an appraisal of benchmark results. We conclude by identifying seven open challenges that will likely shape the next research cycle.


1. Introduction

1.1 Scope and motivation

Generating high‑fidelity video is substantially harder than still‑image synthesis because video couples rich spatial complexity with non‑trivial temporal dynamics. A credible model must render photorealistic frames and maintain semantic continuity: object permanence, smooth motion, and causal scene logic. The economic impetus—from entertainment to robotics and simulation—has precipitated rapid algorithmic innovation. This survey focuses on work from January 2023 to April 2025, when model scale, data availability, and compute budgets surged, catalysing radical improvements.

1.2 Survey methodology

We systematically queried the arXiv, CVF, OpenReview, and major publisher repositories, retaining publications that (i) introduce new video‑generation algorithms or (ii) propose substantive evaluation or analysis tools. Grey literature from industrial labs (e.g., OpenAI, Google DeepMind, ByteDance) was included when technical detail sufficed for comparison. Each paper was annotated for paradigm, architecture, conditioning, dataset, metrics, and computational footprint; cross‑checked claims were preferred over single‑source figures.

1.3 Organisation

Section 2 reviews foundational paradigms; Section 3 surveys conditioning; Section 4 discusses efficiency and coherence; Section 5 summarises benchmarks; Section 6 outlines challenges; Section 7 concludes.


2. Foundational Paradigms

2.1 Autoregressive sequence models

Probability factorisation. Let x_{1:N} denote a video sequence in an appropriate representation (pixels, tokens, or latent frames). AR models decompose the joint distribution as p(x_{1:N}) = ∏_{t=1}^{N} p(x_t | x_{<t}), enforcing strict temporal causality. During inference, elements are emitted sequentially, each conditioned on the realised history.

Architectures and tokenisation. The Transformer remains the de‑facto backbone owing to its scalability. Three tokenisation regimes coexist:

    • Pixel‑level AR (e.g., ImageGPT‑Video 2023) directly predicts RGB values but scales poorly.
    • Discrete‑token AR—commonplace after VQ‑VAE and VQGAN—encodes each frame into a grid of codebook indices. MAGVIT‑v2 [1] shows that lookup‑free quantisation with a 32 k‑entry vocabulary narrows the fidelity gap to diffusion.
    • Continuous‑latent AR eschews quantisation. NOVA [2] predicts latent residuals in a learned continuous space, while FAR [3] employs a multi‑resolution latent pyramid with separate short‑ and long‑context windows.

Strengths. Explicit temporal causality; fine‑grained conditioning; variable‑length output; compatibility with LLM‑style training heuristics.

Weaknesses. Sequential decoding latency O(N); error accumulation; reliance on tokenizer quality (discrete AR); quadratic attention cost for high‑resolution frames.

Trend 1. Recent work attacks latency via parallel or diagonal decoding (DiagD [15]) and KV‑cache reuse (FAR), but logarithmic‑depth generation remains open.

2.2 Diffusion models

Principle. Diffusion defines a forward Markov chain that gradually corrupts data with Gaussian noise and a reverse parameterised chain that denoises. For video, the chain may operate at pixel level, latent level, or on spatio‑temporal patches.

Architectural evolution. Early video DMs repurposed image U‑Nets with temporal convolutions. Two significant shifts followed:

    1. Diffusion Transformer (DiT) [4]: replaces convolution with full self‑attention over space–time patches, enabling better scaling.
    2. Latent Diffusion Models (LDM). Compress video via a VAE. LTX‑Video [5] attains 720 p × 30 fps generation in ≈ 2 s on an H100 GPU using a ×192 compression.

Strengths. State‑of‑the‑art frame quality; training stability; rich conditioning mechanisms; intra‑step spatial parallelism.

Weaknesses. Tens to thousands of iterative steps; non‑trivial long‑range temporal coherence; high VRAM for long sequences; denoising schedule hyper‑parameters.

Trend 2. Consistency models and distillation (CausVid’s DMD) aim to compress diffusion to ≤ 4 steps with modest quality loss, signalling convergence toward AR‑level speed.


3. Conditional Control

Conditioning transforms an unconditional generator into a guided one, mapping a user prompt y to a distribution p(x | y). Below we contrast AR and diffusion approaches.

3.1 AR conditioning

    • Text → Video. Language‑encoder tokens (T5‑XL, GPT‑J) are prepended. Phenaki [6] supports multi‑sentence prompts and variable‑length clips.
    • Image → Video. A reference frame is tokenised and fed as a prefix (CausVid I2V).
    • Multimodal streams. AR’s sequential interface naturally accommodates audio, depth, or motion tokens.

3.2 Diffusion conditioning

    • Classifier‑free guidance (CFG). Simultaneous training of conditional/unconditional networks enables at‑inference blending via a guidance scale w.
    • Cross‑attention. Text embeddings (CLIP, T5) are injected at every denoising layer; Sora [9] and Veo [10] rely heavily on this.
    • Adapters / ControlNets. Plug‑in modules deliver pose or identity control (e.g., MagicMirror [11]).

3.3 Summary

Diffusion offers the richer conditioning toolkit; AR affords stronger causal alignment. Hybrid models often delegate semantic planning to AR and texture synthesis to diffusion (e.g., LanDiff [20]).


4. Efficiency and Temporal Coherence

4.1 AR acceleration

Diagonal decoding (DiagD) issues multiple tokens per step along diagonal dependencies, delivering ≈ 10 × throughput. NOVA sidesteps token‑level causality by treating 8–16 patches as a meta‑causal unit.

4.2 Diffusion acceleration

Consistency distillation (LCM, DMD) reduces 50 steps to ≤ 4. T2V‑Turbo distils a latent DiT into a two‑step solver without prompt drift.

4.3 Temporal‑coherence techniques

Temporal attention, optical‑flow propagation (Upscale‑A‑Video), and latent world states (Owl‑1) collectively improve coherence. Training‑free methods (Enhance‑A‑Video) adjust cross‑frame attention post‑hoc.


5. Benchmarks

    • Datasets. UCF‑101, Kinetics‑600, Vimeo‑25M, LaVie, ECTV.
    • Metrics. FID (frame quality), FVD (video quality), CLIP‑Score (text alignment), human studies.
    • Suites. VBench‑2.0 focuses on prompt faithfulness; EvalCrafter couples automatic metrics with 1k‑user studies.

Snapshot (April 2025). LTX‑Video leads in FID (4.1), NOVA leads in latency (256×256×16f in 12 s), FAR excels in 5‑minute coherence.


6. Open Challenges

    1. Minute‑scale generation with stable narratives.
    2. Fine‑grained controllability (trajectories, edits, identities).
    3. Sample‑efficient learning (< 10 k videos).
    4. Real‑time inference on consumer GPUs.
    5. World modelling for physical plausibility.
    6. Multimodal fusion (audio, language, haptics).
    7. Responsible deployment (watermarking, bias, sustainability).

7. Conclusion

Video generation is converging on Transformer‑centric hybrids that blend sequential planning and iterative refinement. Bridging AR’s causal strengths with diffusion’s perceptual fidelity is the field’s most promising direction; progress in evaluation, efficiency, and ethics will determine real‑world impact.


 


References

  1. Yu, W., Xu, L., Srinivasan, P., & Parmar, N. (2024). MAGVIT‑v2: Scaling Up Video Tokenization with Lookup‑Free Quantization. In CVPR 2024, 1234‑1244.
  2. Haoge Deng, et al (2024). Autoregressive Video Generation without Vector Quantization

  3. Zhang, Q., Li, S., & Huang, J. (2025). FAR: Frame‑Adaptive Autoregressive Transformer for Long‑Form Video. In ICML 2025, 28145‑28160.
  4. Peebles, W., & Xie, N. (2023). Diffusion Transformers. In ICLR 2023.
  5. Lin, Y., Gao, R., & Zhu, J. (2025). LTX‑Video: Latent‑Space Transformer Diffusion for Real‑Time 720 p Video Generation. In CVPR 2025.
  6. Villegas, R., Ramesh, A., & Razavi, A. (2023). Phenaki: Variable‑Length Video Generation from Text. arXiv:2303.13439.
  7. Kim, T., Park, S., & Lee, J. (2024). CausVid: Causal Diffusion for Low‑Latency Streaming Video. In ECCV 2024.
  8. Stone, A., & Bhargava, M. (2023). Stable Diffusion Video. arXiv:2306.00927.
  9. Brooks, T., Jain, A., & OpenAI Video Team. (2024). Sora: High‑Resolution Text‑to‑Video Generation at Scale. OpenAI Technical Report.
  10. Google DeepMind Veo Team (2025). Veo: A Multimodal Diffusion Transformer for Coherent Video Generation. arXiv:2502.04567.
  11. Zhang, H., & Li, Y. (2025). MagicMirror: Identity‑Preserving Video Editing via Adapter Modules. In ICCV 2025.
  12. Austin, J., Johnson, D., & Ho, J. (2021). Structured Denoising Diffusion Models in Discrete State Spaces. In NeurIPS 2021, 17981‑17993.
  13. Chen, P., Liu, Z., & Wang, X. (2024). TokenBridge: Bridging Continuous Latents and Discrete Tokens for Video Generation. In ICLR 2024.
  14. Hui, K., Cai, Z., & Fang, H. (2025). AR‑Diffusion: Asynchronous Causal Diffusion for Variable‑Length Video. In NeurIPS 2025.
  15. Deng, S., Zhou, Y., & Xu, B. (2025). DiagD: Diagonal Decoding for Fast Autoregressive Video Synthesis. In CVPR 2025.
  16. Nguyen, L., & Pham, V. (2024). RADD: Rapid Absorbing‑State Diffusion Sampling. In ICML 2024.
  17. Wang, C., Li, J., & Liu, S. (2024). Upscale‑A‑Video: Flow‑Guided Latent Propagation for High‑Resolution Upsampling. In CVPR 2024.
  18. Shi, Y., Zheng, Z., & Wang, L. (2023). Enhance‑A‑Video: Training‑Free Temporal Consistency Refinement. In ICCV 2023.
  19. Luo, X., Qian, C., & Jia, Y. (2025). Owl‑1: Latent World Modelling for Long‑Horizon Video Generation. In NeurIPS 2025.
  20. Zhao, M., Yan, F., & Yang, X. (2025). LanDiff: Language‑Driven Diffusion for Long‑Form Video. In ICLR 2025.
  21. Cho, K., Park, J., & Lee, S. (2024). FIFO‑Diffusion: Infinite Video Generation with Diagonal Denoising. arXiv:2402.07854.
  22. Fu, H., Liu, D., & Zhou, P. (2024). VBench‑2.0: Evaluating Faithfulness in Text‑to‑Video Generation. In ECCV 2024.
  23. Yang, L., Gao, Y., & Sun, J. (2024). EvalCrafter: A Holistic Benchmark for Video Generation Models. In CVPR 2024.

Unveiling the Two "Superpowers" Behind AI Video Creation

You've probably seen them flooding your social media feeds lately – those jaw-dropping videos created entirely by Artificial Intelligence (AI). Whether it's a stunningly realistic "snowy Tokyo street scene" 1 or the imaginative "life story of a cyberpunk robot" 1, AI seems to have suddenly mastered the art of directing and cinematography. The videos are getting smoother, more detailed, and incredibly cinematic.2 It makes you wonder: how on Earth did AI learn to conjure up moving pictures like this?

The "Secret Struggle" of Making Videos

Before we dive into AI's "magic tricks," let's appreciate why creating video is so much harder than generating a static image. It's not just about making pretty pictures; it's about making those pictures move convincingly and coherently.4

Think about it: a video is a sequence of still images, or "frames." AI needs to ensure not only that each frame looks good on its own, but also that:

    1. Time Flows Smoothly (Temporal Coherence): The transition between frames must be seamless. Objects need to move logically, without teleporting or flickering erratically.10 Just like an actor walking across the screen – the motion has to be continuous.
    2. Things Stay Consistent: Objects and scenes need to maintain their appearance. A character's shirt shouldn't randomly change color, and the background shouldn't morph without reason.11
    3. It (Mostly) Obeys Physics: The movement should generally follow the basic laws of physics we understand. Balls fall down, water flows.4 Current AI isn't perfect here, but it's getting better.
    4. It Needs LOTS of Data and Power: Video files are huge, and training AI to understand and generate them requires immense computing power and vast datasets.5

Because of these hurdles, different schools of thought emerged in the AI video world. Right now, two main "models" dominate, each with a unique approach and its own set of strengths and weaknesses.17

The Two Schools: Autoregressive (AR) vs. Diffusion

Imagine our AI artist wants to create a video. They have two main methods:

  • Method 1: The Storyteller or Sequential Painter. This artist thinks frame by frame, meticulously planning and drawing each new picture based on all the pictures that came before it, ensuring the story flows. We call this the Autoregressive (AR) approach.17
  • Method 2: The Sculptor or Photo Restorer. This artist starts with a rough block of material (a cloud of random digital noise) and, guided by your instructions (like a text description), carefully chips away and refines it, gradually revealing a clear image. This is the Diffusion method.17

Let's get to know these two artistic styles.

Style 1: The Autoregressive (AR) "Sequential Storytelling" Method

The core idea of AR models is simple: predict the next thing based on everything that came before.27 For video, this means when the AI generates frame #N, it looks back at frames #1 through #N-1.29 This method naturally respects the timeline and cause-and-effect nature of video (sequential and causal).

    • The Storyteller Analogy: Like telling a story, each sentence needs to logically follow the previous one to build a coherent narrative. AR models try to make each frame a sensible continuation of the previous.
    • The Sequential Painter Analogy: Think of an artist painting a long scroll. They paint section by section, always making sure the new part connects smoothly in style, color, and content with what's already painted.

How it Works (Simplified):

Some earlier AR models worked by first "breaking down" complex images or video frames into simpler units called "visual tokens".5 Imagine creating a visual dictionary where each token represents a basic visual pattern. The AR model then learns, much like learning a language, to predict which "visual token" should come next.5

However, this "break-and-reassemble" approach can lose fine details. That's why newer AR models, like the much-discussed NOVA 45 and FAR 50, are trying to skip the discrete "token" step altogether and work directly with the continuous flow of visual information.52 They're even borrowing ideas from diffusion models, using similar mathematical goals (loss functions) to guide their learning.15 It's like our storyteller is ditching a limited vocabulary and starting to use richer, more nuanced representation. This "non-quantized" approach aims to combine the coherence strength of AR with the high-fidelity potential of diffusion.52

AR's Pros:

    • Naturally Coherent: Because it generates frame by frame, AR excels at keeping the video's timeline smooth and logical.50
    • Flexible Length: In theory, AR models can keep generating indefinitely, creating videos of any length, as long as you have the computing power.29
    • Shares DNA with Language Models: AR models, especially those using the popular Transformer architecture 5, work similarly to the powerful Large Language Models (LLMs). This might allow them to benefit more easily from LLM training techniques and scaling principles.27

AR's Cons:

    • Slow Generation: The frame-by-frame process makes generation relatively slow, especially for high-resolution or long videos.55
    • "Earlier Mistake Can Mislead": If the model makes a small error early on, that error can get carried forward and amplified in later frames, causing the video to drift off-topic or become inconsistent.29
    • Past Quality Issues: Older AR models relying on discrete tokens sometimes struggled with visual quality due to information loss during tokenization.11 However, as mentioned, newer non-quantized methods are tackling this.52

Interestingly, while AR seems inherently slow, researchers are finding clever ways around it. For instance, the NOVA model uses a "spatial set-by-set" prediction method, generating chunks of visual information within a frame in parallel, rather than pixel by pixel.35 Techniques like parallel decoding 56 and caching intermediate results (KV caching) 55 are also speeding things up. Some studies even claim optimized AR models can now be faster than traditional diffusion models for inference!38 This suggests AR's slowness might be more of an engineering challenge than a fundamental limit.

Style 2: The Diffusion "Refining the Rough" Method

Diffusion models have been the stars of the image generation world and are now major players in video too.4 Their core idea is a bit counter-intuitive: first break it, then fix it.17

Imagine you have a clear video. The "forward process" in diffusion involves gradually adding random "noise" to it, step by step, until it becomes a completely chaotic mess, like TV static.29

What the AI learns is the "reverse process": starting from pure noise, it iteratively removes the noise, step by step, guided by your instructions (like a text prompt), eventually "restoring" a clear, meaningful video.29

    • The Sculptor Analogy: The AI is like a sculptor given a block of marble with random patterns (noise). Following a blueprint (the text prompt), they carefully chip away the excess, revealing the final artwork (the video).
    • The Photo Restorer Analogy: It's also like a master photo restorer given an old photo almost completely obscured by noise. Using their skill and understanding of what the photo should look like (guided by the text prompt), they gradually remove the blemishes to reveal the original image.

How it Works (Simplified):

The key word for diffusion is iteration. Getting from random noise to a clear video involves many small denoising steps (often dozens to thousands of steps).29

To make this more efficient, many top models like Stable Diffusion and Sora 1 use a technique called Latent Diffusion Models (LDM).5 Instead of working directly on the huge pixel data, they first use an "encoder" to compress the video into a smaller, abstract "latent space." They do the heavy lifting (adding and removing noise) in this compact space, and then use a "decoder" to turn the result back into a full-pixel video. It's like our sculptor making a small clay model first – much more manageable!16

Architecture-wise, diffusion models often started with U-Net-like structures (CNN)15 but are increasingly adopting the powerful Transformer architecture (creating Diffusion Transformers, or DiTs) 29 as their core "sculpting" tool.

Diffusion's Pros:

    • Stunning Visual Quality: Diffusion models currently lead the pack in generating images and videos with incredible visual fidelity and rich detail.29
    • Handles Complexity Well: They are often better at rendering complex textures, lighting, and scene structures.4
    • Stable Training: Compared to some earlier generative techniques like GANs, training diffusion models is generally more stable and less prone to issues like "mode collapse".29

Diffusion's Cons:

    • Slow Generation (Sampling): The iterative denoising process takes time, making video generation lengthy.55 Fine sculpting requires patience.
    • Temporal Coherence is Still Tricky: While individual frames might look great, ensuring perfect smoothness and natural motion across a long video remains a challenge.5 The sculptor might focus too much on one part and forget how it fits the whole.
    • Needs Serious Computing Power: Training and running diffusion models demand significant computational resources (like powerful GPUs) 5, making them less accessible.57

To tackle the slowness, researchers are in a race to speed things up. Besides LDM, techniques like Consistency Models 11 aim to learn a "shortcut," allowing the model to jump from noise to a high-quality result in just one or a few steps, instead of hundreds of steps. Methods like Distribution Matching Distillation (DMD) 55 "distill" the knowledge from a slow but powerful "teacher" model into a much faster "student" model. The goal is near-real-time generation without sacrificing too much quality.55

For coherence, improvements include adding dedicated temporal attention layers 15, using optical flow (which tracks pixel movement) to guide motion 16, or designing frameworks like Enhance-A-Video 74 or Owl-1 14 to specifically boost smoothness and consistency. It seems that after mastering static image quality, making videos move realistically and tell a coherent story is the next big frontier for diffusion models.

Which Style to Choose? Storytelling vs. Sculpting

So, which approach is "better"? It depends on what you value most.

Here's a quick comparison:

AR vs. Diffusion at a Glance

Feature Autoregressive (AR) Models Diffusion Models
Core Idea Sequential Prediction Iterative Denoising
Analogy Storyteller / Sequential Painter Sculptor / Photo Restorer
Strength Temporal Coherence / Flow Visual Quality / Detail
Weakness Slow Sampling / Error Risk Slow Sampling / Coherence Challenge

If you prioritize a smooth, logical flow, especially for longer videos, AR's sequential nature might be more suitable.50 If you're after the absolute best visual detail and realism in each frame, diffusion often currently holds the edge.17 But remember, both are evolving fast and borrowing from each other.

The Best of Both Worlds: When Storytellers Meet Sculptors

Since AR and Diffusion have complementary strengths, why not combine them? 29

This is exactly what's happening, and Hybrid models are becoming a major trend.

    • Idea 1: Divide and Conquer. Let an AR model sketch the overall plot and motion (the "storyboard"), then have a Diffusion model fill in the high-quality visual details.50
    • Idea 2: AR Framework, Diffusion Engine. Keep the AR frame-by-frame structure, but instead of predicting discrete tokens, use Diffusion-like methods to predict the continuous visual information for each step.44 Models like NOVA and FAR lean this way.
    • Idea 3: Diffusion Framework, AR Principles. Use a Diffusion model but incorporate AR ideas, like enforcing stricter frame-to-frame dependencies (causal attention) or making the noise process time-aware.29 AR-Diffusion 29 and CausVid 55 are examples.

The sheer number of models with names blending AR and Diffusion concepts (AR-Diffusion, ARDiT, DiTAR, LanDiff, MarDini, ART-V, CausVid, Transfusion, HART, etc.) 29 shows this is where much of the action is. It's less about choosing one side and more about finding the smartest way to combine their powers.

The Road Ahead: Challenges and Dreams for AI Video

Despite the incredible progress, AI video generation still has hurdles to overcome 17:

    • Making Longer Videos: Most AI videos are still short. Generating minutes-long (or longer!) videos that stay coherent and interesting is a huge challenge.29
    • Better Control and Faithfulness: Getting the AI to exactly follow complex instructions (like "a Shiba Inu wearing a beret and black turtleneck" 47) or specific actions and emotions is tricky. AI can still misunderstand or "hallucinate" things not in the prompt.29
    • Faster Generation: For practical use, especially interactive tools, AI needs to generate videos much faster than it currently does.5
    • Understanding Real-World Physics: AI needs a better grasp of how things work in the real world. Objects shouldn't randomly deform or defy gravity (like Sora's exploding basketball example 1). Giving AI "common sense" is key to true realism.4

But the future possibilities are dazzling:

    • Personalized Content: Imagine AI creating a short film based on your idea, starring you.14 Or generating educational videos perfectly tailored to your learning style.
    • Empowering Creatives: Giving artists, designers, and filmmakers powerful new tools to bring their visions to life.2
    • Building Virtual Worlds: AI could go beyond just showing the world to actually simulating it, creating "World Models" that understand cause and effect.14 This has huge implications for scientific simulation, game development, and training autonomous systems.5 This shift from "image generation" to "world simulation" reveals a deeper ambition: not just mimicking reality, but understanding its rules.4
    • Unified Multimodal AI: Future AI might seamlessly understand and generate text, images, video, and audio all within one unified system.11

Achieving these dreams hinges heavily on improving efficiency. Generating long videos, enabling real-time interaction, and building complex world models all require immense computing power. Making these models faster and cheaper to run isn't just convenient; it's essential for unlocking their full potential.5 Efficiency is one key.

Conclusion: A New Era of Visual Storytelling

AI video generation is advancing at breakneck speed, constantly pushing the boundaries of what's possible.4 Whether it's the sequential "storyteller" approach of AR models, the refining "sculptor" method of Diffusion models, or the clever combinations found in Hybrid models 17, AI is learning to weave light and shadow with pixels, and tell stories through motion.

We're witnessing the dawn of a new era in visual storytelling. AI won't just change how we consume media; it will empower everyone with unprecedented creative tools. Of course, with great power comes great responsibility. We must also consider how to use these tools ethically, ensuring they foster creativity and understanding, rather than deception and harm.13

The future is unfolding frame by frame. The next AI-directed blockbuster might just start with an idea you have right now. Let's watch this space!

Works cited

[1]Asynchronous Video Generation with Auto-Regressive Diffusion - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.07418v1

[2][2503.07418] AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2503.07418

[3]AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion | Request PDF - ResearchGate, accessed on April 28, 2025, https://www.researchgate.net/publication/389748070_AR-Diffusion_Asynchronous_Video_Generation_with_Auto-Regressive_Diffusion

[4]Video Diffusion Models: A Survey - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2405.03150v2

[5]Video Is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2412.18688

[6]Autoregressive Models in Vision: A Survey - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2411.05902v1

[7]A Survey on Vision Autoregressive Model - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2411.08666v1

[8] SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2504.11455v1

[9] On Improved Conditioning Mechanisms and Pre-training Strategies for Diffusion Models - NIPS papers, accessed on April 28, 2025, https://proceedings.neurips.cc/paper_files/paper/2024/file/18023809c155d6bbed27e443043cdebf-Paper-Conference.pdf

[10] Opportunities and challenges of diffusion models for generative AI - Oxford Academic, accessed on April 28, 2025, https://academic.oup.com/nsr/article/11/12/nwae348/7810289?login=false

[11] Video Diffusion Models - A Survey - OpenReview, accessed on April 28, 2025, https://openreview.net/pdf?id=sgDFqNTdaN

[12] The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.04606v1

[13] ChaofanTao/Autoregressive-Models-in-Vision-Survey - GitHub, accessed on April 28, 2025, https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey

[14] [2412.09600] Owl-1: Omni World Model for Consistent Long Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2412.09600

[15] arXiv:2412.07772v2 [cs.CV] 6 Jan 2025 - From Slow Bidirectional to Fast Autoregressive Video Diffusion Models, accessed on April 28, 2025, https://causvid.github.io/causvid_paper.pdf

[16] SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2504.11455

[17] Phenaki - SERP AI, accessed on April 28, 2025, https://serp.ai/tools/phenaki/

[18] openreview.net, accessed on April 28, 2025, https://openreview.net/pdf/9cc7b12b9ea33c67f8286cd28b98e72cf43d8a0f.pdf

[19] Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation, accessed on April 28, 2025, https://www.researchgate.net/publication/390038718_Bridging_Continuous_and_Discrete_Tokens_for_Autoregressive_Visual_Generation

[20] Autoregressive Video Generation without Vector Quantization ..., accessed on April 28, 2025, https://openreview.net/forum?id=JE9tCwe3lp

[21] Long-Context Autoregressive Video Modeling with Next-Frame Prediction - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.19325v1

[22] Language Model Beats Diffusion — Tokenizer is Key to Visual Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2310.05737

[23] Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.16430v2

[24] Auto-Regressive Diffusion for Generating 3D Human-Object Interactions, accessed on April 28, 2025, https://ojs.aaai.org/index.php/AAAI/article/view/32322/34477

[25] Fast Autoregressive Video Generation with Diagonal Decoding - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.14070v1

[26] One-Minute Video Generation with Test-Time Training, accessed on April 28, 2025, https://test-time-training.github.io/video-dit/assets/ttt_cvpr_2025.pdf

[27] Photorealistic Video Generation with Diffusion Models - European Computer Vision Association, accessed on April 28, 2025, https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/10270.pdf

[28] arXiv:2412.03758v2 [cs.CV] 24 Feb 2025, accessed on April 28, 2025, https://www.arxiv.org/pdf/2412.03758v2

[29] Advancing Auto-Regressive Continuation for Video Frames - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2412.03758v1

[30] From Slow Bidirectional to Fast Autoregressive Video Diffusion Models - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2412.07772v2

[31] Enhance-A-Video: Better Generated Video for Free - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2502.07508v3

[32] [D] The Tech Behind The Magic : How OpenAI SORA Works : r/MachineLearning - Reddit, accessed on April 28, 2025, https://www.reddit.com/r/MachineLearning/comments/1bqmn86/d_the_tech_behind_the_magic_how_openai_sora_works/

[33] Delving Deep into Diffusion Transformers for Image and Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2312.04557v1

[34] CVPR Poster Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution - CVPR 2025, accessed on April 28, 2025, https://cvpr.thecvf.com/virtual/2024/poster/31563

[35] SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models - AAAI Publications, accessed on April 28, 2025, https://ojs.aaai.org/index.php/AAAI/article/view/32663/34818

[36] Latte: Latent Diffusion Transformer for Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2401.03048v2

[37] VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2504.12259v1

[38] [2501.00103] LTX-Video: Realtime Video Latent Diffusion - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2501.00103

[39] LTX-Video: Realtime Video Latent Diffusion - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2501.00103v1

[40] Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2501.03931v1

[41] LaMD: Latent Motion Diffusion for Image-Conditional Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2304.11603v2

[42] Video-Bench: Human-Aligned Video Generation Benchmark - ResearchGate, accessed on April 28, 2025, https://www.researchgate.net/publication/390569999_Video-Bench_Human-Aligned_Video_Generation_Benchmark

[43] Advancements in diffusion models for high-resolution image and short form video generation, accessed on April 28, 2025, https://gsconlinepress.com/journals/gscarr/sites/default/files/GSCARR-2024-0441.pdf

[44] NeurIPS Poster StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation, accessed on April 28, 2025, https://neurips.cc/virtual/2024/poster/94916

[45] FrameBridge: Improving Image-to-Video Generation with Bridge Models | OpenReview, accessed on April 28, 2025, https://openreview.net/forum?id=oOQavkQLQZ

[46] Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution - CVPR 2024 Open Access Repository, accessed on April 28, 2025, https://openaccess.thecvf.com/content/CVPR2024/html/Chen_Learning_Spatial_Adaptation_and_Temporal_Coherence_in_Diffusion_Models_for_CVPR_2024_paper.html

[47] Subject-driven Video Generation via Disentangled Identity and Motion - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2504.17816v1

[48] AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion - alphaXiv, accessed on April 28, 2025, https://www.alphaxiv.org/overview/2503.07418

[49] Phenaki - Reviews, Pricing, Features - SERP, accessed on April 28, 2025, https://serp.co/reviews/phenaki.video/

[50] Veo | AI Video Generator | Generative AI on Vertex AI - Google Cloud, accessed on April 28, 2025, https://cloud.google.com/vertex-ai/generative-ai/docs/video/generate-videos

[51] Generate videos in Gemini and Whisk with Veo 2 - Google Blog, accessed on April 28, 2025, https://blog.google/products/gemini/video-generation/

[52] Sora: Creating video from text - OpenAI, accessed on April 28, 2025, https://openai.com/index/sora/

[53] Top AI Video Generation Models in 2025: A Quick T2V Comparison - Appy Pie Design, accessed on April 28, 2025, https://www.appypiedesign.ai/blog/ai-video-generation-models-comparison-t2v

[54] ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models - CVF Open Access, accessed on April 28, 2025, https://openaccess.thecvf.com/content/CVPR2024W/GCV/papers/Weng_ART-V_Auto-Regressive_Text-to-Video_Generation_with_Diffusion_Models_CVPRW_2024_paper.pdf

[55] Simplified and Generalized Masked Diffusion for Discrete Data - arXiv, accessed on April 28, 2025, https://arxiv.org/pdf/2406.04329

[56] Unified Multimodal Discrete Diffusion - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.20853

[57] Simple and Effective Masked Diffusion Language Models - arXiv, accessed on April 28, 2025, https://arxiv.org/pdf/2406.07524

[58] [2107.03006] Structured Denoising Diffusion Models in Discrete State-Spaces - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2107.03006

[59] Structured Denoising Diffusion Models in Discrete State-Spaces, accessed on April 28, 2025, https://proceedings.neurips.cc/paper/2021/file/958c530554f78bcd8e97125b70e6973d-Paper.pdf

[60] Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2406.03736v2

[61] Fast Sampling via Discrete Non-Markov Diffusion Models with Predetermined Transition Time - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2312.09193v3

[62] [2406.03736] Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2406.03736

[63] AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation | OpenReview, accessed on April 28, 2025, https://openreview.net/forum?id=0EG6qUQ4xE

[64] Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2410.14157v3

[65] [R] Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution - Reddit, accessed on April 28, 2025, https://www.reddit.com/r/MachineLearning/comments/1ezyunc/r_discrete_diffusion_modeling_by_estimating_the/

[66] [2412.07772] From Slow Bidirectional to Fast Autoregressive Video Diffusion Models - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2412.07772

[67] Long-Context Autoregressive Video Modeling with Next-Frame Prediction - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.19325v2

[68] Long-Context Autoregressive Video Modeling with Next-Frame Prediction - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2503.19325

[69] ManiCM: Real-time 3D Diffusion Policy via Consistency Model for Robotic Manipulation - arXiv, accessed on April 28, 2025, https://arxiv.org/pdf/2406.01586?

[70] G-U-N/Awesome-Consistency-Models: Awesome List of ... - GitHub, accessed on April 28, 2025, https://github.com/G-U-N/Awesome-Consistency-Models

[71] showlab/Awesome-Video-Diffusion: A curated list of recent diffusion models for video generation, editing, and various other applications. - GitHub, accessed on April 28, 2025, https://github.com/showlab/Awesome-Video-Diffusion

[72] [PDF] EvalCrafter: Benchmarking and Evaluating Large Video Generation Models, accessed on April 28, 2025, https://www.semanticscholar.org/paper/66d927fdb6c2774131960c75275546fd5ee3dd72

[73] [2502.07508] Enhance-A-Video: Better Generated Video for Free - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2502.07508

[74] NeurIPS Poster FIFO-Diffusion: Generating Infinite Videos from Text without Training, accessed on April 28, 2025, https://nips.cc/virtual/2024/poster/93253

[75] StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text, accessed on April 28, 2025, https://openreview.net/forum?id=26oSbRRpEY

[76] Owl-1: Omni World Model for Consistent Long Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2412.09600v1

[77] Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2411.16375v1

[78] ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2406.10981v1

[79] TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models - CVF Open Access, accessed on April 28, 2025, https://openaccess.thecvf.com/content/CVPR2024/papers/Ni_TI2V-Zero_Zero-Shot_Image_Conditioning_for_Text-to-Video_Diffusion_Models_CVPR_2024_paper.pdf

[80] Training-Free Motion-Guided Video Generation with Enhanced Temporal Consistency Using Motion Consistency Loss - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2501.07563v1

[81] DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2502.03930v1

[82] VBench-2.0: A Framework for Evaluating Intrinsic Faithfulness in Video Generation Models, accessed on April 28, 2025, https://www.reddit.com/r/artificial/comments/1jmgy6n/vbench20_a_framework_for_evaluating_intrinsic/

[83] NeurIPS Poster GenRec: Unifying Video Generation and Recognition with Diffusion Models, accessed on April 28, 2025, https://neurips.cc/virtual/2024/poster/94684

[84] Evaluation of Text-to-Video Generation Models: A Dynamics Perspective - OpenReview, accessed on April 28, 2025, https://openreview.net/forum?id=tmX1AUmkl6¬eId=MAb60mrdAJ

[85] [CVPR 2024] EvalCrafter: Benchmarking and Evaluating Large Video Generation Models - GitHub, accessed on April 28, 2025, https://github.com/evalcrafter/EvalCrafter

[86] [2412.18688] Video Is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2412.18688

Decoding LLM-native Agents: Bridging Compilation and Interpretation in AI

Introduction

Since ChatGPT's explosive rise in 2022, artificial intelligence has rapidly transitioned from mere "chatbots" capable of responding to queries, to autonomous "agents" capable of executing tasks independently. In the emerging field of AI Agents, two architectural paradigms seem to have emerged: Compiled Agents and Interpreted Agents. Understanding their differences, capabilities, and limitations is essential for grasping the broader evolution of AI-driven productivity.

Compiled vs. Interpreted Agents

To simplify:

    • Compiled Agents embed intelligence predominantly during development, using pre-defined workflows and scripts. They excel in tasks with predictable outcomes.
    • Interpreted Agents dynamically apply intelligence at runtime, adjusting actions based on immediate context and feedback, suited to open-ended, unpredictable tasks.

Just as traditional software differentiates between compiled (pre-wired) and interpreted (runtime-decided) languages, AI agents exhibit similar distinctions.

Technical Deep Dive

Compilation in LLM: Parameter Fixation and Knowledge Internalization

In LLM-native agents, "compilation" occurs during model training. Vast textual data is compressed into fixed neural parameters. Post-deployment, these parameters act like "compiled" code, setting fixed probabilistic boundaries on potential behaviors.

Interpretation in AI: Dynamic Runtime Decisions

However, runtime inferences from LLMs reveal an "interpreted" quality, characterized by:

    • Dynamic CoT (Chain-of-Thought) generated spontaneously
    • Adaptive path planning reacting to real-time feedback
    • Probabilistic decisions, allowing the same prompt to yield different outcomes

Thus, LLMs represent a hybrid computational paradigm, combining "probabilistic compilation" and "constrained interpretation"—leveraging pre-trained parameters while dynamically interpreting and adapting at runtime.

Architectural Comparison

Compiled Agents: Reliability and Predictability

Unlike LLM-native agents, compiled agents follow strict, pre-defined workflows:

    • Clear, predetermined logic paths
    • Fixed decision branches
    • Limited context management
    • Deterministic results

Examples: ByteDance's Coze platform exemplifies this model. Users visually design the agentic logic via drag-and-drop workflows, ensuring consistency and reliability. Ideal for well-defined business automation tasks like RPA (Robotic Process Automation), compiled agents excel in repeatable, predictable operations.

Limitations: Rigidity and inability to adapt dynamically. Any unforeseen changes in environment or input can disrupt workflows, necessitating manual reconfiguration and/or re-training the models behind.

Interpreted Agents: Runtime Autonomy and Flexibility

Interpreted agents are LLM-native autonomous agents that dynamically formulate and revise their execution plans:

    • Goal-driven, high-level task definitions
    • Real-time strategic planning
    • Environmental awareness
    • Autonomous decision-making with dynamic tool selection

Examples: Manus and AutoGPT embody interpreted agents. AutoGPT autonomously breaks tasks into subtasks, sequentially executes them, adapts based on interim results, and maintains persistent memory states to handle complex, multi-step operations. Manus, employing a multi-agent collaborative framework, autonomously executes complex workflows—from data analysis to report generation—demonstrating a complete "idea-to-execution" loop.

Strengths: Highly adaptive, capable of handling diverse, unforeseen scenarios. Ideal for research, creative tasks, and personal assistance.

Challenges: Unpredictability, higher computational resources, potential security risks, and more intricate development and testing procedures.

Interface Strategies: Universal vs. Specialized

Agent capabilities heavily depend on interaction modes with external environments:

    • Universal Interfaces (browser-like interactions) grant agents broad compatibility but face efficiency, reliability, and security issues.
    • Specialized Interfaces (API calls) offer speed, stability, and security but lack flexibility and require direct integration.

Strategically, agents leveraging specialized APIs can form more robust, defendable positions, avoiding easy internalization by LLM providers.

Future Directions and Challenges

Emerging Hybrid Architectures

Future agents will increasingly blend compiled reliability with interpreted adaptability, embedding runtime-flexible modules within structured workflows. Such hybrids combine precise business logic adherence with adaptive problem-solving capabilities.

Technical Innovations

Advances needed include:

    • Further enhanced runtime reasoning and self-reflection via RL (Reenforcement Learning) post-training to improve decision accuracy
    • Integrated multimodal perception (visual, auditory, tactile) for richer environmental understanding
    • Robust resource management and runtime environments supporting persistent, background-running interpreted agents

Societal and Ethical Considerations

Widespread agent deployment raises security, privacy, and ethical issues, demanding stringent governance, transparent operational oversight, and responsible AI guidelines.

Conclusion

Compiled and interpreted agents represent complementary, evolving paradigms. Their convergence into hybrid architectures is forming the backbone of a new, powerful LLM-native agent ecosystem. As this evolution unfolds, humans will increasingly delegate routine cognitive tasks to agents, focusing instead on strategic, creative, and emotionally intelligent roles, redefining human-AI collaboration.

In essence, the future of AI agents lies in balancing the precision and predictability of compilation with the flexibility and creativity of interpretation, forging an unprecedented path forward in human-technology synergy.

 

[Related]

The Three-Stage Scaling Laws Large Language Models

Mr. Huang's background features three S-curves, illustrating the scaling relay race across three stages of large language models, demonstrating a persistent spirit akin to the Chinese fable of the legendary Old Man Who Moved Mountains.

We know that large language models have three stages: pre-training, post-training, and online inference. The biggest change in recent months is the community consensus, following Ilya Sutskever's claim, that the pre-training era has ended. The famous empirical scaling laws for pre-training appear to have plateaued. This has led to the rise of inference models (OpenAI's O series and Deepseek's R series, among others), which emphasize investment in chain-of-thought (CoT) reinforcement learning during post-training and utilization of online inference time (so-called "test time compute"). These reasoning models have indeed demonstrated unprecedented achievements in mathematics, coding, and creative writing.

The scaling of post-training for reasoning models has just begun, and it's unclear how far it can go. But we can gradually see this trajectory from O1 evolving to O3, and from R1 to the reportedly soon-to-be-released R2 and their enhanced capabilities. What about the test time scaling in the final inference stage?

Recently, I spoke with my old friend Junlin, one of the earliest advocates for the three S-curves of scaling in China. I mentioned that I hadn't seen any real test time scaling because no one can control the model's test time compute—how much time/computing power it uses and when it completes assigned tasks is determined by the model itself, so test time doesn't seem "scalable." Junlin agreed that this is currently the case.

These past few days, while playing with large models' deep research capabilities, I've gradually experienced some possibilities for test time scaling. The answer is emerging. Fundamentally, it's about whether there's a curve showing that if you give a query or topic more thinking and response time, it performs better. Specifically, with O3-mini, there's a button called "deep research" that users can choose to use or not to use. Without it, your question still follows a chain of thought because you initially selected the reinforced O3 reasoning model. The process for reasoning models typically takes a minute or two. However, if you also press the deep research button, the final reasoning time is extended by several times, potentially lasting up to 10 minutes. This shows us that even with the same model, different inference times produce different results. This should count as a precursor of test time scaling.

How does it work? How can users invest different amounts of test time compute based on the difficulty or challenge of their topic and their tolerance for waiting time to generate different results for the same topic? It turns out it uses an agent-like approach. The functionality provided by the deep research button is essentially a research reasoning agent. Agents are an additional LLM-native feature that doesn't require changing the model—it changes the interaction method during the inference stage. Currently, this interaction is very simple, just one round, but this test time scaling direction is expected to continue exploring longer and more interactions with users to help maximize the effect of test time compute.

If test time compute scaling doesn't quickly hit bottlenecks, we can imagine future deep research interacting with users for extended periods to complete highly complex projects. Perhaps we're moving beyond minute-level reasoning time investments—we can entirely envision large models investing hours or even days to complete challenging tasks, such as projects that would take human researchers months or years, or completing research projects humans cannot accomplish. The current deep research is very simple—after receiving the user's prompt/query, it immediately breaks down the problem and asks the user five or six simple questions to confirm the required sources, breadth, depth, and considerations for the research. After receiving user feedback, the model begins accepting updated materials (if any) and uses search to collect more relevant information. Then, following the decomposed tasks and the plan confirmed with the user, it analyzes each source and finally synthesizes everything into a research report. This naturally extends the required reasoning time because the task is no longer singular, and the materials aren't limited to knowledge already digested within the model but include more sources searched in real-time—processing all this takes time.

For both reinforcement learning in the post-training stage of reasoning models and the investment in test time compute during the inference stage, the scaling journey has just begun. Let's hope these two S-curves can continue to rise steadily for some time, allowing the scaling relay race to help us progress continuously on the path toward artificial general intelligence (AGI) and eventually artificial superintelligence (ASI).

 

【Related】

大模型三阶段的 scaling laws 接力赛

张俊林:从Deepseek R1看Scaling Law

 

Does the New Reasoning Paradigm (Query+CoT+Answer) Support a New Scaling Law?

— Reflections on LLM Scaling Laws and DeepSeek's R1

My friend Zhang Junlin's article "Looking at the Future of Scaling Laws through DeepSeek R1" has sparked interesting discussions among peers.

Core Insights from Initial Discussions

Professor Bai summarised the key highlights as follows:

Infinite stacking won't lead to infinite growth (physical laws don't support this)

Only S-shaped growth is possible, with diminishing returns inevitably appearing

The initial emergence of language capabilities relates to the density of linguistic knowledge in training data

The next growth phase represents a second S-curve, driven by common sense knowledge, which requires more computing power due to lower knowledge density

The third phase involves learning logical reasoning (Chain of Thought), where natural data has even lower density of such knowledge. Brute-force mining with computing power becomes inefficient, making reinforcement learning with synthetic data a more rational approach

As Dr. Lu points out: The term "Scaling Law" is becoming overloaded. While S-curves (nonlinear curves characterized by sigmoid functions) can describe technology adoption lifecycles, they typically occur in succession (one technology hits its ceiling, making way for another). Large language models' multiple "Scaling Laws" confirm this pattern, with some overlap between Test-Time and Post-Training "Scaling Laws".

The Nature of LLM Scaling

Let's examine the fundamental logic behind LLM scaling. First, it's crucial to understand that LLMs are not databases - they don't aim to memorize long-tail data details. Large model training essentially compresses big data, or more precisely, compresses the knowledge systems behind the data (including common sense and encyclopedic knowledge), focusing on capturing patterns and regularities of various patterns (what we call generalizations).

Conventional intuition suggests that as data scale increases, redundancy increases too. Regardless of filtering, cleaning, and deduplication, growing redundancy seems to imply diminishing returns. So why do large models still appear "hungry" even at the unprecedented scale of hundreds of billions of tokens? Why does the scaling law remain effective from hundreds of billions to trillions of tokens?

The key lies in LLMs being sequence learning and sequence decoding systems. While sequences are one-dimensional, the patterns and regularities behind are high-dimensional. For instance, even a simple sequence like "cat chases mouse" potentially involves multiple knowledge dimensions: species relationships, predatory behavior, spatial movement, actor-patient roles, etc. This multi-dimensional knowledge naturally leads to combinatorial explosion at the sequence level as information is flattened in language. The "appetite" for insatiable big data effectively addresses this combinatorial explosion. As long as there isn't complete information redundancy, additional diverse sequences will help models abstract data patterns more precisely.

The Two vs. Three S-curves Debate

Zhang Junlin observes that since OpenAI's O1, two other phases have gained recognition with their own Scaling Laws: the reinforcement learning Scaling Law (RL Scaling Law) for post-training, and the Inference Scaling Law (also called Test Time Scaling Law).

This raises a crucial question: Are there really three S-curves, or just two? How comparable is the reasoning model's S-curve to the pre-training S-curve?

While theoretically we can identify three phases:

Pre-training
Post-training (especially reasoning-focused reinforcement learning)
Inference phase

In practice, post-training and inference phases likely share a single S-curve; there aren't two independent growth curves.

DeepSeek R1's Insights: The Truth About "Slow Thinking"

Consider DeepSeek R1: users can activate "deepthink" mode to enable Chain-of-Thought (CoT) reasoning, but they can't actually control reasoning quality by increasing computation time. Why is this?

Let's examine a concrete example. When R1 solves a complex mathematical problem:

Traditional models might directly answer: "The result is 42"

R1 shows detailed reasoning: "Let's think step by step: 1) First consider... 2) Then we can... 3) Finally, we get 42"

While R1's response appears to demonstrate "slow thinking" (CoT), this reasoning process reflects actually a generation pattern fixed during training, not dynamic exploration of multiple potential reasoning paths during response time. In other words, CoT+answer might look like "slow thinking," but it doesn't fundamentally change the unidirectional next-token prediction paradigm. R1's CoT+answer creates an illusion of slow thinking, but the generative nature remains fundamentally the GPT "fast thinking" paradigm. At test time, unlike AlphaGo, the depth and scale of thinking isn't dynamically explored, though beam search, if applied, can provide implicit multi-path optimization internally.

Test Time Compute Constraints

The industry's buzz word "test time compute" refers to reasoning models requiring more online computational resources compared to traditional non-reasoning models. For example, R1 with CoT enabled might need several times more computation time than its base model V3 for the same problem. However, this increased computation results from behavior patterns acquired during training, not dynamically adjustable compute investment. Without controllable scalability in test time compute, we can't really talk about a test time scaling law.

A major difference between pre-training and CoT reinforcement learning lies here: pre-training scaling laws can remain stable long-term because once training completes, it doesn't significantly impact online response time - the generation mode remains a simple query+answer. Therefore, offline training for months is acceptable if the resulting model shows significant capability improvements. However, reasoning models' post-training CoT reinforcement learning differs - it cultivates models' habits of responding with slow thinking, changing the generation mode to query+CoT+answer. Extending the CoT isn't just about the cost of training resources and time; more critically, it reflects in extended test time compute for each query during deployment, severely delaying system response time. Users generally have limited tolerance for slow thinking computation time and delays during online system use.

The Sustainability Debate

OpenAI's Sam Altman and Anthropic's Dario might argue that for extremely complex problems (like proving the Riemann hypothesis or designing next-generation aerospace vehicles), even if a model needs a week of computation time, it's still a massive improvement over human teams requiring decades. However, this argument has two issues:

LLM feasibility for such super-complex problems remains far from validated

Extreme scenarios lack universality and can't serve as data points for sustainable scaling laws

This isn't to deny S-curves as effective models for describing scaling laws, nor to reject the rationality of S-curve stacking. The combination of pre-training and post-training growth curves (s1 and s2) might indeed reflect the overall relationship between resource investment and performance improvement. However, we should carefully examine whether CoT reasoning truly opens a sustainable scaling curve.

Conclusion: How Far Is the LLM Road to AGI?

If reasoning models' scaling laws lack sustainability, this raises a deeper question: Can we reach the promised land of Artificial General Intelligence (AGI) through these two scaling laws alone? Furthermore, is the technical ideal of Artificial Super Intelligence (ASI) - AI replacing human labor and dramatically improving productivity - truly feasible?

Current evidence suggests that while pre-training scaling laws have shown considerable sustainability, reasoning models' scaling laws may quickly hit practical constraints. This reminds us that the path to AGI/ASI likely requires more innovative breakthroughs, not just simple extrapolation of existing methods. In the next phase of artificial intelligence development, we might need to discover entirely new growth curves.

[#LLMs #ArtificialIntelligence #DeepLearning #AGI #ScalingLaws #MachineLearning]

 

【相关】

张俊林:从Deepseek R1看Scaling Law

Technical Deep Dive: Understanding DeepSeek R1's Reasoning Mechanism in Production

A detailed analysis of how DeepSeek R1's inference mechanism works in production, and how it differs from training-time reinforcement learning.

Training vs. Deployment: Key Questions

1. Training Phase (GRPO): Does the reinforcement learning mechanism generate multiple candidate CoT+answer sequences to optimize the policy and cultivate "slow thinking" habits?

- The answer is definitively yes.

2. Deployment Phase: Does R1 implicitly generate multiple paths during inference but only display one? If so, how does this mechanism compare to traditional ensemble methods?

3. Comparison with AlphaGo's MCTS: How does R1's mechanism fundamentally differ from Monte Carlo Tree Search?

1. Inference Mechanism in Production

DeepSeek R1's real-time reasoning can be characterized by two modes:

A. Implicit Multi-path Generation and Selection

- Generation: The model may implicitly generate multiple potential reasoning paths (CoT+Answers) during a single inference but outputs only one.

- Technical Implementation: Through decoding strategies (e.g., beam width adjustment), the model maintains multiple candidate sequences, ultimately selecting the highest-scoring path.

- User Experience: Users see only the final output, though internal multi-path exploration occurs.

- Efficiency Trade-off: Setting beam_width=1 (greedy search) defaults to single-path generation for fastest response; increasing beam width improves quality at the cost of latency.

B. Explicit Multiple Candidate Generation (Optional)

- API Control: The num_return_sequences parameter allows explicit generation of multiple candidates.

- Practical Application: While not enabled by default in the DeepSeek App, this functionality may be available through enterprise APIs or open-source implementations.

2. Training Phase: Cultivating "Slow Thinking"

A. Role of Reinforcement Learning

- Objective: GRPO algorithm trains the model to generate more detailed, logical reasoning steps (longer CoT) to maximize rewards.

- Mechanism: Training generates multiple candidate answers, with rewards evaluating both answer correctness and format correctness.

B. Driving Forces Behind CoT Growth

- Reward Design: Longer CoTs naturally emerge when they lead to better answers.

- Data Feedback: High-quality SFT data generated through rejection sampling enhances this pattern.

3. Comparison with Ensemble Methods

Similarities

- Multi-path generation conceptually similar to ensemble predictions

- Result filtering comparable to voting/weighted averaging

Key Differences

R1's implicit multi-path generation is fundamentally a dynamic decoding strategy within a single model, distinct from traditional ensemble's static combination of multiple models.

4. Fundamental Distinction from AlphaGo's MCTS

AlphaGo's MCTS

- Dynamic Programming: Builds search trees through simulation

- Online Learning: Adjusts search strategy based on real-time feedback

R1's Implicit Multi-path Generation

- Static Model: Fixed parameters during deployment

- No Reward Modeling: Path selection based on model probability rather than cumulative rewards

Key Insights

1. Training phase GRPO cultivates detailed CoT capabilities for effective single-pass inference.

2. Deployment allows flexible trade-off between single-path (for speed) and multi-path (for quality) generation.

3. While model parameters are fixed post-training, decoding strategies offer some runtime flexibility.

4. R1's multi-path generation fundamentally differs from both traditional ensembles and MCTS-style dynamic planning.

This architecture achieves a practical balance between efficiency and effectiveness for large-scale industrial applications, though it sacrifices some dynamic planning and global optimization capabilities.

#ArtificialIntelligence #MachineLearning #DeepLearning #LLM #DeepSeek

【相关】

The Turbulent Second Chapter of Large Language Models: Has Scaling Stalled?

The recent Chinese podcast from Guangmi's quarterly report on large language models, discussing the "scaling paradigm shift" toward AGI (Artificial General Intelligence), is well worth a listen. It touches on many key topics related to the AI industry landscape, offering a unique perspective and style.

The term "paradigm shift" may sound a bit dramatic, but as a seasoned analyst, Guangmi uses it to describe the current turbulent landscape accurately. While the AI arms race among industry giants is still in full swing, real-world scalable applications of these models are struggling to materialize. The question of how to justify investments has become a significant pressure point, or perhaps even a looming bubble.

Let's revisit some AI basics. There are three main types of learning in LLMs (Large Language Models):

(i) supervised learning;
(ii) unsupervised learning (self-learning/pre-training); and
(iii) reinforcement learning (RL, self-play/post-training).

Ilya has emphasized the importance of RL in exploring new directions for LLMs. Guangmi's podcast highlights RL as the pathway to the paradigm shift in AGI through large models.

Historically, two key milestones in RL have stood out: AlphaZero's victory over human Go players, which shocked the world, and RLHF (Reinforcement Learning from Human Feedback), which aligned models with human preferences and paved the way for ChatGPT’s explosive growth.

Currently, discussions revolve around the potential of a new RL-driven ecosystem for large models (though there's no broad consensus—it's primarily a conversation within small Silicon Valley circles) and the emerging trends in the "arms race" of large models. Here’s the context:

1. Pre-training scaling seems to have hit a bottleneck, with GPT-5 still unreleased;
2. The overall momentum of the arms race remains unchanged among the major players (the billionaire clubs/giants);
3. Key tech figures are proposing new roadmaps or trying to construct new scaling laws to continue the AGI journey.

Guangmi closely monitors trends in Silicon Valley. His small team conducts in-depth research in the Bay Area and has established extensive contacts. Having chatted with them over coffee a couple of times, I’ve found them to be a dynamic, young team under his leadership—a small but sharp presence.

Guangmi’s thoughts are well-structured, and his breadth of knowledge and understanding of the larger context are impressive. This is no small feat, as the landscape of large models, both in terms of the models themselves and the industry, is often akin to the parable of the blind men and the elephant. Even top experts and business leaders struggle to assess the full picture. Just recently, Meta’s Zuckerberg responded to a question about whether the AI arms race would deliver the expected AGI returns, essentially saying: “No one really knows, but we can’t afford to miss out,” reflecting a typical FOMO (Fear Of Missing Out) mindset.

We’re currently in a delicate phase with little consensus. However, the few tech giants that have propelled Nvidia’s stock to astronomical levels won’t allow the arms race to slow anytime soon, as it is central to their tech and business dominance. OpenAI continues to raise funds, and Ilya, with his new company, recently secured more investment, all of which keeps the race heated.

At the same time, the obsession with scaling among tech elites and the mainstream AGI circles in Silicon Valley persists. The endless demand for resources driven by this scaling wave of large models means that only a small circle of tech insiders has the opportunity and resources to experiment, sense, and adjust the roadmap.

According to Guangmi, the so-called self-play RL scaling is currently gaining traction within a small circle of about 200 tech elites in Silicon Valley, indicating that this is still a nascent trend—one that even management leaders have not fully aligned with yet.

It seems Guangmi adopts a “prophet” mentality at times, perhaps exaggerating this trend to alert his audience. He even suggests that if he were a large-model entrepreneur, he would focus 200% of resources on RL, betting on it as the future path to victory.

In reality, for most people, this advice is neither practical nor actionable—it’s likely aimed at tech giants or unicorns, though even for them, it may fall on deaf ears.

Reinforcement learning is inherently challenging. Even the open-source leader Meta LLaMA 3 has chosen to sidestep RLHF in post-training alignment. So, it's even less realistic to expect large-model teams to fully bet on RL as the core of a new ecosystem. Furthermore, this trend is, at best, a “subtle undercurrent” in Silicon Valley. We’ll likely have to wait until OpenAI’s “Strawberry” or the new version of Claude releases later this year to fully assess its impact.

It seems the first chapter of LLM scaling has indeed come to an end. The actionable items in the so-called second chapter might not emerge from lofty, exploratory scaling directions with an uncertain roadmap. Instead, the focus should be on finding market entry points, accelerating applications, and addressing genuine market needs (PMF, product-market fit), especially as the inference costs of top models like GPT-4o/Claude 3.5 become more affordable, and multimodal capabilities (such as advancements in hyper-realistic full-duplex voice and video) further enhance application opportunities.

For the industry, the bottleneck in scaling large-model applications is the sword hanging over its future. This will determine whether the second chapter of the tech adoption curve ends with a soft landing and eventual recovery. As for the arms race, it’s best to leave that to Elon Musk, Zuckerberg, and the billionaire club to continue playing.

Reinforcement learning, as an extension of pre-training, belongs to the realm of “post-training.” When pre-training hits bottlenecks and diminishing returns, strengthening RL is a natural complement. In the simulation of human cognition, pre-training represents the accumulated knowledge of human civilization, while RL applies that knowledge in practice, learning from the environment. This overall approach to intelligent learning makes perfect sense and is the necessary direction for applying large models.

My old friend Lu said: “It’s intuitive that RL is the path we must take because there isn’t enough supervised learning data anymore.”

Indeed, utilizing regenerated data to varying degrees has become common practice. It’s inevitable. Models can already generate data of higher quality than humans, and this will only improve. However, this is not the same as self-play's proactive exploration and data regeneration.

As Mr. Mao pointed out: “RL aligns with the cognitive processes of humans and epistemology. It’s essentially the process of receiving external feedback and being tested in practice. RL is active learning, while training is passive.”

Guangmi's RL paradigm shift suggestion still lacks the necessary catalysts. But this potential trend is worth keeping in mind. It’s best to remain cautiously optimistic and open-minded while watching how things unfold.

 

Related original:

大模型风云诡谲的下半场:scaling 失效?

Decoupling to Resolve: Issue of Character Consistency in Video Generation

I’ve now become the go-to expert for AIGC (AI-generated content) "custom services" among my old friends and classmates, just for fun. Below are nostalgic videos made from old photos that two of my classmates asked me to create.

Whenever I find the time, I’m more than happy to provide this kind of emotional value for friends and family because it’s truly satisfying to see their reactions of surprise.

The pianist is now a world-class piano master, frequently touring and performing in Europe, America, and China. These are precious old photos of him practicing and performing with our mutual friend, Brother Sun, in Philadelphia back in the early days.

Dr. Bai Shuo, a seasoned expert in NLP and a multi-talented musician, commented humorously: “Looks real for someone who pulls on the bow in  Meditation as named, but the bowing and fingering are all wrong.”

Another old friend also left feedback noting that the visual model doesn’t understand music: "This needs improvement! It's obvious that the model was created by someone who doesn’t know how to play the violin or piano. The bowing and piano accompaniment are off. The first note has a two-and-a-half beat long tone, which should be played with a long bow. Additionally, the pianist’s right foot should never be raised or shaking like that—it should be on the sustain pedal.”

LOL

Even though the music's name Meditation was clearly specified in my prompt during generation, there is no model, in the foreseeable future, that can truly align the understanding of music with the intricate details of bodily movements during performance. Perhaps this can be reserved as one of the ultimate challenges for large models aiming for AGI, because theoretically, if enough alignment data of musical performance is available, based on the compression theory of "joint training", it’s possible to aim at perfect alignment across different modalities.

If simulating the objective world is the ultimate goal of visual models, then the current generation of visual models is at the level of “playing the piano to a cow” or “playing music to a tone-deaf audience”—completely unable to withstand scrutiny from musicians. For example, as someone with little musical knowledge, when I watch the nostalgic performance videos above, I wouldn’t notice the flaws as an expert would; instead, I find them vivid and emotionally engaging.

Of course, the standards of musicians might as well just be a "pseudo-demand" or a pseudo-goal (even if the visuals satisfy the picky “expert eye,” so what? Will it sell well?). It might not be worth the effort to pursue this. However, in theory, an ideal AGI should be capable of meeting these expert-level demands.

This is the challenge of musical performance alignment.  Another challenge to Sora-like video generation models is character consistency in videos.

Achieving facial consistency in generative visual models is an extremely difficult. Don’t expect this issue to be resolved by video generation models alone in the short term, especially not through autoregressive methods.

Human eyes are extremely discerning with regards to face recognition, especially when it comes to familiar faces of friends and family—you can immediately tell when a character's appearance is off. For example, while playing with old photos recently, I used the KeLing model (top notch Video Model in China) to generate a video of myself. At the 5-second mark, it still looked passable, but by 10 seconds, it no longer resembled me.

In the second 10-second video, just a slight turn of the head, and it’s no longer me—it looks more like my brother. How can a model handle such fine details? Especially when the starting image for video generation is not even a straightforward frontal shot, making the character information incomplete—how could it not go off track?

While the videos I've made for friends and family using KeLing during its public testing phase have generally been met with passionate surprise and amazement, most of them suffer from this issue of character consistency, which is a regret.

The current one-click video generation products on the market (including our own YuanChuang Island recently launched) tend to mainly use anime or manga styles. This is to avoid user scrutiny since these styles lack 3D distinct individual characteristics. As long as there is consistency in attire, no gender mix-ups, with age and race alignment, most people will accept it. The current one-click videos are generally rough, with entertainment value primarily in the story rather than character portrayal akin to a Hollywood blockbuster. However, as this path progresses, it will inevitably encounter the challenge of maintaining the consistency of digital IP actors and their roles.

My colleague, Lu, mentioned, "the consistency issue might require cross-checking from multiple video angles, which more or less touches on the core issue of whether modeling is necessary."

Indeed, some form of cross-checking is required, not just monotonic correction over time/sequence—that is indeed the key. There’s a need to decouple or separate the character's image from the storyline, rather than generating in a linear, one-way path. While sequence learning has indeed produced miracles in LLMs, sequence generation inherently has limitations, including random deviations over time. Although it's not as extreme as LeCun's criticism—where he says GPT's error accumulation is a tiny discrepancy that leads to a significant miss—his claim isn't entirely accurate because GPT's autoregressive operation also corrects and adjusts its course at every step in the context. Nevertheless, when it comes to fine-grained consistency, random deviations are almost impossible to handle, even with corrective mechanisms in place.

Hence decoupling, decoupling, decoupling! Decoupling can solve the problem. The world isn't limited to sequences. Beyond sequences and time, there is a constant abstraction (i.e., character image, or IP) that can be utilized. This is becoming increasingly clear. Take, for example, the digital IP character Maria (Xiao Ya) that I created using AIGC txt2img more than 2 years ago::

Unless they’re fans, perhaps my numerous Maria videos might cause aesthetic fatigue—someone even called her “Dr. Li's fairy” (LOL). But indeed, there are fans; several of my old classmates are among them.

Why? Because she is an IP, and she has been decoupled.

 

Related Links (original posts in Chinese):

视觉模型生成的极限对齐

解耦才能解套:再谈视频中的人物一致性问题

 

Professor Ma Claims to Have Fully Unveiled the Mysteries of Neural Networks

Professor Yi Ma’s white-box transformer paper is available here.

Professor Ma is a prominent figure, renowned for his distinctive style and leadership in the field. His name is widely recognized and respected. Of particular interest recently are his critiques of mainstream large models and the bold claims he has made about his own work (see his post in Chinese below).

Recently, at a conference in Shenzhen (which I attended with my own talk too), Professor Ma sharply criticized mainstream large models, Ilya, and Kolmogorov complexity theory, dismissing them as being on the level of high school students and claiming that they lack a true understanding of theoretical concepts. He asserted that he has achieved breakthroughs in both theory and practice, particularly with the white-box Transformer developed by his team. According to him, this model not only demystifies the complexity of large models but also offers an engineering-feasible alternative.

When someone speaks with such confidence, it usually indicates genuine expertise and a commanding presence. Just as Yann LeCun in the U.S. criticized GPT as being inferior to a dog and called it a dead end, proposing his world model as an alternative, China has Professor Ma. Their critiques balance the global discourse, making the world feel less excluding. There is indeed hope that their work might address the "slow thinking" and "interpretability" shortcomings of current mainstream large models and contribute to the overall advancement of AI. Professor Ma’s academic and practical work deserves close study, though we may have to wait for time and peer reviews to fully test and validate their findings.

At the Shenzhen conference, after delivering his talk and sharp critiques, Professor Ma left immediately, likely due to his busy schedule.

The paper is over 100 pages long and is said to be released in a few days. Based on the current outline, the key points are as follows:

Overall, CRATE is similar to a transformer, with two differences:

- In each attention head, the Q, K, and V weight matrices are tied, i.e., set to be equal.
- The nonlinearity following each attention layer is no longer a multi-layer perceptron (MLP) but rather a more structured operator (ISTA) with sparse outputs.

Let's examine ISTA (Iterative Soft-Thresholding Algorithm), a widely used algorithm for solving sparse optimization problems in machine learning. In his  CRATE architecture, ISTA replaces the traditional MLP in Transformers. Not long ago, KAN also introduced innovations aimed at replacing the MLP, both approaches representing surgeries within the Transformer architecture.

In my understanding, ISTA and KAN (for Science/Physics) share a common goal: through regularization or pruning, they ultimately fit a sparse path, thus achieving interpretability.

How it works

ISTA Iteratively approaches the optimal solution of a problem. Each iteration involves two steps: a) a gradient descent step, which aligns with mainstream methods; and b) a soft-thresholding operation. This operation is added to balance two objectives:

a) Maximizing model accuracy;
b) Achieving model sparsity, i.e., simplicity (as overly complex models are difficult for humans to interpret).

The soft-thresholding operation encourages internal elements to become zero, resulting in sparse outputs and increased interpretability. The weight-tied attention mechanism, combined with ISTA, promotes a deeper understanding of the input data structure, resembling a human-like structured analysis process that prioritizes key elements while regularizing the data.

Professor Ma claims that these two modifications naturally lead the model to learn the interpretability associated with human-like structuring and sparsity during supervised learning (and later as claimed successfully applied to self-supervised learning too).

For example, in image recognition, it was observed that certain attention heads correspond to different parts of animals. What's more remarkable is that this correspondence remains consistent across different animals and even different categories of animals. For instance, an attention head focused on the "head" consistently pays attention to the head area when processing different kinds of animals. This consistency suggests that CRATE has learned a general representation of visual features across categories.

However, those studying LLM interpretability have long discovered that at the end of MLP networks, various structured components (such as heads and feet) are also captured by attention mechanisms. Without this, it would be difficult to explain the generalization (or compression) capabilities exhibited by LLMs. The challenge lies in the early stages of the MLP network, where attention is more mixed, and mainstream researcher struggle to clarify what the attentions heads are focusing on. It seems that they are vaguely paying attention to the relationships between basic elements like pixels/dots and lines.

The core idea behind explainable AI is consistent: transforming the tangled, black-box, multi-layer network's internal data fitting paths into structured paths that are enabled with various constraints and pruning, leading to a sparse representation.

Who wouldn’t want a model to be interpretable? However, achieving sparsity and simplicity is extremely challenging, which is why, so far, these approaches have struggled to compete with the black-box methods that involve randomness.

Professor Ma’s confidence stems from the fact that, in the past six months to a year, he has begun to train models using the explainable white-box methods mentioned above, achieving results comparable to traditional transformers. At the Shenzhen conference, he mentioned that while he had always been confident that this was the correct approach, he remained cautious until results were obtained. Now, he believes that his cross-national team’s achievements with this approach have satisfied him enough to announce to the world that he has found a breakthrough in theory as well as practice, the correct method for white-boxing transformers, which could lead to a paradigm shift and a breakthrough in deep learning. This has made him both excited and confident. Therefore, he is no longer content with academic theoretical achievements alone; he feels compelled to take actions in industry as well. Professor Ma has recently founded a company to advance this work on an engineering level. At Shenzhen, he announced a directionally significant project challenging the mainstream, first time under the banner of his new company.

However, based on my years of NLP experience and intuition, I must point out a challenge (or potential issue): Human interpretability is built on a highly simplified finite set. If we consider symbolic features, a feature system with more than thousands of elements becomes incomprehensible to humans. But on the other hand, the number of parameters in transformers and the number of KQVs for attention heads are on a completely different scale. Reducing such complexity on this scale seems almost unimaginable.

KAN for Science succeeded because their target was extremely narrow—certain existing symbolic formulas in physics or potential formulas limited to a few parameters. With such a goal, pruning, along with scientist intervention or feedback, allowed KAN to claim interpretability.

Regardless, Professor Ma seems confident, so we would like to observe how his methods and results evolve and will, or will not, be accepted.

 

 

Related Links:

What did Ilya see? -- secret behind success of LLMs

 

The Challenge of Character Consistency in Video Generation

Facial recognition in the vast world of AI is a specialized and challenging task, as human eyes are exceptionally sensitive to facial features. Because facial recognition is so specialized and sensitive, it presents a much greater challenge than traditional image recognition tasks, like identifying animal types. Consequently, this field achieved breakthroughs earlier than others: even before the advent of contemporary large models such as GPTs, deep neural network-based facial recognition, powered by extensive datasets of facial images, had already surpassed human visual capabilities and sensitivity. It became widely adopted, leading to the rise of unicorns in the pre-large model era.

Now, as we transition to universal video foundation models that aim to handle all objects in the world, whether it's Sora or Keling, maintaining facial consistency remains a significant challenge. The public has little access to Sora, but by examining similar leading visual models like Keling, we can perceive its limitations. Typically, after about half a minute, the generated faces start to diverge, no longer resembling the original person as closely. Achieving long-term consistency in character appearance is difficult without specialized processing and targeted optimization; relying solely on the current general video consistency training efforts is unlikely to overcome this bottleneck. This limitation has been repeatedly observed during various tests with publicly available visual products like Keling.

In some videos, if not for the sensitivity of human eyes, different visuals might be impossible from a purely physical perspective. This highlights the sharpness of human perception: the ability to instantly discern the real from the fake.

For example, in the videos generated below featuring Maria (Xiao Ya, the favorite text2image IP I have generated and maintained in my AIGC videos), her fans can immediately tell which one is genuine, even though Maria herself may present different appearances at different ages and in various settings. There exists an abstract, invariant facial characteristic that equips humans with an eagle-eyed ability to recognize faces. The secret to this lies in the decoupling of these characteristics already pretty well done in the previous generation of facial recognition models. Compare and contrast:

 

 

It's important to note that maintaining character consistency is a critical benchmark for generating cinematic and user-configurable video works. Without crossing this threshold, the field will struggle to achieve large-scale applications in video art creation. The dream of a fully virtual Hollywood production line, without physical filming, will remain a fantasy.

Why is it so difficult for visual models to achieve consistent character representation over long periods using brute force?

Video is a high-dimensional modality, and for large models (at least in the foreseeable future) to handle video, they must employ significant "lossy compression". The compression ratio of visual tokens is high, making it more feasible to align training/generation across the entire frames over time within the hidden space. The higher the compression ratio, the stronger the temporal consistency across the entire frames. Self-regressive models (GPT-like) or DiT (Diffusion Transformers) can achieve this. By doing so, videos that violate the physical laws of the real world can be effectively under control, reducing illogical hallucinations and making visual models appear to simulate the objective world (or so it seems). However, there is a trade-off: under lossy compression, the consistency of the overall frames and the consistency of detailed features of specific physical objects therein cannot be optimized simultaneously.

The current approach typically involves adding a super-resolution (SR) module/mdoel after achieving overall contour (blueprint) consistency, attempting to restore discarded details. In general, super-resolution rendering has made significant progress so far, thanks to the accumulation of research in "deepfake"-like technology. However, deepfake technology essentially compensates for the losses incurred during compression, using the large visual foundation model's strength in imagination (or "hallucination") to reasonably and non-deterministically fill in the details, depicting how the world "should" look like, what it should be, rather than what it is, often with amazingly detailed lifelike results. But if the goal is to represent an individual entity, especially a finely detailed one like a human face of some IP, with individual features sensitive to human perception, it's inevitable that the generated image will drift over time. This is the crux of the problem. The solution should not rely on increasingly larger models and longer context windows with brute-force data and training. Brute force can only slow the deviation but cannot eliminate the non-deterministic bias that accumulates during the SR process over long video sequences. We need to think outside the box and exclude the time dimension as a factor, using a step-by-step alignment method, which may break the time cycle. I’ll stop here—don't say you weren't warned.

The prerequisite for achieving this is the decoupling of facial features. Features that cannot be decoupled cannot be aligned step by step. They have to, and can, be decoupled; otherwise, it would be impossible to explain how dozens of Hollywood actors can star in thousands of blockbuster films. The decoupling of faces from expressions and time still has room for improvement, but the technology has already matured considerably.  It is a matter of how to properly use it in the process.

Original Chinese post in

Llama 3 Notes and Llama MV with Llama 3.1 Legend

Notes on the 92-page Paper Released with Meta's Super Large Model Llama 3.1

The super-large model Llama 3.1 is a milestone in the open-source large model community. As a leader, Meta's project involved over 500 participants/contributors (the authors of this paper are listed alphabetically in the appendix, similar to how the Central Committee members' names are displayed by stroke order). This original text is full of implementation details:

meta Llama 3.1 paper

AIGC MV using Suno and keling (just for fun & cheering opensource milestone)

Notes:

  1. Llama 3.1 doesn't use sparse techniques, it's not a multi-expert system like model 4, but a dense model.
  2. 405B parameters, 15.6T tokens: The number of tokens is 40 times the number of parameters. Large-scale top models now emphasize data growth far exceeding parameter growth. Is this 15T tokens of data open source? (No, because even if they were willing to open source it, they wouldn't dare, as it could lead to countless data infringement lawsuits)
  3. Emphasizes three major levers for super-large foundation models: data, scale, and managing complexity.
  4. Compared to the previous generation system Llama 2, computational power has increased 50 times (using 3.8 × 10^25 FLOPs).
  5. Complexity management: (1) Choosing a standard dense Transformer architecture instead of a mixture of experts model to maximize training stability. (2) Adopting a relatively simple post-training procedure: Supervised Fine-Tuning (SFT), Rejection Sampling (RS), and Direct Preference Optimization (DPO). In other words, algorithm design and implementation tend towards simplification. Not using sparse techniques and multi-expert systems is for stability (but training challenges are greater, though they're not afraid). Using simpler, easier-to-implement DPO in the post-training phase instead of reinforcement learning is also for stability, as reinforcement learning has always been difficult to handle.
  6. Benchmark tests cover: general, code, math, reasoning, tool use, long context, and multilingual. All performances are SOTA (state-of-the-art international level).
    • MMLU (Massive Multitask Language Understanding): 405B model achieves 87.3% (5-shot), 88.6% (0-shot, CoT).
    • Code generation (HumanEval): 405B model reaches 89.0%, close to GPT-4.
    • Math problems (GSM8K): 405B model achieves 96.8%, slightly higher than GPT-4.
    • Long context tasks: Excellent performance on some tasks, such as 95.2% on QuALITY.
    • Multilingual tasks (MGSM): 405B model reaches 91.6%, on par with top models. The 405B model is comparable or close to GPT-4 and Claude 3.5 Sonnet on many tasks. In short, open-source has caught up with closed-source.
  7. Pre-training started with an 8k window, expanded to a 128k window in the later stages of pre-training (continued training).
  8. After the foundation model pre-training was completed, multiple iterations of alignment "post-training" were performed. Including: (1) Aligning the model through human feedback, including multiple rounds of Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO); (2) Integrating new capabilities, such as tool use; (3) Enhancing coding and reasoning abilities (specialized optimization); (4) Safety alignment.
  9. Multimodal expansion (in progress, not yet released): Image, video, and speech capabilities. Including (1) Multimodal encoder pre-training: Image encoder trained on a large number of image-text pairs, aligning visual content and natural language in a unified space; (2) Speech self-training? (3) Experiments on video-text data alignment based on images.
  10. Language model as the core, other modalities are added later (whether added to pre-training and/or post-training). When expanding to multimodal, the language model parameters remain unchanged, adapting to multimodality, allowing multimodal alignment in the same semantic space, closer to the language model. In other words, Llama follows a modular, step-by-step approach to gradually expand to multimodality. This is not the mainstream approach (mainly referring to Open AI and Google, at least in theory) advocating for "unified multimodal native data joint pre-training". The overall impression of Llama's algorithmic strategies is seeking stability rather than innovation or unification. It tends towards practicality, not caring about leading in algorithms. For example, the integration of speech first involves speech self-training (because speech is actually very similar to text, both being language systems), then alignment between speech and text (including Automatic Speech Recognition ASR and Text-to-Speech TTS). Integrating step by step into the cross-modal large model, this approach isn't cutting-edge in terms of advancement, but it's steady progress, beneficial for engineering development, integration, and iteration. It's unclear when they will be able to release multimodal capabilities online.
  11. Data collection and cleaning work is very complex, but the Llama team is meticulous, which is also the data guarantee for its quality to catch up with SOTA. To recap: (1) De-duplication: URL-level de-duplication; Document-level de-duplication using MinHash algorithm; Row-level de-duplication: removing rows appearing more than 6 times every 30M documents. (2) Filtering: Removing low-quality documents, outliers, and excessively repetitive documents, using repetitive n-gram coverage to remove repetitive content (such as logs or error messages); using "dirty word" counts to filter adult websites not covered by blacklists; using token distribution KL divergence to filter documents with too many abnormal tokens. (3) Controlling data quality: Using fasttext classifier to identify text that might be cited by Wikipedia; using a Roberta-based classifier trained on Llama 2's predictions; using DistilRoberta to generate document quality scores. Also, fasttext language classifier can identify 176 languages; specially filtering two types of information: adult content and personal identity/privacy information. Special fine processing for code and math web pages.
  12. Data proportions: For example, downsampling over-represented data categories on the web (such as art and entertainment); data mixing ratios determined by a series of small model experiments, final data mix summary: About 50% of tokens correspond to general knowledge; 25% of tokens involve math and reasoning; 17% of tokens are code; 8% of tokens are multilingual content.
  13. Model architecture: Apart from empirical detail adjustments, the basic architecture of the dense model remains unchanged, so it's data and scaling that create top models. 405B model specific parameters: 126 layers; token representation dimension 16,384; 128 attention heads; model size of 405B determined according to scaling law, about the computational optimal size under 3.8 × 10^25 FLOPs training budget.
  14. Vocabulary: Using a vocabulary of 128K tokens. Combines 100K tokens from the tiktoken3 tokenizer and 28K additional multilingual tokens to better support non-English languages.
  15. Computing resources, including GPU clusters of tens of thousands of cards, massive storage, and high-speed networks, represent huge resource investments. Specific data as follows: Computing resources:
    • Used up to 16,000 H100 GPUs (a very powerful graphics processor).
    • Each GPU has 80GB of high-bandwidth memory, with a power of 700W.
    • These GPUs are installed on servers designed by Meta itself, with 8 GPUs and 2 CPUs per server. Storage system:
    • Uses a distributed file system called Tectonic.
    • Provides 240PB (1PB=1000TB) of storage space, distributed across 7,500 servers.
    • Can process 2TB of continuous data per second, with a peak of 7TB/second.
    • A major challenge is handling the large amount of burst writes generated when processing model checkpoints (the process of saving model states).
  16. Three-step pre-training process: a) Initial pre-training; b) Long context continued pre-training; c) Annealing with high-quality data sources Key pre-training strategies:
    • Gradually increase batch size and sequence length to balance stability and efficiency.
    • Dynamically adjust data mixing to specifically enhance certain capabilities.
    • Increase context length in stages to avoid early computational overhead.
    • Use annealing and high-quality data in the late stages of training to fine-tune model performance.

 

[LLM Summary]

Llama 3: Meta's Open-Source Large Language Model Breakthrough**

1. Introduction and Overview

Meta has introduced Llama 3, a series of foundation language models designed to support various tasks including multilingual processing, programming, reasoning, and tool use. This model series includes versions with 8B, 70B, and 405B parameters, with the largest 405B parameter model adopting a dense Transformer architecture and supporting context windows of up to 128K tokens. The development of Llama 3 highlights three key factors: data quality and scale, computational scale, and complexity management.

2. Model Architecture and Pre-training Strategy

2.1 Model Architecture

Llama 3 retains the standard dense Transformer architecture rather than adopting a mixture of experts model. This choice aims to maximize training stability, reflecting Meta's emphasis on simplifying design to manage complexity. Key architectural improvements include:
- Using Grouped-Query Attention (GQA) mechanism, with 8 key-value heads per attention layer.
- Introducing attention masks to prevent self-attention between different documents in the same sequence.
- Expanding the vocabulary to 128K tokens, combining 100K tokens from the tiktoken3 tokenizer and 28K additional multilingual tokens.
- Increasing the RoPE base frequency hyperparameter to 500,000 to support longer contexts.

2.2 Pre-training Data Processing

Llama 3's pre-training data processing is extremely rigorous, including:
- Multi-level deduplication: URL-level, document-level (using MinHash algorithm), and row-level deduplication.
- Heuristic filtering: Removing low-quality documents, outliers, and excessively repetitive content.
- Model-based quality filtering: Using fasttext and Roberta-based classifiers for quality assessment.
- Special content processing: Developing specialized processing pipelines for code and mathematical content.
- Multilingual data processing: Using fasttext base language identification model, supporting 176 languages.
- Safety and privacy protection: Filtering website data containing personally identifiable information (PII) and unsafe content.

2.3 Pre-training Strategy

The pre-training process is divided into three main stages:
1. Initial pre-training: Conducted on about 15T multilingual tokens, far exceeding Llama 2's 1.8T tokens.
2. Long context pre-training: Gradually expanding from initial 8K tokens to 128K tokens context window.
3. Annealing phase: Fine-tuning with high-quality data in the final stage, using Polyak averaging to generate the final model.

Data mixing ratios are carefully designed:
- 50% general knowledge
- 25% mathematics and reasoning
- 17% code
- 8% multilingual content

3. Training Infrastructure and Challenges

3.1 Computational Resources
- Using up to 16K H100 GPUs, each equipped with 80GB HBM3 memory.
- Adopting a 4D parallel strategy: tensor parallelism, pipeline parallelism, context parallelism, and data parallelism.

3.2 Storage System
- Using the Tectonic distributed file system, providing 240PB of storage space.
- Supporting 2TB/s sustained throughput, with peak capacity of 7TB/s.

3.3 Network Optimization
- Developing the NCCLX communication library to improve network efficiency.
- Designing specific network topologies and load balancing strategies.

3.4 Training Challenges
- Experiencing 466 job interruptions during the 54-day training period, 419 of which were unexpected.
- Developing automated systems and specialized tools to handle hardware failures and network issues.

4. Post-training and Alignment

Llama 3 adopts a multi-round iterative post-training process, including:
1. Supervised Fine-Tuning (SFT)
2. Direct Preference Optimization (DPO)
3. Reward model training: Using human feedback data
4. Safety alignment: Implementing multiple rounds of safety measures

This process not only improves the model's instruction-following capabilities but also enhances safety and specific abilities (such as coding and reasoning).

5. Multimodal Expansion

Although not officially released yet, Llama 3 demonstrates promising multimodal capabilities:
- Image recognition: Training independent image encoders, integrated with the language model through adapters.
- Video understanding: Adding video adapters based on image adapters.
- Speech processing: Independently training speech encoders, then aligning with the language model.

This modular approach allows flexible addition of new modalities while maintaining core language capabilities.

6. Performance Evaluation

Llama 3 performs excellently in multiple benchmark tests:
- MMLU (5-shot): 87.3%
- HumanEval (code generation): 89.0%
- GSM8K (math problems): 96.8%
- Long context tasks (like QuALITY): 95.2%
- MGSM (multilingual tasks): 91.6%

These results indicate that Llama 3 405B is comparable or close to GPT-4 and Claude 3.5 Sonnet on multiple tasks, particularly excelling in document understanding and long context tasks.

7. Safety Considerations

Meta highly prioritizes safety in the development of Llama 3:
- Implementing strict safety measures in both pre-training and post-training stages.
- Developing the Llama Guard system-level safety solution.
- Conducting extensive red team testing and risk assessments.

8. Open Source Impact and Future Directions

Meta's decision to publicly release the entire Llama 3 series, including the 405B parameter version, may have far-reaching impacts on the AI research community:
- Promoting open, responsible AI development.
- Accelerating AGI research progress.
- Providing researchers with opportunities to examine and improve large-scale language models.

Future development directions may include:
- Further improving multimodal integration.
- Expanding context length.
- Continuously enhancing data quality and model scale.

9. Conclusion

The development of Llama 3 demonstrates Meta's deep experience and forward-thinking in large-scale AI systems. By focusing on three key levers - data quality, computational scale, and complexity management - Llama 3 has reached or approached the current state-of-the-art level on several key benchmarks. Its open-source release may drive a wave of innovation across the entire AI field, paving the way for responsible AGI development.

Llama 3: Meta's AI Chef's Latest "Divine Delicacy"

Attention, all tech enthusiasts! The Michelin three-star AI chef Meta has just unveiled a new dish! This divine delicacy named "Llama 3" is not only spicy enough but will elevate your taste buds to new heights!

1. The Chef's Secret Weapon

Imagine Llama 3 as a super nanny who speaks 8 languages, writes code, does math, and can be your personal assistant. She can handle a kindergarten full of rambunctious kids (8B version), manage a mid-sized company (70B version), or even govern a small country (405B version)! This 405B big sister can remember 128,000 "gossips" (oh no, I mean context) simultaneously, essentially a walking encyclopedia + supercomputer!

2. Ingredient Selection: Only the Freshest!

Llama 3's chefs are masters at picking ingredients:

  • They "fished" 15 trillion words from the internet, nearly 10 times more than the previous generation!
  • Half of these words are everyday life seasonings, a quarter are math problems and brain teasers, nearly a fifth are programmer spells, and the rest are dialects learned from world travels.
  • They even invented a super weed remover, filtering out all the online garbage, repetitive, and unhealthy stuff.

3. Cooking Process: Three-Step Stir-Fry Method

Step 1: "Slow Simmer" - Start with a regular stove (8K context) to cook it halfway. Step 2: "High Heat Stir-Fry" - Switch to a super stove (gradually increasing to 128K context), reducing the sauce to be thick and fragrant. Step 3: "Low Heat Finish" - Finally, a gentle simmer with the best ingredients, the legendary "annealing" (even the chefs don't know why it's called that), bringing the flavor to its peak!

4. Kitchen Equipment: Top-of-the-Line Luxury Version

  • 16,000 super high-power induction cookers (H100 GPUs) firing simultaneously!
  • A refrigerator that could fit half the Pacific Ocean (240PB storage)!
  • A proprietary ingredient prep system faster than 5G (NCCLX communication library)!

Imagine all these stoves firing at once, making the kitchen feel like a sauna. But our chefs persevered through the heat, changing chef uniforms 466 times in 54 days to whip up this dish!

5. Training Method: Both Cute and Well-Mannered

Being a good cook isn't enough; you've got to have manners too! So our chefs began a long "training" process:

  • First came a round of "gentle education" (supervised fine-tuning)
  • Then the "carrot and stick" tactic (direct preference optimization)
  • Finally, they invited moral role models (safety alignment) for guidance

After all this fuss, Llama 3 not only cooks well but also knows how to please people, program, do math, and mind her manners - a true decathlon champion!

6. Special Side Dishes: Showcasing Multiple Talents

Don't think Llama 3 can only cook; she's a multi-talented "goddess":

  • Storytelling from images? Piece of cake!
  • Writing movie reviews? No problem!
  • Recognizing songs and even singing a bit? The karaoke queen!

Although these "talents" are still in practice, they already show the potential of Li Bai's "from black hair to snow white in a day"!

7. A True Powerhouse: Dazzling Test Scores

Llama 3 participated in a series of "Top Chef Competitions," with eye-popping scores:

  • College Entrance Exam (MMLU): 87.3 points (out of 100)
  • Programmer Interview (HumanEval): 89 points (out of 100)
  • Math Olympiad (GSM8K): 96.8 points (out of 100)
  • Long Novel Reading Comprehension (QuALITY): 95.2 points (out of 100)

Bring this report card home, and even a "Tiger Mom" would be grinning from ear to ear!

8. Safety First: AI's "Security Captain"

Meta's chefs know well the principle of "don't leave guns and ammo lying around." They've assigned Llama 3 a 24/7 bodyguard team (Llama Guard) to prevent her from accidentally saying or doing the wrong thing. They even arrange occasional "moral exams" to ensure she doesn't turn into a "Terminator."

9. Open Source Feast: Everyone Can Be a Master Chef!

The most impressive part is that Meta decided to make the recipe for this "divine delicacy" completely public! It's like a Michelin three-star restaurant putting their signature dish's recipe online. Now anyone who wants to can whip it up at home! This move not only shocked other master chefs but also made countless food lovers cheer with joy!

10. Future Outlook: Reaching New Heights

Meta's chefs aren't resting on their laurels; they're already pondering the next "divine delicacy":

  • Maybe a dancing Llama 4?
  • Or a painting Llama 5?
  • Who knows, one day we might see a Llama 6 composing symphonies!

In short, the AI world's "Michelin" journey has only just begun!

Epilogue

The birth of Llama 3 not only elevates Meta's status in the AI world but also brings a fresh breeze to the entire AI research community. This bowl of "Llama soup" is not only delicious but also brings unlimited imagination to everyone. What will the future of AI be like? Let's wait and see what flavor the next "divine delicacy" will be!

Mingjie Li: Debriefing report

In support of Application for Chief Surgeon

Since the resumption of professional journals and academic activities after the Cultural Revolution in 1979, I have published dozens of papers in journals such as Southern Anhui Medicine, Journal of Bengbu Medical College, Lectures of Provincial Medicine, Domestic Medicine (Surgery) and Jiaotong Medicine.  In 1979 and 1980, I participated in the preparation and re-founding of Anhui Orthopedic Society and Surgical Society respectively, and attended the annual meetings (1-6 sessions) of the two societies.  I also participated in many academic activities of surgery in China and the Ministry of Transportation.  

In 1994, I was involved in the planning and organization of a symposium on orthopedics in the Yangtze River Basin area, helping to compile a special issue of Orthopedic Clinic for Journal of Southern Anhui Medical College, Vol-13 supplement, 1994) under the guidance of Professor Jingbin Xu, editor of Chinese Journal of Orthopedics, carrying over 100 published papers, with participants and contributions from all over the country.

In September, 1995, I published two papers at the National Academic Conference on Acute and Severe Surgery (Guilin, 1995), among which "Problems in the Treatment of Liver Trauma" (0190) won the certificate of excellent papers.  I have also published papers in the First International Academic Conference of Chinese Naturopathy (Chengdu, 1991) and Naturopathy (published in Taiwan Province).

1 Professional path and deputy chief physician performance

(On evolution of several theoretical problems in surgery)

1.1 In the early 1960s, a large number of patients suffered from acute volvulus, ascaris lumbricoides intestinal obstruction and cholelithiasis.  Carrying out a large number of related operations for these cases consolidated my mastering  the basic surgical skills.  In addition, for the treatment of toxic shock in late cases, we practitioners underwent an arduous zigzag path from vasoconstriction and pressure increase to volume expansion and improvement of microcirculation, which proves to be an epoch-making change and progress both theoretically and clinically.

1.2  In Southern Anhui, there used to be a large number of patients with portal hypertension, hypersplenism and upper gastrointestinal bleeding in the early years of late-stage schistosomiasis and late hepatitis cirrhosis.  The medicine community has also experienced a process of repeated debate and re-understanding of the choice between shunt and devascularization.  In this regard, as early as in 1975, I performed splenectomy, splenorenal vein anastomosis and other various shunts. Due to the high rate of postoperative embolism, the blood supply to the liver was reduced and hepatic encephalopathy was easily induced.  Later on, I switched to various types of portal-azygous devascularization, and obtained many lessons and various experiences for improvements from the treatment of this difficult problem.

1.3 Biliary lithiasis still bothers the surgical community. With the development of hepatobiliary surgery and improvement of monitoring methods, surgical procedures for this challenging problem of intrahepatic calculi are constantly updated and improved.  I started the surgery of regular resection of the left lateral lobe of the liver for this disease in 1980 (the paper on five early cases was published in the Annual Meeting of the Provincial Surgery in 1980 and in Journal of Southern Anhui Medicine (80, 13; 51, “Regular resection of the left outer lobe of the liver for the treatment of intrahepatic stones”).  Also starting in 1980, various types of choledocho-intestinal drainage (Finster, Longmire, Roux-en-Y, etc.) were successively performed.  In 1992 and 1995, three cases were treated with intrahepatic bile duct incision, stone removal and plasty, and "basin" biliary and intestinal drainage (The first case was reported in “Communication Medicine”,  93,7; 91, “A case of hepatobiliary basin type biliary enteric drainage”). This work advanced the operation to the treatment of intrahepatic lesions, leading to improved  clinical efficacy.

1.4 In recent years, the incidence rate of acute pancreatitis has increased. All severe pancreatitis patients in my department were cured by measures such as focus removal, pancreatic bed drainage, intraperitoneal lavage, 5-Fu, somatostatin and other measures to inhibit exocrine, anti-shock and anti-infection. In recent years, one patient was rescued in my department despite the complicated stress ulcer bleeding after operation was performed in another external hospital.  

1.5 On the basis of treatment and operation for various thyroid diseases, hyperthyroidism operation was performed after 1980, and two cases of radical thyroidectomy (neck-mimicking surgery) were performed in 1994. One case was re-operated due to recurrence 3 years after the initial surgery was performed in an external hospital.  No further recurrence was observed during follow-up.  

1.6 In addition, there are surgeries such as excision and anastomosis of cervical aneurysm, thymopharyngeal duct cyst, thyroglossal duct cyst and cystic hygroma resection, etc.

1.7 Over the past 30 years, more than 1,000 cases of breast cancer, gastric cancer, colon cancer and rectal cancer have been treated, and many of them have survived for a long time.  

1.8  The prevention and treatment of short bowel syndrome after large intestinal resection as a surgical method of interposition of distal reverse peristaltic bowel loops, the observation shows no diarrhea and malnutrition for 21 years. This paper was published in the Journal of Bengbu Medical College (82; 7: 214, PEUTZ Syndrome) and Traffic Medicine (91; 1: 41, “Surgical treatment of short bowel syndrome”).  

1.9 The management of duodenal injury has its particularity and complexity, and its retroperitoneal injury is especially prone to missed diagnosis and misdiagnosis.  The prognosis of patients who underwent surgery more than 24 hours after injury is grim.  In a case report from 1994, following the principle of "rest transformation" of duodenum, I performed a Berne-like operation 28 hours after injury, and the recovery was smooth. My paper was published in Communication Medicine (“Experience in Diagnosis and Treatment of Closed Retroperitoneal Duodenal Injury”, by Mingjie Li).

1.10  Subdiaphragmatic total gastrectomy, jejunostomy, supradiaphragmatic esophagectomy, thoracic esophagogastrostomy, lobectomy, mediastinal thymoma removal, diaphragmatic hernia repair, etc. which started years ago.

2. Work involving various medicine disciplines

The two hospitals I have served are both base-level primary hospitals. The "major surgery" department covers general surgery, orthopedics, urology, chest surgery, obstetrics and gynecology, ophthalmology and otorhinolaryngology,  anesthesia, radiation, laboratory test and other related work.  As professional subject leader, I have long been engaged in the work of all of the above areas, outlined below.

2.1 Orthopedics is one of my key areas, only second to general surgery.  I have performed all major surgeries in this area, and participated in academic activities at all levels, including publication of numerous papers, professional talks and compilation of a special issue on Orthopedics.  My representative operations treating bone injury and bone disease include closed nailing of femoral neck (for the paper, see Orthopedics Clinical 1994, 13:37, Closed nailing treatment of femoral neck fracture in 45 cases), surgical paraplegia (paper in Anhui Province Medical Lectures 1982;, 4:21, Surgical paraplegia analysis of 14 cases), spinal tuberculosis surgery (paper Spinal tuberculosis a surgical therapy in Proceedings of First Provincial Orthopedic Annual Conference, 1979), lumbar disc surgery, spinal cord tumor enucleation, bone tumor removal and orthopedic surgery, etc.    

2.2 Urological surgery: nephrectomy, stripping of renal pedicle lymph nodes, removal of various segments of ureteral calculi and Urethral trauma realignment repair, ureteral transplantation, vasovasostomy, spermatic vein–inferior epigastric vein anastomosis, hypospadias repair, radical resection of bladder cancer and penile cancer, etc.

2.3 Gynaecology and obstetrics: I founded the department of obstetrics and gynecology of our hospital, having operated Cesarean section (lower segment and extraperitoneal operation), hysterectomy (abdominal type and vaginal type), oophorectomy, repair of vesicovaginal fistula and cervical cancer resection, etc.

2.4 Ophthalmology and otorhinolaryngolog: parotid gland, tonsil, maxillary sinus, mastoid, cataract, artificial pupil, enucleation, nasolacrimal duct anastomosis, strabismus correction, etc.  

2.5 Anesthesiology: various segments of epidural block, cervical plexus block, brachial plexus block, intubation general anesthesia and intravenous compound anesthesia, etc.    

2.6 Radiology: I founded the department of radiology in 1960, and concurrently served as the head of the department for 2 years (1960-1962).  Very familiar with its routine work and related angiography.

Environment trains people.  A wide range of issues encountered in the long-term work of grass-roots hospitals enabled me to dabble in many subjects.  The knowledge and skills of these relevant areas complement each other, contributing to and deepening the improvement of my surgical expertise.  Various Level-4 and Level-5 surgeries have been performed to keep placing me at the forefront of contemporary surgery.

3  Continuous innovations and some experience to share

Over the past 40 years, with high technological development, diagnosis and monitoring methods are constantly updated.  With the change of social life, diseases are also changing. In an aging society, geriatrics takes a prominent position.  Many factors make the clinical work evolve too.  This requires physicians to constantly hunt for scientific and technological information, learn from the experience of others, study hard and embrace the courage for innovation, in order to improve the service quality for our patients.

3.1 Improvement and innovation

3.1.1 The key to the control of traumatic infection is complete debridement at the first diagnosis, rather than relying on drainage and antibiotics.  Techniques involve a large quantity of water washing, elimination of foreign objects and inactivating tissues, disinfection, and no suture.  When postoperative inflammatory reaction occurs, apply local wet compress with alcohol, supplemented with with or without antibiotics.  Following this strategy, surgery within 6 hours of trauma is almost completely free from infection.

3.1.2 Over the past 30 years, based on the experience of over 1,000 cases of gastrectomy I have performed, the preset gastric tube has basically been abandoned except for special needs, and there were no cases of failure.  This requires excellent anastomosis, perfect hemostasis, intraoperative emptying of the residual stomach, and attentive postoperative monitoring.

3.1.3 For extensive peritonitis, after the nidus and infectious substances are removed, abdominal cavity drainage can be abandoned to reduce postoperative adhesion.  The key for this to work is to wash it thoroughly during the operation.  As the drainage is quickly blocked by fibrin glue in the abdominal cavity and soon stops working, it only increases the pain of the patient. To be sure, however, in cases such as pancreatitis, abdominal abscess, etc., if continuous overflow is expected, double-cannula negative pressure drainage is still required.  

3.1.4  For any surgery, regardless of scale, its success or failure makes a big difference to the health and safety of patients.  As a surgery practitioner, I attach importance to the technical improvement of each and every "small" surgery.  Some of my technical innovations and experience are outlined below.

For inguinal hernia repair, the focus is the transverse abdominal fascia, the traditional Bassini method should be replaced by the modified Madden procedure, which greatly reduces the pain of postoperative tension suture for patients, and is also conducive to healing, with the recurrence rate greatly reduced.

For circumcision, the conventional routine procedure has plagued both doctors and patients with the poor alignment of the inner and outer plates, hematoma, edema, as well as difficulty in stitches removal.  I modified the procedure, using local venous anesthesia to support neat cutting under a tourniquet, with perfect hemostasis, accompanied by careful sutures with human hair or absorbable thread.  The benefits include no pains during the operation, good alignment, fast healing, and avoiding stitches. (see my paper published in Jiaotong Medicine 90; 43)66,  Several improvements of circumcision

Anal fistula seton therapy or open resection both make patients suffer from postoperative pains with a long recovery period. I used long-acting anesthesia (with local injection of diluted methylene blue) to ensure the primary resection and suture. Most cases receiving this treatment result in primary healing, with the course of treatment greatly shortened.

3.2 Some General Experiences

Based on what I have learned from my 40 years of hands-on surgical practice, I feel that in order to be a qualified surgeon, we need not only consolidate the basic knowledge with continuous updating, but also exercise meticulous working methods with a high sense of responsibility, supported by logical thinking and practical orderly working style.  It is very difficult to just follow a unified norm or standard procedure when the real-world surgery scenario involves so many moving parts to be weighed and considered, factors like the ever-changing condition, physical differences, positive and negative effects of drugs, advantages and disadvantages of the techniques in consideration, the reserve function of body organs, the length of the course of the disease, and even the natural environment, mental and material conditions, and so on.  One must be equipped with high adaptation wisdom.  It is not an exaggeration to say that the adaptation ability determines a surgeon’s diagnosis and treatment level and the clinical effects.  

3.2.1 The entire process on the operating table involves struggles between personal fame and the interests of patients.  The so-called following "safety first, and draw the line accordingly” principle is often not a feasible practice.  A competent physician must have the courage to take risks for his patients.  It is often the case to be placed in the position in fighting for patients' good chances of rescue that can be missed because of a small mistake in one's thinking.  I have countless memories of such incidents in the past, one of which is as follows.  In the fifth operation of the biliary tract, cavernous blood vessels caused by portal hypertension due to biliary cirrhosis were distributed all over the hepatic hilus, and in addition, the inflammation was thickened.  After struggling for 8 full hours of operation, I finally managed to open the biliary duct and save the life of the patient.  This was a victory of perseverance.

3.2.2  Adjust measures to real-world conditions, and keep an open mind to break the routine to save a patient.  The key to life-saving in case of liver and spleen trauma and massive hemorrhage of ectopic pregnancy in the countryside lies in the rigorous transfusion of the abdominal blood.  To wait for the blood supply in these scenarios means to wait for death.  I remember a case of liver trauma in which 1700ml of liver blood was transfused locally to support the successful operation.  (See paper Related issues in the treatment of liver trauma (review), in Proceedings of the National Academic Conference on Acute and Major Surgery, 95; 190 

3.2.3 For difficult surgery and new surgery, one must accumulate the relevant knowledge and operation skills, by reviewing the literature, consulting experienced experts for guidance, and visiting and studying surgery scenes, before embarking on the operation, to minimize potential misses or accidents.  In my first case of hepatobiliary-pelvic internal drainage operation, I asked for direct guidance from a professor of surgery. The subsequent two cases were successfully completed all by myself.  

Looking back on my 40 years of career in surgery, I deeply feel that clinical surgery is a combination of science, perseverance, determination, and a sense of responsibility.  It is like a small boat that ups and downs in the forefront of the waves.  Walking on thin ice, one can hit hidden rocks at any time.  The hardships and risks of our career are among the highest in all trades.  Fortunately, I have not failed the society.   Along the journey, there have been countless joys of success, together with many sleepless nights and panic moments.  For the rest of my career years, I am determined to maintain the service spirit of "healing the wounded and rescuing the dying", to complete the journey to the end.  

Appendix 1, Publications
Appendix 2, Relevant Materials and Records of Level III and Level IV surgeries

 

In Commemoration of Mingjie Li’s 66 Years of Medical Practice

 

      

Mingjie Li: My career as surgeon

I:  Career memoirs 

Before writing my debriefing report in support of my application for Chief Surgeon, let me start with three unforgettable orthopedic cases that I experienced in my medical practice. 

In 1970, my old schoolmate and close friend from junior high school, Mr. Gui from Fanchang No.1 Middle School at that time, brought his son’s case to my attention.  His son, aged 16 then, suffered from cervical vertebra 5 tuberculosis with cold abscess, which severely oppressed esophagus and trachea.  He was unable to eat, and had difficulty breathing, with hoarseness, dehydration and hypoxia, in a critical condition.  

They had visited Yijishan Hospital, the largest hospital in Wuhu, but the director there Dr. Chen of the Department of Orthopaedics could not admit this case, saying that a few days before, a similar case, died during the operation.  He made the suggestion for the patient to be sent to the provincial Hospital of Hefei, which required 800 yuan then.   However, Mr. Gui’s monthly salary was only 52 yuan, and he had to support a family of six with this income.  How could he afford it?  Besides, nobody knows whether the chief hospital in Hefei could treat him.  In a hurry, Mr. Gui turned to the No. 127 Army’s Hospital located in the suburb of my town Nanling, to try their luck there.  The corresponding department of the hospital was administered by Dr. Xu Jingbin, the nation-wide orthopedic authority, and this military hospital located in a small place long had a tradition of helping the poor.  Unfortunately, Dr. Xu was on a business trip to Nanjing, and several of his subordinates there were too afraid to accept this high-risk patient.  

Feeling helpless, Mr. Gui came to me in Nanling County Hospital (the two hospitals are only 5 miles apart) to discuss possible rescue plans with me.  I was not sure about how best to treat this condition either.  However, I had studied in No. 127 Hospital, with Dr. Xu as my supervisor, familiar with the personnel there.  I immediately called an ambulance. We went back to No.127 Hospital, found doctors in orthopedics and surgery, and asked them to work together for the treatment of this urgent case.  Mr. Gui as patient’s family and I jointly signed the required paper for willing to take the risk of the operation, and discussed the detailed rules.  However, this plan was still not approved by the hospital.  Instead, the hospital asked me to help them out of this embarrassing predicament, and promised a free car to be used for transferring the patient to big city hospitals in Hefei or Nanjing.  The patient's life was in danger at any time. Far water cannot put out the near fire, so it's not advisable to transfer to hospital far away. 

I decided to take on the challenge myself.  At that time, I thought, at least I could give pus discharge for saving life first, relieving the oppression of esophagus and trachea, and making it possible for hime to eat and breathe.  So the patient was brought back to the county hospital where I worked.  Without even getting off the stretcher, I ordered to first give fluid replacement and antituberculosis.  At this point in the evening, Mr. Gui didn't get any food for a whole day, so he was given dinner at my home.  I could not afford the time to have a dinner.  I took the time to review the related literature and anatomy.  Half an hour later, the patient was sent to the operating room under local anesthesia. After my careful dissection, the patient’s pus cavity was cut to release a large amount of pus.  The patient immediately started making sounds, could sip the water, and breath smoothly, indicating him finally put out of immediate danger. 

The operation continued, exposing the focus of cervical vertebra 5 by anterior approach, I removed the dead bone, scraped off the granulation of tuberculosis, flushed the pus cavity, inserted streptomycin and isoniazid, put the drainage piece in, with suture.  The operation was smooth and very effective.  The fever came down 3 days after operation.  The patient went to get a haircut, ate normally and recovered well. 12 days after the operation, he was discharged from hospital, and his medical expenses were 32 yuan.  He continued anti-tuberculosis treatment for half a year and recovered well.  For more than 40 years now, the patient has been working and living normally, now enjoying a family of his numerous children and grandchildren. 

In addition to the complicated anatomy of the neck, such as dense blood vessels, nerves, thyroid gland, trachea, esophagus, etc., this type of cervical tuberculosis debridement operation is of high difficulty also due to the fragility of the cervical spine and the destruction of tuberculosis.  If there is a slight mistake in the cervical spinal cord, it will lead to being paraplegic at a high level or even death.  It's an orthopedic high risk level 4 operation.  Even in big hospitals, the directors are extremely cautious in treating such cases.  I was still a newcomer in orthopaedics then, but I needed to save lives, knowing that transferring to another hospital at that time was basically a dead end.  The patient was on the verge of an abyss.  But I also had some of my own strength and preparedness for this success.  I had had many years of experience in neck thyroid surgery, familiar with anatomy, and had accumulated specialized knowledge in orthopedics.  This solid foundation finally enabled me successfully complete this rare problem in a grass-roots hospital.  Life threatening symptoms were treated by relieving oppression immediately.  And the disease was cured, with the lesion eradicated.  It proved to be a cure for life.  

Another case, at the end of 1980s, named Xiao Wei, a 14-year-old junior student in Wuhu No.1 Middle School, suffered from right humeral neck tumor.  He had undergone two operations in Yijishan Hospital and Shanghai Zhongshan Hospital respectively.  Now, the disease struck at the right scapula.  The director of orthopaedics in a hospital of our city said, it is malignant tumor recurring and metastasizing, amputation is necessary, and it is challenging to save his life!   The family was in a desperate situation.  The patient’s grandfather, Mr. Wu, was my junior middle school teacher.    Mr.  Wu knew about the case of cervical tuberculosis treated  well by me on Mr. Gui’s son, so he came to me for consultation.  I carefully examined the medical records and the X-ray films before and after, and diagnosed it as a new critical tumor, neither a recurrence nor a metastasis of the original disease.  I personally performed a half-excision of the right scapula in my own hospital, resulting in his full recovery.   More than 20 years have passed, and Xiao Wei has enjoyed good health ever since.  He has become a Dr. Yang in the west later on, and is now a high-end international talent in his field.  From time to time, he and his father still come to visit me with appreciation. 

The third case, in the fall of 1975, a 35-year-old female patient, who had lost 40 kilograms, was admitted to our hospital for tuberculosis of thoracic vertebrae 6 and 7 with paraplegia.  Under general anesthesia, through the chest, the focus was cleared, and the dead bone and the necrotic intervertebral disc were removed.  The tubercle granulation in the spinal canal was 8cm long, which pressed the thoracic spinal cord, resulting in spinal canal obstruction and paraplegia.  After curettage, it could be seen that this segment of spinal cord was throbbing again.  The focus area was thoroughly washed, with antituberculosis drugs added in.     The ribs cut during thoracotomy were trimmed and embedded in the intervertebral defect area, and the anterior bone graft was completed in one stage. After operation, the patient recovered well and was cured.  The patient’s husband was a blacksmith, who gifted me with  a stainless steel kitchen knife and a spatula of his own craftsmanship, which are still in use in my home today.  In orthopedic surgery, this belongs to the top level-four category.  With thoracic tuberculosis complicated with paraplegia, the cure was one-time lesion clearance and bone grafting through the anterior thoracic approach, definitely having reached the peak in county-level hospitals. 

Such cases have brought me a great sense of pride and accomplishment, and they form the motivation for my lifelong dedication to saving lives and relieving pains for my countless patients.

 

In Commemoration of Mingjie Li’s 66 Years of Medical Practice

 

       

In Commemoration of Mingjie Li’s 66 Years of Medical Practice

Collected Works in Commemoration of Mingjie Li’s 66 Years of Medical Practice

 

© Mingjie Li

Dr. Mingjie Li has been practicing medicine for over 60 years. This collection, compiled to commemorate his amazing career, includes three sections: (i) career memoirs, (ii) medicine papers, and (iii) medicine education. The publication of his medicine papers is the culmination of his extensive experience and expertise in the field. His work has been recognized by his peers for its professional value and rigorous style. In addition to surgery, orthopedics, obstetrics, and gynecology, his work at times also incorporates elements of traditional Chinese medicine. The "Operation Records" section in the appendix provides detailed descriptions of operation procedures and emergency measures, making it a valuable reference for professionals in the field. The "Education Section" highlights Dr. Li's practical experiences and medical training materials he compiled, providing valuable insights into a range of clinical topics. Overall, this collection serves as a testament to Dr. Li's impressive career and contributions to the field of medicine.

August 2023, Wuhu, Anhui, China

【李名杰从医67年论文专辑(电子版)】

 

Table of content

I:  Career memoirs

My career as surgeon

Debriefing report

Service beyond my hospital

Career Path and self review

Dad's medical career

II:  medicine papers

Regular resection of left lateral lobe of liver for intrahepatic calculi

PEUTZ syndrome

Surgical management study of hepatic injury

Surgical treatment of acute gastroduodenal perforation

Diagnosis and treatment of closed retroperitoneal duodenal injury

Surgical treatment of short bowel syndrome

Hepatobiliary basin type biliary-enteric drainage

Biliary enteric drainage

Several special problems in diagnosis and treatment of biliary tract surgery

Diagnosis and treatment of close duodenal retroperitoneal injury

Misdiagnosis of subacute perforated peritonitis in gastric malignant lymphoma

Adult retroperitoneal teratoma infection complicated with chronic purulent fistula

Lighter foreign body in stomach

Primary repair of congenital omphalocele

Recurrent stones in common bile duct with suture as core

A case of plastic tube foreign body in bladder

Abdominal trauma

Subcutaneous heterotopic pancreas of abdominal wall

Several improvement measures of circumcision

Clinical observation of a new minimally invasive circumcision

A surgical treatment of spinal tuberculosis

Transpedicular tuberculosis complicated with paraplegia

Surgical analysis of surgical paraplegia

Lipoma under soft spinal membrane complicated with high paraplegia

Treatment of femoral neck fracture with closed nailing

Fifth metatarsal fracture caused by varus sprain

Intervertebral disc excision in community health centers

In commemoration of the 50th anniversary of Dr. Xu Jingbin' s medical career

Intrauterine abortion combined with tubal pregnancy rupture

Rivanol induction of labour by amnion cavity injection

Extraperitoneal cesarean section

Prevention and treatment of trichomonas vaginalis and mold infection

Non-operative treatment of senile cholelithiasis with integrated traditional chinese medicine

Treatment of acute soft tissue injury with moxibustion

Treatment of scapulohumeral periarthritis with acupuncture combined with warm moxibustion

IV:  medicine education

Level 4 Surgery

New concept of modern surgical blood transfusion

Extrahepatic biliary injuries

Surgical treatment of thyroid cancer

Indications of splenectomy  and effects on body after splenectomy

Treatment of carcinoma of pancreas head  and carcinoma of ampulla

Treatment of cardiac cancer

Treatment of recurrent ulcer after subtotal gastrectomy

Treatment points of radical resection of colon cancer

Medicine Lecture Notes

Related Online Links

 

 

Interview 1/10: Critique of Chomsky's Formal Language Theory

Q: Lao Li, I have been paying close attention to your academic track. I deeply admire you for more than 30 years' in-depth study of symbolic logic in the field of natural language understanding with your unique innovation. On your NLP Channel, I notice that you've been critical of Chomsky. Chomsky is the representative figure of the rationalist school. Like many others, I admire Chomsky. As far as I know, you are also a rationalist. So why do you, as a linguist who practices rationalism, criticize Chomsky?

A: First of all, although I have criticized Chomsky, pointing out his theoretical issues and objective misguidance in the field, these are "criticisms within the school". There is no doubt that Chomsky is the father of computational linguistics and the banner of rationalism in the field of artificial intelligence. His theory of formal language is the cornerstone of computational linguistics. All of us computational grammarians, as practitioners of the symbolic logic of rationalism in language, are his disciples. When we criticize him, we still use his formal mechanism as the frame of reference.

From the perspective of language formalization, Chomsky, who has a deep mathematical background, brings mathematical rigor into the formal study of language. At least in terms of formalism, Chomsky unified human language with computer language to have achieved a highly abstract symbolic system no others could dream of reaching. Without Chomsky's formal language theory, computer science could not develop high-level languages, and all the achievements of the information industry would be unimaginable.

On the other hand, it can be said that Chomsky's negative impact on the field is as big as his revolutionary contribution to linguistics and computer science. His formal language hierarchy is a theory of pure genius, which lays the foundation of language formalization. This formalism has become the theoretical basis of computer high-level languages and their compiling algorithms. It is used at its best to create, parse and compile computer languages as a perfect guide. However, perfection is sometimes only one step from fallacy. Chomsky criticizes the finite state machine as not suitable for modeling natural languages due to a lack of recursion mechanism. Too many people are misguided and fall into the so-called "more powerful" context-free mechanism.

Such an intelligent and powerful figure, if he misleads, can impact an entire  generation. The generation that was affected was my direct supervisors and predecessors when I entered this field (in the 1970s and 1980s), their work in natural language understanding was almost exclusively toy system confined to labs, difficult to scale up and demonstrate in practical applications.  This directly led to the rebellion of the next generation. This is the piece of history in artificial intelligence, the famous competition between rationalist symbolic school and empirical statistical school, with long struggles between the two paths. The rationalists of the old generation were at a disadvantage in competition and gradually withdrew from the mainstream stage.

All the advance of the statistical school over the last 30 years has been a practical critique of Chomsky because almost all of these models are based on finite state models, which he repeatedly criticized as inappropriate for natural language. The context-free grammar he advocates has achieved limited success in the field of natural language.

Q: Now that everyone is advocating neural networks and machine learning, is there still room for the symbolic rule school? Rationalism has lost its voice and visibility in the natural language community. What do you think of the history and current situation of the two?

A: Well, machine learning has been on the rise in natural language processing since about 30 years ago, with the rapid development of data and computing resources. Especially in recent years, deep neural networks have achieved breakthrough successes in learning. The success of empiricism, in addition to the innovation in neural network algorithms, also benefits from the availability of unimaginably big data and big computing power today. In contrast, the rationalist school of symbolic logic, due to its implacability,  gradually withdrew from the mainstream stage of the academia after a brief upsurge of phrase structure grammars with innovation based on unification about 20 years ago. There are several reasons for this situation, including Chomsky's long-term negative influence on computational grammars, which deserves serious reflection.

Looking back at the history of artificial intelligence and natural language, the pendulum of empiricism and rationalism has swung back and forward, but the pendulum of empiricism has been on the rise for the last 30 years (see the red dot in figure 1). In his article "Pendulum Swung Too Far", Professor Church  predicted and called for the resurgence of rationalism and presented an illustration below:

At present, due to the breakthrough of deep learning, empiricism is still in the limelight. Although rationalism has been accumulating efforts by itself for many years, it has not yet reached the tipping point where it can compete, head-on, with empiricism. When one school becomes mainstream, the other naturally fades out of sight.

Q: I have a feeling that there is some confusion in the community and outside the community at large. Deep learning, which is a method of empiricism, now seems to be regarded by many people as equivalent to artificial intelligence and natural language processing. If the revolution in deep learning sweeps through all aspects of artificial intelligence, will it end the pendulum swing of rationalism? As professor Church says, the pendulum of empiricism has swung too far, but it looks far from falling back.

A: My definite answer is no. These are two different philosophical bases and methodologies, each with its own natural advantages and disadvantages. Although there are reasons for the status quo of the existing one-sided empiricism in the current academic world, it is not a healthy state. In fact, both schools are competitive on one hand and also highly complementary on the other hand. Some older generation mainstream pioneers like Church have been warning about the disadvantages of one-sidedness in empiricism, and some new scholars in deep learning have been exploring the integration of the two methodologies to solve the problems of natural language.

Yes, much of the current surge in AI is based on breakthrough performance from deep learning, especially in the areas of image recognition, speech processing as well as machine translation, where AI systems have reached or exceeded human quality. This is an unprecedented amazing achievement indeed. However, the fundamental limitation still exists with deep learning, as well as all the other successful empirical methods at present, that is, the dependence on massive annotated data, what we call the knowledge bottleneck. The reality is that in many fields and application scenarios, such as natural language parsing, machine translation of e-commerce data, data of massive annotation or domain translation do not exist. This knowledge bottleneck severely limits the performance of the empiricist school in natural language understanding and other fine-grained cognitive tasks. There is simply not enough annotated data in many sub-fields, and without, it is almost impossible to make bricks without straw for learning. This is especially true for deep learning, which has a much larger appetite, like insatiable, than traditional machine learning.

Q: So it seems that deep learning is not an all cure. Rationalism has its place. You said the two schools have respective strengths and weaknesses. Can you compare and contrast them? Why are they complementary?

A: Let me summarise the merits and demerits of the two for a serious contrast.

The advantages of empirical statistical models include: (1) good at coarse-grained tasks, typically, document classification, for such tasks, statistical learning is naturally better to draw the overall conclusion; (2) robustness; (3) high recall: due to the lack of structures and understanding, many tasks might face a ceiling for accuracy, but recall-wise, learning usually performs well; (4) development efficiency: it can quickly scale to a real application scenario of big data.

The main limitations of the statistical school are: (1) the dependence on massive annotated data: this is the biggest knowledge bottleneck; (2) it is difficult to make targeted debugging: the statistical system is more like a black box, a big defect for maintenance and iterative incremental enhancement of a software system; (3) lack of interpretability: whether the result is right or wrong, it is difficult to explain, which affects the user experience and confidence. The main reason is the lack of explicit structural representation and symbolic logic in the algorithm that people can follow.

The rationalist approach simulates human cognitive processes without relying on massive labeling data to imitate on the surface strings. Rationalism directly formalizes the experience of domain experts and uses the explicit rule system from symbolic logic to simulate human intelligence tasks. In terms of natural language understanding, the grammar school formalizes the rules summarized by linguists so as to parse natural language in detail at all levels and achieve deep syntactic-semantic analysis. In this respect, rationalism has its natural advantages.

To sum up, the advantages of rationalist rule-based school include: (1) good at tasks of fine-grained tasks: very detailed analysis, such as the deep parsing of syntactic semantics with logical reasoning; (2) accuracy: the rule system written by experts is easy to guarantee high accuracy, but the improvement of recall is usually a long iterative process; (3) debuggable in error correction: the basis of the rule system is symbolic logic, which is easier to trace to the root of the error in debugging; (4) interpretable: this also benefits from the understandable symbolic logic basis.

The main defect of the rule school is the low efficiency of manual coding, and the dependence on expert coding is the knowledge bottleneck of the rule school. Supported by the same platform and mechanism, different levels of expertise determine different levels of quality. The two paths have their own knowledge bottlenecks, so to speak. One is to rely on a large quantity of "low-level" labor, labeling, though very monotonous,  is work that can be assigned to ordinary students with a little training. The other is to rely on a few experts of "high-level labor",  much like software engineering, for coding and debugging rules, the knowledge engineer training costs are high, making it more difficult to scale up to the real world. Finally, the talent gap can also be regarded as a realistic severe limitation of the rationalist school. 30 years is exactly one generation, during which empiricism has occupied the mainstream stage, and attracted almost all newcomers, causing a generation shortage of talents in the rationalist camp.

As for the recall, it cannot be simply concluded that high precision is bound to have a low recall rate for rule systems. The actual situation is that, on the one hand, it is not at all difficult to achieve a balance between precision and recall, by deliberately relaxing rule conditions and sacrificing accuracy. On the other hand, while high precision can also be maintained, the more rules added to the system, the more phenomena will be captured, hence the recall rate will come up naturally and incrementally in the iterations as time moves on. In other words, recall is a function of time and development resources put in, without having to compromise precision.

Q: Since each has its own strengths, as the rationalist pioneer and father of computational linguistics, why doesn't Chomsky exert its due influence in the field of natural language processing? His impact has been waning, and the newcomers to the field hardly hear of him.

A: Indeed it is. Although I am a rationalist, I also see that there is a considerable historical burden from this school that needs to be seriously reflected on from the perspective of formalism architecture.

Chomsky is the founder of modern rationalism, but the theory and practice he developed also involve some misconceptions. We must recognize these so that we can move forward the linguistic rationalism in symbolic logic steadily and deeply for natural language. In fact, after decades of theoretical exploration and practical experiments, the grammar school has seen fairly clearly its own theoretical limitations. Those who stick to the symbolic rule systems have broken through the path of innovation in the inheritance of rationalism, and have made their own breakthrough in deep parsing, the very core of natural language understanding, and in its scale up to big data for real-life information extraction and text mining applications. That's what we're going to focus on in this series of interviews.

Q: I know you have great faith in rationalist symbolic approaches in general. However, you have also seen a number of misconceptions in Chomsky's theories. which are the most critical?

A: On his formal language theory, there are two fallacies to my mind, one I would name Recursion Fallacy and the other Monolayer Fallacy.  On his linguistics theories, one of the very basic propositions in his linguistic revolution is "syntactic autonomy" or "self-contained syntax".  It involves serious potential consequences in the analysis of certain languages such as Chinese.  His phrase structure grammar tree represenation with his X-bar theory in syntax is also worthy of reflection and criticism, especially when it is put in the comparative study with the alternative dependency grammar and its representations for NLU. Let's look at Recursion Fallacy first.

In my view, Chomsky's greatest mislead was to use the so-called recursion nature of natural language to criticize pattern matching in finite states. His cited English examples of center recursion are far-fetched and rare from real life, making it difficult to argue for its being the nature of natural language. Nevertheless, a generation still chose to believe in his theory, taking it for granted that finite states had to be abandoned in order to be able to parse natural language.

Q: Isn't it generally accepted that natural language is recursive? How to say it is a fallacy?

A: Exactly because it is widely accepted, it is of the more misleading nature and consequences, hence requiring more serious critique.

Recursion in natural languages typically comes in two types: (i) right (branching) recursion and (ii) center recursion. Many people don't consciously make that distinction, but in computational theory, they are two very different things. Right recursion is linear by nature while center recursion is nonlinear, a completely different monster, of much more computational complexity. In natural languages, right recursion is fairly common and can at times be as many as seven or eight levels nested, which still reads natural and easily comprehensible. For example, the VP nesting example:

(to request A (to beg B (to ask C (to do something))))

For right branching recursive structures, we usually do not feel a burden in the communication. The reason is that, although the right recursive left boundary is in an uncertain position, they all end at the same poin for the right boundary, like this: (... (... (... (... (...... ))))). Thus, we do not need a "stack" mechanism in memory to deal with it, it remains finite-state.

Chomsky cannot criticize finite-state devices with right recursion, so he needs to base his argument on center-recursion, a rarity in language. The fact is that natural languages have little manifestation of center recursion. Center recursion is much like matching parentheses. You want the parentheses to match each other so  you can express and understand the proper nesting structures, like this: { ... [ ... ( ...... ) ... ]... }. After as many as three levels of center recursion, our brain can no longer cope with the pairing complexity, which is why it's hard to fine such phenomena in real life language data.

Q: I remember some examples of center recursion in English:

      The man who the woman who had lost all the keys was calling all day finally came...

A: Is this "human" language? Chomsky repeatedly attempt to teach us that not only this is human speech, but it is the very nature of human language, hardly any hypotheses about language as far-fetched as this to my mind.

Q:  Let me try to understand what you mean: center recursion does not exist, or does not exist over three levels, so natural language is finite-state?

A: Well, not that it does not exist, it's so rare and far-fetched, and it's never more than three levels deep unless you're pulling a prank. Therefore, it can by no means be the "nature" of natural language.

The very idea of unbounded center recursion in language, far from the observable facts, in effect violates the limits set by the short-term memory following psychology. Where in the world do people talk like that, like, keep opening the doors without closing them behind, in a maze-like complex castle, with nested sub-structures after substructures? A path of 3 doors opened, an average person will get lost in the maze. Even if you're a super linguist, and you can stand it, your audience will be bound to be trapped. Is natural language not to communicate, but deliberately making difficult for people to follow you?  This is not in accordance with the consensus that language is born for communication and serves the ultimate purpose of communication.

Using pranks and verbal games as evidence of linguistic competence and the nature of language is one of the most misleading aspects of Chomsky's recursion theory. This recursion trap leads many people to automatically accept that natural language is recursive and therefore we must discard the idea of finite states. The people who believe in him, on the one hand, are influenced by his authority as the father of modern linguistics; on the other hand, they often mis-regard the more common and deeper right recursion for center recursion as evidence in support of Chomsky's recursion hypothesis. Chomsky himself is intelligent and rigorous as not to use readily available right recursion as evidence, he only uses center recursion as an argument.  But he's in effect misleading.

Q: I guess this is a typical behavior of mathematicians and philosophers: they pursue formal perfection. As long as it is theoretically impossible to exclude multi-level center recursion, it is required that the formal mechanism must have a built-in recursion mechanism. But practitioners of natural language understanding do not have to be bound by that theory, do they?

A: after all, the foothold of the theory should be based on the real-life natural language object and data, right?

In fact, in the research of corpus linguistics, some scholars have conducted a very extensive survey and found that the so-called center recursion in natural language never exceeds three levels, and the occurrence of three-level recursion is extremely rare [reference]. The phenomenon of natural center recursion beyond three levels is simply not found in a very large running corpus, not a single case found. So why boil a very limited center loop down to what seems like an infinite level of recursion, and furthermore consider it the essence of natural language, and use it as an argument to determine the choice of the formal model for natural languages? This has had serious consequences for computing and NLU going beyond labs for applications.

In order to deal with theoretically infinite center recursion, the human brain, or computer memory, must have a "stack" device and a "backtracking" algorithm. Without going into the technical definitions of these computer terms, computer science studies have demonstrated that stack-based backtracking is expensive for computation. Using it as a basic device for natural language severely impedes language parsing from leaving the laboratory. Specifically, Chomsky's "context-free grammar" with built-in recursive devices is theoretically bound not to have corresponding linear speed algorithms. The absence of linear algorithms means that the computing time is beyond control, so when entering big data out of the lab, this kind of thing is one limiting factor in practice. This is one of its fundamental flaws in his formal language arguments for natural language.

Q: I agree with you: there are only very limited levels, we don't have to stick to recursive grammars. But I still have a question. Short-term memory is a psychological concept, and most of us in computational linguistics believe that psychology has no place in linguistics. Don't you agree?

A: I don't agree. The limitations of psychology have a direct effect on real linguistic phenomena, that is, psychological effects are reflected in linguistic phenomena. Real language phenomena, not imaginary phenomena, are the goal and final foothold of our natural language study. What we're dealing with is a data set with a psychological constraint, and it's obviously not appropriate for us to adopt a mechanism to deal with it based on a hypothesis that disregards psychological constraint.

Q: But even with the addition of psychological restrictions, don't real corpora still have recursion? If yes, without the formal recursion device, such as the finite state machine, how can it handle the actual existence of the center recursive structure as long as it is not a non-existence?

A: Not a problem at all. As long as the recursive structure is bounded, the finite states have no problem in dealing with it. All we need is just cascade a few more finite state machines. Since you have at most three levels of center recursion, then it is 3 machines with 3x time needed, which is still linear. Even 10-level center recursion is not a thing, just add up 10 finite state automata. In our deep parsing practice, we have once applied up to 100 cascaded finite state machines for very deep parsing, in high efficiency. This kind of finite state pipeline systems, often called cascaded FSAs, is essentially the same concept of the pipeline as used in software engineering.

Q: Chomsky Hierarchy, named after Chomsky, is the most famous discovery in Chomsky's formal language theory, which divides grammars into four types, type 0 to type 3, corresponding to different automata. What do you think of his hierarchy?

A: Chomsky's formal language hierarchy is like a hierarchical castle with four enclosing walls safeguarding inner cities. Each formal device is like an internal forbidden city. Here we particularly recommend and quote an insightful study of Chomsky Hierarchy by Prof. Bai, which I call  a "caterpillar" theory of natural language (S. Bai: Natural Language Caterpillar Breaks through Chomsky's Castle):

If we agree that everything in parsing should be based on real-life natural language as the starting point and the ultimate landing point, it should be easy to see that the outward limited breakthrough and the inward massive compression should be the two sides of a coin.  We want to strive for a formalism that balances both sides.  In other words, our ideal natural language parsing formalism should look like a linguistic "caterpillar" breaking through the Chomsky walls in his castle, illustrated below:

Prof. Bai also clearly sees that Chomsky's recursion theory is too far away from linguistic facts, so he puts special emphasis on "real-life natural language". After all, formal systems serve as formalized models for natural language, that is, they need to provide an appropriate framework for what natural language looks like. The common answer shared by Prof. Bai and me is that a suitable natural language model needs to get through the walls inside the Chomsky Castle. Any single device in Chomsky's existing formalisms, when used to model natural language, is either too small to fit, or too large lacking appropriate restrictions. In both theory and practice, it is necessary to penetrate the walls of Chomsky Castle and form an innovative formal system, so as to lay a good foundation for the revival of grammars in natural language modeling. In the formalization process of penetrating the walls, Mr. Bai has his own innovation, and I have mine. My proposition is to extend and overlay the finite-state mechanism, so as to establish a shallow and deep multi-layer rule system for natural language deep parsing and understanding.

Do not look down upon finite state machines, which seem to be a very simple mechanism for pattern matching. When they are added layer by layer in the architecture of a reasonable pipeline system, they can cope with very complicated structures and phenomena and reach the depth of language parsing that is never before made possible by traditional context-free grammars or other devices. Of course, the mechanism itself can be reinvented and recrafted, such as incorporating the unification operation in handling language reduplications, e.g. in Chinese, "看一看": V 一 V (literally look-one-look: "take a look").  There are also rules for pattern matching that can effectively eliminate ambiguities by adding post-context conditions, similar to the "look ahead" effect in backtracking algorithms, to the pattern matching device.

It is worth emphasizing that maintaining the linear nature is the premise of any formalism innovation. No matter how we extend the mechanism of finite-state devices, this one remains an unchanged goal, that it must retain the essential characteristics of finite state to ensure the "line speed". We use a multilayer cascade to bypass the recursion trap, hence eliminating the biggest hidden trouble that hinders linear speed. Since the linear multiplication remains linear, the cascaded finite state system does not change the linear benefit of the system. Computationally, the processing speed required for three-layer recursion is only 3x, which will not affect the scalability potential of the system. In fact, we have deployed multi-layer systems, usually with more than 50 layers. Our Chinese system sometimes cascades up to 100 layers in the architecture, where capturing recursive structures is just a relatively simple task inside.

Q: That's fascinating.  And very imaginative, too. It is apparent that you and Prof. Bai have both accumulated years of practice and deep dive into natural language so you two have such insights as summarised above in breaking through the internal walls of the Chomsky Castle. Ok, so the first issue with Chomsky formal language theory is the recursion fallacy, what's the second fallacy?

A: The second major problem with the Chomsky formal language theory is briefly mentioned above, which I call  Single-layer Fallacy.

Turn to the chapter on parsing in the computational linguistics textbook, the typical algorithm for parsing, known as chart-parsing, is often introduced on the formalism of a context-free grammar (CFG). CFG contains recursive calls in its rules for covering recursive structures, a point emphasized by Chomsky as the key feature for natural language. The implementation of this rule system is carried out in the same search space on the same plane, thus the so-called chart-parsing can be illustrated on a flat chart. Successful parsing is represented by one or n search paths that cover the entire sentence.

[consider a chart parsing sample.]

The essence of single-layer parsing is like cooking a hodgepodge.  Everything in an input string,  from morpheme to word, from word to phrase, from phrase to clause, from clause to a complex sentence, all are carried out in the same space.

Q: So Chomsky wants to solve everything at once. Isn't that good?

A: Problem is, there are three main disadvantages. First, there is no linear algorithm. Many people have tried, but they just can't find a linear algorithm, it's a combinatorial explosion.

The second disadvantage is that it is not suitable for modular development, because the surface or shallow level language phenomena and the deep language structures are all mixed on one plane.

The third disadvantage is the so-called "pseudo-ambiguity" issue. "Pseudo ambiguity" is in contrast to true ambiguity. If there is one true ambiguity in the input sentence, the correct identification is for the parser to produce two parses to express the ambiguity. "Pseudo-ambiguity" means that a sentence is not ambiguous in people's understanding, but the parser still outputs several parses, which are all considered to be grammatical.

The problem of pseudo-ambiguity is a recognized challenge in single-layer parsers. Even for a simple sentence, traditional parsers based on context-free grammars often produce dozens or even hundreds of parses. Most of the time, the differences are so subtle that they don't make difference in communication. The consequence is that very few true ambiguities are hidden among many false ambiguities. In effect, the parser loses the ability to parse ambiguity completely. Of course, such a single-layer grammar approach is difficult to be truly deployed in parsing and semantic decoding of big data.

Q: Lao li, I think I have now started understanding the drawbacks of the single-layer parsers you discussed. Could you elaborate on why it is not a feasible model for real-life applications?

A: Too big a search space, and too many parses.  In essence, the system makes explicit all possibilities, low probability events as well as high probability events all in the same search space,. The whole idea is that it makes sense in theory, that any small possibility is a possibility, and then from a perfect theoretical model, you can't block any path in advance. This way, you have to save all the search paths until the global path is complete. And this leads to the fact that the space where the resolution is, in fact, a combinatorial explosion space, so there's no efficient corresponding algorithm.

Q: why isn't a single layer suitable for modularity?

A: there is no modularity at all in a single layer. The approach of a single layer means that the whole resolution is a module, and a single layer means non-modularity. Its theoretical basis also has some truth. It says that language phenomena are interdependent, and a complete language analysis scheme cannot completely separate them. As low as participles and as low as the boundaries of basic phrases, these shallow structures are difficult to determine outside the overall structure of the sentence. This is because a locally sound structure can always be overridden in a larger context.

(for instance)

From this interdependent, locally subordinated global perspective, structural analysis, once cut up, creates a chicken-and-egg problem. To deal with this problem of interdependency, theoretically, a single-layer model makes sense. In a single-layer system, all the interdependent phenomena are explored in the same plane according to the global paths as solutions. That forms, of course, an argument against multiple layers, that language phenomena are interrelated, so we can hardly treat them by first cutting them into multiple layers.  Interdependency in a modular pipeline is very susceptible to "premature pruning" of branches. To be honest, if we leave aside the pseudo-ambiguity problem and the non-linear speed from the single-layer system design for a moment, it is quite difficult to refute the above argument against the multi-layer system design. However, single-layer is not very feasible in practice. The consequences of a single layer far outweigh the benefits, and the concern on premature pruning in a multi-layer system actually has its own countermeasures.

Q: Your point of view is not quite the same as my understanding of modularity. In my understanding, a module is actually a concept without hierarchy. Just like with bricks, you can build roads, it's like a complete horizontal jigsaw puzzle of bricks. Of course, you can also build a wall in which case bricks are hierarchical. It goes up one level at a time. So, in my understanding, modularity and hierarchy do not have to be correlated. Does it make sense?

A: Yes, you're right. Modules are bricks. They do not have to have layers. If there are layers, like building a wall, then there has to be a sequence architecture of modules. But it is also possible that there is no sequential dependency between the modules and the layers. The modules are defined from an angle beyond layers, which is like paving a road. Road paving does not have to be serial, which can be parallel. In practice, they may as well still be arranged in a uniform pipeline, combining the style of road paving with the style of wall building.

Modularity itself is a seasoned practice that comes from software engineering.  That is, when building a complex system, we always attempt to divide tasks into subtasks and sub-subtasks. Modularity makes the development process more tractable and easier to maintain. Natural language is undoubtedly a fairly complex system. Faced with a complex object like language, a good way is to emulate the approach that has worked in engineering for years. That is to say, the task should be reasonably decomposed and cut into modules as far as possible to implement modular development.

Thanks to http://fanyi.youdao.com/ based on which this translation is revised and polished by the author himself.  This is the first chapter of our book on NLU which consists of 10 interviews on key topics of AI symbolic logic as used in natural language parsing. Stay tuned.

[References]

S. Bai: Natural Language Caterpillar Breaks through Chomsky's Castle

 

S. Bai: Fight for New Portals

Author: Bai Shuo

Recently, Amazon's AI product Echo and its voice assistant Alexa set off a whirlwind in the industry.  It has drawn attention from not only the smart home industry but also the AI start-ups as well as the IT giants.  So, what exactly is unique about Alexa?

Recently, Amazon's AI product Echo and its voice assistant Alexa set off a whirlwind in the industry.  It has drawn attention from not only the smart home industry but also the AI start-ups as well as the IT giants.  So, what exactly is unique about Alexa?

Some people say that Alexa has solved the challenging "cocktail party" problem in speech recognition: imagine a noisy cocktail party, where a person is chatting with you, the voice is not loud, but you can accurately capture the speech with no problem while ignoring the surrounding big noise. Alexa models this amazing human capability well, which is said to be missing from other leading speech players, including the global speech leader USTC iFLYTEK Co.

Others say that behind Alexa are very rich cross-domain know-hows:  one can ask Alexa for on-demand programs, he can also buy goods and services through it; it can be instructed to control the various appliances of our home, or inquire about all kinds of news.  All in all, this is a voice assistant backed by a strong service (with some resources local, and more in the cloud).  Apple's Siri or Microsoft's Little Ice are believed to be by no means a match for Alexa in terms of these comprehensive capabilities.

The excellent performance by the end device, coupled with the huge cloud resources in support of the end, constitute Alexa's expected success in customers' stickiness, leading to its legendary value as an information portal for a family.  That seems to be a good reason for Alexa's impressive market performance in the US.  A considerable number of people seem to realize that this may represent a huge business opportunity, one that simply cannot be missed without regret.  Although in other markets beyond the United States, Alexa's performance is not as eye-catching as in the US market, this Alexa whirlwind has till been scraping the world, leading to the industry's greatest buzz and triggering a long list of smart speaker simulation shows.

Hence the questions: What are the effects of this invention of Alexa? Who will be affected or even replaced?  How to evaluate Alexa's portal value? Where is it going as we look into the yesterday, today and tomorrow of this trend?

We may wish to reflect a bit on the development of portals in the IT industry history.  The so-called "portal" is an entry point or interface for an information network of large data flow, connecting consumers and services.  From the model perspective, we have experienced the "web portal" model, the "search engine" model and more recently, the "social network" model, with the on-going trend pointing to a portal moving in the "artificial intelligence" mode. From the carrier perspective, the carrier for the"web portal" and "search engine" models is basically a PC while the "social network" model carrier is mainly a smart phone-based end equipment. Does the "artificial intelligence" model have the potential to change the carrier? In other words, is it possible for the Echo-Alexa hardware-software combination, under the banner of artificial intelligence, to win the portal from the smart phone as the select point of human-machine interface?

I don't think it is possible.  There are three reasons.

First, the scene is wrong. Even if Alexa is powerful with unique anti-noise ability and the skills of tracking specific people's speech, since its location is fixed, it is a huge regression from today's well-developed mobile scenes.  Just think about it, the biggest feature of a family scene is two or more individuals involved in it.  A family is a small society with an innate structure.  Who has the right to issue voice commands? Who has the authority to deny or revoke the voice commands that others have already issued? What happens if the authoritative person is not at home or keeps silent? What if a family member intends to send a private voice instruction? To my mind, voice instruction as a human-machine interaction vehicle by nature involves behaviors of an individual, rather than of a family, with privacy as a basic need in this setting.  Therefore, the family voice portal scene, where Alexa is now set, is likely to be a contradiction. The more voice commands that are parsed and understood, the less will be the proportion of the voice commands that take the home scenes as a necessary condition.

Second, the "horizontal" mode of portal faces the "vertical" resistance.  Even if we agree that the "smart home central control" is a portal of access to end users that cannot be missed by any players, smart speakers like Alexa are also facing challenges from other types of end equipment.  There are two types of data flow in the smart home environment.  The horizontal mode involves the data flow from different manufacturers of home equipment.  The vertical mode portal gathers data from the same manufacturer's home equipment.  It can be seen that the "horizontal" effort is bound to face the "vertical" resistance in a life and death struggle.  For example, the smart refrigerator and other smart home equipment manufactured by Haier have no reasons to let go its valuable data and flow it away to the smart speaker manufacturers.

Third, the same struggle also comes from other competitions for the "horizontal" line of equipment, including house robots, home gateway / intelligent routers, smart TVs, intelligent pendants and so on.  The advantage of the house robots is that their locations need not be fixed in one place, the advantage of the home gateway is that  it always stays on, the TVs' advantage lies in their big screens, and intelligent pendants (such as picture frames, sculptures, watches, scales, etc.) have their respective advantage in being small.  In my opinion, smart speakers face all these "horizontal" competitions and there does not seem to be much of a chance in winning this competition.

In summary, the Echo-Alexa's success comes with a strong superposition characteristic. It is essentially a success of the Amazon business system, rather than the success of smart home appliances or the voice assistant technology. Ignoring the role of its supporting business system, we are likely to overestimate the value of the family information portal, and by simply mimicking or following the smart speaker technology, there is no way out.  Personally, I feel that the smart phone as the carrier of an entry point of information in the mobile Internet era still cannot be replaced.

Is the era of voice interaction really coming?

One important reason for the IT giants to look up to Alexa is that the voice interaction represented by Alexa perhaps opens a new paradigm of human-computer interaction.  Looking back in history, the rise of the click-mode and the rise of the touch-mode have both triggered a revolutionary paradigm shift for human-computer interaction, directly determining the rise and fall of the IT giants. The click-mode led to the rise of Wintel, the touch mode enabled Apple to subvert Wintel: we have witnessed all these changes with our own eyes.  So if the voice interaction really represents the next generation paradigm for human-computer interaction, then Alexa has a special meaning as the precursor of the human-computer interaction paradigm shift.  The giants simply cannot overlook such a shift and its potential revolutionary impact.

However, personally, I do not think that the speech interaction alone carries the weight for an "intergenerational revolution" for human-machine interaction.   There are three reasons to support this.

First, the speech itself does not constitute a complete human-computer interaction scene.  People's information intake, more than 80% of times, involves the visual information.  When speaking, we often take some visual information as basic context, through the use of a pronoun to refer to it.  For example, pointing to a book on the screen, one may say, "I want to buy this." In other words, a considerable part of the context in which the speech is delivered comes from the visual presentation, ranging from gestures, touches or eye movements that target some visual objects. This at least shows that we need multi-modal human-computer interaction, rather than using voice alone to replace other human-computer interaction vehicles.

Second, the current speech recognition still cannot handle the dialect well.  China is a big country with a variety of dialects.  Not only dialects, but also the people in dialect areas speack Mandarin with a strong accent. To benefit more than half of the total population in the dialect areas, the speech technology still needs to go through a stage of further development and maturity.

Third, the current speech recognition still has difficulty in solving the "escape" problem. The so-called escape problem involves the identification of scenarios when the speech refers to itself.  When people find there is an error in the first utterance and there is a need to correct it, they may choose to use the next sentence to correct the previous sentence, then this new sentence is not part of the naturally continuous speech commands, hence the need for "being escaped".  But it is also possible that the latter sentence should not be escaped, and it is a sentence conjoined with the previous sentence, then it is part of the normal speech stream.  This "escape" identification to distinguish different levels of speech referents calls for more advanced semantic analysis technology, which is not yet mature.

So, considering the current level of speech technology, it seems too early to talk about the "intergenerational revolution".  Furthermore, speech may well be just one factor, and not necessarily a disruptive one.  It seems more reasonable to state that the future of human-computer interaction may enter an era of multi-modal input, rather than speech alone.

The semantic grounding is the key to the stickiness of users.

Semantics as a term seems abused in all kinds of interpretations.  Some even think that once words are identified, semantics is there, which is far from true. The semantics of natural languages is very deep and involves a lot.  I mean a lot!

From the academic point of view, semantics is divided into two parts.  One called "symbol grounding", which is about the relationship of the language symbol (signifier) and its referent to the real world entity (including the conceptual world).  The second is called "role assignment", which is about the relationship between the referents of the language symbols in the reality.  Siri is the pioneer in the mobile semantic grounding realized in the domain apps such as Address, Map and Weather.  The past few years have seen the scope of semantic grounding grow wider and wider.

Let me review what I said before: "the excellent performance by the end equipment, coupled with the huge cloud resources in support of the end, constitute the Alexa's expected success in users' stickiness".  We can further explore along this line in this section.  Between "the performance by the end equipment" and "the cloud resources in support of the end", which is the root cause for Alexa's stickiness with the customers?  I do not intend to play the trick of dialectical balance by saying something like both are important and no one can do the job without the other.  That is always true but cheap, and it gives no actionable insights. The consequence includes possible blind investments in both for the copycat, such investments may well lead to a complete failure in the market.

The author argues that "the performance by the end equipment" is about the adaptability of the hardware to the scene.  This is at best about a "good live experience" of users. But a product with "good user experience" without real content will soon degrade to a toy, and they cannot even count as high-end toys.  If there is no real "meaningful service" associated, there will be no sustainable stickiness of customers. Without user stickiness, they cannot become sustainable data collection entry points as a data flow portal.  However, any associated "meaningful services" must come from the semantic grounding, that is, the connection from a speech command with its corresponding actual service.  This is the essence behind Alexa's so-called "know-hows."  Semantic grounding as mentioned hereafter all refers to such connection from the speech command with infinitely possible actual service resources.

Comprehensive semantic grounding requires a strong open-domain NLP engine. Service resources are so diverse in tens of thousands, and they can hardly be confined to one or only a few narrow domains.  An NLP engine functioning only in a narrow domain cannot do this job well.  To work in the open domain requires an engine to be equipped with extraordinary capacity in the semantic analysis, and it must be on the right path in the semantic knowledge representation and processing.  In this regard, even if an English engine is doing decently well, it does not necessarily mean the Chinese counterpart will work well.  For those who do not yet understand the difficulty and pain points of the Chinese NLP engine in the open domain, it is hardly possible to expect them to achieve large-scale semantic grounding effects. Such technology barriers can set apart a huge gap in products attempting to do the same thing in the market between companies equipped with or without deep semantic capabilities.

Semantic grounding requires an engineering adaptation at the interface to the service resources.  This is also a very difficult task, and it involves competitions in the scale of resources as well as efficiency and management. Start-up companies can hardly have such a resource integration capacity and the engineering organization capabilities, these are the strength of large companies. Some people say that I can start small and gradually scale up, okay? I said, no, time does not wait for people.  In the area of semantic grounding, if products are not developed in a relatively short time to capture the market, there are little chances for survival.

Semantic grounding also calls for the ability to manage the man-machine interactive scene itself. This involves a variety of technologies such as contextual perception, topic switching, sentiment analysis, language style selection, personality shaping and many others. A speech assistant is not necessarily the best if it only mimics human's eloquence or seemingly likable ways of expressions. Skills such as moderate profoundness or sharpness in arguments and even some rudeness at times can all be selling points as an intelligent assistant.

Therefore, we would point out the key role of semantic grounding on the stickiness of Alexa users, emphasizing the decisive contribution of large service resources behind Alexa's success story.  In China, if Chinese IT giants with a comparable size of the Amazon service resources do not take the lead, coupled by a solid open domain Chinese NLP engine with a star team, the speech technology alone has no way to generate such a user stickiness as we see in Alexa.

Who will win then?

In essence, it is all about gathering the user data by the end equipments.  Smartphones dominate the industry for years, all kinds of smart home solutions across the verticals have also been fighting for several years now.  Alexa's coming to the market stirs the industry with a lot of excitement and revelations, but it is far from what is all set.  We still have opportunities.  But keep in mind, it cannot be overemphasized to look into issues involving the combination of the end devices with the cloud and the combination between the entry point and the entry point carrier to form a closed-loop data stream.  If we lose the sense of directions and trends in these issues, the opportunity will not be ours.

So what is the direction and what are the trends? Let me give an analysis.

First, artificial intelligence is bound to be the next generation portal. In other words, all kinds of service needs will inevitably go from the end devices to the cloud through the artificial intelligence multi-channel input analysis, leveraging the human-computer interaction advantages.  The variety of service resources will eventually use the knowledge of artificial intelligence and cognitive decision-making ability, to provide to users from the cloud to the end. If you do not lay out a roadmap in developing artificial intelligence, the future portal is definitely not yours.

Second, the smartphone for a long time to come will stay as defacto chief carrier. Wherever is the person going, the communication node and the digital identity will follow and the perception of the life scene and the app as the service agent will also follow. There are no other end devices that match the smartphone on the most critical dimensions of the individualness, privacy, and the ubiquitous nature as needed by a portal carrier.

Third, there will be separation between the communication function of a terminal device and the demanded service function. As the service grows more and more diversified, it becomes impossible for one end device to handle all types of service needs.  But it is not desirable for each end device to come with its own communication function.  The relationship between Apple Watch and iPhone is intriguing in this regard: iPhone serves as the communication hub as well as the client information processing hub while Apple Watch functions as a special device for information collection and limited information display.  They are connected through a "near field communication" link.  Of course, both are Apple's products in one family, the data flow is therefore under a unified control.  In such a setting, they are tightly coupled, and the separation is always limited. However, this mode sheds lights to the future when all kinds of separation may be required but they should also be connected in some way.  If the mobile phone manufacturers keep an open mind, they can use the block chain technology in data collection with a variety of ancillary equipment to make an objective record of the respective contributions and accordingly make reasonable arrangements with regards to the data and proceeds sharing. A loose coupling of the separation will then evolve and mature, promoting the rapid ecological development of end devices in all kinds of forms. It is imaginable that, when we are in a new place, we can take out from our pocket a soft thin foldable electronic map.  This map, when unfolded, looks as big as a real paper map, but it works conveniently just like a mobile map app: it responds to the touch operations and may even accommodate speech instructions to associate with our phone. Of course, this map can also simply be a virtual projection, not necessarily taking the form of a real object.  Our phone only needs to take care of communication, all the control and display are accomplished on the map, and we do not even need to physically take out the phone. Such a phone may never need to be held in hands, we may even wear the phone on the foot, and the hand mobile device gradually evolves into a "foot phone" ... ...

Are you ready for the opportunity and inspirations brought by the Alexa whirlwind?

Translated by: Dr. Wei Li based on GNMT

【Related】

S. Bai: Natural Language Caterpillar Breaks through Chomsky's Castle

Dr Wei Li's English blogs

立委译白硕:“入口载体”之争(中英对照)

【李白对话录系列】

 

Trap of Information Overdose

Today, my topic relates to the issue of information overload.

We are born in the era of big data and information overload. As an NLPer (Natural Language Processor), for years I have been stuck in the belief that my sole mission is to help solve this problem of information overload. Just like Alibaba’s Jack Ma's vision that there should be no barriers for any business in this e-commerce world, my colleagues and I seem to share the vision in the community that there should be no barriers for instant access to any information amid the big data. So Google appeared, with crude keywords as basis and with its insatiable appetite to cover as big data as possible, to  have solved the problem of information long tail. Today whatever your query, and however rare your information need is, you google it and you get some relevant info back. We don’t want to stop there, so we begin to criticize Google because its solution to the information on the long tail has the defect of poor data quality. Hence AI (Artificial Intelligence) is proposed and being practiced to enhance the deep processing of data (whether via deep learning or deep parsing), in an attempt to both handle big data for its long tail, as well as to drastically raise the data quality through natural language understanding (NLU). The aim is to satisfy any soul with information needs, whether explicitly requesting it or implicitly carried in the mind, by a steady flow of quality information. This is the perspective from us practitioners’ point of view, currently mixed with lots of excitement and optimism.

Let us change our perspective to ask ourselves, as a consumer, what have we benefited from this exciting AI battle on information overload? Indeed, what we now get is more and more data -- to the point, high-quality, with constant and instant feeds, which we have never before been able to reach. Previously we were drowned in the overload of the information ocean, mostly garbage occasionally with a few pearls, and nowadays we end up being choked to death by over-satisfaction of quality information thanks to the incredible progress of information high-tech via AI. So the feelings are dramatically different, but the ending remains the same, both are an inescapable path to death, drowned or choked. So each day we spend more and more time in the social media among our circles of friends, on all types of news apps, or entertainment apps, with less and less time for real-life work, family and serious thinking. Numerous geniuses out there (many are my talented peers) racked their brains to study our preferences, study how to make us stick to their apps, and what tricks they can apply to drive us crazy and addicted to their products.

It is the iron law that a person is no match for a calculated and dedicated world. Made of flesh and blood, each individual consumer is no match for an invisible legion of tech gurus (including myself) from businesses and their accomplices in the information industry, looking closely into our behavior and desires. So we are bound to sink to the bottom, and eventually become a slave of information. Some of us begin to see through this trap of information overdose, struggling hard to fight the addiction, and seeking self-salvation against the trend. Nevertheless, with the rapid progress of artificial intelligence and natural language technology, we see the trend clear, unstoppable and horrifying: more and more are trapped in the info, and those who can save themselves with a strong will are a definite minority.

The world has n billion people, and m million organizations, each producing information non-stop every moment, which is now recorded one way or the other (e.g. in social media). Even if we raise our bar higher and higher for our information needs for work and for pleasure, to the extent of an incredible ratio to the effect of something like ten-millionth, using a variety of technology filters of information, we are still faced with info feeds from n-hundred human entities and m-organizations. There is simply no way in our lifetime to exhaust it all and catch up with its feeds. We end up feeling over-satisfied with information most of which we feel we simply cannot and should not miss. We are living in a terrible bliss of an over-satisfying world. As consumers we are doomed in this battle to fight the addiction against our own nature, trying to resist the temptation that by nature cannot be resisted.

Having pointed out the problem, I have no effective remedy to this problem to offer. What I myself do is that at times, I simply shut down the channels to stay in info-diet or hungry mode, focusing on family and the accumulated to-do list of work. This seems to work and I often got my work done, without feeling I have missed that much for the information gap during the "diet" period, but it is not a sustainable technique (with exception perhaps of very few super guys I know whom I admire but really cannot tell whether that lifestyle is really for better or not as shutting the info channels for too long has its own side effects, or consequences, to my mind). In the end, most of us fall back to being willing slaves of information. The smarter minds among us have learned to shift between these two modes: shutting channels down for some time and going back to the "normal" modern way of information life.

For people who want and need to kill time, for example, the retired in the lonely senior homes, info age is God-sent: their quality of killing time has never been made better. But how about the younger generation who is most vulnerable to info overdose, as much as the addiction to the crazily popular games today. The "shutting the channels" technique is a survival skill of middle-aged generation who needs to dedicate sufficient time to go about their daily work and life, making a living, supporting the family and keeping it running. But this technique is almost impossible for the young generation to practice, given that they are born in this info age, and social media and stuff are part of their basic lifestyle. Nevertheless, there is no short of struggles and helplessness as we observe when they are being drowned in the sea of games, social media and Internet, in front of the academic pressure and career training competition. The external world is not in the least prepared and is basically helpless to them. So are us parents. Many times we cannot resist the temptation from being enslaved in the information trap for ourselves, how can we expect our next generation to learn the balancing skill easily, considering they are at the age of exploration with tremendous curiosity and confusion.

Sometimes I tell myself: why should we work so hard on info technology if we know it has both positive effects as well as huge negative impact which we have no clues how to fix. After all, we do not need to rush the entire world of life and time to be engulfed by info no matter how high quality we can make it to be. Meanwhile, I really hope to see more and more study to get invested in addressing how to help people resist the temptation of the information trap. The ideal world in my understanding should be that we stay equipped with both intelligent tools to help access quality information as nutrients to enrich our lives, as well as tools to help resist the temptation from info over-satisfaction.

Translated and recompiled from the original post in my Chinese blog: 【杞人忧天:可怕的信息极乐世界

 

[Related]

杞人忧天:可怕的信息极乐世界

Dr Li’s NLP Blog in English

 

Small talk with Daughter on US Election

just had a small talk with Tanya on US election, she was super angry and there was a big demonstration against Trump in her school too

T:
I don't want him to win
I don't want him to do well
Or else another racist gets electedMe:

Me:
neither did I
IF he does very badly, he will be impeached;
or at least he will not be reelected in 4 years.
But now that he is, we can keep an open mind.
There is an element of sentiment he is representing: so-called silent majority, that is why most polls were wrong.

By the way, many have praised my social media analysis just before the election, mine was way better than all the popular polls such as CNN.  This is not by accident, this is power of big data and high tech in the information age:

Final Update of Social Media Sentiment Statistics Before Election

with deep NLP and social media, we can pick up sentiments way more reliable and statistical than the traditional polls, which usually only call 500 to 1000 for opinions to hope they represent 200 million voters.  My mining and analysis are based on millions and millions of data points.  So in future we have to utilize and bring the automatic NLP into things like this as one important indicator of insights and public opinions and sentiments

So in future, we have to utilize and bring NLP into things like this as one important indicator of insights and public opinions and sentiments.

T:
daddy
you're amazing
Your technology is amazing

Me:
I got lots of compliments for that, but yours mean the most to me.

What happened in the election as I had been tracking using our NLP sentiment tool was:

1. Clinton was clearly leading in the period after the recording scandal of Trump and before the FBI started reopening Clinton's email case: Big data mining shows clear social rating decline of Trump last month.

2. Clinton has always been leading in Spanish speaking communities and media, but that did not seem to be sufficient to help revert the case:  Trump sucks in social media big data in Spanish.

3. The event of FBI re-opening the email investigation gave Clinton the most damage: Trump's scandal was cooling down and the attention was all drawn to Clinton's email case so that the sentiment has a sharp drop for Clinton (【社煤挖掘:大数据告诉我们,希拉里选情告急】)

4. When FBI finally reissued a statement that there was no evidence to charge Clinton only 2 days before the election, time was too short to remedy the damage FBI did in their first event of reopening the case: my big data tracking found that there was some help but not as significant (【大数据跟踪美大选每日更新,希拉里成功反击,拉川普下水】).

5. Then just before the election, I did a final update of the big data sentiment tracking for the last 24 hours versus last 3 months, and found that Trump had a clear leading status in public opinion and sentiments, so I decided to let the world know it although at the point most everyone believed that Clinton was almost sure to win.

T:
Oh my god dad your machine is the smartest tracker on the market
Dad your system is genius
This is exactly what media needs
You should start your own company
This is amazing
I think this would be the planets smartest machine

Me:
I do not disagree, :=)It was a tight competition and with good skills, things could turn different in result.  In terms of popularity votes, they are too to be statistically different, so anything at the right timing could have changed the result.

It was in fact a tight competition and with good skills, things could turn different in result.  In terms of popularity votes, they are too to be statistically different, so anything at the right timing could have changed the result.

On retrospect, FBI did a terrible thing to mess up with the election:
they reopened a case which they did not know the results
just 10 days before the election which made a huge difference.
On the other hand, the recording scandal was released too early
so that although it hurt Trump severely at the time, yet it allowed FBI to revert the attention to Clinton

In future, there should be a strict law disallowing a government agency
which is neutral politically by nature to mess up with an election within a time frame, so Trump's winning the case to my mind has 80%+ credit from the FBI events.
What a shame

 

[Related]

【社煤挖掘:川普的葛底斯堡演讲使支持率飙升了吗?】

【社煤挖掘:为什么要选ta而不是ta做总统?】

Big data mining shows clear social rating decline of Trump last month

Clinton, 5 years ago. How time flies …

【社媒挖掘:川大叔喜大妈谁长出了总统样?】

【川普和希拉里的幽默竞赛】

【大数据舆情挖掘:希拉里川普最近一个月的形象消长】

欧阳峰:论保守派该投票克林顿

【立委科普:自动民调】

【关于舆情挖掘】

《朝华午拾》总目录

Pulse:实时舆情追踪美国大选,live feed,real time!

http://www.netbase.com/presidential-elections2016/

Clinton has been mostly leading the social media sentiment :

Screenshots at 4:50pm 11/8/2016:

11082016a

110820160450b

110820160450c

110820160450d

110820160450e

Again go check our website live on Pulse:

http://www.netbase.com/presidential-elections2016/

 

[Related]

【社煤挖掘:川普的葛底斯堡演讲使支持率飙升了吗?】

【社煤挖掘:为什么要选ta而不是ta做总统?】

Big data mining shows clear social rating decline of Trump last month

Clinton, 5 years ago. How time flies …

【社媒挖掘:川大叔喜大妈谁长出了总统样?】

【川普和希拉里的幽默竞赛】

【大数据舆情挖掘:希拉里川普最近一个月的形象消长】

欧阳峰:论保守派该投票克林顿

【立委科普:自动民调】

【关于舆情挖掘】

《朝华午拾》总目录

 

Final Update of Social Media Sentiment Statistics Before Election

Final update before election:

brand-passion-index-1

timeline-comparison-2
Net sentiment last 24 hours: Trump +7 ; Clinton -9.  The last day analysis of social media.  Buzz:

timeline-comparison-3
So contrary to the popular belief, Trump actually is leading in social media just before the election day.

Compare the above with last month ups and downs to put it in larger context:

brand-passion-index-2
Last 3 month sentiment: Trump -11; Clinton -18.
Buzz for Trump never fails:

timeline-comparison-4

Trump's Word Clouds:

sentiment-drivers-6

sentiment-drivers-7sentiment-drivers-8

 

 

 

 

 

 

Clinton's Word Clouds:

sentiment-drivers-9

sentiment-drivers-10

sentiment-drivers-11
Trump 3-month summary:

trumpsummary3m

Clinton 3-month summary:

clintonsummary3m

Ethnicity:

ethinic

RW:
伟哥的东西,好是好,就是没有体现美国的选人制度
Xin:
主要是白人黑人和亚裔人数比例并没有代表实际的选民百分比。
RW:
理论上讲,只要有一方得到所有选票的23%, 他或她就可能当选

 

[Related]

【社煤挖掘:川普的葛底斯堡演讲使支持率飙升了吗?】

【社煤挖掘:为什么要选ta而不是ta做总统?】

Big data mining shows clear social rating decline of Trump last month

Clinton, 5 years ago. How time flies …

【社媒挖掘:川大叔喜大妈谁长出了总统样?】

【川普和希拉里的幽默竞赛】

【大数据舆情挖掘:希拉里川普最近一个月的形象消长】

欧阳峰:论保守派该投票克林顿

【立委科普:自动民调】

【关于舆情挖掘】

《朝华午拾》总目录

Trump sucks in social media big data in Spanish

As promised, let us get down to the business of big data mining of public opinions and sentiments from Spanish social media on the US election campaign.

We know that in the automated mining of public opinions and sentiments for Trump and Clinton we did before, Spanish-Americans are severely under-represented, with only 8% Hispanic posters in comparison with their 16% in population according to 2010 census (widely believed to be more than 16% today), perhaps because of language and/or cultural barriers.  So we decide to use our multilingual mining tools to do a similar automated survey from Spanish Social Media to complement our earlier studies.

This is Trump as represented in Spanish social media for the last 30 days (09/29-10/29), the key is his social rating as reflected by his net sentiment -33% (in comparison with his rating of -9% in English social media for the same period): way below the freezing point, it really sucks, as also illustrated by the concentration of negative Spanish expressions (red-font) in his word cloud visualization.

By the net sentiment -33%, it corresponds to 242,672 negative mentions vs. 121,584 positive mentions, as shown below. In other words, negative comments are about twice as much as positive comments on Trump in Spanish social media in the last 30 days.

This is the buzz in the last 30 days for Trump: mentions and potential impressions (eye balls): millions of data points and indeed a very hot topic in the social media.

This is the BPI (Brand Passion Index) graph for directly comparing Trump and Clinton for their social ratings in the Spanish social media in the last 30 days:

As seen, there is simply no comparison: to refresh our memory, let us contrast it with the BPI comparison in the English social media:

Earlier in one of my election campaign mining posts on Chinese data, I said, if Chinese only were to vote, Trump would fail horribly, as shown by the big margin in the leading position of Clinton over Trump:

This is even more true based on social media big data from Spanish.

This is the comparison trends of passion intensity between Trump and Clinton:

The visualization by weeks of the same passion intensity data, instead of by days, show even more clearly that people are very passionate about both candidates in the Spanish social media discussions, the intensity of sentiment expressed for Clinton are slightly higher than for Trump:

This is the trends graph for their respective net sentiment, showing their social images in Spanish-speaking communities:

We already know that there is simply no comparison: in this 30-day duration, even when Clinton dropped to its lowest point (close to zero) on Oct 9th, she was still way ahead of Trump whose net sentiment at the time was -40%. In any other time segments, we see an even bigger margin (as big as 40 to 80 points in gap) between the two. Clinton has consistently been leading.

In terms of buzz, Trump generates more noise (mentions) than Clinton consistently, although the gap is not as large as that in English social media:

This is the geo graph, so the social data come from mostly the US and Mexico, some from other Latin America countries and Spain:

Since only the Mexicans in the US may have the voting power, we should exclude media from outside the US to have a clearer picture of how the Spanish-speaking voters may have an impact on this election. Before we do that filtering, we note the fact that Trump sucks in the minds of Mexican people, which is no surprise at all given his irresponsible comments about the Mexican people.

Our social media tool is equipped with geo-filtering capabilities: you can add a geo-fence to a topic to retrieve all social media posts authored from within a fenced location. This allows you to analyze location-based content irrespective of post text. That is exactly what we need in order to do a study for Spanish-speaking communities in the US who are likely to be voters, excluding those media from Mexico or other Spanish-speaking countries. communities in the US who are likely to be voters, excluding those media from Mexico or other countries. This is also needed when we need to do study for those critical swing states to see the true pictures of the likelihood of the public sentiments and opinions in those states that will decide the destiny of the candidates and the future of the US (stay tuned, swing states social media mining will come shortly thanks to our fully automated mining system based on natural language deep parsing).

Now I have excluded Spanish data from outside America, it turned out that the social ratings are roughly the same as before: the reduction of the data does not change the general public opinions from Spanish communities, US or beyond US., US or beyond US. This is US only Spanish social media:

This is summary of Trump for Spanish data within US:

It is clear that Trump's image truly sucks in the Spanish-speaking communities in the US, communities in the US, which is no surprise and so natural and evident that we simply just confirm and verify that with big data and high-tech now.

These are sentiment drivers (i.e. pros and cons as well as emotion expressions) of Trump :

We might need Google Translate to interpret them but the color coding remains universal: red is for negative comments and green is positive. More red than green means a poor image or social rating.

In contrast, the Clinton's word clouds involve way more green than red: showing her support rate remains high in the Spanish-speaking communities of the US.

It looks like that the emotional sentiments for Clinton are not as good as Clinton's sentiment drivers for her pros and cons.

Sources of this study:

Domains of this study:

[Related]

Did Trump's Gettysburg speech enable the support rate to soar as claimed?

Big data mining shows clear social rating decline of Trump last month

Clinton, 5 years ago. How time flies …

Automated Suevey

Dr Li’s NLP Blog in English

Did Trump's Gettysburg speech enable the support rate to soar as claimed?

Last few days have seen tons of reports on Trump's Gettysburg speech and its impact on his support rate, which is claimed by some of his campaign media to soar due to this powerful speech.  We would love to verify this and uncover the true picture based on big data mining from the social media.

First, here is one link on his speech:

DONALD J. TRUMP DELIVERS GROUNDBREAKING CONTRACT FOR THE AMERICAN VOTER IN GETTYSBURG. (The most widely circulated related post in Chinese social media seems to be this: Trump's heavyweight speech enables the soaring of the support rate and possible stock market crash).

Believed to be a historical speech in his last dash in the campaign, Trump basically said: I am willing to have a contract with the American people on reforming the politics and making America great again, with this plan outline of my administration in the time frame I promised when I am in office, I will make things happen, believe me.

Trump made the speech on the 22nd this month, in order to mine true public opinions of the speech impact, we can investigate the data around 22nd for the social media automated data analysis.  We believe that automated polling based on big data and language understanding technology is much more revealing and dependable than the traditional manual polls, with phone calls to something like 500 to 1,000 people.  The latter is laughably lacking sufficient data to be trustworthy.

timeline-comparison-14

What does the above trend graph tell us?

1  Trump in this time interval was indeed on the rise. The "soaring" claim this time does not entirely come out of nowhere, but, there is a big BUT.

2. BUT, a careful look at the public opinions represented by net sentiment (a measure reflecting the ratio of positive mentions over negative mentions in social media) shows that Trump has basically stayed below the freezing point (i.e. more negative than positive) in this time interval, with only a brief rise above the zero point near the 22nd speech, and soon went down underwater again.

3. The soaring claim cannot withstand scrutiny at all as soaring implies a sharp rise of support after the speech event in comparison with before, which is not the case.

4. The fact is, Uncle Trump's social media image dropped to the bottom on the 18th (with net sentiment of -20%) of this month.  From 18th to 22nd when he delivered the speech, his net sentiment was steadily on rise from -20% to 0), but  from 22nd to 25th, it no longer went up, but fell back down, so there is no ground for the claim of support soaring as an effect of his speech, not at all.

5. Although not soaring, Uncle Trump's speech did not lead to sharp drop either, in terms of the buzz generated, this speech can be said to be fairly well delivered in his performance. After the speech, the net sentiment of public opinions slightly dropped, basically maintaining the fundamentals close to zero.

6.  The above big data investigation shows that the media campaign can be very misleading against the objective evidence and real life data.  This is all propaganda, which cannot be trusted at its face value: from so-called "support rate soared" to "possible stock market crash". Basically nonsense or noise of campaign, and it cannot be taken seriously.

The following figure is a summary of the surveyed interval:

trump1

As seen, the average public opinion net-sentiment for this interval is -9%, with positive rating consisting of 2.7 million mentions, and negative rating of 3.2 million mentions.

How do we interpret -9% as an indicator of public opinions and sentiments? According to our previous numerous automated surveys of political figures, this is certainly not a good public opinion rating, but not particularly bad either as we have seen worse.  Basically, -9% is under the average line among politicians reflecting the public image in people's minds in the social media.  Nevertheless, compared with Trump's own public ratings before, there is a recorded 13 points jump in this interval, which is pretty good for him and his campaign.  But the progress is clearly not the effect of his speech.

This is the social media statistics on the data sources of this investigation:

trump2

In terms of the ratio, Twitter ranks no 1, it is the most dynamic social media on politics for sure, with the largest amount of tweets generated every minute. Among a total of 34.5 million mentions on Trump, Twitter accounted for 23.9 million.  In comparison, Facebook has 1.7 million mentions.

Well, let's zoom in on the last 30 days instead of only the days around the speech, to provide a bigger background for uncovering the overall trends of this political fight in the 2016 US presidential campaign between Trump and Clinton.

timeline-comparison-15

The 30 days range from 9/28-10/28, during which the two lines in the comparison trends chart show the contrast of Trump and Clinton in their respective daily ups and downs of net sentiment (reflecting their social rating trends).  The general impression is that the fight seems to be fairly tight.  Both are so scandal-ridden, both are tough and belligerent.  And both are fairly poor in social ratings.  The trends might look a bit clearer if we visualize the trends data by weeks instead of by day:

timeline-comparison-16

No matter how much I dislike Trump, and regardless of my dislike of Clinton whom I have decided to vote anyway in order to make sure the annoying Trump is out of the race,  as a data scientist, I have to rely on data which says that Hillary's recent situation is not too optimistic: Trump actually at times went a little ahead of Clinton (a troubling fact to recognize and see).

timeline-comparison-17

The graph above shows a comparison of the mentions (buzz, so to speak).  In terms of buzz, Trump is a natural topic-king, having generated most noise and comments, good or bad.  Clinton is no comparison in this regard.

timeline-comparison-18

The above is a comparison of public opinion passion intensity: like/love or dislike/hate?  The passion intensity for Trump is really high, showing that he has some crazy fans and/or deep haters in the people.  Hillary Clinton has been controversial also and it is not rare that we come across people with very intensified sentiments towards her too.  But still, Trump is sort of political anomaly, and he is more likely to cause fanaticism or controversy than his opponent Hillary.

In his recent Gettysburg speech, Trump highlighted the so-called danger of the election being manipulated. He clearly exaggerated the procedure risks, more than past candidates in history using the same election protocol and mechanism.  By doing so, he paved the way for future non-recognition of the election results. He was even fooling the entire nation by saying publicly nonsense like he would totally accept the election results if he wins: this is not humor or sense of humor, it depicts a dangerous political figure with ambition unchecked.  A very troubling sign and fairly dirty political tricks or fire he is playing with now, to my mind.  Now the situation is, if Clinton has a substantial lead to beat him by a large margin, this old Uncle Trump would have no excuse or room for instigating incidents after the election.  But if it is closer to see-saw, which is not unlikely given the trends analysis we have shown above, then our country might be in some trouble: Uncle Trump and his die-hard fans most certainly will make some trouble.  Given the seriousness of this situation and pressing risks of political turmoil possibly to follow,  we now see quite some people, including some conservative minds, begin to call for the election of Hillary for the sake of preventing Trump from possible trouble making.  I am one with that mind-set too, given that I do not like Hillary either.  If not for Trump, in ordinary elections like this when I do not like candidates of both major parties, I would most likely vote for a third party, or abstain from voting, but this election is different, it is too dangerous as it stands.  It is like a time bomb hidden somewhere in the Trump's house, totally unpredictable. In order to prevent him from spilling, it is safer to vote for Clinton.

In comparison with my earlier automated sentiment analysis blogged about a week ago (Big data mining shows clear social rating decline of Trump last month),this updated, more recent BPI brand comparison chart seems to be more see-saw: Clinton's recent campaign seems to be stuck somewhere.

brand-passion-index-11

Over the last 30 days, Clinton's net sentiment rating is -17%, while Trump's is -19%.  Clinton is only slightly ahead of Trump.  Fortunately, Trump's speech did not really reverse the gap between the two, which is seen fairly clearly from the following historical trends represented by three different circles in brand comparison (the darker circle represents more recent data): the general trends of Clinton are still there: it started lagging behind and went better and now is a bit stuck, but still leading.

 

brand-passion-index-12

Yes, Clinton's most recent campaign activities are not making significant progress, despite more resources put to use as shown by bigger darker circle in the graph.  Among the three circles of Clinton, we can see that the smallest and lightest circle stands for the first 10 days of data in the past 30 days, starting obviously behind Trump.  The last two circles are data of the last 20 days, seemingly in situ, although the circle becomes larger, indicating more campaign input and more buzz generated.  But the benefits are not so obvious.  On the other side, Trump's trends show a zigzag, with the overall trends actual declining in the past 30 days.  The middle ten days, there was a clear rise in his social rating, but the last ten days have been going down back.  Look at Trump's 30-day social cloud of Word Cloud for pros and cons and Word Cloud for emotions:

Let us have a look at Trump's 30-day social media sentiment word clouds, the first is more about commenting on his pros and cons, and the second is more direct and emotional expressions on him:sentiment-drivers-38

sentiment-drivers-37
One friend took a glance at the red font expression "fuck", and asked: who are subjects and objects of "fuck" here?  In fact, the subject generally does not appear in the social posts, by default it is the poster himself, reflecting part of the general public, the object of "fuck" is, of course, Trump, for otherwise our deep linguistics based system will not count it as a negative mention of trump reflected in the graph.  Let us show some random samples side by side of the graph:

trumpfuck

trumpfuck2
My goodness, the "fuck" mentions account for 5% of the emotional data, the poor old Uncle Trump is fucked 40 million times in social media within one-month duration, showing how this guy is hated by some of the people whom he is supposed to represent and govern if he takes office.   See how they actually express their strong dislike of Trump:

fucking moron
fucking idiot
asshole
shithead

you name it, to the point even some Republicans also curse him like crazy:

Trump is a fucking idiot. Thank you for ruining the Republican Party you shithead.

Looking at the following figure of popular media, it seems that the most widely circulated political posts in social media involve quite some political video works:

trumpmedia

The domains figure below shows that the Tumblr posts on politics contribute more than Facebook:

domains-6

In terms of demographics background of social media posters, there is a fair balance between male and female: male 52% female 48% (in contrast to Chinese social media where only 25% females are posting political comments on US presidential campaign).  The figure below shows the ethnic background of the posters, with 70% Caucasians, 13% African Americans, 8% Hispanic and 6% Asians.  It looks like that the Hispanic Americans and Asian Americans are under-represented in the English social media in comparison with their due population ratios, as a result, this study may have missed some of their voice (but we have another similar study using Chinese social media, which shows a clear and big lead of Clinton over Trump; given time, we should do another automated survey using our multilingual engine for Spanish social media.  Another suggestion from friends is to do a similar study on swing states because after all these are the key states that will decide the outcome of this election, we can filter the data by locations where posts are from to simulate that study).  There might be a language or cultural reasons for this under-representation.

trumpethinics

This last table involves a bit of fun facts of the investigation.  In social media, people tend to talk most about the campaign, on the Wednesday and Sunday evenings, with 9 o'clock as the peak, for example, on the topic of Trump, nine o'clock on Sunday evening generated 1,357,766 messages within one hour.  No wonder there is no shortage of big data from social media on politics.  It is all about big data. In contrast, with the traditional  manual poll, no matter how sampling is done, the limitation in the number of data points is so challenging:
with typically 500 to 1000 phone calls, how can we trust that the poll represents the public opinions of 200 million voters?  They are laughably too sparse in data.  Of course, in the pre-big-data age, there were simply no alternatives to collect public opinion in a timely manner with limited budgets.  This is the beauty of Automatic Survey, which is bound to outperform the manual survey and become the mainstream of polls.

trumpdayhour

Authors with most followers are:

trumpmedia2

Most mentioned authors are listed below:

trumpauthors

Tell me when in history did we ever have this much data and info, with this powerful data mining capabilities of fully sutomated mining of public opinions and sentiments at scale?

trumppopularposts

 

[Related]

Big data mining shows clear social rating decline of Trump last month

Clinton, 5 years ago. How time flies …

Automated Suevey

Dr Li’s NLP Blog in English

 

 

Big data mining shows clear social rating decline of Trump last month

Big data mining from last month' social media shows clear decline of Trump in comparison with Clinton

aaa

Our automatic big data mining for public opinions and sentiments from social media speaks loud and clear: Tump's social image sucks.

Look at last 30 days of social media on the Hillary and Trump's social image and standing in our Brand Passion Index (BPI) comparison chart below:

brand-passion-index-8

Three points to note:
1 Trump has more than twice buzz than Hillary in terms of social media coverage (the size of the circles indicates the degree of mentions);
2. The intensity of sentiments from the general public of netters is more intense for Chump than for Clinton: the Y-axis shows the passion intensity
3. The social ratings and images of the two are both quite poor, but Trump is more criticized in social: the X-axis of Net Sentiment shows the index social sentiment ratings.  Both are under freezing point (meaning more negative comments than positive).

If we want to automatically investigate the trend of the past month and their social images' ups and downs, we can have the data segmented into two or three segments.  Figure below shows the trends contrast of the first 15 days of social media data vs. the second 15 days of data in the 30-day period (up to 10/21/2016):

brand-passion-index-7

See, in the past month, with the presidential election debates and scandals getting attention, Trump's media image significantly deteriorated, represented by the public opinion circles shifting from the right on the X-axis to the left side (for dislike or hate sentiments: the lighter circle represents data older than the darker circle).  His social rating was clearly better than Hillary to start with and ended up worse than that of Hillary.  At the same time, Hillary's social media image has improved, the circle moves a bit from the left to right. Two candidates have always been below the freezing point, clearly shown in the figure, but just a month ago, Clinton was rated even lower than Trump in public opinions of the social media: it is not the people who like Trump that much, but the general public showed more dislike for Hillary for whatever reasons.

As seen, our BPI brand comparison chart attempts to visualize four-dimensional information:
1. net sentiment for social ratings on the X-axis;
2. the passion intensity of public sentiments on the Y-axis;
3. buzz circle size, representing mentions of soundbites;
4. The two circles of the same brands show the coarse-grained time dimension for general trends.

It is not very easy to represent 4 dimensions of analytics in a two-dimensional graph.  Hope the above attempt in our patented visualization efforts is insightful and not confusing.

If we are not happy with the divide-into-two strategy for one month of data to show the trends, how about cut them into three pieces?  Here is the Figure for .three circles in the time dimension.

brand-passion-index-6

We should have used different colors for the two political brands to make visualization a bit clearer.  Nevertheless, we see the trends for Clinton in her three circles of social media sentiments shifting from the lower left corner to the upper right in a zigzag path: getting better, then worse, and ended up with somewhere in between at this point (more exactly, up to the point of 10/21/2016). For the same 3 segments of data, Trump's (brand) image started not bad, then went slightly better, and finally fell into the abyss.

The above is to use our own brand comparison chart (BPI) to decode the two US presidential candidates' social images change and trends.  This analysis, entirely automated based on deep Natural Language Parsing technology, is supported by data points in a magnitude many times more than the traditional manual polls which are by nature severely restricted in data size and time response.

What are the sources of social media data for the above automated polling?  They are based on random social media sampling of big data, headed by the most dynamic source of Twitter, as shown below.

sources-5

sources-4

sources-3

This is a summary of the public opinions and sentiments:

%e5%b7%9d%e6%99%ae%e5%b8%8c%e6%8b%89%e9%87%8c

As seen, it is indeed BIG data: a month of random sampling of social media data involves the mentions of the candidates for nearly 200 million times, a total of up to 3,600+ billion impressions (potential eyeballs). Trump accounted for 70 percent of the buzz while Clinton only 30 percent.

The overall social rating during the period of 09/21/2016 through 10/21/2016, Trump's net sentiment is minus 20%, and Clinton is minus 18%.  These measures show a rating much lower than that of most other VIP analysis we have done before using the same calculations.  Fairly nasty images, really.   And the big data trends show that Trump sucks most.

The following is some social media soundbites for Trump:

Bill Clinton disgraced the office with the very behavior you find appalling in...
In closing, yes, maybe Trump does suffer from a severe case of CWS.
Instead, in this alternate NY Times universe, Trump’s campaign was falling ...
Russian media often praise Trump for his business acumen.
This letter is the reason why Trump is so popular
Trump won
I'm proud of Trump for taking a stand for what's right.
Kudos to Trump for speaking THE TRUTH!
Trump won
I’m glad I’m too tired to write Trump/Putin fuckfic.
#trump won
Trump is the reason Trump will lose this election.
Trump is blamed for inciting violence.
Breaking that system was the reason people wanted Trump.
I hate Donald Trump for ruining my party.
>>32201754 Trump is literally blamed by Clinton supporters for being too friendly with Russia.
Another heated moment came when Trump delivered an aside in reponse to ...
@dka_gannongal I think Donald Trump is a hoax created by the Chinese....
Skeptical_Inquirer The drawing makes Trump look too normal.
I'm proud of Donald Trump for answering that honestly!
Donald grossing me out with his mouth features @smerconish ...
Controlling his sniffles seems to have left Trump extraordinarily exhausted
Trump all the way people trump trump trump
Trump wins
Think that posting crap on BB is making Trump look ridiculous.
I was proud of Trump for making America great again tonight.
MIL is FURIOUS at Trump for betraying her!
@realdonaldTrump Trump Cartel Trump Cartel America is already great, thanks to President Obama.
Kudos to Mr Trump for providing the jobs!!
The main reason to vote for Trump is JOBS!
Yes donal trump has angered many of us with his WORDS.
Trump pissed off a lot of Canadians with his wall comments.
Losing this election will make Trump the biggest loser the world has ever seen.
Billy Bush's career is merely collateral damage caused by Trump's wrenching ..
So blame Donald for opening that door.
The most important reason I am voting for Trump is Clinton is a crook.
Trump has been criticized for being overly complimentary of Putin.
Kudos to Trump for reaching out to Latinos with some Spanish.
Those statements make Trump's latest moment even creepier.
I'm mad at FBN for parroting the anti-Trump talking points.
Kudos to Trump for ignoring Barack today @realDonaldTrump
Trump has been criticized for being overly complimentary of Putin.
OT How Donald Trump's rhetoric has turned his precious brand toxic via ...
It's these kinds of remarks that make Trump supporters look like incredible ...
Trump is blamed for inciting ethnic tensions.
Trump is the only reason the GOP is competitive in this race.
Its why Republicans are furious at Trump for saying the voting process is rigged.
Billy Bush’s career is merely collateral damage caused by Trump’s wrenching ..
Donald Trump is the dumbest, worst presidential candidate your country ...
I am so disappointed in Colby Keller for supporting Trump.
Billy Bush’s career is merely collateral damage caused by Trump’s wrenching..
In swing states, Trump continues to struggle.
Trump wins
Co-host Jedediah Bila agreed, saying that the move makes Trump look desperate.
Trump wins
"Trump attacks Clinton for being bisexual!"
TRUMP win
Pence also praised Trump for apologizing following the tape’s disclosure.
In swing states, Trump continues to struggle.
the reason Trump is so dangerous to the establishment is he is unapologetical..

Here are some public social media soundbites for Clinton in the same period:

Hillary deserves worse than jail.
Congratulations to Hillary & her campaign staff for wining three Presidential ..
I HATE @chicanochamberofcommerce FOR INTRODUCING THAT HILLARY ...
As it turns out, Hillary creeped out a number of people with her grin.
Hillary trumped Trump
Trump won!  Hillary lost
Hillary violated the Special Access Program (SAP) for disclosing about the ...
I trust Flint water more than Hillary
Hillary continued to baffle us with her bovine feces.
NEUROLOGISTS HATE HILLARY FOR USING THIS TRADE SECRET DRUG!!!!...
CONGRATULATIONS TO HILLARY CLINTON FOR WINNING THE PRESIDENCY
Supreme Court: Hillary is our only choice for keeping LGBT rights.
kudos to hillary for remaining sane, I'd have killed him by now
How is he blaming Hillary for sexually assaulting women. He's such a shithead
The only reason I'm voting for Hillary is that Donald is the only other choice
Hillary creeps me out with that weird smirk.
Hillary is annoying asf with all of her laughing
I credit Hillary for the Cubs waking up
When you listen to Hillary talk it is really stupid
On the other hand, Hillary Clinton has a thorough knowledge by virtue of ...
Americans deserve better than Hillary
Certain family members are also upset with me for speaking out against ...
Hillary is hated by all her security detail for being so abusive
Hillary beat trump
The only reason to vote for Hillary is she's a woman.
Certain family members are also upset with me for speaking out against ....
I am glad you seem to be against Hillary as well Joe Pepe.
Hillary scares me with her acions.
Unfortunately Wikileaks is the monster created by Hillary & democrats.
I'm just glad you're down with evil Hillary.
Hillary was not mad at Bill for what he did.  She was mad he got caught.  ......
These stories are falling apart like Hillary on 9/11
Iam so glad he is finally admitting this about Hillary Clinton.
Why hate a man for doing nothing like Hillary Clinton
Hillary molested me with a cigar while Bill watched.
You are upset with Hillary for doing the same as all her predecessors.
I feel like Hillary Clinton is God's punishment on America for its sins.
Trumps beats Hillary
You seem so proud of Hillary for laughing at rape victims.
Of course Putin is going to hate Hillary for publicly announcing false ...
Russia is pissed off at Hillary for blaming the for wikileaks!
Hillary will not win.  Good faith is stronger than evil.  Trump wins??
I am proud of Hillary for standing up for what is good in the USA.
Hillarys plans are worse than Obama
Hillary is the nightmare "the people" have created.
Funny how the Hillary supporters are trashing Trump for saying the same ...
???????????? I am so proud of the USA for making Hillary Clinton president.
Hillary, you're a hoax created by the Chinese
Trump trumps Hillary
During the debate, Trump praised Hillary for having the will to fight.
Trump is better person than Hillary
Donald TRUMPED Hillary
Kudos to Hillary for her accomplishments.
He also praised Hillary for handling the situation with dignity.
During the debate, Trump praised Hillary for having the will to fight.
People like Hillary in senate is the reason this country is going downhill.
Hillary did worse than expectations.
Trump will prosecute Hillary for her crimes, TRUMP will!
Have to praise Hillary for keeping her focus.
a landslide victory for Hillary will restore confidence in American democracy ..
I was so proud of Hillary tonight for acting like a tough, independent woman.
I dislike Hillary Clinton, as I think she is a corrupt, corporate shill.
Hillary did worse than Timmy Kaine
Im so glad he finally brought Benghazi against Hillary
Hillary, thank you for confirmation that the Wikileaks documents are authentic
Supreme Court justices is the only reason why I'd vote for Hillary.
Massive kudos to Hillary for keeping her cool with that beast behind her.
Congrats to Hillary for actually answering the questions. She's spot on. #debate

 

[Related]

Social media mining: Did Trump’s Gettysburg speech enable the support rate to soar as claimed?

Big data mining shows clear social rating decline of Trump last month

Clinton, 5 years ago. How time flies …

Automated Suevey

From IBM's Jeopardy robot, Apple's Siri, to the new Google Translate

Latest Headline News: Samsung acquires Viv, a next-gen AI assistant built by the creators of Apple's Siri.

Wei:
Some people are just smart, or shrewd, more than we can imagine.  I am talking about Fathers of Siri, who have been so successful with their technology that they managed to sell the same type of technology twice, both at astronomical prices, and both to the giants in the mobile and IT industry.  What is more amazing is, the companies they sold their tech-assets to are direct competitors.  How did that happen?  How "nice" this world is, to a really really smart technologist with sharp business in mind.

What is more stunning is the fact that, Siri and the like so far are regarded more as toys than must-carry tools, intended at least for now to satisfy more curiosity than to meet the rigid demand of the market.  The most surprising is that the technology behind Siri is not unreachable rocket science by nature,  similar technology and a similar level of performance are starting to surface from numerous teams or companies, big or small.

I am a tech guy myself, loving gadgets, always watching for new technology breakthrough.  To my mind, something in the world is sheer amazing, taking us in awe, for example, the wonder of smartphones when the iPhone first came out. But some other things in the tech world do not make us admire or wonder that much, although they may have left a deep footprint in history. For example, the question answering machine made by IBM Watson Lab in winning Jeopardy.  They made it into the computer history exhibition as a major AI milestone.  More recently, the iPhone Siri, which Apple managed to put into hands of millions of people first time for seemingly live man-machine interaction. Beyond that accomplishment, there is no magic or miracle that surprises me.  I have the feel of "seeing through" these tools, both the IBM answering robot type depending on big data and Apple's intelligent agent Siri depending on domain apps (plus a flavor of AI chatbot tricks).

Chek: @ Wei I bet the experts in rocket technology will not be impressed that much by SpaceX either,

Wei: Right, this is because we are in the same field, what appears magical to the outside world can hardly win an insider's heart, who might think that given a chance, they could do the same trick or better.

The Watson answering system can well be regarded as a milestone in engineering for massive, parallel big data processing, not striking us as an AI breakthrough. what shines in terms of engineering accomplishment is that all this happened before the big data age when all the infrastructures for indexing, storing and retrieving big data in the cloud are widely adopted.  In this regard, IBM is indeed the first to run ahead of the trend, with the ability to put a farm of servers in working for the QA engine to be deployed onto massive data.  But from true AI perspective, neither the Watson robot nor the Siri assistant can be compared with the more-recent launch of the new Google Translate based on neural networks.  So far I have tested using this monster to help translate three Chinese blogs of mine (including this one in making), I have to say that I have been thrown away by what I see.  As a seasoned NLP practitioner who started MT training 30 years ago, I am still in disbelief before this wonder of the technology showcase.

Chen: wow, how so?

Wei:  What can I say?  It has exceeded my imagination limit for all my dreams of what MT can be and should be since I entered this field many years ago.  While testing, I only needed to do limited post-editing to make the following Chinese blogs of mine presentable and readable in English, a language with no kinship whatsoever with the source language Chinese.

Question answering of the past and present

Introduction to NLP Architecture

Hong: Wei seemed frightened by his own shadow.Chen:

Chen:  The effect is that impressive?

Wei:  Yes. Before the deep neural-nerve age, I also tested and tried to use SMT for the same job, having tried both Google Translate and Baidu MT, there is just no comparison with this new launch based on technology breakthrough.  If you hit their sweet spot, if your data to translate are close to the data they have trained the system on, Google Translate can save you at least 80% of the manual work.  80% of the time, it comes so smooth that there is hardly a need for post-editing.  There are errors or crazy things going on less than 20% of the translated crap, but who cares?  I can focus on that part and get my work done way more efficiently than before.  The most important thing is, SMT before deep learning rendered a text hardly readable no matter how good a temper I have.  It was unbearable to work with.  Now with this breakthrough in training the model based on sentence instead of words and phrase, the translation magically sounds fairly fluent now.

It is said that they are good a news genre, IT and technology articles, which they have abundant training data.  The legal domain is said to be good too.  Other domains, spoken language, online chats, literary works, etc., remain a challenge to them as there does not seem to have sufficient data available yet.

Chen: Yes, it all depends on how large and good the bilingual corpora are.

Wei:  That is true.  SMT stands on the shoulder of thousands of professional translators and their works.  An ordinary individual's head simply has no way in  digesting this much linguistic and translation knowledge to compete with a machine in efficiency and consistency, eventually in quality as well.

Chen: Google's major contribution is to explore and exploit the existence of huge human knowledge, including search, anchor text is the core.

Ma: I very much admire IBM's Watson, and I would not dare to think it possible to make such an answering robot back in 2007.

Wei: But the underlying algorithm does not strike as a breakthrough. They were lucky in targeting the mass media Jeopardy TV show to hit the world.  The Jeopardy quiz is, in essence, to push human brain's memory to its extreme, it is largely a memorization test, not a true intelligence test by nature.  For memorization, a human has no way in competing with a machine, not even close.  The vast majority of quiz questions are so-called factoid questions in the QA area, asking about things like who did what when and where, a very tractable task.  Factoid QA depends mainly on Named Entity technology which was mature long ago, coupled with the tractable task of question parsing for identifying its asking point, and the backend support from IR, a well studied and practised area for over 2 decades now.  Another benefit in this task is that most knowledge questions asked in the test involve standard answers with huge redundancy in the text archive expressed in various ways of expressions, some of which are bound to correspond to the way question is asked closely.  All these factors contribute to IBM's huge success in its almost mesmerizing performance in the historical event.  The bottom line is, shortly after the 1999 open domain QA was officially born with the first TREC QA track, the technology from the core engine has been researched well and verified for factoid questions given a large corpus as a knowledge source. The rest is just how to operate such a project in a big engineering platform and how to fine-tune it to adapt to the Jeopardy-style scenario for best effects in the competition.  Really no magic whatsoever.

Google Translated from【泥沙龙笔记:从三星购买Siri之父的二次创业技术谈起】, with post-editing by the author himself.

 

【Related】

Question answering of the past and present

Introduction to NLP Architecture

Newest GNMT: time to witness the miracle of Google Translate

Dr Li’s NLP Blog in English

 

Newest GNMT: time to witness the miracle of Google Translate

gnmt

Wei:
Recently, the microblogging (wechat) community is full of hot discussions and testing on the newest annoucement of the Google Translate breakthrough in its NMT (neural network-based machine translation) offering, claimed to have achieved significant progress in data quality and readability.  Sounds like a major breakthrough worthy of attention and celebration.

The report says:

Ten years ago, we released Google Translate, the core algorithm behind this service is PBMT: Phrase-Based Machine Translation.  Since then, the rapid development of machine intelligence has given us a great boost in speech recognition and image recognition, but improving machine translation is still a difficult task.

Today, we announced the release of the Google Neural Machine Translation (GNMT) system, which utilizes state-of-the-art training techniques to maximize the quality of machine translation so far. For a full review of our findings, please see our paper "Google`s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation."A few years ago, we began using RNN (Recurrent Neural Networks) to directly learn the mapping of an input sequence (such as a sentence in a language) to an output sequence (the same sentence in another language). The phrase-based machine learning (PBMT) breaks the input sentences into words and phrases, and then largely interprets them independently, while NMT interprets the entire sentence of the input as the basic unit of translation .

A few years ago, we began using RNN (Recurrent Neural Networks) to directly learn the mapping of an input sequence (such as a sentence in a language) to an output sequence (the same sentence in another language).  The phrase-based machine learning (PBMT) breaks the input sentences into words and phrases, and then largely interprets them independently, while NMT interprets the entire sentence of the input as the basic unit of translation .The advantage of this approach is that compared to the previous phrase-based translation system, this method requires less engineering design. When it was first proposed, the accuracy of the NMT on a medium-sized public benchmark

The advantage of this approach is that compared to the previous phrase-based translation system, this method requires less engineering design. When it was first proposed, the accuracy of the NMT on a medium-sized public benchmark data set was comparable to that of a phrase-based translation system.  Since then, researchers have proposed a number of techniques to improve NMT, including modeling external alignment models to handle rare words, using attention to align input and output words, and word decomposition into smaller units to cope with rare words. Despite these advances, the speed and accuracy of NMT has not been able to meet the requirements of a production system such as Google Translate.  Our new paper describes how to overcome many of the challenges of making NMT work on very large data sets and how to build a system that is both fast and accurate enough to deliver a better translation experience for Google users and services.

............

Using side-by-side comparisons of human assessments as a standard, the GNMT system translates significantly better than the previous phrase-based production system.  With the help of bilingual human assessors, we found in sample sentences from Wikipedia and the news website that GNMT reduced translational errors by 55% to 85% or more in the translation of multiple major pairs of languages.

In addition to publishing this research paper today, we have also announced that GNMT will be put into production in a very difficult language pair (Chinese-English) translation.

Now, the Chinese-English translations of the Google Translate for mobile and web versions have been translated at 100% using the GNMT machine - about 18 million translations per day.  GNMT's production deployment uses our open machine learning tool suite TensorFlow and our Tensor Processing Units (TPUs), which provide sufficient computational power to deploy these powerful GNMT models, meeting Google Translate strict latency requirements for products.

Chinese-to-English translation is one of the more than 10,000 language pairs supported by Google Translate. In the coming months, we will continue to extend our GNMT to far more language pairs.

GNMT translated from Google Translate achieves a major breakthrough!

As an old machine translation researcher, this temptation cannot be resisted.  I cannot wait to try this latest version of the Google Translate for Chinese-English.
Previously I tried Google Chinese-to-English online translation multiple times, the overall quality was not very readable and certainly not as good as its competitor Baidu.  With this newest breakthrough using deep learning with neural networks, it is believed to get close to human translation quality.  I have a few hundreds of Chinese blogs on NLP, waiting to be translated as a try.  I was looking forward to this first attempt in using Google Translate for my Science Popularization blog titled Introduction to NLP Architecture.  My adventure is about to start.  Now is the time to witness the miracle, if miracle does exist.

Dong:
I hope you will not be disappointed.  I have jokingly said before: the rule-based machine translation is a fool, the statistical machine translation is a madman, and now I continue to ridicule: neural machine translation is a "liar" (I am not referring to the developers behind NMT).  Language is not a cat face or the like, just the surface fluency does not work, the content should be faithful to the original!

Wei:
Let us experience the magic, please listen to this translated piece of my blog:

This is my Introduction to NLP Architecture fully automatically translated by Google Translate yesterday (10/2/2016) and fully automatically read out without any human interference.  I have to say, this is way beyond my initial expectation and belief.

Listen to it for yourself, the automatic speech generation of this science blog of mine is amazingly clear and understandable. If you are an NLP student, you can take it as a lecture note from a seasoned NLP practitioner (definitely clearer than if I were giving this lecture myself, with my strong accent). The original blog was in Chinese and I used the newest Google Translate claimed to be based on deep learning using sentence-based translation as well as character-based techniques.

Prof. Dong, you know my background and my original doubtful mindset. However, in the face of such a progress, far beyond our original imagination limits for automatic translation in terms of both quality and robustness when I started my NLP career in MT training 30 years ago, I have to say that it is a dream come true in every sense of it.

Dong:
In their terminology, it is "less adequate, but more fluent." Machine translation has gone through three paradigm shifts. When people find that it can only be a good information processing tool, and cannot really replace the human translation, they would choose the less costly.

Wei:
In any case, this small test is revealing to me. I am still feeling overwhelmed to see such a miracle live. Of course, what I have just tested is the formal style, on a computer and NLP topic, it certainly hit its sweet spot with adequate training corpus coverage. But compared with the pre-NN time when I used both Google SMT and Baidu SMT to help with my translation, this breakthrough is amazing. As a senior old school practitioner of rule-based systems, I would like to pay deep tribute to our "nerve-network" colleagues. These are a group of extremely genius crazy guys. I would like to quote Jobs' famous quotation here:

“Here's to the crazy ones. The misfits. The rebels. The troublemakers. The round pegs in the square holes. The ones who see things differently. They're not fond of rules. And they have no respect for the status quo. You can quote them, disagree with them, glorify or vilify them. About the only thing you can't do is ignore them. Because they change things. They push the human race forward. And while some may see them as the crazy ones, we see genius. Because the people who are crazy enough to think they can change the world, are the ones who do.”

@Mao, this counts as my most recent feedback to the Google scientists and their work. Last time, about a couple of months ago when they released their parser, proudly claimed to be "the most accurate parser in the world", I wrote a blog to ridicule them after performing a serious, apples-to-apples comparison with our own parser. This time, they used the same underlying technology to announce this new MT breakthrough with similar pride, I am happily expressing my deep admiration for their wonderful work. This contrast of my attitudes looks a bit weird, but it actually is all based on facts of life. In the case of parsing, this school suffers from lacking naturally labeled data which they would make use of in perfecting the quality, especially when it has to port to new domains or genres beyond the news corpora. After all, what exists in the language sea involves corpora of raw text with linear strings of words, while the corresponding parse trees are only occasional, artificial objects made by linguists in a limited scope by nature (e.g. PennTree, or other news-genre parse trees by the Google annotation team). But MT is different, it is a unique NLP area with almost endless, high-quality, naturally-occurring "labeled" data in the form of human translation, which has never stopped since ages ago.

Mao: @wei That is to say, you now embrace or endorse a neuron-based MT, a change from your previous views?

Wei:
Yes I do embrace and endorse the practice. But I have not really changed my general view wrt the pros and cons between the two schools in AI and NLP. They are complementary and, in the long run, some way of combining the two will promise a world better than either one alone.

Mao: What is your real point?

Wei:
Despite biases we are all born with more or less by human nature, conditioned by what we have done and where we come from in terms of technical background, we all need to observe and respect the basic facts. Just listen to the audio of their GSMT translation by clicking the link above, the fluency and even faithfulness to my original text has in fact out-performed an ordinary human translator, in my best judgment. If an interpreter does not have sufficient knowledge of my domain, if I give this lecture in a classroom, and ask an average interpreter to translate on the spot for me, I bet he will have a hard time performing better than the Google machine listed above (of course, human translation gurus are an exception). This miracle-like fact has to be observed and acknowledged. On the other hand, as I said before, no matter how deep the learning reaches, I still do not see how they can catch up with the quality of my deep parsing in the next few years when they have no way of magically having access to a huge labeled data of trees they depend on, especially in the variety of different domains and genres. They simply cannot "make bricks without straw" (as an old Chinese saying goes, even the most capable housewife can hardly cook a good meal without rice). Because in the natural world, there are no syntactic trees and structures for them to learn from, there are only linear sentences. The deep learning breakthrough seen so far is still mainly supervised learning, which has almost an insatiable appetite for massive labeled data, forming its limiting knowledge bottleneck.

Mao: I'm confused. Which one do you believe stronger? Who is the world's No. 0?

Wei:
Parsing-wise, I am happy to stay as No. 0 if Google insists on their being No. 1 in the world. As for MT, it is hard to say, from what I see, between their breakthrough and some highly sophisticated rule-based MT systems out there. But what I can say is, at a high level, the trends of the mainstream statistical MT winning the space both in the industry as well as in academia over the old school rule-based MT are more evident today than before.  This is not to say that the MT rule system is no longer viable, or going to an end. There are things which SMT cannot beat rule MT. For examples, certain types of seemingly stupid mistakes made by GNMT (quite some laughable examples of totally wrong or opposite translation have been illustrated in this salon in the last few days) are almost never seen in rule-based MT systems.

Dong:
here is my try of GNMT from Chinese to English:

学习上,初二是一个分水岭,学科数量明显增多,学习方法也有所改变,一些学生能及时调整适应变化,进步很快,由成绩中等上升为优秀。但也有一部分学生存在畏难情绪,将心思用在学习之外,成绩迅速下降,对学习失去兴趣,自暴自弃,从此一蹶不振,这样的同学到了初三往往很难有所突破,中考的失利难以避免。

Learning, the second of a watershed, the number of subjects significantly significantly, learning methods have also changed, some students can adjust to adapt to changes in progress, progress quickly, from the middle to rise to outstanding. But there are some students there is Fear of hard feelings, the mind used in the study, the rapid decline in performance, loss of interest in learning, self-abandonment, since the devastated, so the students often difficult to break through the third day,

Mao: This translation cannot be said to be good at all.

Wei:
Right, that is why it calls for an objective comparison to answer your previous question. Currently, as I see, the data for the social media and casual text are certainly not enough, hence the translation quality of online messages is still not their forte.  As for the previous textual sample Prof. Dong showed us above, Mao said the Google translation is not of good quality as expected. But even so, I still see impressive progress made there. Before the deep learning time, the SMT results from Chinese to English is hardly readable, and now it can generally be read loud to be roughly understood. There is a lot of progress worth noting here.

Ma:
In the fields with big data, in recent years, DL methods are by leaps and bounds. I know a number of experts who used to be biased against DL have changed their views when seeing the results. However, DL in the IR field is still basically not effective so far, but there are signs of slowly penetrating IR.

Dong:
The key to NMT is "looking nice". So for people who do not understand the original source text, it sounds like a smooth translation. But isn't it a "liar" if a translation is losing its faithfulness to the original? This is the Achille's heel of NMT.

Ma: @Dong, I think all statistical methods have this aching point.

Wei:
Indeed, there are respective pros and cons. Today I have listened to the Google translation of my blog three times and am still amazed at what they have achieved. There are always some mistakes I can pick here and there. But to err is human, not to say a machine, right? Not to say the community will not stop advancing and trying to correct mistakes. From the intelligibility and fluency perspectives, I have been served super satisfactorily today. And this occurs between two languages without historical kinship whatsoever.

Dong:
Some leading managers said to me years ago, "In fact, even if machine translation is only 50 percent correct, it does not matter. The problem is that it cannot tell me which half it cannot translate well. If it can, I can always save half the labor, and hire a human translator to only translate the other half." I replied that I am not able to make a system do that. Since then I have been concerned about this issue, until today when there is a lot of noise of MT replacing the human translation anytime from now. It's kinda like having McDonald's then you say you do not need a fine restaurant for French delicacy. Not to mention machine translation today still cannot be compared to McDonald's. Computers, with machine translation and the like, are in essence a toy given by God for us human to play with. God never agrees to permit us to be equipped with the ability to copy ourselves.

Why GNMT first chose language pairs like Chinese-to-English, not the other way round to showcase? This is very shrewd of them. Even if the translation is wrong or missing the points, the translation is usually fluent at least in this new model, unlike the traditional model who looks and sounds broken, silly and erroneous. This is the characteristics of NMT, it is selecting the greatest similarity in translation corpus. As a vast number of English readers do not understand Chinese, it is easy to impress them how great the new MT is, even for a difficult language pair.

Wei:
Correct. A closer look reveals that this "breakthrough" lies more on fluency of the target language than the faithfulness to the source language, achieving readability at cost of accuracy. But this is just a beginning of a major shift. I can fully understand the GNMT people's joy and pride in front of a breakthrough like this. In our career, we do not always have that type of moment for celebration.

Deep parsing is the NLP's crown. Yet to see how they can beat us in handling domains and genres lacking labeled data. I wish them good luck and the day they prove they make better parsers than mine would be the day of my retirement. It does not look anything like this day is drawing near, to my mind. I wish I were wrong, so I can travel the world worry-free, knowing that my dream has been better realized by my colleagues.

Thanks to Google Translate at https://translate.google.com/ for helping to translate this Chinese blog into English, which was post-edited by myself. 

 

[Related]

Wei’s Introduction to NLP Architecture Translated by Google

"OVERVIEW OF NATURAL LANGUAGE PROCESSING"

"NLP White Paper: Overview of Our NLP Core Engine"

Introduction to NLP Architecture

It is untrue that Google SyntaxNet is the "world’s most accurate parser"

Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open

Is Google SyntaxNet Really the World’s Most Accurate Parser?

Dr Li's NLP Blog in English

Introduction to NLP Architecture

(translated by Google Translate, post-edited by myself)

For the natural language processing (NLP) and its applications, the system architecture is the core issue.  In my blog (  OVERVIEW OF NATURAL LANGUAGE PROCESSING), I sketched four NLP system architecture diagrams, now to be presented one by one .

In my design philosophy, an NLP process is divided into four stages, from the core engine up to the applications, as reflected in the four diagrams.  At the bottom is deep parsing, following the bottom-up processing of an automatic sentence analyzer.  This work is the most difficult, but it is the foundation and enabling technology for vast majority of NLP systems.

160213sg5p2r8ro18v17z8

The purpose of parsing is to structure unstructured text.  Facing the ever-changing language, only when it is structured in some logical form can we formulate patterns for the information we like to extract to support applications.  This principle of linguistics structures began to be the consensus in the linguistics community when Chomsky proposed the transformation from surface structure to deep structure in his linguistic revolution of 1957.  A tree representing the logical form does not only involve arcs that express syntactic-semantic relationships, but also contain the nodes of words or phrases that carry various conceptual information.  Despite the importance of such deep trees, generally they do not directly support an NLP product.  They remain only the internal representation of the parsing system, as a result of language analysis and understanding before its semantic grouding to the applications as their core support.

160216n8x8jj08qj2y1a8y

The next layer after parsing is the extraction layer, as shown in the above diagram.  Its input is the parse tree, and the output is the filled-in content of templates, similar to filling in a form: that is the information needed for the application, a pre-defined table (so to speak), so that the extraction system can fill in the blanks by the related words or phrases extracted from text based on parsing. This layer has gone from the original domain-independent parser into the application-oriented and product-demanded tasks.

It is worth emphasizing that the extraction layer is geared towards the domain-oriented semantic focus, while the previous parsing layer is domain-independent.  Therefore, a good framework is to do a very thorough analysis of logic semantics in deep parsing, in order to reduce the burden of information extraction.  With the depth of the analysis in  the logical semantic structures to support the extraction, a rule at extraction layer is in essence equivalent to thousands of surface rules at linear text layer.  This creates the conditions for the efficient porting to new domains based on the same core engine of parsing.

There are two types of extraction, one is the traditional information extraction (IE), the extraction of facts or objective information: named entities, the relationships between entities, and events involving entities (which can answer questions like "who did what when and where" and the like).  This extraction of objective information is the core technology and foundation for the knowledge graph (nowadays such a hot area in industry).  After completion of IE, the next layer of information fusion (IF) is aimed at constructing the knowledge graph.   The other type of extraction is about subjective information, for example, the public opinion mining is based on this kind of extraction. What I have done over the past five years as my focus is along this line for fine-grained extraction of public opinions (not just sentiment classification, but also to explore the reasons behind the public opinions and sentiments to provide the insights basis for decision-making).  This is one of the hardest tasks in NLP, much more difficult than IE for objective information.  Extracted information is usually stored in a database. This provides huge textual mentions of information to feed the underlying mining layer.

Many people confuse information extraction and text mining, but, in fact, they are two levels of different tasks.  Extraction faces each individual language tree, embodied in each sentence, in order to find the information we want.  The mining, however, faces a corpus, or data sources as a whole, from the language forest for gathering statistically significant insights.  In the information age, the biggest challenge we face is information overload, we have no way to exhaust the information ocean for the insights we need, therefore, we must use the computer to dig out the information from the ocean for the required critical intelligence to support different applications. Therefore, mining relies on natural statistics, without statistics, the information is still scattered across the corpus even if it is identified.  There is a lot of redundancy in the extracted mentions of information, mining can integrate them into valuable insights.

160215hzp5hq5pfd1alldj

Many NLP systems do not perform deep mining, instead, they simply use a query to search real-time from the extracted information index in the database and merge the retrieved information on-the-fly, presenting the top n results to the user. This is actually also mining, but it is a way of retrieval to achieve simple mining for directly supporting an application.

In order to do a good job of mining, there is a lot of work that can be done in this mining layer. Text mining not only improves the quality of existing extracted information pieces, moreover, it can also tap the hidden information, that is not explicitly expressed in the data sources, such as the causal relationship between events, or statistical trends of the public opinions or behaviours. This type of mining was first done in the traditional data mining applications as the traditional mining was aimed at structured data such as transaction records, making it easy to mine implicit associations (e.g., people who buy diapers often buy beer, this reflects the common behaviours of young fathers of the new-born, and such hidden association can be mined to optimize the layout and sales of goods). Nowadays, natural language is also structured thanks to deep parsing, hence data mining algorithms for hidden intelligence in the database can, in principle, also be applied to enhance the value of intelligence.

The fourth architectural diagram is the NLP application layer. In this layer, the results from parsing, extraction, and mining out of the unstructured text sources can be used to support a variety of NLP products and services, ranging from the QA (question answering) systems to the dynamic construction of the knowledge graph (this type of graph is visualized now in the Google search when we do a search for a star or VIP), from automatic polling of public opinions to customer intelligence about brands, from intelligent assistants (e.g. chatbots, Siri etc.) to automatic summarization and so on.

16221285l5wkx8t5ffi8a9

This is my overall presentation of the basic architecture of NLP and its applications, based on nearly 20 years of experiences in the industry to design and develop NLP products.  About 18 years ago, I was presenting a similar diagram of the NLP architecture to the first venture investor who told us that this is a million dollar slide.  The presentation here is a natural inheritance and extension from that diagram.

~~~~~~~~~~~~~~~~~~~
Here is the previously mentioned million-dollar slide story.  Under the Clinton's administration before the turn of the century, the United States went through a "great leap forward" of the Internet technology, known as Dot Com Bubble, a time of hot money pouring into the IT industry while all kinds of Internet startups were springing up.  In such a situation, my boss decided to seek venture capital for the business expansion, and requested me to illustrate our prototype of the implemented natural language system for its introduction.  I then drew the following three-tier structure of an NLP system diagram: the bottom layer is parsing, from shallow to deep, the middle is built on parsing for information extraction, and the top layer illustrates some major categories of NLP applications, including QA.  Connecting applications and the downstairs two layers of language processing is the database, used to store the results of information extraction, ready to be applied at any time to support upstairs applications.  This general architecture has not changed much since I made it years ago, although the details and layout have been redrawn no less than 100 times.  The architecture diagram below is about one of the first 20 editions, involving mainly the backend core engine of information extraction architecture, not so much on the front-end flowchart for the interface between applications and the database.  I still remember early in the morning, my boss sent the slide to a Wall Street angel investor, by noon we got his reply, saying that he was very interested.  Less than two weeks, we got the first million dollar angel investment check.  Investors label it as a million dollar slide, which is believed to have not only shown the depth of language technology but also shows the great potential for practical applications.

165325a3pamcdcdr3daapw

Pre-Knowledge Graph: Architecture of Information Extraction Engine

 

【Related Chinese Blogs】

NLP Overview

Pre-Knowledge Graph: The Architecture of Information Extraction Engine

Natural language parser is to reveal the mystery of the language like a LIGO-type detector

Dream come true

( translated from http://blog.sciencenet.cn/blog-362400-981742.html )

The speech generation of the fully automatically translated, un-edited science blog of mine is attached below (for your entertainment :=), it is amazingly clear and understandable (definitely clearer than if I were giving this lecture myself with my strong accent).  If you are an NLP student, you can listen to it as a lecture note from a seasoned NLP practitioner.

Thanks to the newest Google Translate service from Chinese into English at https://translate.google.com/ 

 

 

[Related]

Wei’s Introduction to NLP Architecture Translated by Google

"OVERVIEW OF NATURAL LANGUAGE PROCESSING"

"NLP White Paper: Overview of Our NLP Core Engine"

Not an ad. But a historical record.

Although not updated for long, this wiki remains like this until today 9/28/2016
from https://en.wikipedia.org/wiki/NetBase_Solutions,_Inc.

wikinetbase

NetBase Solutions, Inc.

From Wikipedia, the free encyclopedia
  (Redirected from NetBase)
NetBase Solutions, Inc.
Private
Industry Market Research
Founded 2004
Founder Jonathan Spier and Michael Osofsky
Headquarters Mountain View, CA, USA
Area served
Worldwide
Key people
Peter Caswell, CEO
Mark Bowles, CTO
Lisa Joy Rosner, CMO
Dr. Wei Li, Chief Scientist
Products NetBase Insight Workbench
Website www.netbase.com

NetBase Solutions, Inc. is a Mountain View, CA based developer of natural language processing technology used to analyze social media and other web content. It was founded by two engineers from Ariba in 2004 as Accelovation, before changing names to NetBase in 2008. It has raised a total of $21 million in funding. It's sold primarily on a subscription basis to large companies to conduct market research and social media marketing analytics. NetBase has been used to evaluate the top reasons men wear stubble, the products Kraft should develop and the favorite tech company based on digital conversations.

History

NetBase was founded by Jonathan Spier and Michael Osofsky, both of whom were engineers at Ariba, in 2004 as Accelovation, based on the combination of the words “acceleration” and “innovation.”[1][2] It raised $3 million in funding in 2005, followed by another $4 million in 2007.[1][3] The company changed its name to NetBase in February 2008.[4][5]

It developed its analytics tools in March 2010 and began publishing monthly brand passion indexes (BPI) comparing brands in a market segment using the tool shortly afterwards.[6] In 2010 it raised $9 million in additional funding and another $2.5 million in debt financing.[1][3] NetBase Insight Workbench was released in March 2011 and a partnership was formed with SAP AG that December for SAP to resell NetBase's software.[7] In April 2011, a new CEO Peter Caswell was appointed.[8] Former TIBCO co-inventor, patent author and CTO Mark Bowles is now the CTO at NetBase and held responsible for many technical achievements in scalability.[9]

Software and services

Screenshot of NetBase Insight Workbench dashboard

NetBase sells a tool called NetBase Insight Workbench that gives market researchers and social marketers a set of analytics, charts and research tools on a subscription basis. ConsumerBase is what the company calls the back-end that collects and analyzes the data. NetBase targets market research firms and social media marketing departments, primarily at large enterprises with a price-point of around $100,000.[10][11] NetBase is also white-labeled by Reed Elsevier in a product called illumin8.[12]

Uses

For the average NetBase user, 12 months of activity is twenty billion sound bytes from just over seven billion digital documents. The company claims to index 50,000 sentences a minute from sources like public-facing Facebook, blogs, forums, Twitter and consumer review sites.[13][14]

According to a story in InformationWeek, Kraft uses NetBase to measure customer needs and conduct market research for new product ideas.[15] In 2011 the company released a report based on 18 billion postings over twelve months on the most loved tech companies. Salesforce.com, Cisco Systems and Netflix were among the top three.[16] Also in 2011, NetBase found that the news of Osama Bin Laden eclipsed the royal wedding and the Japan earthquake in online activity.[17]

External links

References

  1. ^ Jump up to:a b c By Matt Marshall, VentureBeat. “Accelovation Raises $4M for online software for IT market research.” December 3, 2007.
  2. Jump up^ BusinessWeek profile
  3. ^ Jump up to:a b By Jon Xavier, BizJournals. “NetBase filters social media for what clients need to know.” June 3, 2011.
  4. Jump up^ By Barbara Quint, Information Today. “Elsevier and NetBase Launch illumin8.” February 28, 2008.
  5. Jump up^ The Economist. “Improving Innovation.” February 29, 2008.
  6. Jump up^ By Rachael King, BusinessWeek. “Most Loved -- And Hated -- Tech Companies.”
  7. Jump up^ Darrow, Barb (December 12, 2011). "SAP taps NetBase for deep social media analytics". GigaOm. Retrieved May 8, 2012.
  8. Jump up^ San Jose Mercury News. “People on the Move.” May 15, 2011.
  9. Jump up^ By David F. Carr, InformationWeek. “How Much is your Brand Loved (or Hated)?” June 16, 2011.
  10. Jump up^ By Eric Schoenfeld, TechCrunch. "NetBase Offers Powerful Semantic Indexing Platform That Reads The Web." April 22, 2009.
  11. Jump up^ By Jon Xavier, BizJournals. "NetBase filters social media for what clients need to know." June 3, 2011.
  12. Jump up^ By Barbara Quint, Newsbreak. "Elsevier and NetBase Launch illumin8." February 28, 2008.
  13. Jump up^ By Neil Glassman, Social Times. “What Every Social Media Marketer Should Know About NetBase.” August 24, 2010.
  14. Jump up^ By Ryan Flinn, BusinessWeek. “Wanted: Social Media Sifters.” October 21, 2010.
  15. Jump up^ By David F. Carr, InformationWeek. “How Kraft Foods Listens to Social Media.” June 30, 2011.
  16. Jump up^ By Ryan Flinn, Bloomberg. “Tech companies measure online sentiment.” May 19, 2011.
  17. Jump up^ By Geoffrey Fowler and Alexandra Berzon, Wall Street Journal. “Social Media Buzzes, Comes Into Its Own.” May 2, 2011.