NLU、NLP 多年来还有一个公认的难点,就是并列结构(conjoined structure)。并列在思维逻辑里没有地位,它是语言表达的产物。并列是语言学中最不讲道理的程咬金,它总是横插一刀,而且任性,在任一层次。一切的 subcat arg structures 或 mod-head patterns 都必须为它让道,否则就堵塞交通,让 parsing 的路线断链。然而,如果没有并列,自然语言就会难以容忍地单调枯燥,尽失精简。
举个简单例子:
这句话逻辑上展开以后怎么样呢?
颈椎间盘突出症的最常见和最典型表现是一侧颈肩部及上肢的酸痛
==>
颈椎间盘突出症的最常见表现是一侧颈肩部的酸痛
颈椎间盘突出症的最典型表现是一侧颈肩部的酸痛
颈椎间盘突出症的最常见表现是上肢的酸痛
颈椎间盘突出症的最典型表现是上肢的酸痛
这才牵涉两个并列,一个句子出现五六个甚至上十个并列, 并不鲜见。语言不是逻辑。没有并列,语言面临组合爆炸式啰嗦。很难想象,传统的单层 parsing 系统,譬如教科书上经典的乔姆斯基式 CFG-based chart parsing,可以把各种并列处理妥帖。
Conjoin can be so f* hierarchical, even for a very deep, multi-level parsing system: conjoin remains a challenge if not very carefully/skillfully handled by a very experienced linguist 'cause the boundaries are tough to identify and they just appear at any levels at will. The conjoined elements are semantically parallel but the parallelness, which ideally should be used as conditions to help identify the conjoined structure and its scope, is unfortunately in practice all relative and fuzzy, which can hardly be enforced. food can be conjoined with food, of course, but look at this:
我喜欢肥肉和哲学。
food and knowledge, totally different monsters of semantics, can also be conjoined, it is preference semantics at its worst.
OK, I am not going to elaborate on solutions, which should be a long article by itself. This post serves as an introduction of this linguistic monster, to arouse the awareness of linguistic challenges in natural language parsing.
【相关】