分类： 机器翻译

from NLP 历史上最大的媒体误导：成语难倒了电脑

【从博鳌机器同传“一带一路”的翻译笑话说起】

这是网上这两天广泛热议和流传的AI笑话（博鳌AI同传遭热议）：

昨天还在想，这“一带一路”的翻译笑话是怎么回事儿呢？这类高频新术语、成语是机器的大拿，不就是一个词典记忆嘛。

今天看新智元的采访（博鳌AI同传遭热议！腾讯翻译君负责人李学朝、讯飞胡郁有话说），原来，这次的笑话不是出在成语上，而是出在成语的“泛化”能力上。“成语泛化”的捕捉和翻译，这一点目前还是短板。

对于中译英，“一带一路”的翻译完全没有问题，因为这是近年来习大大新时代新政的最流行的新术语，家喻户晓，没有人去泛化它。机器翻译自然不会错，主流怎么翻译，机器就会怎么翻译，不会更好，也绝不会更差。

可是这个中国的术语到了英语世界，并不是所有受众都记得住准确的说法了。结果，“标准” 的流行译法 “one belt one road”，被有些老外记错了，成了“one road one belt” or "the road and belt" 等。这也是可以理解的，老外没有政治学习时间也无须应对时事政治考核，能记得一个大概就不错了。

虽然说法不同了，次序有变，但两个关健词 road 和 belt 都在，这种成语“泛化”对于人译不构成挑战，因为老外的记忆偏差和“泛化”的路数，与译员的心理认知是一致的，所以人工传译遇到这类绝不会有问题。可是，以大数据驱动的机器翻译这次傻了，真地就“神经”了，这些泛化的变式大多是口语中的稀疏数据，无法回译成汉语的“一带一路”，笑话就出来了。

提高MT的“成语泛化”能力，是当今的一个痛点，但并不是完全无迹可寻。将来也会成为一个突破点的。只是目前一般系统和研究还顾不上去对付它。以前我提过一个成语泛化的典型案例应该具有启示作用的：“1234应犹在，只是56改”。

早期机器翻译广为流传的类似笑话也是拿成语说事（The spirit is willing, but the flesh is weak，心有余而力不足据传被翻译成了“威士忌没有问题，但肉却腐烂了”），因为一般人认为成语的理解最难，因此也必然是机器的挑战。这是完全外行的思路。成语的本质是记忆，凡记忆电脑是大拿，人脑是豆腐。

NLP 最早的实践是机器翻译，在电脑的神秘光环下，被认为是模拟或挑战人类智能活动的机器翻译自然成为媒体报道的热点。其中有这么一个广为流传的机器翻译笑话，为媒体误导之最：

说的是有记者测试机器翻译系统，想到用这么一个出自圣经的成语：

The spirit is willing, but the flesh is weak (心有余而力不足)

翻译成俄语后再翻译回英语就是：

The whiskey is alright, but the meat is rotten（威士忌没有问题，但肉却腐烂了）

这大概是媒体上流传最广的笑话了。很多年来，这个经典笑话不断被添油加醋地重复着，成为NLP的标准笑柄。然而，自然语言技术中没有比成语更加简单的问题了。成语是NLP难点的误解全然是外行人的臆测，这种臆测由于两个因素使得很多不求甚解的人轻信了。其一是NLP系统的成语词典不够全面的时候造成的类似上述的“笑话”，似乎暴露了机器的愚蠢，殊不知这样的“错误”是系统最容易 debug 的：补全词典即可。因为成语 by definition 是可列举的（listable），补全成语的办法可以用人工，也可以从语料库中自动习得，无论何种方式，都是 tractable 的任务。语言学告诉我们，成语的特点在于其不具有语义的可分解性（no/little semantic compositianlity），必须作为整体来记忆（存贮），这就决定了它的非开放性（可列举）。其二是对于机器“理解”（实际是一种“人工”智能）的误解，以为人理解有困难的部分也必然是机器理解的难点，殊不知两种“理解”根本就不是一回事。很多成语背后有历史故事，需要历史知识才可以真正理解其含义，而机器是没有背景知识的，由此便断言，成语是NLP的瓶颈。

事实是，对于 NLP，可以说，识别了就是理解了，而识别可枚举的表达法不过是记忆而已，说到底是存储量的问题。可是确实有人天真到以为由冷冰冰的无机材料制作的“电脑”真地具有人脑那样的自主理解能力/机制。

关于新时代“一带一路”的合适译法，我曾经从语言学构词法角度也论过：

“一带一路”，官方翻译是： one belt one road。

不得其解，昨天才搞明白是中国倡导由中国带头沿着古丝绸之路开发新的经济贸易开发区一方面帮助消化过剩的产能一方面带动区域经济实现共赢让区域内国家分享中国经济高速发展的火车头效益从而树立中国崛起的和平领军形象。

感觉还有更多也许更好的选项反正是成语反正光字面形式谁也搞不清真意总是需要伴随进一步解释不如就译成：

一带一路 ===》 one Z one P （pronounced as：one zee one “pee”）

怎么样，这个翻译简直堪比经典翻译 long time no see （好久不见）和 “people mountain people sea” （人山人海）了。认真说，Zone 比 Belt 好得多。

One zone one path.
One zone one road.
New zone old road.
New Silk Road Zone.

感觉都不如 one Z one P 顺口。

from 【语言学随笔：从缩略语看汉字的优越性】

【相关】

博鳌AI同传遭热议！腾讯翻译君负责人李学朝、讯飞胡郁有话说

NLP 历史上最大的媒体误导：成语难倒了电脑

【语义计算：议一议“微软机器翻译提前七年达到专业翻译水平”】

李：
最近微信群疯传一条新智元的人工智能新闻专访，【机器翻译提前7年达到人类专业翻译水平，微软再现里程碑突破】。不少老友也特地转发给我这个“老机译”。微软这几天的营销好生了得。到处都是这个第一家超越人类的MT新闻而且提前了七年（这个提法有点扯，因为如果一个行业很多系统在某个时刻普遍大都可以达到一个水准，再说提前n年就不智了）！

这个微软 MT 是在哪里？比较过百度，谷歌，有道。有道似乎最好，所以现在就用有道。不妨也试试微软。

事到如今，这几家都可以 claim 新闻翻译超过业余翻译的水平，进入专业翻译的段位了。跟语音类似，这是整个行业的突破，神经翻译大幅度超越上一代统计翻译，尤其是顺畅度。眼见为实,这一点我们都是见证人。已经 n 多次测试过这些系统了。(【谷歌NMT，见证奇迹的时刻】; 【校长对话录：向有道机器翻译同仁致敬】). 如果是新闻文体，很少让人失望的。

说是第一个正式超越的系统云云，基本上是 marketing。

MT 的下一个突破点应该是：（i）对于缺乏直接对比语料的语言对的 MT（据说进展神速）；（ii）对于缺乏翻译语料的领域化 MT，譬如翻译电商领域，目前可用度差得一塌糊涂（20%左右），可有需求，无数据; (iii) 在保持目前NMT 目标语顺畅度的优势情况下，杜绝乱译，确保忠实可信。

这次他们严格测试的是汉译英，拿出数据来证明达到或超越了一般人的翻译水平。然后说，英译汉是类似的方法和原理，所以结论应该相同。这个我信。

有意思的是，在规则MT时代，绝不敢说这个话。汉译英比英译汉困难多了，因为汉语的解析比英语解析难，基于结构解析和转换的翻译自然效果很不相同。但目前的NMT 不需要依赖解析，所以语言的方向性对质量的影响很小。以前最头疼的汉译外，反而容易成为亮点。

当年入行的时候，有两个兴趣点：一是做外汉MT（主要是英汉），一是探索中文解析。前者有底气，知道这条路是通的，就是力气活，假以时间和资源，质量会逐渐越来越好。后者其实没有底气，感觉慢慢长路，“红旗不知道要打多久”（【从产业角度说说NLP这个行当】），但是实在太有趣了。当年的梦想是，什么时候中文解析做好了，用它来做汉外MT，能与外汉MT一样，那该多解气啊。

中文解析经过n多年的思索和实践，终于有底气了（【美梦成真】）。可是原先要落地MT的设想，却突然失去了这个需求和动力。好在 NLU 不仅仅在 MT 落地，还有许多可以落地的地方。

真所谓人算不如天算，看潮起潮落。老友谈养生之道，各种禁忌，颇不以为然，老了就老了，要那么长寿干嘛？最近找到一条长寿的理由，就是，可以看看这个世界怎么加速度变化的。今天见到的发生的许多事情，在 30 年前都是不可想象的：NMT，voice, image, parsing，iPhone，GPS, Tesla, you name it.

王:
中文解析，当然不一定中文，其他语言解析也一样，前景十分广阔，市场巨大。因现阶段还不能达到如人般的理解程度，所以还属于只能踩着有限小石子过河（落地）阶段，以后才能上大路，开高速。我也期望能有一个万能智能秘书，能准确理解且快速帮我办事，达到期望的结果。但能力都还有限，自然还是解析很力不足。

李:
parsing 的应用潜力很大，道理上没人说不对，毕竟这是AI在认知道路上可以预见的一个关键的支持。但实际上有两个坎儿：一是不好做，二是不好用。不好做是，想用的人往往不大会做，不能要求每个人都成为parsing专家。不好用是，独立的 offshelf 的，迄今没有见到大规模使用的成功案例。相对成功使用的，大多是内部消化，自己做自己用。这就局限了它的应用范围和潜力发挥。内部使用的成功经验，最多算是一种可行性论证，证明结构解析对于应用是的确可以赋能的。但平台化领域化的道路还很长。

核武器之所以在吆喝，是因为它还没爆炸，也似乎短期内不会爆炸。真爆炸了，听响声就够了，不需要吆喝了。

嘿，找到 MS Translator URL 了：https://www.bing.com/translator

做个现场测试和比较，用今天城里的新闻：

白宫发言人桑德斯14日表示，电视评论员柯德洛（Larry Kudlow）将出任国家经济会议主席。

桑德斯透过声明指出，川普向柯德洛提出担任总统经济政策助理，以及国家经济会议主席一事，柯德洛也接受了；白宫之后会宣布，柯德洛上任的时间。

川普月初宣布将对进口钢铁和铝材分别课征关税，其国家经济会议主席柯恩（Gary Cohn）因不认同川普此举，在6日请辞。

柯恩请辞后，柯德洛是接任的人选之一；在过去一年，柯德洛是川普的非正式顾问，也是川普2016年总统大选竞选期间的顾问。

川普13日向媒体表示，他有很高机率将任命柯德洛递补首席经济顾问柯恩辞职遗缺。川普说，“柯德洛成为我的好友一段时间了，他在我竞选早期就力挺我，是我最初的支持者之一。他是名非常有才华的人。”

70岁的柯德洛在前总统雷根任内监督白宫管理及预算局5年。柯德洛2001年加入财经媒体CNBC，之后成为知名评论家与主持人。

微软的自动翻译是：

White House spokesman Sanders said 14th that TV commentator Codro Larry Kudlow will serve as president of the National Economic Council.

Sanders said in a statement that Trump to Codro as president of the economic policy assistant, as well as the President of the National Economic Council, Codro also accepted, the White House will announce later, Codro time.

Cohn, the president of the National Economic Council, resigned on 6th for not agreeing to Trump's move, announced earlier this month that it would impose tariffs on imported steel and aluminium.

After Cohen resigned, Codro was one of the successors, and in the past year Codro was Trump's unofficial advisor and advisor to Trump during the 2016 presidential election campaign. Trump 13th to the media that he had a high chance of appointing Codro chief economic advisor Cohen to resign casual vacancies. Trump said, "Codro has been my best friend for a while, and he's been my first supporter of my early campaign." He is a very talented person.

” The 70-year-old Codro oversaw the White House administration and Budget Office for 5 years in former President Ronald Reagan. Codro joined the financial media CNBC in 2001 and became a well-known critic and moderator.

有道的译文是：

Larry Kudlow, a television commentator, will become President of the national economic council, White House spokesman jay sanders said Sunday.

In a statement, Mr. Sanders said he had accepted Mr. Cordelo as an assistant to the President's economic policy and the President of the national economic council. The White House would then announce that Mr. Cordelo took office.

The chairman of the national economic council, Gary Cohn, said he would resign on June 6 after he announced that he would impose tariffs on imported steel and aluminum.

When cohn resigned, he was one of his successors. Over the past year, he has been an informal adviser to trump and an adviser to trump during the 2016 presidential campaign.

Trump told the media on Wednesday that he had a high chance of appointing the resignation of coderot's chief economic adviser, korn/ferry. "He was a good friend for a while," trump said. "he supported me early in my campaign and was one of my first supporters. He is a very talented person.

The 70-year-old has overseen the White House administration and budget office for five years under former President Reagan. Cordelo joined CNBC in 2001 and became a well-known critic and host.

谷歌的译文是：

White House spokesman Saunders said on the 14th that television commentator Larry Kudlow will serve as chairman of the National Economic Council.

Saunders pointed out through the statement that Trump had proposed to Cadero as the president’s economic policy assistant and the chairman of the national economic conference. Caldero also accepted it; the White House will later announce the time when Calderon took office.

Trump announced at the beginning of the month that it would impose tariffs on imported steel and aluminum products. Gary Cohn, chairman of the National Economic Council, resigned on the 6th because he did not agree with Trump.

After Keen resigned, Kedlow was one of the candidates for the replacement. In the past year, Kedro was an informal advisor to Trump and a consultant during Trump's 2016 presidential campaign.

Trump stated to the media on the 13th that he has a high chance of appointing Kodlow to replace his chief economic advisor. Trump said, “Kadelo has been a friend of mine for some time. He supported me early in my campaign and was one of my initial supporters. He is a very talented person.”

Kodlow, 70, oversees the White House’s Office of Management and Budget for 5 years under former President Reagan. He joined the financial media CNBC in 2001 and later became a famous critic and host.

粗粗读一遍，都不错。可能是偏爱，还是更喜欢有道。

再来一片，其实不是新闻：

列宁的导师普列汉诺夫的遗言：
一、随着生产力的发展，知识份子的队伍比无产阶级增加得更快，在生产力中的作用跃居首位，在电气时代马克思主义的无产阶级专政理论将会过时。
二、布尔什维克的无产阶级专政将迅速演变成一党专政，再变为领袖专政。而建立在欺骗和暴力基础上的社会，本身就包含着自我毁灭的炸药，一旦真相大白，便会立刻土崩瓦解。
三、“布”党将依次遇到四大危机：饥荒危机、意识形态危机、社会经济危机和崩溃危机，最后政权土崩瓦解，这一过程可能持续数十年，但这个结局谁也无法改变。
四、国家的伟大并不在于它的领土甚至它的历史，而是民主传统和公民的生活水平。只要公民还在受穷，只要没有民主，国家就难保不发生动荡，直至崩溃。

微软：
Plekhanov's last words:

With the development of productive forces, the team of intellectuals has increased faster than the proletariat, the role of the productive forces leaped to the top, and the Marxist proletarian dictatorship theory in the era of electricity will be outdated. Second, the Bolshevik dictatorship of the proletariat will rapidly evolve into a one-party dictatorship, then become a leader dictatorship.

A society based on deception and violence, in itself, contains self-destructive explosives that, once the truth is revealed, Fall apart immediately.

Third, the "cloth" party will encounter four major crises: Famine crisis, ideological crisis, social economic crisis and collapse crisis, the final regime fell apart, this process may last for decades, but this end can not change. The greatness of the state is not in its territory or even its history, but in its democratic traditions and the living standards of its citizens. As long as citizens are still poor, as long as there is no democracy, the state will not be in turmoil until it collapses.

谷歌：
The words of Plekhanov:
     First, with the development of productive forces, the ranks of intellectuals have grown faster than the proletariat, and they have taken the lead in productivity. In the age of electricity, the Marxist theory of the dictatorship of the proletariat will become obsolete.
     2. The dictatorship of the proletariat of the Bolsheviks will rapidly evolve into a one-party dictatorship and will become a dictatorship of leaders. The society based on deception and violence itself contains self-destructive explosives. Once the truth is revealed, it will soon fall apart.
     Third, the “cloth” party will encounter four major crises in sequence: the famine crisis, the ideological crisis, the socio-economic crisis, and the collapse crisis. The final regime collapsed. This process may last for decades, but no one can change this outcome.
     Fourth, the greatness of the country does not lie in its territory or even its history. It is the democratic tradition and the standard of living of its citizens. As long as citizens are still living in poverty, as long as there is no democracy, the country cannot guarantee that there will be no turmoil until collapse.

有道：

His last words:
One, with the development of productive forces, intellectual team increase faster than that of the proletariat, in the role as the first productivity, at the age of electricity of marxism's theory of the dictatorship of the proletariat will be out of date.

The dictatorship of the proletariat of the bolsheviks will quickly turn into a one-party state and become a leadership dictatorship. Society, based on deception and violence, contains self-destructive explosives, and when the truth comes out, it will fall apart.

Three, "cloth" party will, in turn, have four big crisis: hunger crisis, the collapse of the ideology, social and economic crisis and crisis, the regime collapse, this process may last for decades, but the end no one can change.

The greatness of a nation lies not in its territory or even its history, but in its democratic traditions and the living standards of its citizens. As long as the citizens are still poor, as long as there is no democracy, there will be no unrest until the country collapses.

梁：
这岂不让专业翻译人士好紧张吗？

李：
没办法，准备改行做编辑吧。作为职业，译员的市场的确在急剧萎缩中，大势所趋。

昨天跟老友说还是要抬头讲故事不能只是低头做实事。还要看故事的风口和时机。微软这一宣传家喻户晓老妪能解普罗惊叹。一叹人工智能已经步步紧逼看得见摸得着了。二叹微软太牛给人印象是把对手远远抛在后面在这个一日千里的AI时代居然提前七年实现赶超人类语言认知的里程碑。

梁：
对，讲个好故事，比什么都重要！

李：
其实这几家品质都差不多还有搜狗还有一些初创自从大约两三年前深度神经以后都陆续达到了新闻翻译超越业余人工翻译的水平。换句话说整个行业提升了。任何一家都可以心不跳脸不红做此宣称。可是老百姓和投资人不知道。这就看谁会讲故事了。

马：
大公司自己宣传，一帮不懂的媒体也愿意跟着捧，甚至捧得更卖力气。现在机器翻译拼的就是语料和平台，以前搜狗没有机器翻译，和我们实验室的刘洋合作后，不到一年就出了一个很不错的系统。

李：
AI 越来越像当年美苏的军备竞赛了，size matters.

@马少平搜狗要营销的话可以与电视台合作搞个新闻现场大奖赛请翻译界名人做评委找n个专业翻译 m 个业余翻译现场出题限时翻译（要限制到熟练专业来不及查工具书全凭大脑勉强可以应付为最佳）

马：
@wei 比起其他公司来，搜狗不是太会营销。

李：
这种比赛没有悬念最终一定是机器赢。好好设计一下，双盲比赛，让专家评审，也不能说它不公平。万一机器没得冠军而是亚军或季军宣传效果更佳为下一轮比赛的高潮做了铺垫。那位人类选手神译要好好保护大力宣传大书特书他过目不忘博闻强记知识渊博的种种事迹。当年花生智力竞赛大胜人类本质上就是玩的这个套路。一直玩到进入了计算机历史博物馆的里程碑专区去了。MT 现在要玩的话，类似的效果，更容易设计，要想拉巨头参与也容易：几家巨头的MT网站都是公开的，随叫随到。

Ben:
@wei youTube上《成都》有高圆圆的音乐，立委应该会喜欢！

李:
歌是好歌，早听过n多遍了，温暖慰藉。赵雷嗓子很有味道，可这小子镜头太多；圆圆友情出境，镜头太少，前面的剪影还是替身。

成都是个养人的好地方，出国前呆过大半年，乐不思非蜀（见【立委外传】）：

1990 ：尝尽成都美食。茶馆火锅夫妻肺片。

赵雷草根天才啊，独领城市歌谣，能写出这样的绝妙好词：

【画】
为寂寞的夜空画上一个月亮
把我画在那月亮的下面歌唱
为冷清的房子画上一扇大窗
再画上一张床
画一个姑娘陪着我
再画个花边的被窝
画上灶炉与柴火
我们一起生来一起活
画一群鸟儿围着我
再画上绿岭和青坡
画上宁静与祥和
雨点儿在稻田上飘落
画上有你能用手触到的彩虹
画中有我决定不灭的星空
画上弯曲无尽平坦的小路
尽头的人家梦已入
画上母亲安详的姿势
还有橡皮能擦去的争执
画上四季都不愁的粮食
悠闲的人从没心事
我没有擦去争吵的橡皮
只有一只画着孤独的笔
那夜空的月也不再亮
只有个忧郁的孩子在唱
为寂寞的夜空画上一个月亮

我不知道如何翻译，劳有道机器翻译一下：

“Draw a moon for the lonely night sky.
Draw me under the moon and sing.
Draw a large window for the cold house.
Draw another bed.
Draw a girl with me.
Draw another lace bed.
Draw a stove and firewood.
We were born to live together.
Draw a flock of birds around me.
Let me draw green ridge and green slope.
Picture peace and serenity.
The rain fell on the rice fields.
There's a rainbow you can touch with your hands.
There are stars in the picture that I have decided not to destroy.
There are endless smooth paths.
The end of the family dream has entered.
Picture mother's peaceful pose.
There's also an eraser argument.
Paint food that is not sad in four seasons.
A leisurely person never worries.
I didn't wipe out the quarrel eraser.
There was only one painting of a lonely pen.
The night sky was no longer bright.
Only a sad child was singing.
Draw a moon for the lonely night sky.”

自然有错译的地方（如 there's also an eraser argument. I didn't wipe out the quarrel eraser），可是总体而言，专业出身的我也不敢说一定能译得更好，除非旬月踟蹰。机器翻译超越业余翻译，已经是不争的事实。

【相关】

【尼沙龙笔记：宁顺不信，神经机器翻译的成就和短板】

大赞了神经机器翻译的革命性进步以后，提到两个短板其一是不忠：无中生有或化有为无以假乱真指鹿为马胆大包天。其二是依赖领域数据没有数据的领域立马傻眼。

李:
我用有道app里面的口译功能测试了一下字正腔圆的做节目的人，【文昭談古論今】，一边在 youTube 上看他的视频，一边把有道打开做现场口译，几乎完美无缺。

毛:
同声传译，以后是不需要的了。

李:
识别我的口音还是有误：识别我的英文比中文似乎更好一些。上面的那位是自媒体里面的很受欢迎的一位，文科背景，出口成章，比播音员说话还清晰。

语音识别的两个明显错误：neural network 错成了 neutral network，text 成了 tax（税），大概是我的英语发音的确不够好。但总体而言，句子蛮长，一口气说一大段，它也一样即时翻译（通过wifi接云端，立等可取）。

哈，text 与 taxi（出租车）也打起架来：

从这些人类不会犯的错误看，神经 MT 的巨大成功，与语音转写的巨大成功，完全是一个道理，都是在真正的海量数据中模仿，而没有任何“理解”。不合逻辑不合事理的句子会以一种蛮“顺耳”的方式呈现出来。

尽管如此，我们当年还是没想到，在没有解析和理解的前提下，这条路能走这么远。很久以来，我们的信念是，没有理解，无从翻译。鹦鹉学舌，可以学几句零碎的片段，但绝不可能把如此复杂的自然语言，学得如此栩栩如生。但事实上，“鹦鹉学舌”方式，在强大的数据和运算能力支持下，的的确确可以做到在很大的范围几乎可以以假乱真。

短板也是显然的，没有数据的话，再强大的运算也培训不出一只鹦鹉来。譬如，电商场景的机器翻译，由于缺乏汉英对照的大数据，就寸步难行。

下面是我说中文，让有道口译为英文的试验：

“二次大战”先转错为“20大战”，继而又错译为 “20th centuray”。这个错误很值得评论，说明了神经翻译为什么求得了“顺畅”牺牲了“忠实”。我说的是“大约是在二次大战以后”，虽然转写就错了一个字，成为“大约是在20大战以后”，翻译却错得离谱：这不是原来意义上的错误放大（error propagation），而是目前神经翻译“乱译”趋向的一个表现，by design：这种乱译的确在很大程度上克服了上一代统计机器翻译“不顺畅”的致命缺点。

“乱译”（或者“顺畅”）的根子在，目前的机器翻译里面有专门针对目标语的语言模型在，不仅仅是双语对照模型。目标语的模型里面“beginning of 20th century”一定是足够的常见，被记住了，所以尽管原句是“20大战以后”，它也一样无视（“大战”居然摇身一变，成了 century，是为指鹿为马，“以后”弄成了其反面 beginning，这简直是颠倒黑白），如果是前一代统计翻译（statistical MT），或者前前一代的规则翻译（rule-based MT），这种错误绝不会出现，应该是译成 “20 wars later”或 “after 20th war”之类。可是目标语训练数据中根本就没有这个 “20 war” 这样的提法，与其忠实而别扭，不如化有为无或无中生有，甚至指鹿（20 war）为马（20th century），以求“顺畅”。这是目前机器翻译的短板，已经被很多人察觉和批判，研究界也在研究对策。

也就是说，为了“顺畅”，目前的系统可以无视原文中的一些材料。同样为了“顺畅”，译文也可以无中生有加一些材料。这对不懂原文的人可能非常误导：批评者说，找翻译本来就是因为不懂原文，结果你翻译出来，听上去那么顺畅，让我不得不信，可其实你暗度陈仓，居然敢于胡编乱造，这也太搞了吧。

这种批评当然言之成理，信达雅，信是基础，信求不得，达雅何用？无信，达雅反而更加迷惑人，不如不译。你无中生有了一个地方，让我怀疑你整篇都不可信了。这种想当然的胡编乱造真是害死人。

不过，其实了解历史和经历过机器翻译不同阶段的人，会有不同的角度。实际是，前两代机器翻译的译文大都惨不忍睹，在可读性和顺畅上没有根本的解决途径（点滴的积累式进步还是有的），虽然意思也能勉强传达（就是说不会在“信”上胆敢无中生有或化有为无）。这个问题是如此严重，以至于影响了很多人使用机器翻译的意愿，除非是不得已，因为看机器译文实在是太别扭，太难受了。

毛:
能把谎说圆，这不正是逼近了人的智能吗？

李:
@毛德操问题是，鹦鹉学舌，哪里有什么“把谎说圆”。机器不会说谎，正如机器不会说真；同理，潜艇不会游泳。无中生有是真的，但“胡编乱造”不过是个比喻说法。机器没有歹心，正如机器没有良心。因为机器根本就没有心。有的不过是记忆和计算而已。硬要把计算说成智能，硬要把比喻当成真相，那也没辙。乔姆斯基的态度是，不理睬。还好，当年创造的是“人工智能”这个词，脱不开“人工”、“人为”、“模仿”的涵义。如果先驱们当年达特茅斯开会，不小心起个名字是“机器智能”，那可就糟透了。

Nick：
@wei 英国最早的说法就是machine intelligence。大概到七十年代才开始被美国带成人工了。

李：
达特茅斯会上呢？

马：
达特茅斯会上，还有一个词是复杂信息处理，不过最后还是AI占了上风。

李：
先驱们蛮“接地气”啊。其实，“复杂信息处理”很中肯，符合术语命名的严肃性。AI 还是太过“性感”了。

机器翻译更惨，很长时间是 “自动翻译”、“机器翻译” 混用，后来基本统一为机器翻译，因为自动翻译有多种用法什么全自动翻译半自动翻译等等。当然较真的话，自动翻译比机器翻译还不堪。其实应该叫做随大流翻译，或者叫做 NLU-free translation，简称无智翻译，and I was not kidding.

Nick：
自动/机器定理证明。mt就不太好说artificial translation，中文更不能说人工翻译。artificial本来就有点瞎编的意思。

李：
其实还真就是 artificial，本来就是仿造啊。译成汉语是仿人翻译。没有人的翻译样本，大量的样本，当今的MT根本就不可能。

马：
AI翻译

李：
人工智能其实应该翻译为人造智能。人造翻译（或仿人翻译）与人工翻译可大不相同。但取法乎上仅得其中的古训不大灵了，古训忽略了量的概念。被取法者足够大量的时候所得不止于中。AI 代替中庸势在必行。取法乎众可得中上，这是事实。但最好的机器翻译不如最好的人工翻译，这也是事实。因为后者有智能有理解。而前者虽然号称神经了，其实连“人造的理解”（譬如 NLU）都没有。

现如今人工智能好比一个性感女郎，沾点边的都往上面贴。今天跟一位老人工智能学者谈，他说，其实人工智能本性上就是一个悲催的学科，它是一个中继站，有点像博士后流动站。怎么讲？人工智能的本性就是暂时存放那些机理还没弄清楚的东西，一旦机理清楚了，就“非人工智能化”了（硬赖着不走，拉大旗作虎皮搞宣传的，是另一回事儿），独立出去成为一个专门的学科了。飞机上天了，潜艇下水了，曾几何时，这看上去是多么人工智能啊。现在还有做飞机潜艇的人称自己是搞人工智能的吗？他们属于空气动力学，流体动力学，与AI没有一毛钱的关系。同理，自动驾驶现如今还打着AI的招牌，其实已经与AI没啥关系了。飞机早就自动驾驶了，没人说是人工智能，到了汽车就突然智能起来？说不过去啊。总之，人工智能不是一个能 hold 住很多在它旗下的科学，它会送走一批批 misfits，这是好事儿，这是科学的进步。真正属于人工智能的学问，其实是一个很小的圈圈，就好比真正属于人类智能的部分也是很小的圈圈，二者都比我们直感上认为的范围，要小很多很多。我问，什么才是真正的恒定的AI呢？老友笑道，还是回到前辈们的原始定义吧，其中主要一项叫做“general problem solver”（西蒙 1959）。

马：
是这么回事。11年写的一篇博客：人工智能，一个永远没有结果的科学_马少平_新浪博客。

李：
好文。马老师科普起来也这么厉害啊堪比白居易写诗老妪能解。有说服力而且生动。

“11年写的一篇博客”。走火入魔第一眼看这句我无意识把自己变成了神经网络网络里面是这样编码的：“11 years ago 写了一篇博客”，宁顺不信。我的大数据训练我首先排除了 2011 的选项然后无中生有加了个 ago 以求顺畅。摩登时代，忠实值几个钱？忽悠才是摇钱树。

马：
用时11年，?

洪：
人工智能是江湖，八仙过海都威武。武侠人物不绝出，很多虚晃都诈唬。

AI像狗头前置棍，棍拴骨头引狂奔。确实因之人前进，精髓却总不得啃。

李：
洪爷的诗没治了，大AI无疆，无处不诗啊。

回头说宁顺不信。两相比较，平心而论，对于多数人多数场合还是顺畅的权重似乎更大。只是需要记住三点：（1）认真使用前，需要人工核对：机器译文只是提供一个快速浏览，了解个大概的选项，虽然总体的忠实比例其实不差，但任何一个点都可能错得离谱；（2）翻译工作者如果不学会利用机器翻译，与机器合作提高效率（要善于做核对和后编辑），很可能不久会淘汰：实际上翻译的工作市场已经急剧萎缩中，有道本身提供的人工翻译已经快速便宜到不行，可见能够生存下来的少数人工翻译一定是学会人机合作的工作模式的。（3）AI 还在飞速发展中，让我们拭目以待，看今后的系统能不能在信达雅三者之间做更好的平衡。可以想象的一个可能是，将来的系统至少可以让用户在“忠实”和“顺畅”中做权重的选择：根据偏好的不同，系统应该可以做不同的翻译：偏重忠实但生硬一点的选项（就是鲁迅曾经践行过的“硬译”），或者偏重顺畅却可能局部不忠实的选项。

The Shallowness of Google Translate

It’s pretty lengthy. Pointing the fact of no understanding in deep learning. We all know it is true. What we did not know was how far a system can go without understanding or parsing on an end to end deep neural network modal. All criticisms here are valid but still MT has never been this impressive and useful in practice unless you make the wrong choice to use it for translating literary works or for translating domain documents where it has no human translation data to learn from.

【相关】

【校长对话录：向有道机器翻译同仁致敬】

人工智能，一个永远没有结果的科学_马少平_新浪博客

【谷歌NMT，见证奇迹的时刻】

The Shallowness of Google Translate

有道的机器翻译（http://fanyi.youdao.com/）

【语义计算：李白对话录系列】

神经机译：川普宣告，米国人民今天站起来了

川普宣告，人民当家作主，米国人民今天站起来了!

川普今天总统登基，发表就职演说，谷歌神经翻译如下，请听（作为一个老机译，给这篇机器翻译打分的话，我会给忠实度85分，顺畅度90分，可懂度95分，个人觉得已经超越人工现场翻译的平均水平。当然，演说一般属于翻译中容易的部分。演说写稿人为了效果，喜欢用短句、白话，喜欢重复）：

TRUMP：首席大法官罗伯茨，卡特总统，克林顿总统，布什总统，奥巴马总统，美国人和世界人民，谢谢。

我们，美国公民，现在加入了伟大的国家努力，重建我们的国家，恢复其对我们所有人民的承诺。
在一起，我们将决定美国和世界的路线许多，未来几年。我们将面临挑战，我们将面临艰难，但我们将完成这项工作。

每四年，我们将采取这些步骤，进行有秩序和和平的权力转移，我们感谢奥巴马总统和第一夫人米歇尔奥巴马在这一过渡期间的恩典援助。他们是壮观的。谢谢。

然而，今天的仪式具有非常特殊的意义，因为今天我们不仅仅是将权力从一个政府转移到另一个政府，或从一个政党转移到另一个政府，而是我们从华盛顿转移权力，并将其交还给你，人民。

长期以来，我们国家首都的一个小团体获得了政府的奖励，而人民承担了成本。华盛顿蓬勃发展，但人民没有分享其财富。政治家兴旺，但工作离开，工厂关闭。企业保护自己，但不是我们国家的公民。他们的胜利不是你的胜利。他们的胜利不是你的胜利。虽然他们在我们国家的首都庆祝，但没有什么可以庆祝在我们的土地上奋斗的家庭。

所有的变化从这里开始，现在，因为这一刻是你的时刻，它属于你。

它属于今天聚集在这里的每个人，每个人都在整个美国。这是你的一天。这是你的庆祝。而这个，美利坚合众国，是你的国家。

真正重要的不是哪个党控制我们的政府，而是我们的政府是否由人民控制。

2017年1月20日将被记住为人民成为这个国家的统治者的那一天。

我们国家被遗忘的男人和女人将不再被忘记。

每个人都在听你的。你来自成千上万的人成为历史运动的一部分，世界从未见过的那些喜欢。

在这个运动的中心是一个关键的信念，一个国家存在为其公民服务。美国人想要他们的孩子的伟大的学校，他们的家庭的安全的邻里，并为自己好的工作。这些是对义人和公义的公正和合理的要求。

但对于我们太多的公民，存在一个不同的现实：母亲和儿童陷入我们内部城市的贫困;生锈的工厂散落像墓碑横跨我们国家的景观;教育制度与现金齐齐，但使我们年轻美丽的学生失去了所有的知识;和犯罪，帮派和毒品偷走了太多的生命，抢夺了我们国家这么多未实现的潜力。

这美国大屠杀停在这里，现在停止。

我们是一个国家，他们的痛苦是我们的痛苦。他们的梦想是我们的梦想。他们的成功将是我们的成功。我们分享一颗心，一个家，一个光荣的命运。我今天所做的宣誓就是对所有美国人的忠诚宣誓。

几十年来，我们以牺牲美国工业为代价丰富了外国产业;补贴了其他国家的军队，同时允许我们的军队非常悲伤的消耗。我们捍卫了其他国家的边界，拒绝为自己辩护。

在海外花费了数万亿美元，美国的基础设施已经失修和腐烂。我们已经使其他国家富有，而我们国家的财富，实力和信心已经消失了地平线。

一个接一个地，工厂关闭了，离开了我们的岸边，甚至没有想到数百万和数百万留在美国工人。我们的中产阶级的财富已经从他们的家里被剥夺，然后再分配到世界各地。

但这是过去。现在，我们只看到未来。

我们今天聚集在这里，正在发布一项新法令，在每个城市，每个外国首都和每一个权力大厅上听到。从今天起，我们的土地将有一个新的愿景。从这一天开始，它将只有美国第一，美国第一。

每一项关于贸易，税收，移民，外交事务的决定都将使美国工人和美国家庭受益。我们必须保护我们的边界免受其他国家的蹂躏，使我们的产品，偷窃我们的公司和破坏我们的工作。

保护将导致巨大的繁荣和力量。我会为我的身体每一口气，为你而战，我永远不会让你失望。

美国将再次赢得胜利，赢得前所未有的胜利。

我们将带回我们的工作。

我们将带回我们的边界。

我们将会

Google Translated from:

TRUMP: Chief Justice Roberts, President Carter, President Clinton, President Bush, President Obama, fellow Americans and people of the world, thank you.

We, the citizens of America, are now joined in a great national effort to rebuild our country and restore its promise for all of our people.
Together, we will determine the course of America and the world for many, many years to come. We will face challenges, we will confront hardships, but we will get the job done.

Every four years, we gather on these steps to carry out the orderly and peaceful transfer of power, and we are grateful to President Obama and First Lady Michelle Obama for their gracious aid throughout this transition. They have been magnificent. Thank you.

Today's ceremony, however, has very special meaning because today, we are not merely transferring power from one administration to another or from one party to another, but we are transferring power from Washington, D.C. and giving it back to you, the people.

For too long, a small group in our nation's capital has reaped the rewards of government while the people have borne the cost. Washington flourished, but the people did not share in its wealth. Politicians prospered, but the jobs left and the factories closed. The establishment protected itself, but not the citizens of our country. Their victories have not been your victories. Their triumphs have not been your triumphs. And while they celebrated in our nation's capital, there was little to celebrate for struggling families all across our land.

That all changes starting right here and right now because this moment is your moment, it belongs to you.

It belongs to everyone gathered here today and everyone watching all across America. This is your day. This is your celebration. And this, the United States of America, is your country.

What truly matters is not which party controls our government, but whether our government is controlled by the people.

January 20th, 2017 will be remembered as the day the people became the rulers of this nation again.

The forgotten men and women of our country will be forgotten no longer.

Everyone is listening to you now. You came by the tens of millions to become part of a historic movement, the likes of which the world has never seen before.

At the center of this movement is a crucial conviction, that a nation exists to serve its citizens. Americans want great schools for their children, safe neighborhoods for their families, and good jobs for themselves. These are just and reasonable demands of righteous people and a righteous public.

But for too many of our citizens, a different reality exists: mothers and children trapped in poverty in our inner cities; rusted out factories scattered like tombstones across the landscape of our nation; an education system flush with cash, but which leaves our young and beautiful students deprived of all knowledge; and the crime and the gangs and the drugs that have stolen too many lives and robbed our country of so much unrealized potential.

This American carnage stops right here and stops right now.

We are one nation and their pain is our pain. Their dreams are our dreams. And their success will be our success. We share one heart, one home, and one glorious destiny. The oath of office I take today is an oath of allegiance to all Americans.

For many decades, we've enriched foreign industry at the expense of American industry; subsidized the armies of other countries, while allowing for the very sad depletion of our military. We've defended other nations' borders while refusing to defend our own.

And spent trillions and trillions of dollars overseas while America's infrastructure has fallen into disrepair and decay. We've made other countries rich, while the wealth, strength and confidence of our country has dissipated over the horizon.

One by one, the factories shuttered and left our shores, with not even a thought about the millions and millions of American workers that were left behind. The wealth of our middle class has been ripped from their homes and then redistributed all across the world.

But that is the past. And now, we are looking only to the future.

We assembled here today are issuing a new decree to be heard in every city, in every foreign capital, and in every hall of power. From this day forward, a new vision will govern our land. From this day forward, it's going to be only America first, America first.

Every decision on trade, on taxes, on immigration, on foreign affairs will be made to benefit American workers and American families. We must protect our borders from the ravages of other countries making our products, stealing our companies and destroying our jobs.

Protection will lead to great prosperity and strength. I will fight for you with every breath in my body, and I will never ever let you down.

America will start winning again, winning like never before.

We will bring back our jobs.

We will bring back our borders.

We will ......

【谷歌NMT，见证奇迹的时刻】

Question answering of the past and present

【立委NLP频道】

From IBM's Jeopardy robot, Apple's Siri, to the new Google Translate

Latest Headline News: Samsung acquires Viv, a next-gen AI assistant built by the creators of Apple's Siri.

Wei:
Some people are just smart, or shrewd, more than we can imagine. I am talking about Fathers of Siri, who have been so successful with their technology that they managed to sell the same type of technology twice, both at astronomical prices, and both to the giants in the mobile and IT industry. What is more amazing is, the companies they sold their tech-assets to are direct competitors. How did that happen? How "nice" this world is, to a really really smart technologist with sharp business in mind.

What is more stunning is the fact that, Siri and the like so far are regarded more as toys than must-carry tools, intended at least for now to satisfy more curiosity than to meet the rigid demand of the market. The most surprising is that the technology behind Siri is not unreachable rocket science by nature, similar technology and a similar level of performance are starting to surface from numerous teams or companies, big or small.

I am a tech guy myself, loving gadgets, always watching for new technology breakthrough. To my mind, something in the world is sheer amazing, taking us in awe, for example, the wonder of smartphones when the iPhone first came out. But some other things in the tech world do not make us admire or wonder that much, although they may have left a deep footprint in history. For example, the question answering machine made by IBM Watson Lab in winning Jeopardy. They made it into the computer history exhibition as a major AI milestone. More recently, the iPhone Siri, which Apple managed to put into hands of millions of people first time for seemingly live man-machine interaction. Beyond that accomplishment, there is no magic or miracle that surprises me. I have the feel of "seeing through" these tools, both the IBM answering robot type depending on big data and Apple's intelligent agent Siri depending on domain apps (plus a flavor of AI chatbot tricks).

Chek: @ Wei I bet the experts in rocket technology will not be impressed that much by SpaceX either,

Wei: Right, this is because we are in the same field, what appears magical to the outside world can hardly win an insider's heart, who might think that given a chance, they could do the same trick or better.

The Watson answering system can well be regarded as a milestone in engineering for massive, parallel big data processing, not striking us as an AI breakthrough. what shines in terms of engineering accomplishment is that all this happened before the big data age when all the infrastructures for indexing, storing and retrieving big data in the cloud are widely adopted. In this regard, IBM is indeed the first to run ahead of the trend, with the ability to put a farm of servers in working for the QA engine to be deployed onto massive data. But from true AI perspective, neither the Watson robot nor the Siri assistant can be compared with the more-recent launch of the new Google Translate based on neural networks. So far I have tested using this monster to help translate three Chinese blogs of mine (including this one in making), I have to say that I have been thrown away by what I see. As a seasoned NLP practitioner who started MT training 30 years ago, I am still in disbelief before this wonder of the technology showcase.

Chen: wow, how so?

Wei: What can I say? It has exceeded my imagination limit for all my dreams of what MT can be and should be since I entered this field many years ago. While testing, I only needed to do limited post-editing to make the following Chinese blogs of mine presentable and readable in English, a language with no kinship whatsoever with the source language Chinese.

Introduction to NLP Architecture

Hong: Wei seemed frightened by his own shadow.Chen:

Chen: The effect is that impressive?

Wei: Yes. Before the deep neural-nerve age, I also tested and tried to use SMT for the same job, having tried both Google Translate and Baidu MT, there is just no comparison with this new launch based on technology breakthrough. If you hit their sweet spot, if your data to translate are close to the data they have trained the system on, Google Translate can save you at least 80% of the manual work. 80% of the time, it comes so smooth that there is hardly a need for post-editing. There are errors or crazy things going on less than 20% of the translated crap, but who cares? I can focus on that part and get my work done way more efficiently than before. The most important thing is, SMT before deep learning rendered a text hardly readable no matter how good a temper I have. It was unbearable to work with. Now with this breakthrough in training the model based on sentence instead of words and phrase, the translation magically sounds fairly fluent now.

It is said that they are good a news genre, IT and technology articles, which they have abundant training data. The legal domain is said to be good too. Other domains, spoken language, online chats, literary works, etc., remain a challenge to them as there does not seem to have sufficient data available yet.

Chen: Yes, it all depends on how large and good the bilingual corpora are.

Wei: That is true. SMT stands on the shoulder of thousands of professional translators and their works. An ordinary individual's head simply has no way in digesting this much linguistic and translation knowledge to compete with a machine in efficiency and consistency, eventually in quality as well.

Chen: Google's major contribution is to explore and exploit the existence of huge human knowledge, including search, anchor text is the core.

Ma: I very much admire IBM's Watson, and I would not dare to think it possible to make such an answering robot back in 2007.

Wei: But the underlying algorithm does not strike as a breakthrough. They were lucky in targeting the mass media Jeopardy TV show to hit the world. The Jeopardy quiz is, in essence, to push human brain's memory to its extreme, it is largely a memorization test, not a true intelligence test by nature. For memorization, a human has no way in competing with a machine, not even close. The vast majority of quiz questions are so-called factoid questions in the QA area, asking about things like who did what when and where, a very tractable task. Factoid QA depends mainly on Named Entity technology which was mature long ago, coupled with the tractable task of question parsing for identifying its asking point, and the backend support from IR, a well studied and practised area for over 2 decades now. Another benefit in this task is that most knowledge questions asked in the test involve standard answers with huge redundancy in the text archive expressed in various ways of expressions, some of which are bound to correspond to the way question is asked closely. All these factors contribute to IBM's huge success in its almost mesmerizing performance in the historical event. The bottom line is, shortly after the 1999 open domain QA was officially born with the first TREC QA track, the technology from the core engine has been researched well and verified for factoid questions given a large corpus as a knowledge source. The rest is just how to operate such a project in a big engineering platform and how to fine-tune it to adapt to the Jeopardy-style scenario for best effects in the competition. Really no magic whatsoever.

Google Translated from【泥沙龙笔记：从三星购买Siri之父的二次创业技术谈起】, with post-editing by the author himself.

Introduction to NLP Architecture

Newest GNMT: time to witness the miracle of Google Translate

Dr Li’s NLP Blog in English

Newest GNMT: time to witness the miracle of Google Translate

gnmt

Wei:
Recently, the microblogging (wechat) community is full of hot discussions and testing on the newest annoucement of the Google Translate breakthrough in its NMT (neural network-based machine translation) offering, claimed to have achieved significant progress in data quality and readability. Sounds like a major breakthrough worthy of attention and celebration.

The report says:

Ten years ago, we released Google Translate, the core algorithm behind this service is PBMT: Phrase-Based Machine Translation. Since then, the rapid development of machine intelligence has given us a great boost in speech recognition and image recognition, but improving machine translation is still a difficult task.

Today, we announced the release of the Google Neural Machine Translation (GNMT) system, which utilizes state-of-the-art training techniques to maximize the quality of machine translation so far. For a full review of our findings, please see our paper "Google`s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation."A few years ago, we began using RNN (Recurrent Neural Networks) to directly learn the mapping of an input sequence (such as a sentence in a language) to an output sequence (the same sentence in another language). The phrase-based machine learning (PBMT) breaks the input sentences into words and phrases, and then largely interprets them independently, while NMT interprets the entire sentence of the input as the basic unit of translation .

A few years ago, we began using RNN (Recurrent Neural Networks) to directly learn the mapping of an input sequence (such as a sentence in a language) to an output sequence (the same sentence in another language). The phrase-based machine learning (PBMT) breaks the input sentences into words and phrases, and then largely interprets them independently, while NMT interprets the entire sentence of the input as the basic unit of translation .The advantage of this approach is that compared to the previous phrase-based translation system, this method requires less engineering design. When it was first proposed, the accuracy of the NMT on a medium-sized public benchmark

The advantage of this approach is that compared to the previous phrase-based translation system, this method requires less engineering design. When it was first proposed, the accuracy of the NMT on a medium-sized public benchmark data set was comparable to that of a phrase-based translation system. Since then, researchers have proposed a number of techniques to improve NMT, including modeling external alignment models to handle rare words, using attention to align input and output words, and word decomposition into smaller units to cope with rare words. Despite these advances, the speed and accuracy of NMT has not been able to meet the requirements of a production system such as Google Translate. Our new paper describes how to overcome many of the challenges of making NMT work on very large data sets and how to build a system that is both fast and accurate enough to deliver a better translation experience for Google users and services.

............

Using side-by-side comparisons of human assessments as a standard, the GNMT system translates significantly better than the previous phrase-based production system. With the help of bilingual human assessors, we found in sample sentences from Wikipedia and the news website that GNMT reduced translational errors by 55% to 85% or more in the translation of multiple major pairs of languages.

In addition to publishing this research paper today, we have also announced that GNMT will be put into production in a very difficult language pair (Chinese-English) translation.

Now, the Chinese-English translations of the Google Translate for mobile and web versions have been translated at 100% using the GNMT machine - about 18 million translations per day. GNMT's production deployment uses our open machine learning tool suite TensorFlow and our Tensor Processing Units (TPUs), which provide sufficient computational power to deploy these powerful GNMT models, meeting Google Translate strict latency requirements for products.

Chinese-to-English translation is one of the more than 10,000 language pairs supported by Google Translate. In the coming months, we will continue to extend our GNMT to far more language pairs.

GNMT translated from Google Translate achieves a major breakthrough!

As an old machine translation researcher, this temptation cannot be resisted. I cannot wait to try this latest version of the Google Translate for Chinese-English.
Previously I tried Google Chinese-to-English online translation multiple times, the overall quality was not very readable and certainly not as good as its competitor Baidu. With this newest breakthrough using deep learning with neural networks, it is believed to get close to human translation quality. I have a few hundreds of Chinese blogs on NLP, waiting to be translated as a try. I was looking forward to this first attempt in using Google Translate for my Science Popularization blog titled Introduction to NLP Architecture. My adventure is about to start. Now is the time to witness the miracle, if miracle does exist.

Dong:
I hope you will not be disappointed. I have jokingly said before: the rule-based machine translation is a fool, the statistical machine translation is a madman, and now I continue to ridicule: neural machine translation is a "liar" (I am not referring to the developers behind NMT). Language is not a cat face or the like, just the surface fluency does not work, the content should be faithful to the original!

Wei:
Let us experience the magic, please listen to this translated piece of my blog:

This is my Introduction to NLP Architecture fully automatically translated by Google Translate yesterday (10/2/2016) and fully automatically read out without any human interference. I have to say, this is way beyond my initial expectation and belief.

Listen to it for yourself, the automatic speech generation of this science blog of mine is amazingly clear and understandable. If you are an NLP student, you can take it as a lecture note from a seasoned NLP practitioner (definitely clearer than if I were giving this lecture myself, with my strong accent). The original blog was in Chinese and I used the newest Google Translate claimed to be based on deep learning using sentence-based translation as well as character-based techniques.

Prof. Dong, you know my background and my original doubtful mindset. However, in the face of such a progress, far beyond our original imagination limits for automatic translation in terms of both quality and robustness when I started my NLP career in MT training 30 years ago, I have to say that it is a dream come true in every sense of it.

Dong:
In their terminology, it is "less adequate, but more fluent." Machine translation has gone through three paradigm shifts. When people find that it can only be a good information processing tool, and cannot really replace the human translation, they would choose the less costly.

Wei:
In any case, this small test is revealing to me. I am still feeling overwhelmed to see such a miracle live. Of course, what I have just tested is the formal style, on a computer and NLP topic, it certainly hit its sweet spot with adequate training corpus coverage. But compared with the pre-NN time when I used both Google SMT and Baidu SMT to help with my translation, this breakthrough is amazing. As a senior old school practitioner of rule-based systems, I would like to pay deep tribute to our "nerve-network" colleagues. These are a group of extremely genius crazy guys. I would like to quote Jobs' famous quotation here:

“Here's to the crazy ones. The misfits. The rebels. The troublemakers. The round pegs in the square holes. The ones who see things differently. They're not fond of rules. And they have no respect for the status quo. You can quote them, disagree with them, glorify or vilify them. About the only thing you can't do is ignore them. Because they change things. They push the human race forward. And while some may see them as the crazy ones, we see genius. Because the people who are crazy enough to think they can change the world, are the ones who do.”

@Mao, this counts as my most recent feedback to the Google scientists and their work. Last time, about a couple of months ago when they released their parser, proudly claimed to be "the most accurate parser in the world", I wrote a blog to ridicule them after performing a serious, apples-to-apples comparison with our own parser. This time, they used the same underlying technology to announce this new MT breakthrough with similar pride, I am happily expressing my deep admiration for their wonderful work. This contrast of my attitudes looks a bit weird, but it actually is all based on facts of life. In the case of parsing, this school suffers from lacking naturally labeled data which they would make use of in perfecting the quality, especially when it has to port to new domains or genres beyond the news corpora. After all, what exists in the language sea involves corpora of raw text with linear strings of words, while the corresponding parse trees are only occasional, artificial objects made by linguists in a limited scope by nature (e.g. PennTree, or other news-genre parse trees by the Google annotation team). But MT is different, it is a unique NLP area with almost endless, high-quality, naturally-occurring "labeled" data in the form of human translation, which has never stopped since ages ago.

Mao: @wei That is to say, you now embrace or endorse a neuron-based MT, a change from your previous views?

Wei:
Yes I do embrace and endorse the practice. But I have not really changed my general view wrt the pros and cons between the two schools in AI and NLP. They are complementary and, in the long run, some way of combining the two will promise a world better than either one alone.

Mao: What is your real point?

Wei:
Despite biases we are all born with more or less by human nature, conditioned by what we have done and where we come from in terms of technical background, we all need to observe and respect the basic facts. Just listen to the audio of their GSMT translation by clicking the link above, the fluency and even faithfulness to my original text has in fact out-performed an ordinary human translator, in my best judgment. If an interpreter does not have sufficient knowledge of my domain, if I give this lecture in a classroom, and ask an average interpreter to translate on the spot for me, I bet he will have a hard time performing better than the Google machine listed above (of course, human translation gurus are an exception). This miracle-like fact has to be observed and acknowledged. On the other hand, as I said before, no matter how deep the learning reaches, I still do not see how they can catch up with the quality of my deep parsing in the next few years when they have no way of magically having access to a huge labeled data of trees they depend on, especially in the variety of different domains and genres. They simply cannot "make bricks without straw" (as an old Chinese saying goes, even the most capable housewife can hardly cook a good meal without rice). Because in the natural world, there are no syntactic trees and structures for them to learn from, there are only linear sentences. The deep learning breakthrough seen so far is still mainly supervised learning, which has almost an insatiable appetite for massive labeled data, forming its limiting knowledge bottleneck.

Mao: I'm confused. Which one do you believe stronger? Who is the world's No. 0?

Wei:
Parsing-wise, I am happy to stay as No. 0 if Google insists on their being No. 1 in the world. As for MT, it is hard to say, from what I see, between their breakthrough and some highly sophisticated rule-based MT systems out there. But what I can say is, at a high level, the trends of the mainstream statistical MT winning the space both in the industry as well as in academia over the old school rule-based MT are more evident today than before. This is not to say that the MT rule system is no longer viable, or going to an end. There are things which SMT cannot beat rule MT. For examples, certain types of seemingly stupid mistakes made by GNMT (quite some laughable examples of totally wrong or opposite translation have been illustrated in this salon in the last few days) are almost never seen in rule-based MT systems.

Dong:
here is my try of GNMT from Chinese to English:

学习上，初二是一个分水岭，学科数量明显增多，学习方法也有所改变，一些学生能及时调整适应变化，进步很快，由成绩中等上升为优秀。但也有一部分学生存在畏难情绪，将心思用在学习之外，成绩迅速下降，对学习失去兴趣，自暴自弃，从此一蹶不振，这样的同学到了初三往往很难有所突破，中考的失利难以避免。

Learning, the second of a watershed, the number of subjects significantly significantly, learning methods have also changed, some students can adjust to adapt to changes in progress, progress quickly, from the middle to rise to outstanding. But there are some students there is Fear of hard feelings, the mind used in the study, the rapid decline in performance, loss of interest in learning, self-abandonment, since the devastated, so the students often difficult to break through the third day,

Mao: This translation cannot be said to be good at all.

Wei:
Right, that is why it calls for an objective comparison to answer your previous question. Currently, as I see, the data for the social media and casual text are certainly not enough, hence the translation quality of online messages is still not their forte. As for the previous textual sample Prof. Dong showed us above, Mao said the Google translation is not of good quality as expected. But even so, I still see impressive progress made there. Before the deep learning time, the SMT results from Chinese to English is hardly readable, and now it can generally be read loud to be roughly understood. There is a lot of progress worth noting here.

Ma:
In the fields with big data, in recent years, DL methods are by leaps and bounds. I know a number of experts who used to be biased against DL have changed their views when seeing the results. However, DL in the IR field is still basically not effective so far, but there are signs of slowly penetrating IR.

Dong:
The key to NMT is "looking nice". So for people who do not understand the original source text, it sounds like a smooth translation. But isn't it a "liar" if a translation is losing its faithfulness to the original? This is the Achille's heel of NMT.

Ma: @Dong, I think all statistical methods have this aching point.

Wei:
Indeed, there are respective pros and cons. Today I have listened to the Google translation of my blog three times and am still amazed at what they have achieved. There are always some mistakes I can pick here and there. But to err is human, not to say a machine, right? Not to say the community will not stop advancing and trying to correct mistakes. From the intelligibility and fluency perspectives, I have been served super satisfactorily today. And this occurs between two languages without historical kinship whatsoever.

Dong:
Some leading managers said to me years ago, "In fact, even if machine translation is only 50 percent correct, it does not matter. The problem is that it cannot tell me which half it cannot translate well. If it can, I can always save half the labor, and hire a human translator to only translate the other half." I replied that I am not able to make a system do that. Since then I have been concerned about this issue, until today when there is a lot of noise of MT replacing the human translation anytime from now. It's kinda like having McDonald's then you say you do not need a fine restaurant for French delicacy. Not to mention machine translation today still cannot be compared to McDonald's. Computers, with machine translation and the like, are in essence a toy given by God for us human to play with. God never agrees to permit us to be equipped with the ability to copy ourselves.

Why GNMT first chose language pairs like Chinese-to-English, not the other way round to showcase? This is very shrewd of them. Even if the translation is wrong or missing the points, the translation is usually fluent at least in this new model, unlike the traditional model who looks and sounds broken, silly and erroneous. This is the characteristics of NMT, it is selecting the greatest similarity in translation corpus. As a vast number of English readers do not understand Chinese, it is easy to impress them how great the new MT is, even for a difficult language pair.

Wei:
Correct. A closer look reveals that this "breakthrough" lies more on fluency of the target language than the faithfulness to the source language, achieving readability at cost of accuracy. But this is just a beginning of a major shift. I can fully understand the GNMT people's joy and pride in front of a breakthrough like this. In our career, we do not always have that type of moment for celebration.

Deep parsing is the NLP's crown. Yet to see how they can beat us in handling domains and genres lacking labeled data. I wish them good luck and the day they prove they make better parsers than mine would be the day of my retirement. It does not look anything like this day is drawing near, to my mind. I wish I were wrong, so I can travel the world worry-free, knowing that my dream has been better realized by my colleagues.

Thanks to Google Translate at https://translate.google.com/ for helping to translate this Chinese blog into English, which was post-edited by myself.

"OVERVIEW OF NATURAL LANGUAGE PROCESSING"

"NLP White Paper: Overview of Our NLP Core Engine"

Introduction to NLP Architecture

It is untrue that Google SyntaxNet is the "world’s most accurate parser"

Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open

Is Google SyntaxNet Really the World’s Most Accurate Parser?

Dr Li's NLP Blog in English

Notes for An HPSG-style Chinese Reversible Grammar

ABSTRACT

Key words: Chinese parsing, Chinese generation, reversible grammar, HPSG

This paper presents a reversible Chinese unification grammar named CPSG. The lexicalized and integrated design of CPSG embodies the general spirit of the modern linguistic theory Head-driven Phrase Structure Grammar (HPSG, Pollard & Sag 1987, 1994). Using ALE formalism in Prolog (Carpenter & Penn 1994), we have implemented a prototype of CPSG.

CPSG covers Chinese morphology, Chinese syntax and semantics in a novel integrated language model (Figure 1, for interface between morphology, see Li 1997; for interface between syntax and semantics, see Li 1996). CPSG model is in sharp contrast to the conventional clear-cut successive design of grammar components (Figure 2, see survey in Feng 1996). We will show that our model is much better suited and more efficient for Chinese analysis (or generation).

cpsg

Grammar reversibility is a highly desired feature for multi-lingual machine translation application (Hutchins & Somers 1992, Huang 1986, 1987). To test its reversible features, we have applied the CPSG prototype to an experiment of bi-directional machine translation between English and Chinese. The machine translation engine developed in our Natural Language Lab is based on shake-and-bake design, a novel approach to machine translation suited for unification grammars (Whitelock 1992, 1994, Beaven 1992, Brew 1992). The experimental results meet our design objective and verify the feasibility of CPSG approach.

~~~~~~~~~~~~~~~~~~~~~

Notes for NWLC-97, UBC, Vancouver

Outline of An HPSG-style Chinese Reversible Grammar

Wei LI ([email protected])

Linguistics Department, Simon Fraser University

Key words: lexicalist approach, integrated language model, HPSG,

reversible grammar, bi-directional machine translation,

Chinese computational grammar,

Chinese word identification, Chinese parsing,
Chinese generation

background

1.1. design philosophy

Two major obstacles in writing Chinese computational grammar:

lacking in serious study on Chinese lexical base

well designed lexicon is crucial for a successful computational system

theoretical linguists have made fruitful efforts (e.g. Li Linding) but lack formalization

computational linguists require more patience in adapting and formalizing the fruits:

it is huge work, but has to be done if a non-toy system is targeted

lack of effective interaction between morphology, syntax and semantics.

e.g.

ambiguity in word identification makes it hard to interface morphology & syntax:

a theoretical defect of morphology preprocessor (segmenter)

e.g. ABC: ABC or A | BC or AB | C or A | B | C?

active/passive isomorphic phenomena make semantic constraint a desired need in parsing NP Vt: subject NP or object NP?

Solution: the lexicalized and integrated design of Chinese grammar

1.2. major theoretical foundation:

HPSG: lexicalist theory encouraging integration of different components

a desired framework matching our design philosophy

CPSG: HPSG-style unification grammar

CPSG: reversible grammar suited for both parsing and generation

CPSG: formalized grammar, a description that does not rely on undefined notions

integrated language model

2.1. CPSG versus conventional Chinese grammar

parse tree embodies both morphological and syntactic structures in CPSG

lexicalized formal grammar

3.1. formalized grammar, as required by a computational grammar: formulation of CPSG

readily implementable (theories, principles, rules, etc.);

precise definition for the very basic notions (e.g. sign, morpheme, word, phrase, sentence, NP, VP, etc.), rules (PS rules and lexical rules), lexical items (lexical hierarchy), typology (hierarchy embodied in feature structures)

(4.) Definition: sign

A sign is the most fundamental concept of grammar. Formally, a sign is defined by the type [a_sign], which introduces a set of linguistic features for its description, as shown below.

a_sign
INDEX index
KANJI kanji
MORPH1 expected
MORPH2 expected
CATEGORY category
COMP0 expected
COMP1 expected
COMP2 expected
MOD expected
KNOWLEDGE knowledge
CONTENT content
INDEX0 index
INDEX1 index
INDEX2 index
DTR dtr

(5.) Definition: word

In CPSG, a word is a sign satisfying the following two conditions: (1) its obligatory morphological expectation has all been saturated; (2) it is not a mother of any syntactic structures, hence no syntactic daughters. Formally, a word is defined as shown below.

(6.) word

a_sign
MORPH1 ~obligatory
MORPH2 ~obligatory
DTR no_syn_dtr

3.2. lexicalized grammar

CPSG consists of two parts:

(1) a minimized general grammar:

only 11 phrase structure rules
(covering complement structure, modifier structure,
conjunctive structure and morphological structure)

(2) a feature enriched lexicon:

lexical entries;
lexical hierarchy and a set of lexical rules
(capturing lexical generalizations).

(7.) comp0 PS rule

MOTHER a_sign
COMP0 saturated
COMP1 [1]
COMP2 [2]
DTR comp0
MYSISTER [6]
LEFTMOD [7] category
RIGHTMOD [8] category
LEFTCOMP [9] category
RIGHTCOMP [10] category

===>

EXPECTING a_sign
COMP0 a_expected
DIRECTION left
ROLE [3]
SIGN [4]
COMP1 [1] ~obligatory
COMP2 [2] ~obligatory
INDEX [5]
DTR dtr
LEFTMOD [7]
RIGHTMOD [8]
RIGHTCOMP [10]

EXPECTED a_sign [4]
CONTENT content
MYHEAD [5]
MYROLE [3] comp_role
INDEX [6]
CATEGORY [9]

PRINCIPLE #head_feature

(8.) lexical entry: chi

a_sign
KANJI one_character
H1 chi
CATEGORY v
INDEX0 [1] index
INDEX1 [2] index
COMP0 a_expected
DIRECTION left
SIGN a_sign
CATEGORY n
INDEX [1]
COMP1 a_expected
DIRECTION right
SIGN a_sign
CATEGORY n
INDEX [2]
KNOWLEDGE eat
U_OBJECT food
MALE none
PERSON 3
SINGULAR bin
U_SUBJECT animate
MALE bin
PERSON tri
SINGULAR bin

Implementation and Application of CPSG

CPSG prototype implemented in ALE and Prolog, having parsed a corpus of 200 various types of sentences

ALE and Prolog: suitable for unification grammar
ALE: mechanism for typed feature structures: type polymorphism
a powerful tool in language modeling

CPSG prototype adapted for application to bi-directional MT, having generated the same corpus of 200 sentences

References

Beaven, John L. (1992): "Shake and Bake Machine Translation", Proceedings of the 15th International Conference on Computational Linguistics, pp. 603-609, Nantes, France.

Brew, Chris (1992): "Letting the Cat out of the Bag: Generation for Shake-and-bake MT", Proceedings of the 15th International Conference on Computational Linguistics, pp. 610-616, Nantes, France.

Carpenter, B. & Penn, G. (1994): ALE, The Attribute Logic Engine, User's Guide

Feng, Z. (1996): "COLIPS Lecture Series - Chinese Natural Language Processing", Communications of COLIPS, Vol.6, No.1 1996, Singapore (http://www.iscs.nus.sg/~colips/commcolips/paper/p96.html)

Huang, X-M. (1986): "A Bidirectional Grammar for Parsing and Generating Chinese". Proceedings of the International Conference on Chinese Computing, Singapore, pp. 46-54

Huang, X-M. (1987): XTRA: The Design and Implementation of A Fully Automatic Machine Translation System, Doctoral dissertation, University of Essex.

Hutchins, W.J. & H.L. Somers (1992): An Introduction to Machine Translation. London, Academic Press.

Li, W. (1996): Interaction of Syntax and Semantics in Parsing Chinese Transitive Patterns. Proceedings of International Conference on Chinese Computing (ICCC'96), Singapore

Li, W. (1997): Chart Parsing Chinese Character Strings. Proceedings of The Ninth North American Conference on Chinese Linguistics (NACCL-9, to be available), Victoria, Canada

Pollard, C. & I. Sag (1987): Information based Syntax and Semantics Vol. 1: Fundamentals. Centre for the Study of Language and Information, Stanford University, CA

Pollard, C. & I. Sag (1994): Head-Driven Phrase Structure Grammar, Centre for the Study of Language and Information, Stanford University, CA

Whitelock, Pete (1992): "Shake and Bake Translation", Proceedings of the 14th International Conference on Computational Linguistics, pp. 784-790, Nantes, France.

Whitelock, Pete (1994). "Shake and Bake Translation", C.J. Rupp, M.A. Rosner, and R.L. Johnson (eds.), Constraints, Language and Computation, pp. 339-359, London, Academic Press.

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

立委硕士论文：EChA 试验结果 (11)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

[参考书目]

Heinz Dieter MAAS "Automata Tradukado en kaj el Esperanto" ( "Lingvo-kibernetiko kaj aliajinternacilingvaj aktoj de l(1a IX-a Internacia Kongreso de Kibernetiko", pp 75-81, 1982 Gunter Narr Verlag Tubingen )

<<机器翻译论文选辑>> ( 科学技术文献出版社, 1979 )
Kalocsay-Waringhien <<Plena Analiza Gramatiko de Esperanto>> ( 中国世界语出版社, 1984 )
刘涌泉等著 <<中国的机器翻译>> ( 知识出版社, 1984 )
刘涌泉, 高祖舜, 刘倬著 <<机器翻译浅说>> ( 科学普及出版社, 1964 )
刘涌泉, 李维 <<巴贝尔通天塔必将建成>> ( 中国第一届世界语大会论文, 1985.8 )
刘倬 <<三次机器翻译试验>> ( 第一次机器翻译学术会议论文, 1980.9 )
<<论机器翻译规则系统的编制方法>> ( 1982.3 上海 )
<<JFY型英汉机器翻译系统的研制和试验>> ( 语言学会第二届年会论文, 1983.4 )

乔毅 <<开展语言的计算机处理和世界语类型的机器翻译>> ( 中国第一届世界语大会论文, 1985.8 )
魏原枢, 徐文琪编 <<世界语语法>> ( 上海外语教育出版社, 1982 )
叶蜚声, 徐通锵著 <<语言学纲要>> ( 北京大学出版社, 1981 )
<<语言和计算机>> (1) (中国社会科学出版社, 1982 )
<<语言和计算机>> (2) (中国社会科学出版社, 1985 )
张道真编著 <<实用英语语法>> ( 商务印书馆, 1984 )

[致谢]

研制世界语类型的机器翻译系统, 从一开始就得到刘涌泉老师的热情支持, 从方案主体到具体问题的处理, 他都给以认真指导。在程序设计和上机调试的的过程中, 刘倬老师也多次给予指导, 有些基本操作的算法也是刘倬老师提供的。在EChA系统取得初步成果的时候, 笔者向他们表示深切的感谢。另外, 还要特别感谢机房韩老师的多方协助。没有她提供的方便, EChA系统根本不可能在这么短时间试验成功。

[附录一] EChA试验结果

(1) LA ORIGINALA TEKSTO / THE ORIGINAL TEXT / 世界语原文

(001) TIEL EVOLUIGHIS PLI KAJ PLI LA PLANADO PER MASHINOJ . (002) TIUJ MASHINOJ KOMENCE NUR ELKALKULIS LA DIKTITAJN MATEMATIKAJN PROBLEMOJN , KONFORME AL LA ENPROGRAMIGO . (003) LA ELEKTRONIKAN PROGRAMIGON PRETIGIS HOMOJ . (004) PLI POSTE , KIAM LA SCIODISKETOJ ESTIS ELTROVITAJ , LA PLENAN INDIKARON , ENDISKIGITAN , ONI METIS EN MASHINOJN KAJ ILI TIAMANIERE POVIS EN SI MEM AKUMULI SCIENCAN STOKON , PLI GRANDAN OL LA HOMA CERBO. (005) KAJ SE TEMIS EKZEMPLE PRI LA PLANADO DE ELEKTROMOTORO , ONI ENMETIS LA SHABLONDISKETON DE LA ELEKTROMOTOR-PLANADO , DONIS LA INDIKOJN DE LA DEZIRATA MOTORO ( KILOVATO , TENSIO , ROTACIO , TIPO , KTP ) , (006) POST KIO LA MASHINO MEM PROGRAMIGIS SIN KAJ FARIS LA KALKULOJN . POST KELKAJ MINUTOJ GHI JAM PRETE ELDONIS LA MEZUROJN : LA DIAMETRON DE LA ROTACIA PARTO , GHIAN LONGON, LA MEZUROJN DE LA KANELOJ , DRATOJ , LA VOLVONOMBRON , ENTUTE CHION BEZONATAN . (007) ECH PLI : BALDAU ESTIS ATINGITE , KE LA MASHINO FARIS LA TUTAN DESEGNON KAJ TRANSDONIS GHIN AL LA FABRIKO . (008) KOMPRENEBLE TIUJ < DESEGNOJ > NE ESTIS IDENTAJ KUN NIAJ PAPERDESEGNOJ . (009) ILI ESTIS DISKETOJ , KIUJ ENTENIS CHIUN DETALON . (010) TIAMANIERE LA PLANADON KAJ FABRIKADON DE LA MASHINOJ JAM PLENUMIS SAME MASHINOJ . (011) ILI PLANIS LA MENDITAN MASHINON , FABRIKIS , ECH KONTROLPROVIS GHIN KAJ LA FUSHAN FORJHETIS . (012) SED CHIO CHI ANKORAU OKAZIS SUB HOMA GVIDADO KAJ PLEJ GRAVE ESTIS , KE CHIO CHI BAZIGHIS SUR LA HOMA SCIO .

LA TEKSTO TRADUKITA EN LA ANGLAN / THE TEXT TRANSLATED INTO ENGLISH / 英语译文

(001) SO DEVELOPED MORE AND MORE THE PLANNING BY MACHINES . (002) THOSE MACHINES AT BEGINNING ONLY CALCULATED OUT THE DICTATED MATHEMATICAL PROBLEMS , ACCORDING TO THE PROGRAMMING . (003) MEN PREPARED THE ELECTRONIC PROGRAMMING . (004) MORE LATER , WHEN THE KNOWLEDGE-DISKETTES HAD BEEN FOUND OUT , PEOPLE PUT THE FULL INDICATION , ENDISKED , INTO MACHINES AND THEY THEREFORE COULD IN THEMSELVES ACCUMULATE SCIENTIFIC STOCK, MORE GREAT THAN THE MAN'SBRAIN . (005) AND IF IT CONCERNED FOR EXAMPLE ABOUT THE PLANNING OF ELECTRIC MOTOR, PEOPLE INPUT THE SAMPLE DISKETTE OF THE MOTOR PLANNING , GAVE THE INDICATIONS OF THE DESIRED MOTOR (KILOWATT , VOLTAGE , ROTATION , TYPE , ETC ) , AFTER WHICH THE MACHINE ITSELF PROGRAMMED ITSELF AND DID THE CALCULATIONS . (006) AFTER SEVERAL MINUTES IT ALREADY READILY GAVE OUT THE MEASUREMENTS : THE DIAMETER OF THE ROTARY PART ,ITS LENGTH , THE MEASUREMENTS OF THE GROOVES , WIRES , THE WINDING NUMBER , IN TOTAL ALL REQUIRED . (007) EVEN MORE : SOON IT HAD BEEN ACHIEVED , THAT THE MACHINE DID THE TOTAL DESIGN AND OVERHANDED IT TO THE FACTORY . (008) OF COURSE THOSE < DESIGNS > WERE NOT IDENTICAL WITH OUR PAPERDESIGNS . (009) THEY WERE DISKETTES , WHICH CARRIED ALL DETAIL . (010) THEREFORE MACHINES ALREADY FULFILED THE PLANNING AND MANUFACTURING OF THE MACHINES SAMELY . (011) THEY PLANNED THE ORDERED MACHINE , MANUFACTURED , EVEN EXAMINED IT AND THREW AWAY THE USELESS . (012) BUT ALL THIS STILL HAPPENED UNDER MAN'S GUIDING AND IT WAS MOST IMPORTANT , THAT ALL THIS WAS BASED ON THE MAN'S KNOWLEDGE .

LA TEKSTO TRADUKITA EN LA CHINAN / THE TEXT TRANSLATED INTO CHINESE / 汉语译文

(001) 这样用机器设计越来越发展了. (002) 那些机器开始时仅仅按照输入程序计算出所命令的数学问题. (003) 人准备了电子程序设计. (004) 更以后,当微型知识磁盘被发明了时,人们把所写入磁盘的全套指令集合放到机器里面,他(它)们这样能在自己本身里面积累比人的头脑更大的科学贮蓄. (005) 如果涉及例如关于电动机的设计, 人们输入了电动机设计的微型样品磁盘, 给了所希望的电动机的指标(千瓦,电压,运转,型号,等等),在此以后机器本身把自己程序化了,做了计算. (006) 在几分钟以后它已经就能给出尺寸:运转部分的直径,它的长度,槽纹,导线的尺寸,圈数,总之所需要的一切. (007) 甚至更:很快达到了,机器做了整个图样,把它转交到工厂. (008) 当然那些<图样>与我们的图纸不是一样的. (009) 他(它)们是储有所有细节的微型磁盘. (010) 这样机器已经同样地完成了机器的设计和制造. (011) 他(它)们设计了所定购的机器,制造了,甚至检验了它,把废的抛弃了. (012) 但是这一切仍然在人的指导下进行,最重要的是,这一切以人的知识作为基础.

(2) DIVERSAJ FRAZOJ / VARIOUS SENTENCES / 各类文句

(016) KIAM MI ESTIS LUDANTA VIOLONON , MIA ONKLO VIZITIS NIAN HEJMON .
WHEN I WAS PLAYING VIOLIN , MY UNCLE VISITED OUR HOME .
当我(当时)正在拉小提琴时,我的叔叔访问了我的家.

(020) MI ESTOS FININTA LA EKSPERIMENTON PRI MASHINA TRADUKADO POST KELKAJ MONATOJ .
I WILL HAVE FINISHED THE EXPERIMENT ABOUT MACHINE'S TRANSLATING IN SEVERAL MONTHS.
我在几月以后将已经完成关于机器的翻译的实验.

(028) BABELO NE ESTIS ELKONSTRUITA.
BABEL HAD NOT BEEN BUILT UP .
巴贝尔塔没有被建成.

(029) NEPRE ESTOS ELKONSTRUITA LA NOVA BABELO .
ABSOLUTELY WILL HAVE BEEN BUILT UP THE NEW BABEL .
新巴贝尔塔必然地将被建成.

(040) KIAL VI LERNAS ESPERANTON ?
WHY DO YOU LEARN ESPERANTO ?
为什么你学习世界语?

(044) NE PROKRASTU LA HODIAUAN LABORON GHIS MORGAU .
DON'T PUT OFF THE TODAY'S WORK TILL TOMORROW .
别把今天的工作推迟到明天.

(045) KIEL BONE PENTRAS LA KNABO !
HOW WELL THE BOY PAINTS !
男孩多么好地画画啊!

(048) KIU ESTAS LA AUTORO DE LA LIBRO , KIUN VI JHUS LEGIS ?
WHO IS THE AUTHOR OF THE BOOK , WHICH YOU JUST READ ?
你刚刚读了的书的作者是谁?

(050) SE MI PARTOPRENUS EN VIA AMUZA AKTIVADO , MI ESTUS TRE GHOJA .
IF I WOULD TAKE PART IN YOUR RECREATIONAL ACTIVITY , I WOULD BE VERY GLAD .
如果我参加你(们)的文娱活动,我会是很高兴的.

(056) CHU VI MEMORAS LA TAGOJN , KIAM NI KUNE STUDIS EN LA UNIVERSITATO ?
DO YOU REMEMBER THE DAYS , WHEN WE TOGETHER STUDIED IN THE UNIVERSITY ?
你记得我们在一起在大学里面学习的日子吗?

(058) UNUIGHU PROLETOJ DE CHIUJ LANDOJ !
LET PROLETARIANS OF ALL COUNTRIES UNITE !
让所有国家的无产者联合吧!

(061) KIEL SAGHA VI ESTAS !
HOW WISE YOU ARE !
你是多么聪明啊!

(062) ESPERANTO ESTAS INTERNACIA HELPA LINGVO .
ESPERANTO IS INTERNATIONAL HELP LANGUAGE .
世界语是国际辅助语言.

(067) LIA PROPONO ESTAS , KE NI CHIUJ LIBERE ELMETU NIAJN OPINIOJN .
HIS PROPOSAL IS , THAT WE ALL FREELY OUTPUT OUR OPINIONS .
他的建议是,让我们所有人自由地提出我们的意见.

(068) MI NE SCIAS , KIAM KOMENCIGHOS NIAJ FERIOJ .
I DON'T KNOW , WHEN WILL BEGIN OUR HOLIDAYS .
我不知道,我们的假日什么时候将开始.

(069) LA LIBRO , KIU KUSHAS SUR LA TABLO , ESTAS VERDA .
THE BOOK , WHICH LIES ON THE TABLE , IS GREEN .
在桌子上躺的书是绿的.

(071) LA INFANO PLORAS , CHAR IU LIN BATIS .
THE CHILD CRIES , BECAUSE SOMEBODY BEAT HIM .
小孩哭,因为某人打了他.

(078) LERNI ESPERANTON NE ESTAS MALFACILE .
TO LEARN ESPERANTO IS NOT DIFFICULT .
学习世界语不是困难的.

(084) MI NE SCIAS , CHU VI POVAS PLENUMI TIUN CHI TASKON .
I DON'T KNOW , WHETHER YOU CAN FULFIL THIS TASK .
我不知道,是否你能完成这个任务.

(086) MULTAJ DIVERSLANDAJ ESPERANTISTOJ CHEESTOS LA UNIVERSALAN KONGRESON DE ESPERANTO OKAZONTAN PEKINE .
A LOT OF VARIOUS COUNTRY'S ESPERANTISTS WILL ATTEND THE UNIVERSAL CONGRESS OF ESPERANTO TO BE HELD IN BEIJING .
许多不同国家的世界语者将参加在北京将召开的世界语的国际大会.

(089) LIA PROPONO ELEKTI NOVAN PREZIDANTON NE ESTIS AKCEPTITA .
HIS PROPOSAL TO ELECT NEW PRESIDENT HAD NOT BEEN ACCEPTED .
他的选举新总统的建议没有被接受.

(090) SHI ESTAS LA PLEJ BELA EL LA KNABINOJ .
SHE IS THE MOST BEAUTIFUL OF THE GIRLS .
她在女孩里面是最漂亮的.

(092) FALINTE , LI NE POVIS RELEVIGHI .
HAVING FALLEN , HE COULD NOT GET UP .
摔倒了,他不能重新起来.

(093) FORIRONTE , LI PREMIS MIAN MANON .
TO GO AWAY , HE SHOOK MY HAND .
将要离去,他握了我的手.

(098) MI TRE AMAS ESPERANTON , MI PLI AMAS ESPERANTISTOJN , MI PLEJ AMAS LA IDEALON DE ESPERANTO .
I VERY MUCH LOVE ESPERANTO , I MORE LOVE ESPERANTISTS , I MOST LOVE THE IDEAL OF ESPERANTO .
我很爱世界语,我更爱世界语者,我最爱世界语的理想.

(116) NI LUDU , CHU BONE ?
LET'S PLAY , ALL RIGHT ?
让我们玩吧,好吗?

(119) KIA MIRAKLO TIO ESTAS , KE NIAJ ANTIKVULOJ KONSTRUIS LA GRANDAN MURON NUR PER SIAJ DU MANOJ !
WHAT MIRACLE IT IS , THAT OUR ANCESTORS BUILT THE GREAT WALL ONLY BY THEIR TWO HANDS !
我们的祖先仅仅用自己的两手建造了长城,这是怎样的奇迹啊!

(121) FORPASIS UNU TAGO , FORPASIS ANKAU LA DUA .
PASSED AWAY ONE DAY , PASSED AWAY ALSO THE SECOND .
一天过去了,第二也过去了.

(122) CHU ESTAS EBLE , KE VI NENION SCIAS ?
IS IT POSSIBLE , THAT YOU KNOW NOTHING ?
你不知道任何事,这是可能的吗?

(131) LA HOMON , PRI KIU VI PAROLAS , MI NENIAM VIDIS .
I NEVER SAW THE MAN , ABOUT WHOM YOU SPEAK .
我从未看见过你提到的人.

(132) NI , ESPERANTISTOJ , DEVAS LABORI PLI ENERGIE OL IAM .
WE , ESPERANTISTS , MUST WORK MORE HARD THAN EVER .
我们,世界语者,应该比任何时候更努力工作.

(133) SOMERE ESTAS TRE VARME .
IN SUMMER IT IS VERY HOT .
夏天是很热的.

(134) DOKTORO ZAMENHOF NASKIGHIS LA 15-AN DE DECEMBRO EN 1859 .
DOCTOR ZAMENHOF WAS BORN ON THE 15TH OF DECEMBER IN 1859 .
柴门霍夫博士1859年十二月的15号出生.

(135) SE VI SCIUS , KIU LI ESTAS , VI LIN PLI ESTIMUS .
IF YOU WOULD KNOW , WHO HE IS , YOU MORE WOULD ESTEEM HIM .
如果你知道,他是谁,你更会尊敬他.

(136) CENTOJ DA MALFERMAJ AUTOJ NIN PORTIS AL LA CENTRA LENIN-STADIONO, MALRAPIDE MOVIGHANTE TRA LA HOMA SVARMO .
HUNDREDS OF OPEN CARS CARRIED US TO THE CENTRAL LENIN STADIUM , SLOWLY MOVING THROUGH THE MAN'S SWARM .
成百敞篷汽车把我们带到中央列宁运动场,缓慢地通过人群运动.

(137) MI VIDIS , KE LI FALIS KAJ LIA VESTO MALPURIGHIS .
I SAW , THAT HE FELL AND HIS CLOTHES BECAME DIRTY .
我看见了,他摔倒了,他的衣服弄脏了.

(139) MI SCIIS , KE LI NE FAROS , KION LI PROMESIS .
I KNEW , THAT HE WOULD NOT DO WHAT HE PROMISED .
我知道,他将不做他允诺的.

(140) ESTAS PAULO , KIU ARANGHIS LA AFERON .
IT IS PAULO THAT ARRANGED THE AFFAIR .
是PAULO安排了事情.

(142) KUREGIS LA KNABO PER SIA TUTA FORTO , SED LI NE POVIS ATINGI LA PAPILION .
RAN THE BOY BY HIS TOTAL STRENGTH , BUT HE COULD NOT ACHIEVE THE BUTTERFLY .
男孩用自己的整个力量狂奔,但是他不能达到蝴蝶.

(144) LI DONIS AL MI MULTAJN INSTRUAJN LIBROJN .
HE GAVE ME A LOT OF TEACHING BOOKS .
他给了我许多教科书.

(145) CHU VI PAROLAS CHINE AU JAPANE ?
DO YOU SPEAK IN CHINESE OR IN JAPANESE ?
你用中文还是用日文说话?

(151) NUR TIU NE ERARAS , KIU NENIAM ION FARAS .
ONLY THAT PERSON IS NOT WRONG , WHO NEVER DOES SOMETHING .
仅仅从不做某事的那个人不犯错误.

(155) ESPERANTO ESTAS CHIES PROPRAJHO .
ESPERANTO IS EVERYBODY'S PROPERTY .
世界语是所有人的财产.

(156) MI MEMORAS CHIUN , KIUN MI VIDIS .
I REMEMBER ALL , WHOM I SAW .
我记得我看见了的所有人.

(157) ESTAS NENIU EN LA CHAMBRO .
THERE IS NOBODY IN THE ROOM .
在房间里面没有任何人.

(3) DU POEMOJ / TWO POEMS / 两首诗歌

(099) LA ESPERO : ESPERANTISTA HIMNO ( POEMO FAR ZAMENHOF ) .

(100) EN LA MONDON VENIS NOVA SENTO ,
TRA LA MONDO IRAS FORTA VOKO ;
(101) PER FLUGILOJ DE FACILA VENTO ,
NUN DE LOKO FLUGU GHI AL LOKO .

(102) NE AL GLAVO SANGONSOIFANTA ,
GHI LA HOMAN TIRAS FAMILION ;
(103) AL LA MOND' ETERNE MILITANTA ,
GHI PROMESAS SANKTAN HARMONION .
(099) THE HOPE : ESPERANTIST'S HYMN ( POEM BY ZAMENHOF ) .

(100) INTO THE WORLD CAME NEW FEELING ,
OVER THE WORLD GOES STRONG VOICE ;
(101) BY WINGS OF EASY WIND ,
NOW FROM PLACE LET IT FLY TO PLACE .
(102) NOT TO SWORD BLOODTHIRSTY ,
IT PULLS THE MAN FAMILY ;
(103) TO THE WORLD EVER FIGHTING ,
IT PROMISES SACRED HARMONY .

(099) 希望: 世界语者的颂歌 (柴门霍夫所作的诗歌).

(100) 新感觉来到了世界,
有力的声音走遍世界;
(101) 用顺风的翅膀,
现在让它从一个地方飞到另一个地方吧.

(102) 它不把人的家庭
引到渴血的刀剑;
(103) 向永远战争着的世界,
它允诺神圣的和谐.

(104) AL NIA KARA LINGVO ( FAR IU NOVA ESPERANTISTO ) .

(105) LA LINGVO GRACIA , KARA MIA ,
GHIS KIAM VI VENIS AL MI FINE FIN ?
(106) ATENDIS SOIFE MI , ETERNE VIA ,
MI AMAS VIN !

(107) MI AMAS VIN VERE , PRUVU DIO ,
KAJ MIA BON-KORO BATAS NUR POR VI ;
(108) NE PLU SEKRETETO ESTAS TIO :
VIN AMAS MI !

(109) CHU KREDAS VI MIAN AMON MARAN ?
(110) CHU KREDAS , KE MIA KORO FLAMAS ?
(111) CHU KREDAS LA VORTON PURE KARAN :
VIN MI AMAS !

(104) TO OUR DEAR LANGUAGE ( BY SOME NEW ESPERANTIST ) .

(105) THE LANGUAGE GRACEFUL , MY DEAR ,
TILL WHEN YOU CAME TO ME AT LAST ?
(106) WAITED LONGINGLY I , EVER YOURS ,
I LOVE YOU !

(107) I LOVE YOU TRUELY , LET GOD PROVE ,
AND MY GOOD HEART BEATS ONLY FOR YOU ;
(108) NO LONGER THAT IS LITTLE SECRET :
I LOVE YOU !

(109) DO YOU BELIEVE MY LOVE LIKE SEA ?
(110) DO BELIEVE , THAT MY HEART BURNS ?
(111) DO BELIEVE THE WORD PURELY DEAR :
I LOVE YOU !

(104) 献给我们的亲爱的语言(某新世界语者所作).

(105) 优美的语言,我的亲爱的,
到什么时候你最后来到了我这儿?
(106) 我渴望地等待,你的永远的,
我爱你!

(107) 我真实地爱你,让上帝证明吧,
我的善良的心仅仅为了你跳动;
(108) 那已经不再是小秘密:
我爱你!

(109) 你相信我的大海一样的爱吗?
(110) 相信,我的心燃烧吗?
(111) 相信纯粹地亲爱的词吗:
我爱你!

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

Outline of an HPSG-style Chinese reversible grammar

Outline of an HPSG-style Chinese reversible grammar*

Wei LI
Simon Fraser University
(NLWC97)

This paper presents the outline and the design philosophy of a lexicalized Chinese unification grammar named W‑CPSG. W‑CPSG covers Chinese morphology, Chinese syntax and semantics in a novel integrated language model. The grammar works reversibly, suited for both parsing and generation. This work is developed in the general spirit of the linguistic theory Head-driven Phrase Structure Grammar (Pollard & Sag 1994). We identify the following two problems as major obstacles in formulating a precise and efficient Chinese grammar. First, we lack in serious study on Chinese lexical base and often jump too soon for linguistic generalization. Second, there is a lack of effective interaction and adequate interface between morphology, syntax and semantics. We address these problems in depth with the lexicalized and integrated design of W‑CPSG. We will also illustrate how W‑CPSG is formalized and how it works.

Background

Unification grammars have been extensively studied in the last decade (Shieber 1986). Implementations of such grammars for English are being used in a wide variety of applications. Attempts also have been made to write Chinese unification grammars (Huang 1986, among others). W‑CPSG (for Wei's Chinese Phrase Structure Grammar, Li, W. 1997b) is a new endeavor in this direction, with its unique design and characteristics.

1.1. Design philosophy

We identify the following two problems as major obstacles in formulating a precise and efficient Chinese grammar. First, we lack in serious study on Chinese lexical base and often jump too soon for linguistic generalization. Second, there is a lack of effective interaction and adequate interface between morphology, syntax and semantics. We address these problems in depth with the lexicalized and integrated design of W‑CPSG.

1.1.1. Lexicalized design

It has been widely accepted that a well-designed lexicon is crucial for a successful grammar, especially for a natural language computational system. But Chinese linguistics in general and Chinese computational grammars in particular have generally been lacking in in-depth research on Chinese lexical base. For many years, most dictionaries published in China did not even contain information for grammatical categories in the lexical entries (except for a few dictionaries intended for foreign readers learning Chinese). Compared with the sophisticated design and rich linguistic information embodied in English dictionaries like Oxford Advanced Learners' Dictionary and Longman Dictionary of Contemporary English, Chinese linguistics is hampered by the lack of such reliable lexical resources.

In the last decade, however, Chinese linguists have achieved significant progress in this field. The publication of 800 Words in Contemporary Mandarin (Lü et al., 1980) marked a milestone for Chinese lexical research. This book is full of detailed linguistic description of the most frequently used Chinese words and their collocations. Since then, Chinese linguists have made fruitful efforts, marked by the publication of a series of valency dictionaries (e.g. Meng et al., 1987) and books (e.g. Li, L. 1986, 1990). But almost all such work was done by linguists with little knowledge of computational linguistics. Their description lacks formalization and consistency. Therefore, Chinese computational linguists require patience in adapting and formalizing these results, making them implementable.

1.1.2. Integrated design

Most conventional grammars assume a successive model of morphology, syntax and semantics. We argue that this design is not adequate for Chinese natural language processing. Instead, an integrated grammar of morphology, syntax and semantics is adopted in W‑CPSG.

Let us first discuss the rationale of integrating morphology and syntax in Chinese grammar. As it stands, a written Chinese sentence is a string of characters (morphemes) with no blanks to mark word boundaries. In conventional systems, there is a procedure-based Chinese morphology preprocessor (so-called segmenter). The major purpose for the segmenter is to identify a string of words to feed syntax. This is not an easy task, due to the possible involvement of the segmentation ambiguity. For example, given a string of 4 Chinese characters da xue sheng huo, the segmentation ambiguity is shown in (1a) and (1b) below.

(1) da xue sheng huo

(a) da-xue | sheng-huo
university | life

(b) da-xue-sheng | huo
university-student | live

The resolution of the above ambiguity in the morphology preprocessor is a hopeless job because such structural ambiguity is syntactically conditioned. For sentences like da xue sheng huo you qu (university life is interesting), (1a) is the right identification. For sentences like da xue sheng huo bu xia qu le (university students cannot make a living), (1b) is right. So far there are no segmenters which can handle this properly and guarantee correct word segmentation (Feng 1996). In fact, there can never be such segmenters as long as syntax is not brought in. This is a theoretical defect of all Chinese analysis systems in the morphology-before-syntax architecture (Li, W. 1997a). I have solved this problem in our morphology-syntax integrated W‑CPSG (see 2.2. below).

Now we examine the motivation of integrating syntax and semantics in Chinese grammar. It has been observed that, compared with the analysis of Indo-European languages, proper Chinese analysis relies more heavily on semantic information (see, e.g. Chen 1996, Feng 1996). Chinese syntax is not as rigid as languages with inflections. Semantic constraint is called for in both structural and lexical disambiguation as well as in solving the problem of computational complexity. The integration of syntax and semantics helps establish flexible ways of their interaction in analysis (see 2.3. below).

1.2. Major theoretical foundation: HPSG

The work on W‑CPSG is developed in the spirit of the linguistic theory Head-driven Phrase Structure Grammar (HPSG, proposed by Pollard & Sag, 1987). HPSG is a highly lexicalist theory, which encourages the integration of different components. This matches our design philosophy for implementing our Chinese computational grammar. HPSG serves as a desired framework to start this research with. We benefit most from the general linguistic ideas in HPSG. However, W‑CPSG is not confined to the theory-internal formulations of principles and rules and other details in HPSG versions (e.g. Pollard & Sag 1987, 1994 or later developments). We borrow freely from other theoretical sources or form our own theories in W‑CPSG to meet our goal of Natural Language Processing in general and Chinese computing in particular. For example, treating morphology as an integrated part of parsing and placing it right into grammar is our deliberate choice. In syntax, we formulate our own theory for configuration and word order. Our semantics differs most from any standard version of situation-semantics-based theory in HPSG. It is based on insights from Tesnière's Dependency Grammar (Tesnière 1959), Fillmore's Case Grammar (Fillmore 1968) and Wilks' Preference Semantics (Wilks 1975, 1978) as well as our own semantic view for knowledge representation and better coordination of syntax-semantics interaction (Li, W. 1996). For these differences and other modifications, it is more accurate to regard W‑CPSG as an HPSG-style Chinese grammar, rather than an (adapted) version of Chinese HPSG.

Integrated language model

2.1. W‑CPSG versus conventional Chinese grammar

The lexicalized design sets the common basis for the organization of the grammar in W‑CPSG. This involves the interfaces of morphology, syntax and semantics.[1] W‑CPSG assumes an integrated language model of its components (see Figure 1). The W‑CPSG model is in sharp contrast to the conventional clear-cut successive design of grammar components (see Figure 2).

Figure 2. conventional language model (non-reversible)

2.2. Interfacing morphology and syntax

As shown in Figure 2 above, conventional systems take a two-step approach: a procedure-based preprocessor for word identification (without discovering the internal structure) and a grammar for word-based parsing. W‑CPSG takes an alternative one-step approach and the parsing is character- (i.e. morpheme-) based. A morphological PS (phrase structure) rule is designed not only to identify candidate words but to build word‑internal structures as well. In other words, W‑CPSG is a self-contained model, directly accepting the input of a character string for parsing. The parse tree embodies both the morphological analysis and the syntactic analysis, as illustrated by the following sample parsing chart.

lw6

Note: DET for determiner; CLA for classifier; N for noun; DE for particle de;
AF for affix; V for verb; A for adjective; CLAP for classifier phrase;
NP for noun phrase; DEP for DE-phrase

This is so-called bottom-up parsing. It starts with lexicon look-up. Simple edges 1 through 7 are lexical edges. Combined edges are phrasal edges. Each edge represents a sign, i.e. a character (morpheme), a word, a phrase or a sentence. Lexical edges result from a successful match between the signs in the input string and the entries in the lexicon during lexicon look-up. After looking up the lexicon, the lexical information for the signs are made available to the parser. For the sake of concise illustration, we only show two crucial pieces of information for each edge in the chart, namely category and interpretation with a delimiting colon (some function words are only labeled for category). The parser attempts to combine the edges according to PS rules in the grammar until a parse is found. A parse is an edge which ranges over the whole string. The parse ((((1+2)+3)+4)+((5+6)+7)) represents the following binary structural tree embodying both the morphological and syntactic analysis of this NP phrase.

lw5

As seen, word identification is no longer a pre-condition for parsing. It becomes a natural by-product of parsing in this integrated grammar of morphology and syntax: a successful parse always embodies the right word identification. For example, the parse ((((1+2)+3)+4)+((5+6)+7)) includes the identification of a word-string zhe (DET) ben (CLA) shu (N) de (DE) ke-du-xing (N). An argument against the conventional separation model is that there exists in the two-step approach a theoretical threshold beyond which the precision for the correct word identification is not possible. This is because proper word identification in Chinese is to a considerable extent syntactically conditioned due to possible structural ambiguity involved. Our strategy has advantages over the conventional approach in resolving word identification ambiguities and in handling the productive word formation. It has solved the problems inherent in the morphology-before-syntax architecture (for detailed argumentation, see Li, W. 1997a).

2.3. Interaction of syntax and semantics

The interface and interaction of syntax and semantics are of vital importance in a Chinese grammar. We are of the same opinion as Chen (1996) and many others that it is more effective to analyze Chinese in an environment where semantic constraints are enforced during the parsing, not after. The argument is based on the linguistic characteristics of Chinese. Chinese has no inflection (like English ‑'s, ‑s, ‑ing, ‑ed, etc.), no such formatives as article (like English a, the), infinitivizer (like English to) and complementizer (like English that). Instead, function words and word order are used as major syntactic devices. But Chinese function words (prepositions, aspect particles, passive particle, plural suffix, conjunctions, etc.) can often be omitted (Lü et al. 1980, p.2). Moreover, fixed word order in order to mark syntactic functions which is usually assumed for isolating languages, is to a considerable extent untrue for Chinese. In fact, there is remarkable freedom or flexibility in Chinese word order. One typical example is demonstrated in the numerous word order variations (although the default order is S‑V‑O subject-verb-object) for the Chinese transitive patterns (Li, W. 1996). All these added up project a picture of Chinese as a language of loose syntactic constraint. A weak syntax requires some support beyond syntax to enhance grammaticality. Semantic constraints are therefore called for. I believe that an effective way to model this interaction between syntax and semantics is to integrate the two in one grammar.

One strong piece of evidence for this syntax-semantics integration argument is that Chinese has what I call syntactically crippled structures. These are structures which can hardly be understood on purely formal grounds and are usually judged as ungrammatical unless accompanied with the support from the semantic constraints (i.e. the match of semantic selection restrictions). Some Chinese NP predicate (Li, W. & McFetridge 1995) and transitive patterns like S‑O‑V (Li, W. 1996), among others, are such structures. The NP Predicate is a typical instance of semantic dependence. It is highly undesirable if we assume a general rule like S --> NP1 NP2 in a Chinese grammar to capture such phenomena. This is because there is a semantic condition for NP2 to function as predicate, which makes the Chinese NP predicate a very restricted pattern. For example, in the sentence This table is three-legged: zhe (this) zhang (classifier) zhuo-zi (desk) san (three) tiao (classifier) tui (leg), the subject must be of the semantic type animate or furniture (which can have legs). The general rule with no recourse to semantic constraints is simply too productive and may cause severe computational complexity. In the case of Chinese transitive patterns, formal means are decisive for some variations in their interpretation (i.e. role assignment) process. But others are heavily dependent on semantic constraint. Take chi (eat) as an example. There is no difference in syntactic form in sentences like wo (I) chi (eat) dianxin (Dim-Sum) le (perfect-aspect) and dianxin (Dim-Sum) wo (I) chi (eat) le (perfect-aspect). Who eats what? To properly assign roles to NP1 NP2 V as S-O-V versus O-S-V, the semantic constraint animate eats food needs to be enforced.

The conventional syntax-before-semantics model has now received less popularity in Chinese computing community. Researchers have been exploring various ways of integrating syntax and semantics in Chinese grammar (Chen 1996). In W‑CPSG, the Chinese syntax was enhanced by the incorporation of a semantic constraint mechanism. This mechanism embodies a lexicalized knowledge representation, which parallels to the syntactic representation in the lexicon. I have developed a way to dynamically coordinate the syntactic constraint and semantic constraint in one model. This technique proves to be effective in handling rhetorical expressions and in making the grammar both precise and robust (Li, W 1996).

Lexicalized formal grammar

3.1. Formalized grammar

The application nature of this research requires that we pay equal attention to practical issues of computational systems as well as to a sound theoretical design. All theories and rule formulations in W‑CPSG are implementable. In fact. most of them have been implemented in our prototype W‑CPSG. W‑CPSG is a strictly formalized grammar that does not rely on undefined notions. The whole grammar is represented by typed feature structures (TFS), as defined below based on Carpenter & Penn (1994).

(3) Definition: typed feature structure

A typed feature structure is a data structure adopted to model a certain object of a grammar. The necessary part for a typed feature structure is type. Type represents the classification of the feature structure. A simple feature structure contains only the type information, but a complex feature structure can introduce a set of feature-value pairs in addition to the type. A feature-value pair consists of a feature and a value. A feature reflects one aspect of an object. The value describes that aspect. A value is itself a feature structure (simple or complex). A feature determines which type of feature structures it takes as its value. Typed feature structures are finite in a grammar. Their definition constitutes the typology of the grammar.

With this formal device of typed feature structures, we formulate W‑CPSG by defining from the very basic notions (e.g. sign, morpheme, word, phrase, S, NP, VP, etc.) to rules (PS rules and lexical rules), lexical items, lexical hierarchy and typology (hierarchy embodied in feature structures) (Li, W. 1997b). The following sample definitions of some basic notions illustrate the formal nature of W‑CPSG. Please note that they are system-internal definitions and are used in W‑CPSG to serve the purpose of configurational constraints (see Chapter VI of Li, W. 1997b).

(4) Definition: sign [2]

a_sign
KANJI kanji
MORPH expected
CATEGORY category
COMP0 expected
COMP1 expected
COMP2 expected
MOD expected
KNOWLEDGE knowledge
CONTENT content
DTR dtr

A sign is the most fundamental concept of grammar. A sign is a dynamic unit of grammatical analysis. It can be a morpheme, a word, a phrase or a sentence. Formally, a sign is defined by the TFS a_sign, which introduces a set of linguistic features for its description, as shown above. These features include the orthographic feature KANJI; morphological feature MORPH; syntactic features CATEGORY, COMP0, COMP1, COMP2, and MOD; structural feature (for both morphology and syntax) DTR; semantic features KNOWLEDGE and CONTENT.

(5) Definition: morpheme

a_sign
MORPH ~saturated

A morpheme is a sign whose morphological expectation has not been saturated. In W‑CPSG, ~saturated is equivalent to obligatory/optional/null. For example, the suffix ‑xing (‑ness) is such a morpheme whose morphological expectation for a preceding adjective is obligatory. In W‑CPSG, a morpheme like ‑xing (‑ness) ceases to be a morpheme when its obligatory expectation, say the adjective ke-du (readable), is saturated. Therefore, the sign ke-du-xing (readability) is not a morpheme, but becomes a word per se.

(6) Definition: word

a_sign
MORPH ~obligatory
DTR no_syn_dtr

In W‑CPSG, ~obligatory is equivalent to saturated/optional/null. The specification [MORPH ~obligatory] defines a syntactic sign, i.e. a sign whose obligatory morphological expectation has been saturated. A word is a syntactic sign with no syntactic daughters, i.e. [DTR no_syn_dtr]. Obviously, word with [MORPH saturated/optional/null] overlaps morpheme with [MORPH obligatory/optional/null] in cases when the morphological expectation is optional or null.

Just like the overlapping of morpheme and word, there is also an intersection between word and phrase. Compare the following definition of phrase with the above definition of word.

(7) Definition: phrase

a_sign
MORPH ~obligatory
COMP0 ~obligatory
COMP1 ~obligatory
COMP2 ~obligatory

A phrase is a syntactic sign whose obligatory complement expectation has all been saturated, i.e. [COMP0 ~obligatory, COMP1 ~obligatory, COMP2 ~obligatory]. When a word has only optional complement expectation or no complement expectation, it is also a phrase. The overlapping relationship among morpheme, word and phrase can be shown by the following illustration of the three sets.

lw4

S is a syntactic sign satisfying the following 3 conditions: (1) its category is pred (which includes V and A); (2) its comp0 is saturated; (3) its obligatory comp1 and comp2 are saturated.

3.2. Lexicalized grammar

W‑CPSG takes a radical lexicalist approach. We started with individual words in the lexicon and have gradually built up a lexical hierarchy and the grammar prototype.

W‑CPSG consists of two parts: a minimized general grammar and a information-enriched lexicon. The general grammar contains only 11 PS rules, covering complement structure, modifier structure, conjunctive structure and morphological structure. We formulate a PS rule for illustration.

lw3

This comp0 PS rule is similar to the rule S ==> NP VP in the conventional phrase structure grammar. The feature COMP0 represents the expectation of the head daughter for its external complement (subject or specifier) on its left side, i.e. [DIRECTION left]. The nature of its expected comp0, NP or other types of sign, is lexically decided by the individual head (hence head-driven or lexicon-driven). It will always be warranted by the general grammar, here via the index [3]. This is the nature of lexicalized grammars. PS rules in such grammars are very abstract. Essentially, they say one thing, namely, 2 signs can combine so long as the lexicon so indicates. The indices [1] and [2] represent configurational constraint. They ensure that internal obligatory complements COMP1 and COMP2 must be saturated before this rule can be applied. Finally, Head Feature Principle (defined elsewhere in the grammar based on the adaptation of the Head Feature Principle in HPSG, Pollard & Sag, 1994) ensures that head features are percolated up from the head daughter to the mother sign.

The lexicon houses lexical entries with their linguistic description and knowledge representation. Potential morphological structures, as well as potential syntactic structures, are lexically encoded (in the feature MORPH for the former and in the features COMP0, COMP1, COMP2, MOD for the latter). Our knowledge representation is also embodied in the lexicon (in the feature KNOWLEDGE). I believe that this is an effective and realistic way of handling natural language phenomena and their disambiguation without having to resort to an encyclopedia-like knowledge base. The following sample formulation of the lexical entry chi (eat) projects a rough picture of what the W‑CPSG lexicon looks like.

lw2

The lexicon also contains lexical generalizations. The generalizations are captured by the inheritance of the lexical hierarchy and by a set of lexical rules. Due to space limitations, I will not show them in this paper.

Implementation and application of W‑CPSG

A substantial Chinese computational grammar has been implemented in the W‑CPSG prototype. It covers all basic Chinese constructions. Particular attention is paid to the handling of function words and verb patterns. On the basis of the information- enriched lexicon and the general grammar, the system adequately handles the relationship between linguistic individuality and generality. The grammar formalism which I use to code W‑CPSG is ALE, a grammar compiler on top of Prolog, developed by Carpenter & Penn (1994). ALE is equipped with an inheritance mechanism on typed feature structures, a powerful tool in grammar modeling. I have made extensive use of the mechanism in the description of lexical categories as well as in knowledge representation. This seems to be an adequate way of capturing the inherent relationship between features in a grammar. Prolog is a programming environment particularly suitable for the development of unification and reversible grammars (Huang 1986, 1987). ALE compiles W‑CPSG into a Chinese parser, a Prolog program ready to accept a string of characters for analysis. In the first experiment, W‑CPSG has parsed a corpus of 200 Chinese sentences of various types.

An important benefit of a unification-based grammar is that the same grammar can be used both for parsing and generation. Grammar reversibility is a highly desired feature for multi-lingual machine translation application. Following this line, I have successfully applied W‑CPSG to the experiment of bi-directional machine translation between English and Chinese. The machine translation system developed in our Natural Language Lab is based on the shake-and-bake design (Whitelock 1992, 1994). I used the same three grammar modules (W‑CPSG, an English grammar and a bilingual transfer lexicon) and the same corpus for the experiment. As part of machine translation output, W‑CPSG has successfully generated the 200 Chinese sentences. The experimental results meet our design objective and verify the feasibility of our approach.

References

Carpenter, B. & Penn, G. (1994): ALE, The Attribute Logic Engine, User's Guide

Chen, K-J. (1996): "Chinese sentence parsing" Tutorial Notes for International Conference on Chinese Computing ICCC'96, Singapore

Feng, Z-W. (1996): "COLIPS lecture series - Chinese natural language processing", Communications of COLIPS, Vol. 6, No. 1 1996, Singapore

Fillmore, C. J. (1968): "The case for case". Bach and Harms (eds.), Universals in Linguistic Theory. Holt, Reinhart and Winston, pp. 1-88.

Huang, X-M. (1986): "A bidirectional grammar for parsing and generating Chinese". Proceedings of the International Conference on Chinese Computing, Singapore, pp. 46-54

Huang, X-M. (1987): XTRA: The Design and Implementation of A Fully Automatic Machine Translation System, Doctoral dissertation, University of Essex.

Li, L-D. (1986): Xiandai Hanyu Juxing (Sentence Patterns in Contemporary Mandarin), Shangwu Yinshuguan, Beijing

Li, L-D. (1990): Xiandai Hanyu Dongci (Verbs in Contemporary Mandarin), Zhongguo Shehui Kexue Chubanshe, Beijing

Li, W. & P. McFetridge (1995): "Handling Chinese NP predicate in HPSG", Proceedings of PACLING-II, Brisbane, Australia

Li, W. (1996): "Interaction of syntax and semantics in parsing Chinese transitive patterns", Proceedings of International Conference on Chinese Computing (ICCC'96), Singapore

Li, W. (1997a): "Chart parsing Chinese character strings", Proceedings of The Ninth North American Conference on Chinese Linguistics (NACCL-9, to be available), Victoria, Canada

Li, W. (1997b): W‑CPSG: A Lexicalized Chinese Unification Grammar, Doctoral dissertation, Simon Fraser University (on-going)

Lü, S-X. et al. (ed.) (1980): Xiandai Hanyu Babai Ci (800 Words in Contemporary Mandarin), Shangwu Yinshuguan, Beijing

Meng, Z., H-D. Zheng, Q-H. Meng, & W-L. Cai (1987): Dongci Yongfa Cidian (Dictionary of Verb Usages), Shanghai Cishu Chubanshe, Shanghai

Pollard, C. & I. Sag (1987): Information based Syntax and Semantics Vol. 1: Fundamentals. Centre for the Study of Language and Information, Stanford University, CA

Pollard, C. & I. Sag (1994): Head-Driven Phrase Structure Grammar, Centre for the Study of Language and Information, Stanford University, CA

Shieber, S. (1986): An Introduction to Unification-Based Approaches to Grammar. Centre for the Study of Language and Information, Stanford University, CA

Tesnière, L. (1959): Éléments de Syntaxe Structurale, Paris: Klincksieck

Whitelock, Pete (1992): "Shake and bake translation", Proceedings of the 14th International Conference on Computational Linguistics, pp. 784-790, Nantes, France.

Whitelock, Pete (1994). "Shake and bake translation", C.J. Rupp, M.A. Rosner, and R.L. Johnson (eds.), Constraints, Language and Computation, pp. 339-359, London, Academic Press.

Wilks, Y.A. (1975). "A preferential pattern-seeking semantics for natural language interference". Artificial Intelligence, Vol. 6, pp. 53-74.

Wilks, Y.A. (1978). "Making preferences more active". Artificial Intelligence, Vol. 11, pp. 197-223

-------------------------------------

* This project was supported by the Science Council of British Columbia, Canada under G.R.E.A.T. Award (code: 61) and by my industry partner TCC Communications Corporation, British Columbia, Canada. I thank my academic advisors Paul McFetridge and Fred Popowich and my industry advisor John Grayson for their supervision and encouragement. Thanks also go to my colleagues Davide Turcato, James Devlan Nicholson and Olivier Laurens for their help during the implementation of this grammar in our Natural Language Lab. I am also grateful to the editors of the NWLC'97 Proceedings for their comments and corrections.

[1] We leave aside the other components such as discourse, pragmatics, etc. They are an important part of a grammar for a full analysis of language phenomena, but they are beyond what can be addressed in this research.

[2] In formulating W‑CPSG, we use uppercase for feature and lowercase for type; ~ for logical not and / for logical or; number in square brackets for unification.

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

立委硕士论文：目标语调序 (9)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

目标语调序

在前面的虚词一线和形态生成一线, 已经做了一些局部调序并给了同号. 如:

CHIO (一切) CHI (这) ----> 这一切 (012);
DOKTORO (博士) ZAMENHOF (柴门霍夫) ----> 柴门霍夫博士 (134)

英语疑问句和否定句所需要的调序, 就放在形态生成的同时进行. 如:

NE (NOT) ESTIS (WERE) ----> WERE NOT (008)

CHU VIA (YOUR) AMIKO (FRIEND) ESTAS (IS) KURACISTO (DOCTOR) ?
----> IS YOUR FRIEND DOCTOR ? (039)

从综合第二线开始, 系统从句子整体着眼, 自底而上分别做各目标语的归约调序. 有了CDC和调序子程序, 建立目标语的归约生成算法就很简单了. 其基本思路是:

(1) 由句首至句末依次取词, 放过已加工和非终结节点.
(2) 若该词层号为一, 右链为零, 说明已经归约到顶层主轴心, 该句加工完毕.
(3) 若该词需要调序, 入调序子程序.
(4) 该词做已加工特征, 并视情况决定是否给该词以轴心词同号.
(5) 入子程序检查该词的姐妹词是否也都已加工.
(6) 若是, 则该词及其所有姐妹词给以轴心词同号, 轴心词做终结节点特征.
(7) 返回第(1)步.

对于英语, 问题特别简单, 只有一种情况需要调序, 即及物谓语所带的前置宾语和后置主语. (不及物谓语句中的后置主语无需调序.) 汉语的问题就复杂得多, 主要规则有:

(1) 存在 "有" (ESTI) 的主语应后置. 除此以外, 后置主语(包括多数主语从句)一律前移.

(2) 要求带 "把", "使" 等的汉语及物动词做谓语的句子, 其宾语在加上 "把", "使"等以后, 应置于谓语前. 除此以外, 前置宾语一律后移.

(3) 后置定语从句在两种情况下不需前移: 1. ESTAS + X, KIU 型强调句式; 2. 长15词以上的定语从句. 其余的所有后置定语一律前移. 各姐妹定语的相对位置主要由它们的语义特征决定, 具体是通过调序时给或不给同号来实现.

(4) 状语从句一般原位不动(但后置时间状语从句最好前移). 其余后置状语一律前移. 各姐妹状语相对位置的处理原则同上.

【相关】

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

立委硕士论文：EChA试验结果分析 (10)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

EChA试验结果分析

总的来说, 这次试验结果相当令人满意. 译文不但可读, 多数都很通顺. 由于比较重视修辞, 机器味儿也不浓. 当然, 这毕竟是小范围的实验, 虽然我们尽量照顾到各种可能出现的语言现象, 但也难说在今后的扩大试验中会出现什么问题, 好在该系统比较容易维护和改进.

第二首诗中有两处(110)(111)把疑问句错译成英语强调句:

CHU kredas la vorton pure karan: vin mi amas! (111)
DO BELIEVE the word purely dear: I love you!
Cf: 相信纯粹地亲爱的词吗:我爱你!

这是因为原诗句为了节奏的需要, 承前省略了主语 VI (YOU). 有意思的是, 译成强调句于诗意没有什么损害.

在EChA上机伊始, 我们由于专心于检验方案主体的可行性和合理性, 而忽略了修辞. 初期译文(1985.12)显得较粗糙, 比较后期结果(1986.2), 译文的改进是明显的. 例如:

形式主语IT的增加 (007)(012)(077)(122)(125)(133):

Sed chio chi ankorau okazis sub homa gvidado kaj PLEJ GRAVE ESTIS, KE chio chi bazighis sur la homa scio. (012)

1) But all this still happened under man's guiding and MOST IMPORTANT WAS, THAT all this was based on the man's knowledge.

2) But all this still happened under man's guiding and IT WAS MOST IMPORTANT, THAT all this was based on the man's knowledge.

不定式带TO跟不带TO的区分 (004)(019)(072)(078)(083)(084)(088)(089)(092)(095)(132)(142)(146):

LABORI estas necese.(072)
1) (TO) WORK is necessary.
2) TO WORK is necessary.
工作是必要的.

双宾语 (128)(143)(144):

Donu AL mi iom da kafo! (128)
1) Give TO me a little coffee!
2) Give me a little coffee!
给我一点咖啡!

表示存在的 ESTI 译 "有" 和 THERE TO BE (049)(157):

En unu jaro ESTAS kvar sezonoj: printempo, somero, autuno kaj vintro. (049)

1) In one year ARE four seasons: spring, summer, autumn and winter.
在一年里面 "是" 四季节:春季,夏季,秋季和冬季.

2) In one year THERE ARE four seasons: spring, summer, autumn and winter.
在一年里面 "有" 四季节:春季,夏季,秋季和冬季.

目标语词义的选择 (059)(067)(081)(046)(098)(013)(014)(027)(118)(130):

ELMETU viajn opiniojn pri nia laboro! (059)

1) "输出" 你们的关于我们的工作的意见!
2) "提出" 你们的关于我们的工作的意见!
OUTPUT your opinions about our work!

Chu mi FARIS multajn erarojn en mia hejmtasko? (081)

1) Did I DO a lot of mistakes in my homework?
我在我的家庭作业里面 "做" 了许多错误吗?

2) Did I MAKE a lot of mistakes in my homework?
我在我的家庭作业里面 "犯" 了许多错误吗?

La partio TRE zorgas la vivon de la popolamaso. (046)

1) The party VERY cares for the life of the masses.
2) The party VERY MUCH cares for the life of the masses.
党很关心人民群众的生活.

La suno levighas CHE oriento. (013)

1) The sun rises AT east.
2) The sun rises IN THE east.
太阳在东方升起.

POST unu monato komencighos la someraj ferioj. (014)

1) AFTER one month will begin the summer's holidays.
2) IN one month will begin the summer's holidays.
暑假在一月以后将开始.

La eksperimento pri mashina tradukado ANKORAU NE estas finita. (027)

1) The experiment about machine's translating STILL has been NOT finished.
关于机器的翻译的试验 "仍然没有" 被完成.

2) The experiment about machine's translating has been NOT finshed YET.
关于机器的翻译的试验 "还没有" 被完成.

Ni esperas, ke li GAJNU championecon en la konkurso. (118)

1) We hope, that he WIN championship in the competition.
2) We hope, that he WILL WIN championship in the competition.
我们希望,让他在比赛里面赢得冠军.

Prenu la lingvon neutralan KIEL la bazon. (130)

1) Take the language neutral AS the base.
2) Take the language neutral FOR the base.
拿中立的语言作为基础.

通过EChA试验, 我们深深体会到, 同一语系中的语言转换较之不同语系容易许多. 亲属关系越近, 机器翻译对自动分析的精度要求也就越低, 因而越容易推向实用. 英语和汉语都是分析型语言, 有很多类似的语言特点, 即便如此, 世英转换比

世汉还是简单得多. 只要建立一部世英自动词典, 再加上一套形态转换算法, 甚至无需进行层次和句法的分析, 就可以实现词对词世英机器翻译. 这样的译文尽管粗糙, 但在相当程度上是可用的. 我们对ECHA综合第一线(形态转换)输出的未经调序的中间译文作了统计, 以不引起误解为标准, 英语正确率为 95% (150/158) 左右, 费解的有八句 (003)(010)(075)(095)(102)(108)(111)(141), 汉语正确率为 72% (113/158) 左右. 排除形态转换中利用了句法分析结果的部分, (但不排除第一线的虚词分析和转换), 英语正确率也在80%以上. 如果在输出译文时, 对前置宾格名词加上标识符, 则可懂度还可提高. 当然, 我们试验的这158句总有一定的局限, 所以上述统计也只具有相对意义. 中国的机器翻译, 从一开始研究的就是印欧和汉臧这两个没有亲属关系的语系间语言的自动转换, 难度很大. 这恐怕是我们的实用系统迟迟不能问世的重要原因之一. 所以, 崐中国机器翻译工作者肩上的担子更重, 任务更艰巨, 更需要独创和献身精神. 这种不利的条件也有它的另一面: 机器翻译与汉语结合带来的许多特别的问题, 客观上使我们的研究比较深入. 我国的机译研究就没有象欧美那样经历词对词翻译的第一代, 而是直接从第二代句对句翻译开始, 起点较高, 并且在很短时间内(60年代初期)就赶上了当时的世界先进水平. 这显然与我们所研究的特定对象(俄-汉, 英-汉等)的要求有关.[10]

现在谈谈另一个问题: 文学作品可不可以由机器翻译? 我们说完全可以, 不过很困难. 要把人在翻译文学作品时所遵循的规则(其中很多是下意识的)形式化算法化, 显然不容易. 即便做到了, 经济上也不上算. 所以, 在相当长的时间内, 除特别的实验需要外, 人们一般不去花这个力气. EChA选译了两首诗歌, 在这个方面做了粗浅的尝试, 证明机器也可译诗. 从译?
文看, 英语比汉语美, 保留了更多的节奏和韵律的特点, 更象一首诗. 汉语译文除了几句译得较好( 如: "向永远战争着的世界, / 它允诺神圣的和谐" ), 总体上看, 更象一篇散文. 这也难怪, 因为EChA本来就不是专门为翻译诗歌而设计的. 诗歌形式上的两个最大特点是节奏和尾韵. 可以设想, 诗歌机译系统的词典跟一般机器词典应有所不同: 各词条的每一义项下集中了一批同义的目标语等价词. 这些词长短不一, 韵尾各异, 供机器在诗歌综合时选用, 正象人在写诗或译诗时常需要翻韵书一样.

一提机器翻译, 人们总爱问: 机器能够翻译文学作品吗? 为什么不能? 离散是对连续的逼近, 机器智能是对人的智能的模拟, 二者之间并没有一道不可逾越的鸿沟. 从功能上看, 机器和人没有什么不同. 机器不过是无机体的人罢了. 只要人会的事情, 机器迟早也能会. 机器的不会并不是它不能, 而是人没有使它会, 这正如文盲不会写字是因为没人教他一样. 不过, 机器胃口很刁, 不懂 "意会", 只有 "言传"(通过计算机语言)才能教会它. 可惜, 对很多事, 人至今还是知其然, 并不知其所以然, 无法传授. 可见, 机器的无能全由于人的无能. 可人今天不知其所以然的, 并不说明将来总也不知, 所以从发展的观点看, 机器和人一样是无所不能的. 事实上, 机器目前已能代替医生, 译员和作曲家做部分工作, 而且比技术较差的人做得还象样些, 因为它 "取法乎上". 即便人, 也只有很少一部分专家能够从事这些工作. 机器已经闯进了万物之灵的神圣禁地.

最后, 一般地谈谈修辞问题. 由于机器翻译至今多局限在实验室里, 所以未予修辞而产生的阅读障碍(包括心理障碍)还不突出. 但随着机器翻译的逐步实用化, 修辞的必要性将越来越明显. 前面所举的后期译文对初期译文的改进的实例, 主要涉及的就是修辞.

1) 什么是机器翻译修辞?

机器翻译修辞是保证译文通顺的一个重要手段. 它是机器语法之后译文综合的一部分, 是自动翻译过程的最后一个环节. 广义的修辞包括贯穿翻译全过程的, 一切旨在促使译文通顺和美化的手段, 譬如成语手段(通过成语词典), 虚词分析(通过虚词模块), 结构手段(通过搭配关系)等等. 有些所谓多义区分, 实际上也是一种修辞, 例如 LUDI (PLAY) 可分为 "玩", "打球)", "演奏(乐器)"等义项, 但 "演奏" 义下具体选择 "拉(提琴, 胡琴)"(016), "弹(钢琴)"(038) 还是 "吹(口琴)" 就属于修辞了. EChA对于涉及多义的修辞, 即目标语合适对等词的选择, 就把它当作多义问题解决(见EChA虚词模块, 词类词义区分表和多义区分模块). 一般来说, 跟具体的词汇或语法现象联系很紧的修辞, 以及其他个性较强的特例修辞, 应该放在相应的词典或语法部分同时处理, 而可以归出类别的修辞, 则由最后独立的修辞模块统一解决.

机器翻译修辞具有某种超语言学的特征, 属于翻译学范畴. 我们知道, 根据原语和译语的语言学角度的对比差异, 就可以对所译文句实现转换(主要是句型转换), 这是我们目前机器翻译的主体工作. 但这样直接转换的句子不能保证其通顺, 甚至也不能保证其正确(即不被误解), 因为语言间(尤其是没有亲属关系的语言间)除了词汇语法等差异外, 还有超语言学(表达习惯, 思维方式等等)的差异存在, 即翻译学角度的对比差异. 例如: nun DE LOKO flugu ghi AL LOKO (now FROM PLACE let it fly TO PLACE) (101) / 现在从 "一个" 地方让它飞到 "另一个" 地方吧("从地方到地方" 不符合汉语表达习惯). 修辞主要是为消除这种差异而设置的. 因此, 只有翻译学角度的语言对比差异, 才是修辞的根本依据.

2) 修辞的分类

可分作两大类: 必要修辞和美修辞. 必要修辞是保证译文正确可懂所必需的修辞, 它是修辞的初级阶段. 美修辞则是保证译文通顺畅达, 甚至产生某种美感或帮助形成译文风格所要求的修辞, 它是修辞的高级阶段. 机器翻译修辞首先是作为必要修辞提出来的. 必要修辞是基础, 具有更大的迫切性, 是所有实用系统的必要组成部分, 如形态修辞. 这部分修辞数量很有限, 一定量的研究就可以穷尽它. 美修辞可以说是锦上添花. 它是为机器译文不断提高质量, 使之朝成熟, 完美方向发展, 以期赶上人工翻译的手段. 可见, 美修辞是无限发展的, 它本身具有许多层次和侧面. 修修补补远不能满足美修辞发展的需要. 它要求体系和方法上的不断革新. 就机器翻译的前景来说, 美修辞的比重将逐渐变大. 从严格的意义上讲, 只有美修辞才真正体现修辞本身的特点和规律, 因为必要修辞在一定的意义上不过是语法的推广, 即可以算作广义的语法. 它的手段跟机器语法没有根本的不同. 在现行的EChA系统中, 必要修辞就常常跟语法混在一起.

关于美修辞, EChA只是做了一点尝试. 应该指出, 机器翻译的美有自己的侧重点, 它最推崇 "通顺流畅, 合乎习惯和简洁自然", 其次是译文风格的形成. 我们认为, 机器译文的风格逐步形成, 是完全可能的. 因为从形式上看, 风格的承担者主要是词汇, 尤其是小词(语气词, 结构词), 其次, 语法形式也有些不同. 不同风格的形式特点, 是可以为机器识辨和接受的. ?
具体做法可以吸收计算风格学(Computational stylistics)的研究成果, 去设计不同风格的译语修辞模型. 风格可以有正规体, 典雅体和口语体等等. 正规体格式规范, 清楚简单, 给人的印象是客观公正, 不假藻饰. 典雅体的特点是虚词多用古字 (如 "则", 即", "乃", "便", "故", "且", "其", "及" 等), 成语用的也较多, 显得简洁古雅. 口语体则比较松散自由, 带?
有更多的语气词(如 "吗", "呢", "可不", "是吗", "啊" 等).

____________________________________________________________________

附注: [10] 参见刘涌泉 <<中国的机器翻译>> ( <<情报科学>> 1980, 3 )

[致谢]

研制世界语类型的机器翻译系统, 从一开始就得到刘涌泉老师的热情支持, 从方案主体到具体问题的处理, 他都给以认真指导. 在程序设计和上机调试的的过程中, 刘倬老师也多次给予指导, 有些基本操作的算法也是刘倬老师提供的. 在EChA系统取得初步成果的时候, 笔者向他们表示深切的感谢. 另外, 还要特别感谢机房韩老师的多方协助. 没有她提供的方便, EChA系统根本不可能在这么短时间试验成功.

【相关】

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

立委硕士论文：英语形态生成 (8)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

英语形态生成

加尾算法跟削尾算法正好是逆过程. 建立一个完全的, 符合实用系统要求的英语加尾算法并不困难, 因为英语的形态比较简单. EChA把汉语形态修辞与英语形态生成放在一处进行.

原语和译语的对比差异是建立语言转换规则的依据. 这种对比差异可以归纳为下面五种情况: 1) 一一对应; 2) 此一彼多; 3) 此多彼一; 4) 此有彼无; 5) 此无彼有. 我们以世界语到英语的形态转换分别举例如下:

1) 一一对应

世界语派生副词(由逻辑类为形容词的词干加 "-E" 尾构成)
--------->英语相应形容词加 "-LY" 尾.

例: diligent-E ----> diligent-LY ; serioz-E ----> serious-LY ;
sincer-E ----> sincere-LY. (063)

例外: bon-E ----> well (045)
( 不是 good-LY, 这种情况在词典一线入词类词义区分表处理. )

显然, 一一对应的情形最好办.

2) 此一彼多

世界语不定式 --------> 英语动词原形或 TO + 动词原形
世界语条件句(谓语动词以 "-US" 收尾) --------> 英语三种形式(过去, 现在, 将来).
例: 1. Se mi sci-US hierau, mi certe ven-US. -
---> If I HAD KNOWN yesterday, I certainly SHOULD HAVE COME. (与过去事实相反的假设)

Se vi est-US mi, kion vi far-US? ----> If you WERE me, what WOULD you do? (与现在事实相反)

Se vi ven-US morgau, vi shin vid-US.
----> If you SHOULD come tomorrow, you WOULD see her. (与将来事实相反)

这种情况最麻烦. 机器翻译中的多义现象盖源于此. 如果上例没有明确的时间状语, 那只能靠跨句上下文去推测, 这对机器实在太难了. EChA遇到这种情况, 就干脆一律用 "WOULD" 代替 "-US" (050), 这虽然不大符合英语语语法规范, 暂时也只能这样了. 好在这样转换并不造成误解.

此一彼多另一个常见的例子是, 世界语现在时简单式(-AS尾)对应于英语一般现在时和现在进行时两种. 虽然世界语复合时态有与英语现在进行时对应的形式( ESTAS x-ANTA ), 但是世界语的节约原则要求人们尽可能少用复杂形式. 我们一时还找不出足够可靠的形式规则, 来决定 "-AS" 究竟何时译作一般时态, 何时译作进行时态. EChA目前一律以一般现在时译之, 这使得部分译文不是很确切, 但并不造成误解或费解. 如:

Kien vi ir-RA? (158) ----> To where DO you go? ( CF: Where ARE you GOING? )
Chu kredas, ke mia koro flam-AS? (110) ----> Do believe, that my heart burn-S?
( CF: Do you believe that my heart IS BURNING? )

3) 此多彼一

世界语形动词或副动词的各种形式 --------> 英语分词的相应形式.

-ANTA 和 -ANTE ----> -ING ; -INTA 和 -INTE ----> HAVING+过去分词 ;
-OTA 和 -OTE ----> TO BE+过去分词; 等等.

[例] KURANTE sur la strato, li falis. (091) ----> RUNNING on the street, he fell.

Laboristoj estas KONSTRUANTAJ fabrikon. (015)
----> Workers are BUILDING factory.

这种情况好办. 世界语形态比较丰富, 而现代英语形态不发达, 所以世英形态转换中最经常出现的, 就是此多彼一或此有彼无的情形, 这对建立比较完全的EChA英语形态生成(加尾)算法是很有利的条件.

4) 此有彼无

世界语将来将来时 ( ESTOS x-ONTA(J) ) --------> 英语 ?

[例] Mi ESTOS LEGONTA la libron kiam shi venos. (023)
----> I WILL ( 或: WILL BE GOING TO ) read the book when she comes.

这种情况看上去似乎很不利, 实际上并不难处理. 因为现今存在的各种语言, 作为人们千百年来交流思想的工具, 一般都能够表达各种细微的语义差别. 虽然乙语言也许缺乏甲语言的某个特定的表达手段, 但如果必要, 它总可以找到代替的表达方式. 如上例 ESTOS LEGONTA 通常译作 WILL READ 已经足够, 如果一定要强调将来的将来, 也不妨译作 WILL BE GOING TO READ 这样繁冗的形式. 再如汉语缺乏形态, 但如果需要, 总可以用适当的助词或副词等来代替, 这就是所谓的形态修辞.

5) 此无彼有

世界语 ? --------> 英语完成进行时

[例] Mi atend-AS vin chi tie du horojn.
----> I HAVE BEEN WAITING here for you for two hours.
CF: I WAIT here for you for two hours.
I AM WAITING here for you for two hours.

此所无彼所有的, 如果在彼也是可有可无的, 或并不太影响语义, 那还好办, 如上例. 再如, 英语的不定冠词, 世界语就没有, EChA对此干脆不管, 也没造成严重的后果, 只是译文显得有些不顺: Is your friend (*) doctor? (039) This is (*) green star, and that is (*) red star. (152) ( * 处本应有不定冠词 A ) 最头痛的是此所无彼必有. 从完全没有冠词的语言(如汉语和俄语)译入有冠词的语言在很多情况下就是这样.

上述归纳在机器翻译的转换生成中具有普遍意义. 最困难的是此一彼多和此所无彼必有两种情况, 一般要通过精密的句法和语义的对比和分析来解决. 比如通过分析不定式所直接联系的英语轴心词的句型特征, 就可以决定该不定式采用带 TO 还是不带 TO 的形式. 实在不得已, 只好把几种可能的选择同时打印出来, 由用户自己决定----这当然是权宜之计, 但常常比编制一套不可靠的区分规则, 客观上更有利一些. 机器模拟人的智能, 在一定的阶段总还有某些局限. 上面的做法, 实际上就是把机器暂时还不具有的智能, 交还给人发挥, 特别是那些很难形式化, 但人凭经验和直感却很容易判断的部分. 然而, 人工智能的使命决定了, 人们应该尽最大努力提高机器智能化程度. 条件允许却不去努力是设计者的懒惰和失职.

在EChA形态生成一线, 还有词典化了的多义区分程序段(它在形态生成前执行), 用BASIC写起来很容易. 现举例介绍如下:

1) LUDI 玩 / 打(各类球) / 拉(提琴, 胡琴) / 弹(钢琴) / 吹(口琴)

2120 IF VT$(GC)<>"1" THEN 2160
( 若该词不及物则保留词典基本义项 "玩", 该词多义区分毕, 转2160. )

2130 IF HY$(ZC)="胡琴" OR RIGHT$(HY$(ZC),4)="提琴" THEN HY$(GC)="拉": GOTO 2160
( 若找到词为 "胡琴", 或找到词的后两字为 "提琴" (包括大提琴,小提琴,中音提琴等), 则该词取汉义 "拉", 该词毕, 转2160. )

2140 IF HY$(ZC)="钢琴" THEN HY$(GC)="弹": GOTO 2160
2145 IF HY$(ZC)="口琴" THEN HY$(GC)="吹": GOTO 2160
2150 IF RIGHT$(HY$(ZC),2)="球" THEN HY$(GC)="打"
2160 GC=GC+1: GOTO 1830 ( 放过该词, 取后一词, 转1830. )

2) BATI 打 / (心)跳动

1990 IF VT$(GC)="1" AND (RIGHT$(HY$(ZC),2)=心" OR HY$(ZC)="心脏") THEN HY$(GC)="跳动"
2000 GOTO 2160

3) OKAZI 进行 / 发生 / 召开

2450 IF RIGHT$(HY$(ZC),2)="事" THEN HY$(GC)="发生":GOTO 2160
2460 IF RIGHT$(HY$(ZC),2)="会" THEN HY$(GC)="召开":YY$(GC)="BE HELD": YTZ$(GC)="8": XX$(GC)="1"
2470 GOTO 2160

3) RIGARDI: LOOK AT / LOOK / WATCH (TV) / SEE (FILM)

2830 IF VT$(GC)<>"1" THEN YY$(GC)="LOOK": GOTO 2160
2840 IF YY$(ZC)="TELEVISION" OR YY$(ZC)="TV" THEN YY$(GC)="WATCH": GOTO 2160
2850 IF YY$(ZC)="FILM" THEN YY$(GC)="SEE": YTZ$(GC)="1"
2860 GOTO 2160

4) NENIAM 从不 / 从未

3070 IF ST$(ZC)="2" THEN HY$(GC)="从未": HY$(ZC)=HY$(ZC)+"过": JG$(ZC)="9"
3080 GOTO 2160

【相关】

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

《朝华午拾》总目录立委硕士论文：9. 目标语调序

立委硕士论文：世界语句法分析(6&7)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

世界语句法分析(1): 虚词处理

虚词分析是世界语句法分析中最困难的部分. EChA的策略是分而治之, 各个击破. 每一个虚词的分析规则自成一体, 互相独立, 这样在充实或改进某一具体虚词的规则时, 便不致于影响其他虚词的规则, 这也就是规则和规则分开吧.[9] 语言规则和算法程序应该分开, 大家已经说了许多, 而规则和规则分开, 似乎还没有引起足够的重视. (不是指所有规则都分开: 具有普遍意义的抽象语法规则集合, 作为系统对于该语言充分形式化的逻辑描述, 是自动分析的枢纽, 本身就是一个可以做的很美的统一整体, 谈不上分开. (参考EChA句法分析第二线, 见第7节.) 一个优良的系统应该既能分得开, 又能合得拢.) 我们认为, 规则和规则分开, 对于研制实用性机译系统具有决定性意义. 没有什么系统从一开始研制就可以足够完善, 所以是否容易扩充和改进, 在很大程度上决定了一个系统的前途. 规则和算法分开, 固然大大增强了系统的扩充能力, 并且便于语言工作者和软件工作者充分合作. 但这还不够. 如果能实现规则和规则分开, 不但有利于遵循具体问题具体分析原则, 去解决语言这种特别复杂的现象中的许多个性问题, 从而大大提高翻译质量, 而且也为语言工作者和语言工作者的协作, 创造了必要的条件----这种协作, 对于研制大型实用系统是必不可少的.

规则和规则分开的主要方式是: 1) 词典语法化: 以词为基本单位, 把关于该词的各种用法及其分析规则, 以数据的形式写入词典(它建在外存贮器上). 这样的机器词典, 形式上很类似于我们案头的词典工具书, 如牛津, 韦式, LONGMAN等, 而且也较容易借鉴已有的这些词典的研究成果. 我们建议首先把虚词和动词的条目语法化. 2) 语法词典化: 在编写句法分析或综合程序(它在内存贮器中)时, 把规则落实到具体词或小类上, 并使这些规则独立开来. 这两种方法形式有别, 实质是一样的. 我们在EChA中采用的是第二种方法. (参见EChA虚词分析部分和EChA综合部分的多义词区分规则.) 说到底, EChA分析第一线不过是一个带有分析规则的虚词大词典.

当然, 应该指出, 规则和规则分开, 必然使规则量成倍增长. 然而, 由于边界分明, 这种增长并不影响系统结构上的逻辑清晰性, 这跟以前语言和算法, 规则和规则都没分开时的情形大不相同, 那时的规则无限膨胀, 只能致使系统最终报废. 不过规则量的增长, 涉及到机器的存贮容量问题. 但这实际上也不成问题, 因为现在的机器对于存贮节省的要求, 已经不是那么苛刻了. 即便是微型机, 中高挡的内存容量就能达到, 或很容易扩充到四兆到八兆字节. 值得强调的是, 规则量的增长, 一般并不影响系统的工作效率, 因为规则是附在具体的词或小类下, 只有所译文句出现了某词, 才会入该词一线.

在EChA虚词分析一线中, 我们把虚词的多义区分, 甚至有些涉及虚词特点的目标语修辞, 都一古脑纳入具体虚词的分析规则中. 这样处理显然比较简便易行, 也大大减轻了综合的困难. 但是, 正是在这儿, EChA违背了我们所极力赞同的分析和综合独立的原则. 目前还想不出更好更合理的办法. 不过, 我们主张独立分析的本意, 不外乎为了两点: 1) 为了使分析深入以便提高机译质量; 2) 让同一个独立分析结果, 能为多语综合所利用. 考虑到虚词的分析和综合同步进行, 有助于提高译文崐质量, 而且由于虚词数量的有限及其分析规则的相互独立, 在增加新的目标语时充实这些规则不会有很大困难, 更不会影响整个系统的筋骨, 因而我们目前的做法是有理由的, 它并不违背我们的宗旨.

世界语句法分析(2)

分析第(2)线与目标语综合充分独立, 逻辑性强, 是一个相当完整的语言分析模型. 它由一个主程序和几个以动词分析算法为核心的环环相扣的子程序构成. 主程序主要用来确定各语段的范围(前限后限)及其加工次序, 为它们进入动词子程序做好准备. 它必须对各种类型的世界语文句作出正确, 合理的处理, 才能保证系统的充分概括性和适应性. 从各类文句的试验结果看, EChA相当好地做到了这一点.

我们把世界语文句的类型归纳如下:

1.无谓句. 如:

Kia belega pejzagho ! (041) / What beautiful scenery ! 多么绝美的景色!

2.谓语句:

1) 简单句: 全句只有一个谓语. 如: Skribu klare ! (033) / Write clearly ! 写清楚!

2) 扩展的简单句: 全句至少有两个谓语, 但只有一个主句, 从句跟主句(以主轴心为代表)没有直接联系, 即从句处于2层以外 ( 其层号 >= 3 ). 这类从句往往是定语从句或同位语从句. 如:

La homon , pri kiu vi parolas , mi neniam vidis . (131)
The man(宾), about whom you speak , I never saw .
我从未见过你提到的人.

3) 主从句: 全句至少有两个谓语, 但只有一个主句, 从句跟主句发生直接联系. 如:

Se mi partoprenus en via amuza aktivado , mi estus tre ghoja . (050)
If I should take part in your recreational activity , I would be very glad .
如果我参加你们的文娱活动, 我会是很高兴的.

4) 并列句: 全句至少有两个谓语, 同时也至少有两个有并列关系的分句, 并且其中一个是主轴心. 如:

Mi miras , timas , tremas . (074)
I wonder, fear, tremble.
我惊奇, 害怕, 颤抖.

5) 交错句: 以上四类句子交错组合而成的复杂句. 如本文第3节举的例句(004)就是.

EChA在对付这些不同类型的句子时, 能够把复杂的句子分解成简单的句子处理. 分析程序首先查找从句. 如果查到, 先入并列从句子程序分解(若是光杆从句就放过, 返主), 然后确定每一个从句的前后限, 入动词子程序加工. 加工完毕, 做绝对放过标志. 所有从句处理完毕, 再行主句加工. 这时候, 句子呈或者简单句, 或者并列句的形式.

世界语中表示关系的从句, 如有相应的 T 类相关词与之呼应, 就是同位语从句. 而当主句中 T 类相关词省略时, 便于表示疑问的名词性从句同形, 从而增加了识辨难度. 对此本系统暂时不予考虑. 这种省略虽然显得较干练 (成语警句中常用), 崐但不宜提倡, 因为甚至人(尤其是非印欧语系的人)理解起来, 也常常感到困难.

[例] Bone ridas , KIU laste ridas .
Well smiles, WHO smiles at last.
谁笑得最后, 笑得最好.

KIO pasis , ne revenos .
WHAT passed, will not return.
时不再来. (一去不复返.)

CF: Nur TIU ne eraras, KIU neniam ion faras.(151)
Only THAT PERSON is not wrong, WHO never dose something.
仅仅从不做某事的那个人不犯错误.

第二线的关键是动词子程序的建立. (这儿所谓动词包括谓语动词, 形动词, 副动词和不定式, 但不包括-ADO词, 因为世界语的-ADO词已经完全名词化了, 不再具有动词的特性.) 如果说先从句后主句的加工过程, 实际上是自下而上的方法, 那么动词算法的路径正好反过来, 是自上而下. 动词子程序首先设三个开关. 一是检验是否可以构成动词短语 VP. 若不能, 如独词句及光杆的形动词, 副动词或不定式, 则给该词节点信息 J (终结节点), 该词加工完毕, 退出. 二是检验该词是否系词, 若是, 转系词子程序作适当处理, 再回动词子程序递归加工. 这是因为系动词有其特殊性, 比如一般动词谓语简单句, 只可能有一个前面没有介词的普通格名词(它当然是主语), 而系词谓语句却可以有两个(一主一表), 因而不能直接入动词子程序. 最后一个开关检验该动词短语是否扩展的 VP, 若不是, 即行分析. 扩展的 VP 定义为该动词的间接成分层中(所谓间接成分层是指其层号 >= 动词轴心的层号 + 2 的层次), 至少又包含一个 VP. 对于扩展的动词短语, 运用栈技术作递归加工. 这样动词子程序真正的加工单位便是不扩展的各类 VP (简单句, 形动词短语, 副动词短语, 不定式短语). 动词子程序在工作期间, 常常需要调用其他子程序. 各子程序间的逻辑关系是十分清楚的.

名词子程序也要设开关. 扩展的 NP 定义为带有至少一个 VP 的 NP, 它必须回动词子程序递归加工.

对于不扩展的动词短语, 一般来说加工次序如下:

丨动词子程序丨--------丨名词子程序丨------丨形容词子程序丨----丨副词子程序丨

这形象地体现了 "自顶而下" 的分析思想.

试验表明, EChA的两线分析程序, 一具体一抽象, 一个对付个性一个对付共性, 一个面向虚词一个面向实词, 一个尽量使句法分析词典化, 一个则努力使分析过程逻辑化, 二者相互配合, 很有效地实现了各类世界语文句的自动分析. EChA输出的中间结果158条CDC链中只发现一处分析错误. 它出现在第一首诗歌 "LA ESPERO" 的第三句:

Ne al glavo sangonsoifanta , ghi LA HOMAN tiras FAMILION . (102)
Not to sword bloodthirsty , it THE MAN'S (目的格) pulls FAMILY (目的格).

为了节奏和韵律的关系, 作者把形容词修饰语与其轴心词分开了(当然仍同格同数), 中间插进一个动词谓语. 于是系统误把二者都看作是动词谓语的宾语, 因为 "冠词+形容词" (后不跟名词) 结构一般总是代替 NP 的, 所以EChA也就这样分析了. 幸运的是, 这一分析错误没有导致译文错误, 因为中英文综合都把前置宾语移至动词轴心之后, 客观上恢复了修饰语与其中心词的正常词序, 当然这只是巧合.

_____________________________________________________________________

附注: [9] 这儿关于规则和规则分开的讨论, 很大程度上得益于与刘倬老师的几次谈话.

【相关】

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

立委硕士论文：世界语形态分析 (5)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

世界语形态分析

源语文句分析大体可以分形态分析和句法分析两大类. 前者研究的对象小于等于词, 而后者的对象大于等于词(句素). 分析的终极目的就是求解词的正确的CDC成分. 本节先讨论形态分析问题. 我们把构词分析的讨论也放在这一节.

世界语形态分析的主体是消尾算法的建立. 世界语没有形态同形现象, 所以只要削尾正确, 形态分析也就完成. 下面给出EChA的削尾算法. 应该说, 该算法是比较完备和合理的, 完全能够满足世界语自动分析实用系统的要求.

世界语削尾算法

(1) 若该词最末字母为 "-O" 取 "名词 / 普通格 / 单数" 的结论, 该词削尾后查实词词干词典, 转下一步(2), 否则步骤(12).

(2) 若查词典成功, 取词典信息到加工场, 该词加工完毕, 否则下一步(3).

(3) 若该词最末二字母为 "-AD" 取 "AD词" 的结论, 该词削尾后查实词词干词典, 转下一步(4), 否则步骤(5).

(4) 若查词典成功, 取词典信息到加工场, 该词加工完毕, 否则步骤(11).

(5) 若该词最末三字母为 "-ANT" 取 "分词 / 进行式 / 主动式" 的结论, 该词削尾后查实词词干词典, 转步骤(4), 否则下一步(6).

(6) 若该词最末三字母为 "-INT" 取 "分词 / 完成式 / 主动式" 的结论, 该词削尾后查实词词干词典, 转步骤(4), 否则下一步(7).

(7) 若该词最末三字母为 "-ONT" 取 "分词 / 将来式 / 主动式" 的结论, 该词削尾后查实词词干词典, 转步骤(4), 否则下一步(8).

(8) 若该词最末二字母为 "-AT" 取 "分词 / 进行式 / 被动式" 的结论, 该词削尾后查实词词干词典, 转步骤(4), 否则下一步(9).

(9) 若该词最末二字母为 "-IT" 取 "分词 / 完成式 / 被动式" 的结论, 该词削尾后查实词词干词典, 转步骤(4), 否则下一步(10).

(10) 若该词最末二字母为 "-OT" 取 "分词 / 将来式 / 被动式" 的结论, 该词削尾后查实词词干词典, 转步骤(4), 否则下一步(11).

(11) 该词取 "生词" 的结论, 保留削尾结论, 在加工场的目标语语义项里复制该词, 该词加工完毕.

(12) 若该词最末字母为 "-'" 取 "名词 / 普通格 / 单数" 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则下一步(13).

(13) 若该词最末字母为 "-A" 取 "形容词 / 普通格 / 单数" 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则下一步(14).

(14) 若该词最末字母为 "-E" 取 "副词 / 普通格" 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则下一步(15).

(15) 若该词最末字母为 "-J" 取 "普通格 / 复数" 的结论, 该词削尾后转下一步(16), 否则步骤(18).

(16) 若该词最末字母为 "-O" 取 "名词" 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则下一步(17).

(17) 若该词最末字母为 "-A" 取 "形容词" 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则步骤(11).

(18) 若该词最末字母为 "-N" 取 "目的格" 的结论, 该词削尾后转下一步(19), 否则步骤(23).

(19) 若该词最末字母为 "-J" 取 "复数" 的结论, 该词削尾后转步骤(16), 否则下一步(20).

(20) 若该词最末字母为 "-O" 取 "名词 / 单数" 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则下一步(21).

(21) 若该词最末字母为 "-A" 取 "形容词 / 单数" 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则下一步(22).

(22) 若该词最末字母为 "-E" 取 "副词" 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则步骤(11).

(23) 若该词最末字母为 "-S" 转下一步(24), 否则转步骤(30).

(24) 若该词最末二字母为 "-AS" 取 "现在时" 的结论, 该词削尾后转步骤(28), 否则下一步(25).

(25) 若该词最末二字母为 "-IS" 取 "过去时" 的结论, 该词削尾后转步骤(28), 否则下一步(26).

(26) 若该词最末二字母为 "-OS" 取 "将来时" 的结论, 该词削尾后转步骤(28), 否则下一步(27).

(27) 若该词最末二字母为 "-US" 取 "虚拟式" 的结论, 该词削尾后转步骤(29), 否则步骤(32).

(28) 取 "陈述式" 的结论, 转下一步(29).

(29) 取 "动词 / 谓语 / 主动语态" 的结论, 查实词词干词典, 转步骤(2).

(30) 若该词最末字母为 "-I" 取 "动词 / 不定式" 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则下一步(31).

(31) 若该词最末字母为 "-U" 取 "命令式" 的结论, 该词削尾后转步骤(29), 否则下一步(32).

(32) 查虚词词典(因该词无尾可削). 若成功取词典信息到加工场, 该词加工完毕, 否则取 "名词 / 专有名词" 的结论, 返回步骤(11).

[注] 世界语基本法规第16条说: "名词和冠词末尾的元音字母可以省略, 用省略号 ' 来代替". 这种现象多出现在诗歌里, 如 MOND'(103). 我们在步骤(12)对它作了处理(冠词是长度小于 3 的虚词, 直接查虚词词典, 不入削尾一线, 故不予考虑).

我们谈谈构词分析问题, 这包括两个方面: 1. 关于建立削缀算法(派生词处理)的讨论; 2. 关于拆离合成词的讨论. 在现行的EChA系统中, 这两个问题都回避了. 我们建立的词典, 是以词干(包括合成词词干)作存贮单位的, 加工词只要削去语法词尾, 就可以查到. 但是, 应该指出, 这样做, 对于世界语这种构词特别灵活的语言并不合理. 以词干存词, 在做小型实验时还可应付, 如果是实用系统, 就会出现存不胜存的情况. 我们主张实词词典既存词根也存词干, 同时建立一个完全的世界语削缀算法和合成词拆离算法, 以便对付生词. (世界语除国际性的专业词汇外, 基本词根很有限. 所谓生词, 一般都是由基本词根及几十个词缀随机组合的派生词或合成词. 因此, 只要切分正确, 生词便不 "生".)

世界语后缀可以叠加(理论上无限), 但前缀通常只能有一个. 这样词典一线的加工路径应该是:

lw9

削缀与削尾不同, 并非有缀必削. 对于削尾, 机器是先削后查, 而对于削缀, 则是先查词典, 查不着的生词再去削缀. 这样处理便于我们根据设计要求(实验型还是实用型, 对于翻译速度, 质量, 成本的要求等等)和机器条件(内存容量, 运算速度等)决定实词词典收词干的标准.

现在, 由于计算机技术的发展, 机器功能(存贮, 速度)越来越强, 而成本急遽下降. 因此机器翻译界如今有人提倡存贮单位宜大不宜小(如尽量多收成语的主张[7] ), 以海量存贮和快速查找来减轻分析的负担. 这是很有见地的认识. 单位越大, 确定性就越强, 对分析综合(机器智能)的要求就越低, 研制的难度相对减轻, 而译文的质量会大大提高. 机器翻译是实用性?
很强的学科, 这种主张就显得更有价值. 当然, 单位也不是越大越好, 因为单位每大一级(从词根到词干, 从词干到词, 从词到词组, 从词组到语句), 其组合的可能性呈指数增长.[8] 如果推向极端, 以句子为存贮单位, 则完全不需要分析和综合, 只要对号入座即可输出译文. 这时候, 人工智能的程度等于零, 翻译质量却可以达到最佳(如果以人工水平为最佳). 可惜, 硬件技术无论怎样发达, 其存贮容量和查找速度也总有限, 不可能对付无穷的句子. (但为了某种特殊的需要在有限的范围内, 这种办法是可行的, 如旅游翻译机. 这到底还算不算机器翻译? 应该算的, 只是它不是人工智能意义下的机器翻译.) 机器翻译的另一极是以词素(词根, 词缀, 词尾)为分析单位, 它所需要的词典容量(只存词根)最小, 人工智能的水平最高, 不但有句法分析和综合, 还要有构词分析和综合. 但费了好大劲儿, 质量却最不能保证, 因为一个句子掰得太碎(原文分析), 捏拢来总难免有些难看的痕迹(译文综合). 所以, 现行的机译系统, 一般都是在这两极中根据具体条件和设计者的观点取某个中值. 我们认为, 一个优秀的实用系统应该有两手, 既能分析得很透彻, 又能对常用词组(成语)囫囵儿处理. 该细的地方细得下去, 该粗的地方粗得起来. 一般来说, 对于常用的, 固定的, 个性的可枚举现象粗一点比较有利, 而对于规律性的随机现象, 则适宜较细致的分析. 所以, 对于以世界语为分析对象的实用机译系统, 我们既主张尽可能多收成语和带缀词干, 也充分肯定建立一个完备的削缀算法的必要性.

那么, 世界语实词词典收多少派生词词干比较合理呢? 对于独立型机器翻译:

(1) 如果是小型实验系统, 目的是在有限的材料内试验系统的句法分析和综合能力, 那就词干全收; 否则:

(2) 凡是常用的派生词词干一律收进词典, 而不再入削缀子程序----常用性(出现频率高)是根本标准;

(3) 有助于区别同形多义的派生词词干, 应该收;

(4) 可收可不收的, 主张收;

(5) 在刚开始设计实用系统的机器词典时, 由于世界语词缀的极端灵活性和随机性, 很难一次收入许多带缀的词干, 这样, 削缀算法就显得更重要. 削下缀来, 虽然表义不是很确切, 甚至有时在目标语综合时, 还需要辅以说明性注释(见后面例释), 但总比直接打出生词来(信息量为零)强出百倍. 随着系统的不断扩充和完善, 收的词干自然会越来越多.

如果是具有特定的目标语的相关型机器翻译:

(1) 收多少派生词词干应该考虑目标语的构词特点及词汇状况;

(2) 在目标语中作为一个完整概念, 而不是词根和词缀意义简单相加所能反映的词干, 应该收入词典. 如: DOM-EGO 楼房, 大厦 (而不是一般的 "大-房子" );

(3) 如果以汉语为目标语, 削缀更多一些, 因为世汉构词法很相似, 汉族人的心理本能地习惯于理解词素与词素的组合. (这种民族偏爱心理在引进外来词时表现的很明显, 如 "德律风" 为 "电话" 取代, "莱塞" 为 "激光" 取代等.) 可以举出很多世汉构词神似的例子. 而且也有许多世界语派生词如 DOM-ACHO 虽然整个儿译作 "陋室" 更雅一些, 但也不妨用统一的削缀合成法组成新词 "鬼-房子", 与原义相去也不远. 特别是有些缀与汉字(词素)有很多一致性, 如 VIC-/副-, -IN-/女-, -EBL-/可- 等等, 就更有理由作削缀处理.

世汉构词对比例释(1): 派生词

(1) BO- 姻- : BO-PATRO 姻-父亲 (岳父或公公) , BO-FILO 姻-儿子 (女婿) , BO-FRATO 姻-兄弟 (内弟) ;

(2) GE- (男女)- : GE-AMIKOJ (男女)-朋友们 , GE-KAMARADOJ (男女)-同志们 , GE-AKTOROJ (男女)-演员们 ;

(3) EKS- 前- : EKS-OFICISTO 前-职员 , EKS-MINISTRO 前-部长 , EKS-INSTRUISTO 前-教师 ;

(4) MAL- [反义] : MAL-BONA [反义]好 (坏) , MAL-AMIKO [反义]朋友 (敌人) , MAL-SAGHE [反义]聪明 (愚苯) ;

[说明] MAL-是世界语中用得最广, 随机性最强的前缀之一, 具有极强的造词能力, 可惜, 中文没有对应的词素. 如果系统遇到某个MAL-型生词, 削下前缀后给出[反义]这样的说明性标识, 也还可以使人理解.

(5) VIC- 副- : VIC-PREZIDANTO 副-主席 , VIC-ESTRO 副-队长 , VIC-CHEFMINISTRO 副-总理 ;

(6) FI- 坏- : FI-INSEKTO 坏-虫 , FI-KOMERCISTO 坏-商人 (奸商) , FI-KUTIMO 坏-习惯 (恶习) ;

(7) SEN- 1. 若词根逻辑类为名词则 "无-" : SEN-GUSTA 无-味的 , SEN-SENCA 无-意义的 ;

若词根逻辑类为动词则 "不-" : SEN-MORTA 不-死的 (不朽的) , SEN-ATENTA 不-注意的 ;

(8) NE- 若词根逻辑类为名词则 "非-" 否则 "不-" : NE-ESPERANTISTO 非-世界语者 , NE-BONA 不-好的 ;

(9) 介词性前缀: 1. SUR- -上: SUR-TABLE 桌子-上 ; 2. APUD- -旁: APUD-VOJA 路-旁的 ;

EN- -内: EN-LANDE 国-内 ; 4. LAU- 按-: LAU-VICE 按-次序 ; 5. DE- 从-: DE-NOVE 从-新 ;

(10) -ACH- 鬼- : DOM-ACHO 鬼-房子 (陋室) , KNAB-ACHO 鬼-男孩 (捣蛋鬼) , VETER-ACHO 鬼天气 ;

(11) -AN- -成员 : KLUB-ANO 俱乐部-成员 , KURS-ANO 讲习班-成员 , KOMUNUM-ANO 公社-成员 ;

(12) -UL- -者 : BON-ULO 好-者 , KAR-ULO 亲爱-者 , JUN-ULO 年青-者 , LONG-KRUR-ULO 长/腿-者 ;

(13) -IN- 女- : KAMARAD-INO 女-同志 , INSTRUIST-INO 女-教师 , OFICIST-INO 女-职员 , AKTOR-INO , 女-演员 ;

(14) -EBL- 可- : VID-EBLA 可-见的 , MANGH-EBLA 可-吃的 , UZ-EBLA 可-用的 , NE-ATING-EBLA 不-可-达到的 ;

(15) -EC- -性 : CERT-ECO 确实-性 , NECES-ECO 必要-性 , KLAR-ECO 清楚-性 , LIBER-ECO 自由-性 ;

(16) -EM- 爱- : LABOR-EMA 爱-工作的 (勤劳的) , PAROL-EMA 爱-说话的 , MENSOG-EMA 爱-撒谎的 ;

(17) -IND- 值得- : LERN-INDA 值得-学习的 , LAUD-INDE 值得-称赞 , LEG-INDA 值得-读的 , AM-INDA 值得-爱的 ;

(18) -ON- 1. 若 -ONO 则 "-分之一": DU-ONO 二-分之一 , TRI-ONO 三-分之一 , KVAR-ONO 四-分之一 ;

若 X+Y-ONOJ 则 "Y-分之X": TRI DEK-ONOJ 十-分之三 , KVIN OK-ONOJ 八-分之五 .

合成词 ("词根+词根") 也是一样. 比较固定的, 应该整个儿存入词典, 随机组合的, 应该拆开. 但这儿有一个困难, 世界语语法为了方便使用者, 即便对完全随机组合的合成词, 也不作加连字符的规定. 那么怎么拆呢? 词根的数量与词缀不能比, 长度也变化很大, 一个字母一个字母地削查比较, 显然不是办法. 如果坚持不要译前编辑, 还找不到一个合理的解决办法. 目前可以考虑先对中间有连字符的合成词作拆词加工. 我们提倡除比较固定常用的合成词外, 世界语者在运用随机合成词时,为读者的省力和机器的识辨计加上连字符. 鉴于世界语构词法与汉语构词法惊人的一致(组合方式及其高度随机性都很类似), 对于世汉机器翻译这一倡议更加必要.

世汉构词对比例释(2): 合成词

(1) AKVO-FONTO 水/源 ; (2) VARM-ENERGIO 热/能 ; (3) ARBO-BRANCHO 树/枝 ; (4) VAPOR-SHIPO 汽/船 ;

(5) SURD-MUT-ULO 聋/哑-者 ; (6) BLANK-HARA 白/发的 ; (7) NUD-PIEDA 光/脚的 ; (8) FISH-KAPTI 捕/鱼

______________________________________________________________

附注: [7] 参见:

刘涌泉 <<中国的机器翻译>> ( <<情报科学>> 1980, 3 )

王广义 <<机器翻译中的固定词组和固定结构问题>> ( <<语言和计算机>> (1), 1982 )

[8] 参看: 叶蜚声, 徐通锵 <<语言学纲要>> 第二章第二节 " 1. 语言的层级体系", PP.34-36 ( 北京大学出版社, 1981 )

【相关】

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

立委硕士论文：EChA机器词典及词表 (4)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

EChA机器词典及词表

EChA所有词典词表都是随机数据文件, 并且各配有一套修改和扩充的外围维护程序, 这给系统的改进提供了方便. 下面

分别介绍各词典词表的定义.

1) 实词词干词典
格式: __________________________________________________________________________________
词干丨逻辑类丨及物性丨带不定式丨支配词丨支配词汉义码丨汉义丨汉义特征丨英义
____丨_______丨______丨_________丨_______丨_____________丨____丨_________丨______ ___________________________________________________________
丨英义特征丨语义特征丨词类词义区分表记录号丨备用项丨
丨_________丨__________丨_______________________丨________丨

<逻辑类>::= { N, V, A, F, P, C, K, T, R, S, W, E, D, X }

N=名词 , V=动词 , A=形容词 , F=副词 , P=介词 , C=连词或标点 , K=K类相关词 ,
T=T类相关词 , R=其他相关词 , S=数词 , W=人称代词 , E=系词 , D=冠词 , X=万能词

[说明] 逻辑类用来表明词的静态词性. 世界语实词的语法词性是动态随机的, 只能由削尾决定. 但每个词一般具有一个基本词性, 这是单词的深层的逻辑特征. 语法词性不过是由它通过加词尾派生的表层的句法特征.

<汉义特征>::= { "...以后", "...的", "使...", "把...", "给...", "...下", "...上", "...里", "...时",
多义词特征, 构成成语特征, ... }

[说明] 汉义特征揭示了该词汉义的结构特性, 也给出了汉语生成的修辞信息.

<英义特征>::= { 不规则变化特征, 双写特征, 形式不变特征, ... }

[说明] 英义特征给出该词的英语形态生成方式信息.

<支配词汉义>::= { 零义, "给", "以", "到", ... }

[说明] 支配词汉义标示该词所支配的词(通常是介词)的汉义.

<语义特征>::= { HM, LK, TM, FX, ... }

HM=人类特征, LK=地点特征, TM=时间特征, FX=方向特征
2) 虚词词典

虚词词典除包含实词词典的各项信息外, 还揭示了部分CDC信息, 如词性, 格, 数, 关系, 分布, 节点等. 分析之前就能在词典里给出某些动态信息, 这是由虚词特点决定的. 例如: 介词永远处于非终结节点(节点"Y")上, 原副词和万能词一般是不扩展的, 所以总处于终结节点(节点"J")上. 万能词 ECH (EVEN) 永远位于其轴心词之前(分布"Q"). 原副词 JAM (ALREADY) 永远做状语(关系"F"). 从属连词 KE (THAT) 总是引导名词性从句(词类"K", 节点"K"), 而且总位于其轴心词之后(分布"H").

冠词LA永远做定语(关系"D"), 位于轴心词前(分布"Q"), 处于终结节点上(节点"J").

3) 成语词典

机器翻译界所谓的成语, 比其通常的意义要宽泛得多. 凡是常用的比较固定的词组都可收作成语. 世界语中纯粹的不可分析的习惯表达法较少, 所以成语词典容量相对不大. 成语词典的收词范围, 还在很大程度上决定于原语和译语的对比差异. 亲属关系相近的表达方法类似, 可以少收或不收成语. 在EChA中, 就没有设立世英成语词典, 只有一部世汉成语词典.

EChA成语例释:

MALFERMA(JN) AUTO(JN) ----- 敞蓬汽车 ( CF: OPEN CAR(S) )
SOMERA(JN) FERIO(JN) ----- 暑假 ( CF: SUMMER HOLIDAY(S) )
LA ANGLA(N) LINGVO(N) ---- 英语 ( CF: THE ENGLISH LANGUAGE )
INSTRUA(JN) LIBRO(JN) ---- 教科书 ( CF: TEACHING BOOK(S) )
LA GRANDA(N) MURO(N) ---- 长城 ( CF: THE GREAT WALL )
HOMA(N) SVARMO(N) ---- 人群 ( CF: MAN'S SWARM )
FACILA(N) VENTO(N) ---- 顺风 (CF: EASY WIND )

4) 词类词义区分表

建立该词表对于世界语作为源语的机器翻译很必要, 可以大大减轻综合时多义区分的负担. 凡是随着词性和逻辑类的不同, 目标语的义项也相应不同, 而这种改变并不遵循形态转换规律, 这样的单词就收入区分表. 例如: MATEMATIK-A(JN) 必须收入, 而 HOM-A(JN) 就不必收, 因为前者的英义是 MATHEMATICAL (不是 MATHEMATICS' ), 而后者只要按规律从源语形容格(形容词性), 生成目标语所有格的词尾 -'S 或助词 "的" ( MAN-'S / "人-的" ) 就可以了. 我们在实词词典中对要入区分表的词, 都给出了查表记录号(随机文件地址), 所以系统只要按地址取记录就行了. 用BASIC编程时, 拿随机文件记录号?
作为单词内部代码, 是值得推荐的.

词类词义区分表例释:

实词词典词类词义区分表

ATING-I: ACHIEVE / 达到        ATING-O: ACHIEVEMENT / 成就
EKZEMPL-O: EXAMPLE / 例子      EKZEMPL-E: FOR EXAMPLE / 例如
KOMENC-I: BEGIN / 开始         KOMENC-E: AT BEGINNING / 开始时
MEZUR-I: MEASURE / 测量        MEZUR-O: MEASUREMENT / 尺寸
OKAZ-I: HAPPEN / 发生          OKAZ-O: OCCASION / 场合
SCI-I: KNOW / 知道             SCI-O: KNOWLEDGE / 知识
TIP-O: TYPE / 型号             TIP-A: TYPICAL / 典型的

5) 英语不规则词表

这个词表跟一般英语词典附录中列的不规则表没什么两样, 不过为了简便, 我们把动词形式的不规则变化和名词复数的不规则变化放在一个表内. 不规则词表是供英语形态生成查用的.

英语不规则词表

原形过去时过去分词名词复数

BEAT             BEAT                  BEATEN
BECOME BECAME                BECOME
...              ...                   ...                    ...
CHILD                                                         CHILDREN
...              ...                   ...                    ...

最后我们给出EChA句子加工场的格式:

目标语序号丨实词词典各项丨CDC信息丨已加工特征丨虚词特征丨
目标语调序信息丨目标语位移序号丨

[说明] 1. 目标语序号用来在综合阶段自底而上归约加工时给同号.

目标语位移序号用来在用搬家法作虚拟调序时代表整个词条. 用序号代替整个词条位移的虚拟调序, 比纯粹用搬家法效率高, 大约跟拉链法相仿. 鉴于BASIC不能处理组合项变量, 如果采用搬家法调序, 只能一项一项位移, 这种虚拟调序的技术更显出优越性. 但须注意, 跟位移序号一起移动的, 还必须包括该词的自然顺序号, 用它标示原词条位置, 这样查问时才无后顾之忧.

【相关】

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

立委硕士论文：层次递归成分体系 (3)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

层次递归成分体系

在给出层次递归成分体系(CDC)的定义之前, 我们先说说该体系的来源及其理论依据.

CDC体系是机器翻译的一种中间语言, 我们试图提供一套更加合乎独立分析独立综合要求的机器翻译抽象文法. CDC是EChA系统的关键, 它体现了我们对语言结构的看法和对机器翻译的认识. CDC是直接从导师们的中介成分体系[2] 脱胎而来的, 它保留了中介成分的形式, 继承和改造了它的内容, 其思想基础是有向直接联系理论(或轴心词理论). 体现在CDC中的要点是:

1) 句子的最顶层是主句谓语, 它是全句的最大联系中心(主轴心), 所以谓语是全句的代表. 一个完整的句子的最简单也是最典型的形式, 就是独词祈使句. 如:

Venu! Come! 来!

任何其他句子(无谓句是不完整句, 除外)都是从上面的简单形式一层一层推衍出来的:

Venu! ... La studento venu chi tien! ... La studento, kiu parolis, venu chi tien! ......

Come! Let the student come here! Let the student, who spoke, come here!

反过来说, 对一个无论怎样复杂的句子层层归约, 归约的顶层必然是主句动词谓语:

VENU
/ \    \
studento         tien   (!)
/ \ /
la        parolis      chi
/     /    \
(,)    kiu      (,)

2) 一个词只能跟另外的一个词发生直接联系, 但一个词可以带 N 个 ( N>=0 ) 直接联系词. 这就是句子结构的有向直接联系观点.[3] 带直接联系词的词叫轴心词, 当 N>0 时, 它是非终结节点词. 直接联系词本身也常常是低一层次的轴心词.

3) 主句谓语(主轴心)处在第一层. 与主句谓语发生直接联系的词位于第二层. 与第二层词直接联系的词在第三层. 这样一环扣一环, 组成句子的每一个词都处在某一个层次上. 理论上说, 句子的层次可以是无限的.

4) "虚词不虚." 虚词(或者叫功能词, 结构词)较之实词包含更多的句法结构信息. 有些虚词同样可以充当轴心词. 比如: 在 "介+名" 结构中, 介词是轴心词. 主从连词如 SE (IF), KVANKAM (ALTHOUGH) 等也充当轴心词, 作为从句的代表, 它跟主句谓语发生直接联系, 它所带的下位直接联系词是从句谓语.崐 5) 作为源语文句的中间语言映射, 层次递归成分应该, 也可以落实到每个词上. 所谓词, 从机器角度来看, 就是两空之间的字符串(汉语另当别论). 严格地说, 标点符号也是词(虚词), 也要参与文句的分析和归约.

建立CDC体系的两项基本原则是:

1) 层次递归原则: 有多少层次反映多少层次, 而且层次是递归的. 层次的递归性表现在: (1) 对文句可以自底而上层层归约(参见EChA系统的目标语生成算法); (2) 对文句可以自顶而下层层分析(参见EChA的源语分析算法).

2) 词本位原则:[4] 词到句子(以主句谓语为代表)是一个动态递归过程的两极, 其间的各个环节就是所谓层次. 贯彻词本位原则的实质, 就是在一切层次上都把成分(CDC)落实到词. 句子是, 也仅仅是由句素组成的. 而每一个大大小小的句素(词组, 短语, 从句等)按照我们的看法, 总是以一个轴心词来代表的.

现在, 我们给出层次递归成分体系的形式化定义:

层次递归成分体系是层次递归成分的集合.
层次递归成分是这样一个六元信息组:
形态信息 | 结构关系信息 | 节点信息 | 分布信息 | 层号信息 | 链号信息

<形态信息>::=
{ <词性>, <格>, <数>, <时态>, <语态>, <语式>, <非谓语形式>, <体>, <人称>, ... }

<词性>::= { N, V, A, F, P, Z, C, K, B }

N=名词, V=动词, A=形容词, F=副词, P=介词, Z=助动词, C=并列连词,
K=主从连词, B=标点符号

<格>::= { 非格, 普通格, 目的格 }

<数>::= { 非数, 单数, 复数 }

<时态>::= { 非时态, 现在时, 过去时, 将来时 }

<语态>::= { 非语态, 主动语态, 被动语态 }

<语式>::= { 非语式, 陈述语式, 命令语式, 虚拟语式 }

<非谓语形式>::= { 非非谓语形式, 分词, 不定式, 名动词 }

<体>::= { 非体, 进行体, 完成体, 将来体 }

<人称>::= { 非人称, 第一人称, 第二人称, 第三人称 }

<结构关系信息>::= { S, W, O, D, F, B, T, I, C, L, M, A, Z, V, R }

S=主语, W=谓语, O=宾语, D=定语, F=状语, B=补语, T=同位语,
I=独立成分, C=同等连词或标点, L=从句起始标点, M=从句末标点,
A=插入成分起始标点,Z=插入成分末标点, V=非结构意义标点, R=句末标点

<节点信息>::= { J, <非终结节点> }

J=终结节点

<非终结节点>::= { S, O, D, B, K, X, Y }

S=主语从句节点, O=宾语从句节点, D=定语从句节点, B=补语从句节点,
K=一般从句节点, X=动词性非终结节点, Y=其他非终结节点

<分布信息>::= { Q, H, G }

Q=位于轴心词前, H=位于轴心词后, G=轴心

<层号信息>::= { 非层号, <自然数> }

<自然数>::= { 1, 2, 3, ... }

<链号信息>::= { <左链号>, <右链号> }

<左链号>::= { 非左链号, 99, N }

N=大于句首号小于句末号的自然数

<右链号>::= { 非右链号, N }

[说明] 左链号的设置是为了处理同等成分的方便. 我们把同等成分的最右元素认作整个成分的代表(落脚点, 轴心). 左链号99是同等成分最左元素的标志. 有了左链号, 消除了后顾之忧, 同等成分就可以和其他句素一样, 参加文句的分析和归约.

下面是用这套成分体系作分析的例句(004):

CDC中形态信息略去, 余下依次是: 关系/节点/分布/层号/左链/右链, 例如:

FJQ 05 00 02 --->
状语/终结节点/位于其轴心词之前/处于第5层/没有左链(00是非左链号)
/右链号为02

Pli    poste          ,              kiam           la sciodisketoj
英:   More later , when           the            knowledge-disks
汉:   更以后           , 当(...时) 微型知识磁盘
CDC链: FJQ 05 00 02   FYQ 04 00 17   LJQ 05 00 04   FKQ 04 00 17   DJQ 07 00 06   SYQ 06 00 07

estis          eltrovitaj     , la plenan         indikaron [注:目的格]
had been       found out      , the            full           indication
被             发明了         ,                             全套           指令集合
WBH 05 00 04   BJH 06 00 07   MJH 05 00 04   DJQ 05 00 12   DJQ 05 00 12   OYQ 04 00 17

,              endiskigitan   ,              oni            metis          en
,              endisked       ,              people         put            into
,              所写入磁盘的   ,              人们           放             到(...里面)
AJQ 06 00 14   DYH 05 00 12   ZJH 06 00 14   SJQ 04 00 17   WXG 03 99 20   BYH 04 00 17

mashinojn      kaj            ili            tiamaniere     povis          en
machines       and            they           therefore      could          in
机器它们           这样能             在(...里面)
OJH 05 00 18   CJQ 02 17 23   SJQ 02 00 23   FJQ 02 00 23   WXG 01 20 00   FYQ 03 00 27

si             mem            akumuli sciencan       stokon         ,
themselves accumulate     scientific stock          ,
自己           本身           积累科学           贮蓄           ,
BYH 04 00 24   BJH 05 00 25   BXH 02 00 23   DJQ 04 00 29   OYH 03 00 27   VJQ 05 00 32

pli            grandan        ol             la             homa           cerbo          .
more great          than           the            man's          brain          .
更             大比人的           头脑           .
FJQ 05 00 32   DYH 04 00 29   FYH 05 00 32   DJQ 07 00 36   DJQ 07 00 36   BYH 06 00 33   RJH 02 00 23

层次递归成分实质上就是不同层次的词之间直接联系关系的一种反映. 它揭示了文句结构的正确的句法树. 根据文句的CDC链, 我们很容易画出该句的句法树.

实验证明, 作为体现独立分析结果的机器翻译中间语言, 层次递归成分体系是比较有效的. 现在, 越来越多的专家呼吁建立能充分体现对源语分析的结果, 正确揭示文句的层次结构和语义信息的媒介语, 或类似媒介语的东西. 许多文章论证了分析和综合独立的必要性. 原语分析依赖译语, 或译语综合依赖原语, 使分析和综合都不能深入, 而且难免捉襟见肘.[5]

当然, 层次递归成分体系还处于草创时期, 必然存在不少问题, 有待于在实践中不断检验, 改进和完善. 通过时间的考验和我们的努力, 也许它最终能成为一个比较得心应手的机译工具, 而为人们乐于采用, 这当然是我们所希望的. 也许它不是一个好的方案, 很快便被淘汰了. 但无论如何, 总是一次有益的尝试.

这套体系的不足之处是, 它不大能够反映有向直接联系的语义性质, 而这对于高质量的机器翻译是比较关键的信息. 人类语言不管怎样千差万别, 总有某些共同的东西. 例如, 句素间的层次结构及其直接联系关系就具有很强的普遍性. 正是这些语言共性才使翻译成为可能, 从而它成为语言转换的基础. 句素与句素之间的逻辑语义联系, 也是重要的语言共性之一.[6] 逻辑语义的确定, 将大大有助于生成地道的目标语. 在CDC体系中, 结构关系一项基本上是传统语法中句法成分的继承, 反映的是句子表层结构的关系(主谓宾定状补等). 看来, 有必要扩充CDC, 再加一个逻辑语义元:

<逻辑语义信息>::= { Ag, Sb, Ob, Vb, Pl, Tl, Mn, Pp, Rs, Fr, Rg, Dg, Tm, Pr, Cl, Fn, Ms, Pm, Cd, Nb, Pt, Mt, Ps, Tg, Cs, Ex, Dt, Ct, Cn, Cc, Cp, Tw, Xx }

Ag=施事(Agent), Sb=主体(Subject), Ob=受事(Object), Vb=行为(Verb), Pl=地点(Place),
Tl=工具(Tool), Mn=方式(Manner), Pp=目的(Purpose), Rs=结果(Result),
Fr=频率(Frequency), Rg=范围(Range), Dg=程度(degree), Tm=时点(Time),
Pr=时段(Period), Cl=颜色(Colour), Fn=功能(Function), Ms=尺寸(Measurement),
Pm=后饰(Post-modifier), Cd=条件(Condition) , Nb=数量(Number),
Pt=属性(Property), Mt=质料(Material), Ps=领属(Possession), Tg=对象(Target),
Cs=原因(Cause), Ex=说明(Explanation), Dt=限定(Determiner),
Ct=环境(Circumstance), Cn=内容(Content), Cc=让步(Concession),
Cp=比较(Comparison), Tw=同位, Xx=非语义(或不定语义)

[注] Xx是所有无法确定, 或没有必要确定的成分的逻辑语义. 机器翻译跟自然语言理解不同, 并不一味要求分析得越具体越透彻越好. 机器翻译过程中的中间信息究竟要深入到怎样的程度, 应根据充分必要的原则来决定. 少则影响效果(质量), 多则白费功夫.

_____________________________________________________________

附注: [2] 关于中介成分体系, 参见:

刘涌泉, 刘倬, 高祖舜 <<俄汉机器翻译规则系统新旧方案比较>> ( <<中国语文>> 1962.2 )

刘涌泉 <<外汉机器翻译中的中介成分体系>> ( <<中国语文>> 1982.2 )

刘倬 <<三次机器翻译试验>> ( 第一次机器翻译学术会议论文, 1980.9 )

[3] 关于有向直接联系理论, 参见:

刘涌泉, 刘倬, 高祖舜 <<俄汉机器翻译规则系统新旧方案比较>> (同上)

刘涌泉, 刘倬, 高祖舜 <<机器翻译中的词序问题>> ( <<中国语文>> 1965.3 )

并请参阅 <<特斯尼埃的 <结构句法基础> 简介>> ( 张烈材, <<国外语言学>> 1985.2 )

[4] 参见: 刘涌泉 <<词>> ( 1984年机器翻译及自然语言处理学术讨论会论文, 1984.9 )

[5] 参见: 冯志伟 <<当前机器翻译的一些新特点>> ( <<情报学刊>> 1982. Vol 1 No.2 )

[6] 参见: 董振东 <<逻辑语义及其在机译中的应用>> ( <<中国的机器翻译>> pp.25-45 )

【相关】

立委硕士论文：目标语调序

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

立委硕士论文：世界语: 语言学特点及其研究价值 (2)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

世界语: 语言学特点及其研究价值

在进入EChA系统的细节和探讨机器翻译的一般理论和方法之前, 我们专列这一节讨论世界语本身, 这对说明本系统的设计思想和具体方法是很必要的. 毫无疑问, 我们的讨论主要是从语言学角度着眼.

世界语(Esperanto)是波兰的语言大师柴门霍夫博士( L.L.Zamenhof 1859.12.15 - 1917.4.14 )于1887年在印欧语系的基础上经过艰苦研究提出的一个人造语方案. 由于其科学, 简明, 逻辑性强, 由于日益增长的克服语言障碍的国际需要, 也由于其维护世界和平, 增进各民族相互了解, 实现世界大同的崇高理想的感召, 它逐渐为人们所接受. 目前, 世界上有2000多万人在学习和使用世界语. 世界语早已脱尽了人造的斧痕, 走上了自然发展的道路. 它不但能写也能说, 不但适于表达精密的科学思想, 而且在文学上也取得了令人赞叹的成就. 从莱勃尼茨的万国通用文字的设想开始, 先后提出的人造语方案达150多种, 唯有世界语经受住各种考验生存下来了. 现在, 越来越多的人认识到世界语作为国际辅助语的独特价值. 有些国际性学术会议(如控制论大会)已经采用世界语作为工作语言.

世界语中除数量有限的虚词外, 其他词都有非常规则的形态变化, 借以表现该词的词性, 格, 数, 时态, 语态, 语式, 分词形式等语法信息. 另外还有一整套前缀后缀, 用以表现词汇意义上的细微差别和修辞色彩. 世界语是典型的黏着语, 词尾和语缀的意义单一, 可以叠加. 这套词尾和语缀设计得非常巧妙, 规则, 特别容易掌握, 而且也非常适合机器的递归加工.

(EChA的削尾算法就体现了这种递归加工的优点, 见本文第5节.) 世界语没有语法同形词, 句法关系一目了然, 这不论对人还是对机器的识辨, 都是一个极为有利的条件(民族语机器翻译中同形判别的问题在这儿根本不存在了). 同时, 世界语的词类转换也特别灵活, 只要逻辑上说得过去, 不致引起误解, 同一个词干可以根据句法需要, 通过词尾变化随意改变词性. (我国古汉语词类活用也比较自由, 在一定程度上具有类似的灵活性, 可惜这种活用没有明确的形态标志, 常常要靠逻辑语义的分析才能确定.)

世界语的词尾形式并不很多, 但却很完备, 可以和形态发达的语言相媲美, 这一点我们不能不为之惊叹. 拿格来说, 世界语只有普通格(零形态)和目的格(加词尾-N)两种, 但由于它把词性和格的用法巧妙地统一起来, 再加上有介词这种分析形式的后备, 表达起来跟形态丰富的语言一样灵活自由. 俄语是现代形态最丰富的语言之一, 它有六个格. 粗略地说, 它的一格(主格)跟世界语普通格对应, 二格(属格)跟世界语形容词--姑且叫做形容格吧(加词尾-A)对应, 三格(与格)在世界语中没有相应的屈折形式, 一般用介词AL来代替. 四格(宾格)对应于世界语的目的格. 五格(工具格)跟世界语副词--也姑且叫做状格吧相对应. 六格是前置格, 跟前置词O,Ha,B等搭配, 它本身并不表示特定的语义关系. 有意思的是, 世界语介词后可以跟崐普通格和目的格两种, 前者表示静态, 后者表示动态(方向). 比较俄语的类似用法, 世界语的简洁和完备的特点是很明显的.

世界语基本语法规则共16条, 原则上没有例外.[0] 由此人们也许会推断这门语言很简陋, 刻板, 缺乏表现力. 这是一个极大的误解. 这里涉及世界语的另一个非常突出的语言学特点, 就是它兼有分析性语言和综合性语言的要素(虚词和形态都比较丰富), 同一种语义既可以用分析形式(借助于虚词), 又可以用综合形式(借助于屈折变化)来表示----当然, 这两种形式并不等同, 它们体现了不同的风格. 由于这一特点, 世界语兼容性强, 文体多样, 特别灵活, 富于弹性和表现力. 如果作为目标语, 它最能维妙维肖地模仿原文的语言特色. 它既可以反映语序自由, 文体柔美的斯拉夫风格, 又可以表现形态缺乏的语言(如汉语和英语)的单纯, 严谨, 密集的特点. 下面我们举几个例子来看一下分析形式和综合形式在世界语中的兼容并存情况:

分析形式综合形式

时态: Mi ESTAS skrib-ANTA. Mi skrib-AS. / Mi skrib-ANTAS.

I AM writ-ING. 我 "在" 写字.

语态: Ghi ESTAS limig-ITA. Ghi limig-ITAS. / Ghi lim-IGHAS.

It IS limit-ED. 它 "被" 限定了.

词义: Tio estas MALGRANDA (ETA) Tio estas sekret-ETO.

sekreto.

That is a LITTLE secret.

那是 "小" 秘密.

介词与副词(状格):

Li parolas EN (PER) Esperanto.
Li parolas esperant-E. Li parolas Esperant-ON.

He speaks IN Esperanto.
He speaks Esperanto.

他说世界语.
他 "用" 世界语说话.

介词与格(目的格):

Shi parolis POR 30 minutoj. Shi parolis 30 minut-OJN.

She spoke FOR 30 minutes. 她说了30分钟.

分析形式向综合形式的转换:

LAU kutimo ...............LAU-kutim-E...kutim-E

这种分析形式和综合形式并存的情形在世界语中极其普遍, 这一点跟民族语不一样. 虽然没有绝对不用分析形式的综合性语言, 也没有绝对不用综合形式的分析性语言, 但是, 每一个具体的民族语言总是以一种形式为主, 而且在多数场合总是一种形式排斥另一种形式, 一般不允许并存.

总之, 跟人们通常想象的正相反, 世界语是高度灵活的, 表达方式极其多样, 且能互相转换. 这种高度灵活性正好适应了人类思维模糊性的特点. 灵活性与规则性的高度统一, 这就是世界语的真正奇迹.

人造语言的规则性容易为人理解. 关于灵活性, 再补充几点. 由于篇幅关系, 我们不打算展开, 必要时辅以一两句例证.

在世界语中动词的及物与不及物的界限模糊了.

Mi IRAS. / IRU vian propran VOJON!

I GO. / GO your own WAY! 我行走. / 走你自己的路!

La tuta homaro PAROLOS nur unu LINGVON.
/ Mi PAROLAS esperante (en Esperanto, per Espernato).

The whole mankind will SPEAK only one LANGUAGE.
/ I SPEAK in Esperanto.

全人类将说仅仅一种语言. / 我用世界语说话.

直接宾语(所谓宾格)与间接宾语(所谓与格)的界限模糊了.

informi ION al IU / informi IUN pri IO

tell sth. to sb. / tell sb. about sth. 向某人告诉某事 / 告诉某人关于某事

宾语与状语的界限模糊了. 世界语语法规定: 目的格(即通常所谓宾格)也可以表达某种状语意义(参见基本法规第14和第13条).

Mi invitas vin VOJAGHI kun mi PEKINON.

I invite you to TRAVEL with me TO PEKING. 我邀请你和我一起 "旅游北京".

词缀与词根的界限模糊了, 从而派生词与合成词的界限模糊了. 同时虚词与实词的界限也模糊了.

sekret-ET-o / ET-a sekreto JES, / mi JES-as vian opinion.

little secret 小秘密 Yes, I agree with you. 是的, 我同意你的意见.

ANTAU-vidi / Sinjorinoj ANTAU-as. Kred-IND-a
/ ne-IND-a , IND-igi , sen-IND-ulo

foresee / Ladies first. believ-able
/ not worthy, make worthy, good-for-nothing

万能介词JE的设置. 人们在表达思想时, 常常只意识到从属成分与中心成分有某种朦胧的修饰关系, 但却说不出, 往崐往也不必要说究竟是何种语义联系. 为了适应人类思维的这种模糊特点, 柴门霍夫引入介词JE. 这是一个很有见识的创造. (表达这种模糊关系还可用屈折形式的目的格或副词(状格), 见基本法规第14条.)
词性与格在用法上的统一. 词性和格都是根据词尾 "入句而后定" 的动态句法特征, 都能表现比较抽象的语义关系, 可以相互补充. (这跟分析形式的介词短语不同. 介词除了上述JE外, 一般用来表示较为具体和确定的语义关系.)

Mi skribas plum-E.
CF: (俄) (五格)

极其灵活的词类转换.

La FLOR-OJ FLOR-AS. Li KANT-AS italan popolan KANT-ON.
Mi estas GHOJ-A. Mi GHOJ-AS.

The flowers blossom. . He sang an Italian folk song.
I am glad.

词序的自由.

Mi amas vin. (106) / Mi vin amas. / Vin amas mi. (108)
/ Vin mi amas. (111) / Amas mi vin. / Amas vin mi.

I love you. 我爱你.

构词的灵活. 派生词: 词缀的丰富及其黏合特点; 合成词: 词根与词根的自由复合.

Shi rid-AS. Shi rid-ETAS. Shi estas rid-EMA.
Shi estas rid-EMULO. Shi estas rid-EMULINO ( rid-EMINO ).
Shi estas rid-EMULINETO ( rid-EMINETO ).......

她笑. 她微笑. 她爱笑.
她是爱笑的人. 她是爱笑的女人.
她是爱笑的小女孩儿 .......

INTER-lingvo 中间语言

fonto-lingvo celo-lingvo ponto-lingvo
naci-lingvo internaci-lingvo

源语目标语媒介语(桥梁语言)
民族语国际语

完善的时态语态系统和精巧的相关词表. 世界语的时态语态系统和相关词表是两项绝妙的创造. 它们是如此地精巧完善, 富有逻辑的力量和美, 每一个世界语者都象化学家欣赏元素周期表一样体验到这种美, 并为此感到自豪. 借助于唯一的一个助动词ESTI, 世界语能表达各种复合时态语态. 相关词表所能表达的语义的简洁和丰富更是无与伦比的.

世界语的这些特点给人们的自由创造留下了很大的余地, 为人们充分发挥自己的语言才能提供了最好的条件. 这种灵活性并不影响作为世界语基础的16条基本法则的不可动摇的严格性. 在这儿, 自由和约束达到了完美的统一. 在世界语国里, 每个人都在不同程度上是创造者, 每一个世界语者都体验到这种创造的乐趣. 人们再也不是习惯的奴隶了.

然而, 不能不承认, 世界语的灵活和自由给机器的自动处理带来了一定的困难. 我们在研制EChA系统的过程中, 深深感到, 与民族语相比, 以世界语为源语的机器翻译虽然有其容易的一面, 也有其特有的难处, 总之要比我们预料的要复杂得多. 容易来自其高度规则性, 困难则源于其高度灵活性.

世界语作为人们唯一实际使用的人造语言自然有它独特的研究价值. 拿它与民族语作对比研究, 我们会得到很多有益的启示. 由于其独特的地位, 人们在研究思维与语言, 民族与语言, 社会与语言, 个体与语言, 信仰与语言等等的关系, 以及探讨语言的共性, 语言的本质, 语言的前途(未来社会的语言), 语言的形式和内容, 语言的类型, 语言的教学等问题时都可?
能在研究世界语的过程中获益. 另外, 世界语本身的发展也需要语言学者对它作科学的研究和总结, 这不但有益于这门语言健康的发展, 有助于世界语语言学理论体系的建立, 同时也会丰富一般语言学的理论. 语言学者对世界语的理论研究虽然早已开始, 但还远远不够.

对于机器翻译工作者, 世界语还有一层特殊的意义, 就是世界语作为民族语间机器翻译的媒介语的价值.[1] 这可以从两方面看: 1) 按照机器特点对世界语作必要改造, 定义一个作为媒介语的世界语子集, 再辅以一套高度形式化的成分体系. 这个设想我们在第一届中国世界语大会上提过. 我们也确实设计过一个以世界语作为媒介语的英汉机器翻译规则系统. 虽然由于时间等原因没有能上机试验, 但我们相信该方案是可行的, 也是值得尝试的. 拿世界语或其子集作媒介语, 尽管还远远不是最理想, 但如果研制的是印欧语系间多语言自动翻译, 或者是以这些语言为源语的多对一系统(如英/法/德/俄--汉系统), 相信会带来很多方便. 2) 虽然不直接采用世界语作媒介语, 但在设计机译媒介语时, 认真吸取世界语的优点, 可以少走弯路.

_______________________________________________________________________

附注: [0] 为便于查对, 这里把世界语16条基本法规转抄如下:

(1) 不存在不定冠词, 只存在定冠词 (LA), 其性数格不变.

(2) 名词词尾为 "-O", 复数形式加词尾 "-J". 只存在两个格: 普通格和目的格; 后者由普通格加词尾 "-N" 构成.

(3) 形容词以 "-A" 收尾, 其格数与名词同. 比较级用PLI和连词OL, 最高级用PLEJ.

(4) 基数词(没有词尾变化)是: UNU 1, DU 2, TRI 3, KVAR 4, KVIN 5, SES 6, SEP 7, OK 8, NAU 9, DEK 10, CENT 100, MIL 1000. 几十和几百由数词简单合并而成. 序数词加形容词词尾; 倍数加后缀 "-OBL-", 分数加 "-ON-", 集合数词加 "-OP-", 分配意义用介词 PO. 此外, 数词也可以有名词和副词形式.

(5) 人称代词: MI, VI, LI, SHI, LI, GHI (代物件或动物), NI, VI, ILI. 其所有格形式加形容词词尾构成. 数格的变化与名词同.

(6) 动词没有人称和数的变化. 动词的各种形式: 现在时用词尾 "-AS"; 过去时 "-IS"; 将来时 "-OS"; 假定式 "-US"; 命令式 "-U"; 不定式 "-I". 分词(有形容词和副词的意义): 主动现在式 "-ANT-"; 主动过去式 "-INT-"; 主动将来式 "-ONT-"; 被动现在式 "-AT-"; 被动过去式 "-IT-"; 被动将来式 "-OT-". 被动语态的各种形式, 都借助于ESTI的相应形式和所需要的动词的被动分词构成; 被动式所用的介词是DE.

(7) 副词以 "-E" 收尾; 各比较等级与形容词同.

(8) 所有介词都要求普通格.

(9) 每个词读写一致.

(10) 单词重音永远在倒数第二个音节上.

(11) 合成词由词与词简单合并而成(主要的词放在后面); 语法词尾也被看作独立的词.

(12) 有其他否定词的时候, 就不再用 NE.

(13) 为了表示方向, 单词加目的格词尾.

(14) 每个介词都有确定不变的意义. 但是如果我们需要用一个介词, 而从意义上看不出应该用哪一个, 这时我们就用没有独立意义的介词JE. 介词JE也可以用没有介词的目的格来代替.

(15) 所谓外来词, 即大多数语言取自同一来源的词, 在世界语里不加变化地应用, 只需照世界语拼写法书写; 但如果一个词根派生几个不同的词时, 最好只不加变化地采用那个基本词, 并由此按照世界语的规则构造出其他的词来.

(16) 名词和冠词末尾的元音字母可以省略, 用省略号 ' 来代替.

[1] 请参看 <<巴贝尔通天塔必将建成>> (刘涌泉李维, 中国第一届世界语大会论文. 其中第四节专门讨论了世界语作为机译媒介语的优点, 缺点, 可能和前景.)

【相关】

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

硕士论文：世界语到汉语和英语的自动翻译试验(1)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

本文是我在导师刘涌泉和刘倬先生指导下所做的毕业设计的论文总结. 共分十大部分:
1. EChA概况: 系统流程图; 2. 世界语: 语言学特点及其研究价值; 3. 层次递归成分体系CDC: 体现独立分析结果的EChA中间语言; 4. EChA机器词典, 句子加工场格式; 5. 世界语形态分析: 削尾算法, 关于削缀问题的讨论; 6. 句法分析第一线: 虚词处理, 规则和规则分开的讨论; 7. 句法分析第二线: CDC的求解, 中间结果分析; 8. 英语形态生成, 汉语形态修辞, 原语和译语对比差异的一般总结, 多义区分例释; 9. 调序: 自底而上加工; 10. EChA试验结果分析, 汉语和英语的机译文的比较, 关于文学作品可不可以跟机器翻译结合的问题, 修辞的讨论。

EChA概况 ............................................................... 3
世界语: 语言学特点及其研究价值 ......................................... 7
层次递归成分体系 ....................................................... 13
EChA机器词典 ........................................................... 19
世界语形态分析 ......................................................... 23
世界语句法分析(1) ...................................................... 29
世界语句法分析(2) ...................................................... 31
英语形态生成 ........................................................... 34
目标语调序 ............................................................. 38
EChA试验结果的分析 ..................................................... 39

[致谢] ..................................................................... 44

[参考书目] ................................................................. 45

[附录一] EChA试验结果 ...................................................... 46

[附录二] 世界语文摘 ........................................................ 57

EChA概况

EChA (E-Ch/A: el Esperanto en la Chinan kaj Anglan Lingvojn) 系统是以世界语作为源语, 以汉语和英语作为目标语的一对多小型实验系统. 它是一个句对句的, 分析和综合有一定独立性的全文机器翻译系统. 本系统实现了翻译过程的完全自动化,不需要译前和译后编辑. (由于纯技术原因, 世界语中的几个戴帽字母暂时还需要用加 H 的复合字母来转写.) EChA系统从上机调试到打出译文只用了五个月, 全部工作历时近一年, 进展比较顺利. 本系统使用的是IBM-PC/XT微型机, 编程语言 BASIC (Version D2.00), 同时选用IBM公司的BASIC编译程序软件包. EChA由CCDOS操作系统(即带有汉字库的PC DOS 2.10)支持. 系统主体是六线分析和综合程序. 另外还建立了三部词典, 两个词表, 编制了词典的造查, 扩充和维护程序. 整个系统由近一万条BASIC语句构成. 编程时充分利用了BASIC串处理函数, 显得特别方便.

这次试验共翻译了150多句世界语文句. 汉语和英语的机器译文都通顺或可懂, 结果令人满意. (见附录) 提供本系统试验的源语素材有三部分: 第一部分是选自著名世界语作家Sandor Szhatmari的世界语原文著作 "Mashinmondo" (<<机器世界>>, 中国展望出版社)上的两段连续文章(12句, P.100-101), 句子比较长, 结构也比较复杂. 第二部分选自魏原枢和徐文琪编著的 <<世界语语法>> (上海外语教育出版社, 1982.10)中的典型例句(100多句), 这些例句(其中有一部分是日常用语)都具有一定的语言学特点, 表现了不同时态(简单时态,复合时态), 语态(主动语态, 被动语态), 语式(陈述语式, 命令语式, 假定语式),不同的句式(简单句, 并列句, 复合句, 无主句, 独词句, 一般疑问句, 特殊疑问句, 等等),不同的句型以及动词的各种形式. 总之, 它们具有相当的代表性, 基本上反映了世界语语法概貌, 这就弥补了连续文句特点单一的不足, 更有利于试验EChA系统的能力和适应性. 最后作为一种尝试,还选译了两首世界语诗歌(第一首是著名的世界语者的颂歌"希望之歌").

EChA由三大部分组成: 1) 机器词典; 2) 源语分析; 3) 目标语生成. 源语分析部分包括了世界语的全部基本语法和常用句型. 然而, 由于机器条件和实验周期的限制, 本系统的规模(特别是词典的规模)还很小, 有待于进一步扩充和改进. ----准备从两方面来扩充EChA系统, 一是补充例句, 做扩大试验; 二是增加俄语和法语作为新的目标语, 进一步检验体现独立分析结果的中间语言CDC(层次递归成分体系, 第3节详述)的适应范围, 并探讨其完善的途径. 另外, 时间仓促给系统还带来一些问题: EChA的结构还不是很合理, 算法有待于进一步优化, 规则和算法还没能分开, 在分析和综合的独立性上下了不少功夫, 但还没有完全独立.

尽管还有上述问题, 然而按照设计要求, 只要适当扩充词典, 系统就有能力处理世界语的绝大多数语言现象. 在中国近三十年的机器翻译研究历史中, EChA是第一个以世界语为研究对象的机译系统. 在世界语跟机器翻译结合的过程中, EChA是一个成功的尝试和良好的开端. 我们热切希望得到专家学者, 世界语同志们的帮助和指导.

EChA系统流程图
______丨________
/   原文输入    丨
/_________________丨
_______________________丨________________________
词               丨 1. 削尾, 查词典(实词词典, 虚词词典, 成语词典, 丨
典               丨    词类词义区分表) 丨
(形态分析) 丨_____________________________________________丨
------------------- _______________________丨_________________________
句               丨 2. 连词标点, 切分, 其他虚词丨
法               丨________________________________________________丨
分                _______________________丨_________________________
析               丨 3. 中间语言CDC的求解丨
丨________________________________________________丨
------------------- _______________________丨_________________________
丨 4. 多义词区分; 英语形态生成及汉语形态修辞; 查丨
目               丨      英语不规则词词表丨
标               丨_______________________________________________丨
语                _______________________丨_________________________
生               丨 5. 英语调序丨
成               丨________________________________________________丨
_______________________丨_________________________
丨 6. 汉语调序及其他修辞丨
丨________________________________________________丨
_________丨_________
丨     译文输出丨
丨__________________丨

源语文句输入以后, 作第一遍扫描. 首先判定加工词长度是否大于三. 若大于三, 转子程序削尾后查实词词干词典, 否则查虚词词典. 因为世界语虚词(无词尾变化)大多短小, 以三为界限最合理, 可以大大减少虚查次数. 词典查不着的作生词处理, 削尾信息保留. 查完词典及词表以后, 把削尾信息和词典信息移到计算机内存中所开辟的句子加工场.

句法分析确定源语文句的层次结构和句法关系. 分析结果以一种高度形式化的层次递归成分体系CDC来体现. CDC是独立于目标语的机器翻译中间语言, 这种独立性对于一对多机译系统是必要的. CDC由形态, 成分, 节点, 分布, 链号和层次几部分信息构成. 它不但揭示了源语文句的正确的句法树, 而且还包含了其它的有用的信息. 事实上, 它为建立多目标语的生成系统奠定了良好的基础.

句法分析第一线处理虚词, 中心任务是加工连词和标点, 正确切分语段. 原则上为每一个虚词编制一套分析规则. 世界语虚词数量很有限, 但用法较多, 具有民族语功能词的类似的复杂性, 是语言个性的集中表现, 所以分别加工比较适宜, 这也有利于规则跟规则分开. 该线加工任务很重, 特别是连词KAJ和KE, 分析规则十分复杂. 在很大程度上, 虚词分析对了, 句法关系也就清楚了. 因此, 集中力量编制一套完备的针对具体虚词的分析系统, 对于世界语类型的机器翻译至关重要. 该线正确处理了虚词个性现象, 便可以保证下一线分析的充分抽象性和概括性, 这样做对于象世界语这样的科学而规则的语言显得特别有利. 句法分析第二线运用自顶而下的方法, 从句子的谓语轴心(第一层)着手, 一层一层往下递归加工, 直到最末层(终结节点层). 加工过程就是不断递归调用各子程序的过程. 其中以动词子程序为核心, 它充分反映了世界语语法的基本内容及其高度规则性. 分析完毕得出一条对应于源语文句的中间语言CDC的链.

综合第一线做英语形态生成和汉语形态修辞. 英语形态并不发达, 所以世英的形态转换规则也不复杂. 汉语缺乏形态, 一般用适当的虚词(助词, 副词等)来代替. 我们把多义词区分规则也放在这一线, 这是因为多义区分的条件至此已经具备. 一般来说, 根据多义词及其联系词的CDC成分和语义特征就可以得出该词的正确义项. 综合第二线和第三线分别做英语调序和汉语调序. 调序信息由CDC结合目标语语法规律得出, 调序的方法是自底而上, 层层归约, 这样就不至于调乱. 我们知道, 世界语语序极为灵活自由, 而汉语语序却很固定, 所以生成汉语的主要任务是调序. 对于英语, 调序的任务较轻, 主要是保证文句主干 "主谓宾" 次序不乱. 英语名词没有主宾格的区分, 所以关键是把前置宾语移到动词之后. "世界语是印欧语系的一个合理化的公分母", 与英语相似处毕竟很多, 比如同一句法层次的定语或状语的内部调序, 在译汉语时是一个难题, 而在印欧系诸语言中则不是大问题. 另外修辞加工的过程也可以免了. (世英转换中的成语和多义现象较之世汉转换也少得多.) 总之, 英语生成比汉语生成容易许多.

EChA虽然是个不大的系统, 但是内容比较丰富. 它既有形态分析, 又有形态生成, 也有调序和修辞, 还有自己的一套成分体系. 我们在总体设计时, 已经考虑到增加新的不同类型的目标语扩充该系统的需要. 可以预计, 如果增加两线俄语和法语的生成程序(主要是形态生成), 分析部分稍作改动(主要是充实与综合还没有完全独立开来的虚词分析规则), 就可以实现崐世到汉/英/法/俄的自动翻译. 总之, 实用机译系统所能遇到的问题, EChA几乎都已涉及, 而且主体六线程序各个有自己的特色, 是个有相当代表性的一对多全自动机译模型.