【一日一parsing:谈parsing是问答系统的核武】

一日一parsing:今天的是。。。

0831d

怎么知道这里的问题和答案可以相配呢?如果有 parsing 和建立其上的知识图谱,那就好办。图谱里面有 professionOf 的 relationship,有了 parsing 抽取这个关系就是小菜(这个例子很简单,就是把同位语关系映射到professionOf关系)。有了 parsing 对于 question 要问的关系,也可以解出来 asking point,子树(S:李娜-从事,O:从事-运动;Mod:什么-关系)就确定了 asking point 是寻求 professionOf(“李娜”)。然后做语义 matching,问答系统的这个环就圆了。This is IE or knowledge-graph supported QA.

具体说,为了让Q和A能match,我们可以对两边做子树规则,填空(抽取)到 professionOf 的关系去,语义一体化,然后就顺风顺水了。第一条子树规则是:

"从事"O: (“职业|运动”)

O: (“职业|运动”)

Mod (“什么|何种”)

S: ^Sombody==>

==> professionOf(^Somebody,?)

professionOf(^Somebody,?)

这是 Question parsing 和 asking point extraction.  在答案源那一边,也有一组规则做 professionOf 的抽取,其中有这样一条规则:[personNE]

[person-NE]:^Person

equiv([profession_token]:^Profession)

==> professionOf(^Person,^Profession)

QA 就这样 match 了。

如果没有专门的知识图谱,没有事先定义好的关系的抽取,怎样做 QA 来应对呢?那就用 SVO parsing 也可以应对相当多的关于事件的问答。但是关系和复杂的事件的问答,简单的 SVO matching 就不行。好在原则上说,复杂的语义大多可以预先定义成 IE (predefined), 专门去做针对性抽取。简单的语义是 open-ended 的,语言学parsing(主谓宾定状补等)就够应付了。

天不我欺也。

IE 对于 SVO,实质就是 (semantic) slot normalization,原来的 slots 是语言学的,叫 S 也好, O 也好,equiv(同位语)也好,mod 也好 。。。。现在的 slots 是 pragmatic 的语义: 譬如 professionOf, locationOf, employeeOf, acquiringCompany, acquiredCompany, priceOfAcqusition, etc.

SVO matching 的 QA 也可以举一个例子, 譬如询问如何做某事:做+某事 就是一个 V+O:

0831a

0831b

0831c

甭管怎样换说法,不变的是 VO (格式化,硬盘)。有了这个 VO matching 做底,离开QA 或人机对话就不远了。譬如,FAQ 档案里很可能就有这样的标题: 格式化硬盘的步骤;关于格式化硬盘;等。于是 Q与A基本就是 SVO 子树 matching:"格式化“ ---O---> “硬盘”。
0901b

接着这个话题再发挥一下。IE 说的是信息抽取,多数时候这个 information 是与 insights (情报,有价值的信息)等价。但其实 IE 可以是抽取有价值的情报,也可以是抽取无价值的情报(噪音)。

为啥要抽取无价值的信息呢?道理很简单,噪音捣乱啊,为了剔除噪音,首先要识别它,或者说抽取它以便扔掉它。所用的方法可以完全一样。搜索界有 stop words ,被当做噪音扔掉了,那是噪音的最简单形式,不需要上下文,纯粹是高频虚词:对于 parsing 这些 stop words 其实很关键,是必要的建立结构的桥梁,但对于关键词搜索,因为里面没有结构,这些词就变成纯粹的噪音了。用 IE 来剔除噪音,实际上是根据上下文结构来断定哪些信息是应该扔掉的,譬如上面的句子里面,在 QA 的语用场景下,就可以剔除诸如:“请告诉我”、“我不知道”等,这样才凸显关键的的VO“格式化-硬盘”。要是做相似度计算,这些个词都是噪音。把“请告诉我”当成一个 4-gram 的 stop word 行不行?可以,但是如果这种东西有很多变式,ngram 就不行了。这时候在子树基础上做 IE 抽取噪音就非常可取了。又因为噪音大多可以用 word-driven 来做,做这件事儿是很靠谱的,基本一抓一准。

小结一下,一般而言,如果 Q 和 A 说法类似,譬如“格式化”+ “硬盘”,那么只要在 SVO 基础上做 matching 就可以把 QA couple 起来。如果 说法很不相同,或者一个关系或事件的变式太多,那么就加一层 IE,matching 在 IE 语义上做。SVO 的 QA matching 是智能搜索的本质,可以对付不可预测的问题。IE 的 QA matching 是预先定义的,针对领域的,不仅精准,而且可以应对变式。两个方案相辅相成。一个善于领域的精准,一个善于open domain 的广度和召回。二者都比 keywords 好出很多,因为有结构。如果从 backoff 来看,那就是 IE 优先, SVO 其次,keywords 楼底。这样精度广度就全照顾到了。

说来归齐,对于QA,对于对话系统,parsing 是核心引擎的关键技术。QA 说到底就是在 Q 与 A 中建立映射,映射的基础是语义匹配。deep parsing 及其 IE 是语义匹配的核武。

 

【相关】

【Bots 的愿景】

立委科普:问答系统的前生今世

泥沙龙笔记:parsing 是引擎的核武器,再论NLP与搜索

泥沙龙笔记:从 sparse data 再论parsing乃是NLP应用的核武器

【立委科普:NLP核武器的奥秘】

问答系统

泥沙龙笔记:搜索和知识图谱的话题

置顶:立委NLP博文一览】

《朝华午拾》总目录

立委NLP频道

立委硕士论文:EChA 试验结果 (11)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

 [参考书目]

 

  1. Heinz Dieter MAAS "Automata  Tradukado en kaj el Esperanto" ( "Lingvo-kibernetiko kaj aliajinternacilingvaj aktoj de l(1a IX-a Internacia Kongreso de Kibernetiko", pp 75-81, 1982 Gunter Narr Verlag Tubingen )
  1. <<机器翻译论文选辑>> ( 科学技术文献出版社, 1979 )
  2. Kalocsay-Waringhien <<Plena Analiza Gramatiko de Esperanto>> ( 中国世界语出版社, 1984 )
  3. 刘涌泉等著 <<中国的机器翻译>> ( 知识出版社, 1984 )
  4. 刘涌泉, 高祖舜, 刘倬著 <<机器翻译浅说>> ( 科学普及出版社, 1964 )
  5. 刘涌泉, 李维 <<巴贝尔通天塔必将建成>> ( 中国第一届世界语大会论文, 1985.8 )
  6. 刘倬 <<三次机器翻译试验>> ( 第一次机器翻译学术会议论文, 1980.9 )
    <<论机器翻译规则系统的编制方法>> ( 1982.3 上海 )
    <<JFY型英汉机器翻译系统的研制和试验>> ( 语言学会第二届年会论文, 1983.4 )
  1. 乔毅 <<开展语言的计算机处理和世界语类型的机器翻译>> ( 中国第一届世界语大会论文, 1985.8 )
  2. 魏原枢, 徐文琪编 <<世界语语法>> ( 上海外语教育出版社, 1982 )
  3. 叶蜚声, 徐通锵著 <<语言学纲要>> ( 北京大学出版社, 1981 )
  4. <<语言和计算机>> (1) (中国社会科学出版社, 1982 )
  5. <<语言和计算机>> (2) (中国社会科学出版社, 1985 )
  6. 张道真编著 <<实用英语语法>> ( 商务印书馆, 1984 )

[致谢]

研制世界语类型的机器翻译系统, 从一开始就得到刘涌泉老师的热情支持, 从方案主体到具体问题的处理, 他都给以认真指导。在程序设计和上机调试的的过程中, 刘倬老师也多次给予指导, 有些基本操作的算法也是刘倬老师提供的。在EChA系统取得初步成果的时候, 笔者向他们表示深切的感谢。另外, 还要特别感谢机房韩老师的多方协助。没有她提供的方便, EChA系统根本不可能在这么短时间试验成功。

[附录一] EChA试验结果

 

(1) LA ORIGINALA TEKSTO / THE ORIGINAL TEXT / 世界语原文

(001) TIEL EVOLUIGHIS PLI KAJ PLI LA PLANADO PER MASHINOJ . (002) TIUJ MASHINOJ KOMENCE NUR ELKALKULIS LA DIKTITAJN MATEMATIKAJN PROBLEMOJN , KONFORME AL LA ENPROGRAMIGO . (003) LA ELEKTRONIKAN PROGRAMIGON PRETIGIS HOMOJ . (004) PLI POSTE , KIAM LA SCIODISKETOJ ESTIS ELTROVITAJ , LA PLENAN INDIKARON , ENDISKIGITAN , ONI METIS EN MASHINOJN KAJ ILI TIAMANIERE POVIS EN SI MEM AKUMULI SCIENCAN STOKON , PLI GRANDAN OL LA HOMA CERBO. (005) KAJ SE TEMIS EKZEMPLE PRI LA PLANADO DE ELEKTROMOTORO , ONI ENMETIS LA SHABLONDISKETON DE LA ELEKTROMOTOR-PLANADO , DONIS LA INDIKOJN DE LA DEZIRATA MOTORO ( KILOVATO , TENSIO , ROTACIO , TIPO , KTP ) , (006) POST KIO LA MASHINO MEM PROGRAMIGIS SIN KAJ FARIS LA KALKULOJN . POST KELKAJ MINUTOJ GHI JAM PRETE ELDONIS LA MEZUROJN : LA DIAMETRON DE LA ROTACIA PARTO , GHIAN LONGON, LA MEZUROJN DE LA KANELOJ , DRATOJ , LA VOLVONOMBRON , ENTUTE CHION BEZONATAN . (007) ECH PLI : BALDAU ESTIS ATINGITE , KE LA MASHINO FARIS LA TUTAN DESEGNON KAJ TRANSDONIS GHIN AL LA FABRIKO . (008) KOMPRENEBLE TIUJ < DESEGNOJ > NE ESTIS IDENTAJ KUN NIAJ PAPERDESEGNOJ . (009) ILI ESTIS DISKETOJ , KIUJ ENTENIS CHIUN DETALON . (010) TIAMANIERE LA PLANADON KAJ FABRIKADON DE LA MASHINOJ JAM PLENUMIS SAME MASHINOJ . (011) ILI PLANIS LA MENDITAN MASHINON , FABRIKIS , ECH KONTROLPROVIS GHIN KAJ LA FUSHAN FORJHETIS . (012) SED CHIO CHI ANKORAU OKAZIS SUB HOMA GVIDADO KAJ PLEJ GRAVE ESTIS , KE CHIO CHI BAZIGHIS SUR LA HOMA SCIO .

LA TEKSTO TRADUKITA EN LA ANGLAN / THE TEXT TRANSLATED INTO ENGLISH / 英语译文

(001) SO DEVELOPED MORE AND MORE THE PLANNING BY MACHINES . (002) THOSE MACHINES AT BEGINNING ONLY CALCULATED OUT THE DICTATED MATHEMATICAL PROBLEMS , ACCORDING TO THE PROGRAMMING . (003) MEN PREPARED THE ELECTRONIC PROGRAMMING . (004) MORE LATER , WHEN THE KNOWLEDGE-DISKETTES HAD BEEN FOUND OUT , PEOPLE PUT THE FULL INDICATION , ENDISKED , INTO MACHINES AND THEY THEREFORE COULD IN THEMSELVES ACCUMULATE SCIENTIFIC STOCK, MORE GREAT THAN THE MAN'SBRAIN . (005) AND IF IT CONCERNED FOR EXAMPLE ABOUT THE PLANNING OF ELECTRIC MOTOR, PEOPLE INPUT THE SAMPLE DISKETTE OF THE MOTOR PLANNING , GAVE THE INDICATIONS OF THE DESIRED MOTOR (KILOWATT , VOLTAGE , ROTATION , TYPE , ETC ) , AFTER WHICH THE MACHINE ITSELF PROGRAMMED ITSELF AND DID THE CALCULATIONS . (006) AFTER SEVERAL MINUTES IT ALREADY READILY GAVE OUT THE MEASUREMENTS : THE DIAMETER OF THE ROTARY PART ,ITS LENGTH , THE MEASUREMENTS OF THE GROOVES , WIRES , THE WINDING NUMBER , IN TOTAL ALL REQUIRED . (007) EVEN MORE : SOON IT HAD BEEN ACHIEVED , THAT THE MACHINE DID THE TOTAL DESIGN AND OVERHANDED IT TO THE FACTORY . (008) OF COURSE THOSE < DESIGNS >  WERE NOT IDENTICAL WITH OUR PAPERDESIGNS . (009) THEY WERE DISKETTES , WHICH CARRIED ALL DETAIL . (010) THEREFORE MACHINES ALREADY FULFILED THE PLANNING AND MANUFACTURING OF THE MACHINES SAMELY . (011) THEY PLANNED THE ORDERED MACHINE , MANUFACTURED , EVEN EXAMINED IT AND THREW AWAY THE USELESS . (012) BUT ALL THIS STILL HAPPENED UNDER MAN'S GUIDING AND IT WAS MOST IMPORTANT , THAT ALL THIS WAS BASED ON THE MAN'S KNOWLEDGE .

LA TEKSTO TRADUKITA EN LA CHINAN / THE TEXT TRANSLATED INTO CHINESE / 汉语译文

(001) 这样用机器设计越来越发展了. (002) 那些机器开始时仅仅按照输入程序计算出所命令的数学问题. (003) 人准备了电子程序设计. (004) 更以后,当微型知识磁盘被发明了时,人们把所写入磁盘的全套指令集合放到机器里面,他(它)们这样能在自己本身里面积累比人的头脑更大的科学贮蓄. (005) 如果涉及例如关于电动机的设计, 人们输入了电动机设计的微型样品磁盘, 给了所希望的电动机的指标(千瓦,电压,运转,型号,等等),在此以后机器本身把自己程序化了,做了计算. (006) 在几分钟以后它已经就能给出尺寸:运转部分的直径,它的长度,槽纹,导线的尺寸,圈数,总之所需要的一切. (007) 甚至更:很快达到了,机器做了整个图样,把它转交到工厂. (008) 当然那些<图样>与我们的图纸不是一样的. (009) 他(它)们是储有所有细节的微型磁盘. (010) 这样机器已经同样地完成了机器的设计和制造. (011) 他(它)们设计了所定购的机器,制造了,甚至检验了它,把废的抛弃了. (012) 但是这一切仍然在人的指导下进行,最重要的是,这一切以人的知识作为基础.

(2) DIVERSAJ FRAZOJ / VARIOUS SENTENCES / 各类文句

(016) KIAM MI ESTIS LUDANTA VIOLONON , MIA ONKLO VIZITIS NIAN HEJMON .
WHEN I WAS PLAYING VIOLIN , MY UNCLE VISITED OUR HOME .
当我(当时)正在拉小提琴时,我的叔叔访问了我的家.

(020)  MI ESTOS FININTA LA EKSPERIMENTON PRI MASHINA TRADUKADO POST KELKAJ MONATOJ .
I WILL HAVE FINISHED THE EXPERIMENT ABOUT MACHINE'S TRANSLATING IN SEVERAL MONTHS.
我在几月以后将已经完成关于机器的翻译的实验.

(028)  BABELO NE ESTIS ELKONSTRUITA.
BABEL HAD NOT BEEN BUILT UP .
巴贝尔塔没有被建成.

(029)  NEPRE ESTOS ELKONSTRUITA LA NOVA BABELO .
ABSOLUTELY WILL HAVE BEEN BUILT UP THE NEW BABEL .
新巴贝尔塔必然地将被建成.

(040)  KIAL VI LERNAS ESPERANTON ?
WHY DO YOU LEARN ESPERANTO ?
为什么你学习世界语?

(044)  NE PROKRASTU LA HODIAUAN LABORON GHIS MORGAU .
DON'T PUT OFF THE TODAY'S WORK TILL TOMORROW .
别把今天的工作推迟到明天.

(045)  KIEL BONE PENTRAS LA KNABO !
HOW WELL THE BOY PAINTS !
男孩多么好地画画啊!

(048)  KIU ESTAS LA AUTORO DE LA LIBRO , KIUN VI JHUS LEGIS ?
WHO IS THE AUTHOR OF THE BOOK , WHICH YOU JUST READ ?
你刚刚读了的书的作者是谁?

(050)  SE MI PARTOPRENUS EN VIA AMUZA AKTIVADO , MI ESTUS TRE GHOJA .
IF I WOULD TAKE PART IN YOUR RECREATIONAL ACTIVITY , I WOULD BE VERY GLAD .
如果我参加你(们)的文娱活动,我会是很高兴的.

(056)  CHU VI MEMORAS LA TAGOJN , KIAM NI KUNE STUDIS EN LA UNIVERSITATO ?
DO YOU REMEMBER THE DAYS , WHEN WE TOGETHER STUDIED IN THE UNIVERSITY ?
你记得我们在一起在大学里面学习的日子吗?

(058)  UNUIGHU PROLETOJ DE CHIUJ LANDOJ !
LET PROLETARIANS OF ALL COUNTRIES UNITE !
让所有国家的无产者联合吧!

(061)  KIEL SAGHA VI ESTAS !
HOW WISE YOU ARE !
你是多么聪明啊!

(062)  ESPERANTO ESTAS INTERNACIA HELPA LINGVO .
ESPERANTO IS INTERNATIONAL HELP LANGUAGE .
世界语是国际辅助语言.

(067)  LIA PROPONO ESTAS , KE NI CHIUJ LIBERE ELMETU NIAJN OPINIOJN .
HIS PROPOSAL IS , THAT WE ALL FREELY OUTPUT OUR OPINIONS .
他的建议是,让我们所有人自由地提出我们的意见.

(068)  MI NE SCIAS , KIAM KOMENCIGHOS NIAJ FERIOJ .
I DON'T KNOW , WHEN WILL BEGIN OUR HOLIDAYS .
我不知道,我们的假日什么时候将开始.

(069)  LA LIBRO , KIU KUSHAS SUR LA TABLO , ESTAS VERDA .
THE BOOK , WHICH LIES ON THE TABLE , IS GREEN .
在桌子上躺的书是绿的.

(071)  LA INFANO PLORAS , CHAR IU LIN BATIS .
THE CHILD CRIES , BECAUSE SOMEBODY BEAT HIM .
小孩哭,因为某人打了他.

(078)  LERNI ESPERANTON NE ESTAS MALFACILE .
TO LEARN ESPERANTO IS NOT DIFFICULT .
学习世界语不是困难的.

(084)  MI NE SCIAS , CHU VI POVAS PLENUMI TIUN CHI TASKON .
I DON'T KNOW , WHETHER YOU CAN FULFIL THIS TASK .
我不知道,是否你能完成这个任务.

(086)  MULTAJ DIVERSLANDAJ ESPERANTISTOJ CHEESTOS LA UNIVERSALAN KONGRESON DE ESPERANTO OKAZONTAN PEKINE .
A LOT OF VARIOUS COUNTRY'S ESPERANTISTS WILL ATTEND THE UNIVERSAL CONGRESS OF ESPERANTO TO BE HELD IN BEIJING .
许多不同国家的世界语者将参加在北京将召开的世界语的国际大会.

(089)  LIA PROPONO ELEKTI NOVAN PREZIDANTON NE ESTIS AKCEPTITA .
HIS PROPOSAL TO ELECT NEW PRESIDENT HAD NOT BEEN ACCEPTED .
他的选举新总统的建议没有被接受.

(090)  SHI ESTAS LA PLEJ BELA EL LA KNABINOJ .
SHE IS THE MOST BEAUTIFUL OF THE GIRLS .
她在女孩里面是最漂亮的.

(092)  FALINTE , LI NE POVIS RELEVIGHI .
HAVING FALLEN , HE COULD NOT GET UP .
摔倒了,他不能重新起来.

(093)  FORIRONTE , LI PREMIS MIAN MANON .
TO GO AWAY , HE SHOOK MY HAND .
将要离去,他握了我的手.

(098)  MI TRE AMAS ESPERANTON , MI PLI AMAS ESPERANTISTOJN , MI PLEJ AMAS LA IDEALON DE ESPERANTO .
I VERY MUCH LOVE ESPERANTO , I MORE LOVE ESPERANTISTS , I MOST LOVE THE IDEAL OF ESPERANTO .
我很爱世界语,我更爱世界语者,我最爱世界语的理想.

(116)  NI LUDU , CHU BONE ?
LET'S PLAY , ALL RIGHT ?
让我们玩吧,好吗?

(119)  KIA MIRAKLO TIO ESTAS , KE NIAJ ANTIKVULOJ KONSTRUIS LA GRANDAN MURON NUR PER SIAJ DU MANOJ !
WHAT MIRACLE IT IS , THAT OUR ANCESTORS BUILT THE GREAT WALL ONLY BY THEIR TWO HANDS !
我们的祖先仅仅用自己的两手建造了长城,这是怎样的奇迹啊!

(121)  FORPASIS UNU TAGO , FORPASIS ANKAU LA DUA .
PASSED AWAY ONE DAY , PASSED AWAY ALSO THE SECOND .
一天过去了,第二也过去了.

(122)  CHU ESTAS EBLE , KE VI NENION SCIAS ?
IS IT POSSIBLE , THAT YOU KNOW NOTHING ?
你不知道任何事,这是可能的吗?

(131)  LA HOMON , PRI KIU VI PAROLAS , MI NENIAM VIDIS .
I NEVER SAW THE MAN , ABOUT WHOM YOU SPEAK .
我从未看见过你提到的人.

(132)  NI , ESPERANTISTOJ , DEVAS LABORI PLI ENERGIE OL IAM .
WE , ESPERANTISTS , MUST WORK MORE HARD THAN EVER .
我们,世界语者,应该比任何时候更努力工作.

(133)  SOMERE ESTAS TRE VARME .
IN SUMMER IT IS VERY HOT .
夏天是很热的.

(134)  DOKTORO ZAMENHOF NASKIGHIS LA 15-AN DE DECEMBRO EN 1859 .
DOCTOR ZAMENHOF WAS BORN ON THE 15TH OF DECEMBER IN 1859 .
柴门霍夫博士1859年十二月的15号出生.

(135)  SE VI SCIUS , KIU LI ESTAS , VI LIN PLI ESTIMUS .
IF YOU WOULD KNOW , WHO HE IS , YOU MORE WOULD ESTEEM HIM .
如果你知道,他是谁,你更会尊敬他.

(136)  CENTOJ DA MALFERMAJ AUTOJ NIN PORTIS AL LA CENTRA LENIN-STADIONO, MALRAPIDE MOVIGHANTE TRA LA HOMA SVARMO .
HUNDREDS OF OPEN CARS CARRIED US TO THE CENTRAL LENIN STADIUM , SLOWLY MOVING THROUGH THE MAN'S SWARM .
成百敞篷汽车把我们带到中央列宁运动场,缓慢地通过人群运动.

(137)  MI VIDIS , KE LI FALIS KAJ LIA VESTO MALPURIGHIS .
I SAW , THAT HE FELL AND HIS CLOTHES BECAME DIRTY .
我看见了,他摔倒了,他的衣服弄脏了.

(139)  MI SCIIS , KE LI NE FAROS , KION LI PROMESIS .
I KNEW , THAT HE WOULD NOT DO WHAT HE PROMISED .
我知道,他将不做他允诺的.

(140)  ESTAS PAULO , KIU ARANGHIS LA AFERON .
IT IS PAULO THAT ARRANGED THE AFFAIR .
是PAULO安排了事情.

(142)  KUREGIS LA KNABO PER SIA TUTA FORTO , SED LI NE POVIS ATINGI LA PAPILION .
RAN THE BOY BY HIS TOTAL STRENGTH , BUT HE COULD NOT ACHIEVE THE BUTTERFLY .
男孩用自己的整个力量狂奔,但是他不能达到蝴蝶.

(144)  LI DONIS AL MI MULTAJN INSTRUAJN LIBROJN .
HE GAVE ME A LOT OF TEACHING BOOKS .
他给了我许多教科书.

(145)  CHU VI PAROLAS CHINE AU JAPANE ?
DO YOU SPEAK IN CHINESE OR IN JAPANESE ?
你用中文还是用日文说话?

(151)  NUR TIU NE ERARAS , KIU NENIAM ION FARAS .
ONLY THAT PERSON IS NOT WRONG , WHO NEVER DOES SOMETHING .
仅仅从不做某事的那个人不犯错误.

(155)  ESPERANTO ESTAS CHIES PROPRAJHO .
ESPERANTO IS EVERYBODY'S PROPERTY .
世界语是所有人的财产.

(156)  MI MEMORAS CHIUN , KIUN MI VIDIS .
I REMEMBER ALL , WHOM I SAW .
我记得我看见了的所有人.

(157)  ESTAS NENIU EN LA CHAMBRO .
THERE IS NOBODY IN THE ROOM .
在房间里面没有任何人.

(3) DU POEMOJ / TWO POEMS / 两首诗歌

(099) LA ESPERO : ESPERANTISTA HIMNO ( POEMO FAR ZAMENHOF ) .

(100) EN LA MONDON VENIS NOVA SENTO ,
TRA LA MONDO IRAS FORTA VOKO ;
(101) PER FLUGILOJ DE FACILA VENTO ,
NUN DE LOKO FLUGU GHI AL LOKO .

(102) NE AL GLAVO SANGONSOIFANTA ,
GHI LA HOMAN TIRAS FAMILION ;
(103) AL LA MOND' ETERNE MILITANTA ,
GHI PROMESAS SANKTAN HARMONION .
(099) THE HOPE : ESPERANTIST'S HYMN ( POEM BY ZAMENHOF ) .

(100) INTO THE WORLD CAME NEW FEELING ,
OVER THE WORLD GOES STRONG VOICE ;
(101) BY WINGS OF EASY WIND ,
NOW FROM PLACE LET IT FLY TO PLACE .
(102) NOT TO SWORD BLOODTHIRSTY ,
IT PULLS THE MAN FAMILY ;
(103) TO THE WORLD EVER FIGHTING ,
IT PROMISES SACRED HARMONY .

(099) 希望: 世界语者的颂歌 (柴门霍夫所作的诗歌).

(100) 新感觉来到了世界,
有力的声音走遍世界;
(101) 用顺风的翅膀,
现在让它从一个地方飞到另一个地方吧.

(102) 它不把人的家庭
引到渴血的刀剑;
(103) 向永远战争着的世界,
它允诺神圣的和谐.

(104) AL NIA KARA LINGVO ( FAR IU NOVA ESPERANTISTO ) .

(105) LA LINGVO GRACIA , KARA MIA ,
GHIS KIAM VI VENIS AL MI FINE FIN ?
(106) ATENDIS SOIFE MI , ETERNE VIA ,
MI AMAS VIN !

(107) MI AMAS VIN VERE , PRUVU DIO ,
KAJ MIA BON-KORO BATAS NUR POR VI ;
(108) NE PLU SEKRETETO ESTAS TIO :
VIN AMAS MI !

(109) CHU KREDAS VI MIAN AMON MARAN ?
(110) CHU KREDAS , KE MIA KORO FLAMAS ?
(111) CHU KREDAS LA VORTON PURE KARAN :
VIN MI AMAS !

(104) TO OUR DEAR LANGUAGE ( BY SOME NEW ESPERANTIST ) .

(105) THE LANGUAGE GRACEFUL , MY DEAR ,
TILL WHEN YOU CAME TO ME AT LAST ?
(106) WAITED LONGINGLY I , EVER YOURS ,
I LOVE YOU !

(107) I LOVE YOU TRUELY , LET GOD PROVE ,
AND MY GOOD HEART BEATS ONLY FOR YOU ;
(108) NO LONGER THAT IS LITTLE SECRET :
I LOVE YOU !

(109) DO YOU BELIEVE MY LOVE LIKE SEA ?
(110) DO BELIEVE , THAT MY HEART BURNS ?
(111) DO BELIEVE THE WORD PURELY DEAR :
I LOVE YOU !

(104) 献给我们的亲爱的语言(某新世界语者所作).

(105) 优美的语言,我的亲爱的,
到什么时候你最后来到了我这儿?
(106) 我渴望地等待,你的永远的,
我爱你!

(107) 我真实地爱你,让上帝证明吧,
我的善良的心仅仅为了你跳动;
(108) 那已经不再是小秘密:
我爱你!

(109) 你相信我的大海一样的爱吗?
(110) 相信,我的心燃烧吗?
(111) 相信纯粹地亲爱的词吗:
我爱你!

 

 

【相关】

硕士论文: 世界语到汉语和英语的自动翻译试验
立委硕士论文:1. EChA概况
立委硕士论文:2. 世界语: 语言学特点及其研究价值
立委硕士论文:3. 层次递归成分体系
立委硕士论文:4. EChA机器词典及词表
立委硕士论文:5. 世界语形态分析
立委硕士论文:6/7 世界语句法分析
立委硕士论文:8. 英语形态生成
立委硕士论文:9. 目标语调序
立委硕士论文:10. EChA 试验结果的分析
立委硕士论文【致谢】【参考书目】
立委硕士论文全文(世界语版)

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

Outline of an HPSG-style Chinese reversible grammar

 Outline of an HPSG-style Chinese reversible grammar*

Wei  LI
Simon Fraser University
(NLWC97)

This paper presents the outline and the design philosophy of a lexicalized Chinese unification grammar named W‑CPSG. W‑CPSG covers Chinese morphology, Chinese syntax and semantics in a novel integrated language model. The grammar works reversibly, suited for both parsing and generation. This work is developed in the general spirit of the linguistic theory Head-driven Phrase Structure Grammar (Pollard & Sag 1994). We identify the following two problems as major obstacles in formulating a precise and efficient Chinese grammar. First, we lack in serious study on Chinese lexical base and often jump too soon for linguistic generalization. Second, there is a lack of effective interaction and adequate interface between morphology, syntax and semantics. We address these problems in depth with the lexicalized and integrated design of W‑CPSG. We will also illustrate how W‑CPSG is formalized and how it works.

 

  1. Background

Unification grammars have been extensively studied in the last decade (Shieber 1986). Implementations of such grammars for English are being used in a wide variety of applications. Attempts also have been made to write Chinese unification grammars (Huang 1986, among others). W‑CPSG (for Wei's Chinese Phrase Structure Grammar, Li, W. 1997b) is a new endeavor in this direction, with its unique design and characteristics.

1.1. Design philosophy

We identify the following two problems as major obstacles in formulating a precise and efficient Chinese grammar. First, we lack in serious study on Chinese lexical base and often jump too soon for linguistic generalization. Second, there is a lack of effective interaction and adequate interface between morphology, syntax and semantics. We address these problems in depth with the lexicalized and integrated design of W‑CPSG.

1.1.1. Lexicalized design

It has been widely accepted that a well-designed lexicon is crucial for a successful grammar, especially for a natural language computational system. But Chinese linguistics in general and Chinese computational grammars in particular have generally been lacking in in-depth research on Chinese lexical base. For many years, most dictionaries published in China did not even contain information for grammatical categories in the lexical entries (except for a few dictionaries intended for foreign readers learning Chinese). Compared with the sophisticated design and rich linguistic information embodied in English dictionaries like Oxford Advanced Learners' Dictionary and Longman Dictionary of Contemporary English, Chinese linguistics is hampered by the lack of such reliable lexical resources.

In the last decade, however, Chinese linguists have achieved significant progress in this field. The publication of 800 Words in Contemporary Mandarin (Lü et al., 1980) marked a milestone for Chinese lexical research. This book is full of detailed linguistic description of the most frequently used Chinese words and their collocations. Since then, Chinese linguists have made fruitful efforts, marked by the publication of a series of valency dictionaries (e.g. Meng et al., 1987) and books  (e.g. Li, L. 1986, 1990). But almost all such work was done by linguists with little knowledge of computational linguistics. Their description lacks formalization and consistency. Therefore, Chinese computational linguists require patience in adapting and formalizing these results, making them implementable.

1.1.2. Integrated design

Most conventional grammars assume a successive model of morphology, syntax and semantics. We argue that this design is not adequate for Chinese natural language processing. Instead, an integrated grammar of morphology, syntax and semantics is adopted in W‑CPSG.

Let us first discuss the rationale of integrating morphology and syntax in Chinese grammar. As it stands, a written Chinese sentence is a string of characters (morphemes) with no blanks to mark word boundaries. In conventional systems, there is a procedure-based Chinese morphology preprocessor (so-called segmenter). The major purpose for the segmenter is to identify a string of words to feed syntax. This is not an easy task, due to the possible involvement of the segmentation ambiguity. For example, given a string of 4 Chinese characters da xue sheng huo, the segmentation ambiguity is shown in (1a) and (1b) below.

(1)                    da xue sheng huo

(a)        da-xue                          | sheng-huo
university                    | life

(b)        da-xue-sheng               | huo
university-student       | live

The resolution of the above ambiguity in the morphology preprocessor is a hopeless job because such structural ambiguity is syntactically conditioned. For sentences like da xue sheng huo you qu (university life is interesting), (1a) is the right identification. For sentences like da xue sheng huo bu xia qu le (university students cannot make a living), (1b) is right. So far there are no segmenters which can handle this properly and guarantee correct word segmentation (Feng 1996). In fact, there can never be such segmenters as long as syntax is not brought in. This is a theoretical defect of all Chinese analysis systems in the morphology-before-syntax architecture (Li, W. 1997a). I have solved this problem in our morphology-syntax integrated W‑CPSG (see 2.2. below).

Now we examine the motivation of integrating syntax and semantics in Chinese grammar. It has been observed that, compared with the analysis of Indo-European languages, proper Chinese analysis relies more heavily on semantic information (see, e.g. Chen 1996, Feng 1996). Chinese syntax is not as rigid as languages with inflections. Semantic constraint is called for in both structural and lexical disambiguation as well as in solving the problem of computational complexity.  The integration of syntax and semantics helps establish flexible ways of their interaction in analysis (see 2.3. below).

1.2. Major theoretical foundation: HPSG

The work on W‑CPSG is developed in the spirit of the linguistic theory Head-driven Phrase Structure Grammar (HPSG, proposed by Pollard & Sag, 1987). HPSG is a highly lexicalist theory, which encourages the integration of different components. This matches our design philosophy for implementing our Chinese computational grammar. HPSG serves as a desired framework to start this research with. We benefit most from the general linguistic ideas in HPSG. However, W‑CPSG is not confined to the theory-internal formulations of principles and rules and other details in HPSG versions (e.g. Pollard & Sag 1987, 1994 or later developments). We borrow freely from other theoretical sources or form our own theories in W‑CPSG to meet our goal of Natural Language Processing in general and Chinese computing in particular. For example, treating morphology as an integrated part of parsing and placing it right into grammar is our deliberate choice. In syntax, we formulate our own theory for configuration and word order. Our semantics differs most from any standard version of situation-semantics-based theory in HPSG. It is based on insights from Tesnière's Dependency Grammar (Tesnière 1959), Fillmore's Case Grammar (Fillmore 1968) and  Wilks' Preference Semantics (Wilks 1975, 1978) as well as our own semantic view for knowledge representation and better coordination of syntax-semantics interaction (Li, W. 1996). For these differences and other modifications, it is more accurate to regard W‑CPSG as an HPSG-style Chinese grammar, rather than an (adapted) version of Chinese HPSG.

  1. Integrated language model

2.1. W‑CPSG versus conventional Chinese grammar

The lexicalized design sets the common basis for the organization of the grammar in W‑CPSG. This involves the interfaces of morphology, syntax and semantics.[1]   W‑CPSG assumes an integrated language model of its components (see Figure 1).  The W‑CPSG model is in sharp contrast to the conventional clear-cut successive design of grammar components (see Figure 2).

 

 lw1

Figure 2.  conventional language model (non-reversible)

2.2. Interfacing morphology and syntax

As shown in Figure 2 above, conventional  systems take a two-step approach: a procedure-based preprocessor for word identification (without discovering the internal structure) and a grammar for word-based parsing. W‑CPSG takes an alternative one-step approach and the parsing is character- (i.e. morpheme-) based. A morphological PS (phrase structure) rule is designed not only to identify candidate words but to build word‑internal structures as well. In other words, W‑CPSG is a self-contained model, directly accepting the input of a character string for parsing. The parse tree embodies both the morphological analysis and the syntactic analysis, as illustrated by the following sample parsing chart.

lw6

Note:    DET for determiner; CLA for classifier; N for noun; DE for particle de;
AF for affix; V for verb; A for adjective; CLAP for classifier phrase;
NP for noun phrase; DEP for DE-phrase

This is so-called bottom-up parsing. It starts with lexicon look-up. Simple edges 1 through 7 are lexical edges. Combined edges are phrasal edges. Each edge represents a sign, i.e. a character (morpheme), a word, a phrase or a sentence. Lexical edges result from a successful match between the signs in the input string and the entries in the lexicon during lexicon look-up. After looking up the lexicon, the lexical information for the signs are made available to the parser. For the sake of concise illustration, we only show two crucial pieces of information for each edge in the chart, namely category and interpretation with a delimiting colon (some function words are only labeled for category). The parser attempts to combine the edges according to PS rules in the grammar until a parse is found. A parse is an edge which ranges over the whole string. The parse ((((1+2)+3)+4)+((5+6)+7)) represents the following binary structural tree embodying both the morphological and syntactic analysis of this NP phrase.

lw5

As seen, word identification is no longer a pre-condition for parsing. It becomes a natural by-product of parsing in this integrated grammar of morphology and syntax: a successful parse always embodies the right word identification. For example, the parse ((((1+2)+3)+4)+((5+6)+7)) includes the identification of a word-string zhe (DET) ben (CLA) shu (N) de (DE) ke-du-xing (N). An argument against the conventional separation model is that there exists in the two-step approach a theoretical threshold beyond which the precision for the correct word identification is not possible. This is because proper word identification in Chinese is to a considerable extent syntactically conditioned due to  possible structural ambiguity involved. Our strategy has advantages over the conventional approach  in  resolving word identification ambiguities and in handling the productive word formation. It has solved the problems inherent in the morphology-before-syntax architecture (for detailed argumentation, see Li, W. 1997a).

2.3. Interaction of syntax and semantics

The interface and interaction of syntax and semantics are of vital importance in a Chinese grammar. We are of the same opinion as Chen (1996) and many others that it is more effective to analyze Chinese in an environment where semantic constraints are enforced during the parsing, not after. The argument is based on the linguistic characteristics of Chinese. Chinese has no inflection (like English ‑'s, ‑s, ‑ing, ‑ed, etc.), no such formatives as article (like English a, the), infinitivizer (like English to) and complementizer (like English that). Instead, function words and word order are used as major syntactic devices. But Chinese function words (prepositions, aspect particles, passive particle, plural suffix, conjunctions, etc.) can often be omitted (Lü et al. 1980, p.2). Moreover, fixed word order in order to mark syntactic functions which is usually assumed for isolating languages, is to a considerable extent untrue for Chinese. In fact, there is remarkable freedom or flexibility in Chinese word order. One typical example is demonstrated in the numerous word order variations (although the default order is S‑V‑O subject-verb-object) for the Chinese transitive patterns  (Li, W. 1996).  All these added up project a picture of Chinese as a language of loose syntactic constraint. A weak syntax requires some support beyond syntax to enhance grammaticality. Semantic constraints are therefore called for. I believe that an effective way to model this interaction between syntax and semantics is to integrate the two in one grammar.

One strong piece of evidence for this syntax-semantics integration argument is that Chinese has what I call syntactically crippled structures. These are structures which can hardly be understood on purely formal grounds and are usually judged as ungrammatical unless accompanied with the support from the semantic constraints (i.e. the match of semantic selection restrictions). Some Chinese NP predicate (Li, W. & McFetridge 1995) and transitive patterns like S‑O‑V (Li, W. 1996), among others, are such structures. The NP Predicate is a typical instance of semantic dependence. It is highly undesirable if we assume a general rule like S --> NP1 NP2 in a Chinese grammar to capture such phenomena. This is because there is a semantic condition for NP2 to function as predicate, which makes the Chinese NP predicate a very restricted pattern. For example, in the sentence This table is three-legged: zhe (this) zhang (classifier) zhuo-zi (desk) san (three) tiao (classifier) tui (leg), the subject must be of the semantic type animate or furniture (which can have legs). The general rule with no recourse to semantic constraints is simply too productive and may cause severe computational complexity. In the case of Chinese transitive patterns, formal means are decisive for some variations in their interpretation (i.e. role assignment) process. But others are heavily dependent on semantic constraint. Take chi (eat) as an example. There is no difference in syntactic form in sentences like wo (I) chi (eat) dianxin (Dim-Sum) le (perfect-aspect) and dianxin (Dim-Sum) wo (I) chi (eat) le (perfect-aspect). Who eats what? To properly assign roles to NP1 NP2 V as S-O-V versus O-S-V, the semantic constraint animate eats food needs to be enforced.

The conventional syntax-before-semantics model has now received less popularity in Chinese computing community. Researchers have been exploring various ways of integrating syntax and semantics in Chinese grammar (Chen 1996). In W‑CPSG, the Chinese syntax was enhanced by the incorporation of a semantic constraint mechanism. This mechanism embodies a lexicalized knowledge representation, which parallels to the syntactic representation in the lexicon. I have developed a way to dynamically coordinate the syntactic constraint and semantic constraint in one model. This technique proves to be effective in handling rhetorical expressions and in making the grammar both precise and robust (Li, W 1996).

 

  1. Lexicalized formal grammar

3.1. Formalized grammar

The application nature of this research requires that we pay equal attention to practical issues of computational systems as well as to a sound theoretical design. All theories and rule formulations in W‑CPSG are implementable. In fact. most of them have been implemented in our prototype W‑CPSG. W‑CPSG is a strictly formalized grammar that does not rely on undefined notions. The whole grammar is represented by typed feature structures (TFS), as defined below based on Carpenter & Penn (1994).

(3)        Definition: typed feature structure 

A typed feature structure is a data structure adopted to model a certain object of a grammar. The necessary part for a typed feature structure is type. Type represents the classification of the feature structure. A simple feature structure contains only the type information, but a complex feature structure can introduce a set of feature-value pairs in addition to the type. A feature-value pair consists of a feature and a value. A feature reflects one aspect of an object. The value describes that aspect. A value is itself a feature structure (simple or complex). A feature determines which type of feature structures it takes as its value. Typed feature structures are finite in a grammar. Their definition constitutes the typology of the grammar.

With this formal device of typed feature structures, we formulate W‑CPSG by defining from the very basic notions (e.g. sign, morpheme, word, phrase, S, NP, VP, etc.) to rules (PS rules and lexical rules), lexical items, lexical hierarchy and typology (hierarchy embodied in feature structures) (Li, W. 1997b). The following sample definitions of some basic notions illustrate the formal nature of W‑CPSG. Please note that they are system-internal definitions and are used in W‑CPSG to serve the purpose of configurational constraints (see Chapter VI of Li, W. 1997b).

(4)        Definition: sign [2]

a_sign
KANJI kanji
MORPH expected
CATEGORY category
COMP0 expected
COMP1 expected
COMP2 expected
MOD expected
KNOWLEDGE knowledge
CONTENT content
DTR dtr

A sign is the most fundamental concept of grammar. A sign is a dynamic unit of grammatical analysis. It can be a morpheme, a word, a phrase or a sentence. Formally, a sign is defined by the TFS a_sign, which introduces a set of linguistic features for its description, as shown above. These features include the orthographic feature KANJI; morphological feature MORPH; syntactic features CATEGORY, COMP0, COMP1, COMP2, and MOD; structural feature (for both morphology and syntax) DTR; semantic features KNOWLEDGE and CONTENT.

(5)        Definition: morpheme

a_sign
MORPH ~saturated

A morpheme is a sign whose morphological expectation has not been saturated. In W‑CPSG, ~saturated is equivalent to obligatory/optional/null. For example, the suffix ‑xing (‑ness) is such a morpheme whose morphological expectation for a preceding adjective is obligatory.  In W‑CPSG, a morpheme like ‑xing (‑ness) ceases to be a morpheme when its obligatory expectation, say the adjective ke-du (readable), is saturated. Therefore, the sign ke-du-xing (readability) is not a morpheme, but becomes a word per se.

(6)        Definition: word

a_sign
MORPH ~obligatory
DTR no_syn_dtr

In W‑CPSG, ~obligatory is equivalent to saturated/optional/null. The specification [MORPH ~obligatory] defines a syntactic sign, i.e. a sign whose obligatory morphological expectation has been saturated. A word is a syntactic sign with no syntactic daughters, i.e. [DTR no_syn_dtr]. Obviously, word with [MORPH saturated/optional/null] overlaps morpheme with [MORPH obligatory/optional/null] in cases when the morphological expectation is optional or null.

Just like the overlapping of morpheme and word, there is also an intersection between word and phrase. Compare the following definition of phrase with the above definition of word.

(7)        Definition: phrase

a_sign
MORPH ~obligatory
COMP0 ~obligatory
COMP1 ~obligatory
COMP2 ~obligatory 

A phrase is a syntactic sign whose obligatory complement expectation has all been saturated, i.e. [COMP0 ~obligatory, COMP1 ~obligatory, COMP2 ~obligatory]. When a word has only optional complement expectation or no complement expectation, it is also a phrase. The overlapping relationship among morpheme, word and phrase can be shown by the following illustration of the three sets.

lw4 

S is a syntactic sign satisfying the following 3 conditions: (1) its category is pred (which includes V and A); (2) its comp0 is saturated; (3) its obligatory comp1 and comp2  are saturated.

3.2. Lexicalized grammar

W‑CPSG takes a radical lexicalist approach. We started with individual words in the lexicon and have gradually built up a lexical hierarchy and the grammar prototype.

W‑CPSG consists of two parts: a minimized general grammar and a information-enriched lexicon. The general grammar contains only 11 PS rules, covering complement structure, modifier structure, conjunctive structure and morphological structure. We formulate a PS rule for illustration.

lw3

This comp0 PS rule is similar to the rule S ==> NP VP in the conventional phrase structure grammar. The feature COMP0 represents the expectation of the head daughter for its external complement (subject or specifier) on its left side, i.e. [DIRECTION left]. The nature of its expected comp0, NP or other types of sign, is lexically decided by the individual head (hence head-driven or lexicon-driven). It will always be warranted by the general grammar, here via the index [3]. This is the nature of lexicalized grammars. PS rules in such grammars are very abstract. Essentially, they say one thing, namely, 2 signs can combine so long as the lexicon so indicates. The indices [1] and [2] represent configurational constraint. They ensure that internal obligatory complements COMP1 and COMP2 must be saturated before this rule can be applied. Finally, Head Feature Principle (defined elsewhere in the grammar based on the adaptation of the Head Feature Principle in HPSG, Pollard & Sag, 1994) ensures that head features are percolated up from the head daughter to the mother sign.

The lexicon houses lexical entries with their linguistic description and knowledge representation. Potential morphological structures, as well as potential syntactic structures, are lexically encoded (in the feature MORPH for the former and in the features COMP0, COMP1, COMP2, MOD for the latter). Our knowledge representation is also embodied in the lexicon (in the feature KNOWLEDGE). I believe that this is an effective and realistic way of handling natural language phenomena and their disambiguation without having to resort to an encyclopedia-like knowledge base. The following sample formulation of the lexical entry chi (eat) projects a rough picture of what the W‑CPSG lexicon looks like.

lw2

The lexicon also contains lexical generalizations. The  generalizations are captured by the inheritance of the lexical hierarchy and by a set of lexical rules. Due to space limitations, I will not show them in this paper.

  1. Implementation and application of W‑CPSG

A substantial Chinese computational grammar has been implemented in the W‑CPSG prototype.  It covers all basic Chinese constructions. Particular attention is paid to the handling of function words and verb patterns.  On the basis of the information- enriched lexicon and the general grammar, the system adequately handles the relationship between linguistic individuality and generality. The grammar formalism which I use to code W‑CPSG is ALE, a grammar compiler on top of Prolog, developed by Carpenter & Penn (1994). ALE  is equipped with an inheritance mechanism on typed feature structures, a powerful tool in grammar modeling. I have made extensive use of the mechanism in the description of lexical categories as well as in knowledge representation. This seems to be an adequate way of capturing the inherent relationship between features in a grammar. Prolog is a programming environment particularly suitable for the development of unification and reversible grammars (Huang 1986, 1987). ALE compiles W‑CPSG into a Chinese parser, a Prolog program ready to accept a string of characters for analysis. In the first experiment, W‑CPSG has parsed a corpus of 200 Chinese sentences of various types.

An important benefit of a unification-based grammar is that the same grammar can be used both for parsing and generation. Grammar reversibility is a highly desired feature for multi-lingual machine translation application. Following this line, I have successfully applied W‑CPSG to the experiment of bi-directional machine translation between English and Chinese. The machine translation system developed in our Natural Language Lab is based on the shake-and-bake design (Whitelock 1992, 1994). I used the same three grammar modules (W‑CPSG, an English grammar and a bilingual transfer lexicon) and the same corpus for the experiment. As part of machine translation output, W‑CPSG has successfully generated the 200 Chinese sentences. The experimental results meet our design objective and verify the feasibility of our approach.

 

References

 

Carpenter, B. & Penn, G. (1994): ALE, The Attribute Logic Engine, User's Guide

Chen, K-J.  (1996): "Chinese sentence parsing" Tutorial Notes for International Conference on Chinese Computing ICCC'96, Singapore

Feng, Z-W.  (1996): "COLIPS lecture series - Chinese natural language processing",  Communications of COLIPS, Vol. 6, No. 1 1996, Singapore

Fillmore, C. J. (1968): "The case for case". Bach and Harms (eds.), Universals in Linguistic Theory. Holt, Reinhart and Winston, pp. 1-88.

Huang, X-M. (1986): "A bidirectional grammar for parsing and generating Chinese".  Proceedings of the International Conference on Chinese Computing, Singapore, pp. 46-54

Huang, X-M. (1987): XTRA: The Design and Implementation of A Fully Automatic Machine Translation System, Doctoral dissertation, University of Essex.

Li, L-D. (1986): Xiandai Hanyu Juxing (Sentence Patterns in Contemporary Mandarin), Shangwu Yinshuguan, Beijing

Li, L-D. (1990): Xiandai Hanyu Dongci (Verbs in Contemporary Mandarin), Zhongguo Shehui Kexue Chubanshe, Beijing

Li, W. & P. McFetridge (1995): "Handling Chinese NP predicate in HPSG", Proceedings of PACLING-II, Brisbane, Australia

Li, W. (1996): "Interaction of syntax and semantics in parsing Chinese transitive patterns", Proceedings of International Conference on Chinese Computing (ICCC'96), Singapore

Li, W. (1997a): "Chart parsing Chinese character strings", Proceedings of The Ninth North American Conference on Chinese Linguistics (NACCL-9, to be available), Victoria, Canada

Li, W. (1997b): W‑CPSG: A Lexicalized Chinese Unification Grammar, Doctoral dissertation, Simon Fraser University (on-going)

Lü, S-X. et al. (ed.) (1980): Xiandai Hanyu Babai Ci (800 Words in Contemporary Mandarin), Shangwu Yinshuguan, Beijing

Meng, Z., H-D. Zheng, Q-H. Meng, & W-L. Cai (1987): Dongci Yongfa Cidian (Dictionary of Verb Usages), Shanghai Cishu Chubanshe, Shanghai

Pollard, C.  & I. Sag (1987): Information based Syntax and Semantics Vol. 1: Fundamentals. Centre for the Study of Language  and Information, Stanford University, CA

Pollard, C.  & I. Sag (1994): Head-Driven Phrase Structure Grammar,  Centre for the Study of Language and Information, Stanford University, CA

Shieber, S. (1986): An Introduction to Unification-Based Approaches to Grammar. Centre for the Study of Language  and Information, Stanford University, CA

Tesnière, L. (1959): Éléments de Syntaxe Structurale, Paris: Klincksieck

Whitelock, Pete (1992): "Shake and bake translation", Proceedings of the 14th International Conference on Computational Linguistics, pp. 784-790, Nantes, France.

Whitelock, Pete (1994). "Shake and bake translation", C.J. Rupp, M.A. Rosner, and R.L. Johnson (eds.), Constraints, Language and Computation, pp. 339-359, London, Academic Press.

Wilks, Y.A. (1975). "A preferential pattern-seeking semantics for natural language interference".  Artificial Intelligence, Vol. 6, pp. 53-74.

Wilks, Y.A. (1978). "Making preferences more active".  Artificial Intelligence, Vol. 11,  pp. 197-223

 

-------------------------------------

* This project was supported by the Science Council of British Columbia, Canada under G.R.E.A.T. Award (code: 61) and by my industry partner TCC Communications Corporation, British Columbia, Canada. I thank my academic advisors Paul McFetridge and Fred Popowich and my industry advisor John Grayson for their supervision and encouragement. Thanks also go to my colleagues Davide Turcato, James Devlan Nicholson and Olivier Laurens for their help during the implementation of this grammar in our Natural Language Lab. I am also grateful to the editors of the NWLC'97 Proceedings for their comments and corrections.

[1] We leave aside the other components such as discourse, pragmatics, etc. They are an important part of a grammar for a full analysis of language phenomena, but they are beyond what can be addressed in this research.

[2] In formulating W‑CPSG, we use uppercase for feature and lowercase for type; ~ for logical not and / for logical or; number in square brackets for unification.

 

[Related]

Outline of An HPSG-style Chinese Reversible Grammar ABSTRACT

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

立委硕士论文:目标语调序 (9)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

目标语调序

在前面的虚词一线和形态生成一线, 已经做了一些局部调序并给了同号. 如:

CHIO (一切) CHI (这) ----> 这一切 (012);
DOKTORO (博士) ZAMENHOF (柴门霍夫) ----> 柴门霍夫博士 (134)

英语疑问句和否定句所需要的调序, 就放在形态生成的同时进行. 如:

NE (NOT) ESTIS (WERE) ----> WERE NOT (008)

CHU VIA (YOUR) AMIKO (FRIEND) ESTAS (IS) KURACISTO (DOCTOR) ?
----> IS YOUR FRIEND DOCTOR ? (039)

从综合第二线开始, 系统从句子整体着眼, 自底而上分别做各目标语的归约调序. 有了CDC和调序子程序, 建立目标语的归约生成算法就很简单了. 其基本思路是:

(1) 由句首至句末依次取词, 放过已加工和非终结节点.
(2) 若该词层号为一, 右链为零, 说明已经归约到顶层主轴心, 该句加工完毕.
(3) 若该词需要调序, 入调序子程序.
(4) 该词做已加工特征, 并视情况决定是否给该词以轴心词同号.
(5) 入子程序检查该词的姐妹词是否也都已加工.
(6) 若是, 则该词及其所有姐妹词给以轴心词同号, 轴心词做终结节点特征.
(7) 返回第(1)步.

对于英语, 问题特别简单, 只有一种情况需要调序, 即及物谓语所带的前置宾语和后置主语. (不及物谓语句中的后置主语无需调序.) 汉语的问题就复杂得多, 主要规则有:

(1) 存在 "有" (ESTI) 的主语应后置. 除此以外, 后置主语(包括多数主语从句)一律前移.

(2) 要求带 "把", "使" 等的汉语及物动词做谓语的句子, 其宾语在加上 "把", "使"等以后, 应置于谓语前. 除此以外, 前置宾语一律后移.

(3) 后置定语从句在两种情况下不需前移: 1. ESTAS + X, KIU 型强调句式; 2. 长15词以上的定语从句. 其余的所有后置定语一律前移. 各姐妹定语的相对位置主要由它们的语义特征决定, 具体是通过调序时给或不给同号来实现.

(4) 状语从句一般原位不动(但后置时间状语从句最好前移). 其余后置状语一律前移. 各姐妹状语相对位置的处理原则同上.

 

 

【相关】

硕士论文: 世界语到汉语和英语的自动翻译试验
立委硕士论文:1. EChA概况
立委硕士论文:2. 世界语: 语言学特点及其研究价值
立委硕士论文:3. 层次递归成分体系
立委硕士论文:4. EChA机器词典及词表
立委硕士论文:5. 世界语形态分析
立委硕士论文:6/7 世界语句法分析
立委硕士论文:8. 英语形态生成
立委硕士论文:9. 目标语调序
立委硕士论文:10. EChA 试验结果的分析
立委硕士论文【致谢】【参考书目】
立委硕士论文全文(世界语版)

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

立委硕士论文:EChA试验结果分析 (10)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

EChA试验结果分析

总的来说, 这次试验结果相当令人满意. 译文不但可读, 多数都很通顺. 由于比较重视修辞, 机器味儿也不浓. 当然, 这毕竟是小范围的实验, 虽然我们尽量照顾到各种可能出现的语言现象, 但也难说在今后的扩大试验中会出现什么问题, 好在该系统比较容易维护和改进.

第二首诗中有两处(110)(111)把疑问句错译成英语强调句:

CHU kredas la vorton pure karan: vin mi amas! (111)
DO BELIEVE the word purely dear: I love you!
Cf: 相信纯粹地亲爱的词吗:我爱你!

这是因为原诗句为了节奏的需要, 承前省略了主语 VI (YOU). 有意思的是, 译成强调句于诗意没有什么损害.

在EChA上机伊始, 我们由于专心于检验方案主体的可行性和合理性, 而忽略了修辞. 初期译文(1985.12)显得较粗糙, 比较后期结果(1986.2), 译文的改进是明显的. 例如:

  1. 形式主语IT的增加 (007)(012)(077)(122)(125)(133):

Sed chio chi ankorau okazis sub homa gvidado kaj PLEJ GRAVE ESTIS, KE chio chi bazighis sur la homa scio. (012)

1) But all this still happened under man's guiding and MOST IMPORTANT WAS, THAT all this was based on the man's knowledge.

2) But all this still happened under man's guiding and IT WAS MOST IMPORTANT, THAT all this was based on the man's knowledge.

  1. 不定式带TO跟不带TO的区分 (004)(019)(072)(078)(083)(084)(088)(089)(092)(095)(132)(142)(146):

LABORI estas necese.(072)
1) (TO) WORK is necessary.
2) TO WORK is necessary.
工作是必要的.

  1. 双宾语 (128)(143)(144):

Donu AL mi iom da kafo! (128)
1) Give TO me a little coffee!
2) Give me a little coffee!
给我一点咖啡!

  1. 表示存在的 ESTI 译 "有" 和 THERE TO BE (049)(157):

En unu jaro ESTAS kvar sezonoj: printempo, somero, autuno kaj vintro. (049)

1) In one year ARE four seasons: spring, summer, autumn and winter.
在一年里面 "是" 四季节:春季,夏季,秋季和冬季.

2) In one year THERE ARE four seasons: spring, summer, autumn and winter.
在一年里面 "有" 四季节:春季,夏季,秋季和冬季.

  1. 目标语词义的选择 (059)(067)(081)(046)(098)(013)(014)(027)(118)(130):

ELMETU viajn opiniojn pri nia laboro! (059)

1) "输出" 你们的关于我们的工作的意见!
2) "提出" 你们的关于我们的工作的意见!
OUTPUT your opinions about our work!

Chu mi FARIS multajn erarojn en mia hejmtasko? (081)

1) Did I DO a lot of mistakes in my homework?
我在我的家庭作业里面 "做" 了许多错误吗?

2) Did I MAKE a lot of mistakes in my homework?
我在我的家庭作业里面 "犯" 了许多错误吗?

La partio TRE zorgas la vivon de la popolamaso. (046)

1) The party VERY cares for the life of the masses.
2) The party VERY MUCH cares for the life of the masses.
党很关心人民群众的生活.

La suno levighas CHE oriento. (013)

1) The sun rises AT east.
2) The sun rises IN THE east.
太阳在东方升起.

POST unu monato komencighos la someraj ferioj. (014)

1) AFTER one month will begin the summer's holidays.
2) IN one month will begin the summer's holidays.
暑假在一月以后将开始.

La eksperimento pri mashina tradukado ANKORAU NE estas finita. (027)

1) The experiment about machine's translating STILL has been NOT finished.
关于机器的翻译的试验 "仍然没有" 被完成.

2) The experiment about machine's translating has been NOT finshed YET.
关于机器的翻译的试验 "还没有" 被完成.

Ni esperas, ke li GAJNU championecon en la konkurso. (118)

1) We hope, that he WIN championship in the competition.
2) We hope, that he WILL WIN championship in the competition.
我们希望,让他在比赛里面赢得冠军.

Prenu la lingvon neutralan KIEL la bazon. (130)

1) Take the language neutral AS the base.
2) Take the language neutral FOR the base.
拿中立的语言作为基础.

通过EChA试验, 我们深深体会到, 同一语系中的语言转换较之不同语系容易许多. 亲属关系越近, 机器翻译对自动分析的精度要求也就越低, 因而越容易推向实用. 英语和汉语都是分析型语言, 有很多类似的语言特点, 即便如此, 世英转换比

世汉还是简单得多. 只要建立一部世英自动词典, 再加上一套形态转换算法, 甚至无需进行层次和句法的分析, 就可以实现词对词世英机器翻译. 这样的译文尽管粗糙, 但在相当程度上是可用的. 我们对ECHA综合第一线(形态转换)输出的未经调序的中间译文作了统计, 以不引起误解为标准, 英语正确率为 95% (150/158) 左右, 费解的有八句 (003)(010)(075)(095)(102)(108)(111)(141), 汉语正确率为 72% (113/158) 左右. 排除形态转换中利用了句法分析结果的部分, (但不排除第一线的虚词分析和转换), 英语正确率也在80%以上. 如果在输出译文时, 对前置宾格名词加上标识符, 则可懂度还可提高. 当然, 我们试验的这158句总有一定的局限, 所以上述统计也只具有相对意义. 中国的机器翻译, 从一开始研究的就是印欧和汉臧这两个没有亲属关系的语系间语言的自动转换, 难度很大. 这恐怕是我们的实用系统迟迟不能问世的重要原因之一. 所以, 崐中国机器翻译工作者肩上的担子更重, 任务更艰巨, 更需要独创和献身精神. 这种不利的条件也有它的另一面: 机器翻译与汉语结合带来的许多特别的问题, 客观上使我们的研究比较深入. 我国的机译研究就没有象欧美那样经历词对词翻译的第一代, 而是直接从第二代句对句翻译开始, 起点较高, 并且在很短时间内(60年代初期)就赶上了当时的世界先进水平. 这显然与我们所研究的特定对象(俄-汉, 英-汉等)的要求有关.[10]

现在谈谈另一个问题: 文学作品可不可以由机器翻译? 我们说完全可以, 不过很困难. 要把人在翻译文学作品时所遵循的规则(其中很多是下意识的)形式化算法化, 显然不容易. 即便做到了, 经济上也不上算. 所以, 在相当长的时间内, 除特别的实验需要外, 人们一般不去花这个力气. EChA选译了两首诗歌, 在这个方面做了粗浅的尝试, 证明机器也可译诗. 从译?
文看, 英语比汉语美, 保留了更多的节奏和韵律的特点, 更象一首诗. 汉语译文除了几句译得较好( 如: "向永远战争着的世界, / 它允诺神圣的和谐" ), 总体上看, 更象一篇散文. 这也难怪, 因为EChA本来就不是专门为翻译诗歌而设计的. 诗歌形式上的两个最大特点是节奏和尾韵. 可以设想, 诗歌机译系统的词典跟一般机器词典应有所不同: 各词条的每一义项下集中了一批同义的目标语等价词. 这些词长短不一, 韵尾各异, 供机器在诗歌综合时选用, 正象人在写诗或译诗时常需要翻韵书一样.

一提机器翻译, 人们总爱问: 机器能够翻译文学作品吗? 为什么不能? 离散是对连续的逼近, 机器智能是对人的智能的模拟, 二者之间并没有一道不可逾越的鸿沟. 从功能上看, 机器和人没有什么不同. 机器不过是无机体的人罢了. 只要人会的事情, 机器迟早也能会. 机器的不会并不是它不能, 而是人没有使它会, 这正如文盲不会写字是因为没人教他一样. 不过, 机器胃口很刁, 不懂 "意会", 只有 "言传"(通过计算机语言)才能教会它. 可惜, 对很多事, 人至今还是知其然, 并不知其所以然, 无法传授. 可见, 机器的无能全由于人的无能. 可人今天不知其所以然的, 并不说明将来总也不知, 所以从发展的观点看, 机器和人一样是无所不能的. 事实上, 机器目前已能代替医生, 译员和作曲家做部分工作, 而且比技术较差的人做得还象样些, 因为它 "取法乎上". 即便人, 也只有很少一部分专家能够从事这些工作. 机器已经闯进了万物之灵的神圣禁地.

最后, 一般地谈谈修辞问题. 由于机器翻译至今多局限在实验室里, 所以未予修辞而产生的阅读障碍(包括心理障碍)还不突出. 但随着机器翻译的逐步实用化, 修辞的必要性将越来越明显. 前面所举的后期译文对初期译文的改进的实例, 主要涉及的就是修辞.

1) 什么是机器翻译修辞?

机器翻译修辞是保证译文通顺的一个重要手段. 它是机器语法之后译文综合的一部分, 是自动翻译过程的最后一个环节. 广义的修辞包括贯穿翻译全过程的, 一切旨在促使译文通顺和美化的手段, 譬如成语手段(通过成语词典), 虚词分析(通过虚词模块), 结构手段(通过搭配关系)等等. 有些所谓多义区分, 实际上也是一种修辞, 例如 LUDI (PLAY) 可分为 "玩", "打球)", "演奏(乐器)"等义项, 但 "演奏" 义下具体选择 "拉(提琴, 胡琴)"(016), "弹(钢琴)"(038) 还是 "吹(口琴)" 就属于修辞了. EChA对于涉及多义的修辞, 即目标语合适对等词的选择, 就把它当作多义问题解决(见EChA虚词模块, 词类词义区分表和多义区分模块). 一般来说, 跟具体的词汇或语法现象联系很紧的修辞, 以及其他个性较强的特例修辞, 应该放在相应的词典或语法部分同时处理, 而可以归出类别的修辞, 则由最后独立的修辞模块统一解决.

机器翻译修辞具有某种超语言学的特征, 属于翻译学范畴. 我们知道, 根据原语和译语的语言学角度的对比差异, 就可以对所译文句实现转换(主要是句型转换), 这是我们目前机器翻译的主体工作. 但这样直接转换的句子不能保证其通顺, 甚至也不能保证其正确(即不被误解), 因为语言间(尤其是没有亲属关系的语言间)除了词汇语法等差异外, 还有超语言学(表达习惯, 思维方式等等)的差异存在, 即翻译学角度的对比差异. 例如: nun DE LOKO flugu ghi AL LOKO (now FROM PLACE let it fly TO PLACE) (101) / 现在从 "一个" 地方让它飞到 "另一个" 地方吧("从地方到地方" 不符合汉语表达习惯). 修辞主要是为消除这种差异而设置的. 因此, 只有翻译学角度的语言对比差异, 才是修辞的根本依据.

2) 修辞的分类

可分作两大类: 必要修辞和美修辞. 必要修辞是保证译文正确可懂所必需的修辞, 它是修辞的初级阶段. 美修辞则是保证译文通顺畅达, 甚至产生某种美感或帮助形成译文风格所要求的修辞, 它是修辞的高级阶段. 机器翻译修辞首先是作为必要修辞提出来的. 必要修辞是基础, 具有更大的迫切性, 是所有实用系统的必要组成部分, 如形态修辞. 这部分修辞数量很有限, 一定量的研究就可以穷尽它. 美修辞可以说是锦上添花. 它是为机器译文不断提高质量, 使之朝成熟, 完美方向发展, 以期赶上人工翻译的手段. 可见, 美修辞是无限发展的, 它本身具有许多层次和侧面. 修修补补远不能满足美修辞发展的需要. 它要求体系和方法上的不断革新. 就机器翻译的前景来说, 美修辞的比重将逐渐变大. 从严格的意义上讲, 只有美修辞才真正体现修辞本身的特点和规律, 因为必要修辞在一定的意义上不过是语法的推广, 即可以算作广义的语法. 它的手段跟机器语法没有根本的不同. 在现行的EChA系统中, 必要修辞就常常跟语法混在一起.

关于美修辞, EChA只是做了一点尝试. 应该指出, 机器翻译的美有自己的侧重点, 它最推崇 "通顺流畅, 合乎习惯和简洁自然", 其次是译文风格的形成. 我们认为, 机器译文的风格逐步形成, 是完全可能的. 因为从形式上看, 风格的承担者主要是词汇, 尤其是小词(语气词, 结构词), 其次, 语法形式也有些不同. 不同风格的形式特点, 是可以为机器识辨和接受的. ?
具体做法可以吸收计算风格学(Computational stylistics)的研究成果, 去设计不同风格的译语修辞模型. 风格可以有正规体, 典雅体和口语体等等. 正规体格式规范, 清楚简单, 给人的印象是客观公正, 不假藻饰. 典雅体的特点是虚词多用古字 (如 "则", 即", "乃", "便", "故", "且", "其", "及" 等), 成语用的也较多, 显得简洁古雅. 口语体则比较松散自由, 带?
有更多的语气词(如 "吗", "呢", "可不", "是吗", "啊" 等).

____________________________________________________________________

附注: [10] 参见 刘涌泉 <<中国的机器翻译>> ( <<情报科学>> 1980, 3 )

 

[致谢]

研制世界语类型的机器翻译系统, 从一开始就得到刘涌泉老师的热情支持, 从方案主体到具体问题的处理, 他都给以认真指导. 在程序设计和上机调试的的过程中, 刘倬老师也多次给予指导, 有些基本操作的算法也是刘倬老师提供的. 在EChA系统取得初步成果的时候, 笔者向他们表示深切的感谢. 另外, 还要特别感谢机房韩老师的多方协助. 没有她提供的方便, EChA系统根本不可能在这么短时间试验成功.

 

\

【相关】

硕士论文: 世界语到汉语和英语的自动翻译试验
立委硕士论文:1. EChA概况
立委硕士论文:2. 世界语: 语言学特点及其研究价值
立委硕士论文:3. 层次递归成分体系
立委硕士论文:4. EChA机器词典及词表
立委硕士论文:5. 世界语形态分析
立委硕士论文:6/7 世界语句法分析
立委硕士论文:8. 英语形态生成
立委硕士论文:9. 目标语调序
立委硕士论文:10. EChA 试验结果的分析
立委硕士论文【致谢】【参考书目】
立委硕士论文全文(世界语版)

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

立委硕士论文:英语形态生成 (8)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

英语形态生成

加尾算法跟削尾算法正好是逆过程. 建立一个完全的, 符合实用系统要求的英语加尾算法并不困难, 因为英语的形态比较简单. EChA把汉语形态修辞与英语形态生成放在一处进行.

原语和译语的对比差异是建立语言转换规则的依据. 这种对比差异可以归纳为下面五种情况: 1) 一一对应; 2) 此一彼多; 3) 此多彼一; 4) 此有彼无; 5) 此无彼有. 我们以世界语到英语的形态转换分别举例如下:

1) 一一对应

世界语派生副词(由逻辑类为形容词的词干加 "-E" 尾构成)
--------->英语相应形容词加 "-LY" 尾.

例: diligent-E ----> diligent-LY ; serioz-E ----> serious-LY ;
sincer-E ----> sincere-LY. (063)

例外: bon-E ----> well (045)
( 不是 good-LY, 这种情况在词典一线入词类词义区分表处理. )

显然, 一一对应的情形最好办.

2) 此一彼多

世界语不定式 --------> 英语动词原形 或 TO + 动词原形
世界语条件句(谓语动词以 "-US" 收尾) --------> 英语三种形式(过去, 现在, 将来).
例: 1. Se mi sci-US hierau, mi certe ven-US. -
---> If I HAD KNOWN yesterday, I certainly                                                                                SHOULD HAVE COME. (与过去事实相反的假设)

  1. Se vi est-US mi, kion vi far-US? ----> If you WERE me, what WOULD you do?                                 (与现在事实相反)
  1. Se vi ven-US morgau, vi shin vid-US.
    ----> If you SHOULD come tomorrow, you WOULD see her.                                                                 (与将来事实相反)

这种情况最麻烦. 机器翻译中的多义现象盖源于此. 如果上例没有明确的时间状语, 那只能靠跨句上下文去推测, 这对机器实在太难了. EChA遇到这种情况, 就干脆一律用 "WOULD" 代替 "-US" (050), 这虽然不大符合英语语语法规范, 暂时也只能这样了. 好在这样转换并不造成误解.

此一彼多另一个常见的例子是, 世界语现在时简单式(-AS尾)对应于英语一般现在时和现在进行时两种. 虽然世界语复合时态有与英语现在进行时对应的形式( ESTAS x-ANTA ), 但是世界语的节约原则要求人们尽可能少用复杂形式. 我们一时还找不出足够可靠的形式规则, 来决定 "-AS" 究竟何时译作一般时态, 何时译作进行时态. EChA目前一律以一般现在时译之, 这使得部分译文不是很确切, 但并不造成误解或费解. 如:

Kien vi ir-RA? (158) ----> To where DO you go? ( CF: Where ARE you GOING? )
Chu kredas, ke mia koro flam-AS? (110) ----> Do believe, that my heart burn-S?
( CF: Do you believe that my heart IS BURNING? )

3) 此多彼一

世界语形动词或副动词的各种形式 --------> 英语分词的相应形式.

-ANTA 和 -ANTE ----> -ING ; -INTA 和 -INTE ----> HAVING+过去分词 ;
-OTA 和 -OTE ----> TO BE+过去分词; 等等.

[例] KURANTE sur la strato, li falis. (091) ----> RUNNING on the street, he fell.

Laboristoj estas KONSTRUANTAJ fabrikon. (015)
----> Workers are BUILDING factory.

这种情况好办. 世界语形态比较丰富, 而现代英语形态不发达, 所以世英形态转换中最经常出现的, 就是此多彼一或此有彼无的情形, 这对建立比较完全的EChA英语形态生成(加尾)算法是很有利的条件.

4) 此有彼无

世界语将来将来时 ( ESTOS x-ONTA(J) ) --------> 英语 ?

[例] Mi ESTOS LEGONTA la libron kiam shi venos. (023)
----> I WILL ( 或: WILL BE GOING TO ) read the book when she comes.

这种情况看上去似乎很不利, 实际上并不难处理. 因为现今存在的各种语言, 作为人们千百年来交流思想的工具, 一般都能够表达各种细微的语义差别. 虽然乙语言也许缺乏甲语言的某个特定的表达手段, 但如果必要, 它总可以找到代替的表达方式. 如上例 ESTOS LEGONTA 通常译作 WILL READ 已经足够, 如果一定要强调将来的将来, 也不妨译作 WILL BE GOING TO READ 这样繁冗的形式. 再如汉语缺乏形态, 但如果需要, 总可以用适当的助词或副词等来代替, 这就是所谓的形态修辞.

5) 此无彼有

世界语 ? --------> 英语完成进行时

[例] Mi atend-AS vin chi tie du horojn.
----> I HAVE BEEN WAITING here for you for two hours.
CF: I WAIT here for you for two hours.
I AM WAITING here for you for two hours.

此所无彼所有的, 如果在彼也是可有可无的, 或并不太影响语义, 那还好办, 如上例. 再如, 英语的不定冠词, 世界语就没有, EChA对此干脆不管, 也没造成严重的后果, 只是译文显得有些不顺: Is your friend (*) doctor?       (039) This is (*) green star, and that is (*) red star. (152) ( * 处本应有不定冠词 A ) 最头痛的是此所无彼必有. 从完全没有冠词的语言(如汉语和俄语)译入有冠词的语言在很多情况下就是这样.

上述归纳在机器翻译的转换生成中具有普遍意义. 最困难的是此一彼多和此所无彼必有两种情况, 一般要通过精密的句法和语义的对比和分析来解决. 比如通过分析不定式所直接联系的英语轴心词的句型特征, 就可以决定该不定式采用带 TO 还是不带 TO 的形式. 实在不得已, 只好把几种可能的选择同时打印出来, 由用户自己决定----这当然是权宜之计, 但常常比编制一套不可靠的区分规则, 客观上更有利一些. 机器模拟人的智能, 在一定的阶段总还有某些局限. 上面的做法, 实际上就是把机器暂时还不具有的智能, 交还给人发挥, 特别是那些很难形式化, 但人凭经验和直感却很容易判断的部分. 然而, 人工智能的使命决定了, 人们应该尽最大努力提高机器智能化程度. 条件允许却不去努力是设计者的懒惰和失职.

在EChA形态生成一线, 还有词典化了的多义区分程序段(它在形态生成前执行), 用BASIC写起来很容易. 现举例介绍如下:

1) LUDI 玩 / 打(各类球) / 拉(提琴, 胡琴) / 弹(钢琴) / 吹(口琴)

2120 IF VT$(GC)<>"1" THEN 2160
( 若该词不及物则保留词典基本义项 "玩", 该词多义区分毕, 转2160. )

2130 IF HY$(ZC)="胡琴" OR RIGHT$(HY$(ZC),4)="提琴" THEN HY$(GC)="拉": GOTO 2160
( 若找到词为 "胡琴", 或找到词的后两字为 "提琴" (包括大提琴,小提琴,中音提琴等), 则该词取汉义 "拉", 该词毕, 转2160. )

2140 IF HY$(ZC)="钢琴" THEN HY$(GC)="弹": GOTO 2160
2145 IF HY$(ZC)="口琴" THEN HY$(GC)="吹": GOTO 2160
2150 IF RIGHT$(HY$(ZC),2)="球" THEN HY$(GC)="打"
2160 GC=GC+1: GOTO 1830 ( 放过该词, 取后一词, 转1830. )

2) BATI 打 / (心)跳动

1990 IF VT$(GC)="1" AND (RIGHT$(HY$(ZC),2)=心" OR HY$(ZC)="心脏") THEN HY$(GC)="跳动"
2000 GOTO 2160

3) OKAZI 进行 / 发生 / 召开

2450 IF RIGHT$(HY$(ZC),2)="事" THEN HY$(GC)="发生":GOTO 2160
2460 IF RIGHT$(HY$(ZC),2)="会" THEN HY$(GC)="召开":YY$(GC)="BE HELD": YTZ$(GC)="8": XX$(GC)="1"
2470 GOTO 2160

3) RIGARDI: LOOK AT / LOOK / WATCH (TV) / SEE (FILM)

2830 IF VT$(GC)<>"1" THEN YY$(GC)="LOOK": GOTO 2160
2840 IF YY$(ZC)="TELEVISION" OR YY$(ZC)="TV" THEN YY$(GC)="WATCH": GOTO 2160
2850 IF YY$(ZC)="FILM" THEN YY$(GC)="SEE": YTZ$(GC)="1"
2860 GOTO 2160

4) NENIAM 从不 / 从未

3070 IF ST$(ZC)="2" THEN HY$(GC)="从未": HY$(ZC)=HY$(ZC)+"过": JG$(ZC)="9"
3080 GOTO 2160

 

 

 

【相关】

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录立委硕士论文:9. 目标语调序

立委硕士论文:世界语句法分析(6&7)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

世界语句法分析(1): 虚词处理

虚词分析是世界语句法分析中最困难的部分. EChA的策略是分而治之, 各个击破. 每一个虚词的分析规则自成一体, 互相独立, 这样在充实或改进某一具体虚词的规则时, 便不致于影响其他虚词的规则, 这也就是规则和规则分开吧.[9] 语言规则和算法程序应该分开, 大家已经说了许多, 而规则和规则分开, 似乎还没有引起足够的重视. (不是指所有规则都分开: 具有普遍意义的抽象语法规则集合, 作为系统对于该语言充分形式化的逻辑描述, 是自动分析的枢纽, 本身就是一个可以做的很美的统一整体, 谈不上分开. (参考EChA句法分析第二线, 见第7节.) 一个优良的系统应该既能分得开, 又能合得拢.) 我们认为, 规则和规则分开, 对于研制实用性机译系统具有决定性意义. 没有什么系统从一开始研制就可以足够完善, 所以是否容易扩充和改进, 在很大程度上决定了一个系统的前途. 规则和算法分开, 固然大大增强了系统的扩充能力, 并且便于语言工作者和软件工作者充分合作. 但这还不够. 如果能实现规则和规则分开, 不但有利于遵循具体问题具体分析原则, 去解决语言这种特别复杂的现象中的许多个性问题, 从而大大提高翻译质量, 而且也为语言工作者和语言工作者的协作, 创造了必要的条件----这种协作, 对于研制大型实用系统是必不可少的.

规则和规则分开的主要方式是: 1) 词典语法化: 以词为基本单位, 把关于该词的各种用法及其分析规则, 以数据的形式写入词典(它建在外存贮器上). 这样的机器词典, 形式上很类似于我们案头的词典工具书, 如牛津, 韦式, LONGMAN等, 而且也较容易借鉴已有的这些词典的研究成果. 我们建议首先把虚词和动词的条目语法化. 2) 语法词典化: 在编写句法分析或综合程序(它在内存贮器中)时, 把规则落实到具体词或小类上, 并使这些规则独立开来. 这两种方法形式有别, 实质是一样的. 我们在EChA中采用的是第二种方法. (参见EChA虚词分析部分和EChA综合部分的多义词区分规则.) 说到底, EChA分析第一线不过是一个带有分析规则的虚词大词典.

当然, 应该指出, 规则和规则分开, 必然使规则量成倍增长. 然而, 由于边界分明, 这种增长并不影响系统结构上的逻辑清晰性, 这跟以前语言和算法, 规则和规则都没分开时的情形大不相同, 那时的规则无限膨胀, 只能致使系统最终报废. 不过规则量的增长, 涉及到机器的存贮容量问题. 但这实际上也不成问题, 因为现在的机器对于存贮节省的要求, 已经不是那么苛刻了. 即便是微型机, 中高挡的内存容量就能达到, 或很容易扩充到四兆到八兆字节. 值得强调的是, 规则量的增长, 一般并不影响系统的工作效率, 因为规则是附在具体的词或小类下, 只有所译文句出现了某词, 才会入该词一线.

在EChA虚词分析一线中, 我们把虚词的多义区分, 甚至有些涉及虚词特点的目标语修辞, 都一古脑纳入具体虚词的分析规则中. 这样处理显然比较简便易行, 也大大减轻了综合的困难. 但是, 正是在这儿, EChA违背了我们所极力赞同的分析和综合独立的原则. 目前还想不出更好更合理的办法. 不过, 我们主张独立分析的本意, 不外乎为了两点: 1) 为了使分析深入以便提高机译质量; 2) 让同一个独立分析结果, 能为多语综合所利用. 考虑到虚词的分析和综合同步进行, 有助于提高译文崐质量, 而且由于虚词数量的有限及其分析规则的相互独立, 在增加新的目标语时充实这些规则不会有很大困难, 更不会影响整个系统的筋骨, 因而我们目前的做法是有理由的, 它并不违背我们的宗旨.

 

世界语句法分析(2)

分析第(2)线与目标语综合充分独立, 逻辑性强, 是一个相当完整的语言分析模型. 它由一个主程序和几个以动词分析算法为核心的环环相扣的子程序构成. 主程序主要用来确定各语段的范围(前限后限)及其加工次序, 为它们进入动词子程序做好准备. 它必须对各种类型的世界语文句作出正确, 合理的处理, 才能保证系统的充分概括性和适应性. 从各类文句的试验结果看, EChA相当好地做到了这一点.

我们把世界语文句的类型归纳如下:

1.无谓句. 如:

Kia belega pejzagho ! (041) / What beautiful scenery ! 多么绝美的景色!

2.谓语句:

1) 简单句: 全句只有一个谓语. 如: Skribu klare ! (033) / Write clearly ! 写清楚!

2) 扩展的简单句: 全句至少有两个谓语, 但只有一个主句, 从句跟主句(以主轴心为代表)没有直接联系, 即从句处于2层以外 ( 其层号 >= 3 ). 这类从句往往是定语从句或同位语从句. 如:

La homon , pri kiu vi parolas , mi neniam vidis . (131)
The man(宾), about whom you speak , I never saw .
我从未见过你提到的人.

3) 主从句: 全句至少有两个谓语, 但只有一个主句, 从句跟主句发生直接联系. 如:

Se mi partoprenus en via amuza aktivado , mi estus tre ghoja . (050)
If I should take part in your recreational activity , I would be very glad .
如果我参加你们的文娱活动, 我会是很高兴的.

4) 并列句: 全句至少有两个谓语, 同时也至少有两个有并列关系的分句, 并且其中一个是主轴心. 如:

Mi miras , timas , tremas . (074)
I wonder, fear, tremble.
我惊奇, 害怕, 颤抖.

5) 交错句: 以上四类句子交错组合而成的复杂句. 如本文第3节举的例句(004)就是.

EChA在对付这些不同类型的句子时, 能够把复杂的句子分解成简单的句子处理. 分析程序首先查找从句. 如果查到, 先入并列从句子程序分解(若是光杆从句就放过, 返主), 然后确定每一个从句的前后限, 入动词子程序加工. 加工完毕, 做绝对放过标志. 所有从句处理完毕, 再行主句加工. 这时候, 句子呈或者简单句, 或者并列句的形式.

世界语中表示关系的从句, 如有相应的 T 类相关词与之呼应, 就是同位语从句. 而当主句中 T 类相关词省略时, 便于表示疑问的名词性从句同形, 从而增加了识辨难度. 对此本系统暂时不予考虑. 这种省略虽然显得较干练 (成语警句中常用), 崐但不宜提倡, 因为甚至人(尤其是非印欧语系的人)理解起来, 也常常感到困难.

[例] Bone ridas , KIU laste ridas .
Well smiles, WHO smiles at last.
谁笑得最后, 笑得最好.

KIO pasis , ne revenos .
WHAT passed, will not return.
时不再来. (一去不复返.)

CF: Nur TIU ne eraras, KIU neniam ion faras.(151)
Only THAT PERSON is not wrong, WHO never dose something.
仅仅从不做某事的那个人不犯错误.

第二线的关键是动词子程序的建立. (这儿所谓动词包括谓语动词, 形动词, 副动词和不定式, 但不包括-ADO词, 因为世界语的-ADO词已经完全名词化了, 不再具有动词的特性.) 如果说先从句后主句的加工过程, 实际上是自下而上的方法, 那么动词算法的路径正好反过来, 是自上而下. 动词子程序首先设三个开关. 一是检验是否可以构成动词短语 VP. 若不能, 如独词句及光杆的形动词, 副动词或不定式, 则给该词节点信息 J (终结节点), 该词加工完毕, 退出. 二是检验该词是否系词, 若是, 转系词子程序作适当处理, 再回动词子程序递归加工. 这是因为系动词有其特殊性, 比如一般动词谓语简单句, 只可能有一个前面没有介词的普通格名词(它当然是主语), 而系词谓语句却可以有两个(一主一表), 因而不能直接入动词子程序.  最后一个开关检验该动词短语是否扩展的 VP, 若不是, 即行分析. 扩展的 VP 定义为该动词的间接成分层中(所谓间接成分层是指其层号 >= 动词轴心的层号 + 2 的层次), 至少又包含一个 VP. 对于扩展的动词短语, 运用栈技术作递归加工. 这样动词子程序真正的加工单位便是不扩展的各类 VP (简单句, 形动词短语, 副动词短语, 不定式短语). 动词子程序在工作期间, 常常需要调用其他子程序. 各子程序间的逻辑关系是十分清楚的.

名词子程序也要设开关. 扩展的 NP 定义为带有至少一个 VP 的 NP, 它必须回动词子程序递归加工.

对于不扩展的动词短语, 一般来说加工次序如下:

丨动词子程序丨--------丨 名词子程序 丨------丨形容词子程序丨----丨 副词子程序 丨

这形象地体现了 "自顶而下" 的分析思想.

试验表明, EChA的两线分析程序, 一具体一抽象, 一个对付个性一个对付共性, 一个面向虚词一个面向实词, 一个尽量使句法分析词典化, 一个则努力使分析过程逻辑化, 二者相互配合, 很有效地实现了各类世界语文句的自动分析. EChA输出的中间结果158条CDC链中只发现一处分析错误. 它出现在第一首诗歌 "LA ESPERO" 的第三句:

Ne al glavo sangonsoifanta , ghi LA HOMAN tiras FAMILION . (102)
Not to sword bloodthirsty , it THE MAN'S (目的格) pulls FAMILY (目的格).

为了节奏和韵律的关系, 作者把形容词修饰语与其轴心词分开了(当然仍同格同数), 中间插进一个动词谓语. 于是系统误把二者都看作是动词谓语的宾语, 因为 "冠词+形容词" (后不跟名词) 结构一般总是代替 NP 的, 所以EChA也就这样分析了. 幸运的是, 这一分析错误没有导致译文错误, 因为中英文综合都把前置宾语移至动词轴心之后, 客观上恢复了修饰语与其中心词的正常词序, 当然这只是巧合.

_____________________________________________________________________

附注: [9] 这儿关于规则和规则分开的讨论, 很大程度上得益于与刘倬老师的几次谈话.

 

【相关】

硕士论文: 世界语到汉语和英语的自动翻译试验
立委硕士论文:1. EChA概况
立委硕士论文:2. 世界语: 语言学特点及其研究价值
立委硕士论文:3. 层次递归成分体系
立委硕士论文:4. EChA机器词典及词表
立委硕士论文:5. 世界语形态分析
立委硕士论文:6/7 世界语句法分析
立委硕士论文:8. 英语形态生成
立委硕士论文:9. 目标语调序
立委硕士论文:10. EChA 试验结果的分析
立委硕士论文【致谢】【参考书目】
立委硕士论文全文(世界语版)

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

立委硕士论文:世界语形态分析 (5)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

世界语形态分析

源语文句分析大体可以分形态分析和句法分析两大类. 前者研究的对象小于等于词, 而后者的对象大于等于词(句素). 分析的终极目的就是求解词的正确的CDC成分. 本节先讨论形态分析问题. 我们把构词分析的讨论也放在这一节.

世界语形态分析的主体是消尾算法的建立. 世界语没有形态同形现象, 所以只要削尾正确, 形态分析也就完成. 下面给出EChA的削尾算法. 应该说, 该算法是比较完备和合理的, 完全能够满足世界语自动分析实用系统的要求.

世界语削尾算法

(1) 若该词最末字母为 "-O" 取 "名词 / 普通格 / 单数" 的结论, 该词削尾后查实词词干词典, 转下一步(2), 否则步骤(12).

(2) 若查词典成功, 取词典信息到加工场, 该词加工完毕, 否则下一步(3).

(3) 若该词最末二字母为 "-AD" 取 "AD词" 的结论, 该词削尾后查实词词干词典, 转下一步(4), 否则步骤(5).

(4) 若查词典成功, 取词典信息到加工场, 该词加工完毕, 否则步骤(11).

(5) 若该词最末三字母为 "-ANT" 取 "分词 / 进行式 / 主动式" 的结论, 该词削尾后查实词词干词典, 转步骤(4), 否则下一步(6).

(6) 若该词最末三字母为 "-INT" 取 "分词 / 完成式 / 主动式" 的结论, 该词削尾后查实词词干词典, 转步骤(4), 否则下一步(7).

(7) 若该词最末三字母为 "-ONT" 取 "分词 / 将来式 / 主动式" 的结论, 该词削尾后查实词词干词典, 转步骤(4), 否则下一步(8).

(8) 若该词最末二字母为 "-AT" 取 "分词 / 进行式 / 被动式" 的结论, 该词削尾后查实词词干词典, 转步骤(4), 否则下一步(9).

(9) 若该词最末二字母为 "-IT" 取 "分词 / 完成式 / 被动式" 的结论, 该词削尾后查实词词干词典, 转步骤(4), 否则下一步(10).

(10) 若该词最末二字母为 "-OT" 取 "分词 / 将来式 / 被动式" 的结论, 该词削尾后查实词词干词典, 转步骤(4), 否则下一步(11).

(11) 该词取 "生词" 的结论, 保留削尾结论, 在加工场的目标语语义项里复制该词, 该词加工完毕.

(12) 若该词最末字母为 "-'" 取 "名词 / 普通格 / 单数" 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则下一步(13).

(13) 若该词最末字母为 "-A" 取 "形容词 / 普通格 / 单数" 的结论, 该词削尾后查实词词干词典, 转步骤(2),  否则下一步(14).

(14) 若该词最末字母为 "-E" 取 "副词 / 普通格" 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则下一步(15).

(15) 若该词最末字母为 "-J" 取 "普通格 / 复数" 的结论, 该词削尾后转下一步(16), 否则步骤(18).

(16) 若该词最末字母为 "-O" 取 "名词" 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则下一步(17).

(17) 若该词最末字母为 "-A" 取 "形容词" 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则步骤(11).

(18) 若该词最末字母为 "-N" 取 "目的格" 的结论, 该词削尾后转下一步(19), 否则步骤(23).

(19) 若该词最末字母为 "-J" 取 "复数" 的结论, 该词削尾后转步骤(16), 否则下一步(20).

(20) 若该词最末字母为 "-O" 取 "名词 / 单数" 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则下一步(21).

(21) 若该词最末字母为 "-A" 取 "形容词 / 单数" 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则下一步(22).

(22) 若该词最末字母为 "-E" 取 "副词" 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则步骤(11).

(23) 若该词最末字母为 "-S" 转下一步(24), 否则转步骤(30).

(24) 若该词最末二字母为 "-AS" 取 "现在时" 的结论, 该词削尾后转步骤(28), 否则下一步(25).

(25) 若该词最末二字母为 "-IS" 取 "过去时" 的结论, 该词削尾后转步骤(28), 否则下一步(26).

(26) 若该词最末二字母为 "-OS" 取 "将来时" 的结论, 该词削尾后转步骤(28), 否则下一步(27).

(27) 若该词最末二字母为 "-US" 取 "虚拟式" 的结论, 该词削尾后转步骤(29), 否则步骤(32).

(28) 取 "陈述式" 的结论, 转下一步(29).

(29) 取 "动词 / 谓语 / 主动语态" 的结论, 查实词词干词典, 转步骤(2).

(30) 若该词最末字母为 "-I" 取 "动词 / 不定式" 的结论, 该词削尾后查实词词干词典, 转步骤(2), 否则下一步(31).

(31) 若该词最末字母为 "-U" 取 "命令式" 的结论, 该词削尾后转步骤(29), 否则下一步(32).

(32) 查虚词词典(因该词无尾可削). 若成功取词典信息到加工场, 该词加工完毕, 否则取 "名词 / 专有名词" 的结论, 返回步骤(11).

[注] 世界语基本法规第16条说: "名词和冠词末尾的元音字母可以省略, 用省略号 ' 来代替". 这种现象多出现在诗歌里, 如 MOND'(103). 我们在步骤(12)对它作了处理(冠词是长度小于 3 的虚词, 直接查虚词词典, 不入削尾一线, 故不予考虑).

我们谈谈构词分析问题, 这包括两个方面: 1. 关于建立削缀算法(派生词处理)的讨论; 2. 关于拆离合成词的讨论. 在现行的EChA系统中, 这两个问题都回避了. 我们建立的词典, 是以词干(包括合成词词干)作存贮单位的, 加工词只要削去语法词尾, 就可以查到. 但是, 应该指出, 这样做, 对于世界语这种构词特别灵活的语言并不合理. 以词干存词, 在做小型实验时还可应付, 如果是实用系统, 就会出现存不胜存的情况. 我们主张实词词典既存词根也存词干, 同时建立一个完全的世界语削缀算法和合成词拆离算法, 以便对付生词. (世界语除国际性的专业词汇外, 基本词根很有限. 所谓生词, 一般都是由基本词根及几十个词缀随机组合的派生词或合成词. 因此, 只要切分正确, 生词便不 "生".)

世界语后缀可以叠加(理论上无限), 但前缀通常只能有一个. 这样词典一线的加工路径应该是:

lw9

削缀与削尾不同, 并非有缀必削. 对于削尾, 机器是先削后查, 而对于削缀, 则是先查词典, 查不着的生词再去削缀. 这样处理便于我们根据设计要求(实验型还是实用型, 对于翻译速度, 质量, 成本的要求等等)和机器条件(内存容量, 运算速度等)决定实词词典收词干的标准.

现在, 由于计算机技术的发展, 机器功能(存贮, 速度)越来越强, 而成本急遽下降. 因此机器翻译界如今有人提倡存贮单位宜大不宜小(如尽量多收成语的主张[7] ), 以海量存贮和快速查找来减轻分析的负担. 这是很有见地的认识. 单位越大, 确定性就越强, 对分析综合(机器智能)的要求就越低, 研制的难度相对减轻, 而译文的质量会大大提高. 机器翻译是实用性?
很强的学科, 这种主张就显得更有价值. 当然, 单位也不是越大越好, 因为单位每大一级(从词根到词干, 从词干到词, 从词到词组, 从词组到语句), 其组合的可能性呈指数增长.[8] 如果推向极端, 以句子为存贮单位, 则完全不需要分析和综合, 只要对号入座即可输出译文. 这时候, 人工智能的程度等于零, 翻译质量却可以达到最佳(如果以人工水平为最佳). 可惜, 硬件技术无论怎样发达, 其存贮容量和查找速度也总有限, 不可能对付无穷的句子. (但为了某种特殊的需要在有限的范围内, 这种办法是可行的, 如旅游翻译机. 这到底还算不算机器翻译? 应该算的, 只是它不是人工智能意义下的机器翻译.) 机器翻译的另一极是以词素(词根, 词缀, 词尾)为分析单位, 它所需要的词典容量(只存词根)最小, 人工智能的水平最高, 不但有句法分析和综合, 还要有构词分析和综合. 但费了好大劲儿, 质量却最不能保证, 因为一个句子掰得太碎(原文分析), 捏拢来总难免有些难看的痕迹(译文综合). 所以, 现行的机译系统, 一般都是在这两极中根据具体条件和设计者的观点取某个中值. 我们认为, 一个优秀的实用系统应该有两手, 既能分析得很透彻, 又能对常用词组(成语)囫囵儿处理. 该细的地方细得下去, 该粗的地方粗得起来. 一般来说, 对于常用的, 固定的, 个性的可枚举现象粗一点比较有利, 而对于规律性的随机现象, 则适宜较细致的分析. 所以, 对于以世界语为分析对象的实用机译系统, 我们既主张尽可能多收成语和带缀词干, 也充分肯定建立一个完备的削缀算法的必要性.

那么, 世界语实词词典收多少派生词词干比较合理呢? 对于独立型机器翻译:

(1) 如果是小型实验系统, 目的是在有限的材料内试验系统的句法分析和综合能力, 那就词干全收; 否则:

(2) 凡是常用的派生词词干一律收进词典, 而不再入削缀子程序----常用性(出现频率高)是根本标准;

(3) 有助于区别同形多义的派生词词干, 应该收;

(4) 可收可不收的, 主张收;

(5) 在刚开始设计实用系统的机器词典时, 由于世界语词缀的极端灵活性和随机性, 很难一次收入许多带缀的词干, 这样, 削缀算法就显得更重要. 削下缀来, 虽然表义不是很确切, 甚至有时在目标语综合时, 还需要辅以说明性注释(见后面例释), 但总比直接打出生词来(信息量为零)强出百倍. 随着系统的不断扩充和完善, 收的词干自然会越来越多.

如果是具有特定的目标语的相关型机器翻译:

(1) 收多少派生词词干应该考虑目标语的构词特点及词汇状况;

(2) 在目标语中作为一个完整概念, 而不是词根和词缀意义简单相加所能反映的词干, 应该收入词典. 如: DOM-EGO 楼房, 大厦 (而不是一般的 "大-房子" );

(3) 如果以汉语为目标语, 削缀更多一些, 因为世汉构词法很相似, 汉族人的心理本能地习惯于理解词素与词素的组合. (这种民族偏爱心理在引进外来词时表现的很明显, 如 "德律风" 为 "电话" 取代, "莱塞" 为 "激光" 取代等.) 可以举出很多世汉构词神似的例子. 而且也有许多世界语派生词如 DOM-ACHO 虽然整个儿译作 "陋室" 更雅一些, 但也不妨用统一的削缀合成法组成新词 "鬼-房子", 与原义相去也不远. 特别是有些缀与汉字(词素)有很多一致性, 如 VIC-/副-, -IN-/女-, -EBL-/可- 等等, 就更有理由作削缀处理.

世汉构词对比例释(1): 派生词

(1) BO- 姻- : BO-PATRO 姻-父亲 (岳父或公公) , BO-FILO 姻-儿子 (女婿) , BO-FRATO 姻-兄弟 (内弟) ;

(2) GE- (男女)- : GE-AMIKOJ (男女)-朋友们 , GE-KAMARADOJ (男女)-同志们 , GE-AKTOROJ (男女)-演员们 ;

(3) EKS- 前- : EKS-OFICISTO 前-职员 , EKS-MINISTRO 前-部长 , EKS-INSTRUISTO 前-教师 ;

(4) MAL- [反义] : MAL-BONA [反义]好 (坏) , MAL-AMIKO [反义]朋友 (敌人) , MAL-SAGHE [反义]聪明 (愚苯) ;

[说明] MAL-是世界语中用得最广, 随机性最强的前缀之一, 具有极强的造词能力, 可惜, 中文没有对应的词素. 如果系统遇到某个MAL-型生词, 削下前缀后给出[反义]这样的说明性标识, 也还可以使人理解.

(5) VIC- 副- : VIC-PREZIDANTO 副-主席 , VIC-ESTRO 副-队长 , VIC-CHEFMINISTRO 副-总理 ;

(6) FI- 坏- : FI-INSEKTO 坏-虫 , FI-KOMERCISTO 坏-商人 (奸商) , FI-KUTIMO 坏-习惯 (恶习) ;

(7) SEN- 1. 若词根逻辑类为名词则 "无-" : SEN-GUSTA 无-味的 , SEN-SENCA 无-意义的 ;

  1. 若词根逻辑类为动词则 "不-" : SEN-MORTA 不-死的 (不朽的) , SEN-ATENTA 不-注意的 ;

(8) NE- 若词根逻辑类为名词则 "非-" 否则 "不-" : NE-ESPERANTISTO 非-世界语者 , NE-BONA 不-好的 ;

(9) 介词性前缀:  1. SUR- -上: SUR-TABLE 桌子-上 ; 2. APUD- -旁: APUD-VOJA 路-旁的 ;

  1. EN- -内: EN-LANDE 国-内 ; 4. LAU- 按-: LAU-VICE 按-次序 ; 5. DE- 从-: DE-NOVE 从-新 ;

(10) -ACH- 鬼- : DOM-ACHO 鬼-房子 (陋室) , KNAB-ACHO 鬼-男孩 (捣蛋鬼) , VETER-ACHO 鬼天气 ;

(11) -AN- -成员 : KLUB-ANO 俱乐部-成员 , KURS-ANO 讲习班-成员 , KOMUNUM-ANO 公社-成员 ;

(12) -UL- -者 : BON-ULO 好-者 , KAR-ULO 亲爱-者 , JUN-ULO 年青-者 , LONG-KRUR-ULO 长/腿-者 ;

(13) -IN- 女- : KAMARAD-INO 女-同志 , INSTRUIST-INO 女-教师 , OFICIST-INO 女-职员 , AKTOR-INO , 女-演员 ;

(14) -EBL- 可- : VID-EBLA 可-见的 , MANGH-EBLA 可-吃的 , UZ-EBLA 可-用的 , NE-ATING-EBLA 不-可-达到的 ;

(15) -EC- -性 : CERT-ECO 确实-性 , NECES-ECO 必要-性 , KLAR-ECO 清楚-性 , LIBER-ECO 自由-性 ;

(16) -EM- 爱- : LABOR-EMA 爱-工作的 (勤劳的) , PAROL-EMA 爱-说话的 , MENSOG-EMA 爱-撒谎的 ;

(17) -IND- 值得- : LERN-INDA 值得-学习的 , LAUD-INDE 值得-称赞 , LEG-INDA 值得-读的 , AM-INDA 值得-爱的 ;

(18) -ON- 1. 若 -ONO 则 "-分之一": DU-ONO 二-分之一 , TRI-ONO 三-分之一 , KVAR-ONO 四-分之一 ;

  1. 若 X+Y-ONOJ 则 "Y-分之X": TRI DEK-ONOJ 十-分之三 , KVIN OK-ONOJ 八-分之五 .

合成词 ("词根+词根") 也是一样. 比较固定的, 应该整个儿存入词典, 随机组合的, 应该拆开. 但这儿有一个困难, 世界语语法为了方便使用者, 即便对完全随机组合的合成词, 也不作加连字符的规定. 那么怎么拆呢? 词根的数量与词缀不能比, 长度也变化很大, 一个字母一个字母地削查比较, 显然不是办法. 如果坚持不要译前编辑, 还找不到一个合理的解决办法. 目前可以考虑先对中间有连字符的合成词作拆词加工. 我们提倡除比较固定常用的合成词外, 世界语者在运用随机合成词时,为读者的省力和机器的识辨计加上连字符. 鉴于世界语构词法与汉语构词法惊人的一致(组合方式及其高度随机性都很类似), 对于世汉机器翻译这一倡议更加必要.

世汉构词对比例释(2): 合成词

(1) AKVO-FONTO 水/源 ; (2) VARM-ENERGIO 热/能 ; (3) ARBO-BRANCHO 树/枝 ; (4) VAPOR-SHIPO 汽/船 ;

(5) SURD-MUT-ULO 聋/哑-者 ; (6) BLANK-HARA 白/发的 ; (7) NUD-PIEDA 光/脚的 ; (8) FISH-KAPTI 捕/鱼

______________________________________________________________

附注: [7] 参见:

刘涌泉 <<中国的机器翻译>> ( <<情报科学>> 1980, 3 )

王广义 <<机器翻译中的固定词组和固定结构问题>> ( <<语言和计算机>> (1), 1982 )

[8] 参看: 叶蜚声, 徐通锵 <<语言学纲要>> 第二章第二节 " 1. 语言的层级体系", PP.34-36 ( 北京大学出版社, 1981 )

 

 

【相关】

硕士论文: 世界语到汉语和英语的自动翻译试验
立委硕士论文:1. EChA概况
立委硕士论文:2. 世界语: 语言学特点及其研究价值
立委硕士论文:3. 层次递归成分体系
立委硕士论文:4. EChA机器词典及词表
立委硕士论文:5. 世界语形态分析
立委硕士论文:6/7 世界语句法分析
立委硕士论文:8. 英语形态生成
立委硕士论文:9. 目标语调序
立委硕士论文:10. EChA 试验结果的分析
立委硕士论文【致谢】【参考书目】
立委硕士论文全文(世界语版)

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

立委硕士论文:EChA机器词典及词表 (4)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

EChA机器词典及词表

EChA所有词典词表都是随机数据文件, 并且各配有一套修改和扩充的外围维护程序, 这给系统的改进提供了方便. 下面

分别介绍各词典词表的定义.

1) 实词词干词典
格式:            __________________________________________________________________________________
词干丨逻辑类丨及物性丨带不定式丨支配词丨支配词汉义码丨汉义丨汉义特征 丨 英义
____丨_______丨______丨_________丨_______丨_____________丨____丨_________丨______    ___________________________________________________________
丨英义特征 丨 语义特征 丨 词类词义区分表记录号  丨 备用项 丨
丨_________丨__________丨_______________________丨________丨

<逻辑类>::= { N, V, A, F, P, C, K, T, R, S, W, E, D, X }

N=名词 , V=动词 , A=形容词 , F=副词 , P=介词 , C=连词或标点 , K=K类相关词 ,
T=T类相关词 , R=其他相关词 , S=数词 , W=人称代词 , E=系词 , D=冠词 , X=万能词

[说明] 逻辑类用来表明词的静态词性. 世界语实词的语法词性是动态随机的, 只能由削尾决定. 但每个词一般具有一个基本词性, 这是单词的深层的逻辑特征. 语法词性不过是由它通过加词尾派生的表层的句法特征.

<汉义特征>::= { "...以后", "...的", "使...", "把...", "给...", "...下", "...上", "...里", "...时",
多义词特征, 构成成语特征, ... }

[说明] 汉义特征揭示了该词汉义的结构特性, 也给出了汉语生成的修辞信息.

<英义特征>::= { 不规则变化特征, 双写特征, 形式不变特征, ... }

[说明] 英义特征给出该词的英语形态生成方式信息.

<支配词汉义>::= { 零义, "给", "以", "到", ... }

[说明] 支配词汉义标示该词所支配的词(通常是介词)的汉义.

<语义特征>::= { HM, LK, TM, FX, ... }

HM=人类特征, LK=地点特征, TM=时间特征, FX=方向特征
2) 虚词词典

虚词词典除包含实词词典的各项信息外, 还揭示了部分CDC信息, 如词性, 格, 数, 关系, 分布, 节点等. 分析之前就能在词典里给出某些动态信息, 这是由虚词特点决定的. 例如: 介词永远处于非终结节点(节点"Y")上, 原副词和万能词一般是不扩展的, 所以总处于终结节点(节点"J")上. 万能词 ECH (EVEN) 永远位于其轴心词之前(分布"Q"). 原副词 JAM (ALREADY) 永远做状语(关系"F"). 从属连词 KE (THAT) 总是引导名词性从句(词类"K", 节点"K"), 而且总位于其轴心词之后(分布"H").

冠词LA永远做定语(关系"D"), 位于轴心词前(分布"Q"), 处于终结节点上(节点"J").

3) 成语词典

机器翻译界所谓的成语, 比其通常的意义要宽泛得多. 凡是常用的比较固定的词组都可收作成语. 世界语中纯粹的不可分析的习惯表达法较少, 所以成语词典容量相对不大. 成语词典的收词范围, 还在很大程度上决定于原语和译语的对比差异. 亲属关系相近的表达方法类似, 可以少收或不收成语. 在EChA中, 就没有设立世英成语词典, 只有一部世汉成语词典.

EChA成语例释:

MALFERMA(JN) AUTO(JN) ----- 敞蓬汽车 ( CF: OPEN CAR(S) )
SOMERA(JN) FERIO(JN) ----- 暑假 ( CF: SUMMER HOLIDAY(S) )
LA ANGLA(N) LINGVO(N) ---- 英语 ( CF: THE ENGLISH LANGUAGE )
INSTRUA(JN) LIBRO(JN) ---- 教科书 ( CF: TEACHING BOOK(S) )
LA GRANDA(N) MURO(N) ---- 长城 ( CF: THE GREAT WALL )
HOMA(N) SVARMO(N) ---- 人群 ( CF: MAN'S SWARM )
FACILA(N) VENTO(N) ---- 顺风 (CF: EASY WIND )

4) 词类词义区分表

建立该词表对于世界语作为源语的机器翻译很必要, 可以大大减轻综合时多义区分的负担. 凡是随着词性和逻辑类的不同, 目标语的义项也相应不同, 而这种改变并不遵循形态转换规律, 这样的单词就收入区分表. 例如: MATEMATIK-A(JN) 必须收入, 而 HOM-A(JN) 就不必收, 因为前者的英义是 MATHEMATICAL (不是 MATHEMATICS' ), 而后者只要按规律从源语形容格(形容词性), 生成目标语所有格的词尾 -'S 或助词  "的" ( MAN-'S / "人-的" ) 就可以了. 我们在实词词典中对要入区分表的词, 都给出了查表记录号(随机文件地址), 所以系统只要按地址取记录就行了. 用BASIC编程时, 拿随机文件记录号?
作为单词内部代码, 是值得推荐的.

词类词义区分表例释:

实词词典                      词类词义区分表

ATING-I: ACHIEVE / 达到        ATING-O: ACHIEVEMENT / 成就
EKZEMPL-O: EXAMPLE / 例子      EKZEMPL-E: FOR EXAMPLE / 例如
KOMENC-I: BEGIN / 开始         KOMENC-E: AT BEGINNING / 开始时
MEZUR-I: MEASURE / 测量        MEZUR-O: MEASUREMENT / 尺寸
OKAZ-I: HAPPEN / 发生          OKAZ-O: OCCASION / 场合
SCI-I: KNOW / 知道             SCI-O: KNOWLEDGE / 知识
TIP-O: TYPE / 型号             TIP-A: TYPICAL / 典型的

5) 英语不规则词表

这个词表跟一般英语词典附录中列的不规则表没什么两样, 不过为了简便, 我们把动词形式的不规则变化和名词复数的不规则变化放在一个表内. 不规则词表是供英语形态生成查用的.

英语不规则词表

原形             过去时                过去分词              名词复数

BEAT             BEAT                  BEATEN
BECOME       BECAME                BECOME
...              ...                   ...                    ...
CHILD                                                         CHILDREN
...              ...                   ...                    ...

最后我们给出EChA句子加工场的格式:

目标语序号丨实词词典各项丨CDC信息丨已加工特征丨虚词特征丨
目标语调序信息丨目标语位移序号丨

[说明] 1. 目标语序号用来在综合阶段自底而上归约加工时给同号.

  1. 目标语位移序号用来在用搬家法作虚拟调序时代表整个词条. 用序号代替整个词条位移的虚拟调序, 比纯粹用搬家法效率高, 大约跟拉链法相仿. 鉴于BASIC不能处理组合项变量, 如果采用搬家法调序, 只能一项一项位移, 这种虚拟调序的技术更显出优越性. 但须注意, 跟位移序号一起移动的, 还必须包括该词的自然顺序号, 用它标示原词条位置, 这样查问时才无后顾之忧.

【相关】

硕士论文: 世界语到汉语和英语的自动翻译试验
立委硕士论文:1. EChA概况
立委硕士论文:2. 世界语: 语言学特点及其研究价值
立委硕士论文:3. 层次递归成分体系
立委硕士论文:4. EChA机器词典及词表
立委硕士论文:5. 世界语形态分析
立委硕士论文:6/7 世界语句法分析
立委硕士论文:8. 英语形态生成
立委硕士论文:9. 目标语调序
立委硕士论文:10. EChA 试验结果的分析
立委硕士论文【致谢】【参考书目】
立委硕士论文全文(世界语版)

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

立委硕士论文:层次递归成分体系 (3)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

层次递归成分体系

在给出层次递归成分体系(CDC)的定义之前, 我们先说说该体系的来源及其理论依据.

CDC体系是机器翻译的一种中间语言, 我们试图提供一套更加合乎独立分析独立综合要求的机器翻译抽象文法. CDC是EChA系统的关键, 它体现了我们对语言结构的看法和对机器翻译的认识. CDC是直接从导师们的中介成分体系[2] 脱胎而来的, 它保留了中介成分的形式, 继承和改造了它的内容, 其思想基础是有向直接联系理论(或轴心词理论). 体现在CDC中的要点是:

1) 句子的最顶层是主句谓语, 它是全句的最大联系中心(主轴心), 所以谓语是全句的代表. 一个完整的句子的最简单也是最典型的形式, 就是独词祈使句. 如:

Venu! Come! 来!

任何其他句子(无谓句是不完整句, 除外)都是从上面的简单形式一层一层推衍出来的:

Venu! ... La studento venu chi tien! ... La studento, kiu parolis, venu chi tien! ......

Come!     Let the student come here!     Let the student, who spoke, come here!

反过来说, 对一个无论怎样复杂的句子层层归约, 归约的顶层必然是主句动词谓语:

VENU
/                \    \
studento         tien   (!)
/           \               /
la        parolis      chi
/     /    \
(,)    kiu      (,)

2) 一个词只能跟另外的一个词发生直接联系, 但一个词可以带 N 个 ( N>=0 ) 直接联系词. 这就是句子结构的有向直接联系观点.[3] 带直接联系词的词叫轴心词, 当 N>0 时, 它是非终结节点词. 直接联系词本身也常常是低一层次的轴心词.

3) 主句谓语(主轴心)处在第一层. 与主句谓语发生直接联系的词位于第二层. 与第二层词直接联系的词在第三层. 这样一环扣一环, 组成句子的每一个词都处在某一个层次上. 理论上说, 句子的层次可以是无限的.

4) "虚词不虚." 虚词(或者叫功能词, 结构词)较之实词包含更多的句法结构信息. 有些虚词同样可以充当轴心词. 比如: 在 "介+名" 结构中, 介词是轴心词. 主从连词如 SE (IF), KVANKAM (ALTHOUGH) 等也充当轴心词, 作为从句的代表, 它跟主句谓语发生直接联系, 它所带的下位直接联系词是从句谓语.崐    5) 作为源语文句的中间语言映射, 层次递归成分应该, 也可以落实到每个词上. 所谓词, 从机器角度来看, 就是两空之间的字符串(汉语另当别论). 严格地说, 标点符号也是词(虚词), 也要参与文句的分析和归约.

建立CDC体系的两项基本原则是:

1) 层次递归原则: 有多少层次反映多少层次, 而且层次是递归的. 层次的递归性表现在: (1) 对文句可以自底而上层层归约(参见EChA系统的目标语生成算法); (2) 对文句可以自顶而下层层分析(参见EChA的源语分析算法).

2) 词本位原则:[4] 词到句子(以主句谓语为代表)是一个动态递归过程的两极, 其间的各个环节就是所谓层次. 贯彻词本位原则的实质, 就是在一切层次上都把成分(CDC)落实到词. 句子是, 也仅仅是由句素组成的. 而每一个大大小小的句素(词组, 短语, 从句等)按照我们的看法, 总是以一个轴心词来代表的.

现在, 我们给出层次递归成分体系的形式化定义:

  1. 层次递归成分体系是层次递归成分的集合.
  2. 层次递归成分是这样一个六元信息组:
    形态信息 | 结构关系信息 | 节点信息 | 分布信息 | 层号信息 | 链号信息
  1.  <形态信息>::=
    { <词性>, <格>, <数>, <时态>, <语态>, <语式>, <非谓语形式>, <体>, <人称>, ... }

<词性>::= { N, V, A, F, P, Z, C, K, B }

N=名词, V=动词, A=形容词, F=副词, P=介词, Z=助动词, C=并列连词,
K=主从连词, B=标点符号

<格>::= { 非格, 普通格, 目的格 }

<数>::= { 非数, 单数, 复数 }

<时态>::= { 非时态, 现在时, 过去时, 将来时 }

<语态>::= { 非语态, 主动语态, 被动语态 }

<语式>::= { 非语式, 陈述语式, 命令语式, 虚拟语式 }

<非谓语形式>::= { 非非谓语形式, 分词, 不定式, 名动词 }

<体>::= { 非体, 进行体, 完成体, 将来体 }

<人称>::= { 非人称, 第一人称, 第二人称, 第三人称 }

  1. <结构关系信息>::= { S, W, O, D, F, B, T, I, C, L, M, A, Z, V, R }

S=主语, W=谓语, O=宾语, D=定语, F=状语, B=补语, T=同位语,
I=独立成分, C=同等连词或标点, L=从句起始标点, M=从句末标点,
A=插入成分起始标点,Z=插入成分末标点, V=非结构意义标点, R=句末标点

  1. <节点信息>::= { J, <非终结节点> }

J=终结节点

<非终结节点>::= { S, O, D, B, K, X, Y }

S=主语从句节点, O=宾语从句节点, D=定语从句节点, B=补语从句节点,
K=一般从句节点, X=动词性非终结节点, Y=其他非终结节点

  1. <分布信息>::= { Q, H, G }

Q=位于轴心词前, H=位于轴心词后, G=轴心

  1. <层号信息>::= { 非层号, <自然数> }

<自然数>::= { 1, 2, 3, ... }

  1. <链号信息>::= { <左链号>, <右链号> }

<左链号>::= { 非左链号, 99, N }

N=大于句首号小于句末号的自然数

<右链号>::= { 非右链号, N }

[说明]   左链号的设置是为了处理同等成分的方便. 我们把同等成分的最右元素认作整个成分的代表(落脚点, 轴心).  左链号99是同等成分最左元素的标志. 有了左链号, 消除了后顾之忧, 同等成分就可以和其他句素一样, 参加文句的分析和归约.

下面是用这套成分体系作分析的例句(004):

CDC中形态信息略去, 余下依次是: 关系/节点/分布/层号/左链/右链, 例如:

FJQ 05 00 02 --->
状语/终结节点/位于其轴心词之前/处于第5层/没有左链(00是非左链号)
/右链号为02

Pli    poste          ,              kiam           la                     sciodisketoj
英:   More  later ,             when           the            knowledge-disks
汉:   更以后           ,            当(...时)                            微型知识磁盘
CDC链:  FJQ 05 00 02   FYQ 04 00 17   LJQ 05 00 04   FKQ 04 00 17   DJQ 07 00 06   SYQ 06 00 07

estis          eltrovitaj     ,             la          plenan         indikaron [注:目的格]
had been       found out      ,    the            full           indication
被             发明了         ,                             全套           指令集合
WBH 05 00 04   BJH 06 00 07   MJH 05 00 04   DJQ 05 00 12   DJQ 05 00 12   OYQ 04 00 17

,              endiskigitan   ,              oni            metis          en
,              endisked       ,              people         put            into
,              所写入磁盘的   ,              人们           放             到(...里面)
AJQ 06 00 14   DYH 05 00 12   ZJH 06 00 14   SJQ 04 00 17   WXG 03 99 20   BYH 04 00 17

mashinojn      kaj            ili            tiamaniere     povis          en
machines       and            they           therefore      could          in
机器                                    它们           这样               能             在(...里面)
OJH 05 00 18   CJQ 02 17 23   SJQ 02 00 23   FJQ 02 00 23   WXG 01 20 00   FYQ 03 00 27

si             mem            akumuli           sciencan       stokon         ,
themselves                 accumulate     scientific      stock          ,
自己           本身           积累                     科学           贮蓄           ,
BYH 04 00 24   BJH 05 00 25   BXH 02 00 23   DJQ 04 00 29   OYH 03 00 27   VJQ 05 00 32

pli            grandan        ol             la             homa           cerbo          .
more       great          than           the            man's          brain          .
更             大                 比                                 人的           头脑           .
FJQ 05 00 32   DYH 04 00 29   FYH 05 00 32   DJQ 07 00 36   DJQ 07 00 36   BYH 06 00 33   RJH 02 00 23

层次递归成分实质上就是不同层次的词之间直接联系关系的一种反映. 它揭示了文句结构的正确的句法树. 根据文句的CDC链, 我们很容易画出该句的句法树.

实验证明, 作为体现独立分析结果的机器翻译中间语言, 层次递归成分体系是比较有效的. 现在, 越来越多的专家呼吁建立能充分体现对源语分析的结果, 正确揭示文句的层次结构和语义信息的媒介语, 或类似媒介语的东西. 许多文章论证了分析和综合独立的必要性. 原语分析依赖译语, 或译语综合依赖原语, 使分析和综合都不能深入, 而且难免捉襟见肘.[5]

当然, 层次递归成分体系还处于草创时期, 必然存在不少问题, 有待于在实践中不断检验, 改进和完善. 通过时间的考验和我们的努力, 也许它最终能成为一个比较得心应手的机译工具, 而为人们乐于采用, 这当然是我们所希望的. 也许它不是一个好的方案, 很快便被淘汰了. 但无论如何, 总是一次有益的尝试.

这套体系的不足之处是, 它不大能够反映有向直接联系的语义性质, 而这对于高质量的机器翻译是比较关键的信息. 人类语言不管怎样千差万别, 总有某些共同的东西. 例如, 句素间的层次结构及其直接联系关系就具有很强的普遍性. 正是这些语言共性才使翻译成为可能, 从而它成为语言转换的基础. 句素与句素之间的逻辑语义联系, 也是重要的语言共性之一.[6] 逻辑语义的确定, 将大大有助于生成地道的目标语. 在CDC体系中, 结构关系一项基本上是传统语法中句法成分的继承, 反映的是句子表层结构的关系(主谓宾定状补等). 看来, 有必要扩充CDC, 再加一个逻辑语义元:

<逻辑语义信息>::= { Ag, Sb, Ob, Vb, Pl, Tl, Mn, Pp, Rs, Fr, Rg, Dg, Tm, Pr, Cl, Fn, Ms, Pm, Cd, Nb, Pt, Mt, Ps, Tg, Cs, Ex, Dt, Ct, Cn, Cc, Cp, Tw, Xx }

Ag=施事(Agent), Sb=主体(Subject), Ob=受事(Object), Vb=行为(Verb), Pl=地点(Place),
Tl=工具(Tool), Mn=方式(Manner), Pp=目的(Purpose), Rs=结果(Result),
Fr=频率(Frequency), Rg=范围(Range), Dg=程度(degree), Tm=时点(Time),
Pr=时段(Period), Cl=颜色(Colour), Fn=功能(Function), Ms=尺寸(Measurement),
Pm=后饰(Post-modifier), Cd=条件(Condition) , Nb=数量(Number),
Pt=属性(Property), Mt=质料(Material), Ps=领属(Possession), Tg=对象(Target),
Cs=原因(Cause), Ex=说明(Explanation), Dt=限定(Determiner),
Ct=环境(Circumstance), Cn=内容(Content), Cc=让步(Concession),
Cp=比较(Comparison), Tw=同位, Xx=非语义(或不定语义)

[注] Xx是所有无法确定, 或没有必要确定的成分的逻辑语义. 机器翻译跟自然语言理解不同, 并不一味要求分析得越具体越透彻越好. 机器翻译过程中的中间信息究竟要深入到怎样的程度, 应根据充分必要的原则来决定. 少则影响效果(质量), 多则白费功夫.

_____________________________________________________________

附注: [2] 关于中介成分体系, 参见:

刘涌泉, 刘倬, 高祖舜 <<俄汉机器翻译规则系统新旧方案比较>> ( <<中国语文>> 1962.2 )

刘涌泉 <<外汉机器翻译中的中介成分体系>> ( <<中国语文>> 1982.2 )

刘  倬 <<三次机器翻译试验>> ( 第一次机器翻译学术会议论文, 1980.9 )

[3] 关于有向直接联系理论, 参见:

刘涌泉, 刘倬, 高祖舜 <<俄汉机器翻译规则系统新旧方案比较>> (同上)

刘涌泉, 刘倬, 高祖舜 <<机器翻译中的词序问题>> ( <<中国语文>> 1965.3 )

并请参阅 <<特斯尼埃的 <结构句法基础> 简介>> ( 张烈材, <<国外语言学>> 1985.2 )

[4] 参见: 刘涌泉 <<词>> ( 1984年机器翻译及自然语言处理学术讨论会论文, 1984.9 )

[5] 参见: 冯志伟 <<当前机器翻译的一些新特点>> ( <<情报学刊>> 1982. Vol 1 No.2 )

[6] 参见: 董振东 <<逻辑语义及其在机译中的应用>> ( <<中国的机器翻译>> pp.25-45 )

 

 

 

 

【相关】

立委硕士论文:目标语调序

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

 

立委硕士论文:世界语: 语言学特点及其研究价值 (2)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

世界语: 语言学特点及其研究价值

在进入EChA系统的细节和探讨机器翻译的一般理论和方法之前, 我们专列这一节讨论世界语本身, 这对说明本系统的设计思想和具体方法是很必要的. 毫无疑问, 我们的讨论主要是从语言学角度着眼.

世界语(Esperanto)是波兰的语言大师柴门霍夫博士( L.L.Zamenhof 1859.12.15 - 1917.4.14 )于1887年在印欧语系的基础上经过艰苦研究提出的一个人造语方案. 由于其科学, 简明, 逻辑性强, 由于日益增长的克服语言障碍的国际需要, 也由于其维护世界和平, 增进各民族相互了解, 实现世界大同的崇高理想的感召, 它逐渐为人们所接受. 目前, 世界上有2000多万人在学习和使用世界语. 世界语早已脱尽了人造的斧痕, 走上了自然发展的道路. 它不但能写也能说, 不但适于表达精密的科学思想, 而且在文学上也取得了令人赞叹的成就. 从莱勃尼茨的万国通用文字的设想开始, 先后提出的人造语方案达150多种, 唯有世界语经受住各种考验生存下来了. 现在, 越来越多的人认识到世界语作为国际辅助语的独特价值. 有些国际性学术会议(如控制论大会)已经采用世界语作为工作语言.

世界语中除数量有限的虚词外, 其他词都有非常规则的形态变化, 借以表现该词的词性, 格, 数, 时态, 语态, 语式, 分词形式等语法信息. 另外还有一整套前缀后缀, 用以表现词汇意义上的细微差别和修辞色彩. 世界语是典型的黏着语, 词尾和语缀的意义单一, 可以叠加. 这套词尾和语缀设计得非常巧妙, 规则, 特别容易掌握, 而且也非常适合机器的递归加工.

(EChA的削尾算法就体现了这种递归加工的优点, 见本文第5节.) 世界语没有语法同形词, 句法关系一目了然, 这不论对人还是对机器的识辨, 都是一个极为有利的条件(民族语机器翻译中同形判别的问题在这儿根本不存在了). 同时, 世界语的词类转换也特别灵活, 只要逻辑上说得过去, 不致引起误解, 同一个词干可以根据句法需要, 通过词尾变化随意改变词性. (我国古汉语词类活用也比较自由, 在一定程度上具有类似的灵活性, 可惜这种活用没有明确的形态标志, 常常要靠逻辑语义的分析才能确定.)

世界语的词尾形式并不很多, 但却很完备, 可以和形态发达的语言相媲美, 这一点我们不能不为之惊叹. 拿格来说, 世界语只有普通格(零形态)和目的格(加词尾-N)两种, 但由于它把词性和格的用法巧妙地统一起来, 再加上有介词这种分析形式的后备, 表达起来跟形态丰富的语言一样灵活自由. 俄语是现代形态最丰富的语言之一, 它有六个格. 粗略地说, 它的一格(主格)跟世界语普通格对应, 二格(属格)跟世界语形容词--姑且叫做形容格吧(加词尾-A)对应, 三格(与格)在世界语中没有相应的屈折形式, 一般用介词AL来代替. 四格(宾格)对应于世界语的目的格. 五格(工具格)跟世界语副词--也姑且叫做状格吧相对应. 六格是前置格, 跟前置词O,Ha,B等搭配, 它本身并不表示特定的语义关系. 有意思的是, 世界语介词后可以跟崐普通格和目的格两种, 前者表示静态, 后者表示动态(方向). 比较俄语的类似用法, 世界语的简洁和完备的特点是很明显的.

世界语基本语法规则共16条, 原则上没有例外.[0] 由此人们也许会推断这门语言很简陋, 刻板, 缺乏表现力. 这是一个极大的误解. 这里涉及世界语的另一个非常突出的语言学特点, 就是它兼有分析性语言和综合性语言的要素(虚词和形态都比较丰富), 同一种语义既可以用分析形式(借助于虚词), 又可以用综合形式(借助于屈折变化)来表示----当然, 这两种形式并不等同, 它们体现了不同的风格. 由于这一特点, 世界语兼容性强, 文体多样, 特别灵活, 富于弹性和表现力. 如果作为目标语, 它最能维妙维肖地模仿原文的语言特色. 它既可以反映语序自由, 文体柔美的斯拉夫风格, 又可以表现形态缺乏的语言(如汉语和英语)的单纯, 严谨, 密集的特点. 下面我们举几个例子来看一下分析形式和综合形式在世界语中的兼容并存情况:

分析形式                              综合形式

  1. 时态: Mi ESTAS skrib-ANTA. Mi skrib-AS. / Mi skrib-ANTAS.

I AM writ-ING. 我 "在" 写字.

  1. 语态: Ghi ESTAS limig-ITA. Ghi limig-ITAS. / Ghi lim-IGHAS.

It IS limit-ED. 它 "被" 限定了.

  1. 词义: Tio estas MALGRANDA (ETA) Tio estas sekret-ETO.

sekreto.

That is a LITTLE secret.

那是 "小" 秘密.

  1. 介词与副词(状格):

Li parolas EN (PER) Esperanto.
Li parolas esperant-E. Li parolas Esperant-ON.

He speaks IN Esperanto.
He speaks Esperanto.

他说世界语.
他 "用" 世界语说话.

  1. 介词与格(目的格):

Shi parolis POR 30 minutoj.           Shi parolis 30 minut-OJN.

She spoke FOR 30 minutes.             她说了30分钟.

  1. 分析形式向综合形式的转换:

LAU kutimo ...............LAU-kutim-E...kutim-E

这种分析形式和综合形式并存的情形在世界语中极其普遍, 这一点跟民族语不一样. 虽然没有绝对不用分析形式的综合性语言, 也没有绝对不用综合形式的分析性语言, 但是, 每一个具体的民族语言总是以一种形式为主, 而且在多数场合总是一种形式排斥另一种形式, 一般不允许并存.

总之, 跟人们通常想象的正相反, 世界语是高度灵活的, 表达方式极其多样, 且能互相转换. 这种高度灵活性正好适应了人类思维模糊性的特点. 灵活性与规则性的高度统一, 这就是世界语的真正奇迹.

人造语言的规则性容易为人理解. 关于灵活性, 再补充几点. 由于篇幅关系, 我们不打算展开, 必要时辅以一两句例证.

  1. 在世界语中动词的及物与不及物的界限模糊了.

Mi IRAS. / IRU vian propran VOJON!

I GO. / GO your own WAY!                 我行走. / 走你自己的路!

La tuta homaro PAROLOS nur unu LINGVON.
/ Mi PAROLAS esperante (en Esperanto, per Espernato).

The whole mankind will SPEAK only one LANGUAGE.
/ I SPEAK in Esperanto.

全人类将说仅仅一种语言. / 我用世界语说话.

  1. 直接宾语(所谓宾格)与间接宾语(所谓与格)的界限模糊了.

informi ION al IU / informi IUN pri IO

tell sth. to sb. / tell sb. about sth.   向某人告诉某事 / 告诉某人关于某事

  1. 宾语与状语的界限模糊了. 世界语语法规定: 目的格(即通常所谓宾格)也可以表达某种状语意义(参见基本法规第14和第13条).

Mi invitas vin VOJAGHI kun mi PEKINON.

I invite you to TRAVEL with me TO PEKING. 我邀请你和我一起 "旅游北京".

  1. 词缀与词根的界限模糊了, 从而派生词与合成词的界限模糊了. 同时虚词与实词的界限也模糊了.

sekret-ET-o / ET-a sekreto       JES, / mi JES-as vian opinion.

little secret  小秘密            Yes, I agree with you. 是的, 我同意你的意见.

ANTAU-vidi / Sinjorinoj ANTAU-as.  Kred-IND-a
/ ne-IND-a , IND-igi , sen-IND-ulo

foresee / Ladies first.            believ-able
/ not worthy, make worthy, good-for-nothing

  1. 万能介词JE的设置. 人们在表达思想时, 常常只意识到从属成分与中心成分有某种朦胧的修饰关系, 但却说不出, 往崐往也不必要说究竟是何种语义联系. 为了适应人类思维的这种模糊特点, 柴门霍夫引入介词JE. 这是一个很有见识的创造. (表达这种模糊关系还可用屈折形式的目的格或副词(状格), 见基本法规第14条.)
  2. 词性与格在用法上的统一. 词性和格都是根据词尾 "入句而后定" 的动态句法特征, 都能表现比较抽象的语义关系, 可以相互补充. (这跟分析形式的介词短语不同. 介词除了上述JE外, 一般用来表示较为具体和确定的语义关系.)

Mi skribas plum-E.
CF: (俄)                       (五格)

  1. 极其灵活的词类转换.

La FLOR-OJ FLOR-AS.     Li KANT-AS italan popolan KANT-ON.
Mi estas GHOJ-A. Mi GHOJ-AS.

The flowers blossom. .  He sang an Italian folk song.
I am glad.

  1. 词序的自由.

Mi amas vin. (106) / Mi vin amas. / Vin amas mi. (108)
/ Vin mi amas. (111) / Amas mi vin.  / Amas vin mi.

I love you. 我爱你.

  1. 构词的灵活. 派生词: 词缀的丰富及其黏合特点; 合成词: 词根与词根的自由复合.

Shi rid-AS. Shi rid-ETAS. Shi estas rid-EMA.
Shi estas rid-EMULO. Shi estas rid-EMULINO ( rid-EMINO ).
Shi estas rid-EMULINETO ( rid-EMINETO ).......

她笑.      她微笑.        她爱笑.
她是爱笑的人.        她是爱笑的女人.
她是爱笑的小女孩儿 .......

INTER-lingvo   中间语言

fonto-lingvo         celo-lingvo       ponto-lingvo
naci-lingvo       internaci-lingvo

源语                 目标语            媒介语(桥梁语言)
民族语             国际语

  1. 完善的时态语态系统和精巧的相关词表. 世界语的时态语态系统和相关词表是两项绝妙的创造. 它们是如此地精巧完善, 富有逻辑的力量和美, 每一个世界语者都象化学家欣赏元素周期表一样体验到这种美, 并为此感到自豪. 借助于唯一的一个助动词ESTI, 世界语能表达各种复合时态语态. 相关词表所能表达的语义的简洁和丰富更是无与伦比的.

世界语的这些特点给人们的自由创造留下了很大的余地, 为人们充分发挥自己的语言才能提供了最好的条件. 这种灵活性并不影响作为世界语基础的16条基本法则的不可动摇的严格性. 在这儿, 自由和约束达到了完美的统一. 在世界语国里, 每个人都在不同程度上是创造者, 每一个世界语者都体验到这种创造的乐趣. 人们再也不是习惯的奴隶了.

然而, 不能不承认, 世界语的灵活和自由给机器的自动处理带来了一定的困难. 我们在研制EChA系统的过程中, 深深感到, 与民族语相比, 以世界语为源语的机器翻译虽然有其容易的一面, 也有其特有的难处, 总之要比我们预料的要复杂得多. 容易来自其高度规则性, 困难则源于其高度灵活性.

世界语作为人们唯一实际使用的人造语言自然有它独特的研究价值. 拿它与民族语作对比研究, 我们会得到很多有益的启示. 由于其独特的地位, 人们在研究思维与语言, 民族与语言, 社会与语言, 个体与语言, 信仰与语言等等的关系, 以及探讨语言的共性, 语言的本质, 语言的前途(未来社会的语言), 语言的形式和内容, 语言的类型, 语言的教学等问题时都可?
能在研究世界语的过程中获益. 另外, 世界语本身的发展也需要语言学者对它作科学的研究和总结, 这不但有益于这门语言健康的发展, 有助于世界语语言学理论体系的建立, 同时也会丰富一般语言学的理论. 语言学者对世界语的理论研究虽然早已开始, 但还远远不够.

对于机器翻译工作者, 世界语还有一层特殊的意义, 就是世界语作为民族语间机器翻译的媒介语的价值.[1] 这可以从两方面看: 1) 按照机器特点对世界语作必要改造, 定义一个作为媒介语的世界语子集, 再辅以一套高度形式化的成分体系. 这个设想我们在第一届中国世界语大会上提过. 我们也确实设计过一个以世界语作为媒介语的英汉机器翻译规则系统. 虽然由于时间等原因没有能上机试验, 但我们相信该方案是可行的, 也是值得尝试的. 拿世界语或其子集作媒介语, 尽管还远远不是最理想, 但如果研制的是印欧语系间多语言自动翻译, 或者是以这些语言为源语的多对一系统(如英/法/德/俄--汉系统), 相信会带来很多方便. 2) 虽然不直接采用世界语作媒介语, 但在设计机译媒介语时, 认真吸取世界语的优点, 可以少走弯路.

_______________________________________________________________________

附注: [0] 为便于查对, 这里把世界语16条基本法规转抄如下:

(1) 不存在不定冠词, 只存在定冠词 (LA), 其性数格不变.

(2) 名词词尾为 "-O", 复数形式加词尾 "-J". 只存在两个格: 普通格和目的格; 后者由普通格加词尾 "-N" 构成.

(3) 形容词以 "-A" 收尾, 其格数与名词同. 比较级用PLI和连词OL, 最高级用PLEJ.

(4) 基数词(没有词尾变化)是: UNU 1, DU 2, TRI 3, KVAR 4, KVIN 5, SES 6, SEP 7, OK 8, NAU 9, DEK 10, CENT 100, MIL 1000. 几十和几百由数词简单合并而成. 序数词加形容词词尾; 倍数加后缀 "-OBL-", 分数加 "-ON-", 集合数词加 "-OP-", 分配意义用介词 PO. 此外, 数词也可以有名词和副词形式.

(5) 人称代词: MI, VI, LI, SHI, LI, GHI (代物件或动物), NI, VI, ILI. 其所有格形式加形容词词尾构成. 数格的变化与名词同.

(6) 动词没有人称和数的变化. 动词的各种形式: 现在时用词尾 "-AS"; 过去时 "-IS"; 将来时 "-OS"; 假定式 "-US"; 命令式 "-U"; 不定式 "-I". 分词(有形容词和副词的意义): 主动现在式 "-ANT-"; 主动过去式 "-INT-"; 主动将来式 "-ONT-"; 被动现在式 "-AT-"; 被动过去式 "-IT-"; 被动将来式 "-OT-". 被动语态的各种形式, 都借助于ESTI的相应形式和所需要的动词的被动分词构成; 被动式所用的介词是DE.

(7) 副词以 "-E" 收尾; 各比较等级与形容词同.

(8) 所有介词都要求普通格.

(9) 每个词读写一致.

(10) 单词重音永远在倒数第二个音节上.

(11) 合成词由词与词简单合并而成(主要的词放在后面); 语法词尾也被看作独立的词.

(12) 有其他否定词的时候, 就不再用 NE.

(13) 为了表示方向, 单词加目的格词尾.

(14) 每个介词都有确定不变的意义. 但是如果我们需要用一个介词, 而从意义上看不出应该用哪一个, 这时我们就用没有独立意义的介词JE. 介词JE也可以用没有介词的目的格来代替.

(15) 所谓外来词, 即大多数语言取自同一来源的词, 在世界语里不加变化地应用, 只需照世界语拼写法书写; 但如果一个词根派生几个不同的词时, 最好只不加变化地采用那个基本词, 并由此按照世界语的规则构造出其他的词来.

(16) 名词和冠词末尾的元音字母可以省略, 用省略号 ' 来代替.

 

[1] 请参看 <<巴贝尔通天塔必将建成>> (刘涌泉 李维, 中国第一届世界语大会论文. 其中第四节专门讨论了世界语作为机译媒介语的优点, 缺点, 可能和前景.)

 

 

【相关】

硕士论文: 世界语到汉语和英语的自动翻译试验
立委硕士论文:1. EChA概况
立委硕士论文:2. 世界语: 语言学特点及其研究价值
立委硕士论文:3. 层次递归成分体系
立委硕士论文:4. EChA机器词典及词表
立委硕士论文:5. 世界语形态分析
立委硕士论文:6/7 世界语句法分析
立委硕士论文:8. 英语形态生成
立委硕士论文:9. 目标语调序
立委硕士论文:10. EChA 试验结果的分析
立委硕士论文【致谢】【参考书目】
立委硕士论文全文(世界语版)

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

硕士论文:世界语到汉语和英语的自动翻译试验(1)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

本文是我在导师刘涌泉和刘倬先生指导下所做的毕业设计的论文总结. 共分十大部分:
1. EChA概况: 系统流程图; 2. 世界语: 语言学特点及其研究价值; 3. 层次递归成分体系CDC: 体现独立分析结果的EChA中间语言; 4. EChA机器词典, 句子加工场格式; 5. 世界语形态分析: 削尾算法, 关于削缀问题的讨论; 6. 句法分析第一线: 虚词处理, 规则和规则分开的讨论; 7. 句法分析第二线: CDC的求解, 中间结果分析; 8. 英语形态生成, 汉语形态修辞, 原语和译语对比差异的一般总结, 多义区分例释; 9. 调序: 自底而上加工; 10. EChA试验结果分析, 汉语和英语的机译文的比较, 关于文学作品可不可以跟机器翻译结合的问题, 修辞的讨论。

                         目       录

  1. EChA概况 ............................................................... 3
  2. 世界语: 语言学特点及其研究价值 ......................................... 7
  3.   层次递归成分体系 ....................................................... 13
  4. EChA机器词典 ........................................................... 19
  5. 世界语形态分析 ......................................................... 23
  6. 世界语句法分析(1) ...................................................... 29
  7. 世界语句法分析(2) ...................................................... 31
  8. 英语形态生成 ........................................................... 34
  9. 目标语调序 ............................................................. 38
  10. EChA试验结果的分析 ..................................................... 39

[致谢] ..................................................................... 44

[参考书目] ................................................................. 45

[附录一] EChA试验结果 ...................................................... 46

[附录二] 世界语文摘 ........................................................ 57

EChA概况

EChA (E-Ch/A: el Esperanto en la Chinan kaj Anglan Lingvojn) 系统是以世界语作为源语, 以汉语和英语作为目标语的一对多小型实验系统. 它是一个句对句的, 分析和综合有一定独立性的全文机器翻译系统. 本系统实现了翻译过程的完全自动化,不需要译前和译后编辑. (由于纯技术原因, 世界语中的几个戴帽字母暂时还需要用加 H 的复合字母来转写.) EChA系统从上机调试到打出译文只用了五个月, 全部工作历时近一年, 进展比较顺利. 本系统使用的是IBM-PC/XT微型机, 编程语言 BASIC (Version D2.00), 同时选用IBM公司的BASIC编译程序软件包. EChA由CCDOS操作系统(即带有汉字库的PC DOS 2.10)支持. 系统主体是六线分析和综合程序. 另外还建立了三部词典, 两个词表, 编制了词典的造查, 扩充和维护程序. 整个系统由近一万条BASIC语句构成. 编程时充分利用了BASIC串处理函数, 显得特别方便.

这次试验共翻译了150多句世界语文句. 汉语和英语的机器译文都通顺或可懂, 结果令人满意. (见附录) 提供本系统试验的源语素材有三部分: 第一部分是选自著名世界语作家Sandor Szhatmari的世界语原文著作 "Mashinmondo" (<<机器世界>>, 中国展望出版社)上的两段连续文章(12句, P.100-101), 句子比较长, 结构也比较复杂. 第二部分选自魏原枢和徐文琪编著的 <<世界语语法>> (上海外语教育出版社, 1982.10)中的典型例句(100多句), 这些例句(其中有一部分是日常用语)都具有一定的语言学特点, 表现了不同时态(简单时态,复合时态), 语态(主动语态, 被动语态), 语式(陈述语式, 命令语式, 假定语式),不同的句式(简单句, 并列句, 复合句, 无主句, 独词句, 一般疑问句, 特殊疑问句, 等等),不同的句型以及动词的各种形式. 总之, 它们具有相当的代表性, 基本上反映了世界语语法概貌, 这就弥补了连续文句特点单一的不足, 更有利于试验EChA系统的能力和适应性. 最后作为一种尝试,还选译了两首世界语诗歌(第一首是著名的世界语者的颂歌"希望之歌").

EChA由三大部分组成: 1) 机器词典; 2) 源语分析; 3) 目标语生成. 源语分析部分包括了世界语的全部基本语法和常用句型. 然而, 由于机器条件和实验周期的限制, 本系统的规模(特别是词典的规模)还很小, 有待于进一步扩充和改进. ----准备从两方面来扩充EChA系统, 一是补充例句, 做扩大试验; 二是增加俄语和法语作为新的目标语, 进一步检验体现独立分析结果的中间语言CDC(层次递归成分体系, 第3节详述)的适应范围, 并探讨其完善的途径. 另外, 时间仓促给系统还带来一些问题:  EChA的结构还不是很合理, 算法有待于进一步优化, 规则和算法还没能分开, 在分析和综合的独立性上下了不少功夫, 但还没有完全独立.

尽管还有上述问题, 然而按照设计要求, 只要适当扩充词典, 系统就有能力处理世界语的绝大多数语言现象. 在中国近三十年的机器翻译研究历史中, EChA是第一个以世界语为研究对象的机译系统. 在世界语跟机器翻译结合的过程中, EChA是一个成功的尝试和良好的开端. 我们热切希望得到专家学者, 世界语同志们的帮助和指导.

EChA系统流程图
______丨________
/   原文输入    丨
/_________________丨
_______________________丨________________________
词               丨 1. 削尾, 查词典(实词词典, 虚词词典, 成语词典, 丨
典               丨    词类词义区分表)                                               丨
(形态分析)     丨_____________________________________________丨
-------------------  _______________________丨_________________________
句               丨 2. 连词标点, 切分, 其他虚词                                     丨
法               丨________________________________________________丨
分                _______________________丨_________________________
析               丨 3. 中间语言CDC的求解                                           丨
丨________________________________________________丨
-------------------  _______________________丨_________________________
丨 4. 多义词区分; 英语形态生成及汉语形态修辞; 查丨
目               丨      英语不规则词词表                                              丨
标               丨_______________________________________________丨
语                _______________________丨_________________________
生               丨 5. 英语调序                                                                丨
成               丨________________________________________________丨
_______________________丨_________________________
丨  6. 汉语调序及其他修辞                                            丨
丨________________________________________________丨
_________丨_________
丨     译文输出           丨
丨__________________丨

源语文句输入以后, 作第一遍扫描. 首先判定加工词长度是否大于三. 若大于三, 转子程序削尾后查实词词干词典, 否则查虚词词典. 因为世界语虚词(无词尾变化)大多短小, 以三为界限最合理, 可以大大减少虚查次数. 词典查不着的作生词处理, 削尾信息保留. 查完词典及词表以后, 把削尾信息和词典信息移到计算机内存中所开辟的句子加工场.

句法分析确定源语文句的层次结构和句法关系. 分析结果以一种高度形式化的层次递归成分体系CDC来体现. CDC是独立于目标语的机器翻译中间语言, 这种独立性对于一对多机译系统是必要的. CDC由形态, 成分, 节点, 分布, 链号和层次几部分信息构成. 它不但揭示了源语文句的正确的句法树, 而且还包含了其它的有用的信息. 事实上, 它为建立多目标语的生成系统奠定了良好的基础.

句法分析第一线处理虚词, 中心任务是加工连词和标点, 正确切分语段. 原则上为每一个虚词编制一套分析规则. 世界语虚词数量很有限, 但用法较多, 具有民族语功能词的类似的复杂性, 是语言个性的集中表现, 所以分别加工比较适宜, 这也有利于规则跟规则分开. 该线加工任务很重, 特别是连词KAJ和KE, 分析规则十分复杂. 在很大程度上, 虚词分析对了, 句法关系也就清楚了. 因此, 集中力量编制一套完备的针对具体虚词的分析系统, 对于世界语类型的机器翻译至关重要. 该线正确处理了虚词个性现象, 便可以保证下一线分析的充分抽象性和概括性, 这样做对于象世界语这样的科学而规则的语言显得特别有利. 句法分析第二线运用自顶而下的方法, 从句子的谓语轴心(第一层)着手, 一层一层往下递归加工, 直到最末层(终结节点层). 加工过程就是不断递归调用各子程序的过程. 其中以动词子程序为核心, 它充分反映了世界语语法的基本内容及其高度规则性. 分析完毕得出一条对应于源语文句的中间语言CDC的链.

综合第一线做英语形态生成和汉语形态修辞. 英语形态并不发达, 所以世英的形态转换规则也不复杂. 汉语缺乏形态, 一般用适当的虚词(助词, 副词等)来代替. 我们把多义词区分规则也放在这一线, 这是因为多义区分的条件至此已经具备. 一般来说, 根据多义词及其联系词的CDC成分和语义特征就可以得出该词的正确义项. 综合第二线和第三线分别做英语调序和汉语调序. 调序信息由CDC结合目标语语法规律得出, 调序的方法是自底而上, 层层归约, 这样就不至于调乱. 我们知道, 世界语语序极为灵活自由, 而汉语语序却很固定, 所以生成汉语的主要任务是调序. 对于英语, 调序的任务较轻, 主要是保证文句主干 "主谓宾" 次序不乱. 英语名词没有主宾格的区分, 所以关键是把前置宾语移到动词之后. "世界语是印欧语系的一个合理化的公分母", 与英语相似处毕竟很多, 比如同一句法层次的定语或状语的内部调序, 在译汉语时是一个难题, 而在印欧系诸语言中则不是大问题. 另外修辞加工的过程也可以免了. (世英转换中的成语和多义现象较之世汉转换也少得多.) 总之, 英语生成比汉语生成容易许多.

EChA虽然是个不大的系统, 但是内容比较丰富. 它既有形态分析, 又有形态生成, 也有调序和修辞, 还有自己的一套成分体系. 我们在总体设计时, 已经考虑到增加新的不同类型的目标语扩充该系统的需要. 可以预计, 如果增加两线俄语和法语的生成程序(主要是形态生成), 分析部分稍作改动(主要是充实与综合还没有完全独立开来的虚词分析规则), 就可以实现崐世到汉/英/法/俄的自动翻译. 总之, 实用机译系统所能遇到的问题, EChA几乎都已涉及, 而且主体六线程序各个有自己的特色, 是个有相当代表性的一对多全自动机译模型.

 

【相关】

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

PhD Thesis: Chapter VII Concluding Remarks

This chapter summarizes the research conducted in this dissertation, including its contributions as well as limitation.

7.0. Summary

The goal of this dissertation is to explore effective ways of formally approaching Chinese morpho-syntactic interface in a phrase structure grammar.  This research has led to the following results:  (i) the design of a Chinese grammar, namely CPSG95, which enables flexible coordination and interaction of morphology and syntax;  (ii) the solutions proposed in CPSG95 to a series of long-standing problems at the Chinese morpho-syntactic interface.

CPSG95 was designed in the general framework of HPSG (Pollard and Sag 1987, 1994).  The sign-based mono-stratal design from HPSG demonstrates the advantage in being capable of accommodating and accessing information of different components of a grammar.  One crucial feature of CPSG95 is its introduction of morphology expectation feature structures and the corresponding morphological PS rules into HPSG.  As a result, CPSG95 has been demonstrated to provide a favorable environment for solving morpho-syntactic interface problems.

Three types of morpho-syntactic interface problems have been studied extensively: (i) the segmentation ambiguity in Chinese word identification;  (ii) Chinese separable verbs, a borderline problem between compounding and syntax; and (iii) borderline phenomena between derivation morphology and syntax.

In the context of the CPSG95 design, the segmentation ambiguity is no longer a problem as morphology and syntax are designed system internally in the grammar to support morpho-syntactic parsing based on non-deterministic tokenization (W. Li 1997, 2000).  In other words, the design of CPSG95 itself entails an adequate solution to this long-standing problem, a problem which has been a central topic in Chinese NLP for the last two decades.  This is made possible because the access to a full grammar including both morphology and syntax is available in the integrated process of Chinese parsing and word identification while traditional word segmenters can at best access partial grammar knowledge.[1]

The second problem involves an interesting case between compounding and syntax:  different types of Chinese separable verbs demonstrate various degrees of separability in syntax while all these verbs, when used contiguously, are part of Chinese verb vocabulary.  For each type of separable verbs, arguments were presented for the proposed linguistic analysis and a solution to the problem was then formulated in CPSG95 based on the analysis.  All the proposed solutions provide a way of capturing the link between the separated use and the contiguous use of the separable verb phenomena.  They are shown to be better solutions than previous approaches in the literature which either cannot link the separated use and the contiguous use in the analysis or suffer from being not formal.

The third problem at the interface of derivation and syntax involves two issues: (i) a considerable amount of ‘quasi-affix’ data, and (ii) the intriguing case of zhe-suffixation which demonstrates an unusual combination of a phrase with a bound morpheme.  A generic analysis of Chinese derivation has been proposed in CPSG95.  This analysis has been demonstrated to be also effective in handling both quasi-affixation and zhe-affixation.

7.1. Contributions

The specific contributions are reflected in the study of the following five topics, each constituting a chapter.

On the topic of the Role of Grammar, the investigation leads to the central argument that knowledge from both morphology and syntax is required to properly handle the major types of morpho-syntactic interface problems.  This establishes the foundation for the general design of CPSG95 as consisting of morphology and syntax in one grammar formalism.

An in-depth study has been conducted in the area of the segmentation ambiguity in Chinese word identification.  The most important discovery from the study is that the disambiguation involves the analysis of the entire input string.  This means that the availability of a grammar is key to the solution of this problem.  A natural solution to this problem is the use of grammatical analysis to resolve, and/or prepare the basis for resolving, the segmentation ambiguity.

On the topic of the Design of CPSG95, a mono-stratal Chinese phrase structure grammar has been established in the spirit of the HPSG theory.  Components of a grammar such as morphology, syntax and semantics are all accommodated in distinct features of a sign.  CPSG95 is designed to provide a framework and means for formalizing the analysis of the linguistic problems at the morpho-syntactic interface.

The essential part of this work is the design of expectation feature structures.  Expectation feature structures are generalized from the HPSG feature structures for syntactic subcategorization and modification.  One characteristic of the CPSG95 structural expectation is the design of morphological expectation features to incorporate Chinese productive derivation, which covers a wide range of linguistic phenomena in Chinese word formation.

In order to meet the requirements induced by introducing morphology into the general grammar and by accommodating linguistic characteristics of Chinese, modifications from the standard HPSG are proposed in CPSG95.  The rationale and arguments for these modifications have been presented.  The design of CPSG95 is demonstrated to be a successful application of HPSG in the study of Chinese morpho-syntactic phenomena.

On the topic of Defining the Chinese Word, efforts have been made to reach a better understanding of Chinese wordhood in theory, methodology and formalization.

The theoretical inquiry follows the insight from Di Sciullo and Williams (1987) and Lü (1989).  Two notions of word, namely grammar word and vocabulary word, have been examined and distinguished.  While vocabulary word is easy to define once a lexicon is given, the object for linguistic study and generalization is actually grammar word.  Unfortunately, as there is a considerable amount of borderline phenomena between Chinese morphology and syntax, no precise definition of Chinese grammar word has been available across systems.  Therefore, an argument in favor of the system-internal wordhood definition and interface coordination within a grammar has been made.  This leads to a case-by-case approach to the analysis of specific Chinese morpho-syntactic interface problems.

On the other hand, three useful wordhood judgment methods have also been proposed as a complementary means to the case-by-case analysis.  These methods are (i) syntactic process test involving passivization and topicalization; (ii) keyword based judgment patterns for verbs, and (iii) a general expansion test named X-insertion.  These methods are demonstrated to be fairly operational and easy to apply.

In terms of formalization, a system-internal representation of word has been defined in CPSG95 feature structures.  This definition distinguishes a grammar word from both bound morphemes and syntactic constructions.  The formalization effort is necessary for the rigid study of Chinese morpho-syntactic problems and ensures the implementability of the solutions to these problems as proposed in the dissertation.

On the topic of Chinese Separable Verbs, the task is to coordinate the idiomatic nature of separable verbs and their separated uses in various syntactic patterns.

Since there are different degrees of ‘separability’ for different types of Chinese separable verbs, there is no uniform analysis which can handle all separable verbs properly.  A case-by-case study for each type of separable verbs has been conducted.  An essential part of this study is the arguments for the wordhood judgment for each type.  In the light of this judgment, CPSG95 provides formalized analyses of separable verbs which satisfy two criteria:  (i)  they all capture both structural and semantic aspects of the constructions at issue; (ii) they all provide a way of capturing the link between the separated use and contiguous use.

Finally, on the topic of Morpho-syntactic Interface Involving Derivation, a general approach to Chinese derivation has been proposed.  This approach not only enables us to handle quasi-affix phenomena, but is also flexible enough to provide an adequate treatment of the special problem in zhe-suffixation.

In the CPSG95 analysis, the affix serves as head of a derivative and can impose various constraints in the lexicon on its expected stem sign for the morphological expectation.  Coupled with only two PS rules formulated in the general grammar (Prefix PS Rule and Suffix PS Rule), it has been shown that various Chinese affixation phenomena can be captured equally well.  The PS rules ensure that all the lexical constraints be observed before the affix and the stem combine and that the output of derivation be a word.

As for the quasi-affixation problem, based on the observation that there is no fundamental structural difference between quasi-affixation and other affixation, a proper treatment of 'quasi-affixes' can be established in the same way as other affixes are handled in CPSG95; the individual difference in semantics is shown to be capturable in the lexicon.

The study of zhe-suffixation started with arguments for its analysis of VP+-zhe.  This is an unsolvable problem in any system which enforces sequential processing of morphology before syntax.  The solution which CPSG95 offers demonstrates the power of designing derivation morphology and syntax in a mono-stratal grammar.   With this novel design in modeling Chinese grammar, the CPSG95 general approach to derivation readily applies to the tough case of zhe-suffixation.  This is possible because of the ability of an affix in placing any lexicalized constraints, VP in this case, on the expected stem for morphological expectation.  In addition, the proposed lexicalized solution also captures the building of the semantic content for this morpho-syntactic borderline phenomenon.

7.2. Limitation

The major limitation of the work reported in this thesis lies in the following two aspects.

Limited by space, the thesis has only presented some sample formulation of typical affixes and quasi-affixes to demonstrate the proposed general approach to Chinese derivation morphology.  As many affixes/quasi-affixes have their distinctive semantic property, a reader who likes to experiment with this proposal in implementation still has to work out the technical details for each affix.  However, it is believed that the general strategy has been presented in sufficient details to allow for easy accommodation of individual aspects of an affix which have not been specifically addressed in the thesis.

Limited by the focus on a handful of major morpho-syntactic interface problems, the treatment of reduplication and unlisted proper names have not been listed as special topics for in-depth exploration.  They are only briefly discussed in Chapter II (Section 2.2) as cases of productive word formation for the need to involve syntax when they involve segmentation ambiguity at the boundaries.  However, they are also long-standing word identification problems which affect morpho-syntactic interface when the segmentation ambiguity is involved.  In particular, it is felt that the treatment of transliterated foreign names requires further research before a satisfactory solution can be found in the framework of CPSG95.[2]

7.3. Final Notes

This last section is used to place the research reported in this thesis in a larger context.

Chinese NLP has reached a new stage marked by the publication of Guo’s series of papers on Chinese tokenization (Guo 1997a,b,c,d, Guo 1998).  There are signs that the major research focus is being shifted from word segmentation to the grammar design and development.  In this process,  the morph-syntactic interface will remain a hot topic for quite some time to come.  The work on CPSG95 can be seen as one of the efforts in this direction.

The design of CPSG95, a formal grammar capable of representing both morphology and syntax in a uniform formalism, is one successful application of the modern linguistic theory HPSG in the area of Chinese  morpho-syntactic interface research.  However, this is by no means to claim that CPSG95 is the only or best framework to capture the morpho-syntactic problems.   This is only one approach which has been shown to be feasible and effective.  Other equally good or better approaches may exist.

In terms of future directions, constraints from semantics and discourse should be made available in the grammatical analysis.  In Chapter II (Section 2.4), we have seen problems whose ultimate solutions depend on the access to the semantic or discourse constraints.  It is believed that the sign-based mono-stratal design of CPSG95 will be extensible to accommodate these constraints.  However, this will require years of future research before they can be formally modeled and properly introduced into the grammar.

 

--------------------------

[1] As a matter of fact, the CPSG95 experiment shows that most segmentation ambiguity is resolved automatically as a by-product of morpho-syntactic parsing and the remaining ambiguity is embodied in the multiple syntactic trees as the results of the analysis.

[2] However, in the CPSG95 implementation, the problem of handling the Chinese person names, a special case of compounding, has been solved fairly satisfactorily.  The proposal is to use the surname as the head sign to expect the given name (of one or two characters) on its right to form potential full names.  As the right boundary of a person name is difficult to define without the support of sentential analysis, the conventional word segmenter frequently makes wrong segmentation in such cases.  In contrast, the approach implemented in CPSG95 is free from this problem because whether a potential name proposed by the surname ultimately survive as a proper name is decided by whether it contributes to a valid parse for the processed sentence.  In last few years, there has been rapid progress on proper name identification in the area of information extraction, called named entity tagging (MUC7 1998; Chen et al 1997).

 

BIBLIOGRAPHY

Bauer, Laurie (1988).  Introducing Linguistic Morphology.  Edinburgh:  Edinburgh University Press.

Bloomfield, Leonard (1933). Language, New York: Henry Holt & Co.

Borsley, Robert (1987).  Subjects and Complements in HPSG.   Technical report no. CSLI-107-87.  Stanford:  Center for the Study of Language and Information.

Carpenter, B. and G. Penn (1994).  ALE, The Attribute Logic Engine, User's Guide.  From http://www.sfs.nphil.uni-tuebingen.de/~gpenn/ale.html (accessed January 30, 2001).

Chao, Yuen-Ren (1968).  A Grammar of Spoken Chinese.  Berkeley:  University of California Press.

Chen, H.-H et al (1997).  Description of the NTU System used for MET-2.  Proceedings of MUC-7.  From http://perso.enst.fr/~monnier/lectures/IE/MUC7/muc_7_toc.html (accessed January 30, 2001).

Chen, K. and S. Liu (1992).  Word Identification for Mandarin Chinese Sentences.  Proceedings of 14th International Conference on Computational Linguistics (COLING’92). Nantes, France, 101-107.

Chen, M.Y. and W. S-Y. Wang (1975).  Sound Change:  Actuation and Implementation.  Language 51:2, 255-281.

Chen, Ping (1994).  “Shilun Hanyu zhong San Zhong Juzi Chengfen yu Yuyi Cheng Fen de Peiwei Yuanze” (On Mapping Principles of Relationship between Chinese Three Syntactic Constituents and Semantic Roles). Zhongguo Yuwen (Chinese Linguistics), No.3.

Chomsky, Noam (1970).  Remarks on Nominalization.  Readings in English Transformational Grammar, eds. by R. Jacobs and P. Rosenbaum, Waltham, Massachasetts:  Ginn and Company, 184-221.

Dai, John Xiang-ling (1993).  Chinese Morphology and its Interface with Syntax.  Ph.D. Dissertation, Ohio State University.

DeFrancis, John (1984).  The ChineseLanguage: Fact and Fantasy.  Honolulu:  University of Hawaii Press.

Di Sciullo, A.M. and E. Williams (1987).  On The Definition of Word.  The MIT Press, Cambridge, Massachusetts.

Ding, Shengshu (1953). “Hanyu Yufa Jianghua” (Lectures of Chinese Grammar), Zhongguo Yuwen (Chinese Linguistics), No. 3 and No. 4.

Dowty, D. (1982).  More on the Categorial Analysis of Grammatical Relations.  In A. Zaenen (Ed.), Subjects and Other Subjects:  Proceedings of the Harvard Conference on Grammatical Relations.  Bloomington:  Indiana University Linguistics Club.

Feng, Zhiwei (1996).  COLIPS Lecture Series - Chinese Natural Language Processing,  Communications of COLIPS, Vol.6, No.1, Singapore.

Gan, Kok Wee (1995).  Integrating Word Boundary Disambiguation with Sentence Understanding, Ph.D. Dissertation, National University of Singapore.

Gazdar, G., E. Klein, G.K. Pullum, and I.A. Sag (1985).  Generalized Phrase Structure Grammar.  Cambridge: Blackwell, and Cambridge, Mass.:  Harvard University Press.

Guo, Jin (1997a).  Critical tokenization and its properties.  Computational Linguistics, Vo. 23, No.4, 569-596.

Guo, Jin (1997b).  Chinese Language Modeling for Speech Recognition.  Ph.D. dissertation, Institute of Systems Science, National University of Singapore.

Guo, Jin (1997c).  A Comparative Study on Sentence Tokenization Generation Schemes.  In review for journal publication from http://sunzi.iss.nus.sg:1996/guojin/papers/ (accessed March 25, 1999).

Guo, Jin (1998).  One tokenization per source.  Proceedings of the 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics (COLING-ACL ’98),  Montreal, Canada, 457-463.

He, K., H. Xu and B. Sun (1991).  Design Principles of an Expert System for Automatic Word Segmentation of Written Chinese Texts, Journal of Chinese Information Processing, Vol. 5, No. 2, 1-14.

Hockett, C.F. (1958).  A Course in Modern Linguistics.  New York:  Macmillan.

Hu, F. and L. Wen (1954).  “Ci de fanwei, xingtai, gongneng” (Scope, form and function of word). Zhongguo Yuwen (Chinese Linguistics), August issue.

Jackendoff, Ray (1972). Semantic Interpretation In Generative Grammar, Cambridge, Massachusetts:  MIT Press.

Jensen, John T. (1990).  Morphology:  Word Structure in Generative Grammar.  Amsterdam/Philadephia:  John Benjamins Publishing Company.

Kathol, Andreas (1999).  Agreement and the Syntax-Morphology Interface in HPSG. In Robert Levine and Georgia Green (eds.) Studies in Current Phrase Structure Grammar. Cambridge University Press, 223-274.

Kolman, B. and R.C. Busby (1987). Discrete Mathematical Structures for Computer Science, 2nd edition. Prentice-Hall, Inc.

Krieger, Hans-Ulrich (1994). Derivation without Lexical Rules,  in C.J Rupp, M. Rosner and R. Johnson (eds), Constraints, Language, and Computation.  Academic Press, 277-313.

Li, C.N. and  S.A. Thompson (1981).  Mandarin Chinese:  A Functional Grammar.  Berkeley:  University of California Press.

Li, Linding (1986).  Xiandai Hanyu Juxing (Sentence Patterns in Contemporary Mandarin), Shangwu Yinshuguan (Commercial Press), Beijing.

Li, Linding (1990).  Xiandai Hanyu Dongci (Verbs in Contemporary Mandarin), Zhongguo Shehui Kexue Chubanshe, Beijing.

Li, Qinghua (1983).  “Tan liheci de tedian he yongfa” (On the characteristics and usages of separable words).  Yuyan Jiaoxue He Yan Jiu (Language Instruction and Research), No.3.

Li, Wei (1996).  Interaction of Syntax and Semantics in Parsing Chinese Transitive Patterns.  Proceedings of International Conference on Chinese Computing (ICCC'96), Singapore.

Li, Wei (1997).  Chart Parsing Chinese Character Strings.  Proceedings of the Ninth North American Conference on Chinese Linguistics (NACCL-9), Victoria, Canada.

Li, Wei (2000). On Chinese parsing without using a separate word segmenter.  Communication of COLIPS 10 (1): 19-68.

Liang, Nanyuan (1987).  CDWS -- A Written Chinese Automatic Word Segmentation System.  Journal of Chinese Information Processing, 1(2): 44-52.

Lieber, R. (1992).  Deconstructing Morphology. Chicago: University of Chicago Press.

Lin, Handa (1983).  “Shime shi ci – xiaoyu ci de bu shi ci” (What is a word – a unit smaller than a word is not a word). Zhongguo Yuwen (Chinese Linguistics), No.34.

Lu, Jianming (1988).  “Mingci-xing ‘laixin’ shi ci haishi cizu” (Nominal laixin: word or word group).  Zhongguo Yuwen (Chinese Linguistics), No. 5.

Lu, Zhiwei (1957).  Hanyu de Goucifa (Chinese Word Formation), Kexue Chubanshe (Science Publishing House)..

Lü, Shuxiang. (1946). “Cong Zhuyu, Binyu de Fenbie Tan Guoyu Juzi de Fenxi” (On Sentence Analysis of Mandarin Chinese from the Angle of the Distinction between Subject and Object),  Kaiming Shudian Er Shi Zhounian Jiannian Wenji (Selected Works to Celebrate the 20th Anniversary of Kaiming Bookstore).

Lü, Shuxinag et al (ed.) (1980).  Xiandai Hanyu Babai Ci (800 Words in Contemporary Mandarin), Shangwu Yinshuguan (Commercial Press), Beijing.

Lü, Shuxiang (1989). “Hanyu Yufa Fenxi Wenti” (Issues on Chinese grammatical analysis),  Lü Shuxiang Zixuanji (Self-selected Works of Shuxiang Lü), Shang Hai Jiaoyu Chubanshe (Shanghai Education Publishing House), Shanghai, 93-180.

Lua, Kim Teng (1994).  Application of Information Theory Binding in Word Segmentation. Computer Processing of Chinese and Oriental Languages 8(1): 115-124.

Lyons, John (1968).  Introduction to Theoretical Linguistics.  Cambridge:  Cambridge University Press.

MUC-7 (1998).  Proceedings of the Seventh Message Understanding Conference (MUC-7).  From http://perso.enst.fr/~monnier/lectures/IE/MUC7/muc_7_toc.html (accessed January 30, 2001).

Pollard, C. and I. Sag (1987).  Information based Syntax and Semantics Vol. 1: Fundamentals.  Centre for the Study of Language  and Information, Stanford University, CA.

Pollard, C. and I. Sag (1994).  Head-Driven Phrase Structure Grammar.  The University of Chicago Press.

Riehemann, Susanne (1993). Word Formation in Lexical Type Hierarchies – A Case Study of bar-Adjectives in German. SfS-Report-02-93, University of Tübingen.

Riehemann, Susanne (1998). Type-based derivational morphology.  Journal of Comparative Germanic Linguistics 2. 49-77.

Sapir, Edward (1921).  Language:  Introduction to the Study of Speech.  NewYork:  Harcourt, Brace, and World.

Selkirk, E. (1982).  The Syntax of Words.  Cambridge:  MIT Press.

Shi, Youwei (1992).  Huhuan Rouxing – Hanyu Yufa Tanyi (A Call for Flexibility – Peculiarities of Chinese Grammar), Hunan Publishing House.

Shieber, S. (1986).  An Introduction to Unification-Based Approaches to Grammar.  Centre for the Study of Language  and Information, Stanford University, CA.

Sproat, R., C. Shih, V. Gale, and N. Chang (1996).  A Stochastic Finite-State Word-Segmentation Algorithm for Chinese.  Computational Linguistics. Vol. 22, No. 3.

Sun, L. and P. Cole (1991).  The effect of morphology on long-distance reflexives.  Journal of Chinese Linguistics 19:1, 42-62.

Sun, M. and B. T’sou (1995).  Ambiguity resolution in Chinese word segmentation.  Proceedings of the 10th Pacific Asia Conference on Language, Information and Computation (PACLIC-95), Hong Kong, 121-126.

Sun, M. and C. Huang (1996).  Word Segmentation and Part-of-Speech Tagging for Unrestricted Chinese Texts, A Tutorial at the 1996 International Conference on Chinese Computing (ICCC96), Singapore.

Thompson, S.A. (1973).  Resultative Verb Compounds in Mandarin Chinese:  A Case of Lexical Rules. Language 49:2, 361-379.

Wang, Li (1955).  ZhongguoYufa Lilun (Chinese Grammatical Theory), Zhonghua Shuju, Shanghai.

Wang, Xiaolong (1989).  Automatic Chinese Word Segmentation, in Word Separating and Mutual Translation of Syllable and Character Strings, Ph.D. Dissertation, Dept. of Computer Science and Engineering, Harbin Institute of Technology.

Webster, J. J. and C-Y Kit. (1992).  Tokenization as the Initial Phase in NLP.  Proceedings of the 14th International Conference on Computational Linguistics (COLING-92).  Nantes, France, 1106-1110.

Wu, A. and Z. Jiang (1998).  Word Segmentation in Sentence Analysis.  Proceedings of the 1998 International Conference on Chinese Information Processing.  Beijing, China, 169-180.

Wu, Dekai (1998).  A Position Statement on Chinese Segmentation.  Presented at the Chinese Language Processing Workshop, University of Pennsylvania. (Current draft at http://www.cs.ust.hk/~dekai/papers/segmentation.html, accessed January 30, 2001).

Wu, M. and K. Su (1993).  Corpus-Based Automatic Compound Extraction with Mutual Information and Relative Frequency Count.  Proceedings of R.O.C. Computational Linguistics Conference (ROCLING) VI, Taiwan, 207-216.

Xue, Ping (1991).  Syntactic Dependencies in Chinese and their Theoretical Implications.  Ph.D. dissertation, University of Victoria, Canada.

Yao, T., G. Zhang, and Y. Wu (1990).  A Rule-Based Chinese Automatic Segmentation System.  Journal of Chinese Information Processing 4(1): 37-43.

Yeh, C-L. and H-J. Lee (1991).  Rule-Based Word Identification For Mandarin Chinese Sentences -- A Unification Approach.  Computer Processing of Chinese and Oriental Languages. Vol. 5, No. 2, 97-118.

Yu, Shihong et al (1997).  Description of the Kent Ridge Digital Labs System Used for MUC-7.  Proceedings of MUC-7.  From http://perso.enst.fr/~monnier/lectures/IE/MUC7/muc_7_toc.html (accessed January 30, 2001).

Zhang, J., Z. Chen and S. Chen (1991).  A Method of Word Identification for Chinese by Constraint Satisfaction and Statistical Optimization Techniques.  Proceedings of R.O.C. Computational Linguistics Conference (ROCLING) IV, Taiwan, 147-165.

Zhang, Shoukang (1957).  “Lüetan hanyu goucifa” (A brief discussion on Chinese word formation)  Xiandai Hanyu Cankao Ziliao (Reference for Comtemporary Chinese),  ed. by Yushu Hu (1981),  Shanghai:  Shanghai Jiaoyu Chubanshe (Shanghai Education Publishing Company), 241-256.

Zhao, S. and B. Zhang (1996).  “Liheci de queding yu liheci de xingzhi” (Determination and characteristics of separable words).  Yuyan Jiaoxue he Yanjiu (Language Instruction and Research), No.1, 40-51.

Zhu, Dexi (1985).  Yufa Wenda (Questions and Answers on Chinese Grammar).  Shangwu Yinshuguan (Commercial Press), Beijing.

Zwicky, A.M. (1987). Slashes in the Passive.  Linguistics 25, 639-669.

Zwicky, A.M. (1989).  Idioms and Constructions.  Eastern States Conference on Linguistics 5, 547-558.

 

[Related]

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

6.0. Introduction

This chapter studies some challenging problems of Chinese derivation and its interface with syntax.  These problems have been a challenge to existing word segmenters; they are also long-standing problems for Chinese grammar research.

It is observed that a good number of signs have become more and more like affixes as the Chinese language develops.  Typical, indisputable examples include signs like the nominalizer 性 ‑xing (-ness) and the prefix 第 di- (-th).  While few people doubt the existence of affixes in Contemporary Chinese, there is no general agreement on the exact number of Chinese affixes, due to a considerable number of borderline cases often referred to as ‘quasi-affixes’ (类语缀 lei yu-zhui).[1]  It will be argued that the quasi-affixes belong to morphology and are structurally not different from other affixes.  The major difference between ‘quasi-affixes’ and the few generally honored (‘genuine’) affixes lies mainly in the following aspect.  The former retain some ‘solid’ meaning while the latter are more functionalized.  However, this does not prevent CPSG95 from providing a proper treatment of quasi-affixes in the same way as it handles other affixes.  It will be shown that the difference in semantics between affixes or quasi-affixes can be accommodated fairly easily in the CPSG95 lexicon.

Based on the examination of the common property of Chinese affixes and quasi-affixes, a general approach to Chinese derivation is proposed.  This approach not only enables us to handle quasi-affix phenomena, but is also flexible enough to provide an adequate treatment of a special problem in Chinese derivation, namely zhe-suffixation.  The affix status of 者 -zhe (-er) is generally acknowledged (classified as suffix in the authoritative books like Lü et al 1980):  it attaches to a verb sign and produces a word.  The peculiar aspect of this suffix is that the verb stem which it attaches to can be syntactically expanded.  In fact, there is significant amount of evidence for the argument that this suffix expects a VP as its stem (see 6.5 for evidence).   Since a VP is only formed in syntax and derivation is within the domain of morphology, this phenomenon presents a highly challenging case on how morphology should be interfaced properly to syntax.  The solution which is offered in CPSG95 demonstrates the power of designing morphology and syntax in an integrated grammar formalism.  In contrast, in any system which enforces sequential processing of derivation morphology before syntax - most traditional systems assume this, this is an unsolvable problem.  There does not seem to be a way of enabling partial output of syntactic analysis (i.e. VP) to feed back to some derivation rule in the preprocessing stage.

In Section 6.1, the general approach to Chinese derivation is proposed first.  Following this proposal, prefixation is illustrated in 6.2 and suffixation in 6.3.  Section 6.4 shows that this general approach to derivation applies equally well to the 'quasi-affix' phenomena.  Section 6.5 investigates the suffixation of -zhe (-er).  The analysis is based on the argument that this suffixation involves the combination VP+-zhe.  The specific solution following the CPSG95 general approach will be presented based on this analysis.

6.1. General Approach to Derivation

This section examines the property of Chinese affixes and proposes a corresponding general approach to Chinese derivation.  This serves as the basis for the specific solutions to be presented in the remaining sections to various problems in Chinese derivation.

It is fairly easy to observe that in Chinese derivation it is the affix which selects the stem, not the other way round.  For example, the suffix 性 -xing (‑ness) expects an adjective to produce an (abstract) noun.   Based on the examination of the behavior of a variety of Chinese affixes or quasi-affixes, the following generalization has been reached.  That is, an affix lexically expects a sign of category x, with possible additional constraints, to form a derived word of category y.   This generalization is believed to capture the common property shared by Chinese affixes/quasi-affixes.  It seems to account for all Chinese derivational data, including typical affixation, quasi-affixation (see 6.4) and the special case of zhe-suffixation (see 6.5).  So far no counter evidence has been found to challenge this generalization.

The observation and the generalization above support the argument that in a grammar which relies on lexicalized expectation feature structures to drive the building of structures, affixes, not the stems, should be selecting heads of the morphological structures.[2]   Leaving aside the non-productive affixation,[3] the general strategy to Chinese productive derivation is proposed as follows.  In the lexicon, the affix as head of derivative is encoded with the following derivation information:  (i) what type of stem (constraints) it expects;  (ii) where to look for the expected stem, on its right or left;  (iii) what type of (derived) word it leads to (category, semantics, etc.).  Based on this lexical information, CPSG95 has two PS rules in the general grammar for derivation:  one for prefixation, one for suffixation.[4]  These rules ensure that all the constraints be observed before an affix and a stem are combined.  They also determine that the output of derivation, i.e. the mother sign, be a word.

Along this line, the key to a lexicalized treatment of Chinese derivation is to determine the structural and semantic property of the derivative and to impose proper constraints on the expected stem.  The constraints on the expected stem can be lexically specified in the morphological expectation feature [PREFIXING] or [SUFFIXING] of the affix.  The property (category, syntactic expectation, semantics, etc.) of the derivative can also be encoded directly in the lexical entry of the affix, seen as the head of a derivational structure in the CPSG95 analysis.  This property information, as part of head features, will be percolated up when the derivation rules are applied.

In the remaining part of this chapter, it will be demonstrated how this proposed general approach is applied to each specific derivation problem.

6.2. Prefixation

The purpose of this section is to present the CPSG95 solution to Chinese prefixation.  This is done by formulating a sample lexical entry for the ordinal prefix 第 di- (-th) in CPSG95.  It will be shown how the lexical information drives the prefix rule in the general grammar for the derivational combination.

Thanks to the productivity of the prefix 第 di- (-th), the ordinal numeral is always a derived word from the cardinal numeral via the following rule, informally formulated in (6-1).

(6-1.) 第 di- + cardinal numeral --> ordinal numeral

第22条军规
di-      22      tiao    jun-gui
-th     22      CLA   military-rule
the 22-nd military rule (Catch-22)

第八个是铜像
di-      ba      ge      shi     tong-xiang
-th     eight  CLA   be      bronze-statue
The eighth is the bronze statue.

The basic function of the Chinese numeral, whether cardinal or ordinal,  is to combine with a classifier, as shown in the sample sentences above.

To capture this phenomenon, CPSG95 defines two subtypes for the category numeral [num], namely the [cardinal_num] and [ordinal_num].   The lexical entries of the prefix 第 di‑ (‑th) and the cardinal numeral 五 wu (five) are formulated in (6-2) and (6-3).  The prefix encodes the lexical expectation for the derivation 第 di- + [cardinal_num] ‑‑> [ordinal_num] plus the semantic composition of the combination.  Note that the constraint @numeral inherits all common property specified for the numeral macro.

th6263

As indicated before, prefixation in CPSG95 is handled by the Prefix PS Rule based on the lexical specification.  More specifically, it is driven by the lexical expectation encoded in [PREFIXING].  The prefix rule is formulated in (6-4).

th64

Like all PS rules in CPSG95, whenever two adjacent signs satisfy all the constraints, this rule takes effect in combining them into a higher level sign in parsing.  For example, the prefix 第 di- (-th) and the sign 五 wu (five) will be combined into the sign as shown in (6-5).

th65

The combination of 第五 di+wu in (6-5) demonstrates how the morphological structure is built in the CPSG95 approach to Chinese prefixation.

6.3. Suffixation

Like prefixation, the Suffix PS Rule for suffixation is driven by the lexically encoded expectation in [SUFFIXING].  Parallel to the Prefix PS Rule, the suffix rule is formulated in (6-6).

th66

With this PS rule in hand, all that is needed is to capture the individual derivational constraint in the lexical entries of the suffixes at issue.  For example, the suffix 性 -xing (-ness) changes an adjective or verb into an abstract noun:  A/V + ‑xing  ‑‑> N.  This information is contained in the formulation of the suffix 性 –xing (-ness) in the CPSG95 lexicon, as shown in (6-7).

th67

Note that abstract nouns are uncountable, hence the call to the uncountable_noun macro to inherit the common property of uncountable nouns.[5]

Suppose the suffix 性 -xing (-ness) appears immediately after the adjective 实用 shi-yong (practical) formulated in (6-8), the suffix PS rule will combine them into a noun, as  shown in (6-9).

th6869

The combination of 实用性 shi-yong+xing in (6-9) demonstrates how the morphological structure is built in the CPSG95 approach to Chinese suffixation.

6.4. Quasi-affixes

The purpose of this section is to propose an adequate treatment of the quasi-affix phenomena in Chinese.  This is an area which has not received enough investigation in the field of Chinese NLP.  Few Chinese NLP systems demonstrate where and how to handle these quasi-affixes.

To achieve the purpose, typical examples of ‘quasi-affixes’ are presented and compared with some ‘genuine’ affixes.  The comparison highlights the general property shared by both 'quasi-affixes' and other affixes and also shows their differences.  Based on this study, it is found to be a feasible proposal to treat quasi-affixes within the derivation morphology of CPSG95.  The proposed solution will be presented by demonstrating how a typical quasi-affix is represented in CPSG95 and how the general affix rules can work with the lexical entries of 'quasi-affixes' as well.

The tables in (6-10) and (6-11) list some representative quasi-affixes in Chinese.

(6-10.)         Table for sample quasi-prefixes

prefixation examples
lei (quasi-)+N --> N 类前缀 lei-[qian-zhui]: quasi-[pre-fix]
前缀 qian (before, pre-, former-) zhui (...)
ban (semi-)+N --> N 半文盲 ban-[wen-mang]: semi-illiterate
文盲 wen (written-language), mang (blind)
dan (mono-)+N --> N 单音节 dan-[yin-jie]: mono-syllable
音节 yin (sound), jie (segment)
shuang (bi-)+N --> N 双音节 shuang-[yin-jie]: bi-syllable
duo (multi-)+N --> N 多音节 duo-[yin-jie]: multi-syllable
fei (non-)+N/A --> A 非谓 fei-wei: non-predicate
非正式 fei-[zheng-shi]: non-official
xiang (each other)+Vt (mono-syllabic) --> Vi 相爱 xiang-ai: love each other
zi (self-)+Vt --> Vi 自爱 zi-ai: self-love zi-xue-xi: self-learning
qian (former, ex-) + N
--> N
前夫人 qian-[fu-ren]: ex-wife
前总统 qian-[zong-tong]: former president

(6-11.)         Table for sample quasi-suffixes

suffixation Examples
N + shi (style) --> N 美国式 [mei-guo]-shi: American-style
NUM/N + xing (model)
--> N
1980型 1980-xing: 1980 model;
IV型 IV-xing: Model IV
A/V + (rate) --> N 准确率 [zhun-que]-lü: (percentage of) precision
NUM + liu (class) --> A 一流 yi-liu: first class
三流 san-liu: third class
N + mang ('blind', person who has little knowledge of) --> N 法盲 fa-mang:
person who has no knowledge of law
计算机盲 [ji-suan-ji]-mang: computer-layman

Compare the above quasi-affixes with the few widely acknowledged affixes like 性 -xing (-ness) and 第 di- (-th), it is fairly easy to observe that the property as generalized in Section 6.1 is shared by both affixes and quasi-affixes.  That is, in all cases of the combination, the affix or quasi-affix expects a sign of category x, with possible additional constraints, either on the right or on the left to form a derived word of category y (y may be equal to x).  For example, the quasi-prefix 自 zi- (self-) expects a transitive verb to produce an intransitive verb, etc.  This property supports the following two points of view:  (i) the affix or quasi-affix is the selecting head of the combination;  (ii) both types of combination (affixation) should be properly contained in morphology since the output is always a word (derivative).

In terms of difference, it is observed that there are different degrees of the functionalization of the meaning between quasi-affixes and other affixes.  For example, the nominalizer 性 -xing (‑ness) seems to be semantically more functionalized than the quasi-suffix 盲 -mang (blind-man, person who has little knowledge of).  In the case of 性 -xing (-ness), there is believed to be little semantic contribution from the affix.  But in cases of affixation by quasi-affixes, the semantic contribution of the affixes is non-trivial, and it must be ensured that proper semantics be built based on semantic compositionality of both the stem and the affix.

Except for the different degrees of semantic abstractness, there is no essential grammatical difference observed between quasi-affixes and the few widely accepted affixes.  As the semantic variation can be easily accommodated in the lexicon, nothing needs to be changed in the  general approach to Chinese derivation as described before.  The text below demonstrates how the quasi-affix phenomena are handled in CPSG95, using a sample quasi-affix to show the derivation.

The quasi-prefix to examine is 相 xiang- (each other).  It is used before a mono-syllabic transitive verb, making it an intransitive verb: 相 xiang- + Vt (monosyllabic) ‑‑> Vi.  More precisely, the syntactic object of the transitive verb is morphologically satisfied so that the derivative becomes an intransitive verb.

Unlike the original verb, the verb derived via xiang-prefixation requires a plural subject, as shown in (6-12).  This is a linguistically interesting phenomenon.  In a sense, it is a version of subject-predicate agreement in Chinese.

(6-12.) (a)    他们相爱过。
ta-men         xiang-         ai       guo
they            each-other   love    GUO
They used to love each other.

(b)      他爱过。
ta       ai       guo
he      love    GUO.
He used to love (someone).

(b) *   他相爱过。
ta       xiang-         ai       guo
he      each-other   love    GUO.

This number agreement can help decode the plural semantics of the subject noun as shown in the first sentence (6-13a) in the following group.  Sentence (6-13a) illustrates a common, number-underspecified case where the NP has no plural marker.  This contrasts with (6-13b) which includes a plural marker 们 men (-s), and with (6-13c) which resorts to the use of a numeral-classifier construction.

(6-13.) (a)     孩子相爱了。
hai-zi           xiang-         ai       le
child           each-other   love    LE
The children have fallen in love with each other.

(b)      孩子们相爱了。
hai-zi men   xiang-         ai       le
child  PLU   each-other   love    LE
The children have fallen in love with each other.

(c)      两个孩子相爱了。
liang ge      hai-zi           xiang-         ai       le
two    CLA   child           each-other   love    LE
The two children have fallen in love with each other.

Following the practice for number agreement in HPSG, the agreement can be captured by enforcing an additional plural constraint on the subject expectation [SUBJ | SIGN | CONTENT | INDEX | NUMBER plural], as shown in the formulation of the lexical entry for 相 xiang- (each other) in (6-14) below.

th614

As shown above, the affixation also necessitates corresponding modification of the semantics in the argument structure:  the first argument is equal to the second via index [2].[6]  Note that the notation [ ], or more accurately, the most general feature structure, is used as a place holder.  For example, HANZI <[ ]> stands for the constraint of a mono-hanzi sign.  Another thing worth noticing is that the derivative requires that a subject must appear before it.  In other words, the subject expectation becomes obligatory.  This is based on the fact that this derived verb cannot stand by itself in syntax, unlike most original verbs in Chinese, say 爱 ai (love), whose subject expectation is optional.

With the lexical entries for the quasi-affixes taking care of the differences in the building of semantics, there is no need for any modification of the CPSG95 PS rules.  For example, the prefix 相 xiang- (each other) and the verb 爱 ai (love) formulated in (6-15) will be combined into the derivative 相爱 xiang-ai (love each other) shown in (6-16) via the Prefix PS Rule.

th615616

In summary, the proposed approach to Chinese derivation is effective in handling quasi-affixes as well.  The general grammar rules for derivation remain unchanged while lexical constraints are accommodated in the lexicon.  This demonstrates the advantages of the lexicalized design for grammar development.

6.5. Suffix 者 zhe (-er)

This section analyzes zhe-suffixation, a highly challenging  case at the interface between morphology and syntax.  This is believed to be an unsolvable problem as long as a system is based on the sequential processing of derivation morphology and syntax.  The solution to be proposed in this section is based on the argument that this suffixation is a combination of VP+zhe.

The suffix 者 zhe (-er, person) is a very productive bound morpheme.   It is often compared to the English suffix ‑er or ‑or, as seen in the pairs in (6-17).

(6-17.)
工作 gong-zuo (work)      工作者 [gong-zuo]-zhe (work‑er)
劳动 lao-dong (labor)       劳动者 [lao-dong]-zhe (labor-er)
学习 xue-xi (learn)           学习者 [xue-xi]-zhe (learn-er);.

But 者 ‑zhe is not an ordinary suffix;  it belongs to the category of so-called ‘phrasal affix’,[7] with very different characteristics than the English counterpart.  Although the output of the zhe-suffixation is a word, the input is a VP, not a lexical V.  In other words, it combines with a VP and produces a lexical N:  VP+zhe --> N.   The arguments to be presented below support this analysis.

The first thing is to demonstrate the word status of zhe‑suffixation.  This is fairly straightforward:  there are no observed facts to show that the zhe-derivative is different from other lexical nouns in the syntactic distribution.  For example, like other lexical nouns, the derivative can combine with an optional classifier construction to form a noun phrase.   Compare the following pairs of examples in (6-18) and (6-19).

(6-18.) (a)    两名违反这项规定者
liang  ming [[wei-fan      zhe    xiang gui-ding]     -zhe]
two    CLA   violate         this    CLA   regulation   -er
two persons who have violated this regulation

(b)    两名学生
liang  ming xue-sheng
two    CLA   student
two students

(6-19.) (a)    他是一位优秀工作者
ta       shi     yi       wei    you-xiu        [[gong-zuo]   -zhe]
he      be      one    CLA   excellent      work           -er
He is an excellent worker.

(b)    他是一位优秀工人。
ta       shi     yi       wei    you-xiu        gong-ren
he      be      one    CLA   excellent      worker
He is an excellent worker.

The next thing is to demonstrate the phrasal nature of the ‘stem’.[8]   The stem is judged as a VP because it can be freely expanded by syntactical complements or modifiers without changing the morphological relationship between the stem and the suffix, as shown in (6‑20) below.  (6-20a) involves a modifier (努力 nu-li) before the head verb.  The verb stem in (6-20b) and (6-20c) is a transitive VP consisting of a verb and an NP object.

(6-20.) (a)    努力工作者
[nu-li  gong-zuo]     -zhe
hard  work           ‑er
hard-worker, person who works hard

(b)      学习鲁迅者
[xue-xi         Lu Xun]       -zhe
learn           Lu Xun       -er
person who is learning from Lu Xun

(c)      违反这项规定者
[wei-fan       zhe    xiang           gui-ding]      -zhe
violate         this    CLA   regulation   -er
person who violates this rule

More examples with the head verb 雇 gu (employ) are given in (6-21), with the last two expressions involving passivized VP.

(6-21.)(a)    雇者
gu-zhe
employ-er

(b)      雇人者
[gu               ren]             -zhe
employ        person         -er
those who employ people, employer/recruiter

(c)      被雇者
[bei gu]                  -zhe
[be-employed]       -er
employee

(d)      被人雇者
[bei    ren              gu]               -zhe
by      person         employ        -er
those who are employed by (other) people

In fact, the stem VP is semantically equivalent to a relative clause.   A Chinese relative clause is normally expressed in the form of a DE-phrase: VP+de+N (Xue 1991).  In other words, 者 ‑zhe embodies functions of two signs, an N (‘person’, by default) and a relative clause introducer de, something like English one that + VP (or person who + VP).[9]  Compare the two examples in (6-22) and (6-23) with the same meaning - the expression in (6-23) is more colloquial than the first in (6-22) which uses the suffix 者‑zhe.

(6-22.) 违反规定者,处以罚款。
wei-fan        gui-ding       zhe,            chu-yi                   fa-kuan
violate         regulation   one that      punish-by   fine

Those who violate the regulations will be punished by fines.

(6-23.) 违反规定的人,处以罚款。
wei-fan        gui-ding       de      ren,             chu-yi          fa-kuan
violate         regulation   DE     person         punish-by   fine
Those who violate the regulations will be punished by fines.

On further examination, it is found that VPs with attached aspect markers combine with the suffix 者 -zhe with difficulty, as seen in the following examples.

(6-24.) (a)    违反规定者
wei-fan        gui-ding       zhe
violate         regulation   -er
Those who violate the regulations

(b) ?  违反了规定者
wei-fan        le       gui-ding       zhe
violate         LE     regulation   one that

This means that some further constraint may be necessary in order to prevent the grammar from producing strings like (6-24b).  If CPSG95 is only used for parsing, such a constraint is not absolutely necessary because, in normal Chinese text, such input is almost never seen.  Since CPSG95 is intended to be procedure-neutral, for use in both parsing and generation, the further constraint is desirable.

This constraint is in fact not an isolated phenomenon in Chinese grammar.  In syntax, the constraint is commonly required when the VP is not in the predicate position.[10]  For example, when a verb, say 喜欢 xi-huan (like), or a preposition, say 为了 wei-le (in order to), subcategorizes for a VP as a complement, it actually expects a VP with no aspect markers attached.   The following pair of sentences demonstrates this point.

(6-25.) (a)    我喜欢打篮球。
wo     xi-huan       da      lan-qiu.
I         like              play   basket-ball
I like playing basket-ball.

(b) * 我喜欢打了篮球。
wo     xi-huan       da      le       lan-qiu
I         like              play   LE     basket-ball

To accommodate such common constraint requirement in both Chinese morphology and syntax, a binary feature [FINITE] is designed for Chinese verbs in CPSG95.  In the lexicon, this feature is under-specified for each Chinese verb, i.e. [FINITE bin].  When an aspect marker 了着过 le/zhe/guo combines with the verb, this feature is unified to be [FINITE plus].  We can then enforce the required constraint [FINITE minus] in the morphological expectation or syntactic expectation to prevent aspected VP from appearing in a position expecting a non-predicate un-aspected  VP.

Based on the above analysis, the lexical entry of the suffix 者 –zhe is formulated in (6-26).  Note the notation for the macro with parameter (placed in parentheses) @common_noun(名|位|个).  This macro represents the following information.  The derivative is like any other common noun, it inherits the common property;  it can combine with an optional classifier construction using the classifier 名 ming or 位  wei or 个 ge.[11]

th626

As seen, the VP expectation is realized by using the macro constraint @vp.  The semantics of the derivative is [np_semantics], an instance of -er with restriction from the event of VP, represented by [2].  The index [1] ensures that whatever is expected as a subject by the VP, which has no chances to be satisfied syntactically in this case, is semantically identical to this noun.[12]  In other words, this derived noun semantically fills an argument slot held by the subject in the VP semantics [v_content].  In the active case, say, 雇人者 [gu ren]–zhe (‘person who employs people’), the subject is the first argument, i.e. the index of this noun is the logical subject of employ.  However, when the VP is in passive, say, 被人雇者 [bei ren gu]‑zhe (‘person who is employed by other people’), the subject expected by the VP fills the second argument, i.e. the noun in this case is the logical object of the VP.  It is believed that this is the desired result for the semantic composition of zhe-derivation.

With the lexical expectation of the suffix as the basis, the general Suffix PS Rule is ready to work.  Remember that there is nothing restricting the input stem to the derivation in either of the derivation rules, formulated in (6-4) and (6-6) before.  In CPSG95, this is not considered part of the general grammar but rather a lexical property of the head affix.  It is up to the affix to decide what constraints such as category, wordhood status, semantic constraint, etc., to impose on the expected stem to produce a derivative.  In most cases of derivation, the input status of the stem is a word, but now we have an intricate case where the suffix zhe (-er) expects a verb phrase for derivation.  The general property for all cases of derivation is that regardless of the input, the output of derivation (as well as any other types of morphology) is always a word.

Before demonstrating by examples how zhe-derivation is implemented, there is a need to address the configurational constraints of CPSG95.  This is an important factor in realizing the flexible interaction between morphology and syntax as required in this case.

In all HPSG-style grammars, some type of configurational constraint is in place to ensure the proper order of rule application.  A typical constraint is that the subject rule should apply after the object rule.  This is implemented in CPSG95 by imposing the constraint in the subject PS rule that the head daughter must be a phrase and by imposing the constraint in the object PS rule that the subject of the head daughter may not be satisfied.[13]

Since derivation morphology and syntax are designed in the same framework in CPSG95, constraints are called for to ensure the ordering of rule application between morphological PS rules and syntactic PS rules as well.  In general, morphological rules apply before syntactic rules.  However, if this constraint is made absolute, to the extent that that all morphological rules must apply before all syntactic rules, we in effect make morphology and syntax two independent, successive modules, just like the case for traditional systems.  The grammar will then lose the power of flexible interaction between morphology and syntax and cannot handle cases like zhe-derivation.  However, this is not a problem in CPSG95.

The proposed constraint regulating the rule application order between morphological PS rules and syntactic PS rules is as follows.  Only when a sign has both obligatory morphological expectation and syntactic expectation will CPSG95 have constraints ensuring that the morphological rule apply first.  For example, as formulated in (6-14) before, the sign 相 xiang- (each other) has both morphological expectation in [PREFIXING] as a bound morpheme and syntactic expectation for the subject in [SUBJ] as (head of) derivative.  If the input string is 他们相爱  ta-men (they) xiang- (each other) ai (love), the prefix rule will first combine 相 xiang- (each other) and the stem 爱 ai (love) before the subject rule can apply.  The result is the expected structure embodying the results of both morphological analysis and syntactic analysis, [ta-men [xiang- ai]].  This constraint is implemented by specifying in all syntactic PS rules that the head daughter cannot have obligatory morphological expectation yet to be satisfied.  It effectively prevents a bound morpheme from being used as a constituent in syntax.   It should be emphasized that this constraint in the general grammar does not prohibit a bound morpheme from combining with any types of sign;  such constraints are only lexically decided in the expectation feature of the affix.

The following text shows step by step the CPSG95 solution to the problem of zhe-derivation.  The chosen example is the derivation for the derived noun 违法规定者 [[wei-fan gui-ding]-zhe]  ‘persons violating (the) regulation’.  The lexical sign of the suffix 者 -zhe (-er) has already been formulated in (6-26) before.  The words 违反 wei-fan (violate) and 规定 gui-ding (regulation) in the CPSG95 lexicon are shown in (6-27) and (6-28) respectively.

th627628

Note that all common nouns, specified as @common_noun, in the lexicon have the following INDEX features [PERSON 3, NUMBER number], i.e. third person with unspecified number.  As for the feature [GENDER], it is encoded in the noun itself with one of the following [male], [female], [have_gender], [no_gender] or unspecified as [gender].   The corresponding sort hierarchy is: [gender] consists of sub-sorts [no_gender] and [have_gender];  and [have_gender] is sub-typed into [male] and [female].  Of course, 规定 gui-ding (regulation) is lexically specified as [GENDER no_gender].

The following is the VP built by the object PS rule in the CPSG95 syntax.  As seen, the building of the semantics follows the practice in HPSG, with the argument slots filled by the [INDEX] feature of the subject and object.  In this VP case, [ARG2] has been realized.

th629
The VP result in (6-29) and the suffix 者 –zhe will combine into the expected derived noun via the Suffix PS Rule, as shown in (6-30).

th630

To summarize, it is the integrated model of derivational morphology and syntax in CPSG95 that makes the above analysis implementable.  Without the integration, there is no way that a suffix is allowed to expect a phrasal stem.[14]  The lexicalist approach adopted in CPSG95 facilitates the capturing of the individual feature of the phrase expectation for the few individual affixes like 者 -zhe. This enables the general PS rules for derivation in CPSG95 to be applicable to both typical cases of affixation and special cases of affixation.

6.6. Summary

This chapter has investigated some representative phenomena of Chinese derivation and their interface to syntax.  The solutions to these problems have been presented based on the arguments for the analysis.

The key to a lexicalized treatment of Chinese derivation is to determine the structural and semantic property of the derivative and to impose proper constraints on the expected stem.  The constraints on the expected stem are lexically specified in the corresponding morphological expectation feature structure of the affix.  The property of the derivative is also lexically encoded in the affix, seen as head of derivational structure in the CPSG95 analysis.  This property information will be percolated up when the derivation rules are applied.  These rules ensure that the output of derivation is a word.  It has been shown that this approach applies equally well to derivation via ‘quasi-affixes’ and the tough case of zhe-suffixation as well.

 

------------------------------------

[1] Some linguists (e.g. Li and Thompson 1981) hold the view that Chinese has only a few affixes;  others (e.g. Chao 1968) believe that the inventory of Chinese affixes should be extended to include quasi-affixes.  Interestingly, the sign lei (quasi-, original sense ‘class’) itself is a quasi-prefix in Chinese.  Phenomena similar to Chinese quasi-affixes, called ‘semi-affixes’ or ‘Affixoide’, also exist in German morphology (Riehemann 1998).

[2] This is similar to the practice in many grammars, including HPSG, that a functional sign preposition is the selecting head of the corresponding syntactic structure, namely Prepositional Phrase.

[3] Those affixes which are not or no longer productive, e.g. lao‑ (original meaning ‘old’) in lao‑hu (tiger) and lao‑shu (mouse),  are not a problem.  The corresponding derived words are simply listed in the CPSG95 lexicon.

[4] The CPSG95 phrase-structural approach to Chinese productive derivation was inspired by the implementation in HPSG of a word-syntactic approach in Krieger (1994).  Similar practice is also seen in Selkirk (1982), Riehemann (1993) and Kathol (1999) in an effort to explore alternative approaches than the lexical rule approach to morphology.

[5] The major common property is reflected in two aspects, formulated in the macro definition of uncountable_noun in CPSG95.  First, there is value setting for the [NUMBER] feature, i.e. [CONTENT|INDEX|NUMBER no_number].  The CPSG95 sort hierarchy for the type [number] is defined as {a_number, no_number} where [a_number] is further sub-typed into {singular, plural}.  [NUMBER no_number] applies to uncountable nouns while [NUMBER a_number] is used for countable noun where the plurality is yet to be decided (i.e. under-specified for plurality).  Second, based on the syntactic difference between Chinese countable nouns and uncountable nouns, the classifier expected by uncountable nouns is exclusively zhong (kind/sort of).  That is, uncountable nouns may only combine with a preceding classifier construction using the classifier zhong.

[6] For time being, the subtle difference in semantics between pairs like We love ourselves and We love each other is not represented in the content.  It requires a more elaborate system of semantics to reflect the nuance.  The elaboration of semantics is left for future research.

[7] Some linguists (e.g. Z. Lu 1957; Lü et al 1980; Lü 1989; Dai 1993) have briefly introduced the notion of ‘phrasal affix’ in Chinese.  Lü further indicates that these ‘phrasal affixes’ are a distinctive characteristic of the Chinese grammar.

[8] The English possessive morpheme ‘s is arguably a suffix which expects an NP instead of a lexical noun as its stem:  NP + -’s.  Unlike VP + -zhe, the result of this NP + -‘s combination is generally regarded as a phrase, not a word.  In this sense, ‘s seems to be closer to a functional word, similar to a preposition or postposition, than to a suffix.

[9] Chinese zhe-suffixation is somewhat like the English phenomenon of what-clause (in ‘what he likes is not what interests her’). ‘What’ in this use also embodies functions of two signs that which. But the English what-clause functions as an NP, but VP+zhe forms a lexical N.

[10] It is generally agreed in the circle of Chinese grammar research that Chinese predicate (or finite) verbs have aspect distinction, using or not using aspect markers.  This is in contrast to English where both finite and non-finite verbs have aspect distinction but only finite verbs are tensed.

[11] It is generally agreed that each Chinese common noun may only combine with a classifier construction using a specific set of classifiers.  This classifier specification is generally regarded as lexical, idiosyncratic information of nouns (Lü et al 1980).  Using the macro with the classifier parameter follows this general idea.  It is worth noticing that the lexical formulation for -zhe (-er) in CPSG95 does not rely on any specific NP analysis chosen in syntax, except that the classifier specification should be placed under the entry for nouns (or derived nouns).

[12] The proposal in building the semantics for the zhe-derivative is based on ideas similar to the assumption adopted for the complement control in HPSG that ‘the fundamental mechanism of control was coindexing between the unexpressed subject of an unsaturated complement and its controler’ (Pollard and Sag 1994:282).

[13] If the object expectation is obligatory, this constraint ensures the priority of the object rule over the subject rule in application, building the desirable structure [S [V O]] instead of [[S V] O].  This is because, a verb with obligatory object yet to be satisfied is by definition not a phrase.  If the object expectation is optional, the order of rule application is still in effect although the lexical V in this scenario does not violate the phrase definition.  There are two cases for this situation.  In case one, the object O happens to occur in the input string.  The subject PS rule will tentatively combine S and V via the subject rule, but it can go no further.  This is because the object rule cannot apply after the subject rule, due to the constraint in the object rule that the head cannot have a satisfied subject.  The successful parse will only build the expected structure [S [V O]].  In case two, the object O does not appear in the input string.  Then the tentative combination [S V] built by the subject rule becomes the final parse.

[14] For example, if the lexical rule approach were adopted for derivation, this problem could not be solved.

 

[Related]

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

 

PhD Thesis: Chapter V Chinese Separable Verbs

 

5.0. Introduction

This chapter investigates the phenomena usually referred to as separable verbs (离合动词 lihe dongci) in the form V+X.  Separable verbs constitute a significant portion of Chinese verb vocabulary.[1]  These idiomatic combinations seem to show dual status (Z. Lu 1957; L. Li 1990).  When V+X is not separated, it is like an ordinary verb.   When V is separated from X, it seems to be more like a phrasal combination.  The co-existence of both the separated use and contiguous use for these constructions is recognized as a long-standing problem at the interface of Chinese morphology and syntax (L. Wang 1955;  Z. Lu 1957; Chao 1968; Lü 1989; Lin 1983;  Q. Li 1983; L. Li 1990; Shi 1992; Dai 1993; Zhao and Zhang 1996).

Some linguists (e.g. L. Li 1990; Zhao and Zhang 1996) have made efforts to classify different types of separable verbs and demonstrated different linguistic facts about these types.  There are two major types of separable verbs:  V+N idioms with the verb-object relation and V+A/V idioms with the verb-modifier relation - when X is A or non-conjunctive V.[2]

The V+N idiom is a typical case which demonstrates the mismatch between a vocabulary word and grammar word.  There have been three different views on whether V+N idioms are words or phrases in Chinese grammar.

Given the fact that the V and the N can be separated in usage, the most popular view (e.g. Z. Lu 1957; L. Li 1990; Shi 1992) is that they are words when V+N are contiguous and they are phrases otherwise.  This analysis fails to account for the link between the separated use and the contiguous use of the idioms.  In terms of the type of V+N idioms like 洗澡 xi zao (wash-bath: take a bath), this analysis also fails to explain why a different structural analysis should be given to this type of contiguous V+N idioms listed in the lexicon than the analysis to the also contiguous but non-listable combination of V and N (e.g. 洗碗 xi wan 'wash dishes').[3]  As will be shown in Section 5.1, the structural distribution for this type of V+N idioms and the distribution for the corresponding non-listable combinations are identical.

Other grammarians argue that V+N idioms are not phrases (Lin 1983;  Q. Li 1983; Zhao and Zhang 1996).  They insist that they are words, or a special type of words.  This argument cannot explain the demonstrated variety of separated uses.

There are scholars (e.g. Lü 1989; Dai 1993) who indicate that idioms like 洗澡 xi zao are phrases.  Their judgment is based on their observation of the linguistic variations demonstrated by such idioms.  But they have not given detailed formal analyses which account for the difference between these V+N idioms and the non-listable V+NP constructions in the semantic compositionality.  That seems to be the major reason why this insightful argument has not convinced people with different views.

As for V+A/V idioms, Lü (1989) offers a theory that these idioms are words and the insertable signs between V and A/V are Chinese infixes.  This is an insightful hypothesis.  But as in the case of the analyses proposed for V+N idioms, no formal solutions have been proposed based on the analyses in the context of phrase structure grammars.  As a general goal, a good solution should not only be implementable, but also offer an analysis which captures the linguistic link, both structural and semantic, between the separated use and the contiguous use of separable verbs.  It is felt that there is still a distance between the proposed analyses reported in literature and achieving this goal of formally capturing the linguistic generality.

Three types of V+X idioms can be classified based on their different degrees of 'separability' between V and X, to be explored in three major sections of this chapter.  Section 5.1 studies the first type of V+N idioms like 洗澡 xi zao (wash-bath: take a bath).  These idioms are freely separable.  It is a relatively easy case.  Section 5.2 investigates the second type of the V+N idioms represented by 伤心 shang xin (hurt-heart: sad or heartbroken).  These idioms are less separable.  This category constitutes the largest part of the V+N phenomena.  It is a more difficult borderline case.  Section 5.3 studies the V+A/V idioms.  These idioms are least separable:  only the two modal signs 得 de3 (can) and 不 bu (cannot) can be inserted inside them, and nothing else.  For all these problems, arguments for the wordhood judgment will be presented first.  A corresponding morphological or syntactic analysis will be proposed, together with the formulation of the solution in CPSG95 based on the given analysis.

5.1. Verb-object Idioms: V+N I

The purpose of this section is to analyze the first type of V+N idioms, represented by 洗澡 xi zao (wash‑bath: take a bath).  The basic arguments to be presented are that they are verb phrases in Chinese syntax and the relationship between the V and the N is syntactic.  Based on these arguments, formal solutions to the problems involved in this construction will be presented.

The idioms like 洗澡 xi zao are classified as V+N I, to be distinguished from another type of idioms V+N II (see 5.2).  The following is a sample list of this type of idioms.

(5-1.) V+N I: xi zao type

洗澡 xi (wash) zao (bath #)              take a bath
擦澡 ca (scrub) zao (bath #)             clean one's body by scrubbing
吃亏 chi (eat) kui (loss #)                   get the worst
走路 zou (go) lu (way $)                      walk
吃饭 chi (eat) fan (rice $)                    have a meal
睡觉 shui (V:sleep) jiao (N:sleep #)   sleep
做梦 zuo (make) meng (N:dream)     dream (a dream)
吵架  chao (quarrel) jia (N:fight #)    quarrel (or have a row)
打仗 da (beat) zhang (battle)              fight a battle
上当 shang (get) dang (cheating #)                be taken in
拆台 chai (pull down) tai (platform #)          pull away a prop
见面 jian (see) mian (face #)                            meet (face to face)
磕头 ke (knock) tou (head)                              kowtow
带头 dai (lead) tou (head $)                            take the lead
帮忙 bang (help) mang (business #)              give a hand
告状 gao (sue) zhuang (complaint #)            lodge a complaint

Note: Many nouns (marked with # or $) in this type of constructions cannot be used independently of the corresponding V.[4]  But those with the mark $ have no such restriction in their literal sense.  For example, when the sign fan  means 'meal', as it does in the idiom, it cannot be used in a context other than the idiom chi-fan (have a meal).  Only when it stands for the literal meaning ‘rice’, it does not have to co-occur with  chi.

There is ample evidence for the phrasal status of the combinations like 洗澡 xi zao.  The evidence is of three types.  The first comes from the free insertion of some syntactic constituent X between the idioms in the form V+X+N: this involves keyword-based judgment patterns and other X‑insertion tests proposed in Chapter IV.  The second type of evidence resorts to some syntactic processes for the transitive VP, namely passivization and long-distance topicalization.  The V+N I idioms can be topicalized and passivized in the same way as ordinary transitive VP structures do.  The last piece of evidence comes from the reduplication process associated with this type of idiom.   All the evidence leads to the conclusion that V+N I idioms are syntactic in nature.

The first evidence comes from using the wordhood judgment pattern: V(X)+zhe/guo à word(X).  It is a well observed syntactic fact that Chinese aspectual markers appear right after a lexical verb (and before the direct object).  If 洗澡 xi zao were a lexical verb, the aspectual markers would appear after the combinations, not inside them.  But that is not the case, shown by the ungrammaticality of the example in (5-2b).  A productive transitive VP example is given in (5-3) to show its syntactic similarity (parallelness) with V+N I idioms.

(5-2.) (a)      他正在洗着澡
ta       zheng-zai    xi      zhe    zao.
he      right-now    wash ZHE   bath
He is taking a bath right now.

(b) *   他正在洗澡着。
ta       zheng-zai    xi-zao         zhe.
he      right-now    wash-bath   ZHE

(5-3.) (a)      他正在洗着衣服。
ta       zheng-zai    xi      zhe    yi-fu.
he      right-now    wash ZHE   clothes
He is washing the clothes right now.

(b) *   他正在洗衣服着。
ta       zheng-zai    xi      yi-fu           zhe.
he      right-now    wash clothes        ZHE

The above examples show that the aspectual marker 着 zhe (ZHE) should be inserted in the V+N idiom, just as it does in an ordinary transitive VP structure.

Further evidence for X-insertion is given below.   This comes from the post-verbal modifier of ‘action-times’ (动量补语 dongliang buyu) like 'once', 'twice', etc.  In Chinese, action-times modifiers appear after the lexical verb and aspectual marker (but before the object), as shown in (5-4a) and (5-5a).

(5-4.) (a)      他洗了两次澡。
ta       xi      le       liang  ci       zao.
he      wash LE     two    time   bath
He has taken a bath twice.

(b) *   他洗澡了两次。
ta       xi-zao         le       liang  ci.
he      wash-bath   LE     two    time

(5-5.) (a)      他洗了两次衣服。
ta       xi      le       liang  ci       yi-fu.
he      wash LE     two    time   clothes
He has washed the clothes twice.

(b) *   他洗衣服了两次。
ta       xi      yi-fu           le       liang  ci.
he      wash clothes        LE     two    time

So far, evidence has been provided of syntactic constituents which are attached to the verb in the V+N I idioms.  To further argue for the VP status of the whole idiom, it will be demonstrated that the N in the V+N I idioms in fact fills the syntactic NP position in the same way as all other objects do in Chinese transitive VP structures.  In fact, N in the V+N I does not have to be a bare N:  it can be legitimately expanded to a full-fledged NP (although it does not normally do so).  A full-fledged NP in Chinese typically consists of a classifier phrase (and modifiers like de-construction) before the noun.  Compare the following pair of examples.  Just like an ordinary NP 一件崭新的衣服 yi jian zan-xin de yi-fu (one piece of brand-new clothes), 一个痛快的澡 yi ge tong-kuai de zao (a comfortable bath) is a full-fledged NP.

(5-6.)           他洗了一个痛快的澡。
ta       xi      le       yi       ge      tong-kuai     de      zao.
he      wash LE     one    CLA   comfortable DE     bath
He has taken a comfortable bath.

(5-7.)           他洗了一件崭新的衣服。
ta       xi      le       yi       jian    zan-xin        de      yi-fu.
he      wash LE     one    CLA   brand-new  DE     clothes
He has washed one piece of brand-new clothes.

It requires attention that the above evidence is directly against the following widespread view, i.e. signs like 澡 zao, marked with # in (5-1), are 'bound morphemes' or ‘bound stems’ (e.g. L. Li 1990; Zhao and Zhang 1996).  As shown, like every other free morpheme noun (e.g. yi-fu), zao holds a lexical position in the typical Chinese NP sequence 'determiner + classifier + (de-construction) + N', e.g. 一个澡 yi ge zao (a bath), 一个痛快的澡 yi ge tong-kuai de zao (a comfortable bath).[5]  In fact, as long as the ‘V+N I phrase’ arguments are accepted (further evidence to come), by definition ‘bound morpheme’ is a misnomer for 澡 zao.  As a part of morphology, a bound morpheme cannot play a syntactic role:  it is inside a word and cannot be seen in syntax.  The analysis of 洗xi (...) zao as a phrase entails the syntactic roles played by 澡 zao:  (i) 澡 zao is a free morpheme noun which fills the lexical position as the final N inside the possibly full-fledged NP;  (ii) 澡zao plays the object role in the syntactic transitive structure 洗澡xi zao.

This bound morpheme view is an argument used for demonstrating  the relevant V+N idioms to be words rather than phrases (e.g. L. Li 1990).  Further examination of this widely accepted view will help to strengthen the counter-arguments that all V+N I idioms are phrases.

Labeling signs like 澡zao (bath) as bound morphemes seem to come from an inappropriate interpretation of the statement that bound morphemes cannot be ‘freely’, or ‘independently’, used in syntax.[6]  This interpretation places an equal sign between the idiomatic co-occurrence constraint and ‘not being freely used’.  It is true that 澡zao is not an ordinary noun to be used in isolation.  There is a co-occurrence constraint in effect:  澡zao cannot be used without the appearance of 洗xi (or 擦ca).  However, the syntactic role played by 澡zao, the object in the syntactic VP structure, has full potential of being ‘freely’ used as any other Chinese NP object:   it can even be placed before the verb in long-distance constructions as shall be shown shortly.  A more proper interpretation of ‘not being freely used’ in terms of defining bound morphemes should be that a genuine bound morpheme, e.g. the suffix 性 -xing ‘-ness’, has to attach to another sign contiguously to form a word.

A comparison with similar phenomena in English may be helpful.  English also has similar idiomatic VPs, such as kick the bucket.[7]  For the same reason, it cannot be concluded that bucket (or the bucket) is a bound morpheme only because it demonstrates necessary co-occurrence with the verb literal kick.  Signs like bucket, zao (bath) are not of the same nature as bound morphemes like –less, -ly, un-, ‑xing (-ness), etc

The second type of evidence shows some pattern variations for the V+N I idioms.  These variations are typical syntactic patterns for the transitive V+NP structure in Chinese.  One of most frequently used patterns for transitive structures is the topical pattern of long distance dependency.  This provides strong evidence for judging the V+N I idioms as syntactic rather than morphological.  For, with the exception of clitics, morphological theories in general conceive of the parts of a word as being contiguous.[8]  Both the V+N I idiom and the normal V+NP structure can be topicalized, as shown in (5-8b) and (5-9b) below.

(5-8.) (a)      我认为他应该洗澡。
wo     ren-wei        ta       ying-gai       xi zao.
I         think           he      should        wash-bath
I think that he should take a bath.

(b)      澡我认为他应该洗
zao    wo     ren-wei        ta       ying-gai       xi.
bath  I         think           he      should        wash
The bath I think that he should take.

(5-9.) (a)       我认为他应该洗衣服。
wo     ren-wei        ta       ying-gai       xi      yi-fu.
I         think           he      should        wash clothes
I think that he should wash the clothes.

(b)      衣服我认为他应该洗。
yi-fu           wo     ren-wei        ta       ying-gai       xi.
clothes        I         think           he      should        wash
The clothes I think that he should wash.

The minimal pair of passive sentences in (5-10) and (5‑11) further demonstrates the syntactic nature of the V+N I structure.

(5-10.)         澡洗得很干净。
zao             xi      de3    hen    gan-jing.
bath            wash DE3   very   clean
A good bath was taken so that one was very clean.

(5-11.)         衣服洗得很干净。
yi-fu           xi      de3    hen    gan-jing.
clothes        wash DE3   very   clean
The clothes were washed clean.

The third type of evidence involves the nature of reduplication associated with such idioms.  For idioms like 洗澡 xi zao (take a bath), the first sign can be reduplicated to denote the shortness of the action:  洗澡 xi zao (take a bath) --> 洗洗澡 xi xi zao (take a short bath).  If 洗澡 xi zao is a word, by definition, 洗xi is a morpheme inside the word and 洗洗澡 xi-xi-zao belongs to morphological reduplication (AB-->AAB type).  However, this analysis fails to account for the generality of such reduplication:  it is a general rule in Chinese grammar that a verb reduplicates itself contiguously to denote the shortness of the action.  For example, 听音乐 ting (listen to) yin-yue (music) --> 听听音乐 ting ting yin-yue (listen to music for a while); 休息 xiu-xi (rest) --> 休息休息 xiu-xi xiu-xi (have a short rest), etc.  On the other hand, when we accept that 洗澡 xi zao is a verb-object phrase in syntax and the nature of this reduplication is accordingly judged as syntactic,[9] we come to a satisfactory and unified account for all the related data.  As a result, only one reduplication rule is required in CPSG95 to capture the general phenomena;[10]  there is no need to do anything special for V+N  idioms.

This AB ‑‑> AAB type reduplication problem for the V+N idioms poses a big challenge to traditional word segmenters (Sun and Huang 1996).  Moreover, even when a word segmenter successfully incorporates some procedure to cope with this problem, the essentially same rule has to be repeated in the grammar for the general VV reduplication.  This is not desirable in terms of capturing the linguistic generality.

All the evidence presented above indicates that idioms like 洗澡xi zao, no matter whether V and N are used contiguously or not, are not words, but phrases.  The idiomatic nature of such combinations seems to be the reason why most native speakers, including some linguists, regard them as words.  Lü (1989: 113-114) suggests that vocabulary words  like 洗澡 xi zao should be distinguished from grammar words.  He was one of the first Chinese grammarians who found that the V+N relation in the idioms like 洗澡 xi zao is a syntactic verb object relation.  But he did not provide full arguments for his view, neither did he offer a precise formalized analysis of this problem.[11]

As shown in the previous examples, the V+N I idioms do not differ from other transitive verb phrases in all major syntactic behaviors.   However, due to their idiomatic nature, the V+N I idioms are different from ordinary transitive VPs in the following two major aspects.  These differences need to be kept in mind when formulating the grammar to capture the phenomena.

  • Semantics:  the semantics of the idiom should be given directly in the lexicon, not as a result of the computation of the semantics of the parts based on some general principle of compositionality.
  • Co-occurrence requirement:  洗 xi (or 擦 ca) and 澡 zao must co-occur with each other;  走 zou (go) and 路 lu (way) must co-occur; etc.  This is a requirement specific to the idioms at issue.  For example, 洗 xi and 澡 zao must co-occur in order to stand as an idiom to mean ‘take a bath’.

Based on the study above, the CPSG95 solution to this problem is described below.  In order to enforce the co-occurrence of the V+N I idioms, it is specified in the CPSG95 lexicon that the head V obligatorily expects as its object an NP headed by a specific literal.  This treatment originates from the practice of handling collocations in HPSG.  In HPSG, there are features designed to enable the subcategorization for particular words, or phrases headed by particular words.  For example, the feature [NFORM there] and [NFORM it] refer to the expletive there and it respectively for the special treatment of existential constructions, cleft constructions, etc. (Pollard and Sag 1987:62).  The values of the feature PFORM distinguish individual prepositions like for, on, etc.  They are used in phrasal verbs like rely on NP, look for NP, etc.  In CPSG95, this approach is being generalized, as described below.

As presented before, the feature for orthography [HANZI] records the Chinese character string for each lexical sign.  When a specific lexical literal is required in an idiomatic expectation, the constraint is directly placed on the value of the feature [HANZI] of the expected sign, in addition to possible other constraints.  It is standard practice in a lexicalized grammar that the expected complement (object) for the transitive structure be coded directly in the entry of the head V in the lexicon.  Usually, the expected sign is just an ordinary NP.  In the idiomatic VP like 洗 xi (...) 澡 zao, one further constraint is placed:  the expected NP must be headed by the literal character 澡zao.  This treatment ensures that all pattern variations for transitive VP such as passive constructions, topicalized constructions, etc. in Chinese syntax will equally apply to the V+N I idioms.[12]

The difference in semantics is accommodated in the feature [CONTENT] of the head V with proper co-indexing.  In ordinary cases like 洗衣服 xi yi-fu (wash clothes), the argument structure is [vt_semantics] which requires two arguments, with the role [ARG2] filled by the semantics of the object NP.  In the idiomatic case 洗澡 xi zao (take a bath), the V and N form a semantic whole, coded as [RELN take_bath].[13]  The V+N I idioms are formulated like intransitive verbs in terms of composing the semantics - hence coded as [vi_semantics], with only one argument to be co-indexed with the subject NP.  Note that there are two lexical entries in the lexicon for the verb 洗 xi (wash), one for the ordinary use and the other for the idiom, shown in (5-12) and (5-13).

th000

The above solution takes care of the syntactic similarity of the
V+N I idioms and ordinary V+NP structures.  It is also detailed enough to address their major differences.  In addition, the associated reduplication process (i.e. V+N --> V+V+N) is no longer a problem once this solution is adopted.  As the V in the V+N idioms is judged and coded as a lexical V (word) in this proposal, the reduplication rule which handles V --> VV will equally apply here.

5.2. Verb-object Idioms: V+N II

The purpose of this section is to provide an analysis of another type of V+N idiom and present the solution implemented in CPSG95 based on the analysis.

Examples like 洗澡 xi zao (take a bath) are in fact easy cases to judge.   There are more marginal cases.  When discussing Chinese verb-object idioms, L. Li (1990) and Shi (1992) indicate that the boundary between a word and a phrase in Chinese is far from clear-cut.  There is a remarkable “gray area” in between.  Examples in (5-14) are V+N II idioms, in contrast to the V+N I type, classified by L. Li (1990).

(5-14.) V+N II: 伤心 shang xin type

伤心 shang (hurt) xin (heart)             sad or break one's heart
担心 dan (carry) xin (heart)               worry
留神 liu (pay) shen (attention)           pay attention to
冒险 mao (take) xian (risk)                 take the risk
借光 jie (borrow) guang (light)           benefit from
劳驾 lao (bother) jia (vehicle)             beg the pardon
革命 ge (change) ming (life)                 make revolution
落后 luo (lag) hou (back)                      lag behind
放手 fang (release) shou (hand)          release one's hold

Compared with V+N I (洗澡xi zao type), V+N II has more characteristics of a word.  The lists below given by L. Li (1990) contrast their respective characteristics.[14]

(5-15.) V+N I (based on L. Li 1990:115-116)

as a word

V-N

(a1) corresponds to one generalized sense (concept)

(a2) usually contains ‘bound morpheme(s)’

as a phrase

V X N

 

(b1) may insert an aspectual particle (X=le/zhe/guo)

(b2) may insert all types of post-verbal modifiers (X=BUYU)

(b3) may insert a pre-nominal modifier de-construction (X=DEP)

(5-16.) V+N II (based on L. Li 1990:115)

as a word

 

V-N X

(a1) corresponds to one generalized sense (concept)

(a2) usually contains ‘bound morpheme(s)’

(a3) (some) may be followed by an aspectual particle (X=le/zhe/guo)

(a4) (some) may be followed by a post-verbal modifier
of duration or number of times (X=BUYU)

(a5) (some) may take an object (X=BINYU)

as a phrase

 

V X N

(b1) may insert an aspectual particle (X=le/zhe/guo)

(b2) may insert all types of post-verbal modifiers (X=BUYU)

(b3) may insert a pre-nominal modifier de-construction (X=DEP)

For V+N I, the previous text has already given detailed analysis and evidence and decided that such idioms are phrases, not words.  This position is not affected by the demonstrated features (a1) and (a2) in (5‑15);  as argued before,  (a1) and (a2) do not contribute to the definition of a grammar word.

However, (a3), (a4) and (a5) are all syntactic evidence showing that V+N II idioms can be inserted in lexical positions.   On the other hand, these idioms also show the similarity with V+N I idioms in the features (b1), (b2) and (b3) as a phrase.  In particular, (a3) versus (b1) and (a4) versus (b2) demonstrate a 'minimal pair' of phrase features and word features.  The following is such a minimal pair example (with the same meaning as well) based on the feature pairs (a3) versus (b1), with a post-verbal modifier 透tou (thorough) and aspectual particle 了le (LE).  It demonstrates the borderline status of such idioms.  As before, a similar example of an ordinary transitive VP is also given below for comparison.

(5-17.)         V+N II: word or phrase?

伤心:sad; heart-broken
shang          xin
hurt            heart

(a)      我伤心透了
wo     shang-xin  tou              le.
I         sad              thorough     LE
I was extremely sad.

(b)      我伤透了心
wo     shang         tou              le       xin.
I         break          thorough     LE     heart
I was extremely sad.

(5-18.)         Ordinary V+NP phrase: 恨hen (hate) 他ta (he)

(a) *   我恨他透了
wo     hen   ta      tou              le.
I         hate   he      thorough     LE

(b)      我恨透了他
wo     hen   tou              le       ta.
I         hate   thorough     LE     he
I thoroughly hate him.

As shown in (5-18), in the common V+NP structure, the post-verbal modifier 透 tou (thorough) and the aspectual particle 了 le (perfect aspect) can only occur between the lexical V and NP.  But in many V+N II idioms, they may occur either after the V+N combination or in between.  In (5‑17a), 伤心 shang xin is in the lexical position because Chinese syntax requires that the post-verbal modifier attach to the lexical V, not to a VP as indicated in (5-18a).  Following the same argument, 伤 shang (hurt) alone in (5-17b) must be a lexical V as well.  The sign 心 xin (heart) in (5‑17b) establishes itself in syntax as object of the V, playing the same role as 他ta (he) in (5-18b).  These facts show clearly that V+N II idioms can be used both as lexical verbs and as transitive verb phrases.   In other words, before entering a context, while still in the lexicon, one can not rule out either possibility.

However, there is a clear cut condition for distinguishing its use as a word and its use as a phrase once a V+N II idiom is placed in a context.   It is observed that the only time a V+N II idiom assumes the lexical status is when V and N are contiguous.  In all other cases, i.e. when V and N are not contiguous, they behave essentially similar to the V+N I type.

In addition to the examples in (5-17) above, two more examples are given below to demonstrate the separated phrasal use of V+N II.  The first is the case V+X+N where X is a possessive modifier attached to the head N.  Note also the post-verbal position of 透 tou (thorough) and 了le (LE).  The second is an example of passivization when N occurs before V.  These examples provide strong evidence for the syntactic nature of V+N II idioms when V and N are not used contiguously.

(5-19.) (a) *   你伤他的心透了
ni       shang         ta       de      xin    tou              le.
you    hurt            she    DE     heart thorough     LE

(b)      你伤透了他的心
ni       shang         tou              le       ta       de      xin.
you    hurt            thorough     LE     she    DE     heart
You broke her heart.

(5-20.)         V+N II: instance of passive with or without 被 bei (BEI)

心(被)伤透了
xin    (bei)   shang         tou              le.
heart BEI    break          thorough     LE
The heart was completely broken.
or: (Someone) was extremely sad.

Based on the above investigation, it is proposed in CPSG95 that two distinct entries be constructed for each such idiom, one as an inseparable lexical V, and the other as a transitive VP just like that of V+N I.  Each entry covers its own part of the phenomena.  In order to capture the semantic link between the two entries, a lexical rule called V_N_II Rule is formulated in CPSG95, shown in (5-21).

th001

The input to the V_N_II Lexical Rule is an entry with [CATEGORY v_n_ii] where [v_n_ii] is a given sub-category in the lexicon for V+N II type verbs.  The output is another entry with the same information except for three features [HANZI], [CATEGROY] and [COMP1_RIGHT].  The new value for [HANZI] is a list concatenating the old [HANZI] and the [HANZI] for the expected [COMP1_RIGHT].  The new [CATEGORY] value is simply [v].  The value for [COMP1_RIGHT] becomes [null].  The outline of the two entries captured by this lexical rule are shown in (5-22) and (5-23).

th002

It needs to be pointed out that the definition of [CATEGORY v_n_ii] in CPSG95 is narrower than L. Li’s definition of V+N II type idioms.  As indicated by L. Li (1990), not all V+N II idioms share the same set of lexical features (a3), (a4) and (a5) as a word.  The definition in CPSG95 does not include the idioms which share the lexical feature (a5), i.e. taking a syntactic object.  These are idioms like 担心dan-xin (carry-heart: worry about).  For such idioms, when they are used as inseparable compound words, they can take a syntactic object.  This is not possible for all other V+N idioms, as shown below.

(5-24.) (a)     她很担心你
ta       hen    dan-xin                ni.
he      very   worry (about)        you
He is very concerned about you.

(b) *   他很伤心你
ta       hen    shang-xin            ni.
he      very   sad                       you

In addition, these idioms do not demonstrate the full distributional potential of transitive VP constructions.  The separated uses of these idioms are far more limited than other V+N idioms.  For example, they can hardly be passivized or topicalized as other V+N idioms can, as shown by the following minimal pair of passive constructions.

(5-25.)(a) *   心(被)担透了
xin    (bei)   dan             tou              le.
heart BEI    carry           thorough     LE

(b)      心(被)伤透了
xin    (bei)   shang         tou              le.
heart BEI    break          thorough     LE
The heart was completely broken.
or: (Someone) was extremely sad.

In fact, the separated use ('phrasal use') for such V+N idioms seems only limited to some type of X-insertion, typically the appearance of aspect signs between V and N.[15]  Such separated use is the only thing shared by all V+N idioms, as shown below.

(5-26.)(a)     他担过心
ta       dan             guo    xin
he      carry           GUO  heart
He (once) was worried.

(b)      他伤过心
ta       shang         guo    xin
he      break          GUO  heart
He (once) was heart-broken.

To summarize,  the V+N idioms like 担心 dan-xin which can take a syntactic object do not share sufficient generality with other V+N II idioms for a lexical rule to capture.  Therefore, such idioms are excluded from the [CATEGORY v_n_ii] type.  This makes these idioms not subject to the lexical rule proposed above.  It is left for future research to answer the question whether there is enough generality among this set of idioms to justify some general approach to this problem, say, another lexical rule or some other ways of generalization of the phenomena.  For time being, CPSG95 simply lists both the contiguous and separated uses of these idioms in the lexicon.[16]

It is worth noticing that leaving such idioms aside, this lexical rule still covers large parts of V+N II phenomena.  The idioms like 担心dan-xin only form a very small set which are in the state of transition to words per se (from the angle of language development) but which still retain some (but not complete) characteristics of a phrase.[17]

5.3. Verb-modifier Idioms: V+A/V

This section investigates the V+X idioms in the form of V+A/V.  The data for the interaction of V+A/V idioms and the modal insertion are presented first.  The subsequent text will argue for Lü's infix hypothesis for the modal insertion and accordingly propose a lexical rule to capture the idioms with or without modal insertion.

The following is a sample list of V+A/V idioms, represented by kan jian (look-see: have seen).

(5-27.) V+A/V: kan jian type

看见 kan (look) jian (see)                    have seen
看穿 kan (look) chuan (through)        see through
离开 li (leave) kai (off)                         leave
打倒 da (beat) dao (fall)                      down with
打败 da (beat) bai (fail)                       defeat
打赢 da (beat) ying (win)                    fight and win
睡着 shui (sleep) zhao (asleep)            fall asleep
进来 jin (enter) lai (come)                             enter
走开 zou (go) kai (off)                         go away
关上  guan (close) shang (up)             close

In the V+A/V idiom kan jian (have-seen), the first sign kan (look) is the head of the combination while the second jian (see) denotes the result.  So when we say, wo (I) kan-jian (see) ta (he), even without the aspectual marker le (LE) or guo (GUO), we know that it is a completed action:  'I have seen him' or 'I saw him'.[18]

Idioms like kan-jian (have-seen) function just as a lexical whole (transitive verb).  When there is an aspect marker, it is attached immediately after the idioms as shown in (5‑28).  This is strong evidence for judging V+A/V idioms as words, not as syntactic constructions.

(5-28.)         我看见了他
wo     kan jian     le       ta.
I         look-see       LE     he                   I have seen him.

The only observed separated use is that such idioms allow for two modal signs 得 de3 (can) and 不 bu (cannot) in between, shown by (5-29a) and (5-29b).  But no other signs, operations or processes can enter the internal structure of these idioms.

(5-29.) (a)     我看不见他
wo     kan bu jian         ta.
I         look cannot see     he
I cannot see him.

(c)      你看得见他吗?
ni       kan de3 jian       ta       me?
you    look can see          he      ME
Can you see him?

Note that English modal verbs ‘can’ and ‘cannot’ are used to translate these two modal signs.  In fact, Contemporary Mandarin also has corresponding modal verbs (能愿动词 neng-yuan dong-ci):  能 neng (can) and 不能 bu neng (cannot).  The major difference between Chinese modal verbs 能 neng / 不能 bu neng and the modal signs 得 de3 / 不 bu lies in their different distribution in syntax.  The use of modal signs 得 de3 (can) and 不 bu (cannot) is extremely restrictive:  they have to be inserted into V+BUYU combinations.  But Chinese modal verbs can be used before any VP structures.  It is interesting to see the cases when they are used together in one sentence, as shown in (5-30 a+b) below.  Note that the meaning difference between the two types of modal signs is subtle, as shown in the examples.

(5-30.)(a)     你看得见他吗?
ni       kan de3 jian         ta       me?
you    look can see          he      ME
Can you see him? (Is your eye-sight good enough?)

(b)      你能看见他吗?
ni       neng kan jian      ta       me?
you    can    see              he      ME
Can you see him?
(Note: This is used in more general sense. It covers (a) and more.)

(a+b)  你能看得见他吗?
ni       neng kan de3 jian         ta       me?
you    can    look can see          he      ME
Can you see him? (Is your eye-sight good enough?)

(5-31.)(a)     我看不见他
wo     kan bu jian           ta
I         look cannot see     he
I cannot see him. (My eye-sight is too poor.)

(b)      我不能看见他
wo     bu     neng kan jian      ta
I         not    can    see              he
I cannot see him. (Otherwise, I will go crazy.)

(a+b) 我不能看不见他
wo     bu     neng kan bu jian           ta.
I         not    can    look cannot see     he
I cannot stand not being able to see him.
(I have to keep him always within the reach of my sight.)

Lü (1989:127) indicates that the modal signs are in fact the only two infixes in Contemporary Chinese.  Following this infix hypothesis, there is a good account for all the data above.  In other words, the V+A/V idioms are V+BUYU compound words subject to the modal infixation.  The phenomena of 看得见 kan-de3-jian (can see) and 看不见 kan-bu-jian (cannot see) are therefore morphological by nature.  But Lü did not offer formal analysis for these idioms.

Thompson (1973) first proposed a lexical rule to derive the potential forms V+de3/bu+A/V from the V+A/V idioms.  The lexical rule approach seems to be most suitable for capturing the regularity of the V+A/V idioms and their infixation variants V+de3/bu+A/V.  The  approach taken in CPSG95 is similar to Thompson’s proposal.  More precisely, two lexical rules are formulated in CPSG95 to handle the infixation in V+A/V idioms.  This way, CPSG95 simply lists all V+A/V idioms in the lexicon as V+A/V type compound words, coded as [CATEGORY v_buyu].[19]  Such entries cover all the contiguous uses of the idioms.  It is up to the two lexical rules to produce two infixed entries to cover the separated uses of the idioms.

The change of the infixed entries from the original entry lies in the semantic contribution of the modal signs.  This is captured in the lexical rules in (5-32) and (5-33).  In case of V+de3+A/V, the Modal Infixation Lexical Rule I in (5-32) assigns the value [can] to the feature [MODAL] in the semantics.  As for V+bu+A/V, there is a setting  [POLARITY minus] used to represent the negation in the semantics, shown in (5-33).[20]

th003

The following lexical entry shows the idiomatic compound 看见 kan-jian as coded in the CPSG95 lexicon (leaving some irrelevant details aside).   This entry satisfies the necessary condition for the proposed infixation lexical rules.

th004

The modal infixation lexical rules will take this [v_buyu] type compound as input and produce two V+MODAL+BUYU entries.  As a result, new entries 看得见 kan-de3-jian (can see) and 看不见 kan-bu-jian (cannot see) as shown below are added to the lexicon.[21]

th005

th006

The above proposal offers a simple, effective way of capturing the linguistic data of the interaction of V+A/V idioms and the modal insertion, since it eliminates the need for any change of the general grammar in order to accommodate this type of separable verbs interacting with 得 de3 / 不 bu, the only two infixes in Chinese.

5.4. Summary

This chapter has conducted an inquiry into the linguistic phenomena of Chinese separable verbs, a long-standing difficult problem at the interface of Chinese compounding and syntax.   For each type of separable verb, arguments for the wordhood judgment have been presented.  Based on this judgment, CPSG95 provides analyses which capture both structural and semantic aspects of the constructions at issue.  The proposed solutions are formal and implementable.  All the solutions provide a way of capturing the link between the separated use and contiguous use of the V+X idioms.  The proposals presented in this chapter cover the vast majority of separable verbs.  Some unsolved rare cases or potential problems are also identified for further research.

 

----------------------------------------------------------------------

[1] They are also called phrasal verbs (duanyu dongci) or compound verbs (fuhe dongci) among Chinese grammarians.  For linguists who believe that they are compounds, the V+N separable verbs are often called verb object compounds and the V+A/V separable verbs resultative compounds.  The want of a uniform term for such phenomena reflects the borderline nature of these cases.  According to Zhao and Zhang (1996), out of the 3590 entries in the frequently used verb vocabulary, there are 355 separable V+N idioms.

[2] As the term 'separable verbs' gives people an impression that these verbs are words (which is not necessarily true), they are better called V+X (or V+N or V+A/V) idioms.

[3] There is no disagreement among Chinese grammarians for the verb-object combinations like xi wan:  they are analyzed as transitive verb phrases in all analyses, no matter whether the head V and the N is contiguous (e.g. xi wan 'wash dishes') or not (e.g. xi san ge wan 'wash three dishes').

[4] Such signs as zao (bath), which are marked with # in (5-1), are often labeled as 'bound morphemes' among Chinese grammarians, appearing only in idiomatic combinations like xi zao (take a bath), ca zao (clean one's body by scrubbing).  As will be shown shortly, bound morpheme is an inappropriate classification for these signs.

[5] It is widely acknowledged that the sequence num+classifier+noun is one typical form of Chinese NP in syntax.  The argument that zao is not a bound morpheme does not rely on any particular analysis of such Chinese NPs.  The fact that such a combination is generally regarded as syntactic ensures the validity of this argument.

[6] The notion ‘free’ or ‘freely’ is linked to the generally accepted view of regarding word as a minimal ‘free’ form, which can be traced back to classical linguistics works such as Bloomfield (1933).

[7] It is generally agreed that idioms like kick the bucket are not compounds but phrases (Zwicky 1989).

[8] That is the rationale behind the proposal of inseparability as important criterion for wordhood judgment in Lü (1989).

[9] In Chinese, reduplication is a general mechanism used both in morphology and syntax.  This thesis only addresses certain reduplication issues when they are linked to the morpho-syntactic problems under examination, but cannot elaborate on the Chinese reduplication phenomena in general.  The topic of Chinese reduplication deserves the study of a full-length dissertation.     

[10] In the ALE implementation of CPSG95, there is a VV Diminutive Reduplication Lexical Rule in place for phenomena like xi zao (take a bath) à xi xi zao (take a short bath);  ting yin-yue (listen to music) à ting ting yin-yue (listen to music for a while);  xiu-xi (rest) à xiu-xi xiu-xi (have a short rest).

[11] He observes that there are two distinct principles on wordhood.  The vocabulary principle requires that a word represent an integrated concept, not the simple composition of its parts.  Associated with the above is a tendency to regard as a word a relatively short string.  The grammatical principle, however, emphasizes the inseparability of the internal parts of a combination.  Based on the grammatical principle, xi zao is not a word, but a phrase.  This view is very insightful.

[12] The pattern variations are captured in CPSG95 by lexical rules following the HPSG tradition.  It is out of the scope of this thesis to present these rules in the CPSG95 syntax.  See W. Li (1996) for details.

[13] In the rare cases when the noun zao is realized in a full-fledged phrase like yi ge tong-kuai de zao (a comfortable bath), we may need some complicated special treatment in the building of the semantics.  Semantically, xi (wash) yi (one) ge (CLA) tong‑kuai (comfortable) de (DE) zao (bath): ‘take a comfortable bath’ actually means tong‑kuai (comfortable) de2 (DE2) xi (wash) yi (one) ci (time) zao (bath): ‘comfortably take a bath once’.  The syntactic modifier of the N zao is semantically a modifier attached to the whole idiom.  The classifier phrase of the N becomes the semantic 'action-times' modifier of the idiom.  The elaboration of semantics in such cases is left for future research.

[14] The two groups classified by L. Li (1990) are not restricted to the V+N combinations.  In order not to complicate the case,  only the comparison of the two groups of V+N idioms are discussed here.  Note also that in the tables, he used the term ‘bound morpheme’ (inappropriately) to refer to the co-occurrence constraint of the idioms.

[15] Another type of X-insertion is that N can occasionally be expanded by adding a de‑phrase modifier.  However, this use is really rare.

[16] Since they are only a small, easily listable set of verbs, and they only demonstrate limited separated uses (instead of full pattern variations of a transitive VP construction), to list these words and all their separated uses in the lexicon seems to be a better way than, say, trying to come up with another lexical rule just for this small set.  Listing such idiosyncratic use of language in the lexicon is common practice in NLP.

[17] In fact, this set has been becoming smaller because some idioms, say zhu-yi 'focus-attention: pay attention to', which used to be in this set, have already lost all separated phrasal uses and have become words per se.  Other idioms including dan-xin (worry about) are in the process of transition (called ionization by Chao 1968) with their increasing frequency of being used as words.   There is a fairly obvious tendency that they combine more and more closely as words, and become transparent to syntax.  It is expected that some, or all, of them will ultimately become words proper in future, just as zhu-yi did.

[18] In general, one cannot use kan-jian to translate English future tense 'will see', instead one should use the single-morpheme word kan:  I will see him --> wo (I) jiang (will) kan (see) ta (he).

[19] Of course, [v_buyu] is a sub-type of verb [v].

[20] The use of this feature for representing negation was suggested in  Footnote 18 in Pollard and Sag (1994:25)

[21] This is the procedural perspective of viewing the lexical rules.  As pointed out by Pollard and Sag (1987:209), “Lexical rules can be viewed from either a declarative or a procedural perspective: on the former view, they capture generalizations about static relationships between members of two or more word classes; on the latter view, they describe processes which produce the output from the input form.”

 

[Related]

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

PhD Thesis: Chapter IV Defining the Chinese Word

 

4.0. Introduction

This chapter examines the linguistic definition of the Chinese word and establishes its formal representation in CPSG95.  This lays a foundation for the treatment of Chinese morpho-syntactic interface problems in later chapters.

To address issues on interfacing morphology and syntax in Chinese NLP, the fundamental question is:  what is a Chinese word?  A proper answer to this question defines the boundaries between morphology, the study of how morphemes combine into words, and syntax, the study of how words combine into phrases.  However, there is no easy answer to this question.

In fact, how to define Chinese words has been a central topic among Chinese grammarians for decades (Hu and Wen 1954; L. Wang 1955;  Z. Lu 1957; Lin 1983; Lü 1989; Shi 1992; Dai 1993; Zhao and Zhang 1996).  In late 50's, there was a heated discussion on the definition of Chinese word in China.  This discussion was induced by the campaign for the Chinese writing system reform (文字改革运动).  At that time, the government policy was to ultimately replace the Chinese characters (hanzi) by a Romanized writing system.  The system of pinyin, based on the Latin alphabet, was designed to represent the pronunciation of the characters in the Contemporary Mandarin.  The simplest way is to use pinyin as a writing system and simply translate Chinese characters into syllables in pinyin.  But it was soon found impractical due to the many-to-one correspondence from hanzi to syllable.  Text in pinyin with no  explicit word boundary delimiters is hardly comprehensible.   Linguists agree that the key issue for the feasibility of a pinyin-based writing system is to establish a standard or definition for Chinese words (Z. Lu 1957).  Once words can be identified by a common standard, the pinyin system can in principle be adopted for recording the Chinese language by using space and punctuation marks to separate words.  This is because the number of homophones at the word level is dramatically reduced when compared to the number of homophones at the hanzi (morpheme or monosyllabic) level.

But the definition of a Chinese word is a very complicated issue due to the existence of a considerable amount of borderline cases.  It has never been possible to reach a precise definition which can be applied to all circumstances and which can be accepted by linguists from different schools.

There have been many papers addressing the Chinese wordhood issue (e.g. Z. Lu 1957; Lin 1983; Lü 1989; Dai 1993).  Although there are still many problems in defining Chinese words for borderline cases and more debate will continue for many years to come, the understanding of Chinese wordhood has been deepened in the general acknowledgement of the following key aspects:  (i) the distinct status of Chinese morphology;  (ii) the distinction of different notions of word;  and (iii) the lack of absolute definition across systems or theories.

Almost all Chinese grammarians agree that unlike Classical Chinese, Contemporary Chinese is not based on single-morpheme words.   In other words, the word and the morpheme are no longer coextensive in Contemporary Chinese.[1]  In fact, that is the reason why we need to define Chinese morphology.  If the word and the morpheme stand for the same linguistic object in a language, like Classical Chinese, the definition of  morpheme will entail the definition of word and there is no role of morphology.

As it stands, there is little debate on the definition of morpheme in Chinese.  It is generally acknowledged that each syllable (or its corresponding written form hanzi) corresponds to (at least) one morpheme.  In a characteristic ‘isolating language’ - Classical Chinese is close to this, there is no or very poor morphology.[2]  However, Contemporary Chinese contains a significant number of bound morphemes in word formation (Dai 1993).  In particular, it is observed that many affixes are highly productive (Lü et al 1980).

It is widely acknowledged that the grammar of Contemporary Chinese is not complete without the component of morphology (Z. Lu 1957; Chao 1968; Li and Thompson 1981; Dai 1993; etc.).   Based on this widely accepted assumption, one major task for this thesis is to argue for the proper place to cut the line between morphology and syntax, and to explore effective ways of interleaving the two for analysis.

A significant development concerning the Chinese wordhood study is the  distinction between two different notions of word:  grammar word versus vocabulary word.  It is now clear that in terms of grammar analysis, a vocabulary word is not an appropriate notion (Lü 1989; more discussion to come in 4.1).

Decades of debate and discussion on the definition of a Chinese word have also shown that an operational definition for a grammar word precise enough to apply to all cases can hardly be established across systems or theories.  But a computational grammar of Chinese cannot be developed without precise definitions.  This leads to an argument in favor of the system internal wordhood definition and the interface coordination within a grammar.

The remaining sections of this chapter are organized like this.  Section 4.1 examines two notions of word.  Making sure that we use the right notion based on some appropriate guideline, some operational methods for judging a Chinese grammar word will be developed in 4.2.  Section 4.3 demonstrates the formal representation of a word in CPSG95.  This formalization is based on the design of expectation feature structures and the structural feature structure  presented in Chapter III.

4.1. Two Notions of Word

This section examines the two notions of word which have caused confusion.  The first notion, namely vocabulary word, is easy to define.  However, for the second notion, namely, grammar word, unfortunately,  no operational definition has been available.  It will be argued that a feasible alternative is to system internally define a grammar word and the labor division between Chinese morphology and syntax.

A grammar word stands for the grammatical unit which fits in the hierarchy of morpheme, word and phrase in linguistic analysis.  This gives the general concept of this notion but it is by no means an operational definition.  Vocabulary word, on the other hand, refers to the listed entry in the lexicon.  This definition is simple and unambiguous once a lexicon is given.  The lexical lookup will generate vocabulary words as potential building blocks for analysis.

On one hand, vocabulary words come from the lexicon;  they are basic building blocks for linguistic analysis.  On the other hand, as the ‘resulting’ unit for morphological analysis as well as the ‘starting’ or ‘atomic’ unit for syntactic analysis, the grammar word is the notion for linguistic generalization.  But it is observed that a vocabulary word is not necessarily a grammar word and vice versa.  It is this possible mismatch between vocabulary word and grammar word that has caused a problem in both Chinese grammar research and Chinese NLP system development.

Lü (1989) indicates that not making a distinction between these two notions of word has caused considerable confusion on the definition of Chinese word in the literature.  He further points out that only the former notion should be used in the grammar research.

Di Sciullo and Williams (1987) have similar ideas on these two notions of word.  They indicate that a sign listable in the lexicon corresponds to no certain grammatical unit.[3]   It can be a morpheme, a (grammar) word, or a phrase including sentence.  Some examples of different kinds of Chinese vocabulary words are given below to demonstrate this insight.

(4-1.) sample Chinese vocabulary words

(a) 性           bound morpheme, noun suffix, ‘-ness’
(b) 洗           free morpheme or word, V: ‘wash’
(c) 澡           word (only used in idioms), N: ‘bath’
(d) 澡盆        compound word, N: ‘bath-tub’
(e) 洗澡        idiom phrase, VP: ‘take a bath’
(f) 他们         pronoun as noun phrase, NP: ‘they’
(g) 城门失火,殃及池鱼

idiomatic sentence, S:
‘When the gate of a city is on fire, the fish in the
canal around the gate is also endangered.’

The above signs are all Chinese vocabulary words.  But grammatically, they do not necessarily function as a grammar word.  For example, (4-1a) functions as a suffix, smaller than a word.  (4-1e) behaves like a transitive VP (see 5.1 for more evidence), and (4-1g) acts as a sentence, both larger than a word.  The consequence of mixing up these different units in a grammar is the loss of power for a grammar to capture the linguistic generality for each level of grammatical unit.

The definition of grammar word has been a contentious issue in general linguistics (Di Sciullo and Williams 1987).  Its precise definition is particularly difficult in Chinese linguistics as there is a considerable amount of phenomena marginal between Chinese morphology and syntax (Zhu 1985; L. Li 1990; Sun and Huang 1996).  The morpheme-word-phrase transition is a continuous band in the linguistic reality.  Different grammars may well cut the division differently.  As long as there is no contradiction in coordinating these objects within the grammar, there does not seem to exist absolute judgment on which definition is right and which is wrong.

It is generally agreed that a grammar word is the smallest unit in syntax (Lü 1989), as also emphasized by Di Sciullo and Williams (1987) on the 'syntactic atomicity' of word.[4]  But this statement only serves as a guideline in theory, it is not an operational definition for the following reason.  It is logically circular to define word, smallest unit in syntax, and syntax, study of how words combine into phrases, one upon the other.

To avoid this 'circular definition' problem, a feasible alternative is to system internally define grammar word and the labor division between Chinese morphology and syntax, as in the case of CPSG95.  Of course, the system internal definition still needs to be justified based on the proposed morphological or syntactic analysis of borderline phenomena in terms of capturing the linguistic generality.  More specifically, three things need to be done:  (i) argue for the analysis case by case, e.g. why a certain construction should be treated as a morphological or syntactic phenomenon, what linguistic generality is captured by such a treatment, etc.;  (ii) establish some operational methods for wordhood judgment to cover similar cases;  (iii) use formalized data structures to represent the linguistic units after the wordhood judgment is made.  Section 4.2 will handle task (ii) and Section 4.3 is devoted to the formal definition of word required by task (iii).   The task in (i) will be pursued in the remaining chapters.

Another important notion related to grammar word is unlisted word.  Conceptually, an unlisted word is a novel construction formed via morphological rules, e.g. a derived word like 可读性 ke-du-xing (-able-read-ness: readability), foolish-ness, a compound person name (given name + family name) such as John Smith, 毛泽东 mao-ze-dong (Mao Zedong).  Unlisted words are often rule-based.  This is where productive word formation sets in.

However, unlisted word is not a crystal clear notion, just like the underlying concept grammar word.  Many grammarians have observed that phrases and unlisted words in Chinese are formed under similar rules (e.g. Zhu 1985; J. Lu 1988).  As both syntactic constructions and unlisted words are rule based, it can be difficult to judge a significant amount of borderline constructions as morphological or syntactic.

There are fuzzy cases where a construction is regarded as a grammar word by one and judged as a syntactic construction by another.  For example, while san (three) ge (CLA) is regarded as a syntactic construction, namely numeral-classifier phrase, in many grammars including CPSG95, such constructions are treated as compound words by others (e.g. Chen and Liu 1992).  ‘Quasi-affixation’ presents another outstanding ‘gray area’ (see 6.2).

The difficulty in handling the borderline phenomena leads back to the argument that the labor division between Chinese morphology and syntax should be pursued system internally and argued case by case in terms of capturing the linguistic generality.  To implement the required system internal definition, it is desirable to investigate practical wordhood judgment methods in addition to case-by-case arguments.  Some judgment methods will be developed in 4.2.  Case-by-case arguments and analysis for specific phenomena will be presented in later chapters.  After the wordhood judgment is made, there is a need for the formal representation.  Section 4.3 defines the formal representation of word with illustrations.

4.2. Judgment Methods

This section proposes some operational wordhood judgment methods based on the notion of ‘syntactic atomicity’ (Di Sciullo and Williams 1987).  These methods should be applied in combination with arguments of the associated grammatical analysis.  In fact, whether a sign is judged as a morpheme, a grammar word or a phrase ultimately depends on the related grammatical analysis.  However, the operationality of these methods will help facilitate the later analysis for some individual problems and avoid unnecessary repetition of similar arguments.

Most methods proposed for Chinese wordhood judgment in the literature are not fully operational.  For example, Chao (1968) agrees with Z. Lu (1957) that a word can fill the functional frame of a typical syntactic structure.  Dai (1993) points out that this method may effectively separate bound morphemes from free words, it cannot differentiate between words and phrases, as phrases may also be positioned in a syntactic frame.  In fact, whether this method can indeed separate bound morphemes from free words is still a problem.  This method cannot be made operational unless the definition of ‘frame of a typical syntactic structure’ is given.  The judgment methods proposed in this section try to avoid this ‘lack of operationality’ problem.

Dai (1993) made a serious effort in proposing a series of methods for cutting the line between morphemes and syntactic units in Chinese.  These methods have significantly advanced the study of this topic.  However, Dai admits that there is limitation associated with these proposals.  While each proposed method provides a sufficient (but not necessary) condition for judging whether a unit is a morpheme,  none of the methods can further determine whether this unit is a word or a phrase.  For example, the method of syntactic independence tests whether a unit in a question can be used as a short answer to the question.  If yes, the syntactic independence is confirmed and this unit is not a morpheme inside a word.  Obviously, such a method tells nothing about the syntactic rank of the tested unit because a word, a phrase or clause can all serve as an answer to a question.  In order to achieve that, other methods and/or analyses need to be brought in.

The first judgment method proposed below involves passivization and topicalization tests.  In essence, this is to see whether a string involves syntactic processes.  As an atomic unit, the internal structure of a word is transparent to syntax.  It follows that no syntactic processes are allowed to exert effects on the internal structure of a word.[5]  As  passivization and topicalization are generally acknowledged to be typical syntactic processes, if a potential combination A+B is subject to passivization B+bei+A and topicalization B+…+NP+A, it can be concluded that A+B is not a word:   the relation between A and B must be syntactic.

The second method is to define an unambiguous pattern for the wordhood judgment, namely, judgment patterns.  Judgment patterns are by no means a new concept.  In particular, keyword based judgment patterns have been frequently used in the literature of Chinese linguistics as a handy way for deterministic word category detection (e.g. L. Wang 1955;  Zhu 1985; Lü 1989).

The following keyword (i.e. aspect markers) based patterns are proposed for  judging a verb sign.

(4-2.)
(a) V(X)+着/过 --> word(X)
(b) V(X)+着/过/了+NP --> word(X)

The pattern (4-2a) states that if X is a sign of verb, no matter transitive or intransitive, appearing immediately before zhe/guo, then X is a word.  This proposal is backed by the following argument.  It is an important and widely acknowledged grammatical generalization in Chinese syntax that the aspect markers appear immediately after lexical verbs (Lü et al 1980).

Note that the aspect marker le (LE) is excluded from the pattern in (4-2a) because the same keyword le corresponds to two distinctive morphemes in Chinese:  the aspect le (LE) attaches to a lexical V while the sentence-final le (LEs) attaches to a VP (Lü et al 1980).  Therefore, judgment cannot be reliably made when a sentence ends in X+le, for example, when X is an intransitive verb or a transitive verb with the optional object omitted.  However, le in pattern (4-2b) has no problem since le is not in the ambiguous sentence final position.  This pattern says that if any of the three aspect markers appears between a sign X of verb and NP, X must be a word:  in fact, it is a lexical transitive verb.

There are two ways to use the judgment patterns.  If a sub-string of the input sentence matches a judgment pattern, one reaches the conclusion promptly.  If the input string does not match a pattern directly, one can still make indirect use of the patterns for judgment.  The idiomatic combination xi (wash) zao (bath) is a representative example.   Assume that the vocabulary word xi zao is a grammar word.  It follows that it should be able to fill in the lexical verb position in the judgment pattern (4-2a).  We then make a sentence which contains a substring matching the pattern to see whether it is grammatical.  The result is ungrammatical:  * 他洗澡着 ta (he) xi-zao (V) zhe (ZHE);  * 他洗澡过 ta (he) xi-zao (V) guo (GUO).  Therefore, our assumption must be wrong:  洗澡 xi zao is not a grammar word.  We then change the assumption and try to insert aspect markers inside them (it is in fact an expansion test, to be discussed shortly).  The new assumption is that the verb xi alone is a grammar word.  What we get are perfectly grammatical sentences and they match the pattern (4-2b):  他洗着澡 ta (he) xi (V) zhe (ZHE) zao (bath): ‘He is taking a bath’;  他洗过澡 ta (he) xi (V) guo (GUO) zao (bath): ‘He has taken the bath’.  Therefore the assumption is proven to be correct.  This way, all V+X combinations can be judged based on the judgment patterns (4-2a) or (4-2b).

The third method proposed below involves a more general expansion test.  As an atomic unit in syntax, the internal parts of a word are in principle not separable.[6]  Lü (1989) emphasized inseparability as a criterion for judging grammar words.  But he did not give instructions how this criterion should be applied.  Nevertheless, many linguists (e.g. Bloomfield 1933; Z. Lu 1957;  Lyons 1968; Dai 1993) have discussed expansion tests one way or another in assisting the wordhood judgment.

The method of expansion to be presented below for wordhood judgment is called X-insertion.  X-insertion is based on Di Sciullo and Williams’ thesis of the syntactic atomicity of word.  The rationale is that the internal parts of a word cannot be separated by syntactic constituents.

As a method, how to perform X-insertion is defined as follows.   Suppose that one needs to judge whether the combination A+B is a word.   If a sign X can be found to satisfy the following condition, then A+B is not a word, but a syntactic combination:  (i) A+X+B is a grammatical string,  (ii) X is not a bound morpheme, and (iii) the sub-structure [A+X] is headed by A or the sub-string [X+B] is headed by B.

The first constraint is self-evident:  a syntactic combination is necessarily a grammatical string.  The second constraint aims at  eliminating the danger of wrongly applying an infix here.  In fact, if X is a morphological infix, the conclusion would be just opposite:  A+B is a word.  The last constraint states that X must be a dependant of the head A (or B).  Otherwise, it results in a different structure.  There is no direct structural relation between A and B when A (or B) is a dependant of the head X in the structure.  Therefore, the question of whether A+B is a phrase or a word does not apply in the first place.

After the wordhood judgment is made on strings of signs based on the above judgment methods and/or the arguments for the analysis involved, the next step is to have them properly represented (coded) in the grammar formalism used.  This is the topic to be presented in 4.3 below.

4.3. Formal Representation of Word

The expectation feature structure and structural phrase structure in the mono-stratal design of CPSG95 presented in Chapter III provide means for the formal definition of the basic unit word in CPSG95.  Once the wordhood judgment for a unit is made based on arguments for a structural analysis and/or using the methods presented in Section 4.2., the formal representation is required for coding it in CPSG95.

This type of formalization is required to ensure its implementability in enforcing a required configurational constraint.  For example, the suffix 性 -xing expects an adjective word to form an abstract noun, such constraints [CATEGORY a] and @word can be placed in the morphological expectation feature [SUFFIXING].  These constraints will permit, for example, the legitimately derived word 严肃性 [yan-su]-xing] (serious-ness), but will block the following combination * 非常严肃性 [[fei-chang yan-su]-xing] (very-serious-ness).  This is because 非常严肃 [fei-chang yan-su] violates the formal constraint as given in the word definition:  it is not an atomic unit in syntax.

In CPSG95, word is defined as a syntactically atomic unit without obligatory morphological expectations, formally represented in the following macro.

word macro
a_sign
PREFIXING saturated | optional
SUFFIXING saturated | optional
STRUCT no_syn_dtr

Note that the above formal definition uses the sorted hierarchy [struct] for the structural feature structure and the sorted hierarchy [expected] for the expectation feature structure.  The definitions of these feature structures have been given in the preceding Chapter III.

Based on the sorted hierarchy struct: {syn_dtr, no_syn_dtr}, the constraint [no_syn_dtr] ensures that the word sign do not contain any syntactic daughter.[7]  This prevents syntactic constructions from being treated as words.  On the other hand, since [saturated], [obligatory] and [optional] are three subtypes of [expected], the constraint [saturated|optional] prevents a bound morpheme, say a prefix or suffix which has obligatory expectation in [PREFIXING] or [SUFFIXING], from being treated as a word.

This macro definition covers the representation of mono-morpheme words, e.g. 鹅 e ‘goose’, 读 du ‘read’, etc., or multi-morpheme words, e.g. 小看 xiao-kan ‘look down upon’, 天鹅 tian-e ‘swan’, etc., as well as unlisted words such as derived words whose internal morphological structures have already been formed.  Some typical examples of word are shown below.

th11

th12

For a derived word, note that the specification of [PREFIXING satisfied] and [STRUCT prefix], or [SUFFIXING satisfied] and [STRUCT suffix], assigned by the corresponding PS rule is compatible with the macro word definition.

The above word definition is an extension of the corresponding representation features from HPSG (Pollard and Sag 1987).  HPSG uses a binary structural feature [LEX] to distinguish lexical signs, [LEX +], and non-lexical signs, [LEX -].  In addition, [sign] is divided into [lexical_sign] and [phrasal_sign].[8]  Except for the one-to-one correspondence between [phrasal_sign] and [syn_dtr] in terms of rank (which stands for non-atomic syntactic constructs including phrases), neither of these HPSG binary divisions account for the distinction between a bound morpheme and a free morpheme.  Such a distinction is not necessary in HPSG because bound morphemes are assumed to be processed in the preprocessing stage (e.g. lexical rules for English inflection, Pollard and Sag 1987) and do not show themselves as independent input to the parser.  As CPSG95 involves both derivation morphology and syntax in an integrated general grammar, the HPSG binary divisions are no longer sufficient for formalizing the word definition.  ‘Word’ in CPSG95 needs to be distinguished with proper constraints from not only syntactic constructs, but also from affixes (bound morphemes).

In CPSG95, as productive derivation is designed to be an integrated component of the grammar, the word definition is both specified in the lexicon for some free morpheme words and assigned by the rules in morphological analysis.  This practice in essence follows one  suggestion in the original HPSG book:  "we might divide rules of grammar into two classes: rules of word formation, including compounding rules, which introduce the specification [LEX +] on the mother, and other rules, which introduce [LEX -] on the mother." (Pollard and Sag 1987:73).

It is worth noticing that words thus defined can fill either a morphological position or a syntactic position.  This reflects the interface nature of word:  word is an eligible unit in both morphology and syntax.  This is in contrast to bound morphemes which can only be internal parts of morphology.

In morphology, derivation combines a word and an affix into a derived word.  These derivatives are eligible to feed morphology again.   This is shown above by the examples in (4-5) and (4-6).  The adjective word 可读 ke-du (read-able) is derived from the prefix morpheme 可 ke- (-able) and the word 读 du (read).  Like other adjective words, this derived word can further combine with the suffix 性
–xing (-ness) in morphology.  It can also directly enter syntax, as all words do.

To syntax, all words are atomic units.  If a lexical position is specified, via the macro constraint @word in CPSG95, in a syntactic pattern, it makes no difference whether a filler of this position is a listed grammar word, or an unlisted word such as a derivative.  Such distinction is transparent to the syntactic structure.

4.4. Summary

Efforts have been made to reach a better understanding of Chinese wordhood in theory, methodology and formalization.  The main spirit of the HPSG theory and Di Sciullo and Williams' ‘syntactic atomicity’ theory has been applied to the study of Chinese wordhood and its formal representation.  Some effective wordhood judgment methods have also been proposed, based on theoretical guidelines.

The above work in the area of Chinese wordhood study provides a sound foundation for the analysis of the specific Chinese morpho-syntactic interface problems in Chapter V and Chapter VI.

 

 

-------------------------------------------------------

[1] For Classical Chinese, word, morpheme, syllable and hanzi are presumably all co-extensive.  This is the so-called Monosyllabic Myth of Chinese (DeFrancis 1984: ch.8).  The development of large numbers of homophones, mainly due to the loss of coda stops, has led to the development of large quantities of bi-syllabic and poly-syllabic word-like expressions (Chen and Wang 1975).

[2] Classical Chinese arguably allows for a certain degree of compounding.  In the linguistic literature, some linguists (e.g. Sapir 1921; Zhang 1957; Jensen 1990) did not strictly distinguish Contemporay/Modern Chinese from Classical Chinese and they held the general view that Chinese has little morphology except for limited compounding.  But this view of Contemporary Chinese has been criticized as misconception (Dai 1993) and is no longer accepted by the community of Chinese grammarians.

[3] Di Sciullo and Williams call a sign listable in the lexicon listeme, equivalent to the notion vocabulary word.

[4] In the literature, variations of  this view include the Lexicalist position (Chomsky 1970), the Lexical Integrity Hypothesis (Jackendoff 1972), the Principle of Morphology-Free Syntax (Zwicky 1987), etc.

[5] This type of ‘atomicity’ constraint (Di Sciullo and Williams 1987) is generally known as Lexical Integrity Hypothesis (LIH, Jackendoff 1972), which states that syntactic rules or operations cannot refer to part of a word.  A more elaborate version of LIH is proposed by Zwicky (1987) as a Principle of Morphology-Free Syntax.  This principle states that syntactic rules cannot make reference to the internal morphological composition of words.  The only lexical properties accessible to syntax, according to Zwicky, are syntactic category, subcategory, and features like gender, case, person, etc.

[6] Of course, in theory a word may be separated by morphological infix.  But except for the two modal signs de3 (can) and bu (cannot) (see Section 5.3 in Chapter V), there does not seem to exist infixation in Mandarin Chinese.

[7] In terms of rank, [no_syn_dtr] in CPSG95 corresponds to the type [lexical_sign] in HPSG (Pollard and Sag 1987).  A binary division between [lexical_sign] and [phrasal_sign] is enough in HPSG to distinguish the atomic unit word from syntactic construction.  But, as CPSG95 incorporates derivation in the general grammar, [no_syn_dtr] covers for both free morphemes and bound morphemes.  That is why the [no_syn_dtr] constraint on [STRUCT] alone cannot define word in CPSG95;  it needs to involve constraints on morphological expectation structures as well, as shown in the macro definition.

[8] Note that there are [LEX -] signs which are not of the type [phrasal_sign].

 

[Related]

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

PhD Thesis: Chapter III Design of CPSG95

3.0. Introduction

CPSG95 is the grammar designed to formalize the morpho-syntactic analysis presented in this dissertation.  This chapter presents the general design of CPSG95 with emphasis on three essential aspects related to the morpho-syntactic interface:  (i) the overall mono-stratal design of the sign;  (ii) the design of expectation feature structures;  (iii) the design of structural feature structures.

The HPSG-style mono-stratal design of the sign in CPSG95 provides a general framework for the information flow between different components of a grammar via unification.  Morphology, syntax and semantics are all accommodated in distinct features of a sign.  An example will be shown to illustrate the information flow between these components.

Expectation feature structures are designed to accommodate lexical information for the structural combination.  Expectation feature structures are vital to a lexicalized grammar like CPSG95.  The formal definition for the sort hierarchy [expected] for the expectation features will be given.  It will be demonstrated that the defined sort hierarchy provides means for imposing a proper structural hierarchy as defined by the general grammar.

One characteristic of the CPSG95 structural expectation is the unique design of morphological expectation features to incorporate Chinese productive derivation.  This design is believed to be a feasible and natural way of modeling Chinese derivation, as shall be presented shortly below and elaborated in section 3.2.1.  How this design benefits the interface coordination between derivation and syntax will be further demonstrated in Chapter VI.

The type [expected] for the expectation features is similar to the HPSG definition of [subcat] and [mod].  They both accommodate lexical expectation information to drive the analysis conducted via the general grammar.  In order to meet some requirements induced by introducing morphology into the general grammar and by accommodating linguistic characteristics of Chinese, three major modifications from the standard HPSG are proposed in CPSG95.  They are:  (i) the CPSG95 type [expected] is more generalized as to cover productive derivation in addition to syntactic subcategorization and modification;  (ii) unlike HPSG which tries to capture word order phenomena as independent constraints, Chinese word order in CPSG95 is integrated in the definition of the expectation features and the corresponding morphological/syntactic relations;  (iii) in terms of handling the syntactic subcategorization, CPSG95 pursues a non-list alternative to the standard practice of HPSG relying on the list design of obliqueness hierarchy.  The rationale and arguments for these modifications are presented in the corresponding sections, with a brief summary given below.

The first modification is necessitated by meeting the needs of introducing Chinese productive derivation into the grammar.  It is observed that a Chinese affix acts as the head daughter of the derivative in terms of expectation (Dai 1993).  The expectation information that drives the analysis of a Chinese productive derivation is found to be capturable lexically by the affix sign;  this is very similar to how the information for the head-driven syntactic analysis is captured in HPSG.  The expansion of the expectation notion to include productive morphology can account for a wider range of linguistic phenomena.  The feasibility of this modification has been verified by the implementation of CPSG95 based on the generalized expectation feature structures.

One outstanding characteristic of all the expectation features designed in CPSG95 is that the word order information is implied in the definition of these features.[1]  Word order constraints in CPSG95 are captured by individual PS rules for the structural relationship between the constituents.  In other words, Chinese word order constraints are not treated as phenomena which have sufficient generalizations of themselves independent of the individual morphological or syntactic relations.  This is very different from the word order treatment in theories like HPSG (Pollard and Sag 1987) and GPSG (Gazdar, Klein, Pullum and Sag 1985).  However, a similar treatment can be found in the work from  the school of ‘categorial grammar’ (e.g. Dowty 1982).

The word order theory in HPSG and GPSG is based on the assumption that structural relations and syntactic roles can be defined without involving the factor of word order.  In other words, it is assumed that the structural nature of a constituent (subject, object, etc.) and its linear position in the related structures can be studied separately.  This assumption is found to be inappropriate in capturing Chinese structural relations.  So far, no one has been able to propose an operational definition for Chinese structural relations and morphological/syntactic roles without bringing in word order.[2]

As Ding (1953) points out, without the means of inflections and case markers, word order is a primary constraint for defining and distinguishing Chinese structural relations.[3]  In terms of expectation, it can always be lexically decided where for the head sign to look for its expected daughter(s).  It is thus natural to design the expectation features directly on their expected word order.

The reason for the non-list design in capturing Chinese subcategorization can be summarized as follows:  (i) there has been no successful attempt by anyone, including the initial effort involved in the CPSG95 experiment, which demonstrates that the obliqueness design can be applied to Chinese grammar with sufficient linguistic generalizations;  (ii) it is found that the atomic approach with separate features for each complement is a feasible and flexible proposal in representing the relevant linguistic phenomena.

Finally, the design of the structural feature [STRUCT]  originates from [LEX + | -] in HPSG (Pollard and Sag 1987).  Unlike the binary type for [LEX], the type [struct] for [STRUCT] forms an elaborate sort hierarchy.  This is designed to meet the configurational requirements of introducing morphology into CPSG95.  This feature structure, together with the design of expectation feature structures, will help create a favorable framework for handling Chinese morpho-syntactic interface.  The proposed structural feature structure and the expectation feature structures contribute to the formal definition of linguistic units in CPSG95.  Such definitions enable proper lexical configurational constraints to be imposed on the expected signs when required.

3.1. Mono-stratal Design of Sign

This section presents the data structure involving the interface between morphology, syntax and semantics in CPSG95.  This is done by defining the mono-stratal design of the fundamental notion sign and by illustrating how different components, represented by the distinct features for the sign, interact.

As a dynamic unit of grammatical analysis, a sign can be a morpheme, a word, a phrase or a sentence.  It is the most fundamental object of HPSG-style grammars.  Formally, a sign is defined in CPSG95 by the type [a_sign], as shown below.[4]

(3-1.) Definition: a_sign

a_sign
HANZI                            hanzi_list
CONTENT                      content
CATEGORY                    category
SUBJ                               expected
COMP0_LEFT               expected
COMP1_RIGHT             expected
COMP2_RIGHT             expected
MOD_LEFT                    expected
MOD_RIGHT                  expected
PREFIXING                    expected
SUFFIXING                    expected
STRUCT                          struct

The type [a_sign] introduces a set of linguistic features for the description of a sign.  These are features for orthography, morphology, syntax and semantics, etc.[5]  The types, which are eligible to be the values of these features, have their own definitions in the sort hierarchy.  An introduction of these features follows.

The orthographic feature [HANZI] contains a list of Chinese characters (hanzi or kanji).  The feature [CONTENT] embodies the semantic representation of the sign.  [CATEGORY] carries values like [n] for noun, [v] for verb, [a] for adjective, [p] for preposition, etc.  The structural feature [STRUCT] contains information on the relation of the structure to its sub-constituents, to be presented in detail in section 3.3.

The features whose appropriate value must be the type [expected] are called expectation features.  They are the essential part of a lexicalist grammar as these features contain information about various types of potential structures in both syntax and morphology.  They specify various constraints on the expected daughter(s) of a sign for structural analysis.   The design of these expectation features and their appropriate type [expected] will be presented shortly in section 3.2.

The definition of [a_sign] illustrates the HPSG philosophy of mono-stratal analysis interleaving different components.  As seen, different components of Chinese grammar are contained in different feature structures for the general linguistic unit sign.  Their interaction is effected via the unification of relevant feature structures during various stages of analysis.  This will unfold as the solutions to the morpho-syntactic interface problems are presented in Chapter V and Chapter VI.  For illustration, the prefix 可 ke (-able) is used as an example in the following discussion.

As is known, the prefix ke- (-able) makes an adjective out of a transitive verb:  ke- + Vt --> A.  This lexicalized rule is contained in the CPSG95 entry for the prefix ke-, shown in (3-2).  Following the ALE notation, @ is used for macro, a shorthand mechanism for a pre-defined feature structure.[6]

th1

As seen, the prefix ke- morphologically expects a sign with [CATEGORY vt].  An affix is analyzed as the head of a derivational structure in CPSG95 (see section 6.1 for discussion) and [CATEGORY] is a representative head feature to be percolated up to the mother sign via the corresponding morphological PS rule as formulated in (6-4) of section 6.2, this expectation eventually leads to a derived word with [CATEGORY a].  Like most Chinese adjectives, the derived adjective has an optional expectation for a subject NP to account for sentences like 这本书很可读 zhe (this) ben (CLA) shu (book) hen (very) ke-du (read-able): ‘This book is very readable’.  This syntactic optional expectation for the derivative is accommodated in the head feature [SUBJ].

Note that before any structural combination of ke- with other expected signs, ke- is a bound morpheme, a sign which has obligatory morphological expectation in [PREFIXING].  As a head for both the morphological combination ke+Vt and the potential syntactic combination NP+[ke+Vt], the interface between morphology and syntax in this case lies in the hierarchical structures which should be imposed.   That is, the morphological structure (derivation) should be established before its syntactic expected structure can be realized.  Such a configurational constraint is specified in the corresponding PS rules, i.e. the Subject PS Rule and The Prefix PS Rule.  It guarantees that the obligatory morphological expectation of ke- has to be saturated before the sign can be legitimately used in syntactic combination.

The interaction between morphology/syntax and semantics in this case is encoded by the information flow, i.e. structure-sharing indicated by the number index in square brackets, between the corresponding feature structures inside this sign.  The semantic compositionality involved in the morphological and syntactic grouping is represented like this.  There is a semantic predicate marked as [-able] (for worthiness) in the content feature [RELN];  this predicate has an argument which is co-indexed by [1] with the semantics of the expected Vt.  Note that the syntactic subject of the derived adjective, say ke-du (read-able) or ke-chi (eat-able), is the semantic (or logical) object of the stem verb, co-indexed by [2] in the sample entry above.  The head feature [CONTENT] which reflects the semantic compositionality will be percolated up to the mother sign when applicable morphological and syntactic PS rules take effect in structure building.

In summary, embodied in CPSG95 is a mono-stratal grammar of morphology and syntax within the same formalism.  Both morphology and syntax use same data structure (typed feature structure) and mechanisms (unification, sort hierarchy, PS rules, lexical rules, macros).   This design for Chinese grammar is original and is shown to be feasible in the CPSG95 experiments on various Chinese constructions.  The advantages of handling morpho-syntactic interface problems under this design will be demonstrated throughout this dissertation.

3.2. Expectation Feature Structures

This section presents the design of the expectation features in CPSG95.  In general, the expectation features contain information about various types of potential structures of the sign.  In CPSG95, various constraints on the expected daughter(s) of a sign are specified in the lexicon to drive both morphological and syntactic structural analysis.  This provides a favorable basis for interleaving Chinese morphology and syntax in analysis.

The expected daughter in CPSG95 is defined as one of the following grammatical constituents:  (i) subject in the feature [SUBJ];  (ii) first complement in the feature [COMP0_LEFT] or [COMP1_RIGHT];  (iii) second complement in [COMP2_RIGHT];   (iv) head of a modifier in the feature [MOD_LEFT] or [MOD_RIGHT];   (v) stem of an affix in the feature [PREFIXING] or [SUFFIXING].[7]  The first four are syntactic daughters which will be investigated in sections 3.2.2 and 3.2.3.  The last one is the morphological daughter for affixation, to be presented in section 3.2.1.  All these features are defined on the basis of the relative word order of the constituents in the structure.  The hierarchy for the structure at issue resorts to the configurational constraints which will be presented in section 3.2.4.

3.2.1. Morphological Expectation

One key characteristic of the CPSG95 expectation features is the design of morphological expectation features to incorporate Chinese productive derivation.

It is observed that a Chinese affix acts as the head daughter of the derivative in terms of expectation (see section 6.1 for more discussion).   An affix can lexically define what stem to expect and can predict the derivation structure to be built.  For example, the suffix 性 –xing demands that it combine with a preceding adjective to make an abstract noun, i.e. A+-xing --> N.  This type of information can be easily captured by the expectation feature structure in the lexicon, following the practice of the HPSG treatment of the syntactic expectation such as subcategorization and modification.

In the CPSG95 lexicon, each affix entry is encoded to provide the following derivation information:   (i) what type of stem it expects;  (ii) whether it is a prefix or suffix to decide where to look for the expected stem;  (iii) what type of (derived) word it produces.  Based on this lexical information, the general grammar only needs to include two PS rules for Chinese derivation:  one for prefixation, one for suffixation.  These rules will be formulated in Chapter VI (sections 6.2 and 6.3).  It will also be demonstrated that this lexicalist design for Chinese derivation works for both typical cases of affixation and for some difficult cases such as ‘quasi-affixation’ and zhe-suffixation.

In summary, the morphological combination for productive derivation in CPSG95 is designed to be handled by only two PS rules in the general grammar, based on the lexical specification in [PREFIXING] and [SUFFIXING].  Essentially, in CPSG95, productive derivation is treated like a ‘mini-syntax’;[8]  it becomes an integrated part of Chinese structural analysis.

3.2.2. Syntactic Expectation

This section presents the design of the expectation features to represent Chinese syntactic relations.  It will be demonstrated that constraints like word order and function words are crucial to the formalization of syntactic relations.  Based on them, four types of syntactic relations can be defined, which are accommodated in six syntactic expectation feature structures for each head word.

There is no general agreement on how to define Chinese syntactic relations.  In particular, the distinction between Chinese subject and object has been a long debated topic (e.g. Ding 1953; L. Li 1986, 1990; Zhu 1985; Lü 1989).  The major difficulty lies in the fact that Chinese does not have inflection to indicate subject-verb agreement and nominative case or accusative case, etc.

Theory-internally, there have been various proposals that Chinese syntactic relations be defined on the basis of one or more of the following factors:  (i) word order (more precisely, constituent order);  (ii) the function words associated with the constituents;  (iii) the semantic relations or roles.  The first two factors are linguistic forms while the third factor belongs to linguistic content.

L. Li (1986, 1990) relies mainly on the third factor to study Chinese verb patterns. The constituents in his proposal are named as NP-agent (ming-shi), NP-patient (ming-shou), etc. This practice amounts to placing an equal sign between the syntactic relation and semantic relation.  It implies that the syntactic relation is not an independent feature.  This makes syntactic generalization difficult.

Other Chinese grammarians (e.g. Ding 1953; Zhu 1985) emphasize the factor of word order in defining syntactic relations.  This school insists that syntactic relations be differentiated from semantic relations.  More precisely, semantic relations should be the result of the analysis of syntactic relations.  That is also the rationale behind the CPSG95 practice of using word order and other constraints (including function words) in the definition of Chinese relations.

In CPSG95, the expected syntactic daughter in CPSG95 is defined as one of the following grammatical constituents:  (i) subject in the feature [SUBJ], typically an NP which is on the leftmost position relative to the head;  (ii) complements closer to the head in the feature [COMP0_LEFT] or [COMP1_RIGHT], in the form of an NP or a specific PP;  (iii) the second complement in [COMP2_RIGHT]: this complement is defined to be an XP (NP, a specific PP, VP, AP, etc.) farther away from the head than [COMP1_RIGHT] in word order;  (iv) head of a modifier in the feature [MOD_LEFT] or [MOD_RIGHT].  In this defined framework of four types of possible syntactic relations, for each head word, the lexicon is expected to specify the specific constraints in its corresponding expectation feature structures and map the syntactic constituents to the corresponding semantic roles in [CONTENT].  This is a secure way of linking syntactic structures and their semantic composition for the following reason.  Given a specific head word and a syntactic structure with its various constraints specified in the expectation feature structures, the decoding of semantics is guaranteed.[9]

A Chinese syntactic pattern can usually be defined by constraints from category, word order, and/or function words (W. Li 1996).  For example, NP+V, NP+V+NP, NP+PP(x)+NP, NP+V+NP+NP, NP+V+NP+VP, etc.  are all  such patterns.  With the design of the expectation features presented above, these patterns can be easily formulated in the lexicon under the relevant head entry, as demonstrated by the sample formulations given in (3-3) and (3-4).

th2

th3

The structure in (3-3) is a Chinese transitive pattern in its default word order, namely NP1+Vt+NP2.  The representation in (3-4) is another transitive pattern NP+PP(x)+Vt.  This pattern requires a particular preposition x to introduce its object before the head verb.

The sample entry in (3-5) is an example of how modification is represented in CPSG95.  Following the HPSG semantics principle, the semantic content from the modifier will be percolated up to the mother sign from the head-modifier structure via the corresponding PS rule.  The added semantic contribution of the adverb chang-chang (often) is its specification of the feature [FREQUENCY] for the event at issue.

th4

3.2.3. Chinese Subcategorization

This section presents the rationale behind the CPSG95 design for subcategorization.  Instead of a SUBCAT-list, a keyword approach with separate features for each complement is chosen for representing the subcategorization information, as shown in the corresponding expectation features in section 3.2.2.  This design has been found to be a feasible alternative to the standard practice of HPSG relying on the list design of obliqueness hierarchy and SUBCAT Principle when handling subject and complements.

The CPSG95 design for representing subcategorization follows one proposal from Pollard and Sag (1987:121), who point out:  “It may be possible to develop a hybrid theory that uses the keyword approach to subjects, objects and other complements, but which uses other means to impose a hierarchical structure on syntactic elements, including optional modifiers not subcategorized for in the same sense.”  There are two issues for such a hybrid theory:  the keyword approach to representing subject and complements and the means for imposing a hierarchical structure.  The former is discussed below while the latter will be addressed in the subsequent section 3.2.4.

The basic reason for abandoning the list design is due to the lack of an operational definition of obliqueness which captures generalizations of Chinese subcategorization.  In the English version of HPSG (Pollard and Sag 1987, 1994), the obliqueness ordering is established between the syntactic notions of subject, direct object and second object (or oblique object).[10]  But these syntactic relations themselves are by no means universal.  In order to apply this concept to the Chinese language, there is a need for an operational definition of obliqueness which can be applied to Chinese syntactic relations.  Such a definition has not been available.

In fact, how to define Chinese subject, object and other complements has been one of the central debated topics among Chinese grammarians for decades (Lü 1946, 1989; Ding 1953; L. Li 1986, 1990; Zhu 1985; P. Chen 1994).  No general agreement for an operational, cross-theory definition of Chinese subcategorization has been reached.  It is often the case that formal or informal definitions of Chinese subcategorization are given within a theory or grammar.   But so far no Chinese syntactic relations defined in a theory are found to demonstrate convincing advantages of a possible obliqueness ordering, i.e. capturing the various syntactic generalizations for Chinese.

Technically, however, as long as subject and complements are formally defined in a theory, one can impose an ordering of them in a SUBCAT list.  But if such a list does not capture significant generalizations, there is no point in doing so.[11]  It has turned out that the keyword approach is a promising alternative once proper means are developed for the required configurational constraint on structure building.

The keyword approach is realized in CPSG95 as follows.  Syntactic constituents for subcategorization, namely subject and complements, are directly accommodated in four parallel features [SUBJ], [COMP0_LEFT], [COMP1_RIGHT] and [COMP2_RIGHT].

The feasibility of the keyword approach proposed here has been tested during the implementation of CPSG95 in representing a variety of structures.  Particular attention has been given to the constructions or patterns related to Chinese subcategorization.  They include various transitive structures, di-transitive structures, pivotal construction (jianyu-shi), ba-construction (ba-zi ju), various passive constructions (bei-dong shi), etc.  It is found to be easy to  accommodate all these structures in the defined framework consisting of the four features.

We give a couple of typical examples below, in addition to the ones in (3-3) and (3-4) formulated before, to show how various subcategorization phenomena are accommodated in the CPSG95 lexicon within the defined feature structures for subcategorization.  The expected structure and example are shown before each sample formulation in (3‑6) through (3-8) (with irrelevant implementation details left out).

th5

th6

Based on such lexical information, the desirable hierarchical structure on the related syntactic elements, e.g. [S [V O]] instead of [[S V] O], can be imposed via the configurational constraint based on the design of the expectation type.  This is presented in section 3.2.4 below.

3.2.4. Configurational Constraint

The means for the configurational constraint to impose a desirable hierarchical morpho-syntactic structure defined by a grammar is the key to the success of a keyword approach to structural constituents, including subject and complements from the subcategorization.  This section defines the sort hierarchy of the expectation type [expected].  The use of this design for flexible configurational constraint both in the general grammar and in the lexicon will be demonstrated.

As presented before, whether a sign has structural expectation, and what type of expectation a sign has, can be lexically decided:  they form the basis for a lexicalized grammar.  Four basic cases for  expectation are distinguished in the expectation type of CPSG95:  (i) obligatory: the expected sign must occur;  (ii) optional:  the expected sign may occur;  (iii) null:  no expectation;  (iv) satisfied: the expected sign has occurred.  Note that case (i), case (ii) and case (iii) are static information while (iv) is dynamic information, updated at the time when the daughters are combined into a mother sign.  In other words, case (iv) is only possible when the expected structure has actually been built.  In HPSG-style grammars, only the general grammar, i.e. the set of PS rules, has the power of building structures.  For each structure being built, the general grammar will set [satisfied] to the corresponding expectation feature of the mother sign.

Out of the four types, case (i) and case (ii) form a natural class,  named as [a_expected];  case (iii) and case (iv) are of one class named as [saturated].  The formal definition of the type [expected] is given (3-9].

(3-9.) Definition: sorted hierarchy for [expected]

expected: {a_expected, saturated}
a_expected: {obligatory, optional}
ROLE role
SIGN a_sign
saturated: {null, satisfied}

The type [a_expected] introduces two features:  [ROLE] and [SIGN].   [ROLE] specifies the semantic role which the expected sign plays in the structure.  [SIGN] houses various types of constraints on the expected sign.

The type [expected] is designed to meet the requirement of the configurational constraint.  For example, in order to guarantee that syntactic structures for an expecting sign are built on top of its morphological structures if the sign has obligatory morphological expectation, the following configurational constraint is enforced in the general grammar.  (The notation | is used for logical OR.)

(3-10.)         configurational constraint in syntactic PS rules

PREFIXING                    saturated | optional
SUFFIXING                    saturated | optional

The constraint [saturated] means that syntactic rules are permitted to apply if a sign has no morphological expectation or after the morphological expectation has been satisfied.  The reason why the case [optional] does not block the application of syntactic rules is the following.  Optional expectation entails that the expected sign may or may not appear.  It does not have to be satisfied.

Similarly, within syntax, the constraints can be specified in the Subject PS Rule:

(3-11.)         configurational constraint in Subject PS rule

COMP0_LEFT                 saturated | optional
COMP1_RIGHT              saturated | optional
COMP2                           saturated | optional

This ensures that complement rules apply before the subject rule does.  This way of imposing a hierarchical structure between subcategorized elements corresponds to the use of SUBCAT Principle in HPSG based on the notion of obliqueness.

The configurational constraint is also used in CPSG95 for the formal definition of phrase, as formulated below.

phrase macro

a_sign
PREFIXING saturated | optional
SUFFIXING saturated | optional
COMP1_LEFT saturated | optional
COMP1_RIGHT saturated | optional
COMP2 saturated | optional

Despite the notational difference, this definition follows the spirit reflected in the phrase definition given in Pollard and Sag (1987:69) in terms of the saturation status of the subcategorized complements.  In essence, the above definition says that a phrase is a sign whose morphological expectation and syntactic complement expectation (except for subject) are both saturated.  The reason to include [optional] in the definition is to cover phrases whose head daughter has optional expectation, for example, a verb phrase consisting of just a verb with its optional object omitted in the text.

Together with the design of the structural feature [STRUCT] (section 3.3), the sort hierarchy of the type [expected] will also enable the formal definition for the representation of the fundamental notion word (see Section 4.3 in Chapter IV).  Definitions such as @word and @phrase are the basis for lexical configurational constraints to be imposed on the expected signs when required.  For example, -xing (-ness) will expect an adjective stem with the word constraint and -zhe (-er) can impose the phrase constraint on the expected verb sign based on the analysis proposed in section 6.5.

3.3. Structural Feature Structure

The design of the feature [STRUCT] serves important structural purposes in the formalization of the CPSG95 interface between morphology and syntax.  It is necessary to present the rationale of this design and the sort hierarchy of the type [struct] used in this feature.

The design of [STRUCT struct] originates from the binary structural feature structure [LEX + | -] in the original HPSG theory (Pollard and Sag 1987).  However, in the CPSG95 definition, the type [struct] forms an elaborate sort hierarchy.   It is divided into two types at the top level:  [syn_dtr] and [no_syn_dtr].  A sub-type of [no_syn_dtr] is [no_dtr].  The CPSG95 lexicon encodes the feature [STRUCT no_dtr] for all single morphemes.[12]  Another sub-type of [no_syn_dtr] is [affix] (for units formed via affixation) which is further sub-typed into [prefix] and [suffix], assigned by the Prefix PS rule and Suffix PS Rule.  In syntax, [syn_dtr] includes sub-types like [subj], [comp] and [mod].  Despite the hierarchical depth of the type, it is organized to follow the natural classification of the structural relation involved.  The formal definition is given below.

(3-12.)         Definition: sorted hierarchy for [struct]

struct: {syn_dtr, no_syn_dtr}
syn_dtr: {subj, comp, mod}
comp: {comp0_left, comp1_right, comp2_right}
mod: {mod_left, mod_right}
no_syn_dtr: {no_dtr, affix}
affix: {prefix, suffix}

In CPSG95, [STRUCT] is not a (head) feature which percolates up to the mother sign;  its value is solely decided by the structure being built.[13]   Each PS rule, whether syntactic or morphological, assigns the value of the [STRUCT] feature for the mother sign, according to the nature of combination.  When morpheme daughters are combined into a mother sign word, the value of the feature [STRUCT] for the mother sign remains a sub-type of [no_syn_dtr].  But when some syntactic rules are applied, the rules will assign the value to the mother sign as a sub-type of [syn_dtr] to show that the structure being built is a syntactic construction.

The design of the feature structure [STRUCT struct] is motivated by the new requirement caused by introducing morphology into the general grammar of  CPSG95.  In HPSG, a simple, binary type for [LEX] is sufficient to distinguish lexical signs, i.e. [LEX +], from signs created via syntactic rules, i.e. [LEX -].  But in CPSG95, as presented in section 3.2.1 before, productive derivation is also accommodated in the general grammar.  A simple distinction between a lexical sign and a syntactic sign cannot capture the difference between signs created via morphological rules and signs created via syntactic rules.  This difference plays an essential role in formalizing the morpho-syntactic interface, as shown below.

The following examples demonstrate the structural representation through the design of the feature [STRUCT].  In the CPSG95 lexicon, the single Chinese characters like the prefix ke- (-able) and the free morphemes du (read), bao (newspaper) are all coded as [STRUCT no_dtr].   When the Prefix PS Rule combines the prefix ke- and the verb du into an adjective ke-du, the rule assigns [STRUCT prefix] to the newly built derivative.  The structure may remain in the domain of morphology as the value [prefix] is a sub-type of [no_syn_dtr].  However, when this structure is further combined with a subject, say, bao (newspaper) by the syntactic Subj PS Rule, the resulting structure [bao [ke-du]] (‘Newspapers are readable’) is syntactic, having [STRUCT subj] assigned by the Subj PS Rule;  in fact, this is a simple sentence.  Similarly, the syntactic Comp1_right PS Rules can combine the transitive verb du (read) and the object bao (newspaper) and assign for the unit du bao (read newspapers) in the feature [STRUCT comp1_right].  In general, when signs whose [STRUCT] value is a sub-type of [no_syn_dtr] combine into a unit whose [STRUCT] is assigned a sub-type of [syn_dtr], it marks the jump from the domain of morphology to syntax.  This is the way the interface of Chinese morphology and syntax is formalized in the present formalism.

The use of this feature structure in the definition of Chinese word will be presented in Chapter IV.  Further advantages and flexibility of the design of this structural feature structure and the expectation feature structures will be demonstrated in later chapters in presenting solutions to some long-standing problems at the morpho-syntactic interface.

3.4. Summary

The major design issues for the proposed mono-stratal Chinese grammar CPSG95 are addressed.  This provides a framework and means for formalizing the analysis of the linguistic problems at the morpho-syntactic interface.  It has been shown that the design of the CPSG95 expectation structures enables configuration constraints to be imposed on the structure hierarchy defined by the grammar.  This makes the keyword approach to Chinese subcategorization a feasible alternative to the list design based on the obliqueness hierarchy of subject and complements.

Within this defined framework of CPSG95, the subsequent Chapter IV will be able to formulate the system-internal, but strictly formalized definition of Chinese word.  Formal definitions such as @word and @phrase enable proper configurational constraints to be imposed on the expected signs when required.  This lays a foundation for implementing the proposed solutions to the morpho-syntactic interface problems to be explored in the remaining chapters.

 

---------------------------------------------------------------------------------

[1] More precisely, it is not ‘word’ order, it is constituent order, or linear precedence (LP) constraint between constituents.

[2]  L. Li (1986, 1990)’s definition on structural constituents does not involve word order.  However, his proposed definition is not an operational one from the angle of natural language processing.  He relies on the decoding of the semantic roles for the definitions of the proposed constituents like NP-agent (ming-shi), NP-patient (ming-shou), etc.  Nevertheless, his proposal has been reported to produce good results in the field of Chinese language teaching.  This seems to be understandable because the process of decoding semantic roles is naturally and subconsciously conducted in the mind of the language instructors/learners.

[3] Most linguists agree that Chinese has no inflectional morphology (e.g. Hockett 1958; Li and Thompson 1981; Zwicky 1987; Sun and Cole 1991).  The few linguists who believe that Chinese has developed or is developing inflection morphology include  Bauer (1988) and Dai (1993).  Typical examples cited as Chinese inflection morphemes are aspect markers le, zhe, guo and the plural marker men.

[4] A note for the notation: uppercase is used for feature and lowercase, for type.

[5] Phonology and discourse are not yet included in the definition.  The latter is a complicated area which requires further research before it can be properly integrated in the grammar analysis.  The former is not necessary because the object for CPSG95 is Written Chinese.   In the few cases where phonology affects structural analysis, e.g. some structural expectation needs to check the match of number of syllables, one can place such a constraint indirectly by checking the number of Chinese characters instead (as we know, a syllable roughly corresponds to a Chinese character or hanzi).

[6] The macro constraint @np in (3-2) is defined to be [CATEGORY n] and a call to another macro constraint @phrase to be defined shortly in Section 3.2.4.

[7] These expectation features defined for [a_sign] are a maximum set of possible expected daughters;  any specific sign may only activate a subset of them, represented by non-null value.

[8] This is similar to viewing morphology as ‘the syntax of words’ (Selkirk 1982; Lieber 1992; Krieger 1994).  It seems that at least affixation shares with syntax similar structural constraints on constituency and linear ordering in Chinese.  The same type of mechanisms (PS rules, typed feature structure for expectation, etc) can be used to capture both Chinese affixation and syntax (see Chapter VI).

[9] More precisely, the decoding of possible ways of semantic composition is guaranteed.  Syntactically ambiguous structures with the same constraints correspond to multiple ways of semantic compositionality.  These are expressed as different entries in the lexicon and the link between these entries is via corresponding lexical rules, following the HPSG practice. (W. Li 1996)

[10]  Borsley (1987) has proposed an HPSG framework where subject is posited as a distinct feature than other complements.  Pollard and Sag (1994:345) point out that “the overwhelming weight of evidence favors Borsley’s view of this matter”.

[11] The only possible benefit of such arrangement is that one can continue using the SUBCAT Principle for building complement structure via list cancellation.

[12] It also includes idioms whose internal morphological structure is unknown or has no grammatical relevance.

[13] The reader might have noticed that the assigned value is the same as the name of the PS rule which applies.  This is because there is correspondence between what type of structure is being built and what PS rule is building it.  Thus, the [STRUCT] feature actually records the rule application information.  For example, [STRUCT subj] reflects the fact that the Subj PS Rule is the most recently applied rule to the structure in point;  a structure built via the Prefix PS Rule has [STRUCT prefix] in place; etc.  This practice gives an extra benefit of the functionality of ‘tracing’ which rules have been applied in the process of debugging the grammar.  If there has never been a rule applied to a sign, it must be a morpheme carrying [STRUCT no_dtr] from the lexicon.

 

[Related]

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

 

 

【心路历程:当理论遭遇实践、博士走向工业】

这两天翻开我20年前关于汉语短语结构文法的博士论文,重读一遍,有些感慨。

我的博士做得比较辛苦,其中的曲折和坎坷,不足道也。总之是,做实验做了很多现象,舍不得放弃,可博士论文要求有一条主线,讲究的是点入。不知道草稿了多少提纲,一律被导师忽视或枪毙,最后是不断舍弃,不断聚焦,千锤百炼,才打造出这么个棱角全部被磨圆了所谓博士论文。感觉上,多数的博士论文都是这种过分打磨,读起来了无趣味的文字,在下的更是如此。但里面包含多少不眠之夜的挣扎、艰辛和血泪,天知地知也。

其实,所谓PhD哲学博士是一个历史遗留下来的错误称号,当代的博士基本都是专才,一点也不“博”,很少通才。很多年的辛苦研究基本是掘地三尺的劲头,重精不重广,除了自己的一亩三分地,其他领域无知得很,哲学就更谈不上了。北美的博士制度耗费了人一生中最有创造力的时期,长达5-8年,感觉是太超过了。见过很多博士磨圆了锐气,了无成就,面对真实市场手足失措的案例。难怪俗话有说,傻得像博士。这里的得失留给教育学家研究点评吧。

话说我终于一边工作,一边完成了定稿,导师也首肯了。那时甜甜刚四岁。

I should thank my four-year-old daughter, Tian Tian. I feel sorry for not being able to spend more time with her. What has supported me all these years is the idea that some day she will understand that as a first-generation immigrant, her dad has managed to overcome various challenges in order to create a better environment for her to grow.
PhD Thesis Dedication
To my daughter Tian Tian
whose babbling accompanied and inspired the writing of this work

I still remember I was in tears when writing this to give a final touch on this degree thesis

现如在正在做中文 deep parser,已经很有规模了。正好回顾一下,看 20 年前的思路与20年后做法,有何不同。离校后开始工业开发至今,我毫不犹豫就抛弃了博士的自动分析的路线,虽然做博士时说得头头是道。实际是扬弃吧。有抛弃有继承。抛弃的是单层的CFG,继承的是词法句法的无缝连接。这个转变反映的是理论和实践的距离以及学术与工业的关系。

做博士的时候,正是 unification systems 最被热捧的时候。于是跟随导师,在 Prolog平台上用 HPSG 做了一个汉语文法的MT双向实验(同一个汉语文法被用来同时做分析与生成,支持汉语英语的双向机器翻译),做了个 toy。需要写论文了,不得不把做过的各种现象不断缩小,最后集中到汉语的词法(包括切词)和句法的接口上做文章。整篇论文论述的就是一个思想,切词、词法与句法必须一体化,用的是单层 CFG parsing,说得头头是道。

一体化理论上当然是成立的,因为语言现象中的相互依赖,只有在一体化的框架下才好对付。哪怕 90% 的现象不是相互依赖的,是可以摘开的,你总可以用 10% 的现象证明一体化的正确性(理论上不妨碍那 90%)。

20年后呢,去球吧。早抛弃了单层一体化的思路,那是一个死胡同,做 toy 可以,很难 scale up,也做不深入,做不了真实世界的系统。继承的是一体化的通信管道和休眠唤醒似的patching机制。但宁肯修修补补,也不追求语法体系的完美。

对 HPSG 好奇,或感兴趣汉语怎么用HPSG的同学可以看看我整理出来的博士论文,虽然是过气了的 formalism,记得半年前冯志伟老师还系列编译介绍了 HPSG 讲座。有读者问,怎么用到中文呢?其实对于这种涉及一系列理论assumptions和技术细节的所谓 theoretical formalism,不做一遍基本是雾里看花。Unification 和 typed 数据结构逻辑上看上去很美,做起来也觉得好玩,做过后就洗手不干了。玩过 Prolog 的人也许有类似的体会。

决定把当年在博士论文中列举的具有句法分析难点的例子,当作 unit test 都  parse  一遍,看变了设计思想的系统是不是还可以抓住这些语言现象。

0824e

0824d

0824f

0824h

0824g

0824i

0824a

0824b

0824c

0825d

“头羊”(类似案例还有“个人”、“难过”)带有所谓切词的 hidden ambiguity,因为直接违反 longest principle,是中文切词的痛点,也是一体化的有力证据。理论上,任何的切词 ambiguity (不仅仅是 hidden ambiguity)都需要带入整个句子才能最后确认,local context 永远有漏洞,你永远可以营造出一个 context 使得你的 local 决策失效。但实践中还是可以大体把 local 与 全局分开,没必要带着切词的 ambiguity 一路跑到终点。hidden ambiguity 不影响大局者可以休眠,如上例。必要的时候可以用 word-driven 的句法后模块再唤醒它

 

【相关篇什】

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

《泥沙龙笔记:parsing 的休眠反悔机制》

【立委科普:歧义parsing的休眠唤醒机制初探】

【泥沙龙笔记:NLP hard 的歧义突破】

【立委科普:结构歧义的休眠唤醒演义】

【新智元笔记:李白对话录 – 从“把手”谈起】

《新智元笔记:跨层次结构歧义的识别表达痛点》

【离皇冠上的明珠只有一步之遥的感觉】

关于 parsing

【关于中文NLP】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

PhD Thesis: Chapter II Role of Grammar

 

2.0. Introduction

This chapter examines the role of grammar in handling the three major types of morpho-syntactic interface problems.  This investigation  justifies the mono-stratal design of CPSG95 which contains feature structures of both morphology and syntax.

The major observation from this study is:  (i) grammatical analysis, including both morphology and syntax, plays the fundamental role in contributing to the solutions of the morpho-syntactic problems;  (ii)  when grammar alone is not sufficient to reach the final solution, knowledge beyond morphology and syntax may come into play and serve as “filters” based on the grammatical analysis results.[1]  Based on this observation, a study in the direction of interleaving morphology and syntax will be pursued in the grammatical analysis.  Knowledge beyond morphology and syntax is left to future research.

Section 2.1 investigates the relationship between grammatical analysis and  the resolution of segmentation ambiguity.  Section 2.2 studies the role of syntax in handling Chinese productive word formation.  The borderline cases and their relationship with grammar are explored in 2.3.  Section 2.4 examines the relevance of knowledge beyond syntax to segmentation disambiguation.  Finally, a summary of the presented arguments and discoveries is given in 2.5.

2.1. Segmentation Ambiguity and Syntax

Segmentation ambiguity is one major problem which challenges the traditional word segmenter or an independent morphology.  The following study shows that this ambiguity is structural in nature, not fundamentally different from other structural ambiguity in grammar.  It will be demonstrated that sentential structural analysis is the key to this problem.

A huge amount of research effort in the last decade has been made on resolving segmentation ambiguity (e.g. Chen and Liu 1992; Gan 1995; He, Xu and Sun 1991; Liang 1987; Lua 1994; Sproat, Shih, Gale and Chang 1996; Sun and T’sou 1995; Sun and Huang 1996; X. Wang 1989; Wu and Su 1993; Yao, Zhang and Wu 1990; Yeh and Lee 1991; Zhang, Chen and Chen 1991; Guo 1997b).  Many (e.g. Sun and Huang 1996; Guo 1997b) agree that this is still an unsolved problem.  The major difficulty with most approaches reported in the literature lies in the lack of support from sufficient grammar knowledge.  To ultimately solve this problem, grammatical analysis is vital, a point to be elaborated in the subsequent sections.

2.1.1. Resolution of Hidden Ambiguity

The topic of this section is the treatment of hidden ambiguity.   The conclusion out of the investigation below is that the structural analysis of the entire input string provides a sound basis for handling this problem.

The following sample sentences illustrate a typical case involving the hidden ambiguity string 烤白薯 kao bai shu.

(2-1.) (a)      他吃烤白薯
ta         | chi  | kao-bai-shu
he      | eat  | baked-sweet-potato
[S [NP ta] [VP [V chi] [NP kao-bai-shu]]]
He eats the baked sweet potato.

(b) * ta       | chi  | kao          | bai-shu
he      | eat  | bake         | sweet-potato

(2-2.) (a) *    他会烤白薯
ta         | hui  | kao-bai-shu.
he      | can | baked-sweet-potato

(b)     ta       | hui  | kao          | bai-shu.
he      | can | bake         | sweet-potato
[S [NP ta] [VP [V hui] [VP [V kao] [NP bai-shu]]]]
He can bake sweet potatoes.

Sentences (2-1) and (2-2) are a minimal pair;  the only difference is the choice of the predicate verb, namely chi (eat) versus hui (can, be capable of).  But they have very different structures and assume different word identification.  This is because verbs like chi expect an NP object but verbs like hui require a VP complement.  The two segmentations of the string kao bai shu provide two possibilities, one as an NP kao-bai-shu and the other as a VP kao | bai-shu.  When the provided unit matches the expectation, it leads to a successful syntactic analysis, as illustrated by the parse trees in (2‑1a) and (2-2b).  When the expectation constraint is not satisfied, as in (2-1b) and (2-2a), the analysis fails.  These examples show that all candidate words in the input string should be considered for grammatical analysis.  The disambiguation choice can be made via the analysis, as seen in the examples above with the sample parse trees.  Correct segmentation results in at least one successful parse.

He, Xu and Sun (1991) indicate that a hidden ambiguity string requires a larger context for disambiguation.  But they did not define what the 'larger context' should be.  The following discussion attempts to answer this question.

The input string to the parser constitutes a basic context as well as the object for sentential analysis.[2]  It will be argued that this input string is the proper context for handling the hidden ambiguity problem.  The point to be made is, context smaller than the input string is not reliable for the hidden ambiguity resolution.  This point is illustrated by the following examples of the hidden ambiguity string ge ren in (2-3).[3]  In each successive case, the context is expanded to form a new input string.   As a result, the analysis and the associated interpretation of ‘person’ versus ‘individual’ change accordingly.

(2-3.)  input string                            reading(s)

(a)      人  ren                                       person (or man, human)
[N ren]

(b)      个人  ge ren                               individual
[N ge-ren]

(c)      三个人  san ge ren                               three persons
[NP [CLAP [NUM san] [CLA ge]] [N ren]]

(d)      人的力量  ren de li liang                      the human power
[NP [DEP [NP ren] [DE de]] [N li-liang]]

(e)      个人的力量  ge ren de li liang                        the power of an individual
[NP [DEP [NP ge-ren] [DE de]] [N li-liang]]

(f)       三个人的力量  san ge ren de li liang              the power of three persons
[NP [DEP [NP [CLAP [NUM san] [CLA ge]] [N ren]] [DE de]] [N li-liang]]

(g)      他不是个人  ta bu shi ge ren.
           (1)    He is not a man. (He is a pig.)
[S [NP ta] [VP [ADV bu] [VP [V shi] [NP [CLAP ge] [N ren]]]]]
(2)  He is not an individual. (He represents a social organization.)
[S [NP ta] [VP [ADV bu] [VP [V shi] [NP  ge-ren]]]]

Comparing (a), (b) with (c), and (d), (e) with (f), one can see the associated change of readings when each successively expanded input string leads to a different grammatical analysis.  Accordingly, one segmentation is chosen over the other on the condition that the grammatical analysis of the full string can be established based on the segmentation.  In (b), the ambiguous string is all that is input to the parser, therefore the local context becomes full context.  It then acquires the lexical reading individual as the other possible segmentation ge | ren does not form a legitimate combination.  This reading may be retained, as in (e), or changed to the other reading person, as in (c) and (f), or reduced to one of the possible interpretations, as in (g), when the input string is further lengthened.  All these changes depend on the sentential analysis of the entire input string, as shown by the associated structural trees above.  It demonstrates that the full context is required for the adequate treatment of the hidden ambiguity phenomena.  Full context here refers to the entire input string to the parser.

It is necessary to explain some of the analyses as shown in the sample parses  above.  In Contemporary Mandarin, a numeral cannot  combine with a noun without a classifier in between.[4]  Therefore, the segmentation san (three) | ge-ren (individual) is excluded in (c) and (f), and the correct segmentation san (three) | ge (CLA) | ren (person) leads to the NP analysis.  In general, a classifier alone cannot combine with the following noun either, hence the interpretation of ge ren as one word ge-ren (individual) in (b) and (e).  A classifier usually combines with a preceding numeral or determiner before it can combine with the noun.  But things are more complicated.  In fact, the Chinese numeral yi (one) can be omitted when the NP is in object position.  In other words, the classifier alone can combine with a noun in a very restricted syntactic environment.  That explains the two readings in (g).[5]

The following is a summary of the arguments presented above.   These arguments have been shown to account for the hidden ambiguity phenomena.  The next section will further demonstrate the validity of these arguments for overlapping ambiguity as well.

(2-4.) Conclusion
The grammatical analysis of the entire input string is required for the adequate treatment of the hidden ambiguity problem in word identification.

2.1.2. Resolution of Overlapping Ambiguity

This section investigates overlapping ambiguity and its resolution.  A previous influential theory is examined, which claims that the overlapping ambiguity string can be locally disambiguated.   However, this theory is found to be unable to account for a significant amount of data.  The conclusion is that both overlapping ambiguity and hidden ambiguity require a context of the entire input string and a grammar for disambiguation.

For overlapping ambiguity, comparing different critical tokenizations will be able to detect it, but such a technique cannot guarantee a correct choice without introducing other knowledge.  Guo (1997) pointed out:

As all critical tokenizations hold the property of minimal elements on the word string cover relationship, the existence of critical ambiguity in tokenization implies that the “most powerful and commonly used” (Chen and Liu 1992, page 104) principle of maximum tokenization would not be effective in resolving critical ambiguity in tokenization and implies that other means such as statistical inferencing or grammatical reasoning have to be introduced.

However, He, Xu and Sun (1991) claim that overlapping ambiguity can be resolved within the local context of the ambiguous string.  They classify the overlapping ambiguity string into nine types.  The classification is based on the categories of the assumably correctly segmented words in the ambiguous strings, described below.

Suppose there is an overlapping ambiguous string consisting of ABC;  both AB and BC are entries listed in the lexicon.  There are two possible cases.  In case one, the category of A and the category of BC define the classification of the ambiguous string.  This is the case when the segmentation A|BC is considered correct.  For example, in the  ambiguous string 白天鹅 bai tian e, the word AB is  bai-tian (day-time) and the word BC is tian-e (swan).  The correct segmentation for this string is assumed to be A|BC, i.e. bai (A: white) | tian-e (N: swan) (in fact, this cannot be taken for granted as shall be shown shortly), therefore, it belongs to the A-N type.  In case two, i.e. when the segmentation AB|C is considered correct, the category of AB and the category C define the classification of the ambiguous string.   For example, in the ambiguous string 需求和 xu qiu he, the word AB is  xu-qiu (requirement) and the word BC qiu-he (sue for peace).  The correct segmentation for this string is AB|C, i.e. xu-qiu (N: requirement) | he (CONJ: and) (again, this should not be taken for granted), therefore, it belongs to the N-CONJ type.

After classifying the overlapping ambiguous strings into one of nine types, using the two different cases described above, they claim to have discovered a rule.[6]  That is, the category of the correctly segmented word BC in case one (or AB in case two) is predictable from AB (or BC in case two) within the local ambiguous string.  For example, the category of tian-e (swan) in bai | tian-e (white swan) is a noun.  This information is predictable from bai tian within the respective local string bai tian e.  The idea is, if ever an overlapping ambiguity string is formed of bai tian and C, the judgment of bai | tian-C as the correct segmentation entails that the word tian-C  must be a noun.  Otherwise, the segmentation A|BC is wrong and the other segmentation AB|C is right.  For illustration, it is noted that tian-shi (angel) in the ambiguous string 白天使 bai | tian-shi (white angel) is, as expected, a noun.  This predictability of the category information from within the local overlapping ambiguous string is seen as an important discovery (Feng 1996).  Based on this assumed feature of the overlapping ambiguous strings, He,  Xu and Sun (1991) developed their theory that an overlapping ambiguity string can be disambiguated within the local string itself.

The proposed disambiguation process within the overlapping ambiguous string proceeds as follows.  In order to correctly segment an overlapping ambiguous string, say, bai tian e or bai tian shi, the following information needs to be given under the entry bai-tian (day-time) in the tokenization lexicon:  (i) an ambiguity label, to indicate the necessity to call a disambiguation rule;  (ii) the ambiguity type A-N, to indicate that it should call the rule corresponding to this type.  Then the following disambiguation rule can be formulated.

(2-5.) A-N type rule       (He,  Xu and Sun 1991)
In the overlapping ambiguous string A(1)...A(i) B(1)...B(j) C(1)...C(k),
if        B(1)...B(j) and C(1)...C(k) form a noun,
then  the correct segmentation is A(1)...A(i) | B(1)...B(j)-C(1)...C(k),
else    the correct segmentation is A(1)...A(i)-B(1)...B(j) | C(1)...C(k).

This way, bai tian e and bai tian shi will always be segmented as bai (white) | tian-e (swan) and bai (white) | tian-shi (angel) instead of bai-tian (daytime) | e (goose) and bai-tian (daytime) | shi (make).  This can be easily accommodated in a segmentation algorithm provided the above  information is added to the lexicon and the disambiguation rules are implemented.  The whole procedure is running within the local context of the overlapping ambiguous string and uses only lexical information.  So they also name the overlapping ambiguity disambiguation morphology-based disambiguation, with no need to consult syntax, semantics or discourse.

Feng (1996) emphasizes that He, Xu and Sun's view on the overlapping ambiguous string constitutes a valuable contribution to the theory of Chinese word identification.  Indeed, this overlapping ambiguous string theory, if it were right, would be a breakthrough in this field.  It in effect suggests that the majority of the segmentation ambiguity is resolvable without and before a grammar module.  A handful of simple rules, like the A-N type rule formulated above, plus a lexicon would solve most ambiguity problems in word identification.[7]

Feng (1996) provides examples for all the nine types of overlapping ambiguous strings as evidence to support He, Xu and Sun (1991)'s theory.   In the case of the A-N type ambiguous string bai tian e, the correct segmentation is supposed to be bai | tian-e in this theory.  However, even with his own cited example, Feng ignores a perfect second reading (parse) when the time NP bai-tian (daytime) directly acts as a modifier for the sentence with no need for a preposition, as shown in (2‑6b) below.

(2-6.)           白天鹅游过来了
bai tian e you guo lai le       (Feng 1996)

(a)      bai     | tian-e       | you          | guo-lai      | le.
          white | swan        | swim        | over-here  | LE
[S [NP bai tian-e] [VP you guo-lai le]]
The white swan swam over here.

(b)      bai-tian       | e              | you          | guo-lai      | le.
          day-time      | goose        | swim        | over-here  | LE
[S [NP+mod bai-tian] [S [NP e] [VP you guo-lai le]]]
In the day time the geese swam over here.

In addition, one only needs to add a preposition zai (in) to the beginning of the sentence to make the abandoned segmentation bai-tian | e the only right one in the changed context.  The presumably correct segmentation, namely bai | tian-e, now turns out to be wrong, as shown in (2-7a) below.

(2-7.)           在白天鹅游过来了
zai bai tian e you guo lai le

(a) *   zai     | bai           | tian-e       | you          | guo-lai      | le.
          in      | white        | swan        | swim        | over-here  | LE

(b)      zai     | bai-tian    | e              | you          | guo-lai      | le.
          in      | day-time   | goose        | swim        | over-here  | LE
[S [PP+mod zai bai-tian] [S [NP e] [VP you guo-lai le]]]
In the day time the geese swam over here.

The above counter-example is by no means accidental.  In fact, for each cited ambiguous string in the examples given by Feng, there exist counter-examples.  It is not difficult to construct a different context where the preferred segmentation within the local string, i.e. the segmentation chosen according to one of the rules, is proven to be wrong.[8]  In the pairs of sample sentences (2‑8) through (2-10), (a) is an example which Feng (1996) cited to support the view that the local ambiguous string itself is enough for disambiguation.  Sentences in (b) are counter-examples to this theory.  It is a notable fact that the listed local string is often properly contained in a more complicated ambiguous string in an expanded context, seen in (2-9b) and (2-10b).  Therefore, even when the abandoned segmentation can never be linguistically correct in any context, as shown for tu-xing (graph) | shi (BM) in (2-9) where a bound morpheme still exists after the segmentation, it does not entail the correctness of the other segmentation in all contexts.  These data show that all possible segmentations should be retained for the grammatical analysis to judge.

(2-8.)  V-N type of overlapping ambiguous string

研究生命
          yan jiu sheng ming:
          yan-jiu (V:study) | sheng-ming (N:life)
yang-jiu-sheng (N:graduate student) | ming (life/destiny)

(a)      研究生命的本质
          yan-jiu    sheng-ming de      ben-zhi
          study          life               DE     essence
Study the essence of life.

(b)      研究生命金贵
           yan-jiu-sheng      ming  jin-gui
          graduate-student  life     precious
Life for graduate students is precious.

(2-9.)  CONJ-N type of overlapping ambiguous string
和平等 he ping deng:
          he (CONJ:and) | ping-deng (N:equality)
he-ping (N:peace) | deng (V:wait)?

(a)      独立自主和平等互利的原则
           du-li-zi-zhu           he      ping-deng-hu-li               de      yuan-ze
          independence       and    equal-reciprocal-benefit  DE     principle
the principle of independence and equal reciprocal benefit

(b)      和平等于胜利 he-ping       deng-yu       sheng-li
          peace           equal           victory
Peace is equal to victory.

(2-10.)  V-P type of overlapping ambiguous string
看中和 kan zhong he:
          kan-zhong (V:target) | he (P:with)
kan (V:see) | zhong-he (V:neutralize)

(a)      他们看中和日本人生意的机会
ta-men    kan-zhong   he      ri-ben          ren              zuo     sheng-yi      de      ji-hui    
they         target           with    Japan          person          do      business     DE   opportunity
They have targeted the opportunity to do business with the Japanese.

(b)      这要看中和作用的效果
zhe          yao    kan    zhong-he-zuo-yong                   de          xiao-guo
this    need  see     neutralization                DE     effect
This will depend on the effects of the neutralization.

The data in (b) above directly contradict the claim that an overlapping ambiguous string can be disambiguated within the local string itself.  While this approach is shown to be inappropriate in practice, the following comment attempts to reveal its theoretical motivation.

As reviewed in the previous text, He, Xu and Sun (1991)'s overlapping ambiguity theory is established on the classification of the overlapping ambiguous strings.  A careful examination of their proposed nine types of the overlapping ambiguous strings reveals an underlying assumption on which the classification is based.  That is, the correctly segmented words within the overlapping ambiguous string will automatically remain correct in a sentence containing the local string.   This is in general untrue, as shown by the counter-examples above.[9]   The following analysis reveals why.

Within the local context of the overlapping ambiguous string, the chosen segmentation often leads to a syntactically legitimate structure while the abandoned segmentation does not.  For example,  bai (white) | tian-e (swan) combines into a valid syntactic unit while there is no structure which can span bai-tian (daytime) | e (goose).  For another example,  yan-jiu (study) | sheng-ming (life) can be combined into a legitimate verb phrase [VP [V yan-jiu] [NP sheng-ming]], but  yan-jiu-sheng (graduate student) | ming (life/destiny) cannot.  But that legitimacy only stands locally within the boundary of the ambiguous string.  It does not necessarily hold true in a larger context containing the string.  As shown previously in (2-7a),  the locally legitimate structure bai | tian-e (white swan) does not lead to a successful parse for the sentence.  In contrast, the locally abandoned segmentation bai-tian (daytime) | e (goose) has turned out to be right with the parse in (2-7b).   Therefore, the full context instead of the local context of the ambiguous string is required for the final judgment on which segmentation can be safely abandoned.  Context smaller than the entire input string is not reliable for the overlapping ambiguity resolution.  Note that exactly the same conclusion has been reached for the hidden ambiguous strings in the previous section.

The following data in (2-11) further illustrate the point of the full context requirement for the overlapping ambiguity resolution, similar to what has been presented for the hidden ambiguity phenomena in (2-3).  In each successive case, the context is expanded to form a new input string.  As a result, the interpretation of ‘goose’ versus ‘swan’ changes accordingly.

(2-11.)  input string                reading(s)

(a)      鹅 e                                goose
[N e]

(b)      天鹅 tian e                                swan
[N tian-e]

(c)      白天鹅 bai tian e                       white swan
[N [A bai] [N tian-e]]

(d)      鹅游过来了 e you guo lai le.
The geese swam over here.
[S [NP e] [VP you guo-lai le]]

(e)      天鹅游过来了 tian e you guo lai le.
The swans swam over here.
[S [NP tian-e] [VP you guo-lai le]]

(f)      白天鹅游过来了 bai tian e you guo lai le.
          (i)       The white swan swam over here.
[S [NP bai tian-e] [VP you guo-lai le]]
          (ii)      In the daytime, the geese swam over here.
S [NP+mod bai-tian] [S [NP e] [VP you guo-lai le]]]

(g)       在白天鹅游过来了 zai bai tian e you guo lai le.
            In the daytime, the geese swam over here.
[S [PP zai bai-tian] [S [NP e] [VP you guo-lai le]]]

(h)      三只白天鹅游过来了 san zhi bai tian e you guo lai le.
           Three white swans swam over here.
[S [NP san zhi bai tian-e] [VP you guo-lai le]]

It is interesting to compare (c) with (f), (g) and (h) to see their associated change of readings based on different ways of  segmentation.  In (c), the overlapping ambiguous string is all that is input to the parser, therefore the local context becomes full context.  It then acquires the reading white swan corresponding to the segmentation bai | tian-e.  This reading may be retained, or changed, or reduced to one of the possible interpretations when the input string is lengthened.  That is respectively the case in (h), (g) and (f).  All these changes depend on the grammatical analysis of the entire input string.  It shows that the full context and a grammar are required for the resolution of most ambiguities;  and when sentential analysis cannot disambiguate - in cases of ‘genuine’ segmentation ambiguity like (f), the structural analysis can make explicit the ambiguity in the form of multiple parses (readings).

In the light of the inquiry in this section, the theoretical significance of the distinction between overlapping ambiguity and hidden ambiguity seems to have diminished.[10]  They are both structural in nature.  They both require full context and a grammar for proper treatment.

(2-12.) Conclusion

(i)  It is not necessarily true that an overlapping ambiguous string can be disambiguated within the local string.

(ii) The grammatical analysis of the entire input string is required for the adequate treatment of the overlapping ambiguity problem as well as the hidden ambiguity problem.

2.2. Productive Word Formation and Syntax

This section examines the connection of productive word formation and segmentation ambiguity.  The observation is that there is always a possible involvement of ambiguity with each type of word formation.  The point to be made is that no independent morphology systems can resolve this ambiguity when syntax is unavailable.  This is because words formed via morphology, just like words looked up from lexicon, only provide syntactic ‘candidate’ constituents for the sentential analysis.  The choice is decided by the structural analysis of the entire sentence.

Derivation is a major type of productive word formation in Chinese.   Section 1.2.2 has given an example of the involvement of hidden ambiguity in derivation, repeated below.

(2-13.)         这道菜没有吃头  zhe dao cai mei you chi tou.

(a)      zhe    | dao          | cai            | mei-you    | chi-tou
          this    | CLA                   | dish         | not-have   | worth-of-eating
[S [NP zhe dao cai] [VP [V mei-you] [NP chi-tou]]]
This dish is not worth eating.

(b) ?   zhe    | dao          | cai            | mei-you    | chi  | tou
          this    | CLA                   | dish         | not have   | eat  | head
[S [NP zhe dao cai] [VP [ADV mei-you] [VP [V chi] [NP tou]]]]
This dish did not eat the head.

(2-14.)         他饿得能吃头牛 ta e de neng chi tou niu.

(a) *   ta       | e              | de            | neng         | chi-tou               | niu
he      | hungry     | DE3         | can           | worth-of-eating  | ox

(b)      ta       | e              | de            | neng         | chi  | tou            | niu
he      | hungry     | DE3         | can           | eat  | CLA          | ox
[…[VP [V e] [DE3P [DE3 de] [VP [V neng] [VP [V chi] [NP tou niu]]]]]]
He is so hungry that he can eat an ox.

Some derivation rule like the one in (2-15) is responsible for combining the transitive verb stem and the suffix –tou (worth-of) into a derived noun for (2-13a) and (2-14a).

(2-15.)         X (transitive verb) + tou --> X-tou (noun, semantics: worth-of-X)

However, when syntax is not available, there is always a danger of wrongly applying this morphological rule due to possible ambiguity involved, as shown in (2-14a).  In other words, morphological rules only provide candidate words;  they cannot make the decision whether these words are legitimate in the context.

Reduplication is another method for productive word formation in Chinese.  An outstanding problem is the AB --> AABB reduplication or AB --> AAB reduplication if AB is a listed word.   In these cases, some reduplication rules or procedures need to be involved to recognize AABB or AAB.  If reduplication is a simple process confined to a local small context, it may be possible to handle it by incorporating some procedure-based function calls during the lexical lookup.  For example, when a three-character string, say 分分心 fen fen xin, cannot be found in the lexicon, the reduplication function will check whether the first two characters are the same, and if yes, delete one of them and consult the lexicon again.  This method is expected to handle the AAB type reduplication, e.g. fen-xin (divide-heart: distract) --> fen-fen-xin (distract a bit).

But, segmentation ambiguity can be involved in reduplication as well.  Compare the following examples in (2-16) and (2-17) containing the sub-string fen fen xin, the first is ambiguity free but the second is ambiguous.  In fact, (2‑17) involves an overlapping ambiguous string  shi fen fen xinshi (ten) | fen-fen-xin (distract a bit) and shi-fen (very) | fen-xin (distract).  Based on the conclusion presented in 2.1, it requires grammatical analysis to resolve the segmentation ambiguity.  This is illustrated in (2‑17).

(2-16.)         让他分分心

rang     | ta    | fen-fen-xin
let      | he   | distracted-a-bit
Let him relax a while.

(2-17.)         这件事十分分心

zhe jian shi shi fen fen xin.

(a) *   zhe    | jian          | shi           | shi  | fen-fen-xin
          this    | CLA         | thing       | ten  | distracted a bit

(b)      zhe    | jian          | shi            | shi-fen     | fen-xin
           this    | CLA         | thing       | very         | distract
[S [NP zhe jian shi] [VP [ADV shi-fen] [V fen-xin]]]
This thing is very distracting.

Finally, there is also possible ambiguity involvement in the proper name formation.  Proper names for persons, locations, etc. that are not listed in the lexicon are recognized as another major problem in word identification (Sun and Huang 1996).[11]  This problem is complicated when ambiguity is involved.

For example, a Chinese person name usually consists of a family name followed by a given name of one or two characters.  For example, the late Chinese chairman mao-ze-dong (Mao Zedong) used to have another name li-de-sheng (Li Desheng).  In the lexicon, li is a listed family name.  Both de-sheng and sheng mean ‘win’.  This may lead to three ways of word segmentation, a complicated case involving both overlapping ambiguity and hidden ambiguity:  (i) li | de-sheng;  (ii) li-de | sheng;  (iii) li-de-sheng, as shown in (2-18) below.

(2-18.)         李得胜了 li de sheng le.

(a)      li        | de-sheng  | le
           Li       | win          | LE
[S [NP li] [VP de-sheng le]]
Li won.

(b)      li-de   | sheng       | le
           Li De | win          | LE
[S [NP li de] [VP sheng le]]
Li De won.

(c) *    li-de-sheng  | le
           Li Desheng  | LE

For this particular type of compounding, the family name serves as the left boundary of a potential compound name of person and the length can be used to determine candidates.[12]  Again, the choice is decided by the grammatical analysis of the entire sentence, as illustrated in (2-18).

(2-19.) Conclusion

Due to the possible ambiguity involvement in productive word formation, a grammar containing both morphology and syntax is required for an adequate treatment.  An independent morphology system or separate word segmenter cannot solve ambiguity problems.

2.3. Borderline Cases and Grammar

This section reviews some outstanding morpho-syntactic borderline phenomena.  The points to be made are:  (i) each proposed morphological or syntactic analysis should be justified in terms of capturing the linguistic generality;  (ii) the design of a grammar should facilitate the access to the knowledge from both morphology and syntax in analysis.

The nature of the borderline phenomena calls for the coordination of morphology and syntax in a grammar.  The phenomena of Chinese separable verbs are one typical example.  The co-existence of their contiguous use and separate use leads to the confusion whether they belong to the lexicon and morphology, or whether they are syntactic phenomena.  In fact, as will be discussed in Chapter V, there are different degrees of ‘separability’ for different types of Chinese separable verbs;  there is no uniform analysis which can handle all separable verbs properly.  Different types of separable verbs may justify different approaches to the problems.  In terms of capturing linguistic generality, a good analysis should account for the demonstrated variety of separated uses and link the separated use and the contiguous use.

‘Quasi-affixation’ is another outstanding interface problem.  This problem requires careful morpho-syntactic coordination.  As presented in Chapter I, structurally, ‘quasi-affixes’ and ‘true’ affixes demonstrate very similar word formation potential, but ‘quasi-affixes’ often retain some ‘solid’ meaning while the meaning of ‘true’ affixes are functionalized.  Therefore, how to coordinate the semantic contribution of the derived words via ‘quasi-affixation’ in the context of the building of the semantics for the entire sentence is the key.  This coordination requires flexible information flow between data structures for morphology, syntax and semantics during the morpho-syntactic analysis.

In short, the proper treatment of the morpho-syntactic borderline phenomena requires inquiry into each individual problem in order to reach a morphological or syntactic analysis which maximally captures linguistic generality.  It also calls for the design of a grammar where information between morphology and syntax can be effectively coordinated.

2.4. Knowledge beyond Syntax

This section examines the roles of knowledge beyond syntax in the resolution of segmentation ambiguity.  Despite the fact that further information beyond syntax may be necessary for a thorough solution to segmentation ambiguity,[13] it will be argued that syntax is the appropriate place for initiating this process due to the structural nature of segmentation ambiguity.

Depending on which type of information is essential for the disambiguation, disambiguation can be classified as structure-oriented, semantics-oriented and pragmatics-oriented.  This classification hierarchy is modified from that in He, Xu and Sun (1991).  They have classified the hidden ambiguity disambiguation into three categories:  syntax-based, semantics-based and pragmatics-based.  Together with the morphology-based disambiguation which is equivalent to the overlapping ambiguity resolution in their theory, they have built a hierarchy from morphology up to pragmatics.

A note on the technical details is called for here.  The term X‑oriented (where X is either syntax, semantics or pragmatics) is selected here instead of X-based in order to avoid the potential misunderstanding that X is the basis for the relevant disambiguation.  It will be shown that while information from X is required for the ambiguity resolution, the basis is always syntax.

Based on the study in 2.1, it is believed that there is no morphology-based (or morphology-oriented) disambiguation independent of syntax.  This is because the context of morphology is a local context, too small for resolving structural ambiguity.  There is little doubt that the morphological analysis is a necessary part of word identification in terms of handling productive word formation.  But this analysis cannot by itself resolve ambiguity, as argued in 2.2.  The notion 'structure' in structure-oriented disambiguation includes both syntax and morphology.

He, Xu and Sun (1991) exclude the overlapping ambiguity resolution in the classification beyond morphology.  This exclusion is found to be not appropriate.  In fact, both the resolution of hidden ambiguity and overlapping ambiguity can be classified into this hierarchy.   In order to illustrate this point, for each such class, I will give examples from both hidden ambiguity and overlapping ambiguity.

Sentences in (2-20) and (2-21) which contain the hidden ambiguity string 阵风zhen feng  are examples for the structure-oriented disambiguation.  This type of disambiguation relying on a grammar constitutes the bulk of the disambiguation task required for word identification.

(2-20.)         一阵风吹过来了
yi zhen feng chui guo lai le.          (Feng 1996)

(a)      yi       | zhen         | feng         | chui          | guo-lai      | le
          one    | CLA          | wind        | blow         | over-here  | LE
[S [NP [CLAP yi zhen] [N feng]] [VP chui guo-lai le]]
A gust of wind blew here

(b) *   yi       | zhen-feng                    | chui                   | guo-lai      | le
          one    | gusts-of-wind    | blow         | over-here  | LE

(2-21.)         阵风会很快来临 zhen feng hui hen kuai lai lin.

(a)      zhen-feng              | hui  | hen          | kuai         | lai-lin
          gusts-of-wind       | will | very                   | soon         | come
[S [NP zhen-feng] [VP hui hen kuai lai-lin]]]
Gusts of wind will come very soon.

(b) *   zhen  | feng                   | hui  | hen          | kuai         | lai-lin
          CLA   | wind        | will | very                   | soon         | come

Compare (2-20a) where the ambiguity string is identified as two words zhen (CLA) feng (wind) and (2-21a) where the string is taken as one word zhen-feng (gusts-of-wind).  Chinese syntax defines that a numeral cannot directly combine with a noun, neither can a classifier alone when it is in non-object position.  The numeral and the classifier must combine together before they can combine with a noun.  So (2-20b) and (2‑21b) are both ruled out while (2-20a) and (2-21a) are structurally well-formed.

For the structure-oriented overlapping ambiguity resolution,  numerous examples have been cited before, and one typical example is repeated below.

(2-22.)         研究生命金贵 yan jiu sheng ming jin gui

(a)      yan-jiu-sheng       | ming         | jin-gui
graduate student | life            | precious
[S [NP yan-jiu-sheng] [S [NP ming] [AP jin-gui]]]
Life for graduate students is precious.

(b) *   yan-jiu        | sheng-ming        | jin-gui
study          | life                     | precious

As a predicate, the adjective jin-gui (precious) syntactically expects an NP as its subject, which is saturated by the second NP ming (life) in (2-22a).   The first NP serves as a topic of the sentence and is semantically linked to the subject ming (life) as its possessive entity.[14]  But there is no parse for (2-22b) despite the fact that the sub-string yan-jiu sheng-ming (to study life) forms a verb phrase [VP [V yan-jiu] [NP sheng-ming]] and the sub-string sheng-ming jin-gui (life is precious) forms a sentence [S [NP sheng-ming] [AP jin-gui]].  On one hand, the VP in the subject position does not satisfy the syntactic constraint (the category NP) expected by the adjective jin-gui (precious) - although other adjectives, say zhong-yao 'important', may expect a VP subject.  On the other hand, the transitive verb yan-jiu (study) expects an NP object.  It cannot take an S object (embedded object clause) as do other verbs, say ren-wei (think).

The resolution of the following hidden ambiguity belongs to the semantics-oriented disambiguation.

(2-23.)         请把手抬高一点儿 qing ba shou tai gao yi dian er            (Feng 1996)

(a1)    qing             | ba   | shou         | tai            | gao | yi-dian-er
          please          | BA  | hand        | hold         | high| a-little
[VP [ADV qing] [VP ba shou tai gao yi-dian-er]]
Please raise your hand a little higher.

(a2) * qing   | ba   | shou         | tai            | gao           | yi-dian-er
          invite | BA  | hand        | hold         | high         | a-little

(b1) * qing             | ba-shou    | tai            | gao           | yi-dian-er
          please          | N:handle  | hold         | high         | a-little

(b2) ? qing   | ba-shou    | tai            | gao           | yi-dian-er
          invite | N:handle  | hold         | high         | a-little
[VP [VG [V qing] [NP ba-shou]] [VP tai gao yi-dian-er]]
Invite the handle to hold a little higher.

This is an interesting example.  The same character qing is both an adverb ‘please’ and a verb ‘invite’.  (2-23b2) is syntactically valid, but violates the semantic constraint or semantic selection restriction.  The logical object of qing (invite) should be human but ba-shou (handle)  is not human.  The two syntactically valid parses (2-23a1) and (2-23b2), which correspond to two ways of segmentation, are expected to be somehow disambiguated on the above semantic grounds.

The following case is an example of semantics-oriented resolution of the overlapping ambiguity.

(2-24.)         茶点心吃了 cha dian xin chi le.

(a1)    cha    | dian-xin   | chi  | le
tea     | dim sum  | eat  | LE
[S [NP+object cha dian-xin] [VP chi le]]
The tea dim sum was eaten.

(a2) ? cha    | dian-xin   | chi  | le
tea     | dim sum  | eat  | LE
[S [NP+agent cha dian-xin] [VP chi le]]
The tea dim sum ate (something).

(a3) ? cha    | dian-xin   | chi  | le
tea     | dim sum  | eat  | LE
[S [NP+object cha ] [S [NP+agent dian-xin] [VP chi le]]]
Tea, the dim sum ate.

(a4) ? cha    | dian-xin   | chi  | le
tea     | dim sum  | eat  | LE
[S [NP+agent cha ] [VP [NP+object dian-xin] [VP chi le]]]
The tea ate the dim sum.

(b1) ? cha-dian               | xin           | chi  | le
tea dim sum         | heart       | eat  | LE
[S [NP+object cha-dian] [S [NP+agent xin] [VP chi le]]]
The tea dim sum, the heart ate.

(b2) ? cha-dian               | xin           | chi  | le
tea dim sum         | heart        | eat  | LE
[S [NP+agent cha-dian] [VP [NP+object xin] [VP chi le]]]
The tea dim sum ate the heart.

Most Chinese dictionaries contain the listed compound noun cha-dian (tea-dim-sum), but not cha dian-xin which stands for the same thing, namely the snacks served with the tea.  As shown above, there are four analyses for one segmentation and two analyses for the other segmentation.  These are all syntactically legitimate, corresponding to six different readings.  But there is only one analysis which makes sense, namely the implicit passive construction with the compound noun cha dian-xin as the preceding (logical) object in (a1).  All the other five analyses are nonsense and can be disambiguated if the semantic selection restriction that animate being eats (i.e. chi) food is enforced.   Syntactically, (a2) is an active construction with the optional object omitted.  The constructions for (a3) and (b1) are of long distance dependency where the object is topicalized and placed at the beginning.   The SOV (Subject Object Verb) pattern for (a4) and (b2) is a very  restrictive construction in Chinese.[15]

The pragmatics-oriented disambiguation is required for the case where ambiguity remains after the application of both structural and semantic constraints.[16]  The sentences containing this type of ambiguity are genuinely ambiguous within the sentence boundary, as shown with the multiple parses in (2-25) for the hidden ambiguity and (2-26) for the overlapping ambiguity below.

(2-25.)         他喜欢烤白薯 ta xi huan kao bai shu.

(a)      ta       | xi-huan    | kao          | bai-shu.
          he      | like           | bake         | sweet-potato
[S [NP ta] [VP [V xi-huan] [VP [V kao] [NP bai-shu]]]]
He likes baking sweet potatoes.

(b)      ta       | xi-huan    | kao-bai-shu.
          he      | like           | baked-sweet-potato
[S [NP ta] [VP [V xi-huan] [NP kao-bai-shu]]]
He likes the baked sweet potatoes.

(2-26.)         研究生命不好 yan jiu sheng ming bu hao

(a)      yan-jiu-sheng       | ming         | bu   | hao.
          graduate student | destiny     | not | good
[S [NP yan-jiu-sheng] [S [NP ming] [AP bu hao]]]
The destiny of graduate students is not good.

(b)      yan-jiu        | sheng-ming        | bu   | hao.
          study          | life                     | not | good
[S [VP yan-jiu sheng-ming] [AP bu hao]]
It is not good to study life.

An important distinction should be made among these classes of disambiguation.  Some ambiguity must be solved in order to get a reading during analysis.  Other ambiguity can be retained in the form of multiple parses, corresponding to multiple readings.  In either case, it demonstrates that at least a grammar (syntax and morphology) is required.  The structure-oriented ambiguity belongs to the former, and can be handled by the appropriate structural analysis.  The semantics-oriented ambiguity and the pragmatics-oriented ambiguity belong to the latter, so multiple parses are a way out.  The examples for different classes of ambiguity show that the structural analysis is the foundation for handling ambiguity problems in word identification.  It provides possible structures for the semantic constraints or pragmatic constraints to work on.

In fact, the resolution of segmentation ambiguity in Chinese word identification is but a special case of the resolution of structural ambiguity for NLP in general.  As a matter of fact, the grammatical analysis has been routinely used to resolve, and/or prepare the basis for resolving, the structural ambiguity like the PP attachment.[17]

2.5. Summary

The most important discovery in the field of Chinese word identification presented in this chapter is that the resolution of both types of segmentation ambiguity involves the analysis of the entire input string.  This means that the availability of a grammar is the key to the solution of this problem.

This chapter has also examined the ambiguity involvement in productive word formation and reached the following conclusion.  A grammar for morphological analysis as well as for sentential analysis is required for an adequate treatment of this problem.  This establishes the foundation for the general design of CPSG95 as consisting of morphology and syntax in one grammar formalism. [18]

The study of the morpho-syntactic borderline problems shows that  the sophisticated design of a grammar is called for so that information between morphology and syntax can be effectively coordinated.  This is the work to be presented in Chapter III and Chapter IV.  It also demonstrates that each individual borderline problem should be studied carefully in order to reach a morphological or syntactic analysis which maximally captures linguistic generality.  This study will be pursued in Chapter V and Chapter VI.

 

 

----------------------------------------------------------

[1]  Constraints beyond morphology and syntax can be implemented as subsequent modules, or “filters”, in order to select the correct analysis when morpho-syntactic analysis leads to multiple results (parses).  Alternatively, such constraints can also be integrated into CPSG95 as components parallel to, and interacting with, morphology and syntax.  W. Li (1996) illustrates how semantic selection restriction can be integrated into syntactic constraints in CPSG95 to support Chinese parsing.

[2] In theory, if discourse is integrated in the underlying grammar, the input can be a unit larger than sentence, say, a paragraph or even a full text.  But this will depend on the further development in discourse theory and its formalization.  Most grammars in current use assume sentential analysis.

[3]  Similar examples for the overlapping ambiguity string will be shown in 2.1.2.

[4]  But in Ancient Chinese, a numeral can freely combine with countable nouns.

[5] These two readings in written Chinese correspond to an obvious difference in Spoken Chinese:  ge (CLA) in (g1) is weakened in pronunciation, marked by the dropping of the tone, while in (g2) it reads with the original 4th tone emphatically.

[6] It is likely that what they have found corresponds to Guo’s discovery of “one tokenization per source” (Guo 1998).  Guo’s finding is based on his experimental study involving domain (“source”) evidence and seems to account for the phenomena better.  In addition, Guo’s strategy in his proposal is also more effective, reported to be one of the best strategies for disambiguation in word segmenters.

[7] According to He, Xu and Sun (1991)'s statistics on a corpus of 50833 Chinese characters, the overlapping ambiguous strings make up 84.10%, and the hidden ambiguous strings 15.90%, of all ambiguous strings.

[8] Guo (1997b) goes to the other extreme to hypothesize that “every tokenization is possible”.   Although this seems to be a statement too strong, the investigation in this chapter shows that at least domain independently, local context is very unreliable for making tokenization decision one way or the other.

[9] However, this assumption may become statistically valid within a specific domain or source, as examined in Guo (1998).  But Guo did not give an operational definition of source/domain.  Without such a definition, it is difficult to decide where to collect the domain-specific information required for disambiguation based on the principle one tokenization per source, as proposed by Guo (1998).

[10] This distinction is crucial in the theories of Liang (1987) and He,  Xu and Sun (1991).

[11] This work is now defined as one fundamental task, called Named Entity tagging, in the world of information extraction (MUC-7 1998).  There has been great advance in developing Named Entity taggers both for Chinese (e.g. Yu et al 1997; Chen et al 1997) and for other languages.

[12] That is what was actually done with the CPSG95 implementation.  More precisely, the family name expects a special sign with hanzi-length of 1 or 2 to form a full name candidate.

[13] A typical, sophisticated word segmenter making reference to knowledge beyond syntax is presented in Gan (1995).

[14] This is in fact one very common construction in Chinese in the form of NP1 NP2 Predicate.  Other examples include ta (he) tou (head) tong (ache): ‘he has a head-ache’ and ta (he) shen-ti (body) hao (good): 'he is good in health'.

[15] For the detailed analysis of these constructions, see W. Li (1996).

[16] It seems that it may be more appropriate to use terms like global disambiguation or discourse-oriented disambiguation instead of the term pragmatics-oriented disambiguation for the relevant phenomena.

[17] It seems that some PP attachment problems can be resolved via grammatical analysis alone.  For example, put something on the table; found the key to that door.  Others require information beyond syntax (semantics, discourse, etc.) for a proper solution.  For example, see somebody with telescope. In either case, the structural analysis provides a basis.  The same thing happens to the disambiguation in Chinese word identification.

[18] In fact, once morphology is incorporated in the grammar, the identification of both vocabulary words and non-listable words becomes a by-product during the integrated morpho-syntactic analysis.  Most ambiguity is resolved automatically and the remaining ambiguity will be embodied in the multiple syntactic trees as the results of the analysis.  This has been shown to be true and viable by W. Li (1997, 2000) and Wu and Jiang (1998).

 

[Related]

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

PhD Thesis: Chapter I Introduction

1.0. Foreword

This thesis addresses the issue of the Chinese morpho-syntactic interface.  This study is motivated by the need for a solution to a series of long-standing problems at the interface.  These problems pose challenges to an independent morphology system or a separate word segmenter as there is a need to bring in syntactic information in handling these problems.

The key is to develop a Chinese grammar which is capable of representing sufficient information from both morphology and syntax.  On the basis of the theory of Head-Driven Phrase Structure Grammar (Pollard and Sag 1987, 1994), the thesis will present the design of a Chinese grammar, named CPSG95 (for Chinese Phrase Structure Grammar).  The interface between morphology and syntax is defined system internally in CPSG95.  For each problem, arguments will be presented for the linguistic analysis involved.  A solution to the problem will then be formulated based on the analysis.  The proposed solutions are formalized and implementable;  most of the proposals have been tested in the implementation of CPSG95.

In what follows, Section 1.1 reviews some important developments in the field of Chinese NLP (Natural Language Processing).  This serves as the background for this study.  Section 1.2 presents a series of long-standing problems related to the Chinese morpho-syntactic interface.  These problems are the focus of this thesis.  Section 1.3 introduces CPSG95 and sketches its morpho-syntactic interface by illustrating an example of the proposed morpho-syntactic analysis.

1.1. Background

This section presents the background for the work on the interface between morphology and syntax in CPSG95.  Major development on Chinese tokenization and parsing, the two areas which are related to this study, will be reviewed.

1.1.1. Principle of Maximum Tokenization and Critical Tokenization

This section reviews the influential Theory of Critical Tokenization (Guo 1997a) and its implications.  The point to be made is that the results of Guo’s study can help us to select the tokenization scheme used in the lexical lookup phase in order to create the basis for morpho-syntactic parsing.

Guo (1997a,b,c) has conducted a comprehensive formal study on tokenization schemes in the framework of formal languages, including deterministic tokenization such as FT (Forward Maximum Tokenization) and BT (Backward Maximum Tokenization), and non-deterministic tokenization such as CT (Critical Tokenization), ST (Shortest Tokenization) and ET (Exhaustive Tokenization).  In particular, Guo has focused on the study of the rich family of tokenization strategies following the general Principle of Maximum Tokenization, or “PMT”.  Except for ET, all the tokenization schemes mentioned above are PMT-based.

In terms of lexical lookup, PMT can be understood as a heuristic by which a longer match overrides all shorter matches.  PMT has been widely adopted (e.g. Webster and Kit 1992; Guo 1997b) and is believed to be “the most powerful and commonly used disambiguation rule” (Chen and Liu 1992:104).

Shortest Tokenization, or “ST”, first proposed by X. Wang (1989), is a non-deterministic tokenization scheme following the Principle of Maximum Tokenization.  A segmented token string is shortest if it contains the minimum number of vocabulary words possible - “short” in the sense of the shortest word string length.

Exhaustive Tokenization, or “ET”, does not follow PMT.  As its name suggests, the ET set is the universe of all possible segmentations consisting of all candidate vocabulary words.  The mathematical definition of ET is contained in Definition 4 for “the character string tokenization operation”  in Guo (1997a).

The most important concept in Guo’s theory is Critical Tokenization, or “CT”.  Guo’s definition is based on the partially ordered set, or ‘poset’, theory in discrete mathematics (Kolman and Busby 1987).  Guo has found that different segmentations can be linked by the cover relationship to form a poset.   For example, abc|d and ab|cd both cover ab|c|d, but they do not cover each other.

Critical tokenization is defined as the set of minimal elements, i.e. tokenizations which are not covered by other tokenizations, in the tokenization poset.  Guo has given proof for a number of mathematical properties involving critical tokenization.  The major ones are listed below.

  • Every tokenization is a subtokenization of (i.e. covered by) a critical tokenization, but no critical tokenization has a true supertokenization;
  • The tokenization variations following the Principle of Maximum Tokenization proposed in the literature, such as FT, BT, FT+BT and ST, are all true sub-classes of CT.

Based on these properties, Guo concludes that CT is the precise mathematical description of the widely adopted Principle of Maximum Tokenization.

Guo (1997c) further reports his experimental studies on relative merits of these tokenization schemes in terms of three quality indicators, namely, perplexity, precision and recall.  The perplexity of a tokenization scheme gives the expected number of tokenized strings generated for average ambiguous fragments.  The precision score is the percentage of correctly tokenized strings among all possible tokenized strings while the recall rate is the percentage of correctly tokenized strings generated by the system among all correctly tokenized strings.  The main results are:

  • Both FT and BT can achieve perfect unity perplexity but have the worst precision and recall;
  • ET achieves perfect recall but has the lowest precision and highest perplexity;
  • ST and CT are simple with good computational properties.  Between the two, ST has lower perplexity but CT has better recall.

Guo (1997c) concludes, “for applications with moderate performance requirement, ST is the choice;  otherwise, CT is the solution.”

In addition to the above theoretical and experimental study, Guo (1997b) also develops a series of optimized algorithms for the implementation of these generation schemes.

The relevance and significance of Guo’s achievement to the research in this thesis lie in the following aspect.  The research on Chinese morpho-syntactic interface is conducted with the goal of  supporting Chinese morpho-syntactic parsing.  The input to a Chinese morpho-syntactic parser comes directly from the lexical lookup of the input string based on some non-deterministic tokenization scheme (W. Li 1997, 2000; Wu and Jiang 1998).  Guo’s research and algorithm development can help us to decide which tokenization schemes to use depending on the tradeoff between precision, recall and perplexity or the balance between reducing the search space and minimizing premature commitment.

1.1.2. Monotonicity Principle and Task-driven Segmentation

This section reviews the recent development on Chinese analysis systems involving the interface between morphology and syntax.  The research on the Chinese morpho-syntactic interface in this thesis echoes this new development in the field of Chinese NLP.

In the last few years, projects have been proposed for implementing a Chinese analysis system which integrates word identification and parsing.  Both rule-based systems and statistical models have been attempted with good results.

Wu (1998) has addressed the drawbacks of the conventional practice on the development of Chinese word segmenters, in particular, the problem of premature commitment in handling segmentation ambiguity.  In his A Position Statement on Chinese Segmentation, Wu proposed a general principle:

Monotonicity Principle for segmentation:

A valid basic segmentation unit (segment or token) is a substring that no processing stage after the segmenter needs to decompose.

The rationale behind this principle is to prevent premature commitment and to avoid repetition of work between modules.   In fact, traditional word segmenters are modules independent of subsequent applications (e.g. parsing).  Due to the lack of means for accessing sufficient grammar knowledge, they suffer from premature commitment and repetition of work, hence violating this principle.

Wu’s proposal of the monotonicity principle is a challenge to the Principle of Maximum Tokenization.  These two principles are not always compatible.  Due to the existence of hidden ambiguity (see 1.2.1), the PMT-based segmenters by definition are susceptible to premature commitment leading to “too-long segments”.  If the target application is designed to solve the hidden ambiguity problem in the segments, “decomposition” of some segments is unavoidable.

In line with the Monotonicity Principle, Wu (1998) proposes an alternative approach which he claims “eliminates the danger of premature commitment”, namely task-driven segmentation.  Wu (1998) points out, “Task-driven segmentation is performed in tandem with the application (parsing, translating, named-entity labeling, etc.) rather than as a preprocessing stage.  To optimize accuracy, modern systems make use of integrated statistically-based scores to make simultaneous decisions about segmentation and parsing/translation.”  The HKUST parser, developed by Wu’s group, is such a statistical system employing the task-driven segmentation.

As for rule-based systems, similar practice of integrating word identification and parsing has also been explored.  W. Li (1997, 2000) proposed that the results of an ET-based lexical lookup directly feed the parser for the hanzi-based parsing.  More concretely, morphological rules are designed to build word internal structure for productive morphology and non-productive morphology is lexicalized via entry enumeration.[1]  This approach is the background for conducting the research on Chinese morpho-syntactic interface for CPSG95 in this dissertation.

The Chinese parser on the platform of multilingual NLPWin developed by Microsoft Research also integrates word identification and parsing (Wu and Jiang 1998).  They also use a hand-coded grammar for word identification as well as for sentential parsing.  The unique part of this system is the use of a certain lexical constraint on ET in the lexical lookup phase.  This effectively reduces the parsing search space as well as the number of syntactic trees produced by the parser, with minimal sacrifice in the recall of tokenization.  This tokenization strategy provides a viable alternative to the PMT-based tokenization schemes like CT or ST in terms of the overall balance between precision, recall and perplexity.

The practice of simultaneous word identification and parsing in implementing a Chinese analysis system calls for the support of a grammar (or statistical model) which contains sufficient information from both morphology and syntax.  The research on Chinese morpho-syntactic interface in this dissertation aims at providing this support.

1.2. Morpho-syntactic Interface Problems

This section presents a series of outstanding problems in Chinese NLP which are related to the morpho-syntactic interface.  One major goal of this dissertation is to argue for the proposed analyses of the problems and to provide solutions to them based on the analyses.

Sun and Huang (1996) have reviewed numerous cases which challenge the existing word segmenters.  As many of these cases call for an exchange of information between morphology and syntax, an appropriate solution can hardly be reached within the module of a separate word segmenter.  Three major problems at issue are presented below.

1.2.1. Segmentation ambiguity

This section presents the long-standing problem in Chinese tokenization, i.e. the resolution of the segmentation ambiguity.  Within a separate word segmenter, resolving the segmentation ambiguity is a difficult, sometimes hopeless job.  However, the majority of ambiguity can be resolved when a grammar is available.

Segmentation ambiguity has been the focus of extensive study in Chinese NLP for the last decade (e.g. Chen and Liu 1992; Liang 1987;  Sproat, Shih, Gale and Chang 1996; Sun and Huang 1996; Guo 1997b).  There are two types of segmentation ambiguities (Liang 1987; Guo 1997b):  (i) overlapping ambiguity:  e.g. da-xue | sheng-huo vs. da-xue-sheng | huo as shown in (1-1) and (1-2);  and (ii) hidden ambiguity:  ge-ren vs. ge | ren, as shown in (1-3) and (1-4).

(1-1.) 大学生活很有趣
da-xue         | sheng-huo          | hen          | you-qu
university    | life                     | very          | interesting
The university life is very interesting.

(1-2.)  大学生活不下去了
da-xue-sheng                 | huo          | bu | xia-qu      | le
university student          | live           | not | down        | LEs
University students can no longer make a living.

(1-3.)  个人的力量
ge-ren         | de   | li-liang
individual   | DE  | power
the power of an individual

(1-4.) 三个人的力量
san    |  ge            | ren           | de   | li-liang
three  | CLA          | person      |DE   | power
the power of three persons

These examples show that the resolution of segmentation ambiguity requires larger syntactic context and grammatical analysis.   There will be further arguments and evidence in Chapter II (2.1) for the following conclusion:  both types of segmentation ambiguity are structural by nature and require sentential analysis for the resolution.  Without access to a grammar, no matter how sophisticated a tokenization algorithm is designed, a word segmenter is bound to face an upper bound for the precision of word identification.  However, in an integrated system, word identification becomes a natural by-product of parsing (W. Li 1997, 2000;  Wu and Jiang 1998).  More precisely, the majority of ambiguity can be resolved automatically during morpho-syntactic parsing;  the remaining ambiguity can be made explicit in the form of  multiple syntactic trees.[2]  But in order to make this happen, the parser requires reliable support from a grammar which contains both morphology and syntax.

1.2.2. Productive Word Formation

Non-listable words created via productive morphology pose another challenge (Sun and Huang 1996).  There are two major problems involved in this issue:  (i) problem in identifying lexicon-unlisted words;  (ii) problem of possible segmentation ambiguity.

One important method of productive word formation is derivation.  For example, the derived word 可读性 ke-du-xing (-able-read-ness: readability) is created via morphology rules, informally formulated below

(1-5.) derivation rules

ke + X (transitive verb) --> ke-X (adjective, semantics: X-able)

Y (adjective or verb) + xing --> Y-xing (abstract noun, semantics: Y-ness)

Rules like the above have to be incorporated properly in order to correctly identify such non-listable words.  However, there has been little research in the literature on what formalism should be adopted for Chinese morphology  and how it should be interfaced to syntax.

To make the case more complicated, ambiguity may also be involved in productive word formation.  When the segmentation ambiguity is involved in word formation, there is always a danger of wrongly applying morphological rules.  For example, 吃头 chi-tou (worth of eating) is a derived word (transitive verb + suffix tou);   however, it can also be segmented as two separate tokens chi (eat) | tou (CLA), as shown in (1-6) and (1-7) below.

(1-6.)  这道菜没有吃头
zhe    | dao           | cai            | mei-you    | chi-tou
this    | CLA          | dish         | not have   | worth-of-eating
This dish is not worth eating.

(1-7.) 他饿得能吃头牛
ta       | e               | de             | neng        | chi  | tou           | niu
he      | hungry     | DE3         | can           | eat  | CLA                   | ox
He is so hungry that he can eat an ox.

To resolve this segmentation ambiguity, as indicated before in 1.2.1, the structural analysis of the complete sentences is required.  An independent morphology system or a separate word segmenter cannot handle this problem without accessing syntactic knowledge.

1.2.3. Borderline Cases between Morphology and Syntax

It is widely acknowledged that there is a remarkable gray area between Chinese morphology and Chinese syntax (L. Li 1990; Sun and Huang 1996).  Two typical cases are described below.  The first is the phenomena of Chinese separable verbs.  The second case involves interfacing derivation and syntax.

Chinese separable verbs are usually in the form of V+N and V+V or V+A.  These idiomatic combinations are long-standing problems at the interface between compounding and syntax in Chinese grammar (L. Wang 1955; Z. Lu 1957; Lü 1989; Lin 1983; Q. Li 1983; L. Li 1990; Shi 1992; Zhao and Zhang 1996).

The separable verb 洗澡 xi zao (wash‑bath: take a bath) is a typical example.  Many native speakers regard xi zao as one word (verb), but the two morphemes are separable.  In fact, xi+zao shares the syntactic behavior and the pattern variations with the syntactic transitive combination V+NP:  not only can aspect markers appear between xi and zao,  but this structure can be passivized and topicalized as well.  The following is an example of topicalization (of long distance dependency) for xi zao.

(1-8.)(a)       我认为他应该洗澡
wo     ren-wei        ta       ying-gai       xi zao.
I         think           he      should        wash-bath
I think that he should take a bath.

(b)      澡我认为他应该洗
zao    wo     ren-wei        ta       ying-gai       xi.
bath  I         think           he      should        wash
The bath I think that he should take.

Although xi zao behaves like a syntactic phrase, it is a vocabulary word in the lexicon due to its idiomatic nature.  As a result, almost all word segmenters output xi-zao in (1-8a) as one word while treating the two signs[3] in (1-8b) as two words.  Thus the relationship between the separated use of the idiom and the non-separated use is lost.

The second case represents a considerable number of borderline cases often referred to as  ‘quasi-affixes’.  These are morphemes like 前 qian (former, ex-) in words like 前夫 qian-fu (ex-husband), 前领导 qian-[ling-dao] (former boss) and -盲 mang (person who has little knowledge of) in words like 计算机盲 [ji-suan-ji]-mang (computer layman), 法盲 fa-mang (person who has no knowledge of laws).

It is observed that 'quasi-affixes' are structurally not different from other affixes.  The major difference between 'quasi-affixes' and the few generally honored ('genuine') affixes like the nominalizer 性 -xing (-ness) lies mainly in the following aspect.  The former retain some 'solid' meaning while the latter are more functionalized.  Therefore, the key to this problem seems to lie in the appropriate way of coordinating the semantic contribution of the derived words using 'quasi-affixes' to the building of the semantics for the entire sentence.  This is an area which has not received enough investigation in the field of Chinese NLP.  While many word segmenters have included some type of derivational processing for a few typical affixes, few systems demonstrate where and how to handle these 'quasi-affixes'.

1.3. CPSG95:  HPSG-style Chinese Grammar in ALE

To investigate the interaction between morphological and syntactic information, it is important to develop a Chinese grammar which incorporates morphology and syntax in the same formalism.  This section gives a brief presentation on the design and background of CPSG95 (including lexicon).

1.3.1. Background and Overview of CPSG95

Shieber (1986) distinguishes two types of grammar formalism:  (i) theory-oriented formalism;  (ii) tool-oriented formalism.  In general, a language-specific grammar turns to a theory-oriented formalism for its foundation and a tool-oriented formalism for its implementation.  The work on CPSG95 is developed in the spirit of the theory-oriented formalism Head-driven Phrase Structure Grammar (HPSG, proposed by Pollard and Sag 1987).  The tool-oriented formalism used to implement CPSG95 is the Attribute Logic Engine (ALE, developed by Carpenter and Penn 1994).

The unique feature of CPSG95 is its incorporation of Chinese morphology in the HPSG framework.[4]  Like other HPSG grammars, CPSG95 is a heavily lexicalized unification grammar.  It consists of two parts:  a minimized general grammar and an information-enriched lexicon.  The general grammar contains a small number of Phrase Structure (PS) rules, roughly corresponding to the HPSG schemata tuned to the Chinese language.[5]  The syntactic PS rules capture the subject-predicate structure, complement structure, modifier structure, conjunctive structure and long-distance dependency.  The morphological PS rules cover morphological structures for productive word formation.  In one version of CPSG95 (its source code is  shown in APPENDIX I), there are nine PS rules:  seven syntactic rules and two morphological rules.

In CPSG95, potential morphological structures and potential syntactic structures are both lexically encoded.  In syntax, a word can expect (subcat-for or mod in HPSG terms) another sign to form a phrase.   Likewise, in Chinese morphology, a morpheme can expect another sign to form a word.[6]

One important modification of HPSG in designing CPSG95 is to use an atomic approach with separate features for each complement to replace the list design of obliqueness hierarchy among complements.  The rationale and arguments for this modification are presented in Section 3.2.3 in Chapter III.

1.3.2. Illustration

The example shown in (1-9) demonstrates the morpho-syntactic analysis  in CPSG95.

(1-9.) 这本书的可读性
zhe    ben    shu    de      ke               du      xing
this    CLA   book  DE     AF:-able      read   AF:-ness
this book’s readability
(Note: CLA for classifier; DE for particle de; AF for affix.)

Figure 1 illustrates the tree structure built by the morphological PS rules and the syntactic PS rules in CPSG95

cpsgtree

Figure 1. Sample Tree Structure for CPSG95 Analysis

As shown, the tree embodies both morphological analysis (the sub-tree for ke-du-xing) and syntactic analysis (the NP structure).  The results of the morphological analysis (the category change from V to A and to N and the building of semantics, etc.) are readily accessible in building syntactic structures.

1.4. Organization of the Dissertation

The remainder of this dissertation is divided into six chapters.

Chapter II presents arguments for the need to involve syntactic analysis for a proper solution to the targeted morpho-syntactic problems.   This establishes the foundation on which CPSG95 is based.

Chapter III presents the design of CPSG95.  In particular, the expectation feature structures will be defined.  They are used to encode the lexical expectation of both morphological and syntactic structures.  This design provides the necessary means for formally defining Chinese word and the interface of morphology, syntax and semantics.

Chapter IV is on defining the Chinese word.  This is generally recognized as a basic issue in discussing Chinese morpho-syntactic interface.  The investigation leads to a way of the wordhood formalization and a coherent, system-internal definition of the work division between morphology and syntax.

Chapter V studies Chinese separable verbs.  It discusses  wordhood judgment for each type of separable verbs based on their distribution.   The corresponding morphological or syntactic solutions will then be presented.

Chapter VI investigates some outstanding problems of Chinese derivation and its interface with syntax.  It will be demonstrated that the general approach to Chinese derivation in CPSG95 works both for typical cases of derivation and the two special problems, namely 'quasi-affix' phenomena and zhe-affixation.

The last chapter, Chapter VII, concludes this dissertation.  In addition to a concise retrospect for what has been achieved, it also gives an account of the limitations of the present research and future research directions.

Finally, the three appendices give the source code of one version of the implemented CPSG95 and some tested results.[7]

 

--------------------------------------------------

[1] In line with the requirements by Chinese NLP, this thesis places emphasis on the analysis of productive morphology:  phenomena which are listable in the lexicon are not the major concern.  This is different from many previous works on Chinese morphology (e.g. Z. Lu 1957; Dai 1993) where the bulk of discussions is on unproductive morphemes (affixes or ‘bound stems’).

[2] Ambiguity which remains after sentential parsing may be resolved by using further semantic, discourse or pragmatic knowledge, or ‘filters’.

[3] In CPSG95 and other HPSG-style grammars, a ‘sign’ usually stands for the generalized notion of grammatical units such as morpheme, word, phrase, etc.

[4] Researchers have looked at the incorporation of morphology of other natural languages in the HPSG framework (e.g. Type-based Derivation Morphology by Riehemann 1998).  Arguments for  the inclusion of morphological features in the definition of sign will be presented in detail in Chapter III

[5] Note that ‘phrase structure’ in terms like Phrase Structure Grammar (PSG) or Phrase Structure rules (PS rules) does not necessarily refer to structures of (syntactic) phrases. It stands for surface-based constituency structure, in contrast to, say, dependency structure in Dependency Grammar.  In CPSG95, some productive morphological structures are also captured by PS rules.

[6] Note that in this dissertation, the term expect is used as a more generalized notion than the terms subcat-for (subcategorize for) and mod (modify).  ‘Expect’ is intended to be applied to morphology as well as to syntax.

[7]  There are differences in technical details between the proposed grammar in this dissertation and the implemented version.  This is because any implemented version was tested at a given time while this thesis evolved over a long period of time.  It is the author’s belief that it best benefits readers (including those who want to follow the CPSG practice) when a version was actually tested and given as was.

 

[Related]

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

 

The Morpho-syntactic Interface in a Chinese Phrase Structure Grammar

by

 

Wei Li

B.A., Anqing Normal College, China, 1982

M.A., The Graduate School of Chinese Academy of

Social Sciences, China, 1986

 

 

Thesis submitted in partial fulfillment of

the requirements for the degree of

DOCTOR OF PHILOSOPHY

 

in the Department

of

Linguistics

Morpho-syntactic Interface in a Chinese Phrase Structure Grammar

 

Wei Li 2000

SIMON FRASER UNIVERSITY

November 2000

 

 

All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without permission of the author.

 

Approval

Name:                         Wei Li

Degree:                       Ph.D.

Title of thesis:             THE MORPHO-SYNTACTIC INTERFACE IN

A CHINESE PHRASE STRUCTURE GRAMMAR

 

(Approved January 12, 2001)

 

Abstract

This dissertation examines issues related to the morpho-syntactic interface in Chinese, specifically those issues related to the following long-standing problems in Chinese Natural Language Processing (NLP): (i) disambiguation in Chinese word identification;  (ii) Chinese productive word formation;  (iii) borderline phenomena between morphology and syntax, such as Chinese separable verbs and ‘quasi-affixation’.

All these problems pose challenges to an independent Chinese morphology system or separate word segmenter.  It is argued that there is a need to bring in the syntactic analysis in handling these problems.

To enable syntactic analysis in addition to morphological analysis in an integrated system, it is necessary to develop a Chinese grammar that is capable of representing sufficient information from both morphology and syntax.  The dissertation presents the design of such a Chinese phrase structure grammar, named CPSG95 (for Chinese Phrase Structure Grammar).  The unique feature of CPSG95 is its incorporation of Chinese morphology in the framework of Head-Driven Phrase Structure Grammar.  The interface between morphology and syntax is then defined system internally in CPSG95 and uniformly represented using the underlying grammar formalism used by the Attribute Logic Engine.  For each problem, arguments are presented for the proposed analysis to capture the linguistic generality;  morphological or syntactic solutions are formulated based on the analysis.  This provides a secure approach to solving problems at the interface of Chinese morphology and syntax.


Dedication

To my daughter Tian Tian

whose babbling accompanied and inspired the writing of this work

And to my most devoted friend Dr. Jianjun Wang

whose help and advice encouraged me to complete this work

Acknowledgments

First and foremost, I feel profoundly grateful to Dr. Paul McFetridge, my senior supervisor.  It was his support that brought me to SFU and the beautiful city Vancouver, which changed my life.  Over the years,  he introduced me into the HPSG study, and provided me with his own parser for testing grammar writing.  His mentorship and guidance have influenced my research fundamentally.  He critiqued my research experiments and thesis writing in many facets, from the development of key ideas, selection of topics, methodology, implementation details to writing and presentation style.  I feel guilty for not being able to promptly understand and follow his guidance at times.

I would like to thank Dr. Fred Popowich, my second advisor.  He has given me both general academic guidance on research methodology and numerous specific comments for the thesis revision which have helped shape the present version of the thesis as it is today.

I am also grateful to Dr. Nancy Hedberg from whom I have taken four graduate courses, including the course of HPSG.  I have not only learned a lot from her lectures in the classroom, but have benefited greatly from our numerous discussions on general linguistic topics as well as issues in Chinese linguistics.

Thanks to Davide Turkato, my friend and colleague in the Natural Language Lab.  He is always there whenever I need help.  We have also shared many happy hours in our common circle of Esperanto club in Vancouver.

I would like to thank Dr. Ping Xue, Dr. Zita McRobbie, Dr. Thomas Perry, Dr. Donna Gerdts and Dr. Richard DeArmond for the courses I have taken from them.  These courses were an important part of my linguistic training at SFU.

For various help and encouragement I have got during my graduate studies, I should also thank all the faculty, staff and colleagues of the linguistics department and the Natural Language Lab of SFU, in particular, Rita, Sheilagh, Dr. Ross Saunders, Dr. Wyn Roberts, Dr. Murray Munro and Dr. Olivier Laurens.  I am particularly thankful to Carol Jackson, our Graduate Secretary for her years of help.  She is remarkable, very caring and responsive.

I would like to extend my thanks to all my fellow students and friends in the linguistics department of SFU, in particular, Dr. Trude Heift, Dr. Janine Toole, Susan Russel, Dr. Baoning Fu, Zhongying Lu, Dr. Shuicai Zhou, Jianyi Yu, Jean Wang, Cliff Burgess and Kyoung-Ja Lee.  We have had so much fun together and have had many interesting discussions, both academic and non-academic.  Today, most of us have graduated, some are professors or professionals in different universities or institutions.  Our linguistics department is not big, but it is such a nice department where faculty, staff and the graduate student body form a very sociable community.  I have truly enjoyed my graduate life in this department.

Beyond SFU, I would like to thank Dr. De-Kang Lin for the insightful discussion on the possibilities of integrated Chinese parsing back in 1995.  Thanks to Gerald Penn, one of the authors of ALE, for providing the powerful tool ALE and for giving me instructions on modifying some functions in ALE to accommodate some needs for Chinese parsing during my experiment in implementing a Chinese grammar.

I am also grateful to Dr. Rohini Srihari, my current industry supervisor, for giving me an opportunity to manage NLP projects for real world applications at Cymfony.  This industrial experience has helped me to broaden my NLP knowledge, especially in the area of statistical NLP and the area of shallow parsing using Finite State Transducers.

Thanks to Carrie Pine and Walter Gadz from US Air Force Research Laboratory who have been project managers for the Small Business Innovation Research (SBIR) efforts ‘A Domain Independent Event Extraction Toolkit’ (Phase II), ‘Flexible Information Extraction Learning Algorithm’ (Phase I and Phase II) and ‘Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization’ (Phase I and Phase II).  I have been Principal Investigator for these government funded efforts at Cymfony Inc. and have had frequent and extremely beneficial contact with them.  With these projects, I have had an opportunity to apply the skills and knowledge I have acquired from my Ph.D. program at SFU.

My professional training at SFU was made possible by a grant that Dr. Paul McFetridge and Dr. Nick Cercone applied for.  The work reported in this thesis was supported in the later stage  by a Science Council of B.C. (CANADA) G.R.E.A.T. award.  I am grateful to both my academic advisor Paul McFetridge and my industry advisor John Grayson, CEO of TCC Communications Corporation of Victoria, for assisting me in obtaining this prestigious grant.

I would not have been able to start and continue my research career without many previous helps I got from various sources, agencies and people in the last 15 years, for which I owe a big prayer of thanks.

I owe a great deal to Prof. Zhuo Liu and Prof. Yongquan Liu for leading me into the NLP area and supervising my master program in computational linguistics at CASS (Chinese Academy of Social Sciences, 1983-1986).  Their guidance in both research ideas and implementation details benefited me for life.  I am grateful to my former colleagues Prof. Aiping Fu, Prof. Zhenghui Xiong and Prof. Linding Li at the Institute of Linguistics of CASS for many insightful discussions on issues involving NLP and Chinese grammars.  Thanks also go to Ms. Fang Yang and the machine translation team at Gaoli Software Co. in Beijing for the very constructive and fruitful collaborative research and development work.  Our collaboration ultimately resulted in the commercialization of the GLMT English-to-Chinese machine translation system.

Thanks to Dr. Klaus Schubert, Dr. Dan Maxwell and Dr. Victor Sadler from BSO (Utrecht, The Netherlands) for giving me the project of writing a computational grammar of Chinese dependency syntax in 1988.  They gave me a lot of encouragement and guidance in the course of writing the grammar.  This work enabled me to study Chinese grammar in a formal and systematic way.  I have carried over this formal study of Chinese grammar to the work reported in this thesis.

I am also thankful to the Education Ministry of China, Sir Pao Foundation and British Council for providing me with the prestigious Sino-British Friendship Scholarship.  This scholarship enabled me to study computational linguistics at Centre for Computational Linguistics, UMIST, England (1992).  During my stay in UMIST, I had opportunities to attend lectures given by Prof. Jun-ichi Tsujii, Prof. Harold Somers and Dr. Paul Bennett.  I feel grateful to all of them for their guidance in and beyond the classroom.  In particular, I must thank Dr. Paul Bennett for his supervision, help and care.

I would like to thank Prof. Dong Zhen Dong and Dr. Lua Kim Teng for inviting and sponsoring me for a presentation at ICCC'96 in Singapore.  They are the leading researchers in the area of Chinese NLP.  I have benefited greatly from the academic contact and communication with them.

Thanks to anonymous reviewers of the international journals of  Communications of COLIPS, Journal of Chinese Information Processing, World Science and Technology and grkg/Humankybernetik.  Thanks also to reviewers of the International Conference on Chinese Computing (ICCC’96), North American Conference on Chinese Linguistics (NACCL‑9), Applied Natural Language Conference (ANLP’2000), Text Retrieval Conference (TREC-8), Machine Translation SUMMIT II, Conference of the Pacific Association for Computational Linguistics (PACLING-II) and North West Linguistics Conferences (NWLC).  These journals and conferences have provided a forum for publishing the NLP-related research work I and my colleagues have undertaken at different times of my research career.

Thanks to Dr. Jin Guo who has developed his influential theory of tokenization.  I have benefited enormously from exchanging ideas with him on tokenization and Chinese NLP.

In terms of research methodology and personal advice, I owe a great deal to my most devoted friend Dr. Jianjun Wang, Associate Professor at California State University, Bakersfield, and Fellow of the National Center for Education Statistics in US.  Although in a totally different discipline, there has never been an obstacle for him to understand the basic problem I was facing and to offer me professional advice.  At times when I was puzzled and confused, his guidance often helped me to quickly sort things out.  Without his advice and encouragement, I would not have been able to complete this thesis.

Finally, I wish to thank my family for their support.  All my family members, including my parents, brothers and sisters in China, have been so supportive and understanding.  In particular, my father has been encouraging me all the time.  When I went through hardships  in my pursuit,  he shared the same burden;  when I had some achievement,  he was as happy as I was.

I am especially grateful to my wife, Chunxi.  Without her love, understanding and support, it is impossible for me to complete this thesis.  I wish I had done a better job to have kept her less worried and frustrated.  I should thank my four-year-old daughter, Tian Tian.  I feel sorry for not being able to spend more time with her.  What has supported me all these years is the idea that some day she will understand that as a first-generation immigrant, her dad has managed to overcome various challenges in order to create a better environment for her to grow.


 

Approval                    ii

Abstract                    iii

Dedication                    iv

Acknowledgments                  v

Chapter I  Introduction                1

1.0. Foreword                1

1.1. Background                2

  • Principle of Maximum Tokenization and Critical Tokenization            2
  • Monotonicity Principle and Task-driven Segmentation            5

1.2. Morpho-syntactic Interface Problems        8

1.2.1. Segmentation ambiguity        8

1.2.2. Productive Word Formation        10

1.2.3. Borderline Cases between Morphology and Syntax              11

1.3. CPSG95:  HPSG-style Chinese Grammar in ALE    13

1.3.1. Background and Overview of CPSG95    14

1.3.2. Illustration              15

1.4. Organization of the Dissertation          16

Chapter II  Role of Grammar              18

2.0. Introduction                18

2.1. Segmentation Ambiguity and Syntax        19

2.1.1. Resolution of Hidden Ambiguity      19

2.1.2. Resolution of Overlapping Ambiguity    24

2.2. Productive Word Formation and Syntax      33

2.3. Borderline Cases and Grammar          37

2.4. Knowledge beyond Syntax            39

2.5. Summary                46

Chapter III  Design of CPSG95              48

3.0. Introduction                48

3.1. Mono-stratal Design of Sign          52

3.2. Expectation Feature Structures          57

3.2.1. Morphological Expectation        58

3.2.2. Syntactic Expectation          59

3.2.3. Chinese Subcategorization        63

3.2.4. Configurational Constraint        67

3.3. Structural Feature Structure          70

3.4. Summary                73

Chapter IV  Defining the Chinese Word          75

4.0. Introduction                75

4.1. Two Notions of Word            78

4.2. Judgment Methods              83

4.3. Formal Representation of Word          88

4.4. Summary                92

Chapter V  Chinese Separable Verbs            93

5.0. Introduction                93

5.1. Verb-object Idioms: V+N I            96

5.2. Verb-object Idioms: V+N II          107

5.3. Verb-modifier Idioms: V+A/V          116

5.4. Summary                122

Chapter VI  Morpho-syntactic Interface Involving Derivation    123

6.0. Introduction                123

6.1. General Approach to Derivation          125

6.2. Prefixation                127

6.3. Suffixation                130

6.4. Quasi-affixes                132

6.5. Suffix zhe (-er)              139

6.6. Summary                151

Chapter VII  Concluding Remarks            152

7.0. Summary                152

7.1. Contributions              154

7.2. Limitation                158

7.3. Final Notes                    159

BIBLIOGRAPHY                  161

APPENDIX I    Source Code of Implemented CPSG95      170

APPENDIX II  Source Code of Implemented CPSG95 Lexicon    208

APPENDIX III  Tested Results in Three Experiments Using CPSG95  229

 

[Related]

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

【汉语句法的挑战之一:if-then的简约式】

我:
汉语中有一种特别常见的句式,形式上看上去很像主谓关系(S-Pred)或 Topic+Clause,但是却是表达类似条件虚拟的因果关系(的浓缩形式,通常前一部分是VP或Clause,后一部分是 Pred,偶尔为小句),考虑给这种关系一个特别的命名,不叫 S,也不叫 Next,也不叫 Conj, 叫个什么好呢?实质是条件状语 Cond-Adv,
应该做个文献调查,看看汉语语法学家对这个现象,都怎么个说法?

他要来就好了 ==【 if】 他要来 【then it】 就好了 == 如果他要来,【那】就好了。

LOCK状态下连按两下HOME可以快速启动照相机 ==
【if】LOCK状态下【你】连按两下HOME【then 你就】可以快速启动照相机

喝粥吃不饱 《-- 【if 我们】喝粥【then 我们就】吃不饱

这个句式似乎有些 trigger 它的小词,但非常微妙,形式也很多样,不好掌控:上面几句算随手举例,里面的 triggers 大概包括:“就”、“可以”、“也”等。“喝粥吃不饱”似乎与结果补语“不饱”有关。也有前后都是小句的:

她来我走 == 【if】她来【then】我【就】走。
她不来我也走 《-- 【if】她不来【then】我也【仍然要】走

“她来我走”似乎是依靠平行句式(她来、我走)和对比谓词(来、走)。这种东西在英语就很难这样简约。

白:
@wei “要来”也可以是“要是来”的缩略。

宋:
这些压缩复句内部的逻辑关系是上下文相关的。某人拼命挣钱,忽略了健康,被人批评为“要钱不要命”,是“(为了)要钱(而)不要命”;但如果出自强盗之口,就是 “(如果)要钱(就)不要命”。强盗对被抢人说的话。也可能是 “(我)(只)要钱,不要(你的)命。” 这是复句义的歧义。

白:
卖瓜的说“不甜不要钱”。明明有歧义,大家也不按别的意思去理解,否则有强词夺理的嫌疑,比如“不甜 and 不要钱”。

宋:
因为不甜,所以不要钱,你们随便拿吧。

白:
在面对强盗的时候,求自保的被抢人肯定按照最配合强盗的理解行事。有实力干掉强盗的,就可能故意采用不利于强盗的解释。强盗本来就是不讲理的。

我:
这玩意儿真是难点 不好识别 可不识别 就有恶果,譬如 假如语义落地是抽取 sentiment,这种句子里一多半的虚拟式的本质 说明不是事实 不应该抽取。
“x 不好不要钱” 并不是评价 x 不好,而是条件虚拟 “如果 x 不好”,条件是没有 sentiment 的。

宋:
不一定虚拟。“(我)(只)要钱,不要(你的)命。” 就不是虚拟。

我:
“x 不降价就不要买”,也没断定 x 降价还是没降价。x “降价”一般认为是好事,“不要买” 一般认为是对 x 不利的行为:“x 不增加电池寿命就不喜欢了” (本来可能是喜欢的,但这里隐含了小抱怨,抱怨的原因是“电池寿命”)。诸如此类 都要求要识别条件式 才好准确判断 sentiment 及其原因。

汉语怎么就发展了这么个表达法呢 偏偏不用显性的小词 “如果”、“要是”、“假如”、“倘若” 等。口语还特常见。英语不用 if 的时候就要用倒装词序,也还是有了显性的形式痕迹:
Had I done that I would have lost
Should they get there in time they would make it

宋:
百度翻译结果:
Had I done that I would have lost
如果我这样做了,我会失去
Should they get there in time they would make it
如果他们及时赶到那里,他们会做到的

我:
很不错。MT 汉语生成用了显性表达的主从连词“如果”。
汉语分析绕不过去口语化句式常省略小词这一关。

网:
越是大家熟悉的事,大家才会用缩略语,口语。 才会出现一些语法上有歧义的句子,因为大家心照不宣,太熟悉了。这叫做语境。 而大多数不那么熟悉的地方就更偏向书面语,歧义就很少了。 所以我觉得歧义问题没那么严重吧,假如碰到就把这个特例记住得了。英语应该也有很多省略的口语,生活中的。这是人之常情吧,太常见的事就省略点,反正大家都懂。

我:
记住的是可以词典化的,open ended 的句式难进词典,死记不行。汉语省略小词的现象 总体说都是句法层的挑战 不是词典可以解决的。除了主从连词,省略介词也极为常见:

这件事儿我的意见大家还是要往前看
== 【关于】这件事儿【依】我的意见【,】大家还是要往前看

翻译成英语,这些介词绝不可以省略,否则就是 Chinglish:on this matter in my opinion, .......

* this matter my opinion we all should look forward

网:
关于这件事我的意见是。不能是省略了“是”?英语口语也有省略吧 肯定有

我:
也可能。“是”也是小词,那也是省了小词。
汉语难缠就难缠在,本来就是一个形态缺乏的语言,按照常理,应该更多依赖小词和词序来弥补形式的不足,但事实上,汉语的小词经常省略,词序也比我们想象的自由和弹性得多.

白:
“还是”有副词和动词的不同义项。

我:
简直就是挑战显性形式的极限,逼迫我们不得不诉求隐性形式(包括常识)来达到交流和理解。如果把汉语治服了,人类的多数语言就是小菜。

宋:
还依靠上下文。

白:
“意见“涉及三个坑,谁,对谁/什么,什么内容。其实填坑的要素一样不少
没有小词,一样填坑。但是,如果有多种填坑的可能性 问题就出来了。在涉及公平交易的场合,设坑和填坑 都是隐性形式的法子。如果你用明显不利于交易对手方的解释,这法子太low。不甜不要钱,就是这个情况。

我:
如果有显性形式可资利用,就对隐性形式(subcat 之类)的依赖减轻了。
涉及语用,算另一个层面,开始可以不问。只解析出 if-then 的框架即可。

白:
如果…则,是有利于对手方的;… and …,是不利于对手方的

我:
我准备在 links 中加一个 CondAdv 的 link,把目前的 S,Next 和 Conj
分出来一批表达这类条件。Next 从默认越来越单纯化为 【接续】;Conj 为【并列】,S 为【主语】, CondAdv【条件】

白:
我小的时候就对“不学无术”产生过疑问。到底是“因为不学,所以无术”还是“既不学,又无术”还是“如果不学,则无术”?

我:
还有 “不破不立”。不破哪能立。不学则无术。

白:
真的不是则 是因为所以

我:
【因为所以】 与 【如果则】 已经相当接近了 可以找一个上位,把两者都囊括进去 模糊一把。它们 与 【并列】 完全不同。

白:
如果则是潜在的关系,因为所以是已落地的关系

我:
“不学无术” 作为【并列】 也很说得通。【如果则】 是虚拟,而【因为所以】可能是已然,也可能不是已然: 除非【因为所以】里面的动词附着了时体助词,否则 因为所以不强调已然。

白:
如果则是门卫,因为是门票,所以是住户

我:
林彪既不学亦无术: “不读书 不看报 一点马列主义都不讲”。当年批林的时候常有类似的话。
不甜不要钱,甜也不要钱。” 前者是省了 if 后者是省了 even if 让了一步,可仅有的形式是那个几乎万能的 “也”。

白:
“他这人不学无术”
“这对冤家还真是不打不成交”

我:
成语好说,但此类句子完全可以不是成语:“这对冤家不打得头破血流不罢休
【不 VP 不 Pred(结果)】这样的 pattern,“罢休” 这个谓词可以在词典标注隐性形式 具有结果的意味。

其实虽然英语没有汉语这种表达条件的简约句式,但英语在【主语从句】和【条件状语从句】之间,也有摇摆:可见【状语】、【主语】有时真地蛮接近的:

it won't be good that you are not coming
it won't be good if u r not coming
你不来不好

这类句式限定于谓语是判定性的说法 诸如 not good, does not work。有时想 硬把汉语的简约条件式直译过去 虽然不合语法 似乎老外也应该可以理解(待核实。假设可理解,这说明简约式还是内含了某种逻辑链条的蛛丝马迹,即便没有小词的显性形式的帮忙):

? u come, I will leave
* u not come, i stay
* not work not succeed
* no work no success
* not leave won't work

洋泾浜就是从类似下面的简约式直译过来的:

不走不行
不工作不成事儿
你来我【就】走

汉语简约式最大的好处是对无意义主语的省略,比较英语就很明显:

不劳动不得食
if you/we/one/anyone/everyone/anyone/they/people do not work
you/we/one/they/he/she should not get food

英语不得不在主句和条件从句加上这些莫名其妙的主语;汉语简约式直接省去,默认就是宇宙真理,普适于所有人。最叠床架屋的是,为了政治正确,在填写这些无意义的主语的时候,还常不得不这样来说: he or she 或者 s/he: if a person does not work, he or she should not get food.

no hard work no success 这种应该算是英语接受的、最接近汉语的简约句式了。

网:
@wei 你解析句子生成的内容用什么形式?用什么来表达解析后获得的语义?

我:
tree (& roles) / IE Templates

 

【相关篇什】

【离皇冠上的明珠只有一步之遥的感觉】

关于 parsing

【关于中文NLP】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

【立委科普:美梦成真的通俗版解说】

凑热闹参加【征文:美梦成真】 ,有网友搞不懂这美梦是啥,怎么叫美梦成真。说明我瞎激动的所谓美梦,非但没有做到老妪能解,甚至没有让科学人士明白,就科普而言,那是相当的失败。

看我能不能用大白话说明白这事儿:

我们人类的语言说简单也简单,说复杂也复杂。简单到不管多笨的人,也大都从小就学会了语言,交流没问题。但是人学会语言,大多知其然,不知其所以然。只有专门研究语言的语言学家一直在尝试对人类语言讲出点所以然来。可语言这玩意儿,不研究也就罢了,一研究就发现这是上帝的恶作剧,复杂得很,深不可测。

几千年的探索,总结出一种叫文法的东西,用它可以对语言的内在规律做一些总结,这样,千变万化的语句就可以分析成有限的句型结构,可以帮助语言理解和把握。人类本能的语言理解能力也因此显得有迹可循了。这就是我们在学校文法课上老师教给我们的知识,特别是一种语句分析的结构图的画法(grammar diagramming),条分缕析建立主语谓语宾语定语状语等结构联系,证明是一个很管用的语言分析技能。这一切本来是为了加强我们的语文能力。

电脑出现以后,就有人工智能的科学家想到,要教会电脑人类语言,这个领域叫自然语言理解(Natural Language Understanding),其核心是对人类语言做自动分析(parsing),分析结果往往用类似文法课上学到的树形图来表达。自动语言分析很重要,它是语言处理的核心技术。一个质量优良、抗干扰强(所谓鲁棒 robust)而且可以运行到大数据上面的自动分析引擎,就是个核武器。有了这样的自动分析,就可以帮助完成很多语言任务,譬如人机对话、机器秘书、情报抽取、舆情挖掘、自动文摘、机器翻译、热点追踪等等。(也有不少日常语言处理应用,譬如关键词搜索、垃圾过滤、文章分类、作者鉴定,甚至自动文摘和机器翻译,不分析,不理解,只是把语言当成黑匣子,把任务定义成通过黑匣子的从输入到输出的映射,然后利用统计模型来学习模拟,也可以走得很远。这些绕过了结构和理解的近似方法,由于其鲁棒性等优点,实际上是主流的主导性做法)。

自动分析语言方面,英语研究得比较充分。中文还刚刚在起步阶段,原因之一,是中文比欧洲语言难学,歧义更严重,大规律少,小规律和例外较多,不太好捉摸。因此有不少似是而非的流行说法,什么,词无定类,入句而后定,句无定法,“意合”而已矣。总之,中文自动分析是一项公认的很有意义但非常艰难的任务。尤其是要教会电脑分析真实世界的社交媒体大数据中的形形色色文句,更是难上加难。就是这个中文自动分析的美梦,最近被实现了。

这样的成就可以不可以说是美梦成真呢?

[11]方锦清  2013-10-17 15:04

我看不懂啊,可以进一步解释一下?

博主回复(2013-10-17 19:18):

这是一个跨越1/4世纪科研美梦终成真的现实故事。故事的主人公做助理研究员的时候,满怀热情,不知天高地厚地为世界上最微妙的语言之一现代汉语,描绘了一幅自然语言理解(NLU)蓝图,其核心是对千变万化的中文文句施行自动语法分析。这幅蓝图距离现实太过遥远,其实现似乎非人力可为。然而,1/4世纪之后,积累加机缘,天时和地利,主人公终于实现了这个理想,正在投入真实世界的大数据应用。
The mission impossible accomplished.

征文在此,请支持:【征文参赛:美梦成真】

 

【美梦成真】

  • 这是一个跨越1/4世纪科研美梦终成真的现实故事。故事的主人公做助理研究员的时候,满怀热情,不知天高地厚地为世界上最微妙的语言之一现代汉语,描绘了一幅自然语言理解(NLU)蓝图,其核心是对千变万化的中文文句施行自动语法分析。这幅蓝图距离现实太过遥远,其实现似乎遥遥无期,非人力可为。然而,1/4世纪之后,积累加机缘,天时和地利,主人公终于实现了这个理想,正在投入真实世界的大数据应用。The mission impossible accomplished.

二十五年了,中文之心,如在吾庐,一日不曾忘记!拔高一点说,对于语言学家,中文之心可以说是梦萦魂牵的海外流浪人的中国心。

   很多年了,由于工作的原因,一头扎进英语处理的海洋沉浮。直到近两年,英语已经无可再做,该做的差不多都做了,不该做的也神农尝草,遍历辛苦。大山大水已然身后,而且已经大数据实用化了,应该可以放下。近几年来,随着白发的繁盛,岁月的流逝,忧虑之心油然而起。弹指一挥,逝者如斯,怕这辈子没有机会回到中文处理上来,那将抱憾终身。
   都说中文是世界上最诡秘、最玄妙、最不讲逻辑,自然也是最难机器处理的语言。有人甚至声称中文无文法,中文理解全靠“意合”,是对机器自然语言理解和人工智能前所未有的挑战。目的地如此高远,而现状却相当悲惨,中文处理整个领域深陷在汉字串切词的浅层漩涡长达数十年不能自拔。切词是什么?最多算万里长征的前十步而已。
   25年了,许多思考、想法,在头脑绕了很多年,一直未及实现,现在是时候了。这辈子不爬中文的珠穆朗玛,枉为华裔语言学博士。陶先生说:归去来兮,田园将芜胡不归?

喝令三山五岳开道,中文处理,我回来了!

出道之初的上世纪80年代,我为一家荷兰的多语机器翻译BSO项目,参照英文依存文法,设计过一个【中文依存文法】,涵盖了现代汉语几乎所有的重要句型,画过无数的中文依存关系句法树,看上去真地很美。但那只是纸上谈兵。虽然设计这套文法是为机器处理,真要实现起来谈何容易。事实上,在当时那只能是一场科研美梦。这一梦就是25年!

现在回看当年的蓝图,对照最近在机器上实现的依存句法分析器,一脉相承,感慨万千。年轻时就有绿色的梦,那么喜欢树,欣赏树,着迷画树,好像在画天堂美景一样体验着绿之美,梦想某一天亲手栽培这颗语言学之树,为信息技术创造奇迹。如今终于迎来了实现的曙光,天时地利人和,研发的辛苦与享受已然合一,这是何等美妙的体验。

请欣赏青年立委当年“手绘”的粗糙又精致的句法树蓝图的几段截屏(可怜见地,当时只能用纯文本编辑器数着空格和汉字去“画树”,就如我年三十在机房数着字符描画山口百惠并用IBM-PC制成年历一样)。对照新鲜出炉的中文句法分析器全自动生成的婀娜树姿,我不得不说,美梦成真不再是一个传说。

(1) 25年前的蓝图(美梦):

25年后的实现(成真):
(2) 25年前的蓝图(美梦):

25年后的实现(成真):

(3)25年前的蓝图(美梦):

25年后的实现(成真):

(4) 25年前的蓝图(美梦):

25年后的实现(成真):

但那时我在上海也有一个惟一的不但敢于随便谈笑,而且还敢于托他办点私事的人,那就是送书去给白莽的柔石。

(5) 25年前的蓝图(美梦):

25年后的实现(成真):

(6)25年前的蓝图(美梦):

25年后的实现(成真):

胶合板是把原木旋切或刨切成单片薄板, 经过干燥、涂胶,  并按木材纹理方向纵横交错相叠, 在加热或不加热的条件下压制而成的一种板材。
 

 

【相关篇什】

初稿(2012-10-13 ):科学网—【立委随笔:中文之心,如在吾庐】

汉语依从文法: 维文钩沉(25年前旧作,浏览器下请选用国标码 GB 阅读以免乱码和图形失真)】:
ChineseDependencyGrammar1.txt
ChineseDependencyGrammar2.txt
ChineseDependencyGrammar3.txt

立委科普:语法结构树之美 (英文例示)】

立委科普:语法结构树之美(中文例示)】

【立委科普:美梦成真的通俗版解说】

【立委科普:自然语言parsers是揭示语言奥秘的LIGO式探测仪】 

【离皇冠上的明珠只有一步之遥的感觉】

关于 parsing

【关于中文NLP】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

 

【一日一parsing,而山不加增,何苦而不平?】

"终于冰箱安装到位了, 欣喜之余发现有点儿小问题, 就联系了店家, 店家主动帮助联系客服上门查看, 虽然最终没有解决问题, 心里有点儿遗憾, 但是因为不影响使用, 所以也就无所谓了."  这一句够复杂的,目前酱紫滴:

“店家” 与 “主动帮助”在主语之外,语义中间件给做了逻辑宾语,是 overkill,以为帮助的 subcat 的宾语没有 saturated,但是 动词性宾语ObjV 也算宾语的,这个调整一下可以 fix
最后的错误是远距离,“虽然” 应该找到 “但是”的,是强搭配,但里面有几个小句挡路。“但是”前面的小句没关系,反正是强搭配,抽着鞭子跑马也不怕越位,可是“但是”后面又来了个“因为 。。。所以”,这个嵌套有点讨厌:“但是”的落脚点因此不在第一小句,而在第二小句“所以”上。换句话说,人的理解是,“虽然”引导的让步状语从句应该长距离落实在最后的“无所谓”上,才符合句法语义逻辑。社会媒体似乎是不经意写出来的句子,也有这种繁复的小句嵌套的长距离句法问题(贴帖的人大概是个知道分子老九,大老粗没那么多“因为所以”“虽然但是”的,而且嵌套)。最后,“联系客服上门查看”还有个 subcat 词典没到位的 bug,小 case 了,不难纠正。small bugs are de-ed:

白:
这问题问的

我:
这事儿做的。
这澡洗的。
这牛吹的。
这问题问的。那叫一个水平。
这日子过的。那叫一个窝心。
这戏演的,那叫一个烂。
这话说的,那叫一个高。
感慨或惊叹的口语句式,句法主谓,逻辑述宾:这OV的。默认似乎负面,但正面也不少见。
这OV的 --》瞧人家这OV的
--》【human】+这+OV+的+标点
底层结构应该是:human+V+O+V+得+【】(补语省略)
他问问题问得【那叫一个水平】
他过日子过得【那叫一个窝心】
他演戏演得【烂】
他说话说得【高】

0822a

0822b

0822c

0822d

0822e

 

 

【相关】

【离皇冠上的明珠只有一步之遥的感觉】

关于 parsing

【关于中文NLP】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

【语义计算沙龙:从“10年中学文化课”切词谈系统设计】

我:
毛老啊,1966-1976 10年文革,是我十年的中小学,我容易吗?10年中学文化课的时间不到一半,其余是学工学农学军。学赤脚医生 学开手扶拖拉机。
为什么是 【十年中】【学文化课】不是 【十年中学】【文化课】?

Guo:
@wei 单就这句,确实两可。但你后面有这么多的"学"……
至少对这个例子,统计,"深度神经"RNN之类还是有merit的。当然,这两种解析其实也没本质的区别。不必多费心思。

我:
怎讲?因为“学”频率高 所以“中学”成词就不便?统计模型在这个case怎么工作显示merit呢?愿闻其详。
大数据说 有五年中学 有六年中学,极少见十年中学,反映的是中学学制的常识。但是这个知识不是很强大,很难作数,因为这不是 positive evidence。如果句子在 “六年中学” 发生边界纠纷的时候 得到来自大数据的直接支持,那是正面的 evidence,力量就很强。负面证据不顶事儿,因为它面对的是 【非六】(或【非五】)的大海,理论上无边无沿,那点儿证据早被淹没了。

Guo:
统计分long term / global vs short term / local.

你讲的"大数据",其实是在讲前者。

现在热的"深度神经",有些是有意无意地多考虑些后者。例如,深度神经"皇冠上的明珠"LSTM即是Long Short Term Memory。虽非显式地求取利用"即时统计",那层意思还是感觉的到的。

我:
@Guo 恩。这个 local 和 global 之间的关系很tricky
0821e

这个貌似歪打正着的parse应该纯粹是狗屎运,不理论。

白:
N+N的得分本来就低 有状语有动词的更加“典型” N+N是实在没招了只能借助构词法解决零碎的产物 有状语有动词时谁还理N+N。不管几年中学,也抗衡不了这个结构要素。就是说,同样是使用规则,有些规则上得厅堂,有些规则只能下得厨房。如果没有上得厅堂的规则可用,随你下厨房怎么折腾。但是如果有上得厅堂的规则可用,谁也不去下厨房。

我:
这里不仅仅是 N+N 的问题,在绝大多数切词模块中,还没走到N+N这一步,因此这个问题实际上可能挑战不少现存的切词程序:十年/中学/文化课 or 十年中/学/文化课 ?
有一个常用的切词 heuristic 要求偏向于音节数均匀的路径 显然前者比后者均匀多了。

白:
句法上谈多层,也是“狗/咬吕洞宾”, 不是“狗咬/吕洞宾”

我:
真正的反例是交叉型的。
句法怎么谈层次 其实无关 因为多层的切词不过是一个技术策略,(通常)本身并不参与 parsing,最终的结果是 狗/咬/吕洞宾 就行了。其实 即便论句法 SVO 层次 在汉语中还是颇有争论的 不像西方语言里面 V+NP 的证据那么充分。

白:
这有点循环论证了

我:
目前的接口是这样的 多数系统的接口是。切词的结果并不存在层次,虽然切词内部可以也应该使用层次。肯定有研究型系统不采用这样的接口,但实用系统中的多数似乎就是这样简单。

白:
都保留也没啥,交给句法处理好了,谁说一定要分出个唯一结果再交上去,很多系统接受词图而不是词流了。对于神经网络这种天然接受不确定性的formalism而言,接受词图并不比接受词流多什么负担。

我:
数据结构多了维度,对于传统系统,涉及面蛮大的。词不仅仅是词,词本身不是一个简单的 object。以前的系统词流就是string 或最多是 token+POS list 对那样简单的结构增加维度还好。

白:
词和短语一样可以给位置加锁解锁 竞争位置的锁

我:
不错,词是一切潜在结构的发源地,蕴藏了很大潜能,甚至在设计中,应该让词典可以内建结构,与parsing机制一体化。这种设计思想下的词 增加维度 就是带着镣铐跳舞 不是容易处置好的。nondeterministic 是一个动听但不太好使的策略。否则理论上无需任何休眠与唤醒。

白:
可以参数化,连续过渡。处理得好,管子就粗些。处理不好,管子就细些。极端就回到一条线。一个位置允许几个词竞争锁,可以参数化。超出管子容量的,再做休眠唤醒。

我:
多层系统下的 nondeterministic 结构,就好比潘多拉的盒子。放鬼容易降鬼难,层次越多越是这样。也许机器学习那边不怕,反正不是人在降服鬼。

白:
其实一个词多个POS,或者多个subcat,机制是一样的。不仅有组合增加的一面,也有限制增加的一面。不用人降服鬼,鬼自己就打起来,打不赢没脸见人。只要制定好“见人”的标准,其他就交给鬼。

我:
这就是毛主席的路线 叫天下大乱达到天下大治。文革大乱10年国民经济临近崩溃的边缘,但没有像60年那样彻底崩盘,除了狗屎运,还因为有一个绝对权威在。这个权威冷酷无情 翻脸不认人。今天红上了天的红卫兵造反派 明天就下牢狱。

白:
鬼打架也是有秩序的,不是大乱,是分布式表示。

我:
这样的系统大多难以调试 等到见人了 结果已定局 好坏都是它了 斯大林说 胜利者是不受指责的。

白:
局部作用,高度自治

我:
鬼虽然是按照人制定的规则打架。具体细节却难以追踪 因此也难以改正。当然 这个毛病也不是现在才有的 是一切黑箱子策略的通病。

白:
不是黑箱子,是基于规则、分布式表示、局部自治。打架的任何细节语言学上都可解释。理论上,如果词典确定,所有交集型分词歧义就已经确定,是词流还是词图,只是一个编码问题。如果再加上管子粗细的限制,编码也是高度可控的。

我:
刁德一说 这茶喝到这儿才有了滋味。看好白老师及其design

白:
“10年”说的究竟是时长(duration)为10年的时间段,还是2010年这一年的简称,也是需要甄别的。

 

【相关】

 

【置顶:立委NLP博文一览】

《朝华午拾》总目录

【一日一parsing:汉语单音节动词的语义分析很难缠】

白:
“她拿来一根漂亮的海草,围在身上做装饰物。”

我:
0821a

“围” 与 “做” 的逻辑主语阙如。原因之一是这两个动词本身的subcat没有要求“她”【human】或“海草”【physical object】。语义中间件目前是保守策略,因为逻辑填坑是无中生有,宁缺毋滥,rather underkill than overkill,精度优先。

人的理解是怎么回事呢:单个儿的“围”不好说,但是VP【围在身上】从“身上”继承了【human】的未填之坑,正好让“她”填做逻辑主语。同理,“做”是万能动词,也没有特定语义要求的坑,但是VP【做装饰物】(act as NP)则挖了一个同位语的语义坑【physical object】,可以让“海草”来填:【human】“把”(“用”)【physical object】“围在身上”;【physical object】“做装饰物”。
“围在身上”的句法主语可以是【human】,也可以是【physical object】:“一根漂亮的海草围在身上”。但是背后的逻辑语义都是 【human】为逻辑主语。

白:
此例引自小学一年级水平的课外读物

围,属于具有“附着、固定”subcat的动词子类,如果做话题,可以单独表示起始动作完成后的遗留状态。话题化 被固定物做话题

我:
而“海草”可以看做【工具】(包括【材料】状语),也可以看做是 VP【围在身上】内部的“围“的【受事】

白:
是逻辑宾语

我:
这是层次不同造成的逻辑角色的不同。
实际上,对这一类汉语单音节动词做如此细致的语义分析,挑战性很大。它们太多义了,只有组成合成动词、甚至形成 VP 以后,才逐渐排除多义而收心。这个动态的 subcat 的确定和填写过程,相当繁难,if not impossible。

白:
房子盖在山上做行宫

我:
0821b

“盖-房子”算合成词。
again “做” 的逻辑主语(深层同位语)没连上“房子”。

白:
他给你打了一副手镯当嫁妆

我:
0821c
SVO 齐活了,主句的O却断了。这叫顾腚不顾头,需要好好debug一哈:

0821d

这个比较完美了。也把“打手镯”当成“打酱油”一样做进离合词了。这样处理很重要,因为“打”是个万能动词,不知道有多少词义(如果考虑搭配中的词义的话)。

 

【相关】

【离皇冠上的明珠只有一步之遥的感觉】

关于 parsing

【关于中文NLP】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

《朝华午拾: 与女民兵一道成长的日子》

我1976年高中毕业下放到皖南山区烟墩镇旁的尤村。不久赶上了“双抢”(抢收早稻,抢种晚稻),真地把人往死里累。双抢是一年挣工分的好季节,给双份工分,有时甚至给三倍,连续20多天,天不亮起床,到半夜才回,再壮的汉子都要累趴下才能休息半天。人民公社给双倍工分这种变相的资产阶级的“物质刺激”很厉害,不管多累,人都不敢懈怠,你怕累少上工,工就给别人赚去了,到年底分红,你分的稻谷、红薯和香油也相应减少了。其实,羊毛出在羊身上,每年生产队的收成是一个定数,工分多给少给不过是一种财富再分配的方式而已。如果单纯依靠农民的社会主义干劲,双抢跟平时同等工分数,工分总量下来了,单位工分的价格提高了,就没有物质刺激出来的积极性了。谁说经济学在一大二公的人民公社没有用处?

生产队照顾城里娃,工分给高些。于是给我们三个知青各开七分半工,相当于一个妇女全劳力的工分,包括早饭前上早工两个小时,否则只有六分半。那年十分工值RMB0.65元。我在妇女堆里干了半年多,年底分红,赚回了所有的口粮,外带半床红薯和四五斤香油。

妇女全劳力多是年轻的姑娘或媳妇,个个都是干农活的好手。尤村的10几个风华正茂姑娘组成了一个“女民兵班”,不甘寂寞,活动有声有色,曾名噪一时。不过到我去的时候,已经式微,因为其中的骨干大都到了嫁人的年纪,近亲远媒各处张罗,集体活动不能继续。尽管如此,跟女民兵在广阔天地一道成长,在当时是充满了革命浪漫主义的色彩的,让人沉迷和兴奋。干农活的辛苦也去了大半。

我们村村长人很精明,但脾气暴躁,又是秃头,看上去恶心。倒是他家三个姐妹一个个如花似玉,大妹妹二妹妹都是女民兵班的主力,小妹妹刚14-5岁,皮肤白嫩,见人脸红,在社办一个作坊里做工。大妹妹刚嫁给本村一个高个子,有模样无脑袋的傻小子,自由恋爱的,算是姑娘们中最幸运的了,可还是鲜花插错了地方。刚去不久,这位大妹子被照顾在场上打谷,没有下水田。我跟她一起干活,场上就两个人,总是她照顾我,从那时就落下了心猿意马的毛病,直到有一天发现她肚子越来越大,才意识到她跟其他民兵姑娘不同,原来是媳妇级的了。后来跟那个二妹子及一帮姑娘媳妇一道,在田里薅草(就是用耙子在水田里把杂草掀翻,不让杂草长出来),二妹子总是侵犯我的领地,把她的耙子探过来帮我。没有她帮忙,我大概一半的速度也赶不上。我老指责她,“不许侵犯”,她总笑而不答,我行我素。二妹子模样很好,稍微有些胖,很壮实,象个铁姑娘,但善解人意,脾气性情好得赛过薛宝钗,是我最心仪的。可当时媒婆正在给她提亲,我离开村子不久,她就远嫁了,听到消息后心里很不是滋味。

这些农家女在我看来都是仙女。从小在那样的艰苦环境中,却一个个风华正茂,英姿飒爽,而且不失农家女的善良朴实和冰雪聪明。我觉得当地没人配得上她们,她们自己也企图跟命运和媒人抗争,不过最后都一个个嫁走了,消没在人海中。

【相关】

《毛时代的最后知青》

【朝华午拾集锦:立委流浪图】

《朝华午拾》总目录(置顶)

《毛时代的最后知青》

我是文革后最后一批插队的,算是赶上了末班车。当时岁数不够,按照政策可以留城,可是当年的情形是,留城待业常常是永久失业,不象插队,几年之后,还有上调招工或者升学(工农兵学员)的前途。另外就是,由于时代风尚的影响,留城的好像比下乡的矮人一截似的。我有一位同班好友,独子,留城以后,见面说话就没有我们下乡知青那样器宇轩昂。 

插队的故事对我是太久远了,恍如隔世。这也是我一直想写,却感觉心有余而力不足的原因。虽然如此,插队的片断却不时在心中翻腾。虽然连不成篇,这些记忆残片却是刻印在脑海最深处的。 

我插队的地方是比较偏远的皖南山区,叫尤村,就在镇子旁边。当时一起下到这个村子去的一共三位,陈兄是中医世家,人很老成憨实,带来了大半箱子医书。曾兄是退伍军人的子弟,有点吊儿郎当玩世不恭的样子。我随身携带的是薄冰《简明英语语法》和一台晶体管中波收音机,希望还能继续电台《广播英语》的学习。我们三人从镇上一下车,就被尤村的老书记带领一伙人敲锣打鼓迎到了村里,暂时安置在一位公社赤脚医生的家里,住了两个月。后来村子利用国家发给我们三人的安家费,盖了三大间仓库一样透风凉的屋子,我们才算独立安家落户。 

第一个月是吃大户。每天各家各户轮流吃。农民大多朴实好客,我们吃饭的那天,东家往往要比平时多预备一些菜肴。可是,各家家境不同,伙食还是参差不齐,有些确实难以下咽,但又怕人笑话知青娇气,只好硬着头皮吃。最糟糕的不是伙食的质量,而是卫生状况。有一天天擦黑,推门去晚餐,手上黏黏糊糊摸了一手,回来后我们几个一合计,发现不是鼻涕就是浓痰的残迹,都恶心得要吐。 

后来决定哥仨自己开伙,分工合作。还记得清晨起来到河塘担水,身子骨瘦小的我与水桶不成比例,在早春的冷风中瑟瑟发抖。不过,自己开伙还是受用多了,每天干活就满心盼望早早收工去享用自己的晚餐。最常做最美味的菜肴是咸肉炖黄豆。咸肉是父母捎来改善伙食的,每次割一小块肥肉,慢火烧化,那泛着油光的软黄豆实在太诱人了。黄豆和木炭都是队里照顾知青配给的,弄个小瓦罐盛上黄豆、肥肉和水,上工前置于炭火上,收工回来就四香飘溢。 

这样的美味当然不能长久。于是自己种菜。我们图省事,挑最容易的菜,种了两大片黄瓜。黄瓜这玩意儿,一旦结起来,就不得了,瓜满为患。怎么摘怎么吃也赛不过它生长的速度。平时没事就摘了生吃,到了晚上再做黄瓜汤,或者炒黄瓜,直吃得想吐。这个后遗症不小。很久很久,我都把黄瓜当作最贱的菜,偶然生吃一点可以,从来不拿它当菜。可是斗转星移,不知流浪海外的何年何月,黄瓜忽然金贵起来。太太和女儿都爱吃。暖房子里面出来的英国黄瓜,每根两三块美元,一样成为我们家的必备。有时伙食中蔬菜量不够,怕孩子营养不平衡,就洗根黄瓜给她,她总是美滋滋地啃它,从不厌烦。 

黄瓜确实不好做菜,要是赶上了鸡蛋,炒菜也好,做汤也好,都不错。单做就不成菜,不下饭。鸡蛋是非常珍贵的,我们不养鸡自然没有鸡蛋,也舍不得买。后来还是村子里有人从我们知青这里借钱急用,可又没有钱还,就从鸡屁股下抠出一些鸡蛋来偿还我们,我们才有了些口福。有一天秃头队长来巡视,看见我们的黄瓜地,就狠狠剋了我们一顿。说,你们这帮懒虫,谁让你们种黄瓜来着,一点正经菜也不种,你吃个屁。他所谓正经菜,是指辣椒茄子一类,那样的菜只要有点菜籽油,不用鸡蛋不用肉,就可以做得让人垂涎欲滴。可是拾叨起来不容易,除了浇水,还要施肥,最好是粪兑水浇了才好长。 

黄瓜吃腻了,后来没的好吃,改吃炒山芋(北方叫红薯)。这一招说来还是村里那个放牛娃教给我的。这个放牛娃很机灵,自从我们知青来了,就总找机会来套瓷。是他告诉我,山芋也一样可以做菜,就跟炒土豆丝一样做法。山芋是口粮,我们不缺,于是我们尝试切丝红炒,添上油盐,做出来比黄瓜好吃下饭多了。不过,有一条与土豆丝不同,炒菜的火候一定要适可而止,否则烂成糊就不好吃了。 

从放牛娃那里学会了骑牛。别看老牛笨乎乎的,走起路来却非常稳妥实在,一步一个脚印。起初我看田埂头的羊肠小道,老觉得那老牛一不小心就会折到沟渠或水田里,其实老牛从不出差错。放牛娃吆喝一声,那老牛就乖乖地倾前身,低下犄角,我在牛娃的帮助和鼓励下,蹬着牛角,翻身上了牛背,开始胆战心惊的骑牛前行。骑牛的最大感受是不舒服,那老牛的脊背咯咯吱吱的,感觉不到皮肉,满屁股都是骨头,根本不象我以前想像中的牧童骑牛之乐。 

敲锣打鼓把我们迎进村的老队长跟我们走得最近。事无巨细,他都爱来管,自然充当了知青监管人的角色。我们插队的时候,正是老队长大家庭最鼎盛的时期:五个孩子,三男二女,人丁兴旺。老伴操持家务,任劳任怨,对人热情有加。大儿子山虎算我们哥们,比我略长,但长得比我还矮小一大截,似乎发育有问题,但干活并不赖,是个整劳力,担任队里的记分员。山虎小学毕业就回乡种田,作为长子,与女民兵姐姐一起,帮助父亲分担家庭重负。三个劳力,加上两个弟弟拾粪、放鸭,放学做点零工,一家人挣足了工分。这个家庭的红火兴旺,加上老队长的威信,可与家有三朵金花的秃头队长一比,这两大家是村子里六七十户人家里面的显赫人家。老队长的家也是我们的家,在他家里我们感觉在自己家一样地自在。一家都是热心人,包括最小的六岁女儿,我们一来,就手舞足蹈,欢呼雀跃。家里做了好吃的,老队长就把我们叫去。大娘从不抱怨,总是笑吟吟默默在一旁伺候我们吃喝。

山虎很活跃,实诚热心,跟我们知青亲如兄弟,给了我们很多帮助。他总是随身带着他的记分簿,满本子是他的涂鸦,只有他自己能看懂的那些工分记录。我见过不少字写得差的人,我自己也一手烂字,可把汉字写到他那样难看,那样奇形怪状,不可辨认,还真不容易。我离开尤村上大学期间,我们一直保持着联系,每次读他的信都要辨认老半天才能猜八九不离十。他每封信尾总是画点图示,两只手紧握啊,或者一颗心系上一条线,朴素地表示他对我们友谊地久天长的祝愿。

老队长是村里德高望重的人物,他清瘦黝黑,尖小巴,身子骨健朗,谈笑如洪钟。他哪年当的队长,哪年让位给秃头小队长,我们不很清楚。只知道老队长是退伍军人,识文断字,见多识广,是尤村的核心。我们的到来使老队长异常兴奋。他跑前跑后,张罗安排,滴水不漏。只有一件事,我感觉有些滑稽,内心有抵触,却不敢流露:老队长雷打不动,每周要组织我们政治学习和座谈一次,一学就是一夜。每当这个时候,老队长就把家里的大小孩子统统驱离,把煤油灯点得亮亮的,一点不心疼熬油。他不苟言笑,正襟危坐,特别严肃深思的样子。记得他组织学习《哥达纲领批判》,一字一板地阅读,那样子很象个教授,可从来也没见他有自己的讲解。对于马列,我只在中学迷恋过“政治经济学”,对于其他著作不是很懂。我听不明白的,他其实也不懂,毕竟他也就小学毕业的文化程度。当时我就好奇,他心里在想什么。为什么对那些深奥难懂的马列原著那么热衷,而且总摆出若有所思的样子。我当年自觉是个小毛孩,老队长是可敬有威的长者,是我们的依靠,即便心里有疑惑,也从不敢追问。这样的学习一直持续到我离开尤村。

老队长唱歌富有磁性,略带沧桑,很有魅力。记得在水田薅草的时节,暖洋洋的阳光,绿油油的禾苗,春风和煦。老队长一边薅草,一边张池有度地唱起歌来。听上去有点象船工号子,声音高高低低的,随着风,一波一波袭来,抑扬悠长,不绝如缕。那是怎样一种有声有色,和谐无间,引人遐想的农耕图景啊。

很多年过去,老队长的歌声却一直留在我的记忆中,虽然我从未搞清这首歌的来历。直到去年,女儿的 iPod 新增的一首歌,一下子把我抓住了。这歌当然不是老队长的歌,可曲调内蕴与老队长的歌神似,是它复活了我心中掩埋已久的歌。每当歌声响起,我就沉浸在遐想之中。老队长的面容身影,广阔天地的清风和日,单纯悠长的田家生活和劳动的场景,就在我眼前浮现。 我问女儿这是什么曲子。女儿一副我是土老冒的惊讶,这是 Akon 啊,那首红透半边天的歌曲 don’t matter 啊。这首黑人歌曲2007年一出品,很快在电台热播,连续两周居于排行榜首。我惊喜,也感到诧异,远隔千山万水,神秘古老的中国民间小调居然与带有美国非裔色彩的黑人歌曲如此契合。甚至我在 Akon 本人身上也隐约看到黑瘦干练的老队长的身影。

http://www.tudou.com/programs/view/FfPXSuKQ6Jw/?spm=0.0.FfPXSuKQ6Jw.A.ouYVF5
Akon “Don’t Matter”

请移步欣赏场面火爆的现场表演(需要翻墙):
https://youtu.be/JWA5hJl4Dv0

 

我大学毕业的时候曾回村探望,那时老队长已经离开人世,是癌症夺走了他的生命。女儿远嫁,传回的消息是女婿赌博被抓进了局子,二儿子肝炎送了性命。大娘经受这种种打击,显得衰老无语。家庭再也没有了欢声笑语,只有山虎撑着这个家,快30的人了一直未娶媳妇。谈起来,他总是苦苦一笑,说不急,先把弟妹上学供出来,自己的事可以放一放。我的心沉沉的,感伤世事无常,那么鼎盛兴旺的大家先失了顶梁柱,复遭种种不幸,如今如此零落。那记忆深处的歌声在我心中也更加增添了些许沧桑的苦涩和无奈。

至于原歌,现在也忘记具体曲调了,就是那种陶醉心迷的印象还在。认准了 Akon 以后,今天就是真的那个曲子再现,我不敢肯定我是否还能识出来。 就 Akon 吧。记忆已经外化,有个落实处,挺好的。

【相关】

《朝华午拾:与女民兵一道成长的日子》

【朝华午拾集锦:立委流浪图】

《朝华午拾》总目录(置顶)

【离皇冠上的明珠只有一步之遥的感觉】

1471802218_457583

parsing 是最好的游戏,而且实用。

据说好玩的游戏都没用,有实用价值的东西做不成游戏。但是,对于AI人员,parsing 却是这么一个最好玩但也最有用的游戏。纵情于此,乐得其所,死得其所也。

禹:
李老师parser有没有觉得太烧脑呢?
做parser少了个做字。感觉上先是一个比较优雅的规则集,然后发现规则之外又那么多例外,然后开始调规则,解决冲突,然后'整理规则的事情还得亲力亲为,做好几年感觉会不会很烦?

我:
不烦 特别好玩。能玩AI公认的世界级人类难题且登顶在望,何烦之有?
烦的是好做的语言 做着做着 没啥可做了 那才叫烦。英语就有点做烦了。做中文不烦 还有不少土地没有归顺 夺取一个城池或山头 就如将军打仗赢了一个战役似的 特别有满足感。

梁:
收复领地?

我:

【打过长江去,解放全中国!】

parsing 是最好的游戏。先撒一个default的网,尽量楼。其实不能算“优雅的规则集”,土八路的战略,谈不上优雅。倒有点像原始积累期的跑马,搂到越多越好。然后才开始 lexicalist 的精度攻坚,这才是愚公移山。在 default 与 lexicalist 的策略之间,建立动态通信管道,一盘棋就下活了。
譬如说吧,汉语离合词,就是一大战役。量词搭配,是中小战役。ABAB、AABB等重叠式是阵地战。定语从句界限不好缠,算是大战役。远距离填坑,反而不算大战役。因为远距离填坑在句法基本到位之后,已经不再是远距离了,而且填的逻辑SVO的坑,大多要语义相谐,变得很琐碎,但其实难度不大。(这就是白老师说的,要让大数据训练自动代替人工的语义中间件的琐碎工作。而且这个大数据是不需要标注的。白老师的RNN宏图不知道啥时开工,或已经开工?)

parsing 是最好的游戏。一方面它其实不是愚公面对的似乎永无尽头的大山,虽然这个 monster 看上去还是挺吓人的。但大面上看,结构是可以见底的,细节可以永远纠缠下去。另一方面,它又是公认的世界级人类难题。不少人说,自然语言理解(NLU)是人工智能(AI)的终极难题,而 deep parsing 是公认的通向NLU的必由之路,其重要性可比陈景润为攀登哥德巴赫猜想之巅所做出的1+1=2.  我们这代人不会忘记30多年前迎来“科学的春天”时除迟先生的如花妙笔:“自然科学的皇后是数学。数学的皇冠是数论。哥德巴赫猜想,则是皇冠上的明珠。...... 现在,离开皇冠上的明珠,只有一步之遥了。”(作为毛时代最后的知青,笔者是坐着拖拉机在颠簸的山路回县城的路上读到徐迟的长篇报告文学作品【哥德巴赫猜想】的,一口气读完,头晕眼花却兴奋不已。)

不世出的林彪都会悲观主义,问红旗到底要打到多久。但做 deep parsing,现在就可以明确地说,红旗登顶在望,短则一年,长则三五年而已。登顶可以定义为 open domain 正规文体达到 95% 左右的精度广度(f-score, near-human performance)。换句话说,就是结构分析的水平已经超过一般人,仅稍逊色于语言学家。譬如,英语我们五六年前就登顶了

最有意义的还是因为 parsing 的确有用,说它是自然语言应用核武器毫不为过。有它没它,做起事来就大不一样。shallow parsing 可以以一当十,到了 deep parsing,就是以一当百+了。换句话说,这是一个已经成熟(90+精度可以认为是成熟了)、潜力几乎无限的技术。

刘:
@wei 对parsing的执着令人钦佩

我:
多谢鼓励。parsing 最终落地,不在技术的三五个百分点的差距,而在有没有一个好的产品经理,既懂市场和客户,也欣赏和理解技术的潜力。

刘:
任何技术都是这样的

我:
量变引起质变。90以后,四五个百分点的差别,也许对产品和客户没有太大的影响。但是10多个百分点就大不一样了。譬如,社会媒体 open domain 舆情分析的精度,我们利用 deep parsing support 比对手利用机器学习去做,要高出近20个百分点。结果就天差地别。虽然做出来的报表可以一样花哨,但是真要试图利用舆情做具体分析并支持决策,这样的差距是糊弄不过去的。大数据的统计性过滤可以容忍一定的错误,但不能容忍才六七十精度的系统。

当然也有客户本来就是做报表赶时髦,而不是利用 insights 帮助调整 marketing 的策略或作为决策的依据,对这类客户,精度和质量不如产品好用、fancy、便宜更能打动他们。而且这类客户目前还不在少数。这时候单单有过硬的技术,也还是使不上劲儿。这实际上也是市场还不够成熟的一个表现。拥抱大数据成为潮流后,市场的消化、识别和运用能力还没跟上来。从这个角度看市场,北美的市场成熟度比较东土,明显成熟多了。

 

【相关】

泥沙龙笔记:parsing 是引擎的核武器,再论NLP与搜索

泥沙龙笔记:从 sparse data 再论parsing乃是NLP应用的核武器

It is untrue that Google SyntaxNet is the “world’s most accurate parser”

【立委科普:NLP核武器的奥秘】

徐迟:【哥德巴赫猜想】

《朝华点滴:插队的日子(一)》

关于 parsing

【关于中文NLP】

【置顶:立委NLP博文一览】

《朝华午拾》总目录